個人認為眉眉角角很多
網路上找得到的幾乎都是文檔分群的範例
讓在嘗試的時候非常吃力
請看步驟:
1. 轉檔(.csv => sequence file)
個人認為這裡很關鍵
mahout裡似乎是沒有可以直接用的commend line
必須要自己寫JAVA轉檔
這裡附上我的JAVA code
獻醜了
import java.io.*; import java.util.*; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.SequenceFile.Writer; import org.apache.hadoop.io.Text; import org.apache.mahout.math.*; public class Convert2Seq { public static void main(String args[]) throws Exception{ //轉好的檔案命名為point-vector String output = "~/point-vector"; FileSystem fs = null; SequenceFile.Writer writer; Configuration conf = new Configuration(); fs = FileSystem.get(conf); Path path = new Path(output); writer = new SequenceFile.Writer(fs, conf, path, Text.class, VectorWritable.class); //所有的csv檔都在factory路徑底下 File folder = new File("~/factory"); File[] listOfFiles = folder.listFiles(); //分批把factor路徑底下的csv轉成name-vector格式之後寫進point-vector裡面 for (int i=0; i<listOfFiles.length; i++){ String input = listOfFiles[i].toString(); VectorWritable vec = new VectorWritable(); try { FileReader fr = new FileReader(input); BufferedReader br = new BufferedReader(fr); String s = null; while((s=br.readLine())!=null){ String spl[] = s.split(","); String key = spl[0]; Integer val = 0; double[] colvalues = new double[1000]; for(int k=1;k<spl.length;k++){ colvalues[val] = Double.parseDouble(spl[k]); val++; } NamedVector nmv = new NamedVector(new DenseVector(colvalues),key); vec.set(nmv); writer.append(new Text(nmv.getName()), vec); } } catch (Exception e) { System.out.println("ERROR: "+e);} }writer.close(); } }
接下來在compile的時候要記得指定classpath
javac -classpath/{hadoop_home}/hadoop-core-0.20.2-cdh3u1.jar :/{mahout_home}/{mahout_version}/core/target/mahout-core-0.5.jar :/{mahout_home}/{mahout_version}/core/target/mahout-core-0.5-job.jar :/{mahout_home}/{mahout_version}/core/target/mahout-core-0.5-sources.jar :/{mahout_home}/{mahout_version}/math/target/mahout-math-0.5.jar :/{mahout_home}/{mahout_version}/math/target/mahout-math-0.5-sources.jar :/{mahout_home}/{mahout_version}/math/target/mahout-math-0.5-tests.jar Convert2Seq.java java -Djava.library.path=/{hadoop_home}/lib/native/Linux-amd64-64 -cp .:/usr/local/hadoop-0.20.2-cdh3u1/hadoop-core-0.20.2-cdh3u1.jar :/{mahout_home}/{mahout_version}/core/target/mahout-core-0.5.jar :/{mahout_home}/{mahout_version}/core/target/mahout-core-0.5-job.jar :/{mahout_home}/{mahout_version}/core/target/mahout-core-0.5-sources.jar :/{mahout_home}/{mahout_version}/math/target/mahout-math-0.5.jar :/{mahout_home}/{mahout_version}/math/target/mahout-math-0.5-sources.jar :/{mahout_home}/{mahout_version}/math/target/mahout-math-0.5-tests.jar Convert2Seq
基本上這樣就可以了
忍不住再怨一下,轉這個超煩!
2. Canopy clustering
在做kmeans的時候,mahout會要求要有initial cluster
mahout裡的canopy cluster就可以幫我們生出來
$mahout canopy -i point-vector -o center-vector #結果存在center-vector裡 -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -t1 500 -t2 250 -ow -cl
然後就可以用他的結果(就是center-vector啦!)做kmeans clustering了~
t1、t2參數的設定和相關原理請參考線上mahout文件
3. Kmeans clustering
input是轉好的point-vector data
center是canopy生出來的initial clusters
output就取一個自己喜歡的名字囉~
$mahout kmeans -i point-vector -c center-vector -o clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 20 -cl -k 15 -ow
4. dumping clustering result
跑完之後
結果是存在~clusters/clusterPoints裡面
因為是sequence file格式,得動用到seqdumper把他轉成看得懂的格式
檔案通常不止一個,用for-loop打他們通通打回原形~~~
typeset -i i let i=1 for file in `$hadoop fs -ls clusters/clusteredPoints | grep 'part' | awk '{print $8}'` do $mahout seqdumper -s $file -o ~/$i.txt let i++ done
沒有留言:
張貼留言