個人認為眉眉角角很多
網路上找得到的幾乎都是文檔分群的範例
讓在嘗試的時候非常吃力
請看步驟:
1. 轉檔(.csv => sequence file)
個人認為這裡很關鍵
mahout裡似乎是沒有可以直接用的commend line
必須要自己寫JAVA轉檔
這裡附上我的JAVA code
獻醜了
import java.io.*;
import java.util.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.Writer;
import org.apache.hadoop.io.Text;
import org.apache.mahout.math.*;
public class Convert2Seq {
public static void main(String args[]) throws Exception{
//轉好的檔案命名為point-vector
String output = "~/point-vector";
FileSystem fs = null;
SequenceFile.Writer writer;
Configuration conf = new Configuration();
fs = FileSystem.get(conf);
Path path = new Path(output);
writer = new SequenceFile.Writer(fs, conf, path, Text.class, VectorWritable.class);
//所有的csv檔都在factory路徑底下
File folder = new File("~/factory");
File[] listOfFiles = folder.listFiles();
//分批把factor路徑底下的csv轉成name-vector格式之後寫進point-vector裡面
for (int i=0; i<listOfFiles.length; i++){
String input = listOfFiles[i].toString();
VectorWritable vec = new VectorWritable();
try {
FileReader fr = new FileReader(input);
BufferedReader br = new BufferedReader(fr);
String s = null;
while((s=br.readLine())!=null){
String spl[] = s.split(",");
String key = spl[0];
Integer val = 0;
double[] colvalues = new double[1000];
for(int k=1;k<spl.length;k++){
colvalues[val] = Double.parseDouble(spl[k]);
val++;
}
NamedVector nmv = new NamedVector(new DenseVector(colvalues),key);
vec.set(nmv);
writer.append(new Text(nmv.getName()), vec);
}
} catch (Exception e) {
System.out.println("ERROR: "+e);}
}writer.close();
}
}
接下來在compile的時候要記得指定classpath
javac -classpath/{hadoop_home}/hadoop-core-0.20.2-cdh3u1.jar
:/{mahout_home}/{mahout_version}/core/target/mahout-core-0.5.jar
:/{mahout_home}/{mahout_version}/core/target/mahout-core-0.5-job.jar
:/{mahout_home}/{mahout_version}/core/target/mahout-core-0.5-sources.jar
:/{mahout_home}/{mahout_version}/math/target/mahout-math-0.5.jar
:/{mahout_home}/{mahout_version}/math/target/mahout-math-0.5-sources.jar
:/{mahout_home}/{mahout_version}/math/target/mahout-math-0.5-tests.jar Convert2Seq.java
java -Djava.library.path=/{hadoop_home}/lib/native/Linux-amd64-64
-cp .:/usr/local/hadoop-0.20.2-cdh3u1/hadoop-core-0.20.2-cdh3u1.jar
:/{mahout_home}/{mahout_version}/core/target/mahout-core-0.5.jar
:/{mahout_home}/{mahout_version}/core/target/mahout-core-0.5-job.jar
:/{mahout_home}/{mahout_version}/core/target/mahout-core-0.5-sources.jar
:/{mahout_home}/{mahout_version}/math/target/mahout-math-0.5.jar
:/{mahout_home}/{mahout_version}/math/target/mahout-math-0.5-sources.jar
:/{mahout_home}/{mahout_version}/math/target/mahout-math-0.5-tests.jar Convert2Seq
基本上這樣就可以了
忍不住再怨一下,轉這個超煩!
2. Canopy clustering
在做kmeans的時候,mahout會要求要有initial cluster
mahout裡的canopy cluster就可以幫我們生出來
$mahout canopy -i point-vector -o center-vector #結果存在center-vector裡 -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -t1 500 -t2 250 -ow -cl
然後就可以用他的結果(就是center-vector啦!)做kmeans clustering了~
t1、t2參數的設定和相關原理請參考線上mahout文件
3. Kmeans clustering
input是轉好的point-vector data
center是canopy生出來的initial clusters
output就取一個自己喜歡的名字囉~
$mahout kmeans -i point-vector -c center-vector -o clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 20 -cl -k 15 -ow
4. dumping clustering result
跑完之後
結果是存在~clusters/clusterPoints裡面
因為是sequence file格式,得動用到seqdumper把他轉成看得懂的格式
檔案通常不止一個,用for-loop打他們通通打回原形~~~
typeset -i i
let i=1
for file in `$hadoop fs -ls clusters/clusteredPoints | grep 'part' | awk '{print $8}'`
do
$mahout seqdumper -s $file -o ~/$i.txt
let i++
done

沒有留言:
張貼留言