2013年8月29日 星期四

[R] Big Data(1) rmr2套件 多重input

rmr2是一個可以使用R語言在Hadoop上做map-reduce的套件。

尚不知Hadoop和mapreduce為何物的人請參考這篇文章

以往做過的東西和網路上看得到的demo幾乎都是單一dataset利用鍵值(key)打散然後做平行運算。然而最近在處理的問題必須要把不同來源和結構(schema)的資料送到同一個node裡做運算,想說網路上找demo code卻遍尋不著,整個讓我非常頭大。

就在我試了老半天之後,竟然就成功了!!!於是就來寫一篇網誌告訴大家遇到好多個dataset的時候該怎麼辦~~~

用來demo的三組資料A、B、C
我們要做這件事:

資料長這樣(我隨便亂生的)。
> head(A)
  key         A1         A2         A3
1   1 -1.0348640  1.8177439  0.1352576
2   3 -1.3735558 -0.2437948 -0.4509937
3   2 -0.2034888  1.0297576  0.6305115
4   1  0.2270242  0.9087429 -0.4122123
5   3 -0.3290382 -0.4840644 -0.3688641
6   3 -0.1822808 -1.1303439 -0.4175791
> head(B)
  key           B1         B2          B3          B4
1   3  0.363861211  1.5158812 -1.20591630  0.08659873
2   3 -0.001122564  0.1037150  0.25809288  1.60135858
3   2  1.465006010 -0.4572000 -1.89767865 -0.60817508
4   2  1.964748118  0.8320135  0.50937176  0.77755846
5   1 -0.390667063  0.8493213  0.09075889  0.36958850
6   1 -0.768992997  0.7308232 -0.82277576  0.33674132
> head(C)
  key         C1         C2
1   1  0.4192436  0.1554905
2   2 -0.2125050 -1.4696090
3   1  0.3410178  0.3290381
4   2 -2.2716488  1.3529220
5   2 -0.2565037 -0.1575783
6   2 -0.4259541  0.1968482
接下來用函數split,把資料打散成一份一份的存成list格式。
> A_wk<- split(A,A$key)
> head(A_wk)
$`1`
   key         A1         A2          A3
1    1 -1.0348640  1.8177439  0.13525756
4    1  0.2270242  0.9087429 -0.41221232
13   1 -0.3573558 -1.2751938 -0.03458945
14   1 -0.7453312 -1.1503369  1.08577621
16   1  0.8465141  0.2366092  0.91902192
19   1  1.4896751  0.3213586 -0.98961302
23   1 -0.3143099  2.0754432 -1.29391057

$`2`
   key          A1          A2          A3
3    2 -0.20348880  1.02975761  0.63051151
7    2  1.53614888  1.96886642  0.04558348
9    2 -0.07431669 -0.03700706 -1.47200277
10   2  0.66355253  0.07667024 -1.22673427
17   2  1.13416422 -1.69200417 -0.13861365
18   2 -0.99696590  0.24465904  0.54821302
24   2 -0.44196754 -0.28170710 -0.73912548
27   2 -1.34823336 -0.06120274  2.12261003
30   2  1.83861168 -0.02982669  0.14312250

$`3`
   key         A1         A2         A3
2    3 -1.3735558 -0.2437948 -0.4509937
5    3 -0.3290382 -0.4840644 -0.3688641
6    3 -0.1822808 -1.1303439 -0.4175791
8    3  1.2317304 -0.7572487 -0.4401060
11   3 -0.4940248 -0.1259619 -1.1145702
12   3 -1.4488153 -0.9855823 -0.7537385
15   3 -0.6147528  0.6804414 -0.7799006
20   3  0.5340705 -0.2427455 -1.5272875
21   3 -1.2019567 -0.1434495 -0.3046498
22   3  0.1311908 -0.4900816  0.8861471
25   3 -1.1544569 -0.1732862 -2.2312314
26   3 -1.3264688 -0.6784207  1.5171326
28   3  1.1866616 -1.9195358  0.3591871
29   3  0.7476575 -0.3390230 -1.6448516

> B_wk<- split(B,B$key)
> head(B_wk)
$`1`
   key          B1         B2          B3            B4
5    1 -0.39066706  0.8493213  0.09075889  0.3695884990
6    1 -0.76899300  0.7308232 -0.82277576  0.3367413230
8    1 -1.05567857 -0.5663445  0.32075285  1.0807403069
9    1 -0.07174419 -0.6553943 -0.30134811 -1.2155568454
13   1 -0.39760179  0.5973388 -0.43153826 -0.0003626449
14   1  0.38234556 -0.4762401  0.90686094 -3.8579677970
15   1 -0.73177601 -0.6438049 -1.52620752  0.0814186088
20   1  0.07927141 -1.7918052 -1.27799659  0.1533002628

$`2`
   key          B1         B2          B3         B4
3    2  1.46500601 -0.4572000 -1.89767865 -0.6081751
4    2  1.96474812  0.8320135  0.50937176  0.7775585
7    2  0.35856113  0.1048170 -0.76270331 -0.5114040
10   2  1.11808088 -0.5707235 -0.16225111 -1.0749321
12   2 -0.32032713 -0.7167343 -0.03320639  0.2495948
18   2  1.36768551  1.1874533  2.13816520  1.0105115
19   2  0.02825361  0.8781400 -1.44303311  1.2391620

$`3`
   key           B1         B2         B3          B4
1    3  0.363861211  1.5158812 -1.2059163  0.08659873
2    3 -0.001122564  0.1037150  0.2580929  1.60135858
11   3  0.180619368 -0.8830636  0.7562675  1.09992035
16   3  2.644110325 -1.8546195  2.4887309 -0.03694847
17   3  1.131906794  0.9559589 -1.9111856  1.16240718

> C_wk<- split(C,C$key)
> head(C_wk)
$`1`
   key          C1         C2
1    1  0.41924361  0.1554905
3    1  0.34101785  0.3290381
14   1 -1.27793718 -0.2728525
18   1 -0.72084207  0.3307406
20   1  0.05193866 -0.4465938
27   1  1.03553670  1.7562845
29   1  2.35281628 -1.6114928
37   1  0.02295450 -0.9392724

$`2`
   key         C1         C2
2    2 -0.2125050 -1.4696090
4    2 -2.2716488  1.3529220
5    2 -0.2565037 -0.1575783
6    2 -0.4259541  0.1968482
7    2 -1.1866331 -0.3687882
13   2  0.7102612 -1.1221971
15   2  0.1592025  0.2775758
17   2  0.3842816 -0.6379072
22   2  0.1516206 -1.0723437
24   2 -0.4070279 -0.2998299
25   2  0.7779358 -0.2862851
28   2 -0.6860207 -0.9731296
30   2 -0.1319815  1.4057571
31   2  0.2845947 -0.3443439
33   2  0.7927496  0.9126125
34   2  0.4733910 -1.6850074

$`3`
   key          C1           C2
8    3 -0.13961645  0.204808027
9    3  1.49999162 -0.435241747
10   3 -0.94473626  1.768523536
11   3 -1.68380914 -0.172574070
12   3  1.12455947  0.611700128
16   3  0.19875147 -1.356228028
19   3  0.78590745 -0.796733981
21   3 -1.02839096  0.254459297
23   3 -1.00747198  0.593401435
26   3  0.81168025 -1.447740656
32   3  0.74883355 -0.170628912
35   3 -0.82805688 -1.792942718
36   3 -0.68229982  0.001420327
38   3  1.11185787  1.452839232
39   3  0.01296254 -0.676003236
40   3 -0.70681824 -0.027688693

打散之後我們用keyval函數把list的物件名稱(用names函數提取)當作key值(其實就是把打散的資料和key值兜在一起)
> Total<- keyval(c(names(A_wk),names(B_wk),names(C_wk)),c(A_wk,B_wk,C_wk))

之後放進mapreduce裡面! 因為已經用keyval把資料賦予key值,所以這裡我就不需要map了,直接在reduce裡面把資料還原成我要的東西。 先把資料用欄位名稱打散(在這裡不同dataset欄位名稱是不同的,如果有一樣的話請加上其他條件),打散之後再把它們合併成data.frame

> Demo<- mapreduce(
+     input = to.dfs(Total),
+     reduce = function(k,v){
+         test<- split(v,sapply(v,function(x) paste(colnames(x),collapse = "")))
+         test<- lapply(test,function(x) Reduce(rbind,x))
+         keyval(k,list(test[[1]],test[[2]],test[[3]]))
+     }
+ )
然後來看結果
> from.dfs(Demo)
$key
[1] "1" "1" "1" "2" "2" "2" "3" "3" "3"

$val
$val[[1]]
   key         A1         A2          A3
1    1 -1.0348640  1.8177439  0.13525756
4    1  0.2270242  0.9087429 -0.41221232
13   1 -0.3573558 -1.2751938 -0.03458945
14   1 -0.7453312 -1.1503369  1.08577621
16   1  0.8465141  0.2366092  0.91902192
19   1  1.4896751  0.3213586 -0.98961302
23   1 -0.3143099  2.0754432 -1.29391057

$val[[2]]
   key          B1         B2          B3            B4
5    1 -0.39066706  0.8493213  0.09075889  0.3695884990
6    1 -0.76899300  0.7308232 -0.82277576  0.3367413230
8    1 -1.05567857 -0.5663445  0.32075285  1.0807403069
9    1 -0.07174419 -0.6553943 -0.30134811 -1.2155568454
13   1 -0.39760179  0.5973388 -0.43153826 -0.0003626449
14   1  0.38234556 -0.4762401  0.90686094 -3.8579677970
15   1 -0.73177601 -0.6438049 -1.52620752  0.0814186088
20   1  0.07927141 -1.7918052 -1.27799659  0.1533002628

$val[[3]]
   key          C1         C2
1    1  0.41924361  0.1554905
3    1  0.34101785  0.3290381
14   1 -1.27793718 -0.2728525
18   1 -0.72084207  0.3307406
20   1  0.05193866 -0.4465938
27   1  1.03553670  1.7562845
29   1  2.35281628 -1.6114928
37   1  0.02295450 -0.9392724

$val[[4]]
   key          A1          A2          A3
3    2 -0.20348880  1.02975761  0.63051151
7    2  1.53614888  1.96886642  0.04558348
9    2 -0.07431669 -0.03700706 -1.47200277
10   2  0.66355253  0.07667024 -1.22673427
17   2  1.13416422 -1.69200417 -0.13861365
18   2 -0.99696590  0.24465904  0.54821302
24   2 -0.44196754 -0.28170710 -0.73912548
27   2 -1.34823336 -0.06120274  2.12261003
30   2  1.83861168 -0.02982669  0.14312250

$val[[5]]
   key          B1         B2          B3         B4
3    2  1.46500601 -0.4572000 -1.89767865 -0.6081751
4    2  1.96474812  0.8320135  0.50937176  0.7775585
7    2  0.35856113  0.1048170 -0.76270331 -0.5114040
10   2  1.11808088 -0.5707235 -0.16225111 -1.0749321
12   2 -0.32032713 -0.7167343 -0.03320639  0.2495948
18   2  1.36768551  1.1874533  2.13816520  1.0105115
19   2  0.02825361  0.8781400 -1.44303311  1.2391620

$val[[6]]
   key         C1         C2
2    2 -0.2125050 -1.4696090
4    2 -2.2716488  1.3529220
5    2 -0.2565037 -0.1575783
6    2 -0.4259541  0.1968482
7    2 -1.1866331 -0.3687882
13   2  0.7102612 -1.1221971
15   2  0.1592025  0.2775758
17   2  0.3842816 -0.6379072
22   2  0.1516206 -1.0723437
24   2 -0.4070279 -0.2998299
25   2  0.7779358 -0.2862851
28   2 -0.6860207 -0.9731296
30   2 -0.1319815  1.4057571
31   2  0.2845947 -0.3443439
33   2  0.7927496  0.9126125
34   2  0.4733910 -1.6850074

$val[[7]]
   key         A1         A2         A3
2    3 -1.3735558 -0.2437948 -0.4509937
5    3 -0.3290382 -0.4840644 -0.3688641
6    3 -0.1822808 -1.1303439 -0.4175791
8    3  1.2317304 -0.7572487 -0.4401060
11   3 -0.4940248 -0.1259619 -1.1145702
12   3 -1.4488153 -0.9855823 -0.7537385
15   3 -0.6147528  0.6804414 -0.7799006
20   3  0.5340705 -0.2427455 -1.5272875
21   3 -1.2019567 -0.1434495 -0.3046498
22   3  0.1311908 -0.4900816  0.8861471
25   3 -1.1544569 -0.1732862 -2.2312314
26   3 -1.3264688 -0.6784207  1.5171326
28   3  1.1866616 -1.9195358  0.3591871
29   3  0.7476575 -0.3390230 -1.6448516

$val[[8]]
   key           B1         B2         B3          B4
1    3  0.363861211  1.5158812 -1.2059163  0.08659873
2    3 -0.001122564  0.1037150  0.2580929  1.60135858
11   3  0.180619368 -0.8830636  0.7562675  1.09992035
16   3  2.644110325 -1.8546195  2.4887309 -0.03694847
17   3  1.131906794  0.9559589 -1.9111856  1.16240718

$val[[9]]
   key          C1           C2
8    3 -0.13961645  0.204808027
9    3  1.49999162 -0.435241747
10   3 -0.94473626  1.768523536
11   3 -1.68380914 -0.172574070
12   3  1.12455947  0.611700128
16   3  0.19875147 -1.356228028
19   3  0.78590745 -0.796733981
21   3 -1.02839096  0.254459297
23   3 -1.00747198  0.593401435
26   3  0.81168025 -1.447740656
32   3  0.74883355 -0.170628912
35   3 -0.82805688 -1.792942718
36   3 -0.68229982  0.001420327
38   3  1.11185787  1.452839232
39   3  0.01296254 -0.676003236
40   3 -0.70681824 -0.027688693
OK,和我想要的一樣!
前面三組dataset屬於key "1"
中間三組dataset屬於key "2"
後面三組dataset屬於key "3"
大功告成!!!
在reduce裡面可以加上更多的東西,像是資料表之間彼此join之類的運算。

沒有留言:

張貼留言