尚不知Hadoop和mapreduce為何物的人請參考這篇文章
以往做過的東西和網路上看得到的demo幾乎都是單一dataset利用鍵值(key)打散然後做平行運算。然而最近在處理的問題必須要把不同來源和結構(schema)的資料送到同一個node裡做運算,想說網路上找demo code卻遍尋不著,整個讓我非常頭大。
就在我試了老半天之後,竟然就成功了!!!於是就來寫一篇網誌告訴大家遇到好多個dataset的時候該怎麼辦~~~
用來demo的三組資料A、B、C
我們要做這件事:
資料長這樣(我隨便亂生的)。
> head(A) key A1 A2 A3 1 1 -1.0348640 1.8177439 0.1352576 2 3 -1.3735558 -0.2437948 -0.4509937 3 2 -0.2034888 1.0297576 0.6305115 4 1 0.2270242 0.9087429 -0.4122123 5 3 -0.3290382 -0.4840644 -0.3688641 6 3 -0.1822808 -1.1303439 -0.4175791 > head(B) key B1 B2 B3 B4 1 3 0.363861211 1.5158812 -1.20591630 0.08659873 2 3 -0.001122564 0.1037150 0.25809288 1.60135858 3 2 1.465006010 -0.4572000 -1.89767865 -0.60817508 4 2 1.964748118 0.8320135 0.50937176 0.77755846 5 1 -0.390667063 0.8493213 0.09075889 0.36958850 6 1 -0.768992997 0.7308232 -0.82277576 0.33674132 > head(C) key C1 C2 1 1 0.4192436 0.1554905 2 2 -0.2125050 -1.4696090 3 1 0.3410178 0.3290381 4 2 -2.2716488 1.3529220 5 2 -0.2565037 -0.1575783 6 2 -0.4259541 0.1968482接下來用函數split,把資料打散成一份一份的存成list格式。
> A_wk<- split(A,A$key) > head(A_wk) $`1` key A1 A2 A3 1 1 -1.0348640 1.8177439 0.13525756 4 1 0.2270242 0.9087429 -0.41221232 13 1 -0.3573558 -1.2751938 -0.03458945 14 1 -0.7453312 -1.1503369 1.08577621 16 1 0.8465141 0.2366092 0.91902192 19 1 1.4896751 0.3213586 -0.98961302 23 1 -0.3143099 2.0754432 -1.29391057 $`2` key A1 A2 A3 3 2 -0.20348880 1.02975761 0.63051151 7 2 1.53614888 1.96886642 0.04558348 9 2 -0.07431669 -0.03700706 -1.47200277 10 2 0.66355253 0.07667024 -1.22673427 17 2 1.13416422 -1.69200417 -0.13861365 18 2 -0.99696590 0.24465904 0.54821302 24 2 -0.44196754 -0.28170710 -0.73912548 27 2 -1.34823336 -0.06120274 2.12261003 30 2 1.83861168 -0.02982669 0.14312250 $`3` key A1 A2 A3 2 3 -1.3735558 -0.2437948 -0.4509937 5 3 -0.3290382 -0.4840644 -0.3688641 6 3 -0.1822808 -1.1303439 -0.4175791 8 3 1.2317304 -0.7572487 -0.4401060 11 3 -0.4940248 -0.1259619 -1.1145702 12 3 -1.4488153 -0.9855823 -0.7537385 15 3 -0.6147528 0.6804414 -0.7799006 20 3 0.5340705 -0.2427455 -1.5272875 21 3 -1.2019567 -0.1434495 -0.3046498 22 3 0.1311908 -0.4900816 0.8861471 25 3 -1.1544569 -0.1732862 -2.2312314 26 3 -1.3264688 -0.6784207 1.5171326 28 3 1.1866616 -1.9195358 0.3591871 29 3 0.7476575 -0.3390230 -1.6448516 > B_wk<- split(B,B$key) > head(B_wk) $`1` key B1 B2 B3 B4 5 1 -0.39066706 0.8493213 0.09075889 0.3695884990 6 1 -0.76899300 0.7308232 -0.82277576 0.3367413230 8 1 -1.05567857 -0.5663445 0.32075285 1.0807403069 9 1 -0.07174419 -0.6553943 -0.30134811 -1.2155568454 13 1 -0.39760179 0.5973388 -0.43153826 -0.0003626449 14 1 0.38234556 -0.4762401 0.90686094 -3.8579677970 15 1 -0.73177601 -0.6438049 -1.52620752 0.0814186088 20 1 0.07927141 -1.7918052 -1.27799659 0.1533002628 $`2` key B1 B2 B3 B4 3 2 1.46500601 -0.4572000 -1.89767865 -0.6081751 4 2 1.96474812 0.8320135 0.50937176 0.7775585 7 2 0.35856113 0.1048170 -0.76270331 -0.5114040 10 2 1.11808088 -0.5707235 -0.16225111 -1.0749321 12 2 -0.32032713 -0.7167343 -0.03320639 0.2495948 18 2 1.36768551 1.1874533 2.13816520 1.0105115 19 2 0.02825361 0.8781400 -1.44303311 1.2391620 $`3` key B1 B2 B3 B4 1 3 0.363861211 1.5158812 -1.2059163 0.08659873 2 3 -0.001122564 0.1037150 0.2580929 1.60135858 11 3 0.180619368 -0.8830636 0.7562675 1.09992035 16 3 2.644110325 -1.8546195 2.4887309 -0.03694847 17 3 1.131906794 0.9559589 -1.9111856 1.16240718 > C_wk<- split(C,C$key) > head(C_wk) $`1` key C1 C2 1 1 0.41924361 0.1554905 3 1 0.34101785 0.3290381 14 1 -1.27793718 -0.2728525 18 1 -0.72084207 0.3307406 20 1 0.05193866 -0.4465938 27 1 1.03553670 1.7562845 29 1 2.35281628 -1.6114928 37 1 0.02295450 -0.9392724 $`2` key C1 C2 2 2 -0.2125050 -1.4696090 4 2 -2.2716488 1.3529220 5 2 -0.2565037 -0.1575783 6 2 -0.4259541 0.1968482 7 2 -1.1866331 -0.3687882 13 2 0.7102612 -1.1221971 15 2 0.1592025 0.2775758 17 2 0.3842816 -0.6379072 22 2 0.1516206 -1.0723437 24 2 -0.4070279 -0.2998299 25 2 0.7779358 -0.2862851 28 2 -0.6860207 -0.9731296 30 2 -0.1319815 1.4057571 31 2 0.2845947 -0.3443439 33 2 0.7927496 0.9126125 34 2 0.4733910 -1.6850074 $`3` key C1 C2 8 3 -0.13961645 0.204808027 9 3 1.49999162 -0.435241747 10 3 -0.94473626 1.768523536 11 3 -1.68380914 -0.172574070 12 3 1.12455947 0.611700128 16 3 0.19875147 -1.356228028 19 3 0.78590745 -0.796733981 21 3 -1.02839096 0.254459297 23 3 -1.00747198 0.593401435 26 3 0.81168025 -1.447740656 32 3 0.74883355 -0.170628912 35 3 -0.82805688 -1.792942718 36 3 -0.68229982 0.001420327 38 3 1.11185787 1.452839232 39 3 0.01296254 -0.676003236 40 3 -0.70681824 -0.027688693
打散之後我們用keyval函數把list的物件名稱(用names函數提取)當作key值(其實就是把打散的資料和key值兜在一起)
> Total<- keyval(c(names(A_wk),names(B_wk),names(C_wk)),c(A_wk,B_wk,C_wk))
之後放進mapreduce裡面! 因為已經用keyval把資料賦予key值,所以這裡我就不需要map了,直接在reduce裡面把資料還原成我要的東西。 先把資料用欄位名稱打散(在這裡不同dataset欄位名稱是不同的,如果有一樣的話請加上其他條件),打散之後再把它們合併成data.frame
> Demo<- mapreduce( + input = to.dfs(Total), + reduce = function(k,v){ + test<- split(v,sapply(v,function(x) paste(colnames(x),collapse = ""))) + test<- lapply(test,function(x) Reduce(rbind,x)) + keyval(k,list(test[[1]],test[[2]],test[[3]])) + } + )然後來看結果
> from.dfs(Demo) $key [1] "1" "1" "1" "2" "2" "2" "3" "3" "3" $val $val[[1]] key A1 A2 A3 1 1 -1.0348640 1.8177439 0.13525756 4 1 0.2270242 0.9087429 -0.41221232 13 1 -0.3573558 -1.2751938 -0.03458945 14 1 -0.7453312 -1.1503369 1.08577621 16 1 0.8465141 0.2366092 0.91902192 19 1 1.4896751 0.3213586 -0.98961302 23 1 -0.3143099 2.0754432 -1.29391057 $val[[2]] key B1 B2 B3 B4 5 1 -0.39066706 0.8493213 0.09075889 0.3695884990 6 1 -0.76899300 0.7308232 -0.82277576 0.3367413230 8 1 -1.05567857 -0.5663445 0.32075285 1.0807403069 9 1 -0.07174419 -0.6553943 -0.30134811 -1.2155568454 13 1 -0.39760179 0.5973388 -0.43153826 -0.0003626449 14 1 0.38234556 -0.4762401 0.90686094 -3.8579677970 15 1 -0.73177601 -0.6438049 -1.52620752 0.0814186088 20 1 0.07927141 -1.7918052 -1.27799659 0.1533002628 $val[[3]] key C1 C2 1 1 0.41924361 0.1554905 3 1 0.34101785 0.3290381 14 1 -1.27793718 -0.2728525 18 1 -0.72084207 0.3307406 20 1 0.05193866 -0.4465938 27 1 1.03553670 1.7562845 29 1 2.35281628 -1.6114928 37 1 0.02295450 -0.9392724 $val[[4]] key A1 A2 A3 3 2 -0.20348880 1.02975761 0.63051151 7 2 1.53614888 1.96886642 0.04558348 9 2 -0.07431669 -0.03700706 -1.47200277 10 2 0.66355253 0.07667024 -1.22673427 17 2 1.13416422 -1.69200417 -0.13861365 18 2 -0.99696590 0.24465904 0.54821302 24 2 -0.44196754 -0.28170710 -0.73912548 27 2 -1.34823336 -0.06120274 2.12261003 30 2 1.83861168 -0.02982669 0.14312250 $val[[5]] key B1 B2 B3 B4 3 2 1.46500601 -0.4572000 -1.89767865 -0.6081751 4 2 1.96474812 0.8320135 0.50937176 0.7775585 7 2 0.35856113 0.1048170 -0.76270331 -0.5114040 10 2 1.11808088 -0.5707235 -0.16225111 -1.0749321 12 2 -0.32032713 -0.7167343 -0.03320639 0.2495948 18 2 1.36768551 1.1874533 2.13816520 1.0105115 19 2 0.02825361 0.8781400 -1.44303311 1.2391620 $val[[6]] key C1 C2 2 2 -0.2125050 -1.4696090 4 2 -2.2716488 1.3529220 5 2 -0.2565037 -0.1575783 6 2 -0.4259541 0.1968482 7 2 -1.1866331 -0.3687882 13 2 0.7102612 -1.1221971 15 2 0.1592025 0.2775758 17 2 0.3842816 -0.6379072 22 2 0.1516206 -1.0723437 24 2 -0.4070279 -0.2998299 25 2 0.7779358 -0.2862851 28 2 -0.6860207 -0.9731296 30 2 -0.1319815 1.4057571 31 2 0.2845947 -0.3443439 33 2 0.7927496 0.9126125 34 2 0.4733910 -1.6850074 $val[[7]] key A1 A2 A3 2 3 -1.3735558 -0.2437948 -0.4509937 5 3 -0.3290382 -0.4840644 -0.3688641 6 3 -0.1822808 -1.1303439 -0.4175791 8 3 1.2317304 -0.7572487 -0.4401060 11 3 -0.4940248 -0.1259619 -1.1145702 12 3 -1.4488153 -0.9855823 -0.7537385 15 3 -0.6147528 0.6804414 -0.7799006 20 3 0.5340705 -0.2427455 -1.5272875 21 3 -1.2019567 -0.1434495 -0.3046498 22 3 0.1311908 -0.4900816 0.8861471 25 3 -1.1544569 -0.1732862 -2.2312314 26 3 -1.3264688 -0.6784207 1.5171326 28 3 1.1866616 -1.9195358 0.3591871 29 3 0.7476575 -0.3390230 -1.6448516 $val[[8]] key B1 B2 B3 B4 1 3 0.363861211 1.5158812 -1.2059163 0.08659873 2 3 -0.001122564 0.1037150 0.2580929 1.60135858 11 3 0.180619368 -0.8830636 0.7562675 1.09992035 16 3 2.644110325 -1.8546195 2.4887309 -0.03694847 17 3 1.131906794 0.9559589 -1.9111856 1.16240718 $val[[9]] key C1 C2 8 3 -0.13961645 0.204808027 9 3 1.49999162 -0.435241747 10 3 -0.94473626 1.768523536 11 3 -1.68380914 -0.172574070 12 3 1.12455947 0.611700128 16 3 0.19875147 -1.356228028 19 3 0.78590745 -0.796733981 21 3 -1.02839096 0.254459297 23 3 -1.00747198 0.593401435 26 3 0.81168025 -1.447740656 32 3 0.74883355 -0.170628912 35 3 -0.82805688 -1.792942718 36 3 -0.68229982 0.001420327 38 3 1.11185787 1.452839232 39 3 0.01296254 -0.676003236 40 3 -0.70681824 -0.027688693OK,和我想要的一樣!
前面三組dataset屬於key "1"
中間三組dataset屬於key "2"
後面三組dataset屬於key "3"
大功告成!!!
在reduce裡面可以加上更多的東西,像是資料表之間彼此join之類的運算。
沒有留言:
張貼留言