{"id":24094,"date":"2019-12-07T08:00:33","date_gmt":"2019-12-06T23:00:33","guid":{"rendered":"https:\/\/www.techscore.com\/blog\/?p=24094"},"modified":"2019-12-09T10:02:07","modified_gmt":"2019-12-09T01:02:07","slug":"parquet_examine","status":"publish","type":"post","link":"https:\/\/www.techscore.com\/blog\/2019\/12\/07\/parquet_examine\/","title":{"rendered":"AWS S3 \u30b3\u30b9\u30c8\u524a\u6e1b\u3092\u76ee\u7684\u306b CSV \u304b\u3089 Apache Parquet \u306b\u4e57\u308a\u63db\u3048\u308b\u305f\u3081\u306e\u4e0b\u8abf\u3079"},"content":{"rendered":"

\u3053\u308c\u306f TECHSCORE Advent Calendar 2019<\/a> \u306e7\u65e5\u76ee\u306e\u8a18\u4e8b\u3067\u3059\u3002<\/p>\n

Amazon Simple Storage Service \u3068\u3044\u3046\u540d\u306e\u901a\u308a\u3001S3 \u306f\u63d0\u4f9b\u3055\u308c\u3066\u3044\u308b\u30b5\u30fc\u30d3\u30b9\u5185\u5bb9\u306f\u975e\u5e38\u306b\u30b7\u30f3\u30d7\u30eb\u306a\u306e\u3067\u3059\u304c\u5229\u7528\u6642\u306e\u7528\u9014\u304c\u591a\u5c90\u306b\u308f\u305f\u308a\u307e\u3059\u3002
\n\u5229\u7528\u7528\u9014\u304c\u591a\u5c90\u306b\u308f\u305f\u308b\u3068\u3044\u3046\u4e8b\u306f\u3001\u6ce8\u610f\u3057\u3066\u7ba1\u7406\u3057\u306a\u3044\u3068\u30ab\u30aa\u30b9\u306b\u9665\u308b\u53ef\u80fd\u6027\u304c\u3042\u308a\u3001\u300c\u4e00\u6642\u7684\u306b\u7f6e\u3044\u3066\u3044\u308b\u3064\u3082\u308a\u3060\u3063\u305f\u300d\u300c\u305d\u306e\u3046\u3061\u306b\u5bfe\u5fdc\u3059\u308b\u3064\u3082\u308a\u3060\u3063\u305f\u300d\u3068\u3044\u3046\u91ce\u826f\u30c7\u30fc\u30bf\u304c\u3044\u3064\u306e\u9593\u306b\u304b\u696d\u52d9\u306b\u7d44\u307f\u8fbc\u307e\u308c\u3066\u3057\u307e\u3044\u7c21\u5358\u306b\u624b\u304c\u51fa\u305b\u306a\u304f\u306a\u308b\u4e8b\u614b\u306b\u767a\u5c55\u3059\u308b\u5834\u5408\u3082\u3042\u308a\u307e\u3059\u3002<\/p>\n

\u79c1\u304c\u666e\u6bb5\u5229\u7528\u3057\u3066\u3044\u308b AWS \u30a2\u30ab\u30a6\u30f3\u30c8\u306e\u4e2d\u3067\u6700\u3082\u904b\u7528\u6b74\u306e\u9577\u3044\u3082\u306e\u306b\u3082\u3001\u4f55\u3084\u3089\u3088\u308d\u3057\u304f\u306a\u3044\u30c7\u30fc\u30bf\u304c\u5b58\u5728\u3059\u308b\u3053\u3068\u304c\u5206\u304b\u308a\u307e\u3057\u305f\u3002
\nAWS \u3092\u9069\u5207\u306b\u5229\u7528\u51fa\u6765\u3066\u3044\u308b\u304b\u30b3\u30b9\u30c8\u306e\u9762\u304b\u3089\u8abf\u67fb\u3092\u3057\u3066\u3044\u308b\u62c5\u5f53\u8005\u304b\u3089\u300cS3 \u306e\u30b9\u30c8\u30ec\u30fc\u30b8\u5229\u7528\u91cf\u3001\u52e2\u3044\u3088\u304f\u5897\u52a0\u3057\u3066\u3044\u308b\u7406\u7531\u306f\u4f55\uff1f\u300d\u3068\u805e\u304b\u308c\u3066\u5373\u7b54\u3067\u304d\u305a\u3001\u8abf\u67fb\u3057\u3066\u307f\u308b\u3068\u8a72\u5f53\u306e S3 \u30d0\u30b1\u30c3\u30c8\u3092\u767a\u898b\u3057\u307e\u3057\u305f\u3002<\/p>\n

\u672a\u5727\u7e2e\u306eCSV\u30c7\u30fc\u30bf\u304cS3\u30b3\u30b9\u30c8\u3092\u62bc\u3057\u4e0a\u3052<\/h2>\n

\u3053\u306e\u30d0\u30b1\u30c3\u30c8\u306b\u306f\u5fc3\u5f53\u305f\u308a\u304c\u3042\u308a\u307e\u3057\u305f\u3002\u307e\u3055\u306b\u300c\u305d\u306e\u3046\u3061\u306b\u5bfe\u5fdc\u3059\u308b\u3064\u3082\u308a\u3060\u3063\u305f\u300d\u5b9f\u7e3e\u30c7\u30fc\u30bf\u304c\u672a\u5727\u7e2e\u306e\u72b6\u614b\u3067\u7f6e\u304b\u308c\u3066\u3044\u307e\u3059\u3002
\n\u696d\u52d9\u3067\u6025\u304e\u5fc5\u8981\u306b\u306a\u308a\u62bd\u51fa\u3059\u308b\u3088\u3046\u306b\u306a\u3063\u305f CSV \u30c7\u30fc\u30bf\u3067\u3001\u5727\u7e2e\u51e6\u7406\u3084\u30e9\u30a4\u30d5\u30b5\u30a4\u30af\u30eb\u8a2d\u5b9a\u3092\u305b\u305a\u306b\u904b\u7528\u3092\u958b\u59cb\u3002\u305d\u306e\u5f8c\u3001\u3053\u306e\u30c7\u30fc\u30bf\u3092\u57fa\u306b\u5f8c\u7d9a\u51e6\u7406\u304c\u8a95\u751f\u3057\u3066\u3057\u307e\u3044\u8ab0\u3082\u624b\u3092\u4ed8\u3051\u3089\u308c\u306a\u304f\u306a\u3063\u3066\u3057\u307e\u3063\u305f\u91ce\u826f\u30c7\u30fc\u30bf\u306e1\u3064\u3067\u3057\u305f\u3002
\n\u3072\u3063\u305d\u308a\u3068\u904b\u7528\u3055\u308c\u3066\u3044\u308b\u5272\u306b\u306f\u306f\u3063\u304d\u308a\u3068\u53f3\u80a9\u4e0a\u304c\u308a\u3067\u30b9\u30c8\u30ec\u30fc\u30b8\u5229\u7528\u91cf\u304c\u5897\u3048\u3066\u3044\u304d\u307e\u3059\u3002
\n\u300c\u305d\u306e\u3046\u3061\u300d\u306f\u4eca\u3067\u3057\u3087\uff01\u3068\u3044\u3046\u3053\u3068\u3067\u3001\u826f\u3044\u6a5f\u4f1a\u3068\u3070\u304b\u308a\u30b3\u30b9\u30c8\u524a\u6e1b\u5bfe\u5fdc\u3092\u5b9f\u65bd\u3059\u308b\u305f\u3081\u306e\u8abf\u67fb\u3092\u884c\u3044\u307e\u3057\u305f\u3002<\/p>\n

\u76ee\u6307\u3059\u6210\u679c<\/h2>\n
    \n
  • \u904b\u7528\u306b\u30d5\u30a1\u30a4\u30eb\u5727\u7e2e\u3092\u8ffd\u52a0\u3057\u3001S3 \u30b9\u30c8\u30ec\u30fc\u30b8\u6599\u91d1\u3092\u30b3\u30b9\u30c8\u524a\u6e1b\u3059\u308b<\/li>\n
  • \u8ffd\u52a0\u51e6\u7406\u306e\u5de5\u6570\u3092\u3067\u304d\u308b\u3060\u3051\u5c0f\u3055\u304f\u3059\u308b<\/li>\n
  • \u5f8c\u7d9a\u51e6\u7406\u3078\u306e\u5f71\u97ff\u3092\u3067\u304d\u308b\u3060\u3051\u5c0f\u3055\u304f\u3059\u308b<\/li>\n<\/ul>\n

    S3 \u306e\u30b3\u30b9\u30c8\u306f\u9069\u5207\u306b\u5229\u7528\u3057\u3066\u3044\u308c\u3070\u5b89\u4fa1\u306a\u3082\u306e\u306a\u306e\u3067\uff08\u57f7\u7b46\u6642\u70b9\u306e2019\u5e7412\u6708\u3067\u306f\u3001S3\u6a19\u6e96\u30b9\u30c8\u30ec\u30fc\u30b8\u306e\u5834\u5408\u3067\u3082 \u6700\u521d\u306e 50 TB\/\u6708\u306f0.025USD\/GB\u3000\u203b\u6771\u4eac\u30ea\u30fc\u30b8\u30e7\u30f3\u306e\u5834\u5408\uff09\u3001\u4fee\u6b63\u306b\u5de5\u6570\u3092\u304b\u3051\u3066\u3082\u5f97\u3089\u308c\u308b\u524a\u6e1b\u52b9\u679c\u306f\u7d50\u5c40\u5c0f\u3055\u304f\u306a\u3063\u3066\u3057\u307e\u3044\u307e\u3059\u3002\u3067\u304d\u308b\u3060\u3051\u624b\u9593\u6687\u304b\u3051\u305a\u306b\u5b9f\u73fe\u3059\u308b\u3068\u3044\u3046\u8996\u70b9\u3092\u898b\u5931\u308f\u306a\u3044\u3053\u3068\u304c\u91cd\u8981\u3068\u8a8d\u8b58\u3057\u307e\u3057\u305f\u3002<\/p>\n

    \u73fe\u5728\u904b\u7528\u3055\u308c\u3066\u3044\u308b\u5f8c\u7d9a\u51e6\u7406\u3067\u306f Apache Spark \u3067\u30c7\u30fc\u30bf\u3092\u30a2\u30af\u30bb\u30b9\uff06\u96c6\u8a08\u3057\u3066\u3044\u307e\u3059\u3002\u4eca\u5f8c\u3082\u3053\u306e\u5229\u7528\u65b9\u6cd5\u306f\u5909\u308f\u3089\u306a\u3044\u524d\u63d0\u3068\u3057\u307e\u3059\u3002
    \n\u5727\u7e2e\u51e6\u7406\u304c\u7c21\u5358\u3068\u3044\u3046\u70b9\u3060\u3051\u306b\u6ce8\u76ee\u3059\u308b\u3068 gzip \u304c\u3088\u3055\u305d\u3046\u3060\u3068\u601d\u308f\u308c\u307e\u3057\u305f\u3002AWS Glue \u3067\u306f gzip \u306f\u672a\u89e3\u51cd\u306e\u307e\u307e\u51e6\u7406\u53ef\u80fd\u306a\u306e\u3067\u5f8c\u7d9a\u51e6\u7406\u3078\u306e\u5f71\u97ff\u3082\u5c0f\u3055\u304f\u3067\u304d\u305d\u3046\u3067\u3059\u3002<\/p>\n

    \u4ed6\u306e\u5f62\u5f0f\u3068\u3057\u3066\u306f\u3001Apache Parquet \u3078\u306e\u5909\u63db\u3082\u5019\u88dc\u3068\u3057\u307e\u3057\u305f\u30022018\u5e7410\u6708\u3088\u308a Amazon Kinesis Data Firehose \u3067\u30b5\u30dd\u30fc\u30c8\u304c\u958b\u59cb\u3055\u308c\u3066\u304a\u308a\u3001\u904b\u7528\u306b\u7d44\u307f\u8fbc\u3080\u305f\u3081\u306e\u5de5\u6570\u304c\u5c0f\u3055\u304f\u6e08\u307f\u305d\u3046\u3067\u3059\u3002
    \nFirehose \u3092\u5229\u7528\u3059\u308b\u5834\u5408\u306e\u30a4\u30f3\u30d7\u30c3\u30c8\u30c7\u30fc\u30bf\u306f JSON \u3067\u3042\u308b\u5fc5\u8981\u304c\u3042\u308a\u3001CSV \u306a\u3069\u5225\u306e\u30c7\u30fc\u30bf\u5f62\u5f0f\u306e\u5834\u5408\u306f\u4e8b\u524d\u306b\u5909\u63db\u304c\u5fc5\u8981\u3067\u3059\u304c\u3001\u5909\u63db\u7528 AWS Lambda \u3092\u7d44\u307f\u8fbc\u3093\u3060 blueprint \u304c\u516c\u958b\u3055\u308c\u3066\u3044\u307e\u3059\u3002\u3046\u307e\u304f\u306f\u307e\u308c\u3070 No \u30b3\u30fc\u30c7\u30a3\u30f3\u30b0\u3067\u5bfe\u5fdc\u53ef\u80fd\u3068\u306a\u3063\u3066\u3044\u307e\u3059\u3002
    \n\u540c\u3058\u304f\u5217\u6307\u5411\u30d5\u30a9\u30fc\u30de\u30c3\u30c8\u3067\u3042\u308b Apache ORC \u3082\u30b5\u30dd\u30fc\u30c8\u3055\u308c\u3066\u3044\u307e\u3059\u304c\u3001\u4eca\u56de\u306f Parquet \u5909\u63db\u3092\u8003\u3048\u307e\u3059\u3002<\/p>\n

    \u30b5\u30a4\u30ba\u306e\u8a08\u6e2c\uff1a\u5727\u7e2e\u30b5\u30a4\u30ba<\/h2>\n

    \u73fe\u5728\u306e\u904b\u7528\u3067\u306f100\u30d0\u30a4\u30c8\u7a0b\u5ea6\u304b\u3089100\u30e1\u30ac\u30d0\u30a4\u30c8\u8db3\u3089\u305a\u306e CSV \u30d5\u30a1\u30a4\u30eb\u304c\u6570\u591a\u304f\u5b58\u5728\u3059\u308b\u72b6\u614b\u306b\u306a\u3063\u3066\u3044\u307e\u3059\u3002\u5927\u304d\u306a\u30b5\u30a4\u30ba\u306e\u30d5\u30a1\u30a4\u30eb\u3067\u8a66\u3057\u3066\u3082\u904b\u7528\u306b\u5373\u3057\u3066\u3044\u306a\u3044\u306e\u3067\u3001\u30b5\u30f3\u30d7\u30eb\u30c7\u30fc\u30bf\u3067\u306f\u904b\u7528\u30c7\u30fc\u30bf\u306b\u8fd1\u3044\u30b5\u30a4\u30ba\u306b\u306a\u308b\u3088\u3046\u306b\u3057\u307e\u3057\u305f\u3002
    \n\u540c\u3058 CSV \u30d5\u30a1\u30a4\u30eb\u3092 gzip \u5727\u7e2e\u3057\u305f\u3082\u306e\u3001Parquet \u306b\u5909\u63db\u3057\u305f\u3082\u306e\uff08Parquet \u306f\u30c7\u30d5\u30a9\u30eb\u30c8\u3067 Snappy \u5727\u7e2e\u3055\u308c\u307e\u3059\uff09\u3092\u7528\u610f\u3057\u3001\u30b5\u30a4\u30ba\u3092\u8a08\u6e2c\u3057\u307e\u3057\u305f\u3002<\/p>\n

    \u30b5\u30f3\u30d7\u30eb\u30c7\u30fc\u30bf<\/h3>\n

    \u30c7\u30fc\u30bf\u9577\u306f\u9577\u304f\u3042\u308a\u307e\u305b\u3093\u3002\u30c7\u30fc\u30bf\u4ef6\u6570\u3092\u5897\u3084\u3059\u3053\u3068\u3067\u30d5\u30a1\u30a4\u30eb\u30b5\u30a4\u30ba\u3092\u5927\u304d\u304f\u3057\u307e\u3059\u3002<\/p>\n\n\n\n\n\n\n\n\n\n
    \u9805\u76ee<\/th>\n\u30c7\u30fc\u30bf\u30b5\u30f3\u30d7\u30eb<\/th>\n<\/tr>\n
    id<\/td>\n1<\/td>\n<\/tr>\n
    clientKey<\/td>\naaa<\/td>\n<\/tr>\n
    itemCode<\/td>\n00001<\/td>\n<\/tr>\n
    itemCount<\/td>\n1<\/td>\n<\/tr>\n
    itemPrice<\/td>\n1000<\/td>\n<\/tr>\n
    createdAt<\/td>\n2019-12-01T11:00:18.398000<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

    \u8a08\u6e2c\u7d50\u679c<\/h3>\n\n\n\n\n\n\n\n\n\n
    CSV\u30c7\u30fc\u30bf\u4ef6\u6570<\/th>\nCSV<\/th>\nParquet<\/th>\ngzip<\/th>\n<\/tr>\n
    \u30d8\u30c3\u30c0\uff0b1\u4ef6<\/td>\n116<\/td>\n4592<\/td>\n3958.62%<\/td>\n136<\/td>\n117.24%<\/td>\n<\/tr>\n
    \u30d8\u30c3\u30c0\uff0b100\u4ef6<\/td>\n6059<\/td>\n6178<\/td>\n101.96%<\/td>\n852<\/td>\n14.06%<\/td>\n<\/tr>\n
    \u30d8\u30c3\u30c0\uff0b1\u5343\u4ef6<\/td>\n61827<\/td>\n24206<\/td>\n39.15%<\/td>\n9170<\/td>\n14.83%<\/td>\n<\/tr>\n
    \u30d8\u30c3\u30c0\uff0b1\u4e07\u4ef6<\/td>\n627721<\/td>\n78872<\/td>\n12.56%<\/td>\n89030<\/td>\n14.18%<\/td>\n<\/tr>\n
    \u30d8\u30c3\u30c0\uff0b10\u4e07\u4ef6<\/td>\n6276697<\/td>\n230500<\/td>\n3.67%<\/td>\n889481<\/td>\n14.17%<\/td>\n<\/tr>\n
    \u30d8\u30c3\u30c0\uff0b100\u4e07\u4ef6<\/td>\n62766465<\/td>\n1741330<\/td>\n2.77%<\/td>\n8893356<\/td>\n14.17%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

    \u30d5\u30a1\u30a4\u30b9\u30b5\u30a4\u30ba\uff08\u5358\u4f4d\uff1a\u30d0\u30a4\u30c8\uff09\u3068\u3001CSV \u30d5\u30a1\u30a4\u30eb\u30b5\u30a4\u30ba\u306b\u5bfe\u3057\u3066\u4f55\u30d1\u30fc\u30bb\u30f3\u30c8\u7a0b\u5ea6\u306e\u30b5\u30a4\u30ba\u306b\u306a\u3063\u305f\u306e\u304b\u3092\u8a18\u8f09\u3057\u3066\u3044\u307e\u3059\u3002
    \ngzip \u304c\u65e9\u3005\u306b\u982d\u6253\u3061\u306b\u306a\u3063\u305f\u306e\u306b\u6bd4\u3079\u3066\u3001Parquet \u306f\u7d99\u7d9a\u7684\u306b\u30b5\u30a4\u30ba\u3092\u5c0f\u3055\u304f\u3057\u7d9a\u3051\u3066\u3044\u308b\u3053\u3068\u304c\u5206\u304b\u308a\u307e\u3059\u3002\u66f4\u306b\u30c7\u30fc\u30bf\u4ef6\u6570\u304c\u5897\u3048\u308c\u3070\u3001gzip \u306e10\u5206\u306e1\u3001\u3082\u3068\u306e CSV \u30d5\u30a1\u30a4\u30eb\u306e100\u5206\u306e1\u7a0b\u5ea6\u306b\u3082\u306a\u308a\u305d\u3046\u3067\u3059\u3002\u305f\u3060\u3057\u3001\u3042\u308b\u7a0b\u5ea6\u306e\u30d5\u30a1\u30a4\u30eb\u30b5\u30a4\u30ba\u304c\u7121\u3051\u308c\u3070\u6069\u6075\u3092\u53d7\u3051\u3089\u308c\u306a\u3044\u3053\u3068\u3082\u5206\u304b\u308a\u307e\u3057\u305f\u3002<\/p>\n

    \u30d9\u30f3\u30c1\u30de\u30fc\u30af\uff1aSpark \u8aad\u307f\u8fbc\u307f\u306a\u3069<\/h2>\n

    \u51e6\u7406\u524d\u3068\u51e6\u7406\u5f8c\u306e\u6642\u9593\u306e\u5dee\u5206\u3092\u6240\u8981\u6642\u9593\u3068\u3057\u3066\u8a08\u6e2c\u3057\u307e\u3057\u305f\u3002\uff08\u4ee5\u4e0b\u306e\u30b3\u30fc\u30c9\u4f8b\u3067\u306f\u6642\u9593\u8a08\u6e2c\u90e8\u5206\u306e\u30b3\u30fc\u30c9\u306f\u7701\u7565\u3057\u3066\u3044\u307e\u3059\u3002\uff09
    \n\u691c\u8a3c\u3057\u305f\u74b0\u5883\u306e python \u3068 spark \u306e\u30d0\u30fc\u30b8\u30e7\u30f3\u306f\u4ee5\u4e0b\u306e\u901a\u308a\u3067\u3059\u3002<\/p>\n

    $ python --version\nPython 2.7.5\n\n$ pyspark --version\nWelcome to\n      ____              __\n     \/ __\/__  ___ _____\/ \/__\n    _\\ \\\/ _ \\\/ _ `\/ __\/  '_\/\n   \/___\/ .__\/\\_,_\/_\/ \/_\/\\_\\   version 2.4.4\n      \/_\/\n\nUsing Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_171\nBranch\nCompiled by user  on 2019-08-27T21:21:38Z\nRevision\nUrl\nType --help for more information.\n<\/code><\/pre>\n

    \u30c7\u30fc\u30bf\u8aad\u307f\u8fbc\u307f<\/h3>\n

    \u4ee5\u4e0b\u306e\u3088\u3046\u306a python \u30b3\u30fc\u30c9\u3067\u3001dataframe \u306b\u30c7\u30fc\u30bf\u3092\u8aad\u307f\u8fbc\u307f\u307e\u3057\u305f\u3002<\/p>\n

    #!\/usr\/bin\/env python\n# -*- coding: utf-8 -*-\n\nfrom pyspark.context import SparkContext\nfrom pyspark.sql import SQLContext\n\nsc = SparkContext()\nsqlContext = SQLContext(sc)\n\n# CSV\u306e\u5834\u5408\ndf = sqlContext.read.format(\"com.databricks.spark.csv\").option(\"header\", \"true\").option(\"inferSchema\", \"true\").load(\".\/data\/csv\/data.csv\")\n\n# Parquet\u306e\u5834\u5408( \u30c7\u30a3\u30ec\u30af\u30c8\u30ea\u914d\u4e0b\u306b1\u3064\u306b Parquet \u30d5\u30a1\u30a4\u30eb )\ndf = sqlContext.read.parquet(\".\/data\/parquet\/\")\n<\/code><\/pre>\n\n\n\n\n\n
    \u30c7\u30fc\u30bf\u4ef6\u6570<\/th>\nCSV<\/th>\nParquet<\/th>\n<\/tr>\n
    \u30d8\u30c3\u30c0\uff0b10\u4e07\u4ef6<\/td>\n4.958 \u79d2<\/td>\n1.570 \u79d2<\/td>\n<\/tr>\n
    \u30d8\u30c3\u30c0\uff0b100\u4e07\u4ef6<\/td>\n7.095 \u79d2<\/td>\n2.857 \u79d2<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

    dataframe \u3078\u306e\u8aad\u307f\u8fbc\u307f\u306f Parquet \u306e\u5727\u52dd\u3067\u3057\u305f\u3002
    \n\u73fe\u5b9f\u7684\u306a\u904b\u7528\u3067\u306f1\u4ef6\u30842\u4ef6\u306e\u30d5\u30a1\u30a4\u30eb\u3092\u8aad\u307f\u8fbc\u3080\u3053\u3068\u306f\u7121\u3044\u3068\u601d\u3044\u5c0f\u3055\u306a\u30d5\u30a1\u30a4\u30eb\u4ef6\u6570\u3067\u306f\u8a66\u3057\u3066\u3044\u307e\u305b\u3093\u304c\u3001CSV \u3068 Parquet \u3067\u3055\u307b\u3069\u5909\u308f\u3089\u306a\u3044\u7d50\u679c\u304b\u3089\u4ef6\u6570\u304c\u5927\u304d\u304f\u306a\u308b\u306b\u3064\u308c\u3066\u5dee\u7570\u304c\u5927\u304d\u304f\u306a\u3063\u3066\u3044\u304f\u306e\u3067\u306f\u306a\u3044\u304b\u3068\u4e88\u60f3\u3057\u3066\u3044\u307e\u3059\u3002<\/p>\n

    \u73fe\u5728\u306e\u904b\u7528\u3067\u306f\u51e6\u7406\u901f\u5ea6\u306f\u91cd\u8981\u306a\u8981\u4ef6\u3067\u306f\u3042\u308a\u307e\u305b\u3093\u304c\u3001\u3042\u307e\u308a\u306b\u3082\u9045\u5ef6\u3059\u308b\u306e\u306f\u56f0\u308a\u307e\u3059\u3002
    \n\u7c21\u5358\u306a\u96c6\u8a08\u51e6\u7406\u3084JOIN\u306e\u8a08\u6e2c\u3082\u5b9f\u65bd\u3057\u307e\u3057\u305f\u3002
    \n\u7d50\u8ad6\u304b\u3089\u8a00\u3044\u307e\u3059\u3068\u3001\u6e96\u5099\u3057\u305f\u30c7\u30fc\u30bf\u4ef6\u6570\u7a0b\u5ea6\u3060\u3068\u5dee\u7570\u306f\u307b\u3068\u3093\u3069\u898b\u3089\u308c\u307e\u305b\u3093\u3002
    \n\u5ff5\u306e\u305f\u3081 dataframe \u3067\u51e6\u7406\u3059\u308b\u5834\u5408\u3068 SQL \u3067\u51e6\u7406\u3059\u308b\u5834\u5408\u3067\u3082\u6bd4\u8f03\u3057\u3066\u307f\u307e\u3057\u305f\u304c\u3001\u3053\u3061\u3089\u3082\u5dee\u7570\u306f\u307b\u3068\u3093\u3069\u3042\u308a\u307e\u305b\u3093\u3067\u3057\u305f\u3002<\/p>\n

    GroupBy \uff0b Count<\/h3>\n
    # dtaframe \u306e\u5834\u5408\ndf.groupBy(\"clientKey\").count().sort(\"clientKey\").show()\n\n# SQL \u306e\u5834\u5408\ndf.registerTempTable(\"sample_data\")\nsqlContext.sql(\"SELECT clientKey, COUNT(*) FROM sample_data GROUP BY clientKey ORDER BY clientKey\").show()\n\n# \u5b9f\u884c\u7d50\u679c\n+----------+--------+\n| clientKey|count(1)|\n+----------+--------+\n|       aaa|  100000|\n|       bbb|  100000|\n|       ccc|  100000|\n|       ddd|  100000|\n|       eee|  100000|\n|       fff|  100000|\n|       ggg|  100000|\n|       hhh|  100000|\n|       iii|  100000|\n|       jjj|  100000|\n+----------+--------+\n<\/code><\/pre>\n\n\n\n\n\n\n
    \u30c7\u30fc\u30bf\u4ef6\u6570<\/th>\nCSV<\/th>\nParquet<\/th>\n<\/tr>\n
    dataframe<\/th>\nSQL<\/th>\ndataframe<\/th>\nSQL<\/th>\n<\/tr>\n
    \u30d8\u30c3\u30c0\uff0b10\u4e07\u4ef6<\/td>\n0.463 \u79d2<\/td>\n0.587 \u79d2<\/td>\n0.452 \u79d2<\/td>\n0.562 \u79d2<\/td>\n<\/tr>\n
    \u30d8\u30c3\u30c0\uff0b100\u4e07\u4ef6<\/td>\n0.645 \u79d2<\/td>\n0.707 \u79d2<\/td>\n0.423 \u79d2<\/td>\n0.555 \u79d2<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

    \u4e57\u7b97 \uff0b GroupBy<\/h3>\n
    # dtaframe \u306e\u5834\u5408\ndf.select(\"clientKey\", df.itemCount * df.itemPrice).groupBy(\"clientKey\").sum().sort(\"clientKey\").show()\n\n# SQL \u306e\u5834\u5408\nsqlContext.sql(\"SELECT clientKey, SUM(itemCount * itemPrice) FROM sample_data GROUP BY clientKey ORDER BY clientKey\").show()\n\n# \u5b9f\u884c\u7d50\u679c\n+----------+----------------------------+\n| clientKey|sum((itemCount * itemPrice))|\n+----------+----------------------------+\n|       aaa|                 71923762200|\n|       bbb|                 71923762200|\n|       ccc|                 71923762200|\n|       ddd|                 71923762200|\n|       eee|                 71923762200|\n|       fff|                 71923762200|\n|       ggg|                 71923762200|\n|       hhh|                 71923762200|\n|       iii|                 71923762200|\n|       jjj|                 71923762200|\n+----------+----------------------------+\n<\/code><\/pre>\n\n\n\n\n\n\n
    \u30c7\u30fc\u30bf\u4ef6\u6570<\/th>\nCSV<\/th>\nParquet<\/th>\n<\/tr>\n
    dataframe<\/th>\nSQL<\/th>\ndataframe<\/th>\nSQL<\/th>\n<\/tr>\n
    \u30d8\u30c3\u30c0\uff0b10\u4e07\u4ef6<\/td>\n0.656 \u79d2<\/td>\n0.611 \u79d2<\/td>\n0.504 \u79d2<\/td>\n0.477 \u79d2<\/td>\n<\/tr>\n
    \u30d8\u30c3\u30c0\uff0b100\u4e07\u4ef6<\/td>\n0.883 \u79d2<\/td>\n0.893 \u79d2<\/td>\n0.620 \u79d2<\/td>\n0.617 \u79d2<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

    JOIN \uff0b \u4e57\u7b97 \uff0b GroupBy<\/h3>\n
    #dataframe \u306e\u5834\u5408\ndf_name = sqlContext.read.format(\"com.databricks.spark.csv\").option(\"header\", \"true\").option(\"inferSchema\", \"true\").load(\".\/data\/csv\/name.csv\")\ndf.join(df_name, df.clientKey == df_name.clientKey, \"inner\").select(\"clientName\", df.itemCount * df.itemPrice).groupBy(\"clientName\").sum().sort(\"clientName\").show()\n\n# SQL \u306e\u5834\u5408\ndf_name.registerTempTable(\"sample_data_name\")\nsqlContext.sql(\"\"\"\n        SELECT b.clientName, \n               SUM(a.itemCount * a.itemPrice) \n          FROM sample_data as a \n          JOIN sample_data_name as b \n            ON a.clientKey = b.clientKey \n      GROUP BY b.clientName\n      ORDER BY b.clientName\n\"\"\").show()\n\n# \u5b9f\u884c\u7d50\u679c\n+---------------+----------------------------+\n|     clientName|sum((itemCount * itemPrice))|\n+---------------+----------------------------+\n|\u30af\u30e9\u30a4\u30a2\u30f3\u30c8aaa|                 71923762200|\n|\u30af\u30e9\u30a4\u30a2\u30f3\u30c8bbb|                 71923762200|\n|\u30af\u30e9\u30a4\u30a2\u30f3\u30c8ccc|                 71923762200|\n|\u30af\u30e9\u30a4\u30a2\u30f3\u30c8ddd|                 71923762200|\n|\u30af\u30e9\u30a4\u30a2\u30f3\u30c8eee|                 71923762200|\n|\u30af\u30e9\u30a4\u30a2\u30f3\u30c8fff|                 71923762200|\n|\u30af\u30e9\u30a4\u30a2\u30f3\u30c8ggg|                 71923762200|\n|\u30af\u30e9\u30a4\u30a2\u30f3\u30c8hhh|                 71923762200|\n|\u30af\u30e9\u30a4\u30a2\u30f3\u30c8iii|                 71923762200|\n|\u30af\u30e9\u30a4\u30a2\u30f3\u30c8jjj|                 71923762200|\n+---------------+----------------------------+\n<\/code><\/pre>\n\n\n\n\n\n\n
    \u30c7\u30fc\u30bf\u4ef6\u6570<\/th>\nCSV<\/th>\nParquet<\/th>\n<\/tr>\n
    dataframe<\/th>\nSQL<\/th>\ndataframe<\/th>\nSQL<\/th>\n<\/tr>\n
    \u30d8\u30c3\u30c0\uff0b10\u4e07\u4ef6<\/td>\n0.741 \u79d2<\/td>\n0.759 \u79d2<\/td>\n0.832 \u79d2<\/td>\n0.834 \u79d2<\/td>\n<\/tr>\n
    \u30d8\u30c3\u30c0\uff0b100\u4e07\u4ef6<\/td>\n1.109 \u79d2<\/td>\n1.107 \u79d2<\/td>\n1.123 \u79d2<\/td>\n1.056 \u79d2<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

    \u307e\u3068\u3081<\/h2>\n

    \u79c1\u305f\u3061\u306e\u30a2\u30ab\u30a6\u30f3\u30c8\u3067 S3 \u30b3\u30b9\u30c8\u3092\u62bc\u3057\u4e0a\u3052\u308b\u539f\u56e0\u306b\u306a\u3063\u3066\u3044\u308b\u672a\u5727\u7e2e\u306e CSV \u30d5\u30a1\u30a4\u30eb\u306b\u5bfe\u3059\u308b\u30b3\u30b9\u30c8\u524a\u6e1b\u7b56\u3068\u3057\u3066\u306f\u3001\u4ee5\u4e0b\u306e\u5bfe\u7b56\u304c\u6709\u52b9\u3067\u3042\u308b\u3053\u3068\u304c\u5206\u304b\u308a\u307e\u3057\u305f\u3002<\/p>\n