Hadoop Streaming编程实例

2014-04-03 00:00:00

来源
中存储

大数据

Hadoop Streaming是Hadoop提供的多语言编程工具，通过该工具，用户可采用任何语言编写MapReduce程序，本文将介绍几个Hadoop Streaming编程实例，大家可重点从以下几个方面学习：（1）对于一种编写语言，应该怎么编写Mapper

Hadoop Streaming是Hadoop提供的多语言编程工具，通过该工具，用户可采用任何语言编写MapReduce程序，本文将介绍几个Hadoop Streaming编程实例，大家可重点从以下几个方面学习：

（1）对于一种编写语言，应该怎么编写Mapper和Reduce，需遵循什么样的编程规范

（2）如何在Hadoop Streaming中自定义Hadoop Counter

（3）如何在Hadoop Streaming中自定义状态信息，进而给用户反馈当前作业执行进度

（4）如何在Hadoop Streaming中打印调试日志，在哪里可以看到这些日志

（5）如何使用Hadoop Streaming处理二进制文件，而不仅仅是文本文件

本文重点解决前四个问题，给出了C++和Shell编写的Wordcount实例，供大家参考。

1. C++版WordCount

（1）Mapper实现（mapper.cpp）

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #include <iostream> #include <string> using namespace std; int main() { string key; while(cin >> key) { cout << key << "t" << "1" << endl; // Define counter named counter_no in group counter_group cerr << "reporter:counter:counter_group,counter_no,1n"; // dispaly status cerr << "reporter:status:processing......n"; // Print logs for testing cerr << "This is log, will be printed in stdout filen"; } return 0; }

（2）Reducer实现（reducer.cpp）

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 #include <iostream> #include <string> using namespace std; int main() { //reducer将会被封装成一个独立进程，因而需要有main函数 string cur_key, last_key, value; cin >> cur_key >> value; last_key = cur_key; int n = 1; while(cin >> cur_key) { //读取map task输出结果 cin >> value; if(last_key != cur_key) { //识别下一个key cout << last_key << "t" << n << endl; last_key = cur_key; n = 1; } else { //获取key相同的所有value数目 n++; //key值相同的，累计value值 } } cout << last_key << "t" << n << endl; return 0; }

（3）编译运行

编译以上两个程序：

g++ -o mapper mapper.cpp

g++ -o reducer reducer.cpp

测试一下：

echo “dong xicheng is here now, talk to dong xicheng now” | ./mapper | sort | ./reducer

注：上面这种测试方法会频繁打印以下字符串，可以先注释掉，这些字符串hadoop能够识别

reporter:counter:counter_group,counter_no,1

reporter:status:processing……

This is log, will be printed in stdout file

测试通过后，可通过以下脚本将作业提交到集群中（run_cpp_mr.sh）：

1 2 3 4 5 6 7 8 9 10 11 12 13 14 #!/bin/bash HADOOP_HOME=/opt/yarn-client INPUT_PATH=/test/input OUTPUT_PATH=/test/output echo "Clearing output path: $OUTPUT_PATH" $HADOOP_HOME/bin/hadoop fs -rmr $OUTPUT_PATH ${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar -files mapper,reducer -input $INPUT_PATH -output $OUTPUT_PATH -mapper mapper -reducer reducer

2. Shell版WordCount

（1）Mapper实现（mapper.sh）

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 #! /bin/bash while read LINE; do for word in $LINE do echo "$word 1" # in streaming, we define counter by # [reporter:counter:<group>,<counter>,<amount>] # define a counter named counter_no, in group counter_group # increase this counter by 1 # counter shoule be output through stderr echo "reporter:counter:counter_group,counter_no,1" >&2 echo "reporter:counter:status,processing......" >&2 echo "This is log for testing, will be printed in stdout file" >&2 done done

（2）Reducer实现（mapper.sh）

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #! /bin/bash count=0 started=0 word="" while read LINE;do newword=`echo $LINE | cut -d ' ' -f 1` if [ "$word" != "$newword" ];then [ $started -ne 0 ] && echo "$wordt$count" word=$newword count=1 started=1 else count=$(( $count + 1 )) fi done echo "$wordt$count"

（3）测试运行

测试以上两个程序：

echo “dong xicheng is here now, talk to dong xicheng now” | sh mapper.sh | sort | sh reducer.sh

注：上面这种测试方法会频繁打印以下字符串，可以先注释掉，这些字符串hadoop能够识别

reporter:counter:counter_group,counter_no,1

reporter:status:processing……

This is log, will be printed in stdout file

测试通过后，可通过以下脚本将作业提交到集群中（run_shell_mr.sh）：

1 2 3 4 5 6 7 8 9 10 11 12 13 14 #!/bin/bash HADOOP_HOME=/opt/yarn-client INPUT_PATH=/test/input OUTPUT_PATH=/test/output echo "Clearing output path: $OUTPUT_PATH" $HADOOP_HOME/bin/hadoop fs -rmr $OUTPUT_PATH ${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar -files mapper.sh,reducer.sh -input $INPUT_PATH -output $OUTPUT_PATH -mapper "sh mapper.sh" -reducer "sh reducer.sh"

3. 程序说明

在Hadoop Streaming中，标准输入、标准输出和错误输出各有妙用，其中，标准输入和输出分别用于接受输入数据和输出处理结果，而错误输出的意义视内容而定：

（1）如果标准错误输出的内容为：reporter:counter:group,counter,amount，表示将名称为counter，所在组为group的hadoop counter值增加amount，hadoop第一次读到这个counter时，会创建它，之后查找counter表，增加对应counter值

（2）如果标准错误输出的内容为：reporter:status:message，则表示在界面或者终端上打印message信息，可以是一些状态提示信息

（3）如果采用错误输出的内容不是以上两种情况，则表示调试日志，Hadoop会将其重定向到stderr文件中。注：每个Task对应三个日志文件，分别是stdout、stderr和syslog，都是文本文件，可以在web界面上查看这三个日志文件内容，也可以登录到task所在节点上，到对应目录中查看。

另外，需要注意一点，默认Map Task输出的key和value分隔符是t，Hadoop会在Map和Reduce阶段按照t分离key和value，并对key排序，注意这点非常重要，当然，你可以使用stream.map.output.field.separator指定新的分隔符。

声明： 此文观点不代表本站立场；转载须要保留原文链接；版权疑问请联系我们。

Hadoop Streaming编程实例

深入Nutch index源代码解析(一)

深入Nutch index源代码解析二)

携世界杯余温 SAP大中华区宣布连续四季获得两位数增长

大数据厂商百分点C轮融资2500万美元

Google Baseline工程把基因大数据化

新型 Linux Rootkit PUMAKIT 使用先进的隐身技术躲避检测

OpenAI就ChatGPT宕机致歉：部分服务恢复，Sora仍处于瘫痪状态

N-able 收购现有战略合作伙伴 Adlumin

美方指控“与中国有关黑客”入侵多家电信公司网络，外交部驳斥

IDC：英方软件第九次获中国专业灾备软件厂商第一

阿里云盘回应相册陌生照片“乱入”问题：已快速修复，用户影响面较小

Backblaze：如何扩展公司的云存储？

以色列初创企业Datafy在种子轮融资600万美元

Cloudflare宣布R2的主要更新，包括事件通知和GCS支持

IDrive Backup新功能：云对云备份Google数据

60国签署巴黎AI峰会声明，美英缺席

富士通横滨国立大学使用 Fugaku 超级计算机推进台风龙卷风预报

Jülich 购买 D-Wave 量子计算机加强量子研究

Trane 将液体冷却集成到 AI 和 HPC 的热管理中

D-Wave 宣布举办 Qubits 2025 量子计算用户大会

Trendfocus 磁带和归档存储服务 CQ3 '24 季度更新报告

适用于 IBM Spectrum Scale 的联想分布式存储解决方案

CES 2025：威刚/XPG Schowcasing 工业和游戏存储设备

AI推理将驱动AIDC需求提升数据中心行业有望复苏

美光采样 6550 ION PCIe Gen5 高达 61TB 的 E3 数据中心 SSD

中国信通院发布《智能化医疗装备产业蓝皮书（2024年）》

使用 Ardis DDP10EF 和 SupremeRAID SR-1000 for M&E 解锁更高水平的媒体性能

多地点运营的企业，分布式管理与集中式管理哪种更具有网络保护的优势？

数字政府一体化建设白皮书（2024年）

Orico公司联合西部数据推出面向创作者的混合存储产品

科技要闻

IDC 发布《FutureScape 2025 年全球制造业预测 – 亚太地区（不包括日本）影响》报告

60国签署巴黎AI峰会声明，美英缺席

一月手机激活量统计数据出炉：华为领跑，小米崛起，苹果失速！

Nasuni 2024年财报创纪录

慧荣Silicon Motion公布24 财年第四季度财务业绩

Hadoop Streaming编程实例

猜你喜欢

科技要闻