How to Benchmark a Hadoop Cluster
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000
At the end of the run, the results are written to the console and also recorded in a local file (which is appended to, so you can rerun the benchmark and not lose old results):
% cat TestDFSIO_results.log ----- TestDFSIO ----- : write Date & time: Sun Apr 12 07:14:09 EDT 2009 Number of files: 10 Total MBytes processed: 10000 Throughput mb/sec: 7.796340865378244 Average IO rate mb/sec: 7.8862199783325195 IO rate std deviation: 0.9101254683525547 Test exec time sec: 163.387
The files are written under the?/benchmarks/TestDFSIO?directory by default (this can be changed by setting thetest.build.data?system property), in a directory called?io_data.
To run a read benchmark, use the?-read?argument. Note that these files must already exist (having been written byTestDFSIO -write):
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000
Here are the results for a real run:
----- TestDFSIO ----- : read Date & time: Sun Apr 12 07:24:28 EDT 2009 Number of files: 10 Total MBytes processed: 10000 Throughput mb/sec: 80.25553361904304 Average IO rate mb/sec: 98.6801528930664 IO rate std deviation: 36.63507598174921 Test exec time sec: 47.624
When you’ve finished benchmarking, you can delete all the generated files from HDFS using the?-clean?argument:
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -clean
Hadoop comes with a MapReduce program that does a partial sort of its input. It is very useful for benchmarking the whole MapReduce system, as the full input dataset is transferred through the shuffle. The three steps are: generate some random data, perform the sort, then validate the results.
First we generate some random data using?RandomWriter. It runs a MapReduce job with 10 maps per node, and each map generates (approximately) 10 GB of random binary data, with key and values of various sizes. You can change these values if you like by setting the properties?test.randomwriter.maps_per_host?and?test.randomwrite.bytes_per_map. There are also settings for the size ranges of the keys and values; see?RandomWriter?for details.
Here’s how to invoke?RandomWriter?(found in the example JAR file, not the test one) to write its output to a directory calledrandom-data:
% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar randomwriter random-data
Next we can run the?Sort?program:
% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort random-data sorted-data
The overall execution time of the sort is the metric we are interested in, but it’s instructive to watch the job’s progress via the web UI (http://jobtracker-host:50030/), where you can get a feel for how long each phase of the job takes.
As a final sanity check, we validate the data in?sorted-data?is, in fact, correctly sorted:
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar testmapredsort -sortInput random-data \ -sortOutput sorted-data
This command runs the?SortValidator?program, which performs a series of checks on the unsorted and sorted data to check whether the sort is accurate. It reports the outcome to the console at the end of its run:
SUCCESS! Validated the MapReduce framework's 'sort' successfully.
There are many more Hadoop benchmarks, but the following are widely used:
MRBench?(invoked with?mrbench) runs a small job a number of times. It acts as a good counterpoint to sort, as it checks whether small job runs are responsive.
NNBench?(invoked with?nnbench) is useful for load testing namenode hardware.
Gridmix?is a suite of benchmarks designed to model a realistic cluster workload, by mimicking a variety of data-access patterns seen in practice. See?src/benchmarks/gridmix2?in the distribution for further details.[63]
Nutch集成slor的索引方法介绍? ?* 建立索引? ?* @param solrUrl solr的web地址? ?* @param crawlDb 爬取DB的存放路径:\crawl\crawldb
我们想了个办法:把海量数据分成小块,让一台机器处理一小块数据,所有的机器同时工作。最后把结 果汇总起来。这就是“并行计算”。hadoop中的MapReduce就是专门用来做分布式计算的并行处理框架。hadoop就是用来解决大数据的存储和计算的。
以Hadoop Tutorial为主体带大家走一遍如何使用Hadoop分析数据!MapReduce框架由一个Jobracker(通常简称JT)和数个TaskTracker(TT)组成(在cdh4中如果使用了Jobtracker HA特性,则会有2个Jobtracer,其中只有一个为active,另一个作为standby处于inactive状态)。JobTr
重谈下MapReduce框架中用户经常使用的一些接口或类的详细内容。了解这些会极大帮助你实现、配置和优化MR任务。当然javadoc中对每个class或接口都进行了更全面的陈述,这里只是一个指引教程。
hadoop常见问题解决:WARN mapred.LocalJobRunner: job_local910166057_0001o
【聚焦搜索,数智采购】2021第一届百度爱采购数智大会即将于5月28日在上海盛大开启!
本次大会上,紫晶存储董事、总经理钟国裕作为公司代表,与中国—东盟信息港签署合作协议
XEUS统一存储已成功承载宣武医院PACS系统近5年的历史数据迁移,为支持各业务科室蓬勃扩张的数据增量和访问、调用乃至分析需求奠定了坚实基础。
大兆科技全方面展示大兆科技在医疗信息化建设中数据存储系统方面取得的成就。
双方相信,通过本次合作,能够使双方进一步提升技术实力、提升产品品质及服务质量,为客户创造更大价值。