Commit a2e6b81c authored by jiongzhu's avatar jiongzhu
Browse files

Update readme, add numReduceTasks and remove unnecessary files.

parent 3d0db868
# EECS 598-008 WN 2019 Advanced Data Mining - Hadoop Examples
This project will give an example of a simple Hadoop project and will include instructions on how to run it.
## How to get this example code run?
### Login to Flux Hadoop Server
First, login to the server through
ssh <uniqname>
If you have deployed Hadoop locally and you want to utilize it, you may skip this step.
### Clone this repository
To clone the repo to your current working directory:
git clone
If you have added your SSH pubkey to EECS GitLab, you can also use
git clone
### Submit WordCount Job to Yarn
Submit the WordCount Job to Yarn
WordCount/ [queuename]
The `[queuename]` argument optional; it is used to specify which queue you would like to submit your job to. The default behavior is to submit your job to `default` queue. If you want to submit to the special course queue, use `eecs598w19`.
### Check your results
By default the result will be written to HDFS under path `hadoop_examples/wordcount/output` in your HDFS home folder. To check the output directly (not recommended if your result size is large), use
hdfs dfs -cat hadoop_examples/wordcount/output/*
To download these results to the local filesystem, use
hdfs dfs -get hadoop_examples/wordcount/output/ <path_to_a_local_folder>
Now you should be able to access your results locally!
## HDFS Commands Cheatsheet
[Here]( is a comprehensive cheatsheet to transfer files between HDFS and your local filesystem, or manage your files on HDFS.
(Notice that the description for `hdfs dfs -get /hadoop/*.txt /home/ubuntu/` command in that Cheatsheet is incorrect: the direction for copy is incorrectly reversed. )
**Be cautious if you want to try any administration commands.** Some of them are dangerous and you may break the system or loss all your data in HDFS if you play with it without understanding what you are doing.
## Hadoop Streaming Commands Breakdown
To make things more intuitive, I will first explain the command we have here:
yarn jar "$dirname/hadoop-streaming.jar" \
-input hadoop_examples/wordcount/input \
-output hadoop_examples/wordcount/output \
-mapper \
-reducer \
-file \
-file \
-numReduceTasks 10 \
- `yarn jar`: you are submitting a task (`hadoop-streaming` in our case) written in Java to Yarn, which is Hadoop's distributed task scheduling system.
- `-input`: the path to the *raw input data* **in HDFS**. Should be separable by lines.
- `-output`: the path which `hadoop-streaming` can write the *results* to. **This path should not exist when you submitting this job**, because if something is already there, how does the program know whether it should override things already there?
- `-mapper`: An script (or other executable files) which read (a portion of) *raw data* through `stdin`, and write intermediate key-value pairs to `stdout`.
- `-reducer`: An script (or other executable files) which reads **all intermediate key-value pairs for the same key** through `stdin`, and write the final result for a key to `stdout`.
- `-file`: This is to upload your script to HDFS temporality so that Hadoop can run your script distributedly. Hadoop can only access files in HDFS so if you do not include your scripts here, Hadoop will not be able to access your scripts.
- `-numReduceTasks`: This is to specify the number of reducers you want to use. If it is too low, your job may run very slowly for large datasets. However if you set it too aggressively, your job may get killed because you are taking up too much resources. My suggestion is to keep this value between 3 to 10.
- `-jobconf`: This is to modify some job configuration. Here we use it to specify the queue we would like to use. You can eliminate this option, which will submit your job to the `default` queue.
[Here]( is an offical guide for Hadoop Streaming. Notice that this example adopts the usage specified in Section 3.2. Also, it is not always the case that `hadoop-streaming.jar` is accessible through `$HADOOP_HOME/hadoop-streaming.jar`, so I suggest you put `hadoop-streaming.jar` together with your main script, as in this WordCount example.
## FAQ
### Comparing different filesystems
You may encounter three different filesystems when using Flux Hadoop server:
- **Local filesystem** on **local machine**: this is the local filesystem on the computer you are currently using.
- **Local filesystem** on **Flux Hadoop** server: this is the file system you will encounter when you login through `ssh <uniqname>`. However **this is still a local filesystem**, since when you do `ssh`, you are just connecting to the terminal of another machine, and all your files there is only the **local filesystem of that machine**.
- **HDFS** on **Flux Hadoop** server: this is **the only filesystem that Hadoop can utilized** for running distributed computing tasks, which is different from the above two.
Therefore, if you are using Flux Hadoop server and write something in your own computer (without `ssh`), it may be the case that you will need to first upload your script or data from *Local filesystem on local machine* to *Local filesystem on Flux Hadoop*, then from *Local filesystem on Flux Hadoop* to *HDFS*.
\ No newline at end of file
hadoop-streaming-2.6.3.jar as hadoop-streaming.jar
......@@ -16,5 +16,6 @@ yarn jar "$dirname/hadoop-streaming.jar" \
-reducer \
-file \
-file \
-numreducer 10 \
dirname="$(dirname "$0")"
if ! [ -z $1 ]; then
echo "Switch to $jarname; copying $dirname/bin/$jarname to $dirname/hadoop-streaming.jar"
cp $dirname/bin/$jarname $dirname/hadoop-streaming.jar
echo "$jarname as hadoop-streaming.jar" > $dirname/bin/current_version.txt
if [ -f $dirname/bin/current_version.txt ]; then
cat $dirname/bin/current_version.txt
echo "Usage: $0 <jar-version>"
echo " e.g. $0 2.6.3"
echo "Choices:"
ls $dirname/bin/*.jar
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment