2014年9月30日 星期二

Spark vs Yarn (Simple Grep example)

Environment:
-HARDWARE:
Command:
-cat /proc/cpuinfo # 顯示CPU的資訊
-cat /proc/meminfo # 顯示記憶體的資訊
-sudo smartctl -i /dev/sda # 顯示硬碟型號及規格, apt-get install smartmontools


HOSTNAME IPADDRESS CPU CORE MEM DISK OS
----------------------------------------------------------------------------------------------------------------
master 192.168.0.7 4 8 3.5GB 500GB Ubuntu 14.04.1 LTS
regionserver2 192.168.0.23 2 4 3.5GB 500GB Ubuntu 14.04.1 LTS

-SOFTWARE:
-Hadoop 2.4.1
-Spark 1.0.2
-Scala 2.10.4
-java version: 1.7.0_65

Test Info:
-INPUT
Total Size: 2.8GB
INFO: Linux Redhat / Fedora, Snort NIDS, iptables firewall log file(2006-allog.1 ~ 2006-allog.9)
Date Collected: Sep - Dec 2006
DOWNLOAD: http://log-sharing.dreamhosters.com/   (Bundle 5)

* put data into HDFS
$hdfs dfs -put DATA_DIR/DATA_FOLDER /user/hduser/LogFile
$hdfs dfs -du /user/hduser  # Get the size of "LogFile" folder

-Example : GREP (Count "Dec" in log file)
Using Spark:
$spark-shell --master yarn-client
scala> val textFile = sc.textFile("/user/hduser/LogFile/2006-allog.1")
scala> textFile.filter(line => line.contains("Dec")).count()
scala> exit

Using Hadoop:
$hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar grep LogFile/2006-allog.1 OutPut "Dec"

# This example will execute two jobs, Grep and Sort. We only check the running time of Grep job.

Result

Data Size   spark Hadoop
-------------------------------------------------------------- 
119MB(2006-allog.1) 3.8 sec 21 sec
686MB(2006-allog.9) 5.8 sec 60 sec
2.8GB(LogFile) 25.3 sec        194 sec

沒有留言:

張貼留言