智慧筋肉人: 9月 2014

2014年9月30日星期二

Hadoop-2.4.1 Example(WordCount) on Eclipse

[Software]
Hadoop2.4.1
Eclipse IDE for Java Developers Luna Release (4.4.0)

1. Open a Map/Reducer Project

2. Add lib jar:
- Right click the project > Build Path > Configure Build Path > Java Build Path > Libraries
> Add External JARS (including jars in following dir):
- share/hadoop/common
- share/hadoop/common/lib
- share/hadoop/mapreduce
- share/hadoop/mapreduce/lib
- share/hadoop/yarn
- share/hadoop/yarn/lib
--------additional-----------
- HDFS lib
- HBase lib

3. On this project, add new:
- Mapper: Mp.java

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Mapper.Context;

public class Mp extends Mapper<LongWritable, Text, Text, IntWritable> {

 private final static IntWritable one = new IntWritable(1);

 private Text word = new Text();

 public void map(LongWritable ikey, Text ivalue, Context context)

   throws IOException, InterruptedException {

  String line = ivalue.toString();

  StringTokenizer tokenizer = new StringTokenizer(line);

  while(tokenizer.hasMoreTokens()){

   word.set(tokenizer.nextToken());

   context.write(word, one);

  }

 }

}

- Reducer: Rd.java

import java.io.IOException;

import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.Reducer.Context;

public class Rd extends Reducer<Text, IntWritable, Text, IntWritable> {

 public void reduce(Text _key, Iterable<IntWritable> values, Context context)

   throws IOException, InterruptedException {

  // process values

  int sum =0;

  for(IntWritable v : values){

   sum += v.get();

  }

  context.write(_key, new IntWritable(sum));

 }

}

- MapReduce Driver: WC.java

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class WC {

 public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  @SuppressWarnings("deprecation")
  Job job = new Job(conf, "wordcount");
  //Job job = Job.getInstance(conf, "wordcount");
  job.setJarByClass(WC.class);
  // TODO: specify a mapper
  job.setMapperClass(Mp.class);
  // TODO: specify a reducer
  job.setReducerClass(Rd.class);

  // TODO: specify output types
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);

  // TODO: specify input and output DIRECTORIES (not files)
  FileInputFormat.setInputPaths(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job, new Path(args[1]));

  if (!job.waitForCompletion(true))
   return;
 }

}

4. Create jar file
- File > Export > JAR file
- Select resources and Jar File Loctaion

5. Run application
- Seclect "Run Configurations" > Check "Java Application", "Name", "Project", "Mainclass"
- Enter "Arguments" > add "file1 Output" in Program arguments
[備註] 因為main裡面沒有指定 input output, 所以這邊必須設定給app,
相當於用terminal 執行 $ hadoop jar project.jar file1 Output1 ,
如果不加路徑，預設input 及output位置在本機的 $ECLIPSE_WORKSPACE/PROJECT_FOLDER

- Click "Run"

[ 問題 ] 如何Run appliction 於現有的Hadoop系統上，而非 Local端

===> 2014/09/18 目前測試，必須export jar file 丟到master上運行是OK的。

[Solution]

Spark vs Yarn (Simple Grep example)

Environment:
-HARDWARE:
Command:
-cat /proc/cpuinfo # 顯示CPU的資訊
-cat /proc/meminfo # 顯示記憶體的資訊
-sudo smartctl -i /dev/sda # 顯示硬碟型號及規格, apt-get install smartmontools

HOSTNAME IPADDRESS CPU CORE MEM DISK OS
----------------------------------------------------------------------------------------------------------------
master 192.168.0.7 4 8 3.5GB 500GB Ubuntu 14.04.1 LTS
regionserver2 192.168.0.23 2 4 3.5GB 500GB Ubuntu 14.04.1 LTS

-SOFTWARE:
-Hadoop 2.4.1
-Spark 1.0.2
-Scala 2.10.4
-java version: 1.7.0_65

Test Info:
-INPUT
Total Size: 2.8GB
INFO: Linux Redhat / Fedora, Snort NIDS, iptables firewall log file(2006-allog.1 ~ 2006-allog.9)
Date Collected: Sep - Dec 2006
DOWNLOAD: http://log-sharing.dreamhosters.com/ (Bundle 5)

* put data into HDFS
$hdfs dfs -put DATA_DIR/DATA_FOLDER /user/hduser/LogFile
$hdfs dfs -du /user/hduser # Get the size of "LogFile" folder

-Example : GREP (Count "Dec" in log file)
Using Spark:
$spark-shell --master yarn-client
scala> val textFile = sc.textFile("/user/hduser/LogFile/2006-allog.1")
scala> textFile.filter(line => line.contains("Dec")).count()
scala> exit

Using Hadoop:
$hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar grep LogFile/2006-allog.1 OutPut "Dec"

# This example will execute two jobs, Grep and Sort. We only check the running time of Grep job.

Result

Data Size spark Hadoop
--------------------------------------------------------------
119MB(2006-allog.1) 3.8 sec 21 sec
686MB(2006-allog.9) 5.8 sec 60 sec
2.8GB(LogFile) 25.3 sec 194 sec

Scala HelloWorld Example

[Software]
Hadoop2.4.1
Scala 2.10.4

[Hello World] by script
1. Create scala file "Hello.scala" as follow:

  object Hello{

    def main(args: Array[String]) {

      println("Hello")

    }

  }

2. Compile the scala file
- $ scalac Hello.scala
- It will create 2 files: Hello.class & Hello$.class

3. Execute(at the same dir with Hello.class)
[method 1]
- $ scala Hello.scala
[method 2]
- $ scala -cp . Hello

[Def]
- def : defines a method
- val : defines a fixed value (which cannot be modified)
- var : defines a variable (which can be modified)

OpenTSDB Installation and StartUp

Features
[Reference]
- http://opentsdb.net/index.html

[OpenTSDB Features]
- Scalable, distributed time series database
-

[Download]
- https://github.com/OpenTSDB/opentsdb/releases
- Version: opentsdb-2.0.0
or use command
- $ git clone git://github.com/OpenTSDB/opentsdb.git

Installation and Start Up
- Reference:
- http://opentsdb.net/docs/build/html/installation.html
1. Requirement Install:
-A Linux system
-Java Development Kit 1.6 or later
-GnuPlot 4.2 or later
-Autotools
-Make
-Python
-Git
-An Internet connection

2. Install openTSDB:
- $sh opentsdb-2.0.0/build.sh
(If compilation was successfuly, there is a tsdb jarfile in ./build along with a tsdb script)
- $cd build
- $make install or $ ./build.sh
[NOTE]
If error shows as:
|+ ./bootstrap
|exec: 17: autoreconf: not found
install dh-autoreconf:
$ sudo apt-get install dh-autoreconf

3. Start openTSDB:
1. In src/opentsdb.conf, modify:
tsd.network.port = 8099
tsd.storage.hbase.zk_quorum = 192.168.0.7:2222
tsd.http.staticroot = /home/hduser/opentsdb/build/staticroot
tsd.http.cachedir = /home/hduser/opentsdb/cachedir/

2. Create Table in HBase:(Must execute the command line at the server which install HBase)
- $ env COMPRESSION=NONE HBASE_HOME=/usr/lib/hbase/hbase-0.98.5-hadoop2/ src/create_table.sh

3. Start:
- $ build/tsdb tsd --config src/opentsdb.conf
- TSD's web interface:
http://127.0.0.1:8099 (port is set in "tsd.network.port" )

4. Test openTSDB with simple collector
1. Register metrics at tsdb_uid of HBase
- $ tsdb mkmetric <metric_string1> <metric_string2> ... <metric_stringN>
ex: (Create two metric, "proc.loadavg.1m", "proc.loadavg.5m")
$ tsdb mkmetric proc.loadavg.1m proc.loadavg.5m

2. Test Collector: Get server avg load and show at TSD web UI
[Reference]
- http://www.slideshare.net/thecupoflife/opentsdb-in-a-real-enviroment
- http://zhengbin.blog.51cto.com/2989505/1273330
- http://opentsdb.net/docs/build/html/user_guide/quickstart.html (about mysql)

1. Create a collector file (Collect local machine info "loadavg")
- loadavg-collector.sh
#!/bin/bash
set -e
while true;
do awk -v now=`date +%s` -v host=`hostname` \
'{ print "put proc.loadavg.1m " now " " $1" host=" host;
print "put proc.loadavg.5m " now " " $2 " host=" host }' /proc/loadavg
sleep 15
done | nc -w 30 192.168.0.7 8079

[NOTE]
- "set -e" --> causes the shell to exit if any subcommand or pipeline returns a non-zero status
- "awk '{print $1 $2}' /proc/loadavg" --> print out column 1 and 2 all value in /proc/loadavg
- "now=`date +%s` -v host=`hostname`" --> create variables
- "nc -w 30 192.168.0.7 8079" --> connecte to "tsdb_host port"
- "|" --> linux commnd for let two cmds could execute at same time (?)

2. Run the collector
- $ chmod +x loadavg-collector.sh
- $ sh loadavg-collector.sh (or $ nohup loadavg-collector.sh --> this will output info to nohup.out)

3. Open TSD web UI (192.168.0.7:8079)
- Setup the time line (ex: From 2014/09/24 To now)
- Metric: proc.loadavg.1m

The result diagram will show out, and you can use mouse to select a scope of diagram to show.

[Error]
- "Request failed: Bad Request: No such name for 'metrics': 'tsd.'"

- [Solution]
- Think
http://grokbase.com/t/cloudfoundry.org/vcap-dev/126b11e3w6/tsdb-configuration-in-vcap-tools-dashboard
Says: There are 2 cases under which the above error happens.
1. collector is not running, so no metrics are pushed to tsdb
2. there is no any web application running in cloudfoundry, so no "frameworks" metrics are pushed to tsdb
- Do
check whether the collectors on monitored hosts are running?

Tcollector
* Start tcollector on hosts
- http://opentsdb.net/docs/build/html/user_guide/utilities/tcollector.html#installation-of-tcollector
- http://books.google.com.tw/books?id=i5IFvlnfqi8C&pg=PA139&lpg=PA139&dq=opentsdb+monitor+hbase+table&source=bl&ots=kOpk1mpmCx&sig=0LsJOVd22zu2-SAM14CUhgMecMo&hl=zh-TW&sa=X&ei=U14iVPOIOMy48gWuuILQAw&ved=0CEMQ6AEwBQ#v=onepage&q=tcollector&f=false

[!] collector is set up on the host which we need to monitor.(Not Hbase system)

[!] it may right tmp data (to file in /proc/)

Nagios Monitor

1. Install
$sudo apt-get install apache2 nagios3 nagios-nrpe-plugin
(1. Select "Internet Site" for "General type of mail configuration"
2. Select "OK")
3. Set web loggin Password )
$sudo apt-get install nagios3-doc
$sudo apt-get install nagios-nrpe-server

2. Start Nagios
$sudo /etc/nagios3/nagios3 -v nagios..cfg
( Check no Errors )

$sudo nano /etc/nagios3/conf.d/hosts.cfg
--------------------------------------------------------

define host{

  use                     generic-host   ; Name of host template to use

  host_name           master

  alias                    master

  dress                192.168.0.7

}

define host{

  use                     generic-host     ; Name of host template to use

  host_name          regionserver2 

  alias                   regionserver2

  address               192.168.0.23

}

--------------------------------------------------------

$sudo nano /etc/nagios3/conf.d/hostgroup_nagios2.cfg
-----------------------Add----------------------------

define hostgroup {

        hostgroup_name  Hadoop_Cluster

        alias           Hadoop

        members         master, regionserver2

 }

---------------------------------------------------------
$sudo /etc/init.d/nagios3 restart

Login: http://192.168.0.7/nagios3 account: nagiosadmin , password:

[REFERENCE]

http://www.cnblogs.com/junrong624/p/3653988.html (Installation)

HBase Count Table Rows (Using Java Jar File)

[Software]
Hadoop2.4.1
Eclipse IDE for Java Developers Luna Release (4.4.0)
HBase0.98.5

/*

 * Version:

 *   v1 : count rows of appoint table, only map task, output: counter "ROWS"

 */

import java.io.IOException;


import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.client.Result;

import org.apache.hadoop.hbase.client.Scan;

import org.apache.hadoop.hbase.io.ImmutableBytesWritable;

import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;

import org.apache.hadoop.hbase.mapreduce.TableMapper;

import org.apache.hadoop.hbase.util.Bytes;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.output.NullOutputFormat;


import org.apache.hadoop.util.GenericOptionsParser;


public class HbaseGet {
 private static byte[] tablename;
 private static byte[] familyname;
 private static byte[] columnname;
 
 public static class GetMap
 extends TableMapper<Text, LongWritable> {//in Java: Text=>String, LongWritable=>long

  public static enum Counters {Rows, Times};
  
  @Override
  public void map(ImmutableBytesWritable rowkey, Result result, Context context)
  throws IOException {
   byte[] b = result.getColumnLatest(Bytes.toBytes("m0"),  Bytes.toBytes("Tj.00")).getValue();
   String msg = Bytes.toString(b);
   if(msg != null && !msg.isEmpty())
    context.getCounter(Counters.Rows).increment(1);
  }
 }
 public static void main(String[] args) throws Exception {
  Configuration conf = HBaseConfiguration.create();
  String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
  if(otherArgs.length !=3){
   System.err.println("Wrong number of arguments:"+ otherArgs.length);
   System.err.println("Usage: hadoop jar HBaseGet.jar HbaseGet <tablename> <CF> <CN>");
   System.exit(-1);
  }  
  tablename  = Bytes.toBytes(otherArgs[0]);
  familyname = Bytes.toBytes(otherArgs[1]);
  columnname = Bytes.toBytes(otherArgs[2]);
  
  Job job = new Job(conf, otherArgs[0]);
  job.setJarByClass(HbaseGet.class);
  
  Scan scan = new Scan();
  scan.addColumn(familyname,columnname);
  TableMapReduceUtil.initTableMapperJob(
    Bytes.toString(tablename),
    scan,
    GetMap.class,
    ImmutableBytesWritable.class,
    Result.class, //Single row result of a Get or Scan query
    job);
  job.setOutputFormatClass(NullOutputFormat.class);
  job.setNumReduceTasks(0);
  System.exit(job.waitForCompletion(true)?0:1);
 }

}

Import CSV file to HBASE(Using Jar File)

[Software]
Hadoop2.4.1
Eclipse IDE for Java Developers Luna Release (4.4.0)
HBase0.98.5

Reference:
- http://hbase.apache.org/xref/org/apache/hadoop/hbase/mapreduce/SampleUploader.html

Step:
- Create Table
$hbase shell
hbase> create 'TEST3','m0','m1','m2','m3','m4','m5','m6','m7','m8','m9','m10','m11','m12','m13','m14','m15'
- Input File in HDFS
- log file with:
- 98 columns
- 1 timestamp
- 1 ??
- 16 monitored servers( 6 info per server)
- Run the "HBimporttsv_v2.jar" to insert log file to HBase table
$ hadoop jar Downloads/HBimporttsv_v2.jar HBimporttsv.Hbaseimporttsv /user/hduser/test.log2 "TEST3"

[Code]
where Hbaseimporttsv.java is:

/*

 * This program is the operation of importing csv file into HBase

 *

 */

package HBimporttsv;

import java.io.IOException;

import java.util.ArrayList;

import java.util.List;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.HColumnDescriptor;

import org.apache.hadoop.hbase.HTableDescriptor;

import org.apache.hadoop.hbase.TableName;

import org.apache.hadoop.hbase.client.*;

import org.apache.hadoop.hbase.io.*;

import org.apache.hadoop.hbase.mapreduce.*;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;

import org.apache.hadoop.hbase.util.Bytes;

import org.apache.hadoop.io.*;

//import org.apache.hadoop.mapred.FileInputFormat;

import org.apache.hadoop.mapreduce.*;

import org.apache.hadoop.util.GenericOptionsParser;

public class Hbaseimporttsv {

 private static final String NAME = "SampleUploader";

 public  static int NUM_OF_SERVER = 16;// number of monitored server

 public  static int NUM_OF_VAR = 6;// number of info per server

 //public static String[]  VAR = {"var1","var2","var3","var4","var5","var6"};//server info type

 public static String[] VAR = {

 "Tj.00","Cal Tj.00","Tc.00","DutV00","DutA00","ErrCode00",

 "Tj.01","Cal Tj.01","Tc.01","DutV01","DutA01","ErrCode01",

 "Tj.02","Cal Tj.02","Tc.02","DutV02","DutA02","ErrCode02",

 "Tj.03","Cal Tj.03","Tc.03","DutV03","DutA03","ErrCode03",

 "Tj.04","Cal Tj.04","Tc.04","DutV04","DutA04","ErrCode04",

 "Tj.05","Cal Tj.05","Tc.05","DutV05","DutA05","ErrCode05",

 "Tj.06","Cal Tj.06","Tc.06","DutV06","DutA06","ErrCode06",

 "Tj.07","Cal Tj.07","Tc.07","DutV07","DutA07","ErrCode07",

 "Tj.08","Cal Tj.08","Tc.08","DutV08","DutA08","ErrCode08",

 "Tj.09","Cal Tj.09","Tc.09","DutV09","DutA09","ErrCode08",

 "Tj.10","Cal Tj.10","Tc.10","DutV10","DutA10","ErrCode10",

 "Tj.11","Cal Tj.11","Tc.11","DutV11","DutA11","ErrCode11",

 "Tj.12","Cal Tj.12","Tc.12","DutV12","DutA12","ErrCode12",

 "Tj.13","Cal Tj.13","Tc.13","DutV13","DutA13","ErrCode13",

 "Tj.14","Cal Tj.14","Tc.14","DutV14","DutA14","ErrCode14",

 "Tj.15","Cal Tj.15","Tc.15","DutV15","DutA15","ErrCode15",

 };

 public static int NUM_OF_TOTAL_COLUMNS = 98;

 static class Uploader

    extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {

      //private long checkpoint = 100;

      //private long count = 0;

      @Override

      public void map(LongWritable key, Text line, Context context)

      throws IOException {

        // Input is a CSV file

        // Split CSV line

     // Ex:

     // Input: 14/09/15 18:20:35, Z00 B00 ,50.5,53.26,53.45,1251.06,291.25,FF

     // Output: values[0]="14/09/15 18:20:35", values[1]="Z00 B00",

     //         values[2]="50.5", values[3]="53.26", values[4]="53.45",

     //         values[5]="1251.06", values[6]="291.25", values[7]="FF"

     String [] values = line.toString().split(",");

        if(values.length != NUM_OF_TOTAL_COLUMNS)

          return;

        // Extract values[0] >> timestamp

        byte [] timestamp = Bytes.toBytes(values[0]);

        // Extract values[1] >> ??

        //byte [] ?? = Bytes.toBytes(values[1]);

        // Create Put

        Put put = new Put(timestamp);//Using first row(timestamp) as ROW_KEY

        //int var_index = 2; // server info star from values[2]

        for(int j = 0; j< NUM_OF_SERVER;j++){

         for(int i = 0; i< NUM_OF_VAR; i++){

             put.add(Bytes.toBytes("m"+j), // Column Family name

               Bytes.toBytes(VAR[(j*NUM_OF_VAR)+i]), // Column name

               Bytes.toBytes(values[2+(j*NUM_OF_VAR)+i])); // Value

            }

        }

        // Uncomment below to disable WAL. This will improve performance but means

        // you will experience data loss in the case of a RegionServer crash.

        // put.setWriteToWAL(false);

        try {

          context.write(new ImmutableBytesWritable(timestamp), put);

        } catch (InterruptedException e) {

          e.printStackTrace();

        }

        /*

        // Set status every checkpoint lines

        if(++count % checkpoint == 0) {

          context.setStatus("Emitting Put " + count);

        }

        */

      }

    }

 public static Job configureJob(Configuration conf, String [] args)

    throws IOException {

      Path inputPath = new Path(args[0]); // input path

      String tableName = args[1]; // Table name which is already in Database

      Job job = new Job(conf, NAME + "_" + tableName);

      job.setJarByClass(Uploader.class);

      FileInputFormat.setInputPaths(job, inputPath);

      job.setInputFormatClass(TextInputFormat.class);

      job.setMapperClass(Uploader.class);

      // No reducers.  Just write straight to table.  Call initTableReducerJob

      // because it sets up the TableOutputFormat. And Output write to table

      TableMapReduceUtil.initTableReducerJob(tableName, null, job);

      job.setNumReduceTasks(0);

      return job;

    }

 public static void main(String[] args) throws Exception{

  Configuration conf = HBaseConfiguration.create();

  String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();

  if(otherArgs.length !=2){

   System.err.println("Wrong number of arguments:"+ otherArgs.length);

   System.err.println("Usage:"+ NAME + " <input> <tablename>");

   System.exit(-1);

  }  

  Job job = configureJob(conf, otherArgs);

  System.exit(job.waitForCompletion(true) ? 0 : 1);  

 }

}

Check Result:
$hbase shell
hbase> get 'TEST3','14/09/16 06:35:38'

Import CSV file to HBASE(Using HBase Shell Command)

[Software]
Hadoop2.4.1
HBase 0.98.5
[Reference]
- http://www.openscg.com/2013/08/hadoop-hbase-tutorial/ (Operations)
- http://wiki.apache.org/hadoop/Hbase/Shell (HBase shell command)

Input type:

14/09/15 18:20:35, Z00 B00 ,0050.50 ,0053.26 ,0053.45 ,1251.06 ,0291.25 ,FF
14/09/15 18:20:35, Z00 B01 ,0053.50 ,0055.80 ,0056.03 ,1249.79 ,0357.45 ,FF
.......

Table:

|    | type | m1 |
------------------------------------------------------------------------------------
HBASE_ROW_KEY | states | deg | high | heat | lenght | avg |char|
------------------------------------------------------------------------------------
14/09/15 18:20:35 | Z00 B00 | 0050.50 | 0053.26 | 0053.45 | 1251.06 | 0291.25 | FF |
14/09/15 18:20:35 | Z00 B01 | 0053.50 | 0055.80 | 0056.03 | 1249.79 | 0357.45 | FF |
.......

Step:
$hbase shell
> create 'log_data', 'type','m1' //建立"log_data"，其中包含兩個 column family, "type" and "m1"
> quit

$hbase org.apache.hadoop.hbase.mapreduce.ImportTsv '-Dimporttsv.separator=,' -Dimporttsv.columns=HBASE_ROW_KEY,type:states,m1:deg,m1:high,m1:heat,m1:length,m1:avg,m1:char log_data /user/hduser/test_log.csv

- org.apache.hadoop.hbase.mapreduce.ImportTsv
執行 hbase-server-${version}-hadoop2.jar 中的 ImportTsv Class，這讓HBASE可以載入csv格式的data
- '-Dimporttsv.separator=,'
讓HBase知道每行資料值的分隔界線為","
- -Dimporttsv.columns
設定Columns Family(在hbase shell建立的'type'與'm1')，至少要有一個HBASE_ROW_KEY來當row key，
column格式則為 "columnfamilyname:columnname" ex: "m1:deg"
- log_data
此arg為input table name (即於hbase shell中建立的 "log_data")
- /user/hduser/test_log.csv
此arg為input file name ，對應位置為與HBASE連結的HDFS

$hbase shell
> scan 'log_data' // 查看輸入資料的table

[Future Work]
1. 尚未對完整log包含後面刪除欄位做輸入
2. 透過其他更簡便的介面或程式碼做Input

HBase create Table (Using Jar File)

[Software]
Hadoop2.4.1
HBase0.98.5

[Reference]
http://diveintodata.org/2009/11/27/how-to-make-a-table-in-hbase-for-beginners/

Running java program
$ hadoop jar hbop.jar HBoperation.HbaseOperation

which jar file contant:

package HBoperation;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.client.*;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class HbaseOperation { 

public static void main(String[] args) throws Exception{

 Configuration myConf = HBaseConfiguration.create(); //create hbase conf object

 //Conf generally will be set by $HBASE_HOME/conf(if it is set in $HADOOP_CLASSPATH)

 //myConf.set() isn't necessary if the conf has be set in  $HADOOP_CLASSPATH

 myConf.set("hbase.master", "192.168.0.7:60000");

 HBaseAdmin hbase = new HBaseAdmin(conf);//Create Admin to operate HBase

 /////////////////////

 //  Create Table   //

 /////////////////////

 //HTableDescriptor desc = new HTableDescriptor("TEST");//Deprecate from 0.98ver

 HTableDescriptor desc = new HTableDescriptor(TableName.valueOf("TEST"));

 HColumnDescriptor meta = new HColumnDescriptor("personal".getBytes());

 HColumnDescriptor pref = new HColumnDescriptor("account".getBytes());

 desc.addFamily(meta);

 desc.addFamily(pref);

 hbase.createTable(desc);

 ///////////////////////

 //  Connect Table    //

 ///////////////////////

 HConnection hconnect = HConnectionManager.createConnection(conf);

 HTableInterface testTable = hconnect.getTable("TEST");

 //////////////////////////

 //   Put Data to Table  //

 //////////////////////////

 Put p = new Put(Bytes.toBytes("student1"));

 p.add(Bytes.toBytes("personal"), Bytes.toBytes("name"), Bytes.toBytes("John"));

 p.add(Bytes.toBytes("account"), Bytes.toBytes("id"), Bytes.toBytes("3355454"));

 testTable.put(p);

 testTable.close();

 hbase.close();

   }

  }

- Check HBase
$hbase shell
hbase>list
- Result
TABLE
1 row(s) in 0.0390 seconds

[Problem]
When I run the jar file fisrt time, the error occurs as following:
"opening socket connection to server localhost 127.0.0.1:2181 will not attempt to authenticate using SASL"

[Solution]
- THINK:
We set the HBase locaiton(with Zookeeper) is "192.168.0.7" , so "server localhost 127.0.0.1" is weird. Maybe the Hbase conf doen't be include in HADOOP_CLASSPATH, because we used "hadoop jar" command.
-Method:
1. modify the bashrc file and add:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HBASE_HOME/conf
2. rerun the env conf:
$. ~/.bashrc

HBase create Table (Using Hive Script)

[Software]
HBase 0.98.5

[Reference]
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-Introduction (Hive HBase Integration)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli#LanguageManualCli-HiveCommandLineOptions
(Hive Command and Hive Shell Command)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL (Hive DDL)

Step:

- Create Hive script called (hive-script.sql)
$nano hive-script.sql :

CREATE TABLE hbase_table_1(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "hiveTableonHB");

- Execute the sql script :
$hive -f hive-script.sql #This command only execute the sql file and won't start up hive shell

- Check the result :
$hive
hive> DESCRIBE hbase_table_1; #then show following

OK
key int from deserializer
value string form diserializer
Time taken: 0.948 seconds, Fetched: 2 row(s)

hive> quit;
$hbase shell
hbase> list

hiveTableonHB
1 row(s) in 1.0810 seconds

Base Table Management Web UI : HareDB client (on HBase)

Introduction
[Reference]
http://www.haredb.com/haredb/file/tutorial/HBaseClient_Web_Version_Manual1.94.03.pdf
http://www.haredb.com/HareDB/src_ap/Product_HareDBClient_Install.aspx?l=4

[HareDB Client Features]
- Visualized client tool for HBase >> Better than the command mode
- Easily retrieve data from and put data into HBase
- Can transfer data from RDB to HBase (Using "Data Model Management" func in HBase Client)
- Design your HBase schema only through some configuring in some GUI pages

Install and Start up
[Download]
- http://sourceforge.net/projects/haredbhbaseclie/files/
- Version: HareDBClient_1.98.01s

[Startup]
1. Must setup hostnames of Hbase systems in /etc/hosts
(Otherwise, it will case the error " unknown hostname")
ex: $gedit /etc/hosts
...
192.168.0.7 master -------> HBase master/slaver
192.168.0.23 regionserver2 -------> HBase slaver
...
2. Execute the sh file
$sh Download/HareDBClient_1.98.01s/ startup.sh
Then a HareDB UI web page will appear:
http://localhost:8080/HareDBClient/index.html

3. Setting new Connection to HBase:
1. Click the upper left button on the web page
> "Manage Connections"
> Right Click "Allen"(Default Connection)
> "Clone" and input new name such as "HBase"
> Click "HBase" connection manager
2. Setting Conneciton Infomations:

Connection Name: HBase
ZooKeeper Host/ip: 192.168.0.7 (-> $HBASE_HOME/conf/hbase-site.xml)
ZooKeeper Client Port: 2222 (-> $HBASE_HOME/conf/hbase-site.xml)
fs.default.name: hdfs://192.168.0.7:9000 (-> $HADOOP_HOME/etc/hadoop/core-site.xml)
yarn.resourcemanager.address: 192.168.0.7:8032 (Default)
yarn.resourcemanager.scheduler.address: 192.168.0.7:8030 (Default)
yarn.resourcemanager.resource-tracker.address: 192.168.0.7:8031 (Default)
yarn.resourcemanager.admin.address: 192.168.0.7:8033 (Default)
mapreduce.jobhistory.address: 192.168.0.7:10020 (Default)
coprocessor folder: hdfs://192.168.0.7:9000/tmp (Default)

Hive metastore: "Embedded"

> Click "Apply"

4. Connect to the HBase
> click upper left button on the main page
> Select "HBase" connection which we just created
> the Left area will show Tables of HBase that we connect

[Important Note]
- All table in HBase must register a coprocessor before operating at First Time:
0. There is a cross mark "X" on the Table' icon.
You may found that all operations aren't work.
1. Right Click table > select "Coprocessor" > Selet "Register"

Hadoop Simple WordCount Example

[Software]
Hadoop2.4.1

[Reference]
http://azure.microsoft.com/zh-tw/documentation/articles/hdinsight-develop-deploy-java-mapreduce/

1. Open Notepad.
2. Copy WordCount.java and paste the following program into notepad.

package org.apache.hadoop.examples;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

  public static class TokenizerMapper 

       extends Mapper<object intwritable="" text="">{

    private final static IntWritable one = new IntWritable(1);

    private Text word = new Text();

    public void map(Object key, Text value, Context context

                    ) throws IOException, InterruptedException {

      StringTokenizer itr = new StringTokenizer(value.toString());

      while (itr.hasMoreTokens()) {

        word.set(itr.nextToken());

        context.write(word, one);

      }

    }

  }

  public static class IntSumReducer 

       extends Reducer<text ext="" ntwritable=""> {

    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<intwritable> values, 

                       Context context

                       ) throws IOException, InterruptedException {

      int sum = 0;

      for (IntWritable val : values) {

        sum += val.get();

      }

      result.set(sum);

      context.write(key, result);

    }

  }

  public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();

    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

    if (otherArgs.length != 2) {

      System.err.println("Usage: wordcount <in> <out>");

      System.exit(2);

    }

    Job job = new Job(conf, "word count");

    job.setJarByClass(WordCount.class);

    job.setMapperClass(TokenizerMapper.class);

    job.setCombinerClass(IntSumReducer.class);

    job.setReducerClass(IntSumReducer.class);

    job.setOutputKeyClass(Text.class);

    job.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

    System.exit(job.waitForCompletion(true) ? 0 : 1);

  }

}

3.Make Dir:
$mkdir Word_Count
4.Compile java:
$javac -classpath $HADOOP_INSTALL/share/hadoop/common/hadoop-common-2.4.1.jar:$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.4.1.jar:$HADOOP_INSTALL/share/hadoop/common/lib/commons-cli-1.2.jar:$HADOOP_INSTALL/share/hadoop/common/lib/hadoop-annotations-2.4.1.jar -d Word_Count WordCount.java

5.Create jar file:
$jar -cvf WordCount.jar -C Word_Count/ .

Then the WordCount.jar will be created at the current DIR.

6.Put input file to HDFS:
$hdfs dfs -put WordCount.java /WordCount/Input/file1

6.Execute the jar
$hadoop jar WordCount.jar org.apache.hadoop.examples.WordCount /WordCount/Input /WordCount/Output

7.Check Result:
$hdfs dfs -cat /WordCount/Output/part-r-00000

訂閱：文章 (Atom)

2014年9月30日 星期二

2014年9月30日星期二