I submitted the job in Yarn-Client mode using the following script:

export
SPARK_JAR=/usr/games/spark/xt/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar

export HADOOP_CLASSPATH=$(hbase classpath)
export
CLASSPATH=$CLASSPATH:/usr/games/spark/xt/SparkDemo-0.0.1-SNAPSHOT.jar:/usr/games/spark/xt/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar:/usr/games/spark/xt/hadoop-common-2.3.0-cdh5.0.1.jar:/usr/games/spark/xt/hbase-client-0.96.1.1-cdh5.0.1.jar:/usr/games/spark/xt/hbase-common-0.96.1.1-cdh5.0.1.jar:/usr/games/spark/xt/hbase-server-0.96.1.1-cdh5.0.1.jar:/usr/games/spark/xt/hbase-protocol-0.96.0-hadoop2.jar:/usr/games/spark/xt/htrace-core-2.01.jar:$HADOOP_CLASSPATH

CONFIG_OPTS="-Dspark.master=yarn-client
-Dspark.jars=/usr/games/spark/xt/SparkDemo-0.0.1-SNAPSHOT.jar,/usr/games/spark/xt/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar,/usr/games/spark/xt/hbase-client-0.96.1.1-cdh5.0.1.jar,/usr/games/spark/xt/hbase-common-0.96.1.1-cdh5.0.1.jar,/usr/games/spark/xt/hbase-server-0.96.1.1-cdh5.0.1.jar,/usr/games/spark/xt/hbase-protocol-0.96.0-hadoop2.jar,/usr/games/spark/xt/htrace-core-2.01.jar"

java -cp $CLASSPATH $CONFIG_OPTS com.xt.scala.TestSpark




My job's code is as follows:


object TestSpark {
  def main(args: Array[String]) {
    readHBase("C_CONS")
  }

  def readHBase(tableName: String) {
    val hbaseConf = HBaseConfiguration.create()
    hbaseConf.set(TableInputFormat.INPUT_TABLE, tableName)

    val sparkConf = new SparkConf()
        .setAppName("<<< Reading HBase >>>")
    val sc = new SparkContext(sparkConf)

    val rdd = sc.newAPIHadoopRDD(hbaseConf, classOf[TableInputFormat],
               classOf[ImmutableBytesWritable], classOf[Result])

    println(rdd.count)

  }
}


2014-09-30 10:21 GMT+08:00 Tao Xiao <xiaotao.cs....@gmail.com>:

> I submitted a job in Yarn-Client mode, which simply reads from a HBase
> table containing tens of millions of records and then does a *count *action.
> The job runs for a much longer time than I expected, so I wonder whether it
> was because the data to read was too much. Actually, there are 20 nodes in
> my Hadoop cluster so the HBase table seems not so big (tens of millopns of
> records). :
>
> I'm using CDH 5.0.0 (Spark 0.9 and HBase 0.96).
>
> BTW, when the job was running, I can see logs on the console, and
> specifically I'd like to know what the following log means:
>
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Starting task 0.0:20 as
> TID 20 on executor 2: b04.jsepc.com (PROCESS_LOCAL)
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Serialized task 0.0:20 as
> 13454 bytes in 0 ms
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Finished TID 19 in 16426
> ms on b04.jsepc.com (progress: 18/86)
> 14/09/30 09:45:20 INFO scheduler.DAGScheduler: Completed ResultTask(0, 19)
>
>
> Thanks
>

Reply via email to