Re:

Cheng Lian Tue, 25 Feb 2014 03:44:43 -0800

RDD.count() is an action, which triggers a distributed job, no matter the
RDD is cached or not.  If the RDD is cached, there won't be duplicated
HBase scan.


How do you want to improve the performance?  Are you trying to reduce
unnecessary distributed jobs, or improve the performance of the second job?
 If the former, just use a variable to hold the result of the first count()
call; if the latter, hBaseRDD.cache() can help.  In both way, the extra
HBase scan can be eliminated.


On Tue, Feb 25, 2014 at 3:14 PM, Soumitra Kumar <kumar.soumi...@gmail.com>wrote:

> I have a code which reads an HBase table, and counts number of rows
> containing a field.
>
>     def readFields(rdd : RDD[(ImmutableBytesWritable, Result)]) :
> RDD[List[Array[Byte]]] = {
>         return rdd.flatMap(kv => {
>             // Set of interesting keys for this use case
>             val keys = List ("src")
>             var data = List[Array[Byte]]()
>             var usefulRow = false
>
>             val cf = Bytes.toBytes ("cf")
>             keys.foreach {key =>
>                 val col = kv._2.getValue(cf, Bytes.toBytes(key))
>                 if (col != null)
>                     usefulRow = true
>                 data = data :+ col
>             }
>
>             if (usefulRow)
>                 Some(data)
>             else
>                 None
>         })
>     }
>
>     def main(args: Array[String]) {
>         val hBaseRDD = init(args)
>         // hBaseRDD.cache()
>
>         println("**** Initial row count " + hBaseRDD.count())
>         println("**** Rows with interesting fields " +
> readFields(hBaseRDD).count())
>   }
>
>
> I am running on a one mode CDH installation.
>
> As it is it takes around 2.5 minutes. But if I comment out 'println("****
> Initial row count " + hBaseRDD.count())', it takes around 1.5 minutes.
>
> Is it doing HBase scan twice, for both 'count' calls? How do I improve it?
>
> Thanks,
> -Soumitra.
>
>

Re:

Reply via email to