RDD.count() is an action, which triggers a distributed job, no matter the RDD is cached or not. If the RDD is cached, there won't be duplicated HBase scan.
How do you want to improve the performance? Are you trying to reduce unnecessary distributed jobs, or improve the performance of the second job? If the former, just use a variable to hold the result of the first count() call; if the latter, hBaseRDD.cache() can help. In both way, the extra HBase scan can be eliminated. On Tue, Feb 25, 2014 at 3:14 PM, Soumitra Kumar <kumar.soumi...@gmail.com>wrote: > I have a code which reads an HBase table, and counts number of rows > containing a field. > > def readFields(rdd : RDD[(ImmutableBytesWritable, Result)]) : > RDD[List[Array[Byte]]] = { > return rdd.flatMap(kv => { > // Set of interesting keys for this use case > val keys = List ("src") > var data = List[Array[Byte]]() > var usefulRow = false > > val cf = Bytes.toBytes ("cf") > keys.foreach {key => > val col = kv._2.getValue(cf, Bytes.toBytes(key)) > if (col != null) > usefulRow = true > data = data :+ col > } > > if (usefulRow) > Some(data) > else > None > }) > } > > def main(args: Array[String]) { > val hBaseRDD = init(args) > // hBaseRDD.cache() > > println("**** Initial row count " + hBaseRDD.count()) > println("**** Rows with interesting fields " + > readFields(hBaseRDD).count()) > } > > > I am running on a one mode CDH installation. > > As it is it takes around 2.5 minutes. But if I comment out 'println("**** > Initial row count " + hBaseRDD.count())', it takes around 1.5 minutes. > > Is it doing HBase scan twice, for both 'count' calls? How do I improve it? > > Thanks, > -Soumitra. > >