I have a code which reads an HBase table, and counts number of rows
containing a field.

    def readFields(rdd : RDD[(ImmutableBytesWritable, Result)]) :
RDD[List[Array[Byte]]] = {
        return rdd.flatMap(kv => {
            // Set of interesting keys for this use case
            val keys = List ("src")
            var data = List[Array[Byte]]()
            var usefulRow = false

            val cf = Bytes.toBytes ("cf")
            keys.foreach {key =>
                val col = kv._2.getValue(cf, Bytes.toBytes(key))
                if (col != null)
                    usefulRow = true
                data = data :+ col
            }

            if (usefulRow)
                Some(data)
            else
                None
        })
    }

    def main(args: Array[String]) {
        val hBaseRDD = init(args)
        // hBaseRDD.cache()

        println("**** Initial row count " + hBaseRDD.count())
        println("**** Rows with interesting fields " +
readFields(hBaseRDD).count())
  }


I am running on a one mode CDH installation.

As it is it takes around 2.5 minutes. But if I comment out 'println("****
Initial row count " + hBaseRDD.count())', it takes around 1.5 minutes.

Is it doing HBase scan twice, for both 'count' calls? How do I improve it?

Thanks,
-Soumitra.

Reply via email to