Re: HBase row count

2014-02-26 Thread Nick Pentreath
Currently no there is no way to save the web ui details. There was some discussion around adding this on the mailing list but no change as yet — Sent from Mailbox for iPhone On Tue, Feb 25, 2014 at 7:23 PM, Soumitra Kumar wrote: > Found the issue, actually splits in HBase was not uniform, so on

Re: HBase row count

2014-02-25 Thread Soumitra Kumar
Found the issue, actually splits in HBase was not uniform, so one job was taking 90% of time. BTW, is there a way to save the details available port 4040 after job is finished? On Tue, Feb 25, 2014 at 7:26 AM, Nick Pentreath wrote: > It's tricky really since you may not know upfront how much da

Re: HBase row count

2014-02-25 Thread Nick Pentreath
It's tricky really since you may not know upfront how much data is in there. You could possibly take a look at how much data is in the HBase tables to get an idea. It may take a bit of trial and error, like running out of memory trying to cache the dataset, and checking the Spark UI on port 4040 t

Re: HBase row count

2014-02-25 Thread Soumitra Kumar
Thanks Nick. How do I figure out if the RDD fits in memory? On Tue, Feb 25, 2014 at 1:04 AM, Nick Pentreath wrote: > cache only caches the data on the first action (count) - the first time it > still needs to read the data from the source. So the first time you call > count it will take the sam

Re: HBase row count

2014-02-25 Thread Koert Kuipers
i find them both somewhat confusing actually. * RDD.cache is lazy, and mutates the RDD in place * RDD.unpersist has a direct effect of unloading, and also mutates the RDD in place to disable future lazy caching i have found that if i need to unload an RDD from memory, but still want it to be cache

Re: HBase row count

2014-02-25 Thread Cheng Lian
BTW, unlike RDD.cache(), the reverse operation RDD.unpersist() is not lazy, which is somewhat confusing... On Tue, Feb 25, 2014 at 7:48 PM, Cheng Lian wrote: > RDD.cache() is a lazy operation, the method itself doesn't perform the > cache operation, it just asks Spark runtime to cache the conte

Re: HBase row count

2014-02-25 Thread Cheng Lian
RDD.cache() is a lazy operation, the method itself doesn't perform the cache operation, it just asks Spark runtime to cache the content of the RDD when the first action is invoked. In your case, the first action is the first count() call, which conceptually does 3 things: 1. Performs the HBase

Re: HBase row count

2014-02-25 Thread Nick Pentreath
cache only caches the data on the first action (count) - the first time it still needs to read the data from the source. So the first time you call count it will take the same amount of time whether cache is enabled or not. The second time you call count on a cached RDD, you should see that it take

Re: HBase row count

2014-02-24 Thread Soumitra Kumar
I did try with 'hBaseRDD.cache()', but don't see any improvement. My expectation is that with cache enabled, there should not be any penalty of 'hBaseRDD.count' call. On Mon, Feb 24, 2014 at 11:29 PM, Nick Pentreath wrote: > Yes, you''re initiating a scan for each count call. The normal way to

Re: HBase row count

2014-02-24 Thread Nick Pentreath
Yes, you''re initiating a scan for each count call. The normal way to improve this would be to use cache(), which is what you have in your commented out line: // hBaseRDD.cache() If you uncomment that line, you should see an improvement overall. If caching is not an option for some reason (maybe