Currently no there is no way to save the web ui details. There was some
discussion around adding this on the mailing list but no change as yet —
Sent from Mailbox for iPhone
On Tue, Feb 25, 2014 at 7:23 PM, Soumitra Kumar
wrote:
> Found the issue, actually splits in HBase was not uniform, so on
Found the issue, actually splits in HBase was not uniform, so one job was
taking 90% of time.
BTW, is there a way to save the details available port 4040 after job is
finished?
On Tue, Feb 25, 2014 at 7:26 AM, Nick Pentreath wrote:
> It's tricky really since you may not know upfront how much da
It's tricky really since you may not know upfront how much data is in
there. You could possibly take a look at how much data is in the HBase
tables to get an idea.
It may take a bit of trial and error, like running out of memory trying to
cache the dataset, and checking the Spark UI on port 4040 t
Thanks Nick.
How do I figure out if the RDD fits in memory?
On Tue, Feb 25, 2014 at 1:04 AM, Nick Pentreath wrote:
> cache only caches the data on the first action (count) - the first time it
> still needs to read the data from the source. So the first time you call
> count it will take the sam
i find them both somewhat confusing actually.
* RDD.cache is lazy, and mutates the RDD in place
* RDD.unpersist has a direct effect of unloading, and also mutates the RDD
in place to disable future lazy caching
i have found that if i need to unload an RDD from memory, but still want it
to be cache
BTW, unlike RDD.cache(), the reverse operation RDD.unpersist() is not lazy,
which is somewhat confusing...
On Tue, Feb 25, 2014 at 7:48 PM, Cheng Lian wrote:
> RDD.cache() is a lazy operation, the method itself doesn't perform the
> cache operation, it just asks Spark runtime to cache the conte
RDD.cache() is a lazy operation, the method itself doesn't perform the
cache operation, it just asks Spark runtime to cache the content of the RDD
when the first action is invoked. In your case, the first action is the
first count() call, which conceptually does 3 things:
1. Performs the HBase
cache only caches the data on the first action (count) - the first time it
still needs to read the data from the source. So the first time you call
count it will take the same amount of time whether cache is enabled or not.
The second time you call count on a cached RDD, you should see that it
take
I did try with 'hBaseRDD.cache()', but don't see any improvement.
My expectation is that with cache enabled, there should not be any penalty
of 'hBaseRDD.count' call.
On Mon, Feb 24, 2014 at 11:29 PM, Nick Pentreath
wrote:
> Yes, you''re initiating a scan for each count call. The normal way to
Yes, you''re initiating a scan for each count call. The normal way to
improve this would be to use cache(), which is what you have in your
commented out line:
// hBaseRDD.cache()
If you uncomment that line, you should see an improvement overall.
If caching is not an option for some reason (maybe
10 matches
Mail list logo