Fwd: Spark RDD cache memory usage

2014-04-30 Thread Han JU
Hi,

As I understand, by default in Spark a fraction of the executor memory
(60%) is reserved for RDD caching. So if there's no explicit caching in the
code (eg. rdd.cache() etc.), or if we persist RDD with
StorageLevel.DISK_ONLY, is this part of memory wasted? Does Spark allocates
the RDD cache memory dynamically? Or does spark automatically caches RDDs
when it can?

I've posted this question in user list but got no response there, so I try
the dev list. Sorry for spam.

Thanks.

-- 
*JU Han*

Data Engineer @ Botify.com

+33 061960


[EC2] r3 instance type

2014-05-12 Thread Han JU
Hi,

I'm modifying the ec2 script for the new r3 instance support, but there's a
problem with the instance storage.

For example, `r3.large` has a single 32GB SSD disk, the problem is that
it's a SSD with TRIM technology and is not automatically formatted and
mounted, `lsblk` gives me this after ec2_script's setup:

xvda202:00   8G  0 disk
└─xvda1 202:10   8G  0 part /
xvdb202:16   0  30G  0 disk

I think there's some workarounds of this problem, for example we could
treat it like an EBS device and check `/dev/xvdb` by using `blkid`, howver
this needs modifying the deployment script inside the AMI and I don't know
if it's the preferred way .

Some ideas or suggestions?

Thanks.
-- 
*JU Han*

Data Engineer @ Botify.com

+33 061960