Re: Spark SQL. Memory consumption

2015-04-02 Thread Vladimir Rodionov
>> Using large memory for executors (*--executor-memory 120g*). Not really a good advice. On Thu, Apr 2, 2015 at 9:17 AM, Cheng, Hao wrote: > Spark SQL tries to load the entire partition data and organized as > In-Memory HashMaps, it does eat large memory if there are not many > duplicated gro

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Vladimir Rodionov
enough space and you have no idea why does application report "no space left on device". Just a guess. -Vladimir Rodionov On Tue, Feb 24, 2015 at 8:34 AM, Joe Wass wrote: > I'm running a cluster of 3 Amazon EC2 machines (small number because it's > expensive when exper

Re: Spark v Redshift

2014-11-04 Thread Vladimir Rodionov
>> We service templated queries from the appserver, i.e. user fills >>out some forms, dropdowns: we translate to a query. and >>The target data >>size is about a billion records, 20'ish fields, distributed throughout a >>year (about 50GB on disk as CSV, uncompressed). tells me that proprietary i

Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Vladimir Rodionov
. One of the options here is try to reduce JVM heap size and reduce data size per JVM instance. -Vladimir Rodionov On Thu, Oct 30, 2014 at 5:22 AM, Ilya Ganelin wrote: > The split is something like 30 million into 2 milion partitions. The > reason that it becomes tractable is that

Re: Spark And Mapr

2014-10-01 Thread Vladimir Rodionov
There is doc on MapR: http://doc.mapr.com/display/MapR/Accessing+MapR-FS+in+Java+Applications -Vladimir Rodionov On Wed, Oct 1, 2014 at 3:00 PM, Addanki, Santosh Kumar < santosh.kumar.adda...@sap.com> wrote: > Hi > > > > We were using Horton 2.4.1 as our Hadoop distribu

Re: Reading from HBase is too slow

2014-10-01 Thread Vladimir Rodionov
Yes, its in 0.98. CDH is free (w/o subscription) and sometimes its worth upgrading to the latest version (which is 0.98 based). -Vladimir Rodionov On Wed, Oct 1, 2014 at 9:52 AM, Ted Yu wrote: > As far as I know, that feature is not in CDH 5.0.0 > > FYI > > On Wed, Oct 1,

Re: Reading from HBase is too slow

2014-10-01 Thread Vladimir Rodionov
Using TableInputFormat is not the fastest way of reading data from HBase. Do not expect 100s of Mb per sec. You probably should take a look at M/R over HBase snapshots. https://issues.apache.org/jira/browse/HBASE-8369 -Vladimir Rodionov On Wed, Oct 1, 2014 at 8:17 AM, Tao Xiao wrote: > I

Re: Reading from HBase is too slow

2014-09-29 Thread Vladimir Rodionov
HBase TableInputFormat creates input splits one per each region. You can not achieve high level of parallelism unless you have 5-10 regions per RS at least. What does it mean? You probably have too few regions. You can verify that in HBase Web UI. -Vladimir Rodionov On Mon, Sep 29, 2014 at 7:21

Spark caching questions

2014-09-09 Thread Vladimir Rodionov
Hi, users 1. Disk based cache eviction policy? The same LRU? 2. What is the scope of a cached RDD? Does it survive application? What happen if I run Java app next time? Will RRD be created or read from cache? If , answer is YES, then ... 3. Is there are any way to invalidate cached RDD automat