Re: RDD blocks on Spark Driver

2017-02-28 Thread Prithish
This is the command I am running: spark-submit --deploy-mode cluster --master yarn --class com.myorg.myApp s3://my-bucket/myapp-0.1.jar On Wed, Mar 1, 2017 at 12:22 AM, Jonathan Kelly wrote: > Prithish, > > It would be helpful for you to share the spark-submit command you are

Re: Custom log4j.properties on AWS EMR

2017-02-28 Thread Prithish
Thanks for your response Jonathan. Yes, this works. I also added another way of achieving this to the Stackoverflow post. Thanks for the help. On Tue, Feb 28, 2017 at 11:58 PM, Jonathan Kelly wrote: > Prithish, > > I saw you posted this on SO, so I responded there just now. S

Re: Custom log4j.properties on AWS EMR

2017-02-26 Thread Prithish
ebugging.properties (maybe also try > without the "/") > > > On 26 Feb 2017, at 16:31, Prithish wrote: > > Hoping someone can answer this. > > I am unable to override and use a Custom log4j.properties on Amazon EMR. I > am running Spark on EMR (Yarn) and have t

Custom log4j.properties on AWS EMR

2017-02-26 Thread Prithish
Hoping someone can answer this. I am unable to override and use a Custom log4j.properties on Amazon EMR. I am running Spark on EMR (Yarn) and have tried all the below combinations in the Spark-Submit to try and use the custom log4j. In Client mode --driver-java-options "-Dlog4j.configuration=hdfs

Re: RDD blocks on Spark Driver

2017-02-26 Thread Prithish
which are local, standalone, yarn > and Mesos. Also, "blocks" is relative to hdfs, "partitions" > is relative to spark. > > liangyihuai > > ---Original--- > *From:* "Jacek Laskowski " > *Date:* 2017/2/25 02:45:20 > *To:* "prithish"; > *

RDD blocks on Spark Driver

2017-02-22 Thread prithish
Hello, Had a question. When I look at the executors tab in Spark UI, I notice that some RDD blocks are assigned to the driver as well. Can someone please tell me why? Thanks for the help.

Re: AVRO File size when caching in-memory

2016-11-16 Thread Prithish
reted by spark? > A compression logic of the spark caching depends on column types. > > // maropu > > > On Wed, Nov 16, 2016 at 5:26 PM, Prithish wrote: > >> Thanks for your response. >> >> I did some more tests and I am seeing that when I have a flatter >

Re: AVRO File size when caching in-memory

2016-11-16 Thread Prithish
size > would depend on the type of data you have and how well it was compressable. > > > > The purpose of these formats is to store data to persistent storage in a > way that's faster to read from, not to reduce cache-memory usage. > > > > Maybe others here have more i

Re: AVRO File size when caching in-memory

2016-11-15 Thread Prithish
Anyone? On Tue, Nov 15, 2016 at 10:45 AM, Prithish wrote: > I am using 2.0.1 and databricks avro library 3.0.1. I am running this on > the latest AWS EMR release. > > On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke wrote: > >> spark version? Are you using tungsten? >>

Re: AVRO File size when caching in-memory

2016-11-14 Thread Prithish
I am using 2.0.1 and databricks avro library 3.0.1. I am running this on the latest AWS EMR release. On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke wrote: > spark version? Are you using tungsten? > > > On 14 Nov 2016, at 10:05, Prithish wrote: > > > > Can someone please

AVRO File size when caching in-memory

2016-11-14 Thread Prithish
Can someone please explain why this happens? When I read a 600kb AVRO file and cache this in memory (using cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried this with different file sizes, and the size in-memory is always proportionate. I thought Spark compresses when using

Re: Reading AVRO from S3 - No parallelism

2016-10-27 Thread prithish
> How big are your avro files?We collapse many small files into a single > partition to eliminate scheduler overhead.If you need explicit > parallelism you can also repartition. > > > > On Thu, Oct 27, 2016 at 5:19 AM, Prithish (mailto:prith...@gmail.com)> wrote:

Reading AVRO from S3 - No parallelism

2016-10-27 Thread Prithish
I am trying to read a bunch of AVRO files from a S3 folder using Spark 2.0. No matter how many executors I use or what configuration changes I make, the cluster doesn't seem to use all the executors. I am using the com.databricks.spark.avro library from databricks to read the AVRO. However, if I t

Question about In-Memory size (cache / cacheTable)

2016-10-26 Thread Prithish
Hello, I am trying to understand how in-memory size is changing in these situations. Specifically, why is in-memory size much higher for avro and parquet? Are there any optimizations necessary to reduce this? Used cacheTable on each of these: AVRO File (600kb) - In-memory size was 12mb Parquet F