Re: Parquet-like partitioning support in spark SQL's in-memory columnar cache

2016-11-28 Thread Nitin Goyal
s what you were refering to originally? Thanks -Nitin On Fri, Nov 25, 2016 at 11:29 AM, Reynold Xin wrote: > It's already there isn't it? The in-memory columnar cache format. > > > On Thu, Nov 24, 2016 at 9:06 PM, Nitin Goyal > wrote: > >> Hi, >>

Parquet-like partitioning support in spark SQL's in-memory columnar cache

2016-11-24 Thread Nitin Goyal
Hi, Do we have any plan of supporting parquet-like partitioning support in Spark SQL in-memory cache? Something like one RDD[CachedBatch] per in-memory cache partition. -Nitin

Continuous warning while consuming using new kafka-spark010 API

2016-09-19 Thread Nitin Goyal
ew API? Is this the expected behaviour or am I missing something here? -- Regards Nitin Goyal

Re: Ever increasing physical memory for a Spark Application in YARN

2016-05-03 Thread Nitin Goyal
re. Or the default > spark.memory.fraction should be 0.66, so that it works out with the default > JVM flags. > > On Mon, Jul 27, 2015 at 6:08 PM, Nitin Goyal > wrote: > >> I am running a spark application in YARN having 2 executors with Xms/Xmx >> as >> 32 Gigs

Re: Secondary Indexing of RDDs?

2015-12-14 Thread Nitin Goyal
Spar SQL's in-memory cache stores statistics per column which in turn is used to skip batches(default size 1) within partition https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala#L25 Hope this helps Thanks -Nitin On T

Re: Running individual test classes

2015-11-03 Thread Nitin Goyal
t. I've tried to >> look this up in the mailing list archives but haven't had luck so far. >> >> How can I run a single test suite? Thanks in advance! >> >> -- >> BR, >> Stefano Baghino >> > > -- Regards Nitin Goyal

Re: want to contribute

2015-10-29 Thread Nitin Goyal
-base and > how to get started to your organization. > Thanking you, > Aaditya Thakkar. > -- Regards Nitin Goyal

Re: Operations with cached RDD

2015-10-11 Thread Nitin Goyal
does not seem to be > reasonable, because the rdd is cached, and zipWithIndex is already executed > previously. > > > > Could you explain why if I perform an operation followed by an action on a > cached RDD, then the last operation in the lineage of the cached RDD is > shown to be executed in the Spark UI? > > > > > > Best regards, Alexander > -- Regards Nitin Goyal

Re: [ compress in-memory column storage used in sparksql cache table ]

2015-09-02 Thread Nitin Goyal
I think spark sql's in-memory columnar cache already does compression. Check out classes in following path :- https://github.com/apache/spark/tree/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/compression Although compression ratio is not as good as Parquet. Thanks -Nitin -- Vi

Ever increasing physical memory for a Spark Application in YARN

2015-07-27 Thread Nitin Goyal
I am running a spark application in YARN having 2 executors with Xms/Xmx as 32 Gigs and spark.yarn.excutor.memoryOverhead as 6 gigs. I am seeing that the app's physical memory is ever increasing and finally gets killed by node manager 2015-07-25 15:07:05,354 WARN org.apache.hadoop.yarn.server.nod

Re: ClosureCleaner slowing down Spark SQL queries

2015-05-30 Thread Nitin Goyal
Thanks Josh and Yin. Created following JIRA for the same :- https://issues.apache.org/jira/browse/SPARK-7970 Thanks -Nitin -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tp12466p12515.html Sent from the

Re: ClosureCleaner slowing down Spark SQL queries

2015-05-27 Thread Nitin Goyal
Hi Ted, Thanks a lot for replying. First of all, moving to 1.4.0 RC2 is not easy for us as migration cost is big since lot has changed in Spark SQL since 1.2. Regarding SPARK-7233, I had already looked at it few hours back and it solves the problem for concurrent queries but my problem is just fo

ClosureCleaner slowing down Spark SQL queries

2015-05-27 Thread Nitin Goyal
Hi All, I am running a SQL query (spark version 1.2) on a table created from unionAll of 3 schema RDDs which gets executed in roughly 400ms (200ms at driver and roughly 200ms at executors). If I run same query on a table created from unionAll of 27 schema RDDS, I see that executors time is same(b