SPARK-8813 - combining small files in spark sql

2016-07-06 Thread Ajay Srivastava
Hi, This jira https://issues.apache.org/jira/browse/SPARK-8813 is fixed in spark 2.0.But resolution is not mentioned there. In our use case, there are big as well as many small parquet files which are being queried using spark sql.Can someone please explain what is the fix and how I can use it i

Re: Data locality across jobs

2015-04-03 Thread Ajay Srivastava
You can read same partition from every hour's output, union these RDDs and then repartition them as a single partition. This will be done for all partitions one by one. It may not necessarily improve the performance, will depend on size of spills in job when all the data was processed together.

Re: Some tasks are taking long time

2015-01-15 Thread Ajay Srivastava
park.apache.org/docs/latest/tuning.html#serialized-rdd-storage Cheers,- Nicos On Jan 15, 2015, at 6:49 AM, Ajay Srivastava wrote: Thanks RK. I can turn on speculative execution but I am trying to find out actual reason for delay as it happens on any node. Any idea about the stack trace in my pr

Re: Some tasks are taking long time

2015-01-15 Thread Ajay Srivastava
or a particular stage. | | spark.speculation.multiplier | 1.5 | How many times slower a task is than the median to be considered for speculation. |   On Thursday, January 15, 2015 5:44 AM, Ajay Srivastava wrote: Hi, My spark job is taking long time. I see that some tasks are tak

Some tasks are taking long time

2015-01-15 Thread Ajay Srivastava
Hi, My spark job is taking long time. I see that some tasks are taking longer time for same amount of data and shuffle read/write. What could be the possible reasons for it ? The thread-dump sometimes show that all the tasks in an executor are waiting with following stack trace - "Executor task

Re: Creating RDD from only few columns of a Parquet file

2015-01-13 Thread Ajay Srivastava
Setting spark.sql.hive.convertMetastoreParquet to true has fixed this. Regards,Ajay On Tuesday, January 13, 2015 11:50 AM, Ajay Srivastava wrote: Hi,I am trying to read a parquet file using -val parquetFile = sqlContext.parquetFile("people.parquet") There is no way

Creating RDD from only few columns of a Parquet file

2015-01-12 Thread Ajay Srivastava
Hi,I am trying to read a parquet file using -val parquetFile = sqlContext.parquetFile("people.parquet") There is no way to specify that I am interested in reading only some columns from disk. For example, If the parquet file has 10 columns and want to read only 3 columns from disk. We have don

Spark summit 2014 videos ?

2014-07-10 Thread Ajay Srivastava
Hi, I did not find any videos on apache spark channel in youtube yet. Any idea when these will be made available ? Regards, Ajay

Re: OFF_HEAP storage level

2014-07-04 Thread Ajay Srivastava
Thanks Jerry. It looks like a good option, will try it. Regards, Ajay On Friday, July 4, 2014 2:18 PM, "Shao, Saisai" wrote: Hi Ajay,   StorageLevel OFF_HEAP means for can cache your RDD into Tachyon, the prerequisite is that you should deploy Tachyon among Spark.   Yes, it can alleviate

OFF_HEAP storage level

2014-07-03 Thread Ajay Srivastava
Hi, I was checking different storage level of an RDD and found OFF_HEAP. Has anybody used this level ? If i use this level, where will data be stored ? If not in heap, does it mean that we can avoid GC ? How can I use this level ? I did not find anything in archive regarding this. Can someone also

Re: Join : Giving incorrect result

2014-06-06 Thread Ajay Srivastava
pache/spark/pull/986. Feel free to try that if you’d like; it will also be in 0.9.2 and 1.0.1. > > >Matei > > >On Jun 5, 2014, at 12:19 AM, Ajay Srivastava wrote: > >Sorry for replying late. It was night here. >> >> >>Lian/Matei, >>Here is the code

Re: Join : Giving incorrect result

2014-06-05 Thread Ajay Srivastava
em, it would be great if you can post the code for the program. Matei On Jun 4, 2014, at 12:58 PM, Xu (Simon) Chen wrote: Maybe your two workers have different assembly jar files? >I just ran into a similar problem that my spark-shell is using a different jar >file than my workers

Join : Giving incorrect result

2014-06-04 Thread Ajay Srivastava
Hi, I am doing join of two RDDs which giving different results ( counting number of records ) each time I run this code on same input. The input files are large enough to be divided in two splits. When the program runs on two workers with single core assigned to these, output is consistent and