Hi,
This jira https://issues.apache.org/jira/browse/SPARK-8813 is fixed in spark
2.0.But resolution is not mentioned there.
In our use case, there are big as well as many small parquet files which are
being queried using spark sql.Can someone please explain what is the fix and
how I can use it i
You can read same partition from every hour's output, union these RDDs and then
repartition them as a single partition. This will be done for all partitions
one by one. It may not necessarily improve the performance, will depend on size
of spills in job when all the data was processed together.
park.apache.org/docs/latest/tuning.html#serialized-rdd-storage
Cheers,- Nicos
On Jan 15, 2015, at 6:49 AM, Ajay Srivastava
wrote:
Thanks RK. I can turn on speculative execution but I am trying to find out
actual reason for delay as it happens on any node. Any idea about the stack
trace in my pr
or a particular stage. |
| spark.speculation.multiplier | 1.5 | How many times slower a task is than the
median to be considered for speculation.
|
On Thursday, January 15, 2015 5:44 AM, Ajay Srivastava
wrote:
Hi,
My spark job is taking long time. I see that some tasks are tak
Hi,
My spark job is taking long time. I see that some tasks are taking longer time
for same amount of data and shuffle read/write. What could be the possible
reasons for it ?
The thread-dump sometimes show that all the tasks in an executor are waiting
with following stack trace -
"Executor task
Setting spark.sql.hive.convertMetastoreParquet to true has fixed this.
Regards,Ajay
On Tuesday, January 13, 2015 11:50 AM, Ajay Srivastava
wrote:
Hi,I am trying to read a parquet file using -val parquetFile =
sqlContext.parquetFile("people.parquet")
There is no way
Hi,I am trying to read a parquet file using -val parquetFile =
sqlContext.parquetFile("people.parquet")
There is no way to specify that I am interested in reading only some columns
from disk. For example, If the parquet file has 10 columns and want to read
only 3 columns from disk.
We have don
Hi,
I did not find any videos on apache spark channel in youtube yet.
Any idea when these will be made available ?
Regards,
Ajay
Thanks Jerry.
It looks like a good option, will try it.
Regards,
Ajay
On Friday, July 4, 2014 2:18 PM, "Shao, Saisai" wrote:
Hi Ajay,
StorageLevel OFF_HEAP means for can cache your RDD into Tachyon, the
prerequisite is that you should deploy Tachyon among Spark.
Yes, it can alleviate
Hi,
I was checking different storage level of an RDD and found OFF_HEAP.
Has anybody used this level ?
If i use this level, where will data be stored ? If not in heap, does it mean
that we can avoid GC ?
How can I use this level ? I did not find anything in archive regarding this.
Can someone also
pache/spark/pull/986. Feel free to try that if you’d like;
it will also be in 0.9.2 and 1.0.1.
>
>
>Matei
>
>
>On Jun 5, 2014, at 12:19 AM, Ajay Srivastava wrote:
>
>Sorry for replying late. It was night here.
>>
>>
>>Lian/Matei,
>>Here is the code
em, it would be great if you can post the code for the
program.
Matei
On Jun 4, 2014, at 12:58 PM, Xu (Simon) Chen wrote:
Maybe your two workers have different assembly jar files?
>I just ran into a similar problem that my spark-shell is using a different jar
>file than my workers
Hi,
I am doing join of two RDDs which giving different results ( counting number of
records ) each time I run this code on same input.
The input files are large enough to be divided in two splits. When the program
runs on two workers with single core assigned to these, output is consistent
and
13 matches
Mail list logo