Hey Yiannis,
If you just perform a count on each "name", "date" pair... can it succeed?
If so, can you do a count and then order by to find the largest one?
I'm wondering if there is a single pathologically large group here that is
somehow causing OOM.
Also, to be clear, you are getting GC limit
Have you tried to repartition() your original data to make more partitions
before you aggregate?
--
Martin Goodson | VP Data Science
(0)20 3397 1240
[image: Inline image 1]
On Mon, Mar 23, 2015 at 4:12 PM, Yiannis Gkoufas
wrote:
> Hi Yin,
>
> Yes, I have set spark.executor.memory to 8g and
Hi Yin,
Yes, I have set spark.executor.memory to 8g and the worker memory to 16g
without any success.
I cannot figure out how to increase the number of mapPartitions tasks.
Thanks a lot
On 20 March 2015 at 18:44, Yin Huai wrote:
> spark.sql.shuffle.partitions only control the number of tasks i
spark.sql.shuffle.partitions only control the number of tasks in the second
stage (the number of reducers). For your case, I'd say that the number of
tasks in the first state (number of mappers) will be the number of files
you have.
Actually, have you changed "spark.executor.memory" (it controls t
Actually I realized that the correct way is:
sqlContext.sql("set spark.sql.shuffle.partitions=1000")
but I am still experiencing the same behavior/error.
On 20 March 2015 at 16:04, Yiannis Gkoufas wrote:
> Hi Yin,
>
> the way I set the configuration is:
>
> val sqlContext = new org.apache.spar
Hi Yin,
the way I set the configuration is:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.setConf("spark.sql.shuffle.partitions","1000");
it is the correct way right?
In the mapPartitions task (the first task which is launched), I get again
the same number of tasks and agai
Hi Yin,
thanks a lot for that! Will give it a shot and let you know.
On 19 March 2015 at 16:30, Yin Huai wrote:
> Was the OOM thrown during the execution of first stage (map) or the second
> stage (reduce)? If it was the second stage, can you increase the value
> of spark.sql.shuffle.partitions
Was the OOM thrown during the execution of first stage (map) or the second
stage (reduce)? If it was the second stage, can you increase the value
of spark.sql.shuffle.partitions and see if the OOM disappears?
This setting controls the number of reduces Spark SQL will use and the
default is 200. Ma
Hi Yin,
Thanks for your feedback. I have 1700 parquet files, sized 100MB each. The
number of tasks launched is equal to the number of parquet files. Do you
have any idea on how to deal with this situation?
Thanks a lot
On 18 Mar 2015 17:35, "Yin Huai" wrote:
> Seems there are too many distinct
Seems there are too many distinct groups processed in a task, which trigger
the problem.
How many files do your dataset have and how large is a file? Seems your
query will be executed with two stages, table scan and map-side aggregation
in the first stage and the final round of reduce-side aggrega
Hi there, I set the executor memory to 8g but it didn't help
On 18 March 2015 at 13:59, Cheng Lian wrote:
> You should probably increase executor memory by setting
> "spark.executor.memory".
>
> Full list of available configurations can be found here
> http://spark.apache.org/docs/latest/configu
You should probably increase executor memory by setting
"spark.executor.memory".
Full list of available configurations can be found here
http://spark.apache.org/docs/latest/configuration.html
Cheng
On 3/18/15 9:15 PM, Yiannis Gkoufas wrote:
Hi there,
I was trying the new DataFrame API with
12 matches
Mail list logo