hi,
Based on my testing, the memory cost is very different for
1. sql("select * from ...").groupby.agg
2. sql("select ... From ... Groupby ...").
For table.partition sized more than 500g, 2# run good, while outofmemory
happened in 1#. I am using the same spark configurations.
Could somebody te
They should be identical. Can you paste the detailed explain output.
On Thursday, March 10, 2016, FangFang Chen
wrote:
> hi,
> Based on my testing, the memory cost is very different for
> 1. sql("select * from ...").groupby.agg
> 2. sql("select ... From ... Groupby ...").
>
> For table.partition
hi
can this be considered a lag in processing of events?
should we report this as delay.
On Thu, Mar 10, 2016 at 10:51 AM, Mario Ds Briggs
wrote:
> Look at
> org.apache.spark.streaming.scheduler.JobGenerator
>
> it has a RecurringTimer (timer) that will simply post 'JobGenerate'
> events to a E
Hi TD,
Thanks a lot for offering to look at our PR (if we fire one) at the
conference NYC.
As we discussed briefly the issues of unbalanced and
under-distributed kafka partitions when developing Spark streaming
application in Mobius (C# for Spark), we're trying the option of
repartitioning within
The central problem with doing anything like this is that you break
one of the basic guarantees of kafka, which is in-order processing on
a per-topicpartition basis.
As far as PRs go, because of the new consumer interface for kafka 0.9
and 0.10, there's a lot of potential change already underway.
Spark 1.6.1 is a maintenance release containing stability fixes. This
release is based on the branch-1.6 maintenance branch of Spark. We
*strongly recommend* all 1.6.0 users to upgrade to this release.
Notable fixes include:
- Workaround for OOM when writing large partitioned tables SPARK-12546
<
Hi everyone,
I have a question about the shuffle mechanisms in Spark and the fault-tolerance
I should expect. Suppose I have a simple job with two stages – something like
rdd.textFile().mapToPair().reduceByKey().saveAsTextFile().
The questions I have are,
1. Suppose I’m not using the exter
Could you provide more details about:
1. Data set size (# ratings, # users and # products)
2. Spark cluster set up and version
Thanks
On Fri, 11 Mar 2016 at 05:53 Deepak Gopalakrishnan wrote:
> Hello All,
>
> I've been running Spark's ALS on a dataset of users and rated items. I
> first encode
1. I'm using about 1 million users against few thousand products. I
basically have around a million ratings
2. Spark 1.6 on Amazon EMR
On Fri, Mar 11, 2016 at 12:46 PM, Nick Pentreath
wrote:
> Could you provide more details about:
> 1. Data set size (# ratings, # users and # products)
> 2. Spark
Hi,
I would like to help with optimizing Spark memory usage. I have some experience
with offheap, managed memory etc. For example I modified Hazelcast to run with
'-
Xmx128M' [1] and XAP from Gigaspaces uses my memory store.
I already studied Spark code, read blogs, videos etc... But I have qu
10 matches
Mail list logo