date:20160310

dataframe.groupby.agg vs sql("select from groupby)")

2016-03-10 Thread FangFang Chen

hi, Based on my testing, the memory cost is very different for 1. sql("select * from ...").groupby.agg 2. sql("select ... From ... Groupby ..."). For table.partition sized more than 500g, 2# run good, while outofmemory happened in 1#. I am using the same spark configurations. Could somebody te

Re: dataframe.groupby.agg vs sql("select from groupby)")

2016-03-10 Thread Reynold Xin

They should be identical. Can you paste the detailed explain output. On Thursday, March 10, 2016, FangFang Chen wrote: > hi, > Based on my testing, the memory cost is very different for > 1. sql("select * from ...").groupby.agg > 2. sql("select ... From ... Groupby ..."). > > For table.partition

Re: submissionTime vs batchTime, DirectKafka

2016-03-10 Thread Sachin Aggarwal

hi can this be considered a lag in processing of events? should we report this as delay. On Thu, Mar 10, 2016 at 10:51 AM, Mario Ds Briggs wrote: > Look at > org.apache.spark.streaming.scheduler.JobGenerator > > it has a RecurringTimer (timer) that will simply post 'JobGenerate' > events to a E

DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

2016-03-10 Thread Renyi Xiong

Hi TD, Thanks a lot for offering to look at our PR (if we fire one) at the conference NYC. As we discussed briefly the issues of unbalanced and under-distributed kafka partitions when developing Spark streaming application in Mobius (C# for Spark), we're trying the option of repartitioning within

Re: DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

2016-03-10 Thread Cody Koeninger

The central problem with doing anything like this is that you break one of the basic guarantees of kafka, which is in-order processing on a per-topicpartition basis. As far as PRs go, because of the new consumer interface for kafka 0.9 and 0.10, there's a lot of potential change already underway.

[ANNOUNCE] Announcing Spark 1.6.1

2016-03-10 Thread Michael Armbrust

Spark 1.6.1 is a maintenance release containing stability fixes. This release is based on the branch-1.6 maintenance branch of Spark. We *strongly recommend* all 1.6.0 users to upgrade to this release. Notable fixes include: - Workaround for OOM when writing large partitioned tables SPARK-12546 <

Understanding fault tolerance in shuffle operations

2016-03-10 Thread Matt Cheah

Hi everyone, I have a question about the shuffle mechanisms in Spark and the fault-tolerance I should expect. Suppose I have a simple job with two stages – something like rdd.textFile().mapToPair().reduceByKey().saveAsTextFile(). The questions I have are, 1. Suppose I’m not using the exter

Re: Running ALS on comparitively large RDD

2016-03-10 Thread Nick Pentreath

Could you provide more details about: 1. Data set size (# ratings, # users and # products) 2. Spark cluster set up and version Thanks On Fri, 11 Mar 2016 at 05:53 Deepak Gopalakrishnan wrote: > Hello All, > > I've been running Spark's ALS on a dataset of users and rated items. I > first encode

Re: Running ALS on comparitively large RDD

2016-03-10 Thread Deepak Gopalakrishnan

1. I'm using about 1 million users against few thousand products. I basically have around a million ratings 2. Spark 1.6 on Amazon EMR On Fri, Mar 11, 2016 at 12:46 PM, Nick Pentreath wrote: > Could you provide more details about: > 1. Data set size (# ratings, # users and # products) > 2. Spark

Contributing to managed memory, Tungsten..

2016-03-10 Thread Jan Kotek

Hi, I would like to help with optimizing Spark memory usage. I have some experience with offheap, managed memory etc. For example I modified Hazelcast to run with '- Xmx128M' [1] and XAP from Gigaspaces uses my memory store. I already studied Spark code, read blogs, videos etc... But I have qu

dataframe.groupby.agg vs sql("select from groupby)")

Re: dataframe.groupby.agg vs sql("select from groupby)")

Re: submissionTime vs batchTime, DirectKafka

DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

Re: DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

[ANNOUNCE] Announcing Spark 1.6.1

Understanding fault tolerance in shuffle operations

Re: Running ALS on comparitively large RDD

Re: Running ALS on comparitively large RDD

Contributing to managed memory, Tungsten..

10 matches

Site Navigation

Mail list logo

Footer information