subject:"reduceByKey"

Re: Replacing groupBykey() with reduceByKey()

2018-08-08 Thread Biplob Biswas

Hi Santhosh, My name is not Bipin, its Biplob as is clear from my Signature. Regarding your question, I have no clue what your map operation is doing on the grouped data, so I can only suggest you to do : dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x: (x[0],x)).reduceByKey

Re: Replacing groupBykey() with reduceByKey()

2018-08-06 Thread Bathi CCDB

reduceByKey() in this context ? Santhosh On Mon, Aug 6, 2018 at 1:20 AM, Biplob Biswas wrote: > Hi Santhosh, > > If you are not performing any aggregation, then I don't think you can > replace your groupbykey with a reducebykey, and as I see you are only > grouping and taking 2

Re: Replacing groupBykey() with reduceByKey()

2018-08-06 Thread Biplob Biswas

Hi Santhosh, If you are not performing any aggregation, then I don't think you can replace your groupbykey with a reducebykey, and as I see you are only grouping and taking 2 values of the result, thus I believe you can't just replace your groupbykey with that. Thanks & Regards

Replacing groupBykey() with reduceByKey()

2018-08-03 Thread Bathi CCDB

I am trying to replace groupByKey() with reudceByKey(), I am a pyspark and python newbie and I am having a hard time figuring out the lambda function for the reduceByKey() operation. Here is the code dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x: (x[0],x)).groupByKey(25).take(2) Here

Re: Iterative rdd union + reduceByKey operations on small dataset leads to "No space left on device" error on account of lot of shuffle spill.

2018-07-27 Thread Dinesh Dharme

Yeah, you are right. I ran the experiments locally not on YARN. On Fri, Jul 27, 2018 at 11:54 PM, Vadim Semenov wrote: > `spark.worker.cleanup.enabled=true` doesn't work for YARN. > On Fri, Jul 27, 2018 at 8:52 AM dineshdharme > wrote: > > > > I am trying to d

Re: Iterative rdd union + reduceByKey operations on small dataset leads to "No space left on device" error on account of lot of shuffle spill.

2018-07-27 Thread Vadim Semenov

`spark.worker.cleanup.enabled=true` doesn't work for YARN. On Fri, Jul 27, 2018 at 8:52 AM dineshdharme wrote: > > I am trying to do few (union + reduceByKey) operations on a hiearchical > dataset in a iterative fashion in rdd. The first few loops run fine but on > the subs

Iterative rdd union + reduceByKey operations on small dataset leads to "No space left on device" error on account of lot of shuffle spill.

2018-07-27 Thread dineshdharme

I am trying to do few (union + reduceByKey) operations on a hiearchical dataset in a iterative fashion in rdd. The first few loops run fine but on the subsequent loops, the operations ends up using the whole scratch space provided to it. I have set the spark scratch directory, i.e

Re: Structured Stream equivalent of reduceByKey

2017-11-06 Thread Michael Armbrust

Wed, Oct 25, 2017 at 5:52 PM, Piyush Mukati >> wrote: >> >>> Hi, >>> we are migrating some jobs from Dstream to Structured Stream. >>> >>> Currently to handle aggregations we call map and reducebyKey on each RDD >>> like >>> rdd.map(event =&g

Re: Structured Stream equivalent of reduceByKey

2017-10-26 Thread Piyush Mukati

>> Hi, >> we are migrating some jobs from Dstream to Structured Stream. >> >> Currently to handle aggregations we call map and reducebyKey on each RDD >> like >> rdd.map(event => (event._1, event)).reduceByKey((a, b) => merge(a, b)) >> >> The fin

Re: Structured Stream equivalent of reduceByKey

2017-10-26 Thread Michael Armbrust

Oct 25, 2017 at 5:52 PM, Piyush Mukati wrote: > Hi, > we are migrating some jobs from Dstream to Structured Stream. > > Currently to handle aggregations we call map and reducebyKey on each RDD > like > rdd.map(event => (event._1, event)).reduceByKey((a, b) => merge(a, b)) >

Structured Stream equivalent of reduceByKey

2017-10-25 Thread Piyush Mukati

Hi, we are migrating some jobs from Dstream to Structured Stream. Currently to handle aggregations we call map and reducebyKey on each RDD like rdd.map(event => (event._1, event)).reduceByKey((a, b) => merge(a, b)) The final output of each RDD is merged to the sink with support for aggre

Re: reducebykey

2017-04-07 Thread Ankur Srivastava

Hi Stephen, If you use aggregate functions or reduceGroup on KeyValueGroupedDataset it behaves as reduceByKey on RDD. Only if you use flatMapGroups and mapGroups it behaves as groupByKey on RDD and if you read the API documentation it warns of using the API. Hope this helps. Thanks Ankur On

reducebykey

2017-04-07 Thread Stephen Fletcher

Are there plans to add reduceByKey to dataframes, Since switching over to spark 2 I find myself increasing dissatisfied with the idea of converting dataframes to RDD to do procedural programming on grouped data(both from a ease of programming stance and performance stance). So I've been

[Spark Core]: flatMap/reduceByKey seems to be quite slow with Long keys on some distributions

2017-04-01 Thread Richard Tsai

e job is split into two stages, flatMap() and count(). When counting Tuples, flatMap() takes about 6s and count() takes about 2s, while when counting Longs, flatMap() takes 18s and count() takes 10s. I haven't look into Spark's implementation of flatMap/reduceByKey, but I guess

groupByKey vs reduceByKey

2016-12-09 Thread Appu K

Hi, Read somewhere that groupByKey() in RDD disables map-side aggregation as the aggregation function (appending to a list) does not save any space. However from my understanding, using something like reduceByKey or (CombineByKey + a combiner function,) we could reduce the data shuffled

Access broadcast variable from within function passed to reduceByKey

2016-11-15 Thread coolgar

oadcast-variable-from-within-function-passed-to-reduceByKey-tp28082.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread dev loper

the batch duration >> is 5 >> >>> seconds. The applications uses Kafka DirectStream and based on the >> feed >> >>> source there are three streams. As given in the code snippet I am >> doing a >> >>> union of three streams and I am t

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread ayan guha

eam and based on the feed > >>> source there are three streams. As given in the code snippet I am > doing a > >>> union of three streams and I am trying to remove the duplicate > campaigns > >>> received using reduceByKey based on the customer and campaignId.

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread Cody Koeninger

d the batch duration is 5 >>> seconds. The applications uses Kafka DirectStream and based on the feed >>> source there are three streams. As given in the code snippet I am doing a >>> union of three streams and I am trying to remove the duplicate campaigns >>> recei

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread dev loper

to Campaigns based on live stock feeds and the batch duration is 5 >> seconds. The applications uses Kafka DirectStream and based on the feed >> source there are three streams. As given in the code snippet I am doing a >> union of three streams and I am trying to remove the duplic

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread Cody Koeninger

trying to remove the duplicate campaigns > received using reduceByKey based on the customer and campaignId. I could > see lot of duplicate email being send out for the same key in the same > batch.I was expecting reduceByKey to remove the duplicate campaigns in a > batch based on custom

Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread dev loper

snippet I am doing a union of three streams and I am trying to remove the duplicate campaigns received using reduceByKey based on the customer and campaignId. I could see lot of duplicate email being send out for the same key in the same batch.I was expecting reduceByKey to remove the duplicate

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-29 Thread Marius Soutier

Not really, a grouped DataFrame only provides SQL-like functions like sum and avg (at least in 1.5). > On 29.08.2016, at 14:56, ayan guha wrote: > > If you are confused because of the name of two APIs. I think DF API name > groupBy came from SQL, but it works similarly as

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-29 Thread ayan guha

If you are confused because of the name of two APIs. I think DF API name groupBy came from SQL, but it works similarly as reducebykey. On 29 Aug 2016 20:57, "Marius Soutier" wrote: > In DataFrames (and thus in 1.5 in general) this is not possible, correct? > > On 11.08.201

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-29 Thread Marius Soutier

whether the DataFrame API optimizes the code doing something > similar to what RDD.reduceByKey does. > > I am using Spark 1.6.2. > > Regards, > Luis > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-re

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-10 Thread Holden Karau

-- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Is-there-a-reduceByKey-functionality-in-DataFrame- > API-tp27508.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ---

Is there a reduceByKey functionality in DataFrame API?

2016-08-10 Thread luismattor

t.1001560.n3.nabble.com/Is-there-a-reduceByKey-functionality-in-DataFrame-API-tp27508.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-23 Thread Nirav Patel

SO it was indeed my merge function. I created new result object for every merge and its working now. Thanks On Wed, Jun 22, 2016 at 3:46 PM, Nirav Patel wrote: > PS. In my reduceByKey operation I have two mutable object. What I do is > merge mutable2 into mutable1 and return mutable1.

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Nirav Patel

PS. In my reduceByKey operation I have two mutable object. What I do is merge mutable2 into mutable1 and return mutable1. I read that it works for aggregateByKey so thought it will work for reduceByKey as well. I might be wrong here. Can someone verify if this will work or be un predictable? On

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Nirav Patel

did you observe any error (on > workers) ? > > Cheers > > On Tue, Jun 21, 2016 at 10:42 PM, Nirav Patel > wrote: > >> I have an RDD[String, MyObj] which is a result of Join + Map operation. >> It has no partitioner info. I run reduceByKey without passing any >> Pa

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Ted Yu

For the run which returned incorrect result, did you observe any error (on workers) ? Cheers On Tue, Jun 21, 2016 at 10:42 PM, Nirav Patel wrote: > I have an RDD[String, MyObj] which is a result of Join + Map operation. It > has no partitioner info. I run reduceByKey without passi

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Takeshi Yamamuro

Hi, Could you check the issue also occurs in v1.6.1 and v2.0? // maropu On Wed, Jun 22, 2016 at 2:42 PM, Nirav Patel wrote: > I have an RDD[String, MyObj] which is a result of Join + Map operation. It > has no partitioner info. I run reduceByKey without passing any Partitioner > or

Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-21 Thread Nirav Patel

I have an RDD[String, MyObj] which is a result of Join + Map operation. It has no partitioner info. I run reduceByKey without passing any Partitioner or partition counts. I observed that output aggregation result for given key is incorrect sometime. like 1 out of 5 times. It looks like reduce

Re: Dataset - reduceByKey

2016-06-07 Thread Jacek Laskowski

oving RDD based operations to Dataset based > operations. We are calling 'reduceByKey' on some pair RDDs we have. What > would the equivalent be in the Dataset interface - I do not see a simple > reduceByKey replacement. &g

Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey

On Tue, Jun 7, 2016 at 2:50 PM, Richard Marscher wrote: > There certainly are some gaps between the richness of the RDD API and the > Dataset API. I'm also migrating from RDD to Dataset and ran into > reduceByKey and join scenarios. > > In the spark-dev list, one person was di

Re: Dataset - reduceByKey

2016-06-07 Thread Richard Marscher

There certainly are some gaps between the richness of the RDD API and the Dataset API. I'm also migrating from RDD to Dataset and ran into reduceByKey and join scenarios. In the spark-dev list, one person was discussing reduceByKey being sub-optimal at the moment and it spawned this JIRA

Re: Dataset - reduceByKey

2016-06-07 Thread Takeshi Yamamuro

n Jeffrey > wrote: > >> Hello. >> >> I am looking at the option of moving RDD based operations to Dataset >> based operations. We are calling 'reduceByKey' on some pair RDDs we have. >> What would the equivalent be in the Dataset interface - I do not

Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey

at 2:32 PM, Bryan Jeffrey wrote: > Hello. > > I am looking at the option of moving RDD based operations to Dataset based > operations. We are calling 'reduceByKey' on some pair RDDs we have. What > would the equivalent be in the Dataset interface - I do not see a simple

Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey

Hello. I am looking at the option of moving RDD based operations to Dataset based operations. We are calling 'reduceByKey' on some pair RDDs we have. What would the equivalent be in the Dataset interface - I do not see a simple reduceByKey replacement. Regards, Bryan Jeffrey

How to change Spark DataFrame groupby("col1",..,"coln") into reduceByKey()?

2016-05-22 Thread unk1102

by keys are empty. How do I avoid empty group by keys in DataFrame? Does DataFrame avoid empty group by key? I have around 8 keys on which I do group by. sourceFrame.select("blabla").groupby("col1","col2","col3",..."col8").agg("bla bla");

Re: Executor memory requirement for reduceByKey

2016-05-17 Thread Raghavendra Pandey

Even though it does not sound intuitive, reduce by key expects all values for a particular key for a partition to be loaded into memory. So once you increase the partitions you can run the jobs.

Re: Executor memory requirement for reduceByKey

2016-05-13 Thread Sung Hwan Chung

Ted Yu wrote: > >> Have you taken a look at SPARK-11293 ? >> >> Consider using repartition to increase the number of partitions. >> >> FYI >> >> On Fri, May 13, 2016 at 12:14 PM, Sung Hwan Chung < >> coded...@cs.stanford.edu> wrote: >> &

Re: Executor memory requirement for reduceByKey

2016-05-13 Thread Sung Hwan Chung

g Hwan Chung < > coded...@cs.stanford.edu> wrote: > >> Hello, >> >> I'm using Spark version 1.6.0 and have trouble with memory when trying to >> do reducebykey on a dataset with as many as 75 million keys. I.e. I get the >> following exception when I run

Re: Executor memory requirement for reduceByKey

2016-05-13 Thread Ted Yu

Have you taken a look at SPARK-11293 ? Consider using repartition to increase the number of partitions. FYI On Fri, May 13, 2016 at 12:14 PM, Sung Hwan Chung wrote: > Hello, > > I'm using Spark version 1.6.0 and have trouble with memory when trying to > do reducebykey on

Executor memory requirement for reduceByKey

2016-05-13 Thread Sung Hwan Chung

Hello, I'm using Spark version 1.6.0 and have trouble with memory when trying to do reducebykey on a dataset with as many as 75 million keys. I.e. I get the following exception when I run the task. There are 20 workers in the cluster. It is running under the standalone mode with 12 GB ass

Re: reduceByKey as Action or Transformation

2016-04-25 Thread Sumedh Wale

On Monday 25 April 2016 11:28 PM, Weiping Qu wrote: Dear Ted, You are right. ReduceByKey is transformation. My fault. I would rephrase my question using following code snippet. object ScalaApp { def main(args: Array

Re: reduceByKey as Action or Transformation

2016-04-25 Thread Weiping Qu

Dear Ted, You are right. ReduceByKey is transformation. My fault. I would rephrase my question using following code snippet. object ScalaApp { def main(args: Array[String]): Unit ={ val conf = new SparkConf().setAppName("ScalaApp").setMaster("local") val sc = n

Re: reduceByKey as Action or Transformation

2016-04-25 Thread Ted Yu

executed or not. > As far as I saw from my codes, the reduceByKey will be executed without > any operations in the Action category. > Please correct me if I am wrong. > > Thanks, > Regards, > Weiping > > On 25.04.2016 17:20, Chadha Pooja wrote: > >&

Re: reduceByKey as Action or Transformation

2016-04-25 Thread Weiping Qu

Thanks. I read that from the specification. I thought the way people distinguish actions and transformations depends on whether they are lazily executed or not. As far as I saw from my codes, the reduceByKey will be executed without any operations in the Action category. Please correct me if I

reduceByKey as Action or Transformation

2016-04-25 Thread Weiping Qu

Hi, I'd like just to verify that whether reduceByKey is transformation or actions. As written in RDD papers, spark flow will not be triggered only if actions are reached. I tried and saw that the my flow will be executed once there is a reduceByKey while it is categorized into transforma

Re: What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Eugen Cepoi

ver] > java.lang.OutOfMemoryError: Java heap space. Before I was getting this > error I was getting errors saying the result size exceed the > spark.driver.maxResultSize. > This does not make any sense to me, as there are no actions in my job that > send data to the driver - just

Re: What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Zhan Zhang

re no actions in my job that send data to the driver - just a pull of data from S3, a map and reduceByKey and then conversion to dataframe and saveAsTable action that puts the results back on S3. I've found a few references to reduceByKey and spark.driver.maxResultSize having some impo

What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Tom Seddon

my job that send data to the driver - just a pull of data from S3, a map and reduceByKey and then conversion to dataframe and saveAsTable action that puts the results back on S3. I've found a few references to reduceByKey and spark.driver.maxResultSize having some importance, but cannot fatho

Re: Data in one partition after reduceByKey

2015-11-25 Thread Ruslan Dautkhanov

(360,68), (380,69), (400,70), (420,71), (440,72), (460,73), (480,74), (500... > > scala> val rddNum = sc.parallelize(nums) > rddNum: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at > parallelize at :23 > > scala> val reducedNum = rddNum.reduceByKey(_+_) > reduc

Re: Data in one partition after reduceByKey

2015-11-23 Thread Patrick McGloin

ShuffledRDD[1] at reduceByKey at :25 scala> reducedNum.mapPartitions(iter => Array(iter.size).iterator, true).collect.toList res2: List[Int] = List(50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) To distribute my data more evenly across the partitions I created my own custom Partitoiner

Data in one partition after reduceByKey

2015-11-20 Thread Patrick McGloin

zes: RDD[Int] = rdd.mapPartitions(iter => Array(iter.size).iterator, true)log.info(s"rdd ==> [${sizes.collect.toList}]") My question is why does my data end up in one partition after the reduceByKey? After the filter it can be seen that the data is evenly distributed, but the re

Re: Incorrect results with reduceByKey

2015-11-18 Thread tovbinm

Deep copying the data solved the issue: data.map(r => {val t = SpecificData.get().deepCopy(r.getSchema, r); (t.id, List(t)) }).reduceByKey(_ ++ _) (noted here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1003) Thanks Igor Berman,

Re: Incorrect results with reduceByKey

2015-11-17 Thread Igor Berman

you should clone your data after reading avro On 18 November 2015 at 06:28, tovbinm wrote: > Howdy, > > We've noticed a strange behavior with Avro serialized data and reduceByKey > RDD functionality. Please see below: > > // We're reading a bunch of Avro seria

Incorrect results with reduceByKey

2015-11-17 Thread tovbinm

Howdy, We've noticed a strange behavior with Avro serialized data and reduceByKey RDD functionality. Please see below: // We're reading a bunch of Avro serialized data val data: RDD[T] = sparkContext.hadoopFile(path, classOf[AvroInputFormat[T]], classOf[AvroWrapper[T]], classOf[Nu

Re: job hangs when using pipe() with reduceByKey()

2015-11-01 Thread hotdog

yes, the first code takes only 30mins. but the second method, I wait for 5 hours, only finish 10% -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/job-hangs-when-using-pipe-with-reduceByKey-tp25242p25249.html Sent from the Apache Spark User List mailing

Re: job hangs when using pipe() with reduceByKey()

2015-11-01 Thread Gylfi

Hi. What is slow exactly? In code-base 1: When you run the persist() + count() you stored the result in RAM. Then the map + reducebykey is done on in-memory data. In the latter case (all-in-oneline) you are doing both steps at the same time. So you are saying that if you sum-up the time to

Re:Re: job hangs when using pipe() with reduceByKey()

2015-10-31 Thread 李森栋

) // just use it to persist a val b = a.map(s => (s, 1)).reduceByKey().count() it 's so fast but when I use val b = rdd.pipe("./my_cpp_program").map(s => (s, 1)).reduceByKey().count() it is so slow and there are many such log in my executors: 15/10/31 19:53:58 INFO collect

Re: job hangs when using pipe() with reduceByKey()

2015-10-31 Thread Ted Yu

Which Spark release are you using ? Which OS ? Thanks On Sat, Oct 31, 2015 at 5:18 AM, hotdog wrote: > I meet a situation: > When I use > val a = rdd.pipe("./my_cpp_program").persist() > a.count() // just use it to persist a > val b = a.map(s => (s, 1)).reduceByK

job hangs when using pipe() with reduceByKey()

2015-10-31 Thread hotdog

I meet a situation: When I use val a = rdd.pipe("./my_cpp_program").persist() a.count() // just use it to persist a val b = a.map(s => (s, 1)).reduceByKey().count() it 's so fast but when I use val b = rdd.pipe("./my_cpp_program").map(s => (s, 1)).reduceByKey(

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread Tathagata Das

s the values. I will need > to use groupByKey to do a groupBy sessionId and secondary sort by timeStamp > and then get the list of JsonValues in a sorted order. Is there any > alternative for that? Please find the code below that I used for the same. > > > Also, does using a cu

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread swetha kasireddy

. Also, does using a customPartitioner for a reduceByKey improve performance? def getGrpdAndSrtdRecs(rdd: RDD[(String, (Long, String))]): RDD[(String, List[(Long, String)])] = { val grpdRecs = rdd.groupByKey(); val srtdRecs = grpdRecs.mapValues[(List[(Long, String)])](iter => iter.toList.sortBy(_

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread Tathagata Das

do that operation with a reduceByKey? 2. If not, use more partitions. That would cause lesser data in each partition, so less spilling. 3. You can control the amount memory allocated for shuffles by changing the configuration spark.shuffle.memoryFraction . More fraction would cause less spilling

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread swetha kasireddy

So, Wouldn't using a customPartitioner on the rdd upon which the groupByKey or reduceByKey is performed avoid shuffles and improve performance? My code does groupByAndSort and reduceByKey on different datasets as shown below. Would using a custom partitioner on those datasets before us

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread Tathagata Das

ultimately improves performance. On Tue, Oct 27, 2015 at 1:20 PM, swetha wrote: > Hi, > > We currently use reduceByKey to reduce by a particular metric name in our > Streaming/Batch job. It seems to be doing a lot of shuffles and it has > impact on performance. Does using a cu

Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread swetha

Hi, We currently use reduceByKey to reduce by a particular metric name in our Streaming/Batch job. It seems to be doing a lot of shuffles and it has impact on performance. Does using a custompartitioner before calling reduceByKey improve performance? Thanks, Swetha -- View this message in

Re: Configuring Spark for reduceByKey on on massive data sets

2015-10-11 Thread hotdog

hi Daniel, Do you solve your problem? I met the same problem when running massive data using reduceByKey on yarn. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p25023.html Sent from the

Re: "Too many open files" exception on reduceByKey

2015-10-11 Thread Tian Zhang

It turns out the mesos can overwrite the OS ulimit -n setting. So we have increased the mesos slave ulimit -n setting. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p25019.html Sent from the Apache Spark

Re: "Too many open files" exception on reduceByKey

2015-10-09 Thread tian zhang

ontext: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user

Re: "Too many open files" exception on reduceByKey

2015-10-08 Thread DB Tsai

.java:745) > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >

Re: "Too many open files" exception on reduceByKey

2015-10-08 Thread Tian Zhang

-many-open-files-exception-on-reduceByKey-tp2462p24985.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: reduceByKey inside updateStateByKey in Spark Streaming???

2015-09-24 Thread Adrian Tanase

r use case? Sent from my iPhone > On 24 Sep 2015, at 19:47, swetha wrote: > > Hi, > > How to use reduceByKey inside updateStateByKey? Suppose I have a bunch of > keys for which I need to do sum and average inside the updateStateByKey by > joining with old state

reduceByKey inside updateStateByKey in Spark Streaming???

2015-09-24 Thread swetha

Hi, How to use reduceByKey inside updateStateByKey? Suppose I have a bunch of keys for which I need to do sum and average inside the updateStateByKey by joining with old state. How do I accomplish that? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560

Re: How to make Group By/reduceByKey more efficient?

2015-09-24 Thread Adrian Tanase

All the *ByKey aggregations perform an efficient shuffle and preserve partitioning on the output. If all you need is to call reduceByKey, then don’t bother with groupBy. You should use groupBy if you really need all the datapoints from a key for a very custom operation. From the docs: Note

How to make Group By/reduceByKey more efficient?

2015-09-22 Thread swetha

Hi, How to make Group By more efficient? Is it recommended to use a custom partitioner and then do a Group By? And can we use a custom partitioner and then use a reduceByKey for optimization? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3

Re: reduceByKey not working on JavaPairDStream

2015-08-26 Thread Sean Owen

I don't see that you invoke any action in this code. It won't do anything unless you tell it to perform an action that requires the transformations. On Wed, Aug 26, 2015 at 7:05 AM, Deepesh Maheshwari wrote: > Hi, > I have applied mapToPair and then a reduceByKey on a DSt

reduceByKey not working on JavaPairDStream

2015-08-25 Thread Deepesh Maheshwari

Hi, I have applied mapToPair and then a reduceByKey on a DStream to obtain a JavaPairDStream>. I have to apply a flatMapToPair and reduceByKey on the DSTream Obtained above. But i do not see any logs from reduceByKey operation. Can anyone explain why is this happening..? find My Code Be

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-24 Thread satish chandra j

HI All, Please find fix info for users who are following the mail chain of this issue and the respective solution below: *reduceByKey: Non working snippet* import org.apache.spark.Context import org.apache.spark.Context._ import org.apache.spark.SparkConf val conf = new SparkConf() val sc = new

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-22 Thread satish chandra j

akeRDD at :21 > 15/08/21 09:58:51 WARN SizeEstimator: Failed to check whether > UseCompressedOops is set; assuming yes > res0: Array[(Int, Int)] = Array((0,3), (1,50), (2,40)) > > Yong > > > ------ > Date: Fri, 21 Aug 2015 19:24:09 +0530 > Subject:

RE: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread java8964

sformation not happening for reduceByKey or GroupByKey From: jsatishchan...@gmail.com To: abhis...@tetrationanalytics.com CC: user@spark.apache.org HI Abhishek, I have even tried that but rdd2 is empty Regards,Satish On Fri, Aug 21, 2015 at 6:47 PM, Abhishek R. Singh wrote: You had: > RDD.reduceB

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j

),(2,40)) > > > > > > I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function > on Values for each key > > > > Code: > > RDD.reduceByKey((x,y) => x+y) > > RDD.take(3) > > > > Result in console: > > RDD: org.apache.sp

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread Abhishek R. Singh

t)] = Array((0,1), (0,2),(1,20),(1,30),(2,40)) > > > I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function on > Values for each key > > Code: > RDD.reduceByKey((x,y) => x+y) > RDD.take(3) > > Result in console: > RDD: org.apache.spark.rdd.R

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j

ne it is just >> invoking "spark-shell". >> >> I don't know too much about the original problem though. >> >> Yong >> >> -- >> Date: Fri, 21 Aug 2015 18:19:49 +0800 >> Subject: Re: Transformation not hap

RE: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread java8964

: Fri, 21 Aug 2015 18:19:49 +0800 Subject: Re: Transformation not happening for reduceByKey or GroupByKey From: zjf...@gmail.com To: jsatishchan...@gmail.com CC: robin.e...@xense.co.uk; user@spark.apache.org Hi Satish, I don't see where spark support "-i", so suspect it is provided

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread Jeff Zhang

gt; x + y).collect >>> res43: Array[(Int, Int)] = Array((0,3), (1,50), (2,40)) >>> >>> On 20 Aug 2015, at 11:05, satish chandra j >>> wrote: >>> >>> HI All, >>> I have data in RDD as mentioned below: >>> >>> RDD : Array[(In

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j

n 20 Aug 2015, at 11:05, satish chandra j >> wrote: >> >> HI All, >> I have data in RDD as mentioned below: >> >> RDD : Array[(Int),(Int)] = Array((0,1), (0,2),(1,20),(1,30),(2,40)) >> >> >> I am expecting output as Array((0,3),(1,50),(2,40)) jus

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j

> wrote: >> >> HI All, >> I have data in RDD as mentioned below: >> >> RDD : Array[(Int),(Int)] = Array((0,1), (0,2),(1,20),(1,30),(2,40)) >> >> >> I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function >> on Values for each key

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j

ray((0,1), (0,2),(1,20),(1,30),(2,40)) > > > I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function on > Values for each key > > Code: > RDD.reduceByKey((x,y) => x+y) > RDD.take(3) > > Result in console: > RDD: org.apache.spark.rdd.RDD[(Int,Int)]=

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-20 Thread satish chandra j

ole: > RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey at > :73 > res:Array[(Int,Int)] = Array() > > Command as mentioned > > dse spark --master local --jars postgresql-9.4-1201.jar -i > > > Please let me know what is missing in my code, as my resultant Array is > empty > > > > Regards, > Satish > >

Transformation not happening for reduceByKey or GroupByKey

2015-08-20 Thread satish chandra j

RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey at :73 res:Array[(Int,Int)] = Array() Command as mentioned dse spark --master local --jars postgresql-9.4-1201.jar -i Please let me know what is missing in my code, as my resultant Array is empty Regards, Satish

Re: Apache Spark : Custom function for reduceByKey - missing arguments for method

2015-07-10 Thread Richard Marscher

ader.scala:118: error: > missing arguments for method getMin in class DatasetReader; > > [ERROR] follow this method with '_' if you want to treat it as a partially > applied function > > [ERROR] minVector = attribMap.reduceByKey( getMin ) > > I am not sure what I am d

Apache Spark : Custom function for reduceByKey - missing arguments for method

2015-07-09 Thread ameyamm

at I am doing wrong here. My RDD is an RDD of Pairs and as per my knowledge, I can pass any method to it as long as the functions is of the type f : (V, V) => V. I am really stuck here. Please help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Sp

Re: run reduceByKey on huge data in spark

2015-06-30 Thread barge.nilesh

http://apache-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546p23555.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark

Re: run reduceByKey on huge data in spark

2015-06-30 Thread lisendong

standalone mode ? > > Cheers > > On Tue, Jun 30, 2015 at 10:03 AM, hotdog <mailto:lisend...@163.com>> wrote: > I'm running reduceByKey in spark. My program is the simplest example of > spark: > > val counts = textFile.flatMap(line => line.split(" &qu

Re: run reduceByKey on huge data in spark

2015-06-30 Thread Ted Yu

Which Spark release are you using ? Are you running in standalone mode ? Cheers On Tue, Jun 30, 2015 at 10:03 AM, hotdog wrote: > I'm running reduceByKey in spark. My program is the simplest example of > spark: > > val counts = textFile.flatMap(line => line.split(&qu

run reduceByKey on huge data in spark

2015-06-30 Thread hotdog

I'm running reduceByKey in spark. My program is the simplest example of spark: val counts = textFile.flatMap(line => line.split(" ")).repartition(2). .map(word => (word, 1)) .reduceByKey(_ + _, 1) counts.saveAsTextFile("hdfs://.

1 2 3 >

1 - 100 of 263 matches

Mail list logo