Hi Santhosh,
My name is not Bipin, its Biplob as is clear from my Signature.
Regarding your question, I have no clue what your map operation is doing on
the grouped data, so I can only suggest you to do :
dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x:
(x[0],x)).reduceByKey
reduceByKey() in this context ?
Santhosh
On Mon, Aug 6, 2018 at 1:20 AM, Biplob Biswas
wrote:
> Hi Santhosh,
>
> If you are not performing any aggregation, then I don't think you can
> replace your groupbykey with a reducebykey, and as I see you are only
> grouping and taking 2
Hi Santhosh,
If you are not performing any aggregation, then I don't think you can
replace your groupbykey with a reducebykey, and as I see you are only
grouping and taking 2 values of the result, thus I believe you can't just
replace your groupbykey with that.
Thanks & Regards
I am trying to replace groupByKey() with reudceByKey(), I am a pyspark and
python newbie and I am having a hard time figuring out the lambda function
for the reduceByKey() operation.
Here is the code
dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x:
(x[0],x)).groupByKey(25).take(2)
Here
Yeah, you are right. I ran the experiments locally not on YARN.
On Fri, Jul 27, 2018 at 11:54 PM, Vadim Semenov wrote:
> `spark.worker.cleanup.enabled=true` doesn't work for YARN.
> On Fri, Jul 27, 2018 at 8:52 AM dineshdharme
> wrote:
> >
> > I am trying to d
`spark.worker.cleanup.enabled=true` doesn't work for YARN.
On Fri, Jul 27, 2018 at 8:52 AM dineshdharme wrote:
>
> I am trying to do few (union + reduceByKey) operations on a hiearchical
> dataset in a iterative fashion in rdd. The first few loops run fine but on
> the subs
I am trying to do few (union + reduceByKey) operations on a hiearchical
dataset in a iterative fashion in rdd. The first few loops run fine but on
the subsequent loops, the operations ends up using the whole scratch space
provided to it.
I have set the spark scratch directory, i.e
Wed, Oct 25, 2017 at 5:52 PM, Piyush Mukati
>> wrote:
>>
>>> Hi,
>>> we are migrating some jobs from Dstream to Structured Stream.
>>>
>>> Currently to handle aggregations we call map and reducebyKey on each RDD
>>> like
>>> rdd.map(event =&g
>> Hi,
>> we are migrating some jobs from Dstream to Structured Stream.
>>
>> Currently to handle aggregations we call map and reducebyKey on each RDD
>> like
>> rdd.map(event => (event._1, event)).reduceByKey((a, b) => merge(a, b))
>>
>> The fin
Oct 25, 2017 at 5:52 PM, Piyush Mukati
wrote:
> Hi,
> we are migrating some jobs from Dstream to Structured Stream.
>
> Currently to handle aggregations we call map and reducebyKey on each RDD
> like
> rdd.map(event => (event._1, event)).reduceByKey((a, b) => merge(a, b))
>
Hi,
we are migrating some jobs from Dstream to Structured Stream.
Currently to handle aggregations we call map and reducebyKey on each RDD
like
rdd.map(event => (event._1, event)).reduceByKey((a, b) => merge(a, b))
The final output of each RDD is merged to the sink with support for
aggre
Hi Stephen,
If you use aggregate functions or reduceGroup on KeyValueGroupedDataset it
behaves as reduceByKey on RDD.
Only if you use flatMapGroups and mapGroups it behaves as groupByKey on
RDD and if you read the API documentation it warns of using the API.
Hope this helps.
Thanks
Ankur
On
Are there plans to add reduceByKey to dataframes, Since switching over to
spark 2 I find myself increasing dissatisfied with the idea of converting
dataframes to RDD to do procedural programming on grouped data(both from a
ease of programming stance and performance stance). So I've been
e job is split into two stages, flatMap() and count(). When counting
Tuples, flatMap() takes about 6s and count() takes about 2s, while when
counting Longs, flatMap() takes 18s and count() takes 10s.
I haven't look into Spark's implementation of flatMap/reduceByKey, but I
guess
Hi,
Read somewhere that
groupByKey() in RDD disables map-side aggregation as the aggregation
function (appending to a list) does not save any space.
However from my understanding, using something like reduceByKey or
(CombineByKey + a combiner function,) we could reduce the data shuffled
oadcast-variable-from-within-function-passed-to-reduceByKey-tp28082.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
the batch duration
>> is 5
>> >>> seconds. The applications uses Kafka DirectStream and based on the
>> feed
>> >>> source there are three streams. As given in the code snippet I am
>> doing a
>> >>> union of three streams and I am t
eam and based on the feed
> >>> source there are three streams. As given in the code snippet I am
> doing a
> >>> union of three streams and I am trying to remove the duplicate
> campaigns
> >>> received using reduceByKey based on the customer and campaignId.
d the batch duration is 5
>>> seconds. The applications uses Kafka DirectStream and based on the feed
>>> source there are three streams. As given in the code snippet I am doing a
>>> union of three streams and I am trying to remove the duplicate campaigns
>>> recei
to Campaigns based on live stock feeds and the batch duration is 5
>> seconds. The applications uses Kafka DirectStream and based on the feed
>> source there are three streams. As given in the code snippet I am doing a
>> union of three streams and I am trying to remove the duplic
trying to remove the duplicate campaigns
> received using reduceByKey based on the customer and campaignId. I could
> see lot of duplicate email being send out for the same key in the same
> batch.I was expecting reduceByKey to remove the duplicate campaigns in a
> batch based on custom
snippet I am doing a
union of three streams and I am trying to remove the duplicate campaigns
received using reduceByKey based on the customer and campaignId. I could
see lot of duplicate email being send out for the same key in the same
batch.I was expecting reduceByKey to remove the duplicate
Not really, a grouped DataFrame only provides SQL-like functions like sum and
avg (at least in 1.5).
> On 29.08.2016, at 14:56, ayan guha wrote:
>
> If you are confused because of the name of two APIs. I think DF API name
> groupBy came from SQL, but it works similarly as
If you are confused because of the name of two APIs. I think DF API name
groupBy came from SQL, but it works similarly as reducebykey.
On 29 Aug 2016 20:57, "Marius Soutier" wrote:
> In DataFrames (and thus in 1.5 in general) this is not possible, correct?
>
> On 11.08.201
whether the DataFrame API optimizes the code doing something
> similar to what RDD.reduceByKey does.
>
> I am using Spark 1.6.2.
>
> Regards,
> Luis
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-re
--
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Is-there-a-reduceByKey-functionality-in-DataFrame-
> API-tp27508.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---
t.1001560.n3.nabble.com/Is-there-a-reduceByKey-functionality-in-DataFrame-API-tp27508.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
SO it was indeed my merge function. I created new result object for every
merge and its working now.
Thanks
On Wed, Jun 22, 2016 at 3:46 PM, Nirav Patel wrote:
> PS. In my reduceByKey operation I have two mutable object. What I do is
> merge mutable2 into mutable1 and return mutable1.
PS. In my reduceByKey operation I have two mutable object. What I do is
merge mutable2 into mutable1 and return mutable1. I read that it works for
aggregateByKey so thought it will work for reduceByKey as well. I might be
wrong here. Can someone verify if this will work or be un predictable?
On
did you observe any error (on
> workers) ?
>
> Cheers
>
> On Tue, Jun 21, 2016 at 10:42 PM, Nirav Patel
> wrote:
>
>> I have an RDD[String, MyObj] which is a result of Join + Map operation.
>> It has no partitioner info. I run reduceByKey without passing any
>> Pa
For the run which returned incorrect result, did you observe any error (on
workers) ?
Cheers
On Tue, Jun 21, 2016 at 10:42 PM, Nirav Patel wrote:
> I have an RDD[String, MyObj] which is a result of Join + Map operation. It
> has no partitioner info. I run reduceByKey without passi
Hi,
Could you check the issue also occurs in v1.6.1 and v2.0?
// maropu
On Wed, Jun 22, 2016 at 2:42 PM, Nirav Patel wrote:
> I have an RDD[String, MyObj] which is a result of Join + Map operation. It
> has no partitioner info. I run reduceByKey without passing any Partitioner
> or
I have an RDD[String, MyObj] which is a result of Join + Map operation. It
has no partitioner info. I run reduceByKey without passing any Partitioner
or partition counts. I observed that output aggregation result for given
key is incorrect sometime. like 1 out of 5 times. It looks like reduce
oving RDD based operations to Dataset based
> operations. We are calling 'reduceByKey' on some pair RDDs we have. What
> would the equivalent be in the Dataset interface - I do not see a simple
> reduceByKey replacement.
&g
On Tue, Jun 7, 2016 at 2:50 PM, Richard Marscher
wrote:
> There certainly are some gaps between the richness of the RDD API and the
> Dataset API. I'm also migrating from RDD to Dataset and ran into
> reduceByKey and join scenarios.
>
> In the spark-dev list, one person was di
There certainly are some gaps between the richness of the RDD API and the
Dataset API. I'm also migrating from RDD to Dataset and ran into
reduceByKey and join scenarios.
In the spark-dev list, one person was discussing reduceByKey being
sub-optimal at the moment and it spawned this JIRA
n Jeffrey
> wrote:
>
>> Hello.
>>
>> I am looking at the option of moving RDD based operations to Dataset
>> based operations. We are calling 'reduceByKey' on some pair RDDs we have.
>> What would the equivalent be in the Dataset interface - I do not
at 2:32 PM, Bryan Jeffrey
wrote:
> Hello.
>
> I am looking at the option of moving RDD based operations to Dataset based
> operations. We are calling 'reduceByKey' on some pair RDDs we have. What
> would the equivalent be in the Dataset interface - I do not see a simple
Hello.
I am looking at the option of moving RDD based operations to Dataset based
operations. We are calling 'reduceByKey' on some pair RDDs we have. What
would the equivalent be in the Dataset interface - I do not see a simple
reduceByKey replacement.
Regards,
Bryan Jeffrey
by keys are empty. How do I avoid empty group by keys in DataFrame? Does
DataFrame avoid empty group by key? I have around 8 keys on which I do group
by.
sourceFrame.select("blabla").groupby("col1","col2","col3",..."col8").agg("bla
bla");
Even though it does not sound intuitive, reduce by key expects all values
for a particular key for a partition to be loaded into memory. So once you
increase the partitions you can run the jobs.
Ted Yu wrote:
>
>> Have you taken a look at SPARK-11293 ?
>>
>> Consider using repartition to increase the number of partitions.
>>
>> FYI
>>
>> On Fri, May 13, 2016 at 12:14 PM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
&
g Hwan Chung <
> coded...@cs.stanford.edu> wrote:
>
>> Hello,
>>
>> I'm using Spark version 1.6.0 and have trouble with memory when trying to
>> do reducebykey on a dataset with as many as 75 million keys. I.e. I get the
>> following exception when I run
Have you taken a look at SPARK-11293 ?
Consider using repartition to increase the number of partitions.
FYI
On Fri, May 13, 2016 at 12:14 PM, Sung Hwan Chung
wrote:
> Hello,
>
> I'm using Spark version 1.6.0 and have trouble with memory when trying to
> do reducebykey on
Hello,
I'm using Spark version 1.6.0 and have trouble with memory when trying to
do reducebykey on a dataset with as many as 75 million keys. I.e. I get the
following exception when I run the task.
There are 20 workers in the cluster. It is running under the standalone
mode with 12 GB ass
On Monday 25 April 2016 11:28 PM,
Weiping Qu wrote:
Dear Ted,
You are right. ReduceByKey is transformation. My fault.
I would rephrase my question using following code snippet.
object ScalaApp {
def main(args: Array
Dear Ted,
You are right. ReduceByKey is transformation. My fault.
I would rephrase my question using following code snippet.
object ScalaApp {
def main(args: Array[String]): Unit ={
val conf = new SparkConf().setAppName("ScalaApp").setMaster("local")
val sc = n
executed or not.
> As far as I saw from my codes, the reduceByKey will be executed without
> any operations in the Action category.
> Please correct me if I am wrong.
>
> Thanks,
> Regards,
> Weiping
>
> On 25.04.2016 17:20, Chadha Pooja wrote:
>
>&
Thanks.
I read that from the specification.
I thought the way people distinguish actions and transformations depends
on whether they are lazily executed or not.
As far as I saw from my codes, the reduceByKey will be executed without
any operations in the Action category.
Please correct me if I
Hi,
I'd like just to verify that whether reduceByKey is transformation or
actions.
As written in RDD papers, spark flow will not be triggered only if
actions are reached.
I tried and saw that the my flow will be executed once there is a
reduceByKey while it is categorized into transforma
ver]
> java.lang.OutOfMemoryError: Java heap space. Before I was getting this
> error I was getting errors saying the result size exceed the
> spark.driver.maxResultSize.
> This does not make any sense to me, as there are no actions in my job that
> send data to the driver - just
re
no actions in my job that send data to the driver - just a pull of data from
S3, a map and reduceByKey and then conversion to dataframe and saveAsTable
action that puts the results back on S3.
I've found a few references to reduceByKey and spark.driver.maxResultSize
having some impo
my job that
send data to the driver - just a pull of data from S3, a map and
reduceByKey and then conversion to dataframe and saveAsTable action that
puts the results back on S3.
I've found a few references to reduceByKey and spark.driver.maxResultSize
having some importance, but cannot fatho
(360,68), (380,69), (400,70), (420,71), (440,72), (460,73), (480,74), (500...
>
> scala> val rddNum = sc.parallelize(nums)
> rddNum: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at
> parallelize at :23
>
> scala> val reducedNum = rddNum.reduceByKey(_+_)
> reduc
ShuffledRDD[1] at
reduceByKey at :25
scala> reducedNum.mapPartitions(iter => Array(iter.size).iterator,
true).collect.toList
res2: List[Int] = List(50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0)
To distribute my data more evenly across the partitions I created my own
custom Partitoiner
zes: RDD[Int] = rdd.mapPartitions(iter =>
Array(iter.size).iterator, true)log.info(s"rdd ==>
[${sizes.collect.toList}]")
My question is why does my data end up in one partition after the
reduceByKey? After the filter it can be seen that the data is evenly
distributed, but the re
Deep copying the data solved the issue:
data.map(r => {val t = SpecificData.get().deepCopy(r.getSchema, r); (t.id,
List(t)) }).reduceByKey(_ ++ _)
(noted here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1003)
Thanks Igor Berman,
you should clone your data after reading avro
On 18 November 2015 at 06:28, tovbinm wrote:
> Howdy,
>
> We've noticed a strange behavior with Avro serialized data and reduceByKey
> RDD functionality. Please see below:
>
> // We're reading a bunch of Avro seria
Howdy,
We've noticed a strange behavior with Avro serialized data and reduceByKey
RDD functionality. Please see below:
// We're reading a bunch of Avro serialized data
val data: RDD[T] = sparkContext.hadoopFile(path,
classOf[AvroInputFormat[T]], classOf[AvroWrapper[T]], classOf[Nu
yes, the first code takes only 30mins.
but the second method, I wait for 5 hours, only finish 10%
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/job-hangs-when-using-pipe-with-reduceByKey-tp25242p25249.html
Sent from the Apache Spark User List mailing
Hi.
What is slow exactly?
In code-base 1:
When you run the persist() + count() you stored the result in RAM.
Then the map + reducebykey is done on in-memory data.
In the latter case (all-in-oneline) you are doing both steps at the same
time.
So you are saying that if you sum-up the time to
) // just use it to persist a
val b = a.map(s => (s, 1)).reduceByKey().count()
it 's so fast
but when I use
val b = rdd.pipe("./my_cpp_program").map(s => (s, 1)).reduceByKey().count()
it is so slow
and there are many such log in my executors:
15/10/31 19:53:58 INFO collect
Which Spark release are you using ?
Which OS ?
Thanks
On Sat, Oct 31, 2015 at 5:18 AM, hotdog wrote:
> I meet a situation:
> When I use
> val a = rdd.pipe("./my_cpp_program").persist()
> a.count() // just use it to persist a
> val b = a.map(s => (s, 1)).reduceByK
I meet a situation:
When I use
val a = rdd.pipe("./my_cpp_program").persist()
a.count() // just use it to persist a
val b = a.map(s => (s, 1)).reduceByKey().count()
it 's so fast
but when I use
val b = rdd.pipe("./my_cpp_program").map(s => (s, 1)).reduceByKey(
s the values. I will need
> to use groupByKey to do a groupBy sessionId and secondary sort by timeStamp
> and then get the list of JsonValues in a sorted order. Is there any
> alternative for that? Please find the code below that I used for the same.
>
>
> Also, does using a cu
.
Also, does using a customPartitioner for a reduceByKey improve performance?
def getGrpdAndSrtdRecs(rdd: RDD[(String, (Long, String))]): RDD[(String,
List[(Long, String)])] =
{ val grpdRecs = rdd.groupByKey(); val srtdRecs =
grpdRecs.mapValues[(List[(Long, String)])](iter =>
iter.toList.sortBy(_
do that operation with a reduceByKey?
2. If not, use more partitions. That would cause lesser data in each
partition, so less spilling.
3. You can control the amount memory allocated for shuffles by changing the
configuration spark.shuffle.memoryFraction . More fraction would cause less
spilling
So, Wouldn't using a customPartitioner on the rdd upon which the
groupByKey or reduceByKey is performed avoid shuffles and improve
performance? My code does groupByAndSort and reduceByKey on different
datasets as shown below. Would using a custom partitioner on those datasets
before us
ultimately improves performance.
On Tue, Oct 27, 2015 at 1:20 PM, swetha wrote:
> Hi,
>
> We currently use reduceByKey to reduce by a particular metric name in our
> Streaming/Batch job. It seems to be doing a lot of shuffles and it has
> impact on performance. Does using a cu
Hi,
We currently use reduceByKey to reduce by a particular metric name in our
Streaming/Batch job. It seems to be doing a lot of shuffles and it has
impact on performance. Does using a custompartitioner before calling
reduceByKey improve performance?
Thanks,
Swetha
--
View this message in
hi Daniel,
Do you solve your problem?
I met the same problem when running massive data using reduceByKey on yarn.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p25023.html
Sent from the
It turns out the mesos can overwrite the OS ulimit -n setting. So we have
increased the mesos slave ulimit -n setting.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p25019.html
Sent from the Apache Spark
ontext:
http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user
.java:745)
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
-many-open-files-exception-on-reduceByKey-tp2462p24985.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h
r use case?
Sent from my iPhone
> On 24 Sep 2015, at 19:47, swetha wrote:
>
> Hi,
>
> How to use reduceByKey inside updateStateByKey? Suppose I have a bunch of
> keys for which I need to do sum and average inside the updateStateByKey by
> joining with old state
Hi,
How to use reduceByKey inside updateStateByKey? Suppose I have a bunch of
keys for which I need to do sum and average inside the updateStateByKey by
joining with old state. How do I accomplish that?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560
All the *ByKey aggregations perform an efficient shuffle and preserve
partitioning on the output. If all you need is to call reduceByKey, then don’t
bother with groupBy. You should use groupBy if you really need all the
datapoints from a key for a very custom operation.
From the docs:
Note
Hi,
How to make Group By more efficient? Is it recommended to use a custom
partitioner and then do a Group By? And can we use a custom partitioner and
then use a reduceByKey for optimization?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3
I don't see that you invoke any action in this code. It won't do
anything unless you tell it to perform an action that requires the
transformations.
On Wed, Aug 26, 2015 at 7:05 AM, Deepesh Maheshwari
wrote:
> Hi,
> I have applied mapToPair and then a reduceByKey on a DSt
Hi,
I have applied mapToPair and then a reduceByKey on a DStream to obtain a
JavaPairDStream>.
I have to apply a flatMapToPair and reduceByKey on the DSTream Obtained
above.
But i do not see any logs from reduceByKey operation.
Can anyone explain why is this happening..?
find My Code Be
HI All,
Please find fix info for users who are following the mail chain of this
issue and the respective solution below:
*reduceByKey: Non working snippet*
import org.apache.spark.Context
import org.apache.spark.Context._
import org.apache.spark.SparkConf
val conf = new SparkConf()
val sc = new
akeRDD at :21
> 15/08/21 09:58:51 WARN SizeEstimator: Failed to check whether
> UseCompressedOops is set; assuming yes
> res0: Array[(Int, Int)] = Array((0,3), (1,50), (2,40))
>
> Yong
>
>
> ------
> Date: Fri, 21 Aug 2015 19:24:09 +0530
> Subject:
sformation not happening for reduceByKey or GroupByKey
From: jsatishchan...@gmail.com
To: abhis...@tetrationanalytics.com
CC: user@spark.apache.org
HI Abhishek,
I have even tried that but rdd2 is empty
Regards,Satish
On Fri, Aug 21, 2015 at 6:47 PM, Abhishek R. Singh
wrote:
You had:
> RDD.reduceB
),(2,40))
> >
> >
> > I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function
> on Values for each key
> >
> > Code:
> > RDD.reduceByKey((x,y) => x+y)
> > RDD.take(3)
> >
> > Result in console:
> > RDD: org.apache.sp
t)] = Array((0,1), (0,2),(1,20),(1,30),(2,40))
>
>
> I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function on
> Values for each key
>
> Code:
> RDD.reduceByKey((x,y) => x+y)
> RDD.take(3)
>
> Result in console:
> RDD: org.apache.spark.rdd.R
ne it is just
>> invoking "spark-shell".
>>
>> I don't know too much about the original problem though.
>>
>> Yong
>>
>> --
>> Date: Fri, 21 Aug 2015 18:19:49 +0800
>> Subject: Re: Transformation not hap
: Fri, 21 Aug 2015 18:19:49 +0800
Subject: Re: Transformation not happening for reduceByKey or GroupByKey
From: zjf...@gmail.com
To: jsatishchan...@gmail.com
CC: robin.e...@xense.co.uk; user@spark.apache.org
Hi Satish,
I don't see where spark support "-i", so suspect it is provided
gt; x + y).collect
>>> res43: Array[(Int, Int)] = Array((0,3), (1,50), (2,40))
>>>
>>> On 20 Aug 2015, at 11:05, satish chandra j
>>> wrote:
>>>
>>> HI All,
>>> I have data in RDD as mentioned below:
>>>
>>> RDD : Array[(In
n 20 Aug 2015, at 11:05, satish chandra j
>> wrote:
>>
>> HI All,
>> I have data in RDD as mentioned below:
>>
>> RDD : Array[(Int),(Int)] = Array((0,1), (0,2),(1,20),(1,30),(2,40))
>>
>>
>> I am expecting output as Array((0,3),(1,50),(2,40)) jus
> wrote:
>>
>> HI All,
>> I have data in RDD as mentioned below:
>>
>> RDD : Array[(Int),(Int)] = Array((0,1), (0,2),(1,20),(1,30),(2,40))
>>
>>
>> I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function
>> on Values for each key
ray((0,1), (0,2),(1,20),(1,30),(2,40))
>
>
> I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function on
> Values for each key
>
> Code:
> RDD.reduceByKey((x,y) => x+y)
> RDD.take(3)
>
> Result in console:
> RDD: org.apache.spark.rdd.RDD[(Int,Int)]=
ole:
> RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey at
> :73
> res:Array[(Int,Int)] = Array()
>
> Command as mentioned
>
> dse spark --master local --jars postgresql-9.4-1201.jar -i
>
>
> Please let me know what is missing in my code, as my resultant Array is
> empty
>
>
>
> Regards,
> Satish
>
>
RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey at
:73
res:Array[(Int,Int)] = Array()
Command as mentioned
dse spark --master local --jars postgresql-9.4-1201.jar -i
Please let me know what is missing in my code, as my resultant Array is
empty
Regards,
Satish
ader.scala:118: error:
> missing arguments for method getMin in class DatasetReader;
>
> [ERROR] follow this method with '_' if you want to treat it as a partially
> applied function
>
> [ERROR] minVector = attribMap.reduceByKey( getMin )
>
> I am not sure what I am d
at I am doing wrong here. My RDD is an RDD of Pairs and as
per my knowledge, I can pass any method to it as long as the functions is of
the type f : (V, V) => V.
I am really stuck here. Please help.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Sp
http://apache-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546p23555.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark
standalone mode ?
>
> Cheers
>
> On Tue, Jun 30, 2015 at 10:03 AM, hotdog <mailto:lisend...@163.com>> wrote:
> I'm running reduceByKey in spark. My program is the simplest example of
> spark:
>
> val counts = textFile.flatMap(line => line.split(" &qu
Which Spark release are you using ?
Are you running in standalone mode ?
Cheers
On Tue, Jun 30, 2015 at 10:03 AM, hotdog wrote:
> I'm running reduceByKey in spark. My program is the simplest example of
> spark:
>
> val counts = textFile.flatMap(line => line.split(&qu
I'm running reduceByKey in spark. My program is the simplest example of
spark:
val counts = textFile.flatMap(line => line.split(" ")).repartition(2).
.map(word => (word, 1))
.reduceByKey(_ + _, 1)
counts.saveAsTextFile("hdfs://.
1 - 100 of 263 matches
Mail list logo