Hi,
I'm not an authority in the Spark community, but what I would do is adding
the project to spark packages http://spark-packages.org/. In fact I think
this case is similar to IndexedRDD, which is also in spark packages
http://spark-packages.org/package/amplab/spark-indexedrdd
2015-07-19 21:49 G
Hi Juan,
It's exactly what I meant. if we will have high load with many repetitions it
can significantly reduce rdd size and improve performance. in real use
cases application frequently need to enrich data from cache or external
system, so we will save time on each repetition.
I will also do some
Hi,
My two cents is that that could be interesting if all RDD and pair
RDD operations would be lifted to work on groupedRDD. For example as
suggested a map on grouped RDDs would be more efficient if the original RDD
had lots of duplicate entries, but for RDDs with little repetitions I guess
you in
In the Spark model, constructing an RDD does not mean storing all its
contents in memory. Rather, an RDD is a description of a dataset that
enables iterating over its contents, record by record (in parallel). The
only time the full contents of an RDD are stored in memory is when a user
explicitly
Sorry, maybe I am saying something completely wrong... we have a stream,
we digitize it to created rdd. rdd in this case will be just array of any.
than we apply transformation to create new grouped rdd and GC should remove
original rdd from memory(if we won't persist it). Will we have GC step in
The user gets to choose what they want to reside in memory. If they call
rdd.cache() on the original RDD, it will be in memory. If they call
rdd.cache() on the compact RDD, it will be in memory. If cache() is called
on both, they'll both be in memory.
-Sandy
On Sun, Jul 19, 2015 at 11:09 AM, С
Thanks for answer! Could you please answer for one more question? Will we
have in memory original rdd and grouped rdd in the same time?
2015-07-19 21:04 GMT+03:00 Sandy Ryza :
> Edit: the first line should read:
>
> val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _)
>
> On Sun, Jul 19, 2015 at
This functionality already basically exists in Spark. To create the
"grouped RDD", one can run:
val groupedRdd = rdd.reduceByKey(_ + _)
To get it back into the original form:
groupedRdd.flatMap(x => List.fill(x._1)(x._2))
-Sandy
-Sandy
On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман
wr
Edit: the first line should read:
val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _)
On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza
wrote:
> This functionality already basically exists in Spark. To create the
> "grouped RDD", one can run:
>
> val groupedRdd = rdd.reduceByKey(_ + _)
>
> To g