Also, you may want to use .lookup() instead of .filter()
def
lookup(key: K): Seq[V]
Return the list of values in the RDD for key key. This operation is done
efficiently if the RDD has a known partitioner by only searching the
partition that the key maps to.
You might want to partition your first
Hi,
In the talk "A Deeper Understanding of Spark Internals", it was mentioned
that for some operators, spark can spill to disk across keys (in 1.1 -
.groupByKey(), .reduceByKey(), .sortByKey()), but that as a limitation of
the shuffle at that time, each single key-value pair must fit in memory.
I'm not sure about .union(), but at least in the case of .join(), as long as
you have hash partitioned the original RDDs and persisted them, calls to
.join() take advantage of already knowing which partition the keys are on,
and will not repartition rdd1.
val rdd1 = log.partitionBy(new HashPartit
also available is .sample(), which will randomly sample your RDD with or
without replacement, and returns an RDD.
.sample() takes a fraction, so it doesn't return an exact number of
elements.
eg.
rdd.sample(true, .0001, 1)
--
View this message in context:
http://apache-spark-user-list.100156
>> nsareen wrote
>>> 1) Does filter function scan every element saved in RDD? if my RDD
>>> represents 10 Million rows, and if i want to work on only 1000 of them,
>>> is
>>> there an efficient way of filtering the subset without having to scan
>>> every element ?
using .take(1000) may be a biase
shahabm wrote
> I noticed that rdd.cache() is not happening immediately rather due to lazy
> feature of Spark, it is happening just at the moment you perform some
> map/reduce actions. Is this true?
Yes, .cache() is a transformation (lazy evaluation)
shahabm wrote
> If this is the case, how can