from:"dsiegel"

Re: Does filter on an RDD scan every data item ?

2014-12-11 Thread dsiegel

Also, you may want to use .lookup() instead of .filter() def lookup(key: K): Seq[V] Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. You might want to partition your first

single key-value pair fitting in memory

2014-12-03 Thread dsiegel

Hi, In the talk "A Deeper Understanding of Spark Internals", it was mentioned that for some operators, spark can spill to disk across keys (in 1.1 - .groupByKey(), .reduceByKey(), .sortByKey()), but that as a limitation of the shuffle at that time, each single key-value pair must fit in memory.

Re: Insert new data into specific partition of an RDD

2014-12-03 Thread dsiegel

I'm not sure about .union(), but at least in the case of .join(), as long as you have hash partitioned the original RDDs and persisted them, calls to .join() take advantage of already knowing which partition the keys are on, and will not repartition rdd1. val rdd1 = log.partitionBy(new HashPartit

Re: Does filter on an RDD scan every data item ?

2014-12-03 Thread dsiegel

also available is .sample(), which will randomly sample your RDD with or without replacement, and returns an RDD. .sample() takes a fraction, so it doesn't return an exact number of elements. eg. rdd.sample(true, .0001, 1) -- View this message in context: http://apache-spark-user-list.100156

Re: Does filter on an RDD scan every data item ?

2014-12-03 Thread dsiegel

>> nsareen wrote >>> 1) Does filter function scan every element saved in RDD? if my RDD >>> represents 10 Million rows, and if i want to work on only 1000 of them, >>> is >>> there an efficient way of filtering the subset without having to scan >>> every element ? using .take(1000) may be a biase

Re: How to enforce RDD to be cached?

2014-12-03 Thread dsiegel

shahabm wrote > I noticed that rdd.cache() is not happening immediately rather due to lazy > feature of Spark, it is happening just at the moment you perform some > map/reduce actions. Is this true? Yes, .cache() is a transformation (lazy evaluation) shahabm wrote > If this is the case, how can

Re: Does filter on an RDD scan every data item ?

single key-value pair fitting in memory

Re: Insert new data into specific partition of an RDD

Re: Does filter on an RDD scan every data item ?

Re: Does filter on an RDD scan every data item ?

Re: How to enforce RDD to be cached?

6 matches

Site Navigation

Mail list logo

Footer information