subject:"something about rdd.collect"

Re: something about rdd.collect

2014-10-14 Thread randylu

If memory is not enough, OutOfMemory exception should occur, but nothing in driver's log. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/something-about-rdd-collect-tp16451p16461.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: something about rdd.collect

2014-10-14 Thread randylu

Thanks rxin, I still have a doubt about collect(). Word's number before reduceByKey() is about 200 million, and after reduceByKey() it decreases to 18 million. Memory for driver is initialized 15GB, then I print out "runtime.freeMemory()" before reduceByKey(), it indicates 13GB free memory.

Re: something about rdd.collect

2014-10-14 Thread Reynold Xin

Hi Randy, collect essentially transfers all the data to the driver node. You definitely wouldn’t want to collect 200 million words. It is a pretty large number and you can run out of memory on your driver with that much data. -- Reynold Xin On October 14, 2014 at 9:26:13 PM, randylu (randyl.

something about rdd.collect

2014-10-14 Thread randylu

My code is as follows: *documents.flatMap(case words => words.map(w => (w, 1))).reduceByKey(_ + _).collect()* In driver's log, reduceByKey() is finished, but collect() seems always in run, just can't be finished. In additional, there are about 200,000,000 words needs to be collected. Is it t