Re: Spark runs out of memory with small file

2017-02-28 Thread Henry Tremblay
Cool! Now I understand how to approach this problem. At my last position, I don't think we did it quite efficiently. Maybe a blog post by me? Henry On 02/28/2017 01:22 AM, 颜发才(Yan Facai) wrote: Google is your friend, Henry. http://stackoverflow.com/questions/21185092/apache-spark-map-vs-mapp

Re: Spark runs out of memory with small file

2017-02-27 Thread Henry Tremblay
Thanks! That works: def process_file(my_iter): the_id = "init" final = [] for chunk in my_iter: lines = chunk[1].split("\n") for line in lines: if line[0:15] == 'WARC-Record-ID:': the_id = line[15:] final.append(Row(the_id = the_

Re: Spark runs out of memory with small file

2017-02-27 Thread Henry Tremblay
This won't work: rdd2 = rdd.flatMap(splitf) rdd2.take(1) [u'WARC/1.0\r'] rdd2.count() 508310 If I then try to apply a map to rdd2, the map only works on each individual line. I need to create a state machine as in my second function. That is, I need to apply a key to each line, but the key

Re: Spark runs out of memory with small file

2017-02-26 Thread Pavel Plotnikov
Hi, Henry In first example the dict d always contains only one value because the_Id is same, in second case duct grows very quickly. So, I can suggest to firstly apply map function to split you file with string on rows then please make repartition and then apply custom logic Example: def splitf(

Re: Spark runs out of memory with small file

2017-02-26 Thread Henry Tremblay
Not sure where you want me to put yield. My first try caused an error in Spark that it could not pickle generator objects. On 02/26/2017 03:25 PM, ayan guha wrote: Hi We are doing similar stuff, but with large number of small-ish files. What we do is write a function to parse a complete file

Re: Spark runs out of memory with small file

2017-02-26 Thread ayan guha
Hi We are doing similar stuff, but with large number of small-ish files. What we do is write a function to parse a complete file, similar to your parse file. But we use yield, instead of return and flatmap on top of it. Can you give it a try and let us know if it works? On Mon, Feb 27, 2017 at 9:

Re: Spark runs out of memory with small file

2017-02-26 Thread Koert Kuipers
using wholeFiles to process formats that can not be split per line is not "old" and there are plenty of problems for which RDD is still better suited than Dataset or DataFrame currently (this might change in near future when Dataset gets some crucial optimizations fixed). On Sun, Feb 26, 2017 at

Re: Spark runs out of memory with small file

2017-02-26 Thread Henry Tremblay
I am actually using Spark 2.1 and trying to solve a real life problem. Unfortunately, some of the discussion of my problem went off line, and then I started a new thread. Here is my problem. I am parsing crawl data which exists in a flat file format. It looks like this: u'WARC/1.0', u'WARC-

Re: Spark runs out of memory with small file

2017-02-26 Thread Gourav Sengupta
Hi Henry, Those guys in Databricks training are nuts and still use Spark 1.x for their exams. Learning SPARK is a VERY VERY VERY old way of solving problems using SPARK. The core engine of SPARK, which even I understand, has gone through several fundamental changes. Just try reading the file usi

Re: Spark runs out of memory with small file

2017-02-26 Thread Henry Tremblay
The file is so small that a stand alone python script, independent of spark, can process the file in under a second. Also, the following fails: 1. Read the whole file in with wholeFiles 2. use flatMap to get 50,000 rows that looks like: Row(id="path", line="line") 3. Save the results as CVS

Re: Spark runs out of memory with small file

2017-02-26 Thread Yan Facai
Hi, Tremblay. Your file is .gz format, which is not splittable for hadoop. Perhaps the file is loaded by only one executor. How many executors do you start? Perhaps repartition method could solve it, I guess. On Sun, Feb 26, 2017 at 3:33 AM, Henry Tremblay wrote: > I am reading in a single smal