You can use sc.newAPIHadoopFile <http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.SparkContext> with CSVInputFormat <https://github.com/mvallebr/CSVInputFormat> so that it will read the csv file properly.
Thanks Best Regards On Sat, Mar 21, 2015 at 12:39 AM, Karlson <ksonsp...@siberie.de> wrote: > > Hi all, > > where is the data stored that is passed to sc.parallelize? Or put > differently, where is the data for the base RDD fetched from when the DAG > is executed, if the base RDD is constructed via sc.parallelize? > > I am reading a csv file via the Python csv module and am feeding the > parsed data chunkwise to sc.parallelize, because the whole file would not > fit into memory on the driver. Reading the file with sc.textfile first is > not an option, as there might be linebreaks inside the csv fields, > preventing me from parsing the file line by line. > > The problem I am facing right now is that even though I am feeding only > one chunk at a time to Spark, I will eventually run out of memory on the > driver. > > Thanks in advance! > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >