Wouldn’t this work if you load the files in hdfs and let the partitions be 
equal to the amount of parallelism you want?

From: Saatvik Shah [mailto:saatvikshah1...@gmail.com]
Sent: Friday, June 30, 2017 8:55 AM
To: ayan guha
Cc: user
Subject: Re: PySpark working with Generators

Hey Ayan,

This isnt a typical text file - Its a proprietary data format for which a 
native Spark reader is not available.

Thanks and Regards,
Saatvik Shah

On Thu, Jun 29, 2017 at 6:48 PM, ayan guha 
<guha.a...@gmail.com<mailto:guha.a...@gmail.com>> wrote:
If your files are in same location you can use sc.wholeTextFile. If not, 
sc.textFile accepts a list of filepaths.

On Fri, 30 Jun 2017 at 5:59 am, saatvikshah1994 
<saatvikshah1...@gmail.com<mailto:saatvikshah1...@gmail.com>> wrote:
Hi,

I have this file reading function is called /foo/ which reads contents into
a list of lists or into a generator of list of lists representing the same
file.

When reading as a complete chunk(1 record array) I do something like:
rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda x:x)

I'd like to now do something similar but with the generator, so that I can
work with more cores and a lower memory. I'm not sure how to tackle this
since generators cannot be pickled and thus I'm not sure how to ditribute
the work of reading each file_path on the rdd?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
--
Best Regards,
Ayan Guha

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

Reply via email to