Re: Question on mappartitionwithsplit

2014-08-17 Thread Josh Rosen
Has anyone tried using functools.partial ( https://docs.python.org/2/library/functools.html#functools.partial) with PySpark? If it works, it might be a nice way to address this use-case. On Sun, Aug 17, 2014 at 7:35 PM, Davies Liu wrote: > On Sun, Aug 17, 2014 at 11:21 AM, Chengi Liu > wrote:

Re: Question on mappartitionwithsplit

2014-08-17 Thread Davies Liu
On Sun, Aug 17, 2014 at 11:21 AM, Chengi Liu wrote: > Hi, > Thanks for the response.. > In the second case f2?? > foo will have to be declared globablly??right?? > > My function is somthing like: > def indexing(splitIndex, iterator): > count = 0 > offset = sum(offset_lists[:splitIndex]) if s

Re: Question on mappartitionwithsplit

2014-08-17 Thread Mohit Singh
Building on what Davies Liu said, How about something like: def indexing(splitIndex, iterator,*offset_lists* ): count = 0 offset = sum(*offset_lists*[:splitIndex]) if splitIndex else 0 indexed = [] for i, e in enumerate(iterator): index = count + offset + i for j, ele in enumerate(e

Re: Question on mappartitionwithsplit

2014-08-17 Thread Chengi Liu
Hi, Thanks for the response.. In the second case f2?? foo will have to be declared globablly??right?? My function is somthing like: def indexing(splitIndex, iterator): count = 0 offset = sum(*offset_lists*[:splitIndex]) if splitIndex else 0 indexed = [] for i, e in enumerate(iterator):

Re: Question on mappartitionwithsplit

2014-08-17 Thread Davies Liu
The callback function f only accept 2 arguments, if you want to pass another objects to it, you need closure, such as: foo=xxx def f(index, iterator, foo): yield (index, foo) rdd.mapPartitionsWithIndex(lambda index, it: f(index, it, foo)) also you can make f become `closure`: def f2(index,

Question on mappartitionwithsplit

2014-08-17 Thread Chengi Liu
Hi, In this example: http://www.cs.berkeley.edu/~pwendell/strataconf/api/pyspark/pyspark.rdd.RDD-class.html#mapPartitionsWithSplit Let say, f takes three arguments: def f(splitIndex, iterator, foo): yield splitIndex Now, how do i send this foo parameter to this method? rdd.mapPartitionsWithSplit(