parallelize uses the default Serializer (PickleSerializer) while textFile
uses UTF8Serializer.
You can get around this with index.zip(input_data._reserialize()) (or
index.zip(input_data.map(lambda x: x)))
(But if you try to just do this, you run into the issue with different
number of partitions
On Mon, Jul 28, 2014 at 12:58 PM, l wrote:
> I have a file in s3 that I want to map each line with an index. Here is my
> code:
>
input_data = sc.textFile('s3n:/myinput',minPartitions=6).cache()
N input_data.count()
index = sc.parallelize(range(N), 6)
index.zip(input_data).