Yeah, I believe repartition() is a lazy operation, but it’s strange that
adding the count() can affect anything. I wonder if it has anything to do
with defining an RDD as a transformation on itself in PySpark. Maybe
combining lazy transformations with Python’s
labels-not-variables<http://foobarnbaz.com/2012/07/08/understanding-python-variables/>design
is somehow problematic? I have no idea.

As in:

# bad?
a = textFile(...)
a = a.repartition(8)

# better?
a = textFile(...)
b = a.repartition(8)

Nick


On Fri, May 9, 2014 at 6:29 PM, Walrus theCat <walrusthe...@gmail.com>wrote:

> Nick,
>
> I have encountered strange things like this before (usually when
> programming with mutable structures and side-effects), and for me, the
> answer was that, until .count (or .first, or similar), is called, your
> variable 'a' refers to a set of instructions that only get executed to form
> the object you expect when you're asking something of it.  Back before I
> was using side-effect-free techniques on immutable data structures, I had
> to call .first or .count or similar to get the behavior I wanted.  There
> are still special cases where I have to purposefully "collapse" the RDD for
> some reason or another.  This may not be new information to you, but I've
> encountered similar behavior before and highly suspect this is playing a
> role here.
>
>
> On Mon, May 5, 2014 at 5:52 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I’m running into something very strange today. I’m getting an error on
>> the follow innocuous operations.
>>
>> a = sc.textFile('s3n://...')
>> a = a.repartition(8)
>> a = a.map(...)
>> c = a.countByKey() # ERRORs out on this action. See below for traceback. [1]
>>
>> If I add a count() right after the repartition(), this error magically
>> goes away.
>>
>> a = sc.textFile('s3n://...')
>> a = a.repartition(8)
>> print a.count()
>> a = a.map(...)
>> c = a.countByKey() # A-OK! WTF?
>>
>> To top it off, this “fix” is inconsistent. Sometimes, I still get this
>> error.
>>
>> This is strange. How do I get to the bottom of this?
>>
>> Nick
>>
>> [1] Here’s the traceback:
>>
>> Traceback (most recent call last):
>>   File "<stdin>", line 7, in <module>
>>   File "file.py", line 187, in function_blah
>>     c = a.countByKey()
>>   File "/root/spark/python/pyspark/rdd.py", line 778, in countByKey
>>     return self.map(lambda x: x[0]).countByValue()
>>   File "/root/spark/python/pyspark/rdd.py", line 624, in countByValue
>>     return self.mapPartitions(countPartition).reduce(mergeMaps)
>>   File "/root/spark/python/pyspark/rdd.py", line 505, in reduce
>>     vals = self.mapPartitions(func).collect()
>>   File "/root/spark/python/pyspark/rdd.py", line 469, in collect
>>     bytesInJava = self._jrdd.collect().iterator()
>>   File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", 
>> line 537, in __call__
>>   File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 
>> 300, in get_return_value
>> py4j.protocol.Py4JJavaError: An error occurred while calling o46.collect.
>>
>>
>> ------------------------------
>> View this message in context: How can adding a random count() change the
>> behavior of my 
>> program?<http://apache-spark-user-list.1001560.n3.nabble.com/How-can-adding-a-random-count-change-the-behavior-of-my-program-tp5406.html>
>> Sent from the Apache Spark User List mailing list 
>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>>
>
>

Reply via email to