Yeah, I believe repartition() is a lazy operation, but it’s strange that adding the count() can affect anything. I wonder if it has anything to do with defining an RDD as a transformation on itself in PySpark. Maybe combining lazy transformations with Python’s labels-not-variables<http://foobarnbaz.com/2012/07/08/understanding-python-variables/>design is somehow problematic? I have no idea.
As in: # bad? a = textFile(...) a = a.repartition(8) # better? a = textFile(...) b = a.repartition(8) Nick On Fri, May 9, 2014 at 6:29 PM, Walrus theCat <walrusthe...@gmail.com>wrote: > Nick, > > I have encountered strange things like this before (usually when > programming with mutable structures and side-effects), and for me, the > answer was that, until .count (or .first, or similar), is called, your > variable 'a' refers to a set of instructions that only get executed to form > the object you expect when you're asking something of it. Back before I > was using side-effect-free techniques on immutable data structures, I had > to call .first or .count or similar to get the behavior I wanted. There > are still special cases where I have to purposefully "collapse" the RDD for > some reason or another. This may not be new information to you, but I've > encountered similar behavior before and highly suspect this is playing a > role here. > > > On Mon, May 5, 2014 at 5:52 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I’m running into something very strange today. I’m getting an error on >> the follow innocuous operations. >> >> a = sc.textFile('s3n://...') >> a = a.repartition(8) >> a = a.map(...) >> c = a.countByKey() # ERRORs out on this action. See below for traceback. [1] >> >> If I add a count() right after the repartition(), this error magically >> goes away. >> >> a = sc.textFile('s3n://...') >> a = a.repartition(8) >> print a.count() >> a = a.map(...) >> c = a.countByKey() # A-OK! WTF? >> >> To top it off, this “fix” is inconsistent. Sometimes, I still get this >> error. >> >> This is strange. How do I get to the bottom of this? >> >> Nick >> >> [1] Here’s the traceback: >> >> Traceback (most recent call last): >> File "<stdin>", line 7, in <module> >> File "file.py", line 187, in function_blah >> c = a.countByKey() >> File "/root/spark/python/pyspark/rdd.py", line 778, in countByKey >> return self.map(lambda x: x[0]).countByValue() >> File "/root/spark/python/pyspark/rdd.py", line 624, in countByValue >> return self.mapPartitions(countPartition).reduce(mergeMaps) >> File "/root/spark/python/pyspark/rdd.py", line 505, in reduce >> vals = self.mapPartitions(func).collect() >> File "/root/spark/python/pyspark/rdd.py", line 469, in collect >> bytesInJava = self._jrdd.collect().iterator() >> File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", >> line 537, in __call__ >> File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line >> 300, in get_return_value >> py4j.protocol.Py4JJavaError: An error occurred while calling o46.collect. >> >> >> ------------------------------ >> View this message in context: How can adding a random count() change the >> behavior of my >> program?<http://apache-spark-user-list.1001560.n3.nabble.com/How-can-adding-a-random-count-change-the-behavior-of-my-program-tp5406.html> >> Sent from the Apache Spark User List mailing list >> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com. >> > >