I am loading a simple text file using pyspark. Repartitioning it seems to
produce garbage data.

I got this results using spark 2.1 prebuilt for hadoop 2.7 using pyspark
shell.

>>> sc.textFile("outc").collect()
[u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
>>> sc.textFile("outc", use_unicode=False).collect()
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']

Repartitioning seems to produce garbarge and also only only 2 records here
>>> sc.textFile("outc", use_unicode=False).repartition(10).collect()
['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.',
'\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
>>> sc.textFile("outc", use_unicode=False).repartition(10).count()
2


Without setting use_unicode=False we can't even repartition at all
>>> sc.textFile("outc").repartition(19).collect()
Traceback (most recent call last):                                              
                                                                                
                                                                                
                                                                                
                                            
  File "<stdin>", line 1, in <module>
  File
"/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py",
line 810, in collect
    return list(_load_from_socket(port, self._jrdd_deserializer))
  File
"/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py",
line 140, in _load_from_socket
    for item in serializer.load_stream(rf):
  File
"/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop2.7/python/pyspark/serializers.py",
line 529, in load_stream
    yield self.loads(stream)
  File
"/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop2.7/python/pyspark/serializers.py",
line 524, in loads
    return s.decode("utf-8") if self.use_unicode else s
  File
"/home/snuderl/scrappstore/virtualenv/lib/python2.7/encodings/utf_8.py",
line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
invalid start byte



Input file contents:
a
b
c
d
e
f
g
h
i
j
k
l



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-2-1-0-weird-behavior-with-repartition-tp28350.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to