I kinda reproduced that, with pyspark 2.1 also for hadoop 2.6 and with python 3.x I'll look into it a bit more after I've fixed a few other issues regarding the salting of strings on the cluster.
2017-01-30 20:19 GMT+01:00 Blaž Šnuderl <snud...@gmail.com>: > I am loading a simple text file using pyspark. Repartitioning it seems to > produce garbage data. > > I got this results using spark 2.1 prebuilt for hadoop 2.7 using pyspark > shell. > > >>> sc.textFile("outc").collect() > [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] > >>> sc.textFile("outc", use_unicode=False).collect() > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] > > Repartitioning seems to produce garbarge and also only only 2 records here > >>> sc.textFile("outc", use_unicode=False).repartition(10).collect() > ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', > '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] > >>> sc.textFile("outc", use_unicode=False).repartition(10).count() > 2 > > > Without setting use_unicode=False we can't even repartition at all > >>> sc.textFile("outc").repartition(19).collect() > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File > "/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop > 2.7/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File > "/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop > 2.7/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop > 2.7/python/pyspark/serializers.py", > line 529, in load_stream > yield self.loads(stream) > File > "/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop > 2.7/python/pyspark/serializers.py", > line 524, in loads > return s.decode("utf-8") if self.use_unicode else s > File > "/home/snuderl/scrappstore/virtualenv/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > > > > Input file contents: > a > b > c > d > e > f > g > h > i > j > k > l > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Pyspark-2-1-0-weird-behavior-with-repa > rtition-tp28350.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94