I kinda reproduced that, with pyspark 2.1 also for hadoop 2.6 and with
python 3.x
I'll look into it a bit more after I've fixed a few other issues regarding
the salting of strings on the cluster.

2017-01-30 20:19 GMT+01:00 Blaž Šnuderl <snud...@gmail.com>:

> I am loading a simple text file using pyspark. Repartitioning it seems to
> produce garbage data.
>
> I got this results using spark 2.1 prebuilt for hadoop 2.7 using pyspark
> shell.
>
> >>> sc.textFile("outc").collect()
> [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
> >>> sc.textFile("outc", use_unicode=False).collect()
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
>
> Repartitioning seems to produce garbarge and also only only 2 records here
> >>> sc.textFile("outc", use_unicode=False).repartition(10).collect()
> ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.',
> '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
> >>> sc.textFile("outc", use_unicode=False).repartition(10).count()
> 2
>
>
> Without setting use_unicode=False we can't even repartition at all
> >>> sc.textFile("outc").repartition(19).collect()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File
> "/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop
> 2.7/python/pyspark/rdd.py",
> line 810, in collect
>     return list(_load_from_socket(port, self._jrdd_deserializer))
>   File
> "/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop
> 2.7/python/pyspark/rdd.py",
> line 140, in _load_from_socket
>     for item in serializer.load_stream(rf):
>   File
> "/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop
> 2.7/python/pyspark/serializers.py",
> line 529, in load_stream
>     yield self.loads(stream)
>   File
> "/home/snuderl/scrappstore/thirdparty/spark-2.1.0-bin-hadoop
> 2.7/python/pyspark/serializers.py",
> line 524, in loads
>     return s.decode("utf-8") if self.use_unicode else s
>   File
> "/home/snuderl/scrappstore/virtualenv/lib/python2.7/encodings/utf_8.py",
> line 16, in decode
>     return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
> invalid start byte
>
>
>
> Input file contents:
> a
> b
> c
> d
> e
> f
> g
> h
> i
> j
> k
> l
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Pyspark-2-1-0-weird-behavior-with-repa
> rtition-tp28350.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
*Olivier Girardot* | Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94

Reply via email to