That was indeed the case, using UTF8Deserializer makes everything work
correctly.
Thanks for the tips!
On Thu, Jun 30, 2016 at 3:32 PM, Pedro Rodriguez
wrote:
> Quick update, I was able to get most of the plumbing to work thanks to the
> code Holden posted and browsing more source code.
>
> I a
Quick update, I was able to get most of the plumbing to work thanks to the
code Holden posted and browsing more source code.
I am running into this error which makes me think that maybe I shouldn't be
leaving the default python RDD serializer/pickler in place and do something
else https://github.c
Thanks Jeff and Holden,
A little more context here probably helps. I am working on implementing the
idea from this article to make reads from S3 faster:
http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219
(although my name is Pedro, I am not the author of the article). The
So I'm a little biased - I think the bet bride between the two is using
DataFrames. I've got some examples in my talk and on the high performance
spark GitHub
https://github.com/high-performance-spark/high-performance-spark-examples/blob/master/high_performance_pyspark/simple_perf_test.py
calls som
Hi Pedro,
Your use case is interesting. I think launching java gateway is the same
as native SparkContext, the only difference is on creating your custom
SparkContext instead of native SparkContext. You might also need to wrap it
using java.
https://github.com/apache/spark/blob/v1.6.2/python/pys
Hi All,
I have written a Scala package which essentially wraps the SparkContext
around a custom class that adds some functionality specific to our internal
use case. I am trying to figure out the best way to call this from PySpark.
I would like to do this similarly to how Spark itself calls the J