from:"Jason White"

spark.pyspark.python is ignored?

2017-06-29 Thread Jason White

According to the documentation, `spark.pyspark.python` configures which python executable is run on the workers. It seems to be ignored in my simple test cast. I'm running on a pip-installed Pyspark 2.1.1, completely stock. The only customization at this point is my Hadoop configuration directory.

Case class with POJO - encoder issues

2017-02-11 Thread Jason White

I'd like to create a Dataset using some classes from Geotools to do some geospatial analysis. In particular, I'm trying to use Spark to distribute the work based on ID and label fields that I extract from the polygon data. My simplified case class looks like this: implicit val geometryEncoder: Enc

Efficient join multiple times

2016-01-08 Thread Jason White

I'm trying to join a contant large-ish RDD to each RDD in a DStream, and I'm trying to keep the join as efficient as possible so each batch finishes within the batch window. I'm using PySpark on 1.6 I've tried the trick of keying the large RDD into (k, v) pairs and using .partitionBy(100).persist(

Re: PySpark + Streaming + DataFrames

2015-11-02 Thread Jason White

d this, could reply back > with the trusted conversion that worked for you (for a clear solution)? > > TD > > > On Mon, Oct 19, 2015 at 3:08 PM, Jason White > wrote: > >> Ah, that makes sense then, thanks TD. >> >> The conversion from RDD -> DF involves a

Re: PySpark + Streaming + DataFrames

2015-10-19 Thread Jason White

015 at 5:23:59 PM, Tathagata Das (t...@databricks.com) wrote: RDD and DF are not compatible data types. So you cannot return a DF when you have to return an RDD. What rather you can do is return the underlying RDD of the dataframe by dataframe.rdd(). On Fri, Oct 16, 2015 at 12:07 PM, Jason Wh

Re: PySpark + Streaming + DataFrames

2015-10-16 Thread Jason White

Hi Ken, thanks for replying. Unless I'm misunderstanding something, I don't believe that's correct. Dstream.transform() accepts a single argument, func. func should be a function that accepts a single RDD, and returns a single RDD. That's what transform_to_df does, except the RDD it returns is a D

PySpark + Streaming + DataFrames

2015-10-16 Thread Jason White

I'm trying to create a DStream of DataFrames using PySpark. I receive data from Kafka in the form of a JSON string, and I'm parsing these RDDs of Strings into DataFrames. My code is: I get the following error at pyspark/streaming/util.py, line 64: I've verified that the sqlContext is properly

PySpark Checkpoints with Broadcast Variables

2015-09-29 Thread Jason White

I'm having trouble loading a streaming job from a checkpoint when a broadcast variable is defined. I've seen the solution by TD in Scala ( https://issues.apache.org/jira/browse/SPARK-5206) that uses a singleton to get/create an accumulator, but I can't seem to get it to work in PySpark with a broad

PySpark Checkpoints with Broadcast Variables

2015-09-29 Thread Jason White

I'm having trouble loading a streaming job from a checkpoint when a broadcast variable is defined. I've seen the solution by TD in Scala ( https://issues.apache.org/jira/browse/SPARK-5206) that uses a singleton to get/create an accumulator, but I can't seem to get it to work in PySpark with a broad

spark.pyspark.python is ignored?

Case class with POJO - encoder issues

Efficient join multiple times

Re: PySpark + Streaming + DataFrames

Re: PySpark + Streaming + DataFrames

Re: PySpark + Streaming + DataFrames

PySpark + Streaming + DataFrames

PySpark Checkpoints with Broadcast Variables

PySpark Checkpoints with Broadcast Variables

9 matches

Site Navigation

Mail list logo

Footer information