According to the documentation, `spark.pyspark.python` configures which
python executable is run on the workers. It seems to be ignored in my simple
test cast. I'm running on a pip-installed Pyspark 2.1.1, completely stock.
The only customization at this point is my Hadoop configuration directory.
I'd like to create a Dataset using some classes from Geotools to do some
geospatial analysis. In particular, I'm trying to use Spark to distribute
the work based on ID and label fields that I extract from the polygon data.
My simplified case class looks like this:
implicit val geometryEncoder: Enc
I'm trying to join a contant large-ish RDD to each RDD in a DStream, and I'm
trying to keep the join as efficient as possible so each batch finishes
within the batch window. I'm using PySpark on 1.6
I've tried the trick of keying the large RDD into (k, v) pairs and using
.partitionBy(100).persist(
d this, could reply back
> with the trusted conversion that worked for you (for a clear solution)?
>
> TD
>
>
> On Mon, Oct 19, 2015 at 3:08 PM, Jason White
> wrote:
>
>> Ah, that makes sense then, thanks TD.
>>
>> The conversion from RDD -> DF involves a
015 at 5:23:59 PM, Tathagata Das (t...@databricks.com) wrote:
RDD and DF are not compatible data types. So you cannot return a DF when you
have to return an RDD. What rather you can do is return the underlying RDD of
the dataframe by dataframe.rdd().
On Fri, Oct 16, 2015 at 12:07 PM, Jason Wh
Hi Ken, thanks for replying.
Unless I'm misunderstanding something, I don't believe that's correct.
Dstream.transform() accepts a single argument, func. func should be a
function that accepts a single RDD, and returns a single RDD. That's what
transform_to_df does, except the RDD it returns is a D
I'm trying to create a DStream of DataFrames using PySpark. I receive data
from Kafka in the form of a JSON string, and I'm parsing these RDDs of
Strings into DataFrames.
My code is:
I get the following error at pyspark/streaming/util.py, line 64:
I've verified that the sqlContext is properly
I'm having trouble loading a streaming job from a checkpoint when a
broadcast variable is defined. I've seen the solution by TD in Scala (
https://issues.apache.org/jira/browse/SPARK-5206) that uses a singleton to
get/create an accumulator, but I can't seem to get it to work in PySpark
with a broad
I'm having trouble loading a streaming job from a checkpoint when a
broadcast variable is defined. I've seen the solution by TD in Scala (
https://issues.apache.org/jira/browse/SPARK-5206) that uses a singleton to
get/create an accumulator, but I can't seem to get it to work in PySpark
with a broad