Quick correction in the code snippet I sent in my previous email:
Line: val enrichedDF = inputDF.withColumn("semantic", udf(col("url")))
Should be replaced by: val enrichedDF = inputDF.withColumn("semantic",
enrichUDF(col("url")))
On Thu, Jan 21, 2016 at 11:07 AM, Jean-Pierre OCALAN
wrote:
> H
Hi Cody,
First of all thanks a lot for your quick reply, although I have removed
this post couple of hours after posting it because I ended up finding it
was due to the way I was using DataFrame UDFs.
Essentially I didn't know that UDFs were purely lazy and in case of the
example below the UDF ge
If you can share an isolated example I'll take a look. Not something I've
run into before.
On Wed, Jan 20, 2016 at 3:53 PM, jpocalan wrote:
> Hi,
>
> I have an application which creates a Kafka Direct Stream from 1 topic
> having 5 partitions.
> As a result each batch is composed of an RDD havi
Turns out data is in python format. ETL pipeline was over writing original
data
Andy
From: Andrew Davidson
Date: Thursday, November 19, 2015 at 6:58 PM
To: "user @spark"
Subject: spark streaming problem saveAsTextFiles() does not write valid
JSON to HDFS
> I am working on a simple POS. I a
The oproblem lies the way you are doing the processing.
After the g.foreach(x => {println(x); println("")}) are you
doing ssc.start. It means till now what you did is just setup the
computation stpes but spark has not started any real processing. so when
you do g.foreach what it iterat
You must start the StreamingContext by calling ssc.start()
On Thu, May 28, 2015 at 6:57 PM, Animesh Baranawal <
animeshbarana...@gmail.com> wrote:
> Hi,
>
> I am trying to extract the filenames from which a Dstream is generated by
> parsing the toDebugString method on RDD
> I am implementing the