Resiliency with SparkStreaming - fileStream

2016-10-26 Thread Scott W
Hello, I'm planning to use fileStream Spark streaming API to stream data from HDFS. My Spark job would essentially process these files and post the results to an external endpoint. *How does fileStream API handle checkpointing of the file it processed ? *In other words, if my Spark job failed whi

Re: Spark Dataframe validating column names

2016-07-05 Thread Scott W
s://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Tue, Jul 5, 2016 at 7:02 AM, Scott W wrote: > > Hello, > > > > I'm processing events using Dataframes conver

Spark Dataframe validating column names

2016-07-04 Thread Scott W
Hello, I'm processing events using Dataframes converted from a stream of JSON events (Spark streaming) which eventually gets written out as as Parquet format. There are different JSON events coming in so we use schema inference feature of Spark SQL The problem is some of the JSON events contains

Spark Streaming - Kafka - java.nio.BufferUnderflowException

2016-05-25 Thread Scott W
I'm running into below error while trying to consume message from Kafka through Spark streaming (Kafka direct API). This used to work OK when using Spark standalone cluster manager. We're just switching to using Cloudera 5.7 using Yarn to manage Spark cluster and started to see the below error. Fe