Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread shyla deshpande
I am building a data pipeline using Kafka, Spark streaming and Cassandra. Wondering if the issues with Kafka source fixed in Spark 2.0.1. If not, please give me an update on when it may be fixed. Thanks -Shyla

Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread shyla deshpande
I'm not aware of any open issues against the kafka source for structured > streaming. > > On Tue, Nov 1, 2016 at 4:45 PM, shyla deshpande > wrote: > >> I am building a data pipeline using Kafka, Spark streaming and Cassandra. >> Wondering if the issues with Kafka

Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread shyla deshpande
>> Look at the resolved subtasks attached to that ticket you linked. > >> Some of them are unresolved, but basic functionality is there. > >> > >> On Tue, Nov 1, 2016 at 7:37 PM, shyla deshpande > >> wrote: > >> > Hi Michael, > >> > &

Error creating SparkSession, in IntelliJ

2016-11-03 Thread shyla deshpande
Hello Everyone, I just installed Spark 2.0.1, spark shell works fine. Was able to run some simple programs from the Spark Shell, but find it hard to make the same program work when using IntelliJ. I am getting the following error. Exception in thread "main" java.lang.NoSuchMethodError: scala.Pr

Re: Error creating SparkSession, in IntelliJ

2016-11-04 Thread shyla deshpande
ation for setting up IDE - https://cwiki.apache.org/ > confluence/display/SPARK/Useful+Developer+Tools# > UsefulDeveloperTools-IDESetup > > I hope this is helpful. > > > 2016-11-04 9:10 GMT+09:00 shyla deshpande : > >> Hello Everyone, >> >> I just installed Spa

java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$ . Please Help!!!!!!!

2016-11-04 Thread shyla deshpande
object App { import org.apache.spark.sql.functions._ import org.apache.spark.sql.SparkSession def main(args : Array[String]) { println( "Hello World!" ) val sparkSession = SparkSession.builder. master("local") .appName("spark session example") .getOrCreate() } }

Re: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$ . Please Help!!!!!!!

2016-11-04 Thread shyla deshpande
would only be provided you run your application with spark-submit or > otherwise have Spark's JARs on your class path. How are you launching your > application? > > On Fri, Nov 4, 2016 at 2:00 PM, shyla deshpande > wrote: > >> object App { >> >> &

Structured Streaming with Kafka source,, does it work??????

2016-11-06 Thread shyla deshpande
I am trying to do Structured Streaming with Kafka Source. Please let me know where I can find some sample code for this. Thanks

Re: Structured Streaming with Kafka Source, does it work??

2016-11-06 Thread shyla deshpande
Hi Jaya! Thanks for the reply. Structured streaming works fine for me with socket text stream . I think structured streaming with kafka source not yet supported. Please if anyone has got it working with kafka source, please provide me some sample code or direction. Thanks On Sun, Nov 6, 2016 a

Structured Streaming with Cassandra, Is it supported??

2016-11-07 Thread shyla deshpande
Hi , I am trying to do structured streaming with the wonderful SparkSession, but cannot save the streaming data to Cassandra. If anyone has got this working, please help Thanks

Re: Structured Streaming with Cassandra, Is it supported??

2016-11-07 Thread shyla deshpande
I am using spark-cassandra-connector_2.11. On Mon, Nov 7, 2016 at 3:33 PM, shyla deshpande wrote: > Hi , > > I am trying to do structured streaming with the wonderful SparkSession, > but cannot save the streaming data to Cassandra. > > If anyone has got this working, please help > > Thanks > >

Akka Stream as the source for Spark Streaming. Please advice...

2016-11-09 Thread shyla deshpande
I am using Spark 2.0.1. I wanted to build a data pipeline using Kafka, Spark Streaming and Cassandra using Structured Streaming. But the kafka source support for Structured Streaming is not yet available. So now I am trying to use Akka Stream as the source to Spark Streaming. Want to make sure I a

Anyone using ProtoBuf for Kafka messages with Spark Streaming for processing?

2016-11-10 Thread shyla deshpande
Using ProtoBuf for Kafka messages with Spark Streaming because ProtoBuf is already being used in the system. Some sample code and reading material for using ProtoBuf for Kafka messages with Spark Streaming will be helpful. Thanks

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-12 Thread shyla deshpande
.com/jaceklaskowski > > > On Sat, Nov 12, 2016 at 4:07 PM, Luciano Resende > wrote: > > If you are interested in Akka streaming, it is being maintained in Apache > Bahir. For Akka there isn't a structured streaming version yet, but we > would > be interested in coll

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-12 Thread shyla deshpande
Is it OK to use ProtoBuf for sending messages to Kafka? I do not see anyone using it . Please direct me to some code samples of how to use it in Spark Structured streaming. Thanks again.. On Sat, Nov 12, 2016 at 11:44 PM, shyla deshpande wrote: > Thanks everyone. Very good discuss

Spark 2.0.2 with Kafka source, Error please help!

2016-11-16 Thread shyla deshpande
I am using protobuf to encode. This may not be related to the new release issue Exception in thread "main" scala.ScalaReflectionException: is not a term at scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:199) at scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols

Re: Spark 2.0.2 with Kafka source, Error please help!

2016-11-16 Thread shyla deshpande
.org/overviews/reflection/thread- > safety.html) AFAIK, the only way to fix it is upgrading to Scala 2.11. > > On Wed, Nov 16, 2016 at 11:16 AM, shyla deshpande < > deshpandesh...@gmail.com> wrote: > >> I am using protobuf to encode. This may not be related to the new relea

Re: Spark 2.0.2 with Kafka source, Error please help!

2016-11-16 Thread shyla deshpande
ME_FIELD_NUMBER = 1 final val AGE_FIELD_NUMBER = 2 final val GENDER_FIELD_NUMBER = 3 final val TAGS_FIELD_NUMBER = 4 final val ADDRESSES_FIELD_NUMBER = 5 } On Wed, Nov 16, 2016 at 1:28 PM, Shixiong(Ryan) Zhu wrote: > Could you provide the Person class? > > On Wed, Nov 16, 201

Re: Spark 2.0.2 with Kafka source, Error please help!

2016-11-16 Thread shyla deshpande
g city = 2; } message Person { optional string name = 1; optional int32 age = 2; optional Gender gender = 3; repeated string tags = 4; repeated Address addresses = 5; } On Wed, Nov 16, 2016 at 3:04 PM, shyla deshpande wrote: > *Thanks for the response. Following is the

Spark 2.0.2, Structured Streaming with kafka source... Unable to parse the value to Object..

2016-11-17 Thread shyla deshpande
val spark = SparkSession.builder. master("local") .appName("spark session example") .getOrCreate() import spark.implicits._ val dframe1 = spark.readStream.format("kafka"). option("kafka.bootstrap.servers","localhost:9092"). option("subscribe","student").load() *How do I deserialize the

Re: Spark 2.0.2, Structured Streaming with kafka source... Unable to parse the value to Object..

2016-11-17 Thread shyla deshpande
t;) .format("console") .start() query.awaitTermination() } On Thu, Nov 17, 2016 at 11:30 AM, shyla deshpande wrote: > val spark = SparkSession.builder. > master("local") > .appName("spark session example") > .getOrCreate() > > im

Re: Spark 2.0.2 with Kafka source, Error please help!

2016-11-17 Thread shyla deshpande
You can define your class which is supported by SQL Encoder, and > convert this generated class to the new class in `parseLine`. > > On Wed, Nov 16, 2016 at 4:22 PM, shyla deshpande > wrote: > >> Ryan, >> >> I just wanted to provide more info. Here is my .proto

How do I access the nested field in a dataframe, spark Streaming app... Please help.

2016-11-20 Thread shyla deshpande
The following my dataframe schema root |-- name: string (nullable = true) |-- addresses: array (nullable = true) ||-- element: struct (containsNull = true) |||-- street: string (nullable = true) |||-- city: string (nullable = true) I want to output

Re: How do I access the nested field in a dataframe, spark Streaming app... Please help.

2016-11-20 Thread shyla deshpande
ex.html#04%20SQL,%20DataFrames%20%26%20Datasets/ > 02%20Introduction%20to%20DataFrames%20-%20scala.html > > > On Sun, Nov 20, 2016 at 1:24 PM, pandees waran wrote: > >> have you tried using "." access method? >> >> e.g: >> ds1.select("name",

How do I persist the data after I process the data with Structured streaming...

2016-11-22 Thread shyla deshpande
Hi, Structured streaming works great with Kafka source but I need to persist the data after processing in some database like Cassandra or at least Postgres. Any suggestions, help please. Thanks

Re: How do I persist the data after I process the data with Structured streaming...

2016-11-22 Thread shyla deshpande
amming-guide.html#using-foreach > > On Tue, Nov 22, 2016 at 2:40 PM, Michael Armbrust > wrote: > >> We are looking to add a native JDBC sink in Spark 2.2. Until then you >> can write your own connector using df.writeStream.foreach. >> >> On Tue, Nov 22, 2016 at

Do I have to wrap akka around spark streaming app?

2016-11-28 Thread shyla deshpande
My data pipeline is Kafka --> Spark Streaming --> Cassandra. Can someone please explain me when would I need to wrap akka around the spark streaming app. My knowledge of akka and the actor system is poor. Please help! Thanks

Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread shyla deshpande
Anyone with experience of spark streaming in production, appreciate your input. Thanks -shyla On Mon, Nov 28, 2016 at 12:11 AM, shyla deshpande wrote: > My data pipeline is Kafka --> Spark Streaming --> Cassandra. > > Can someone please explain me when would I need to wrap

Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread shyla deshpande
ur use case a bit? > > In general, there is no single correct answer to your current question as > it's quite broad. > > Daniel > > On Mon, Nov 28, 2016 at 9:11 AM, shyla deshpande > wrote: > >> My data pipeline is Kafka --> Spark Streaming --> Cassandra. >

Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread shyla deshpande
job > > 2016-11-28 21:46 GMT+01:00 shyla deshpande : > >> Thanks Daniel for the response. >> >> I am planning to use Spark streaming to do Event Processing. I will have >> akka actors sending messages to kafka. I process them using Spark streaming >> and

Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread shyla deshpande
s and then notify the user ? > > 2016-11-28 23:15 GMT+01:00 shyla deshpande : > >> Thanks Vincent for the input. Not sure I understand your suggestion. >> Please clarify. >> >> Few words about my use case : >> When the user watches a video, I get the position

Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread shyla deshpande
as complete and that triggers a lot of other events. I need a way to notify the app about the creation of the completion event. Appreciate any suggestions. Thanks On Mon, Nov 28, 2016 at 2:35 PM, shyla deshpande wrote: > In this case, persisting to Cassandra is for future analytics and > Visu

Re: Do I have to wrap akka around spark streaming app?

2016-11-29 Thread shyla deshpande
partitions > http://allegro.tech/2015/08/spark-kafka-integration.html > > > 2016-11-29 2:18 GMT+01:00 shyla deshpande : > >> Hello All, >> >> I just want to make sure this is a right use case for Kafka --> Spark >> Streaming >> >> Few words about

updateStateByKey -- when the key is multi-column (like a composite key )

2016-11-30 Thread shyla deshpande
updateStateByKey - Can this be used when the key is multi-column (like a composite key ) and the value is not numeric. All the examples I have come across is where the key is a simple String and the Value is numeric. Appreciate any help. Thanks

Re: updateStateByKey -- when the key is multi-column (like a composite key )

2016-11-30 Thread shyla deshpande
fields. Another approach would be to create a hash (like create a > json version of the hash and return that.) > > On Wed, Nov 30, 2016 at 12:30 PM, shyla deshpande < > deshpandesh...@gmail.com> wrote: > >> updateStateByKey - Can this be used when the key is multi-column (

Spark 2.0.2 , using DStreams in Spark Streaming . How do I create SQLContext? Please help

2016-11-30 Thread shyla deshpande
I am Spark 2.0.2 , using DStreams because I need Cassandra Sink. How do I create SQLContext? I get the error SQLContext deprecated. *[image: Inline image 1]* *Thanks*

Re: Spark 2.0.2 , using DStreams in Spark Streaming . How do I create SQLContext? Please help

2016-12-01 Thread shyla deshpande
pak > > On Thu, Dec 1, 2016 at 12:27 PM, shyla deshpande > wrote: > >> I am Spark 2.0.2 , using DStreams because I need Cassandra Sink. >> >> How do I create SQLContext? I get the error SQLContext deprecated. >> >> >> *[image: Inline image 1]*

Wrting data from Spark streaming to AWS Redshift?

2016-12-09 Thread shyla deshpande
Hello all, Is it possible to Write data from Spark streaming to AWS Redshift? I came across the following article, so looks like it works from a Spark batch program. https://databricks.com/blog/2015/10/19/introducing-redshift-data-source-for-spark.html I want to write to AWS Redshift from Spark

Livy VS Spark Job Server

2016-12-12 Thread shyla deshpande
It will be helpful if someone can compare Livy and Spark Job Server. Thanks

How many Spark streaming applications can be run at a time on a Spark cluster?

2016-12-14 Thread shyla deshpande
How many Spark streaming applications can be run at a time on a Spark cluster? Is it better to have 1 spark streaming application to consume all the Kafka topics or have multiple streaming applications when possible to keep it simple? Thanks

Re: How many Spark streaming applications can be run at a time on a Spark cluster?

2016-12-24 Thread shyla deshpande
;> >> >> Thanks, >> Divya >> >> On 15 December 2016 at 08:42, shyla deshpande >> wrote: >> >>> How many Spark streaming applications can be run at a time on a Spark >>> cluster? >>> >>> Is it better to have 1 spar

How do I read data in dockerized kafka from a spark streaming application

2017-01-06 Thread shyla deshpande
My kafka is in a docker container. How do I read this Kafka data in my Spark streaming app. Also, I need to write data from Spark Streaming to Cassandra database which is in docker container. I appreciate any help. Thanks.

Docker image for Spark streaming app

2017-01-08 Thread shyla deshpande
I looking for a docker image that I can use from docker hub for running a spark streaming app with scala and spark 2.0 +. I am new to docker and unable to find one image from docker hub that suits my needs. Please let me know if anyone is using a docker for spark streaming app and share your exper

Re: Docker image for Spark streaming app

2017-01-08 Thread shyla deshpande
started. Thanks On Sun, Jan 8, 2017 at 1:52 PM, shyla deshpande wrote: > Thanks really appreciate. > > On Sun, Jan 8, 2017 at 1:02 PM, vvshvv wrote: > >> Hi, >> >> I am running spark streaming job using spark jobserver via this image: >> >> https

Spark streaming app that processes Kafka DStreams produces no output and no error

2017-01-13 Thread shyla deshpande
Hello, My spark streaming app that reads kafka topics and prints the DStream works fine on my laptop, but on AWS cluster it produces no output and no errors. Please help me debug. I am using Spark 2.0.2 and kafka-0-10 Thanks The following is the output of the spark streaming app... 17/01/14

Re: Spark streaming app that processes Kafka DStreams produces no output and no error

2017-01-14 Thread shyla deshpande
g job with just 2 cores? Appreciate your time and help. Thanks On Fri, Jan 13, 2017 at 10:46 PM, shyla deshpande wrote: > Hello, > > My spark streaming app that reads kafka topics and prints the DStream > works fine on my laptop, but on AWS cluster it produces no output and no >

Re: Spark streaming app that processes Kafka DStreams produces no output and no error

2017-01-16 Thread shyla deshpande
4:01 PM, shyla deshpande wrote: > Hello, > > I want to add that, > I don't even see the streaming tab in the application UI on port 4040 when > I run it on the cluster. > The cluster on EC2 has 1 master node and 1 worker node. > The cores used on the worker node is 2

Re: Spark streaming app that processes Kafka DStreams produces no output and no error

2017-01-19 Thread shyla deshpande
There was a issue connecting to Kafka, once that was fixed the spark app works. Hope this helps someone. Thanks On Mon, Jan 16, 2017 at 7:58 AM, shyla deshpande wrote: > Hello, > I checked the log file on the worker node and don't see any error there. > This is the first time I a

Re: Using mapWithState without a checkpoint

2017-01-23 Thread shyla deshpande
Hello spark users, I do have the same question as Daniel. I would like to save the state in Cassandra and on failure recover using the initialState. If some one has already tried this, please share your experience and sample code. Thanks. On Thu, Nov 17, 2016 at 9:45 AM, Daniel Haviv < daniel.h

How to make the state in a streaming application idempotent?

2017-01-23 Thread shyla deshpande
In a Wordcount application which stores the count of all the words input so far using mapWithState. How do I make sure my counts are not messed up if I happen to read a line more than once? Appreciate your response. Thanks

Re: How to make the state in a streaming application idempotent?

2017-01-24 Thread shyla deshpande
My streaming application stores lot of aggregations using mapWithState. I want to know what are all the possible ways I can make it idempotent. Please share your views. Thanks On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande wrote: > In a Wordcount application which stores the count of

Re: How to make the state in a streaming application idempotent?

2017-01-24 Thread shyla deshpande
On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande wrote: > My streaming application stores lot of aggregations using mapWithState. > > I want to know what are all the possible ways I can make it idempotent. > > Please share your views. > > Thanks > > On Mon, Jan

Re: How to make the state in a streaming application idempotent?

2017-01-24 Thread shyla deshpande
Please share your thoughts. On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande wrote: > > > On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande > wrote: > >> My streaming application stores lot of aggregations using mapWithState. >> >> I want to know what are al

Recovering from checkpoint question

2017-01-24 Thread shyla deshpande
If I just want to resubmit the spark streaming app with different configuration options like different --executor-memory or --total-executor-cores, will the checkpoint directory help me continue from where I left off. Appreciate your response. Thanks

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread shyla deshpande
ot going to reprocess some data after a > certain time, i.e. there is no way I'm going to get the same data in 2 > hours, it may only happen in the last 2 hours, then you may also keep the > state of uniqueId's as well, and then age them out after a certain time. > > > Best, &g

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread shyla deshpande
t; Does that make sense? > > > On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande > wrote: > >> Hi Burak, >> Thanks for the response. Can you please elaborate on your idea of storing >> the state of the unique ids. >> Do you have any sample code or links I can refer

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread shyla deshpande
a lot more space than the BloomFilter though depending on > your data volume. > > On Wed, Jan 25, 2017 at 11:24 AM, shyla deshpande < > deshpandesh...@gmail.com> wrote: > >> In the previous email you gave me 2 solutions >> 1. Bloom filter --> problem in repopu

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread shyla deshpande
ou would be okay with > approximations. Try both out, see what scales/works with your dataset. > Maybe you may handle the second implementation. > > On Wed, Jan 25, 2017 at 12:23 PM, shyla deshpande < > deshpandesh...@gmail.com> wrote: > >> Thanks Burak. But with BloomFi

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread shyla deshpande
Processing of the same data more than once can happen only when the app recovers after failure or during upgrade. So how do I apply your 2nd solution only for 1-2 hrs after restart. On Wed, Jan 25, 2017 at 12:51 PM, shyla deshpande wrote: > Thanks Burak. I do want accuracy, that is why I w

where is mapWithState executed?

2017-01-25 Thread shyla deshpande
Is it executed on the driver or executor. If I need to lookup a map in the updatefunction, I need to broadcast it, if mapWithState executed runs on executor. Thanks

Re: where is mapWithState executed?

2017-01-25 Thread shyla deshpande
After more reading, I know the state is distributed across the cluster. But If I need to lookup a map in the updatefunction, I need to broadcast it. Just want to make sure I am on the right path. Appreciate your help. Thanks On Wed, Jan 25, 2017 at 2:33 PM, shyla deshpande wrote: > Is

mapWithState question

2017-01-28 Thread shyla deshpande
Can multiple DStreams manipulate a state? I have a stream that gives me total minutes the user spent on a course material. I have another stream that gives me chapters completed and lessons completed by the user. I want to keep track for each user total_minutes, chapters_completed and lessons_compl

Re: mapWithState question

2017-01-28 Thread shyla deshpande
Thats a great idea. I will try that. Thanks. On Sat, Jan 28, 2017 at 2:35 AM, Tathagata Das wrote: > 1 state object for each user. > union both streams into a single DStream, and apply mapWithState on it to > update the user state. > > On Sat, Jan 28, 2017 at 12:30 AM,

Re: mapWithState question

2017-01-30 Thread shyla deshpande
Hello, TD, your suggestion works great. Thanks I have 1 more question, I need to write to kafka from within the mapWithState function. Just wanted to check if this a bad pattern in any way. Thank you. On Sat, Jan 28, 2017 at 9:14 AM, shyla deshpande wrote: > Thats a great idea. I w

Re: mapWithState question

2017-01-30 Thread shyla deshpande
ailure, your mapping function > will also be reexecuted, and the writes to kafka can happen multiple times. > So you may only get at least once guarantee about those Kafka writes > > > On Mon, Jan 30, 2017 at 10:02 AM, shyla deshpande < > deshpandesh...@gmail.com> wrote: &

saveToCassandra issue. Please help

2017-02-03 Thread shyla deshpande
Hello All, This is the content of my RDD which I am saving to Cassandra table. But looks like the 2nd row is written first and then the first row overwrites it. So I end up with bad output. (494bce4f393b474980290b8d1b6ebef9, 2017-02-01, PT0H9M30S, WEDNESDAY) (494bce4f393b474980290b8d1b6ebef9, 20

Re: saveToCassandra issue. Please help

2017-02-03 Thread shyla deshpande
PT0H9M30S and then PT0H10M0S. Appreciate your input. Thanks On Fri, Feb 3, 2017 at 12:45 AM, shyla deshpande wrote: > Hello All, > > This is the content of my RDD which I am saving to Cassandra table. > > But looks like the 2nd row is written first and then the first row overwrites

Re: saveToCassandra issue. Please help

2017-02-03 Thread shyla deshpande
g before insert to Cassandra. > 2.- Insert to cassandra all the entries and add some logic to your > request to get the most recent. > > Regards, > > 2017-02-03 10:26 GMT-08:00 shyla deshpande : > > Hi All, > > > > I wanted to add more info .. > > The first c

Spark streaming question - SPARK-13758 Need to use an external RDD inside DStream processing...Please help

2017-02-07 Thread shyla deshpande
I have a situation similar to the following and I get SPARK-13758 . I understand why I get this error, but I want to know what should be the approach in dealing with these situations. Thanks > var cached = ssc.sparkContext.parallelize(Seq

Re: Spark streaming question - SPARK-13758 Need to use an external RDD inside DStream processing...Please help

2017-02-07 Thread shyla deshpande
and my cached RDD is not small. If it was maybe I could materialize and broadcast. Thanks On Tue, Feb 7, 2017 at 10:28 AM, shyla deshpande wrote: > I have a situation similar to the following and I get SPARK-13758 > <https://issues.apache.org/jira/browse/SPARK-13758>. > > &g

Spark standalone cluster on EC2 error .. Checkpoint..

2017-02-16 Thread shyla deshpande
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /checkpoint/11ea8862-122c-4614-bc7e-f761bb57ba23/rdd-347/.part-1-attempt-3 could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operati

Re: Spark standalone cluster on EC2 error .. Checkpoint..

2017-02-17 Thread shyla deshpande
: > Seems like an issue with the HDFS you are using for checkpointing. Its not > able to write data properly. > > On Thu, Feb 16, 2017 at 2:40 PM, shyla deshpande > wrote: > >> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): >> File >>

Spark streaming on AWS EC2 error . Please help

2017-02-20 Thread shyla deshpande
I am running Spark streaming on AWS EC2 in standalone mode. When I do a spark-submit, I get the following message. I am subscribing to 3 kafka topics and it is reading and processing just 2 topics. Works fine in local mode. Appreciate your help. Thanks Exception in thread "pool-26-thread-132" jav

In Spark streaming, will saved kafka offsets become invalid if I change the number of partitions in a kafka topic?

2017-02-25 Thread shyla deshpande
I am commiting offsets to Kafka after my output has been stored, using the commitAsync API. My question is if I increase/decrease the number of kafka partitions, will the saved offsets will become invalid. Thanks

Re: In Spark streaming, will saved kafka offsets become invalid if I change the number of partitions in a kafka topic?

2017-02-26 Thread shyla deshpande
Please help! On Sat, Feb 25, 2017 at 11:10 PM, shyla deshpande wrote: > I am commiting offsets to Kafka after my output has been stored, using the > commitAsync API. > > My question is if I increase/decrease the number of kafka partitions, will > the saved offsets will

error in kafka producer

2017-02-28 Thread shyla deshpande
producer send callback exception: org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for positionevent-6 due to 30003 ms has passed since batch creation plus linger time

Re: error in kafka producer

2017-02-28 Thread shyla deshpande
5 PM, Marco Mistroni wrote: > This exception coming from a Spark program? > could you share few lines of code ? > > kr > marco > > On Tue, Feb 28, 2017 at 10:23 PM, shyla deshpande < > deshpandesh...@gmail.com> wrote: > >> producer send callback exc

spark streaming with kafka source, how many concurrent jobs?

2017-03-10 Thread shyla deshpande
I have a spark streaming application which processes 3 kafka streams and has 5 output operations. Not sure what should be the setting for spark.streaming.concurrentJobs. 1. If the concurrentJobs setting is 4 does that mean 2 output operations will be run sequentially? 2. If I had 6 cores what wo

Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-14 Thread shyla deshpande
, Tathagata Das wrote: > That config I not safe. Please do not use it. > > On Mar 10, 2017 10:03 AM, "shyla deshpande" > wrote: > >> I have a spark streaming application which processes 3 kafka streams and >> has 5 output operations. >>

Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-21 Thread shyla deshpande
is not safe because it breaks the checkpointing logic in subtle ways. > Note that this was never documented in the spark online docs. > > On Tue, Mar 14, 2017 at 2:29 PM, shyla deshpande > wrote: > >> Thanks TD for the response. Can you please provide more explanation. I am &

Spark Streaming questions, just 2

2017-03-21 Thread shyla deshpande
Hello all, I have a couple of spark streaming questions. Thanks. 1. In the case of stateful operations, the data is, by default, persistent in memory. In memory does it mean MEMORY_ONLY? When is it removed from memory? 2. I do not see any documentation for spark.cleaner.ttl. Is this no l

Converting dataframe to dataset question

2017-03-22 Thread shyla deshpande
Why userDS is Dataset[Any], instead of Dataset[Teamuser]? Appreciate your help. Thanks val spark = SparkSession .builder .config("spark.cassandra.connection.host", cassandrahost) .appName(getClass.getSimpleName) .getOrCreate() import spark.implicits._ val sqlC

Re: Converting dataframe to dataset question

2017-03-23 Thread shyla deshpande
I realized, my case class was inside the object. It should be defined outside the scope of the object. Thanks On Wed, Mar 22, 2017 at 6:07 PM, shyla deshpande wrote: > Why userDS is Dataset[Any], instead of Dataset[Teamuser]? Appreciate your > help. Thanks > > val spark =

Re: Converting dataframe to dataset question

2017-03-23 Thread shyla deshpande
userDS:Dataset[Teamuser] = userDF.as[Teamuser] On Thu, Mar 23, 2017 at 12:49 PM, shyla deshpande wrote: > I realized, my case class was inside the object. It should be defined > outside the scope of the object. Thanks > > On Wed, Mar 22, 2017 at 6:07 PM, shyla deshpande > wrote: >

Re: Converting dataframe to dataset question

2017-03-23 Thread shyla deshpande
--+--+-+ > |teamid|userid| role| > +--+--+-+ > |t1|u1|role1| > +--+--+-+ > > > scala> userDS.printSchema > root > |-- teamid: string (nullable = true) > |-- userid: string (nullable = true) > |-- role: string (nullable = tr

Spark dataframe, UserDefinedAggregateFunction(UDAF) help!!

2017-03-23 Thread shyla deshpande
This is my input data. The UDAF needs to aggregate the goals for a team and return a map that gives the count for every goal in the team. I am getting the following error java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String; at com.whil.co

Re: Spark dataframe, UserDefinedAggregateFunction(UDAF) help!!

2017-03-23 Thread shyla deshpande
Thanks a million Yong. Great help!!! It solved my problem. On Thu, Mar 23, 2017 at 6:00 PM, Yong Zhang wrote: > Change: > > val arrayinput = input.getAs[Array[String]](0) > > to: > > val arrayinput = input.getAs[*Seq*[String]](0) > > > Yong > > > ---

Re: Converting dataframe to dataset question

2017-03-23 Thread shyla deshpande
> val df = Seq(TeamUser("t1", "u1", "r1")).toDF() > df.printSchema() > spark.close() > } > } > > case class TeamUser(teamId: String, userId: String, role: String) > > > On Fri, Mar 24, 2017 at 5:23 AM, shyla deshpande > w

dataframe join questions?

2017-03-28 Thread shyla deshpande
Following are my questions. Thank you. 1. When joining dataframes is it a good idea to repartition on the key column that is used in the join or the optimizer is too smart so forget it. 2. In RDD join, wherever possible we do reduceByKey before the join to avoid a big shuffle of data. Do we need

Re: dataframe join questions. Appreciate your input.

2017-03-28 Thread shyla deshpande
On Tue, Mar 28, 2017 at 2:57 PM, shyla deshpande wrote: > Following are my questions. Thank you. > > 1. When joining dataframes is it a good idea to repartition on the key column > that is used in the join or > the optimizer is too smart so forget it. > > 2. In RDD join, wh

Re: Spark SQL, dataframe join questions.

2017-03-29 Thread shyla deshpande
On Tue, Mar 28, 2017 at 2:57 PM, shyla deshpande wrote: > Following are my questions. Thank you. > > 1. When joining dataframes is it a good idea to repartition on the key column > that is used in the join or > the optimizer is too smart so forget it. > > 2. In RDD join, wh

Will the setting for spark.default.parallelism be used for spark.sql.shuffle.output.partitions?

2017-03-30 Thread shyla deshpande
Thanks

Re: Will the setting for spark.default.parallelism be used for spark.sql.shuffle.output.partitions?

2017-03-30 Thread shyla deshpande
The spark version I am using is spark 2.1. On Thu, Mar 30, 2017 at 9:58 AM, shyla deshpande wrote: > Thanks >

dataframe filter, unable to bind variable

2017-03-30 Thread shyla deshpande
The following works df.filter($"createdate".between("2017-03-20", "2017-03-22")) I would like to pass variables fromdate and todate to the filter instead of constants. Unable to get the syntax right. Please help. Thanks

Re: dataframe filter, unable to bind variable

2017-03-30 Thread shyla deshpande
Works. Thanks Hosur. On Thu, Mar 30, 2017 at 8:37 PM, hosur narahari wrote: > Try lit(fromDate) and lit(toDate). You've to import > org.apache.spark.sql.functions.lit > to use it > > On 31 Mar 2017 7:45 a.m., "shyla deshpande" > wrote: > > The fol

What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread shyla deshpande
I want to run a spark batch job maybe hourly on AWS EC2 . What is the easiest way to do this. Thanks

What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread shyla deshpande
I want to run a spark batch job maybe hourly on AWS EC2 . What is the easiest way to do this. Thanks

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread shyla deshpande
is just CRON in the UI ) that >>> will do the above step but have scheduling and potential backfilling and >>> error handling(retries,alerts etc) >>> >>> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that >>> does some Spark jobs but

Structured streaming and writing output to Cassandra

2017-04-07 Thread shyla deshpande
Is anyone using structured streaming and writing the results to Cassandra database in a production environment? I do not think I have enough expertise to write a custom sink that can be used in production environment. Please help!

Re: Structured streaming and writing output to Cassandra

2017-04-08 Thread shyla deshpande
Cheers > Jules > > Sent from my iPhone > Pardon the dumb thumb typos :) > > On Apr 7, 2017, at 11:23 AM, shyla deshpande > wrote: > > Is anyone using structured streaming and writing the results to Cassandra > database in a production environment? > > I do not think

  1   2   >