Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Tathagata Das
If you want to find what offset ranges are present in a microbatch in Structured Streaming, you have to look at the StreamingQuery.lastProgress or use the QueryProgressListener . Both of these

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Tathagata Das
unately, the listeners process is async and can't guarantee happens > before association with microbatch to commit offsets to external storage. > But still they will work. Is there a way to access lastProgress in > foreachBatch ? > > > On Wed, May 22, 2024 at 7:35 AM Tathagata D

Re: Spark Streaming - Design considerations/Knobs

2015-05-20 Thread Tathagata Das
Correcting the ones that are incorrect or incomplete. BUT this is good list for things to remember about Spark Streaming. On Wed, May 20, 2015 at 3:40 AM, Hemant Bhanawat wrote: > Hi, > > I have compiled a list (from online sources) of knobs/design > considerations that need to be taken care of

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-20 Thread Tathagata Das
If you are talking about handling driver crash failures, then all bets are off anyways! Adding a shutdown hook in the hope of handling driver process failure, handles only a some cases (Ctrl-C), but does not handle cases like SIGKILL (does not run JVM shutdown hooks) or driver machine crash. So its

Re: [Unit Test Failure] Test org.apache.spark.streaming.JavaAPISuite.testCount failed

2015-05-20 Thread Tathagata Das
Has this been fixed for you now? There has been a number of patches since then and it may have been fixed. On Thu, May 14, 2015 at 7:20 AM, Wangfei (X) wrote: > Yes it is repeatedly on my locally Jenkins. > > 发自我的 iPhone > > 在 2015年5月14日,18:30,"Tathagata Das" 写道: >

Re: Storing spark processed output to Database asynchronously.

2015-05-21 Thread Tathagata Das
If you cannot push data as fast as you are generating it, then async isnt going to help either. The "work" is just going to keep piling up as many many async jobs even though your batch processing times will be low as that processing time is not going to reflect how much of overall work is pending

Re: Storing spark processed output to Database asynchronously.

2015-05-21 Thread Tathagata Das
y 21, 2015 at 4:55 PM, Tathagata Das > wrote: > >> If you cannot push data as fast as you are generating it, then async isnt >> going to help either. The "work" is just going to keep piling up as many >> many async jobs even though your batch processing times wi

Re: [Streaming] Non-blocking recommendation in custom receiver documentation and KinesisReceiver's worker.run blocking calll

2015-05-21 Thread Tathagata Das
Thanks for the JIRA. I will look into this issue. TD On Thu, May 21, 2015 at 1:31 AM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > I ran into one of the issues that are potentially caused because of this > and have logged a JIRA bug - > https://issues.apache.org/jira/browse/SPARK-7788

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-05-21 Thread Tathagata Das
Looks like somehow the file size reported by the FSInputDStream of Tachyon's FileSystem interface, is returning zero. On Mon, May 11, 2015 at 4:38 AM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > Just to follow up this thread further . > > I was doing some fault tolerant testi

Re: Connecting to an inmemory database from Spark

2015-05-21 Thread Tathagata Das
Doesnt seem like a Cassandra specific issue. Could you give us more information (code, errors, stack traces)? On Thu, May 21, 2015 at 1:33 PM, tshah77 wrote: > TD, > > Do you have any example about reading from cassandra using spark streaming > in java? > > I am trying to connect to cassandra u

Re: Storing spark processed output to Database asynchronously.

2015-05-22 Thread Tathagata Das
probably because Neo4j API takes too long to push the data into >> database. Meanwhile, Spark is unable to receive data probably because the >> process is blocked. >> >> On Thu, May 21, 2015 at 5:28 PM, Tathagata Das >> wrote: >> >>> Can you elab

Re: Trying to connect to many topics with several DirectConnect

2015-05-22 Thread Tathagata Das
Can you show us the rest of the program? When are you starting, or stopping the context. Is the exception occuring right after start or stop? What about log4j logs, what does it say? On Fri, May 22, 2015 at 7:12 AM, Cody Koeninger wrote: > I just verified that the following code works on 1.3.0 :

Re: [Streaming] Non-blocking recommendation in custom receiver documentation and KinesisReceiver's worker.run blocking calll

2015-05-22 Thread Tathagata Das
Hey Aniket, I just checked in the fix in Spark master and branch-1.4. Could you download Spark and test it out? On Thu, May 21, 2015 at 1:43 AM, Tathagata Das wrote: > Thanks for the JIRA. I will look into this issue. > > TD > > On Thu, May 21, 2015 at 1:31 AM, A

Re: Spark Streaming - Design considerations/Knobs

2015-05-24 Thread Tathagata Das
it? > > Thanks, > Hemant > > On Thu, May 21, 2015 at 2:21 AM, Tathagata Das > wrote: > >> Correcting the ones that are incorrect or incomplete. BUT this is good >> list for things to remember about Spark Streaming. >> >> >> On Wed, May 20, 2015 at 3

Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Tathagata Das
You can throttle the no receiver direct Kafka stream using spark.streaming.kafka.maxRatePerPartition On Wed, May 27, 2015 at 4:34 PM, Ted Yu wrote: > Have you seen > http://stackoverflow.com/questions/29051579/pausing-thro

Re: Recommended Scala version

2015-05-28 Thread Tathagata Das
Would be great if you guys can test out the Spark 1.4.0 RC2 (RC3 coming out soon) with Scala 2.11 and report issues. TD On Tue, May 26, 2015 at 9:15 AM, Koert Kuipers wrote: > we are still running into issues with spark-shell not working on 2.11, but > we are running on somewhat older master so

Re: FW: Websphere MQ as a data source for Apache Spark Streaming

2015-05-29 Thread Tathagata Das
Are you sure that the data can be saved as strings? Another, more controlled approach is use DStream.foreachRDD , which takes a Function2 parameter, RDD and Time. There you can explicitly do stuff with the RDD, save it to separate files (separated by time), or whatever. Might help you to debug wha

Re: Recommended Scala version

2015-05-31 Thread Tathagata Das
sses in the jar I've specified with the —jars argument on the command > line are available in the REPL. > > > Cheers > Alex > > On Thu, May 28, 2015 at 8:38 AM, Tathagata Das > wrote: > >> Would be great if you guys can test out the Spark 1.4.0 RC2 (RC3 coming >

Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Tathagata Das
In the receiver-less "direct" approach, there is no concept of consumer group as we dont use the Kafka High Level consumer (that uses ZK). Instead Spark Streaming manages offsets on its own, giving tighter guarantees. If you want to monitor the progress of the processing of offsets, you will have t

Re: Spark updateStateByKey fails with class leak when using case classes - resend

2015-06-01 Thread Tathagata Das
Interesting, only in local[*]! In the github you pointed to, what is the main that you were running. TD On Mon, May 25, 2015 at 9:23 AM, rsearle wrote: > Further experimentation indicates these problems only occur when master is > local[*]. > > There are no issues if a standalone cluster is us

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-02 Thread Tathagata Das
using CDH5.1, so the spark1.0 is provided by cdh, and > spark-1.3.1-bin-hadoop2.3 is downloaded from the official website. > > I could use spark1.0 + yarn, but I can't find a way to handle the logs, > level and rolling, so it'll explode the harddrive. > > Currently I

Re: Roadmap for Spark with Kafka on Scala 2.11?

2015-06-04 Thread Tathagata Das
But compile scope is supposed to be added to the assembly. https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope On Thu, Jun 4, 2015 at 1:24 PM, algermissen1971 wrote: > Hi Iulian, > > On 26 May 2015, at 13:04, Iulian Dragoș > wrote: > > > > >

Re: Spark SQL and Streaming Results

2015-06-05 Thread Tathagata Das
You could take at RDD *async operations, their source code. May be that can help if getting some early results. TD On Fri, Jun 5, 2015 at 8:41 AM, Pietro Gentile < pietro.gentile89.develo...@gmail.com> wrote: > Hi all, > > > what is the best way to perform Spark SQL queries and obtain the result

Re: Spark SQL and Streaming Results

2015-06-05 Thread Tathagata Das
missed it. > > @TD Is this project still active? > > I'm not sure what the status is but it may provide some insights on how to > achieve what your looking to do. > > On Fri, Jun 5, 2015 at 6:34 PM, Tathagata Das wrote: > >> You could take at RDD *async operations, t

Re: Shutdown with streaming driver running in cluster broke master web UI permanently

2015-06-11 Thread Tathagata Das
Do you have the event logging enabled? TD On Thu, Jun 11, 2015 at 11:24 AM, scar0909 wrote: > I have the same problem. i realized that the master spark becomes > unresponsive when we kill the leader zookeeper (of course i assigned the > leader election task to the zookeeper). please let me know

Re: Join between DStream and Periodically-Changing-RDD

2015-06-11 Thread Tathagata Das
Another approach not mentioned is to use a function to get the RDD that is to be joined. Something like this. Not sure, but you can try something like this also: kvDstream.foreachRDD(rdd => { val rdd = getOrUpdateRDD(params...) rdd.join(kvFile) }) The getOrUpdat

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Tathagata Das
Let me try to add some clarity in the different thought directions that's going on in this thread. 1. HOW TO DETECT THE NEED FOR MORE CLUSTER RESOURCES? If there are not rate limits set up, the most reliable way to detect whether the current Spark cluster is being insufficient to handle the data

Re: Spark Streaming reads from stdin or output from command line utility

2015-06-11 Thread Tathagata Das
Are you going to receive data from one stdin from one machine, or many stdins on many machines? On Thu, Jun 11, 2015 at 7:25 PM, foobar wrote: > Hi, I'm new to Spark Streaming, and I want to create a application where > Spark Streaming could create DStream from stdin. Basically I have a command

Re: Spark Streaming reads from stdin or output from command line utility

2015-06-11 Thread Tathagata Das
so I'm working with Scala. > It would be great if you could talk about multi stdin case as well! > Thanks. > > From: Tathagata Das > Date: Thursday, June 11, 2015 at 8:11 PM > To: Heath Guo > Cc: user > Subject: Re: Spark Streaming reads from stdin or output from c

Re: Re: How to keep a SQLContext instance alive in a spark streaming application's life cycle?

2015-06-12 Thread Tathagata Das
BTW, in Spark 1.4 announced today, I added SQLContext.getOrCreate. So you dont need to create the singleton yourself. On Wed, Jun 10, 2015 at 3:21 AM, Sergio Jiménez Barrio < drarse.a...@gmail.com> wrote: > Note: CCing user@spark.apache.org > > > First, you must check if the RDD is empty: > >

Re: Spark or Storm

2015-06-17 Thread Tathagata Das
To add more information beyond what Matei said and answer the original question, here are other things to consider when comparing between Spark Streaming and Storm. * Unified programming model and semantics - Most occasions you have to process the same data again in batch jobs. If you have two sep

Re: Serial batching with Spark Streaming

2015-06-17 Thread Tathagata Das
The default behavior should be that batch X + 1 starts processing only after batch X completes. If you are using Spark 1.4.0, could you show us a screenshot of the streaming tab, especially the list of batches? And could you also tell us if you are setting any SparkConf configurations? On Wed, Jun

Re: Latency between the RDD in Streaming

2015-06-18 Thread Tathagata Das
Its not clear what you are asking. Find "what" among RDD? On Thu, Jun 18, 2015 at 11:24 AM, anshu shukla wrote: > Is there any fixed way to find among RDD in stream processing systems , > in the Distributed set-up . > > -- > Thanks & Regards, > Anshu Shukla >

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-18 Thread Tathagata Das
quot;spark.shuffle.io.preferDirectBufs" to false to turn off >>> the off-heap allocation of netty? >>> >>> Best Regards, >>> Shixiong Zhu >>> >>> 2015-06-03 11:58 GMT+08:00 Ji ZHANG : >>> >>>> Hi, >>>> >>

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-18 Thread Tathagata Das
I think you may be including a different version of Spark Streaming in your assembly. Please mark spark-core nd spark-streaming as provided dependencies. Any installation of Spark will automatically provide Spark in the classpath so you do not have to bundle it. On Thu, Jun 18, 2015 at 8:44 AM, Ni

Re: Latency between the RDD in Streaming

2015-06-18 Thread Tathagata Das
al > D-STREAM to final/last D-STREAM . > Help Please !! > > On Fri, Jun 19, 2015 at 12:40 AM, Tathagata Das > wrote: > >> Its not clear what you are asking. Find "what" among RDD? >> >> On Thu, Jun 18, 2015 at 11:24 AM, anshu shukla >>

Re: understanding on the "waiting batches" and "scheduling delay" in Streaming UI

2015-06-18 Thread Tathagata Das
Also, could you give a screenshot of the streaming UI. Even better, could you run it on Spark 1.4 which has a new streaming UI and then use that for debugging/screenshot? TD On Thu, Jun 18, 2015 at 3:05 AM, Akhil Das wrote: > Which version of spark? and what is your data source? For some reason

Re: createDirectStream and Stats

2015-06-18 Thread Tathagata Das
Are you using Spark 1.3.x ? That explains. This issue has been fixed in Spark 1.4.0. Bonus you get a fancy new streaming UI with more awesome stats. :) On Thu, Jun 18, 2015 at 7:01 PM, Tim Smith wrote: > Hi, > > I just switched from "createStream" to the "createDirectStream" API for > kafka and

Re: Latency between the RDD in Streaming

2015-06-18 Thread Tathagata Das
e is aprox 1500 > tuples/sec). > > On Fri, Jun 19, 2015 at 2:50 AM, Tathagata Das > wrote: > >> Couple of ways. >> >> 1. Easy but approx way: Find scheduling delay and processing time using >> StreamingListener interface, and then calculate "end-to-end dela

Re: RE: Spark or Storm

2015-06-19 Thread Tathagata Das
If the current documentation is confusing, we can definitely improve the documentation. However, I dont not understand why is the term "transactional" confusing. If your output operation has to add 5, then the user has to implement the following mechanism 1. If the unique id of the batch of data i

Re: createDirectStream and Stats

2015-06-19 Thread Tathagata Das
p. Yes, for the record, this is with >> CDH 5.4.1 and Spark 1.3. >> >> >> >> On Thu, Jun 18, 2015 at 7:05 PM, Tim Smith wrote: >> >>> Thanks for the super-fast response, TD :) >>> >>> I will now go bug my hadoop vendor to upgrade fro

Re: Serial batching with Spark Streaming

2015-06-19 Thread Tathagata Das
wordCounts2 = >>> JavaPairDStream.fromJavaDStream( >>> wordCounts.transform( (rdd) -> { >>> sleep.add(1); >>> Thread.sleep(sleep.localValue() < 6 ? 2 : 5000); >>> return JavaRDD.fromRDD(JavaPairRDD

Re: Assigning number of workers in spark streaming

2015-06-19 Thread Tathagata Das
Depends on what cluster manager are you using. Its all pretty well documented in the online documentation. http://spark.apache.org/docs/latest/submitting-applications.html On Fri, Jun 19, 2015 at 2:29 PM, anshu shukla wrote: > Hey , > *[For Client Mode]* > > 1- Is there any way to assign the nu

Re: RE: Spark or Storm

2015-06-19 Thread Tathagata Das
batch or will it honor Kafka's offset reset policy(auto.offset.reset). If >>> it honors the reset policy and it is set as "smallest", then it is the at >>> least once semantics; if it set "largest", then it will be at most once >>> semantics? >>> >

Re: Assigning number of workers in spark streaming

2015-06-19 Thread Tathagata Das
ent deploy mode > ./bin/spark-submit \ > --class org.apache.spark.examples.SparkPi \ > --master spark://207.184.161.138:7077 \ > --executor-memory 20G \ > --total-executor-cores 100 \ > /path/to/examples.jar \ > 1000 > > > > On Sat, Jun 20, 2015 at 3:18

Re: createDirectStream and Stats

2015-06-19 Thread Tathagata Das
rts up and works fine for a while but I >> guess starts to deteriorate after a while. With the existing API >> "createStream", the app does deteriorate but over a much longer period, >> hours vs days. >> >> >> >> >> >> >> On Fr

Re: createDirectStream and Stats

2015-06-19 Thread Tathagata Das
I dont think there was any enhancments that can change this behavior. On Fri, Jun 19, 2015 at 6:16 PM, Tim Smith wrote: > On Fri, Jun 19, 2015 at 5:15 PM, Tathagata Das > wrote: > >> Also, can you find from the spark UI the break up of the stages in each >> batch's

Re: Serial batching with Spark Streaming

2015-06-20 Thread Tathagata Das
> due to failures*? If not, please could you suggest workarounds and point > me to the code? > > One more thing was not 100% clear to me from the documentation: Is there > exactly *1 RDD* published *per a batch interval* in a DStream? > > > > On 19 June 2015 at 16:58,

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Tathagata Das
Do you have Kafka producer in your classpath? If so how are adding that library? Are you running on YARN, or Mesos or Standalone or local. These details will be very useful. On Mon, Jun 22, 2015 at 8:34 AM, Murthy Chelankuri wrote: > I am using spark streaming. what i am trying to do is sending

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-22 Thread Tathagata Das
t; Hi Tathagata, >> >> When you say please mark spark-core and spark-streaming as dependencies >> how do you mean? >> I have installed the pre-build spark-1.4 for Hadoop 2.6 from spark >> downloads. In my maven pom.xml, I am using version 1.4 as described. >> &g

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Tathagata Das
How are you adding that to the classpath? Through spark-submit or otherwise? On Mon, Jun 22, 2015 at 5:02 PM, Murthy Chelankuri wrote: > Yes I have the producer in the class path. And I am using in standalone > mode. > > Sent from my iPhone > > On 23-Jun-2015, at 3:31 am, Ta

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-23 Thread Tathagata Das
12:17 PM, Tathagata Das > wrote: > >> How are you adding that to the classpath? Through spark-submit or >> otherwise? >> >> On Mon, Jun 22, 2015 at 5:02 PM, Murthy Chelankuri >> wrote: >> >>> Yes I have the producer in the class path. And I

Re: Shutdown with streaming driver running in cluster broke master web UI permanently

2015-06-23 Thread Tathagata Das
10:44:01 WARN ConnectionStateManager: There are no > ConnectionStateListeners registered. > 15/06/22 10:44:01 INFO ZooKeeperLeaderElectionAgent: We have lost > leadership > 15/06/22 10:44:01 ERROR Master: Leadership has been revoked -- master > shutting down. > > > On Thu,

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-23 Thread Tathagata Das
queue stream does not support driver checkpoint recovery since the RDDs in the queue are arbitrary generated by the user and its hard for Spark Streaming to keep track of the data in the RDDs (thats necessary for recovering from checkpoint). Anyways queue stream is meant of testing and development,

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-23 Thread Tathagata Das
se the spark executor > classloader for loading the classes or some thing like that? > > > On Tue, Jun 23, 2015 at 12:28 PM, Tathagata Das > wrote: > >> So you have Kafka in your classpath in you Java application, where you >> are creating the sparkContext with

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Tathagata Das
Yes, this is a known behavior. Some static stuff are not serialized as part of a task. On Tue, Jun 23, 2015 at 10:24 AM, Nipun Arora wrote: > I found the error so just posting on the list. > > It seems broadcast variables cannot be declared static. > If you do you get a null pointer exception. >

Re: spark streaming with kafka jar missing

2015-06-23 Thread Tathagata Das
You must not include spark-core and spark-streaming in the assembly. They are already present in the installation and the presence of multiple versions of spark may throw off the classloaders in weird ways. So make the assembly marking the those dependencies as scope=provided. On Tue, Jun 23, 20

Re: java.lang.IllegalArgumentException: A metric named ... already exists

2015-06-23 Thread Tathagata Das
Aaah this could be potentially major issue as it may prevent metrics from restarted streaming context be not published. Can you make it a JIRA. TD On Tue, Jun 23, 2015 at 7:59 AM, Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com> wrote: > Hi, > > I'm running a program in Spark 1.4 where

Re: Calculating tuple count /input rate with time

2015-06-23 Thread Tathagata Das
This should give accurate count for each batch, though for getting the rate you have to make sure that you streaming app is stable, that is, batches are processed as fast as they are received (scheduling delay in the spark streaming UI is approx 0). TD On Tue, Jun 23, 2015 at 2:49 AM, anshu shukl

Re: spark streaming with kafka jar missing

2015-06-23 Thread Tathagata Das
rk > spark-streaming_2.10 > provided > 1.4.0 > > > > > > > maven-assembly-plugin > > > package > > single > > > > > > jar-with-dependencies

Re: Kafka createDirectStream ​issue

2015-06-23 Thread Tathagata Das
Please run it in your own application and not in the spark shell. I see that you are trying to stop the Spark context and create a new StreamingContext. That will lead to unexpected issue, that you are seeing. Please make a standalone SBT/Maven app for Spark Streaming. On Tue, Jun 23, 2015 at 3:43

Re: Time is ugly in Spark Streaming....

2015-06-27 Thread Tathagata Das
Could you print the "time" on the driver (that is, in foreachRDD but before RDD.foreachPartition) and see if it is behaving weird? TD On Fri, Jun 26, 2015 at 3:57 PM, Emrehan Tüzün wrote: > > > > > On Fri, Jun 26, 2015 at 12:30 PM, Sea <261810...@qq.com> wrote: > >> Hi, all >> >> I find a probl

Re: spark streaming - checkpoint

2015-06-27 Thread Tathagata Das
Do you have SPARK_CLASSPATH set in both cases? Before and after checkpoint? If yes, then you should not be using SPARK_CLASSPATH, it has been deprecated since Spark 1.0 because of its ambiguity. Also where do you have spark.executor.extraClassPath set? I dont see it in the spark-submit command. On

Re: Uncaught exception in thread delete Spark local dirs

2015-06-27 Thread Tathagata Das
How are you trying to execute the code again? From checkpoints, or otherwise? Also cc'ed Hari who may have a better idea of YARN related issues. On Sat, Jun 27, 2015 at 12:35 AM, Guillermo Ortiz wrote: > Hi, > > I'm executing a SparkStreamig code with Kafka. IçThe code was working but > today I

Re: spark streaming job fails to restart after checkpointing due to DStream initialization errors

2015-06-27 Thread Tathagata Das
Could you also provide the code where you set up the Kafka dstream? I dont see it in the snippet. On Fri, Jun 26, 2015 at 2:45 PM, Ashish Nigam wrote: > Here's code - > > def createStreamingContext(checkpointDirectory: String) : > StreamingContext = { > > val conf = new SparkConf().setAppNam

Re: How to recover in case user errors in streaming

2015-06-27 Thread Tathagata Das
I use it in map, but the whole point here is to handle > something that is breaking in action ). Please help. :( > > From: amit assudani > Date: Friday, June 26, 2015 at 11:41 AM > > To: Cody Koeninger > Cc: "user@spark.apache.org" , Tathagata Das < > t...@

Re: Uncaught exception in thread delete Spark local dirs

2015-06-27 Thread Tathagata Das
foreachRDD { rdd => > val offsets = rdd.asInstanceOf[HasOffsetRanges].offsetRanges > val documents = rdd.mapPartitionsWithIndex { (i, kafkaEvent) => > . >} > > I understand that I just need a checkpoint if I need to recover the task > it something g

Re: spark streaming with kafka reset offset

2015-06-27 Thread Tathagata Das
In the receiver based approach, If the receiver crashes for any reason (receiver crashed or executor crashed) the receiver should get restarted on another executor and should start reading data from the offset present in the zookeeper. There is some chance of data loss which can alleviated using Wr

Re: Uncaught exception in thread delete Spark local dirs

2015-06-27 Thread Tathagata Das
hen >> SPARK_CLASSPATH=$lib >> else >> SPARK_CLASSPATH=$SPARK_CLASSPATH,$lib >> fi >> done >> spark-submit --name "Metrics" >> >> I need to add all the jars as you know,, maybe it was a bad name >> SPARK_CLASSPATH >> >> Th

Re: How to recover in case user errors in streaming

2015-06-29 Thread Tathagata Das
Mon, Jun 29, 2015 at 8:44 AM, Amit Assudani wrote: > Also, how do you suggest catching exceptions while using with connector > API like, saveAsNewAPIHadoopFiles ? > > From: amit assudani > Date: Monday, June 29, 2015 at 9:55 AM > To: Tathagata Das > > Cc

Re: Checkpoint FS failure or connectivity issue

2015-06-29 Thread Tathagata Das
Yes, the observation is correct. That connectivity is assumed to be HA. On Mon, Jun 29, 2015 at 2:34 PM, Amit Assudani wrote: > Hi All, > > While using Checkpoints ( using HDFS ), if connectivity to hadoop > cluster is lost for a while and gets restored in some time, what happens to > the run

Re: Spark driver using Spark Streaming shows increasing memory/CPU usage

2015-06-30 Thread Tathagata Das
Could you give more information on the operations that you are using? The code outline? And what do you mean by "Spark Driver receiver events"? If the driver is receiving events, how is it being sent to the executors. BTW, for memory usages, I strongly recommend using jmap --histo:live to see wha

Re: Explanation of the numbers on Spark Streaming UI

2015-06-30 Thread Tathagata Das
Well, the scheduling delay is the time a batch has to wait for getting resources. So even if there is no backlog in processing and scheduling delay is 0, there is one batch that is being processed at any point of time, which explains the difference. On Tue, Jun 30, 2015 at 2:42 AM, bit1...@163.com

Re: Spark streaming on standalone cluster

2015-06-30 Thread Tathagata Das
How many receivers do you have in the streaming program? You have to have more numbers of core in reserver by your spar application than the number of receivers. That would explain the receiving output after stopping. TD On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear wrote: > Hi all, > > I

Re: How to recover in case user errors in streaming

2015-06-30 Thread Tathagata Das
retry by myself .. > dstream.foreachRDD { rdd => > >try { > rdd.foreachPartition{ record => > try { > } catch { > case Exception => > } > } catch { > >} > } > > > > > ---

Re: Serialization Exception

2015-06-30 Thread Tathagata Das
I am guessing one of the two things might work. 1. Either define the pattern SPACE inside the process() 2. Mark streamingContext field and inputStream field as transient. The problem is that the function like PairFunction needs to be serialized for being sent to the tasks. And whole closure of th

Re: java.lang.IllegalArgumentException: A metric named ... already exists

2015-07-06 Thread Tathagata Das
A as > soon as I can > > Greetings, > > Juan > > 2015-06-23 21:57 GMT+02:00 Tathagata Das : > >> Aaah this could be potentially major issue as it may prevent metrics from >> restarted streaming context be not published. Can you make it a JIRA. >> >&

Re: writing to kafka using spark streaming

2015-07-06 Thread Tathagata Das
Yeah, creating a new producer at the granularity of partitions may not be that costly. On Mon, Jul 6, 2015 at 6:40 AM, Cody Koeninger wrote: > Use foreachPartition, and allocate whatever the costly resource is once > per partition. > > On Mon, Jul 6, 2015 at 6:11 AM, Shushant Arora > wrote: > >

Re: writing to kafka using spark streaming

2015-07-06 Thread Tathagata Das
gt; > One is an operation and another is action but if I call an opeartion > afterwords mapPartitions also, which one is more efficient and > recommeded? > > On Tue, Jul 7, 2015 at 12:21 AM, Tathagata Das > wrote: > >> Yeah, creating a new producer at the granularity of p

Re: Are Spark Streaming RDDs always processed in order?

2015-07-06 Thread Tathagata Das
zia wrote: > I had a similar inquiry, copied below. > > I was also looking into making an SQS Receiver reliable: > > http://stackoverflow.com/questions/30809975/reliable-sqs-receiver-for-spark-streaming > > Hope this helps. > > -- Forwarded message -- &

Re: How to recover in case user errors in streaming

2015-07-06 Thread Tathagata Das
, Amit Assudani wrote: > Hi TD, > > Why don’t we have OnBatchError or similar method in StreamingListener ? > > Also, is StreamingListener only for receiver based approach or does it > work for Kafka Direct API / File Based Streaming as well ? > > Regards, > Amit > &

Re: Regarding master node failure

2015-07-07 Thread Tathagata Das
This talk may help - https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/ On Tue, Jul 7, 2015 at 9:51 AM, swetha wrote: > Hi, > > What happens if the master node fails in the case of Spark Streaming? Would > the data be lost? > > Thanks, > Swetha >

Re: Spark Kafka Direct Streaming

2015-07-07 Thread Tathagata Das
When you enable checkpointing by setting the checkpoint directory, you enable metadata checkpointing. Data checkpointing kicks in only if you are using a DStream operation that requires it, or you are enabling Write Ahead Logs to prevent data loss on driver failure. More discussion - https://spark

Re: Streaming checkpoints and logic change

2015-07-08 Thread Tathagata Das
You can use DStream.transform for some stuff. Transform takes a RDD => RDD function that allow arbitrary RDD operations to be done on RDDs of a DStream. This function gets evaluated on the driver on every batch interval. If you are smart about writing the function, it can do different stuff at diff

Re: Streaming checkpoints and logic change

2015-07-08 Thread Tathagata Das
read from > ZooKeeper when creating a directStream. I've settled down to this approach > for now, but I want to know how makers of Spark Streaming think about this > drawback of checkpointing. > > If anyone had similar experience, suggestions will be appreciated. > &

Re: spark core/streaming doubts

2015-07-08 Thread Tathagata Das
Responses inline. On Wed, Jul 8, 2015 at 10:26 AM, Shushant Arora wrote: > 1.Does creation of read only singleton object in each map function is same > as broadcast object as singleton never gets garbage collected unless > executor gets shutdown ? Aim is to avoid creation of complex object at ea

Re: Problem in Understanding concept of Physical Cores

2015-07-08 Thread Tathagata Das
There are several levels of indirection going on here, let me clarify. In the local mode, Spark runs tasks (which includes receivers) using the number of threads defined in the master (either local, or local[2], or local[*]). local or local[1] = single thread, so only one task at a time local[2] =

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Tathagata Das
This is also discussed in the programming guide. http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Wed, Jul 8, 2015 at 8:25 AM, Dmitry Goldenberg wrote: > Thanks, Sean. > > "are you asking about foreach vs foreachPartition? that's quite

Re: pause and resume streaming app

2015-07-08 Thread Tathagata Das
Currently the only way to pause it is to stop it. The way I would do this is use the Direct Kafka API to access the Kafka offsets, and save them to a data store as batches finish. If you see a batch job failing because downstream is down, stop the context. When it comes back up, start a new streami

Re: spark streaming performance

2015-07-09 Thread Tathagata Das
What were the number of cores in the executor? It could be that you had only one core in the executor which did all the 50 tasks serially so 50 task X 15 ms = ~ 1 second. Could you take a look at the task details in the stage page to see when the tasks were added to see whether it explains the 5 se

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Tathagata Das
If you are continuously unioning RDDs, then you are accumulating ever increasing data, and you are processing ever increasing amount of data in every batch. Obviously this is going to not last for very long. You fundamentally cannot keep processing ever increasing amount of data with finite resourc

Re: Spark Streaming Hangs on Start

2015-07-09 Thread Tathagata Das
Do you have enough cores in the configured number of executors in YARN? On Thu, Jul 9, 2015 at 2:29 AM, Bin Wang wrote: > I'm using spark streaming with Kafka, and submit it to YARN cluster with > mode "yarn-cluster". But it hangs at SparkContext.start(). The Kafka config > is right since it c

Re: Problem in Understanding concept of Physical Cores

2015-07-09 Thread Tathagata Das
> threads are created , they are still to be executed by same physical core > so kindly elaborate what is extra processing in extra thread in this case. > > Thanks and Regards > Aniruddh > > On Thu, Jul 9, 2015 at 4:43 AM, Tathagata Das wrote: > >> There are sever

Re: Spark Streaming Hangs on Start

2015-07-09 Thread Tathagata Das
1. There will be a long running job with description "start()" as that is the jobs that is running the receivers. It will never end. 2. You need to set the number of cores given to the Spark executors by the YARN container. That is SparkConf spark.executor.cores, --executor-cores in spark-submit.

Re: Does spark guarantee that the same task will process the same key over time?

2015-07-09 Thread Tathagata Das
That's a good question. You have to 1. You to ensure that the stats is fault-tolerant, that if the executor dies or task get relaunched you dont loose priori state. This is a fundamental challenge for all streaming systems, and its pretty hard to do without having some persistence story outside the

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Tathagata Das
duct/0636920033073.do> (O'Reilly) >>> Typesafe <http://typesafe.com> >>> @deanwampler <http://twitter.com/deanwampler> >>> http://polyglotprogramming.com >>> >>> On Thu, Jul 9, 2015 at 5:56 AM, Anand Nalya >>> wrote: >>>

Re: spark streaming performance

2015-07-09 Thread Tathagata Das
I am not sure why you are getting node_local and not process_local. Also there is probably not a good documentation other than that configuration page - http://spark.apache.org/docs/latest/configuration.html (search for locality) On Thu, Jul 9, 2015 at 5:51 AM, Michel Hubert wrote: > > > > > > >

Re: reduceByKeyAndWindow with initial state

2015-07-10 Thread Tathagata Das
Are you talking about reduceByKeyAndWindow with or without inverse reduce? TD On Fri, Jul 10, 2015 at 2:07 AM, Imran Alam wrote: > We have a streaming job that makes use of reduceByKeyAndWindow >

Re: Debug Spark Streaming in PyCharm

2015-07-10 Thread Tathagata Das
spark-submit does a lot of magic configurations (classpaths etc) underneath the covers to enable pyspark to find Spark JARs, etc. I am not sure how you can start running things directly from the PyCharm IDE. Others in the community may be able to answer. For now the main way to run pyspark stuff is

Re: Problems after upgrading to spark 1.4.0

2015-07-13 Thread Tathagata Das
Spark 1.4.0 added shutdown hooks in the driver to cleanly shutdown the Sparkcontext in the driver, which would shutdown the executors. I am not sure whether this is related or not, but somehow the executor's shutdown hook is being called. Can you check the driver logs to see if driver's shutdown ho

  1   2   3   4   5   6   7   8   9   10   >