[no subject]

2015-06-17 Thread Nipun Arora
Hi, Is there anyway in spark streaming to keep data across multiple micro-batches? Like in a HashMap or something? Can anyone make suggestions on how to keep data across iterations where each iteration is an RDD being processed in JavaDStream? This is especially the case when I am trying to updat

Iterative Programming by keeping data across micro-batches in spark-streaming?

2015-06-17 Thread Nipun Arora
Hi, Is there anyway in spark streaming to keep data across multiple micro-batches? Like in a HashMap or something? Can anyone make suggestions on how to keep data across iterations where each iteration is an RDD being processed in JavaDStream? This is especially the case when I am trying to updat

Re: Iterative Programming by keeping data across micro-batches in spark-streaming?

2015-06-17 Thread Nipun Arora
27;s. Any suggestions? Thanks Nipun On Wed, Jun 17, 2015 at 10:52 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Hi, just answered in your other thread as well... > > Depending on your requirements, you can look at the updateStateByKey API > > From: Nipun A

[Spark Streaming] Iterative programming on an ordered spark stream using Java?

2015-06-18 Thread Nipun Arora
rtain events etc which will impact operations in future iterations? I would like to keep some accumulated history to make calculations.. not the entire dataset, but persist certain events which can be used in future DStream RDDs? Thanks Nipun On Wed, Jun 17, 2015 at 11:15 PM, Nipun Arora wrot

Re: [Spark Streaming] Iterative programming on an ordered spark stream using Java?

2015-06-18 Thread Nipun Arora
> this,that will be great. > > Regarding not keeping whole dataset in memory, you can tweak the parameter > of remember, such that it does checkpoint at appropriate time. > > Thanks > Twinkle > > On Thursday, June 18, 2015, Nipun Arora wrote: > >> Hi All, >>

[Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-18 Thread Nipun Arora
Hi, I have the following piece of code, where I am trying to transform a spark stream and add min and max to it of eachRDD. However, I get an error saying max call does not exist, at run-time (compiles properly). I am using spark-1.4 I have added the question to stackoverflow as well: http://stac

Re: [Spark Streaming] Iterative programming on an ordered spark stream using Java?

2015-06-18 Thread Nipun Arora
@Twinkle - what did you mean by "Regarding not keeping whole dataset in memory, you can tweak the parameter of remember, such that it does checkpoint at appropriate time"? On Thu, Jun 18, 2015 at 11:40 AM, Nipun Arora wrote: > Hi All, > > I appreciate the help :) > > He

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-18 Thread Nipun Arora
th so you do not have to bundle it. > > On Thu, Jun 18, 2015 at 8:44 AM, Nipun Arora > wrote: > >> Hi, >> >> I have the following piece of code, where I am trying to transform a >> spark stream and add min and max to it of eachRDD. However, I get an erro

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-22 Thread Nipun Arora
ache.spark spark-core_2.10 1.4.0 org.apache.spark spark-streaming_2.10 1.4.0 Thanks Nipun On Thu, Jun 18, 2015 at 11:16 PM, Nipun Arora wrote: > Hi Tathagata, > > When you say please mark spark-core and spark-streaming as depend

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-23 Thread Nipun Arora
m.html#Dependency_Scope > > On Mon, Jun 22, 2015 at 10:28 AM, Nipun Arora > wrote: > >> Hi Tathagata, >> >> I am attaching a snapshot of my pom.xml. It would help immensely, if I >> can include max, and min values in my mapper phase. >> >> The question

[Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
Hi, I have a spark streaming application where I need to access a model saved in a HashMap. I have *no problems in running the same code with broadcast variables in the local installation.* However I get a *null pointer* *exception* when I deploy it on my spark test cluster. I have stored a mode

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
btw. just for reference I have added the code in a gist: https://gist.github.com/nipunarora/ed987e45028250248edc and a stackoverflow reference here: http://stackoverflow.com/questions/31006490/broadcast-variable-null-pointer-exception-in-spark-streaming On Tue, Jun 23, 2015 at 11:01 AM, Nipun

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
gt; > I had a similar issue when recreating a streaming context from checkpoint > as broadcast variables are not checkpointed. > On 23 Jun 2015 5:01 pm, "Nipun Arora" wrote: > >> Hi, >> >> I have a spark streaming application where I need to access a model

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
I found the error so just posting on the list. It seems broadcast variables cannot be declared static. If you do you get a null pointer exception. Thanks Nipun On Tue, Jun 23, 2015 at 11:08 AM, Nipun Arora wrote: > btw. just for reference I have added the code in a gist: > &

[Spark Streaming] Spark Streaming dropping last lines

2016-02-08 Thread Nipun Arora
I have a spark-streaming service, where I am processing and detecting anomalies on the basis of some offline generated model. I feed data into this service from a log file, which is streamed using the following command tail -f | nc -lk Here the spark streaming service is taking data from por

Re: [Spark Streaming] Spark Streaming dropping last lines

2016-02-10 Thread Nipun Arora
#x27;s not getting any data, i.e. the file end has probably been reached by printing the first two lines of every micro-batch. Thanks Nipun On Mon, Feb 8, 2016 at 10:05 PM Nipun Arora wrote: > I have a spark-streaming service, where I am processing and detecting > anomalies on the basis

Kafka + Spark 1.3 Integration

2016-02-10 Thread Nipun Arora
Hi, I am trying some basic integration and was going through the manual. I would like to read from a topic, and get a JavaReceiverInputDStream for messages in that topic. However the example is of JavaPairReceiverInputDStream<>. How do I get a stream for only a single topic in Java? Reference P

Re: Kafka + Spark 1.3 Integration

2016-02-11 Thread Nipun Arora
examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java > > > On Wed, Feb 10, 2016 at 1:28 PM, Nipun Arora > wrote: > >> Hi, >> >> I am trying some basic integration and was going through the manual. >> >> I would like to read fr

[Spark Streaming] Design Patterns forEachRDD

2015-10-21 Thread Nipun Arora
Hi All, Can anyone provide a design pattern for the following code shown in the Spark User Manual, in JAVA ? I have the same exact use-case, and for some reason the design pattern for Java is missing. Scala version taken from : http://spark.apache.org/docs/latest/streaming-programming-guide.html

[SPARK STREAMING] polling based operation instead of event based operation

2015-10-22 Thread Nipun Arora
Hi, In general in spark stream one can do transformations ( filter, map etc.) or output operations (collect, forEach) etc. in an event-driven pardigm... i.e. the action happens only if a message is received. Is it possible to do actions every few seconds in a polling based fashion, regardless if a

Re: [Spark Streaming] Design Patterns forEachRDD

2015-10-22 Thread Nipun Arora
} > }); > > > On 21-Oct-2015, at 10:55 AM, Nipun Arora wrote: > > Hi All, > > Can anyone provide a design pattern for the following code shown in the > Spark User Manual, in JAVA ? I have the same exact use-case, and for some > reason the design pattern for Java is

Re: [SPARK STREAMING] polling based operation instead of event based operation

2015-10-23 Thread Nipun Arora
egards, > > Lars Albertsson > > > > On Thu, Oct 22, 2015 at 10:48 PM, Nipun Arora > wrote: > > Hi, > > In general in spark stream one can do transformations ( filter, map > etc.) or > > output operations (collect, forEach) etc. in an event-driven pardigm.

[SPARK STREAMING] Concurrent operations in spark streaming

2015-10-24 Thread Nipun Arora
I wanted to understand something about the internals of spark streaming executions. If I have a stream X, and in my program I send stream X to function A and function B: 1. In function A, I do a few transform/filter operations etc. on X->Y->Z to create stream Z. Now I do a forEach Operation on Z

Re: [SPARK STREAMING] Concurrent operations in spark streaming

2015-10-25 Thread Nipun Arora
ever gets submitted to Spark first gets executed first - you can use a > semaphore if you need to ensure the ordering of execution, though I would > assume that the ordering wouldn't matter. > > --- > Regards, > Andy > > On Sat, Oct 24, 2015 at 10:08 PM, Nipun Arora

[SPARK STREAMING ] Sending data to ElasticSearch

2015-10-29 Thread Nipun Arora
Hi, I am sending data to an elasticsearch deployment. The printing to file seems to work fine, but I keep getting no-node found for ES when I send data to it. I suspect there is some special way to handle the connection object? Can anyone explain what should be changed here? Thanks Nipun The fol

[SPARK STREAMING] Questions regarding foreachPartition

2015-11-16 Thread Nipun Arora
Hi, I wanted to understand forEachPartition logic. In the code below, I am assuming the iterator is executing in a distributed fashion. 1. Assuming I have a stream which has timestamp data which is sorted. Will the stringiterator in foreachPartition process each line in order? 2. Assuming I have

Re: [SPARK STREAMING] Questions regarding foreachPartition

2015-11-17 Thread Nipun Arora
ordering. > > You typically want to acquire resources inside the foreachpartition > closure, just before handling the iterator. > > > http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd > > On Mon, Nov 16, 2015 at 4:02 P

SparkStreaming add Max Line of each RDD throwing an exception

2017-01-18 Thread Nipun Arora
Please note: I have asked the following question in stackoverflow as well http://stackoverflow.com/questions/41729451/adding-to-spark-streaming-dstream-rdd-the-max-line-of-each-rdd I am trying to add to each RDD in a JavaDStream the line with the maximum timestamp, with some modification. However,

[SparkStreaming] SparkStreaming not allowing to do parallelize within a transform operation to generate a new RDD

2017-01-18 Thread Nipun Arora
Hi, I am trying to transform an RDD in a dstream by adding changing the log with the maximum timestamp, and adding a duplicate copy of it with some modifications. The following is the example code: JavaDStream logMessageWithHB = logMessageMatched.transform(new Function, JavaRDD>() { @Overrid

[SparkStreaming] SparkStreaming not allowing to do parallelize within a transform operation to generate a new RDD

2017-01-19 Thread Nipun Arora
help. Thanks Nipun -- Forwarded message ----- From: Nipun Arora Date: Wed, Jan 18, 2017 at 5:25 PM Subject: [SparkStreaming] SparkStreaming not allowing to do parallelize within a transform operation to generate a new RDD To: user Hi, I am trying to transform an RDD in a dstream by add

Resource Leak in Spark Streaming

2017-01-31 Thread Nipun Arora
Hi, I get a resource leak, where the number of file descriptors in spark streaming keeps increasing. We end up with a "too many file open" error eventually through an exception caused in: JAVARDDKafkaWriter, which is writing a spark JavaDStream The exception is attached inline. Any help will be

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Nipun Arora
It is spark 1.6 Thanks Nipun On Tue, Jan 31, 2017 at 1:45 PM Shixiong(Ryan) Zhu wrote: > Could you provide your Spark version please? > > On Tue, Jan 31, 2017 at 10:37 AM, Nipun Arora > wrote: > > Hi, > > I get a resource leak, where the number of file descriptors in

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Nipun Arora
so include the patch version, such as 1.6.0, 1.6.1. Could you > also post the JAVARDDKafkaWriter codes. It's also possible that it leaks > resources. > > On Tue, Jan 31, 2017 at 2:12 PM, Nipun Arora > wrote: > > It is spark 1.6 > > Thanks > Nipun > > On Tu

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Nipun Arora
ucers. But nobody closes these KafkaProducers. > > On Tue, Jan 31, 2017 at 3:02 PM, Nipun Arora > wrote: > > > Sorry for not writing the patch number, it's spark 1.6.1. > The relevant code is here inline. > > Please have a look and let me know if there is a resource lea

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Nipun Arora
Just to be clear the pool object creation happens in the driver code, and not in any anonymous function which should be executed in the executor. On Tue, Jan 31, 2017 at 10:21 PM Nipun Arora wrote: > Thanks for the suggestion Ryan, I will convert it to singleton and see if > it solv

Re: Resource Leak in Spark Streaming

2017-02-07 Thread Nipun Arora
Ryan, Apologies for coming back so late, I created a github repo to resolve this problem. On trying your solution for making the pool a Singleton, I get a null pointer exception in the worker. Do you have any other suggestions, or a simpler mechanism for handling this? I have put all the current

Re: Resource Leak in Spark Streaming

2017-02-07 Thread Nipun Arora
go around this problem? Thanks, Nipun On Tue, Feb 7, 2017 at 6:35 PM, Nipun Arora wrote: > Ryan, > > Apologies for coming back so late, I created a github repo to resolve > this problem. On trying your solution for making the pool a Singleton, > I get a null pointer exception i

[Spark Streaming] Dynamic Broadcast Variable Update

2017-05-02 Thread Nipun Arora
Hi All, To support our Spark Streaming based anomaly detection tool, we have made a patch in Spark 1.6.2 to dynamically update broadcast variables. I'll first explain our use-case, which I believe should be common to several people using Spark Streaming applications. Broadcast variables are often

Restful API Spark Application

2017-05-12 Thread Nipun Arora
Hi, We have written a java spark application (primarily uses spark sql). We want to expand this to provide our application "as a service". For this, we are trying to write a REST API. While a simple REST API can be easily made, and I can get Spark to run through the launcher. I wonder, how the spa

Re: Restful API Spark Application

2017-05-15 Thread Nipun Arora
> > > > https://github.com/spark-jobserver/spark-jobserver > > > > Regards > > Sam > > On Fri, 12 May 2017 at 21:00, Nipun Arora > wrote: > >> > >> Hi, > >> > >> We have written a java spark application (primarily uses spar

Spark Launch programatically - Basics!

2017-05-17 Thread Nipun Arora
Hi, I am trying to get a simple spark application to run programatically. I looked at http://spark.apache.org/docs/2.1.0/api/java/index.html?org/apache/spark/launcher/package-summary.html, at the following code. public class MyLauncher { public static void main(String[] args) throws Excep

SparkAppHandle - get Input and output streams

2017-05-18 Thread Nipun Arora
Hi, I wanted to know how to get the the input and output streams from SparkAppHandle? I start application like the following: SparkAppHandle sparkAppHandle = sparkLauncher.startApplication(); I have used the following previously to capture inputstream from error and output streams, but I would l

[Spark Streaming] DAG Execution Model Clarification

2017-05-26 Thread Nipun Arora
Hi, I would like some clarification on the execution model for spark streaming. Broadly, I am trying to understand if output operations in a DAG are only processed after all intermediate operations are finished for all parts of the DAG. Let me give an example: I have a dstream -A , I do map ope

[Spark Streaming] DAG Output Processing mechanism

2017-05-28 Thread Nipun Arora
up vote 0 down vote favorite I would like some clarification on the execution model for spark streaming. Broadly, I am trying to understand if output operations in a DAG are only processed after all intermediate operations are finished for all parts of the DAG. Let me give an example: I have a d

Re: [Spark Streaming] DAG Output Processing mechanism

2017-05-28 Thread Nipun Arora
ill kafka output 1 and kafka output 2 wait for all operations to finish on B and C before sending an output, or will they simply send an output as soon as the ops in B and C are done. What kind of synchronization guarantees are there? On Sun, May 28, 2017 at 9:59 AM, Nipun Arora wrote: > up

Re: [Spark Streaming] DAG Output Processing mechanism

2017-05-29 Thread Nipun Arora
Sending out the message again.. Hopefully someone cal clarify :) I would like some clarification on the execution model for spark streaming. Broadly, I am trying to understand if output operations in a DAG are only processed after all intermediate operations are finished for all parts of the DAG

Kafka + Spark Streaming consumer API offsets

2017-06-05 Thread Nipun Arora
I need some clarification for Kafka consumers in Spark or otherwise. I have the following Kafka Consumer. The consumer is reading from a topic, and I have a mechanism which blocks the consumer from time to time. The producer is a separate thread which is continuously sending data. I want to ensure

[Spark Streaming] - ERROR Error cleaning broadcast Exception

2017-07-11 Thread Nipun Arora
Hi All, I get the following error while running my spark streaming application, we have a large application running multiple stateful (with mapWithState) and stateless operations. It's getting difficult to isolate the error since spark itself hangs and the only error we see is in the spark log and