Re: Spark Streaming + Kafka failure recovery

2015-05-21 Thread Bill Jay
/latest/api/scala/org/apache/spark/streaming/StreamingContext.html>. > The information on consumed offset can be recovered from the checkpoint. > > On Tue, May 19, 2015 at 2:38 PM, Bill Jay > wrote: > >> If a Spark streaming job stops at 12:01 and I resume the job at 12:02. >&

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Bill Jay
ade code. > > On Tue, May 19, 2015 at 12:42 PM, Bill Jay > wrote: > >> Hi all, >> >> I am currently using Spark streaming to consume and save logs every hour >> in our production pipeline. The current setting is to run a crontab job to >> check every minute

Spark Streaming + Kafka failure recovery

2015-05-19 Thread Bill Jay
Hi all, I am currently using Spark streaming to consume and save logs every hour in our production pipeline. The current setting is to run a crontab job to check every minute whether the job is still there and if not resubmit a Spark streaming job. I am currently using the direct approach for Kafk

Partition number of Spark Streaming Kafka receiver-based approach

2015-05-18 Thread Bill Jay
Hi all, I am reading the docs of receiver-based Kafka consumer. The last parameters of KafkaUtils.createStream is per topic number of Kafka partitions to consume. My question is, does the number of partitions for topic in this parameter need to match the number of partitions in Kafka. For example

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Bill Jay
each batch. On Thu, Apr 30, 2015 at 11:15 AM, Cody Koeninger wrote: > Did you use lsof to see what files were opened during the job? > > On Thu, Apr 30, 2015 at 1:05 PM, Bill Jay > wrote: > >> The data ingestion is in outermost portion in foreachRDD block. Although >>

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Bill Jay
t's > impossible, but I'd think we need some evidence before speculating this has > anything to do with it. > > > On Wed, Apr 29, 2015 at 6:50 PM, Bill Jay > wrote: > >> This function is called in foreachRDD. I think it should be running in >> the executors. I

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
> On Wed, Apr 29, 2015 at 2:30 PM, Ted Yu wrote: > >> Maybe add statement.close() in finally block ? >> >> Streaming / Kafka experts may have better insight. >> >> On Wed, Apr 29, 2015 at 2:25 PM, Bill Jay >> wrote: >> >>> Thanks for the suggestion.

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
rity/limits.conf* > Cheers > > On Wed, Apr 29, 2015 at 2:07 PM, Bill Jay > wrote: > >> Hi all, >> >> I am using the direct approach to receive real-time data from Kafka in >> the following link: >> >> https://spark.apache.org/docs/1.3.0/stream

Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
Hi all, I am using the direct approach to receive real-time data from Kafka in the following link: https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html My code follows the word count direct example: https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Bill Jay
hout issues. > > Regularly checking 'scheduling delay' and 'total delay' on the Streaming > tab in the UI is a must. (And soon we will have that on the metrics report > as well!! :-) ) > > -kr, Gerard. > > > > On Thu, Nov 27, 2014 at 8:14 AM, Bill Ja

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Bill Jay
your use case, the cleanest way to solve this, is by > asking Spark Streaming "remember" stuff for longer, by using > streamingContext.remember(). This will ensure that Spark > Streaming will keep around all the stuff for at least that duration. > Hope this helps. > > TD &g

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Bill Jay
Just add one more point. If Spark streaming knows when the RDD will not be used any more, I believe Spark will not try to retrieve data it will not use any more. However, in practice, I often encounter the error of "cannot compute split". Based on my understanding, this is because Spark cleared ou

Re: Error when Spark streaming consumes from Kafka

2014-11-23 Thread Bill Jay
> https://github.com/dibbhatt/kafka-spark-consumer > > Regards, > Dibyendu > > On Sun, Nov 23, 2014 at 2:13 AM, Bill Jay > wrote: > >> Hi all, >> >> I am using Spark to consume from Kafka. However, after the job has run >> for several hours, I saw t

Error when Spark streaming consumes from Kafka

2014-11-22 Thread Bill Jay
Hi all, I am using Spark to consume from Kafka. However, after the job has run for several hours, I saw the following failure of an executor: kafka.common.ConsumerRebalanceFailedException: group-1416624735998_ip-172-31-5-242.ec2.internal-1416648124230-547d2c31 can't rebalance after 4 retries

Spark streaming: java.io.IOException: Version Mismatch (Expected: 28, Received: 18245 )

2014-11-18 Thread Bill Jay
Hi all, I am running a Spark Streaming job. It was able to produce the correct results up to some time. Later on, the job was still running but producing no result. I checked the Spark streaming UI and found that 4 tasks of a stage failed. The error messages showed that "Job aborted due to stage

Re: Spark streaming cannot receive any message from Kafka

2014-11-18 Thread Bill Jay
sure there’s the a parameter in KafkaUtils.createStream you can specify > the spark parallelism, also what is the exception stacks. > > > > Thanks > > Jerry > > > > *From:* Bill Jay [mailto:bill.jaypeter...@gmail.com] > *Sent:* Tuesday, November 18, 2

Re: Spark streaming cannot receive any message from Kafka

2014-11-17 Thread Bill Jay
>> >> Did you configure Spark master as local, it should be local[n], n > 1 >> for local mode. Beside there’s a Kafka wordcount example in Spark Streaming >> example, you can try that. I’ve tested with latest master, it’s OK. >> >> >&g

Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Streaming > example, you can try that. I’ve tested with latest master, it’s OK. > > > > Thanks > > Jerry > > > > *From:* Tobias Pfeiffer [mailto:t...@preferred.jp] > *Sent:* Thursday, November 13, 2014 8:45 AM > *To:* Bill Jay > *Cc:* u...@spark.incubator.apache.

Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, I have a Spark streaming job which constantly receives messages from Kafka. I was using Spark 1.0.2 and the job has been running for a month. However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the co

Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, I have a Spark streaming job which constantly receives messages from Kafka. I was using Spark 1.0.2 and the job has been running for a month. However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the co

Spark streaming job failed due to "java.util.concurrent.TimeoutException"

2014-11-03 Thread Bill Jay
Hi all, I have a spark streaming job that consumes data from Kafka and produces some simple operations on the data. This job is run in an EMR cluster with 10 nodes. The batch size I use is 1 minute and it takes around 10 seconds to generate the results that are inserted to a MySQL database. Howeve

Re: saveAsTextFiles file not found exception

2014-07-25 Thread Bill Jay
, Tathagata Das wrote: > Can you give a stack trace and logs of the exception? Its hard to say > anything without any associated stack trace and logs. > > TD > > > On Fri, Jul 25, 2014 at 1:32 PM, Bill Jay > wrote: > >> Hi, >> >> I am running a Spa

saveAsTextFiles file not found exception

2014-07-25 Thread Bill Jay
Hi, I am running a Spark Streaming job that uses saveAsTextFiles to save results into hdfs files. However, it has an exception after 20 batches result-140631234/_temporary/0/task_201407251119__m_03 does not exist. When the job is running, I do not change any file in the folder. Does

Re: Get Spark Streaming timestamp

2014-07-23 Thread Bill Jay
Tobias > > > > On Thu, Jul 24, 2014 at 6:39 AM, Bill Jay > wrote: > >> Hi all, >> >> I have a question regarding Spark streaming. When we use the >> saveAsTextFiles function and my batch is 60 seconds, Spark will generate a >> series of files such

Get Spark Streaming timestamp

2014-07-23 Thread Bill Jay
Hi all, I have a question regarding Spark streaming. When we use the saveAsTextFiles function and my batch is 60 seconds, Spark will generate a series of files such as: result-140614896, result-140614802, result-140614808, etc. I think this is the timestamp for the beginning of each

Re: Spark Streaming: no job has started yet

2014-07-23 Thread Bill Jay
wrote: > Can you paste the piece of code? > > Thanks > Best Regards > > > On Wed, Jul 23, 2014 at 1:22 AM, Bill Jay > wrote: > >> Hi all, >> >> I am running a spark streaming job. The job hangs on one stage, which >> shows as follows: >>

Re: combineByKey at ShuffledDStream.scala

2014-07-23 Thread Bill Jay
, 2014 at 10:05 PM, Tathagata Das wrote: > Can you give an idea of the streaming program? Rest of the transformation > you are doing on the input streams? > > > On Tue, Jul 22, 2014 at 11:05 AM, Bill Jay > wrote: > >> Hi all, >> >> I am currently running a

Spark Streaming: no job has started yet

2014-07-22 Thread Bill Jay
Hi all, I am running a spark streaming job. The job hangs on one stage, which shows as follows: Details for Stage 4 Summary MetricsNo tasks have started yetTasksNo tasks have started yet Does anyone have an idea on this? Thanks! Bill Bill

combineByKey at ShuffledDStream.scala

2014-07-22 Thread Bill Jay
Hi all, I am currently running a Spark Streaming program, which consumes data from Kakfa and does the group by operation on the data. I try to optimize the running time of the program because it looks slow to me. It seems the stage named: * combineByKey at ShuffledDStream.scala:42 * always takes

Re: spark streaming rate limiting from kafka

2014-07-22 Thread Bill Jay
3 > > Tobias > > > On Tue, Jul 22, 2014 at 3:40 AM, Bill Jay > wrote: > >> Hi Tathagata, >> >> I am currentlycreating multiple DStream to consumefrom different topics. >> How can I let each consumer consume from different partitions. I find the >>

Re: spark streaming rate limiting from kafka

2014-07-21 Thread Bill Jay
>> >> are you saying, after repartition(400), you have 400 partitions on one >> host and the other hosts receive nothing of the data? >> >> Tobias >> >> >> On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay >> wrote: >> >>> I also have an iss

Re: spark streaming rate limiting from kafka

2014-07-19 Thread Bill Jay
have 400 partitions on one > host and the other hosts receive nothing of the data? > > Tobias > > > On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay > wrote: > >> I also have an issue consuming from Kafka. When I consume from Kafka, >> there are always a single executor w

Re: Spark Streaming timestamps

2014-07-18 Thread Bill Jay
ed RDDs and process them. > > TD > > > On Thu, Jul 17, 2014 at 2:05 PM, Bill Jay > wrote: > >> Hi Tathagata, >> >> Thanks for your answer. Please see my further question below: >> >> >> On Wed, Jul 16, 2014 at 6:57 PM, Tathagata Das < >&g

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Bill Jay
I also have an issue consuming from Kafka. When I consume from Kafka, there are always a single executor working on this job. Even I use repartition, it seems that there is still a single executor. Does anyone has an idea how to add parallelism to this job? On Thu, Jul 17, 2014 at 2:06 PM, Chen

Re: Spark Streaming timestamps

2014-07-17 Thread Bill Jay
Hi Tathagata, Thanks for your answer. Please see my further question below: On Wed, Jul 16, 2014 at 6:57 PM, Tathagata Das wrote: > Answers inline. > > > On Wed, Jul 16, 2014 at 5:39 PM, Bill Jay > wrote: > >> Hi all, >> >> I am currently using Spark S

Re: Error: No space left on device

2014-07-17 Thread Bill Jay
Hi, I also have some issues with repartition. In my program, I consume data from Kafka. After I consume data, I use repartition(N). However, although I set N to be 120, there are around 18 executors allocated for my reduce stage. I am not sure how the repartition command works ton ensure the paral

Spark Streaming timestamps

2014-07-16 Thread Bill Jay
Hi all, I am currently using Spark Streaming to conduct a real-time data analytics. We receive data from Kafka. We want to generate output files that contain results that are based on the data we receive from a specific time interval. I have several questions on Spark Streaming's timestamp: 1) I

Re: Number of executors change during job running

2014-07-16 Thread Bill Jay
wrote: > Can you give me a screen shot of the stages page in the web ui, the spark > logs, and the code that is causing this behavior. This seems quite weird to > me. > > TD > > > On Mon, Jul 14, 2014 at 2:11 PM, Bill Jay > wrote: > >> Hi Tathagata, >> &g

Re: Number of executors change during job running

2014-07-14 Thread Bill Jay
process the stage / time to process the whole batch. > > TD > > > On Fri, Jul 11, 2014 at 8:32 PM, Bill Jay > wrote: > >> Hi Tathagata, >> >> Do you mean that the data is not shuffled until the reduce stage? That >> means groupBy still only uses 2 ma

Re: Number of executors change during job running

2014-07-11 Thread Bill Jay
-receiving> > . > > TD > > > > On Fri, Jul 11, 2014 at 4:11 PM, Bill Jay > wrote: > >> Hi folks, >> >> I just ran another job that only received data from Kafka, did some >> filtering, and then save as text files in HDFS. There was no reducing work

Re: Number of executors change during job running

2014-07-11 Thread Bill Jay
submission. As a result, the simple save file action took more than 2 minutes. Do you have any idea how Spark determined the number of executors for different stages? Thanks! Bill On Fri, Jul 11, 2014 at 2:01 PM, Bill Jay wrote: > Hi Tathagata, > > Below is my main function. I omit some filt

Re: Number of executors change during job running

2014-07-11 Thread Bill Jay
; number of partitions in the XYZ-ByKey operation as 300, then there should > be 300 tasks for that stage, distributed on the 50 executors are allocated > to your context. However the data distribution may be skewed in which case, > you can use a repartition operation to redistributed the

Spark groupBy operation is only assigned 2 executors

2014-07-11 Thread Bill Jay
Hi all, I am running a simple analysis using Spark streaming. I set executor number and default parallelism both as 300. The program consumes data from Kafka and do a simple groupBy operation with 300 as the parameter. The batch size is one minute. In the first two batches, there are around 50 exe

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-11 Thread Bill Jay
t; is not working. Is there any other way to include particular jars with > assembly command? > > Regards, > Dilip > > On Friday 11 July 2014 12:45 PM, Bill Jay wrote: > > I have met similar issues. The reason is probably because in Spark > assembly, spark-streaming-kafka is

Re: Number of executors change during job running

2014-07-11 Thread Bill Jay
It will show break of time for each task, > including GC times, etc. That might give some indication. > > TD > > > On Thu, Jul 10, 2014 at 5:13 PM, Bill Jay > wrote: > >> Hi Tathagata, >> >> I set default parallelism as 300 in my configuration file. Somet

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-11 Thread Bill Jay
I have met similar issues. The reason is probably because in Spark assembly, spark-streaming-kafka is not included. Currently, I am using Maven to generate a shaded package with all the dependencies. You may try to use sbt assembly to include the dependencies in your jar file. Bill On Thu, Jul 1

Re: Join two Spark Streaming

2014-07-11 Thread Bill Jay
aRDD => { >val newUniqueValuesRDD = newDataRDD.union(distinctValues).distinct >uniqueValuesRDD = newUniqueValuesRDD > >// periodically call uniqueValuesRDD.checkpoint() > >val uniqueCount = uniqueValuesRDD.count() >newDataRDD.map(x => x / count) > }) >

Re: Number of executors change during job running

2014-07-10 Thread Bill Jay
al delay, I would look at the task details of that >> stage in the Spark web ui. It will show break of time for each task, >> including GC times, etc. That might give some indication. >> >> TD >> >> >> On Thu, Jul 10, 2014 at 5:13 PM, Bill Jay >> wrote: &g

Re: Number of executors change during job running

2014-07-10 Thread Bill Jay
ot set, then the number of reducers > used in the stages can keep changing across batches. > > TD > > > On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay > wrote: > >> Hi all, >> >> I have a Spark streaming job running on yarn. It consume data from Kafka >> a

Re: Use Spark Streaming to update result whenever data come

2014-07-10 Thread Bill Jay
d Spark only with embarassingly parallel > operations such as map or filter. I hope someone else might provide more > insight here. > > Tobias > > > On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay > wrote: > >> Hi Tobias, >> >> Now I did the re-partition and ran the

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Bill Jay
fer wrote: > Bill, > > I haven't worked with Yarn, but I would try adding a repartition() call > after you receive your data from Kafka. I would be surprised if that didn't > help. > > > On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay > wrote: > >> Hi Tob

Number of executors change during job running

2014-07-09 Thread Bill Jay
Hi all, I have a Spark streaming job running on yarn. It consume data from Kafka and group the data by a certain field. The data size is 480k lines per minute where the batch size is 1 minute. For some batches, the program sometimes take more than 3 minute to finish the groupBy operation, which s

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Bill Jay
o try something like repartition(N) or > repartition(N*2) (with N the number of your nodes) after you receive your > data. > > Tobias > > > On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay > wrote: > >> Hi Tobias, >> >> Thanks for the suggestion. I have tried to add

Re: Spark-streaming-kafka error

2014-07-08 Thread Bill Jay
quot;1.0.0" into your application jar? If I remember correctly, it's not > bundled with the downloadable compiled version of Spark. > > Tobias > > > On Wed, Jul 9, 2014 at 8:18 AM, Bill Jay > wrote: > >> Hi all, >> >> I used sbt to package a code

Spark-streaming-kafka error

2014-07-08 Thread Bill Jay
Hi all, I used sbt to package a code that uses spark-streaming-kafka. The packaging succeeded. However, when I submitted to yarn, the job ran for 10 seconds and there was an error in the log file as follows: Caused by: java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$

Re: Use Spark Streaming to update result whenever data come

2014-07-08 Thread Bill Jay
Tobias > > > On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay > wrote: > >> Hi all, >> >> I have a problem of using Spark Streaming to accept input data and update >> a result. >> >> The input of the data is from Kafka and the output is to report a map &g

Join two Spark Streaming

2014-07-08 Thread Bill Jay
Hi all, I am working on a pipeline that needs to join two Spark streams. The input is a stream of integers. And the output is the number of integer's appearance divided by the total number of unique integers. Suppose the input is: 1 2 3 1 2 2 There are 3 unique integers and 1 appears twice. Ther

Use Spark Streaming to update result whenever data come

2014-07-02 Thread Bill Jay
Hi all, I have a problem of using Spark Streaming to accept input data and update a result. The input of the data is from Kafka and the output is to report a map which is updated by historical data in every minute. My current method is to set batch size as 1 minute and use foreachRDD to update th

slf4j multiple bindings

2014-07-01 Thread Bill Jay
Hi all, I have an issue with multiple slf4j bindings. My program was running correctly. I just added the new dependency kryo. And when I submitted a job, the job was killed because of the following error messages: *SLF4J: Class path contains multiple SLF4J bindings.* The log said there were thr

Re: Could not compute split, block not found

2014-07-01 Thread Bill Jay
chronous >> solution where your tasks return immediately and deliver their result via a >> callback later? >> >> Tobias >> >> >> >> On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay >> wrote: >> >>> Tobias, >>> >>> Your s

Re: Could not compute split, block not found

2014-07-01 Thread Bill Jay
wever, if you are only waiting > (e.g., for network I/O), then maybe you can employ some asynchronous > solution where your tasks return immediately and deliver their result via a > callback later? > > Tobias > > > > On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay > wrote

Re: Could not compute split, block not found

2014-06-30 Thread Bill Jay
ion of what I think > may happen. > > Can you maybe do something hacky like throwing away a part of the data so > that processing time gets below one minute, then check whether you still > get that error? > > Tobias > > > ​​ > > > On Mon, Jun 30, 2014 at 1:56 P

Re: Could not compute split, block not found

2014-06-29 Thread Bill Jay
> while waiting for processing, so old data was deleted. When it was > time to process that data, it didn't exist any more. Is that a > possible reason in your case? > > Tobias > > On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay > wrote: > > Hi, > > > > I a

Could not compute split, block not found

2014-06-27 Thread Bill Jay
Hi, I am running a spark streaming job with 1 minute as the batch size. It ran around 84 minutes and was killed because of the exception with the following information: *java.lang.Exception: Could not compute split, block input-0-1403893740400 not found* Before it was killed, it was able to cor

Re: Spark Streaming RDD transformation

2014-06-26 Thread Bill Jay
allelize() can > be used to make an RDD from a List or Array of objects. But that's not > really related to streaming or updating a Map. > > On Thu, Jun 26, 2014 at 1:40 PM, Bill Jay > wrote: > > Hi all, > > > > I am current working on a project that requires to t

Spark Streaming RDD transformation

2014-06-26 Thread Bill Jay
Hi all, I am current working on a project that requires to transform each RDD in a DStream to a Map. Basically, when we get a list of data in each batch, we would like to update the global map. I would like to return the map as a single RDD. I am currently trying to use the function *transform*.