/latest/api/scala/org/apache/spark/streaming/StreamingContext.html>.
> The information on consumed offset can be recovered from the checkpoint.
>
> On Tue, May 19, 2015 at 2:38 PM, Bill Jay
> wrote:
>
>> If a Spark streaming job stops at 12:01 and I resume the job at 12:02.
>&
ade code.
>
> On Tue, May 19, 2015 at 12:42 PM, Bill Jay
> wrote:
>
>> Hi all,
>>
>> I am currently using Spark streaming to consume and save logs every hour
>> in our production pipeline. The current setting is to run a crontab job to
>> check every minute
Hi all,
I am currently using Spark streaming to consume and save logs every hour in
our production pipeline. The current setting is to run a crontab job to
check every minute whether the job is still there and if not resubmit a
Spark streaming job. I am currently using the direct approach for Kafk
Hi all,
I am reading the docs of receiver-based Kafka consumer. The last parameters
of KafkaUtils.createStream is per topic number of Kafka partitions to
consume. My question is, does the number of partitions for topic in this
parameter need to match the number of partitions in Kafka.
For example
each batch.
On Thu, Apr 30, 2015 at 11:15 AM, Cody Koeninger wrote:
> Did you use lsof to see what files were opened during the job?
>
> On Thu, Apr 30, 2015 at 1:05 PM, Bill Jay
> wrote:
>
>> The data ingestion is in outermost portion in foreachRDD block. Although
>>
t's
> impossible, but I'd think we need some evidence before speculating this has
> anything to do with it.
>
>
> On Wed, Apr 29, 2015 at 6:50 PM, Bill Jay
> wrote:
>
>> This function is called in foreachRDD. I think it should be running in
>> the executors. I
> On Wed, Apr 29, 2015 at 2:30 PM, Ted Yu wrote:
>
>> Maybe add statement.close() in finally block ?
>>
>> Streaming / Kafka experts may have better insight.
>>
>> On Wed, Apr 29, 2015 at 2:25 PM, Bill Jay
>> wrote:
>>
>>> Thanks for the suggestion.
rity/limits.conf*
> Cheers
>
> On Wed, Apr 29, 2015 at 2:07 PM, Bill Jay
> wrote:
>
>> Hi all,
>>
>> I am using the direct approach to receive real-time data from Kafka in
>> the following link:
>>
>> https://spark.apache.org/docs/1.3.0/stream
Hi all,
I am using the direct approach to receive real-time data from Kafka in the
following link:
https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
My code follows the word count direct example:
https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/
hout issues.
>
> Regularly checking 'scheduling delay' and 'total delay' on the Streaming
> tab in the UI is a must. (And soon we will have that on the metrics report
> as well!! :-) )
>
> -kr, Gerard.
>
>
>
> On Thu, Nov 27, 2014 at 8:14 AM, Bill Ja
your use case, the cleanest way to solve this, is by
> asking Spark Streaming "remember" stuff for longer, by using
> streamingContext.remember(). This will ensure that Spark
> Streaming will keep around all the stuff for at least that duration.
> Hope this helps.
>
> TD
&g
Just add one more point. If Spark streaming knows when the RDD will not be
used any more, I believe Spark will not try to retrieve data it will not
use any more. However, in practice, I often encounter the error of "cannot
compute split". Based on my understanding, this is because Spark cleared
ou
> https://github.com/dibbhatt/kafka-spark-consumer
>
> Regards,
> Dibyendu
>
> On Sun, Nov 23, 2014 at 2:13 AM, Bill Jay
> wrote:
>
>> Hi all,
>>
>> I am using Spark to consume from Kafka. However, after the job has run
>> for several hours, I saw t
Hi all,
I am using Spark to consume from Kafka. However, after the job has run for
several hours, I saw the following failure of an executor:
kafka.common.ConsumerRebalanceFailedException:
group-1416624735998_ip-172-31-5-242.ec2.internal-1416648124230-547d2c31
can't rebalance after 4 retries
Hi all,
I am running a Spark Streaming job. It was able to produce the correct
results up to some time. Later on, the job was still running but producing
no result. I checked the Spark streaming UI and found that 4 tasks of a
stage failed.
The error messages showed that "Job aborted due to stage
sure there’s the a parameter in KafkaUtils.createStream you can specify
> the spark parallelism, also what is the exception stacks.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* Bill Jay [mailto:bill.jaypeter...@gmail.com]
> *Sent:* Tuesday, November 18, 2
>>
>> Did you configure Spark master as local, it should be local[n], n > 1
>> for local mode. Beside there’s a Kafka wordcount example in Spark Streaming
>> example, you can try that. I’ve tested with latest master, it’s OK.
>>
>>
>&g
Streaming
> example, you can try that. I’ve tested with latest master, it’s OK.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* Tobias Pfeiffer [mailto:t...@preferred.jp]
> *Sent:* Thursday, November 13, 2014 8:45 AM
> *To:* Bill Jay
> *Cc:* u...@spark.incubator.apache.
Hi all,
I have a Spark streaming job which constantly receives messages from Kafka.
I was using Spark 1.0.2 and the job has been running for a month. However,
when I am currently using Spark 1.1.0. the Spark streaming job cannot
receive any messages from Kafka. I have not made any change to the co
Hi all,
I have a Spark streaming job which constantly receives messages from Kafka.
I was using Spark 1.0.2 and the job has been running for a month. However,
when I am currently using Spark 1.1.0. the Spark streaming job cannot
receive any messages from Kafka. I have not made any change to the co
Hi all,
I have a spark streaming job that consumes data from Kafka and produces
some simple operations on the data. This job is run in an EMR cluster with
10 nodes. The batch size I use is 1 minute and it takes around 10 seconds
to generate the results that are inserted to a MySQL database. Howeve
, Tathagata Das
wrote:
> Can you give a stack trace and logs of the exception? Its hard to say
> anything without any associated stack trace and logs.
>
> TD
>
>
> On Fri, Jul 25, 2014 at 1:32 PM, Bill Jay
> wrote:
>
>> Hi,
>>
>> I am running a Spa
Hi,
I am running a Spark Streaming job that uses saveAsTextFiles to save
results into hdfs files. However, it has an exception after 20 batches
result-140631234/_temporary/0/task_201407251119__m_03 does
not exist.
When the job is running, I do not change any file in the folder. Does
Tobias
>
>
>
> On Thu, Jul 24, 2014 at 6:39 AM, Bill Jay
> wrote:
>
>> Hi all,
>>
>> I have a question regarding Spark streaming. When we use the
>> saveAsTextFiles function and my batch is 60 seconds, Spark will generate a
>> series of files such
Hi all,
I have a question regarding Spark streaming. When we use the
saveAsTextFiles function and my batch is 60 seconds, Spark will generate a
series of files such as:
result-140614896, result-140614802, result-140614808, etc.
I think this is the timestamp for the beginning of each
wrote:
> Can you paste the piece of code?
>
> Thanks
> Best Regards
>
>
> On Wed, Jul 23, 2014 at 1:22 AM, Bill Jay
> wrote:
>
>> Hi all,
>>
>> I am running a spark streaming job. The job hangs on one stage, which
>> shows as follows:
>>
, 2014 at 10:05 PM, Tathagata Das wrote:
> Can you give an idea of the streaming program? Rest of the transformation
> you are doing on the input streams?
>
>
> On Tue, Jul 22, 2014 at 11:05 AM, Bill Jay
> wrote:
>
>> Hi all,
>>
>> I am currently running a
Hi all,
I am running a spark streaming job. The job hangs on one stage, which shows
as follows:
Details for Stage 4
Summary MetricsNo tasks have started yetTasksNo tasks have started yet
Does anyone have an idea on this?
Thanks!
Bill
Bill
Hi all,
I am currently running a Spark Streaming program, which consumes data from
Kakfa and does the group by operation on the data. I try to optimize the
running time of the program because it looks slow to me. It seems the stage
named:
* combineByKey at ShuffledDStream.scala:42 *
always takes
3
>
> Tobias
>
>
> On Tue, Jul 22, 2014 at 3:40 AM, Bill Jay
> wrote:
>
>> Hi Tathagata,
>>
>> I am currentlycreating multiple DStream to consumefrom different topics.
>> How can I let each consumer consume from different partitions. I find the
>>
>>
>> are you saying, after repartition(400), you have 400 partitions on one
>> host and the other hosts receive nothing of the data?
>>
>> Tobias
>>
>>
>> On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay
>> wrote:
>>
>>> I also have an iss
have 400 partitions on one
> host and the other hosts receive nothing of the data?
>
> Tobias
>
>
> On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay
> wrote:
>
>> I also have an issue consuming from Kafka. When I consume from Kafka,
>> there are always a single executor w
ed RDDs and process them.
>
> TD
>
>
> On Thu, Jul 17, 2014 at 2:05 PM, Bill Jay
> wrote:
>
>> Hi Tathagata,
>>
>> Thanks for your answer. Please see my further question below:
>>
>>
>> On Wed, Jul 16, 2014 at 6:57 PM, Tathagata Das <
>&g
I also have an issue consuming from Kafka. When I consume from Kafka, there
are always a single executor working on this job. Even I use repartition,
it seems that there is still a single executor. Does anyone has an idea how
to add parallelism to this job?
On Thu, Jul 17, 2014 at 2:06 PM, Chen
Hi Tathagata,
Thanks for your answer. Please see my further question below:
On Wed, Jul 16, 2014 at 6:57 PM, Tathagata Das
wrote:
> Answers inline.
>
>
> On Wed, Jul 16, 2014 at 5:39 PM, Bill Jay
> wrote:
>
>> Hi all,
>>
>> I am currently using Spark S
Hi,
I also have some issues with repartition. In my program, I consume data
from Kafka. After I consume data, I use repartition(N). However, although I
set N to be 120, there are around 18 executors allocated for my reduce
stage. I am not sure how the repartition command works ton ensure the
paral
Hi all,
I am currently using Spark Streaming to conduct a real-time data analytics.
We receive data from Kafka. We want to generate output files that contain
results that are based on the data we receive from a specific time
interval.
I have several questions on Spark Streaming's timestamp:
1) I
wrote:
> Can you give me a screen shot of the stages page in the web ui, the spark
> logs, and the code that is causing this behavior. This seems quite weird to
> me.
>
> TD
>
>
> On Mon, Jul 14, 2014 at 2:11 PM, Bill Jay
> wrote:
>
>> Hi Tathagata,
>>
&g
process the stage / time to process the whole batch.
>
> TD
>
>
> On Fri, Jul 11, 2014 at 8:32 PM, Bill Jay
> wrote:
>
>> Hi Tathagata,
>>
>> Do you mean that the data is not shuffled until the reduce stage? That
>> means groupBy still only uses 2 ma
-receiving>
> .
>
> TD
>
>
>
> On Fri, Jul 11, 2014 at 4:11 PM, Bill Jay
> wrote:
>
>> Hi folks,
>>
>> I just ran another job that only received data from Kafka, did some
>> filtering, and then save as text files in HDFS. There was no reducing work
submission.
As a result, the simple save file action took more than 2 minutes. Do you
have any idea how Spark determined the number of executors
for different stages?
Thanks!
Bill
On Fri, Jul 11, 2014 at 2:01 PM, Bill Jay
wrote:
> Hi Tathagata,
>
> Below is my main function. I omit some filt
; number of partitions in the XYZ-ByKey operation as 300, then there should
> be 300 tasks for that stage, distributed on the 50 executors are allocated
> to your context. However the data distribution may be skewed in which case,
> you can use a repartition operation to redistributed the
Hi all,
I am running a simple analysis using Spark streaming. I set executor number
and default parallelism both as 300. The program consumes data from Kafka
and do a simple groupBy operation with 300 as the parameter. The batch size
is one minute. In the first two batches, there are around 50 exe
t; is not working. Is there any other way to include particular jars with
> assembly command?
>
> Regards,
> Dilip
>
> On Friday 11 July 2014 12:45 PM, Bill Jay wrote:
>
> I have met similar issues. The reason is probably because in Spark
> assembly, spark-streaming-kafka is
It will show break of time for each task,
> including GC times, etc. That might give some indication.
>
> TD
>
>
> On Thu, Jul 10, 2014 at 5:13 PM, Bill Jay
> wrote:
>
>> Hi Tathagata,
>>
>> I set default parallelism as 300 in my configuration file. Somet
I have met similar issues. The reason is probably because in Spark
assembly, spark-streaming-kafka is not included. Currently, I am using
Maven to generate a shaded package with all the dependencies. You may try
to use sbt assembly to include the dependencies in your jar file.
Bill
On Thu, Jul 1
aRDD => {
>val newUniqueValuesRDD = newDataRDD.union(distinctValues).distinct
>uniqueValuesRDD = newUniqueValuesRDD
>
>// periodically call uniqueValuesRDD.checkpoint()
>
>val uniqueCount = uniqueValuesRDD.count()
>newDataRDD.map(x => x / count)
> })
>
al delay, I would look at the task details of that
>> stage in the Spark web ui. It will show break of time for each task,
>> including GC times, etc. That might give some indication.
>>
>> TD
>>
>>
>> On Thu, Jul 10, 2014 at 5:13 PM, Bill Jay
>> wrote:
&g
ot set, then the number of reducers
> used in the stages can keep changing across batches.
>
> TD
>
>
> On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay
> wrote:
>
>> Hi all,
>>
>> I have a Spark streaming job running on yarn. It consume data from Kafka
>> a
d Spark only with embarassingly parallel
> operations such as map or filter. I hope someone else might provide more
> insight here.
>
> Tobias
>
>
> On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay
> wrote:
>
>> Hi Tobias,
>>
>> Now I did the re-partition and ran the
fer wrote:
> Bill,
>
> I haven't worked with Yarn, but I would try adding a repartition() call
> after you receive your data from Kafka. I would be surprised if that didn't
> help.
>
>
> On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay
> wrote:
>
>> Hi Tob
Hi all,
I have a Spark streaming job running on yarn. It consume data from Kafka
and group the data by a certain field. The data size is 480k lines per
minute where the batch size is 1 minute.
For some batches, the program sometimes take more than 3 minute to finish
the groupBy operation, which s
o try something like repartition(N) or
> repartition(N*2) (with N the number of your nodes) after you receive your
> data.
>
> Tobias
>
>
> On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay
> wrote:
>
>> Hi Tobias,
>>
>> Thanks for the suggestion. I have tried to add
quot;1.0.0" into your application jar? If I remember correctly, it's not
> bundled with the downloadable compiled version of Spark.
>
> Tobias
>
>
> On Wed, Jul 9, 2014 at 8:18 AM, Bill Jay
> wrote:
>
>> Hi all,
>>
>> I used sbt to package a code
Hi all,
I used sbt to package a code that uses spark-streaming-kafka. The packaging
succeeded. However, when I submitted to yarn, the job ran for 10 seconds
and there was an error in the log file as follows:
Caused by: java.lang.NoClassDefFoundError:
org/apache/spark/streaming/kafka/KafkaUtils$
Tobias
>
>
> On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay
> wrote:
>
>> Hi all,
>>
>> I have a problem of using Spark Streaming to accept input data and update
>> a result.
>>
>> The input of the data is from Kafka and the output is to report a map
&g
Hi all,
I am working on a pipeline that needs to join two Spark streams. The input
is a stream of integers. And the output is the number of integer's
appearance divided by the total number of unique integers. Suppose the
input is:
1
2
3
1
2
2
There are 3 unique integers and 1 appears twice. Ther
Hi all,
I have a problem of using Spark Streaming to accept input data and update a
result.
The input of the data is from Kafka and the output is to report a map which
is updated by historical data in every minute. My current method is to set
batch size as 1 minute and use foreachRDD to update th
Hi all,
I have an issue with multiple slf4j bindings. My program was running
correctly. I just added the new dependency kryo. And when I submitted a
job, the job was killed because of the following error messages:
*SLF4J: Class path contains multiple SLF4J bindings.*
The log said there were thr
chronous
>> solution where your tasks return immediately and deliver their result via a
>> callback later?
>>
>> Tobias
>>
>>
>>
>> On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay
>> wrote:
>>
>>> Tobias,
>>>
>>> Your s
wever, if you are only waiting
> (e.g., for network I/O), then maybe you can employ some asynchronous
> solution where your tasks return immediately and deliver their result via a
> callback later?
>
> Tobias
>
>
>
> On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay
> wrote
ion of what I think
> may happen.
>
> Can you maybe do something hacky like throwing away a part of the data so
> that processing time gets below one minute, then check whether you still
> get that error?
>
> Tobias
>
>
>
>
>
> On Mon, Jun 30, 2014 at 1:56 P
> while waiting for processing, so old data was deleted. When it was
> time to process that data, it didn't exist any more. Is that a
> possible reason in your case?
>
> Tobias
>
> On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay
> wrote:
> > Hi,
> >
> > I a
Hi,
I am running a spark streaming job with 1 minute as the batch size. It ran
around 84 minutes and was killed because of the exception with the
following information:
*java.lang.Exception: Could not compute split, block input-0-1403893740400
not found*
Before it was killed, it was able to cor
allelize() can
> be used to make an RDD from a List or Array of objects. But that's not
> really related to streaming or updating a Map.
>
> On Thu, Jun 26, 2014 at 1:40 PM, Bill Jay
> wrote:
> > Hi all,
> >
> > I am current working on a project that requires to t
Hi all,
I am current working on a project that requires to transform each RDD in a
DStream to a Map. Basically, when we get a list of data in each batch, we
would like to update the global map. I would like to return the map as a
single RDD.
I am currently trying to use the function *transform*.
66 matches
Mail list logo