To be more explicit, the easiest thing to do in the short term is use
your own instance of KafkaConsumer to get the offsets for the
timestamps you're interested in, using offsetsForTimes, and use those
for the start / end offsets. See
https://kafka.apache.org/10/javadoc/?org/apache/kafka/clients/c
That article is pretty old, If you click through the link to the jira
mentioned in it, https://issues.apache.org/jira/browse/SPARK-18580 ,
it's been resolved.
On Wed, Jan 2, 2019 at 12:42 AM JF Chen wrote:
>
> yes, 10 is a very low value for testing initial rate.
> And from this article
> https:
the group id, as well
> as simply moving to use only a single one of our clusters. Neither of these
> were successful. I am not able to run a test against master now.
>
> Regards,
>
> Bryan Jeffrey
>
>
>
>
> On Thu, Aug 30, 2018 at 2:56 PM Cody Koeninger wro
0 PM, Guillermo Ortiz Fernández
wrote:
> I can't... do you think that it's a possible bug of this version?? from
> Spark or Kafka?
>
> El mié., 29 ago. 2018 a las 22:28, Cody Koeninger ()
> escribió:
>>
>> Are you able to try a recent version of spark?
>>
>
I doubt that fix will get backported to 2.3.x
Are you able to test against master? 2.4 with the fix you linked to
is likely to hit code freeze soon.
>From a quick look at your code, I'm not sure why you're mapping over
an array of brokers. It seems like that would result in different
streams wi
Are you able to try a recent version of spark?
On Wed, Aug 29, 2018 at 2:10 AM, Guillermo Ortiz Fernández
wrote:
> I'm using Spark Streaming 2.0.1 with Kafka 0.10, sometimes I get this
> exception and Spark dies.
>
> I couldn't see any error or problem among the machines, anybody has the
> reason
, Jun 14, 2018 at 3:33 PM, Bryan Jeffrey wrote:
> Cody,
>
> Where is that called in the driver? The only call I see from Subscribe is to
> load the offset from checkpoint.
>
> Get Outlook for Android
>
>
> From: Cody Koeninger
> Sent: Thur
s from checkpoint.
>
> Thank you!
>
> Bryan
>
> Get Outlook for Android
>
> ____
> From: Cody Koeninger
> Sent: Thursday, June 14, 2018 4:00:31 PM
> To: Bryan Jeffrey
> Cc: user
> Subject: Re: Kafka Offset Storage: Fetching O
The expectation is that you shouldn't have to manually load offsets
from kafka, because the underlying kafka consumer on the driver will
start at the offsets associated with the given group id.
That's the behavior I see with this example:
https://github.com/koeninger/kafka-exactly-once/blob/maste
As long as you aren't doing any spark operations that involve a
shuffle, the order you see in spark should be the same as the order in
the partition.
Can you link to a minimal code example that reproduces the issue?
On Wed, May 9, 2018 at 7:05 PM, karthikjay wrote:
> On the producer side, I make
Is this possibly related to the recent post on
https://issues.apache.org/jira/browse/SPARK-18057 ?
On Mon, Apr 16, 2018 at 11:57 AM, ARAVIND SETHURATHNAM <
asethurath...@homeaway.com.invalid> wrote:
> Hi,
>
> We have several structured streaming jobs (spark version 2.2.0) consuming
> from kafka a
Should be able to use the 0.8 kafka dstreams with a kafka 0.9 broker
On Fri, Mar 16, 2018 at 7:52 AM, kant kodali wrote:
> Hi All,
>
> is it possible to use Spark 2.3.0 along with Kafka 0.9.0.1?
>
> Thanks,
> kant
-
To unsubscri
I can't speak for committers, but my guess is it's more likely for
DStreams in general to stop being supported before that particular
integration is removed.
On Sun, Feb 18, 2018 at 9:34 PM, naresh Goud wrote:
> Thanks Ted.
>
> I see createDirectStream is experimental as annotated with
> "org.ap
Have you tried passing in a Map that happens to have
string for all the values? I haven't tested this, but the underlying
kafka consumer constructor is documented to take either strings or
objects as values, despite the static type.
On Wed, Jan 24, 2018 at 2:48 PM, Tecno Brain
wrote:
> Basically
ems this patch is not suitable for our problem。
>
> https://github.com/apache/spark/compare/master...koeninger:SPARK-17147
>
> wood.super
>
> 原始邮件
> 发件人: namesuperwood
> 收件人: Justin Miller
> 抄送: user; Cody Koeninger
> 发送时间: 2018年1月24日(周三) 14:45
> 主题: Re: unconti
y to give that a shot too.
>
> I’ll try consuming from the failed offset if/when the problem manifests
> itself again.
>
> Thanks!
> Justin
>
>
> On Wednesday, January 17, 2018, Cody Koeninger wrote:
>>
>> That means the consumer on the executor tried to seek
That means the consumer on the executor tried to seek to the specified
offset, but the message that was returned did not have a matching
offset. If the executor can't get the messages the driver told it to
get, something's generally wrong.
What happens when you try to consume the particular faili
Do not add a dependency on kafka-clients, the spark-streaming-kafka
library has appropriate transitive dependencies.
Either version of the spark-streaming-kafka library should work with
1.0 brokers; what problems were you having?
On Mon, Dec 25, 2017 at 7:58 PM, Diogo Munaro Vieira
wrote:
> He
You can't create a network connection to kafka on the driver and then
serialize it to send it the executor. That's likely why you're getting
serialization errors.
Kafka producers are thread safe and designed for use as a singleton.
Use a lazy singleton instance of the producer on the executor, d
Modern versions of postgres have upsert, ie insert into ... on
conflict ... do update
On Thu, Dec 14, 2017 at 11:26 AM, salemi wrote:
> Thank you for your respond.
> The approach loads just the data into the DB. I am looking for an approach
> that allows me to update existing entries in the DB a
use foreachPartition(), get a connection from a jdbc connection pool,
and insert the data the same way you would in a non-spark program.
If you're only doing inserts, postgres COPY will be faster (e.g.
https://discuss.pivotal.io/hc/en-us/articles/204237003), but if you're
doing updates that's not
it useful for our needs”, do you
> mean to say the code like below?
>
> kafkaDStream.asInstanceOf[CanCommitOffsets].commitAsync(ranges)
>
>
>
>
>
> Best Regards
>
> Richard
>
>
>
>
>
> From: venkat
> Date: Thursday, November 30, 2017 at 8:16 PM
&
Are you talking about the broker version, or the kafka-clients artifact version?
On Thu, Nov 30, 2017 at 12:17 AM, Raghavendra Pandey
wrote:
> Just wondering if anyone has tried spark structured streaming kafka
> connector (2.2) with Kafka 0.11 or Kafka 1.0 version
>
> Thanks
> Raghav
--
You mentioned 0.11 version; the latest version of org.apache.kafka
kafka-clients artifact supported by DStreams is 0.10.0.1, for which it
has an appropriate dependency.
Are you manually depending on a different version of the kafka-clients artifact?
On Fri, Nov 24, 2017 at 7:39 PM, venks61176 wr
t Stream ?
>
> Thanks
> Jagadish
>
> On Wed, Nov 15, 2017 at 11:23 AM, Cody Koeninger wrote:
>>
>> spark.streaming.kafka.consumer.poll.ms is a spark configuration, not
>> a kafka parameter.
>>
>> see http://spark.apache.org/docs/latest/configuration.ht
spark.streaming.kafka.consumer.poll.ms is a spark configuration, not
a kafka parameter.
see http://spark.apache.org/docs/latest/configuration.html
On Tue, Nov 14, 2017 at 8:56 PM, jkagitala wrote:
> Hi,
>
> I'm trying to add spark-streaming to our kafka topic. But, I keep getting
> this error
>
As it says in SPARK-10320 and in the docs at
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#consumerstrategies
, you can use SubscribePattern
On Sun, Oct 29, 2017 at 3:56 PM, Ramanan, Buvana (Nokia - US/Murray
Hill) wrote:
> Hello Cody,
>
>
>
> As the stake holders of J
You should be able to pass a comma separated string of topics to
subscribe. subscribePattern isn't necessary
On Tue, Sep 19, 2017 at 2:54 PM, kant kodali wrote:
> got it! Sorry.
>
> On Tue, Sep 19, 2017 at 12:52 PM, Jacek Laskowski wrote:
>>
>> Hi,
>>
>> Use subscribepattern
>>
>> You haven't
Have you searched in jira, e.g.
https://issues.apache.org/jira/browse/SPARK-19185
On Mon, Sep 18, 2017 at 1:56 AM, HARSH TAKKAR wrote:
> Hi
>
> Changing spark version if my last resort, is there any other workaround for
> this problem.
>
>
> On Mon, Sep 18, 2017 at 11:43 AM pandees waran wrote:
If you want an "easy" but not particularly performant way to do it,
each org.apache.kafka.clients.consumer.ConsumerRecord
has a topic.
The topic is going to be the same for the entire partition as long as you
haven't shuffled, hence the examples on how to deal with it at a partition
level.
On Fri
gt;
> val kafkaStreamRdd = kafkaStream.transform { rdd =>
> rdd.map(consumerRecord => (consumerRecord.key(), consumerRecord.value()))
> }
>
> On Mon, Aug 28, 2017 at 11:56 AM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> There is no difference i
Capacity" -> Integer.valueOf(96),
> "spark.streaming.kafka.consumer.cache.maxCapacity" -> Integer.valueOf(96),
>
> "spark.streaming.kafka.consumer.poll.ms" -> Integer.valueOf(1024),
>
>
> http://markmail.org/message/n4cdxwurlhf44q5x
>
> https://issues.apache.org/jira/browse/S
1. No, prefetched message offsets aren't exposed.
2. No, I'm not aware of any plans for sync commit, and I'm not sure
that makes sense. You have to be able to deal with repeat messages in
the event of failure in any case, so the only difference sync commit
would make would be (possibly) slower ru
Why are you setting consumer.cache.enabled to false?
On Fri, Aug 25, 2017 at 2:19 PM, SRK wrote:
> Hi,
>
> What would be the appropriate settings to run Spark with Kafka 10? My job
> works fine with Spark with Kafka 8 and with Kafka 8 cluster. But its very
> slow with Kafka 10 by using Kafka Dire
;latest",.
>
>
> Thanks!
>
> On Mon, Aug 21, 2017 at 9:06 AM, Cody Koeninger wrote:
>>
>> Yes, you can start from specified offsets. See ConsumerStrategy,
>> specifically Assign
>>
>>
>> http://spark.apache.org/docs/latest/streaming-kafka-0
Yes, you can start from specified offsets. See ConsumerStrategy,
specifically Assign
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#your-own-data-store
On Tue, Aug 15, 2017 at 1:18 PM, SRK wrote:
> Hi,
>
> How to force Spark Kafka Direct to start from the latest offse
org.apache.spark.streaming.kafka.KafkaCluster has methods
getLatestLeaderOffsets and getEarliestLeaderOffsets
On Mon, Aug 7, 2017 at 11:37 PM, shyla deshpande
wrote:
> Thanks TD.
>
> On Mon, Aug 7, 2017 at 8:59 PM, Tathagata Das
> wrote:
>>
>> I dont think there is any easier way.
>>
>> On Mon,
cted 99% of the time except that
> when there is an exception, I did not want the offsets to be committed.
>
> By Filtering for unsuccessful attempts, do you mean filtering the bad
> records...
>
>
>
>
>
>
> On Mon, Aug 7, 2017 at 9:59 AM, Cody Koeninger wrote:
&
but the offsets are committed all the time independent of whether
>> an exception was raised or not.
>>
>> It will be helpful if you can explain this behavior.
>>
>>
>> On Sun, Aug 6, 2017 at 5:19 PM, Cody Koeninger wrote:
>>>
>>> I mean that the ka
ing, that's why it's
> being overridden. It is the executors that do the mapping and saving to
> cassandra. The status of success or failure of this operation is known only
> on the executor and thats where I want to commit the kafka offsets. If this
> is not what I sould be doin
If your complaint is about offsets being committed that you didn't
expect... auto commit being false on executors shouldn't have anything
to do with that. Executors shouldn't be auto-committing, that's why
it's being overridden.
What you've said and the code you posted isn't really enough to
expl
The warnings regarding configuration on the executor are for the
executor kafka consumer, not the driver kafka consumer.
In general, the executor kafka consumers should consume only exactly
the offsets the driver told them to, and not be involved in committing
offsets / part of the same group as t
Given the emphasis on structured streaming, I don't personally expect
a lot more work being put into DStreams-based projects, outside of
bugfixes. Stable designation is kind of arbitrary at that point.
That 010 version wasn't developed until spark 2.0 timeframe, but you
can always try backporting
iate packages in LibraryDependencies ?
> which ones would have helped compile this ?
>
>
>
> On Sat, Jun 17, 2017 at 2:53 PM, karan alang wrote:
>>
>> Thanks, Cody .. yes, was able to fix that.
>>
>> On Sat, Jun 17, 2017 at 1:18 PM, Cody Koeninger
>> wrote
There are different projects for different versions of kafka,
spark-streaming-kafka-0-8 and spark-streaming-kafka-0-10
See
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
On Fri, Jun 16, 2017 at 6:51 PM, karan alang wrote:
> I'm trying to compile kafka & Spark Streaming int
First thing I noticed, you should be using a singleton kafka producer,
not recreating one every partition. It's threadsafe.
On Tue, May 30, 2017 at 7:59 AM, Vikash Pareek
wrote:
> I am facing an issue related to spark streaming with kafka, my use case is as
> follow:
> 1. Spark streaming(DirectS
You don't need write ahead logs for direct stream.
On Tue, May 2, 2017 at 11:32 AM, kant kodali wrote:
> Hi All,
>
> I need some fault tolerance for my stateful computations and I am wondering
> why we need to enable writeAheadLogs for DirectStream like Kafka (for
> Indirect stream it makes sense
> an optional commit.
>
> Is that in fact the semantics here - i.e., calls to commitAsync are not
> actually guaranteed to succeed? If that's the case, the docs could really
> be a *lot* clearer about that.
>
> Thanks,
>
> DR
>
> On Fri, Apr 28, 2017
>From that doc:
" However, Kafka is not transactional, so your outputs must still be
idempotent. "
On Fri, Apr 28, 2017 at 10:29 AM, David Rosenstrauch wrote:
> I'm doing a POC to test recovery with spark streaming from Kafka. I'm using
> the technique for storing the offsets in Kafka, as des
> executing the last consumed offset of the same partition was 200.000 - and so
> forth. This is the information I seek to get.
>
>> On 27 Apr 2017, at 20:11, Cody Koeninger wrote:
>>
>> Are you asking for commits for every message? Because that will kill
>&
es - and hence the questions.
>
>> On 26 Apr 2017, at 21:42, Cody Koeninger wrote:
>>
>> have you read
>>
>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself
>>
>> On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric
>&
cluster in
>> standalone mode. Now in addition to streaming, I want to be able to run
>> spark batch job hourly and adhoc queries using Zeppelin.
>>
>> Can you please confirm that a standalone cluster is OK for this. Please
>> provide me some links to help me get start
The standalone cluster manager is fine for production. Don't use Yarn
or Mesos unless you already have another need for it.
On Wed, Apr 26, 2017 at 4:53 PM, anna stax wrote:
> Hi Sam,
>
> Thank you for the reply.
>
> What do you mean by
> I doubt people run spark in a. Single EC2 instance, certa
idea onto how to use the KafkaConsumer’s auto offset commits? Keep in
> mind that I do not care about exactly-once, hence having messages replayed is
> perfectly fine.
>
>> On 26 Apr 2017, at 19:26, Cody Koeninger wrote:
>>
>> What is it you're actually trying to ac
What is it you're actually trying to accomplish?
You can get topic, partition, and offset bounds from an offset range like
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets
Timestamp isn't really a meaningful idea for a range of offsets.
On Tue, Apr 25
If you're talking about reading the same message multiple times in a
failure situation, see
https://github.com/koeninger/kafka-exactly-once
If you're talking about producing the same message multiple times in a
failure situation, keep an eye on
https://cwiki.apache.org/confluence/display/KAFKA/K
Glad you got it worked out. That's cool as long as your use case doesn't
actually require e.g. partition 0 to always be scheduled to the same
executor across different batches.
On Tue, Mar 21, 2017 at 7:35 PM, OUASSAIDI, Sami
wrote:
> So it worked quite well with a coalesce, I was able to find
You want spark.streaming.kafka.maxRatePerPartition for the direct stream.
On Sat, Mar 18, 2017 at 3:37 PM, Mal Edwin wrote:
>
> Hi,
> You can enable backpressure to handle this.
>
> spark.streaming.backpressure.enabled
> spark.streaming.receiver.maxRate
>
> Thanks,
> Edwin
>
> On Mar 18, 2017, 12
Probably easier if you show some more code, but if you just call
dstream.window(Seconds(60))
you didn't specify a slide duration, so it's going to default to your
batch duration of 1 second.
So yeah, if you're just using e.g. foreachRDD to output every message
in the window, every second it's going
Spark just really isn't a good fit for trying to pin particular computation
to a particular executor, especially if you're relying on that for
correctness.
On Thu, Mar 16, 2017 at 7:16 AM, OUASSAIDI, Sami
wrote:
>
> Hi all,
>
> So I need to specify how an executor should consume data from a kafk
The kafka-0-8 and kafka-0-10 integrations have conflicting
dependencies. Last time I checked, Spark's doc publication puts
everything all in one classpath, so publishing them both together
won't work. I thought there was already a Jira ticket related to
this, but a quick look didn't turn it up.
eeb Javed
<11besemja...@seecs.edu.pk> wrote:
> I just noticed that Spark version that I am using (2.0.2) is built with
> Scala 2.11. However I am using Kafka 0.8.2 built with Scala 2.10. Could this
> be the reason why we are getting this error?
>
> On Mon, Feb 20, 2017 at 5:50 PM, C
vent: " + inTime + ", " + outTime)
>
> }
>
> }
>
> }))
>
> }
>
> }
>
>
> On Mon, Feb 20, 2017 at 3:10 PM, Cody Koeninger wrote:
>>
>> That's an indication that the beginning offset for a given batch is
That's an indication that the beginning offset for a given batch is
higher than the ending offset, i.e. something is seriously wrong.
Are you doing anything at all odd with topics, i.e. deleting and
recreating them, using compacted topics, etc?
Start off with a very basic stream over the same kaf
Not sure what to tell you at that point - maybe compare what is
present in ZK to a known working group.
On Tue, Feb 14, 2017 at 9:06 PM, Mohammad Kargar wrote:
> Yes offset nodes are in zk and I can get the values.
>
> On Feb 14, 2017 6:54 PM, "Cody Koeninger" wrote:
>
your spark job.
On Tue, Feb 14, 2017 at 8:40 PM, Mohammad Kargar wrote:
> I'm running 0.10 version and
>
> ./bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --list
>
> lists the group.
>
> On Feb 14, 2017 6:34 PM, "Cody Koeninger" wrote:
>>
>&
It looks like you're creating a kafka producer on the driver, and
attempting to write the string representation of
stringIntegerJavaPairRDD. Instead, you probably want to be calling
stringIntegerJavaPairRDD.foreachPartition, so that producing to kafka
is being done on the executor.
Read
https://
Can you explain what wasn't successful and/or show code?
On Tue, Feb 14, 2017 at 6:03 PM, Mohammad Kargar wrote:
> As explained here, direct approach of integration between spark streaming
> and kafka does not update offsets in Zookeeper, hence Zookeeper-based Kafka
> monitoring tools will not sh
Pretty sure there was no 0.10.0.2 release of apache kafka. If that's
a hortonworks modified version you may get better results asking in a
hortonworks specific forum. Scala version of kafka shouldn't be
relevant either way though.
On Wed, Feb 8, 2017 at 5:30 PM, u...@moosheimer.com wrote:
> Dea
You should not need to include jars for Kafka, the spark connectors
have the appropriate transitive dependency on the correct version.
On Sat, Feb 4, 2017 at 3:25 PM, Marco Mistroni wrote:
> Hi
> not sure if this will help at all, and pls take it with a pinch of salt as
> i dont have your setup
spark-streaming-kafka-0-10 has a transitive dependency on the kafka
library, you shouldn't need to include kafka explicitly.
What's your actual list of dependencies?
On Tue, Jan 31, 2017 at 3:49 PM, Marco Mistroni wrote:
> HI all
> i am trying to run a sample spark code which reads streaming d
Keep an eye on
https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
although it'll likely be a while
On Mon, Jan 30, 2017 at 3:41 PM, Tathagata Das
wrote:
> If you care about the semantics of those writes to Kafka, then you should be
> awa
s coverage and lag status.
On Tue, Jan 24, 2017 at 10:45 PM, Cody Koeninger wrote:
> When you said " I check the offset ranges from Kafka Manager and don't
> see any significant deltas.", what were you comparing it against? The
> offset ranges printed in spark logs?
>
&
her legacy app also writes the same results to a database. There are
> huge difference between DB and ES. I know how many records we process daily.
>
> Everything works fine if I run a job instance for each topic.
>
> On Tue, Jan 24, 2017 at 5:26 PM, Cody Koeninger wrote:
>>
>
e stream for all topics. I check the offset
> ranges from Kafka Manager and don't see any significant deltas.
>
> On Tue, Jan 24, 2017 at 4:42 AM, Cody Koeninger wrote:
>>
>> Are you using receiver-based or direct stream?
>>
>> Are you doing 1 stream per topic,
Can you identify the error case and call System.exit ? It'll get
retried on another executor, but as long as that one fails the same
way...
If you can identify the error case at the time you're doing database
interaction and just prevent data being written then, that's what I
typically do.
On Tu
Are you using receiver-based or direct stream?
Are you doing 1 stream per topic, or 1 stream for all topics?
If you're using the direct stream, the actual topics and offset ranges
should be visible in the logs, so you should be able to see more
detail about what's happening (e.g. all topics are s
Spark 2.2 hasn't been released yet, has it?
Python support in kafka dstreams for 0.10 is probably never, there's a
jira ticket about this.
Stable, hard to say. It was quite a few releases before 0.8 was
marked stable, even though it underwent little change.
On Wed, Jan 18, 2017 at 2:21 AM, Kara
Kafka is designed to only allow reads from leaders. You need to fix
this at the kafka level not the spark level.
On Fri, Jan 6, 2017 at 7:33 AM, Raghu Vadapalli wrote:
>
> My spark 2.0 + kafka 0.8 streaming job fails with error partition leaderset
> exception. When I check the kafka topic the p
You can't change the batch time, but you can limit the number of items
in the batch
http://spark.apache.org/docs/latest/configuration.html
spark.streaming.backpressure.enabled
spark.streaming.kafka.maxRatePerPartition
On Tue, Jan 3, 2017 at 4:00 AM, 周家帅 wrote:
> Hi,
>
> I am an intermediate sp
This doesn't sound like a question regarding Kafka streaming, it
sounds like confusion about the scope of variables in spark generally.
Is that right? If so, I'd suggest reading the documentation, starting
with a simple rdd (e.g. using sparkContext.parallelize), and
experimenting to confirm your u
Please post a minimal complete code example of what you are talking about
On Thu, Dec 15, 2016 at 6:00 PM, Michael Nguyen
wrote:
> I have the following sequence of Spark Java API calls (Spark 2.0.2):
>
> Kafka stream that is processed via a map function, which returns the string
> value from tupl
You certainly can use stable version of Kafka brokers with spark
2.0.2, why would you think otherwise?
On Mon, Dec 12, 2016 at 8:53 AM, Amir Rahnama wrote:
> Hi,
>
> You need to describe more.
>
> For example, in Spark 2.0.2, you can't use stable versions of Apache Kafka.
>
> In general, I would
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#creating-a-direct-stream
Use a separate group id for each stream, like the docs say.
If you're doing multiple output operations, and aren't caching, spark
is going to read from kafka again each time, and if some of those
re
quot;spark-core" % spark %
>> "provided",
>> "org.apache.spark" %% "spark-streaming"% spark %
>> "provided",
>> "org.apache.spark" %% "spark-mllib"% spark
When you say 0.10.1 do you mean broker version only, or does your
assembly contain classes from the 0.10.1 kafka consumer?
On Fri, Dec 9, 2016 at 10:19 AM, debasishg wrote:
> Hello -
>
> I am facing some issues with the following snippet of code that reads from
> Kafka and creates DStream. I am u
eduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
Personally I think forcing the stream to fail (e.g. check offsets in
downstream store and throw exception if they aren't as expected) is
the safest thing to do.
If you proceed after a failure, you need a place to reliably record
the batches that failed for later processing.
On Wed, Dec 7, 2016 at
You do not need recent versions of spark, kafka, or structured
streaming in order to do this. Normal DStreams are sufficient.
You can parallelize your static data from the database to an RDD, and
there's a join method available on RDDs. Transforming a single given
timestamp line into multiple li
Have you read / watched the materials linked from
https://github.com/koeninger/kafka-exactly-once
On Mon, Dec 5, 2016 at 4:17 AM, Jörn Franke wrote:
> You need to do the book keeping of what has been processed yourself. This
> may mean roughly the following (of course the devil is in the details)
If you want finer-grained max rate setting, SPARK-17510 got merged a
while ago. There's also SPARK-18580 which might help address the
issue of starting backpressure rate for the first batch.
On Mon, Dec 5, 2016 at 4:18 PM, Liren Ding wrote:
> Hey all,
>
> Does backressure actually work on spark
uster or spark standalone) we get this issue.
>
> Heji
>
>
> On Sat, Nov 19, 2016 at 8:53 AM, Cody Koeninger wrote:
>>
>> I ran your example using the versions of kafka and spark you are
>> using, against a standalone cluster. This is what I observed:
>
So I haven't played around with streaming k means at all, but given
that no one responded to your message a couple of days ago, I'll say
what I can.
1. Can you not sample out some % of the stream for training?
2. Can you run multiple streams at the same time with different values
for k and compare
There have definitely been issues with UI reporting for the direct
stream in the past, but I'm not able to reproduce this with 2.0.2 and
0.8. See below:
https://i.imgsafe.org/086019ae57.png
On Fri, Nov 18, 2016 at 4:38 AM, Julian Keppel
wrote:
> Hello,
>
> I use Spark 2.0.2 with Kafka integra
I ran your example using the versions of kafka and spark you are
using, against a standalone cluster. This is what I observed:
(in kafka working directory)
bash-3.2$ ./bin/kafka-run-class.sh kafka.tools.GetOffsetShell
--broker-list 'localhost:9092' --topic simple_logtest --time -2
simple_logtest
t;
> On Fri, Nov 18, 2016 at 1:50 AM, Cody Koeninger wrote:
>>
>> Ok, I don't think I'm clear on the issue then. Can you say what the
>> expected behavior is, and what the observed behavior is?
>>
>> On Thu, Nov 17, 2016 at 10:48 AM, Hoang Bao Thien
&g
t; batches, it could be any greater size as it does.
>
> Thien
>
> On Fri, Nov 18, 2016 at 1:17 AM, Cody Koeninger wrote:
>>
>> If you want a consistent limit on the size of batches, use
>> spark.streaming.kafka.maxRatePerPartition (assuming you're using
&g
gest, but I
> don't know what I can do in this case.
>
> Do you and other ones could suggest me some solutions please as this seems
> the normal situation with Kafka+SpartStreaming.
>
> Thanks.
> Alex
>
>
>
> On Thu, Nov 17, 2016 at 2:32 AM, Cody Koeninger wr
Streaming?
>
> On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger wrote:
>>
>> Moved to user list.
>>
>> I'm not really clear on what you're trying to accomplish (why put the
>> csv file through Kafka instead of reading it directly with spark?)
>&g
Moved to user list.
I'm not really clear on what you're trying to accomplish (why put the
csv file through Kafka instead of reading it directly with spark?)
auto.offset.reset=largest just means that when starting the job
without any defined offsets, it will start at the highest (most
recent) avai
1 - 100 of 689 matches
Mail list logo