Re: Structured streaming from Kafka by timestamp

2019-02-05 Thread Cody Koeninger
To be more explicit, the easiest thing to do in the short term is use your own instance of KafkaConsumer to get the offsets for the timestamps you're interested in, using offsetsForTimes, and use those for the start / end offsets. See https://kafka.apache.org/10/javadoc/?org/apache/kafka/clients/c

Re: Back pressure not working on streaming

2019-02-05 Thread Cody Koeninger
That article is pretty old, If you click through the link to the jira mentioned in it, https://issues.apache.org/jira/browse/SPARK-18580 , it's been resolved. On Wed, Jan 2, 2019 at 12:42 AM JF Chen wrote: > > yes, 10 is a very low value for testing initial rate. > And from this article > https:

Re: ConcurrentModificationExceptions with CachedKafkaConsumers

2018-08-31 Thread Cody Koeninger
the group id, as well > as simply moving to use only a single one of our clusters. Neither of these > were successful. I am not able to run a test against master now. > > Regards, > > Bryan Jeffrey > > > > > On Thu, Aug 30, 2018 at 2:56 PM Cody Koeninger wro

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-30 Thread Cody Koeninger
0 PM, Guillermo Ortiz Fernández wrote: > I can't... do you think that it's a possible bug of this version?? from > Spark or Kafka? > > El mié., 29 ago. 2018 a las 22:28, Cody Koeninger () > escribió: >> >> Are you able to try a recent version of spark? >> >

Re: ConcurrentModificationExceptions with CachedKafkaConsumers

2018-08-30 Thread Cody Koeninger
I doubt that fix will get backported to 2.3.x Are you able to test against master? 2.4 with the fix you linked to is likely to hit code freeze soon. >From a quick look at your code, I'm not sure why you're mapping over an array of brokers. It seems like that would result in different streams wi

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Cody Koeninger
Are you able to try a recent version of spark? On Wed, Aug 29, 2018 at 2:10 AM, Guillermo Ortiz Fernández wrote: > I'm using Spark Streaming 2.0.1 with Kafka 0.10, sometimes I get this > exception and Spark dies. > > I couldn't see any error or problem among the machines, anybody has the > reason

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Cody Koeninger
, Jun 14, 2018 at 3:33 PM, Bryan Jeffrey wrote: > Cody, > > Where is that called in the driver? The only call I see from Subscribe is to > load the offset from checkpoint. > > Get Outlook for Android > > > From: Cody Koeninger > Sent: Thur

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Cody Koeninger
s from checkpoint. > > Thank you! > > Bryan > > Get Outlook for Android > > ____ > From: Cody Koeninger > Sent: Thursday, June 14, 2018 4:00:31 PM > To: Bryan Jeffrey > Cc: user > Subject: Re: Kafka Offset Storage: Fetching O

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Cody Koeninger
The expectation is that you shouldn't have to manually load offsets from kafka, because the underlying kafka consumer on the driver will start at the offsets associated with the given group id. That's the behavior I see with this example: https://github.com/koeninger/kafka-exactly-once/blob/maste

Re: [Structured-Streaming][Beginner] Out of order messages with Spark kafka readstream from a specific partition

2018-05-10 Thread Cody Koeninger
As long as you aren't doing any spark operations that involve a shuffle, the order you see in spark should be the same as the order in the partition. Can you link to a minimal code example that reproduces the issue? On Wed, May 9, 2018 at 7:05 PM, karthikjay wrote: > On the producer side, I make

Re: Structured streaming: Tried to fetch $offset but the returned record offset was ${record.offset}"

2018-04-17 Thread Cody Koeninger
Is this possibly related to the recent post on https://issues.apache.org/jira/browse/SPARK-18057 ? On Mon, Apr 16, 2018 at 11:57 AM, ARAVIND SETHURATHNAM < asethurath...@homeaway.com.invalid> wrote: > Hi, > > We have several structured streaming jobs (spark version 2.2.0) consuming > from kafka a

Re: is it possible to use Spark 2.3.0 along with Kafka 0.9.0.1?

2018-03-16 Thread Cody Koeninger
Should be able to use the 0.8 kafka dstreams with a kafka 0.9 broker On Fri, Mar 16, 2018 at 7:52 AM, kant kodali wrote: > Hi All, > > is it possible to use Spark 2.3.0 along with Kafka 0.9.0.1? > > Thanks, > kant - To unsubscri

Re: KafkaUtils.createStream(..) is removed for API

2018-02-19 Thread Cody Koeninger
I can't speak for committers, but my guess is it's more likely for DStreams in general to stop being supported before that particular integration is removed. On Sun, Feb 18, 2018 at 9:34 PM, naresh Goud wrote: > Thanks Ted. > > I see createDirectStream is experimental as annotated with > "org.ap

Re: Providing Kafka configuration as Map of Strings

2018-01-24 Thread Cody Koeninger
Have you tried passing in a Map that happens to have string for all the values? I haven't tested this, but the underlying kafka consumer constructor is documented to take either strings or objects as values, despite the static type. On Wed, Jan 24, 2018 at 2:48 PM, Tecno Brain wrote: > Basically

Re: uncontinuous offset in kafka will cause the spark streamingfailure

2018-01-24 Thread Cody Koeninger
ems this patch is not suitable for our problem。 > > https://github.com/apache/spark/compare/master...koeninger:SPARK-17147 > > wood.super > > 原始邮件 > 发件人: namesuperwood > 收件人: Justin Miller > 抄送: user; Cody Koeninger > 发送时间: 2018年1月24日(周三) 14:45 > 主题: Re: unconti

Re: "Got wrong record after seeking to offset" issue

2018-01-18 Thread Cody Koeninger
y to give that a shot too. > > I’ll try consuming from the failed offset if/when the problem manifests > itself again. > > Thanks! > Justin > > > On Wednesday, January 17, 2018, Cody Koeninger wrote: >> >> That means the consumer on the executor tried to seek

Re: "Got wrong record after seeking to offset" issue

2018-01-17 Thread Cody Koeninger
That means the consumer on the executor tried to seek to the specified offset, but the message that was returned did not have a matching offset. If the executor can't get the messages the driver told it to get, something's generally wrong. What happens when you try to consume the particular faili

Re: Which kafka client to use with spark streaming

2017-12-26 Thread Cody Koeninger
Do not add a dependency on kafka-clients, the spark-streaming-kafka library has appropriate transitive dependencies. Either version of the spark-streaming-kafka library should work with 1.0 brokers; what problems were you having? On Mon, Dec 25, 2017 at 7:58 PM, Diogo Munaro Vieira wrote: > He

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Cody Koeninger
You can't create a network connection to kafka on the driver and then serialize it to send it the executor. That's likely why you're getting serialization errors. Kafka producers are thread safe and designed for use as a singleton. Use a lazy singleton instance of the producer on the executor, d

Re: bulk upsert data batch from Kafka dstream into Postgres db

2017-12-14 Thread Cody Koeninger
Modern versions of postgres have upsert, ie insert into ... on conflict ... do update On Thu, Dec 14, 2017 at 11:26 AM, salemi wrote: > Thank you for your respond. > The approach loads just the data into the DB. I am looking for an approach > that allows me to update existing entries in the DB a

Re: bulk upsert data batch from Kafka dstream into Postgres db

2017-12-14 Thread Cody Koeninger
use foreachPartition(), get a connection from a jdbc connection pool, and insert the data the same way you would in a non-spark program. If you're only doing inserts, postgres COPY will be faster (e.g. https://discuss.pivotal.io/hc/en-us/articles/204237003), but if you're doing updates that's not

Re: [Spark streaming] No assigned partition error during seek

2017-12-01 Thread Cody Koeninger
it useful for our needs”, do you > mean to say the code like below? > > kafkaDStream.asInstanceOf[CanCommitOffsets].commitAsync(ranges) > > > > > > Best Regards > > Richard > > > > > > From: venkat > Date: Thursday, November 30, 2017 at 8:16 PM &

Re: Kafka version support

2017-11-30 Thread Cody Koeninger
Are you talking about the broker version, or the kafka-clients artifact version? On Thu, Nov 30, 2017 at 12:17 AM, Raghavendra Pandey wrote: > Just wondering if anyone has tried spark structured streaming kafka > connector (2.2) with Kafka 0.11 or Kafka 1.0 version > > Thanks > Raghav --

Re: [Spark streaming] No assigned partition error during seek

2017-11-30 Thread Cody Koeninger
You mentioned 0.11 version; the latest version of org.apache.kafka kafka-clients artifact supported by DStreams is 0.10.0.1, for which it has an appropriate dependency. Are you manually depending on a different version of the kafka-clients artifact? On Fri, Nov 24, 2017 at 7:39 PM, venks61176 wr

Re: Spark Streaming fails with unable to get records after polling for 512 ms

2017-11-17 Thread Cody Koeninger
t Stream ? > > Thanks > Jagadish > > On Wed, Nov 15, 2017 at 11:23 AM, Cody Koeninger wrote: >> >> spark.streaming.kafka.consumer.poll.ms is a spark configuration, not >> a kafka parameter. >> >> see http://spark.apache.org/docs/latest/configuration.ht

Re: Spark Streaming fails with unable to get records after polling for 512 ms

2017-11-15 Thread Cody Koeninger
spark.streaming.kafka.consumer.poll.ms is a spark configuration, not a kafka parameter. see http://spark.apache.org/docs/latest/configuration.html On Tue, Nov 14, 2017 at 8:56 PM, jkagitala wrote: > Hi, > > I'm trying to add spark-streaming to our kafka topic. But, I keep getting > this error >

Re: FW: Kafka Direct Stream - dynamic topic subscription

2017-10-29 Thread Cody Koeninger
As it says in SPARK-10320 and in the docs at http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#consumerstrategies , you can use SubscribePattern On Sun, Oct 29, 2017 at 3:56 PM, Ramanan, Buvana (Nokia - US/Murray Hill) wrote: > Hello Cody, > > > > As the stake holders of J

Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread Cody Koeninger
You should be able to pass a comma separated string of topics to subscribe. subscribePattern isn't necessary On Tue, Sep 19, 2017 at 2:54 PM, kant kodali wrote: > got it! Sorry. > > On Tue, Sep 19, 2017 at 12:52 PM, Jacek Laskowski wrote: >> >> Hi, >> >> Use subscribepattern >> >> You haven't

Re: ConcurrentModificationException using Kafka Direct Stream

2017-09-18 Thread Cody Koeninger
Have you searched in jira, e.g. https://issues.apache.org/jira/browse/SPARK-19185 On Mon, Sep 18, 2017 at 1:56 AM, HARSH TAKKAR wrote: > Hi > > Changing spark version if my last resort, is there any other workaround for > this problem. > > > On Mon, Sep 18, 2017 at 11:43 AM pandees waran wrote:

Re: Multiple Kafka topics processing in Spark 2.2

2017-09-11 Thread Cody Koeninger
If you want an "easy" but not particularly performant way to do it, each org.apache.kafka.clients.consumer.ConsumerRecord has a topic. The topic is going to be the same for the entire partition as long as you haven't shuffled, hence the examples on how to deal with it at a partition level. On Fri

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-29 Thread Cody Koeninger
gt; > val kafkaStreamRdd = kafkaStream.transform { rdd => > rdd.map(consumerRecord => (consumerRecord.key(), consumerRecord.value())) > } > > On Mon, Aug 28, 2017 at 11:56 AM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> There is no difference i

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-28 Thread Cody Koeninger
Capacity" -> Integer.valueOf(96), > "spark.streaming.kafka.consumer.cache.maxCapacity" -> Integer.valueOf(96), > > "spark.streaming.kafka.consumer.poll.ms" -> Integer.valueOf(1024), > > > http://markmail.org/message/n4cdxwurlhf44q5x > > https://issues.apache.org/jira/browse/S

Re: Kafka Consumer Pre Fetch Messages + Async commits

2017-08-28 Thread Cody Koeninger
1. No, prefetched message offsets aren't exposed. 2. No, I'm not aware of any plans for sync commit, and I'm not sure that makes sense. You have to be able to deal with repeat messages in the event of failure in any case, so the only difference sync commit would make would be (possibly) slower ru

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-25 Thread Cody Koeninger
Why are you setting consumer.cache.enabled to false? On Fri, Aug 25, 2017 at 2:19 PM, SRK wrote: > Hi, > > What would be the appropriate settings to run Spark with Kafka 10? My job > works fine with Spark with Kafka 8 and with Kafka 8 cluster. But its very > slow with Kafka 10 by using Kafka Dire

Re: How to force Spark Kafka Direct to start from the latest offset when the lag is huge in kafka 10?

2017-08-22 Thread Cody Koeninger
;latest",. > > > Thanks! > > On Mon, Aug 21, 2017 at 9:06 AM, Cody Koeninger wrote: >> >> Yes, you can start from specified offsets. See ConsumerStrategy, >> specifically Assign >> >> >> http://spark.apache.org/docs/latest/streaming-kafka-0

Re: How to force Spark Kafka Direct to start from the latest offset when the lag is huge in kafka 10?

2017-08-21 Thread Cody Koeninger
Yes, you can start from specified offsets. See ConsumerStrategy, specifically Assign http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#your-own-data-store On Tue, Aug 15, 2017 at 1:18 PM, SRK wrote: > Hi, > > How to force Spark Kafka Direct to start from the latest offse

Re: KafkaUtils.createRDD , How do I read all the data from kafka in a batch program for a given topic?

2017-08-09 Thread Cody Koeninger
org.apache.spark.streaming.kafka.KafkaCluster has methods getLatestLeaderOffsets and getEarliestLeaderOffsets On Mon, Aug 7, 2017 at 11:37 PM, shyla deshpande wrote: > Thanks TD. > > On Mon, Aug 7, 2017 at 8:59 PM, Tathagata Das > wrote: >> >> I dont think there is any easier way. >> >> On Mon,

Re: kafka settting, enable.auto.commit to false is being overridden and I lose data. please help!

2017-08-07 Thread Cody Koeninger
cted 99% of the time except that > when there is an exception, I did not want the offsets to be committed. > > By Filtering for unsuccessful attempts, do you mean filtering the bad > records... > > > > > > > On Mon, Aug 7, 2017 at 9:59 AM, Cody Koeninger wrote: &

Re: kafka settting, enable.auto.commit to false is being overridden and I lose data. please help!

2017-08-07 Thread Cody Koeninger
but the offsets are committed all the time independent of whether >> an exception was raised or not. >> >> It will be helpful if you can explain this behavior. >> >> >> On Sun, Aug 6, 2017 at 5:19 PM, Cody Koeninger wrote: >>> >>> I mean that the ka

Re: kafka settting, enable.auto.commit to false is being overridden and I lose data. please help!

2017-08-06 Thread Cody Koeninger
ing, that's why it's > being overridden. It is the executors that do the mapping and saving to > cassandra. The status of success or failure of this operation is known only > on the executor and thats where I want to commit the kafka offsets. If this > is not what I sould be doin

Re: kafka settting, enable.auto.commit to false is being overridden and I lose data. please help!

2017-08-06 Thread Cody Koeninger
If your complaint is about offsets being committed that you didn't expect... auto commit being false on executors shouldn't have anything to do with that. Executors shouldn't be auto-committing, that's why it's being overridden. What you've said and the code you posted isn't really enough to expl

Re: Spark streaming giving me a bunch of WARNINGS, please help meunderstand them

2017-07-10 Thread Cody Koeninger
The warnings regarding configuration on the executor are for the executor kafka consumer, not the driver kafka consumer. In general, the executor kafka consumers should consume only exactly the offsets the driver told them to, and not be involved in committing offsets / part of the same group as t

Re: The stability of Spark Stream Kafka 010

2017-06-29 Thread Cody Koeninger
Given the emphasis on structured streaming, I don't personally expect a lot more work being put into DStreams-based projects, outside of bugfixes. Stable designation is kind of arbitrary at that point. That 010 version wasn't developed until spark 2.0 timeframe, but you can always try backporting

Re: Spark-Kafka integration - build failing with sbt

2017-06-19 Thread Cody Koeninger
iate packages in LibraryDependencies ? > which ones would have helped compile this ? > > > > On Sat, Jun 17, 2017 at 2:53 PM, karan alang wrote: >> >> Thanks, Cody .. yes, was able to fix that. >> >> On Sat, Jun 17, 2017 at 1:18 PM, Cody Koeninger >> wrote

Re: Spark-Kafka integration - build failing with sbt

2017-06-17 Thread Cody Koeninger
There are different projects for different versions of kafka, spark-streaming-kafka-0-8 and spark-streaming-kafka-0-10 See http://spark.apache.org/docs/latest/streaming-kafka-integration.html On Fri, Jun 16, 2017 at 6:51 PM, karan alang wrote: > I'm trying to compile kafka & Spark Streaming int

Re: Message getting lost in Kafka + Spark Streaming

2017-05-30 Thread Cody Koeninger
First thing I noticed, you should be using a singleton kafka producer, not recreating one every partition. It's threadsafe. On Tue, May 30, 2017 at 7:59 AM, Vikash Pareek wrote: > I am facing an issue related to spark streaming with kafka, my use case is as > follow: > 1. Spark streaming(DirectS

Re: do we need to enable writeAheadLogs for DirectStream as well or is it only for indirect stream?

2017-05-02 Thread Cody Koeninger
You don't need write ahead logs for direct stream. On Tue, May 2, 2017 at 11:32 AM, kant kodali wrote: > Hi All, > > I need some fault tolerance for my stateful computations and I am wondering > why we need to enable writeAheadLogs for DirectStream like Kafka (for > Indirect stream it makes sense

Re: Exactly-once semantics with kakfa CanCommitOffsets.commitAsync?

2017-04-28 Thread Cody Koeninger
> an optional commit. > > Is that in fact the semantics here - i.e., calls to commitAsync are not > actually guaranteed to succeed? If that's the case, the docs could really > be a *lot* clearer about that. > > Thanks, > > DR > > On Fri, Apr 28, 2017

Re: Exactly-once semantics with kakfa CanCommitOffsets.commitAsync?

2017-04-28 Thread Cody Koeninger
>From that doc: " However, Kafka is not transactional, so your outputs must still be idempotent. " On Fri, Apr 28, 2017 at 10:29 AM, David Rosenstrauch wrote: > I'm doing a POC to test recovery with spark streaming from Kafka. I'm using > the technique for storing the offsets in Kafka, as des

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Cody Koeninger
> executing the last consumed offset of the same partition was 200.000 - and so > forth. This is the information I seek to get. > >> On 27 Apr 2017, at 20:11, Cody Koeninger wrote: >> >> Are you asking for commits for every message? Because that will kill >&

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Cody Koeninger
es - and hence the questions. > >> On 26 Apr 2017, at 21:42, Cody Koeninger wrote: >> >> have you read >> >> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself >> >> On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric >&

Re: help/suggestions to setup spark cluster

2017-04-27 Thread Cody Koeninger
cluster in >> standalone mode. Now in addition to streaming, I want to be able to run >> spark batch job hourly and adhoc queries using Zeppelin. >> >> Can you please confirm that a standalone cluster is OK for this. Please >> provide me some links to help me get start

Re: help/suggestions to setup spark cluster

2017-04-26 Thread Cody Koeninger
The standalone cluster manager is fine for production. Don't use Yarn or Mesos unless you already have another need for it. On Wed, Apr 26, 2017 at 4:53 PM, anna stax wrote: > Hi Sam, > > Thank you for the reply. > > What do you mean by > I doubt people run spark in a. Single EC2 instance, certa

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Cody Koeninger
idea onto how to use the KafkaConsumer’s auto offset commits? Keep in > mind that I do not care about exactly-once, hence having messages replayed is > perfectly fine. > >> On 26 Apr 2017, at 19:26, Cody Koeninger wrote: >> >> What is it you're actually trying to ac

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Cody Koeninger
What is it you're actually trying to accomplish? You can get topic, partition, and offset bounds from an offset range like http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets Timestamp isn't really a meaningful idea for a range of offsets. On Tue, Apr 25

Re: Spark streaming to kafka exactly once

2017-03-22 Thread Cody Koeninger
If you're talking about reading the same message multiple times in a failure situation, see https://github.com/koeninger/kafka-exactly-once If you're talking about producing the same message multiple times in a failure situation, keep an eye on https://cwiki.apache.org/confluence/display/KAFKA/K

Re: [Spark Streaming+Kafka][How-to]

2017-03-22 Thread Cody Koeninger
Glad you got it worked out. That's cool as long as your use case doesn't actually require e.g. partition 0 to always be scheduled to the same executor across different batches. On Tue, Mar 21, 2017 at 7:35 PM, OUASSAIDI, Sami wrote: > So it worked quite well with a coalesce, I was able to find

Re: Spark Streaming from Kafka, deal with initial heavy load.

2017-03-20 Thread Cody Koeninger
You want spark.streaming.kafka.maxRatePerPartition for the direct stream. On Sat, Mar 18, 2017 at 3:37 PM, Mal Edwin wrote: > > Hi, > You can enable backpressure to handle this. > > spark.streaming.backpressure.enabled > spark.streaming.receiver.maxRate > > Thanks, > Edwin > > On Mar 18, 2017, 12

Re: Streaming 2.1.0 - window vs. batch duration

2017-03-17 Thread Cody Koeninger
Probably easier if you show some more code, but if you just call dstream.window(Seconds(60)) you didn't specify a slide duration, so it's going to default to your batch duration of 1 second. So yeah, if you're just using e.g. foreachRDD to output every message in the window, every second it's going

Re: [Spark Streaming+Kafka][How-to]

2017-03-16 Thread Cody Koeninger
Spark just really isn't a good fit for trying to pin particular computation to a particular executor, especially if you're relying on that for correctness. On Thu, Mar 16, 2017 at 7:16 AM, OUASSAIDI, Sami wrote: > > Hi all, > > So I need to specify how an executor should consume data from a kafk

Re: [Spark Kafka] API Doc pages for Kafka 0.10 not current

2017-02-28 Thread Cody Koeninger
The kafka-0-8 and kafka-0-10 integrations have conflicting dependencies. Last time I checked, Spark's doc publication puts everything all in one classpath, so publishing them both together won't work. I thought there was already a Jira ticket related to this, but a quick look didn't turn it up.

Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-22 Thread Cody Koeninger
eeb Javed <11besemja...@seecs.edu.pk> wrote: > I just noticed that Spark version that I am using (2.0.2) is built with > Scala 2.11. However I am using Kafka 0.8.2 built with Scala 2.10. Could this > be the reason why we are getting this error? > > On Mon, Feb 20, 2017 at 5:50 PM, C

Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-20 Thread Cody Koeninger
vent: " + inTime + ", " + outTime) > > } > > } > > })) > > } > > } > > > On Mon, Feb 20, 2017 at 3:10 PM, Cody Koeninger wrote: >> >> That's an indication that the beginning offset for a given batch is

Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-20 Thread Cody Koeninger
That's an indication that the beginning offset for a given batch is higher than the ending offset, i.e. something is seriously wrong. Are you doing anything at all odd with topics, i.e. deleting and recreating them, using compacted topics, etc? Start off with a very basic stream over the same kaf

Re: streaming-kafka-0-8-integration (direct approach) and monitoring

2017-02-14 Thread Cody Koeninger
Not sure what to tell you at that point - maybe compare what is present in ZK to a known working group. On Tue, Feb 14, 2017 at 9:06 PM, Mohammad Kargar wrote: > Yes offset nodes are in zk and I can get the values. > > On Feb 14, 2017 6:54 PM, "Cody Koeninger" wrote: >

Re: streaming-kafka-0-8-integration (direct approach) and monitoring

2017-02-14 Thread Cody Koeninger
your spark job. On Tue, Feb 14, 2017 at 8:40 PM, Mohammad Kargar wrote: > I'm running 0.10 version and > > ./bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --list > > lists the group. > > On Feb 14, 2017 6:34 PM, "Cody Koeninger" wrote: >> >&

Re: Write JavaDStream to Kafka (how?)

2017-02-14 Thread Cody Koeninger
It looks like you're creating a kafka producer on the driver, and attempting to write the string representation of stringIntegerJavaPairRDD. Instead, you probably want to be calling stringIntegerJavaPairRDD.foreachPartition, so that producing to kafka is being done on the executor. Read https://

Re: streaming-kafka-0-8-integration (direct approach) and monitoring

2017-02-14 Thread Cody Koeninger
Can you explain what wasn't successful and/or show code? On Tue, Feb 14, 2017 at 6:03 PM, Mohammad Kargar wrote: > As explained here, direct approach of integration between spark streaming > and kafka does not update offsets in Zookeeper, hence Zookeeper-based Kafka > monitoring tools will not sh

Re: Spark 2.0 Scala 2.11 and Kafka 0.10 Scala 2.10

2017-02-08 Thread Cody Koeninger
Pretty sure there was no 0.10.0.2 release of apache kafka. If that's a hortonworks modified version you may get better results asking in a hortonworks specific forum. Scala version of kafka shouldn't be relevant either way though. On Wed, Feb 8, 2017 at 5:30 PM, u...@moosheimer.com wrote: > Dea

Re: SSpark streaming: Could not initialize class kafka.consumer.FetchRequestAndResponseStatsRegistry$

2017-02-06 Thread Cody Koeninger
You should not need to include jars for Kafka, the spark connectors have the appropriate transitive dependency on the correct version. On Sat, Feb 4, 2017 at 3:25 PM, Marco Mistroni wrote: > Hi > not sure if this will help at all, and pls take it with a pinch of salt as > i dont have your setup

Re: Kafka dependencies in Eclipse project /Pls assist

2017-01-31 Thread Cody Koeninger
spark-streaming-kafka-0-10 has a transitive dependency on the kafka library, you shouldn't need to include kafka explicitly. What's your actual list of dependencies? On Tue, Jan 31, 2017 at 3:49 PM, Marco Mistroni wrote: > HI all > i am trying to run a sample spark code which reads streaming d

Re: mapWithState question

2017-01-30 Thread Cody Koeninger
Keep an eye on https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging although it'll likely be a while On Mon, Jan 30, 2017 at 3:41 PM, Tathagata Das wrote: > If you care about the semantics of those writes to Kafka, then you should be > awa

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Cody Koeninger
s coverage and lag status. On Tue, Jan 24, 2017 at 10:45 PM, Cody Koeninger wrote: > When you said " I check the offset ranges from Kafka Manager and don't > see any significant deltas.", what were you comparing it against? The > offset ranges printed in spark logs? > &

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Cody Koeninger
her legacy app also writes the same results to a database. There are > huge difference between DB and ES. I know how many records we process daily. > > Everything works fine if I run a job instance for each topic. > > On Tue, Jan 24, 2017 at 5:26 PM, Cody Koeninger wrote: >> >

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Cody Koeninger
e stream for all topics. I check the offset > ranges from Kafka Manager and don't see any significant deltas. > > On Tue, Jan 24, 2017 at 4:42 AM, Cody Koeninger wrote: >> >> Are you using receiver-based or direct stream? >> >> Are you doing 1 stream per topic,

Re: Failure handling

2017-01-24 Thread Cody Koeninger
Can you identify the error case and call System.exit ? It'll get retried on another executor, but as long as that one fails the same way... If you can identify the error case at the time you're doing database interaction and just prevent data being written then, that's what I typically do. On Tu

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-23 Thread Cody Koeninger
Are you using receiver-based or direct stream? Are you doing 1 stream per topic, or 1 stream for all topics? If you're using the direct stream, the actual topics and offset ranges should be visible in the logs, so you should be able to see more detail about what's happening (e.g. all topics are s

Re: Assembly for Kafka >= 0.10.0, Spark 2.2.0, Scala 2.11

2017-01-18 Thread Cody Koeninger
Spark 2.2 hasn't been released yet, has it? Python support in kafka dstreams for 0.10 is probably never, there's a jira ticket about this. Stable, hard to say. It was quite a few releases before 0.8 was marked stable, even though it underwent little change. On Wed, Jan 18, 2017 at 2:21 AM, Kara

Re: Kafka 0.8 + Spark 2.0 Partition Issue

2017-01-06 Thread Cody Koeninger
Kafka is designed to only allow reads from leaders. You need to fix this at the kafka level not the spark level. On Fri, Jan 6, 2017 at 7:33 AM, Raghu Vadapalli wrote: > > My spark 2.0 + kafka 0.8 streaming job fails with error partition leaderset > exception. When I check the kafka topic the p

Re: [Spark Kafka] How to update batch size of input dynamically for spark kafka consumer?

2017-01-03 Thread Cody Koeninger
You can't change the batch time, but you can limit the number of items in the batch http://spark.apache.org/docs/latest/configuration.html spark.streaming.backpressure.enabled spark.streaming.kafka.maxRatePerPartition On Tue, Jan 3, 2017 at 4:00 AM, 周家帅 wrote: > Hi, > > I am an intermediate sp

Re: Can't access the data in Kafka Spark Streaming globally

2016-12-23 Thread Cody Koeninger
This doesn't sound like a question regarding Kafka streaming, it sounds like confusion about the scope of variables in spark generally. Is that right? If so, I'd suggest reading the documentation, starting with a simple rdd (e.g. using sparkContext.parallelize), and experimenting to confirm your u

Re: Why foreachPartition function make duplicate invocation to map function for every message ? (Spark 2.0.2)

2016-12-16 Thread Cody Koeninger
Please post a minimal complete code example of what you are talking about On Thu, Dec 15, 2016 at 6:00 PM, Michael Nguyen wrote: > I have the following sequence of Spark Java API calls (Spark 2.0.2): > > Kafka stream that is processed via a map function, which returns the string > value from tupl

Re: Spark 2 or Spark 1.6.x?

2016-12-12 Thread Cody Koeninger
You certainly can use stable version of Kafka brokers with spark 2.0.2, why would you think otherwise? On Mon, Dec 12, 2016 at 8:53 AM, Amir Rahnama wrote: > Hi, > > You need to describe more. > > For example, in Spark 2.0.2, you can't use stable versions of Apache Kafka. > > In general, I would

Re: Spark Streaming with Kafka

2016-12-12 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#creating-a-direct-stream Use a separate group id for each stream, like the docs say. If you're doing multiple output operations, and aren't caching, spark is going to read from kafka again each time, and if some of those re

Re: problem with kafka createDirectStream ..

2016-12-09 Thread Cody Koeninger
quot;spark-core" % spark % >> "provided", >> "org.apache.spark" %% "spark-streaming"% spark % >> "provided", >> "org.apache.spark" %% "spark-mllib"% spark

Re: problem with kafka createDirectStream ..

2016-12-09 Thread Cody Koeninger
When you say 0.10.1 do you mean broker version only, or does your assembly contain classes from the 0.10.1 kafka consumer? On Fri, Dec 9, 2016 at 10:19 AM, debasishg wrote: > Hello - > > I am facing some issues with the following snippet of code that reads from > Kafka and creates DStream. I am u

Re: Reprocessing failed jobs in Streaming job

2016-12-07 Thread Cody Koeninger
eduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11

Re: Reprocessing failed jobs in Streaming job

2016-12-07 Thread Cody Koeninger
Personally I think forcing the stream to fail (e.g. check offsets in downstream store and throw exception if they aren't as expected) is the safest thing to do. If you proceed after a failure, you need a place to reliably record the batches that failed for later processing. On Wed, Dec 7, 2016 at

Re: Spark Streaming - join streaming and static data

2016-12-06 Thread Cody Koeninger
You do not need recent versions of spark, kafka, or structured streaming in order to do this. Normal DStreams are sufficient. You can parallelize your static data from the database to an RDD, and there's a join method available on RDDs. Transforming a single given timestamp line into multiple li

Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-05 Thread Cody Koeninger
Have you read / watched the materials linked from https://github.com/koeninger/kafka-exactly-once On Mon, Dec 5, 2016 at 4:17 AM, Jörn Franke wrote: > You need to do the book keeping of what has been processed yourself. This > may mean roughly the following (of course the devil is in the details)

Re: Back-pressure to Spark Kafka Streaming?

2016-12-05 Thread Cody Koeninger
If you want finer-grained max rate setting, SPARK-17510 got merged a while ago. There's also SPARK-18580 which might help address the issue of starting backpressure rate for the first batch. On Mon, Dec 5, 2016 at 4:18 PM, Liren Ding wrote: > Hey all, > > Does backressure actually work on spark

Re: Mac vs cluster Re: kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-19 Thread Cody Koeninger
uster or spark standalone) we get this issue. > > Heji > > > On Sat, Nov 19, 2016 at 8:53 AM, Cody Koeninger wrote: >> >> I ran your example using the versions of kafka and spark you are >> using, against a standalone cluster. This is what I observed: >

Re: using StreamingKMeans

2016-11-19 Thread Cody Koeninger
So I haven't played around with streaming k means at all, but given that no one responded to your message a couple of days ago, I'll say what I can. 1. Can you not sample out some % of the stream for training? 2. Can you run multiple streams at the same time with different values for k and compare

Re: Kafka direct approach,App UI shows wrong input rate

2016-11-19 Thread Cody Koeninger
There have definitely been issues with UI reporting for the direct stream in the past, but I'm not able to reproduce this with 2.0.2 and 0.8. See below: https://i.imgsafe.org/086019ae57.png On Fri, Nov 18, 2016 at 4:38 AM, Julian Keppel wrote: > Hello, > > I use Spark 2.0.2 with Kafka integra

Re: kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-19 Thread Cody Koeninger
I ran your example using the versions of kafka and spark you are using, against a standalone cluster. This is what I observed: (in kafka working directory) bash-3.2$ ./bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list 'localhost:9092' --topic simple_logtest --time -2 simple_logtest

Re: Kafka segmentation

2016-11-19 Thread Cody Koeninger
t; > On Fri, Nov 18, 2016 at 1:50 AM, Cody Koeninger wrote: >> >> Ok, I don't think I'm clear on the issue then. Can you say what the >> expected behavior is, and what the observed behavior is? >> >> On Thu, Nov 17, 2016 at 10:48 AM, Hoang Bao Thien &g

Re: Kafka segmentation

2016-11-17 Thread Cody Koeninger
t; batches, it could be any greater size as it does. > > Thien > > On Fri, Nov 18, 2016 at 1:17 AM, Cody Koeninger wrote: >> >> If you want a consistent limit on the size of batches, use >> spark.streaming.kafka.maxRatePerPartition (assuming you're using &g

Re: Kafka segmentation

2016-11-17 Thread Cody Koeninger
gest, but I > don't know what I can do in this case. > > Do you and other ones could suggest me some solutions please as this seems > the normal situation with Kafka+SpartStreaming. > > Thanks. > Alex > > > > On Thu, Nov 17, 2016 at 2:32 AM, Cody Koeninger wr

Re: Kafka segmentation

2016-11-16 Thread Cody Koeninger
Streaming? > > On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger wrote: >> >> Moved to user list. >> >> I'm not really clear on what you're trying to accomplish (why put the >> csv file through Kafka instead of reading it directly with spark?) >&g

Re: Kafka segmentation

2016-11-16 Thread Cody Koeninger
Moved to user list. I'm not really clear on what you're trying to accomplish (why put the csv file through Kafka instead of reading it directly with spark?) auto.offset.reset=largest just means that when starting the job without any defined offsets, it will start at the highest (most recent) avai

  1   2   3   4   5   6   7   >