subject:"\[Spark Streaming\+Kafka\]\[How\-to\]"

Re: [Spark Streaming+Kafka][How-to]

2017-03-22 Thread Cody Koeninger

Glad you got it worked out. That's cool as long as your use case doesn't actually require e.g. partition 0 to always be scheduled to the same executor across different batches. On Tue, Mar 21, 2017 at 7:35 PM, OUASSAIDI, Sami wrote: > So it worked quite well with a coalesce, I was able to find

Re: [Spark Streaming+Kafka][How-to]

2017-03-21 Thread OUASSAIDI, Sami

So it worked quite well with a coalesce, I was able to find an solution to my problem : Altough not directly handling the executor a good roundaway was to assign the desired partition to a specific stream through assign strategy and coalesce to a single partition then repeat the same process for th

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread Michael Armbrust

Another option that would avoid a shuffle would be to use assign and coalesce, running two separate streams. spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "...") .option("assign", """{t0: {"0": }, t1:{"0": x}}""") .load() .coalesce(1) .writeStream .fore

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread OUASSAIDI, Sami

@Cody : Duly noted. @Michael Ambrust : A repartition is out of the question for our project as it would be a fairly expensive operation. We tried looking into targeting a specific executor so as to avoid this extra cost and directly have well partitioned data after consuming the kafka topics. Also

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread Michael Armbrust

Sorry, typo. Should be a repartition not a groupBy. > spark.readStream > .format("kafka") > .option("kafka.bootstrap.servers", "...") > .option("subscribe", "t0,t1") > .load() > .repartition($"partition") > .writeStream > .foreach(... code to write to cassandra ...) >

Re: [Spark Streaming+Kafka][How-to]

2017-03-16 Thread Michael Armbrust

I think it should be straightforward to express this using structured streaming. You could ensure that data from a given partition ID is processed serially by performing a group by on the partition column. spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "...") .option("

Re: [Spark Streaming+Kafka][How-to]

2017-03-16 Thread Cody Koeninger

Spark just really isn't a good fit for trying to pin particular computation to a particular executor, especially if you're relying on that for correctness. On Thu, Mar 16, 2017 at 7:16 AM, OUASSAIDI, Sami wrote: > > Hi all, > > So I need to specify how an executor should consume data from a kafk

[Spark Streaming+Kafka][How-to]

2017-03-16 Thread OUASSAIDI, Sami

Hi all, So I need to specify how an executor should consume data from a kafka topic. Let's say I have 2 topics : t0 and t1 with two partitions each, and two executors e0 and e1 (both can be on the same node so assign strategy does not work since in the case of a multi executor node it works based

Re: [Spark Streaming+Kafka][How-to]

Re: [Spark Streaming+Kafka][How-to]

Re: [Spark Streaming+Kafka][How-to]

Re: [Spark Streaming+Kafka][How-to]

Re: [Spark Streaming+Kafka][How-to]

Re: [Spark Streaming+Kafka][How-to]

Re: [Spark Streaming+Kafka][How-to]

[Spark Streaming+Kafka][How-to]

8 matches

Site Navigation

Mail list logo

Footer information