Another option that would avoid a shuffle would be to use assign and
coalesce, running two separate streams.
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("assign", """{t0: {"0": xxxx}, t1:{"0": xxxxx}}""")
.load()
.coalesce(1)
.writeStream
.foreach(... code to write to cassandra ...)
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("assign", """{t0: {"1": xxxx}, t1:{"1": xxxxx}}""")
.load()
.coalesce(1)
.writeStream
.foreach(... code to write to cassandra ...)
On Fri, Mar 17, 2017 at 7:35 AM, OUASSAIDI, Sami <[email protected]>
wrote:
> @Cody : Duly noted.
> @Michael Ambrust : A repartition is out of the question for our project as
> it would be a fairly expensive operation. We tried looking into targeting a
> specific executor so as to avoid this extra cost and directly have well
> partitioned data after consuming the kafka topics. Also we are using Spark
> streaming to save to the cassandra DB and try to keep shuffle operations to
> a strict minimum (at best none). As of now we are not entirely pleased with
> our current performances, that's why I'm doing a kafka topic sharding POC
> and getting the executor to handle the specificied partitions is central.
> ᐧ
>
> 2017-03-17 9:14 GMT+01:00 Michael Armbrust <[email protected]>:
>
>> Sorry, typo. Should be a repartition not a groupBy.
>>
>>
>>> spark.readStream
>>> .format("kafka")
>>> .option("kafka.bootstrap.servers", "...")
>>> .option("subscribe", "t0,t1")
>>> .load()
>>> .repartition($"partition")
>>> .writeStream
>>> .foreach(... code to write to cassandra ...)
>>>
>>
>
>
> --
> *Mind7 Consulting*
>
> Sami Ouassaid | Consultant Big Data | [email protected]
> __
>
> 64 Rue Taitbout, 75009 Paris
>