I think it should be straightforward to express this using structured streaming. You could ensure that data from a given partition ID is processed serially by performing a group by on the partition column.
spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "...") .option("subscribe", "t0,t1") .load() .groupBy($"partition") .writeStream .foreach(... code to write to cassandra ...) On Thu, Mar 16, 2017 at 8:10 AM, Cody Koeninger <c...@koeninger.org> wrote: > Spark just really isn't a good fit for trying to pin particular > computation to a particular executor, especially if you're relying on that > for correctness. > > On Thu, Mar 16, 2017 at 7:16 AM, OUASSAIDI, Sami <sami.ouassa...@mind7.fr> > wrote: > >> >> Hi all, >> >> So I need to specify how an executor should consume data from a kafka >> topic. >> >> Let's say I have 2 topics : t0 and t1 with two partitions each, and two >> executors e0 and e1 (both can be on the same node so assign strategy does >> not work since in the case of a multi executor node it works based on round >> robin scheduling, whatever first available executor consumes the topic >> partition ) >> >> What I would like to do is make e0 consume partition 0 from both t0 and >> t1 while e1 consumes partition 1 from the t0 and t1. Is there no way around >> it except messing with scheduling ? If so what's the best approach. >> >> The reason for doing so is that executors will write to a cassandra >> database and since we will be in a parallelized context one executor might >> "collide" with another and therefore data will be lost, by assigning a >> partition I want to force the executor to process the data sequentially. >> >> Thanks >> Sami >> -- >> *Mind7 Consulting* >> >> Sami Ouassaid | Consultant Big Data | sami.ouassa...@mind7.com >> __ >> >> 64 Rue Taitbout, 75009 Paris >> ᐧ >> > >