Glad you got it worked out. That's cool as long as your use case doesn't
actually require e.g. partition 0 to always be scheduled to the same
executor across different batches.
On Tue, Mar 21, 2017 at 7:35 PM, OUASSAIDI, Sami
wrote:
> So it worked quite well with a coalesce, I was able to find
So it worked quite well with a coalesce, I was able to find an solution to
my problem : Altough not directly handling the executor a good roundaway
was to assign the desired partition to a specific stream through assign
strategy and coalesce to a single partition then repeat the same process
for th
Another option that would avoid a shuffle would be to use assign and
coalesce, running two separate streams.
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("assign", """{t0: {"0": }, t1:{"0": x}}""")
.load()
.coalesce(1)
.writeStream
.fore
@Cody : Duly noted.
@Michael Ambrust : A repartition is out of the question for our project as
it would be a fairly expensive operation. We tried looking into targeting a
specific executor so as to avoid this extra cost and directly have well
partitioned data after consuming the kafka topics. Also
Sorry, typo. Should be a repartition not a groupBy.
> spark.readStream
> .format("kafka")
> .option("kafka.bootstrap.servers", "...")
> .option("subscribe", "t0,t1")
> .load()
> .repartition($"partition")
> .writeStream
> .foreach(... code to write to cassandra ...)
>
I think it should be straightforward to express this using structured
streaming. You could ensure that data from a given partition ID is
processed serially by performing a group by on the partition column.
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("
Spark just really isn't a good fit for trying to pin particular computation
to a particular executor, especially if you're relying on that for
correctness.
On Thu, Mar 16, 2017 at 7:16 AM, OUASSAIDI, Sami
wrote:
>
> Hi all,
>
> So I need to specify how an executor should consume data from a kafk
Hi all,
So I need to specify how an executor should consume data from a kafka topic.
Let's say I have 2 topics : t0 and t1 with two partitions each, and two
executors e0 and e1 (both can be on the same node so assign strategy does
not work since in the case of a multi executor node it works based