I'm not sure about the accumulator approach; one possible approach which
might work (DISCLAIMER: a random thought) would be employing an RPC
endpoint on the driver side which receives such information from executors
and plays as a coordinator.
Beware that Spark's RPC implementation is package priv
I tried to create a data source, however our use case is bit hard as
we do only know the available offsets within the tasks, not on the
driver. I therefore planned to use accumulators in the
InputPartitionReader but they seem not to work.
Example accumulation is done here
https://github.com/kortem