Re: [I] Support directly reading and writing Druid data from Spark (druid)

via GitHub Thu, 23 Apr 2026 09:08:25 -0700


JulianJaffe commented on issue #9780:
URL: https://github.com/apache/druid/issues/9780#issuecomment-4305971861


   > @JulianJaffe Could spark.executor.memoryOverhead work here 
([docs](https://spark.apache.org/docs/latest/configuration.html))?
   
   It can reserve the space, but you'll eventually run into some one setting 
`spark.memory.offHeap.enabled` to true, or pulling very large response sizes 
through netty (hopefully we won't encounter anyone using this from a streaming 
job which would require off-heap storage for state). We can tell users to keep 
increasing their overhead allocation, but since the overhead can't be 
elastically scaled up or down over the course of the task's lifetime this 
becomes more and more wasteful.
   
   The other side of this is to ask what the spark connector gets from Druid 
code - there's obviously a lot of value from being able to use the same code 
paths, but that value is primarily on the _read_ side, not the write side. 
Calling out to a separate JVM to write v9/v10 files using the Druid indexing 
code is a much less efficient solution than just taking the columns or rows we 
already have in memory and writing them directly, without having to use any of 
the existing druid indexers that build in-this-case-unnecessary intermediate 
representations to support querying or apply row/byte maxes that we can already 
target via Spark's AQE and the partitioning and order we request. To me, the 
separate JVM doesn't seem to help us with any place we use core druid code in 
the spark connector outside of the segment indexing on write (i.e. a place 
where we have strong incentives _not_ to use any of the mainline Druid code).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Support directly reading and writing Druid data from Spark (druid)

Reply via email to