Re: [I] Support directly reading and writing Druid data from Spark (druid)

via GitHub Sun, 19 Apr 2026 23:02:38 -0700


JulianJaffe commented on issue #9780:
URL: https://github.com/apache/druid/issues/9780#issuecomment-4278238709


   Hey @jtuglu1, I finally got a moment to push up the [3.x 
connector](https://github.com/JulianJaffe/druid/tree/spark_3_connector). There 
are a couple gotchas:
   
   - OSS Druid still doesn't return complete column info (in particular, if a 
column is multi-valued), so I had to add a new catalog provider with slightly 
stubbed out functionality
   - I hadn't touched these for a few years, and the latest versions I had were 
Spark 3.4 and Druid 0.30. I sight upgraded to Spark 3.5 but there are a number 
of changes between Druid 0.30 and latest that need someone familiar with the 
changes to plumb through. This also had the unfortunate side effect of breaking 
a lot of tests, so for now I just didn't upload the broken tests
   - I don't have a Druid cluster handy to test these on either, but from what 
you've said the bones are the important piece so hopefully this is useful
   - The main piece that could use some work is `DruidDataWriter`, which still 
has some incredibly inefficient uses of `IncrementalIndex`/does the entirely 
unnecessary, wasteful, and garbage-generating dance of taking a complete and 
sorted set of rows from Spark, round-trips them through incremental indices, 
flushes the incremental indices to disk, and then loads them back up to merge 
and finally push to deep storage. The druid file format is well understood and 
there's no reason to not just write the v9/v10 files directly without the 
overhead and ceremony.
   
   Re the conversation you and Gian had above:
   
   > With that structure, integrating as a Unix tool rather than a Java 
library, we'd run in a different JVM, presumably a newer version.
   
   Spark really, _really_ does not do well with this approach since there's no 
good way to tell a Spark executor "reserve this amount of off-heap memory but 
not for Spark tasks".
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Support directly reading and writing Druid data from Spark (druid)

Reply via email to