JulianJaffe commented on issue #9780: URL: https://github.com/apache/druid/issues/9780#issuecomment-4278238709
Hey @jtuglu1, I finally got a moment to push up the [3.x connector](https://github.com/JulianJaffe/druid/tree/spark_3_connector). There are a couple gotchas: - OSS Druid still doesn't return complete column info (in particular, if a column is multi-valued), so I had to add a new catalog provider with slightly stubbed out functionality - I hadn't touched these for a few years, and the latest versions I had were Spark 3.4 and Druid 0.30. I sight upgraded to Spark 3.5 but there are a number of changes between Druid 0.30 and latest that need someone familiar with the changes to plumb through. This also had the unfortunate side effect of breaking a lot of tests, so for now I just didn't upload the broken tests - I don't have a Druid cluster handy to test these on either, but from what you've said the bones are the important piece so hopefully this is useful - The main piece that could use some work is `DruidDataWriter`, which still has some incredibly inefficient uses of `IncrementalIndex`/does the entirely unnecessary, wasteful, and garbage-generating dance of taking a complete and sorted set of rows from Spark, round-trips them through incremental indices, flushes the incremental indices to disk, and then loads them back up to merge and finally push to deep storage. The druid file format is well understood and there's no reason to not just write the v9/v10 files directly without the overhead and ceremony. Re the conversation you and Gian had above: > With that structure, integrating as a Unix tool rather than a Java library, we'd run in a different JVM, presumably a newer version. Spark really, _really_ does not do well with this approach since there's no good way to tell a Spark executor "reserve this amount of off-heap memory but not for Spark tasks". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
