Hi, This is Lee Rhodes (lee...@apache.org) from the Apache DataSketches team.
I am pleased that there is interest in the Spark community for integrating our library more tightly into Spark! I would like to help if I can. Unfortunately, I am not Spark fluent so I'm not going to be very useful for doing the Spark portion of any integration coding. My expertise is with the DataSketches Java library and in the sketching algorithms and can certainly answer questions you might have about how the sketches work, how best to use the library, and relate some of the experiences we have had in integrating our library with other database systems. Please note that our library has been implemented in multiple languages: Java, C++, Python and now Go (in development) and we have binary compatibility across all of these languages. This means that you can have a back-end system building sketches in C++ (for example) and those sketches can be serialized and transported to another system running a different language and the sketches can be interpreted and merged with sketches originating in the receiving system. Database integrations that we are aware of include: PostgreSQL, Hive, Druid, Presto, Spark(HLL), GCHQ/Gaffer, Netflix /Atlas DB, Pinot, Iceberg, Vertica, Greenplum, ClickHouse, Impala, and recently GCP/BigQuery. Current major users include: Microsoft Gray Labs, GCHQ, Canadian Communications Security Establishment (CSE), GameAnalytics, Visa, Nielsen, DataDog, Criteo, Imply, Permutive,... (Because, as you are aware, we don't require any form of registration, we don't know about use cases unless someone tells us!) We have a wide range of different sketching algorithms for obtaining near-real-time results of queries that would otherwise require considerable compute resources and time. Feel free to peruse our website datasketches.apache.org I do want to mention that our little team is extremely limited in resources and if anyone in the Spark community would like to learn more about the science and engineering of sketches we could use the help! Please consider contributing to the DataSketches project! Cheers, Lee. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org