Re: [DISCUSS] Proposal to Add Theta and Tuple Sketches to Spark SQL

Lee Rhodes Wed, 04 Jun 2025 15:57:04 -0700

Hi,
This is Lee Rhodes (lee...@apache.org) from the Apache DataSketches team.


I am pleased that there is interest in the Spark community for integrating our 
library more tightly into Spark!  I would like to help if I can. Unfortunately, 
I am not Spark fluent so I'm not going to be very useful for doing the Spark 
portion of any integration coding.  

My expertise is with the DataSketches Java library and in the sketching 
algorithms and can certainly answer questions you might have about how the 
sketches work, how best to use the library, and relate some of the experiences 
we have had in integrating our library with other database systems.  

Please note that our library has been implemented in multiple languages: Java, 
C++, Python and now Go (in development) and we have binary compatibility across 
all of these languages. This means that you can have a back-end system building 
sketches in C++ (for example) and those sketches can be serialized and 
transported to another system running a different language and the sketches can 
be interpreted and merged with sketches originating in the receiving system.

Database integrations that we are aware of include: PostgreSQL, Hive, Druid, 
Presto, Spark(HLL), GCHQ/Gaffer, Netflix /Atlas DB, Pinot, Iceberg, Vertica, 
Greenplum, ClickHouse, Impala, and recently GCP/BigQuery.  

Current major users include: Microsoft Gray Labs, GCHQ, Canadian Communications 
Security Establishment (CSE), GameAnalytics, Visa, Nielsen, DataDog, Criteo, 
Imply, Permutive,...

(Because, as you are aware, we don't require any form of registration, we don't 
know about use cases unless someone tells us!) 

We have a wide range of different sketching algorithms for obtaining 
near-real-time results of queries that would otherwise require considerable 
compute resources and time.   Feel free to peruse our website 
datasketches.apache.org

I do want to mention that our little team is extremely limited in resources and 
if anyone in the Spark community would like to learn more about the science and 
engineering of sketches we could use the help!  Please consider contributing to 
the DataSketches project! 

Cheers,
Lee.


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Proposal to Add Theta and Tuple Sketches to Spark SQL

Reply via email to