[I] [Discussion] Efficient Row Selection for Multi-Engine Support [datafusion]

via GitHub Fri, 21 Feb 2025 09:18:05 -0800


Arpit-Bandejiya opened a new issue, #14816:
URL: https://github.com/apache/datafusion/issues/14816


   <h2>Background</h2>We have an usecase where data is stored in multiple 
engines/formats and Parquet is the primary format containing all the data. 
While text queries are handled by inverted index format, numeric data queries 
and aggregations are processed via Parquet files. While the file formats are 
different, the data is sorted and stored in the same order across 
them.<br><br>We are using DataFusion to query Parquet files and wondering if 
the result of the query can be represented as a bit set of the document 
position (example below). Bit sets from the different engines can be 
intersected to identify the documents which meets the criteria. The resulting 
bit set then can be used to fetch the relevant documents from 
Parquet.<br><br>Example:<br><br>Assume we have the following data stored in 
parquet file:<br>
   colA | colB
   -- | --
   200 | Autumn leaves
   200 | Salty breeze
   100 | Misty mountains
   100 | Misty mountains
   200 | Velvet curtains
   
   For example, assume have an query like <code> SELECT colB where colA = 
100</code> <br><br>The matching documents can be represented in the form of 
bitset : 00110 (row number starts from left). We want to use the matching 
document information collected from any underlying engine to fetch the relevant 
documents in the parquet file using DataFusion.<br><h2>What we explored</h2>We 
explored that one of the ways to fetch specific rows in DataFusion is by 
creating an access plan and passing it to ParquetExec. Since we need the 
complete plan, we can't parallelize it and start collecting data from Parquet, 
which reduces the overall query performance and is also memory-inefficient as 
we need to iterate the complete stream and convert it to the AccessPlan. 
<br><h2>Possible Solution</h2>If there is a way to:<br><ol><li 
style="list-style-type:decimal">Pass the iterator directly to DataFusion, 
or</li><li style="list-style-type:decimal">Process the matching rows in 
batches.</li></ol>Then it will
  enable on-demand conversion from the matching rows iterator to RowSelection 
in DataFusion thus improving efficiency by reducing memory 
overhead.<br><h2>Questions</h2><ol><li style="list-style-type:decimal">Are 
there existing mechanisms in DataFusion to handle external iterators or row 
sources?</li><li style="list-style-type:decimal">What are the best practices 
for integrating DataFusion with external data sources in a streaming or batched 
manner?</li><li style="list-style-type:decimal">Are there any plans or ongoing 
work in the DataFusion project that might address this use case?</li><li 
style="list-style-type:decimal">Any alternative approaches or design patterns 
that might help us achieve efficient row selection in our multi-engine 
implementation?</li></ol>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] [Discussion] Efficient Row Selection for Multi-Engine Support [datafusion]

Reply via email to