alihan-synnada opened a new issue, #13411:
URL: https://github.com/apache/datafusion/issues/13411

   ### Is your feature request related to a problem or challenge?
   
   Implementations for deserialization of streams are disjointed.
   
   For example, when reading the payload from ObjectStore and deserializing 
`GetResultPayload::Stream`
   - `CsvOpener` and `JsonOpener` use `Decoder`s from `arrow-csv` and 
`arrow-json`, respectively. Additionally, the logic is duplicated. 
   - `ArrowOpener` and `AvroOpener` read the entire stream before 
deserializing. Deserializing as the stream is read would parallelize the IO and 
computation, improving performance.
   - `ParquetOpener` has a completely different implementation from the others.
   
   This issue is likely to take several PRs to resolve.
   
   ### Describe the solution you'd like
   
   - Add a `Decoder` trait similar to `Decoder` structs of `arrow-csv` and 
`arrow-json` and implement it for every format.
   - Add a `BatchDeserializer` trait (similar to `BatchSerializer`) with the 
following API
     - `digest`: consume bytes
     - `next`: try to deserialize a batch, inform if more data is needed or if 
the stream is exhausted
     - `finish`: notify the end of stream
   - Add a `DecoderDeserializer` struct that implements `BatchDeserializer` and 
uses the provided `Decoder` implementation to deserialize the stream.
   - Formats that cannot implement a `Decoder` for some reason can instead 
implement `BatchDeserializer` directly.
   
   ### Describe alternatives you've considered
   
   All types can implement `BatchDeserializer` directly instead of relying on 
`Decoder`, which would remove a layer of abstraction. However this can lead to 
some duplication, as seen in CSV and JSON deserialization implementations.
   
   ### Additional context
   
   - Relevant parts of `CsvOpener` and `JsonOpener` (duplicated logic)
   
https://github.com/apache/datafusion/blob/a5d0563f53d05f5589df83d163c91910f51020ba/datafusion/core/src/datasource/physical_plan/csv.rs#L653-L683
   
https://github.com/apache/datafusion/blob/a5d0563f53d05f5589df83d163c91910f51020ba/datafusion/core/src/datasource/physical_plan/json.rs#L307-L340
   - Relevant parts of `ArrowOpener` and `AvroOpener` (`.bytes().await` reads 
the entire stream)
   
https://github.com/apache/datafusion/blob/a5d0563f53d05f5589df83d163c91910f51020ba/datafusion/core/src/datasource/physical_plan/arrow_file.rs#L239-L245
   
https://github.com/apache/datafusion/blob/a5d0563f53d05f5589df83d163c91910f51020ba/datafusion/core/src/datasource/physical_plan/avro.rs#L229-L233


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to