Hi Remi;

I see. I am unsure how much things need a change at our side since I
haven't estimated the adaptation/refactoring needed for it as of yet.
If it is possible can you share the S3 implementation that you've
worked on? It will guide us to do the estimate and if possible we want to
adopt the approach.
After our talk and the use case that you've already implemented, I am
pretty much convinced of the way that you've implemented. Just I need to
understand the surface impact for the team.

Best,
Mahmut

Rémi Dettai <rdet...@gmail.com>, 11 Kas 2020 Çar, 19:06 tarihinde şunu
yazdı:

> Hi Mahmut,
>
> The way of implementing sources for Parquet has changed. The new way is to
> implement the ChunkReader trait. This is simpler (less methods to
> implement) and more efficient (you have more information about the upcoming
> bytes that will be read). The ParquetReader has been made private as it is
> mostly relevant in combination with FileSource which is private (
> https://github.com/apache/arrow/pull/8300#issuecomment-707712589). I guess
> we could have even removed it and made FileSource specific to File.
>
> Are you sure making just the ParquetReader public would be sufficient to
> make your current code compatible? SerializedFileReader does not work with
> that trait any more, so I doubt this would solve your problem. You would
> also need to expose FileSource, and then implement ChunkReader for that
> (similar to the implem of ChunkReader for File), or make the implem of
> ChunkReader for File generic on the ParquetReader trait instead (impl <T:
> ParquetReader> ChunkReader for T)?  I find this brings in quite a bit of
> complexity! Is there a usecase where you are not reading from the file
> system and you really benefit from going through FileSource?
>
> Before opening this PR, can you quickly look at how complex it would be to
> change your custom sources to implement ChunkReader? I think it might be a
> lot easier than you think ! :-)
>
> Remi
>
> Le mer. 11 nov. 2020 à 14:14, vertexclique vertexclique <
> vertexcli...@gmail.com> a écrit :
>
> > Hi All;
> >
> > I have implemented different data sources before for the
> > ParquetReader(privately) but with the latest changes (esp.
> >
> >
> https://github.com/apache/arrow/pull/8300/files#diff-0b220b2d327afc583fd75b2d3c52901e628026a11cfa694ffc252ffd45fb6db0L20
> > (
> > There is an orphanage of the ParquetReader trait. Is this intentional or
> it
> > is temporary? Since that prevents layering traits on top of it for
> > implementing different data source perspective.
> >
> > This change kind of blocks us at Signavio to move to the latest Arrow
> > nightly. It would be nice to resolve this together so we can adapt the
> > parquet and arrow.
> >
> > Best,
> > Mahmut Bulut (vertexclique)
> >
>

Reply via email to