Hi all,

Recently the datasets API has been improved a lot and I found some of the new 
features are very useful to my own work. For example to me a important one is 
the fix of ARROW-6952[1]. And as I currently work on Java/Scala projects like 
Spark, I am now investigating a way to call some of the datasets APIs in Java 
so that I could gain performance improvement from native dataset 
filters/projectors. Meantime I am also interested in the ability of scanning 
different data sources provided by dataset API.


Regarding using datasets in Java, my initial idea is to port (by writing 
Java-version implementations) some of the high-level concepts in Java such as 
DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and call lower 
level record batch iterators via JNI. This way we seem to retain performance 
advantages from c++ dataset code.


Is anyone interested in this topic also? Or is this something already on the 
development plan? Any feedback or thoughts would be much appreciated.


Best,
Hongze


[1] https://issues.apache.org/jira/browse/ARROW-6952

Reply via email to