Re: Datasets and Java

Hongze Zhang Tue, 26 Nov 2019 22:52:17 -0800

Hi Wes and Micah,

Thanks for your kindly reply.

Micah: We don't use Spark (vectorized) parquet reader because it is a pure Java 
implementation. Performance could be worse than doing the similar work 
natively. Another reason is we may need to
integrate some other specific data sources with Arrow datasets, for limiting 
the workload, we would like to maintain a common read pipeline for both this 
one and other wildly used data sources like Parquet and Csv.

Wes: Yes, Datasets framework along with Parquet/CSV/... reader implementations 
are totally native, So a JNI bridge will be needed then we don't actually read 
files in Java.

My another concern is how many C++ datasets components should be bridged via 
JNI. For example,
bridge the ScanTask only? Or bridge more components including Scanner, Table, 
even the DataSource
discovery system? Or just bridge the C++ arrow Parquet, Orc readers (as Micah 
said, orc-jni is
already there) and reimplement everything needed by datasets in Java? This 
might be not that easy to
decide but currently based on my limited perspective I would prefer to get 
started from the ScanTask
layer as a result we could leverage some valuable work finished in C++ datasets 
and don't have to
maintain too much tedious JNI code. The real IO process still take place inside 
C++ readers when we
do scan operation.

So Wes, Micah, is this similar to your consideration?

Thanks,
Hongze

At 2019-11-27 12:39:52, "Micah Kornfield" <emkornfi...@gmail.com> wrote:
>Hi Hongze,
>To add to Wes's point, there are already some efforts to do JNI for ORC
>(which needs to be integrated with CI) and some open PRs for Parquet in the
>project.  However, given that you are using Spark I would expect there is
>already dataset functionality that is equivalent to the dataset API to do
>rowgroup/partition level filtering.  Can you elaborate on what problems you
>are seeing with those and what additional use cases you have?
>
>Thanks,
>Micah
>
>
>On Tue, Nov 26, 2019 at 1:10 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> hi Hongze,
>>
>> The Datasets functionality is indeed extremely useful, and it may make
>> sense to have it available in many languages eventually. With Java, I
>> would raise the issue that things are comparatively weaker there when
>> it comes to actually reading the files themselves. Whereas we have
>> reasonably fast Arrow-based interfaces to CSV, JSON, ORC, and Parquet
>> in C++ the same is not true in Java. Not a deal breaker but worth
>> taking into consideration.
>>
>> I wonder aloud whether it might be worth investing in a JNI-based
>> interface to the C++ libraries as one potential approach to save on
>> development time.
>>
>> - Wes
>>
>>
>>
>> On Tue, Nov 26, 2019 at 5:54 AM Hongze Zhang <notify...@126.com> wrote:
>> >
>> > Hi all,
>> >
>> >
>> > Recently the datasets API has been improved a lot and I found some of
>> the new features are very useful to my own work. For example to me a
>> important one is the fix of ARROW-6952[1]. And as I currently work on
>> Java/Scala projects like Spark, I am now investigating a way to call some
>> of the datasets APIs in Java so that I could gain performance improvement
>> from native dataset filters/projectors. Meantime I am also interested in
>> the ability of scanning different data sources provided by dataset API.
>> >
>> >
>> > Regarding using datasets in Java, my initial idea is to port (by writing
>> Java-version implementations) some of the high-level concepts in Java such
>> as DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and call
>> lower level record batch iterators via JNI. This way we seem to retain
>> performance advantages from c++ dataset code.
>> >
>> >
>> > Is anyone interested in this topic also? Or is this something already on
>> the development plan? Any feedback or thoughts would be much appreciated.
>> >
>> >
>> > Best,
>> > Hongze
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/ARROW-6952
>>

Re: Datasets and Java

Reply via email to