Re: Re: Re: [DISCUSS][Java] Adding GC-Based reference management strategy for buffers

2021-10-12 Thread Hongze Zhang
> two > > > possible idioms in the same library.  It means code written against the > > > library becomes less portable (you need to know how the memory allocator > > is > > > using GC or not). > > > > > > I understand manual memory management i

Re: Re: Re: [DISCUSS][Java] Adding GC-Based reference management strategy for buffers

2021-10-12 Thread Hongze Zhang
gt; addressing the problem? > > > -Micah > > > > On Thu, Oct 7, 2021 at 3:48 PM Hongze Zhang wrote: > > > We don't have to concern about that since no difference will be made on > > current manual release path unless "MemoryChunkCleaner" is explicit

Re: Re: Re: [DISCUSS][Java] Adding GC-Based reference management strategy for buffers

2021-10-12 Thread Hongze Zhang
s less portable (you need to know how the memory allocator is > > using GC or not). > > > > I understand manual memory management in Java is tedious but is there a > > specific problem this is addressing other than making Arrow have more > > expected semantics to Java users?

Re:Re: Re: [DISCUSS][Java] Adding GC-Based reference management strategy for buffers

2021-10-07 Thread Hongze Zhang
rely on refcounting for keeping things in check, I'm not >sure why changing the default is such a good idea... > >On Tue, Oct 5, 2021 at 2:20 AM Hongze Zhang wrote: > >> Hi Laurent, >> >> >> >> >> Sorry I might describe it unclearly and yes

Re:Re: [DISCUSS][Java] Adding GC-Based reference management strategy for buffers

2021-10-05 Thread Hongze Zhang
the GC itself to collect and free buffers? > >On Wed, Sep 29, 2021 at 11:58 PM Hongze Zhang wrote: > >> Hi, >> >> I would like to discuss on the potential of introducing a GC-based >> reference management strategy to Arrow Java, and we >> have already been wor

[DISCUSS][Java] Adding GC-Based reference management strategy for buffers

2021-09-29 Thread Hongze Zhang
Hi, I would like to discuss on the potential of introducing a GC-based reference management strategy to Arrow Java, and we have already been working on an implementation in our own project. I have put the related codes in following branch and if it makes sense to upstream Apache Arrow I can open

Re: [Java] C Data Interface and dictionaries

2021-08-25 Thread Hongze Zhang
On Wed, 2021-08-25 at 21:02 +0300, roee shlomo wrote: > This means that an API to import an ArrowSchema (in C) into a > Field/Schema > (in Java) is not suitable for dictionary encoded arrays because there > is an > information loss. Specifically, there is nothing in Field/Schema to > indicate the

Re: Review request for Dataset Java API PRs

2021-08-22 Thread Hongze Zhang
b.com/apache/arrow/pull/10883 [3] https://github.com/apache/arrow/pull/10333 [4] https://github.com/apache/arrow/pull/10114 [5] https://github.com/apache/arrow/pull/10652 On Thu, 2021-08-05 at 18:27 +0800, Hongze Zhang wrote: > Thanks everyone for the quick response! By the way I might raise this

Re: Review request for Dataset Java API PRs

2021-08-05 Thread Hongze Zhang
Thanks everyone for the quick response! By the way I might raise this review request a little bit late because I was working on some other projects in the last few months either. Now I just have some time to push this forward. :) About ARROW-11776: On Wed, 2021-08-04 at 08:45 -0700, Micah Kornfi

Review request for Dataset Java API PRs

2021-08-03 Thread Hongze Zhang
Hi, I have some PRs that were to improve Dataset API's Java implementation have not been reviewing for months. Could someone help me to review them? Thanks in advance. 1. https://github.com/apache/arrow/pull/10201 ARROW-11776: [Java][Dataset] Support writing to files within dataset scanner via JN

Re: [Format] Timestamp timezone semantics?

2021-06-03 Thread Hongze Zhang
On Wed, 2021-06-02 at 13:56 -0700, Micah Kornfield wrote: > > > > Any SQL interface to Arrow should follow the SQL standard. So, for > > instance, if a column has TIMESTAMP type, it should behave as a > > date-time without a time-zone. > > > At least in bigquery we do the following mapping: > SQ

Review request for ARROW-7808's PR (Dataset Java API)

2021-01-28 Thread Hongze Zhang
Hi All, Sorry to send a request to all but just would like to ask if anyone could be able to help finish the review for PR#7030[1]. As of now the PR contains following parts: 1. Base dataset API for Java language (which follows the shape of C++ API) 2. A JNI-based implementation of FileSyste

Re [DISCUSS] Using direct memory size as a limit of populated off-heap buffers in Java

2020-07-21 Thread Hongze Zhang
ty dependencies. > >On Mon, Jul 20, 2020 at 3:52 AM Hongze Zhang wrote: > >> Hi, >> >> I want to discuss a bit about the discussion[1] in the pending PR[2] for >> Java Dataset(it's no longer "Datasets" I guess?) API. >> >> >> - Backgr

[DISCUSS] Execute dataset scan tasks in distributed system

2020-07-21 Thread Hongze Zhang
Hi all, Does anyone ever try using Arrow Dataset API in a distributed system? E.g. create scan tasks in machine 1, then send and execute these tasks from machine 2, 3, 4. So far I think a possible workaround is to: 1. Create Dataset on machine 1; 2. Call Scan(), collect all scan tasks from sca

[DISCUSS] Using direct memory size as a limit of populated off-heap buffers in Java

2020-07-20 Thread Hongze Zhang
Hi, I want to discuss a bit about the discussion[1] in the pending PR[2] for Java Dataset(it's no longer "Datasets" I guess?) API. - Background: We are transferring C++ Arrow buffers to Java side BufferAllocators. We should decide whether to use -XX:MaxDirectMemorySize as a limit of these buf

[jira] [Created] (ARROW-8596) [C++][Dataset] Add test case to check if all essential properties are reserved once ScannerBuilder::Project is called

2020-04-26 Thread Hongze Zhang (Jira)
Hongze Zhang created ARROW-8596: --- Summary: [C++][Dataset] Add test case to check if all essential properties are reserved once ScannerBuilder::Project is called Key: ARROW-8596 URL: https://issues.apache.org/jira

[jira] [Created] (ARROW-8499) [C++][Dataset] In ScannerBuilder, batch_size will not work if projecter is not empty

2020-04-17 Thread Hongze Zhang (Jira)
Hongze Zhang created ARROW-8499: --- Summary: [C++][Dataset] In ScannerBuilder, batch_size will not work if projecter is not empty Key: ARROW-8499 URL: https://issues.apache.org/jira/browse/ARROW-8499

[jira] [Created] (ARROW-7808) [Java][Dataset] Implement Datasets Java API

2020-02-09 Thread Hongze Zhang (Jira)
Hongze Zhang created ARROW-7808: --- Summary: [Java][Dataset] Implement Datasets Java API Key: ARROW-7808 URL: https://issues.apache.org/jira/browse/ARROW-7808 Project: Apache Arrow Issue Type

[jira] [Created] (ARROW-7329) AllocationManager: Allow managing different types of memory other than those are allocated using Netty

2019-12-05 Thread Hongze Zhang (Jira)
Hongze Zhang created ARROW-7329: --- Summary: AllocationManager: Allow managing different types of memory other than those are allocated using Netty Key: ARROW-7329 URL: https://issues.apache.org/jira/browse/ARROW

Re: Datasets and Java

2019-11-28 Thread Hongze Zhang
he future we might have > OdbcDataSource and FlightDataSource > > Basically, dataset::FileFormat is meant to be a unified interface to > interact with file formats. Here's an example of such usage without > all the dataset machinery [3]. > > François > > [1] https://issues.a

Re: Datasets and Java

2019-11-27 Thread Hongze Zhang
ub.com/apache/arrow/pull/5608 > > Regards > > Antoine. > > > > Le 27/11/2019 à 11:16, Hongze Zhang a écrit : > > Hi Micah, > > > > > > Regarding our use cases, we'd use the API on Parquet files with some pushed > > filters and >

Re: Datasets and Java

2019-11-27 Thread Hongze Zhang
ew >and relationships between components and how it will co-exist with existing >Java code). If I understand correctly, one goal is to use this as a basis >for a new Spark DataSet API with better performance than the vectorized >spark parquet reader? Are there others? > >Wes, wha

Re: Datasets and Java

2019-11-26 Thread Hongze Zhang
I-based >> interface to the C++ libraries as one potential approach to save on >> development time. >> >> - Wes >> >> >> >> On Tue, Nov 26, 2019 at 5:54 AM Hongze Zhang wrote: >> > >> > Hi all, >> > >> > >> &g

Datasets and Java

2019-11-26 Thread Hongze Zhang
Hi all, Recently the datasets API has been improved a lot and I found some of the new features are very useful to my own work. For example to me a important one is the fix of ARROW-6952[1]. And as I currently work on Java/Scala projects like Spark, I am now investigating a way to call some of