Re: [Java] UTF-16 support for VarCharVectors

2022-09-30 Thread David Li
Also, the driver shouldn't assume UTF-8 (or any encoding) when constructing String from a Binary vector, since that defeats the point of a binary vector! Perhaps this should somehow be configurable (though having a lot of little configuration options is also not ideal). A parameterized extension

Re: [Java] UTF-16 support for VarCharVectors

2022-09-30 Thread Antoine Pitrou
Le 30/09/2022 à 18:57, Kevin Bambrick a écrit : The issue I am facing is sending a UTF-16 string over the wire. Ok, then you can just transcode the strings before sending them as String, *or* you can send them as Binary (not String). Where do these UTF-16 strings come from? > What would t

Re: [Java] UTF-16 support for VarCharVectors

2022-09-30 Thread Kevin Bambrick
The issue I am facing is sending a UTF-16 string over the wire. The application I am working on needs to support UTF-16 strings. The specific issue I am stuck on is integrating with the flight SQL driver (experimentally working on uptaking it for when its released). Right now in my implementation o

Re: [Java] UTF-16 support for VarCharVectors

2022-09-30 Thread Antoine Pitrou
On Thu, 29 Sep 2022 15:19:59 -0400 Larry White wrote: > Interesting. This doesn't seem to be a Java issue, per se then. I've seen > admonations in various Arrow Java threads to always specify the Charset for > the conversion - and so assumed more than one Charset was legal - and have > written Arr

Re: [Java] UTF-16 support for VarCharVectors

2022-09-29 Thread Micah Kornfield
> > I've never attempted to transport that data over the wire or export it > using the C-Data Interface, however. It seems like that's where it would > fall down. Yeah, there would be funny characters or validation failures someplace down the line when trying to transfer the data. On Thu, Sep 29,

Re: [Java] UTF-16 support for VarCharVectors

2022-09-29 Thread Larry White
Interesting. This doesn't seem to be a Java issue, per se then. I've seen admonations in various Arrow Java threads to always specify the Charset for the conversion - and so assumed more than one Charset was legal - and have written Arrow Java test code that uses other charsets without ill effect.

Re: [Java] UTF-16 support for VarCharVectors

2022-09-29 Thread Micah Kornfield
> > Was just wondering was support for UTF-16 Strings considered? As far as I > am aware VarChar vectors only support UTF-8. Are they something that may be > supported in the future? This hasn't really been discussed and is a pretty large change because it would specification updates and other imp

Re: [Java] UTF-16 support for VarCharVectors

2022-09-29 Thread James Henderson
FWIW we'd made a similar assumption. In Schema.fbs [1] the type is called Utf8, as well as the Java `ArrowType.Utf8` class - is this a required assumption to work with other language Arrow libs, maybe? James [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs On Thu, 29 Sept 2022 a

Re: [Java] UTF-16 support for VarCharVectors

2022-09-29 Thread David Li
The specification mandates UTF-8 encoding [1]. UTF-16 may make sense as a canonical extension type, but otherwise could just go into a binary array. [1]: https://github.com/apache/arrow/blob/902781d1f3a41563a23d6755433a8e40ce82de7b/format/Schema.fbs#L155-L157 On Thu, Sep 29, 2022, at 13:57, La

Re: [Java] UTF-16 support for VarCharVectors

2022-09-29 Thread Larry White
Hi Kevin, I don't know of any particular restriction regarding string encoding. VarCharVector stores data as a byte array, and the encoding can be set using the Charset class when you convert Strings to and from bytes. Since java strings use UTF-16 internally, I would expect this to 'just work'.

[Java] UTF-16 support for VarCharVectors

2022-09-29 Thread Kevin Bambrick
Hi. Was just wondering was support for UTF-16 Strings considered? As far as I am aware VarChar vectors only support UTF-8. Are they something that may be supported in the future? Regards. Kevin.