I opened https://issues.apache.org/jira/browse/ARROW-2068, which may help. This is an accessible issue for someone in the community to work on; I'm not sure when I'll be able to get to it.
Thanks Wes On Thu, Feb 1, 2018 at 8:27 AM, Eli <h5r...@protonmail.ch> wrote: > Hey Wes, > > I understand there's another pointer, a definition level pointer, which is > basically a null location marker column. Exposing it as well to pick out the > nulls would be awesome. > > The types of interest (to me) are varchars/strings, bools and numbers, just > basic primitive types that also exist in standard SQL, so having these two > columns available via Python would be sweet. > > > Thanks, > Eli > > > Sent with ProtonMail Secure Email. > > > -------- Original Message -------- > On January 31, 2018 4:06 PM, Wes McKinney wrote: > >>hi Eli, >> >> This isn't available at the moment, but one could make the internal >> buffers in an array accessible in Python. How would you handle nulls >> in this scenario (the bytes for a null value in a primitive array can >> be any value)? How would one handle things other than numbers? >> >> - Wes >> >> On Wed, Jan 31, 2018 at 5:14 AM, Eli h5r...@protonmail.ch wrote: >> >>>Hey Wes, >>>What I meant by "standard" is the binary representation of a specific type >>>aggregated together. >>>The int32 column [1,2,3] would make >>>'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' for example. >>>This is already available via Python's struct.pack(), >>>array.array().tostring() or np.array().astype().tobytes() >>>What I was wondering is whatever that specific representation is already >>>there in Arrow's C++ mechanics somewhere, and whether one can get hold of it >>>from Pyarrow. >>>I don't know C++ very well, but I think what I'm looking for is in buffer.h, >>>there are pointers to types under Buffer which I think point to just that. >>>I saw that Buffer is actually accessible via pa.lib.Buffer, and that it even >>>has a to_pybytes() method. >>>However: >>> - I'm not sure those are the bytes that I speak of >>> >>> - I'm not sure how to use Buffer to find out, keep getting core dumps when >>> trying >>>Sent with ProtonMail Secure Email. >>>-------- Original Message -------- >>> On January 10, 2018 7:34 PM, Wes McKinney wrote: >>>>hi Eli, >>>>I am not aware of any standards for binary columns (or at least, I >>>> don't know what "regular" means in this context) -- part of the >>>> purpose of the Apache Arrow project is to define a columnar standard >>>> in the absence of any existing one. Most database systems define their >>>> own custom wire protocols. >>>>Do you have a link to the specification for the binary protocol for >>>> the database you are using (or some other documentation)? >>>>Thanks, >>>> Wes >>>>On Wed, Jan 10, 2018 at 12:47 AM, Eli h5r...@protonmail.ch wrote: >>>>>Hey Wes, >>>>> The database in question accepts columnar chunks of "regular" binary data >>>>> over the network, one of the sources of which is parquet. >>>>> Thus, data only comes out of parquet on my side, and I was wondering how >>>>> to get it out as "regular" binary columns. Something like tobytes() for >>>>> an Arrow Column, or maybe read_asbytes() for pa itself. The purpose is to >>>>> get to standard binary columns as fast as possible. >>>>> Thanks, >>>>> Eli >>>>> Sent with ProtonMail Secure Email. >>>>>>-------- Original Message -------- >>>>>> Subject: Re: How to get "standard" binary columns out of a pyarrow table >>>>>> Local Time: January 10, 2018 5:32 AM >>>>>> UTC Time: January 10, 2018 3:32 AM >>>>>> From: wesmck...@gmail.com >>>>>> To: dev@arrow.apache.org, Eli h5r...@protonmail.ch >>>>>> hi Eli, >>>>>> I'm wondering what kind of API you would want, if the perfect one >>>>>> existed. If I understand correctly, you are embedding objects in a >>>>>> BYTE_ARRAY column in Parquet, and need to do some post-processing as >>>>>> the data goes in / comes out of Parquet? >>>>>> Thanks, >>>>>> Wes >>>>>> On Sat, Jan 6, 2018 at 8:37 AM, Eli h5r...@protonmail.ch wrote: >>>>>>>Hi, >>>>>>> I'm looking to send "regular" columnar binary data to a database, the >>>>>>> kind that gets created by struct.pack, array.array, numpy.tobytes or >>>>>>> str.encode. >>>>>>> The origin is parquet files, which I'm reading ever so comfortably via >>>>>>> PyArrow. >>>>>>> I do however need to deserialize to Python objcets, currently via >>>>>>> to_pandas(), then re-serialize the columns with one of the above. >>>>>>> I was wondering whether there was a better way to go about it, one >>>>>>> which would be most fast end effective. >>>>>>> Ideally I'd like to go through Python, but I can do C or even some C++ >>>>>>> if necessary. >>>>>>> I posted the question on stackoverflow, and was asked to post here. >>>>>>> Appreciate any feedback! >>>>>>> Thanks, >>>>>>> Eli >>>>>>> Sent with ProtonMail Secure Email. >>>>>>> >>>>>> >>>>> >>>> >>> >> >