hi Eli, This isn't available at the moment, but one could make the internal buffers in an array accessible in Python. How would you handle nulls in this scenario (the bytes for a null value in a primitive array can be any value)? How would one handle things other than numbers?
- Wes On Wed, Jan 31, 2018 at 5:14 AM, Eli <h5r...@protonmail.ch> wrote: > Hey Wes, > > > What I meant by "standard" is the binary representation of a specific type > aggregated together. > > The int32 column [1,2,3] would make > '\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' for example. > > This is already available via Python's struct.pack(), > array.array().tostring() or np.array().astype().tobytes() > > What I was wondering is whatever that specific representation is already > there in Arrow's C++ mechanics somewhere, and whether one can get hold of it > from Pyarrow. > > I don't know C++ very well, but I think what I'm looking for is in buffer.h, > there are pointers to types under Buffer which I think point to just that. > > I saw that Buffer is actually accessible via pa.lib.Buffer, and that it even > has a to_pybytes() method. > > However: > > - I'm not sure those are the bytes that I speak of > > - I'm not sure how to use Buffer to find out, keep getting core dumps when > trying > > > > Sent with ProtonMail Secure Email. > > > -------- Original Message -------- > On January 10, 2018 7:34 PM, Wes McKinney wrote: > >>hi Eli, >> >> I am not aware of any standards for binary columns (or at least, I >> don't know what "regular" means in this context) -- part of the >> purpose of the Apache Arrow project is to define a columnar standard >> in the absence of any existing one. Most database systems define their >> own custom wire protocols. >> >> Do you have a link to the specification for the binary protocol for >> the database you are using (or some other documentation)? >> >> Thanks, >> Wes >> >> On Wed, Jan 10, 2018 at 12:47 AM, Eli h5r...@protonmail.ch wrote: >>>Hey Wes, >>>The database in question accepts columnar chunks of "regular" binary data >>>over the network, one of the sources of which is parquet. >>>Thus, data only comes out of parquet on my side, and I was wondering how to >>>get it out as "regular" binary columns. Something like tobytes() for an >>>Arrow Column, or maybe read_asbytes() for pa itself. The purpose is to get >>>to standard binary columns as fast as possible. >>>Thanks, >>> Eli >>>Sent with ProtonMail Secure Email. >>>>-------- Original Message -------- >>>> Subject: Re: How to get "standard" binary columns out of a pyarrow table >>>> Local Time: January 10, 2018 5:32 AM >>>> UTC Time: January 10, 2018 3:32 AM >>>> From: wesmck...@gmail.com >>>> To: dev@arrow.apache.org, Eli h5r...@protonmail.ch >>>>hi Eli, >>>>I'm wondering what kind of API you would want, if the perfect one >>>> existed. If I understand correctly, you are embedding objects in a >>>> BYTE_ARRAY column in Parquet, and need to do some post-processing as >>>> the data goes in / comes out of Parquet? >>>>Thanks, >>>> Wes >>>>On Sat, Jan 6, 2018 at 8:37 AM, Eli h5r...@protonmail.ch wrote: >>>>>Hi, >>>>> I'm looking to send "regular" columnar binary data to a database, the >>>>> kind that gets created by struct.pack, array.array, numpy.tobytes or >>>>> str.encode. >>>>> The origin is parquet files, which I'm reading ever so comfortably via >>>>> PyArrow. >>>>> I do however need to deserialize to Python objcets, currently via >>>>> to_pandas(), then re-serialize the columns with one of the above. >>>>> I was wondering whether there was a better way to go about it, one which >>>>> would be most fast end effective. >>>>> Ideally I'd like to go through Python, but I can do C or even some C++ if >>>>> necessary. >>>>> I posted the question on stackoverflow, and was asked to post here. >>>>> Appreciate any feedback! >>>>> Thanks, >>>>> Eli >>>>> Sent with ProtonMail Secure Email. >>>>> >>>> >>> >> >