Hey Wes,
What I meant by "standard" is the binary representation of a specific type aggregated together. The int32 column [1,2,3] would make '\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' for example. This is already available via Python's struct.pack(), array.array().tostring() or np.array().astype().tobytes() What I was wondering is whatever that specific representation is already there in Arrow's C++ mechanics somewhere, and whether one can get hold of it from Pyarrow. I don't know C++ very well, but I think what I'm looking for is in buffer.h, there are pointers to types under Buffer which I think point to just that. I saw that Buffer is actually accessible via pa.lib.Buffer, and that it even has a to_pybytes() method. However: - I'm not sure those are the bytes that I speak of - I'm not sure how to use Buffer to find out, keep getting core dumps when trying Sent with ProtonMail Secure Email. -------- Original Message -------- On January 10, 2018 7:34 PM, Wes McKinney wrote: >hi Eli, > > I am not aware of any standards for binary columns (or at least, I > don't know what "regular" means in this context) -- part of the > purpose of the Apache Arrow project is to define a columnar standard > in the absence of any existing one. Most database systems define their > own custom wire protocols. > > Do you have a link to the specification for the binary protocol for > the database you are using (or some other documentation)? > > Thanks, > Wes > > On Wed, Jan 10, 2018 at 12:47 AM, Eli h5r...@protonmail.ch wrote: >>Hey Wes, >>The database in question accepts columnar chunks of "regular" binary data >>over the network, one of the sources of which is parquet. >>Thus, data only comes out of parquet on my side, and I was wondering how to >>get it out as "regular" binary columns. Something like tobytes() for an Arrow >>Column, or maybe read_asbytes() for pa itself. The purpose is to get to >>standard binary columns as fast as possible. >>Thanks, >> Eli >>Sent with ProtonMail Secure Email. >>>-------- Original Message -------- >>> Subject: Re: How to get "standard" binary columns out of a pyarrow table >>> Local Time: January 10, 2018 5:32 AM >>> UTC Time: January 10, 2018 3:32 AM >>> From: wesmck...@gmail.com >>> To: dev@arrow.apache.org, Eli h5r...@protonmail.ch >>>hi Eli, >>>I'm wondering what kind of API you would want, if the perfect one >>> existed. If I understand correctly, you are embedding objects in a >>> BYTE_ARRAY column in Parquet, and need to do some post-processing as >>> the data goes in / comes out of Parquet? >>>Thanks, >>> Wes >>>On Sat, Jan 6, 2018 at 8:37 AM, Eli h5r...@protonmail.ch wrote: >>>>Hi, >>>> I'm looking to send "regular" columnar binary data to a database, the kind >>>> that gets created by struct.pack, array.array, numpy.tobytes or str.encode. >>>> The origin is parquet files, which I'm reading ever so comfortably via >>>> PyArrow. >>>> I do however need to deserialize to Python objcets, currently via >>>> to_pandas(), then re-serialize the columns with one of the above. >>>> I was wondering whether there was a better way to go about it, one which >>>> would be most fast end effective. >>>> Ideally I'd like to go through Python, but I can do C or even some C++ if >>>> necessary. >>>> I posted the question on stackoverflow, and was asked to post here. >>>> Appreciate any feedback! >>>> Thanks, >>>> Eli >>>> Sent with ProtonMail Secure Email. >>>> >>> >> >