Re: Pandas Block Manager

2020-11-11 Thread Weston Pace
Nick, it appears converting the ndarray to a dataframe clears the contiguous flag even though it doesn't actually change the underlying array. At least, this is what I'm seeing with my testing. My guess is this is what is causing arrow to do a copy (arrow is indeed doing a new allocation here, th

Support for reading arbitrary nested objects

2020-11-11 Thread Renato Marroquín Mogrovejo
Hi Arrow experts, I am trying to find out if Arrow supports reading/writing arbitrary nested objects similarly to what Parquet supports with its FSM. I came across this PR https://github.com/apache/arrow/pull/4066 which aimed to implement the Parquet specific approach (the FSM) but it was declined

Re: Pandas Block Manager

2020-11-11 Thread Joris Van den Bossche
On Wed, 11 Nov 2020 at 00:52, Micah Kornfield wrote: > > Sorry, I should clarify, I'm not familiar with zero copy from Pandas to > Arrow, so there might be something else going on here. But once an arrow > file is written out, buffers will be padded/aligned to 8 bytes. > > In general, I think rel

Re: Pandas Block Manager

2020-11-11 Thread Joris Van den Bossche
Hi Weston, When starting with a 2D ndarray, the conversion from numpy to pandas DataFrame (`pd.DataFrame(arr)`) is actually zero copy. But, pandas takes a transposed view on the original array (that's the reason the C contiguous flag changes), to ensure the column are the first dimension of the st

Rust ParquetReader trait

2020-11-11 Thread vertexclique vertexclique
Hi All; I have implemented different data sources before for the ParquetReader(privately) but with the latest changes (esp. https://github.com/apache/arrow/pull/8300/files#diff-0b220b2d327afc583fd75b2d3c52901e628026a11cfa694ffc252ffd45fb6db0L20 ( There is an orphanage of the ParquetReader trait. I

RE: [Discuss] Provide pluggable APIs to support user customized compression codec

2020-11-11 Thread Xie, Qi
Hi, Antoine About the API you mentioned, I want to know what scope this API will be covered, about the configure API to overwrite the built-in gzip? Thanks, XieQi -Original Message- From: Antoine Pitrou Sent: Tuesday, October 27, 2020 11:39 PM To: Xie, Qi ; dev@arrow.apache.org Cc: X

Arrow sync call November 11 at 12:00 US/Eastern, 17:00 UTC

2020-11-11 Thread Neal Richardson
Hi all, Reminder that our biweekly call is coming up at https://meet.google.com/vtm-teks-phx. Note that the US has gone back to standard time so we're back at 17:00 UTC again. All are welcome to join. Notes will be sent out to the mailing list afterward. Neal

Re: Arrow sync call November 11 at 12:00 US/Eastern, 17:00 UTC

2020-11-11 Thread Neal Richardson
Attendees: Mahmut Bulut Projjal Chanda Rémi Dettai Neville Dipale Micah Kornfield Jorge Cardoso Leitão Neal Richardson Charlene Solonynka Discussion: * Patch release? Some parquet issues that may result in data loss, plus some Python and R issues (and others?) so it may be a good idea * Cost of d

Re: Rust ParquetReader trait

2020-11-11 Thread Rémi Dettai
Hi Mahmut, The way of implementing sources for Parquet has changed. The new way is to implement the ChunkReader trait. This is simpler (less methods to implement) and more efficient (you have more information about the upcoming bytes that will be read). The ParquetReader has been made private as i

Patch Release 2.0.1?

2020-11-11 Thread Micah Kornfield
There are a couple of Parquet bugs that I think might warrant a patch release. The most pressing I think is: ARROW-10493 which can potentially lose data silently depending on how batch parameters are used with nullable structs. I think this is serious enough that we should consider a patch releas

Re: Support for reading arbitrary nested objects

2020-11-11 Thread Micah Kornfield
Hi Renato, I'm not clear if you are asking if the Arrow/Feather file format support this or if Arrow's parquet binding support it. Regardless, both formats as of 2.0.0 now support arbitrarily nested data (there were some bugs discovered after the 2.0.0 release, and I just started a discussion on d

Re: Pandas Block Manager

2020-11-11 Thread Nicholas White
Thanks all, this has been interesting. I've made a patch that sort-of does what I want[1] - I hope the test case is clear! I made the batch writer use the `alignment` field that was already in the `IpcWriteOptions` to align the buffers, instead of fixing their alignment at 8. Arrow then writes out

[DISCUSS] Alternative design for KMS interaction in parquet-cpp

2020-11-11 Thread Benjamin Kietzman
In the context of https://issues.apache.org/jira/browse/ARROW-9318 / https://github.com/apache/arrow/pull/8023 which port the parquet-mr design to c++: there's some question whether that design is consistent with the style and conventions of the c++ implementation of parquet. Here is a Gist with a