Nick, it appears converting the ndarray to a dataframe clears the
contiguous flag even though it doesn't actually change the underlying
array. At least, this is what I'm seeing with my testing. My guess
is this is what is causing arrow to do a copy (arrow is indeed doing a
new allocation here, th
Hi Arrow experts,
I am trying to find out if Arrow supports reading/writing arbitrary
nested objects similarly to what Parquet supports with its FSM.
I came across this PR https://github.com/apache/arrow/pull/4066 which
aimed to implement the Parquet specific approach (the FSM) but it was
declined
On Wed, 11 Nov 2020 at 00:52, Micah Kornfield wrote:
>
> Sorry, I should clarify, I'm not familiar with zero copy from Pandas to
> Arrow, so there might be something else going on here. But once an arrow
> file is written out, buffers will be padded/aligned to 8 bytes.
>
> In general, I think rel
Hi Weston,
When starting with a 2D ndarray, the conversion from numpy to pandas
DataFrame (`pd.DataFrame(arr)`) is actually zero copy. But, pandas
takes a transposed view on the original array (that's the reason the C
contiguous flag changes), to ensure the column are the first dimension
of the st
Hi All;
I have implemented different data sources before for the
ParquetReader(privately) but with the latest changes (esp.
https://github.com/apache/arrow/pull/8300/files#diff-0b220b2d327afc583fd75b2d3c52901e628026a11cfa694ffc252ffd45fb6db0L20
(
There is an orphanage of the ParquetReader trait. I
Hi, Antoine
About the API you mentioned, I want to know what scope this API will be
covered, about the configure API to overwrite the built-in gzip?
Thanks,
XieQi
-Original Message-
From: Antoine Pitrou
Sent: Tuesday, October 27, 2020 11:39 PM
To: Xie, Qi ; dev@arrow.apache.org
Cc: X
Hi all,
Reminder that our biweekly call is coming up at
https://meet.google.com/vtm-teks-phx. Note that the US has gone back to
standard time so we're back at 17:00 UTC again.
All are welcome to join. Notes will be sent out to the mailing list
afterward.
Neal
Attendees:
Mahmut Bulut
Projjal Chanda
Rémi Dettai
Neville Dipale
Micah Kornfield
Jorge Cardoso Leitão
Neal Richardson
Charlene Solonynka
Discussion:
* Patch release? Some parquet issues that may result in data loss, plus
some Python and R issues (and others?) so it may be a good idea
* Cost of d
Hi Mahmut,
The way of implementing sources for Parquet has changed. The new way is to
implement the ChunkReader trait. This is simpler (less methods to
implement) and more efficient (you have more information about the upcoming
bytes that will be read). The ParquetReader has been made private as i
There are a couple of Parquet bugs that I think might warrant a patch
release. The most pressing I think is: ARROW-10493 which can potentially
lose data silently depending on how batch parameters are used with nullable
structs.
I think this is serious enough that we should consider a patch releas
Hi Renato,
I'm not clear if you are asking if the Arrow/Feather file format support
this or if Arrow's parquet binding support it.
Regardless, both formats as of 2.0.0 now support arbitrarily nested data
(there were some bugs discovered after the 2.0.0 release, and I just
started a discussion on d
Thanks all, this has been interesting. I've made a patch that sort-of does
what I want[1] - I hope the test case is clear! I made the batch writer use
the `alignment` field that was already in the `IpcWriteOptions` to align
the buffers, instead of fixing their alignment at 8. Arrow then writes out
In the context of https://issues.apache.org/jira/browse/ARROW-9318 /
https://github.com/apache/arrow/pull/8023 which port the parquet-mr
design to c++: there's some question whether that design is consistent
with the style and conventions of the c++ implementation of parquet.
Here is a Gist with a
13 matches
Mail list logo