Re: [Rust] DataFusion performance

2020-12-11 Thread Jorge Cardoso Leitão
Hi Mattew, SchemaRef is just an alias for Arc. Thus, you need to wrap it on an Arc. We do this because the plans are often passed between thread boundaries and thus wrapping them on an Arc allows that. Best, Jorge On Fri, Dec 11, 2020 at 8:14 PM Matthew Turner wrote: > Thanks! Converting the

Re: Question the nature of the "Zero Copy" advantages of Apache Arrow

2021-01-26 Thread Jorge Cardoso Leitão
Hi Thomas, The canonical interface that the arrow format offers to share data within the same process is the C data interface . It offers a stable ABI to share memory via foreign interfaces. C++, Python, R and Rust support it

Re: why that take so many times to read parquets file with 300 000 columns

2021-03-01 Thread Jorge Cardoso Leitão
I understand that this does not answer the question, but it may be worth pointing out regardless: if you control the writing, it may be more suitable to encode the columns and use a link list for the problem: encode each column by a number x and store the data as two columns. For example: id, x0,

Re: Hashing and equivalence of datasets

2021-12-03 Thread Jorge Cardoso Leitão
AFAIK hashing in this context needs to be done on a slot by slot basis, just like array equality, as any item on a null slot has a value on the buffer that is undetermined. E.g. the layout of a primitive array [1, 2, None, 4] is two buffer regions: * [1, 2, ?, 4] and * [true, true, false, true] (i

Re: Hashing and equivalence of datasets

2021-12-04 Thread Jorge Cardoso Leitão
Hi, I think that unfortunately parquet is underdetermined, for example, RLE-hybrid encoding: whether to use a RLE or bitpacked run in RLE-hybrid encoding is left for implementations to decide: an implementation may only use bitpacked runs, while other may use a combination. This leads to different

Re: [Question][Python] Columns with Limited Value Set

2022-01-05 Thread Jorge Cardoso Leitão
We could use an extension type here: wrap the dictionary type on an extension type whose metadata contains the expected keys. This way the keys are stored in the schema. On Wed, Jan 5, 2022 at 11:32 PM Neal Richardson wrote: > For what it's worth, I encountered a similar issue in working on the

Re: [DISCUSS][C++] Store C++ shared_ptr in arrow table

2024-10-11 Thread Jorge Cardoso Leitão
ter to an integer then doing math with it, then casting it > back. > > Sent from Proton Mail <https://proton.me/mail/home> for iOS > > > On Thu, Oct 10, 2024 at 08:35, Jorge Cardoso Leitão < > jorgecarlei...@gmail.com > > > wrote: > > Hi, > > This use-c

Re: [DISCUSS][C++] Store C++ shared_ptr in arrow table

2024-10-10 Thread Jorge Cardoso Leitão
Hi, This use-case seems semantically equivalent with storing python objects in arrow for the purpose of putting them in an arrow table. This can be achieved by some form of pickling or indirection (I recall Polars and others doing one of these). Imo there are different approaches with different t