Re: [C++] std::vector to Datum

2023-05-04 Thread Felipe Oliveira Carvalho
std::vector::data() returns a buffer containing pointers to the individual string buffers and Arrow needs a buffer with contiguous variable-length character data. And that is buffers[2]. buffers[1] contains the offsets for beginning and end of the strings in buffers[2]. So yes, use the StringBuil

Re: [C++] std::vector to Datum

2023-05-04 Thread Felipe Oliveira Carvalho
at 3:09 PM Felipe Oliveira Carvalho wrote: > std::vector::data() returns a buffer containing pointers to > the individual string buffers and Arrow needs a buffer with contiguous > variable-length character data. > > And that is buffers[2]. buffers[1] contains the offsets for beginning an

Re: [Python] "OverflowError: int too big to convert" with target_type float64 - allow loss of precision?

2023-05-11 Thread Felipe Oliveira Carvalho
Does creating a decimal128 array, then casting that array to float64 work? On Mon, May 8, 2023 at 3:08 PM Chris Comeau wrote: > Is there any way to have pa.compute.cast handle int -> float64 with > accepted loss of precision? > > Source value is a python int that's too long for int64, like > 123

Re: holes in arrays

2023-06-08 Thread Felipe Oliveira Carvalho
Hi Arkadiy, Every array can potentially have nulls, meaning that the logical type of the values of every array is areay>, but it’s common for compute kernels to specialize their loops based on the presence or absence of nulls in an array by calling Array::MayHaveLogicalNulls() before starting the

Re: How to adjust pyarrow timestamps using pyarrow.compute

2023-07-21 Thread Felipe Oliveira Carvalho
You can add `duration` arrays to `timestamp` arrays to get new `timestamp` arrays [1][2]. import pyarrow as pa import pyarrow.compute as pc _ts = ["9/03/2023 00:35", "9/03/2023 12:35", "9/03/2023 6:35", "9/03/2023 18:35"] _format = "%d/%m/%Y %H:%M" timestamps = pc.strptime(_ts, format= _format,

Re: [Python] How to cast JSON String Array to STRUCT in arrow?

2023-09-12 Thread Felipe Oliveira Carvalho
Try to give Arrow the JSON text containing all the records. Working one record at a time goes against the philosophy of vectorized array processing. https://arrow.apache.org/docs/python/generated/pyarrow.json.read_json.html Instead of getting an array of structs, you will get a table where each k

Re: [Java][Format] Support for Run End Encoded Vectors

2023-10-23 Thread Felipe Oliveira Carvalho
Hi Elliott, Not that I know of. But do you have concrete numbers and a practical case that could motivate someone to tackle the project? -- Felipe On Sun, Oct 22, 2023 at 10:05 AM Elliott Bradshaw wrote: > Hi Arrow Team, > > We love your work. Wondering if support for Run End Encoded Vectors

Re: [C++] Recommended way to extract values from scalars

2024-02-20 Thread Felipe Oliveira Carvalho
In a Vectorized querying system, scalars and conditionals should be avoided at all costs. That's why it's called "vectorized" — it's about the vectors and not the scalars. Arrow Arrays (AKA "vectors" in other systems) are the unit of data you mainly deal with. Data abstraction (in the OOP sense) i

Re: [C++] Recommended way to extract values from scalars

2024-02-22 Thread Felipe Oliveira Carvalho
data_ + N) > } > > Now I just need to figure out the best way to do this over multiple columns > (row-wise). > > Thanks again! > > > On Tue, 20 Feb 2024 at 19:51, Felipe Oliveira Carvalho > wrote: >> >> In a Vectorized querying system, scalars and conditionals

Re: C/C++ structs to Arrow types

2024-03-07 Thread Felipe Oliveira Carvalho
What are you trying to achieve in converting these structs to arrays partitioned by columns? Are you transferring batches of them from/to somewhere? The Arrow format is not good if you intend to process one at a time. On Wed, Mar 6, 2024 at 12:33 PM kekronbekron wrote: > > Also considering derive

Re: Fine tunning pyarrow.dataset.dataset with adlfs

2024-03-07 Thread Felipe Oliveira Carvalho
1. the first read is always 65536, then it is followed by read of the size of parquet. This might be a constant inside adlfs or the Azure SDK itself (?). I don't know from the top of my head if Parquet always reads 64k or that's an Azure SDK thing. 2. looks like parquet footer is read on almost e

Re: C/C++ structs to Arrow types

2024-03-07 Thread Felipe Oliveira Carvalho
mini DB (ex: a .duckdb file) of each record type+subtype, so > that exploring within a type is fast, and joining stuff is equally fast & > easy. > > Once converted, it's just a matter of accessing them via S3 or whatever. > > > On Thursday, March 7th, 2024 at 20:04

Re: pyarrow: pa.compute.scalar vs pa.scalar

2024-05-27 Thread Felipe Oliveira Carvalho
I couldn't find the docs for compute.scalar, but by checking the source code I can say this: pyarrow.scalar [1] creates an instance of a pyarrow.*Scalar class from a Python object. pyarrow.compute.scalar [2] creates an Arrow compute Expression wrapping a scalar object. You rarely need pyarrow.com

Re: [C++] Building a ChunkedArray with allocation size control

2024-07-04 Thread Felipe Oliveira Carvalho
Hi, The builders can't really know the size of the buffers when nested types are involved. The general solution would be an expensive traversal of the entire tree of builders (e.g. struct builder of nested column types like strings) on every append. I suggest you leverage your domain knowledge of

Re: [C++] Building a ChunkedArray with allocation size control

2024-07-05 Thread Felipe Oliveira Carvalho
che-arrow-15-composable-data-management/ [2] https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout On Fri, Jul 5, 2024 at 1:35 PM Eric Jacobs wrote: > Felipe Oliveira Carvalho wrote: > > Hi, > > The builders can't really know the size of the

Re: [C++] How to add user defined functions to arrow compute

2024-07-08 Thread Felipe Oliveira Carvalho
Hi, ArrayKernelExec must be a pointer to a C function. using ArrayKernelExec = Status (*)(KernelContext*, const ExecSpan&, ExecResult*); Status EncryptFloat64(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) { auto& arg0 = batch[0]; auto out_data = PrealocateBinaryArrayForMyEncryp

Re: [C++] Building a ChunkedArray with allocation size control

2024-07-09 Thread Felipe Oliveira Carvalho
addition [1]) allows a more flexible > > chunking of the data buffers [2]. > > Thanks! I'll check it out. > > -Eric > > > Felipe Oliveira Carvalho wrote: > > > However, I'm not seeing how it would be necessary on every append > > since the topology would

Re: [C++] How to add user defined functions to arrow compute

2024-07-10 Thread Felipe Oliveira Carvalho
>> buffer? I, for example, do not know where the validity buffer is in the >> ExecSpan. >> >> Few additional questions. In the example code in >> "example/arrow/udf_example.cc", it dereferences the array with index 1 in >> the batch. >> *|> batch[

Re: Using the new Azure filesystem object (C++)

2024-07-11 Thread Felipe Oliveira Carvalho
Is Hierarchical Namespace [1] Enabled on the Storage Account? When HNS is not enabled or when operations using ADLFS fail, the Azure file system implementation falls back to Azure Blobs operations. I have a draft on my machine of a change that would add a configuration option to *force* the use o

Re: [C++][JSON] json (string) to Table.

2024-08-12 Thread Felipe Oliveira Carvalho
Don't create a memory pool locally (and destroy it when the function returns), use the global singleton pool from `arrow::default_memory_pool()` instead. __ Felipe On Mon, Aug 12, 2024 at 12:44 PM Surya Kiran Gullapalli < suryakiran.gullapa...@gmail.com> wrote: > Hello all, > I'm trying to conve

Re: How to reconstruct an arrow::Table from an arrow::Buffer object in C++?

2024-08-19 Thread Felipe Oliveira Carvalho
Extra tip: avoid calling ValueOrDie() as that will kill your program in case of errors. Replace auto x = F().ValueOrDie(); with ARROW_ASSIGN_OR_RAISE(auto x, F()) and declare the function to either return an arrow::Status or an arrow::Result. -- Felipe On Mon, Aug 19, 2024 at 10:41 AM Hung Dang

Re: [C++]Create derived data (using formulae)

2024-08-28 Thread Felipe Oliveira Carvalho
You can build `compure::Expression` instances [1] and use them in different contexts like scanning datasets [2] and producing Substrait plans [3] that you can execute. But you have to write your own parser and define the scope and semantics of the operations you would support. [1] https://github.

Re: [DISCUSS][C++] Store C++ shared_ptr in arrow table

2024-10-09 Thread Felipe Oliveira Carvalho
You would have to use a std::shared_ptr as a buffer in one of the array layouts in a manner that’s compatible with the type. On Wed, 9 Oct 2024 at 12:41 Yi Cao wrote: > Hi, > I want to store pointers to avoid copy of large amount of data. And then I > can pass such table and extract pointers fro

Re: Extract objects from CompressedOutputStream

2024-10-11 Thread Felipe Oliveira Carvalho
Hi Robert, I hit the same problem recently but there’s a Python-only workaround you can use. https://github.com/apache/arrow-experiments/pull/35/files#r1797397257 — Felipe On Fri, 11 Oct 2024 at 05:13 Antoine Pitrou wrote: > > Hi Robert, > > On Thu, 10 Oct 2024 08:33:28 -0700 > Robert McLeod

Re: [DISCUSS][C++] Store C++ shared_ptr in arrow table

2024-10-10 Thread Felipe Oliveira Carvalho
Hi, Yi Cao's request comes from a misunderstanding of where the performance of Arrow comes from. Arrow arrays follow the SoA paradigm [1]. The moment you start thinking about individual objects with an associated ref-count (std::shared_ptr) is the moment you've given up the SoA approach and you a

Re: C++ building question

2024-11-22 Thread Felipe Oliveira Carvalho
You can create two different build directories: release and debug. Then you run cmake $ARROW_ROOT on the two different folders. On Fri, 22 Nov 2024 at 15:53 Carl Godkin wrote: > Hi, > > I'm using the arrow library with parquet version 18.0.0 on Windows and > Linux from C++. > > For development

Re: C++ building question

2024-11-22 Thread Felipe Oliveira Carvalho
files AFTER I build them (e.g., using this > <https://github.com/cmberryau/rename_dll/blob/master/rename_dll.py>Python > script) but that doesn't quite work in this case since parquet.dll depends on > arrow.dll. What ends up happening is that my "parquetD.dll

Re: Demand-loading Arrow files

2025-01-22 Thread Felipe Oliveira Carvalho
I don't have very specific advice, but mmap() and programmer control don't come together. The point of mmap is deferring all the logic to the OS and trusting that it knows better. If you're calling read_all(), it will do what the name says: read all the batches. Have you tried looping and getting

Re: [QUESTION][Parquet][Encryption] Checksum Flow for Parquet Modular Encryption

2025-02-27 Thread Felipe Oliveira Carvalho
Further reading: https://en.wikipedia.org/wiki/Authenticated_encryption AES-GCM is a form of Authenticated Encryption. On Thu, Feb 27, 2025 at 3:33 AM Antoine Pitrou wrote: > > Hello, > > Parquet encryption ensures integrity if you use the default encryption > algorithm AES_GCM (not AES_CTR). Y

Re: api gateway with arrow flight grpc

2025-03-13 Thread Felipe Oliveira Carvalho
No, but if these are gRPC proxies they should work. On Wed, 12 Mar 2025 at 18:13 Z A wrote: > Hi, > I just subscribed to this mailing list, and apologize if this is a silly > question. > Has anyone ever done any integration of API Gateway (i.e. Kong, Tyk, > KrakenD, etc.) with your own Arrow Fli