std::vector::data() returns a buffer containing pointers to
the individual string buffers and Arrow needs a buffer with contiguous
variable-length character data.
And that is buffers[2]. buffers[1] contains the offsets for beginning and
end of the strings in buffers[2].
So yes, use the StringBuil
at 3:09 PM Felipe Oliveira Carvalho
wrote:
> std::vector::data() returns a buffer containing pointers to
> the individual string buffers and Arrow needs a buffer with contiguous
> variable-length character data.
>
> And that is buffers[2]. buffers[1] contains the offsets for beginning an
Does creating a decimal128 array, then casting that array to float64 work?
On Mon, May 8, 2023 at 3:08 PM Chris Comeau wrote:
> Is there any way to have pa.compute.cast handle int -> float64 with
> accepted loss of precision?
>
> Source value is a python int that's too long for int64, like
> 123
Hi Arkadiy,
Every array can potentially have nulls, meaning that the logical type of
the values of every array is areay>, but it’s common for
compute kernels to specialize their loops based on the presence or absence
of nulls in an array by calling Array::MayHaveLogicalNulls() before
starting the
You can add `duration` arrays to `timestamp` arrays to get new `timestamp`
arrays [1][2].
import pyarrow as pa
import pyarrow.compute as pc
_ts = ["9/03/2023 00:35", "9/03/2023 12:35", "9/03/2023 6:35", "9/03/2023
18:35"]
_format = "%d/%m/%Y %H:%M"
timestamps = pc.strptime(_ts, format= _format,
Try to give Arrow the JSON text containing all the records. Working one
record at a time goes against the philosophy of vectorized array processing.
https://arrow.apache.org/docs/python/generated/pyarrow.json.read_json.html
Instead of getting an array of structs, you will get a table where each
k
Hi Elliott,
Not that I know of. But do you have concrete numbers and a practical case
that could motivate someone to tackle the project?
--
Felipe
On Sun, Oct 22, 2023 at 10:05 AM Elliott Bradshaw
wrote:
> Hi Arrow Team,
>
> We love your work. Wondering if support for Run End Encoded Vectors
In a Vectorized querying system, scalars and conditionals should be
avoided at all costs. That's why it's called "vectorized" — it's about
the vectors and not the scalars.
Arrow Arrays (AKA "vectors" in other systems) are the unit of data you
mainly deal with. Data abstraction (in the OOP sense) i
data_ + N)
> }
>
> Now I just need to figure out the best way to do this over multiple columns
> (row-wise).
>
> Thanks again!
>
>
> On Tue, 20 Feb 2024 at 19:51, Felipe Oliveira Carvalho
> wrote:
>>
>> In a Vectorized querying system, scalars and conditionals
What are you trying to achieve in converting these structs to arrays
partitioned by columns?
Are you transferring batches of them from/to somewhere?
The Arrow format is not good if you intend to process one at a time.
On Wed, Mar 6, 2024 at 12:33 PM kekronbekron
wrote:
>
> Also considering derive
1. the first read is always 65536, then it is followed by read of the
size of parquet.
This might be a constant inside adlfs or the Azure SDK itself (?). I
don't know from the top of my head if Parquet always reads 64k or
that's an Azure SDK thing.
2. looks like parquet footer is read on almost e
mini DB (ex: a .duckdb file) of each record type+subtype, so
> that exploring within a type is fast, and joining stuff is equally fast &
> easy.
>
> Once converted, it's just a matter of accessing them via S3 or whatever.
>
>
> On Thursday, March 7th, 2024 at 20:04
I couldn't find the docs for compute.scalar, but by checking the
source code I can say this:
pyarrow.scalar [1] creates an instance of a pyarrow.*Scalar class from
a Python object.
pyarrow.compute.scalar [2] creates an Arrow compute Expression
wrapping a scalar object.
You rarely need pyarrow.com
Hi,
The builders can't really know the size of the buffers when nested types
are involved. The general solution would be an expensive traversal of the
entire tree of builders (e.g. struct builder of nested column types like
strings) on every append.
I suggest you leverage your domain knowledge of
che-arrow-15-composable-data-management/
[2]
https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout
On Fri, Jul 5, 2024 at 1:35 PM Eric Jacobs wrote:
> Felipe Oliveira Carvalho wrote:
> > Hi,
> > The builders can't really know the size of the
Hi,
ArrayKernelExec must be a pointer to a C function.
using ArrayKernelExec = Status (*)(KernelContext*, const ExecSpan&,
ExecResult*);
Status EncryptFloat64(KernelContext* ctx, const ExecSpan& batch,
ExecResult* out) {
auto& arg0 = batch[0];
auto out_data = PrealocateBinaryArrayForMyEncryp
addition [1]) allows a more flexible
> > chunking of the data buffers [2].
>
> Thanks! I'll check it out.
>
> -Eric
>
>
> Felipe Oliveira Carvalho wrote:
> > > However, I'm not seeing how it would be necessary on every append
> > since the topology would
>> buffer? I, for example, do not know where the validity buffer is in the
>> ExecSpan.
>>
>> Few additional questions. In the example code in
>> "example/arrow/udf_example.cc", it dereferences the array with index 1 in
>> the batch.
>> *|> batch[
Is Hierarchical Namespace [1] Enabled on the Storage Account?
When HNS is not enabled or when operations using ADLFS fail, the Azure file
system implementation falls back to Azure Blobs operations.
I have a draft on my machine of a change that would add a configuration
option to *force* the use o
Don't create a memory pool locally (and destroy it when the function
returns), use the global singleton pool from `arrow::default_memory_pool()`
instead.
__
Felipe
On Mon, Aug 12, 2024 at 12:44 PM Surya Kiran Gullapalli <
suryakiran.gullapa...@gmail.com> wrote:
> Hello all,
> I'm trying to conve
Extra tip: avoid calling ValueOrDie() as that will kill your program in
case of errors.
Replace auto x = F().ValueOrDie(); with ARROW_ASSIGN_OR_RAISE(auto x, F())
and declare the function to either return an arrow::Status or an
arrow::Result.
--
Felipe
On Mon, Aug 19, 2024 at 10:41 AM Hung Dang
You can build `compure::Expression` instances [1] and use them in different
contexts like scanning datasets [2] and producing Substrait plans [3] that
you can execute.
But you have to write your own parser and define the scope and semantics of
the operations you would support.
[1]
https://github.
You would have to use a std::shared_ptr as a buffer in one of the
array layouts in a manner that’s compatible with the type.
On Wed, 9 Oct 2024 at 12:41 Yi Cao wrote:
> Hi,
> I want to store pointers to avoid copy of large amount of data. And then I
> can pass such table and extract pointers fro
Hi Robert,
I hit the same problem recently but there’s a Python-only workaround you
can use.
https://github.com/apache/arrow-experiments/pull/35/files#r1797397257
—
Felipe
On Fri, 11 Oct 2024 at 05:13 Antoine Pitrou wrote:
>
> Hi Robert,
>
> On Thu, 10 Oct 2024 08:33:28 -0700
> Robert McLeod
Hi,
Yi Cao's request comes from a misunderstanding of where the performance of
Arrow comes from.
Arrow arrays follow the SoA paradigm [1]. The moment you start thinking
about individual objects with an associated ref-count (std::shared_ptr)
is the moment you've given up the SoA approach and you a
You can create two different build directories: release and debug.
Then you run cmake $ARROW_ROOT on the two different folders.
On Fri, 22 Nov 2024 at 15:53 Carl Godkin wrote:
> Hi,
>
> I'm using the arrow library with parquet version 18.0.0 on Windows and
> Linux from C++.
>
> For development
files AFTER I build them (e.g., using this
> <https://github.com/cmberryau/rename_dll/blob/master/rename_dll.py>Python
> script) but that doesn't quite work in this case since parquet.dll depends on
> arrow.dll. What ends up happening is that my "parquetD.dll
I don't have very specific advice, but mmap() and programmer control don't
come together. The point of mmap is deferring all the logic to the OS and
trusting that it knows better.
If you're calling read_all(), it will do what the name says: read all the
batches. Have you tried looping and getting
Further reading: https://en.wikipedia.org/wiki/Authenticated_encryption
AES-GCM is a form of Authenticated Encryption.
On Thu, Feb 27, 2025 at 3:33 AM Antoine Pitrou wrote:
>
> Hello,
>
> Parquet encryption ensures integrity if you use the default encryption
> algorithm AES_GCM (not AES_CTR). Y
No, but if these are gRPC proxies they should work.
On Wed, 12 Mar 2025 at 18:13 Z A wrote:
> Hi,
> I just subscribed to this mailing list, and apologize if this is a silly
> question.
> Has anyone ever done any integration of API Gateway (i.e. Kong, Tyk,
> KrakenD, etc.) with your own Arrow Fli
30 matches
Mail list logo