Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-20 Thread Sasha Krassovsky
points, and performing a conversion from the non-view to view format seems like it would be very cheap (though I understand not necessarily the other way around, but you’d need to do that anyway if you’re serializing). Sasha Krassovsky > 20 мая 2023 г., в 13:00, Will Jones написа

Re: Probably an unnecessary copy when outputting join result?

2023-04-13 Thread Sasha Krassovsky
trivial fix? > > Rossi > > > Sasha Krassovsky <mailto:krassovskysa...@gmail.com>> 于2023年4月14日周五 01:44写道: >> Hi Rossi, >> That’s a good catch! I _think_ the compiler will automatically emit the move >> because it sees we’re copying from an object that’ll

Re: Probably an unnecessary copy when outputting join result?

2023-04-13 Thread Sasha Krassovsky
Hi Rossi,That’s a good catch! I _think_ the compiler will automatically emit the move because it sees we’re copying from an object that’ll never be used again [1], but adding the std::move would be good just to remove any ambiguity. Go ahead and make the PR!Sasha Move, simplyherbsutter.com13 апр. 2

Re: Question about thread local data in `QueryContext`

2023-03-09 Thread Sasha Krassovsky
ow/compute/exec/swiss_join.cc#L2505> > within SwissJoin already, does it make sense to put the thread local vector > inside SwissJoin rather than QueryContext? Or, is the thread local data in > QueryContext is designed to be used inter-node? > > Thanks. > > *Rossi Sun* >

Re: Question about thread local data in `QueryContext`

2023-03-09 Thread Sasha Krassovsky
Hi Rossi, When profiling Acero we noticed that there was a lot of overhead regarding memory allocation, specifically in the creation/destruction of std::vector. This thread local data in QueryContext was put there as a preparation to refactor other nodes to use TempVectorStack when they need a t

Re: Parser for ExecPlans

2022-11-03 Thread Sasha Krassovsky
; in that case, we >> would just need to ask indentation. >> >> Percy >> >> >>> On Thu, Nov 3, 2022 at 12:47 PM Sasha Krassovsky >>> wrote: >>> >>> Hi Percy, >>> Thanks for the input! New lines would be no problem at all

Re: Parser for ExecPlans

2022-11-03 Thread Sasha Krassovsky
…) > …) > …) > > Percy > > >> On Tue, Oct 18, 2022 at 5:54 PM Sasha Krassovsky >> wrote: >> >> Hi everyone, >> We recently had some discussions about parsing expressions. I currently >> have a PR [1] up for that taking into account the

Parser for ExecPlans

2022-10-18 Thread Sasha Krassovsky
clutter. Thanks! Sasha Krassovsky [1] https://github.com/apache/arrow/pull/14287

Re: Parser for expressions

2022-10-13 Thread Sasha Krassovsky
then provide support for nodes / relations I think we >> will need to deviate from SQL as it is simply not expressive enough. >> >>> On Mon, Oct 10, 2022 at 12:17 PM Antoine Pitrou >>> wrote: >>> >>> >>> I don't see the point of h

Re: Parser for expressions

2022-10-10 Thread Sasha Krassovsky
its own grammar different from compute/Acero/Substrait etc. > > Best, > Jin > >> 2022年10月8日 03:01,Sasha Krassovsky 写道: >> >> Hi Jin, >> I agree it would be good to standardize on a syntax. To me, the advantages >> of the lisp-style syntax are: >> -

Re: Parser for expressions

2022-10-07 Thread Sasha Krassovsky
nd Sasha's is with Lisp functional style ((foo x y z), (+ a > b)…). I feel like it'll be better for us to settle on one of the styles > before we start implementing the parsers. > > Best, > Jin > >> On Friday, October 7, 2022, Sasha Krassovsky >> wrote: &g

Re: Parser for expressions

2022-10-06 Thread Sasha Krassovsky
However, at the moment we have zero. > > [1] https://lists.apache.org/thread/0oyns380hgzvl0y8kwgqoo4fp7ntt3bn > >> On Wed, Oct 5, 2022 at 1:55 PM Sasha Krassovsky >> wrote: >> >> Hi David, >> Could you elaborate on which part of my proposal overlaps with S

Re: Parser for expressions

2022-10-05 Thread Sasha Krassovsky
> <https://www.youtube.com/watch?v=5JjaB7p3Sjk> > > -Original Message- > From: Sasha Krassovsky <mailto:krassovskysa...@gmail.com>> > Sent: Wednesday, October 5, 2022 11:29 AM > To: dev@arrow.apache.org <mailto:dev@arrow.apache.org> > Subject: Parser for expression

Re: [WEBSITE] Blog posts on representing Structured Data with Parquet and Arrow

2022-10-05 Thread Sasha Krassovsky
Hi, we aren’t able to connect to your localhost 😀 > On Oct 5, 2022, at 12:44 PM, Andrew Lamb wrote: > > We have published the first post: > http://localhost:4000/blog/2022/10/05/arrow-parquet-encoding-part-1/ > > On Sun, Oct 2, 2022 at 7:00 AM Andrew Lamb wrote: > >> We are working on a seri

Parser for expressions

2022-10-05 Thread Sasha Krassovsky
this one. Looking forward to hearing everyone’s thoughts! Thanks, Sasha Krassovsky [0] https://github.com/apache/arrow/pull/14287 <https://github.com/apache/arrow/pull/14287> [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L1726 <https://github.com/apache/arrow/bl

Re: Transactional semantics in Acero

2022-09-09 Thread Sasha Krassovsky
Hi Jayjeet, Transactions are currently out of scope for Acero - Acero is only meant to be a query execution engine. That said, it can definitely be used as a component for building a full database engine, which could implement its own locking of rows while Acero executes on them. You could also

Re: [ANNOUNCE] New Arrow PMC member: Weston Pace

2022-09-05 Thread Sasha Krassovsky
Congratulations Weston!! Very well deserved! > On Sep 5, 2022, at 11:04 AM, Ian Joiner wrote: > > Congrats Weston! > > On Mon, Sep 5, 2022 at 1:56 AM Sutou Kouhei wrote: > >> The Project Management Committee (PMC) for Apache Arrow has invited >> Weston Pace to become a PMC member and we are

Re: [RESULT][VOTE] C++: switch to C++17

2022-08-30 Thread Sasha Krassovsky
Hi, What kind of timeline did we decide on? Is this something that can be worked on/merged immediately or should we wait until after 10.0 is out? Sasha Krassovsky > On Aug 29, 2022, at 2:20 AM, Antoine Pitrou wrote: > > > Hello, > > With 5 binding +1 votes, 7 non-bindin

Re: [VOTE] C++: switch to C++17

2022-08-24 Thread Sasha Krassovsky
++1 (non-binding) > 24 авг. 2022 г., в 08:53, Jacob Wujciak > написал(а): > + 1 (non-binding) > > Benjamin Kietzman schrieb am Mi., 24. Aug. 2022, > 17:43: > >> +1 (binding) >> >> On Wed, Aug 24, 2022, 11:32 Antoine Pitrou wrote: >> >>> Hello, >>> I would like to propose that the Arrow C+

[C++] Disable anonymous namespaces in debug mode

2022-08-10 Thread Sasha Krassovsky
ut potentially gating our anonymous namespaces behind a `#ifndef NDEBUG` check? This way we can still disable exporting all of those symbols in release builds but make life much easier when debugging. Thanks, Sasha Krassovsky

Re: [DISCUSS][Format] Dynamic data encodings in the IPC format and C ABI

2022-07-29 Thread Sasha Krassovsky
type. Let me know if I need to clarify anything, that was a lot of text :) Sasha Krassovsky > On Jul 29, 2022, at 4:18 PM, Wes McKinney wrote: > > of the implementation when it comes to the IPC format and the C > interface. >

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Sasha Krassovsky
rows would need a little bit of massaging (though not too much) to be put into this representation. Sasha Krassovsky > On Jul 28, 2022, at 1:10 PM, Laurent Quérel wrote: > > Thank you Micah for a very clear summary of the intent behind this > proposal. Indeed, I think that clarify

Re: [C++] ResumeProducing Future Causing Blocking

2022-07-20 Thread Sasha Krassovsky
Hi, Futures run callbacks on the thread that marks then as finished. It seems that inside of the Source node’s generator loop does add a callback (https://github.com/iChauster/arrow/blob/asof_join2/cpp/src/arrow/compute/exec/source_node.cc#L130

Re: [C++] Moving from -O3 to -O2 optimization level in release builds

2022-07-20 Thread Sasha Krassovsky
I’d +1 on this - in my past experience I’ve mostly seen -O2. It would make sense to default to -O2 and only enable -O3 on source files selectively that can be demonstrated to benefit from it (if anyone actually spends the time to look into it). Sasha > On Jul 20, 2022, at 2:10 PM, Wes McKinney

Re: cpp: Debugging 'plan destruction before finishing'

2022-07-15 Thread Sasha Krassovsky
Hi everyone, Yes the current code is a bit difficult to reason about due to the fact that we rely on nodes marking their finished_ future as finished at the proper times. The various ExecNodes were further inconsistent about when they created their futures and where, and sometimes they'd overwrite

Re: Adding cpp memory profiling to Arrow

2022-07-06 Thread Sasha Krassovsky
Hi Ivan, Inside of Acero, we can think of allocations as coming in two classes: - "Big” allocations, which go through `MemoryPool`, using `Buffer`. These are used for representing columns of input data and hash tables. - “Small” allocations, which are usually small, local STL containers like st

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-09 Thread Sasha Krassovsky
writing. Sasha Krassovsky > On Jun 8, 2022, at 3:59 AM, Alessandro Molina > wrote: > > RLE would probably have some benefits that it makes sense to evaluate, I > would personally go in the direction of having a minimal benchmarking suite > for some of the cases where we e

Re: [C++] Kernel function registry evolution

2022-06-06 Thread Sasha Krassovsky
for review with the C++ tests > > passing. I'm expecting assorted workarounds for the various kernels > > that do zero-copy optimizations (setting output buffers with input > > buffers — such optimizations should likely be carried out elsewhere), > > etc., but I will keep

Re: [C++] Kernel function registry evolution

2022-06-03 Thread Sasha Krassovsky
Hi all, I’ve been thinking about some sort of refactoring of this registry for a while now, and I’ve written down some thoughts, please leave your comments. https://docs.google.com/document/d/1LAN9I_Y9cZaG2a84j1wLY8jSlK3gDXYMle-VtyFCAE8/edit?usp=sharing

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-05-20 Thread Sasha Krassovsky
filling up memory faster than we can process. Sasha Krassovsky > 20 мая 2022 г., в 20:09, Supun Kamburugamuve написал(а): >

Re: Clarifying how MapNode works

2022-05-19 Thread Sasha Krassovsky
Hello! Yes your understanding is correct. However it is worth noting that when MapNode was originally implemented it was meant to remove some boilerplate from Project and Filter, as previously these nodes spawned separate tasks. As we move towards more efficiently using pipeline execution, this

Re: Question: What should the offsets buffer be for an empty (list, binary, string) array?

2022-05-09 Thread Sasha Krassovsky
Hello, I think an empty string array will have an offsets buffer of length 1 with the value 0. Sasha Krassovsky > 9 мая 2022 г., в 05:23, Yang hao <1371656737...@gmail.com> написал(а): > > For an empty (list, binary, string) array, what should the offsets buffer > be?

Re: mmap only, read data later?

2022-05-09 Thread Sasha Krassovsky
each file) for each file and then call read_feather later when you actually need it? Sasha Krassovsky > 9 мая 2022 г., в 09:38, Andrew Piskorski написал(а): > > Hello, I'm using R package arrow_7.0.0.tar.gz, in R 4.1.1, on Linux > (Ubuntu 18.04.4 LTS). > > In R, I

Re: [DISCUSS][C++][Python]Switch default mmap behaviour to off

2022-05-06 Thread Sasha Krassovsky
else. Sasha Krassovsky > 5 мая 2022 г., в 23:03, Alvin Chunga Mamani > написал(а): > > Hi all, > I start this discussion to comment on the change to disable the use of mmap > by default, which represents a risk in non-local/pseudo file systems that > can affect perfo

Re: [Compute][C++] Question on compute scheduler

2022-04-26 Thread Sasha Krassovsky
An ExecPlan is composed of a bunch of implicit “pipelines”. Each node in a pipeline (starting with a source node) implements `InputReceived` and `InputFinished`. On `InputReceived`, it performs its computation and calls `InputReceived` on its output. On `InputFinished`, it performs any cleanup a

Re: [Compute][C++] Question on compute scheduler

2022-04-26 Thread Sasha Krassovsky
get a baseline implementation that has one >> thread doing the join so if InputReceived is called by multiple thread I >> might ended up blocking other threads unnecessarily. >> >> If I have a dedicate thread/executor that does the join and InputReceived >> just queue t

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Sasha Krassovsky
I think I can help answer these: 1) LLVM IR is an intermediate representation for compilers, WASM is an open standard for sandboxed computation. They fulfill different but complimentary roles. If the query engine were handed LLVM IR, it would still have to JIT the IR to wasm in order to maintain

Re: [Compute][C++] Question on compute scheduler

2022-04-25 Thread Sasha Krassovsky
tus::OK() > else: > # Is this right? > # Exit and try later > return Status::OK(); > > If I register this function with TaskScheduler, I think it only gets run > once, so I think I might need to schedule the next task when inputs are not > ready but I am not sure of the best

Re: [Compute][C++] Question on compute scheduler

2022-04-25 Thread Sasha Krassovsky
Hi Li, I’ll answer the questions in order: 1. Your guess is correct! The Hash Join may be used standalone (mostly in testing or benchmarking for now) or as part of the ExecNode. The ExecNode will pass the task to the Executor to be scheduled, or will run it immediately if it’s in sync mode (i.e

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

2022-04-25 Thread Sasha Krassovsky
Regarding TPC-H and widening, we can (and do currently for the one query we have implemented) cast the decimal back down to the correct precision after each multiplication, so I don’t think this is an issue. On the other hand, there are definitely things we can do to dynamically detect if decima

Re: [C++] Replacing xsimd with compiler autovectorization

2022-04-03 Thread Sasha Krassovsky
> It would be a very significant contributor, as the inconsistency can manifest > under the form of up to 8-fold differences in performance (or perhaps more). This is on a micro benchmark. For a user workload, the kernel will account for maybe 20% of the runtime, so even if the kernel gets 10x f

Re: [C++] Replacing xsimd with compiler autovectorization

2022-03-31 Thread Sasha Krassovsky
eeping xsimd around to give us opportunities to further tune performance. At the very least, for an initial PR, I would like to keep everything simpler. We can then evaluate xsimd-fying the kernels separately. Sasha On Thu, Mar 31, 2022 at 12:36 AM Antoine Pitrou wrote: > > Le 31/03/20

Re: [C++] Replacing xsimd with compiler autovectorization

2022-03-31 Thread Sasha Krassovsky
> As I showed, those auto-vectorized kernels may be vectorized only in some > situations, depending on the compiler version, the input datatypes... I would more than anything interpret the fact that that code was vectorized at all as an amazing win for compiler technology, as it’s a very abstrac

Re: [C++] Replacing xsimd with compiler autovectorization

2022-03-30 Thread Sasha Krassovsky
instructions sets would it > make sense to continue that for consistency. > > `scalar_arithmetic_simd.cc`? > >> On Wed, Mar 30, 2022 at 4:58 PM Sasha Krassovsky >> wrote: >> >> Yep, that's basically what I'm suggesting. If someone contributes an xsimd

Re: [C++] Replacing xsimd with compiler autovectorization

2022-03-30 Thread Sasha Krassovsky
ming convention for such files so they can be easily > > distinguished. We'd need to tackle this complexity at some point, but > > trying to keep the mechanism understandable by people outside the project > > is something that we should evaluate as this is implemented. > > &

Re: [C++] Replacing xsimd with compiler autovectorization

2022-03-30 Thread Sasha Krassovsky
r > instance https://godbolt.org/z/KTcTe1zPn. Different versions of gcc > generate different vectorized code, and clang and gcc do not auto-vectorize > at the same optimization level (O2 for clang and O3 or O2 -ftree-vectorize > for gcc) > > Regards, > > Joh

Re: [C++] Replacing xsimd with compiler autovectorization

2022-03-29 Thread Sasha Krassovsky
e to using a pattern that was auto-vectorization friendly and maybe there > are better ones. However, this raises the issue about how fragile > auto-vectorization can be at times (and also based on compilers). > > For the record, I don't have a strong opinion on removing or keeping XSIMD,

Re: [C++] Replacing xsimd with compiler autovectorization

2022-03-29 Thread Sasha Krassovsky
ry is the best way. > > To me there's no "replacing" between xsimd and auto-vectorization, they > just do their own jobs. > > Yibo > > -Original Message- > From: Sasha Krassovsky > Sent: Wednesday, March 30, 2022 6:58 AM > To: dev@arrow.apache.org; emkornfi.

Re: [C++] Replacing xsimd with compiler autovectorization

2022-03-29 Thread Sasha Krassovsky
> I have to occasionally build Arrow with an external build system and it > sounds like this type of logic could add complexity there. > > Thanks, > Micah > > On Tue, Mar 29, 2022 at 3:14 PM Sasha Krassovsky < > krassovskysa...@gmail.com> > wrote: > > > Hi

[C++] Replacing xsimd with compiler autovectorization

2022-03-29 Thread Sasha Krassovsky
I believe this would let us remove xsimd as a dependency while also giving us lots of vectorized kernels at the cost of some extra cmake magic. After that, it would just be a matter of making the function registry point to these new functions. Please let me know your thoughts! Thanks, Sasha Krassovsky

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-03-29 Thread Sasha Krassovsky
r "Arrow Bow" would be sufficient for me. Sasha Krassovsky On Tue, Mar 29, 2022 at 9:25 AM Gavin Ray wrote: > "Arrow Compute Engine" sounds quite nice to me, tbh > Agreeing with the points made above about ACE being difficult to google, > and AQE being a loaded term

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

2022-03-08 Thread Sasha Krassovsky
. You can technically use a float too, but I expect 64-bit decimal to be faster. Sasha Krassovsky > 8 марта 2022 г., в 09:01, Micah Kornfield написал(а): > >  >> >> >> Do we want to keep the historical "C++ and Java" requirement or >> do we want