Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-07-26 Thread Li Jin
Thanks Weston! Appreciate the information. On Mon, Jul 25, 2022 at 10:04 PM Weston Pace wrote: > I'll hijack this thread for a bit of road mapping. There are a number > of significant infrastructure changes that are on my mind regarding > Acero. I'll list them here in no particular order. > >

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-07-25 Thread Weston Pace
I'll hijack this thread for a bit of road mapping. There are a number of significant infrastructure changes that are on my mind regarding Acero. I'll list them here in no particular order. * [1] The scanner needs updated to properly support cancellation I mainly mention this here as it is a

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-07-22 Thread Li Jin
Hi! Since the scheduler improvement work came up in some recent discussions about how backpresures are handled in Acero, I am curious if there has been any more progress on this since May or any future plans? Thanks, Li On Mon, May 23, 2022 at 10:37 PM Weston Pace wrote: > > About point 2. I h

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-05-23 Thread Weston Pace
> About point 2. I have previously seen the pipeline prioritization you've > described with both sides running simultaneously. My experience was not > good with that approach and one side's first approach was much better. This is good insight, I appreciate it. I hope we can run this kind of exper

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-05-21 Thread Supun Kamburugamuve
Thanks, Weston. Now I understand a little deeper. About point 2. I have previously seen the pipeline prioritization you've described with both sides running simultaneously. My experience was not good with that approach and one side's first approach was much better. About these points, if you are

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-05-20 Thread Weston Pace
I think I understand what you are saying now. There are a few different things going on that lead to a need for backpressure. 1. Write backpressure We support asynchronous writes. Filesystems like AWS want many parallel I/O operations. On a two core system you might have 8 different parallel w

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-05-20 Thread Supun Kamburugamuve
Hi Sasha, For case 2, I don't see why we need a back-pressure mechanism. Lets say there is an IO thread. All we need is a queue with a defined capacity that feeds data from IO thread to the Read task. Supun.. On Fri, May 20, 2022 at 8:25 PM Sasha Krassovsky wrote: > Hi Supun, > Roughly what ha

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-05-20 Thread Sasha Krassovsky
Hi Supun, Roughly what happens now is #2. However, in your example, it may be the case that we are reading CSV data from disk faster than we are transcoding it into Parquet and writing it. Note that we attempt to use the full disk bandwidth and assign batches to cores once the reads are done, s

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-05-20 Thread Supun Kamburugamuve
Thanks, Weston. From your description, I can think about how the current engine works. Let me try to map your example into execution. Then we can explore a little bit more in detail. WE have 300GB of data and we have a read CSV operator (that creates an Arrow Table) and a Write Parquet operator.

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-05-20 Thread Weston Pace
If the amount of batch data you are processing is larger than the RAM on the system then back pressure is needed. A common use case is dataset repartitioning. If you are repartitioning a large (e.g. 300GB) dataset from CSV to parquet then the bottleneck will typically be the "write" stage. Backp

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-05-20 Thread Supun Kamburugamuve
Looking at the proposal I couldn't understand why there is a need for back-pressure handling. My understanding of the Arrow C++ engine is that it is meant to process batch data. So I couldn't think of why we need to handle back-pressure as it is normally needed in streaming engines. Best, Supun.;

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-05-12 Thread Andrew Lamb
Thank you for sharing this document. Raphael Taylor-Davies is working on a similar exercise scheduling execution for DataFusion plans. The design doc[1] and initial PR [2] may be an interesting reference. In the DataFusion case we were trying to improve performance in a few ways: 1. Within a pip

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-05-12 Thread Li Jin
Thanks Wes and Michal. We have similar concern about the current eager-push control flow with time series / ordered data processing and am glad that we are not the only one thinking about this. I have read the doc and so far just left some questions to make sure I understand the proposal (admitte

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-05-11 Thread Wes McKinney
I talked about these problems with my colleague Michal Nowakiewicz who has been developing some of the C++ engine implementation over the last year and a half, and he wrote up this document with some ideas about task scheduling and control flow in the query engine for everyone to look at and commen

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-05-02 Thread Weston Pace
Thanks for investigating and looking through this. Your understanding of how things work is pretty much spot on. In addition, I think the points you are making are valid. Our ExecNode/ExecPlan interfaces are extremely bare bones and similar nodes have had to reimplement the same solutions (e.g.

[C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-05-02 Thread Wes McKinney
hi all, I've been catching up on the C++ execution engine codebase after a fairly long development hiatus. I have several questions / comments about the current design of the ExecNode and their implementations (currently: source / scan, filter, project, union, aggregate, sink, hash join). My cur