Thanks Weston! Appreciate the information.
On Mon, Jul 25, 2022 at 10:04 PM Weston Pace wrote:
> I'll hijack this thread for a bit of road mapping. There are a number
> of significant infrastructure changes that are on my mind regarding
> Acero. I'll list them here in no particular order.
>
>
I'll hijack this thread for a bit of road mapping. There are a number
of significant infrastructure changes that are on my mind regarding
Acero. I'll list them here in no particular order.
* [1] The scanner needs updated to properly support cancellation
I mainly mention this here as it is a
Hi!
Since the scheduler improvement work came up in some recent discussions
about how backpresures are handled in Acero, I am curious if there has been
any more progress on this since May or any future plans?
Thanks,
Li
On Mon, May 23, 2022 at 10:37 PM Weston Pace wrote:
> > About point 2. I h
> About point 2. I have previously seen the pipeline prioritization you've
> described with both sides running simultaneously. My experience was not
> good with that approach and one side's first approach was much better.
This is good insight, I appreciate it. I hope we can run this kind of
exper
Thanks, Weston. Now I understand a little deeper.
About point 2. I have previously seen the pipeline prioritization you've
described with both sides running simultaneously. My experience was not
good with that approach and one side's first approach was much better.
About these points, if you are
I think I understand what you are saying now. There are a few
different things going on that lead to a need for backpressure.
1. Write backpressure
We support asynchronous writes. Filesystems like AWS want many
parallel I/O operations. On a two core system you might have 8
different parallel w
Hi Sasha,
For case 2, I don't see why we need a back-pressure mechanism. Lets say
there is an IO thread. All we need is a queue with a defined capacity that
feeds data from IO thread to the Read task.
Supun..
On Fri, May 20, 2022 at 8:25 PM Sasha Krassovsky
wrote:
> Hi Supun,
> Roughly what ha
Hi Supun,
Roughly what happens now is #2. However, in your example, it may be the case
that we are reading CSV data from disk faster than we are transcoding it into
Parquet and writing it. Note that we attempt to use the full disk bandwidth and
assign batches to cores once the reads are done, s
Thanks, Weston. From your description, I can think about how the current
engine works. Let me try to map your example into execution. Then we can
explore a little bit more in detail.
WE have 300GB of data and we have a read CSV operator (that creates an
Arrow Table) and a Write Parquet operator.
If the amount of batch data you are processing is larger than the RAM
on the system then back pressure is needed. A common use case is
dataset repartitioning. If you are repartitioning a large (e.g.
300GB) dataset from CSV to parquet then the bottleneck will typically
be the "write" stage. Backp
Looking at the proposal I couldn't understand why there is a need for
back-pressure handling. My understanding of the Arrow C++ engine is that it
is meant to process batch data. So I couldn't think of why we need to
handle back-pressure as it is normally needed in streaming engines.
Best,
Supun.;
Thank you for sharing this document.
Raphael Taylor-Davies is working on a similar exercise scheduling
execution for DataFusion plans. The design doc[1] and initial PR [2] may be
an interesting reference.
In the DataFusion case we were trying to improve performance in a few ways:
1. Within a pip
Thanks Wes and Michal.
We have similar concern about the current eager-push control flow with time
series / ordered data processing and am glad that we are not the only one
thinking about this.
I have read the doc and so far just left some questions to make sure I
understand the proposal (admitte
I talked about these problems with my colleague Michal Nowakiewicz who
has been developing some of the C++ engine implementation over the
last year and a half, and he wrote up this document with some ideas
about task scheduling and control flow in the query engine for
everyone to look at and commen
Thanks for investigating and looking through this. Your understanding
of how things work is pretty much spot on. In addition, I think the
points you are making are valid. Our ExecNode/ExecPlan interfaces are
extremely bare bones and similar nodes have had to reimplement the
same solutions (e.g.
hi all,
I've been catching up on the C++ execution engine codebase after a
fairly long development hiatus.
I have several questions / comments about the current design of the
ExecNode and their implementations (currently: source / scan, filter,
project, union, aggregate, sink, hash join).
My cur
16 matches
Mail list logo