Re: [C++][Acero] can Acero support distributed computation?

2023-07-06 Thread Sasha Krassovsky
Yes, what you’ve said is correct for Mean. But my point earlier is that there should only be a few of such special cases. A simple case would be e.g. Max, where Aggregate outputs Max and then merge outputs Max(Max). Sasha > 6 июля 2023 г., в 23:13, Jiangtao Peng написал(а): > >  > Sorry fo

Re: [C++][Acero] can Acero support distributed computation?

2023-07-06 Thread Jiangtao Peng
Sorry for my unclear expression. Take mean aggregation as example, does Aggregate "output" sum and count value, and Accumulate will "input" sum and count value, then "merge" sum(sum)/sum(count) as "output"? My point is how to implement Pre-Aggregation and Post-Aggregation using Acero. Best, Jia

Re: [C++][Acero] can Acero support distributed computation?

2023-07-06 Thread Sasha Krassovsky
Can you clarify what you mean by “data flow”? Each machine will be executing the same query plan. The query plan will contain an operator called Shuffle, and above the Shuffle will be an Aggregate, and above that will be an Accumulate node. The SourceNode will read data from disk on each machine

Re: [C++][Acero] can Acero support distributed computation?

2023-07-06 Thread Jiangtao Peng
Sure, distributed aggregations can be split into “compute” and “merge” two phase. But how about data flow of “compute” and “merge” on different nodes? Best, Jiangtao From: Sasha Krassovsky Date: Friday, July 7, 2023 at 11:07 AM To: user@arrow.apache.org Subject: Re: [C++][Acero] can Acero sup

Re: [C++][Acero] can Acero support distributed computation?

2023-07-06 Thread Sasha Krassovsky
Distributed aggregations have two phases: “compute” (each node does its own aggregation) and “merge” (the master merged the partial results). For most aggregates (e.g. Sum, Min, Max), merge is just the same as compute, except over the partial results. However, for Mean in particular, the compute

Re: [C++][Acero] can Acero support distributed computation?

2023-07-06 Thread Jiangtao Peng
Hi Sasha, Thanks for your reply! Maybe Shuffle node is enough for data distribution. How about aggregation node? For example, mean kernel with group will maintain sum and count value for each group. Mater node merging mean result needs sum and count value on each partition. But mean kernel see

Re: [C++][Acero] can Acero support distributed computation?

2023-07-06 Thread Sasha Krassovsky
Hi Jiangtao, Acero doesn’t support any distributed computation on its own. However, to get some simple distributed computation going it would be sufficient to add a Shuffle node. For example for Aggregation, the Shuffle would assign a range of hashes to each node, and then each node would hash-p

[C++][Acero] can Acero support distributed computation?

2023-07-06 Thread Jiangtao Peng
Hi there, I'm learning Acero streaming execution engine recently. And I’m wondering if Acero support distributed computing. I have read code about aggregation node and kernel; Aggregation kernel seems to hide the details of aggregation middle state. If use multiple nodes with Acero execution e

Re: [Go] Are builders and CSV readers goroutine safe

2023-07-06 Thread Bryce Mecum
Hi Gus, did you ever get an answer to your questions? >From a look at the source code, neither the CSV reader or builders look goroutine safe. However, your usage of the CSV reader above looks safe to me because 'record' gets copied into each goroutine invocation. Importantly, the builder would ne