Yes, when you use a Sink there is an assumption that there is a Node running that is consuming from the deque. Currently the Interpreter only runs one Node at a time, which means that the full output of that Node sits in a deque for a while.
Clearly the Interpreter has much room for improvement. > On May 29, 2018, at 3:22 PM, Muhammad Gelbana <[email protected]> wrote: > > I found out what was consuming the memory and delaying the results at the > same time. I was pushing all obtained rows from the datasource into a sink > creating by this method > <https://github.com/apache/calcite/blob/27a190ff303700b4329384e05c39bc40c893048e/core/src/main/java/org/apache/calcite/interpreter/Compiler.java#L50>. > Pushing rows into the sink halts further nodes execution until all rows are > totally loaded. I thought since the sink is backed by an "ArrayDeque" that > the rows would be consumed while being pushed to the sink. > > The other approach I applied was to use the "enumerable" method instead. > This way, returned rows from my nodes are available for successive nodes > without delay. > > Thank you all and thank you Julian for the Arrow adapter code. > > Thanks, > Gelbana > > On Tue, May 29, 2018 at 5:50 PM, Julian Hyde <[email protected]> wrote: > >> I believe that scan, filter, project do not buffer; aggregate, join and >> sort do buffer; join perhaps buffers a little more than it should. >> >> Read methods in EnumerableDefaults, for example EnumerableDefaults.join, >> to see where a blocking collection is created and from which input. >> >> Ideally the operators would exploit sorted input (e.g. we could have an >> aggregate that assumes input is sorted by the GROUP BY key and only buffers >> records that have the same key) but Enumerable does not aim to be a >> high-performance, scalable engine, so this never got prioritized. >> >> On a related note, I was pleased to see progress on an Arrow adapter and >> convention in https://issues.apache.org/jira/browse/CALCITE-2173 < >> https://issues.apache.org/jira/browse/CALCITE-2173>. If we were to write >> a high-performance engine that scales across many threads, it would be >> based on Arrow. So anyone with complaints about the performance of >> Enumerable convention should start contributing to Arrow convention! >> >> Julian >> >> >>> On May 29, 2018, at 7:20 AM, Michael Mior <[email protected]> wrote: >>> >>> In theory it certainly should be possible to stream the results. This >> isn't >>> guaranteed however. You would have to look at the entire query pipeline >> to >>> see where things are being materialized. A full stack trace without >>> elements removed would be a good start. >>> >>> -- >>> Michael Mior >>> [email protected] >>> >>> >>> >>> Le lun. 28 mai 2018 à 19:05, Muhammad Gelbana <[email protected]> a >>> écrit : >>> >>>> I'm not sure if I phrased my question correctly so let me explain more. >>>> >>>> I'm running a (SELECT * FROM TABLE) query against a 50 million records >>>> table (Following the BINDABLE convention, so it sends it's rows through >> a >>>> "sink"). Since the extracted rows aren't processed in any way, I was >>>> expecting that the output JDBC resultset would be able to enumerate >> through >>>> all the results in a matter of seconds, but instead, my machine didn't >>>> print anything. What exactly happens is that >>>> (PreparedStatement.executeQuery) doesn't return a resultset promptly >> even >>>> after a few minutes have passed. >>>> >>>> I tried a table with hundreds of rows and my testing code printed those >>>> results right away so it's not something I missed there, but probably a >>>> configuration I didn't set ? Or may be that's just how it is ? Does >> anyone >>>> else believe that the behaviour I expected is reasonable ? It would also >>>> lower the amount of memory consumed to hold the complete results before >>>> bursting them to their final destination, if that's the case in the >> first >>>> place. >>>> >>>> >>>> Thanks, >>>> Gelbana >>>> >> >>
