1) Yes, that sounds correct. The file readers will read from files in parallel (even if there is one file it can read from row groups in parallel). There is no guarantee these reads will finish sequentially.
2) Hmm, this one will work for now, because the executor==nullptr behavior is to borrow the I/O thread. So if each reader has it's own I/O thread pool you will be ok. In the future, this should still work for executor==nullptr but you won't need to do this. Engine work will never be done on the I/O thread pool so as long as the I/O thread pool is not the same as the serial execution thread you can block the execution thread to wait for input from I/O. 2a) Is your input in-memory? Or are you able to read from it quickly? If you never have to wait on I/O then the thread will never be relinquished. If your reads are slow I would expect to see the exec plan thread shift amongst different I/O threads (or even all share a single thread) as different sources complete. 2b) Answered in 2 I think. > is this expected, or does it change w.r.t the > build type? We are currently on release > settings. DCHECK (debug check) will be compiled out if you are compiling in release mode. It helps you to write gratuitous error checking without worrying too much about performance. On Mon, Jul 25, 2022 at 5:21 PM Ivan Chau <ivan.m.c...@gmail.com> wrote: > > Hey all, > > While investigating the in-order behavior of the SourceNode, we found some > interesting observations: > > 1) The ExecContext should use nullptr for its executor to guarantee any > sequential behavior (as discussed previously). We found cases where our > File BatchReader was reading out of order with a multi-threaded ExecContext. > 2) Ideally, to manage our memory footprint (via bounded queues), we would > like each of our inputs to belong to a single thread. This way, if > something blocks, it does not impact reading the input needed to unblock it > from another source. We found that using MakeReaderGenerator for our > in-memory table sources (the basis for our file reader source node) allows > us to do that by specifying an executor (separate thread pools) as a > parameter, and also suggests the following conditions: > 2a) Even when initialized with arrow::internal::GetCPUThreadPool(), it > seems each source node is dedicated to its own thread. We are not sure why > this is the case because of the shared nature of the pool, or if it is just > a coincidence. > 2b) Our initial implementation was creating separate memory pools with a > capacity of one thread for each of our sources with MakeEternal, which has > the same behaviors as 2a. > > As an additional question, we added an assertion to check for ordering with > DCHECK_GE. I expected it to create some sort of Fatal exception when the > condition was false, but this doesn't seem to happen -- is this expected, > or does it change w.r.t the build type? We are currently on release > settings. > > Ivan