Thank you for your previous reply. I still have some question want
to ask
I found that the RecordBatchReader reads fewer rows at a time than each
row_group contains, meaning that a row_group needs to be read twice by
RecordBatchReader. So what is the default batch size for RecordBatchReader?
I found that the RecordBatchReader reads fewer rows at a time than each
row_group contains, meaning that a row_group needs to be read twice by
RecordBatchReader. So what is the default batch size for
RecordBatchReader?
Also, any good advice if I have to follow the row_group? I have a lot of
Following up here:
> N.B. The Voltron Data folks have a scheduling conflict on 4/27 and will not
> be able to host the fortnightly sync call. Is anyone available to run the
> meeting that day?
Is anyone available to run the sync call this Wednesday?
On Wed, Apr 13, 2022, at 12:59, David Li wro
Hello!
I am reading the use of TaskScheduler inside C++ compute code (reading hash
join) and have some questions about it, in particular:
(1) What the purpose of SchedulerTaskCallback defined here:
https://github.com/apache/arrow/blob/5a5d92928ccd438edf7ced8eae449fad05a7e71f/cpp/src/arrow/compute
Regarding TPC-H and widening, we can (and do currently for the one query we
have implemented) cast the decimal back down to the correct precision after
each multiplication, so I don’t think this is an issue. On the other hand,
there are definitely things we can do to dynamically detect if decima
I think there's a couple of embedded / entangled questions here that about this:
* Should Arrow be able to be used to *transport* narrow decimals — for
the (now very abundant) use cases where Arrow is being used as an
internal wire protocol or client/server interface
* Should *compute engines* th
+1 (binding)
I agree with the comments on the PR that it would be good to better
explain what the "type name" is or give an example or reference in the
code comments
On Thu, Apr 21, 2022 at 11:49 AM José Almeida
wrote:
>
> +1 (non binding)
>
> On Thu, Apr 21, 2022 at 1:49 PM Rafael Telles wrote
I was going to reply to this e-mail thread on user@ but thought I
would start a new thread on dev@.
Executing user-defined functions in memory, especially untrusted
functions, in general is unsafe. For "trusted" functions, having an
in-memory API for writing them in user languages is very useful.
Sounds like a fantastic idea, and WASM seems a natural choice
You get the ability to opt into IO if you want/need to, with WASI, but by
default
you can rest assured about worst-case consequences being contained.
On Mon, Apr 25, 2022 at 4:20 PM Wes McKinney wrote:
> I was going to reply to this
My vote: +1 (binding)
The vote passes with 4 binding +1 votes and 3 non-binding +1 votes. Thanks to
all who contributed.
I'll circle back on the PR comments before merging.
On Mon, Apr 25, 2022, at 15:43, Wes McKinney wrote:
> +1 (binding)
>
> I agree with the comments on the PR that it would b
Le 25/04/2022 à 22:19, Wes McKinney a écrit :
I was going to reply to this e-mail thread on user@ but thought I
would start a new thread on dev@.
Executing user-defined functions in memory, especially untrusted
functions, in general is unsafe. For "trusted" functions, having an
in-memory API f
The WebAssembly documentation has a rundown of the techniques used:
https://webassembly.org/docs/security/
I think usually you would run WASM in-process, though we could indeed also put
it in a subprocess to further isolate things.
It would be interesting to define the Flight "harness" protocol
I'm guessing that the default batch size is 65536 rows (64 * 1024) [1].
I don't have any advice on this at the moment, I haven't looked through the
dataset interface very much.
If you're using Scanner::ToTable, then there's a note that ToTable "fully
materializes the Scan result in memory" first
> I found that the RecordBatchReader reads fewer
> rows at a time than each row_group contains, meaning
> that a row_group needs to be read twice by
> RecordBatchReader. So what is the default batch size
> for RecordBatchReader?
There are a few different places a row group could get fragmented. I
Hi Li,
I’ll answer the questions in order:
1. Your guess is correct! The Hash Join may be used standalone (mostly in
testing or benchmarking for now) or as part of the ExecNode. The ExecNode will
pass the task to the Executor to be scheduled, or will run it immediately if
it’s in sync mode (i.e
Thanks! That's super helpful.
A follow up question on TaskScheduler - What's the correct way to define a
task that "do work if input batches are ready, otherwise try later"?
Sth like
Status try_process():
if enough_inputs_to _produce_next_output:
compute_and_produce_next_output();
Thanks Sasha, your intuition on the SerialExecutor is correct. One of
the changes I am working on[1] will make it so that an executor is
always present. The behavior when you do not have an executor is
rather strange (sometimes I/O threads are used and sometimes the
calling thread is used) and th
If I understand correctly, on InputReceived you’ll be accumulating batches
until you have enough to compute the next output? In that case, you have two
options: you can either just immediately compute it using the same thread, or
call the schedule_callback directly (not using the scheduler). I t
I think there is a certain amount of tricky "package management"
involved with such a harness. For example, if I want to build my UDF
on top of tensorflow then I would need a version of the tensorflow C
libs that has been compiled to WASM and (potentially) language
runtimes for whatever language u
Hi everyone,
Sorry if some of this is out of place or not in the right dev email
structure. I've only recently started getting into the arrow dev stuff.
*Summary*: I'm interested in improving the API and functional
documentation, especially for the pyarrow compute functions as I've been
doing som
20 matches
Mail list logo