?????? construct dataset for s3 by ParquetDatasetFactory failed

2022-04-25 Thread 1057445597
Thank you for your previous reply.  I still have some question want to ask I found that the RecordBatchReader reads fewer rows at a time than each row_group contains, meaning that a row_group needs to be read twice by RecordBatchReader. So what is the default batch size for RecordBatchReader?

what is the default batch size of the RecordBatchReader??

2022-04-25 Thread 1057445597
I found that the RecordBatchReader reads fewer rows at a time than each row_group contains, meaning that a row_group needs to be read twice by RecordBatchReader. So what is the default batch size for RecordBatchReader?  Also, any good advice if I have to follow the row_group? I have a lot of

Re: Arrow sync call April 13 at 12:00 US/Eastern, 16:00 UTC

2022-04-25 Thread David Li
Following up here: > N.B. The Voltron Data folks have a scheduling conflict on 4/27 and will not > be able to host the fortnightly sync call. Is anyone available to run the > meeting that day? Is anyone available to run the sync call this Wednesday? On Wed, Apr 13, 2022, at 12:59, David Li wro

[Compute][C++] Question on compute scheduler

2022-04-25 Thread Li Jin
Hello! I am reading the use of TaskScheduler inside C++ compute code (reading hash join) and have some questions about it, in particular: (1) What the purpose of SchedulerTaskCallback defined here: https://github.com/apache/arrow/blob/5a5d92928ccd438edf7ced8eae449fad05a7e71f/cpp/src/arrow/compute

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

2022-04-25 Thread Sasha Krassovsky
Regarding TPC-H and widening, we can (and do currently for the one query we have implemented) cast the decimal back down to the correct precision after each multiplication, so I don’t think this is an issue. On the other hand, there are definitely things we can do to dynamically detect if decima

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

2022-04-25 Thread Wes McKinney
I think there's a couple of embedded / entangled questions here that about this: * Should Arrow be able to be used to *transport* narrow decimals — for the (now very abundant) use cases where Arrow is being used as an internal wire protocol or client/server interface * Should *compute engines* th

Re: [VOTE] Extend Arrow Flight SQL with more SQL type info in schemas

2022-04-25 Thread Wes McKinney
+1 (binding) I agree with the comments on the PR that it would be good to better explain what the "type name" is or give an example or reference in the code comments On Thu, Apr 21, 2022 at 11:49 AM José Almeida wrote: > > +1 (non binding) > > On Thu, Apr 21, 2022 at 1:49 PM Rafael Telles wrote

Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-25 Thread Wes McKinney
I was going to reply to this e-mail thread on user@ but thought I would start a new thread on dev@. Executing user-defined functions in memory, especially untrusted functions, in general is unsafe. For "trusted" functions, having an in-memory API for writing them in user languages is very useful.

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-25 Thread Gavin Ray
Sounds like a fantastic idea, and WASM seems a natural choice You get the ability to opt into IO if you want/need to, with WASI, but by default you can rest assured about worst-case consequences being contained. On Mon, Apr 25, 2022 at 4:20 PM Wes McKinney wrote: > I was going to reply to this

Re: [VOTE] Extend Arrow Flight SQL with more SQL type info in schemas

2022-04-25 Thread David Li
My vote: +1 (binding) The vote passes with 4 binding +1 votes and 3 non-binding +1 votes. Thanks to all who contributed. I'll circle back on the PR comments before merging. On Mon, Apr 25, 2022, at 15:43, Wes McKinney wrote: > +1 (binding) > > I agree with the comments on the PR that it would b

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-25 Thread Antoine Pitrou
Le 25/04/2022 à 22:19, Wes McKinney a écrit : I was going to reply to this e-mail thread on user@ but thought I would start a new thread on dev@. Executing user-defined functions in memory, especially untrusted functions, in general is unsafe. For "trusted" functions, having an in-memory API f

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-25 Thread David Li
The WebAssembly documentation has a rundown of the techniques used: https://webassembly.org/docs/security/ I think usually you would run WASM in-process, though we could indeed also put it in a subprocess to further isolate things. It would be interesting to define the Flight "harness" protocol

Re: what is the default batch size of the RecordBatchReader?

2022-04-25 Thread Aldrin
I'm guessing that the default batch size is 65536 rows (64 * 1024) [1]. I don't have any advice on this at the moment, I haven't looked through the dataset interface very much. If you're using Scanner::ToTable, then there's a note that ToTable "fully materializes the Scan result in memory" first

Re: what is the default batch size of the RecordBatchReader?

2022-04-25 Thread Weston Pace
> I found that the RecordBatchReader reads fewer > rows at a time than each row_group contains, meaning > that a row_group needs to be read twice by > RecordBatchReader. So what is the default batch size > for RecordBatchReader? There are a few different places a row group could get fragmented. I

Re: [Compute][C++] Question on compute scheduler

2022-04-25 Thread Sasha Krassovsky
Hi Li, I’ll answer the questions in order: 1. Your guess is correct! The Hash Join may be used standalone (mostly in testing or benchmarking for now) or as part of the ExecNode. The ExecNode will pass the task to the Executor to be scheduled, or will run it immediately if it’s in sync mode (i.e

Re: [Compute][C++] Question on compute scheduler

2022-04-25 Thread Li Jin
Thanks! That's super helpful. A follow up question on TaskScheduler - What's the correct way to define a task that "do work if input batches are ready, otherwise try later"? Sth like Status try_process(): if enough_inputs_to _produce_next_output: compute_and_produce_next_output();

Re: [Compute][C++] Question on compute scheduler

2022-04-25 Thread Weston Pace
Thanks Sasha, your intuition on the SerialExecutor is correct. One of the changes I am working on[1] will make it so that an executor is always present. The behavior when you do not have an executor is rather strange (sometimes I/O threads are used and sometimes the calling thread is used) and th

Re: [Compute][C++] Question on compute scheduler

2022-04-25 Thread Sasha Krassovsky
If I understand correctly, on InputReceived you’ll be accumulating batches until you have enough to compute the next output? In that case, you have two options: you can either just immediately compute it using the same thread, or call the schedule_callback directly (not using the scheduler). I t

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-25 Thread Weston Pace
I think there is a certain amount of tricky "package management" involved with such a harness. For example, if I want to build my UDF on top of tensorflow then I would need a version of the tensorflow C libs that has been compiled to WASM and (potentially) language runtimes for whatever language u

[Python] [Docs] Framework to override docs for pyarrow.compute functions using native reStructured Text (?)

2022-04-25 Thread Kevin Crouse
Hi everyone, Sorry if some of this is out of place or not in the right dev email structure. I've only recently started getting into the arrow dev stuff. *Summary*: I'm interested in improving the API and functional documentation, especially for the pyarrow compute functions as I've been doing som