Re: RecordBatchFileWriter with DictionaryType: Making sure the dictionary stays the same

2022-05-31 Thread Weston Pace
I don't think you are missing anything. The parquet encoding is baked into the data on the disk so re-encoding at some stage is inevitable. Re-encoding in python like you are doing is going to be inefficient. I think you will want to do the re-encoding in C++. Unfortunately, I don't think we have

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Weston Pace
> I don't think replacing Scalar compute paths with dedicated paths for > RLE-encoded data would ever be a simplification. Also, when a kernel > hasn't been upgraded with a native path for RLE data, former Scalar > Datums would now be expanded to the full RLE-decoded version before > running the ke

[C++] Kernel function registry evolution

2022-06-01 Thread Weston Pace
We've had some evidence for a while now that the kernel functions suffer from an overhead problem that prevents us from effectively utilizing cache. The latest and greatest evidence of this might be [1]. A number of people have made some very interesting suggestions that I think could really cut

Re: [C++] Kernel function registry evolution

2022-06-03 Thread Weston Pace
h, I can try to have a first draft > PR ready to go maybe by Monday (I was going to work on this over the > weekend when I can have some uninterrupted time to do the > refactoring). I'm not sure that a new registry is going to be needed > > On Thu, Jun 2, 2022 at 2:50 AM Anto

Re: data-source UDFs

2022-06-03 Thread Weston Pace
Efficiently reading from a data source is something that has a bit of complexity (parsing files, connecting to remote data sources, managing parallel reads, etc.) Ideally we don't want users to have to reinvent these things as they go. The datasets module in Arrow-C++ has a lot of code here alrea

Re: data-source UDFs

2022-06-04 Thread Weston Pace
27;t quite well-defined enough to be > meaningfully integrated (except perhaps via a generic "stream of batches" > entrypoint), and even if we wanted to feed JDBC/ODBC into an ExecPlan, we'd > have to do some work that would look roughly like writing an ADBC driver, so >

Re: data-source UDFs

2022-06-06 Thread Weston Pace
then the design would need some other registry or mechanism for > passing the deserialized data source-UDF to the execution plan. > 5. The data-source UDF is specific to an execution plan, so definitely > specific to the user who created the Substrait plan in which it is embedded. > U

Re: Apache Arrow development using CLion

2022-06-07 Thread Weston Pace
I tried to use CLion for a little while with mixed results. CLion integrates well with cmake. However, CLion seems to rely heavily on clang-tidy and I was unable to configure clang-tidy in such a way that it ran reasonably quickly. I think part of the problem is that CLion wanted to use all of m

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread Weston Pace
I can try and give a more detailed answer later in the week but the gist of it is that Arrow manages all "buffer allocations" with a memory pool. These are the allocations for the actual data in the arrays. These are the allocations that use the memory pool configured by ARROW_DEFAULT_MEMORY_POOL

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread Weston Pace
ot;--with-private-namespace=je_arrow_private_" "--without-export" "--disable-shared" # Don't override operator new() "--disable-cxx" "--disable-libdl" # See https://github.com/jemalloc/jemalloc/issues/1237 "--disable-initial-exec-tls" ${EP_

Re: [ANNOUNCE] New Arrow committers: Dewey Dunnington, Alenka Frim, and Rok Mihevc

2022-06-22 Thread Weston Pace
Congratulations all! On Wed, Jun 22, 2022, 10:27 AM Dragoș Moldovan-Grünfeld < dragos.m...@gmail.com> wrote: > Congratulations! > > Sent from my iPhone > > > On 22 Jun 2022, at 18:13, Neal Richardson > wrote: > > > > On behalf of the Arrow PMC, I'm happy to announce that > > > > Dewey Dunningto

Re: user-defined Python-based data-sources in Arrow

2022-06-23 Thread Weston Pace
This seems reasonable to me. A very similar interface is the RecordBatchReader[1] which is roughly (glossing over details)... ``` class RecordBatchReader { virtual std::shared_ptr schema() const = 0; virtual Result> Next() = 0; virtual Status Close() = 0; }; ``` This seems pretty close to

Re: user-defined Python-based data-sources in Arrow

2022-06-24 Thread Weston Pace
, and then the reentrancy problem is mot since > no parallel-access occurs. OTOH, if the Python-based data-source can be > accessed in parallel, the above sorting-queue solution is better suited and > would avoid the reentrancy problem of a ReadNext function. > > > Yaron. >

Re: [C++] Kernel function registry evolution

2022-06-29 Thread Weston Pace
This is only for the situation where ALL inputs and outputs are scalar. Scalars, at the kernel level, do not have length. So in this case there is nothing to repeat. It does build a buffer, but just with a single value, so it is all O(1). On Wed, Jun 29, 2022 at 9:49 AM Antoine Pitrou wrote: >

Re: accessing Substrait protobuf Python classes from PyArrow

2022-07-01 Thread Weston Pace
Given that Acero does not do any planner / optimizer type tasks I'm not sure you will find anything like this in arrow-cpp or pyarrow. What you are describing I sometimes refer to as "plan slicing and dicing". I have wondered if we will someday need this in Acero but I fear it is a slippery slope

Re: accessing Substrait protobuf Python classes from PyArrow

2022-07-05 Thread Weston Pace
he source > > > code > > > >> for "substrait" from ` > > > >> > > > > > https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz > > > >> ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in > >

Re: [ARROW C++] ARROW_LOG

2022-07-05 Thread Weston Pace
At the moment that log is used primarily for Arrow developers and is not likely to be terribly useful beyond that. It is not, as far as I know, very extensible. I think you can only configure it to log to stderr or to a single file. However, it could be made extensible if someone were motivated

Re: accessing Substrait protobuf Python classes from PyArrow

2022-07-06 Thread Weston Pace
ython object, which is more convenient to manipulate from > Python, after unpickling in from a field in the Substrait plan. It's just > read-only access to the field from Python, but still needs access to the > Substrait protobuf Python classes. This case was mentioned in my previous

Re: Adding cpp memory profiling to Arrow

2022-07-06 Thread Weston Pace
Memory profiling would be very helpful. Thanks for looking into this. A few thoughts: * Peak allocation is an important number for many users. One major goal for Acero is to get to a point where it can constrain peak allocation to a preconfigured amount for a single query. We are close but not

Re: Proposal: Unassign idle issues

2022-07-08 Thread Weston Pace
+1 (I'm assuming, as Neal described, I can just reassign the issue to myself and it won't confuse the assignment bot) On Fri, Jul 8, 2022 at 8:29 AM Jacob Wujciak wrote: > > I support this idea and a 90 days threshold seems good to me! > > On Fri, Jul 8, 2022 at 8:02 PM Neal Richardson > wrote:

Re: cpp Memory Pool Clarification

2022-07-11 Thread Weston Pace
Are you changing the default memory pool to a LoggingMemoryPool? Where are you doing this? For a benchmark I think you would need to change the implementation in the benchmark file itself. Similarly, is AsofJoinNode using the default memory pool or the memory pool of the exec plan? It should be

Re: cpp Memory Pool Clarification

2022-07-11 Thread Weston Pace
also expect to see some allocations from > TableSourceNode going through the logging memory pool, even if AsOfJoinNode > was using the default memory pool instead of the Exec Plan's pool, but I am > not seeing anything come through... > > -Original Message- >

Re: cpp Memory Pool Clarification

2022-07-11 Thread Weston Pace
end to end benchmark of "scan - join - write" I think would make sense to > include all arrow memory allocation (if that makes sense) > > On Mon, Jul 11, 2022 at 4:37 PM Weston Pace wrote: > > > > Is there anything else I'd need to change?

Re: Substrait vs GraphQL

2022-07-11 Thread Weston Pace
This might be an interesting topic for the Substrait community. You can find ways to contact them at [1]. I don't know GraphQL well enough but from what I do know it seems like a GraphQL -> Substrait converter would be useful, at the very least. [1] https://substrait.io/community/ On Mon, Jul 1

Re: cpp Memory Pool Clarification

2022-07-12 Thread Weston Pace
er and > MakeReaderGenerator) to generate for a regular source node. > > -Original Message- > From: Weston Pace > Sent: Monday, July 11, 2022 4:37 PM > To: dev@arrow.apache.org > Subject: Re: cpp Memory Pool Clarification > > > Is there anything else I

Re: cpp: Debugging 'plan destruction before finishing'

2022-07-14 Thread Weston Pace
> After some quick debugging, I found that the asof node's StopProducing (a conditioning necessary to finish the plan) is called shortly after the error output. StopProducing should probably more accurately be named "Abort" or "StopRightNow". If you run the plan to completion normally I do not be

Re: ExecutionContext, batch ordering clarification

2022-07-19 Thread Weston Pace
If you are using a source node (which it appears you are) then it will be creating new thread tasks for each batch. So, in theory, these could get out of order. My guess is that the file reader is slow enough that by the time you load batch N from disk and decode it, you have a pretty good chance

Re: [C++] ResumeProducing Future Causing Blocking

2022-07-20 Thread Weston Pace
> 4) control is not returned to the processing thread Yes, it looks like the current implementation does not return control to the processing thread, but I think this is correct, or at least "as designed". The thread will be used to continue iterating the source. > control is not returned to the

Re: [C++] ResumeProducing Future Causing Blocking

2022-07-21 Thread Weston Pace
> > Would the new first class support in the scheduler be something similar to > what's available currently in BackpressureMonitor? We are looking to > implement some more custom backpressure schemes that depend on batch > ordering/completion rather than memory size. >

Re: [C++] ResumeProducing Future Causing Blocking

2022-07-22 Thread Weston Pace
22 at 12:32 PM Ivan Chau wrote: > > Hi Weston, > > Not sure if the diagrams came through here -- is there some other place I > need to view them? > > Ivan > > -----Original Message- > From: Weston Pace > Sent: Thursday, July 21, 2022 10:59 PM > T

Re: [C++] ResumeProducing Future Causing Blocking

2022-07-22 Thread Weston Pace
aring > > On Fri, Jul 22, 2022 at 12:32 PM Ivan Chau wrote: > > > > Hi Weston, > > > > Not sure if the diagrams came through here -- is there some other place I > > need to view them? > > > > Ivan > > > > -Original Message- > >

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-07-25 Thread Weston Pace
presures are handled in Acero, I am curious if there has been > any more progress on this since May or any future plans? > > Thanks, > Li > > On Mon, May 23, 2022 at 10:37 PM Weston Pace wrote: > > > > About point 2. I have previously seen the pipeline prioritizatio

Re: [C++] MakeReaderGenerator Behavior using GetCPUThreadPool

2022-07-25 Thread Weston Pace
1) Yes, that sounds correct. The file readers will read from files in parallel (even if there is one file it can read from row groups in parallel). There is no guarantee these reads will finish sequentially. 2) Hmm, this one will work for now, because the executor==nullptr behavior is to borrow

Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-03 Thread Weston Pace
I think, from a compute perspective, one would just cast before doing anything. So you wouldn't need much beyond parse and unparse. For example, if you have a JSON document and you want to know the largest value of $.weather.temperature then you could do... MAX(STRUCT_FIELD(PARSE_JSON("json_col"

Re: Fatal Python error for process exit after opening Pyarrow batch iterator

2022-08-10 Thread Weston Pace
I'm not sure of the exact error you are getting but I suspect this may be related to something I am currently working on[1]. I can reproduce it fairly easily without GCS: ``` import pyarrow as pa import pyarrow.dataset as ds my_dataset = ds.dataset(['/some/big/file.csv'], format='csv') batch_ite

Re: Parquet reader memory usage

2022-08-11 Thread Weston Pace
Just a few additional thoughts: > at least as measured by > the memory pools max_memory() method. The parquet reader does a fair amount of allocation on the global system allocator (i.e. not using a memory pool). Typically this should be small in comparison with the data buffers themselves (whic

Re: dealing with tester timeout in a CI job

2022-08-17 Thread Weston Pace
My first suspicion on a test timeout is usually a deadlock. That being said, I haven't looked at this test / change in any real detail so I don't know if that's the case here. How long does the test take to run locally? Second, I would try and remove sleeps, and make sure to use the utilities Sl

Re: DISCUSS: [C++] Switch to C++17

2022-08-17 Thread Weston Pace
+1. I'm very much in favor of upgrading to C++17. I am lucky to often get to work with people that are new to the Arrow C++ code base and a common feedback is that the code is quite complex. While I do not think moving to C++17 will solve this problem by itself I'm pretty confident that being ab

Re: dealing with tester timeout in a CI job

2022-08-17 Thread Weston Pace
I agree can be reduced by sampling. Could you > > explain how to use SCOPED_TEST, or refer to documentation about it? I > > understand your idea, just looking for an example use of SCOPED_TEST. > > > > > > Yaron. > > &g

Re: DISCUSS: [C++] Switch to C++17

2022-08-17 Thread Weston Pace
or if they wanted to use newer features (which could be an incentive to upgrade their R version). On Wed, Aug 17, 2022 at 4:30 AM Weston Pace wrote: > > +1. I'm very much in favor of upgrading to C++17. I am lucky to > often get to work with people that are new to the Arrow C++

Re: DISCUSS: [C++] Switch to C++17

2022-08-17 Thread Weston Pace
> Any particular reason why this should be 10.0 and not 9.0 for example? (is due to an incoming feature of note?) No. I only said 10.0 because Neal's tactical suggestion earlier in this thread would mean that 10.0 would be the last build that had C++11 support. If we choose not to follow that sug

Re: [VOTE] C++: switch to C++17

2022-08-24 Thread Weston Pace
+1 (non-binding) On Wed, Aug 24, 2022 at 9:24 AM Keith Kraus wrote: > > +1 (non-binding) > > On Wed, Aug 24, 2022 at 12:12 PM David Li wrote: > > > +1 (binding) > > > > On Wed, Aug 24, 2022, at 12:06, Ivan Ogasawara wrote: > > > +1 (non-binding) > > > > > > On Wed, Aug 24, 2022 at 12:00 PM Sasha

Re: Using Acero in a distributed environment

2022-08-24 Thread Weston Pace
I don't know of any work being done to turn Acero into a distributed query engine. However, I would hope that Acero can be used in a distributed query engine, and would be a useful component. If there are features that Acero would need in this environment (e.g. some kind of exec node for speciali

Re: [VOTE] Format: Rules and procedures for Canonical extension types

2022-08-24 Thread Weston Pace
+1 (non-binding). This is maybe implied but I would add that modification of extension types must also require a vote and should be backwards compatible. Furthermore, extension types (particularly those with extensive parameterization/serialization should discuss how future additions would be mad

Re: Usage of the name Feather?

2022-08-29 Thread Weston Pace
I agree as well. I think most lingering uses of the term "feather" are in pyarrow and R however, so it might be good to hear from some of those maintainers. On Mon, Aug 29, 2022 at 9:35 AM Antoine Pitrou wrote: > > > I agree with this as well. > > Regards > > Antoine. > > > On Mon, 29 Aug 2022

Re: [ANNOUNCE] New Arrow PMC member: L. C. Hsieh

2022-09-04 Thread Weston Pace
Congratulations! On Sun, Sep 4, 2022 at 5:04 AM Andy Grove wrote: > > Congratulations, L. C.! > > On Sun, Sep 4, 2022 at 8:09 AM Wang Xudong wrote: > > > Congrats! > > > > David Li 于2022年9月4日周日 19:54写道: > > > > > Congrats & welcome Liang-Chi! > > > > > > On Sun, Sep 4, 2022, at 06:22, Andrew La

Re: [ANNOUNCE] New Arrow PMC member: Weston Pace

2022-09-05 Thread Weston Pace
On Mon, Sep 5, 2022 at 1:56 AM Sutou Kouhei > > wrote: > > > > > > > >> The Project Management Committee (PMC) for Apache Arrow has invited > > > >> Weston Pace to become a PMC member and we are pleased to announce > > > >> that Weston Pace has accepted. > > > >> > > > >> Congratulations and welcome! > > > >> > > > > > > > >

Re: design for ordered aggregation

2022-09-06 Thread Weston Pace
It seems like a reasonable approach. I think my initial gut feeling would be that initializing and finalizing state for each change of key might be a bit heavyweight in cases where there are only a few values per key. I think these cases are fairly common as a data simplification / cleaning pass.

Re: [ANNOUNCE] New Arrow committer: Yanghong Zhong

2022-09-08 Thread Weston Pace
Congratulations! On Thu, Sep 8, 2022 at 8:32 AM David Li wrote: > > Congrats & welcome, Yanghong! > > On Thu, Sep 8, 2022, at 11:04, Daniël Heres wrote: > > Congratulations! > > > > On Thu, Sep 8, 2022, 17:02 Andy Grove wrote: > > > >> Congratulations, Yanghong! > >> > >> On Thu, Sep 8, 2022 at

Re: Transactional semantics in Acero

2022-09-09 Thread Weston Pace
I'd agree with Micah. I'm also not aware of anyone working on this. The docs clarify a bit more on the details[1]. I think we'd need a bit more thinking around an "update/append" workflow too. That being said, updates, transactions, and appends are something that the Iceberg project has thought

Re: Question on handling API changes when upgrading Pyarrow

2022-09-09 Thread Weston Pace
Breaking changes should be documented in the release notes which are announced on the Arrow blog[1][2]. In addition, in pyarrow, changes to non-experimental APIs (and often also those made to experimental APIs) should go through a deprecation cycle where a warning is emitted for at least one relea

Re: [ANNOUNCE] New Arrow committer: Remzi Yang

2022-09-12 Thread Weston Pace
Congrats Remzi! On Mon, Sep 12, 2022 at 5:42 PM Rok Mihevc wrote: > > Congrats! > > Rok > > On Sun, Sep 11, 2022 at 4:27 AM Ian Joiner wrote: > > > Congrats Remzi! > > > > On Sat, Sep 10, 2022 at 8:12 AM Andrew Lamb wrote: > > > > > On behalf of the Arrow PMC, I'm happy to announce that Remzi Y

Re: Integration between Flight and Acero

2022-09-13 Thread Weston Pace
> The alternative path of subclassing SourceNode and having ExecNode::Init or > ExecNode::StartProducing seems quite a bit of change (also I don't think > SourceNode is exposed via public header). But let me know if you think I am > missing something. Agreed that we don't want to go this route. D

Re: Integration between Flight and Acero

2022-09-13 Thread Weston Pace
gt; the result set can be read.) > - Those partitions then each become a Fragment, and then they can be read in > parallel by Dataset. > > It sounds like the service in question here isn't quite that complex, though, > so no need to necessarily go that far. > > On Tue, Sep 13,

Re: PRs for RLE support

2022-09-14 Thread Weston Pace
I'm going to bump this because it would be good to get feedback. In particular it would be nice to get feedback on the suggested format change[1]. We are currently moving forward on coming up with an IPC format proposal which we will share when ready. The two interesting points that jump out to

Re: PRs for RLE support

2022-09-14 Thread Weston Pace
t;> difficult to calculate offsets. Translating an array offset to a > > >> buffer offset takes O(log(N)) time. If the run ends are encoded as > > >> a > > >> child array (so the RLE array has no buffers and two child arrays) > > >> then this

Re: PRs for RLE support

2022-09-14 Thread Weston Pace
he the run ends buffer is > physical size of the array (or larger) which cannot be easily > determined without iterating over the whole buffer. > > But we need a valid buffer size, so we can resolve logical to physical > offsets using binary search. Also I'm not sure if it is

Re: RLE array slicing

2022-09-15 Thread Weston Pace
Thank you everyone, I think I was pretty far off base in representing the work Tobias had done and both Tobias and Matt have clarified well. * There are two child arrays not necessarily for slicing but more to help distinguish between the logical length (there are no buffers with the logical leng

Re: [C++][Gandiva] Proposal to Add A Parser Frontend for Gandiva

2022-09-19 Thread Weston Pace
First, I think you are correct that there is a lot of value to users here. I'd love for a capability like this to someday be in pyarrow too for Arrow compute functions. I think there is a distinct enough difference between "a query language" and "a programming language". However, both of them are

Re: [ANNOUNCE] New Arrow PMC member: Raphael Taylor-Davies

2022-09-19 Thread Weston Pace
Congratulations! On Mon, Sep 19, 2022 at 6:17 PM Yijie Shen wrote: > > Congratulations, Raphael! > > On Tue, Sep 20, 2022 at 11:44 AM L. C. Hsieh wrote: > > > Congratulations! > > > > On Mon, Sep 19, 2022 at 7:40 PM Andy Grove wrote: > > > > > > Congratulations, Raphael! > > > > > > On Mon, Sep

Re: [ANNOUNCE] New Arrow committer: Dan Harris

2022-09-20 Thread Weston Pace
Congratulations Dan On Tue, Sep 20, 2022 at 10:52 AM David Li wrote: > > Congrats, Dan! > > On Tue, Sep 20, 2022, at 13:43, L. C. Hsieh wrote: > > Congratulations! > > > > On Tue, Sep 20, 2022 at 10:38 AM Chao Sun wrote: > >> > >> Congrats Dan! > >> > >> On Tue, Sep 20, 2022 at 10:17 AM Ian Join

Re: Register custom ExecNode factories

2022-09-20 Thread Weston Pace
I'm not great at this build stuff but I think the basic idea is that you will need to package your custom nodes into a shared object. You'll need to then somehow trigger that shared object to load from python. This seems like a good place to invoke the initialize method. Currently pyarrow has to

Re: Correct way to collect results from an Acero query

2022-09-21 Thread Weston Pace
Funny you should mention this, I just ran into the same problem :). We use StartAndCollect so much in our unit tests that there must be some usefulness there. You are correct that it is not an API that can be used outside of tests. I added utility methods DeclarationToTable, DeclarationToBatches,

Re: Substrait consumer for custom data sources

2022-09-27 Thread Weston Pace
In pyarrow it is "string(s) -> arrow Table". However, in the actual C++ (e.g. relation_internal.cc) it is already "string(s) -> compute::Declaration" which should be sufficiently general for your needs. A "compute::Declaration" is a combination of node factory name and node options so you should

Re: Parser for expressions

2022-10-06 Thread Weston Pace
Currently Substrait only has a binary (protobuf) serialization (and a protobuf JSON one but that's not really human writable and barely human readable). Substrait does not have a text serialization. I believe there is some desire for one (maybe Sasha wants to give it a try?). A text format for S

Re: Parser for expressions

2022-10-11 Thread Weston Pace
t;>>> I was thinking of proving out a design here before going there. However > >>>>> we > >>>>> could also just go straight there :) > >>>>> > >>>>> Regarding infix operators and such the edge case I was thinking of is > >>>

Re: Question about pyarrow.substrait.run_query

2022-10-12 Thread Weston Pace
1. Yes. 2. I was going to say yes but...on closer examination...it appears that it is not applying backpressure. The SinkNode accumulates batches in a queue and applies backpressure. I thought we were using a sink node since it is the normal "accumulate batches into a queue" sink. However, the Su

Re: Substrait consumer for custom data sources

2022-10-13 Thread Weston Pace
> Does that sound like a reasonable way to do this? It's not ideal. I may be assuming here but I think your problem is more that there is no way to more flexibly describe a source in python and less that you need to change the default. For example, if you could do something like this (in pyarrow

Re: Substrait consumer for custom data sources

2022-10-14 Thread Weston Pace
initialization seems cleaner to me because there are many > >>> other > >>> extension points that we initialize (add registering in the > >>> default_exec_factory_registry > >>> similar to > >>> https://github.com/apache/arrow/blob/m

Re: [Acero] Error handling in ExecNode

2022-10-18 Thread Weston Pace
Yes. Something like: if (ErrorIfNotOk(flight_writer->WriteRecordBatch(...))) return; Today this method calls `output->ErrorReceived(...)`. The original idea (I think) is that, possibly, a downstream node could "handle" the error. However, in practice, nothing does that, and all errors propagat

Re: [DISCUSS] Integrate existing Spark connector for Flight

2022-10-21 Thread Weston Pace
> Maybe to take a step back - why do we want this in the Arrow > repositories/under Arrow governance? I think this is the important question. What is the goal here? If the goal is to help spread awareness then we can link to a repo somewhere (e.g. a "projects that use Arrow" section or somethin

Re: [DISCUSS] Move issue tracking to

2022-10-24 Thread Weston Pace
+1 for GH issues mainly because it lowers the barrier to entry and JIRA won't be an acceptable solution any longer with infra's proposed changes. I suspect I'd be +1 even without the infra change though providing everyone else was willing to make the switch. On Mon, Oct 24, 2022 at 8:19 AM Jacob

Re: [ANNOUNCE] New Arrow committer: Bogumił Kamiński

2022-10-26 Thread Weston Pace
Congratulations Bogumił. On Wed, Oct 26, 2022 at 6:10 AM Jacob Wujciak wrote: > > Congrats! > > On Wed, Oct 26, 2022 at 8:31 AM Alenka Frim > wrote: > > > Congratulations! > > > > On Wed, Oct 26, 2022 at 7:55 AM Daniël Heres > > wrote: > > > > > Congratulations! > > > > > > On Wed, Oct 26, 2022

Re: [ANNOUNCE] New Arrow PMC member: Jacob Quinn

2022-10-26 Thread Weston Pace
Congrats Jacob! On Wed, Oct 26, 2022 at 6:10 AM Jacob Wujciak wrote: > > Congrats! > > On Wed, Oct 26, 2022 at 8:31 AM Alenka Frim > wrote: > > > Congratulations! > > > > On Wed, Oct 26, 2022 at 7:54 AM Daniël Heres > > wrote: > > > > > Congratulations! > > > > > > On Wed, Oct 26, 2022, 07:50 B

Re: [ANNOUNCE] New Arrow PMC member: Nicola Crane

2022-10-26 Thread Weston Pace
Thanks Nic and congratulations! On Wed, Oct 26, 2022 at 8:28 AM Raúl Cumplido wrote: > > Thanks Nic for your contributions! > > El mié, 26 oct 2022 a las 17:17, Antoine Pitrou () > escribió: > > > > > Welcome, Nic! > > > > > > Le 26/10/2022 à 16:37, Dewey Dunnington a écrit : > > > Congrats, Nic!

Re: [ANNOUNCE] New Arrow committer: Ben Baumgold

2022-10-26 Thread Weston Pace
Congratulations Ben! On Wed, Oct 26, 2022 at 2:05 PM David Li wrote: > > Welcome Ben! > > On Wed, Oct 26, 2022, at 17:57, Ian Joiner wrote: > > Congrats Ben! > > > > Ian > > > > On Wednesday, October 26, 2022, Sutou Kouhei wrote: > > > >> On behalf of the Arrow PMC, I'm happy to announce that Be

Re: [ANNOUNCE] New Arrow committer: Eric Patrick Hanson

2022-10-26 Thread Weston Pace
Congrats Eric! On Wed, Oct 26, 2022 at 2:05 PM David Li wrote: > > Welcome Eric! > > On Wed, Oct 26, 2022, at 17:57, Ian Joiner wrote: > > Congrats Eric! > > > > Ian > > > > On Wednesday, October 26, 2022, Sutou Kouhei wrote: > > > >> On behalf of the Arrow PMC, I'm happy to announce that Eric P

Re: pyarrow dataset API

2022-11-02 Thread Weston Pace
FileSystemDataset is part of the public API (and in a pxd file[1]). I would agree it's fair to say that pyarrow datasets are no longer experimental. > Instead we subclass Dataset and return a custom scanner we created. And our > Dataset subclass *should* be a FileSystemDataset subclass, but > F

Re: [ANNOUNCE] New Arrow committer: Yang Jiang

2022-11-03 Thread Weston Pace
Congratulations On Thu, Nov 3, 2022, 6:25 AM Patrick Horan wrote: > Congrats Jiang! > > On Thu, Nov 3, 2022, at 1:52 AM, Wang Xudong wrote: > > Congratulations! > > > > Yijie Shen 于2022年11月3日周四 11:08写道: > > > > > Congratulations Jiang! > > > > > > On Thu, Nov 3, 2022 at 9:54 AM vin jake wrote:

Re: pyarrow dataset API

2022-11-03 Thread Weston Pace
it > supports appends, but we're working on full schema evolution / support. We > had to do this outside of iceberg because we're not using parquet). Do you > have documentation for how you're envisioning schema evolution to work in > Arrow? Would you be open to chatting w

Re: Parser for ExecPlans

2022-11-03 Thread Weston Pace
Indentation works well when you omit the other arguments (e.g. ...) but once you mix in the arguments for the nodes (especially if those arguments have their own indentation / structure) then it ends up becoming unreadable I think. I prefer the idea of each node having it's own block, with no inde

Re: [ANNOUNCE] New Arrow committer: Jarrett Revels

2022-11-03 Thread Weston Pace
Congrats Jarrett! On Thu, Nov 3, 2022 at 11:25 AM Jacob Wujciak wrote: > > Congratulations! > > On Thu, Nov 3, 2022 at 2:40 PM Rok Mihevc wrote: > > > Congratulations! > > > > On Thu, Nov 3, 2022 at 12:31 AM David Li wrote: > > > > > Welcome Jarrett! > > > > > > On Tue, Nov 1, 2022, at 17:15, S

Re: [ANNOUNCE] New Arrow committer: Curtis Vogt

2022-11-04 Thread Weston Pace
Congrats! On Thu, Nov 3, 2022, 11:06 PM Benson Muite wrote: > Congratulations > On 11/4/22 01:29, Vibhatha Abeykoon wrote: > > Congratulations > > > > On Thu, Nov 3, 2022 at 7:09 PM Rok Mihevc wrote: > > > >> Congratulations! > >> > >> On Thu, Nov 3, 2022 at 12:31 AM David Li wrote: > >> > >>>

Re: Struct evolution

2022-11-09 Thread Weston Pace
>From a datasets / Acero perspective I have been thinking about this in the back of my mind for a while and decided to write my thoughts down in a document. I will send it in a separate email. On Tue, Nov 8, 2022 at 9:53 AM Micah Kornfield wrote: > > Hi Matthew, > Could you give some more specif

[RFC] Schema Evolution

2022-11-09 Thread Weston Pace
I've created a document[1] that both describes the general idea of schema evolution as well as my best guess at how it should work. This is written from an Acero / datasets perspective but the information should be generally applicable / accessible. I am doing some work in the scanner to enable a

Re: Parser for ExecPlans

2022-11-09 Thread Weston Pace
tions to it as they think of the next step. > While here you need to think backward. Obviously you can append to the top > as you write your pipeline ,but that's still a bit counterintuitive. > > Just my two cents. > > > > On Thu, Nov 3, 2022 at 8:08 PM Weston Pace wrot

Re: [RFC] Schema Evolution

2022-11-10 Thread Weston Pace
Sorry about that. I've enabled it now. On Wed, Nov 9, 2022, 9:34 PM Micah Kornfield wrote: > It doesn't look like comment access is enabled? > > On Wed, Nov 9, 2022 at 5:16 PM Weston Pace wrote: > > > I've created a document[1] that both describes the genera

Re: Struct evolution

2022-11-10 Thread Weston Pace
nvolves for each file figuring out how to convert to > the desired. I found it easiest to do this per column of the desired > schema. Then it can be (1) reference a column (2) reference a column and > cast or (3) create a column of nulls of a given type. > > Is something like that

Re: [ANNOUNCE] New Arrow PMC member: Kun Liu

2022-11-14 Thread Weston Pace
Congrats! On Mon, Nov 14, 2022 at 8:14 AM L. C. Hsieh wrote: > > Congratulations! > > On Mon, Nov 14, 2022 at 7:21 AM Andy Grove wrote: > > > > Congratulations! > > > > On Mon, Nov 14, 2022 at 4:58 AM Andrew Lamb wrote: > > > > > Congratulations! > > > > > > On Sun, Nov 13, 2022 at 10:15 PM Yij

Re: Arrow sync call November 23 at 12:00 US/Eastern, 17:00 UTC

2022-11-28 Thread Weston Pace
One thing to note is that you need to have something like "closes #123" in the PR description or a comment in order for GitHub to close the relevant issue when the PR is merged. This isn't too much of a burden to check I think but took a bit of getting used to for me in Substrait where we use the

Re: [ANNOUNCE] New Arrow committer: Raúl Cumplido

2022-12-06 Thread Weston Pace
Congratulations! On Tue, Dec 6, 2022 at 7:57 AM Nic wrote: > > Congratulations! > > On Tue, 6 Dec 2022 at 15:49, Ian Cook wrote: > > > Congratulations Raúl! > > > > On Tue, Dec 6, 2022 at 10:43 AM Matt Topol wrote: > > > > > > Congrats Raúl!! > > > > > > On Tue, Dec 6, 2022 at 9:53 AM Dewey Dun

Re: [ANNOUNCE] New Arrow committer: Jacob Wujciak

2022-12-15 Thread Weston Pace
Congratulations Jacob! On Thu, Dec 15, 2022 at 3:27 PM David Li wrote: > > Congrats & welcome Jacob! > > On Thu, Dec 15, 2022, at 18:14, Nic Crane wrote: > > On behalf of the Arrow PMC, I'm happy to announce that Jacob Wujciak has > > accepted an invitation to become a committer on Apache Arrow.

Re: [VOTE] Add RLE Arrays to Arrow Format

2022-12-16 Thread Weston Pace
+1 I agree that run-end encoding makes more sense but also don't see it as a deal breaker. The most compelling counter-argument I've seen for new types is to avoid a schism where some implementations do not support the newer types. However, for the type proposed here I think the risk is low beca

Re: [ANNOUNCE] New Arrow PMC chair: Andrew Lamb

2022-12-25 Thread Weston Pace
Congratulations! On Sun, Dec 25, 2022, 9:44 PM Remzi Yang <1371656737...@gmail.com> wrote: > Congratulation Andrew! > > On Mon, 26 Dec 2022 at 13:40, David Li wrote: > > > Congrats Andrew! > > > > On Mon, Dec 26, 2022, at 00:26, vin jake wrote: > > > congratulation! > > > > > > Sutou Kouhei 于 2

Re: modeling column group

2023-01-01 Thread Weston Pace
There was a discussion a while back about representing complex numbers that seems similar[1]. If both fields were the same type you could use a fixed size list array. However, since you want two different types you'd want some kind of "packed struct" which does not exist (to my knowledge) today.

Re: [DISCUSS] Updating what are considered reference implementations?

2023-01-06 Thread Weston Pace
I think it would be reasonable to state that a reference implementation must be a complete implementation (i.e. supports all existing types) that is not derived from another implementation (e.g. you can't pick pyarrow and arrow-c++). If an implementation does not plan on ever supporting a new arra

Re: [ANNOUNCE] New Arrow committer: Jie Wen

2023-01-08 Thread Weston Pace
Congratulations Jie! On Sun, Jan 8, 2023 at 10:28 AM Rok Mihevc wrote: > > Congrats Jie! > > Rok > > On Sun, Jan 8, 2023 at 7:00 PM Raúl Cumplido wrote: > > > Congratulations Jie! > > > > El dom, 8 ene 2023, 18:45, David Li escribió: > > > > > Congrats Jie & welcome! > > > > > > On Sun, Jan 8,

Re: [DISCUSS] State of the Arrow Project 2022

2023-01-08 Thread Weston Pace
Start: There have been a few calls in the past for an improved workflow for reviewing PRs. I think a bot that highlights pull requests that need attention (e.g. has no reviews in the "changes requested" state, also some way of knowing how long it's been waiting) would be helpful. There has been

Re: [Monorepo] Add labels breaking-change and critical-fix

2023-01-14 Thread Weston Pace
On further thought it seems a little odd to me that crashes are not critical. However, many of our crashes are from a failure to properly validate user input, which I agree isn't as critical. Would it be too nuanced to say that: * A crash, given valid input, is critical * A crash, given invali

Re: [VOTE] Release Apache Arrow 11.0.0 - RC0

2023-01-19 Thread Weston Pace
I've got a fix[1] in for the verification script for C#. There are more details in the issue and the PR but IMO we are compatible with C#7 and C#6, we simply were not testing it correctly. I have run the tests locally with both 6.0 and 7.0 sdks and they passed. [1] https://github.com/apache/arro

  1   2   3   4   5   >