Re: [Call For Volunteer] Apache Arrow Summit and Selection Committee

2025-05-19 Thread Li Jin
Sorry for missing this email. I volunteer as well. (I have been working with / building Arrow-based data processing systems since 2017. Perhaps I can provide some perspectives from use cases in addition to traditional SQL systems, e.g., streaming, time series, ML, numerical computation etc) On Su

[Arrow Compute] Question on function chaining / math formular

2025-02-21 Thread Li Jin
Dear Arrow Devs, I wonder if there is a nice way to do function chaining / math formular with Arrow compute? (Either Python or C++?) To give an example, let say I have three arrays a, x and y and want to compute: x * (1 - a) + y * a Right now I can do this in pyarrow but pretty hard to read: f

Re: [ANNOUNCE] New Arrow PMC member: Bryce Mecum

2025-02-06 Thread Li Jin
Congrats! On Thu, Feb 6, 2025 at 2:52 AM wish maple wrote: > Congrats! > > Best, > Xuwei Fu > > Raúl Cumplido 于2025年2月6日周四 15:47写道: > > > Congrats Bryce! > > > > El jue, 6 feb 2025, 6:22, Weston Pace escribió: > > > > > Congrats Bryce! > > > > > > On Wed, Feb 5, 2025 at 8:35 PM Saurabh Singh

Re: [C++] Thread deadlock in ObjectOutputStream

2024-05-29 Thread Li Jin
s to take > the lock. > > Can you open a GH issue and we can follow up there? > > Regards > > Antoine. > > > Le 23/05/2024 à 21:23, Li Jin a écrit : > > Hello, > > > > I am seeing a deadlock when destructing an ObjectOutputStream. I have > > attached

[C++] Thread deadlock in ObjectOutputStream

2024-05-23 Thread Li Jin
Hello, I am seeing a deadlock when destructing an ObjectOutputStream. I have attached the stack trace. I did some debugging and found that the issue seems to be that the mutex in question is already held by this thread (I checked the __owner field in the pthread_mutex_t which points to the hangin

Re: Arrow 15 parquet nanosecond change

2024-02-21 Thread Li Jin
t; 2.6, which contains nanosecond support. > It was released in Arrow v13. > > [1] > > https://github.com/apache/arrow/blob/e198f309c577de9a265c04af2bc4644c33f54375/python/pyarrow/parquet/core.py#L953 > > [2]https://github.com/apache/arrow/pull/36137 > > On Wed, Feb 21, 20

Re: Arrow 15 parquet nanosecond change

2024-02-21 Thread Li Jin
“Exponentially exposed” -> “potentially exposed” On Wed, Feb 21, 2024 at 4:13 PM Li Jin wrote: > Thanks - since we don’t control all the invocation of pq.write_table, I > wonder if there is some configuration for the “default” behavior? > > Also I wonder if there are other API

Re: Arrow 15 parquet nanosecond change

2024-02-21 Thread Li Jin
gt; BR > > J > > > śr., 21 lut 2024 o 21:44 Li Jin napisał(a): > > > Hi, > > > > My colleague has informed me that during the Arrow 12->15 upgrade, he > found > > that writing a pandas Dataframe with datetime64[ns] to parquet will > result > >

Arrow 15 parquet nanosecond change

2024-02-21 Thread Li Jin
Hi, My colleague has informed me that during the Arrow 12->15 upgrade, he found that writing a pandas Dataframe with datetime64[ns] to parquet will result in nanosecond metadata and nanosecond values. I wonder if this is something configurable to the old behavior so we can enable “nanosecond in p

Re: [ANNOUNCE] New Arrow PMC chair: Andy Grove

2023-11-28 Thread Li Jin
Congrats Andy! On Tue, Nov 28, 2023 at 3:25 PM Weston Pace wrote: > Congrats Andy! > > On Mon, Nov 27, 2023, 7:31 PM wish maple wrote: > > > Congrats Andy! > > > > Best, > > Xuwei Fu > > > > Andrew Lamb 于2023年11月27日周一 20:47写道: > > > > > I am pleased to announce that the Arrow Project has a ne

Re: C++: Code that read parquet into Arrow Arrays?

2023-11-19 Thread Li Jin
> > Best, > Xuwei Fu > > [1] https://github.com/apache/arrow/blob/main/cpp/src/parquet/encoding.cc > [2] > https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc > > Li Jin 于2023年11月18日周六 05:27写道: > > > Hi, > > > > I am recentl

Re: C++: Code that read parquet into Arrow Arrays?

2023-11-17 Thread Li Jin
> https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader.cc#L107 > [2] > https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader_internal.cc#L345 > > On Fri, Nov 17, 2023 at 12:27 PM Li Jin wrote: > > > > Hi, > > > > I am recentl

C++: Code that read parquet into Arrow Arrays?

2023-11-17 Thread Li Jin
Hi, I am recently investigating a null/nan issue with Parquet and Arrow and wonder if someone can give me a pointer to the code that decodes Parquet row group into Arrow float/double arrays? Thanks, Li

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-08 Thread Li Jin
at 10:07 AM Li Jin wrote: > Update: > > I have done a memory profiling and the result seems to suggest memory > leak. I > have opened a issue to further discuss this: > https://github.com/apache/arrow/issues/37630 > > > On Fri, Sep 8, 2023 at 10:04 AM Li Jin wrote: >

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-08 Thread Li Jin
Update: I have done a memory profiling and the result seems to suggest memory leak. I have opened a issue to further discuss this: https://github.com/apache/arrow/issues/37630 On Fri, Sep 8, 2023 at 10:04 AM Li Jin wrote: > Update: > > I have done a memory profiling and the result

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Li Jin
On Wed, Sep 6, 2023 at 4:35 PM Li Jin wrote: > Also attaching my experiment code just in case: > https://gist.github.com/icexelloss/88195de046962e1d043c99d96e1b8b43 > > On Wed, Sep 6, 2023 at 4:29 PM Li Jin wrote: > >> Reporting back with some new findings. >> >>

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Li Jin
Also attaching my experiment code just in case: https://gist.github.com/icexelloss/88195de046962e1d043c99d96e1b8b43 On Wed, Sep 6, 2023 at 4:29 PM Li Jin wrote: > Reporting back with some new findings. > > Re Felipe and Antione: > I tried with both Antione's suggestions (swa

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Li Jin
issues. Re Xuwei: Thanks for the tips. I am gonna try to memorize this profile next and see what I can find. I am gonna keep looking into this but again, any ideas / suggestions are appreciated (and thanks for all the help so far!) Li On Wed, Sep 6, 2023 at 1:59 PM Li Jin wrote: > T

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Li Jin
Another sign this isn't a leak, just the allocator reaching a level of > > memory commitment that it doesn't feel like undoing. > > > > -- > > Felipe > > > > On Wed, Sep 6, 2023 at 12:56 PM Li Jin wrote: > > > > > Hello, > > > > > > I have

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Li Jin
In Parquet, if non-buffered read is enabled, when read a column, the > whole ColumChunk would be read. > Otherwise, it will "buffered" read it decided by buffer-size > > Maybe I forgot someplaces. You can try to check that. > > Best > Xuwei Fu > > Li Jin

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Li Jin
37139 > [3] https://github.com/apache/arrow/issues/36587 > [4] https://github.com/apache/arrow/issues/37136 > > Li Jin 于2023年9月6日周三 23:56写道: > > > Hello, > > > > I have been testing "What is the max rss needed to scan through ~100G of > > data in a parquet

[C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Li Jin
Hello, I have been testing "What is the max rss needed to scan through ~100G of data in a parquet stored in gcs using Arrow C++". The current answer is about ~6G of memory which seems a bit high so I looked into it. What I observed during the process led me to think that there are some potential

Re: Optimized way of converting list of pa.Array to pd.DataFrame with index

2023-08-31 Thread Li Jin
Although - I am curious if there are any downsides using `self_destruct`? On Thu, Aug 31, 2023 at 1:05 PM Li Jin wrote: > Ah I see - thanks for the explanation. self_destruct probably won't > benefit in my case then. (The pa.Array here is a slice from another batch > so there

Re: Optimized way of converting list of pa.Array to pd.DataFrame with index

2023-08-31 Thread Li Jin
> each array is actually backed by its own memory allocations (which right > would generally mean copying data up front!). > > On Thu, Aug 31, 2023, at 11:11, Li Jin wrote: > > Hi, > > > > I am working on some code where I have a list of pa.Arrays and I am > > cr

Optimized way of converting list of pa.Array to pd.DataFrame with index

2023-08-31 Thread Li Jin
Hi, I am working on some code where I have a list of pa.Arrays and I am creating a pandas.DataFrame from it. I also want to set the index of the pd.DataFrame to be the first Array in the list. Currently I am doing sth like: " df = pa.Table.from_arrays(arrs, names=input_names).to_pandas() df.set_i

Re: Sort a Table In C++?

2023-08-17 Thread Li Jin
23 à 23:20, Ian Cook a écrit : > > Li, > > > > Here's a standalone C++ example that constructs a Table and executes > > an Acero ExecPlan to sort it: > > https://gist.github.com/ianmcook/2aa9aa82e61c3ea4405450b93cf80fbc > > > > Ian > > > > O

Sort a Table In C++?

2023-08-17 Thread Li Jin
Hi, I am writing some C++ test and found myself in need for an c++ function to sort an arrow Table. Before I go around implementing one myself, I wonder if there is already a function that does that? (I searched the doc but didn’t find one). There is function in Acero can do it but I didn’t find

Re: Acero and Substrait: How to select struct field from a struct column?

2023-08-08 Thread Li Jin
gt;>> schema = pa.schema([pa.field("points", pa.struct([pa.field("x", > pa.float64()), pa.field("y", pa.float64())]))]) > >>> expr = pc.field(("points", "x")) > >>> expr.to_substrait(schema) > is_mutable=False

Acero and Substrait: How to select struct field from a struct column?

2023-08-01 Thread Li Jin
Hi, I am recently trying to do (1) assign a struct type column s (2) flatten the struct columns (by assign v1=s[v1], v2=s[v2] and drop the s column) via Substrait and Acero. However, I ran into the problem where I don't know the proper substrait message to encode this (for (2)) Normally, if I s

Re: scheduler() and aync_scheduler() on QueryContext

2023-07-26 Thread Li Jin
I/O call (which under the hood is usually implemented by > submitting something to the I/O executor). > > On Tue, Jul 25, 2023 at 2:56 PM Li Jin wrote: > > > Hi, > > > > I am reading Acero and got confused about the use of > > QueryContext::scheduler() and Q

scheduler() and aync_scheduler() on QueryContext

2023-07-25 Thread Li Jin
Hi, I am reading Acero and got confused about the use of QueryContext::scheduler() and QueryContext::async_scheduler(). So I have a couple of questions: (1) What are the different purposes of these two? (2) Does scheduler/aysnc_scheduler own any threads inside their respective classes or do they

Re: C++: State of parquet 2.x / nanosecond support

2023-07-15 Thread Li Jin
ever, I don't know whether nanoarrow > supports it. > > Best, > Xuwei Fu > > [1] https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm > [2] https://github.com/apache/arrow/pull/36137 > > On 2023/07/14 13:25:22 Li Jin wrote: > > Hi, > > >

C++: State of parquet 2.x / nanosecond support

2023-07-14 Thread Li Jin
Hi, Recently I found myself in the need of nanosecond granularity timestamp. IIUC this is something supported in the newer version of parquet (2.6 perhaps)? I wonder what is the state of that in Arrow and parquet cpp? Thanks, Li

Re: Confusion on substrait AggregateRel::groupings and Arrow consumer

2023-07-10 Thread Li Jin
> > Acero does not currently handle more than one grouping set. > > > [1] https://docs.snowflake.com/en/sql-reference/constructs/group-by-rollup > > On Mon, Jul 10, 2023 at 2:22 PM Li Jin wrote: > > > Hi, > > > > I am looking at the substrait

Confusion on substrait AggregateRel::groupings and Arrow consumer

2023-07-10 Thread Li Jin
Hi, I am looking at the substrait protobuf for AggregateRel as well the Acero substrait consumer code: https://github.com/apache/arrow/blob/main/cpp/src/arrow/engine/substrait/relation_internal.cc#L851 https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L209 Looks l

Re: [C++] Dealing with third party method that raises exception

2023-06-29 Thread Li Jin
ception` in the codebase, you'll find that there > a couple of places where we turn it into a Status already. > > Regards > > Antoine. > > > Le 29/06/2023 à 16:20, Li Jin a écrit : > > Hi, > > > > IIUC, most of the Arrow C++ code doesn't not use ex

[C++] Dealing with third party method that raises exception

2023-06-29 Thread Li Jin
Hi, IIUC, most of the Arrow C++ code doesn't not use exceptions. My question is are there some Arrow utility / macro that wrap the function/code that might raise an exception and turn that into code that returns an arrow error Status? Thanks! Li

Re: Turn a vector of Scalar to an Array/ArrayData of the same datatype

2023-06-16 Thread Li Jin
lars = {MakeScalar(1), > MakeScalar(2)}; > > ARROW_ASSIGN_OR_RAISE(std::unique_ptr builder, > MakeBuilder(type)); > ARROW_RETURN_NOT_OK(builder->AppendScalars(scalars)); > ARROW_ASSIGN_OR_RAISE(auto arr, builder->Finish()); > ``` > > Best, > Jin > > > On Fri, Jun 16, 2023 at 5:23

Turn a vector of Scalar to an Array/ArrayData of the same datatype

2023-06-15 Thread Li Jin
Hi, I find myself in need of a function to turn a vector of Scalar to an Array of the same datatype. The data type is known at the runtime. e.g. shared_ptr concat_scalars(vector values. shared_ptr type); I wonder if I need to use sth like Scalar::Accept(ScalarVisitor*) or is there an easier/bett

Re: Group rows in a stream of record batches by group id?

2023-06-13 Thread Li Jin
(Admittedly, PR title of [1] doesn't reflect that only the scalar aggregate UDF is implemented and not the hash one - that is an oversight on my part - sorry) On Tue, Jun 13, 2023 at 3:51 PM Li Jin wrote: > Thanks Weston. > > I think I found what you pointed out to me before whi

Re: Group rows in a stream of record batches by group id?

2023-06-13 Thread Li Jin
and I'm maybe a little uncertain what > the difference is between this ask and the capabilities added in [1]. > > [1] https://github.com/apache/arrow/pull/35514 > > On Tue, Jun 13, 2023 at 8:23 AM Li Jin wrote: > > > Hi, > > > > I am trying to write a funct

Group rows in a stream of record batches by group id?

2023-06-13 Thread Li Jin
Hi, I am trying to write a function that takes a stream of record batches (where the last column is group id), and produces k record batches, where record batches k_i contain all the rows with group id == i. Pseudocode is sth like: def group_rows(batches, k) -> array[RecordBatch] { builder

Re: Converting Pandas DataFrame <-> Struct Array?

2023-06-13 Thread Li Jin
dtype(df.dtypes[col])) for col in > > df.columns] > > pa_type = pa.struct(fields) > > pa.array(df.itertuples(index=False, type=pa_type) > > > > But this seems like a classic XY problem. What is the root issue you're > > trying to solve? Why avoid RecordBatch?

Re: Converting Pandas DataFrame <-> Struct Array?

2023-06-12 Thread Li Jin
Gentle bump. Not a big deal if I need to use the API above to do so, but bump in case someone has a better way. On Fri, Jun 9, 2023 at 4:34 PM Li Jin wrote: > Hello, > > I am looking for the best ways for converting Pandas DataFrame <-> Struct > Array. &g

Converting Pandas DataFrame <-> Struct Array?

2023-06-09 Thread Li Jin
Hello, I am looking for the best ways for converting Pandas DataFrame <-> Struct Array. Currently I have: pa.RecordBatch.from_pandas(df).to_struct_array() and pa.RecordBatch.from_struct_array(s_array).to_pandas() - I wonder if there is a direct way to go from DataFrame <-> Struct Array withou

Re: Github command to rerun CI checks?

2023-04-18 Thread Li Jin
r doing that, so you > should be able to give that a try. > > We don't have a way of running PR checks as we do with the crossbow > command. We could investigate if there is a way to do it via API. > > Thanks, > Raúl > > El mar, 18 abr 2023 a las 14:47, Li Jin () >

Re: Github command to rerun CI checks?

2023-04-18 Thread Li Jin
gt; > > The UI was recently updated: > > > > > https://docs.github.com/en/actions/managing-workflow-runs/re-running-workflows-and-jobs#re-running-failed-jobs-in-a-workflow > > > > On Mon, Apr 17, 2023 at 7:57 PM Li Jin wrote: > > > >> Thanks!

Re: Github command to rerun CI checks?

2023-04-17 Thread Li Jin
UI. If you want to avoid having > to add small changes to be able to commit you can use empty commits via > '--allow-empty'. > > On Mon, Apr 17, 2023 at 5:25 PM Li Jin wrote: > > > Hi, > > > > Is there a github command to rerun CI checks? (instead of pushing a new > > commit?) > > > > Thanks, > > Li > > >

Github command to rerun CI checks?

2023-04-17 Thread Li Jin
Hi, Is there a github command to rerun CI checks? (instead of pushing a new commit?) Thanks, Li

Re: Stacktrace from Arrow status?

2023-04-04 Thread Li Jin
Thanks David! On Tue, Apr 4, 2023 at 4:58 PM David Li wrote: > Yes, that's what the ARROW_EXTRA_ERROR_CONTEXT option does. > > On Tue, Apr 4, 2023, at 11:13, Li Jin wrote: > > Picking up this conversation again, I noticed when I hit an error in > > test I >

Re: Stacktrace from Arrow status?

2023-04-04 Thread Li Jin
his, std::move(batch)) /home/icexelloss/workspace/arrow/cpp/src/arrow/acero/hash_aggregate_test.cc:271 start_and_collect.MoveResult() ``` Is this because of the ARROW_EXTRA_ERROR_CONTEXT option? On Fri, Mar 24, 2023 at 12:04 PM Li Jin wrote: > Thanks David! > > On Tue, Mar 21, 2023 at 6:32

Re: Zero copy cast kernels

2023-03-28 Thread Li Jin
Thanks Rok! Original question is to asking for a way to "verify if a cast if zero copy by read source code / documentation", and not "verify a cast if zero copy programmatically" but I noticed by reading the test file that int64 to micro is indeed zero copy and I expect nanos to be the same https:

Zero copy cast kernels

2023-03-24 Thread Li Jin
Hello, I recently found myself casting an int64 (nanos from epoch) into a nano timestamp column with the C++ cast kernel (via Acero). I expect this to be zero copy but I wonder if there is a way to check which casts are zero copy and which are not? Li

Re: Stacktrace from Arrow status?

2023-03-24 Thread Li Jin
a rough > stack trace (IIRC, if a function returns the status without using one of > the macros, it won't add a line to the trace). > > [1]: > https://github.com/apache/arrow/blob/1ba4425fab35d572132cb30eee6087a7dca89853/cpp/cmake_modules/DefineOptions.cmake#L608-L609 > > On

Stacktrace from Arrow status?

2023-03-21 Thread Li Jin
Hi, This might be a dumb question but when Arrow code raises an invalid status, I observe that it usually pops up to the user without stack information. I wonder if there are any tricks to show where the invalid status is coming from? Thanks, Li

Re: [DISCUSS] Acero roadmap / philosophy

2023-03-14 Thread Li Jin
Late to the party. Thanks Weston for sharing the thoughts around Acero. We are actually a pretty heavy Acero user right now and are trying to take part in Acero maintenance and development. Internally we are using Acero for a time series streaming data processing system. I would +1 on many of Wes

Re: Timestamp unit in Substrait and Arrow

2023-03-14 Thread Li Jin
rk > here will be pretty easy. The trickier part might be adapting your > producer (Ibis?) > > On Thu, Mar 9, 2023 at 9:43 AM Li Jin wrote: > > > Hi, > > > > I recently came across some limitations in expressing timestamp type with > > Substrait in the Ace

Re: [ANNOUNCE] New Arrow PMC member: Will Jones

2023-03-13 Thread Li Jin
Congratulations Will! On Mon, Mar 13, 2023 at 3:27 PM Bryce Mecum wrote: > Congratulations, Will! >

Timestamp unit in Substrait and Arrow

2023-03-09 Thread Li Jin
Hi, I recently came across some limitations in expressing timestamp type with Substrait in the Acero substrait consumer and am curious to hear what people's thoughts are. The particular issue that I have is when specifying timestamp type in substrait, the unit is "microseconds" and there is no wa

Re: testing of back-pressure

2023-02-16 Thread Li Jin
Thanks Weston for the information. On Thu, Feb 16, 2023 at 1:32 PM Weston Pace wrote: > There is a little bit at the end-to-end level. One goal is to be able to > repartition a very large dataset. This means we read from something bigger > than memory and then write to it. This workflow is te

Re: Question about memory usage and type casting using pyarrow Table

2023-02-15 Thread Li Jin
he array is timezone aware. > > On Wed, Feb 15, 2023 at 10:37 PM Li Jin wrote: > > > Oh found this comment: > > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc#L156 > > > > > > > > On Wed, Feb

Re: Question about memory usage and type casting using pyarrow Table

2023-02-15 Thread Li Jin
Oh found this comment: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc#L156 On Wed, Feb 15, 2023 at 4:23 PM Li Jin wrote: > Not sure if this is actually a bug or expected behavior - I filed > https://github.com/apache/arrow/issues/34210

Re: Question about memory usage and type casting using pyarrow Table

2023-02-15 Thread Li Jin
Not sure if this is actually a bug or expected behavior - I filed https://github.com/apache/arrow/issues/34210 On Wed, Feb 15, 2023 at 4:15 PM Li Jin wrote: > Hmm..something feels off here - I did the following experiment on Arrow 11 > and casting timestamp-naive to int64 is much faste

Re: Question about memory usage and type casting using pyarrow Table

2023-02-15 Thread Li Jin
00:00:00.09998,1970-01-01 00:00:00.0]] On Wed, Feb 15, 2023 at 2:52 PM Rok Mihevc wrote: > I'm not sure about (1) but I'm pretty sure for (2) doing a cast of tz-aware > timestamp to tz-naive should be a metadata-only change. > > On Wed, Feb 15, 2023 at

Re: Question about memory usage and type casting using pyarrow Table

2023-02-15 Thread Li Jin
Asking (2) because IIUC this is a metadata operation that could be zero copy but I am not sure if this is actually the case. On Wed, Feb 15, 2023 at 10:17 AM Li Jin wrote: > Hello! > > I have some questions about type casting memory usage with pyarrow Table. > Let's say I hav

Question about memory usage and type casting using pyarrow Table

2023-02-15 Thread Li Jin
Hello! I have some questions about type casting memory usage with pyarrow Table. Let's say I have a pyarrow Table with 100 columns. (1) if I want to cast n columns to a different type (e.g., float to int). What is the smallest memory overhead that I can do? (memory overhead of 1 column, n columns

Re: Build issues (Protobuf internal symbols)

2023-02-13 Thread Li Jin
" In this case though, it's just that we purposely hide symbols by default. If there's a use case, we could unhide this specific symbol (we did it for one other Protobuf symbol) which would let you externally generate and use the headers (as long as you take care not to actually include the generat

Creating dictionary encoded string in C++

2022-11-03 Thread Li Jin
Hello, I am working on converting some internal data sources to Arrow data. One particularly sets of data we have contains many string columns that can be dictionary-encoded (basically string enums) The current internal C++ API I am using gives me an iterator of "row" objects, for each string col

Re: [ANNOUNCE] New Arrow committer: Will Jones

2022-10-27 Thread Li Jin
congrats! On Thu, Oct 27, 2022 at 9:03 PM Matt Topol wrote: > Congrats Will! > > On Thu, Oct 27, 2022 at 9:02 PM Ian Cook wrote: > > > Congratulations Will! > > > > On Thu, Oct 27, 2022 at 19:56 Sutou Kouhei wrote: > > > > > On behalf of the Arrow PMC, I'm happy to announce that Will Jones > >

[Acero] Error handling in ExecNode

2022-10-18 Thread Li Jin
Hello! I am trying to implement an ExecNode in Acero that receives the input batch, writes the batch to the FlightStreamWriter and then passes the batch to the downstream node. Looking at the API, I am thinking of doing sth like : void InputReceived(ExecNode* input, ExecBatch batch) { # turn

Re: Substrait consumer for custom data sources

2022-10-13 Thread Li Jin
x27;t sound like the correct way, I am happy to do this correctly but someone let me know the correct way :) Li On Thu, Oct 13, 2022 at 2:01 PM Li Jin wrote: > Going back to the default_exec_factory_registry idea, I think ultimately > maybe we want registration API that

Re: Substrait consumer for custom data sources

2022-10-13 Thread Li Jin
dFactory("my_custom_node", MakeMyCustomNode) ... """ On Thu, Oct 13, 2022 at 1:32 PM Li Jin wrote: > Weston - was trying the pyarrow approach you suggested: > > >def custom_source(endpoint): > return pc.Declaration("my_custom_source", create_my_custom_o

Re: Substrait consumer for custom data sources

2022-10-13 Thread Li Jin
object should I return with create_my_custom_options()? Currently I only have a C++ class for my custom option. On Thu, Oct 13, 2022 at 12:58 PM Li Jin wrote: > > I may be assuming here but I think your problem is more that there is > no way to more flexibly describe a source in python and less

Re: Substrait consumer for custom data sources

2022-10-13 Thread Li Jin
ate_my_custom_options()) > > def table_provider(names): > return custom_sources[names[0]] > > pa.substrait.run_query(my_plan, table_provider=table_provider) > ``` > > On Thu, Oct 13, 2022 at 8:24 AM Li Jin wrote: > > > > We did some work around this recently and

Re: Substrait consumer for custom data sources

2022-10-13 Thread Li Jin
r; } """ And then calling `pa.substrat.run_query" should pick up the custom name table provider. Does that sound like a reasonable way to do this? On Tue, Sep 27, 2022 at 1:59 PM Li Jin wrote: > Thanks both. I think NamedTableProvider is close to what I want, and like >

Re: Question about pyarrow.substrait.run_query

2022-10-13 Thread Li Jin
te batches in a queue (just like the sink node) but it is > not handling backpressure. I've created [1] to track this. > > [1] https://issues.apache.org/jira/browse/ARROW-18025 > > On Wed, Oct 12, 2022 at 9:02 AM Li Jin wrote: > > > > Hello! > > > > I have

Question about pyarrow.substrait.run_query

2022-10-12 Thread Li Jin
Hello! I have some questions about how "pyarrow.substrait.run_query" works. Currently run_query returns a record batch reader. Since Acero is a push-based model and the reader is pull-based, I'd assume the reader object somehow accumulates the batches that are pushed to it. And I wonder (1) Does

Re: Pandas backend for Substrait

2022-10-06 Thread Li Jin
Disclaimer: Not ibis-substrait dev here ibis-substrait has a "decompiler"; https://github.com/ibis-project/ibis-substrait/blob/main/ibis_substrait/tests/compiler/test_decompiler.py that takes substrait and returns ibis expression, then you can run ibis expression with ibis's pandas backend: https:

Re: Integration between ibis-substrait and Acero

2022-10-05 Thread Li Jin
name {names}") reader = pa.substrait.run_query(pa.py_buffer(result.SerializeToString()), table_provider) result_table = reader.read_all() self.assertTrue(result_table == test_table_0) First successful run with ibis/substrait/acero - Hooray On Wed, Oct 5, 2

Re: Integration between ibis-substrait and Acero

2022-10-05 Thread Li Jin
PM Will Jones wrote: > Hi Li Jin, > > The original segfault seems to occur because you are passing a Python bytes > object and not a PyArrow Buffer object. You can wrap the bytes object using > pa.py_buffer(): > > pa.substrait.run_query(pa.py_buffer(result_bytes), table_provide

Re: Integration between ibis-substrait and Acero

2022-10-04 Thread Li Jin
For reference, this is the "relations" entry that I was referring to: https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_substrait.py#L186 On Tue, Oct 4, 2022 at 3:28 PM Li Jin wrote: > So I made some progress with updated code: > > t = ibis.table([

Re: Integration between ibis-substrait and Acero

2022-10-04 Thread Li Jin
ssed" Looking the plan reproduces by ibis-substrait, it looks like doesn't match the expected format of Acero consumer. In particular, it looks like the plan produced by ibis-substrait doesn't have a "relations" entry - any thoughts on how this can be fixed? (I don't kno

Integration between ibis-substrait and Acero

2022-10-04 Thread Li Jin
Hi, I am testing integration between ibis-substrait and Acero but hit a segmentation fault. I think this might be cause the way I am integrating these two libraries are wrong, here is my code: Li Jin 1:51 PM (1 minute ago) to me class BasicTests(unittest.TestCase): """Test

Re: Register custom ExecNode factories

2022-09-28 Thread Li Jin
own version of these files to build your Python module separately. > This is where you would add a build flag for pulling in C++ header files > for your Python module, under "python/pyarrow/include", and for making it. > > > Yaron. > &

Re: Substrait consumer for custom data sources

2022-09-27 Thread Li Jin
provide user configurable > dispatching for named tables; > if it doesn't address your use case then we might want to create a JIRA to > extend it. > > On Tue, Sep 27, 2022 at 10:41 AM Li Jin wrote: > > > I did some more digging into this and have some ideas - > > >

Re: Substrait consumer for custom data sources

2022-09-27 Thread Li Jin
is later in favor of a more generic solution. Thoughts? Li On Mon, Sep 26, 2022 at 10:58 AM Li Jin wrote: > Hello! > > I am working on adding a custom data source node in Acero. I have a few > previous threads related to this topic. > > Currently, I am able to register my cu

Substrait consumer for custom data sources

2022-09-26 Thread Li Jin
Hello! I am working on adding a custom data source node in Acero. I have a few previous threads related to this topic. Currently, I am able to register my custom factory method with Acero and create a Custom source node, i.e., I can register and execute this with Acero: MySourceNodeOptions sourc

Re: Register custom ExecNode factories

2022-09-21 Thread Li Jin
.pyx when the python module is loaded. > I don't know cython well enough to know how exactly it triggers the > datasets shared object to load. > > On Tue, Sep 20, 2022 at 11:01 AM Li Jin wrote: > > > > Hi, > > > > Recently I am working on adding a custom da

Re: Correct way to collect results from an Acero query

2022-09-21 Thread Li Jin
gt; > We could probably also add a DeclarationToReader method in the future. > > [1] https://github.com/apache/arrow/pull/13782 > > On Wed, Sep 21, 2022 at 8:26 AM Li Jin wrote: > > > > Hello! > > > > I am testing a custom data source node I added to A

Correct way to collect results from an Acero query

2022-09-21 Thread Li Jin
Hello! I am testing a custom data source node I added to Acero and found myself in need of collecting the results from an Acero query into memory. Searching the codebase, I found "StartAndCollect" is what many of the tests and benchmarks are using, but I am not sure if that is the public API to d

Register custom ExecNode factories

2022-09-20 Thread Li Jin
Hi, Recently I am working on adding a custom data source node to Acero and was pointed to a few examples in the dataset code. If I understand this correctly, the registering of dataset exec node is currently happening when this is loaded: https://github.com/apache/arrow/blob/master/python/pyarrow

Re: Integration between Flight and Acero

2022-09-14 Thread Li Jin
; > and convert this into a record batch reader. Then it would create one > > > of the node's that Yaron has contributed and return that. > > > > > > However, it might be nice if "open a connection to the flight > > > endpoint" happened

Re: Integration between Flight and Acero

2022-09-13 Thread Li Jin
cept it. You would need to know the schema when configuring the > SourceNode, but you won't need to derived from SourceNode. > > > Yaron. > ________ > From: Li Jin > Sent: Tuesday, September 13, 2022 3:58 PM > To: dev@arrow.apache.org > Subje

Re: Integration between Flight and Acero

2022-09-13 Thread Li Jin
twork > to get the schema on its own. > > Given the above, I agree with you that when the Acero node is created its > schema would already be known. > > > Yaron. > > From: Li Jin > Sent: Thursday, September 1, 2022 2:49 PM > To: dev

Re: Question on handling API changes when upgrading Pyarrow

2022-09-10 Thread Li Jin
.0-release/ > [3] > https://github.com/apache/arrow/blame/3eb5673597bf67246271b6c9a98e6f812d4e01a7/python/pyarrow/table.pxi#L1991 > [4] > https://github.com/apache/arrow/blob/apache-arrow-7.0.0/python/pyarrow/__init__.py#L368 > > On Fri, Sep 9, 2022 at 10:15 AM Li Jin wro

Re: Question on handling API changes when upgrading Pyarrow

2022-09-09 Thread Li Jin
but just wondering in general where do I look first if I hit this sort of issue in the future. On Fri, Sep 9, 2022 at 12:20 PM Li Jin wrote: > Hi, > > I am trying to update Pyarrow from 7.0 to 9.0 and hit a couple of issues > that I believe are because of some API changes. In par

Question on handling API changes when upgrading Pyarrow

2022-09-09 Thread Li Jin
Hi, I am trying to update Pyarrow from 7.0 to 9.0 and hit a couple of issues that I believe are because of some API changes. In particular, two issues I saw seems to be (1) pyarrow.read_schema is removed (2) pa.Table.to_batches no longer takes a keyword argument (chunksize) What's the best way t

Re: Integration between Flight and Acero

2022-09-01 Thread Li Jin
g with various ways of getting the actual schema, depending on what > exactly your service supports.) Once you have a Dataset, you can create an > ExecPlan and proceed like normal. > > Of course, if you then want to get things into Python, R, Substrait, > etc... that requires s

Integration between Flight and Acero

2022-08-31 Thread Li Jin
Hello! I have recently started to look into integrating Flight RPC with Acero source/sink node. In Flight, the life cycle of a "read" request looks sth like: - User specifies a URL (e.g. my_storage://my_path) and parameter (e.g., begin = "20220101", end = "20220201") - Client issue GetF

Re: [C++] Read Flight data source into Acero

2022-08-18 Thread Li Jin
but just wanted to mention that I am going > to > > > try and figure this out quite a bit in the next week. I can try to > create > > > some relevant cookbook recipes as I plod along. > > > > > > Aldrin Montana > > > Computer Science PhD Student > &

  1   2   3   4   >