Re: data-source UDFs

2022-06-03 Thread Li Jin
> At the moment as we are not exposing the execution engine primitives to Python user, are you expecting to expose them by this approach. >From our side, these APIs are not directly exposed to the end user, but rather, primitives that allow us to build on top of. The end user would just do sth li

Re: data-source UDFs

2022-06-03 Thread Li Jin
What Yaron is going for is really something similar to custom data source in Spark ( https://levelup.gitconnected.com/easy-guide-to-create-a-custom-read-data-source-in-apache-spark-3-194afdc9627a) that allows utilizing existing Python APIs that knows how to read data source as a stream of record ba

Re: data-source UDFs

2022-06-03 Thread Li Jin
Actually, "UDF" might be the wrong terminology here - This is more of a "custom Python data source" than "Python user defined functions". (Although under the hood it can probably reuse lots of the UDF logic to execute the custom data source) On Fri, Jun 3, 2022 at

Re: user-defined Python-based data-sources in Arrow

2022-06-22 Thread Li Jin
Yaron, Do you mind also linking the previous mailing list discussion here? On Wed, Jun 22, 2022 at 11:40 AM Yaron Gvili wrote: > Hi, > > I'd like to get the community's feedback about a design proposal > (discussed below) for integrating user-defined Python-based data-sources in > Arrow. This i

What is the lib file for Acero/Arrow compute?

2022-06-23 Thread Li Jin
Hi, I just noticed there is no specific lib file for Acero/Arrow compute when I have BUILD_COMPUTE=ON - is it included in the libarrow.so? Thanks! Li

Re: What is the lib file for Acero/Arrow compute?

2022-06-23 Thread Li Jin
obably want to keep casts > (and perhaps a few other kernels) in the main library. AIUI, we may also > want to split out acero/the "engine" as well (or at least give it its own > CMake flag eventually). > > https://issues.apache.org/jira/browse/ARROW-8891 > > -David > &

Re: accessing Substrait protobuf Python classes from PyArrow

2022-07-05 Thread Li Jin
Yaron, do we need to parse the subtrait protobuf in Python so that we can get the UDFs and register them with Pyarrow? On Mon, Jul 4, 2022 at 1:24 PM Yaron Gvili wrote: > This rewriting of the package is basically what I had in mind; the `_ep` > was just to signal a private package, which cannot

Do we have nightly source tar ball

2022-07-07 Thread Li Jin
Hello, I wonder if we have nightly source tarball published somewhere? Li

Re: Do we have nightly source tar ball

2022-07-07 Thread Li Jin
thub.com/apache/arrow/archive/refs/heads/master.tar.gz > [2]: https://github.com/ursacomputing/crossbow/releases > > On Thu, Jul 7, 2022, at 10:39, Li Jin wrote: > > Hello, > > > > I wonder if we have nightly source tarball published somewhere? > > > > Li >

Re: Do we have nightly source tar ball

2022-07-07 Thread Li Jin
and it seems we actually > also upload an sdist there. > > (it could still be more reliable to used HEAD, though, if you want to be > sure to always have the latest. If our nightly release CI is failing, the > index might be outdated for some days) > > Joris > > On Thu, 7 J

Re: Do we have nightly source tar ball

2022-07-07 Thread Li Jin
Thanks! This is very helpful. On Thu, Jul 7, 2022 at 11:33 AM Jacob Wujciak wrote: > The crossbow tarballs do not contain the arrow source, they only contain > the crossbow source (aka a few yaml files). > > On Thu, Jul 7, 2022 at 5:29 PM Li Jin wrote: > > > Thanks bot

Undefined symbol error using pyarrow

2022-07-07 Thread Li Jin
Hello, I am trying to build Arrow/Pyarrow with our internal build system (cmake based) and encounter and error when running pyarrow test: ImportError while importing test module '/home/ljin/vats/add-arrowpython-master/ext/public/python/pyarrow/master/dist/lib/python3.9/pyarrow/tests/test_table.py

Re: Undefined symbol error using pyarrow

2022-07-07 Thread Li Jin
TE enabled, is that the case? > > > > > Le 07/07/2022 à 22:16, Li Jin a écrit : > > Hello, > > > > I am trying to build Arrow/Pyarrow with our internal build system (cmake > > based) and encounter and error when running pyarrow test: > > > > I

Re: cpp Memory Pool Clarification

2022-07-11 Thread Li Jin
> TableSourceNode wouldn't need to allocate since it runs against memory that's already been allocated. Is the memory "that is already allocated" tracked in any allocators? For an end to end benchmark of "scan - join - write" I think would make sense to include all arrow memory allocation (if that

Re: cpp Memory Pool Clarification

2022-07-12 Thread Li Jin
your table. This might give you > something to compare/contrast allocation of an individual node with. > > On Mon, Jul 11, 2022 at 2:04 PM Li Jin wrote: > > > > > TableSourceNode wouldn't need to allocate since it runs against memory > > that's already been alloc

[C++] Resources for implementing flight client

2022-07-13 Thread Li Jin
Hello! I am new to flight and want to look into implementing a C++ client for our existing flight-based data service. I don't really know where to start so wonder if some resources/pointers can be shared? Thanks, Li

Re: [C++] Resources for implementing flight client

2022-07-13 Thread Li Jin
apache.org/docs/cpp/flight.html > > -David > > On Wed, Jul 13, 2022, at 15:32, Li Jin wrote: > > Hello! > > > > I am new to flight and want to look into implementing a C++ client for > our > > existing flight-based data service. I don't really know where to s

[C++] Question about substrait dependency in C++

2022-07-18 Thread Li Jin
Hello! I am working on integrating the latest Arrow C++ into our internal build system. Currently I am planning to build substrait C++ classes independently and provide header locations and so files to the Arrow Cmakefile - I wonder if that is a good approach? (We cannot download the substrait tar

Re: [C++] Question about substrait dependency in C++

2022-07-18 Thread Li Jin
t > ninja-debug`, I'm getting a libsubstrait.a in build/debug. I'm not familiar > enough with Arrow's build system to provide more help there. > > Regards, > Jeroen > > On Mon, 18 Jul 2022 at 18:00, Li Jin wrote: > > > Hello! > > > > I am working on

Re: [C++] Question about substrait dependency in C++

2022-07-18 Thread Li Jin
SUBSTRAIT_URL works for both a *.tar.gz file and a repository > > directory. In my experience, there is no need to also set any > > sha256-related setting. > > > > > > Yaron. > > > > From: Li Jin > > Sent: Monda

Re: ExecutionContext, batch ordering clarification

2022-07-19 Thread Li Jin
Thanks Weston, two follow up questions: (1) What is the threading model when passing "exector=nullptr" to "ExecContext" ? (Does it only uses one thread?) (2) For the file reader, if we want to ensure batches coming out of the reader are ordered but also have parallelism, I'd imagine doing sth like

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-07-22 Thread Li Jin
Hi! Since the scheduler improvement work came up in some recent discussions about how backpresures are handled in Acero, I am curious if there has been any more progress on this since May or any future plans? Thanks, Li On Mon, May 23, 2022 at 10:37 PM Weston Pace wrote: > > About point 2. I h

[C++] Clarifying the behavior of source node and executor

2022-07-25 Thread Li Jin
Hi, Ivan and I are debugging some behavior of the source node this morning and I was hoping to clarify that our understanding is correct. We observed that when using source node with a generator: https://github.com/apache/arrow/blob/66c66d040bbf81a4819b276aee306625dc02837c/cpp/src/arrow/compute/e

Re: [C++] Clarifying the behavior of source node and executor

2022-07-25 Thread Li Jin
Sorry the link to the generator above is wrong - We traced into the code and found it uses BackgroundGenerator: https://github.com/apache/arrow/blob/78fb2edd30b602bd54702896fa78d36ec6fefc8c/cpp/src/arrow/util/async_generator.h#L1581 On Mon, Jul 25, 2022 at 11:07 AM Li Jin wrote: > Hi, >

Re: [C++] Clarifying the behavior of source node and executor

2022-07-25 Thread Li Jin
n how to > obtain such a guarantee. > > > Yaron. > > From: Li Jin > Sent: Monday, July 25, 2022 11:10 AM > To: dev@arrow.apache.org > Subject: Re: [C++] Clarifying the behavior of source node and executor > > Sorry the link to the gen

Re: [C++] Clarifying the behavior of source node and executor

2022-07-25 Thread Li Jin
27;ll > look into adding this sequential-option to source-node and report back. > > > Yaron. > ________ > From: Li Jin > Sent: Monday, July 25, 2022 11:39 AM > To: dev@arrow.apache.org > Subject: Re: [C++] Clarifying the behavior of source node and ex

Re: [C++] Control flow and scheduling in C++ Engine operators / exec nodes

2022-07-26 Thread Li Jin
x27;t think I can throw out any specific dates but I think it is > safe to say that these issues are important to Voltron Data as well. > > [1] https://issues.apache.org/jira/browse/ARROW-16072 > [2] https://issues.apache.org/jira/browse/ARROW-15732 > [3] https://issues.apache.or

CMake dependencies for arrow flight

2022-07-29 Thread Li Jin
Hi! I saw this error when linking my code against arrow flight and suspect I didn't write my cmake correctly: "error: undefined reference to arrow::flight::Location::Location()" I followed https://arrow.apache.org/docs/cpp/build_system.html#cmake and linked my executable with arrow_shared. Is th

Re: CMake dependencies for arrow flight

2022-07-29 Thread Li Jin
(This is with Arrow 7.0.0) On Fri, Jul 29, 2022 at 3:52 PM Li Jin wrote: > Hi! > > I saw this error when linking my code against arrow flight and suspect I > didn't write my cmake correctly: > > "error: undefined reference to arrow::flight::Location::Loca

Re: CMake dependencies for arrow flight

2022-07-29 Thread Li Jin
1]. You can see a small workaround at [2]. > > [1]: https://issues.apache.org/jira/browse/ARROW-12175 > [2]: > https://github.com/apache/arrow-adbc/blob/41daacca08db041b52b458503e713a80528ba65a/c/drivers/flight_sql/CMakeLists.txt#L28-L31 > > -David > > On Fri, Jul 29, 2022, a

Re: CMake dependencies for arrow flight

2022-07-29 Thread Li Jin
Also, if it is the google re2, is there a minimum version required? Currently my system has re2 from 20201101. On Fri, Jul 29, 2022 at 4:45 PM Li Jin wrote: > Thanks David! > > I used the code in the sql flight Cmakelist. Unfortunately I hit another > error, I wonder if you happ

Re: CMake dependencies for arrow flight

2022-07-29 Thread Li Jin
(Nvm the libre2 error, It was my mistake) On Fri, Jul 29, 2022 at 4:49 PM Li Jin wrote: > Also, if it is the google re2, is there a minimum version required? > Currently my system has re2 from 20201101. > > On Fri, Jul 29, 2022 at 4:45 PM Li Jin wrote: > >> Thanks David!

Help with writing/reading from s3

2022-08-01 Thread Li Jin
Hello! We recently updated Arrow to 7.0.0 and hit some error with our old code (Details below). I wonder if there is a new way to do this with the current version? import pyarrow import pyarrow.parquet as pq df = pd.DataFrame({"aa": [1, 2, 3], "bb": [1, 2, 3]}) uri = "gs://amp_bucket_liao/tr

Re: Help with writing/reading from s3

2022-08-03 Thread Li Jin
Thanks! Removing the "gs://" prefix indeed fixes it. On Tue, Aug 2, 2022 at 4:01 PM Will Jones wrote: > Hi Li Jin, > > I'm not sure yet what changed, but I believe you can fix that error simply > by omitting the scheme prefix from the URI and just use the page when &g

Re: Fatal Python error for process exit after opening Pyarrow batch iterator

2022-08-10 Thread Li Jin
Hi - Gently bump this. I suspect this is an upstream issue and wonder if this is a known issue. Is there any other information we can provide? (I think the repro is pretty straightforward but let us know otherwise) On Mon, Aug 8, 2022 at 8:16 PM Alex Libman wrote: > Hi, > > I've hit an issue in

Re: Fatal Python error for process exit after opening Pyarrow batch iterator

2022-08-11 Thread Li Jin
t; > [1] https://issues.apache.org/jira/browse/ARROW-16072 > [2] https://issues.apache.org/jira/browse/ARROW-15732 > > On Wed, Aug 10, 2022 at 1:15 PM Li Jin wrote: > > > > Hi - Gently bump this. I suspect this is an upstream issue and wonder if > > this is a known i

Re: dealing with tester timeout in a CI job

2022-08-17 Thread Li Jin
Yaron, how does the asof join tests normally take? On Wed, Aug 17, 2022 at 6:13 AM Yaron Gvili wrote: > Sorry, yes, C++. The failed job is > https://github.com/apache/arrow/runs/7839062613?check_suite_focus=true > and it timed out on code I wrote (in a PR, not merged). I'd like to avoid a > time

[C++] Read Flight data source into Acero

2022-08-17 Thread Li Jin
Hi, I have a Flight data source (effectively a Flight::StreamReader) and I'd like to create an Acero source node from it. I wonder if something already exists to do that or if not, perhaps some pointers for me to take a look at? Thanks, Li

Re: [C++] Read Flight data source into Acero

2022-08-17 Thread Li Jin
Correction: I have a flight::FlightStreamReader (not Flight::StreamReader) On Wed, Aug 17, 2022 at 12:12 PM Li Jin wrote: > Hi, > > I have a Flight data source (effectively a Flight::StreamReader) and I'd > like to create an Acero source node from it. I wonder if something al

Re: [C++] Read Flight data source into Acero

2022-08-18 Thread Li Jin
but just wanted to mention that I am going > to > > > try and figure this out quite a bit in the next week. I can try to > create > > > some relevant cookbook recipes as I plod along. > > > > > > Aldrin Montana > > > Computer Science PhD Student > &

Integration between Flight and Acero

2022-08-31 Thread Li Jin
Hello! I have recently started to look into integrating Flight RPC with Acero source/sink node. In Flight, the life cycle of a "read" request looks sth like: - User specifies a URL (e.g. my_storage://my_path) and parameter (e.g., begin = "20220101", end = "20220201") - Client issue GetF

Re: Integration between Flight and Acero

2022-09-01 Thread Li Jin
g with various ways of getting the actual schema, depending on what > exactly your service supports.) Once you have a Dataset, you can create an > ExecPlan and proceed like normal. > > Of course, if you then want to get things into Python, R, Substrait, > etc... that requires s

Question on handling API changes when upgrading Pyarrow

2022-09-09 Thread Li Jin
Hi, I am trying to update Pyarrow from 7.0 to 9.0 and hit a couple of issues that I believe are because of some API changes. In particular, two issues I saw seems to be (1) pyarrow.read_schema is removed (2) pa.Table.to_batches no longer takes a keyword argument (chunksize) What's the best way t

Re: Question on handling API changes when upgrading Pyarrow

2022-09-09 Thread Li Jin
but just wondering in general where do I look first if I hit this sort of issue in the future. On Fri, Sep 9, 2022 at 12:20 PM Li Jin wrote: > Hi, > > I am trying to update Pyarrow from 7.0 to 9.0 and hit a couple of issues > that I believe are because of some API changes. In par

Re: Question on handling API changes when upgrading Pyarrow

2022-09-10 Thread Li Jin
.0-release/ > [3] > https://github.com/apache/arrow/blame/3eb5673597bf67246271b6c9a98e6f812d4e01a7/python/pyarrow/table.pxi#L1991 > [4] > https://github.com/apache/arrow/blob/apache-arrow-7.0.0/python/pyarrow/__init__.py#L368 > > On Fri, Sep 9, 2022 at 10:15 AM Li Jin wro

Re: Integration between Flight and Acero

2022-09-13 Thread Li Jin
twork > to get the schema on its own. > > Given the above, I agree with you that when the Acero node is created its > schema would already be known. > > > Yaron. > > From: Li Jin > Sent: Thursday, September 1, 2022 2:49 PM > To: dev

Re: Integration between Flight and Acero

2022-09-13 Thread Li Jin
cept it. You would need to know the schema when configuring the > SourceNode, but you won't need to derived from SourceNode. > > > Yaron. > ________ > From: Li Jin > Sent: Tuesday, September 13, 2022 3:58 PM > To: dev@arrow.apache.org > Subje

Re: Integration between Flight and Acero

2022-09-14 Thread Li Jin
; > and convert this into a record batch reader. Then it would create one > > > of the node's that Yaron has contributed and return that. > > > > > > However, it might be nice if "open a connection to the flight > > > endpoint" happened

Register custom ExecNode factories

2022-09-20 Thread Li Jin
Hi, Recently I am working on adding a custom data source node to Acero and was pointed to a few examples in the dataset code. If I understand this correctly, the registering of dataset exec node is currently happening when this is loaded: https://github.com/apache/arrow/blob/master/python/pyarrow

Correct way to collect results from an Acero query

2022-09-21 Thread Li Jin
Hello! I am testing a custom data source node I added to Acero and found myself in need of collecting the results from an Acero query into memory. Searching the codebase, I found "StartAndCollect" is what many of the tests and benchmarks are using, but I am not sure if that is the public API to d

Re: Correct way to collect results from an Acero query

2022-09-21 Thread Li Jin
gt; > We could probably also add a DeclarationToReader method in the future. > > [1] https://github.com/apache/arrow/pull/13782 > > On Wed, Sep 21, 2022 at 8:26 AM Li Jin wrote: > > > > Hello! > > > > I am testing a custom data source node I added to A

Re: Register custom ExecNode factories

2022-09-21 Thread Li Jin
.pyx when the python module is loaded. > I don't know cython well enough to know how exactly it triggers the > datasets shared object to load. > > On Tue, Sep 20, 2022 at 11:01 AM Li Jin wrote: > > > > Hi, > > > > Recently I am working on adding a custom da

Substrait consumer for custom data sources

2022-09-26 Thread Li Jin
Hello! I am working on adding a custom data source node in Acero. I have a few previous threads related to this topic. Currently, I am able to register my custom factory method with Acero and create a Custom source node, i.e., I can register and execute this with Acero: MySourceNodeOptions sourc

Re: Substrait consumer for custom data sources

2022-09-27 Thread Li Jin
is later in favor of a more generic solution. Thoughts? Li On Mon, Sep 26, 2022 at 10:58 AM Li Jin wrote: > Hello! > > I am working on adding a custom data source node in Acero. I have a few > previous threads related to this topic. > > Currently, I am able to register my cu

Re: Substrait consumer for custom data sources

2022-09-27 Thread Li Jin
provide user configurable > dispatching for named tables; > if it doesn't address your use case then we might want to create a JIRA to > extend it. > > On Tue, Sep 27, 2022 at 10:41 AM Li Jin wrote: > > > I did some more digging into this and have some ideas - > > >

Re: Register custom ExecNode factories

2022-09-28 Thread Li Jin
own version of these files to build your Python module separately. > This is where you would add a build flag for pulling in C++ header files > for your Python module, under "python/pyarrow/include", and for making it. > > > Yaron. > &

Integration between ibis-substrait and Acero

2022-10-04 Thread Li Jin
Hi, I am testing integration between ibis-substrait and Acero but hit a segmentation fault. I think this might be cause the way I am integrating these two libraries are wrong, here is my code: Li Jin 1:51 PM (1 minute ago) to me class BasicTests(unittest.TestCase): """Test

Re: Integration between ibis-substrait and Acero

2022-10-04 Thread Li Jin
ssed" Looking the plan reproduces by ibis-substrait, it looks like doesn't match the expected format of Acero consumer. In particular, it looks like the plan produced by ibis-substrait doesn't have a "relations" entry - any thoughts on how this can be fixed? (I don't kno

Re: Integration between ibis-substrait and Acero

2022-10-04 Thread Li Jin
For reference, this is the "relations" entry that I was referring to: https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_substrait.py#L186 On Tue, Oct 4, 2022 at 3:28 PM Li Jin wrote: > So I made some progress with updated code: > > t = ibis.table([

Re: Integration between ibis-substrait and Acero

2022-10-05 Thread Li Jin
PM Will Jones wrote: > Hi Li Jin, > > The original segfault seems to occur because you are passing a Python bytes > object and not a PyArrow Buffer object. You can wrap the bytes object using > pa.py_buffer(): > > pa.substrait.run_query(pa.py_buffer(result_bytes), table_provide

Re: Integration between ibis-substrait and Acero

2022-10-05 Thread Li Jin
name {names}") reader = pa.substrait.run_query(pa.py_buffer(result.SerializeToString()), table_provider) result_table = reader.read_all() self.assertTrue(result_table == test_table_0) First successful run with ibis/substrait/acero - Hooray On Wed, Oct 5, 2

Re: Pandas backend for Substrait

2022-10-06 Thread Li Jin
Disclaimer: Not ibis-substrait dev here ibis-substrait has a "decompiler"; https://github.com/ibis-project/ibis-substrait/blob/main/ibis_substrait/tests/compiler/test_decompiler.py that takes substrait and returns ibis expression, then you can run ibis expression with ibis's pandas backend: https:

Question about pyarrow.substrait.run_query

2022-10-12 Thread Li Jin
Hello! I have some questions about how "pyarrow.substrait.run_query" works. Currently run_query returns a record batch reader. Since Acero is a push-based model and the reader is pull-based, I'd assume the reader object somehow accumulates the batches that are pushed to it. And I wonder (1) Does

Re: Question about pyarrow.substrait.run_query

2022-10-13 Thread Li Jin
te batches in a queue (just like the sink node) but it is > not handling backpressure. I've created [1] to track this. > > [1] https://issues.apache.org/jira/browse/ARROW-18025 > > On Wed, Oct 12, 2022 at 9:02 AM Li Jin wrote: > > > > Hello! > > > > I have

Re: Substrait consumer for custom data sources

2022-10-13 Thread Li Jin
r; } """ And then calling `pa.substrat.run_query" should pick up the custom name table provider. Does that sound like a reasonable way to do this? On Tue, Sep 27, 2022 at 1:59 PM Li Jin wrote: > Thanks both. I think NamedTableProvider is close to what I want, and like >

Re: Substrait consumer for custom data sources

2022-10-13 Thread Li Jin
ate_my_custom_options()) > > def table_provider(names): > return custom_sources[names[0]] > > pa.substrait.run_query(my_plan, table_provider=table_provider) > ``` > > On Thu, Oct 13, 2022 at 8:24 AM Li Jin wrote: > > > > We did some work around this recently and

Re: Substrait consumer for custom data sources

2022-10-13 Thread Li Jin
object should I return with create_my_custom_options()? Currently I only have a C++ class for my custom option. On Thu, Oct 13, 2022 at 12:58 PM Li Jin wrote: > > I may be assuming here but I think your problem is more that there is > no way to more flexibly describe a source in python and less

Re: Substrait consumer for custom data sources

2022-10-13 Thread Li Jin
dFactory("my_custom_node", MakeMyCustomNode) ... """ On Thu, Oct 13, 2022 at 1:32 PM Li Jin wrote: > Weston - was trying the pyarrow approach you suggested: > > >def custom_source(endpoint): > return pc.Declaration("my_custom_source", create_my_custom_o

Re: Substrait consumer for custom data sources

2022-10-13 Thread Li Jin
x27;t sound like the correct way, I am happy to do this correctly but someone let me know the correct way :) Li On Thu, Oct 13, 2022 at 2:01 PM Li Jin wrote: > Going back to the default_exec_factory_registry idea, I think ultimately > maybe we want registration API that

[Acero] Error handling in ExecNode

2022-10-18 Thread Li Jin
Hello! I am trying to implement an ExecNode in Acero that receives the input batch, writes the batch to the FlightStreamWriter and then passes the batch to the downstream node. Looking at the API, I am thinking of doing sth like : void InputReceived(ExecNode* input, ExecBatch batch) { # turn

Re: [ANNOUNCE] New Arrow committer: Will Jones

2022-10-27 Thread Li Jin
congrats! On Thu, Oct 27, 2022 at 9:03 PM Matt Topol wrote: > Congrats Will! > > On Thu, Oct 27, 2022 at 9:02 PM Ian Cook wrote: > > > Congratulations Will! > > > > On Thu, Oct 27, 2022 at 19:56 Sutou Kouhei wrote: > > > > > On behalf of the Arrow PMC, I'm happy to announce that Will Jones > >

Creating dictionary encoded string in C++

2022-11-03 Thread Li Jin
Hello, I am working on converting some internal data sources to Arrow data. One particularly sets of data we have contains many string columns that can be dictionary-encoded (basically string enums) The current internal C++ API I am using gives me an iterator of "row" objects, for each string col

Re: Build issues (Protobuf internal symbols)

2023-02-13 Thread Li Jin
" In this case though, it's just that we purposely hide symbols by default. If there's a use case, we could unhide this specific symbol (we did it for one other Protobuf symbol) which would let you externally generate and use the headers (as long as you take care not to actually include the generat

Question about memory usage and type casting using pyarrow Table

2023-02-15 Thread Li Jin
Hello! I have some questions about type casting memory usage with pyarrow Table. Let's say I have a pyarrow Table with 100 columns. (1) if I want to cast n columns to a different type (e.g., float to int). What is the smallest memory overhead that I can do? (memory overhead of 1 column, n columns

Re: Question about memory usage and type casting using pyarrow Table

2023-02-15 Thread Li Jin
Asking (2) because IIUC this is a metadata operation that could be zero copy but I am not sure if this is actually the case. On Wed, Feb 15, 2023 at 10:17 AM Li Jin wrote: > Hello! > > I have some questions about type casting memory usage with pyarrow Table. > Let's say I hav

Re: Question about memory usage and type casting using pyarrow Table

2023-02-15 Thread Li Jin
00:00:00.09998,1970-01-01 00:00:00.0]] On Wed, Feb 15, 2023 at 2:52 PM Rok Mihevc wrote: > I'm not sure about (1) but I'm pretty sure for (2) doing a cast of tz-aware > timestamp to tz-naive should be a metadata-only change. > > On Wed, Feb 15, 2023 at

Re: Question about memory usage and type casting using pyarrow Table

2023-02-15 Thread Li Jin
Not sure if this is actually a bug or expected behavior - I filed https://github.com/apache/arrow/issues/34210 On Wed, Feb 15, 2023 at 4:15 PM Li Jin wrote: > Hmm..something feels off here - I did the following experiment on Arrow 11 > and casting timestamp-naive to int64 is much faste

Re: Question about memory usage and type casting using pyarrow Table

2023-02-15 Thread Li Jin
Oh found this comment: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc#L156 On Wed, Feb 15, 2023 at 4:23 PM Li Jin wrote: > Not sure if this is actually a bug or expected behavior - I filed > https://github.com/apache/arrow/issues/34210

Re: Question about memory usage and type casting using pyarrow Table

2023-02-15 Thread Li Jin
he array is timezone aware. > > On Wed, Feb 15, 2023 at 10:37 PM Li Jin wrote: > > > Oh found this comment: > > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc#L156 > > > > > > > > On Wed, Feb

Re: testing of back-pressure

2023-02-16 Thread Li Jin
Thanks Weston for the information. On Thu, Feb 16, 2023 at 1:32 PM Weston Pace wrote: > There is a little bit at the end-to-end level. One goal is to be able to > repartition a very large dataset. This means we read from something bigger > than memory and then write to it. This workflow is te

Timestamp unit in Substrait and Arrow

2023-03-09 Thread Li Jin
Hi, I recently came across some limitations in expressing timestamp type with Substrait in the Acero substrait consumer and am curious to hear what people's thoughts are. The particular issue that I have is when specifying timestamp type in substrait, the unit is "microseconds" and there is no wa

Re: [ANNOUNCE] New Arrow PMC member: Will Jones

2023-03-13 Thread Li Jin
Congratulations Will! On Mon, Mar 13, 2023 at 3:27 PM Bryce Mecum wrote: > Congratulations, Will! >

Re: Timestamp unit in Substrait and Arrow

2023-03-14 Thread Li Jin
rk > here will be pretty easy. The trickier part might be adapting your > producer (Ibis?) > > On Thu, Mar 9, 2023 at 9:43 AM Li Jin wrote: > > > Hi, > > > > I recently came across some limitations in expressing timestamp type with > > Substrait in the Ace

Re: [DISCUSS] Acero roadmap / philosophy

2023-03-14 Thread Li Jin
Late to the party. Thanks Weston for sharing the thoughts around Acero. We are actually a pretty heavy Acero user right now and are trying to take part in Acero maintenance and development. Internally we are using Acero for a time series streaming data processing system. I would +1 on many of Wes

Stacktrace from Arrow status?

2023-03-21 Thread Li Jin
Hi, This might be a dumb question but when Arrow code raises an invalid status, I observe that it usually pops up to the user without stack information. I wonder if there are any tricks to show where the invalid status is coming from? Thanks, Li

Re: Stacktrace from Arrow status?

2023-03-24 Thread Li Jin
a rough > stack trace (IIRC, if a function returns the status without using one of > the macros, it won't add a line to the trace). > > [1]: > https://github.com/apache/arrow/blob/1ba4425fab35d572132cb30eee6087a7dca89853/cpp/cmake_modules/DefineOptions.cmake#L608-L609 > > On

Zero copy cast kernels

2023-03-24 Thread Li Jin
Hello, I recently found myself casting an int64 (nanos from epoch) into a nano timestamp column with the C++ cast kernel (via Acero). I expect this to be zero copy but I wonder if there is a way to check which casts are zero copy and which are not? Li

Re: Zero copy cast kernels

2023-03-28 Thread Li Jin
Thanks Rok! Original question is to asking for a way to "verify if a cast if zero copy by read source code / documentation", and not "verify a cast if zero copy programmatically" but I noticed by reading the test file that int64 to micro is indeed zero copy and I expect nanos to be the same https:

Re: Stacktrace from Arrow status?

2023-04-04 Thread Li Jin
his, std::move(batch)) /home/icexelloss/workspace/arrow/cpp/src/arrow/acero/hash_aggregate_test.cc:271 start_and_collect.MoveResult() ``` Is this because of the ARROW_EXTRA_ERROR_CONTEXT option? On Fri, Mar 24, 2023 at 12:04 PM Li Jin wrote: > Thanks David! > > On Tue, Mar 21, 2023 at 6:32

Re: Stacktrace from Arrow status?

2023-04-04 Thread Li Jin
Thanks David! On Tue, Apr 4, 2023 at 4:58 PM David Li wrote: > Yes, that's what the ARROW_EXTRA_ERROR_CONTEXT option does. > > On Tue, Apr 4, 2023, at 11:13, Li Jin wrote: > > Picking up this conversation again, I noticed when I hit an error in > > test I >

Github command to rerun CI checks?

2023-04-17 Thread Li Jin
Hi, Is there a github command to rerun CI checks? (instead of pushing a new commit?) Thanks, Li

Re: Github command to rerun CI checks?

2023-04-17 Thread Li Jin
UI. If you want to avoid having > to add small changes to be able to commit you can use empty commits via > '--allow-empty'. > > On Mon, Apr 17, 2023 at 5:25 PM Li Jin wrote: > > > Hi, > > > > Is there a github command to rerun CI checks? (instead of pushing a new > > commit?) > > > > Thanks, > > Li > > >

Re: Github command to rerun CI checks?

2023-04-18 Thread Li Jin
gt; > > The UI was recently updated: > > > > > https://docs.github.com/en/actions/managing-workflow-runs/re-running-workflows-and-jobs#re-running-failed-jobs-in-a-workflow > > > > On Mon, Apr 17, 2023 at 7:57 PM Li Jin wrote: > > > >> Thanks!

Re: Github command to rerun CI checks?

2023-04-18 Thread Li Jin
r doing that, so you > should be able to give that a try. > > We don't have a way of running PR checks as we do with the crossbow > command. We could investigate if there is a way to do it via API. > > Thanks, > Raúl > > El mar, 18 abr 2023 a las 14:47, Li Jin () >

Converting Pandas DataFrame <-> Struct Array?

2023-06-09 Thread Li Jin
Hello, I am looking for the best ways for converting Pandas DataFrame <-> Struct Array. Currently I have: pa.RecordBatch.from_pandas(df).to_struct_array() and pa.RecordBatch.from_struct_array(s_array).to_pandas() - I wonder if there is a direct way to go from DataFrame <-> Struct Array withou

Re: Converting Pandas DataFrame <-> Struct Array?

2023-06-12 Thread Li Jin
Gentle bump. Not a big deal if I need to use the API above to do so, but bump in case someone has a better way. On Fri, Jun 9, 2023 at 4:34 PM Li Jin wrote: > Hello, > > I am looking for the best ways for converting Pandas DataFrame <-> Struct > Array. &g

Re: Converting Pandas DataFrame <-> Struct Array?

2023-06-13 Thread Li Jin
dtype(df.dtypes[col])) for col in > > df.columns] > > pa_type = pa.struct(fields) > > pa.array(df.itertuples(index=False, type=pa_type) > > > > But this seems like a classic XY problem. What is the root issue you're > > trying to solve? Why avoid RecordBatch?

Group rows in a stream of record batches by group id?

2023-06-13 Thread Li Jin
Hi, I am trying to write a function that takes a stream of record batches (where the last column is group id), and produces k record batches, where record batches k_i contain all the rows with group id == i. Pseudocode is sth like: def group_rows(batches, k) -> array[RecordBatch] { builder

Re: Group rows in a stream of record batches by group id?

2023-06-13 Thread Li Jin
and I'm maybe a little uncertain what > the difference is between this ask and the capabilities added in [1]. > > [1] https://github.com/apache/arrow/pull/35514 > > On Tue, Jun 13, 2023 at 8:23 AM Li Jin wrote: > > > Hi, > > > > I am trying to write a funct

Re: Group rows in a stream of record batches by group id?

2023-06-13 Thread Li Jin
(Admittedly, PR title of [1] doesn't reflect that only the scalar aggregate UDF is implemented and not the hash one - that is an oversight on my part - sorry) On Tue, Jun 13, 2023 at 3:51 PM Li Jin wrote: > Thanks Weston. > > I think I found what you pointed out to me before whi

  1   2   3   4   >