Sorry for missing this email. I volunteer as well.
(I have been working with / building Arrow-based data processing
systems since 2017. Perhaps I can provide some perspectives from use
cases in addition to traditional SQL systems, e.g., streaming, time
series, ML, numerical computation etc)
On Su
Dear Arrow Devs,
I wonder if there is a nice way to do function chaining / math formular
with Arrow compute? (Either Python or C++?)
To give an example, let say I have three arrays a, x and y and want to
compute:
x * (1 - a) + y * a
Right now I can do this in pyarrow but pretty hard to read:
f
Congrats!
On Thu, Feb 6, 2025 at 2:52 AM wish maple wrote:
> Congrats!
>
> Best,
> Xuwei Fu
>
> Raúl Cumplido 于2025年2月6日周四 15:47写道:
>
> > Congrats Bryce!
> >
> > El jue, 6 feb 2025, 6:22, Weston Pace escribió:
> >
> > > Congrats Bryce!
> > >
> > > On Wed, Feb 5, 2025 at 8:35 PM Saurabh Singh
s to take
> the lock.
>
> Can you open a GH issue and we can follow up there?
>
> Regards
>
> Antoine.
>
>
> Le 23/05/2024 à 21:23, Li Jin a écrit :
> > Hello,
> >
> > I am seeing a deadlock when destructing an ObjectOutputStream. I have
> > attached
Hello,
I am seeing a deadlock when destructing an ObjectOutputStream. I have
attached the stack trace.
I did some debugging and found that the issue seems to be that the mutex in
question is already held by this thread (I checked the __owner field in the
pthread_mutex_t which points to the hangin
t; 2.6, which contains nanosecond support.
> It was released in Arrow v13.
>
> [1]
>
> https://github.com/apache/arrow/blob/e198f309c577de9a265c04af2bc4644c33f54375/python/pyarrow/parquet/core.py#L953
>
> [2]https://github.com/apache/arrow/pull/36137
>
> On Wed, Feb 21, 20
“Exponentially exposed” -> “potentially exposed”
On Wed, Feb 21, 2024 at 4:13 PM Li Jin wrote:
> Thanks - since we don’t control all the invocation of pq.write_table, I
> wonder if there is some configuration for the “default” behavior?
>
> Also I wonder if there are other API
gt; BR
>
> J
>
>
> śr., 21 lut 2024 o 21:44 Li Jin napisał(a):
>
> > Hi,
> >
> > My colleague has informed me that during the Arrow 12->15 upgrade, he
> found
> > that writing a pandas Dataframe with datetime64[ns] to parquet will
> result
> >
Hi,
My colleague has informed me that during the Arrow 12->15 upgrade, he found
that writing a pandas Dataframe with datetime64[ns] to parquet will result
in nanosecond metadata and nanosecond values.
I wonder if this is something configurable to the old behavior so we can
enable “nanosecond in p
Congrats Andy!
On Tue, Nov 28, 2023 at 3:25 PM Weston Pace wrote:
> Congrats Andy!
>
> On Mon, Nov 27, 2023, 7:31 PM wish maple wrote:
>
> > Congrats Andy!
> >
> > Best,
> > Xuwei Fu
> >
> > Andrew Lamb 于2023年11月27日周一 20:47写道:
> >
> > > I am pleased to announce that the Arrow Project has a ne
>
> Best,
> Xuwei Fu
>
> [1] https://github.com/apache/arrow/blob/main/cpp/src/parquet/encoding.cc
> [2]
> https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc
>
> Li Jin 于2023年11月18日周六 05:27写道:
>
> > Hi,
> >
> > I am recentl
> https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader.cc#L107
> [2]
> https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader_internal.cc#L345
>
> On Fri, Nov 17, 2023 at 12:27 PM Li Jin wrote:
> >
> > Hi,
> >
> > I am recentl
Hi,
I am recently investigating a null/nan issue with Parquet and Arrow and
wonder if someone can give me a pointer to the code that decodes Parquet
row group into Arrow float/double arrays?
Thanks,
Li
at 10:07 AM Li Jin wrote:
> Update:
>
> I have done a memory profiling and the result seems to suggest memory
> leak. I
> have opened a issue to further discuss this:
> https://github.com/apache/arrow/issues/37630
>
>
> On Fri, Sep 8, 2023 at 10:04 AM Li Jin wrote:
>
Update:
I have done a memory profiling and the result seems to suggest memory leak.
I
have opened a issue to further discuss this:
https://github.com/apache/arrow/issues/37630
On Fri, Sep 8, 2023 at 10:04 AM Li Jin wrote:
> Update:
>
> I have done a memory profiling and the result
On Wed, Sep 6, 2023 at 4:35 PM Li Jin wrote:
> Also attaching my experiment code just in case:
> https://gist.github.com/icexelloss/88195de046962e1d043c99d96e1b8b43
>
> On Wed, Sep 6, 2023 at 4:29 PM Li Jin wrote:
>
>> Reporting back with some new findings.
>>
>>
Also attaching my experiment code just in case:
https://gist.github.com/icexelloss/88195de046962e1d043c99d96e1b8b43
On Wed, Sep 6, 2023 at 4:29 PM Li Jin wrote:
> Reporting back with some new findings.
>
> Re Felipe and Antione:
> I tried with both Antione's suggestions (swa
issues.
Re Xuwei:
Thanks for the tips. I am gonna try to memorize this profile next and see
what I can find.
I am gonna keep looking into this but again, any ideas / suggestions are
appreciated (and thanks for all the help so far!)
Li
On Wed, Sep 6, 2023 at 1:59 PM Li Jin wrote:
> T
Another sign this isn't a leak, just the allocator reaching a level of
> > memory commitment that it doesn't feel like undoing.
> >
> > --
> > Felipe
> >
> > On Wed, Sep 6, 2023 at 12:56 PM Li Jin wrote:
> >
> > > Hello,
> > >
> > > I have
In Parquet, if non-buffered read is enabled, when read a column, the
> whole ColumChunk would be read.
> Otherwise, it will "buffered" read it decided by buffer-size
>
> Maybe I forgot someplaces. You can try to check that.
>
> Best
> Xuwei Fu
>
> Li Jin
37139
> [3] https://github.com/apache/arrow/issues/36587
> [4] https://github.com/apache/arrow/issues/37136
>
> Li Jin 于2023年9月6日周三 23:56写道:
>
> > Hello,
> >
> > I have been testing "What is the max rss needed to scan through ~100G of
> > data in a parquet
Hello,
I have been testing "What is the max rss needed to scan through ~100G of
data in a parquet stored in gcs using Arrow C++".
The current answer is about ~6G of memory which seems a bit high so I
looked into it. What I observed during the process led me to think that
there are some potential
Although - I am curious if there are any downsides using `self_destruct`?
On Thu, Aug 31, 2023 at 1:05 PM Li Jin wrote:
> Ah I see - thanks for the explanation. self_destruct probably won't
> benefit in my case then. (The pa.Array here is a slice from another batch
> so there
> each array is actually backed by its own memory allocations (which right
> would generally mean copying data up front!).
>
> On Thu, Aug 31, 2023, at 11:11, Li Jin wrote:
> > Hi,
> >
> > I am working on some code where I have a list of pa.Arrays and I am
> > cr
Hi,
I am working on some code where I have a list of pa.Arrays and I am
creating a pandas.DataFrame from it. I also want to set the index of the
pd.DataFrame to be the first Array in the list.
Currently I am doing sth like:
"
df = pa.Table.from_arrays(arrs, names=input_names).to_pandas()
df.set_i
23 à 23:20, Ian Cook a écrit :
> > Li,
> >
> > Here's a standalone C++ example that constructs a Table and executes
> > an Acero ExecPlan to sort it:
> > https://gist.github.com/ianmcook/2aa9aa82e61c3ea4405450b93cf80fbc
> >
> > Ian
> >
> > O
Hi,
I am writing some C++ test and found myself in need for an c++ function to
sort an arrow Table. Before I go around implementing one myself, I wonder
if there is already a function that does that? (I searched the doc but
didn’t find one).
There is function in Acero can do it but I didn’t find
gt;>> schema = pa.schema([pa.field("points", pa.struct([pa.field("x",
> pa.float64()), pa.field("y", pa.float64())]))])
> >>> expr = pc.field(("points", "x"))
> >>> expr.to_substrait(schema)
> is_mutable=False
Hi,
I am recently trying to do
(1) assign a struct type column s
(2) flatten the struct columns (by assign v1=s[v1], v2=s[v2] and drop the s
column)
via Substrait and Acero.
However, I ran into the problem where I don't know the proper substrait
message to encode this (for (2))
Normally, if I s
I/O call (which under the hood is usually implemented by
> submitting something to the I/O executor).
>
> On Tue, Jul 25, 2023 at 2:56 PM Li Jin wrote:
>
> > Hi,
> >
> > I am reading Acero and got confused about the use of
> > QueryContext::scheduler() and Q
Hi,
I am reading Acero and got confused about the use of
QueryContext::scheduler() and QueryContext::async_scheduler(). So I have a
couple of questions:
(1) What are the different purposes of these two?
(2) Does scheduler/aysnc_scheduler own any threads inside their respective
classes or do they
ever, I don't know whether nanoarrow
> supports it.
>
> Best,
> Xuwei Fu
>
> [1] https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm
> [2] https://github.com/apache/arrow/pull/36137
>
> On 2023/07/14 13:25:22 Li Jin wrote:
> > Hi,
> >
>
Hi,
Recently I found myself in the need of nanosecond granularity timestamp.
IIUC this is something supported in the newer version of parquet (2.6
perhaps)? I wonder what is the state of that in Arrow and parquet cpp?
Thanks,
Li
>
> Acero does not currently handle more than one grouping set.
>
>
> [1] https://docs.snowflake.com/en/sql-reference/constructs/group-by-rollup
>
> On Mon, Jul 10, 2023 at 2:22 PM Li Jin wrote:
>
> > Hi,
> >
> > I am looking at the substrait
Hi,
I am looking at the substrait protobuf for AggregateRel as well the Acero
substrait consumer code:
https://github.com/apache/arrow/blob/main/cpp/src/arrow/engine/substrait/relation_internal.cc#L851
https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L209
Looks l
ception` in the codebase, you'll find that there
> a couple of places where we turn it into a Status already.
>
> Regards
>
> Antoine.
>
>
> Le 29/06/2023 à 16:20, Li Jin a écrit :
> > Hi,
> >
> > IIUC, most of the Arrow C++ code doesn't not use ex
Hi,
IIUC, most of the Arrow C++ code doesn't not use exceptions. My question is
are there some Arrow utility / macro that wrap the function/code that might
raise an exception and turn that into code that returns an arrow error
Status?
Thanks!
Li
lars = {MakeScalar(1),
> MakeScalar(2)};
>
> ARROW_ASSIGN_OR_RAISE(std::unique_ptr builder,
> MakeBuilder(type));
> ARROW_RETURN_NOT_OK(builder->AppendScalars(scalars));
> ARROW_ASSIGN_OR_RAISE(auto arr, builder->Finish());
> ```
>
> Best,
> Jin
>
>
> On Fri, Jun 16, 2023 at 5:23
Hi,
I find myself in need of a function to turn a vector of Scalar to an Array
of the same datatype. The data type is known at the runtime. e.g.
shared_ptr concat_scalars(vector values.
shared_ptr type);
I wonder if I need to use sth like Scalar::Accept(ScalarVisitor*) or is
there an easier/bett
(Admittedly, PR title of [1] doesn't reflect that only the scalar aggregate
UDF is implemented and not the hash one - that is an oversight on my part -
sorry)
On Tue, Jun 13, 2023 at 3:51 PM Li Jin wrote:
> Thanks Weston.
>
> I think I found what you pointed out to me before whi
and I'm maybe a little uncertain what
> the difference is between this ask and the capabilities added in [1].
>
> [1] https://github.com/apache/arrow/pull/35514
>
> On Tue, Jun 13, 2023 at 8:23 AM Li Jin wrote:
>
> > Hi,
> >
> > I am trying to write a funct
Hi,
I am trying to write a function that takes a stream of record batches
(where the last column is group id), and produces k record batches, where
record batches k_i contain all the rows with group id == i.
Pseudocode is sth like:
def group_rows(batches, k) -> array[RecordBatch] {
builder
dtype(df.dtypes[col])) for col in
> > df.columns]
> > pa_type = pa.struct(fields)
> > pa.array(df.itertuples(index=False, type=pa_type)
> >
> > But this seems like a classic XY problem. What is the root issue you're
> > trying to solve? Why avoid RecordBatch?
Gentle bump.
Not a big deal if I need to use the API above to do so, but bump in case
someone has a better way.
On Fri, Jun 9, 2023 at 4:34 PM Li Jin wrote:
> Hello,
>
> I am looking for the best ways for converting Pandas DataFrame <-> Struct
> Array.
&g
Hello,
I am looking for the best ways for converting Pandas DataFrame <-> Struct
Array.
Currently I have:
pa.RecordBatch.from_pandas(df).to_struct_array()
and
pa.RecordBatch.from_struct_array(s_array).to_pandas()
- I wonder if there is a direct way to go from DataFrame <-> Struct Array
withou
r doing that, so you
> should be able to give that a try.
>
> We don't have a way of running PR checks as we do with the crossbow
> command. We could investigate if there is a way to do it via API.
>
> Thanks,
> Raúl
>
> El mar, 18 abr 2023 a las 14:47, Li Jin ()
>
gt;
> > The UI was recently updated:
> >
> >
> https://docs.github.com/en/actions/managing-workflow-runs/re-running-workflows-and-jobs#re-running-failed-jobs-in-a-workflow
> >
> > On Mon, Apr 17, 2023 at 7:57 PM Li Jin wrote:
> >
> >> Thanks!
UI. If you want to avoid having
> to add small changes to be able to commit you can use empty commits via
> '--allow-empty'.
>
> On Mon, Apr 17, 2023 at 5:25 PM Li Jin wrote:
>
> > Hi,
> >
> > Is there a github command to rerun CI checks? (instead of pushing a new
> > commit?)
> >
> > Thanks,
> > Li
> >
>
Hi,
Is there a github command to rerun CI checks? (instead of pushing a new
commit?)
Thanks,
Li
Thanks David!
On Tue, Apr 4, 2023 at 4:58 PM David Li wrote:
> Yes, that's what the ARROW_EXTRA_ERROR_CONTEXT option does.
>
> On Tue, Apr 4, 2023, at 11:13, Li Jin wrote:
> > Picking up this conversation again, I noticed when I hit an error in
> > test I
>
his, std::move(batch))
/home/icexelloss/workspace/arrow/cpp/src/arrow/acero/hash_aggregate_test.cc:271
start_and_collect.MoveResult()
```
Is this because of the ARROW_EXTRA_ERROR_CONTEXT option?
On Fri, Mar 24, 2023 at 12:04 PM Li Jin wrote:
> Thanks David!
>
> On Tue, Mar 21, 2023 at 6:32
Thanks Rok!
Original question is to asking for a way to "verify if a cast if zero copy
by read source code / documentation", and not "verify a cast if zero copy
programmatically" but I noticed by reading the test file that int64 to
micro is indeed zero copy and I expect nanos to be the same
https:
Hello,
I recently found myself casting an int64 (nanos from epoch) into a nano
timestamp column with the C++ cast kernel (via Acero).
I expect this to be zero copy but I wonder if there is a way to check which
casts are zero copy and which are not?
Li
a rough
> stack trace (IIRC, if a function returns the status without using one of
> the macros, it won't add a line to the trace).
>
> [1]:
> https://github.com/apache/arrow/blob/1ba4425fab35d572132cb30eee6087a7dca89853/cpp/cmake_modules/DefineOptions.cmake#L608-L609
>
> On
Hi,
This might be a dumb question but when Arrow code raises an invalid status,
I observe that it usually pops up to the user without stack information. I
wonder if there are any tricks to show where the invalid status is coming
from?
Thanks,
Li
Late to the party.
Thanks Weston for sharing the thoughts around Acero. We are actually a
pretty heavy Acero user right now and are trying to take part in Acero
maintenance and development. Internally we are using Acero for a time
series streaming data processing system.
I would +1 on many of Wes
rk
> here will be pretty easy. The trickier part might be adapting your
> producer (Ibis?)
>
> On Thu, Mar 9, 2023 at 9:43 AM Li Jin wrote:
>
> > Hi,
> >
> > I recently came across some limitations in expressing timestamp type with
> > Substrait in the Ace
Congratulations Will!
On Mon, Mar 13, 2023 at 3:27 PM Bryce Mecum wrote:
> Congratulations, Will!
>
Hi,
I recently came across some limitations in expressing timestamp type with
Substrait in the Acero substrait consumer and am curious to hear what
people's thoughts are.
The particular issue that I have is when specifying timestamp type in
substrait, the unit is "microseconds" and there is no wa
Thanks Weston for the information.
On Thu, Feb 16, 2023 at 1:32 PM Weston Pace wrote:
> There is a little bit at the end-to-end level. One goal is to be able to
> repartition a very large dataset. This means we read from something bigger
> than memory and then write to it. This workflow is te
he array is timezone aware.
>
> On Wed, Feb 15, 2023 at 10:37 PM Li Jin wrote:
>
> > Oh found this comment:
> >
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc#L156
> >
> >
> >
> > On Wed, Feb
Oh found this comment:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc#L156
On Wed, Feb 15, 2023 at 4:23 PM Li Jin wrote:
> Not sure if this is actually a bug or expected behavior - I filed
> https://github.com/apache/arrow/issues/34210
Not sure if this is actually a bug or expected behavior - I filed
https://github.com/apache/arrow/issues/34210
On Wed, Feb 15, 2023 at 4:15 PM Li Jin wrote:
> Hmm..something feels off here - I did the following experiment on Arrow 11
> and casting timestamp-naive to int64 is much faste
00:00:00.09998,1970-01-01
00:00:00.0]]
On Wed, Feb 15, 2023 at 2:52 PM Rok Mihevc wrote:
> I'm not sure about (1) but I'm pretty sure for (2) doing a cast of tz-aware
> timestamp to tz-naive should be a metadata-only change.
>
> On Wed, Feb 15, 2023 at
Asking (2) because IIUC this is a metadata operation that could be zero
copy but I am not sure if this is actually the case.
On Wed, Feb 15, 2023 at 10:17 AM Li Jin wrote:
> Hello!
>
> I have some questions about type casting memory usage with pyarrow Table.
> Let's say I hav
Hello!
I have some questions about type casting memory usage with pyarrow Table.
Let's say I have a pyarrow Table with 100 columns.
(1) if I want to cast n columns to a different type (e.g., float to int).
What is the smallest memory overhead that I can do? (memory overhead of 1
column, n columns
"
In this case though, it's just that we purposely hide symbols by default.
If there's a use case, we could unhide this specific symbol (we did it for
one other Protobuf symbol) which would let you externally generate and use
the headers (as long as you take care not to actually include the generat
Hello,
I am working on converting some internal data sources to Arrow data. One
particularly sets of data we have contains many string columns that can be
dictionary-encoded (basically string enums)
The current internal C++ API I am using gives me an iterator of "row"
objects, for each string col
congrats!
On Thu, Oct 27, 2022 at 9:03 PM Matt Topol wrote:
> Congrats Will!
>
> On Thu, Oct 27, 2022 at 9:02 PM Ian Cook wrote:
>
> > Congratulations Will!
> >
> > On Thu, Oct 27, 2022 at 19:56 Sutou Kouhei wrote:
> >
> > > On behalf of the Arrow PMC, I'm happy to announce that Will Jones
> >
Hello!
I am trying to implement an ExecNode in Acero that receives the input
batch, writes the batch to the FlightStreamWriter and then passes the batch
to the downstream node.
Looking at the API, I am thinking of doing sth like :
void InputReceived(ExecNode* input, ExecBatch batch) {
# turn
x27;t sound like the correct way, I am happy to do this
correctly but someone let me know the correct way :)
Li
On Thu, Oct 13, 2022 at 2:01 PM Li Jin wrote:
> Going back to the default_exec_factory_registry idea, I think ultimately
> maybe we want registration API that
dFactory("my_custom_node",
MakeMyCustomNode)
...
"""
On Thu, Oct 13, 2022 at 1:32 PM Li Jin wrote:
> Weston - was trying the pyarrow approach you suggested:
>
> >def custom_source(endpoint):
> return pc.Declaration("my_custom_source", create_my_custom_o
object should I return with create_my_custom_options()?
Currently I only have a C++ class for my custom option.
On Thu, Oct 13, 2022 at 12:58 PM Li Jin wrote:
> > I may be assuming here but I think your problem is more that there is
> no way to more flexibly describe a source in python and less
ate_my_custom_options())
>
> def table_provider(names):
> return custom_sources[names[0]]
>
> pa.substrait.run_query(my_plan, table_provider=table_provider)
> ```
>
> On Thu, Oct 13, 2022 at 8:24 AM Li Jin wrote:
> >
> > We did some work around this recently and
r;
}
"""
And then calling `pa.substrat.run_query" should pick up the custom name
table provider.
Does that sound like a reasonable way to do this?
On Tue, Sep 27, 2022 at 1:59 PM Li Jin wrote:
> Thanks both. I think NamedTableProvider is close to what I want, and like
>
te batches in a queue (just like the sink node) but it is
> not handling backpressure. I've created [1] to track this.
>
> [1] https://issues.apache.org/jira/browse/ARROW-18025
>
> On Wed, Oct 12, 2022 at 9:02 AM Li Jin wrote:
> >
> > Hello!
> >
> > I have
Hello!
I have some questions about how "pyarrow.substrait.run_query" works.
Currently run_query returns a record batch reader. Since Acero is a
push-based model and the reader is pull-based, I'd assume the reader object
somehow accumulates the batches that are pushed to it. And I wonder
(1) Does
Disclaimer: Not ibis-substrait dev here
ibis-substrait has a "decompiler";
https://github.com/ibis-project/ibis-substrait/blob/main/ibis_substrait/tests/compiler/test_decompiler.py
that takes substrait and returns ibis expression, then you can run ibis
expression with ibis's pandas backend:
https:
name {names}")
reader =
pa.substrait.run_query(pa.py_buffer(result.SerializeToString()),
table_provider)
result_table = reader.read_all()
self.assertTrue(result_table == test_table_0)
First successful run with ibis/substrait/acero - Hooray
On Wed, Oct 5, 2
PM Will Jones wrote:
> Hi Li Jin,
>
> The original segfault seems to occur because you are passing a Python bytes
> object and not a PyArrow Buffer object. You can wrap the bytes object using
> pa.py_buffer():
>
> pa.substrait.run_query(pa.py_buffer(result_bytes), table_provide
For reference, this is the "relations" entry that I was referring to:
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_substrait.py#L186
On Tue, Oct 4, 2022 at 3:28 PM Li Jin wrote:
> So I made some progress with updated code:
>
> t = ibis.table([
ssed"
Looking the plan reproduces by ibis-substrait, it looks like doesn't match
the expected format of Acero consumer. In particular, it looks like the
plan produced by ibis-substrait doesn't have a "relations" entry - any
thoughts on how this can be fixed? (I don't kno
Hi,
I am testing integration between ibis-substrait and Acero but hit a
segmentation fault. I think this might be cause the way I am
integrating these two libraries are wrong, here is my code:
Li Jin
1:51 PM (1 minute ago)
to me
class BasicTests(unittest.TestCase):
"""Test
own version of these files to build your Python module separately.
> This is where you would add a build flag for pulling in C++ header files
> for your Python module, under "python/pyarrow/include", and for making it.
>
>
> Yaron.
>
&
provide user configurable
> dispatching for named tables;
> if it doesn't address your use case then we might want to create a JIRA to
> extend it.
>
> On Tue, Sep 27, 2022 at 10:41 AM Li Jin wrote:
>
> > I did some more digging into this and have some ideas -
> >
>
is
later in favor of a more generic solution.
Thoughts?
Li
On Mon, Sep 26, 2022 at 10:58 AM Li Jin wrote:
> Hello!
>
> I am working on adding a custom data source node in Acero. I have a few
> previous threads related to this topic.
>
> Currently, I am able to register my cu
Hello!
I am working on adding a custom data source node in Acero. I have a few
previous threads related to this topic.
Currently, I am able to register my custom factory method with Acero and
create a Custom source node, i.e., I can register and execute this with
Acero:
MySourceNodeOptions sourc
.pyx when the python module is loaded.
> I don't know cython well enough to know how exactly it triggers the
> datasets shared object to load.
>
> On Tue, Sep 20, 2022 at 11:01 AM Li Jin wrote:
> >
> > Hi,
> >
> > Recently I am working on adding a custom da
gt;
> We could probably also add a DeclarationToReader method in the future.
>
> [1] https://github.com/apache/arrow/pull/13782
>
> On Wed, Sep 21, 2022 at 8:26 AM Li Jin wrote:
> >
> > Hello!
> >
> > I am testing a custom data source node I added to A
Hello!
I am testing a custom data source node I added to Acero and found myself in
need of collecting the results from an Acero query into memory.
Searching the codebase, I found "StartAndCollect" is what many of the tests
and benchmarks are using, but I am not sure if that is the public API to d
Hi,
Recently I am working on adding a custom data source node to Acero and was
pointed to a few examples in the dataset code.
If I understand this correctly, the registering of dataset exec node is
currently happening when this is loaded:
https://github.com/apache/arrow/blob/master/python/pyarrow
; > and convert this into a record batch reader. Then it would create one
> > > of the node's that Yaron has contributed and return that.
> > >
> > > However, it might be nice if "open a connection to the flight
> > > endpoint" happened
cept it. You would need to know the schema when configuring the
> SourceNode, but you won't need to derived from SourceNode.
>
>
> Yaron.
> ________
> From: Li Jin
> Sent: Tuesday, September 13, 2022 3:58 PM
> To: dev@arrow.apache.org
> Subje
twork
> to get the schema on its own.
>
> Given the above, I agree with you that when the Acero node is created its
> schema would already be known.
>
>
> Yaron.
>
> From: Li Jin
> Sent: Thursday, September 1, 2022 2:49 PM
> To: dev
.0-release/
> [3]
> https://github.com/apache/arrow/blame/3eb5673597bf67246271b6c9a98e6f812d4e01a7/python/pyarrow/table.pxi#L1991
> [4]
> https://github.com/apache/arrow/blob/apache-arrow-7.0.0/python/pyarrow/__init__.py#L368
>
> On Fri, Sep 9, 2022 at 10:15 AM Li Jin wro
but just wondering in general where do
I look first if I hit this sort of issue in the future.
On Fri, Sep 9, 2022 at 12:20 PM Li Jin wrote:
> Hi,
>
> I am trying to update Pyarrow from 7.0 to 9.0 and hit a couple of issues
> that I believe are because of some API changes. In par
Hi,
I am trying to update Pyarrow from 7.0 to 9.0 and hit a couple of issues
that I believe are because of some API changes. In particular, two issues I
saw seems to be
(1) pyarrow.read_schema is removed
(2) pa.Table.to_batches no longer takes a keyword argument (chunksize)
What's the best way t
g with various ways of getting the actual schema, depending on what
> exactly your service supports.) Once you have a Dataset, you can create an
> ExecPlan and proceed like normal.
>
> Of course, if you then want to get things into Python, R, Substrait,
> etc... that requires s
Hello!
I have recently started to look into integrating Flight RPC with Acero
source/sink node.
In Flight, the life cycle of a "read" request looks sth like:
- User specifies a URL (e.g. my_storage://my_path) and parameter (e.g.,
begin = "20220101", end = "20220201")
- Client issue GetF
but just wanted to mention that I am going
> to
> > > try and figure this out quite a bit in the next week. I can try to
> create
> > > some relevant cookbook recipes as I plod along.
> > >
> > > Aldrin Montana
> > > Computer Science PhD Student
> &
1 - 100 of 357 matches
Mail list logo