I think your conclusion that the Rust API doesn't support using mmap'd
files as a way to read/write arrow files.
In general, I suspect using mmap in Rust is a bit dicey (aka unsafe) as
the normal Rust rules of ownership are hard to apply to chunks of memory
that can be (potentially) modified by d
nd rust, how
> can it handle big data effectively if rust API not support memory
> mapped read/write ?
>
> On Sun, Nov 15, 2020 at 8:22 PM Andrew Lamb wrote:
> >
> > I think your conclusion that the Rust API doesn't support using mmap'd
> files as a way to read/wri
For what it is worth, when we were testing with timeseries data (that also
many sequential values that are very close in absolute value), the parquet
BYTE_STREAM_SPLIT[1] encoding was also quite effective (20% better
compression). However, this wasn't supported in C++ (and thus supported in
Pandas)
Hi Fernando,
Keeping the data in memory as `RecordBatch`es sounds like the way to go if
you want it all to be in memory.
Another way to work in Rust with data from parquet files is to use the
`DataFusion` library; Depending on your needs it might save you some time
building up your analytics (e.g
ted to
> other types of Arrays. I'm doing it now using as_any and then down ref to
> the type I want. But I have to write the type in the code and I want to
> find a way for it to be done automatically.
>
> Thanks,
> Fernando
>
> On Sun, 24 Jan 2021, 12:01 Andrew Lamb,
Hi Anil,
I don't know of any specific plans to make DataFrame Send+Sync, but I gave
it a quick shot: https://github.com/apache/arrow/pull/9406
And it seems to work -- if this is the change you are looking for? If so I
can polish up that PR and get it ready for review and hopefully merged soon
A
your help and waiting for merge.
>
> On Wed, Feb 3, 2021, 8:50 PM Andrew Lamb wrote:
>
>> Hi Anil,
>>
>> I don't know of any specific plans to make DataFrame Send+Sync, but I
>> gave it a quick shot: https://github.com/apache/arrow/pull/9406
>>
>> A
The Buzz project is one example I know of that reads parquet files from S3
using the Rust implementation
https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs
The SerializedFileReader[1] from the Rust parquet crate, despite its
s lots of duplication.
>
> [1]
> https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/datasource/hbee/s3_parquet.rs
> [2]
> https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parque
In case anyone is interested in the topic in general or DataFusion in
particular, I plan a tech talk [1] next week about "Query Engine Design and
the Rust based DataFusion in Apache Arrow."
If you are curious how (SQL) query engines in general are structured, I
plan to describe the typical high le
-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934
On Thu, Mar 4, 2021 at 4:05 PM Andrew Lamb wrote:
> In case anyone is interested in the topic in general or DataFusion in
> particular, I plan a tech talk [1] next week about "Query Engine Design and
> the Rust ba
I am proposing a backwards incompatible change to the Rust DataFusion
LogicalPlanBuilder API and I would like to give all stakeholders a chance
to comment on the proposal
https://github.com/apache/arrow/pull/9703
Thank you,
Andrew
I recently gave a talk about Arrow which I thought might be of interest to
the community.
You can find the slides [1] and video [2] online
Title: Apache Arrow and its Impact on the Database industry
Abstract: The talk motivates why Apache Arrow and related projects (e.g.
DataFusion) is a good c
I have migrated over all JIRA issues that were marked as "Rust" or
"Rust-DataFusion" to new issues in the https://github.com/apache/arrow-rs
and https://github.com/apache/arrow-datafusion repos respectively.
There are now no open JIRA issues [1] for the Rust implementation.
My script moved the ti
I propose regularly releasing, every 2 weeks, minor and patch releases of
the arrow-rs crate, following the semver versioning scheme used by the rest
of the Rust ecosystem. I have written a proposal[1] describing how this
might work.
Feedback and comments most welcome.
Andrew
[1]
https://docs.g
Hi John,
> Is DataFusion a good solution for validating and converting large csv
files (+20M, ~400 columns) existing in S3 buckets to parquet?
In my opinion yes -- that is well within the usecase of DataFusion. There
is an example [1] of how to convert csv files into parquet files that might
be
Hi Joshua,
TLDR is that the datafusion or possibly ballista (Rust query engine) may do
what you want, but you would have to test it. I can't recall offhand if a
sort + limit will be applied per partition and then a final sort + limit
(which would avoid reading everything into memory at once).
In
Sorry Kyle, I totally missed this email
Initially I would say the symptoms sound like "not calling finish() on the
writer" but I skimmed some of your linked code and saw at least one call to
finish, so maybe this is not the root cause
In terms of reading from a parquet file and returning arrow, I
it first
> looked at the footer, while Arrow JS likely tries to parse stream and file
> IPC data in the same way.
>
> Kyle
>
> On Thu, Mar 10, 2022 at 4:24 AM Andrew Lamb wrote:
>
>> Sorry Kyle, I totally missed this email
>>
>> Initially I would say the sympto
Greetings to DataFusion user and developers. There is a proposed change to
rename some of the core apis [1] to clarify and prepare for better
multi-tenant support that I wanted to call attention to for anyone
interested.
Please leave comments on the PR if you would like to join in the discussion.
Hi Ahmed,
It is valid to concatenate batches and the process you describe seems fine.
Your description certainly sounds as if there is something wrong with
`concat` that is producing incorrect RecordBatches -- would it be possible
to provide more information and file a ticket in
https://github.co
plementation. I think parquet2 can do this, but I had trouble with
> parquet2 as it couldn't handle the deeply nested Parquet we have. Will
> check further as to where parquet2 is falling over and raise it on parquet2.
>
> Thanks,
> Ahmed.
>
> On Thu, May 19, 2022 at 12:21
Raphael has a proposed PR[1] to improve this situation.
Ahmed, I wonder if you have a chance to add your opinion
[1] https://github.com/apache/arrow-rs/pull/1719
On Sat, May 21, 2022 at 6:42 AM Andrew Lamb wrote:
> Thanks Ahmed, yes I can see if you tried to write multiple RecordBatches
&
Also need to create a minimal Parquet file to demonstrate the issues we've
> seen.
>
> Thanks
> Ahmed.
>
> On Mon, 23 May 2022, 11:28 Andrew Lamb, wrote:
>
>> Raphael has a proposed PR[1] to improve this situation.
>>
>> Ahmed, I wonder if you have a chance
This advisory is related to the Rust implementation of Arrow. I do not
think there are any exploitable vulnerabilities in arrow due to the
underlying flatbuffers dependency.
The TLDR is that if an application accepts data that claims to be in the
Arrow in memory format from an untrusted source, it
tadata for IPC that is going to come always from a
> trusted source (from Arrow itself I guess) so no security risks here.
>
> Thank you very much!
>
> Roberto.
>
> El mié., 31 ago. 2022 16:04, Andrew Lamb escribió:
>
>> This advisory is related to the Rust implem
Sending record batches with different schemas in the same request is
something we wanted in IOx project well.
The way we handled it was to resend the existing schema definition messages
after data had already arrived. Our implementation [1] is Rust and we
controlled both sender and receiver, so y
And yes, window functions are supported, with various sized windows
On Sun, Dec 25, 2022 at 9:20 AM Daniël Heres wrote:
> Hi Olo,
>
> To my knowledge, we support all the TPC-H queries in parsing, planning and
> executing, which is tested in CI.
> Window functions are supported.
>
> The answers
Hello Arrow Community,
One of the (possibly the only) responsibilities of the PMC chair is to
collect information on the project and submit quarterly updates to the ASF
board. The next one is due on January 11, 2023
Historically[1], Arrow has crowd sourced the content and I plan to continue
the t
Thank you Jacob and Matthew -- the level of detail in your suggestions
looks just about perfect. 🙇♂️
On Wed, Jan 4, 2023 at 12:20 PM Jacob Quinn wrote:
> I added a few notes on the Julia implementation.
>
> -Jacob
>
> On Tue, Dec 27, 2022 at 2:45 PM Andrew Lamb wrote:
&
t;
> Best Regards,
>
> Kevin Gurney
> ------
> *From:* Andrew Lamb
> *Sent:* Wednesday, January 4, 2023 7:24 PM
> *To:* user@arrow.apache.org ; dev <
> d...@arrow.apache.org>
> *Subject:* Re: Apache Arrow Board Report, by Jan 11 2023
>
Hi Olo,
DataFusion is based on the Arrow model and thus all its data is read only;
In order to update data a new copy must be made.
I guess you could call this a version of MVCC.
People often use DataFusion to operate on data in files stored in formats
such as Parquet which are similarly read op
ovements implemented
in
the downstream data frame library.
On Sun, Jan 8, 2023 at 10:09 PM Andrew Lamb wrote:
> Thank you Kevin.
>
> As a reminder to anyone else who may be interested in contributing I plan
> to submit this report in 2 days or so on Jan 11
>
> Andrew
>
>
Following up it appears your email is already in the ASF workspace so I
assume someone has sent you an invite. Welcome!
On Tue, Jan 17, 2023 at 3:14 PM Philip Carinhas <
philip.carin...@zapatacomputing.com> wrote:
> Would it be possible to get a guest invitation to the Apache Slack
> channels tha
uys. Please can you send me an invite as well. Or if you have a link,
>> that’s be great.
>>
>> Thx
>>
>>
>> Sent from Outlook for iOS <https://aka.ms/o0ukef>
>> --
>> *From:* Andrew Lamb
>> *Sent:* Thu
sers to sign up on their own?
>>
>> On Thu, Jan 19, 2023 at 8:41 AM Andrew Lamb wrote:
>>
>>> I have sent invites to the ASF slack
>>>
>>> On Thu, Jan 19, 2023 at 11:01 AM Pranav Yogi Lodha <
>>> pranav.lo...@cloudera.com> wrote:
>>
-- Forwarded message -
From: Gavin McDonald
Date: Fri, Mar 24, 2023 at 5:57 AM
Subject: TAC supporting Berlin Buzzwords
To:
PMCs,
Please forward to your dev and user lists.
Hi All,
The ASF Travel Assistance Committee is supporting taking up to six (6)
people
to attend Berlin
Hello Arrow Community,
One of the responsibilities of being part of the Apache Software Foundation
(ASF) is to regularly summarize the state of the project in a quarterly
update to the ASF board. The next report is due on April 12, 2023
Historically[1], Arrow has crowd sourced the content which h
[1]:
https://docs.google.com/document/d/13FSDydEVXT2UUFdy4XKjVKNJW-WR8ylvG3aI6lD-dNI/edit#
On Wed, Mar 29, 2023 at 6:58 AM Andrew Lamb wrote:
> Hello Arrow Community,
>
> One of the responsibilities of being part of the Apache Software
> Foundation (ASF) is to regularly summarize the
Hi
> Is col1 just converted to a StringArray before anything is evalulated?
No
> Is 'foo' converted to a key using the Dictionary before the filter is
performed?
DataFusion uses the arrow-rs kernels, so I believe in this case the filter
is applied on the dictionary values to find the matching en
Hi,
Does anyone know a location that collects / syndicates Apache Arrow related
content?
Some examples of such a thing are [1] for python and [2] for Rust [2].
Andrew
[1]: https://planetpython.org/
[2] https://this-week-in-rust.org/
While we have added support in Arrow for Utf8View Arrays[1], along with the
required implementation, I don't think we have written a blog post about
it.
I think blog posts announcing and describing new features at a higher
technical level, with diagrams, are critical to quick and widespread
adopti
Severity: moderate
Affected versions:
- Apache Arrow Rust Object Store 0.5.0 through 0.10.1
Description:
Exposure of temporary credentials in logs in Apache Arrow Rust Object Store,
version 0.10.1 and earlier on all platforms using AWS WebIdentityTokens.
On certain error conditions, the logs
Hello Arrow Rust Community,
We are discussing the scintillating topic of deprecation policy for APIs
(aka for how long to keep deprecated APIs before removing them).
Please join the conversation[1] if you are interested
Thanks,
Andrew
[1]: https://github.com/apache/arrow-rs/pull/6852
Hello Arrow Developers / Users,
We are discussing the policy for minimum supported Rust version[1]. Please
comment on the ticket if you would like to participate in the discussion
Thank you,
Andrew
[1]: https://github.com/apache/arrow-rs/issues/181
I think it is a great idea.
We have also been discussing a smaller DataFusion style meetup in
Europe[1]. We have had good luck with both shorter (2 hours) and longer (1
day) events. My experience was that space availability was the key driver,
so I suggest organizing the event around the space you
This is great.
In case it is helpful, here is more information about what we have done for
DataFusion:
We held most of them in spaces that we didn't pay for. We rented an
office from WeWork for the first one in Austin, but otherwise the space was
donated.
We have used:
1. company's offices (e.g.
47 matches
Mail list logo