Re: does rust API support memory mapped read/write ?

2020-11-15 Thread Andrew Lamb
I think your conclusion that the Rust API doesn't support using mmap'd files as a way to read/write arrow files. In general, I suspect using mmap in Rust is a bit dicey (aka unsafe) as the normal Rust rules of ownership are hard to apply to chunks of memory that can be (potentially) modified by d

Re: does rust API support memory mapped read/write ?

2020-11-16 Thread Andrew Lamb
nd rust, how > can it handle big data effectively if rust API not support memory > mapped read/write ? > > On Sun, Nov 15, 2020 at 8:22 PM Andrew Lamb wrote: > > > > I think your conclusion that the Rust API doesn't support using mmap'd > files as a way to read/wri

Re: Delta encoding in Apache Arrow or Parquet

2020-11-16 Thread Andrew Lamb
For what it is worth, when we were testing with timeseries data (that also many sequential values that are very close in absolute value), the parquet BYTE_STREAM_SPLIT[1] encoding was also quite effective (20% better compression). However, this wasn't supported in C++ (and thus supported in Pandas)

Re: [RUST] Reading parquet

2021-01-24 Thread Andrew Lamb
Hi Fernando, Keeping the data in memory as `RecordBatch`es sounds like the way to go if you want it all to be in memory. Another way to work in Rust with data from parquet files is to use the `DataFusion` library; Depending on your needs it might save you some time building up your analytics (e.g

Re: [RUST] Reading parquet

2021-01-24 Thread Andrew Lamb
ted to > other types of Arrays. I'm doing it now using as_any and then down ref to > the type I want. But I have to write the type in the code and I want to > find a way for it to be done automatically. > > Thanks, > Fernando > > On Sun, 24 Jan 2021, 12:01 Andrew Lamb,

Re: [Rust][Datafusion] Dataframe api sync send

2021-02-03 Thread Andrew Lamb
Hi Anil, I don't know of any specific plans to make DataFrame Send+Sync, but I gave it a quick shot: https://github.com/apache/arrow/pull/9406 And it seems to work -- if this is the change you are looking for? If so I can polish up that PR and get it ready for review and hopefully merged soon A

Re: [Rust][Datafusion] Dataframe api sync send

2021-02-03 Thread Andrew Lamb
your help and waiting for merge. > > On Wed, Feb 3, 2021, 8:50 PM Andrew Lamb wrote: > >> Hi Anil, >> >> I don't know of any specific plans to make DataFrame Send+Sync, but I >> gave it a quick shot: https://github.com/apache/arrow/pull/9406 >> >> A

Re: [Rust] [DataFusion] Reading remote parquet files in S3?

2021-02-14 Thread Andrew Lamb
The Buzz project is one example I know of that reads parquet files from S3 using the Rust implementation https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs The SerializedFileReader[1] from the Rust parquet crate, despite its

Re: [Rust] [DataFusion] Reading remote parquet files in S3?

2021-02-15 Thread Andrew Lamb
s lots of duplication. > > [1] > https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/datasource/hbee/s3_parquet.rs > [2] > https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parque

[Rust][DataFusion] Query Engine Design / DataFusion Implementation talk

2021-03-04 Thread Andrew Lamb
In case anyone is interested in the topic in general or DataFusion in particular, I plan a tech talk [1] next week about "Query Engine Design and the Rust based DataFusion in Apache Arrow." If you are curious how (SQL) query engines in general are structured, I plan to describe the typical high le

Re: [Rust][DataFusion] Query Engine Design / DataFusion Implementation talk

2021-03-12 Thread Andrew Lamb
-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934 On Thu, Mar 4, 2021 at 4:05 PM Andrew Lamb wrote: > In case anyone is interested in the topic in general or DataFusion in > particular, I plan a tech talk [1] next week about "Query Engine Design and > the Rust ba

[Rust][DataFusion] Proposed (backwards incompatible) change to LogicalPlanBuilder API

2021-03-14 Thread Andrew Lamb
I am proposing a backwards incompatible change to the Rust DataFusion LogicalPlanBuilder API and I would like to give all stakeholders a chance to comment on the proposal https://github.com/apache/arrow/pull/9703 Thank you, Andrew

[Announce] Apache Arrow and its Impact on the Database industry

2021-04-21 Thread Andrew Lamb
I recently gave a talk about Arrow which I thought might be of interest to the community. You can find the slides [1] and video [2] online Title: Apache Arrow and its Impact on the Database industry Abstract: The talk motivates why Apache Arrow and related projects (e.g. DataFusion) is a good c

[Announce][Rust] JIRA Issues migrated to github issues

2021-04-26 Thread Andrew Lamb
I have migrated over all JIRA issues that were marked as "Rust" or "Rust-DataFusion" to new issues in the https://github.com/apache/arrow-rs and https://github.com/apache/arrow-datafusion repos respectively. There are now no open JIRA issues [1] for the Rust implementation. My script moved the ti

[RUST] Proposal for more frequent Rust Arrow release process

2021-05-01 Thread Andrew Lamb
I propose regularly releasing, every 2 weeks, minor and patch releases of the arrow-rs crate, following the semver versioning scheme used by the rest of the Rust ecosystem. I have written a proposal[1] describing how this might work. Feedback and comments most welcome. Andrew [1] https://docs.g

Re: [DataFusion] Arrow, CSV, S3 and parquet

2021-08-10 Thread Andrew Lamb
Hi John, > Is DataFusion a good solution for validating and converting large csv files (+20M, ~400 columns) existing in S3 buckets to parquet? In my opinion yes -- that is well within the usecase of DataFusion. There is an example [1] of how to convert csv files into parquet files that might be

Re: [Rust] Support for Columnar Sorting and Merging of Parquet Files on Disk

2021-09-28 Thread Andrew Lamb
Hi Joshua, TLDR is that the datafusion or possibly ballista (Rust query engine) may do what you want, but you would have to test it. I can't recall offhand if a sort + limit will be applied per partition and then a final sort + limit (which would avoid reading everything into memory at once). In

Re: [Rust] Unable to read in Python or JS Arrow Stream IPC files written in Rust

2022-03-10 Thread Andrew Lamb
Sorry Kyle, I totally missed this email Initially I would say the symptoms sound like "not calling finish() on the writer" but I skimmed some of your linked code and saw at least one call to finish, so maybe this is not the root cause In terms of reading from a parquet file and returning arrow, I

Re: [Rust] Unable to read in Python or JS Arrow Stream IPC files written in Rust

2022-03-10 Thread Andrew Lamb
it first > looked at the footer, while Arrow JS likely tries to parse stream and file > IPC data in the same way. > > Kyle > > On Thu, Mar 10, 2022 at 4:24 AM Andrew Lamb wrote: > >> Sorry Kyle, I totally missed this email >> >> Initially I would say the sympto

[RUST][DataFusion] Renaming and refactoring towards multi-tenancy

2022-03-12 Thread Andrew Lamb
Greetings to DataFusion user and developers. There is a proposed change to rename some of the core apis [1] to clarify and prepare for better multi-tenant support that I wanted to call attention to for anyone interested. Please leave comments on the PR if you would like to join in the discussion.

Re: [Rust] Concatenating Record Batches

2022-05-19 Thread Andrew Lamb
Hi Ahmed, It is valid to concatenate batches and the process you describe seems fine. Your description certainly sounds as if there is something wrong with `concat` that is producing incorrect RecordBatches -- would it be possible to provide more information and file a ticket in https://github.co

Re: [Rust] Concatenating Record Batches

2022-05-21 Thread Andrew Lamb
plementation. I think parquet2 can do this, but I had trouble with > parquet2 as it couldn't handle the deeply nested Parquet we have. Will > check further as to where parquet2 is falling over and raise it on parquet2. > > Thanks, > Ahmed. > > On Thu, May 19, 2022 at 12:21

Re: [Rust] Concatenating Record Batches

2022-05-23 Thread Andrew Lamb
Raphael has a proposed PR[1] to improve this situation. Ahmed, I wonder if you have a chance to add your opinion [1] https://github.com/apache/arrow-rs/pull/1719 On Sat, May 21, 2022 at 6:42 AM Andrew Lamb wrote: > Thanks Ahmed, yes I can see if you tried to write multiple RecordBatches &

Re: [Rust] Concatenating Record Batches

2022-05-23 Thread Andrew Lamb
Also need to create a minimal Parquet file to demonstrate the issues we've > seen. > > Thanks > Ahmed. > > On Mon, 23 May 2022, 11:28 Andrew Lamb, wrote: > >> Raphael has a proposed PR[1] to improve this situation. >> >> Ahmed, I wonder if you have a chance

Re: Flatbuffers vulnerability and arrow

2022-08-31 Thread Andrew Lamb
This advisory is related to the Rust implementation of Arrow. I do not think there are any exploitable vulnerabilities in arrow due to the underlying flatbuffers dependency. The TLDR is that if an application accepts data that claims to be in the Arrow in memory format from an untrusted source, it

Re: Flatbuffers vulnerability and arrow

2022-08-31 Thread Andrew Lamb
tadata for IPC that is going to come always from a > trusted source (from Arrow itself I guess) so no security risks here. > > Thank you very much! > > Roberto. > > El mié., 31 ago. 2022 16:04, Andrew Lamb escribió: > >> This advisory is related to the Rust implem

Re: [FLIGHT] Sending multiple record batches with different schemas

2022-09-08 Thread Andrew Lamb
Sending record batches with different schemas in the same request is something we wanted in IOx project well. The way we handled it was to resend the existing schema definition messages after data had already arrived. Our implementation [1] is Rust and we controlled both sender and receiver, so y

Re: [Datafusion] Support for TPC-H

2022-12-25 Thread Andrew Lamb
And yes, window functions are supported, with various sized windows On Sun, Dec 25, 2022 at 9:20 AM Daniël Heres wrote: > Hi Olo, > > To my knowledge, we support all the TPC-H queries in parsing, planning and > executing, which is tested in CI. > Window functions are supported. > > The answers

Apache Arrow Board Report, by Jan 11 2023

2022-12-27 Thread Andrew Lamb
Hello Arrow Community, One of the (possibly the only) responsibilities of the PMC chair is to collect information on the project and submit quarterly updates to the ASF board. The next one is due on January 11, 2023 Historically[1], Arrow has crowd sourced the content and I plan to continue the t

Re: Apache Arrow Board Report, by Jan 11 2023

2023-01-04 Thread Andrew Lamb
Thank you Jacob and Matthew -- the level of detail in your suggestions looks just about perfect. 🙇‍♂️ On Wed, Jan 4, 2023 at 12:20 PM Jacob Quinn wrote: > I added a few notes on the Julia implementation. > > -Jacob > > On Tue, Dec 27, 2022 at 2:45 PM Andrew Lamb wrote: &

Re: Apache Arrow Board Report, by Jan 11 2023

2023-01-08 Thread Andrew Lamb
t; > Best Regards, > > Kevin Gurney > ------ > *From:* Andrew Lamb > *Sent:* Wednesday, January 4, 2023 7:24 PM > *To:* user@arrow.apache.org ; dev < > d...@arrow.apache.org> > *Subject:* Re: Apache Arrow Board Report, by Jan 11 2023 >

Re: [DataFusion] MVCC

2023-01-11 Thread Andrew Lamb
Hi Olo, DataFusion is based on the Arrow model and thus all its data is read only; In order to update data a new copy must be made. I guess you could call this a version of MVCC. People often use DataFusion to operate on data in files stored in formats such as Parquet which are similarly read op

Re: Apache Arrow Board Report, by Jan 11 2023

2023-01-11 Thread Andrew Lamb
ovements implemented in the downstream data frame library. On Sun, Jan 8, 2023 at 10:09 PM Andrew Lamb wrote: > Thank you Kevin. > > As a reminder to anyone else who may be interested in contributing I plan > to submit this report in 2 days or so on Jan 11 > > Andrew > >

Re: Slack

2023-01-19 Thread Andrew Lamb
Following up it appears your email is already in the ASF workspace so I assume someone has sent you an invite. Welcome! On Tue, Jan 17, 2023 at 3:14 PM Philip Carinhas < philip.carin...@zapatacomputing.com> wrote: > Would it be possible to get a guest invitation to the Apache Slack > channels tha

Re: Slack

2023-01-19 Thread Andrew Lamb
uys. Please can you send me an invite as well. Or if you have a link, >> that’s be great. >> >> Thx >> >> >> Sent from Outlook for iOS <https://aka.ms/o0ukef> >> -- >> *From:* Andrew Lamb >> *Sent:* Thu

Re: Slack

2023-01-23 Thread Andrew Lamb
sers to sign up on their own? >> >> On Thu, Jan 19, 2023 at 8:41 AM Andrew Lamb wrote: >> >>> I have sent invites to the ASF slack >>> >>> On Thu, Jan 19, 2023 at 11:01 AM Pranav Yogi Lodha < >>> pranav.lo...@cloudera.com> wrote: >>

[Notice] Apache Software Foundation grant to attend Berlin Buzzwords conference

2023-03-25 Thread Andrew Lamb
-- Forwarded message - From: Gavin McDonald Date: Fri, Mar 24, 2023 at 5:57 AM Subject: TAC supporting Berlin Buzzwords To: PMCs, Please forward to your dev and user lists. Hi All, The ASF Travel Assistance Committee is supporting taking up to six (6) people to attend Berlin

[CROWDSOURCING] Apache Arrow Board Report - April 12, 2023

2023-03-29 Thread Andrew Lamb
Hello Arrow Community, One of the responsibilities of being part of the Apache Software Foundation (ASF) is to regularly summarize the state of the project in a quarterly update to the ASF board. The next report is due on April 12, 2023 Historically[1], Arrow has crowd sourced the content which h

Re: [CROWDSOURCING] Apache Arrow Board Report - April 12, 2023

2023-04-11 Thread Andrew Lamb
[1]: https://docs.google.com/document/d/13FSDydEVXT2UUFdy4XKjVKNJW-WR8ylvG3aI6lD-dNI/edit# On Wed, Mar 29, 2023 at 6:58 AM Andrew Lamb wrote: > Hello Arrow Community, > > One of the responsibilities of being part of the Apache Software > Foundation (ASF) is to regularly summarize the

Re: [Rust] [Datafusion] DictionaryArray querying

2023-04-15 Thread Andrew Lamb
Hi > Is col1 just converted to a StringArray before anything is evalulated? No > Is 'foo' converted to a key using the Dictionary before the filter is performed? DataFusion uses the arrow-rs kernels, so I believe in this case the filter is applied on the dictionary values to find the matching en

[QUESTION] Syndication site(s) for Apache Arrow related content?

2023-07-21 Thread Andrew Lamb
Hi, Does anyone know a location that collects / syndicates Apache Arrow related content? Some examples of such a thing are [1] for python and [2] for Rust [2]. Andrew [1]: https://planetpython.org/ [2] https://this-week-in-rust.org/

Blog post about new UTF8View

2023-09-28 Thread Andrew Lamb
While we have added support in Arrow for Utf8View Arrays[1], along with the required implementation, I don't think we have written a blog post about it. I think blog posts announcing and describing new features at a higher technical level, with diagrams, are critical to quick and widespread adopti

CVE-2024-41178: Apache Arrow Rust Object Store: AWS WebIdentityToken exposure in log files

2024-07-23 Thread Andrew Lamb
Severity: moderate Affected versions: - Apache Arrow Rust Object Store 0.5.0 through 0.10.1 Description: Exposure of temporary credentials in logs in Apache Arrow Rust Object Store, version 0.10.1 and earlier on all platforms using AWS WebIdentityTokens. On certain error conditions, the logs

[RUST] API deprecation policy

2024-12-07 Thread Andrew Lamb
Hello Arrow Rust Community, We are discussing the scintillating topic of deprecation policy for APIs (aka for how long to keep deprecated APIs before removing them). Please join the conversation[1] if you are interested Thanks, Andrew [1]: https://github.com/apache/arrow-rs/pull/6852

[DISCUSS] [RUST] Minimum Supported Rust Version Policy

2025-03-15 Thread Andrew Lamb
Hello Arrow Developers / Users, We are discussing the policy for minimum supported Rust version[1]. Please comment on the ticket if you would like to participate in the discussion Thank you, Andrew [1]: https://github.com/apache/arrow-rs/issues/181

Re: [DISCUSS] Apache Arrow Meetup in Europe

2025-03-08 Thread Andrew Lamb
I think it is a great idea. We have also been discussing a smaller DataFusion style meetup in Europe[1]. We have had good luck with both shorter (2 hours) and longer (1 day) events. My experience was that space availability was the key driver, so I suggest organizing the event around the space you

Re: [DISCUSS] Apache Arrow Meetup in Europe

2025-03-10 Thread Andrew Lamb
This is great. In case it is helpful, here is more information about what we have done for DataFusion: We held most of them in spaces that we didn't pay for. We rented an office from WeWork for the first one in Austin, but otherwise the space was donated. We have used: 1. company's offices (e.g.