Re: mmap only, read data later?

2022-05-10 Thread Antoine Pitrou
Le 10/05/2022 à 04:36, Andrew Piskorski a écrit : On Mon, May 09, 2022 at 07:00:47PM +0200, Antoine Pitrou wrote: Generally, the Arrow IPC file/stream formats are designed for large data. If you have many very small files you might try to rethink how you store your data on disk. Ah. Is thi

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-05-10 Thread Antoine Pitrou
Do we have to give it a particular name at all? Most of the C++ subcomponents simply have a description ("the datasets layer", etc.). There are probably more important topics to spend our time on. Regards Antoine. Le 09/05/2022 à 21:44, Ian Cook a écrit : Reflecting on this discussion si

Re: [DISC][Release] More control on Release Candidates commits

2022-05-10 Thread Raul Cumplido
Thanks for the feedback, Krisztián! Lots of good insights on the current release process. I can see that you were already taking actions towards the process I was describing. I will write some notes to update the current process to reflect that on the Release documentation [1] and will share. I s

Re: [DISC][Release] More control on Release Candidates commits

2022-05-10 Thread Antoine Pitrou
Le 10/05/2022 à 13:27, Raul Cumplido a écrit : I still think there is some value in standardising the "feature freeze" on new release candidates once a first release candidate has been created and only add required fixes for the follow up RCs. What I would like to avoid with that is rushing bi

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-05-10 Thread Will Jones
I think it is important to give the C++ execution engine a separate name, as has been said by Wes and Jacques. Two reason for that IMO: 1. The more things we lend the Arrow brand outside of the format, the harder it becomes for outside users to grasp what "Arrow" is. 2. Giving the C++ engine a n

PyArrow builds but fails to load pyarrow._dataset

2022-05-10 Thread Yaron Gvili
Hello, I ran into a problem with running PyArrow that I locally built. The build worked fine (or so it seems) but then the testing procedure had a failure due to not being able to load pyarrow._dataset, which I manually confirmed. I'd appreciate any guidance on how to fix this error. Below are

Re: PyArrow builds but fails to load pyarrow._dataset

2022-05-10 Thread Weston Pace
I think you need to add: export PYARROW_WITH_DATASET=1 On Tue, May 10, 2022 at 7:07 AM Yaron Gvili wrote: > > Hello, > > I ran into a problem with running PyArrow that I locally built. The build > worked fine (or so it seems) but then the testing procedure had a failure due > to not being

Re: PyArrow builds but fails to load pyarrow._dataset

2022-05-10 Thread Niranda Perera
Hi Yaron, Does `import pyarrow` work? On Tue, May 10, 2022 at 1:07 PM Yaron Gvili wrote: > Hello, > > I ran into a problem with running PyArrow that I locally built. The build > worked fine (or so it seems) but then the testing procedure had a failure > due to not being able to load pyarrow._da

Re: PyArrow builds but fails to load pyarrow._dataset

2022-05-10 Thread Antoine Pitrou
That said, tests which require should be skipped gracefully instead of failing. Le 10/05/2022 à 19:13, Weston Pace a écrit : I think you need to add: export PYARROW_WITH_DATASET=1 On Tue, May 10, 2022 at 7:07 AM Yaron Gvili wrote: Hello, I ran into a problem with running PyArrow

Re: [Rust] DataFusion 8.0.0 release

2022-05-10 Thread Andy Grove
I am planning on cutting a release candidate later this week. I have 2 PRs related to release prep work that I would like to get merged prior to that: - https://github.com/apache/arrow-datafusion/pull/2479 - https://github.com/apache/arrow-datafusion/pull/2495 I also have these PRs for new featu

Re: PyArrow builds but fails to load pyarrow._dataset

2022-05-10 Thread Antoine Pitrou
Le 10/05/2022 à 19:16, Antoine Pitrou a écrit : That said, tests which require should be skipped gracefully instead of failing. Oops... some words got swallowed: tests which require *the dataset module* should be skipped gracefully instead of failing. Le 10/05/2022 à 19:13, Weston P

Re: PyArrow builds but fails to load pyarrow._dataset

2022-05-10 Thread Yaron Gvili
> Does `import pyarrow` work? Yes. Also, all but one unit test succeeded: = short test summary info == FAILED pyarrow/t

Re: Arrow C-Data and DuckDB

2022-05-10 Thread David Li
For discussion I've put up https://github.com/apache/arrow/pull/13115 to add this for the C data/stream interfaces. On Mon, May 9, 2022, at 15:42, Antoine Pitrou wrote: > Le 09/05/2022 à 20:28, Tomek Drabas a écrit : >> I am new to this board so please, let me know if any of this doesn't make >>

Re: PyArrow builds but fails to load pyarrow._dataset

2022-05-10 Thread Yaron Gvili
> I think you need to add: > > export PYARROW_WITH_DATASET=1 This worked, thanks. I think the documentation [1] may need be fixed to clarify that DATASET is also an optional component. [1] https://arrow.apache.org/docs/developers/python.html#build-and-test Yaron. _

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-05-10 Thread Wes McKinney
A couple of other names derivative from the Ace- vibe: Acero ("steel" or sometimes "sword" in Spanish but apparently also "maple" in Italian). Also rhymes with Arrow but not sure if this is good or bad Acera ("pavement" or "sidewalk" in Spanish) On Tue, May 10, 2022 at 9:53 AM Will Jones wrote:

[C++] Code style and lint question

2022-05-10 Thread Li Jin
Hello! I am trying to fix C++ code style & lint for my PR. Currently I am running "archery lint --cpplint --clang-format --clang-tidy --fix" and encountered 2 issues: 1. File /home/icexelloss/workspace/arrow/cpp/src/arrow/compute/exec/concurrent_bounded_queue.h failed C++/CLI lint check: Uses

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-05-10 Thread Will Jones
"Acero" has a nice ring to it. Almost as if you said "ACE Arrow" really fast. And maybe the steel / iron meaning gives a sort of close-to-metal vibes (similar to what Rust's name invokes), though I'm not a Spanish speaker with a meaningful understanding of the words' connotations. On Tue, May 10,

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-05-10 Thread Eduardo Ponce
As a Spanish speaking person, I cannot think of a misleading or bad connotation for the word "acero". The word is generally used to refer to either steel materials (actual definition) or as a simile/metaphor comparing to something very strong. We can view this as a self-laud on the robust and power

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-05-10 Thread Andy Grove
I like Acero too. I like it because (as a non-Spanish speaker, at least) it has no obvious meaning or connotation and once the community starts to use this name for the project, that is the meaning that it will come to have. Just like Gandiva (a word I was not familiar with when I learned about the

Re: [C++] Code style and lint question

2022-05-10 Thread Weston Pace
1. You are not allowed to include in any public header file. It has something to do with Windows (I forget the details). If you can move all use of mutex into the implementation that works. Sometimes we have to use the pimpl pattern to make this happen. Another alternative is to include "arrow/ut

Re: mmap only, read data later?

2022-05-10 Thread Weston Pace
If you are reading this as a dataset, and you are not partitioning on your disk, then it is going to read the entire content of every file, because there is no statistics-based partitioning currently enabled with IPC files. If you have some kind of filter, and you can partition your data on the sa

Re: [DISC][Release] More control on Release Candidates commits

2022-05-10 Thread Sutou Kouhei
Hi, In "Re: [DISC][Release] More control on Release Candidates commits" on Tue, 10 May 2022 13:27:09 +0200, Raul Cumplido wrote: > I still think there is some value in standardising the "feature freeze" on > new release candidates once a first release candidate has been created and > only

Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-10 Thread Vibhatha Abeykoon
@Li appreciate your thoughts on these important pieces. Let me walk through one by one. > Numeric code written in numpy/pandas + some relational logic (e.g., > np.where to select rows). People like this type of UDFs because they are > very familiar with pandas/numpy and can be immediately product