Re: pyarrow write_dataset Illegal Instruction

2022-01-24 Thread Weston Pace
Your problem is probably old hardware, specifically an older CPU. Pip builds rely on popcnt (which I think is SSE4.1?) I'm pretty sure you are right that you can compile from source and be ok. It's a performance / portability tradeoff that has to be made when packaging prebuilt binaries. On Mon,

pyarrow write_dataset Illegal Instruction

2022-01-24 Thread Chris Nyland
Hello, I was just taking a look at pyarrow in my off hours. I was trying to write a partitioned data set based on the birthdays example in the pyarrow cook book. However when I run the script I get no data written and a "Illegal Instruction" message prints to screen, no exception is raised. I inst

Re: PyArrow + GCSFS not loading data when using filters... and also performance

2022-01-24 Thread Kelton Halbert
Thank you very much for the helpful response, Alenka. This provides much more clarity to the partitioning system and how I should be interacting with it. I’m in the process of re-processing my dataset to use integers for the date partitioning, but still use strings for the site identifiers. I do

Re: [Python] add a new column to a table during dataset consolidation

2022-01-24 Thread Weston Pace
> You are looking for a row-wise mean, isn't it! I don't think there's an API > for that pyarrow.compute. Right, I don't think this is in there today either. The C++ compute infrastructure itself can create functions that run on record batches (instead of just arrays). An example of this is dro

Re: [Python] add a new column to a table during dataset consolidation

2022-01-24 Thread Niranda Perera
Hi Antonio, Sorry I think I misunderstood your question. You are looking for a row-wise mean, isn't it! I don't think there's an API for that pyarrow.compute. Sorry my bad. You could call `add` for each column and manually create the mean (this would be a vectorized operation column-wise. But this

Re: [Python] add a new column to a table during dataset consolidation

2022-01-24 Thread Antonino Ingargiola
Hi Niranda, On Mon, Jan 24, 2022 at 2:41 PM Niranda Perera wrote: > Did you try using `pyarrow.compute` options? Inside that batch iterator > loop you can call the compute mean function and then call the add_column > method for record batches. > I cannot find how to pass multiple columns to be

Re: [Python] add a new column to a table during dataset consolidation

2022-01-24 Thread Niranda Perera
Hi Antonio, Did you try using `pyarrow.compute` options? Inside that batch iterator loop you can call the compute mean function and then call the add_column method for record batches. In the latest arrow code base might have support for 'projection', that could do this without having to iterate th

[Python] add a new column to a table during dataset consolidation

2022-01-24 Thread Antonino Ingargiola
Hi list, I am looking for a way to add a new column to an existing table that is computed as the sum/mean of other columns. From the docs, I understand that pyarrow compute functions operate on arrays (i.e. columns) but I cannot find if it is possible to aggregate through columns in some way. In