Re: Question on R arrow package

2024-11-06 Thread Weston Pace
I wonder why your workaround is also slow: ``` DS <- arrow::open_dataset(sources = sourcePath) listFiles <- DS$files[!grepl("$folder$",DS$files,fixed=TRUE)] DS2 <- arrow::open_dataset(sources = listFiles) ``` That was going to be my suggestion. Do you know which of the three statements takes a l

Arrow-r equivalent to dplyr's separate() or separate_wider_delim()?

2024-11-06 Thread Schwing, Adam via user
Hello! I would like to take a comma separated string and put each element in its own row. This is easy to do in dplyr using the separate() or separate_wider_delim() plus pivot_longer() functions. However, my dataset is very large because each string has thousands of elements and the dataset con

Re: Question on R arrow package

2024-11-06 Thread Neal Richardson
Looking at https://arrow.apache.org/docs/r/reference/open_dataset.html#arg-factory-options, it seems that `exclude_invalid_files` is slow on remote file systems because of the cost of accessing each file up front to determine if it is valid. And there is `selector_ignore_prefixes`, but it looks lik

Question on R arrow package

2024-11-06 Thread Huschto, Tony
Dear all, I'm using the arrow package to access partitioned parquet data on an AWS S3 bucket. The structure is the typical s3://some_path/entity=ABC/syncDate=mm-dd-/country=US/part***.snappy.parquet Reading the files works very well using DS <- arrow::open_dataset(sources = "s3://some_path/

Re: Question on R arrow package

2024-11-06 Thread Huschto, Tony
It's the last step that takes a lot of time. DS <- arrow::open_dataset(sources = sourcePath) listFiles <- DS$files[!grepl("$folder$",DS$files,fixed=TRUE)] runs very fast, but as DS$files does not contain the "s3://" prefix, I have to add it to listFiles in order to make the following work, and th