I wonder why your workaround is also slow:
```
DS <- arrow::open_dataset(sources = sourcePath)
listFiles <- DS$files[!grepl("$folder$",DS$files,fixed=TRUE)]
DS2 <- arrow::open_dataset(sources = listFiles)
```
That was going to be my suggestion. Do you know which of the three
statements takes a l
Hello!
I would like to take a comma separated string and put each element in its own
row. This is easy to do in dplyr using the separate() or separate_wider_delim()
plus pivot_longer() functions. However, my dataset is very large because each
string has thousands of elements and the dataset con
Looking at
https://arrow.apache.org/docs/r/reference/open_dataset.html#arg-factory-options,
it seems that `exclude_invalid_files` is slow on remote file systems
because of the cost of accessing each file up front to determine if it is
valid. And there is `selector_ignore_prefixes`, but it looks lik
Dear all,
I'm using the arrow package to access partitioned parquet data on an AWS S3
bucket. The structure is the typical
s3://some_path/entity=ABC/syncDate=mm-dd-/country=US/part***.snappy.parquet
Reading the files works very well using
DS <- arrow::open_dataset(sources = "s3://some_path/
It's the last step that takes a lot of time.
DS <- arrow::open_dataset(sources = sourcePath)
listFiles <- DS$files[!grepl("$folder$",DS$files,fixed=TRUE)]
runs very fast, but as DS$files does not contain the "s3://" prefix, I have
to add it to listFiles in order to make the following work, and th