Dear all, I'm using the arrow package to access partitioned parquet data on an AWS S3 bucket. The structure is the typical
s3://some_path/entity=ABC/syncDate=mm-dd-yyyy/country=US/part***.snappy.parquet Reading the files works very well using DS <- arrow::open_dataset(sources = "s3://some_path/entitiy=ABC") AT <- DS$NewScan()$Finish()$ToTable() DF <- as.data.frame(AT) But this works only if the structure only contains the parquet files. In some instances there are additional artifacts, e.g. s3://some_path/entity=ABC/syncDate=mm-dd-yyyy_$folder$ which are files of size 0. Is there any way to set up the open_dataset() command to ignore these files? I tried the exclude_invalid_files option, but this takes forever. Furthermore I tried to eliminate the irrelevant files from DS$files, but wasn't able to manipulate this particular variable. Setting up something like DS <- arrow::open_dataset(sources = sourcePath) listFiles <- DS$files[!grepl("$folder$",DS$files,fixed=TRUE)] DS2 <- arrow::open_dataset(sources = listFiles) also takes an enormous amount of time. Any help is greatly appreciated! Thanks, Tony *Dr. Tony Huschto* Data Scientist Roche Diabetes Care GmbH DSRIBA Sandhofer Strasse 116 68305 Mannheim/Germany Phone: +4962175969845 Mobile: +4915236987520 mailto:tony.husc...@roche.com <tony.husc...@roche.com> *Roche Diabetes Care GmbH* Sandhofer Straße 116; D‑68305 Mannheim; Telefon +49‑621‑759‑0; Telefax +49‑621‑759‑2890 Sitz der Gesellschaft: Mannheim - Registergericht: AG Mannheim HRB 720251 - Geschäftsführung: Marcel Hunn - Aufsichtsratsvorsitzender: Dr. Thomas Schinecker *Confidentiality Note* This message is intended only for the use of the named recipient(s) and may contain confidential and/or privileged information. If you are not the intended recipient, please contact the sender and delete the message. Any unauthorized use of the information contained in this message is prohibited. *Informationen zum Datenschutz:* www.roche.de/datenschutz