Dear all,

I'm using the arrow package to access partitioned parquet data on an AWS S3
bucket. The structure is the typical

s3://some_path/entity=ABC/syncDate=mm-dd-yyyy/country=US/part***.snappy.parquet

Reading the files works very well using

DS <- arrow::open_dataset(sources = "s3://some_path/entitiy=ABC")
AT <- DS$NewScan()$Finish()$ToTable()
DF <- as.data.frame(AT)

But this works only if the structure only contains the parquet files. In
some instances there are additional artifacts, e.g.

s3://some_path/entity=ABC/syncDate=mm-dd-yyyy_$folder$

which are files of size 0. Is there any way to set up the open_dataset()
command to ignore these files? I tried the exclude_invalid_files option,
but this takes forever. Furthermore I tried to eliminate the irrelevant
files from DS$files, but wasn't able to manipulate this particular
variable. Setting up something like

DS <- arrow::open_dataset(sources = sourcePath)
listFiles <- DS$files[!grepl("$folder$",DS$files,fixed=TRUE)]
DS2 <- arrow::open_dataset(sources = listFiles)

also takes an enormous amount of time.

Any help is greatly appreciated!

Thanks,
Tony

*Dr. Tony Huschto*
Data Scientist

Roche Diabetes Care GmbH
DSRIBA
Sandhofer Strasse 116
68305 Mannheim/Germany

Phone: +4962175969845
Mobile: +4915236987520
mailto:tony.husc...@roche.com <tony.husc...@roche.com>

*Roche Diabetes Care GmbH*
Sandhofer Straße 116; D‑68305 Mannheim; Telefon +49‑621‑759‑0;
Telefax +49‑621‑759‑2890
Sitz der Gesellschaft: Mannheim - Registergericht: AG Mannheim HRB 720251 -
Geschäftsführung: Marcel Hunn - Aufsichtsratsvorsitzender: Dr. Thomas
Schinecker
*Confidentiality Note*
This message is intended only for the use of the named recipient(s) and may
contain confidential and/or privileged information. If you are not the
intended recipient, please contact the sender and delete the message. Any
unauthorized use of the information contained in this message is prohibited.

*Informationen zum Datenschutz:* www.roche.de/datenschutz

Reply via email to