[ https://issues.apache.org/jira/browse/ARROW-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17661745#comment-17661745 ]
Rok Mihevc commented on ARROW-4723: ----------------------------------- This issue has been migrated to [issue #16000|https://github.com/apache/arrow/issues/16000] on GitHub. Please see the [migration documentation|https://github.com/apache/arrow/issues/14542] for further details. > [Python] Skip _files when reading a directory containing parquet files > ---------------------------------------------------------------------- > > Key: ARROW-4723 > URL: https://issues.apache.org/jira/browse/ARROW-4723 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Hossein Falaki > Assignee: Hyukjin Kwon > Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > It is common for Apache Spark or other big data platforms to save additional > meta-data files denoted with _ when saving parquet data. > When using {{make_batch_reader}} to load a directory saved by parquet > containing such files we encounter the following error: > {code:java} > PetastormMetadataError Traceback (most recent call last) > /databricks/python/lib/python3.6/site-packages/petastorm/etl/dataset_metadata.py > in infer_or_load_unischema(dataset) > 388 try: > --> 389 return get_schema(dataset) > 390 except PetastormMetadataError: > /databricks/python/lib/python3.6/site-packages/petastorm/etl/dataset_metadata.py > in get_schema(dataset) > 342 raise PetastormMetadataError( > --> 343 'Could not find _common_metadata file. Use materialize_dataset(..) > in' > 344 ' petastorm.etl.dataset_metadata.py to generate this file in your ETL > code.' > PetastormMetadataError: Could not find _common_metadata file. Use > materialize_dataset(..) in petastorm.etl.dataset_metadata.py to generate this > file in your ETL code. You can generate it on an existing dataset using > petastorm-generate-metadata.py{code} > > This is because our Runtime stores the following two files at the end of the > job: > {code:java} > dbfs:/tmp/petastorm/_committed_4686077819843716563 > _committed_4686077819843716563 1965 > dbfs:/tmp/petastorm/_started_4686077819843716563{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)