[
https://issues.apache.org/jira/browse/SPARK-57135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akshat Shenoi updated SPARK-57135:
----------------------------------
Description:
Spark cannot currently read CSV files packaged inside tar archives (.tar,
.tar.gz, .tgz); users must unpack them externally first.
This adds opt-in support (spark.sql.files.archive.reader.enabled, default
false) for reading such archives through the CSV data source by streaming each
entry through the CSV parser, without materializing entries to local disk:
* A streaming ArchiveReader opens the tar once and yields one bounded
InputStream per entry, advancing lazily so memory stays bounded regardless of
archive size. Directories are skipped, along with any entry Spark's own file
listing would filter out — dot- and underscore-prefixed names (e.g. .{_}x,
.DS_Store, _SUCCESS, _committed{_}*) and anything under a dot-/underscore-
prefixed directory (e.g. a leftover _temporary/) — applied per path component,
so an archive parses like a directory of the same files. .tar.gz is
decompressed via Hadoop's codec factory; .tgz is gunzipped explicitly.
ArchiveReader is an abstract base (TarArchiveReader is the only implementation
today), so other archive formats can be added as additive subclasses.
* CSVFileFormat treats archives as non-splittable (one split per archive) and
streams each entry through UnivocityParser, handling each entry as a standalone
CSV file (headers, multiLine, delimiters, column pruning).
Scope: CSV reads over tar only. Schema inference from archives, and other file
formats (e.g. JSON, text, XML), are left to follow-ups. Streaming supports
formats parseable sequentially; formats needing random access (Parquet/ORC
footers) cannot stream from a tar and are out of scope.
was:
Spark cannot currently read CSV files packaged inside tar archives (.tar,
.tar.gz, .tgz); users must unpack them externally first.
This adds opt-in support (spark.sql.files.archive.reader.enabled, default
false) for reading such archives through the CSV data source by streaming each
entry through the CSV parser, without materializing entries to local disk:
* A streaming ArchiveReader opens the tar once and yields one bounded
InputStream per entry, advancing lazily so memory
stays bounded regardless of archive size. Directories are skipped, along with
any entry Spark's own file listing would filter out — dot- and
underscore-prefixed names (e.g. ._x, .DS_Store, _SUCCESS, _committed_*) and
anything under a dot-/underscore-prefixed directory (e.g. a leftover
_temporary/) — applied per path component, so an archive parses like a
directory of the same files. .tar.gz is decompressed via Hadoop's codec
factory; .tgz is gunzipped explicitly. ArchiveReader is an abstract base
(TarArchiveReader is the only implementation today), so other archive formats
can be added as additive subclasses.
* CSVFileFormat treats archives as non-splittable (one split per archive) and
streams each entry through UnivocityParser,
handling each entry as a standalone CSV file (headers, multiLine, delimiters,
column pruning).
Scope: CSV reads over tar only. Schema inference from archives, and other file
formats (e.g. JSON, text, XML), are left to follow-ups. Streaming supports
formats parseable sequentially; formats needing random access (Parquet/ORC
footers) cannot stream from a tar and are out of scope.
> [SQL] Support reading CSV files inside tar archives
> ---------------------------------------------------
>
> Key: SPARK-57135
> URL: https://issues.apache.org/jira/browse/SPARK-57135
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Akshat Shenoi
> Priority: Major
> Labels: pull-request-available
>
> Spark cannot currently read CSV files packaged inside tar archives (.tar,
> .tar.gz, .tgz); users must unpack them externally first.
> This adds opt-in support (spark.sql.files.archive.reader.enabled, default
> false) for reading such archives through the CSV data source by streaming
> each entry through the CSV parser, without materializing entries to local
> disk:
> * A streaming ArchiveReader opens the tar once and yields one bounded
> InputStream per entry, advancing lazily so memory stays bounded regardless of
> archive size. Directories are skipped, along with any entry Spark's own file
> listing would filter out — dot- and underscore-prefixed names (e.g. .{_}x,
> .DS_Store, _SUCCESS, _committed{_}*) and anything under a dot-/underscore-
> prefixed directory (e.g. a leftover _temporary/) — applied per path
> component, so an archive parses like a directory of the same files. .tar.gz
> is decompressed via Hadoop's codec factory; .tgz is gunzipped explicitly.
> ArchiveReader is an abstract base (TarArchiveReader is the only
> implementation today), so other archive formats can be added as additive
> subclasses.
> * CSVFileFormat treats archives as non-splittable (one split per archive)
> and streams each entry through UnivocityParser, handling each entry as a
> standalone CSV file (headers, multiLine, delimiters, column pruning).
> Scope: CSV reads over tar only. Schema inference from archives, and other
> file formats (e.g. JSON, text, XML), are left to follow-ups. Streaming
> supports formats parseable sequentially; formats needing random access
> (Parquet/ORC footers) cannot stream from a tar and are out of scope.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]