[jira] [Updated] (SPARK-57135) [SQL] Support reading CSV files inside tar archives

Akshat Shenoi (Jira) Mon, 08 Jun 2026 11:32:14 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-57135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Akshat Shenoi updated SPARK-57135:
----------------------------------
    Description: 
Spark cannot currently read CSV files packaged inside tar archives (.tar, 
.tar.gz, .tgz); users must unpack them externally first.

This adds opt-in support (spark.sql.files.archive.reader.enabled, default 
false) for reading such archives through the CSV data source by streaming each 
entry through the CSV parser, without materializing entries to local disk:
 * A streaming ArchiveReader opens the tar once and yields one bounded 
InputStream per entry, advancing lazily so memory stays bounded regardless of 
archive size. Directories are skipped, along with any entry Spark's own file 
listing would filter out    — dot- and underscore-prefixed names (e.g. .{_}x, 
.DS_Store, _SUCCESS, _committed{_}*) and anything under a dot-/underscore- 
prefixed directory (e.g. a leftover _temporary/) — applied per path component, 
so an archive parses like a directory of the same files. .tar.gz is 
decompressed via Hadoop's codec factory; .tgz is gunzipped explicitly. 
ArchiveReader is an abstract base (TarArchiveReader is the only implementation 
today), so other archive formats can be added as additive subclasses.
 * CSVFileFormat treats archives as non-splittable (one split per archive) and 
streams each entry through UnivocityParser, handling each entry as a standalone 
CSV file (headers, multiLine, delimiters, column pruning).

Scope: CSV reads over tar only. Schema inference from archives, and other file 
formats (e.g. JSON, text, XML), are left to follow-ups. Streaming supports 
formats parseable sequentially; formats needing random access (Parquet/ORC 
footers) cannot stream from a tar and are out of scope.

  was:
Spark cannot currently read CSV files packaged inside tar archives (.tar, 
.tar.gz, .tgz); users must unpack them externally first.

This adds opt-in support (spark.sql.files.archive.reader.enabled, default 
false) for reading such archives through the CSV data source by streaming each 
entry through the CSV parser, without materializing entries to local disk:
 * A streaming ArchiveReader opens the tar once and yields one bounded 
InputStream per entry, advancing lazily so memory
  stays bounded regardless of archive size. Directories are skipped, along with 
any entry Spark's own file listing would filter out — dot- and 
underscore-prefixed names (e.g. ._x, .DS_Store, _SUCCESS, _committed_*) and 
anything under a dot-/underscore-prefixed directory (e.g. a leftover 
_temporary/) — applied per path component, so an archive parses like a 
directory of the same files. .tar.gz is decompressed via Hadoop's codec 
factory; .tgz is gunzipped explicitly. ArchiveReader is an abstract base 
(TarArchiveReader is the only implementation today), so other archive formats 
can be added as additive subclasses.
 * CSVFileFormat treats archives as non-splittable (one split per archive) and 
streams each entry through UnivocityParser,
  handling each entry as a standalone CSV file (headers, multiLine, delimiters, 
column pruning).

Scope: CSV reads over tar only. Schema inference from archives, and other file 
formats (e.g. JSON, text, XML), are left to follow-ups. Streaming supports 
formats parseable sequentially; formats needing random access (Parquet/ORC 
footers) cannot stream from a tar and are out of scope.


> [SQL] Support reading CSV files inside tar archives
> ---------------------------------------------------
>
>                 Key: SPARK-57135
>                 URL: https://issues.apache.org/jira/browse/SPARK-57135
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Akshat Shenoi
>            Priority: Major
>              Labels: pull-request-available
>
> Spark cannot currently read CSV files packaged inside tar archives (.tar, 
> .tar.gz, .tgz); users must unpack them externally first.
> This adds opt-in support (spark.sql.files.archive.reader.enabled, default 
> false) for reading such archives through the CSV data source by streaming 
> each entry through the CSV parser, without materializing entries to local 
> disk:
>  * A streaming ArchiveReader opens the tar once and yields one bounded 
> InputStream per entry, advancing lazily so memory stays bounded regardless of 
> archive size. Directories are skipped, along with any entry Spark's own file 
> listing would filter out    — dot- and underscore-prefixed names (e.g. .{_}x, 
> .DS_Store, _SUCCESS, _committed{_}*) and anything under a dot-/underscore- 
> prefixed directory (e.g. a leftover _temporary/) — applied per path 
> component, so an archive parses like a directory of the same files. .tar.gz 
> is decompressed via Hadoop's codec factory; .tgz is gunzipped explicitly. 
> ArchiveReader is an abstract base (TarArchiveReader is the only 
> implementation today), so other archive formats can be added as additive 
> subclasses.
>  * CSVFileFormat treats archives as non-splittable (one split per archive) 
> and streams each entry through UnivocityParser, handling each entry as a 
> standalone CSV file (headers, multiLine, delimiters, column pruning).
> Scope: CSV reads over tar only. Schema inference from archives, and other 
> file formats (e.g. JSON, text, XML), are left to follow-ups. Streaming 
> supports formats parseable sequentially; formats needing random access 
> (Parquet/ORC footers) cannot stream from a tar and are out of scope.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-57135) [SQL] Support reading CSV files inside tar archives

Reply via email to