[ 
https://issues.apache.org/jira/browse/SPARK-57135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akshat Shenoi updated SPARK-57135:
----------------------------------
    Description: 
Spark cannot currently read CSV files packaged inside tar archives (.tar, 
.tar.gz, .tgz); users must unpack them externally first.

This adds opt-in support (spark.sql.files.archive.reader.enabled, default 
false) for reading such archives through the CSV data source by streaming each 
entry through the CSV parser, without materializing entries to local disk:
 * A streaming ArchiveReader opens the tar once and yields one bounded 
InputStream per entry, advancing lazily so memory
  stays bounded regardless of archive size. Directories are skipped, along with 
any entry Spark's own file listing would filter out — dot- and 
underscore-prefixed names (e.g. ._x, .DS_Store, _SUCCESS, _committed_*) and 
anything under a dot-/underscore-prefixed directory (e.g. a leftover 
_temporary/) — applied per path component, so an archive parses like a 
directory of the same files. .tar.gz is decompressed via Hadoop's codec 
factory; .tgz is gunzipped explicitly. ArchiveReader is an abstract base 
(TarArchiveReader is the only implementation today), so other archive formats 
can be added as additive subclasses.
 * CSVFileFormat treats archives as non-splittable (one split per archive) and 
streams each entry through UnivocityParser,
  handling each entry as a standalone CSV file (headers, multiLine, delimiters, 
column pruning).

Scope: CSV reads over tar only. Schema inference from archives, and other file 
formats (e.g. JSON, text, XML), are left to follow-ups. Streaming supports 
formats parseable sequentially; formats needing random access (Parquet/ORC 
footers) cannot stream from a tar and are out of scope.

  was:
Spark cannot currently read CSV files packaged inside tar archives (.tar, 
.tar.gz, .tgz); users must unpack them externally first.

This adds opt-in support (spark.sql.files.archive.reader.enabled, default 
false) for reading such archives through the CSV data source by streaming each 
entry through the CSV parser, without materializing entries to local disk:
 * A streaming ArchiveReader opens the tar once and yields one bounded 
InputStream per entry, advancing lazily so memory
  stays bounded regardless of archive size. Directories and dot-prefixed 
entries are skipped. .tar.gz is decompressed via
  Hadoop's codec factory; .tgz is gunzipped explicitly. ArchiveReader is an 
abstract base (TarArchiveReader is the only
  implementation today), so other archive formats can be added as additive 
subclasses.
 * CSVFileFormat treats archives as non-splittable (one split per archive) and 
streams each entry through UnivocityParser,
  handling each entry as a standalone CSV file (headers, multiLine, delimiters, 
column pruning).

Scope: CSV reads over tar only. Schema inference from archives, and other file 
formats (e.g. JSON, text, XML), are left to follow-ups. Streaming supports 
formats parseable sequentially; formats needing random access (Parquet/ORC 
footers) cannot stream from a tar and are out of scope.


> [SQL] Support reading CSV files inside tar archives
> ---------------------------------------------------
>
>                 Key: SPARK-57135
>                 URL: https://issues.apache.org/jira/browse/SPARK-57135
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Akshat Shenoi
>            Priority: Major
>              Labels: pull-request-available
>
> Spark cannot currently read CSV files packaged inside tar archives (.tar, 
> .tar.gz, .tgz); users must unpack them externally first.
> This adds opt-in support (spark.sql.files.archive.reader.enabled, default 
> false) for reading such archives through the CSV data source by streaming 
> each entry through the CSV parser, without materializing entries to local 
> disk:
>  * A streaming ArchiveReader opens the tar once and yields one bounded 
> InputStream per entry, advancing lazily so memory
>   stays bounded regardless of archive size. Directories are skipped, along 
> with any entry Spark's own file listing would filter out — dot- and 
> underscore-prefixed names (e.g. ._x, .DS_Store, _SUCCESS, _committed_*) and 
> anything under a dot-/underscore-prefixed directory (e.g. a leftover 
> _temporary/) — applied per path component, so an archive parses like a 
> directory of the same files. .tar.gz is decompressed via Hadoop's codec 
> factory; .tgz is gunzipped explicitly. ArchiveReader is an abstract base 
> (TarArchiveReader is the only implementation today), so other archive formats 
> can be added as additive subclasses.
>  * CSVFileFormat treats archives as non-splittable (one split per archive) 
> and streams each entry through UnivocityParser,
>   handling each entry as a standalone CSV file (headers, multiLine, 
> delimiters, column pruning).
> Scope: CSV reads over tar only. Schema inference from archives, and other 
> file formats (e.g. JSON, text, XML), are left to follow-ups. Streaming 
> supports formats parseable sequentially; formats needing random access 
> (Parquet/ORC footers) cannot stream from a tar and are out of scope.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to