Ángel Álvarez Pascua created SPARK-51579:
--------------------------------------------

             Summary: Spark CSV Read Low Performance: EOFExceptions in 
Univocity Parser
                 Key: SPARK-51579
                 URL: https://issues.apache.org/jira/browse/SPARK-51579
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.0.0
            Reporter: Ángel Álvarez Pascua


When Spark reads a CSV file, it drops new line characters before parsing the 
content in the 
{{org.apache.spark.sql.catalyst.csv.UnivocityParser.parseIterator}} method.

During parsing, {{UnivocityParser}} expects a delimiter or a new line 
character. However, with new lines removed, it internally throws (and later 
ignores) an {{EOFException}} for each line.
h4. *Impact:*
 * The repeated generation of {{EOFException}} instances is an expensive 
operation in the JVM.
 * This leads to significant performance degradation during CSV file loading.

h4. *Expected Behavior:*
 * Spark should handle new line characters appropriately to prevent excessive 
exception generation.
 * Optimizing this behavior would improve overall CSV parsing performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to