Ángel Álvarez Pascua created SPARK-51579: --------------------------------------------
Summary: Spark CSV Read Low Performance: EOFExceptions in Univocity Parser Key: SPARK-51579 URL: https://issues.apache.org/jira/browse/SPARK-51579 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Ángel Álvarez Pascua When Spark reads a CSV file, it drops new line characters before parsing the content in the {{org.apache.spark.sql.catalyst.csv.UnivocityParser.parseIterator}} method. During parsing, {{UnivocityParser}} expects a delimiter or a new line character. However, with new lines removed, it internally throws (and later ignores) an {{EOFException}} for each line. h4. *Impact:* * The repeated generation of {{EOFException}} instances is an expensive operation in the JVM. * This leads to significant performance degradation during CSV file loading. h4. *Expected Behavior:* * Spark should handle new line characters appropriately to prevent excessive exception generation. * Optimizing this behavior would improve overall CSV parsing performance. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org