lam1051999 opened a new pull request, #50354: URL: https://github.com/apache/spark/pull/50354
### What changes were proposed in this pull request? Spark JSON reader uses `DefaultTimestampFormatter` for inferring timestamps from strings if user does not specify any timestamp patterns, which can cause confusion in case the string only has Year or Year + Month segments in the string with regular strings. This change is to remove the conversion from string to timestamp if JSON property value is in one of the formats: - `[+-]yyyy*` - `[+-]yyyy*-[m]m` ### Why are the changes needed? To avoid confusion between regular strings and strings that are in the formats having only Year or Year + Month segments, as per reported in this issue: https://issues.apache.org/jira/browse/SPARK-49858 ### Does this PR introduce _any_ user-facing change? Yes - Previous behavior: "23456" string is considered as a Timestamp, below is captured from Spark Scala <img width="783" alt="image" src="https://github.com/user-attachments/assets/9f9b82ef-1004-4761-bc22-2dcfd15affcc" /> The issue is even worse when is in PySpark when pulling result from JVM to a Python datetime, and Python datetime cannot handle Year part that is greater than 9999 <img width="870" alt="image" src="https://github.com/user-attachments/assets/e4f1b513-4edc-4db6-b9f7-2a01ad95b668" /> ### How was this patch tested? Unit tests are provided for above cases. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org