[ 
https://issues.apache.org/jira/browse/SPARK-50616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906907#comment-17906907
 ] 

Yang Jie commented on SPARK-50616:
----------------------------------

This is an unresolved issue, so there is no need to fill in the 'Fix Version/s' 
for now. Therefore, I have removed its content.

> Add File Extension Option to CSV DataSource Writer
> --------------------------------------------------
>
>                 Key: SPARK-50616
>                 URL: https://issues.apache.org/jira/browse/SPARK-50616
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.5.3
>            Reporter: James Baugh
>            Priority: Minor
>
> h3. What changes were proposed in this pull request?
> The existing CSV DataSource allows one to set the delimiter/separator but 
> does not allow the changing of the file extension. This means that a file can 
> have values separated by tabs but me marked as a ".csv" file. This change 
> allows one to change the file extension to match the delimiter/separator 
> (e.g. ".tsv" for a tab separated value file).
> PR: [https://github.com/apache/spark/pull/49233]
> h3. Why are the changes needed?
> This PR adds an additional option to set the fileExtension. The end result is 
> that when a separator is set that is not a comma that the output file has a 
> file extension that matches the separator (e.g. file.tsv, file.psv, etc...).
> Notes on Previous Pull Request 
> [#17973|https://github.com/apache/spark/pull/17973]
> A pull request adding this option was discussed 7 years ago. One reason it 
> wasn't added was:
> "I would like to suggest to leave this out if there is no better reason for 
> now. Downside of this is, it looks this allows arbitrary name and it does not 
> gurantee the extention is, say, tsv when the delmiter is a tab. It is purely 
> up to the user."
> I don't believe this is a good reason to not let the user set the extension. 
> If we let them set the delimiter/separator to an arbitrary string/char then 
> why not let the user also set the file extension to specify the separator 
> that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" 
> file extension as the default and has the benefit of allowing other 
> separators to match the file extension.
> h3. Does this PR introduce _any_ user-facing change?
> Yes. This PR adds one row to the options table for the CSV DataSource 
> documentation to include the "fileExtension" option.
> h3. How was this patch tested?
> One unit test was added to validate a file is written with the new extension.
> h3. Was this patch authored or co-authored using generative AI tooling?
> No



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to