mistercrunch commented on PR #34235: URL: https://github.com/apache/superset/pull/34235#issuecomment-3104642667
My take here is that we should just make things work properly with UTF-8 out of the box, but I get why we need the BOM now. From my understanding, the issue is Excel - without the BOM signature bytes, Excel can't figure out that our UTF-8 CSV files are actually UTF-8, so it mangles any non-English characters. That's why `utf-8-sig` matters - it adds those 3 magic bytes (`\xEF\xBB\xBF`) that tell Excel "hey, this is UTF-8!" But I still think we should avoid config complexity. A few thoughts: - Let's just default to `utf-8-sig` for all CSV exports - it's backward compatible and fixes the Excel issue - The BOM is valid UTF-8 and most modern tools handle it fine (unlike the bad old days) - No need for a `CSV_EXPORT` config - this should "just work" without users having to know about BOMs Can we simplify this to: 1. Always use `utf-8-sig` for CSV exports (handles Excel + international chars) 2. Remove the config option entirely 3. Add tests for various UTF-8 scenarios (Arabic, Chinese, etc.) What do you think about just making this the default behavior? (while noting the change in UPDATING.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
