Re: [racket-users] Racket APIs doesn't strip Byte Order Mark (BOM) when reading strings from UTF-8 files

Matthew Flatt Thu, 22 Jun 2017 15:34:49 -0700

At Thu, 22 Jun 2017 14:59:35 -0700 (PDT), kay wrote:
> There're files that starts with BOM. When reading from those files, it seems 
> all of Racket's I/O API don't know to strip the BOM, making them extremely 
> difficult to work with.


I think that's the normal choice for UTF-8 readers. (Just to make sure
I typed "UTF-8 BOM <language>" for a few <language>s and got the same
answer each time.) Apparently, there's some question of what the
standard recommends for readers, although it clearly recommends against
a useless BOM for UTF-8 writers.

Some languages/libraries provide an encoding to strip a BOM from UTF-8,
and selecting that encoding is analogous to using `reencode-input-port`
in Racket. There's not a convenient encoding for that purpose in
`iconv`, though, which is what `reencode-input-port` uses.

So, instead of changing the port's encoding, I recommend just
discarding a BOM match at the start of the port:

 (define (discard-bom p)
   (void (regexp-try-match #rx"^\uFEFF" p)))

Used like this:

  (define port (open-input-file #:mode 'text "test.txt"))
  (discard-bom p)
  (define line (read-line port))

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Racket APIs doesn't strip Byte Order Mark (BOM) when reading strings from UTF-8 files

Reply via email to