On Monday, 23 December 2013 at 20:48:08 UTC, John Carter wrote:
This frustrated me in Ruby unicode too....
Typically i/o is the ultimate in "untrusted and untrustworthy"
sources,
coming usually from systems beyond my control.
Likely to be corrupted, or maliciously crafted, or defective...
Unfortunately not all sequences of bytes are valid UTF8.
Thus inevitably in every collection of inputs there are always
going to be
around 1 in a million codepoints resulting in an UTFException
thrown.
Alas, I always have to do Regex matches on the other 999999
valid
codepoints.....
Is there a standard recipe in stdio for squashing bad
codepoints to some
default?
These days memory is very much larger than most files I want to
scan.
So if I was doing this in C I would typically mmap the file
PROT_READ |
PROT_WRITE and MAP_PRIVATE then run down the file squashing bad
codepoints
and then run down it again matching patterns.
In Ruby I have a horridly inefficient utility....
def IO.read_utf_8(file)
read(file,:external_encoding=>'ASCII-8BIT').encode('UTF-8',:undef=>:replace)
end
What is the idiomatic D solution to this conundrum?
The encoding schemes in std.encoding support cleaning up input
using the sanitize function.
http://dlang.org/phobos/std_encoding.html#.EncodingScheme.sanitize
It'd be nicer if the API were range based but it seems to do the
trick in my experience.