On Monday, 23 December 2013 at 20:48:08 UTC, John Carter wrote:
This frustrated me in Ruby unicode too....

Typically i/o is the ultimate in "untrusted and untrustworthy" sources,
coming usually from systems beyond my control.

Likely to be corrupted, or maliciously crafted, or defective...

Unfortunately not all sequences of bytes are valid UTF8.

Thus inevitably in every collection of inputs there are always going to be around 1 in a million codepoints resulting in an UTFException thrown.

Alas, I always have to do Regex matches on the other 999999 valid
codepoints.....

Is there a standard recipe in stdio for squashing bad codepoints to some
default?

These days memory is very much larger than most files I want to scan.

So if I was doing this in C I would typically mmap the file PROT_READ | PROT_WRITE and MAP_PRIVATE then run down the file squashing bad codepoints
and then run down it again matching patterns.

In Ruby I have a horridly inefficient utility....
      def IO.read_utf_8(file)

read(file,:external_encoding=>'ASCII-8BIT').encode('UTF-8',:undef=>:replace)
      end

What is the idiomatic D solution to this conundrum?

The encoding schemes in std.encoding support cleaning up input using the sanitize function.

http://dlang.org/phobos/std_encoding.html#.EncodingScheme.sanitize

It'd be nicer if the API were range based but it seems to do the trick in my experience.

Reply via email to