Re: Suppressing UTFException / Squashing Bad Codepoints?

Brad Anderson Mon, 23 Dec 2013 13:57:16 -0800

On Monday, 23 December 2013 at 20:48:08 UTC, John Carter wrote:

This frustrated me in Ruby unicode too....
Typically i/o is the ultimate in "untrusted and untrustworthy"sources,
coming usually from systems beyond my control.

Likely to be corrupted, or maliciously crafted, or defective...

Unfortunately not all sequences of bytes are valid UTF8.
Thus inevitably in every collection of inputs there are alwaysgoing to bearound 1 in a million codepoints resulting in an UTFExceptionthrown.
Alas, I always have to do Regex matches on the other 999999valid
codepoints.....
Is there a standard recipe in stdio for squashing badcodepoints to some
default?
These days memory is very much larger than most files I want toscan.
So if I was doing this in C I would typically mmap the filePROT_READ |PROT_WRITE and MAP_PRIVATE then run down the file squashing badcodepoints
and then run down it again matching patterns.

In Ruby I have a horridly inefficient utility....
      def IO.read_utf_8(file)

read(file,:external_encoding=>'ASCII-8BIT').encode('UTF-8',:undef=>:replace)
      end

What is the idiomatic D solution to this conundrum?

The encoding schemes in std.encoding support cleaning up inputusing the sanitize function.


http://dlang.org/phobos/std_encoding.html#.EncodingScheme.sanitize

It'd be nicer if the API were range based but it seems to do thetrick in my experience.

Re: Suppressing UTFException / Squashing Bad Codepoints?

Reply via email to