Re: Reading a file with unknown encoding

akira yamada Wed, 05 Sep 2018 17:45:37 -0700

How about adding "-Ebinary" to the shebang line.

--- read_ignore_bugs.rb.orig 2018-09-06 09:33:26.000000000 +0900
+++ read_ignore_bugs.rb 2018-09-06 09:29:51.000000000 +0900
@@ -1,4 +1,4 @@
-#!/usr/bin/ruby
+#!/usr/bin/ruby -Ebinary


 p ["Default external encoding:", Encoding.default_external]
 puts "========="


2018年9月6日(木) 7:15 Francesco Poli <invernom...@paranoici.org>:

> Hello Debian Ruby experts,
> I have a question related to encodings in Ruby.
> Maybe the question is more fit for Ruby language mailing lists, but,
> since the issue arises in apt-listbugs (which is a Debian native
> package) and you are all nice, helpful and knowledgeable, I thought I
> could ask here...
>
>
> Description of the issue
> ========================
>
> apt-listbugs reads a file ("ignore_bugs") where some bug numbers
> and/or package names are written, along with comments beginning
> with the '#' character.
>
> A generic file in the same format could look like:
>
>   $ cat ignore_bugs
>   # first bug
>   123456
>   # secönd bug
>   234567
>   # a package
>   my-package0+
>
> This file is usually encoded in the same encoding used by environment
> where apt-listbugs runs, so there's no special encoding issue.
>
>   $ file ignore_bugs
>   ignore_bugs: UTF-8 Unicode text
>
> The code that reads this file is similar to the following
> minimal example script (except for the "p" debug statements,
> of course):
>
>   $ cat read_ignore_bugs.rb
>   #!/usr/bin/ruby
>
>   p ["Default external encoding:", Encoding.default_external]
>   puts "========="
>
>   noncomments = []
>
>   open("ignore_bugs").each { |line|
>     p [line.encoding, line]
>     if /^\s*#/ =~ line
>       next
>     end
>     if /^\s*(\S+)/ =~ line
>       noncomments << $1
>     end
>   }
>
>   puts "========="
>   noncomments.each { |elem|
>     p [elem.encoding, elem]
>   }
>
> Running this script in a UTF-8 locale does not pose any issues:
>
>   $ ./read_ignore_bugs.rb
>   ["Default external encoding:", #<Encoding:UTF-8>]
>   =========
>   [#<Encoding:UTF-8>, "# first bug\n"]
>   [#<Encoding:UTF-8>, "123456\n"]
>   [#<Encoding:UTF-8>, "# secönd bug\n"]
>   [#<Encoding:UTF-8>, "234567\n"]
>   [#<Encoding:UTF-8>, "# a package\n"]
>   [#<Encoding:UTF-8>, "my-package0+\n"]
>   =========
>   [#<Encoding:UTF-8>, "123456"]
>   [#<Encoding:UTF-8>, "234567"]
>   [#<Encoding:UTF-8>, "my-package0+"]
>
> However, there may be unusual cases where the file is written with an
> encoding, but then read by apt-listbugs in an environment with
> different locale settings, implying a different default external
> encoding.
> For instance, the file may be encoded in UTF-8 (either because it was
> written by hand with an editor running in a UTF-8 locale, or because it
> was written by apt-listbugs, when running in a UTF-8 locale), but then
> read by a successive execution of apt-listbugs in a US-ASCII locale
> (maybe because LC_ALL=C was set).
> This encoding mismatch may cause an ArgumentError to be raised, if some
> character is found in the file that is an invalid byte sequence in the
> current default external encoding.
>
>   $ LC_ALL=C ./read_ignore_bugs.rb
>   ["Default external encoding:", #<Encoding:US-ASCII>]
>   =========
>   [#<Encoding:US-ASCII>, "# first bug\n"]
>   [#<Encoding:US-ASCII>, "123456\n"]
>   [#<Encoding:US-ASCII>, "# sec\xC3\xB6nd bug\n"]
>   Traceback (most recent call last):
>           2: from ./read_ignore_bugs.rb:8:in `<main>'
>           1: from ./read_ignore_bugs.rb:8:in `each'
>   ./read_ignore_bugs.rb:10:in `block in <main>': invalid byte sequence in
> US-ASCII (ArgumentError)
>
>
> The problem is that the actual encoding of the file is unknown and
> unpredictable...
>
>
> Proposed strategy
> =================
>
> I've been thinking about a way to prevent apt-listbugs from
> barfing in those unusual cases.
>
> Since the non US-ASCII characters, if present at all, will be in
> the comment lines (assuming the format of the file is valid!),
> it does not really matter much whether apt-listbugs is able to
> correctly represent those non US-ASCII characters.
> The comment lines will be skipped, as soon as detected as such.
>
> Hence I thought I could do the following:
>
>   $ cat read_ignore_bugs_encode.rb
>   #!/usr/bin/ruby
>
>   p ["Default external encoding:", Encoding.default_external]
>   puts "========="
>
>   noncomments = []
>
>   open("ignore_bugs").each { |line|
>     enc = line.encode(Encoding.default_external, undef: :replace, invalid:
> :replace)
>     p [line.encoding, line, enc.encoding, enc]
>     if /^\s*#/ =~ enc
>       next
>     end
>     if /^\s*(\S+)/ =~ enc
>       noncomments << $1
>     end
>   }
>
>   puts "========="
>   noncomments.each { |elem|
>     p [elem.encoding, elem]
>   }
>
>
> This seems to work normally, when run in the same locale where the
> "ignore_bugs" file was created:
>
>   $ ./read_ignore_bugs_encode.rb
>   ["Default external encoding:", #<Encoding:UTF-8>]
>   =========
>   [#<Encoding:UTF-8>, "# first bug\n", #<Encoding:UTF-8>, "# first bug\n"]
>   [#<Encoding:UTF-8>, "123456\n", #<Encoding:UTF-8>, "123456\n"]
>   [#<Encoding:UTF-8>, "# secönd bug\n", #<Encoding:UTF-8>, "# secönd
> bug\n"]
>   [#<Encoding:UTF-8>, "234567\n", #<Encoding:UTF-8>, "234567\n"]
>   [#<Encoding:UTF-8>, "# a package\n", #<Encoding:UTF-8>, "# a package\n"]
>   [#<Encoding:UTF-8>, "my-package0+\n", #<Encoding:UTF-8>,
> "my-package0+\n"]
>   =========
>   [#<Encoding:UTF-8>, "123456"]
>   [#<Encoding:UTF-8>, "234567"]
>   [#<Encoding:UTF-8>, "my-package0+"]
>
> but also when run in a more limited locale:
>
>   $ LC_ALL=C ./read_ignore_bugs_encode.rb
>   ["Default external encoding:", #<Encoding:US-ASCII>]
>   =========
>   [#<Encoding:US-ASCII>, "# first bug\n", #<Encoding:US-ASCII>, "# first
> bug\n"]
>   [#<Encoding:US-ASCII>, "123456\n", #<Encoding:US-ASCII>, "123456\n"]
>   [#<Encoding:US-ASCII>, "# sec\xC3\xB6nd bug\n", #<Encoding:US-ASCII>, "#
> sec??nd bug\n"]
>   [#<Encoding:US-ASCII>, "234567\n", #<Encoding:US-ASCII>, "234567\n"]
>   [#<Encoding:US-ASCII>, "# a package\n", #<Encoding:US-ASCII>, "# a
> package\n"]
>   [#<Encoding:US-ASCII>, "my-package0+\n", #<Encoding:US-ASCII>,
> "my-package0+\n"]
>   =========
>   [#<Encoding:US-ASCII>, "123456"]
>   [#<Encoding:US-ASCII>, "234567"]
>   [#<Encoding:US-ASCII>, "my-package0+"]
>
>
> What do you think?
> Is the above described strategy reasonable?
> Or do you see a flaw which will backfire in the future?
>
> Thanks for reading so far and for any help you may provide!
>
>
> P.S.: Please Cc me on replies, as I am not subscribed to the list.
>       Thanks for your understanding!
>
> --
>  http://www.inventati.org/frx/
>  There's not a second to spare! To the laboratory!
> ..................................................... Francesco Poli .
>  GnuPG key fpr == CA01 1147 9CD2 EFDF FB82  3925 3E1C 27E1 1F69 BFFE
>

Re: Reading a file with unknown encoding

Reply via email to