How about adding "-Ebinary" to the shebang line. --- read_ignore_bugs.rb.orig 2018-09-06 09:33:26.000000000 +0900 +++ read_ignore_bugs.rb 2018-09-06 09:29:51.000000000 +0900 @@ -1,4 +1,4 @@ -#!/usr/bin/ruby +#!/usr/bin/ruby -Ebinary
p ["Default external encoding:", Encoding.default_external] puts "=========" 2018年9月6日(木) 7:15 Francesco Poli <invernom...@paranoici.org>: > Hello Debian Ruby experts, > I have a question related to encodings in Ruby. > Maybe the question is more fit for Ruby language mailing lists, but, > since the issue arises in apt-listbugs (which is a Debian native > package) and you are all nice, helpful and knowledgeable, I thought I > could ask here... > > > Description of the issue > ======================== > > apt-listbugs reads a file ("ignore_bugs") where some bug numbers > and/or package names are written, along with comments beginning > with the '#' character. > > A generic file in the same format could look like: > > $ cat ignore_bugs > # first bug > 123456 > # secönd bug > 234567 > # a package > my-package0+ > > This file is usually encoded in the same encoding used by environment > where apt-listbugs runs, so there's no special encoding issue. > > $ file ignore_bugs > ignore_bugs: UTF-8 Unicode text > > The code that reads this file is similar to the following > minimal example script (except for the "p" debug statements, > of course): > > $ cat read_ignore_bugs.rb > #!/usr/bin/ruby > > p ["Default external encoding:", Encoding.default_external] > puts "=========" > > noncomments = [] > > open("ignore_bugs").each { |line| > p [line.encoding, line] > if /^\s*#/ =~ line > next > end > if /^\s*(\S+)/ =~ line > noncomments << $1 > end > } > > puts "=========" > noncomments.each { |elem| > p [elem.encoding, elem] > } > > Running this script in a UTF-8 locale does not pose any issues: > > $ ./read_ignore_bugs.rb > ["Default external encoding:", #<Encoding:UTF-8>] > ========= > [#<Encoding:UTF-8>, "# first bug\n"] > [#<Encoding:UTF-8>, "123456\n"] > [#<Encoding:UTF-8>, "# secönd bug\n"] > [#<Encoding:UTF-8>, "234567\n"] > [#<Encoding:UTF-8>, "# a package\n"] > [#<Encoding:UTF-8>, "my-package0+\n"] > ========= > [#<Encoding:UTF-8>, "123456"] > [#<Encoding:UTF-8>, "234567"] > [#<Encoding:UTF-8>, "my-package0+"] > > However, there may be unusual cases where the file is written with an > encoding, but then read by apt-listbugs in an environment with > different locale settings, implying a different default external > encoding. > For instance, the file may be encoded in UTF-8 (either because it was > written by hand with an editor running in a UTF-8 locale, or because it > was written by apt-listbugs, when running in a UTF-8 locale), but then > read by a successive execution of apt-listbugs in a US-ASCII locale > (maybe because LC_ALL=C was set). > This encoding mismatch may cause an ArgumentError to be raised, if some > character is found in the file that is an invalid byte sequence in the > current default external encoding. > > $ LC_ALL=C ./read_ignore_bugs.rb > ["Default external encoding:", #<Encoding:US-ASCII>] > ========= > [#<Encoding:US-ASCII>, "# first bug\n"] > [#<Encoding:US-ASCII>, "123456\n"] > [#<Encoding:US-ASCII>, "# sec\xC3\xB6nd bug\n"] > Traceback (most recent call last): > 2: from ./read_ignore_bugs.rb:8:in `<main>' > 1: from ./read_ignore_bugs.rb:8:in `each' > ./read_ignore_bugs.rb:10:in `block in <main>': invalid byte sequence in > US-ASCII (ArgumentError) > > > The problem is that the actual encoding of the file is unknown and > unpredictable... > > > Proposed strategy > ================= > > I've been thinking about a way to prevent apt-listbugs from > barfing in those unusual cases. > > Since the non US-ASCII characters, if present at all, will be in > the comment lines (assuming the format of the file is valid!), > it does not really matter much whether apt-listbugs is able to > correctly represent those non US-ASCII characters. > The comment lines will be skipped, as soon as detected as such. > > Hence I thought I could do the following: > > $ cat read_ignore_bugs_encode.rb > #!/usr/bin/ruby > > p ["Default external encoding:", Encoding.default_external] > puts "=========" > > noncomments = [] > > open("ignore_bugs").each { |line| > enc = line.encode(Encoding.default_external, undef: :replace, invalid: > :replace) > p [line.encoding, line, enc.encoding, enc] > if /^\s*#/ =~ enc > next > end > if /^\s*(\S+)/ =~ enc > noncomments << $1 > end > } > > puts "=========" > noncomments.each { |elem| > p [elem.encoding, elem] > } > > > This seems to work normally, when run in the same locale where the > "ignore_bugs" file was created: > > $ ./read_ignore_bugs_encode.rb > ["Default external encoding:", #<Encoding:UTF-8>] > ========= > [#<Encoding:UTF-8>, "# first bug\n", #<Encoding:UTF-8>, "# first bug\n"] > [#<Encoding:UTF-8>, "123456\n", #<Encoding:UTF-8>, "123456\n"] > [#<Encoding:UTF-8>, "# secönd bug\n", #<Encoding:UTF-8>, "# secönd > bug\n"] > [#<Encoding:UTF-8>, "234567\n", #<Encoding:UTF-8>, "234567\n"] > [#<Encoding:UTF-8>, "# a package\n", #<Encoding:UTF-8>, "# a package\n"] > [#<Encoding:UTF-8>, "my-package0+\n", #<Encoding:UTF-8>, > "my-package0+\n"] > ========= > [#<Encoding:UTF-8>, "123456"] > [#<Encoding:UTF-8>, "234567"] > [#<Encoding:UTF-8>, "my-package0+"] > > but also when run in a more limited locale: > > $ LC_ALL=C ./read_ignore_bugs_encode.rb > ["Default external encoding:", #<Encoding:US-ASCII>] > ========= > [#<Encoding:US-ASCII>, "# first bug\n", #<Encoding:US-ASCII>, "# first > bug\n"] > [#<Encoding:US-ASCII>, "123456\n", #<Encoding:US-ASCII>, "123456\n"] > [#<Encoding:US-ASCII>, "# sec\xC3\xB6nd bug\n", #<Encoding:US-ASCII>, "# > sec??nd bug\n"] > [#<Encoding:US-ASCII>, "234567\n", #<Encoding:US-ASCII>, "234567\n"] > [#<Encoding:US-ASCII>, "# a package\n", #<Encoding:US-ASCII>, "# a > package\n"] > [#<Encoding:US-ASCII>, "my-package0+\n", #<Encoding:US-ASCII>, > "my-package0+\n"] > ========= > [#<Encoding:US-ASCII>, "123456"] > [#<Encoding:US-ASCII>, "234567"] > [#<Encoding:US-ASCII>, "my-package0+"] > > > What do you think? > Is the above described strategy reasonable? > Or do you see a flaw which will backfire in the future? > > Thanks for reading so far and for any help you may provide! > > > P.S.: Please Cc me on replies, as I am not subscribed to the list. > Thanks for your understanding! > > -- > http://www.inventati.org/frx/ > There's not a second to spare! To the laboratory! > ..................................................... Francesco Poli . > GnuPG key fpr == CA01 1147 9CD2 EFDF FB82 3925 3E1C 27E1 1F69 BFFE >