On Thu, Sep 06, 2018 at 12:12:57AM +0200, Francesco Poli wrote: > Proposed strategy > ================= > > I've been thinking about a way to prevent apt-listbugs from > barfing in those unusual cases. > > Since the non US-ASCII characters, if present at all, will be in > the comment lines (assuming the format of the file is valid!), > it does not really matter much whether apt-listbugs is able to > correctly represent those non US-ASCII characters. > The comment lines will be skipped, as soon as detected as such. > > Hence I thought I could do the following: > > $ cat read_ignore_bugs_encode.rb > #!/usr/bin/ruby > > p ["Default external encoding:", Encoding.default_external] > puts "=========" > > noncomments = [] > > open("ignore_bugs").each { |line| > enc = line.encode(Encoding.default_external, undef: :replace, invalid: > :replace) > p [line.encoding, line, enc.encoding, enc] > if /^\s*#/ =~ enc > next > end > if /^\s*(\S+)/ =~ enc > noncomments << $1 > end > } > > puts "=========" > noncomments.each { |elem| > p [elem.encoding, elem] > } > > > This seems to work normally, when run in the same locale where the > "ignore_bugs" file was created: > > $ ./read_ignore_bugs_encode.rb ["Default external encoding:", > #<Encoding:UTF-8>] ========= [#<Encoding:UTF-8>, "# first bug\n", > #<Encoding:UTF-8>, "# first bug\n"] [#<Encoding:UTF-8>, "123456\n", > #<Encoding:UTF-8>, "123456\n"] [#<Encoding:UTF-8>, "# secönd bug\n", > #<Encoding:UTF-8>, "# secönd bug\n"] [#<Encoding:UTF-8>, "234567\n", > #<Encoding:UTF-8>, "234567\n"] [#<Encoding:UTF-8>, "# a package\n", > #<Encoding:UTF-8>, "# a package\n"] [#<Encoding:UTF-8>, > "my-package0+\n", #<Encoding:UTF-8>, "my-package0+\n"] ========= > [#<Encoding:UTF-8>, "123456"] [#<Encoding:UTF-8>, "234567"] > [#<Encoding:UTF-8>, "my-package0+"] > > but also when run in a more limited locale: > > $ LC_ALL=C ./read_ignore_bugs_encode.rb ["Default external > encoding:", #<Encoding:US-ASCII>] ========= [#<Encoding:US-ASCII>, > "# first bug\n", #<Encoding:US-ASCII>, "# first bug\n"] > [#<Encoding:US-ASCII>, "123456\n", #<Encoding:US-ASCII>, "123456\n"] > [#<Encoding:US-ASCII>, "# sec\xC3\xB6nd bug\n", > #<Encoding:US-ASCII>, "# sec??nd bug\n"] [#<Encoding:US-ASCII>, > "234567\n", #<Encoding:US-ASCII>, "234567\n"] [#<Encoding:US-ASCII>, > "# a package\n", #<Encoding:US-ASCII>, "# a package\n"] > [#<Encoding:US-ASCII>, "my-package0+\n", #<Encoding:US-ASCII>, > "my-package0+\n"] ========= [#<Encoding:US-ASCII>, "123456"] > [#<Encoding:US-ASCII>, "234567"] [#<Encoding:US-ASCII>, > "my-package0+"] > > > What do you think? Is the above described strategy reasonable? Or do > you see a flaw which will backfire in the future?
Looks OK to me, but it also looks a little bit too cautious, and complex. In this case you only care about the lines that are uncommented and only contain ASCII, so you can just ignore everything else: ----------------8<----------------8<----------------8<----------------- $ cat /tmp/ignore_bugs 123456 # secönd bug 234567 # a package my-package0+ $ cat /tmp/read_bugs.rb ARGV.each do |f| File.readlines(f, encoding: Encoding::BINARY).each do |line| puts line if line !~ /^\s*#/ end end $ ruby /tmp/read_bugs.rb /tmp/ignore_bugs 123456 234567 my-package0+ $ LANG=C ruby /tmp/read_bugs.rb /tmp/ignore_bugs 123456 234567 my-package0+ ----------------8<----------------8<----------------8<-----------------
signature.asc
Description: PGP signature