Re: Reading a file with unknown encoding

Francesco Poli Sun, 09 Sep 2018 10:05:28 -0700

On Sat, 8 Sep 2018 11:11:08 -0300 Antonio Terceiro wrote:

> On Thu, Sep 06, 2018 at 12:12:57AM +0200, Francesco Poli wrote:
> > Proposed strategy
> > =================
[...]
> > 
> > What do you think?  Is the above described strategy reasonable?  Or do
> > you see a flaw which will backfire in the future?
> 
> Looks OK to me, but it also looks a little bit too cautious, and
> complex.


Hello Antonio!   :-)
Thanks a lot for your kind reply.

I am glad my code doesn't look too crazy...   ;-)

I acknowledge that my proposed strategy is not the simplest possible
one.

> In this case you only care about the lines that are uncommented
> and only contain ASCII, so you can just ignore everything else:
> 
> ----------------8<----------------8<----------------8<-----------------
> $ cat /tmp/ignore_bugs 
> 123456
> # secönd bug
> 234567
> # a package
> my-package0+
> $ cat /tmp/read_bugs.rb 
> ARGV.each do |f|
>   File.readlines(f, encoding: Encoding::BINARY).each do |line|
>     puts line if line !~ /^\s*#/
>   end
> end
> $ ruby /tmp/read_bugs.rb /tmp/ignore_bugs 
> 123456
> 234567
> my-package0+
> $ LANG=C ruby /tmp/read_bugs.rb /tmp/ignore_bugs 
> 123456
> 234567
> my-package0+
> ----------------8<----------------8<----------------8<-----------------


I must confess that I was skeptical about this simple strategy.

The reason was that I am not comfortable with the idea that the array
of ignored bugs and packages would contain strings tagged as ASCII-8BIT
encoded, that is to say, effectively tagged as binary data.


Actually, I tried to read the file with BINARY encoding in
apt-listbugs, and it seems to work (even in cases where the
"ignore_bugs" file includes non comment lines with non-US-ASCII
characters, thus violating its format specification...).
As in:

  $ cat ignore_bugs 
  # first bug
  123456
  # secönd bug
  234567
  # a package
  my-package0+
  # an invalid line
  tëxtø

If I understand correctly, the reason why it works is that the array is
only tested through its include?() method, which basically tests for
equality. The equality operator for strings only compares length
and content, without comparing the encoding. Hence, it doesn't matter
if the array contains binary strings or actual text strings: the test
works anyway.

Nonetheless, I still feel uneasy with the idea of carrying an array of
binary data objects around, when the array is instead supposed to
contain strings...
Maybe I am not being crystal clear, so I don't know whether you get
what I mean.

Any other comments on this?


P.S.: I am probably annoying everyone too much, hence you are
definitely authorized to tell me "come on! stop worrying and love the
BINARY encoding!"   ;-)


P.P.S.: One last question: why "encoding: Encoding::BINARY", in stead
of "external_encoding: Encoding::BINARY" ?
I thought that "encoding" was meant to set both external_encoding and
internal_encoding, as explained in
https://ruby-doc.org/core-2.5.1/IO.html#method-c-new-label-IO+Encoding
Am I misunderstanding something?



-- 
 http://www.inventati.org/frx/
 There's not a second to spare! To the laboratory!
..................................................... Francesco Poli .
 GnuPG key fpr == CA01 1147 9CD2 EFDF FB82  3925 3E1C 27E1 1F69 BFFE

pgpqaSdbOYqUw.pgp
Description: PGP signature

Re: Reading a file with unknown encoding

Reply via email to