Working with files of different character encodings

Doug Cacialli Sat, 03 Apr 2010 12:49:23 -0700

Thanks to much help from the list, and hours of reading up on Unicode,
the Encode module, and many posts to perlmonks, I've come up with a
hideous solution for processing text files with different character
encodings.


Can someone please explain why this first block of code works when
decoding .txt files of different character encoding types:

#!/usr/bin/perl
use strict;
use warnings;
use Encode::Guess;

print "\nPlease specify the file path: ";
my $datapath = <STDIN>;
$datapath =~ s/^\s+//;
$datapath =~ s/\s+$//;
open (my $filehndl , "<", "$datapath") ||
        die ("Can't open .txt file $datapath. Exiting program.\n\n");
binmode($filehndl);
if (read($filehndl, my $filestrt, 500))
{
        my $enc = guess_encoding($filestrt);
        if (ref($enc))
        {
                my $enc_name = $enc->name;
                #my $encoding = find_encoding("$enc_name");
                open (my $filehdl2 , "<:encoding($enc_name)" , "$datapath");
                while (my $line = <$filehdl2>)
                {
                        #my $line = $encoding->decode($string);
                        #my $line = decode("$enc_name", $string);
                        chomp $line;
                        my @words = split / /, $line;
                        my $nr_words = @words;
                        print "\n$line\n";
                        print "The line above has " . scalar @words . " 
occurrences of something.\n";
                }
                close ($filehdl2);
        }
}
close ($filehndl);

But this second generates the error:
UTF16: Unrecognised BOM 6100 at /usr/lib/perl/5.10//Encode.pm line
162, <$filehndl> line 1.

#!/usr/bin/perl
use strict;
use warnings;

use Encode;
use Encode::Guess;

print "\nPlease specify the file path: ";
my $datapath = <STDIN>;
$datapath =~ s/^\s+//;
$datapath =~ s/\s+$//;
open (my $filehndl , "<", "$datapath") ||
        die ("Can't open .txt file $datapath. Exiting program.\n\n");
binmode($filehndl);
if (read($filehndl, my $filestrt, 500))
{
        my $enc = guess_encoding($filestrt);
        if (ref($enc))
        {
                my $enc_name = $enc->name;
                while (my $line = decode("$enc_name", <$filehndl>))
                {
                        chomp $line;
                        my @words = split / /, $line;
                        my $nr_words = @words;
                        print "\n$line\n";
                        print "The line above has " . scalar @words . " 
occurrences of something.\n";
                }
        }
}
close ($filehndl);

Otherwise, can someone suggest a more elegant way of accomplishing
this?  It doesn't seem like I should have to open the file twice, as
I'm doing in the first block.  I can't figure out any way around that,
though.

Thanks for any help!

-Doug.

===
Douglas Cacialli, M.A. - Doctoral candidate
Clinical Psychology Training Program
University of Nebraska-Lincoln
Lincoln, Nebraska 68588-0308
===

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Working with files of different character encodings

Reply via email to