Re: extracting data from table embedded in Word document

R. Joseph Newton Sat, 20 Mar 2004 18:43:54 -0800

Andrew Gaffney wrote:

> I'm writing a web-based client information system for a lawyer.


[OT}
Eek!!  Be very, very careful about security, please.  I trust your employer has taken 
"duty of
care" isuues into account.
[/OT]

> His current client list is
> in a 137 page Word document with an embedded table. I can get it into a somewhat 
> usable
> format by copying the entire table, pasting into Excel, and saving as tab delimeted 
> text,
> but this has its problems.
>
> Some of the cells in the table have newlines in them. Because of this, when it's 
> exported
> from Excel, the 2nd line will appear in the correct field, but on a line by itself:
>
> Row 1,   Firstname,       Lastname,        Address, City,    State,   Zip,     Phone,
>                                                                         AnotherPhone

Don't export as space,-delimited.  Use CSV, which is much more portable.

>
> Row 2,   First middle,          Last,           addy,    City,    State,   Zip,     
> 555-5555
>
> So it looks like 3 records instead of 2. Does anyone have any ideas on how to pick 
> apart
> the data to get it into the DB?
>
> --
> Andrew Gaffney

Hi Andrew,

The problem here is more one of datqa design, specifically data normalization, than
formatting.  Neither documents nor spreadsheets are really good storage methods for 
large
scale data.  For long-term robustness, this data should probably be in a normalized 
database.
Multi-valued fields generally indicate data that should be broken out into a table of 
its own:

Assuming some unique ID for each client [*not* the name; names do not guarantee 
uniquesness],
such a table could be very simple.

client_id auto_number unique primary key
client_phone character[20]
primary_number boolean
I'm a little rusty on my SQL today, so the types I list above are logical types.  You 
will
have to check your SQL reference to find the appropriate available types.  These can 
easily be
linked with simple SQL for presentation purposes.  This structure also better supports
working with the data of the fields broken out this way.

I know this isn't a very Perl-ish response, but if you are going to rake on the task of
working with this data anyway, you might as well do it right, and build something that 
will
hold up over time.  Then you can use querying tools or speadheets to prepare any needed
presentation format.

Now as far as the particular mehtod to use for picking the data apart, that depends. 
Is the
phone number the only  multivalued field?  If so it should be fairly straightforward.  
  It
would help if we could actually see some of the data you have to work with.  Might be 
a bit of
work, since you will have to sustitute aliases--but realistic please, for any 
identifying
information.


Greetings! C:\Documents and Settings\rjnewton>perl -w
my @rows;
push @rows, $_ while <STDIN>;
chomp @rows;
my %phones;
my $current_row_tag;
my %data_rows;
foreach $row (@rows) {
   last unless $row =~ /\S/;
   if ($row =~ /^\w/) {
      my ($row_tag, $given_name, $last_name, $streetaddress, $city, $state,
      $zip, $phone) = split /,\s*/, $row;
     $phones{$row_tag} = [];
     push @{$phones{$row_tag}}, $phone;
     my $customer_data = {};
     $customer_data->{'given name'} = $given_name;
     $customer_data->{'last name'} = $last_name;
     $customer_data->{'street address'} = $streetaddress;
     $customer_data->{'City'} = $city;
     $customer_data->{'State'} = $state;
     $customer_data->{'Zip'} = $zip;
     $data_rows{$row_tag} = $customer_data;
     $current_row_tag = $row_tag;
   } else {
      $row =~ s/^\s*//;
      $row =~ s/\s*$//;
      push @{$phones{$current_row_tag}}, $row;
   }
}

foreach $client_id (sort keys %data_rows) {
   my $client = $data_rows{$client_id};
   print "$client->{'given name'} $client->{'last name'}\n";
   print "    $_\n" foreach @{$phones{$client_id}};
}

^Z
Row 1,   Firstname,       Lastname,        Address, City,    State,   Zip,     Phone,
                                                                        AnotherPhone
Row 2,   First middle,          Last,           addy,    City,    State,   Zip,
    555-5555

^Z
Firstname Lastname
    Phone
    AnotherPhone
First middle Last
    555-5555

HTH,

Joseph


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: extracting data from table embedded in Word document

Reply via email to