Re: extracting data from table embedded in Word document

Andrew Gaffney Sat, 20 Mar 2004 21:02:10 -0800

R. Joseph Newton wrote:

Andrew Gaffney wrote:

I'm writing a web-based client information system for a lawyer.
[OT}
Eek!!  Be very, very careful about security, please.  I trust your employer has taken 
"duty of
care" isuues into account.
[/OT]
His current client list is
in a 137 page Word document with an embedded table. I can get it into a somewhat usable
format by copying the entire table, pasting into Excel, and saving as tab delimeted 
text,
but this has its problems.
Some of the cells in the table have newlines in them. Because of this, when it's 
exported
from Excel, the 2nd line will appear in the correct field, but on a line by itself:
Row 1, Firstname, Lastname, Address, City, State, Zip, Phone,
Don't export as space,-delimited. Use CSV, which is much more portable.

That was what I wanted to do in the first place, but Word doesn't support exporting a document with a table as CSV.

Row 2, First middle, Last, addy, City, State, Zip, 555-5555
So it looks like 3 records instead of 2. Does anyone have any ideas on how to pick 
apart
the data to get it into the DB?


The problem here is more one of datqa design, specifically data normalization, than
formatting.  Neither documents nor spreadsheets are really good storage methods for 
large
scale data.  For long-term robustness, this data should probably be in a normalized 
database.
Multi-valued fields generally indicate data that should be broken out into a table of 
its own:

Believe me, I cringed when I saw this Word doc. I'm writing this script to import the data in the Word doc into a MySQL DB in order to do away with the doc. I will personally be overwriting the Word docs with /dev/zero multiple times once I'm done ;)

Now as far as the particular mehtod to use for picking the data apart, that depends. 
Is the
phone number the only  multivalued field?  If so it should be fairly straightforward.  
  It
would help if we could actually see some of the data you have to work with.  Might be 
a bit of
work, since you will have to sustitute aliases--but realistic please, for any 
identifying
information.

The phone number isn't the only field like that. Just scanning through the doc, I see this problem in most of the fields at some point in the doc.

Greetings! C:\Documents and Settings\rjnewton>perl -w
my @rows;
push @rows, $_ while <STDIN>;
chomp @rows;
my %phones;
my $current_row_tag;
my %data_rows;
foreach $row (@rows) {
   last unless $row =~ /\S/;
   if ($row =~ /^\w/) {
      my ($row_tag, $given_name, $last_name, $streetaddress, $city, $state,
      $zip, $phone) = split /,\s*/, $row;
     $phones{$row_tag} = [];
     push @{$phones{$row_tag}}, $phone;
     my $customer_data = {};
     $customer_data->{'given name'} = $given_name;
     $customer_data->{'last name'} = $last_name;
     $customer_data->{'street address'} = $streetaddress;
     $customer_data->{'City'} = $city;
     $customer_data->{'State'} = $state;
     $customer_data->{'Zip'} = $zip;
     $data_rows{$row_tag} = $customer_data;
     $current_row_tag = $row_tag;
   } else {
      $row =~ s/^\s*//;
      $row =~ s/\s*$//;
      push @{$phones{$current_row_tag}}, $row;
   }
}

foreach $client_id (sort keys %data_rows) {
   my $client = $data_rows{$client_id};
   print "$client->{'given name'} $client->{'last name'}\n";
   print "    $_\n" foreach @{$phones{$client_id}};
}

The above code won't work because it splits on a comma. A lot of the fields contain a comma somewhere within the actual data. If it was easy as that, I would have had this done long ago ;)

--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
636-357-1548


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: extracting data from table embedded in Word document

Reply via email to