Andrew Gaffney wrote:
I'm writing a web-based client information system for a lawyer.
[OT} Eek!! Be very, very careful about security, please. I trust your employer has taken "duty of care" isuues into account. [/OT]
His current client list is in a 137 page Word document with an embedded table. I can get it into a somewhat usable format by copying the entire table, pasting into Excel, and saving as tab delimeted text, but this has its problems.
Some of the cells in the table have newlines in them. Because of this, when it's exported from Excel, the 2nd line will appear in the correct field, but on a line by itself:
Row 1, Firstname, Lastname, Address, City, State, Zip, Phone,
Don't export as space,-delimited. Use CSV, which is much more portable.
That was what I wanted to do in the first place, but Word doesn't support exporting a document with a table as CSV.
Row 2, First middle, Last, addy, City, State, Zip, 555-5555
So it looks like 3 records instead of 2. Does anyone have any ideas on how to pick apart the data to get it into the DB?
The problem here is more one of datqa design, specifically data normalization, than formatting. Neither documents nor spreadsheets are really good storage methods for large scale data. For long-term robustness, this data should probably be in a normalized database. Multi-valued fields generally indicate data that should be broken out into a table of its own:
Believe me, I cringed when I saw this Word doc. I'm writing this script to import the data in the Word doc into a MySQL DB in order to do away with the doc. I will personally be overwriting the Word docs with /dev/zero multiple times once I'm done ;)
Now as far as the particular mehtod to use for picking the data apart, that depends. Is the phone number the only multivalued field? If so it should be fairly straightforward. It would help if we could actually see some of the data you have to work with. Might be a bit of work, since you will have to sustitute aliases--but realistic please, for any identifying information.
The phone number isn't the only field like that. Just scanning through the doc, I see this problem in most of the fields at some point in the doc.
Greetings! C:\Documents and Settings\rjnewton>perl -w my @rows; push @rows, $_ while <STDIN>; chomp @rows; my %phones; my $current_row_tag; my %data_rows; foreach $row (@rows) { last unless $row =~ /\S/; if ($row =~ /^\w/) { my ($row_tag, $given_name, $last_name, $streetaddress, $city, $state, $zip, $phone) = split /,\s*/, $row; $phones{$row_tag} = []; push @{$phones{$row_tag}}, $phone; my $customer_data = {}; $customer_data->{'given name'} = $given_name; $customer_data->{'last name'} = $last_name; $customer_data->{'street address'} = $streetaddress; $customer_data->{'City'} = $city; $customer_data->{'State'} = $state; $customer_data->{'Zip'} = $zip; $data_rows{$row_tag} = $customer_data; $current_row_tag = $row_tag; } else { $row =~ s/^\s*//; $row =~ s/\s*$//; push @{$phones{$current_row_tag}}, $row; } }
foreach $client_id (sort keys %data_rows) { my $client = $data_rows{$client_id}; print "$client->{'given name'} $client->{'last name'}\n"; print " $_\n" foreach @{$phones{$client_id}}; }
The above code won't work because it splits on a comma. A lot of the fields contain a comma somewhere within the actual data. If it was easy as that, I would have had this done long ago ;)
-- Andrew Gaffney Network Administrator Skyline Aeronautics, LLC. 636-357-1548
-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>