On 5/27/09 Wed May 27, 2009 3:27 PM, "Stephen Reese" <rsre...@gmail.com> scribbled:
> List, > > I've been working on a method to parse a PDF or TXT document and > output the results to XML over at Experts Exchange. > http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_2443963 > 0.html > > You may view the attached document or if the mailing list doesn't > allow here is a copy of the document I would like to parse: > http://filedb.experts-exchange.com/incoming/2009/05_w22/143310/XenApp-Secure-G > ateway-Server-VL0.txt > > Basically I would like to take the following code and modify it to > parse a TXT instead of a PDF document: > > #!/usr/bin/perl > use strict; > use warnings; > use Data::Dumper; > use CAM::PDF; > > my $pdf = CAM::PDF->new('XenApp_WebInterface_Server_VL04.pdf'); > my $text; > foreach (1..$pdf->numPages) { > $text .= $pdf->getPageText($_); > } > > while($text =~ /Vulnerability Key:\s* > (\S+)\s+STIG ID:\s* > (\S+)\s+Release Number:\s* > (\S+)\s+Status:\s* > (\S+)\s+Short Name:\s* > (\S+)\s+Long Name:\s* > (\S+)\s+IA Controls:\s* > (\S+)\s+Categories:\s* > (\S+)\s+Effective Date:\s* > (\S+)\s+Condition:\s* > (\S+)\s+Policy:\s* > (\S+)/g) { > > print "<Vuln> > <Vulnerability_Key_>$1</Vulnerability_Key_> > <STIG_ID>$2</STIG_ID_> > <Release_Number_>$3</Release_Number_> > <Status_>$4</Status_> > <Short_Name_>$5</Short_Name_> > <Long_Name_>$6</Long_Name_> > <IA_Controls_><IA_Control><ID>$7<ID></IA_Control></IA_Controls_> > <Categories_>$8</Categories_> > <Effective_Date_>$9</Effective_Date_> > <Condition_><subitem><title>$10</title><data></data></subitem></Condition_> > <Policy_>$11</Policy_> > </Vuln>\n"; > } You have two basic choices: 1. Read the whole file into a variable and use the regular expression as above to match multiple lines, extract the information, and print it. 2. Read the file line-by-line, save the relevant data, and print the data when you have a complete set or at the end of the program. A third choice if your data permits would be to set the input record separator ($/) to the value that separates your records and read multiple lines as a record. I don't think this will work in your case. Here is an example of approach 2: #!/usr/local/bin/perl use strict; use warnings; my @keys = ( 'Vulnerability Key', 'STIG ID', 'Release Number', 'Status', 'Short Name', 'Long Name', 'IA Controls', 'Categories', 'Effective Date', 'Condition', 'Policy' ); my( %keys, %tags ); $keys{$_} = 1 for @keys; $tags{$_} = $_ . '_' for @keys; $tags{$_} =~ s/ /_/g for @keys; my $file = 'XenApp Secure_Gateway_Server_VL04.txt'; open( my $fh, '<', $file) or die("Can't open $file: $!"); my %record = map { $_, '' } @keys; while( my $line = <$fh> ) { chomp($line); if( $line =~ m{ \A (.+?) : \s* (\S+) }x ) { $record{$1} = $2 if $keys{$1}; if( $1 eq $keys[$#keys] ) { print "<Vuln>\n"; print "<$tags{$_}>$record{$_}</$tags{$_}>\n" for @keys; print "</Vuln>\n"; %record = map { $_, '' } @keys; } } } ... which produces for your input: <Vuln> <Vulnerability_Key_>V0018219</Vulnerability_Key_> <STIG_ID_>CTX0700</STIG_ID_> <Release_Number_>1</Release_Number_> <Status_>Working</Status_> <Short_Name_>Secure</Short_Name_> <Long_Name_>Secure</Long_Name_> <IA_Controls_>ECSC-1</IA_Controls_> <Categories_>4.4</Categories_> <Effective_Date_></Effective_Date_> <Condition_></Condition_> <Policy_>All</Policy_> </Vuln> ... You may want to add error checking to the case where some keys are missing. Note that your regular expression will only extract the first "word" of the value, and your data in some cases has more than that on a line. You can change this by changing the RE to: if( $line =~ m{ \A (.+?) : \s* (.*?) \s* \z }x ) { -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/