This email follows this format: 1. Suggestions and items you need to read in order to understand the code are listed first. 2. Suggestions about properly posting a question are next. 3. Finally, the code that does all that you ask is at the bottom.
1. Please read or do the following: (a) ALWAYS use warnings; use strict; (b) perldoc LWP::UserAgent (c) perldoc HTTP::Request (d) perldoc HTTP::Request::Common (e) perldoc -f map (f) perldoc -f sort (g) perldoc -f next (h) perldoc perlref (i) perldoc -q hash (j) perldoc perlreftut (k) perldoc perlsyn (especially loop control) (l) perldoc perldata (m) perldoc perlre (n) perldoc perlreref (o) perldoc perlop That is more than enough! 2. Your question was quite complicated but you did not provide enough information in any one email. It took quite a few emails just for me to understand what was going on. Your question should convey the following information. ------BEGIN QUESTION------ I have data in the following format (see __DATA__ in code below). There are 12 space-delimited fields: 1 2 3 4 5 6 7 8 9 10 11 12 gi|37182815|gb|AY358849.1| gi|28592069|gb|U63637.2|BTU63637 100.00 17 0 0 552 568 3218 3234 1.1 34.19 I need to record the following 6 fields: 2. subject id 3. identity % 4. alignment length 5. mismatches 7. q.start 8. q.end The last portion of subject id (BTU63637) is optional. I then need to submit each unique subject id to the web page http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide. The first link of the search results needs to be retrieved. I then need to record two pieces of information from within the page of the second web submission. These two items are chromosome and gene name. If either is missing, I wish to record NA instead. An example page that contains chromosome and gene name is http://.... Then include the code you already wrote. -------END QUESTION------- Since you need to record unique subject ids, this means to me definitely use a hash. I wound up using a hash of a hash. Also, your emails contained a lot of extraneous data going into gene sequences, clone, complete sequence, etc. This is fine for supporting information, but does nothing to succinctly explain what you want and how you want it. 3. You might execute this code with prompt> perl code.pl >results.txt or prompt> perl code.pl | more -------BEGIN CODE------- #!/usr/bin/perl use warnings; use strict; use LWP::UserAgent; use HTTP::Request::Common qw(POST); my %data; my $ua = LWP::UserAgent->new() or die "Could not create UserAgent: $!\n"; my $base_url = 'http://www.ncbi.nlm.nih.gov'; my $temp = 'file.htm'; my @fields = qw(identity alignment mismatch start end); my $length = ( sort {$b <=> $a} map { length } @fields )[0] + 1; while (<DATA>) { next if /^#/ or /^\s*$/; my @values = /^ \w+\|\w+\|\w+\|\S+\|\s+ # 1. query id (\w+\|\w+\|\w+\|\S+\|\w*) \s+ # 2. subject id (\S+) \s+ (\d+) \s+ # 3. identity %, 4. alignment length (\d+) \s+ \d+ \s+ # 5. mismatches, 6. gap openings (\d+) \s+ (\d+) # 7. q.start, 8. q.end /x or next; my $subject = shift @values; if ( not $data{$subject} ){ @{ $data{$subject} }{ @fields } = @values; my $request = POST $base_url . '/entrez/query.fcgi?CMD=search&DB=nucleotide', [ orig_db => 'nucleotide', term => $subject, ]; my $response = $ua->request($request, $temp); if ( $response->is_success ) { local $/ = undef; open HTM, $temp or die "Cannot open $temp for reading: $!\n"; my ($link) = <HTM> =~ m|<a href="/(entrez/viewer\.fcgi\?db=nucleotide&val=\d+)">|; close HTM; my $response = $ua->get("$base_url/$link", ':content_file' => $temp); if ( $response->is_success ){ my $htm; open HTM, $temp or die "Cannot open $temp for reading: $!\n"; my ($chromo) = ($htm = <HTM>) =~ /chromosome=(\S+)/; my ($gene) = $htm =~ /gene=(\S+)/; $data{$subject}->{chromosome} = $chromo || 'NA'; $data{$subject}->{gene} = $gene || 'NA'; } else { print "Could not retrieve link:\n", $response->as_string; next; } } else { print "Subject search error:\n", $response->as_string; next; } } } foreach my $subject( sort keys %data ) { print "$subject\n"; foreach my $field( sort keys %{$data{$subject}} ){ printf "\t%-${length}s = %s\n", $field, $data{$subject}->{$field}; } } __DATA__ # BLASTN 2.2.9 [May-01-2004] # Query: gi|37182815|gb|AY358849.1| Homo sapiens clone DNA180287 ALTE (UNQ6508) mRNA, complete cds # Database: nr # Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score gi|37182815|gb|AY358849.1| gi|28592069|gb|U63637.2|BTU63637 100.00 17 0 0 552 568 3218 3234 1.1 34.19 gi|37182815|gb|AY358849.1| gi|14318385|gb|AC089993.2| 95.24 21 1 0 435 455 56604 56624 1.1 34.19 gi|37182815|gb|AY358849.1| gi|14318385|gb|AC089993.2| 100.00 16 0 0 260 275 89982 89967 4.2 32.21 gi|37182815|gb|AY358849.1| gi|7385112|gb|AF222766.1|AF222766 100.00 17 0 0 345 361 242 226 1.1 34.19 --------END CODE-------- Good luck, ZO -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>