Re: getting online information....

Zeus Odin Fri, 18 Jun 2004 02:54:24 -0700

This email follows this format:
1. Suggestions and items you need to read in order to understand the code
are listed first.
2. Suggestions about properly posting a question are next.
3. Finally, the code that does all that you ask is at the bottom.


1. Please read or do the following:
   (a) ALWAYS use warnings; use strict;
   (b) perldoc LWP::UserAgent
   (c) perldoc HTTP::Request
   (d) perldoc HTTP::Request::Common
   (e) perldoc -f map
   (f) perldoc -f sort
   (g) perldoc -f next
   (h) perldoc perlref
   (i) perldoc -q hash
   (j) perldoc perlreftut
   (k) perldoc perlsyn (especially loop control)
   (l) perldoc perldata
   (m) perldoc perlre
   (n) perldoc perlreref
   (o) perldoc perlop
That is more than enough!

2. Your question was quite complicated but you did not provide enough
information in any one email. It took quite a few emails just for me to
understand what was going on. Your question should convey the following
information.

------BEGIN QUESTION------
I have data in the following format (see __DATA__ in code below). There are
12 space-delimited fields:

1                          2                                3      4  5 6 7
8   9    10     11  12
gi|37182815|gb|AY358849.1| gi|28592069|gb|U63637.2|BTU63637 100.00 17 0 0
552 568 3218 3234   1.1 34.19

I need to record the following 6 fields:
   2. subject id
   3. identity %
   4. alignment length
   5. mismatches
   7. q.start
   8. q.end

The last portion of subject id (BTU63637) is optional. I then need to submit
each unique subject id to the web page
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide. The first link
of the search results needs to be retrieved. I then need to record two
pieces of information from within the page of the second web submission.
These two items are chromosome and gene name. If either is missing, I wish
to record NA instead. An example page that contains chromosome and gene name
is http://.... Then include the code you already wrote.
-------END QUESTION-------

Since you need to record unique subject ids, this means to me definitely use
a hash. I wound up using a hash of a hash. Also, your emails contained a lot
of extraneous data going into gene sequences, clone, complete sequence, etc.
This is fine for supporting information, but does nothing to succinctly
explain what you want and how you want it.


3. You might execute this code with
prompt> perl code.pl >results.txt
or
prompt> perl code.pl | more

-------BEGIN CODE-------
#!/usr/bin/perl
use warnings;
use strict;
use LWP::UserAgent;
use HTTP::Request::Common qw(POST);

my %data;
my $ua       = LWP::UserAgent->new() or die "Could not create UserAgent:
$!\n";
my $base_url = 'http://www.ncbi.nlm.nih.gov';
my $temp     = 'file.htm';
my @fields   = qw(identity alignment mismatch start end);
my $length   = ( sort {$b <=> $a} map { length } @fields )[0] + 1;


while (<DATA>) {
  next if /^#/ or /^\s*$/;
  my @values =
    /^ \w+\|\w+\|\w+\|\S+\|\s+       # 1. query id
      (\w+\|\w+\|\w+\|\S+\|\w*) \s+  # 2. subject id
      (\S+) \s+ (\d+) \s+            # 3. identity %, 4. alignment length
      (\d+) \s+  \d+  \s+            # 5. mismatches, 6. gap openings
      (\d+) \s+ (\d+)                # 7. q.start,    8. q.end
    /x or next;

  my $subject = shift @values;
  if ( not $data{$subject} ){
    @{ $data{$subject} }{ @fields } = @values;

    my $request = POST $base_url .
'/entrez/query.fcgi?CMD=search&DB=nucleotide',
      [ orig_db => 'nucleotide',
        term    => $subject,
      ];
    my $response = $ua->request($request, $temp);
    if ( $response->is_success ) {
      local $/ = undef;
      open HTM, $temp      or die "Cannot open $temp for reading: $!\n";
      my ($link) = <HTM> =~ m|<a
href="/(entrez/viewer\.fcgi\?db=nucleotide&val=\d+)">|;
      close HTM;

      my $response = $ua->get("$base_url/$link", ':content_file' => $temp);
      if ( $response->is_success ){
          my $htm;
          open HTM, $temp  or die "Cannot open $temp for reading: $!\n";
          my ($chromo) = ($htm = <HTM>) =~ /chromosome=(\S+)/;
          my ($gene)   =  $htm          =~ /gene=(\S+)/;
          $data{$subject}->{chromosome} = $chromo || 'NA';
          $data{$subject}->{gene}       = $gene   || 'NA';
      } else {
        print "Could not retrieve link:\n", $response->as_string;
        next;
      }
    } else {
      print "Subject search error:\n", $response->as_string;
      next;
    }

  }
}

foreach my $subject( sort keys %data ) {
  print "$subject\n";
  foreach my $field( sort keys %{$data{$subject}} ){
    printf "\t%-${length}s = %s\n", $field, $data{$subject}->{$field};
  }
}

__DATA__
# BLASTN 2.2.9 [May-01-2004]
# Query: gi|37182815|gb|AY358849.1| Homo sapiens clone DNA180287 ALTE
(UNQ6508) mRNA, complete cds
# Database: nr
# Fields: Query id, Subject id, % identity, alignment length, mismatches,
gap openings, q. start, q. end, s. start, s. end, e-value, bit score
gi|37182815|gb|AY358849.1| gi|28592069|gb|U63637.2|BTU63637 100.00 17 0 0
552 568 3218 3234   1.1 34.19
gi|37182815|gb|AY358849.1| gi|14318385|gb|AC089993.2| 95.24 21 1 0 435 455
56604 56624   1.1 34.19
gi|37182815|gb|AY358849.1| gi|14318385|gb|AC089993.2| 100.00 16 0 0 260 275
89982 89967   4.2 32.21
gi|37182815|gb|AY358849.1| gi|7385112|gb|AF222766.1|AF222766 100.00 17 0 0
345 361 242 226   1.1 34.19
--------END CODE--------

Good luck,
ZO




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: getting online information....

Reply via email to