help parsing file

Pedro Antonio Reche Tue, 17 Dec 2002 16:12:24 -0800

Hi, I am interested in parsing the file at the bottom of this e-mail in
order to extract the string between "" following  /product=,
/protein_id=, /db_xref= and  /translation=, and that for each of the
segment separated by the string "CDS". The ouptput for the example
bellow should look like this:


>V001|AAM13451.1|GI:20152990
MESLKYFYSLSLSLFNGLTKILNLFLMESLKYFYSLSLSLFNGL
TKILNLFLMVSIKRSIFLTL
>V002|AAA60951.1|GI:333518
KQIVLACICLAAVAIPTSLQQSFSSSSSCTEEENKHHMGIDVI
IKVTKQDQTPTNDKICQSVTEVTESEDESEEVVKGDPTTYYTVVGGGLTMDFGFTKCP
KISSISEYSDGNTVNARLSSVSPGQGKDSPAITREEALSMIKDCEMSINIKCSEEEKD
SNIKTHPVLGSNISHKKVSYEDIIGSTIVDTKCVKNLEISVRIGDMCKESSELEVKDG
FKYVDGSASEDAADDTSLINSAKLIACV

So far I have use the code below which actually work. However, I am not
please with it, as it generates an empty element in the hash from the
header of the file and becasue that there might be a better way to do
this. Thereby, I will be very pleased for any input or alternative way
to improve the code. 
Regards,
pedro
#!/usr/sbin/perl -w
$/ = "\n     CDS";
while(<>){
        $_ =~ /product=\"(.+)\"/;
                $gname = $1; 
                $gname =~ s/\s+//g;
                push @ID, $gname;               
        $_ =~ /protein_id="([\w\.]+)\"/;
                $ref = $1;
        $_=~ /db_xref=\"GI:(\w+)\"/;
                $gid = $1;
        $_ =~ /translation=\"([A-Z\s]+)/;
                $seq = $1;
                $seq  =~ s/\s+//g;
               $hash{$gname} = ["$ref", "$gid", "$seq"];
}
open(F, ">test");
foreach $key (@ID){
        print F ">gi|$hash{$key}[1]|$hash{$key}[0]
$key\n$hash{$key}[2]\n";
}
close(F);

REFERENCE   6  (bases 1 to 224501)
AUTHORS   Dietrich,F.S., Ray,C.A., Sharma,A.D., Allen,A. and Pickup,D.J.
TITLE     Direct Submission
JOURNAL   Submitted (11-FEB-2002) Molecular Genetics and Microbiology,
Duke
            University Medical Center, Box 3020 DUMC, 421 Jones
Building,
            Durham, NC 27710, USA
COMMENT     On Apr 16, 2002 this sequence version replaced gi:333516.
FEATURES             Location/Qualifiers
     source          1..224501
                     /organism="Cowpox virus"
                     /strain="Brighton Red"
                     /db_xref="taxon:10243"
     CDS             complement(156..350)
                     /codon_start=1
                     /evidence=not_experimental
                     /product="V001"
                     /protein_id="AAM13451.1"
                     /db_xref="GI:20152990"
                    
/translation="MESLKYFYSLSLSLFNGLTKILNLFLMESLKYFYSLSLSLFNGL
                     TKILNLFLMVSIKRSIFLTL"
     CDS             complement(2743..3483)
                     /codon_start=1
                     /evidence=not_experimental
                     /product="V002"
                     /protein_id="AAA60951.1"
                     /db_xref="GI:333518"
                    
/translation="MKQIVLACICLAAVAIPTSLQQSFSSSSSCTEEENKHHMGIDVI
                    
IKVTKQDQTPTNDKICQSVTEVTESEDESEEVVKGDPTTYYTVVGGGLTMDFGFTKCP
                    
KISSISEYSDGNTVNARLSSVSPGQGKDSPAITREEALSMIKDCEMSINIKCSEEEKD
                    
SNIKTHPVLGSNISHKKVSYEDIIGSTIVDTKCVKNLEISVRIGDMCKESSELEVKDG
                     FKYVDGSASEDAADDTSLINSAKLIACV"
BASE COUNT    74832 a  37730 c  37261 g  74678 t
ORIGIN
        1 tagtaaaatt aaattaatta taaaattata tatataattt actaacttta
gttagataaa
       61 ttaataatat ataagtttta gtacattaat attatatttt aaatatttta
tttagtgtct
//



*******************************************************************
PEDRO A. RECHE , pHD            TL: 617 632 3824
Dana-Farber Cancer Institute,   FX: 617 632 4569
Harvard Medical School,         EM: [EMAIL PROTECTED]
44 Binney Street, D1510A,       EM: [EMAIL PROTECTED]              
Boston, MA 02115                URL: http://www.reche.org                              
                 
*******************************************************************

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

help parsing file

Reply via email to