Hi, I am interested in parsing the file at the bottom of this e-mail in order to extract the string between "" following /product=, /protein_id=, /db_xref= and /translation=, and that for each of the segment separated by the string "CDS". The ouptput for the example bellow should look like this:
>V001|AAM13451.1|GI:20152990 MESLKYFYSLSLSLFNGLTKILNLFLMESLKYFYSLSLSLFNGL TKILNLFLMVSIKRSIFLTL >V002|AAA60951.1|GI:333518 KQIVLACICLAAVAIPTSLQQSFSSSSSCTEEENKHHMGIDVI IKVTKQDQTPTNDKICQSVTEVTESEDESEEVVKGDPTTYYTVVGGGLTMDFGFTKCP KISSISEYSDGNTVNARLSSVSPGQGKDSPAITREEALSMIKDCEMSINIKCSEEEKD SNIKTHPVLGSNISHKKVSYEDIIGSTIVDTKCVKNLEISVRIGDMCKESSELEVKDG FKYVDGSASEDAADDTSLINSAKLIACV So far I have use the code below which actually work. However, I am not please with it, as it generates an empty element in the hash from the header of the file and becasue that there might be a better way to do this. Thereby, I will be very pleased for any input or alternative way to improve the code. Regards, pedro #!/usr/sbin/perl -w $/ = "\n CDS"; while(<>){ $_ =~ /product=\"(.+)\"/; $gname = $1; $gname =~ s/\s+//g; push @ID, $gname; $_ =~ /protein_id="([\w\.]+)\"/; $ref = $1; $_=~ /db_xref=\"GI:(\w+)\"/; $gid = $1; $_ =~ /translation=\"([A-Z\s]+)/; $seq = $1; $seq =~ s/\s+//g; $hash{$gname} = ["$ref", "$gid", "$seq"]; } open(F, ">test"); foreach $key (@ID){ print F ">gi|$hash{$key}[1]|$hash{$key}[0] $key\n$hash{$key}[2]\n"; } close(F); REFERENCE 6 (bases 1 to 224501) AUTHORS Dietrich,F.S., Ray,C.A., Sharma,A.D., Allen,A. and Pickup,D.J. TITLE Direct Submission JOURNAL Submitted (11-FEB-2002) Molecular Genetics and Microbiology, Duke University Medical Center, Box 3020 DUMC, 421 Jones Building, Durham, NC 27710, USA COMMENT On Apr 16, 2002 this sequence version replaced gi:333516. FEATURES Location/Qualifiers source 1..224501 /organism="Cowpox virus" /strain="Brighton Red" /db_xref="taxon:10243" CDS complement(156..350) /codon_start=1 /evidence=not_experimental /product="V001" /protein_id="AAM13451.1" /db_xref="GI:20152990" /translation="MESLKYFYSLSLSLFNGLTKILNLFLMESLKYFYSLSLSLFNGL TKILNLFLMVSIKRSIFLTL" CDS complement(2743..3483) /codon_start=1 /evidence=not_experimental /product="V002" /protein_id="AAA60951.1" /db_xref="GI:333518" /translation="MKQIVLACICLAAVAIPTSLQQSFSSSSSCTEEENKHHMGIDVI IKVTKQDQTPTNDKICQSVTEVTESEDESEEVVKGDPTTYYTVVGGGLTMDFGFTKCP KISSISEYSDGNTVNARLSSVSPGQGKDSPAITREEALSMIKDCEMSINIKCSEEEKD SNIKTHPVLGSNISHKKVSYEDIIGSTIVDTKCVKNLEISVRIGDMCKESSELEVKDG FKYVDGSASEDAADDTSLINSAKLIACV" BASE COUNT 74832 a 37730 c 37261 g 74678 t ORIGIN 1 tagtaaaatt aaattaatta taaaattata tatataattt actaacttta gttagataaa 61 ttaataatat ataagtttta gtacattaat attatatttt aaatatttta tttagtgtct // ******************************************************************* PEDRO A. RECHE , pHD TL: 617 632 3824 Dana-Farber Cancer Institute, FX: 617 632 4569 Harvard Medical School, EM: [EMAIL PROTECTED] 44 Binney Street, D1510A, EM: [EMAIL PROTECTED] Boston, MA 02115 URL: http://www.reche.org ******************************************************************* -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]