On Tue, Oct 02, 2012 at 11:19:51PM +0100, Florian Huber wrote: > The string is: > > >ENSG00000112365|ENST00000230122|109783797|109787053TGTTTCACAATTCATTTTCTACTAAATGTGTACCATTTTTTAAATTGTTTTAACAGAAAGCTGAGGAATGAAAAAACTTCAAGCATTATTTTCAAG > > So I'm trying to retrieve'ENSG00000112365', 'ENST00000230122' and > the sequence bit, starting with a 'T' and get rid of the junk in > between. > > code: > > /#!/usr/bin/perl// > // > //use strict;// > //use warnings;// > // > //my $gene;// > //my @elements = <>;// > // > //foreach $gene (@elements) {// > // $gene =~ />(ENSG\d*) \| (ENST\d*) .*? ([AGCT]*)/x;// > // print "$1 $2 $3\n";// > //}/ > > > This will print "ENSG00000112365 ENST00000230122" > > without the sequence.
You might want to simplify the problem first. It looks like the first two fields that you're interested in are delimited with the vertical bar (|) character. You can split the string first on that to get those 4 parts in between. Then you can just match the simple character class that you want from the last sequence. #!/usr/bin/perl use strict; use warnings; my $sequence = '>ENSG00000112365|ENST00000230122|109783797|109787053TGTTTCACAATTCATTTTCTACTAAATGTGTACCATTTTTTAAATTGTTTTAACAGAAAGCTGAGGAATGAAAAAACTTCAAGCATTATTTTCAAG'; my ($foo, $bar, $baz) = parse_sequence($sequence); print "foo=$foo\n"; print "bar=$bar\n"; print "baz=$baz\n"; sub parse_sequence { my ($sequence) = @_; # We don't want the leading > character. $sequence =~ s/^>//; my @parts = split /\|/, $sequence; # We're interested in the first two fields. my ($foo, $bar) = @parts; # And the matching part of the last one. my ($baz) = $parts[3] =~ /([ACGT]+)/; # You may prefer to return a hash reference or something with # meaningful field names. Those details are entirely up to # you. return ($foo, $bar, $baz); } __END__ Output: foo=ENSG00000112365 bar=ENST00000230122 baz=TGTTTCACAATTCATTTTCTACTAAATGTGTACCATTTTTTAAATTGTTTTAACAGAAAGCTGAGGAATGAAAAAACTTCAAGCATTATTTTCAAG Obviously you should assign appropriate names to the variables. :) > Originally I had .* before the ([ACGT]) so I figured it's > greedy and will eat the sequence away. ? makes it nongreedy, > doesn't it? Still doesn't work. > > Other results: > > with ([AGCT])* it says that $3 is uninitialised - so here it > didn't match at all??? > > with ([AGCT]{5}) it works fine - it returns TGTTT. > > > This I found kinda strange - looks like I've got something with > the greediness/precedence wrong? It takes some practice (and often a clear head) to get regular expressions right. :) The * will only match if the rest of the regular expression allows it to. If not it is happy to not match anything. If you require at least one match then be sure you use +. :) And as above, always try to make things as simple as possible. It's much easier to get simple correct. :) Regards, -- Brandon McCaig <bamcc...@gmail.com> <bamcc...@castopulence.org> Castopulence Software <https://www.castopulence.org/> Blog <http://www.bamccaig.com/> perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }. q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.}; tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'
signature.asc
Description: Digital signature