Alright, thanks for your answers! I think I know what my mistake was: although I realised that * means 0 or more I thought that it would prefer to match as often as possible - but it seems that the matching stops as soon as the regex is successful, that's why.
Thanks again. Flo -------- Original-Nachricht -------- > Datum: Tue, 2 Oct 2012 19:17:58 -0400 > Von: William Muriithi <william.murii...@gmail.com> > An: Florian Huber <florian_hu...@gmx.at> > CC: beginners@perl.org > Betreff: Re: Multiple matching of a group of characters > Florian, > > > > The string is: > > > >>ENSG00000112365|ENST00000230122|109783797|109787053TGTTTCACAATTCATTTTCTACTAAATGTGTACCATTTTTTAAATTGTTTTAACAGAAAGCTGAGGAATGAAAAAACTTCAAGCATTATTTTCAAG > > It may actually have helped if you posted two or three samples. This > could help us identify patterns in your data and hence advice on the > necessary regular expression for process your data > > > > So I'm trying to retrieve'ENSG00000112365', 'ENST00000230122' and the > > sequence bit, starting with a 'T' and get rid of the junk in between. > > There is a lot of T is the above gene sequence, not sure which one you > refers to when you say "starting with 'T'" > > > > > code: > > > > /#!/usr/bin/perl// > > // > > //use strict;// > > //use warnings;// > > // > > //my $gene;// > > //my @elements = <>;// > > // > > //foreach $gene (@elements) {// > > // $gene =~ />(ENSG\d*) \| (ENST\d*) .*? ([AGCT]*)/x;// > Try > > $gene =~ />(ENSG\d*) \| (ENST\d*) .*? (AGCT\z)/x;// > > Assume you need everything starting from AGCT to the end of the sequence > > // print "$1 $2 $3\n";// > > //}/ > > > > > > This will print "ENSG00000112365 ENST00000230122" > > > > without the sequence. Originally I had .* before the ([ACGT]) so I > figured > > it's greedy and will eat the sequence away. ? makes it nongreedy, > doesn't > > it? Still doesn't work. > > > Greed here don't mean eating, its how wide it try matching. Try > google as there is better explanation out there > > Other results: > > > > with ([AGCT])* it says that $3 is uninitialised - so here it didn't > match at > > all??? > > > > with ([AGCT]{5}) it works fine - it returns TGTTT. > > > > > > This I found kinda strange - looks like I've got something with the > > greediness/precedence wrong? > > > > > > Thank you for your help! > > > > Flo > > > > > > On 02/10/2012 01:36, Brandon McCaig wrote: > >> > >> On Mon, Oct 01, 2012 at 11:15:53PM +0100, Florian Huber wrote: > >>> > >>> Dear all, > >> > >> Hello, > >> > >>> $string = "/NOTNEEDED/*ACGACGGGTTCAAGGCAG*/NOTNEEDED/" > >> > >> I would suggest that you show us the real data. I'm assuming that > >> 'NOTNEEDED' is a placeholder for some data that you're not > >> interested in. Without knowing what that is we can't really say > >> for sure what is going on (though we can speculate; see below). > >> > >> Note that you should be using the strict and warnings pragmas > >> (see below). The lack of 'my' here suggests that you probably > >> aren't. > >> > >>> But when I do > >>> > >>> $string =~ /[ACGT]/; > >>> > >>> it matches only the last letter, i.e. "G". Why doesn't it start > >>> at the beginning? > >> > >> It isn't matching the last letter. You are probably making the > >> wrong assumption. This is common when you're having trouble with > >> code. Again, show us the 'NOTNEEDED' part. :) > >> > >>> But it gets even better, I figured that adding the greedy * > >>> should help: > >>> > >>> $string =~ /[ACGT]*/; > >>> > >>> and now it doesn't match anything. Shouldn't it try to match as > >>> many times as possible? > >> > >> It should match at least the once that you saw earlier (assuming > >> the same data). > >> > >>> My confusion was complete when I tried > >>> > >>> $string =~ /[ACGT]{5}/; > >>> > >>> now it matches 5 letters, but this time from the beginning, > >>> i.e.: ACGAC. > >> > >> I'm guessing that the first 'NOTNEEDED' contains a 'G'. That > >> would explain the first match. The second result is nonesense > >> with the data we've seen. :-/ If 'NOTNEEDED' doesn't contain a > >> string at least 5 characters in length composed only of 'A', 'C', > >> 'G', or 'T' then that would explain this last result. > >> > >>> I fail to understand that behaviour. I checked the Perl > >>> documentation a bit and I sort of understand why /[ACGT]/ only > >>> matches one letter only (but not why it starts at the end). > >>> However, I'm simply puzzled at the other things. > >> > >> As said, provide us with a full (minimal) program to demonstrate > >> the problems you're having if your problems persist. > >> > >> Assuming 'NOTNEEDED' cannot contain '/' characters then you may > >> need to include those in your pattern to make sure you match the > >> parts you want. You will probably want to use captures for that > >> (see perldoc perlre). To understand the below program you will > >> also need to understand the /x modifier (again see perldoc > >> perlre). > >> > >> #!/usr/bin/perl > >> > >> use strict; # <---Make sure you have these. > >> use warnings; # <--/ > >> > >> my $string = '/NOTNEEDED/*ACGACGGGTTCAAGGCAG*/NOTNEEDED/'; > >> > >> my ($match) = $string =~ m, > >> ^ # Beginning of string. > >> / # Skip over the first '/'. > >> [^/]* # Skip over anything that's not a '/'. > >> / # Until the next '/'. Skip over that too. > >> \* # Skip over the literal '*' character. > >> ([ACGT]+) # Now capture the sequence we want. > >> ,x; > >> > >> print $match, "\n"; > >> > >> __END__ > >> > >> Output: > >> > >> ACGACGGGTTCAAGGCAG > >> > >> IF the '*' characters literally delimit the parts that you want > >> (AND not the parts that you don't want) then that's even easier: > >> > >> #!/usr/bin/perl > >> > >> use strict; > >> use warnings; > >> > >> my $string = '/NOTNEEDED/*ACGACGGGTTCAAGGCAG*/NOTNEEDED/'; > >> > >> my ($match) = $string =~ /\*([ACGT]+)/; > >> > >> print $match, "\n"; > >> > >> __END__ > >> > >> This produces the same output with this sample string. Without > >> seeing the real data it's hard to speculate. There might be a > >> better way. You need to know the specifications of the data > >> you're processing if you want to reliably process it > >> automatically. We need to know this to help you do it too. > >> > >> o o o o > >> > >> A lot of people seem to post about this same type of data. I'd be > >> surprised if nobody has written CPAN modules for parsing the data > >> yet (and if not then perhaps it would be economical to do so). > >> Just saying... > >> > >> Regards, > >> > >> > > > > -- > To unsubscribe, e-mail: beginners-unsubscr...@perl.org > For additional commands, e-mail: beginners-h...@perl.org > http://learn.perl.org/ > > -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/