Re: Multiple matching of a group of characters

Florian Huber Wed, 03 Oct 2012 03:06:26 -0700

Alright, thanks for your answers! I think I know what my mistake was: although 
I realised that * means 0 or more I thought that it would prefer to match as 
often as possible - but it seems that the matching stops as soon as the regex 
is successful, that's why.


Thanks again.

Flo

-------- Original-Nachricht --------
> Datum: Tue, 2 Oct 2012 19:17:58 -0400
> Von: William Muriithi <william.murii...@gmail.com>
> An: Florian Huber <florian_hu...@gmx.at>
> CC: beginners@perl.org
> Betreff: Re: Multiple matching of a group of characters

> Florian,
> >
> > The string is:
> >
> >>ENSG00000112365|ENST00000230122|109783797|109787053TGTTTCACAATTCATTTTCTACTAAATGTGTACCATTTTTTAAATTGTTTTAACAGAAAGCTGAGGAATGAAAAAACTTCAAGCATTATTTTCAAG
> 
> It may actually have helped if you posted two or three samples.  This
> could help us identify patterns in your data and hence advice on the
> necessary regular expression for process your data
> >
> > So I'm trying to retrieve'ENSG00000112365', 'ENST00000230122' and the
> > sequence bit, starting with a 'T' and get rid of the junk in between.
> 
> There is a lot of T is the above gene sequence, not sure which one you
> refers to when you say "starting with 'T'"
> 
> >
> > code:
> >
> > /#!/usr/bin/perl//
> > //
> > //use strict;//
> > //use warnings;//
> > //
> > //my $gene;//
> > //my @elements = <>;//
> > //
> > //foreach $gene (@elements) {//
> > //    $gene =~ />(ENSG\d*) \| (ENST\d*) .*? ([AGCT]*)/x;//
> Try
> 
> $gene =~ />(ENSG\d*) \| (ENST\d*) .*? (AGCT\z)/x;//
> 
> Assume you need everything starting from AGCT to the end of the sequence
> > //    print "$1 $2 $3\n";//
> > //}/
> >
> >
> > This will print "ENSG00000112365 ENST00000230122"
> >
> > without the sequence. Originally I had .* before the ([ACGT]) so I
> figured
> > it's greedy and will eat the sequence away. ? makes it nongreedy,
> doesn't
> > it? Still doesn't work.
> >
> Greed here don't mean eating, its how wide it try matching.  Try
> google as there is better explanation out there
> > Other results:
> >
> > with ([AGCT])* it says that $3 is uninitialised - so here it didn't
> match at
> > all???
> >
> > with ([AGCT]{5}) it works fine - it returns TGTTT.
> >
> >
> > This I found kinda strange - looks like I've got something with the
> > greediness/precedence wrong?
> >
> >
> > Thank you for your help!
> >
> > Flo
> >
> >
> > On 02/10/2012 01:36, Brandon McCaig wrote:
> >>
> >> On Mon, Oct 01, 2012 at 11:15:53PM +0100, Florian Huber wrote:
> >>>
> >>> Dear all,
> >>
> >> Hello,
> >>
> >>> $string = "/NOTNEEDED/*ACGACGGGTTCAAGGCAG*/NOTNEEDED/"
> >>
> >> I would suggest that you show us the real data. I'm assuming that
> >> 'NOTNEEDED' is a placeholder for some data that you're not
> >> interested in. Without knowing what that is we can't really say
> >> for sure what is going on (though we can speculate; see below).
> >>
> >> Note that you should be using the strict and warnings pragmas
> >> (see below). The lack of 'my' here suggests that you probably
> >> aren't.
> >>
> >>> But when I do
> >>>
> >>> $string =~ /[ACGT]/;
> >>>
> >>> it matches only the last letter, i.e. "G". Why doesn't it start
> >>> at the beginning?
> >>
> >> It isn't matching the last letter. You are probably making the
> >> wrong assumption. This is common when you're having trouble with
> >> code. Again, show us the 'NOTNEEDED' part. :)
> >>
> >>> But it gets even better, I figured that adding the greedy *
> >>> should help:
> >>>
> >>> $string =~ /[ACGT]*/;
> >>>
> >>> and now it doesn't match anything. Shouldn't it try to match as
> >>> many times as possible?
> >>
> >> It should match at least the once that you saw earlier (assuming
> >> the same data).
> >>
> >>> My confusion was complete when I tried
> >>>
> >>> $string =~ /[ACGT]{5}/;
> >>>
> >>> now it matches 5 letters, but this time from the beginning,
> >>> i.e.: ACGAC.
> >>
> >> I'm guessing that the first 'NOTNEEDED' contains a 'G'. That
> >> would explain the first match. The second result is nonesense
> >> with the data we've seen. :-/ If 'NOTNEEDED' doesn't contain a
> >> string at least 5 characters in length composed only of 'A', 'C',
> >> 'G', or 'T' then that would explain this last result.
> >>
> >>> I fail to understand that behaviour. I checked the Perl
> >>> documentation a bit and I sort of understand why /[ACGT]/ only
> >>> matches one letter only (but not why it starts at the end).
> >>> However, I'm simply puzzled at the other things.
> >>
> >> As said, provide us with a full (minimal) program to demonstrate
> >> the problems you're having if your problems persist.
> >>
> >> Assuming 'NOTNEEDED' cannot contain '/' characters then you may
> >> need to include those in your pattern to make sure you match the
> >> parts you want. You will probably want to use captures for that
> >> (see perldoc perlre). To understand the below program you will
> >> also need to understand the /x modifier (again see perldoc
> >> perlre).
> >>
> >> #!/usr/bin/perl
> >>
> >> use strict;   # <---Make sure you have these.
> >> use warnings; # <--/
> >>
> >> my $string = '/NOTNEEDED/*ACGACGGGTTCAAGGCAG*/NOTNEEDED/';
> >>
> >> my ($match) = $string =~ m,
> >>          ^         # Beginning of string.
> >>          /         # Skip over the first '/'.
> >>          [^/]*     # Skip over anything that's not a '/'.
> >>          /         # Until the next '/'. Skip over that too.
> >>          \*        # Skip over the literal '*' character.
> >>          ([ACGT]+) # Now capture the sequence we want.
> >>          ,x;
> >>
> >> print $match, "\n";
> >>
> >> __END__
> >>
> >> Output:
> >>
> >> ACGACGGGTTCAAGGCAG
> >>
> >> IF the '*' characters literally delimit the parts that you want
> >> (AND not the parts that you don't want) then that's even easier:
> >>
> >> #!/usr/bin/perl
> >>
> >> use strict;
> >> use warnings;
> >>
> >> my $string = '/NOTNEEDED/*ACGACGGGTTCAAGGCAG*/NOTNEEDED/';
> >>
> >> my ($match) = $string =~ /\*([ACGT]+)/;
> >>
> >> print $match, "\n";
> >>
> >> __END__
> >>
> >> This produces the same output with this sample string. Without
> >> seeing the real data it's hard to speculate. There might be a
> >> better way. You need to know the specifications of the data
> >> you're processing if you want to reliably process it
> >> automatically. We need to know this to help you do it too.
> >>
> >>                                o o o o
> >>
> >> A lot of people seem to post about this same type of data. I'd be
> >> surprised if nobody has written CPAN modules for parsing the data
> >> yet (and if not then perhaps it would be economical to do so).
> >> Just saying...
> >>
> >> Regards,
> >>
> >>
> >
> 
> -- 
> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
> For additional commands, e-mail: beginners-h...@perl.org
> http://learn.perl.org/
> 
> 

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: Multiple matching of a group of characters

Reply via email to