Re: Multiple matching of a group of characters

Florian Huber Tue, 02 Oct 2012 15:20:35 -0700

Thanks guys, for the answers. :-)

I'm sorry I posted a shortened version of the code as I thought it'dmake it easier to read while still getting the message across. So here'sthe actual example and the corresponding output:


The string is:

>ENSG00000112365|ENST00000230122|109783797|109787053TGTTTCACAATTCATTTTCTACTAAATGTGTACCATTTTTTAAATTGTTTTAACAGAAAGCTGAGGAATGAAAAAACTTCAAGCATTATTTTCAAG

So I'm trying to retrieve'ENSG00000112365', 'ENST00000230122' and thesequence bit, starting with a 'T' and get rid of the junk in between.


code:

/#!/usr/bin/perl//
//
//use strict;//
//use warnings;//
//
//my $gene;//
//my @elements = <>;//
//
//foreach $gene (@elements) {//
//    $gene =~ />(ENSG\d*) \| (ENST\d*) .*? ([AGCT]*)/x;//
//    print "$1 $2 $3\n";//
//}/


This will print "ENSG00000112365 ENST00000230122"

without the sequence. Originally I had .* before the ([ACGT]) so Ifigured it's greedy and will eat the sequence away. ? makes itnongreedy, doesn't it? Still doesn't work.


Other results:

with ([AGCT])* it says that $3 is uninitialised - so here it didn'tmatch at all???


with ([AGCT]{5}) it works fine - it returns TGTTT.

This I found kinda strange - looks like I've got something with thegreediness/precedence wrong?


Thank you for your help!

Flo


On 02/10/2012 01:36, Brandon McCaig wrote:

On Mon, Oct 01, 2012 at 11:15:53PM +0100, Florian Huber wrote:

Dear all,

Hello,

$string = "/NOTNEEDED/*ACGACGGGTTCAAGGCAG*/NOTNEEDED/"

I would suggest that you show us the real data. I'm assuming that
'NOTNEEDED' is a placeholder for some data that you're not
interested in. Without knowing what that is we can't really say
for sure what is going on (though we can speculate; see below).

Note that you should be using the strict and warnings pragmas
(see below). The lack of 'my' here suggests that you probably
aren't.

But when I do

$string =~ /[ACGT]/;

it matches only the last letter, i.e. "G". Why doesn't it start
at the beginning?

It isn't matching the last letter. You are probably making the
wrong assumption. This is common when you're having trouble with
code. Again, show us the 'NOTNEEDED' part. :)

But it gets even better, I figured that adding the greedy *
should help:

$string =~ /[ACGT]*/;

and now it doesn't match anything. Shouldn't it try to match as
many times as possible?

It should match at least the once that you saw earlier (assuming
the same data).

My confusion was complete when I tried

$string =~ /[ACGT]{5}/;

now it matches 5 letters, but this time from the beginning,
i.e.: ACGAC.

I'm guessing that the first 'NOTNEEDED' contains a 'G'. That
would explain the first match. The second result is nonesense
with the data we've seen. :-/ If 'NOTNEEDED' doesn't contain a
string at least 5 characters in length composed only of 'A', 'C',
'G', or 'T' then that would explain this last result.

I fail to understand that behaviour. I checked the Perl
documentation a bit and I sort of understand why /[ACGT]/ only
matches one letter only (but not why it starts at the end).
However, I'm simply puzzled at the other things.

As said, provide us with a full (minimal) program to demonstrate
the problems you're having if your problems persist.

Assuming 'NOTNEEDED' cannot contain '/' characters then you may
need to include those in your pattern to make sure you match the
parts you want. You will probably want to use captures for that
(see perldoc perlre). To understand the below program you will
also need to understand the /x modifier (again see perldoc
perlre).

#!/usr/bin/perl

use strict;   # <---Make sure you have these.
use warnings; # <--/

my $string = '/NOTNEEDED/*ACGACGGGTTCAAGGCAG*/NOTNEEDED/';

my ($match) = $string =~ m,
         ^         # Beginning of string.
         /         # Skip over the first '/'.
         [^/]*     # Skip over anything that's not a '/'.
         /         # Until the next '/'. Skip over that too.
         \*        # Skip over the literal '*' character.
         ([ACGT]+) # Now capture the sequence we want.
         ,x;

print $match, "\n";

__END__

Output:

ACGACGGGTTCAAGGCAG

IF the '*' characters literally delimit the parts that you want
(AND not the parts that you don't want) then that's even easier:

#!/usr/bin/perl

use strict;
use warnings;

my $string = '/NOTNEEDED/*ACGACGGGTTCAAGGCAG*/NOTNEEDED/';

my ($match) = $string =~ /\*([ACGT]+)/;

print $match, "\n";

__END__

This produces the same output with this sample string. Without
seeing the real data it's hard to speculate. There might be a
better way. You need to know the specifications of the data
you're processing if you want to reliably process it
automatically. We need to know this to help you do it too.

                               o o o o

A lot of people seem to post about this same type of data. I'd be
surprised if nobody has written CPAN modules for parsing the data
yet (and if not then perhaps it would be economical to do so).
Just saying...

Regards,

Re: Multiple matching of a group of characters

Reply via email to