On Oct 2, 2012, at 3:19 PM, Florian Huber wrote:

> Thanks guys, for the answers. :-)
> 
> I'm sorry I posted a shortened version of the code as I thought it'd make it 
> easier to read while still getting the message across. So here's the actual 
> example and the corresponding output:
> 
> The string is:
> 
> >ENSG00000112365|ENST00000230122|109783797|109787053TGTTTCACAATTCATTTTCTACTAAATGTGTACCATTTTTTAAATTGTTTTAACAGAAAGCTGAGGAATGAAAAAACTTCAAGCATTATTTTCAAG
> 
> So I'm trying to retrieve'ENSG00000112365', 'ENST00000230122' and the 
> sequence bit, starting with a 'T' and get rid of the junk in between.
> 
> code:
> 
> /#!/usr/bin/perl//
> //
> //use strict;//
> //use warnings;//
> //
> //my $gene;//
> //my @elements = <>;//
> //
> //foreach $gene (@elements) {//
> //    $gene =~ />(ENSG\d*) \| (ENST\d*) .*? ([AGCT]*)/x;//
> //    print "$1 $2 $3\n";//
> //}/

Why all the / characters? Did you put those there, or is it some artifact of 
your email client or mine?

In the future, try posting a complete program that people can run without 
having to generate a data file. In this case, just assign a scalar variable 
with your data line and modify your program to parse that:

my $element = 
q(ENSG00000112365|ENST00000230122|109783797|109787053TGTTTCACAA...CTTCAAGCATTATTTTCAAG);

etc.

> This will print "ENSG00000112365 ENST00000230122"
> 
> without the sequence. Originally I had .* before the ([ACGT]) so I figured 
> it's greedy and will eat the sequence away. ? makes it nongreedy, doesn't it? 
> Still doesn't work.

You are not realizing that [AGCT]* means "zero or more characters from the set 
A, G, C, and T". You are getting a zero-character match because that is what 
you are asking for. Try ([AGCT]+) that insists on at least one matching 
character and will match the longest successive set of AGTC characters.

> 
> Other results:
> 
> with ([AGCT])* it says that $3 is uninitialised - so here it didn't match at 
> all???
> 

You are telling it "zero or more", so no match is fine, and that will be the 
first thing the RE engine tries, so that is what you get.


> with ([AGCT]{5}) it works fine - it returns TGTTT.
> 
> 
> This I found kinda strange - looks like I've got something with the 
> greediness/precedence wrong?

What you have wrong is telling the RE engine that you don't care about matching 
any characters!


--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to