On Oct 2, 2012, at 3:19 PM, Florian Huber wrote: > Thanks guys, for the answers. :-) > > I'm sorry I posted a shortened version of the code as I thought it'd make it > easier to read while still getting the message across. So here's the actual > example and the corresponding output: > > The string is: > > >ENSG00000112365|ENST00000230122|109783797|109787053TGTTTCACAATTCATTTTCTACTAAATGTGTACCATTTTTTAAATTGTTTTAACAGAAAGCTGAGGAATGAAAAAACTTCAAGCATTATTTTCAAG > > So I'm trying to retrieve'ENSG00000112365', 'ENST00000230122' and the > sequence bit, starting with a 'T' and get rid of the junk in between. > > code: > > /#!/usr/bin/perl// > // > //use strict;// > //use warnings;// > // > //my $gene;// > //my @elements = <>;// > // > //foreach $gene (@elements) {// > // $gene =~ />(ENSG\d*) \| (ENST\d*) .*? ([AGCT]*)/x;// > // print "$1 $2 $3\n";// > //}/
Why all the / characters? Did you put those there, or is it some artifact of your email client or mine? In the future, try posting a complete program that people can run without having to generate a data file. In this case, just assign a scalar variable with your data line and modify your program to parse that: my $element = q(ENSG00000112365|ENST00000230122|109783797|109787053TGTTTCACAA...CTTCAAGCATTATTTTCAAG); etc. > This will print "ENSG00000112365 ENST00000230122" > > without the sequence. Originally I had .* before the ([ACGT]) so I figured > it's greedy and will eat the sequence away. ? makes it nongreedy, doesn't it? > Still doesn't work. You are not realizing that [AGCT]* means "zero or more characters from the set A, G, C, and T". You are getting a zero-character match because that is what you are asking for. Try ([AGCT]+) that insists on at least one matching character and will match the longest successive set of AGTC characters. > > Other results: > > with ([AGCT])* it says that $3 is uninitialised - so here it didn't match at > all??? > You are telling it "zero or more", so no match is fine, and that will be the first thing the RE engine tries, so that is what you get. > with ([AGCT]{5}) it works fine - it returns TGTTT. > > > This I found kinda strange - looks like I've got something with the > greediness/precedence wrong? What you have wrong is telling the RE engine that you don't care about matching any characters! -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/