On Tue, Oct 02, 2012 at 11:19:51PM +0100, Florian Huber wrote:
> The string is:
> 
> >ENSG00000112365|ENST00000230122|109783797|109787053TGTTTCACAATTCATTTTCTACTAAATGTGTACCATTTTTTAAATTGTTTTAACAGAAAGCTGAGGAATGAAAAAACTTCAAGCATTATTTTCAAG
> 
> So I'm trying to retrieve'ENSG00000112365', 'ENST00000230122' and
> the sequence bit, starting with a 'T' and get rid of the junk in
> between.
> 
> code:
> 
> /#!/usr/bin/perl//
> //
> //use strict;//
> //use warnings;//
> //
> //my $gene;//
> //my @elements = <>;//
> //
> //foreach $gene (@elements) {//
> //    $gene =~ />(ENSG\d*) \| (ENST\d*) .*? ([AGCT]*)/x;//
> //    print "$1 $2 $3\n";//
> //}/
> 
> 
> This will print "ENSG00000112365 ENST00000230122"
> 
> without the sequence.

You might want to simplify the problem first. It looks like the
first two fields that you're interested in are delimited with the
vertical bar (|) character. You can split the string first on
that to get those 4 parts in between. Then you can just match the
simple character class that you want from the last sequence.

#!/usr/bin/perl

use strict;
use warnings;

my $sequence = 
'>ENSG00000112365|ENST00000230122|109783797|109787053TGTTTCACAATTCATTTTCTACTAAATGTGTACCATTTTTTAAATTGTTTTAACAGAAAGCTGAGGAATGAAAAAACTTCAAGCATTATTTTCAAG';

my ($foo, $bar, $baz) = parse_sequence($sequence);

print "foo=$foo\n";
print "bar=$bar\n";
print "baz=$baz\n";

sub parse_sequence {
    my ($sequence) = @_;

    # We don't want the leading > character.
    $sequence =~ s/^>//;

    my @parts = split /\|/, $sequence;

    # We're interested in the first two fields.
    my ($foo, $bar) = @parts;

    # And the matching part of the last one.
    my ($baz) = $parts[3] =~ /([ACGT]+)/;

    # You may prefer to return a hash reference or something with
    # meaningful field names. Those details are entirely up to
    # you.
    return ($foo, $bar, $baz);
}

__END__

Output:

foo=ENSG00000112365
bar=ENST00000230122
baz=TGTTTCACAATTCATTTTCTACTAAATGTGTACCATTTTTTAAATTGTTTTAACAGAAAGCTGAGGAATGAAAAAACTTCAAGCATTATTTTCAAG

Obviously you should assign appropriate names to the variables.
:)

> Originally I had .* before the ([ACGT]) so I figured it's
> greedy and will eat the sequence away. ? makes it nongreedy,
> doesn't it? Still doesn't work.
> 
> Other results:
> 
> with ([AGCT])* it says that $3 is uninitialised - so here it
> didn't match at all???
> 
> with ([AGCT]{5}) it works fine - it returns TGTTT.
> 
> 
> This I found kinda strange - looks like I've got something with
> the greediness/precedence wrong?

It takes some practice (and often a clear head) to get regular
expressions right. :) The * will only match if the rest of the
regular expression allows it to. If not it is happy to not match
anything. If you require at least one match then be sure you use
+. :) And as above, always try to make things as simple as
possible. It's much easier to get simple correct. :)

Regards,


-- 
Brandon McCaig <bamcc...@gmail.com> <bamcc...@castopulence.org>
Castopulence Software <https://www.castopulence.org/>
Blog <http://www.bamccaig.com/>
perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }.
q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.};
tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'

Attachment: signature.asc
Description: Digital signature

Reply via email to