On Mon, Jun 6, 2011 at 12:29 PM, Nathalie Conte <n...@sanger.ac.uk> wrote:

> Hi,
>
> I need to remove the first 52 bp sequences reads in a fastq file,sequence
> is on line 2.
> fastq file from wikipedia:A FASTQ file normally uses four lines per
> sequence. Line 1 begins with a '@' character and is followed by a sequence
> identifier and an /optional/ description. Line 2 is the raw sequence
> letters. Line 3 begins with a '+' character and is /optionally/ followed by
> the same sequence identifier (and any description) again. Line 4 encodes the
> quality values for the sequence in Line 2, and must contain the same number
> of symbols as letters in the sequence.
>
> A minimal FASTQ file might look like this:
>
> @SEQ_ID
> GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
> +
> !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
>
>
> I have written this script to remove the first 52 bp on each sequence and
> write this new line on newfile.txt document. It seems to do the job , but
> what I need is to change my original bed file with the trimmed seuqence
> lines and keep the other lines the same. I am not sure where to start to
> modify the original fatsq.
> this is my script to trim my sequence :
>
> #!/software/bin/perl
> use warnings;
> use strict;
>
>
> open (IN, "/file.fastq") or die "can't open in:$!";
> open (OUT, ">>newfile.txt") or die "can't open out: $!";
>
>   while (<IN>) {
> next unless (/^[A-Z]/);
>   my $new_line=substr($_,52);
>   print OUT $new_line;
>
> }
>
>
> thanks for any suggestions
> Nat
>
>
> --
> The Wellcome Trust Sanger Institute is operated by Genome Research Limited,
> a charity registered in England with number 1021457 and a company registered
> in England with number 2742969, whose registered office is 215 Euston Road,
> London, NW1 2BE.
>  --
> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
> For additional commands, e-mail: beginners-h...@perl.org
> http://learn.perl.org/
>
>
>
Hi Nathalie,

I am not 100% sure on this as I suspect that when modifying the original
file you also want to deal with that 4th line in which case I have no idea
how to deal with that as I do not understand what its purpose is....

Anyway assuming that you are simply dealing with the 2nd line only life is a
whole lot simpler.
You know the number of characters you are removing is always 52 so that we
don't have to deal with that anymore. Now we can take various routes we
could take all characters from 52 till the end of the string (substr puts
the first character on position 0 not on position 1 ;-) or we could simply
cut out all characters before the 52nd. We could do the cutting using a
regular expression or we could use substr for this purpose (I have no idea
which one is faster please benchmark that if you are looking at a large
number of such operations to be executed it could save you a lot of time ;-)

Using substr to do all the work: my $new_line2 = substr $_, 0, 52, "";
Using a regular expression to do the work: my $new_line3 = $_; $new_line3 =~
s/[A-Z]{51}//;
Doing the counting thing...: my $new_line4 = substr $_, 52, length $_;

All 3 will provide you the result you are looking for I suspect that the
first one will be the fastest option, based on what little experience I have
with these types of opperations but please do prove this before you start
working on thousands of files...

Regards,

Rob

Reply via email to