Hi Nathalie,
On Mon, Jun 6, 2011 at 19:29, Nathalie Conte <n...@sanger.ac.uk> wrote: > I need to remove the first 52 bp sequences reads in a fastq file,sequence is > on line 2. > fastq file from wikipedia:A FASTQ file normally uses four lines per > sequence. Line 1 begins with a '@' character and is followed by a sequence > identifier and an /optional/ description. Line 2 is the raw sequence ... > #!/software/bin/perl > use warnings; > use strict; > > > open (IN, "/file.fastq") or die "can't open in:$!"; > open (OUT, ">>newfile.txt") or die "can't open out: $!"; > > while (<IN>) { > next unless (/^[A-Z]/); > my $new_line=substr($_,52); > print OUT $new_line; > > } I frequently play with FASTQ data but basically Rob has said it all. Since FASTQ has a fixed format, you should use that to your advantage by taking in the 4 lines at a time and then processing them as needed. However, what you have above does not work because the fourth line (the quality scores) can also contain A-Z. (Of course, if you are trimming 52 bases from sequences, you probably want to trim 52 from the quality scores, too. But that's a separate issue...) I would suggest you create a loop that loops through the file and takes in the four lines. Then check that line 1 starts with a @ and line 3 starts with a +. Then compare the lengths of line 2 and 4 to make sure they're equal. If all checks out, then do the trimming that Rob suggests. The FASTQ standard technically allows lines 2 and 4 to span multiple lines -- so the sanity check above is a good idea if you want to make your script flexible. But sometimes, you may know for certain this does not occur in your data; if so, then you can skip this sanity check Good luck! Ray -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/