Hello everyone,
I'm so glad that I could finally be of some help to a group that helped me
before.


>  Hi Changrong
>
> The problem doesn't seem difficult, but I'm afraid we don't have much
> knowledge of bioinformatics between us. If you post a sample of input
> data and the corresponding output you desire then I am sure we can help.
>
> Regards,
>
> Rob
>
>
It would take me a while to write the script myself, because I'm a molecular
biologist and started writing Perl myself less than 3 months ago.
But I can give you some pointers:
Normally FASTA files will have the following structure (I'm using a real
example)

>gi|61499|emb|CAA24495.1| src [Avian sarcoma virus]
MGSSKSKPKDPSQRRRSLEPPDSTHHGGFPASQTPNKTAAPDTHRTPSRSFGTVATEPKLFGGFNTSDTV
TSPQRAGALAGGVTTFVALYDYESWIETDLSFKKGERLQIVNNTEGNWWLAHSLTTGQTGYIPSNYVAPS
DSIQAEEWYFGKITRRESERLLLNPENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSG
GFYITSRTQFSSLQQLVAYYSKHADGLCHRLTNVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGE
VWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVIEYMSKGSLLDFL
KGEMGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQ
GAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMGNGEVLDRVERGYRMPCPPECPES
LHDLMCQCWRRDPEERPTFEYLQAQLLPACVLEVAE

So it looks like a problem to tackle by treating the aminoacid sequence as a
regex and parse it counting how many times each letter shows up in it. Each
of those letters you see represent one aminoacid - and the code is
unequivocal, meaning that a letter will == ONE aminoacid, and no aminoacid
will be represented by more than ONE letter. So it boils down to a script
that will go through the string and store in a variable how many times M
shows up, how many times G does, and so forth.
Calculating the percentages should be easy afterwards.

So Changrong, if you can wait and don't mind using poorly, poorly written
code, I'm up for the challenge.
Alternatively, a good samaritan out there can gelp you with a one or
two-liner code.

Cheers,
Mariano

Reply via email to