Hello everyone, I'm so glad that I could finally be of some help to a group that helped me before.
> Hi Changrong > > The problem doesn't seem difficult, but I'm afraid we don't have much > knowledge of bioinformatics between us. If you post a sample of input > data and the corresponding output you desire then I am sure we can help. > > Regards, > > Rob > > It would take me a while to write the script myself, because I'm a molecular biologist and started writing Perl myself less than 3 months ago. But I can give you some pointers: Normally FASTA files will have the following structure (I'm using a real example) >gi|61499|emb|CAA24495.1| src [Avian sarcoma virus] MGSSKSKPKDPSQRRRSLEPPDSTHHGGFPASQTPNKTAAPDTHRTPSRSFGTVATEPKLFGGFNTSDTV TSPQRAGALAGGVTTFVALYDYESWIETDLSFKKGERLQIVNNTEGNWWLAHSLTTGQTGYIPSNYVAPS DSIQAEEWYFGKITRRESERLLLNPENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSG GFYITSRTQFSSLQQLVAYYSKHADGLCHRLTNVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGE VWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVIEYMSKGSLLDFL KGEMGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQ GAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMGNGEVLDRVERGYRMPCPPECPES LHDLMCQCWRRDPEERPTFEYLQAQLLPACVLEVAE So it looks like a problem to tackle by treating the aminoacid sequence as a regex and parse it counting how many times each letter shows up in it. Each of those letters you see represent one aminoacid - and the code is unequivocal, meaning that a letter will == ONE aminoacid, and no aminoacid will be represented by more than ONE letter. So it boils down to a script that will go through the string and store in a variable how many times M shows up, how many times G does, and so forth. Calculating the percentages should be easy afterwards. So Changrong, if you can wait and don't mind using poorly, poorly written code, I'm up for the challenge. Alternatively, a good samaritan out there can gelp you with a one or two-liner code. Cheers, Mariano