On Monday, 26 March 2012, Francois Berenger wrote: > Dear list, > > If I take all the fasta files for proteins in the PDB, > are the sequences complete? > > I mean, do they have holes sometimes (missing amino acids)?
In theory the SEQRES records describe the sequence of the entity that was crystallized, whether or not it is all visible in the electron density or present in the deposited model. So normally there should not be any "missing" internal residues. But if the expression construct was a not the full gene sequence, e.g. an N-terminal truncation, then those N- or C- terminal residues (or whole domains) will not be listed. So goes the theory. There are always corner cases. I remember having a dispute with the PDB long ago about whether a peptide chain that was known to have undergone loop cleavage was properly described with a single chain identifier or with two chain identifiers. And if the cleavage involved excission of one or more residues, would they appear in the SEQRES records anyhow? > Sorry for the maybe stupid question but I know that sometimes > the PDB files have missing residues, I am hoping that > it is not the case with the FASTA files. I was assuming that the FASTA files you refer to are just conversions of the SEQRES records. If not, then all bets are off. If the FASTA files are retrieved by gene ID from Uniprot or some other sequence data base, then they will be complete in one sense but may not perfectly match what was in the deposited crystal structure due to cloning artifacts, strain variation, allelic non-uniformity, etc. Ethan > Regards, > Francois. >