The only way to really do "creative" stats on the PDB is to just
download the whole thing. It is a sobering thought to realize just how
tiny it is! Less than 17 GB. Once you've got it all on your hard disk
you can start writing little programs to look for different things. I
have posted mine for doing the "method count" here:
http://bl831.als.lbl.gov/~jamesh/pickup/pdb_method_count_notes.txt
For those of you who don't speak awk, basically what I do is first look
at the "method used to determine the structure" entry, but when that is
NULL or otherwise uninformative you can do a number of things. If the
entry lists another PDB entry in "methods", it is probably a molecular
replacement solution. Also, if the "software" used to solve it was
PHASER or MOLREP, or ARP/wARP, or COMO etc., then I'm willing to bet it
is an MR solution too. On the other hand, if the "software" was SOLVE
or AUTOSHARP, then I imagine they probably were doing MAD/SAD. Even if
they didn't know it.
Currently, the script produces this list:
1 COCRYSTAL
1 FIBER-DIFFRACTION
1 MOLPREP
1 N?
1 P
1 SEE CITATION
1 UNCONVENTIANAL
1 UNNECESSARY
2 UNCONVENTIONAL
7 RIP
54 N/A
92 SIR
260 AI
355 MIRAS
434 SIRAS
689 OTHER
1058 MIR
5012 MAD
5335 SAD
8021 NULL
58293 MR
Clearly, there's a few new ones I need to "clean up", but some of them
are funny. What is the method of "COCRYSTAL" anyway? Co-crystallized
with a ligand? Or a heavy atom? There are plenty of entries that I
can't figure out automatically (NULL + OTHER), but I imagine most of
them are MR.
But yes, Tim and Thierry are right, there is definitely confusion and
disagreement about what exactly constitutes "molecular replacement". I
have been somewhat draconian here and lumped "MR" and "direct
refinement" together because frankly it is just too hard to tell them
apart from the PDB alone. In fact, there are 260 PDB entries that claim
they were solved by "AB INITIO" methods, but did they really use direct
methods with no prior phase information at all? Or did they do MAD/SAD
and thought that meant "ab initio"? Probably examples of all. The
resolution of the structure might be helpful in this case, but sometimes
even reading the paper doesn't help.
I do agree though that the use of "molecular replacement" should be true
to the term as Rossmann coined it, where the 6D search of a model
against the new dataset was actually required to "solve" the structure.
Lots of people run molrep (or molprep!?) for each and every dataset,
even when they are doing ligand soaks. Never really have understood why.
However, I'm sure the day is not far off when phenix.refine or the like
will check if the starting R factor is too high and just "automatically"
invoke a run of MR to see if something clicks. Maybe even trying
alternative space groups, just in case you screwed that up too. There
may also soon be "automaitc" runs of BALBES based on the sequence
information of the input file. Eventually, the difference between
different "methods" gets muddied. Then your "average" depositor will be
even more unsure about what to put into REMARK 200, and people like me
will be endlessly asking for command-line options that turn these
"features" off.
-James Holton
MAD Scientist
On 4/15/2013 6:48 AM, Raji Edayathumangalam wrote:
Hi Folks,
Does anyone know of an accurate way to mine the PDB for what percent
of total X-ray structures deposited as on date were done using
molecular replacement? I got hold of a pie chart for the same from my
Google search for 2006 but I'd like to get hold of the most current
statistics, if possible. The PDB has all kinds of statistics but not
one with numbers or precent of X-ray structures deposited sorted by
various phasing types or X-ray structure determination methods.
For example, an "Advanced Search" on the PDB site pulls up the following:
Total current structures by X-ray: 78960
48666 by MR
5139 by MAD
5672 by SAD
1172 by MIR
94 by MIR (when the word is completely spelled out)
75 by SIR
5 by SIR (when the word is completely spelled out)
That leaves about 19,000 X-ray structures either solved by other
phasing methods (seems unlikely) or somehow unaccounted for in the way
I am searching. Maybe the way I am doing the searches is no good. Does
someone have a better way to do this?
Thanks much.
Raji
--
Raji Edayathumangalam
Instructor in Neurology, Harvard Medical School
Research Associate, Brigham and Women's Hospital
Visiting Research Scholar, Brandeis University