Re: [ccp4bb] writing scripts-off topic

Mark Brooks Wed, 25 Jan 2012 07:22:38 -0800

Hi Yuri,
          If you don't like Python, like myself (and I'm not alone, it
would seem), you could try Ruby (http://www.ruby-lang.org/en/). Some
examples of PDB file manipulation are below (taken from [1]).

The language is a great improvement in Perl and Python in my opinion, but
the downside is that there aren't as many crystallographic features as
CCTBX et al, I don't think.

Yours,

Mark

[1] http://bioruby.open-bio.org/wiki/SampleCodes
 Structure I/O and Manipulation How do I read a structure file?

The current implementation only reads PDB format files, there will
hopefully be code for parsing mmCIF and PDB XML files in the future. Adding
these new formats will probably change the API for reading and writing
structure files.

Given the above caveat, the current way to read a PDB file is to slurp the
file into a string which is fed to the constructor of the Bio::PDB class:

require 'bio/db/pdb'string = IO.read('pdb1tim.ent')
structure = Bio::PDB.new(string)

The new Bio::PDB object now contains all the annotations and coordinate data.

How do I write a structure file?

As with reading, the current implementation can only write in PDB format.

All the coordinate classes in the Bio::PDB heirarchy have .to_s methods
which return the object in PDB format. This makes output as simple as:

require 'bio/db/pdb'
file = File.new('pdb1tim.ent').gets(nil)
structure = Bio::PDB.new(file)puts structure.to_s # Writes whole
structureputs structure[nil]['A'].to_s # Writes chain A only

How do I access annotations?

The annotations from the PDB file are stored in Bio::PDB::Record objects.
You can retrieve a specific annotation by calling the '.record' method of a
Bio::PDB object with the name of the annotation as the argument:

structure.record("HEADER")

In fact '.record' returns an array of all Bio::PDB::Records of the given
type, so you'll need to call '.first' to get to the actual
Bio::PDB::Record. Bio::PDB::Record provides methods to get to the
individual data in each record. E.g. The 'HEADER' record provides
classification, depDate and idCode methods. You can look at the PDB format
to see what fields are available in each record, or just look in the pdb.rb
file which has a hash of all the definitions.

So, to get the title of a PDB file you would do it like this:

structure.record('TITLE').first.title

Some records have their own special methods:

structure.remark(num)  #Returns hash of the REMARK records based on 'num'
structure.jrnl         #Returns a hash of the JRNL records
structure.jrnl('TITL') #Returns an array of all the TITL subfields from
                       #the JRNL records
structure.helix(id)    #Returns the HELIX records based on 'id'
                       #Returns an array if no 'id' is given
structure.turn         #Same interface as '.helix'
structure.sheet        #Same interface as '.sheet'
structure.seqres(id)   #Returns the sequence given in the SEQRES record
                       #as a Bio::Sequence::AA object
structure.keywords     #returns a flattened lsit of keywords

And some methods are included for Bio::DB compatability:

structure.entry_id     #Returns idCode
structure.accession    #Same as entry_id
structure.definition   #Title
structure.version      #REVDAT

How do I access the coordinate data?

Coordinate data is stroed in a heirarchy of classes with Bio::PDB at the
top. A Bio::PDB object contains one or more Bio::PDB::Model objects which
contain one or more Bio::PDB::Chain objects which contain one or more
Bio::PDB::Residue objects which contain Bio::PDB::Atom objects.

There are two ways to access the coordinate data in a Bio::PDB object:
keyed access and iterators.

Keyed access provides the simplest way to get to data if you know which
residues or atoms you're interested in. For instance if you wanted to get
hold of chain 'A' you can do it like this:

chain = structure[nil]['A']

The nil refers to the model, which will be nil in crystal structures but
may have a number in NMR structures. Every level in the heirarchy can be
accessed this way. E.g. to get hold of the CA atom of residue 100 in chain
'A':

structure[nil]['A']['100']['CA']

or if you still have your chain object:

chain['100']['CA']

The residue identifier is not just the residue number. PDB residues can
have insertion codes and even negative numbers.

chain['100A']['CA']

could be valid.

Iterators are also provided so you can easily go through all objects in the
heirarchy. E.g:

structure.each{ |model|
  model.each{ |chain|
    chain.each{ |residue|
      residue.each{ |atom|
        puts atom
      }
    }
  }}

Goes through every atom in the structure and outputs it. All the classes in
the heirarchy implement the standard Enumerable and Comparable mixins as
well.

Each class has a number of accessors to allow access to the data, most is
in the Bio::PDB::Atom class:

   - Bio::PDB::Model has a 'model_serial' accessor only.
   - Bio::PDB::Chain has an 'id' accessor for getting and setting the chain
   id.
   - Bio::PDB::Residue has accessors for resName, resSeq and iCode.
   - Bio::PDB::Atom has accessors for serial, element, alt_loc, x, y, z,
   occ, and bfac

To find the x coordinate of the CA atom of residue 100 is then:

structure[nil]['A']['100']['CA'].x

How do I make a deep copy of a structure object?

You can't! There seem to be problems implementing a deep copy function in
BioRuby. This is to be fixed in future versions.

Note to developers: have we tried the time-old hack of
Marshal.load(Marshal.dump(obj)) to do the trick?
How do I measure distances and angles?

Methods for measuring distance and dihedral angles are provided in the
Utils module. To measure distance, use the Bio::PDB::Utils.distance method:

res1 = structure[nil]['A']['100']
res2 = structure[nil]['A']['101']

distance = Bio::PDB::Utils.distance(res1['CA'].xyz,
                                    res2['CA'].xyz)

Bio::PDB::Utils.distance requires two Vectors as its arguments.
Bio::PDB::Coordinate objects, which are produced by the .xyz call to the
Bio::PDB::Atom object, are Vectors. You can also provide a Vector directly:

distance = Bio::PDB::Utils.distance(res1['CA'].xyz,
                                    [10,10,10])

And use the .centreOfGravity and .geometricCentre methods, both of which
provide a Bio::PDB::Coordinate:

distance = Bio::PDB::Utils.distance(res1.centreOfGravity,
                                    res1.geometricCentre)

All objects in the Bio::PDB heirarchy implement the centreOfGravity and
geometricCentre methods.

Dihedral angles are calculated from four Vectors / Bio::PDB::Coordinates:

phi = Bio::PDB::Utils.dihedral_angle(res1['C'].xyz,
                                     res2['N'].xyz,
                                     res2['CA'].xyz,
                                     res2['C'].xyz)
psi = Bio::PDB::Utils.dihedral_angle(res2['N'].xyz,
                                     res2['CA'].xyz,
                                     res2['C'].xyz,
                                     res3['N'].xyz)

How do I find specific atoms, residues, chains or models?

If the iterators and keyed access aren't powerful enough for you, the
finder method is also provided. All the objects in the Bio::PDB hierarchy
implement finder. It requires the class of object you wish to find as the
first argument, and then a block which is evaluated for each object. If the
block returns true then that object is added to an array of 'found' things.

For example, to find all the atoms within a box:

atoms_in_box = structure.finder(Bio::PDB::Atom){ |atom|
  atom.x.between?(10,20) &&
  atom.y.between?(20,30) &&
  atom.z.between?(30,40)}

Or, to find all the HIS residues, in chain 'A' can be written like this:

his_residues = structure[nil]['A'].finder(Bio::PDB::Residue){ |res|
  res.resName == 'HIS'}

or like this:

his_residues = structure.finder(Bio::PDB::Residue){ |res|
  res.resName == 'HIS' && res.chain.id == 'A'}

Since the results are returned in arrays, it is very simple to do set
operations, such as finding the overlap or difference between two
selections:

sel1 = structure.finder(Bio::PDB::Atom){ |atom|
  atom.residue.resName == 'HIS'}
sel2 =  structure.finder(Bio::PDB::Atom){ |atom|
  atom.x.between?(10,20) &&
  atom.y.between?(20,30) &&
  atom.z.between?(30,40)}
sel3 = sel1 & sel2
sel1 = sel1 - sel3

Selection 3 now contains all the HIS atoms within the box and selection 1
contains all the HIS atoms outside the box.
How do I work with ligand data?

Because some PDB files reuse residue numbers that they already used in the
ATOM records in the HETATM records, we have to add an extra flag when we
want to access HETATM data. The extra flag is simply adding 'LIGAND' to the
residue id. E.g Given the PDB file

ATOM      1  N   GLY A   2      -5.308  22.647  33.657  1.00 59.93
ATOM      2  CA  GLY A   2      -5.090  23.965  33.005  1.00 53.36
ATOM      3  C   GLY A   2      -6.209  24.197  32.021  1.00 52.56
ATOM      4  O   GLY A   2      -7.000  25.134  32.176  1.00 55.02
....
HETATM 2097  O8  PHT A   2     -18.122  40.603  18.146  1.00 16.14
HETATM 2098  O9  PHT A   2     -17.348  39.940  20.109  1.00 16.06
HETATM 2099  C10 PHT A   2     -15.622  41.753  16.662  1.00 13.34
HETATM 2100  O11 PHT A   2     -16.077  42.457  15.721  1.00 16.27

Asking for

structure[nil]['A']['2']

is ambiguous, so if you want the ligand data you must use

structure[nil]['A']['LIGAND2']

There should also be an .each_ligand method in the Bio::PDB::Chain class
for iterating through the ligands, but this is not implemented in the
current cvs version.
How do I work with solvent?

Solvent is defined in the current parser as any HETATM record with HOH as
the residue type. This may not be ideal for some files, but deals with 95%
of cases which is good enough for me at the moment!

Solvent residues are stored in a separate Bio::PDB::Chain attached to each
Bio::PDB::Model. Currently the only methods available are 'addSolvent'
which adds a residue to the solvent chain and 'removeSolvent', which sets
the whole solvent chain to nil.

An each_solvent method for Bio::PDB::Model has been implemented in my copy
but is not in cvs at the moment.
How do I get the sequence of a chain?

There are two types of sequence in a PDB file: the real sequence of
whatever protein has been crystalised, and the sequence observable in the
structure.

The first type of sequence is available through the SEQRES record (if this
is provided in the PDB file). You can access this through the top-level
Bio::PDB object:

structure.seqres['A']

provides the sequence given in the SEQRES record for chain A. The observed
sequence is obtained from the Bio::PDB::Chain object. E.g.

structure[nil]['A'].atom_seq

Both methods return a Bio::Sequence::AA object.

On 24 January 2012 04:46, Yuri Pompeu <yuri.pom...@ufl.edu> wrote:

> Hello Everyone,
> I want to play around with some coding/programming. Just simple
> calculations from an input PDB file, B factors averages, occupancies,
> molecular weight, so forth...
> What should I use python,C++, visual basic?
> thanks
>

Re: [ccp4bb] writing scripts-off topic

Reply via email to