Hi Rudy,
> On Feb 27, 2022, at 20:55, Rudy Richardson <[email protected]> wrote:
>
> I have a library of ~1000 compounds as SMILES strings with an appended name
> code and a property. For example:
>
> c1ccc(c2ccccc2)cc1 0001 -2.52
>
> Where "0001" is the name code and "-2.52" is a physicochemical property of
> the molecule.
>
> I would like to convert these strings to a concatenated SDF file,
If you're comfortable working with Python, here's an example using the pybel
interface.
First, here's how to get the name code and property
>>> from openbabel import pybel
>>> mol = pybel.readstring("smi", "c1ccc(c2ccccc2)cc1\t0001\t-2.52")
>>> mol
<openbabel.pybel.Molecule object at 0x1101dece0>
>>> mol.title
'0001\t-2.52'
In that case I used tabs (represented as "\t"), because I believe that's what's
in your file. That would explain the extra space between the fields.
I'll use Python's string.split() to split on any whitespace (which includes
both spaces and tabs)
>>> mol.title.split()
['0001', '-2.52']
and assign them to the variables "name_code" and "value".
>>> name_code, value = mol.title.split()
The pybel API has a "write()" method on molecules which formats it into a given
format. Here's what it looks like in "sdf".
>>> print(mol.write("sdf"))
0001 -2.52
OpenBabel02272221522D
12 13 0 0 0 0 0 0 0 0999 V2000
...
I need to change the title and add a "logX" data item, which I can do with:
>>> mol.title = name_code
>>> mol.data["logX"] = value
giving most of what you wanted.
>>> print(mol.write("sdf"))
0001
OpenBabel02272221552D
12 13 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 12 2 0 0 0 0
1 2 1 0 0 0 0
2 3 2 0 0 0 0
3 4 1 0 0 0 0
4 5 1 0 0 0 0
4 11 2 0 0 0 0
5 10 2 0 0 0 0
5 6 1 0 0 0 0
6 7 2 0 0 0 0
7 8 1 0 0 0 0
8 9 2 0 0 0 0
9 10 1 0 0 0 0
11 12 1 0 0 0 0
M END
> <logX>
-2.52
$$$$
You also had a "No." field. I don't know if that is the index of the input
record, or the integer value of the name_code, like:
>>> int(name_code)
1
I'll assume it's the input index.
I'll use Python's built-in "enumerate()" function. What it does is it add an
index for each element of an iterator. For example, I can iterate through the
characters of "ABCD" like this:
>>> for c in "ABCD":
... print(c)
...
A
B
C
D
What enumerate() does is for each X in the input iterator, it returns (index, X)
>>> for i, c in enumerate("ABCD"):
... print(i, c)
...
0 A
1 B
2 C
3 D
I can also specify the initial index, for example, to start at 1:
>>> for i, c in enumerate("ABCD", 1):
... print(i, c)
...
1 A
2 B
3 C
4 D
The last bit to know is that pybel's "readfile" gives a way to iterate over all
molecules in a file.
>>> from openbabel import pybel
>>> for i, mol in enumerate(pybel.readfile("smi", "wikipedia2.smi"), 1):
... print("Entry#:", i, repr(mol.title))
... if i == 10:
... break
...
Entry#: 1 'Ammonia'
Entry#: 2 'Aspirin'
Entry#: 3 'Acetylene'
Entry#: 4 'Adenosine triphosphate'
Entry#: 5 'Ampicillin'
Entry#: 6 'Ascorbic acid'
Entry#: 7 'Ascorbic acid'
Entry#: 8 'Amphetamine'
Entry#: 9 'Aspartame'
Entry#: 10 'Amoxicillin'
Finally, all the coordinates were 0.0. To make things a bit nicer, use the
"make2D()" or "make3D()" methods to add 2D or 3D coordinates, respectively.
Your example uses 3D, so I'll do that.
Putting it all together, along with some use of Python's "argparse" molecule to
handle command-line processing (which I won't discuss here) gives the
"rjrich.py" program, attached.
It's used like this:
% python rjrich.py test.smi
You can also change the output tag, and the output file name, like this:
% python rjrich.py test.smi --tag Cacao2 -o cacao2.sdf
(I believe if you're using Open Babel under Windows, with Python installed,
then you should use "py" instead of "python" to run the program.)
Cheers,
Andrew
[email protected]
import sys
import argparse
from openbabel import pybel
# Use the "argparse" module to handle command-line argument processing
parser = argparse.ArgumentParser(
description = "convert SMILES with name and data value to SDF"
)
parser.add_argument("--tag",
default = "logX",
help = "SDF data tag to store the value")
parser.add_argument("--output", "-o",
help = "output SDF filename (default: stdout")
parser.add_argument("filename")
def main():
args = parser.parse_args()
# Open the SMILES file for reading
mol_reader = pybel.readfile("smi", args.filename)
# Figure out where to write the output
if args.output is None:
output_file = sys.stdout
else:
output_file = open(args.output, "w")
# Process each
for mol_no, mol in enumerate(mol_reader, 1):
# The title looks like "0001 -2.52" with the name_code
# followed by whitespace followed by the value
title = mol.title
name_code, value = mol.title.split()
# Update the title to have just the name_code
mol.title = name_code
# Add new data items
mol.data["No."] = mol_no
mol.data[args.tag] = value
# Generate 3D coordinates
mol.make3D()
# Write the result
output_file.write(mol.write("sdf"))
# The standard Python way to recognize this is being run as a
# command-line program.
if __name__ == "__main__":
main()
_______________________________________________
OpenBabel-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss