On Fri, Jun 7, 2013 at 10:25 AM, Patrick Fuller <patrickful...@gmail.com>wrote:
> A SMILES contains exactly the same information as the atom/bond lists in a
> much more compact form. If you want to avoid the aromaticity problem, just
> use Kekule form, which makes it virtually identical to any other connection
> table format, but in about 10x to 20x fewer bytes. SMILES are very easy to
> parse, and there are dozens of parsers around.
>
> What I truly like about smiles is that it's human readable + hashable,
> which I see as the real goal. The shorter length is just a corollary of
> that. Prove me wrong, but I think people make too big a deal about size of
> molecule formats. I just bought a 2 TB hard disk drive for $70. WIth mongo
> db + their json serialization, I estimated that I can put 200 million
> verbose json mof structures on that drive. I only have a few thousand, so I
> some room to spare.
>
I have a database of 10 million compounds. The SDF version, even
compressed, is difficult over the internet. It's not about disks, it's
about file transfers and database performance. It's not a matter of a few
bytes here or there (I agree that people worry about file size too much).
It's about a factor of ten or twenty. Connection-table lists of atoms and
bonds are just a dumb way to represent atoms and bonds.
> This discussion has focussed on the syntax of JSON, but completely
> overlooks the real problem with ALL chemical file formats: how do you
> handle all of the cases where a simple connection-table ("ball and stick")
> doesn't capture reality? Things like aromaticity, tautomers,
> organo-metallic bonds, boron-hydrogen cages, distributed bonds (ferrocenes
> and the like) ... these are the problems.
>
> The point of json (and xml) is that they are *extensible*- that's why
> json has exploded in the developer community.
>
This isn't necessarily a good thing. One of the biggest problems in
cheminformatics and molecular modeling is that people have altered existing
formats to suit their own needs ... and that has led to disaster. There is
no such thing as the "PDB format" -- rather, you mostly have to know the
origin of a particular PDB file in order to interpret it. Each project
effectively has its own "PDB format."
JSON may be extensible, but that is useless unless there is a widely
recognized authority on the meaning of each extension, along with
open-source software that illustrates a practical application of the
standard.
Never forget the old joke, "The great thing about standards is that there
are so many to choose from!" JSON essentially gives you a stronger rope
when you in the process of hanging yourself.
> If you need handles for aromaticity and metallic bonding, just add new
> properties to the json/xml. Because of the extensibility, adding new
> properties will not break any existing code.
>
Then why have a standard at all? What is the use of new properties if
nobody knows what they mean? What happens when five projects all introduce
their own syntax and semantics for representing aromaticity and metallic
bonding? Chaos.
> That's the advantage over all of the older table formats, which weren't
> built to be extensible. And you see the repercussions in scientific code
> all the time.
>
The real problem had nothing to do with being "built to be extensible," but
rather that the table format definitions were controlled by commercial
companies that had no interest in data exchange or in participation by the
chemistry community.
When I created the OpenSMILES.org web page, I more-or-less did it by
stealing the leadership from Daylight, the company that invented SMILES. I
invited their participation but, while they didn't object to our project,
they also elected to stay out of it. SMILES now has a future that's in the
hands of the community. If the community decides to add features, we can
... and we'll all be able to agree on those features.
It might seem as if I'm trying to discourage JSON, but nothing could be
farther from the truth. A modern, object-oriented, extensible and well
documented format is long overdue. The CML project is one such (you might
want to look at it for ideas), but it never got traction. Maybe JSON, with
its widespread use and readily-available software, is just the thing.
If you really want to make JSON a standard, the JSON syntax itself is a
trivial part of the problem. The real problem is establishing standards for
how each datatype is to be interpreted, followed by clear, published
standards for each datatype. If you let people just add their own
datatypes on an as-you-please basis, you'll just have another Tower of
Babel ... and that's where the name OpenBabel came from in the first place.
Craig
------------------------------------------------------------------------------
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss