Here's a patch to add PDB detection. It has a ten-character magic
sequence at the start ("HEADER    "), and I make sure that it at least
looks something like a PDB file before concluding that it is. There
are regex tests, but they don't actually run unless the initial string
match succeeds, so I don't think the performance hit is particularly
severe.

I don't have a good source of PDB files, so if this magic fails
(either false positive *or* false negative), please attach one or more
samples to this bug report and I'll try to adapt the patch.

Adam Buchbinder
--- file/magic/Magdir/scientific	2009-02-16 10:59:52.000000000 -0500
+++ file/magic/Magdir/scientific	2009-02-18 16:34:11.000000000 -0500
@@ -69,3 +69,32 @@
 0	string	\060\000\040\000\110\000\105\000\101\000\104\000		GEDCOM data
 0	string	\376\377\000\060\000\040\000\110\000\105\000\101\000\104	GEDCOM data
 0	string	\377\376\060\000\040\000\110\000\105\000\101\000\104\000	GEDCOM data
+
+# PDB: Protein Data Bank files
+#
+# Adam Buchbinder <adam.buchbin...@gmail.com>
+#
+# http://www.wwpdb.org/documentation/format32/sect2.html
+# http://www.ch.ic.ac.uk/chemime/
+#
+# The PDB file format is fixed-field, 80 columns. From the spec:
+#
+# COLS        DATA
+#  1 -  6      "HEADER"
+#  11 - 50     String(40)
+#  51 - 59     Date
+#  63 - 66     IDcode
+#
+# Thus, positions 7-10, 60-62 and 67-80 are spaces. The Date must be in the
+# format DD-MMM-YY, e.g., 01-JAN-70, and the IDcode consists of numbers and
+# uppercase letters. However, examples have been seen without the date string,
+# e.g., the example on the chemime site.
+
+0	string	HEADER\ \ \ \ 
+>&0	regex/1	\^.{40}
+>>&0	regex/1	[0-9]{2}-[A-Z]{3}-[0-9]{2}\ {3}
+>>>&0	regex/1s	[A-Z0-9]{4}.{14}$
+>>>>&0	regex/1	[A-Z0-9]{4}	Protein Data Bank data, ID Code %s
+!:mime	chemical/x-pdb
+>>>>0	regex/1	[0-9]{2}-[A-Z]{3}-[0-9]{2}	\b, %s
+

Reply via email to