[EMAIL PROTECTED] wrote: > I have tried their online Text-symbol Pattern Discovery > with these input values: > > cpkg-30000 > cpkg-31008 > cpkg-3000A > cpkg-30006 > nsug-300AB > nsug-300A2 > cpdg-30001 > nsug-300A3
Well, in the realm of sequence analysis, it is trivial to devise a regex for these values because they are already aligned and of fixed length. This is a couple of more lines than it needs to be, only so its easier to follow the logic. This uses throw-away groups to avoid bracketed sets, becuase you have dashes in your items. You might need to tweak the following code if you have characters special to regex in your sequences or if you want to use bracketed sets. The place to do this is in the _joiner() function. values = """ cpkg-30000 cpkg-31008 cpkg-3000A cpkg-30006 nsug-300AB nsug-300A2 cpdg-30001 nsug-300A3 """.split() # python 2.5 has new list comp features to shorten this test, but # the resulting list comp can begin to look ugly if the alternatives # are complicated def _joiner(c): if len(c) == 1: # will raise KeyError for empty column return c.pop() else: return "(?:%s)" % '|'.join(c) columns = [set(c) for c in zip(*values)] col_strs = [_joiner(c) for c in columns] rgx_str = "".join(col_strs) exact_rgx_str = "^%s$" % rgx_str # '(?:c|n)(?:p|s)(?:k|u|d)g-3(?:1|0)0(?:A|0)(?:A|B|1|0|3|2|6|8)' print rgx_str But, if you get much more complicated that this, you will definitely want to check out hmmer, especially if you can align your sequences. James -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com/ -- http://mail.python.org/mailman/listinfo/python-list