Steven D'Aprano writes: > I get something like this: > > r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)" > > > but it fails on strings like "AA & A & A". What am I doing wrong?
It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS) when the middle part is just one LETTER. That's something of a misanalysis anyway. I notice that the correct pattern has already been posted at least thrice and you have acknowledged one of them. But I think you are also trying to do too much with a single regex. A more promising start is to think of the whole string as "parts" joined with "glue", then split with a glue pattern and test the parts: import re glue = re.compile(" *& *| +") keep, drop = [], [] for datum in data: items = glue.split(datum) if all(map(str.isupper, items)): keep.append(datum) else: drop.append(datum) That will cope with Greek, by the way. It's annoying that the order of the branches of the glue pattern above matters. One _does_ have problems when one uses the usual regex engines. Capturing groups in the glue pattern would produce glue items in the split output. Either avoid them or deal with them: one could split with the underspecific "([ &]+)" and then check that each glue item contains at most one ampersand. One could also allow other punctuation, and then check afterwards. One can use _another_ regex to test individual parts. Code above used str.isupper to test a part. The improved regex package (from PyPI, to cope with Greek) can do the same: import regex part = regex.compile("[[:upper:]]+") glue = regex.compile(" *& *| *") keep, drop = [], [] for datum in data: items = glue.split(datum) if all(map(part.fullmatch, items)): keep.append(datum) else: drop.append(datum) Just "[A-Z]+" suffices for ASCII letters, and "[A-ZÄÖ]+" copes with most of Finnish; the [:upper:] class is nicer and there's much more that is nicer in the newer regex package. The point of using a regex for this is that the part pattern can then be generalized to allow some punctuation or digits in a part, for example. Anything that the glue pattern doesn't consume. (Nothing wrong with using other techniques for this, either; str.isupper worked nicely above.) It's also possible to swap the roles of the patterns. Split with a part pattern. Then check that the text between such parts is glue: keep, drop = [], [] for datum in data: items = part.split(datum) if all(map(glue.fullmatch, items)): keep.append(datum) else: drop.append(datum) The point is to keep the patterns simple by making them more local, or more relaxed, followed by a further test. This way they can be made to do more, but not more than they reasonably can. Note also the use of re.fullmatch instead of re.match (let alone re.search) when a full match is required! This gets rid of all anchors in the pattern, which may in turn allow fewer parentheses inside the pattern. The usual regex engines are not perfect, but parts of them are fantastic. -- https://mail.python.org/mailman/listinfo/python-list