On Feb 16, 8:25 am, Jonathan Lukens <[EMAIL PROTECTED]> wrote: > > What would you like to see instead? > > I had mostly just expected that there was some method that would > return each entire match as an item on a list. I have this pattern: > > >>> import re > >>> corporate_names = > >>> re.compile(u'(?u)\\b([á-ñ]{2,}\\s+)([<<"][Á-Ñá-ñ]+)(\\s*-?[Á-Ñá-ñ]+)*([>>"])') > >>> terms = corporate_names.findall(sourcetext) > > Which matches a specific way that Russian company names are > formatted. I was expecting a method that would return this: > > >>> terms > > [u'string one', u'string two', u'string three']
What is the point of having parenthesised groups in the regex if you are interested only in the whole match? Other comments: (1) raw string for improved legibility ru'(?u)\b([á-ñ]{2,}\s+)([<<"][Á-Ñá-ñ]+)(\s*-?[Á-Ñá-ñ]+)*([>>"])' (2) consider not including space at the end of a group ru'(?u)\b([á-ñ]{2,})\s+([<<"][Á-Ñá-ñ]+)\s*(-?[Á-Ñá-ñ]+)*([>>"])' (3) what appears between [] is a set of characters, so [<<"] is the same as [<"] and probably isn't doing what you expect; have you tested this regex for correctness? > > ...mostly because I was working it this way in Java and haven't > learned to do things the Python way yet. At the suggestion from > someone on the list, I just used list() on all the tuples like so: > > >>> detupled_terms = [list(term_tuple) for term_tuple in terms] > >>> delisted_terms = [''.join(term_list) for term_list in detupled_terms] > > which achieves the desired result, but I am not a programmer and so I > would still be interested to know if there is a more elegant way of > doing this. I can't imagine how "not a programmer" implies "interested to know if there is a more elegant way". In any case, explore the correctness axis first. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list