Dennis Lee Bieber wrote: > On Fri, 23 Jun 2017 09:49:06 +0300, Jussi Piitulainen > <jussi.piitulai...@helsinki.fi> declaimed the following: > >>I just like those character translation methods, and I didn't like it >>when you first took the time to call a simple regex "line noise" and >>then proceeded to post something that looked much more noisy yourself. >> > > Tediously long (and likely slow running), but I'd think each .replace() > would have been self-explanatory. > >>I'm not sure I like the splitting of look-alike (I'm not sure that I >>like not splitting it either) but note that the regex does that for >>free. >> >>The \b in the original regex matches the empty string at a position >>where there is a "word character" on only one side. It recognizes a >>boundary at the beginning of a line and at whitespace, but also at all >>the punctuation marks. >> >>You guess right about the length limits. I wouldn't use them, and then >>there's no need for the boundary markers any more: my \w+ matches >>maximal sequences of word characters (even in foreign languages like >>Finnish or French, and even in upper case, also digits). >> >>To also match "people's" and "didn't", use \w+'\w+, and to match with >>and without the ' make the trailing part optional \w+('\w+)? except the >>notation really does start to become noisy because one must prevent the >>parentheses from "capturing" the group: >> >>import re >>wordy = re.compile(r''' \w+ (?: ' \w+ )? ''', re.VERBOSE) >>text = ''' >>Oliver N'Goma, dit Noli, né le 23 mars 1959 à Mayumba et mort le 7 juin >>2010, est un chanteur et guitariste gabonais d'Afro-zouk. >>''' >> >>print(wordy.findall(text)) >> >># ['Oliver', "N'Goma", 'dit', 'Noli', 'né', 'le', '23', 'mars', '1959', >># 'à', 'Mayumba', 'et', 'mort', 'le', '7', 'juin', '2010', 'est', 'un', >># 'chanteur', 'et', 'guitariste', 'gabonais', "d'Afro", 'zouk'] >> >>Not too bad? > > Above content saved (in a write-only file? I don't recall the times > I've searched my post archives) for potential future use. I should plug it > into my demo and see how much speed improvement I get.
Most of the potential speedup can be gained from using collections.Counter() instead of the database. If necessary write the counter's contents into the database in a second step. -- https://mail.python.org/mailman/listinfo/python-list