On Friday, February 24, 2017 at 11:48:22 AM UTC-8, MRAB wrote: > On 2017-02-24 18:54, kar6...@gmail.com wrote: > > I have a task to search for multiple patterns in incoming string and > > replace with matched patterns, I'm storing all pattern as keys in dict and > > replacements as values, I'm using regex for compiling all the pattern and > > using the sub method on pattern object for replacement. But the problem I > > have a tens of millions of rows, that I need to check for pattern which is > > about 1000 and this is turns out to be a very expensive operation. > > > > What can be done to optimize it. Also I have special characters for > > matching, where can I specify raw string combinations. > > > > for example is the search string is not a variable we can say > > > > re.search(r"\$%^search_text", "replace_text", "some_text") but when I read > > from the dict where shd I place the "r" keyword, unfortunately putting > > inside key doesnt work "r key" like this.... > > > > Pseudo code > > > > for string in genobj_of_million_strings: > > pattern = re.compile('|'.join(regex_map.keys())) > > return pattern.sub(lambda x: regex_map[x], string) > > > Here's an example: > > import re > > # A dict of the replacements. > mapping = {'one': 'unu', 'two': 'du', 'three': 'tri', 'four': 'kvar', > 'five': 'kvin'} > > # The text that we're searching. > text = 'one two three four five six seven eight nine ten' > > # It's best to put the strings we're looking for into reverse order in > # case one of the keys is the prefix of another. > ordered_keys = sorted(mapping.keys(), reverse=True) > ordered_values = [mapping[key] for key in ordered_keys] > > # Build the pattern, putting each key in its own group. > # I'm assuming that the keys are all pure literals, that they don't > # contain anything that's treated specially by regex. You could escape > # the key (using re.escape(...)) if that's not the case. > pattern = re.compile('|'.join('(%s)' % key for key in ordered_keys)) > > # When we find a match, the match object's .lastindex attribute will > # tell us which group (i.e. key) matched. We can then look up the > # replacement. We also need to take into account that the groups are > # numbered from 1, whereas list items are numbered from 0. > new_text = pattern.sub(lambda m: ordered_values[m.lastindex - 1], text) > > > It might be faster (timing it would be a good idea) if you could put all > of the rows into a single string (or a number of rows into a single > srting), process that string, and then split up the result. If none of > the rows contain '\n', then you could join them together with that, > otherwise just pick some other character.
Thanks, what is the idea behind storing the keys and values in a list, I assume looking up for a value in a map is faster getting the value from the list. Also I like the idea of combining multiple rows into one string and passing it. I would batch up the million rows in to strings and give it a shot. -- https://mail.python.org/mailman/listinfo/python-list