On Friday, February 24, 2017 at 11:48:22 AM UTC-8, MRAB wrote:
> On 2017-02-24 18:54, kar6...@gmail.com wrote:
> > I have a task to search for multiple patterns in incoming string and 
> > replace with matched patterns, I'm storing all pattern as keys in dict and 
> > replacements as values, I'm using regex for compiling all the pattern and 
> > using the sub method on pattern object for replacement. But the problem I 
> > have a tens of millions of rows, that I need to check for pattern which is 
> > about 1000 and this is turns out to be a very expensive operation.
> >
> > What can be done to optimize it. Also I have special characters for 
> > matching, where can I specify raw string combinations.
> >
> > for example is the search string is not a variable we can say
> >
> > re.search(r"\$%^search_text", "replace_text", "some_text") but when I read 
> > from the dict where shd I place the "r" keyword, unfortunately putting 
> > inside key doesnt work "r key" like this....
> >
> > Pseudo code
> >
> > for string in genobj_of_million_strings:
> >    pattern = re.compile('|'.join(regex_map.keys()))
> >    return pattern.sub(lambda x: regex_map[x], string)
> >
> Here's an example:
> 
> import re
> 
> # A dict of the replacements.
> mapping = {'one': 'unu', 'two': 'du', 'three': 'tri', 'four': 'kvar', 
> 'five': 'kvin'}
> 
> # The text that we're searching.
> text = 'one two three four five six seven eight nine ten'
> 
> # It's best to put the strings we're looking for into reverse order in
> # case one of the keys is the prefix of another.
> ordered_keys = sorted(mapping.keys(), reverse=True)
> ordered_values = [mapping[key] for key in ordered_keys]
> 
> # Build the pattern, putting each key in its own group.
> # I'm assuming that the keys are all pure literals, that they don't
> # contain anything that's treated specially by regex. You could escape
> # the key (using re.escape(...)) if that's not the case.
> pattern = re.compile('|'.join('(%s)' % key for key in ordered_keys))
> 
> # When we find a match, the match object's .lastindex attribute will
> # tell us which group (i.e. key) matched. We can then look up the
> # replacement. We also need to take into account that the groups are
> # numbered from 1, whereas list items are numbered from 0.
> new_text = pattern.sub(lambda m: ordered_values[m.lastindex - 1], text)
> 
> 
> It might be faster (timing it would be a good idea) if you could put all 
> of the rows into a single string (or a number of rows into a single 
> srting), process that string, and then split up the result. If none of 
> the rows contain '\n', then you could join them together with that, 
> otherwise just pick some other character.

Thanks, what is the idea behind storing the keys and values in a list, I assume 
looking up for a value in a map is faster getting the value from the list.

Also I like the idea of combining multiple rows into one string and passing it. 
I would batch up the million rows in to strings and give it a shot.
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to