Re: Regarding Regex timeout behavior to minimize CPU consumption

Dan Stromberg Sun, 06 Dec 2020 20:30:53 -0800

On Sun, Dec 6, 2020 at 2:37 PM Barry <ba...@barrys-emacs.org> wrote:


> > On 5 Dec 2020, at 23:44, Peter J. Holzer <hjp-pyt...@hjp.at> wrote:
> >
> > On 2020-12-05 23:42:11 +0100, sjeik_ap...@hotmail.com wrote:
> >>   Timeout: no idea. But check out re.compile and re.iterfind as they
> might
> >>   speed things up.
> >
> > I doubt that compiling regular expressions helps the OP much. Compiled
> > regular expressions are cached, but more importantly, if a match takes
> > long enough that specifying a timeout is useful, the time is almost
> > certainly not spent compiling, but matching - most likely backtracking
> > from lots of promising but ultimately unsuccessful partial matches.
> >
> >>     regex = r'data-stid="section-room-list"[\s\S]*?>\s*([\s\S]*?)\s*' \
> >>
> >>
>  
> r'(?:class\s*=\s*"\s*sticky-book-now\s*"|</ul>\s*</section>|id\s*=\s*"Location")'
> >>     rooms_blocks_to_be_replace = re.findall(regex, html_template)
> >
> > This part:
> >
> >    \s*([\s\S]*?)\s*'
> >
> > looks dangerous from a performance point of view. If that can be
> > rewritten with less potential for backtracking, it might help.
> >
> > Generally, it should be possible to implement a timeout for any
> > operation by either scheduling an alarm with signal.alarm or by
> > executing the operation in a separate thread and killing the thread if
> > it takes too long.
>
> I think that python ignores signals until the coeval loop is entered.
> And since the re.match will block that is not going to happen.
>
> Killing threads is not safe and if your OS allows it then you end up with
> the internal state of python messed up.
>
> To implement this I think requires the re code to implement the timeout.
>
> Better is for the OP to fix the re to not back track so much or to work on
> the
> input string in chunks.
>
If the regex is expensive enough to warrant it, you could use a subprocess
- they are killable.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Regarding Regex timeout behavior to minimize CPU consumption

Reply via email to