On Thu, Jul 26, 2012 at 10:45 PM, Joe <ad...@gamebee.de> wrote: > Hey, I have a url regex like this which is keeping django extremely busy > (20secs to 1min to handle a request). On some urls it even crashes. > > my regex: > > url(r'^(?P<item_url>(\w+-?)*)/$', 'detail'), > > > view: > > def detail(request, item_url): > i = get_object_or_404(Page, url=item_url,published=True) > return render_to_response('item/detail.html', {'item':i}, > context_instance=RequestContext(request)) > > replaced with: > > url(r'^(?P<item_url>[\w-]+)/$', 'detail'), > > > The replacement works like a charm. What is wrong with the first regex?
Hi Joe, There's nothing strictly *wrong* with the first regex -- it's just describes a very complex lookup strategy, and as a result, it takes extra time to compute it. In the second regex, you're asking for "a string of 1 or more characters that are either word-like or '-'". That's a very easy thing to check - if you think of how you would manually implement code that check that policy, it could be done with a simple if inside a while loop; as soon as you find a character that doesn't match, you can bail out. However, the first regex is asking for "0 or more groups of word like characters, each of which might be followed by a '-'". Consider a trivial case, matching against the string abcde. It can match the first regex in an incredible number of ways: (a)(b)(c)(d)(e) (ab)(c)(d)(e) (abc)(d)(e) (abcd)(e) (abcde) (a)(bc)(d)(e) (a)(bcd)(e) (a)(bcde) (a)(b)(cde) … and so on. Because you're asking the regex to preserve groups, the algorithm needs to essentially work out every single one of these groups, and then determine which set will be reported as the actual match. As you can guess, this can take some time, which you're observing as a 1 minute delay in serving a URL. This is one of the gotchas that comes from using regular expressions. They're a very powerful language for expressing constraints, but you need to be careful that you don't accidentally fall into a trap where you're asking for something very complex. And don't worry - you're in good company being bitten by this problem. There was a Django security release caused *specifically* by a regular expression like yours. Django uses regular expressions to validate URLs and email form inputs, and at one point, the regex that was used to validate email addresses was constructed in such a way that it was possible to provide a very simple string that would cause the validator to take 30 seconds to confirm that it wasn't valid. Write a tool that hits the same URL and validates the same string 100 times, and you've got yourself a DDOS attack. So - when you're building your URL patterns, you should be trying to keep your regular expressions as simple as possible -- i.e., simple linear probes. If you really do need to match a complex pattern, you'd be better served using a simple regex in the URL pattern, and then doing more specific validation in the view (and raising 404 if the pattern doesn't match what you need it to). Yours, Russ Magee %-) -- You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com. To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-users?hl=en.