Re: Remap/regex_remap consolidation

James Peach Tue, 25 Mar 2014 15:09:11 -0700

On Mar 25, 2014, at 1:50 PM, Thomas Jackson <jacksontj...@gmail.com> wrote:


> Here at LinkedIn we've been using regular remap.config for a while (with
> all our map options). One thing we've been looking into recently is path
> based regexes (which regex_remap supports). While looking into it we found
> a few shortcomings of the plugin-- and decided it would probably be better
> for everyone involved if we could come up with a way to consolidate this
> into regular remap. This raises a few questions about how to consolidate,
> so we figure we'd solicit some feedback from the community before we get
> started. Since there are quite a few changes we're considering I've tried
> to assemble examples for all the scenarios, but to start out I'll put some
> examples of how remaps look today.

Hi Thomas,

I think there is a real need for something like this. I think that it is pretty 
common to follow a regex_map with a secondary set of regex_remap rules. One 
observation that I have is that the current system conflates matching and 
rewriting. I think that the is why you end up with a large set of alternative 
regex-based syntaxes below. Have you considered a system of simple, composable 
match and rewrite operators?

A made-up example, to match https://*.example.com and rewrite it ...

Old version:
        regex_map https://.*.example.com/foo http://dest.example.com

New version:
        map @scheme @value=https @replace=http \
                @host @match=*.example.com @replace=dest.example.com \
                @path @match=/foo(.*) @replace=$0

The new version is verbose, but extensible and more flexible than the fixed 
syntax. I haven't really thought this through very much ...

> 
> *Standard remap.config example*
> 
> # ExampleA: match all domains coming in on a specific port with a path
> regex_map_with_recv_port http://.*:8080/bar    http://dest.com:12345/bar
> # ExampleB:  match all domains /foo regardless of port
> regex_map http://.*/foo    http://dest.com:12345/foo
> # ExampleC: match everything else on a specific domain
> map http://foo.com/    http://127.0.0.1:12345/catchall
> 
> In regular remap.config some features are missing-- such as regex matching
> based on the path. If i were to use regex_remap on a regular mapping rule
> like so:
> 
> regex_map_with_recv_port http://.*:8080
> http://dest.com/@plugin=/usr/libexec/regex_remap.so
> @pparam=mapfile.map
> 
> I could then have a file (mapfile.map) which would look something like:
> 
> ### map file contents
> # regular remap
> ^/foo(.*)                     http://dest.com:12345/foo/$1
> # strip the query string
> ^/foo(.*)(\?.*)?               http://dest.com:12345/foo/$1?$q
> # a redirect
> /oldpath(.*)                   http://newdomain.com:8080/newpath$1@status=302
> 
> 
> This regex_remap markup gives us a few nice things. This gives you regex
> matching on the path, which we found to be extremely useful. Specifically
> we have a use case for /foo and /foo/ to go to an app, but not /foobar. In
> addition regex_remap map files give you a cleaner markup for redirects
> (@status=xxx), as well as per-remap-line config overrides. As we started
> looking into it we realized that regex_remap (the plugin) is a bit limited
> since you cannot use remap plugins within the map file. We then started
> looking into adding that, but figured it might be less work (and more
> helpful) to merge this into ATS propper. So when merging these features,
> there are a few config questions to be answered.
> 
> 
> *Question #1 how many regexes?*
> Today there is one regex (in all the different regex_* types) which is only
> on the domain name. In regex_remap there is one regex, but it matches the
> path and the query string. So the question is how many different regexes
> should their be?
> 
> To get some background of what these look like I'll implement a rule where
> we match http/https on all domains, ports 8081 and 8082, and all paths.
> 
> The main ones we've thought of so far are:
> 
>    *1 regex*- so the entire string you are matching on is one big regex
>        ^https?://.*:808[12]/(.*)$
> 
>        pros: fairly simple, dense (4 lines of today's configs can be
> merged into 1)
>        cons: easy to mess up (lots to match at once). Fairly difficult to
> tell what its doing
> 
>    *2 regex*- one for domain, and one for path
>        http://.*:8081/(.*)$
>        http://.*:8082/(.*)$
>        https://.*:8081/(.*)$
>        https://.*:8082/(.*)$
> 
>        pros: Closer matches what we do today
>        cons: more verbose (can't regex the scheme or port)
> 
>    *4 regex*- one for schema, domain, port, and path
>        ^https?$://^.*$:^808[12]$/^(.*)$
> 
>        pros: seperation of the various regexes, dense, impossible to
> capture more than the field you are in
>        cons: 4 regexes instead of 1 (might be more confusing?)
> 
> *Note*: any number of regexes >1 will require named capture groups (
> http://www.regular-expressions.info/named.html). Which means that you
> cannot use $1, $2, etc. In a lot of ways this is nicer (more explicit) but
> it is a change.
> 
> *What do I think?*
> I prefer #2 or #4 as they help seperate the regex matching into smaller
> regexes (which should make finding non-matching regexes faster) as well as
> make the regexes more scoped-- and hopefully harder to mess up (especially
> by matching too much).
> 
> 
> *Question #2 Explicit vs implicit regexes*
> If we decide to have more than one regex (from #1), do we want all of them
> to be implicitly handled as regex strings? Or do we want to rely on some
> anchoring syntax to flag to the remap engine that the string is a regex
> (such as requiring regexes to start with a '^' and end with a '$').
> 
>    Explicit:
>        pros: clearer that the field is a regex, easier to optimize the
> remap engine
>        cons: requires more markup, and would mean that if you just put in
> .* it would be a string match
>    Implicit:
>        pros: closer matches the markup we have today, simpler configs
>        cons: wasteful if most fields are strings
> 
> *What do I think?*
> In general (and in this specific case) I like explicit over implicit since
> its more clear what you are doing. Especially if we pick 4 regexes (from
> #1) this would allow you to effectively "disable" regex matching within
> specific fields if you don't need it.
> 
> 
> *Question #3 How to handle query strings in the match?*
> Today the query string is not part of remap.config. In regex_remap it is
> optional based on a @pparam=no-query-string. The advantage of not having it
> in the path is you don't have to worry about matching it accidentally or
> reconstructing it. The downside is that you can never match on it. This
> could be controlled by some @ parameter, but that could make remap.config a
> bit confusing since it wouldn't be consistent.
> 
> *What do I think?*
> I would like to leave the query parameters out, or at least have some @
> parameter to disable them on a per-line basis (since I don't want to have
> rules matching on query params).
> 
> 
> *Question #4 How do you *drop* the query string?*
> If for #3 we do decide to put the query string in the match, how would we
> drop it on a specific path? In the regex_remap plugin I'd simply create a
> rule such as:
> 
> # remove query params in regex_remap
> ^/(.*)(\?.*)?  /$1
> 
> And by not adding ?$2 (or something similar) I'd be removing the query
> parameters. Conversely, if I want to keep the query parameters that means i
> effectively have to re-add them every remap line like so:
> 
> # keep the query params
> ^/(.*)(\?.*)?  /$1/$2
> # or with the shorthand
> ^/(.*)(\?.*)?  /$1/$q
> 
> *What do I think?*
> No real opinion on this one, mostly because I don't really want the query
> params in the matching string to begin with :)
> 
> 
> *#5 How to remap paths that are regexed*
> Today we have remap lines that look something like:
> regex_map http://.*/foo http://dest.com:12345/
> 
> What this means is that any request with path starting with "/foo" will be
> remapped and the "/foo" will be replaced with "/". Once we allow regexes in
> the path we have more variables to take into account. If we take this same
> simple case with regex_remap it would look something like:
> /foo(.*)  http://dest.com:12345/$1
> 
> If we were to mimic this markup in remap.config it could look something
> like:
> regex_map http://.*/foo(.*) http://dest.com:12345/$1
> 
> This is a bit more explicit in what it is doing, but in this simple case it
> means quite a few more characters to get the same meaning across. What this
> gets you is the flexibility to do more complex regex matches if needed. If
> we do go with markup like this and we pick anything more than 1 regex from
> #1 we'd be forced to use named groups instead of $1, $2, etc. since the
> strings matched would come from different regexes
> 
> 
> *What do I think?*
> If we used 4 regexes (from #1) and explicit (from #2) we could keep the
> markup pretty similar to what we have now and still have the ability to do
> some cool things.
> 
> So, for the base case of regex_map today (regex domain only) it would look
> like:
> regex_map http://^.*$/foo http://dest.com:12345/foo
> 
> This would only regex the domain name (since it starts with ^ and ends with
> $) and then the path would be treated like a regular map (same as it is
> today). If i wanted to do some regexing based on the path I could write
> something like:
> 
> regex_map http://^.*$^/foo(?P<num>\d+)$ http://dest.com:12345/$num/foo
> 
> 
> 
> 
> If you have any opinions/feedback about these specifics please let me
> know-- we're hoping to nail down the markup fairly quickly and get this
> taken care of in the next few weeks. If you have questions about what
> markup would look like (and don't want to spam the mailing list) feel free
> to mail me individually or PM me on IRC (jacksontj).
> 
> 
> Thomas Jackson
> Traffic SRE @ LinkedIn

Re: Remap/regex_remap consolidation

Reply via email to