On Mar 25, 2014, at 1:50 PM, Thomas Jackson <jacksontj...@gmail.com> wrote:
> Here at LinkedIn we've been using regular remap.config for a while (with > all our map options). One thing we've been looking into recently is path > based regexes (which regex_remap supports). While looking into it we found > a few shortcomings of the plugin-- and decided it would probably be better > for everyone involved if we could come up with a way to consolidate this > into regular remap. This raises a few questions about how to consolidate, > so we figure we'd solicit some feedback from the community before we get > started. Since there are quite a few changes we're considering I've tried > to assemble examples for all the scenarios, but to start out I'll put some > examples of how remaps look today. Hi Thomas, I think there is a real need for something like this. I think that it is pretty common to follow a regex_map with a secondary set of regex_remap rules. One observation that I have is that the current system conflates matching and rewriting. I think that the is why you end up with a large set of alternative regex-based syntaxes below. Have you considered a system of simple, composable match and rewrite operators? A made-up example, to match https://*.example.com and rewrite it ... Old version: regex_map https://.*.example.com/foo http://dest.example.com New version: map @scheme @value=https @replace=http \ @host @match=*.example.com @replace=dest.example.com \ @path @match=/foo(.*) @replace=$0 The new version is verbose, but extensible and more flexible than the fixed syntax. I haven't really thought this through very much ... > > *Standard remap.config example* > > # ExampleA: match all domains coming in on a specific port with a path > regex_map_with_recv_port http://.*:8080/bar http://dest.com:12345/bar > # ExampleB: match all domains /foo regardless of port > regex_map http://.*/foo http://dest.com:12345/foo > # ExampleC: match everything else on a specific domain > map http://foo.com/ http://127.0.0.1:12345/catchall > > In regular remap.config some features are missing-- such as regex matching > based on the path. If i were to use regex_remap on a regular mapping rule > like so: > > regex_map_with_recv_port http://.*:8080 > http://dest.com/@plugin=/usr/libexec/regex_remap.so > @pparam=mapfile.map > > I could then have a file (mapfile.map) which would look something like: > > ### map file contents > # regular remap > ^/foo(.*) http://dest.com:12345/foo/$1 > # strip the query string > ^/foo(.*)(\?.*)? http://dest.com:12345/foo/$1?$q > # a redirect > /oldpath(.*) http://newdomain.com:8080/newpath$1@status=302 > > > This regex_remap markup gives us a few nice things. This gives you regex > matching on the path, which we found to be extremely useful. Specifically > we have a use case for /foo and /foo/ to go to an app, but not /foobar. In > addition regex_remap map files give you a cleaner markup for redirects > (@status=xxx), as well as per-remap-line config overrides. As we started > looking into it we realized that regex_remap (the plugin) is a bit limited > since you cannot use remap plugins within the map file. We then started > looking into adding that, but figured it might be less work (and more > helpful) to merge this into ATS propper. So when merging these features, > there are a few config questions to be answered. > > > *Question #1 how many regexes?* > Today there is one regex (in all the different regex_* types) which is only > on the domain name. In regex_remap there is one regex, but it matches the > path and the query string. So the question is how many different regexes > should their be? > > To get some background of what these look like I'll implement a rule where > we match http/https on all domains, ports 8081 and 8082, and all paths. > > The main ones we've thought of so far are: > > *1 regex*- so the entire string you are matching on is one big regex > ^https?://.*:808[12]/(.*)$ > > pros: fairly simple, dense (4 lines of today's configs can be > merged into 1) > cons: easy to mess up (lots to match at once). Fairly difficult to > tell what its doing > > *2 regex*- one for domain, and one for path > http://.*:8081/(.*)$ > http://.*:8082/(.*)$ > https://.*:8081/(.*)$ > https://.*:8082/(.*)$ > > pros: Closer matches what we do today > cons: more verbose (can't regex the scheme or port) > > *4 regex*- one for schema, domain, port, and path > ^https?$://^.*$:^808[12]$/^(.*)$ > > pros: seperation of the various regexes, dense, impossible to > capture more than the field you are in > cons: 4 regexes instead of 1 (might be more confusing?) > > *Note*: any number of regexes >1 will require named capture groups ( > http://www.regular-expressions.info/named.html). Which means that you > cannot use $1, $2, etc. In a lot of ways this is nicer (more explicit) but > it is a change. > > *What do I think?* > I prefer #2 or #4 as they help seperate the regex matching into smaller > regexes (which should make finding non-matching regexes faster) as well as > make the regexes more scoped-- and hopefully harder to mess up (especially > by matching too much). > > > *Question #2 Explicit vs implicit regexes* > If we decide to have more than one regex (from #1), do we want all of them > to be implicitly handled as regex strings? Or do we want to rely on some > anchoring syntax to flag to the remap engine that the string is a regex > (such as requiring regexes to start with a '^' and end with a '$'). > > Explicit: > pros: clearer that the field is a regex, easier to optimize the > remap engine > cons: requires more markup, and would mean that if you just put in > .* it would be a string match > Implicit: > pros: closer matches the markup we have today, simpler configs > cons: wasteful if most fields are strings > > *What do I think?* > In general (and in this specific case) I like explicit over implicit since > its more clear what you are doing. Especially if we pick 4 regexes (from > #1) this would allow you to effectively "disable" regex matching within > specific fields if you don't need it. > > > *Question #3 How to handle query strings in the match?* > Today the query string is not part of remap.config. In regex_remap it is > optional based on a @pparam=no-query-string. The advantage of not having it > in the path is you don't have to worry about matching it accidentally or > reconstructing it. The downside is that you can never match on it. This > could be controlled by some @ parameter, but that could make remap.config a > bit confusing since it wouldn't be consistent. > > *What do I think?* > I would like to leave the query parameters out, or at least have some @ > parameter to disable them on a per-line basis (since I don't want to have > rules matching on query params). > > > *Question #4 How do you *drop* the query string?* > If for #3 we do decide to put the query string in the match, how would we > drop it on a specific path? In the regex_remap plugin I'd simply create a > rule such as: > > # remove query params in regex_remap > ^/(.*)(\?.*)? /$1 > > And by not adding ?$2 (or something similar) I'd be removing the query > parameters. Conversely, if I want to keep the query parameters that means i > effectively have to re-add them every remap line like so: > > # keep the query params > ^/(.*)(\?.*)? /$1/$2 > # or with the shorthand > ^/(.*)(\?.*)? /$1/$q > > *What do I think?* > No real opinion on this one, mostly because I don't really want the query > params in the matching string to begin with :) > > > *#5 How to remap paths that are regexed* > Today we have remap lines that look something like: > regex_map http://.*/foo http://dest.com:12345/ > > What this means is that any request with path starting with "/foo" will be > remapped and the "/foo" will be replaced with "/". Once we allow regexes in > the path we have more variables to take into account. If we take this same > simple case with regex_remap it would look something like: > /foo(.*) http://dest.com:12345/$1 > > If we were to mimic this markup in remap.config it could look something > like: > regex_map http://.*/foo(.*) http://dest.com:12345/$1 > > This is a bit more explicit in what it is doing, but in this simple case it > means quite a few more characters to get the same meaning across. What this > gets you is the flexibility to do more complex regex matches if needed. If > we do go with markup like this and we pick anything more than 1 regex from > #1 we'd be forced to use named groups instead of $1, $2, etc. since the > strings matched would come from different regexes > > > *What do I think?* > If we used 4 regexes (from #1) and explicit (from #2) we could keep the > markup pretty similar to what we have now and still have the ability to do > some cool things. > > So, for the base case of regex_map today (regex domain only) it would look > like: > regex_map http://^.*$/foo http://dest.com:12345/foo > > This would only regex the domain name (since it starts with ^ and ends with > $) and then the path would be treated like a regular map (same as it is > today). If i wanted to do some regexing based on the path I could write > something like: > > regex_map http://^.*$^/foo(?P<num>\d+)$ http://dest.com:12345/$num/foo > > > > > If you have any opinions/feedback about these specifics please let me > know-- we're hoping to nail down the markup fairly quickly and get this > taken care of in the next few weeks. If you have questions about what > markup would look like (and don't want to spam the mailing list) feel free > to mail me individually or PM me on IRC (jacksontj). > > > Thomas Jackson > Traffic SRE @ LinkedIn