Here at LinkedIn we've been using regular remap.config for a while (with
all our map options). One thing we've been looking into recently is path
based regexes (which regex_remap supports). While looking into it we found
a few shortcomings of the plugin-- and decided it would probably be better
for everyone involved if we could come up with a way to consolidate this
into regular remap. This raises a few questions about how to consolidate,
so we figure we'd solicit some feedback from the community before we get
started. Since there are quite a few changes we're considering I've tried
to assemble examples for all the scenarios, but to start out I'll put some
examples of how remaps look today.

*Standard remap.config example*

# ExampleA: match all domains coming in on a specific port with a path
regex_map_with_recv_port http://.*:8080/bar    http://dest.com:12345/bar
# ExampleB:  match all domains /foo regardless of port
regex_map http://.*/foo    http://dest.com:12345/foo
# ExampleC: match everything else on a specific domain
map http://foo.com/    http://127.0.0.1:12345/catchall

In regular remap.config some features are missing-- such as regex matching
based on the path. If i were to use regex_remap on a regular mapping rule
like so:

regex_map_with_recv_port http://.*:8080
http://dest.com/@plugin=/usr/libexec/regex_remap.so
@pparam=mapfile.map

I could then have a file (mapfile.map) which would look something like:

### map file contents
# regular remap
^/foo(.*)                     http://dest.com:12345/foo/$1
# strip the query string
^/foo(.*)(\?.*)?               http://dest.com:12345/foo/$1?$q
# a redirect
/oldpath(.*)                   http://newdomain.com:8080/newpath$1@status=302


This regex_remap markup gives us a few nice things. This gives you regex
matching on the path, which we found to be extremely useful. Specifically
we have a use case for /foo and /foo/ to go to an app, but not /foobar. In
addition regex_remap map files give you a cleaner markup for redirects
(@status=xxx), as well as per-remap-line config overrides. As we started
looking into it we realized that regex_remap (the plugin) is a bit limited
since you cannot use remap plugins within the map file. We then started
looking into adding that, but figured it might be less work (and more
helpful) to merge this into ATS propper. So when merging these features,
there are a few config questions to be answered.


*Question #1 how many regexes?*
Today there is one regex (in all the different regex_* types) which is only
on the domain name. In regex_remap there is one regex, but it matches the
path and the query string. So the question is how many different regexes
should their be?

To get some background of what these look like I'll implement a rule where
we match http/https on all domains, ports 8081 and 8082, and all paths.

The main ones we've thought of so far are:

    *1 regex*- so the entire string you are matching on is one big regex
        ^https?://.*:808[12]/(.*)$

        pros: fairly simple, dense (4 lines of today's configs can be
merged into 1)
        cons: easy to mess up (lots to match at once). Fairly difficult to
tell what its doing

    *2 regex*- one for domain, and one for path
        http://.*:8081/(.*)$
        http://.*:8082/(.*)$
        https://.*:8081/(.*)$
        https://.*:8082/(.*)$

        pros: Closer matches what we do today
        cons: more verbose (can't regex the scheme or port)

    *4 regex*- one for schema, domain, port, and path
        ^https?$://^.*$:^808[12]$/^(.*)$

        pros: seperation of the various regexes, dense, impossible to
capture more than the field you are in
        cons: 4 regexes instead of 1 (might be more confusing?)

*Note*: any number of regexes >1 will require named capture groups (
http://www.regular-expressions.info/named.html). Which means that you
cannot use $1, $2, etc. In a lot of ways this is nicer (more explicit) but
it is a change.

*What do I think?*
I prefer #2 or #4 as they help seperate the regex matching into smaller
regexes (which should make finding non-matching regexes faster) as well as
make the regexes more scoped-- and hopefully harder to mess up (especially
by matching too much).


*Question #2 Explicit vs implicit regexes*
If we decide to have more than one regex (from #1), do we want all of them
to be implicitly handled as regex strings? Or do we want to rely on some
anchoring syntax to flag to the remap engine that the string is a regex
(such as requiring regexes to start with a '^' and end with a '$').

    Explicit:
        pros: clearer that the field is a regex, easier to optimize the
remap engine
        cons: requires more markup, and would mean that if you just put in
.* it would be a string match
    Implicit:
        pros: closer matches the markup we have today, simpler configs
        cons: wasteful if most fields are strings

*What do I think?*
In general (and in this specific case) I like explicit over implicit since
its more clear what you are doing. Especially if we pick 4 regexes (from
#1) this would allow you to effectively "disable" regex matching within
specific fields if you don't need it.


*Question #3 How to handle query strings in the match?*
Today the query string is not part of remap.config. In regex_remap it is
optional based on a @pparam=no-query-string. The advantage of not having it
in the path is you don't have to worry about matching it accidentally or
reconstructing it. The downside is that you can never match on it. This
could be controlled by some @ parameter, but that could make remap.config a
bit confusing since it wouldn't be consistent.

*What do I think?*
I would like to leave the query parameters out, or at least have some @
parameter to disable them on a per-line basis (since I don't want to have
rules matching on query params).


*Question #4 How do you *drop* the query string?*
If for #3 we do decide to put the query string in the match, how would we
drop it on a specific path? In the regex_remap plugin I'd simply create a
rule such as:

# remove query params in regex_remap
^/(.*)(\?.*)?  /$1

And by not adding ?$2 (or something similar) I'd be removing the query
parameters. Conversely, if I want to keep the query parameters that means i
effectively have to re-add them every remap line like so:

# keep the query params
^/(.*)(\?.*)?  /$1/$2
# or with the shorthand
^/(.*)(\?.*)?  /$1/$q

*What do I think?*
No real opinion on this one, mostly because I don't really want the query
params in the matching string to begin with :)


*#5 How to remap paths that are regexed*
Today we have remap lines that look something like:
regex_map http://.*/foo http://dest.com:12345/

What this means is that any request with path starting with "/foo" will be
remapped and the "/foo" will be replaced with "/". Once we allow regexes in
the path we have more variables to take into account. If we take this same
simple case with regex_remap it would look something like:
/foo(.*)  http://dest.com:12345/$1

If we were to mimic this markup in remap.config it could look something
like:
regex_map http://.*/foo(.*) http://dest.com:12345/$1

This is a bit more explicit in what it is doing, but in this simple case it
means quite a few more characters to get the same meaning across. What this
gets you is the flexibility to do more complex regex matches if needed. If
we do go with markup like this and we pick anything more than 1 regex from
#1 we'd be forced to use named groups instead of $1, $2, etc. since the
strings matched would come from different regexes


*What do I think?*
If we used 4 regexes (from #1) and explicit (from #2) we could keep the
markup pretty similar to what we have now and still have the ability to do
some cool things.

So, for the base case of regex_map today (regex domain only) it would look
like:
regex_map http://^.*$/foo http://dest.com:12345/foo

This would only regex the domain name (since it starts with ^ and ends with
$) and then the path would be treated like a regular map (same as it is
today). If i wanted to do some regexing based on the path I could write
something like:

regex_map http://^.*$^/foo(?P<num>\d+)$ http://dest.com:12345/$num/foo




If you have any opinions/feedback about these specifics please let me
know-- we're hoping to nail down the markup fairly quickly and get this
taken care of in the next few weeks. If you have questions about what
markup would look like (and don't want to spam the mailing list) feel free
to mail me individually or PM me on IRC (jacksontj).


Thomas Jackson
Traffic SRE @ LinkedIn

Reply via email to