We have ~60,000 subs on ftth, dsl and cable modem, behind several
Juniper MX routers.... MX960's with MS-MPC-128G (ftth and cm) and MX104
with MS-MIC-16G (dsl) and doing well. We a had some growing pains,
issues, but were resolved with, app, eim, eif, and source ip load
balancing on AMS interface.... also, since all my subs are in L3VPN's, I
had to share inet.0 metric with inet.3 to get mp-ibgp to see other mx's
as least cost route to accomplish nice load balancing. we did about
3000 ports per sub, so like 100 port blocks at max of 30 (100*30=3000).
we usually do like a /24 or /23 at each MX960, and i recall /25 at the
dsl MX104's. I've senn actually high point max usage of a MS-MPC-128G
flat line during peak time at approx 65gbps... and even more recently i
recall seeing about 70gbs. That's on a single MS-MPC-128G. I hope I
don't have to upgrade to SPC....(or dual ms-mps-128g) I'd rather do
dual stack ipv6 and bypass the cgnat boundary. that's what my current
focus is.
-Aaron
On 10/8/2024 2:19 PM, Jon Lewis wrote:
We started rolling out CGNAT about 6 months ago. It was smooth
sailing for the first few months, but we eventually did run into a
number of issues.
Our customer base is primarily FTTH with "dynamic" IP assignment via
DHCP. Since connections are always-on, customer ONTs/routers get an IP
assigned, and then when the lease is renewed, they request a new lease
for the existing IP, and, in general, that request is granted. This
gives customers the mistaken impression they have a static IP. So, my
impression, from working with some customers who've needed to be moved
from CGNAT back to public IP is that customers who are doing
port-forwarding don't even bother with dynamic DNS. They just know
they can connect to their IP as they've never seen it change. We do
offer/sell static IP, but pre-CGNAT, it was strictly for business
customers. i.e. A residential customer could only get static IP
service by converting their account to a business account. That may
change in the near future.
One issue we didn't foresee has been IP Geo issues. i.e. We all knew
that streaming services like Netflix use IP Geo to determine what
content should be made available, but that's, AFAIK, limited by
country or region. What we didn't anticipate is services like Hulu
Live TV doing IP Geo down to the city level to determine which local
channels are a subscriber's local channels. We're using Juniper MX
gear and SPC3 cards for our CGNAT routers, each one having a single
large external pool. Since we serve most of FL, one external pool
can't IP Geo correctly for customers as far apart as Miami and
Jacksonville hitting the same CGNAT router. We don't currently have
an acceptable solution to this other than moving impacted customers
off CGNAT.
One of the great unknowns (at least for us) with CGNAT was what our
PBA settings should be. i.e. How large each port-block should be,
and how many port-blocks to allow per customer. We started with
256x4. It seemed to work. We eventually noticed that we were logging
port-block exceeded errors. This is one aspect where Juniper's CGNAT
support is lacking. There's a counter for these errors, and it's
available via SNMP, but there's no way to attribute the errors to
subscriber IPs. We're polling the mib and graphing it, so we know
it's a continuing issue and can see when it's incrementing
faster/slower, but Junos provides no means for determining if "PBEs"
are all being caused by a single customer, a handful of customers,
etc. We have a JTAC case open on this. As a quick & hopeful fix, we
both increased the port-block size and block limit. That helped, but
didn't stop the errors. It also cut our CGNAT ratio by more than half
(64:1 -> 28:1), if we stay at this ratio, we'll need much larger
external pools than originally anticipated. Tuning these settings is
kind of painful as JTAC strongly recommends bouncing the CGNAT service
anytime CGNAT related config changes are made. This means briefly
breaking Internet access for all CGNAT'd customers. For the PBEs,
JTAC's suggestions so far have been to shorten some of the timeouts in
the config and to keep doing what we're doing, which is a cron job
that essentially does a "show services nat source port-block", parses
the output looking for subscriber IPs that have used up the ports in
several of their port-blocks, then does a "show services sessions
source-prefix ..." and logs all of this. This at least gives us
snapshots of "who's a heavy user right now" and lets us look at how
they were using all their ports. i.e. was it bittorent, are they
compromised and scanning the internet for more systems to compromise,
is it legit looking traffic - just lots of it, etc.?
The latest CGNAT issue is a customer with a Palo Alto Networks
firewall connected to our network and several of their employees are
our FTTH customers. On their PANW firewall, they're doing IP Geo
based filtering, limiting access to internal servers to "US IPs".
Since we only CGNAT traffic to the external Internet, their on-net
employees hit the firewall from their 100.64/10 IPs and get blocked.
I suggested they whitelist 100.64/10, saying we block traffic from
100.64/10 from entering our network via peering and transit, so they
can be assured anything from 100.64/10 came from inside our network /
our customers. They say the firewall won't let them whitelist
100.64.0.0/10, giving an error that it's invalid IP space.
I know we're not the first to implement CGNAT, so I'm curious if
others have run into these sorts of issues, or others we haven't run
into yet, and if so, how you solved them.
----------------------------------------------------------------------
Jon Lewis, MCP :) | I route
Blue Stream Fiber, Sr. Neteng | therefore you are
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
--
-Aaron