Hello VPP experts, I think there is a thread safety issue in the NAT plugin regarding the counter for busy ports.
Looking at this for the master branch now, there has been some refactoring lately but the issue has anyway been there for a long time, at least several VPP versions back, although filenames and function names have changed. Here I will take the endpoint-independent code in nat44-ei/nat44_ei.c code because that is the part I am using, but it looks like a similar issue is there for nat44-ed as well. In the nat44_ei_alloc_default_cb() function in nat44_ei.c there is a part that looks like this: --a->busy_##n##_port_refcounts[portnum]; \ a->busy_##n##_ports_per_thread[thread_index]++; \ a->busy_##n##_ports++; \ where the variable "a" is an address (nat44_ei_address_t) that belongs to the "addresses" in the global nat44_ei_main, so not thread-specific. As I understand it, different threads may be using the same "a" at the same time. At first sight it might seem like all those three lines are risky because different threads can execute this code at the same time for the same "a". However, the _port_refcounts[portnum] and _ports_per_thread[thread_index] parts are actually okay to access because the [portnum] and [thread_index] ensures that those lines only access parts of those arrays that belong to thecurrent thread, that is how the port number is selected. So the first two lines there are fine, I think, but the third line, incrementing a->busy_##n##_ports, can give a race condition when different threads execute it at the same time. The same issue is also there in other places where the busy_##n##_ports values are updated. I think this is not critical because the busy_##n##_ports information (that can be wrong because of this thread safety issue) is not used very much. However those values are used in nat44_ei_del_address() where it looks like this: /* Delete sessions using address */ if (a->busy_tcp_ports || a->busy_udp_ports || a->busy_icmp_ports) { and then inside that if-statement there is some code to delete those sessions. If the busy_##n##_ports values are wrong it could in principle happen that the session deletion is skipped when there were actually some sessions that needed deleting. Perhaps rare and perhaps resulting in nothing worse than a small memory leak, but anyway. One effect of this is that there can be an inconsistency, if we were to sum up the busy_##n##_ports_per_thread values for all threads, that should be equal to busy_##n##_ports but due to this issue there could be a difference, because while the busy_##n##_ports_per_thread values are correct the busy_##n##_ports values may have been corupted due to the race condition mentioned above. Not sure if the above is a problem in practice, my main motivation for reporting this is that it confuses me when I am trying to understand how te code works in order to do some modifications. Either the code is not thread safe there, or I have misunderstood things. What do you think, is it an issue? If not, what have I missed? (This is not an April fools' joke, I really am this pedantic) Best regards, Elias
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#19087): https://lists.fd.io/g/vpp-dev/message/19087 Mute This Topic: https://lists.fd.io/mt/81773552/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-