I recently switched to a gigabit Internet plan, and it was working great at first, but it started slowing down to ~150mbit after a day or two. I started doing my normal troubleshooting, rebooting the modem, etc., but it soon became clear that the issue was with the server I was using as a gateway/router. I was having issues getting any network throughput, even for data originating on the server itself. I have a test script which generates 64k of random data, then continually pushes that one block over a TCP connection (eliminating the possibility that compression is making it look like there is more data transmitted than there actually is), and over a gigabit link, I usually get a very consistent 111/112 MiB/s transfer rate. However, this script seemed to peg one CPU core at 99%, even though it's spending all its time in the send() syscall, and it was only getting 10 MiB/s. CPU benchmarks showed everything was running normally, there was no CPU throttling. Both network interfaces were doing the same thing (one built into the motherboard, another on a PCIe expansion card). The only thing I could think to do was to reboot the server - which worked.
The next day, it started to slow down again. Doing an internet speed test straight from the server showed slightly decreased speeds, but the same speed test from another device (with packets being forwarded through the server) was extremely low. Once again, the CPU seemed to be the bottleneck, with ksoftirq using 100% of one core. This indicated that outgoing packets are taking a long time to process, and that work is being done in the context of the thread that is sending the packet. I checked my iptables rules - nothing stood out. The route tables all look normal. Then it hit me - I have an LTE backup using an old wireless hotspot, and in order to prevent sending normal traffic to it, I use an alternate route table, with a source address rule which selects that table. I have a service that sets up an SSH tunnel through that interface to an external server, and for simplicity, I have the service configure the routing tables on startup. There was a bug in the script which caused it to add another rule whenever the service starts. The remote SSH server got stuck with an old connection using the ports, so the service was restarting once every few seconds.
In the end, I had a total of 15,059 rules which had to be checked for every packet that goes through (seen in the image above). The solution is simple - remove all source address rules before adding new ones. I also unjammed the remote SSH service, so the connect service wasn't constantly restarting.
I did learn some things in the process. I knew that the processing for iptables, routing, and eBPF took place in the kernel, but I had always assumed it got kicked off in the background, and that the sending process returns immediately, but apparently, it's all synchronous, and send() doesn't return until the packets have actually been placed in the outbound hardware queue. This is significant for performance if there is a lot of processing to be done.