summaryrefslogtreecommitdiff
path: root/sys/net
AgeCommit message (Collapse)Author
2023-07-07Keep mbuf header field ph_mss during loopback TCP with LRO/TSO.Alexander Bluhm
When M_TCP_TSO is preserved, also keep ph_mss. In lo(4) this logic was missing. This may be relevant only for weird pf configs that forward from loopback. OK mvs@ jan@
2023-07-07Fix path MTU discovery for TCP LRO/TSO when forwarding.Alexander Bluhm
When doing LRO (Large Receive Offload), the drivers, currently ix(4) and lo(4) only, record an upper bound of the size of the original packets in ph_mss. When sending, either stack or hardware must chop the packets with TSO (TCP Segmentation Offload) to that size. That means we have to call tcp_if_output_tso() before ifp->if_output(). Put that logic into if_output_tso() to avoid code duplication. As TCP packets on the wire do not get larger that way, path MTU discovery should still work. tested by and OK jan@
2023-07-06use refcnt API for multicast addresses, add tracepoint:refcnt:ethmulti probeKlemens Nanni
Replace hand-rolled reference counting with refcnt_init(9) and hook it up with a new dt(4) probe. OK mvs Feedback OK bluhm
2023-07-06big update to pfsync to try and clean up locking in particular.David Gwynne
moving pf forward has been a real struggle, and pfsync has been a constant source of pain. we have been papering over the problems for a while now, but it reached the point that it needed a fundamental restructure, which is what this diff is. the big headliner changes in this diff are: - pfsync specific locks this is the whole reason for this diff. rather than rely on NET_LOCK or KERNEL_LOCK or whatever, pfsync now has it's own locks to protect it's internal data structures. this is important because pfsync runs a bunch of timeouts and tasks to push pfsync packets out on the wire, or when it's handling requests generated by incoming pfsync packets, both of which happen outside pf itself running. having pfsync specific locks around pfsync data structures makes the mutations of these data structures a lot more explicit and auditable. - partitioning to enable future parallelisation of the network stack, this rewrite includes support for pfsync to partition states into different "slices". these slices run independently, ie, the states collected by one slice are serialised into a separate packet to the states collected and serialised by another slice. states are mapped to pfsync slices based on the pf state hash, which is the same hash that the rest of the network stack and multiq hardware uses. - no more pfsync called from netisr pfsync used to be called from netisr to try and bundle packets, but now that there's multiple pfsync slices this doesnt make sense. instead it uses tasks in softnet tqs. - improved bulk transfer handling there's shiny new state machines around both the bulk transmit and receive handling. pfsync used to do horrible things to carp demotion counters, but now it is very predictable and returns the counters back where they started. - better tdb handling the tdb handling was pretty hairy, but hrvoje has kicked this around a lot with ipsec and sasyncd and we've found and fixed a bunch of issues as a result of that testing. - mpsafe pf state purges this was committed previously, but because the locks pfsync relied on weren't clear this just caused a ton of bugs. as part of this diff it's now reliable, and moves a big chunk of work out from under KERNEL_LOCK, which in turn improves the responsiveness and throughput of a firewall even if you're not using pfsync. there's a bunch of other little changes along the way, but the above are the big ones. hrvoje has done performance testing with this diff and notes a big improvement when pfsync is not in use. performance when pfsync is enabled is about the same, but im hoping the slices means we can scale along with pf as it improves. lots (months) of testing by me and hrvoje on pfsync boxes tests and ok sashan@ deraadt@ says this is a good time to put it in
2023-07-04This diff limits the number of transactions/ticketsAlexandr Nedvedicky
pf_open_trans() can issue for each clone of /dev/pf to 512. The pf_open_trans() is currently being used by DIOCGETRULES ioctl(2). The limit avoids processes to consume all kernel memory by asking DIOCGETRULES for more tickets. If DIOCGETRULES hits the limit, then the application will see EBUSY error. This diff was fine tuned with feedback from cluadio@, deraadt@ and kn@. OK kn@
2023-07-04Check for interface type ethernet before call ether_brport_isset()Jan Klemkow
Pointed out by bluhm. ok bluhm@
2023-07-04The recent change to DIOCGETRULE allows applications whichAlexandr Nedvedicky
periodically read rules from pf(4) to consume all kernel memory. The bug has been discovered and root caused by florian@. In this particular case it was snmpd(8) what ate all kernel memory. This commit introduces DIOCXEND to pf(4) so applications such as snmpd(8) and systat(1) to close ticket/transaction when they are done with fetching the rules. This change also updates snmpd(8) and systat(1) to use newly introduced DIOCXEND ioctl(2). OK claudio@, deraadt@, kn@
2023-07-04remove unused global varJonathan Gray
ok sashan@
2023-07-03use consistent queue(9) example for LIST removal; OK bluhm mvsKlemens Nanni
2023-07-02Use TSO and LRO on the loopback interface to transfer TCP faster.Alexander Bluhm
If tcplro is activated on lo(4), ignore the MTU with TCP packets. They are passed along with the information that they have to be chopped in case they are forwarded later. New netstat(1) counter shows that software LRO is in effect. The feature is currently turned off by default. tested by jan@; OK claudio@ jan@
2023-06-30Introduce M_PF type for pf(4) related memory allocations. Currently usedVitaliy Makkoveev
M_TEMP and M_IFADDR types are unreasonable for that purpose. This dedicated statistics simplify the future pf(4) unlocking work by decreasing search area of possible memory leaks. ok bluhm sashan
2023-06-28pfioctl() must make sure pfioctl_rw() gets unlocked before function returns.Alexandr Nedvedicky
OK bluhm@
2023-06-28Revert r1.406 "Close all pf transactions before opening a new one in ↵Klemens Nanni
DIOCGETRULES." regress/sbin/pfctl panics with "rw_enter: pfioctl_rw locking against myself" as reported by bluhm on bugs@.
2023-06-28use refcnt API for multicast addresses, add tracepoint:refcnt:ifmaddr probeKlemens Nanni
Replace hand-rolled reference counting with refcnt_init(9) and hook it up with a new dt(4) probe. OK bluhm mvs
2023-06-27Introduce M_IFGROUP type of memory allocation. M_TEMP is unreasonableVitaliy Makkoveev
for interface groups data allocations. ok kn claudio bluhm
2023-06-27Use shared net lock for DIOCGETIFACESKlemens Nanni
snmpd(8) and 'pfctl -s Interfaces' dump pf's internal list of interfaces. pf's internal interface list is completely protected by the pf lock, pf lock assertions since pf_if.c r1.110 from over a week ago support this. pfi_*() iterate over net lock protected if_groups lists, but only to read, so downgrade from exclusive write net lock to a shared read-only one. Feedback mvs OK sashan
2023-06-27Remove net lock from DIOC{SET,CLR}IFFLAGKlemens Nanni
pf.conf's 'set skip on ifN' and 'pfctl -F all|Reset' set and clear flags, PFI_IFLAG_SKIP being the only flag. Nothing else in base uses these ioctls and internal state is protected by the pf lock already. OK sashan
2023-06-26Revert unrelated change that sneaked into the pf_ioctl.c commit.Claudio Jeker
2023-06-26Close all pf transactions before opening a new one in DIOCGETRULES.Claudio Jeker
Processes like snmpd or systat open pf(4) once and then issue many DIOCGETRULES calls over their runtime. This accumulates many pf_trans structs over their lifetime. At some point the kernel runs out of memory because of that. By closing all transactions before creating a new one, long living processes do no longer leak transactions. This probably needs further refinement once more transactions types are added but for now this solves the problem. Problem found by florian@ OK sashan@ kn@
2023-06-12Move nd6_ifdetach() out of netlock. In this point, the interface isVitaliy Makkoveev
disconnected from everywhere. No need to hold netlock for dummy 'nd_ifinfo' release. Netlock is also not needed for TAILQ_EMPTY(&ifp->if_*hooks) assertions. ok kn bluhm
2023-06-05Do not calculate IP, TCP, UDP checksums on loopback interface.Alexander Bluhm
Packets sent over loopback got their checksums calculated twice. In the output path they were filled in and during TCP/IP input all checksums were calculated again to be compared with the previous result. Avoid this by claiming that lo(4) supports hardware checksum offloading. For each packet convert the flag that the checksum should be calculated to the flag that it has been checked successfully. Keep the flag that it should be calculated for the case that it may be bridged or forwarded later. A drawback is that "tcpdump -ni lo0 -v" reports invalid checksum. But that is the same with physical interfaces and hardware offloading. OK dlg@
2023-06-05pfsync_update_state() is too paranoid about pf_state::pfsync_state.Alexandr Nedvedicky
For example it should not be surprised if caller asks to remove state from pfsync queue which has been removed already. That kind of race is sorted out later when pfsync_update_state() calls to pfsync_q_ins()/pfsync_q_del(). Change relaxes pfsync_update_state() to panic on sync_state value which is unknown. OK dlg@
2023-06-05pf_remove_state() should not attempt to remove state whichAlexandr Nedvedicky
is already removed. OK dlg@
2023-06-01Add support for wireguard peer descriptionsKlemens Nanni
"wgdescr[iption] foo" to label one peer (amongst many) on a wg(4) interface, "-wgdescr[iption]" or "wgdescr ''" to remove the label, completely analogous to existing interface discriptions. Idea/initial diff from Mikolaj Kucharski (OK sthen) Tests/prodded by Hrvoje Popovski Tweaks/manual bits from me Feedback deraadt sthen mvs claudio OK claudio
2023-05-30add net_tq_barriersDavid Gwynne
this waits once for something to end in all the net tqs. ok claudio@
2023-05-30spellingJonathan Gray
ok jmc@ guenther@ tb@
2023-05-26Remove net lock from DIOC{S,G}ETLIMITKlemens Nanni
Grab the pf lock for pf_pool_limits[] in pfsync such that all access is covered by the pf lock; document accordingly. Hard memory pool limits don't need the net lock for protection, pool(9)s have their own internal lock and the pf lock fully covers limit values. (pf_pool_limits[] access in DIOCXCOMMIT remains under pf *and net* lock until the rest in there gets pulled out of the net lock.) OK sashan
2023-05-18Assert pf lock on interface handlingKlemens Nanni
Make sure that all hooks into pf's internal list of interfaces do happen with the pf lock held, i.e. nothing relies on the net lock alone, so that later unlocking can then rely on it. Full i386 regress (thanks bluhm) and daily usage are fine OK sashan
2023-05-18sc_st_mtx is not sufficient protection to move state aroundAlexandr Nedvedicky
pfsync(4) queues. We also need to grab pf_state::mtx to put/remove state instance safely from pfsync(4) queue. The issue has been pointed out by bluhm@. Patch survived testing done by hrvoje@ OK dlg@
2023-05-17fix stoeplitz_hash_h32.David Gwynne
discussed with and ok tb@
2023-05-16Use separate IFCAPs for LRO and TSO.Jan Klemkow
This diff introduces separate capabilities for TCP offloading. We split this into LRO (large receive offloading) and TSO (TCP segmentation offloading). LRO can be turned on/off via tcprecvoffload option of ifconfig and is not inherited to sub interfaces. TSO is inherited by sub interfaces to signal this hardware offloading capability to the network stack. With tweaks from bluhm, claudio and dlg ok bluhm, claudio
2023-05-15Implement the TCP/IP layer for hardware TCP segmentation offload.Alexander Bluhm
If the driver of a network interface claims to support TSO, do not chop the packet in software, but pass it down to the interface layer. Precalculate parts of the pseudo header checksum, but without the packet length. The length of all generated smaller packets is not known yet. Driver and hardware will use the mbuf packet header field ph_mss to calculate it and update checksum. Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware might support ony one protocol family. The old flag IFXF_TSO is only relevant for large receive offload. It is missnamed, but keep that for now. Note that drivers do not set TSO capabilites yet. Also the ifconfig flags and pseudo interfaces capabilities will be done separately. So this commit should not change behavior. heavily based on the work from jan@; OK sashan@
2023-05-14give softnet threads unique names by suffixing softnet with their index.David Gwynne
ie, you'll see softnet0, softnet1, etc in top/ps/etc now instead of just softnet on these threads. this is done by wrapping the taskq and name up in a softnet struct. ok patrick@ bluhm@ mvs@ kn@ sashan@
2023-05-13Instead of implementing IPv4 header checksum creation everywhere,Alexander Bluhm
introduce in_hdr_cksum_out(). It is used like in_proto_cksum_out(). OK claudio@
2023-05-11pools are always initialised, zap overcautious NULL checkKlemens Nanni
All pools are init'd after pfattach(), none is ever destroyed, so struct pf_pool_limit's .pp always points to valid pools. Drop a check for the impossible from twenty years ago. OK sashan dlg
2023-05-10nat-to may fail to insert state due to conflict on chosen sourceAlexandr Nedvedicky
port number. This is typically indicated by 'wire key attach failed on...' message when pf(4) debugging is enabled. The problem is caused by glitch in pf_get_sport() which fails to discover conflict in advance. In order to fix it we must also calculate toeplitz hash in pf_get_sport() to initialize look up key properly. the bug has been kindly reported by joosepm _von_ gmail _dot_ com OK dlg@
2023-05-10Implement TCP send offloading, for now in software only. This isAlexander Bluhm
meant as a fallback if network hardware does not support TSO. Driver support is still work in progress. TCP output generates large packets. In IP output the packet is chopped to TCP maximum segment size. This reduces the CPU cycles used by pf. The regular output could be assisted by hardware later, but pf route-to and IPsec needs the software fallback in general. For performance comparison or to workaround possible bugs, sysctl net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows TSO counter with chopped and generated packets. based on work from jan@ tested by jmc@ jan@ Hrvoje Popovski OK jan@ claudio@
2023-05-08fix up some formatting in the pf_state_list comment.David Gwynne
2023-05-08The call to in_proto_cksum_out() is only needed before the packetAlexander Bluhm
is passed to ifp->if_output(). The fragment code has its own checksum calculation and the other paths end in goto bad. OK claudio@
2023-05-07I preparation for TSO in software, cleanup the fragment code. UseAlexander Bluhm
if_output_ml() to send mbuf lists to interfaces. This can be used for TSO, fragments, ARP and ND6. Rename variable fml to ml. In pf_route6() split the if else block. Put the safety check (hlen + firstlen < tlen) into ip_fragment(). It makes the code correct in case the packet is too short to be fragmented. This should not happen, but other functions also have this logic. No functional change. OK sashan@
2023-05-07Remove net lock from DIOCOSFP{FLUSH,ADD,GET} aka. OS fingerprintingKlemens Nanni
pf_osfp.c contains all the locking for these three ioctls, everything is protected by the pf lock; assert/document it and inline acess to the global list to eliminate useless function variables. OK bluhm sashan
2023-05-03Remove net lock from DIOCGETRULESET and DIOCGETRULESETSKlemens Nanni
Both walk the list of rulesets aka. anchors, to yield a total count and specific anchor name, respectively. Same access, different copy out. pf_anchor_global are contained within pf_ioctl.c and pf_ruleset.c and fully protected by the pf lock, as is pf_main_ruleset and its pf.c usage. Rely on and assert for pf lock alone. 'pfctl -sr' on 60k unique rules gets noticably faster, around 2.1s instead of 3.5s. OK sashan
2023-04-29Remove net lock from DIOCGETQUEUEKlemens Nanni
Same logic and argument as for the parent *S ioctl unlocked in r1.400, might as well have committed them together: Both ticket and number of queues stem from the pf_queues_active list which is effectively static to pf_ioctl.c and fully protected by the pf lock. OK sashan
2023-04-28Add rtentry refcnt type to dt(4).Vitaliy Makkoveev
ok bluhm@
2023-04-28remove superfluous/invalid KASSERT() in pfsync_q_del().Alexandr Nedvedicky
pointed and OK bluhm@
2023-04-28This change speeds up DIOCGETRULE ioctl(2) which pfctl(8) uses toAlexandr Nedvedicky
retrieve rules from kernel. The current implementation requires like O((n^2)/2) operation to read the complete rule set, because each DIOCGETRULE operation must iterate over previous n rules to find (n + 1)-th rule to read. To address the issue diff introduces a pf_trans structure to keep pointer to next rule to read, thus reading process does not need to iterate from beginning of rule set to reach the next rule. All transactions opened by process get closed either when process is done (reads all rules) or when /dev/pf device is closed. the diff also comes with lots of improvements from dlg@ and kn@ OK dlg@, kn@
2023-04-28Relax the "pass all" rule so all forms of neighbor advertisements are allowedPeter Hessler
in either direction. This more closely matches the IPv4 ARP behaviour. From sashan@ discussed with kn@ deraadt@
2023-04-28Remove net lock from DIOCGETQUEUESKlemens Nanni
Both ticket and number of queues stem from the pf_queues_active list which is effectively static to pf_ioctl.c and fully protected by the pf lock. OK sashan
2023-04-27Remove kernel lock from rtfree(9).Vitaliy Makkoveev
Route timers and route labels protected by corresponding mutexes. `ifa' uses references counting for protection. rt_mpls_clear() could be called lockless because this is the last reference of `rt'. ok bluhm@ kn@
2023-04-27Remove net lock from DIOCGETTIMEOUTKlemens Nanni
'pfctl -s timeouts' values are only used inside of pf, entirely protected by the pf lock through the ioctl interface; the net lock is useless. Previous attempts to remove net lock usage showed that the pf lock cannot yet entirely replace it, so start with small pieces like this one. Contrary to IPv4/6 read-only ioctls, some pf ioctls without FWRITE flag do modify internal pf state, which is not entirely obvious when approached from the ioctl layer. OK sashan dlg