summaryrefslogtreecommitdiff
path: root/sys/net/if.c
AgeCommit message (Collapse)Author
2023-11-11Pass constant struct sockaddr to interface lookup functions.Alexander Bluhm
OK mvs@
2023-11-10Make ifq and ifiq interface MP safe.Alexander Bluhm
Rename ifq_set_maxlen() to ifq_init_maxlen(). This function neither uses WRITE_ONCE() nor a mutex and is called before the ifq mutex is initialized. The new name expresses that it should be used only during interface attach when there is no concurrency. Protect ifq_len(), ifq_empty(), ifiq_len(), and ifiq_empty() with READ_ONCE(). They can be used without lock as they only read a single integer. OK dlg@
2023-10-27Forward TCP LRO disabling to parent devices.Jan Klemkow
Also disable TCP LRO on bridged vlan(4) and default for bpe(4), nvgre(4) and vxlan(4). ok bluhm@
2023-09-16Allow counters_read(9) to take an optional scratch buffer.Martin Pieuchot
Using a scratch buffer makes it possible to take a consistent snapshot of per-CPU counters without having to allocate memory. Makes ddb(4) show uvmexp command work in OOM situations. ok kn@, mvs@, cheloha@
2023-08-18maximium -> maximumJonathan Gray
2023-07-07Keep mbuf header field ph_mss during loopback TCP with LRO/TSO.Alexander Bluhm
When M_TCP_TSO is preserved, also keep ph_mss. In lo(4) this logic was missing. This may be relevant only for weird pf configs that forward from loopback. OK mvs@ jan@
2023-07-07Fix path MTU discovery for TCP LRO/TSO when forwarding.Alexander Bluhm
When doing LRO (Large Receive Offload), the drivers, currently ix(4) and lo(4) only, record an upper bound of the size of the original packets in ph_mss. When sending, either stack or hardware must chop the packets with TSO (TCP Segmentation Offload) to that size. That means we have to call tcp_if_output_tso() before ifp->if_output(). Put that logic into if_output_tso() to avoid code duplication. As TCP packets on the wire do not get larger that way, path MTU discovery should still work. tested by and OK jan@
2023-07-06big update to pfsync to try and clean up locking in particular.David Gwynne
moving pf forward has been a real struggle, and pfsync has been a constant source of pain. we have been papering over the problems for a while now, but it reached the point that it needed a fundamental restructure, which is what this diff is. the big headliner changes in this diff are: - pfsync specific locks this is the whole reason for this diff. rather than rely on NET_LOCK or KERNEL_LOCK or whatever, pfsync now has it's own locks to protect it's internal data structures. this is important because pfsync runs a bunch of timeouts and tasks to push pfsync packets out on the wire, or when it's handling requests generated by incoming pfsync packets, both of which happen outside pf itself running. having pfsync specific locks around pfsync data structures makes the mutations of these data structures a lot more explicit and auditable. - partitioning to enable future parallelisation of the network stack, this rewrite includes support for pfsync to partition states into different "slices". these slices run independently, ie, the states collected by one slice are serialised into a separate packet to the states collected and serialised by another slice. states are mapped to pfsync slices based on the pf state hash, which is the same hash that the rest of the network stack and multiq hardware uses. - no more pfsync called from netisr pfsync used to be called from netisr to try and bundle packets, but now that there's multiple pfsync slices this doesnt make sense. instead it uses tasks in softnet tqs. - improved bulk transfer handling there's shiny new state machines around both the bulk transmit and receive handling. pfsync used to do horrible things to carp demotion counters, but now it is very predictable and returns the counters back where they started. - better tdb handling the tdb handling was pretty hairy, but hrvoje has kicked this around a lot with ipsec and sasyncd and we've found and fixed a bunch of issues as a result of that testing. - mpsafe pf state purges this was committed previously, but because the locks pfsync relied on weren't clear this just caused a ton of bugs. as part of this diff it's now reliable, and moves a big chunk of work out from under KERNEL_LOCK, which in turn improves the responsiveness and throughput of a firewall even if you're not using pfsync. there's a bunch of other little changes along the way, but the above are the big ones. hrvoje has done performance testing with this diff and notes a big improvement when pfsync is not in use. performance when pfsync is enabled is about the same, but im hoping the slices means we can scale along with pf as it improves. lots (months) of testing by me and hrvoje on pfsync boxes tests and ok sashan@ deraadt@ says this is a good time to put it in
2023-07-04Check for interface type ethernet before call ether_brport_isset()Jan Klemkow
Pointed out by bluhm. ok bluhm@
2023-07-02Use TSO and LRO on the loopback interface to transfer TCP faster.Alexander Bluhm
If tcplro is activated on lo(4), ignore the MTU with TCP packets. They are passed along with the information that they have to be chopped in case they are forwarded later. New netstat(1) counter shows that software LRO is in effect. The feature is currently turned off by default. tested by jan@; OK claudio@ jan@
2023-06-27Introduce M_IFGROUP type of memory allocation. M_TEMP is unreasonableVitaliy Makkoveev
for interface groups data allocations. ok kn claudio bluhm
2023-06-12Move nd6_ifdetach() out of netlock. In this point, the interface isVitaliy Makkoveev
disconnected from everywhere. No need to hold netlock for dummy 'nd_ifinfo' release. Netlock is also not needed for TAILQ_EMPTY(&ifp->if_*hooks) assertions. ok kn bluhm
2023-06-05Do not calculate IP, TCP, UDP checksums on loopback interface.Alexander Bluhm
Packets sent over loopback got their checksums calculated twice. In the output path they were filled in and during TCP/IP input all checksums were calculated again to be compared with the previous result. Avoid this by claiming that lo(4) supports hardware checksum offloading. For each packet convert the flag that the checksum should be calculated to the flag that it has been checked successfully. Keep the flag that it should be calculated for the case that it may be bridged or forwarded later. A drawback is that "tcpdump -ni lo0 -v" reports invalid checksum. But that is the same with physical interfaces and hardware offloading. OK dlg@
2023-05-30add net_tq_barriersDavid Gwynne
this waits once for something to end in all the net tqs. ok claudio@
2023-05-16Use separate IFCAPs for LRO and TSO.Jan Klemkow
This diff introduces separate capabilities for TCP offloading. We split this into LRO (large receive offloading) and TSO (TCP segmentation offloading). LRO can be turned on/off via tcprecvoffload option of ifconfig and is not inherited to sub interfaces. TSO is inherited by sub interfaces to signal this hardware offloading capability to the network stack. With tweaks from bluhm, claudio and dlg ok bluhm, claudio
2023-05-14give softnet threads unique names by suffixing softnet with their index.David Gwynne
ie, you'll see softnet0, softnet1, etc in top/ps/etc now instead of just softnet on these threads. this is done by wrapping the taskq and name up in a softnet struct. ok patrick@ bluhm@ mvs@ kn@ sashan@
2023-05-07I preparation for TSO in software, cleanup the fragment code. UseAlexander Bluhm
if_output_ml() to send mbuf lists to interfaces. This can be used for TSO, fragments, ARP and ND6. Rename variable fml to ml. In pf_route6() split the if else block. Put the safety check (hlen + firstlen < tlen) into ip_fragment(). It makes the code correct in case the packet is too short to be fragmented. This should not happen, but other functions also have this logic. No functional change. OK sashan@
2023-04-26Introduce `rtlabel_mtx' mutex(9) to protect route labels storage. ThisVitaliy Makkoveev
time kernel and net locks are held in various combination to protect it. We don't want to put kernel lock to all the places. Netlock also can't be used because rtfree(9) which calls rtlabel_unref() has unknown netlock state within. This new `rtlabel_mtx' mutex(9) protects `rt_labels' list and `label' entry dereference. Since we don't export 'rt_label' structure, keep this lock private to net/route.c. For this reason rtlabel_id2name() now copies label string to externally passed buffer instead of returning address of `rt_labels' list data. This is the way which rtlabel_id2sa() already works. ok bluhm@
2023-04-26Also set TSO flag on vlan interfaces.Jan Klemkow
with tweaks from bluhm, claudio and dlg I fine with it from claudio looks good to me from dlg ok bluhm
2023-04-22revert vlan(4) inherits TSO flagsDavid Gwynne
tb reports amd64 RAMDISK doesn't build with it. also, vlan_flags_from_parent doesn't look right right. it iterates over ifnetlist, which is all interfaces in the system, but appears to assume they're all vlan interfaces and so uses a vlan_softc * to inspect their if_softc pointers.
2023-04-21vlan(4) inherits TSO flagsJan Klemkow
tested by Hrvoje Popovski with tweaks from bluhm and claudio encouraged from deraadt ok bluhm
2023-04-18Remove kernel lock from ifa_ifwithaddr() and ifa_ifwithdstaddr().Vitaliy Makkoveev
Netlock protects `if_list', `ifa_list' and returned `ifa' dereference, so put netlock assertion within. Please note, rtable_setsource() doesn't destroy data pointed by `ar_source'. This is the `ifa_addr' data belongs to `ifa' and exclusive netlock is required to destroy it. So the kernel lock is not required within rt_setsource(). Take netlock by rt_setsource() caller to make `ifa' dereference safe. Suggestions and ok by bluhm@
2023-04-18Document `ifnetlist' locking.Vitaliy Makkoveev
We use both kernel and net lock for protect `ifnetlist'. This means we do modification with both locks held, but for read-only access only one lock required. Some places doing `ifnetlist' foreach loop are protected by kernel lock and context switch can't be introduced there. This is the exception, so "XXXSMP:" comment added. Proposed and ok by bluhm@
2023-04-08Move rtm_ifannounce(IFAN_DEPARTURE) outside netlock within if_detach().Vitaliy Makkoveev
This is the mbuf(9) allocation and broadcast transmission for PF_ROUTE sockets, netlock is not required here. ok bluhm@
2023-04-07Remove kernel locks from the ARP input path. Caller if_netisr()Alexander Bluhm
grabs the exclusive netlock and that is sufficent for in_arpinput() and arpcache(). with kn@; OK mvs@; tested by Hrvoje Popovski
2023-04-05ARP has a queue of packets that should be sent after name resolution.Alexander Bluhm
ND6 did only hold a single packet. Unify the logic and add a mbuf hold queue to struct llinfo_nd6. This is MP safe and queue limits are tracked with atomic operations. New function if_mqoutput() has common code for ARP and ND6. ln_saddr6 holds the source address of the requesting packet. That is easier than fiddling with mbuf queue in nd6_ns_output(). OK kn@
2023-03-07Avoid enabling TSO on interfaces which are already attached to a bridge.Jan Klemkow
with tweaks from claudio and deraadt ok claudio, bluhm
2023-02-27Turn off TSO if interface is added to layer 2 devices.Jan Klemkow
ok bluhm@, claudio@
2022-11-23Let nd6_if{at,de}tach() be void and take an ifp argumentKlemens Nanni
Do it like the rest of at/detach routines which modify a struct ifnet pointer without returning anything. OK mvs
2022-11-23Remove unused struct ifnet's *if_afdata[] and struct domain's ↵Klemens Nanni
dom_if{at,de}tach() Both made obsolete through struct ifnet's previous *if_nd addition. IPv6 Neighbour Discovery handles per-interface data directly, nothing else uses this generic domain API anymore. Outside of _KERNEL, but nothing in base uses them, either. OK bluhm mvs claudio
2022-11-23Add *if_nd to struct ifnet, call nd6_if{at,de}tach() directlyKlemens Nanni
*if_afdata[] and struct domain's dom_if{at,de}tach() are only used with IPv6 Neighbour Discovery in6_dom{at,de}tach(), which allocate/init and free single struct nd_ifinfo. Set up a new ND-specific *if_nd member directly to avoid yet another layer of indirection and thus make the generic domain API obsolete. The per-interface data is only accessed in nd6.c and nd6_nbr.c through the ND_IFINFO() macro; it is allocated and freed exactly once during interface at/detach, so document it as [I]mmutable. OK bluhm mvs claudio
2022-11-14Unlock SIOCGIFG{MEMB,ATTR,LIST}Klemens Nanni
The global interface group list is also protected by the net lock and all access to it (all within if.c) take it accordingly. Getting all - members of a group (SIOCGIFGMEMB), - attributes of a group (SIOCGIFGATTR), - groups (SIOCGIFGLIST) are each read-only operations on the global interface group `ifg_head'. The global interface list `ifnetlist' or its per-interface group lists are not used in these ioctls. OK mvs
2022-11-14Unlock SIOCGIFCONFKlemens Nanni
As netintro(4) explains, this copies a bunch of data from the global interface list as well as its per-interface address lists. All of this is never written to by ifconf(), protected by the net lock and documented as such in the struct comments already. OK mvs
2022-11-14Document global interface group list lockingKlemens Nanni
The per-interface group list is protected by the net lock and already documented as such. The global interface group list `ifg_head' is also protected by the net lock and all access to it (all within if.c) take it accordingly. Feedback OK mvs
2022-11-10bring back r1.673: replace SRP with SMR in the if_idxmap.David Gwynne
when i first wrote if_idxmap i didn't realise (and no one thought to tell me) that index 0 was special and means "no interface", so while here use the 0th slot in the interface map to store the length of the map instead of prepending the map with a length field. if_get() now special cases index 0 and returns NULL directly. this also means the size of the map is now always a power of 2, which is a nicer fit with what the kernel malloc aprovides. the problem with r1.673 that hrvoje popovski found was that attaching a lot of interfaces during autoconf would lock up when growing the map called smr_barrier. the fix in this diff is to (ab)use the usedidx bitmap to store an smr_entry and defer the freeing of the interface pointer map with it. tested by hrvoje popovski tweaks and ok visa@
2022-11-09revert r1.673: replace SRP with SMR in the if_idxmap.David Gwynne
if the map has to be reallocated during boot, there's an smr_barrier waiting for the old map to become unused. that barrier ends up waiting for cpus that aren't running yet because we haven't finished booting yet, so boot gets stuck. found by hrvoje popovski
2022-11-09Recommit r1.669 "Unlock SIOCIFGCLONERS"Klemens Nanni
OK mvs
2022-11-09Push kernel lock from ifioctl() into ifioctl_get()Klemens Nanni
Recommit these two together: - r1.667 "Push kernel lock into ifioctl_get()" locked before the switch() without unlocking in its cases - r1.668 "Push kernel lock inside ifioctl_get()" locked cases individually, as intended I messed up splitting commits, but of course, Hrvoje managed to test a CVS checkout right inbetween those two. OK mpi mvs
2022-11-09replace SRP with SMR in the if_idxmap.David Gwynne
when i first wrote if_idxmap i didn't realise (and no one thought to tell me) that index 0 was special and means "no interface", so while here use the 0th slot in the interface map to store the length of the map instead of prepending the map with a length field. if_get() now special cases index 0 and returns NULL directly. this also means the size of the map is now always a power of 2, which is a nicer fit with what the kernel malloc aprovides. tweaks and ok visa@
2022-11-08Revert lock changes inside ifioctl_get()Klemens Nanni
WITNESS isn't happy with r1.667 "Push kernel lock into ifioctl_get()", so revert it (including r1.668 and r1.669 depending on it): witness: userret: returning with the following locks held: exclusive kernel_lock &kernel_lock r = 0 (0xffffffff82455f58) #0 witness_lock+0x311 #1 ifioctl_get+0x2e #2 sys_ioctl+0x2c4 #3 syscall+0x384 #4 Xsyscall+0x128 panic: witness_warn Stopped at db_enter+0x10: popq %rbp TID PID UID PRFLAGS PFLAGS CPU COMMAND * 70588 52613 0 0x3 0 4K pfctl So back to the drawing board while leaving documentation bits (r1.670). Thanks Hrvoje.
2022-11-08Use four spaces not tabs on line breakKlemens Nanni
2022-11-08Document ifc_list immutabilityKlemens Nanni
Move up to comment explaining different locks to account for all structs. OK millert mvs
2022-11-08Unlock SIOCIFGCLONERSKlemens Nanni
ifconfig(8) -C is the only user in base and the if_clone_attach() comment explains how this list is being built during autoconf(9). After that it is only ever read. Multiple threads may traverse the list in parallel and reading the `int' count is atomic. OK mvs
2022-11-08Push kernel lock inside ifioctl_get()Klemens Nanni
After this mechanical move, I can unlock the individual SIOCG* in there. OK mvs
2022-11-08Push kernel lock into ifioctl_get()Klemens Nanni
Another mechanical diff without semantic changes to avoid churn in actual unlocking diffs. OK mpi
2022-11-08Push kernel lock down into ifioctl()Klemens Nanni
This is a mechanical diff without semantical changes, locking ioctls individually inside ifioctl() rather than all of them around it. This allows us to unlock ioctls one by one. OK mpi
2022-09-08Rename global ifnet TAILQKlemens Nanni
Naming the list like the struct itself makes for awful grepping. Call the global variable "ifnetlist" from now on. There used to be kvm(3) consumers in base picking up this symbol, but those have long been converted to other interfaces. A few potential ports users remain, same deal as sys/net/if_var.h r1.116 "Remove struct ifnet's unused if_switchport member": they get bumped. Previous users pointed out by deraadt OK bluhm
2022-09-02Move PRU_CONTROL request to (*pru_control)().Vitaliy Makkoveev
The 'proc *' arg is not used for PRU_CONTROL request, so remove it from pru_control() wrapper. Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for inet6 case. ok guenther@ bluhm@
2022-08-13Introduce the pru_*() wrappers for corresponding (*pr_usrreq)() calls.Vitaliy Makkoveev
This is helpful for the following (*pr_usrreq)() split to multiple handlers. But right now this makes code more readable. Also add '#ifndef _SYS_SOCKETVAR_H_' to sys/socketvar.h. This prevents the collisions when both sys/protosw.h and sys/socketvar.h are included together. Both 'socket' and 'protosw' structures are required to be defined before pru_*() wrappers, so we need to include sys/socketvar.h to sys/protosw.h. ok bluhm@
2022-08-06Clean up the netlock macros. Merge NET_RLOCK_IN_SOFTNET andAlexander Bluhm
NET_RLOCK_IN_IOCTL, which have the same implementation. The R and W are hard to see, call the new macro NET_LOCK_SHARED. Rename the opposite assertion from NET_ASSERT_WLOCKED to NET_ASSERT_LOCKED_EXCLUSIVE. Update some outdated comments about net locking. OK mpi@ mvs@