summaryrefslogtreecommitdiff
path: root/sys/net/if_var.h
AgeCommit message (Collapse)Author
2024-10-12remove unneeded time.h includeJonathan Gray
2023-12-23Backout always allocate per-CPU statistics counters for networkAlexander Bluhm
interface descriptor. It panics during attach of em(4) device at boot.
2023-12-22Always allocate per-CPU statistics counters for network interfaceVitaliy Makkoveev
descriptor. We have the mess in network interface statistics. Only pseudo drivers do per-CPU counters allocation, all other network devices use the old `if_data'. The network stack partially uses per-CPU counters and partially use `if_data', but the protection is inconsistent: some times counters accessed with exclusive netlock, some times with shared netlock, some times with kernel lock, but without netlock, some times with another locks. To make network interfaces statistics more consistent, always allocate per-CPU counters at interface attachment time and use it instead of `if_data'. At this step only move counters allocation to the if_attach() internals. The `if_data' removal will be performed with the following diffs to make review and tests easier. ok bluhm
2023-11-11Pass constant struct sockaddr to interface lookup functions.Alexander Bluhm
OK mvs@
2023-07-07Fix path MTU discovery for TCP LRO/TSO when forwarding.Alexander Bluhm
When doing LRO (Large Receive Offload), the drivers, currently ix(4) and lo(4) only, record an upper bound of the size of the original packets in ph_mss. When sending, either stack or hardware must chop the packets with TSO (TCP Segmentation Offload) to that size. That means we have to call tcp_if_output_tso() before ifp->if_output(). Put that logic into if_output_tso() to avoid code duplication. As TCP packets on the wire do not get larger that way, path MTU discovery should still work. tested by and OK jan@
2023-06-28use refcnt API for multicast addresses, add tracepoint:refcnt:ifmaddr probeKlemens Nanni
Replace hand-rolled reference counting with refcnt_init(9) and hook it up with a new dt(4) probe. OK bluhm mvs
2023-05-30spellingJonathan Gray
ok jmc@ guenther@ tb@
2023-05-07I preparation for TSO in software, cleanup the fragment code. UseAlexander Bluhm
if_output_ml() to send mbuf lists to interfaces. This can be used for TSO, fragments, ARP and ND6. Rename variable fml to ml. In pf_route6() split the if else block. Put the safety check (hlen + firstlen < tlen) into ip_fragment(). It makes the code correct in case the packet is too short to be fragmented. This should not happen, but other functions also have this logic. No functional change. OK sashan@
2023-04-18Remove kernel lock from ifa_ifwithaddr() and ifa_ifwithdstaddr().Vitaliy Makkoveev
Netlock protects `if_list', `ifa_list' and returned `ifa' dereference, so put netlock assertion within. Please note, rtable_setsource() doesn't destroy data pointed by `ar_source'. This is the `ifa_addr' data belongs to `ifa' and exclusive netlock is required to destroy it. So the kernel lock is not required within rt_setsource(). Take netlock by rt_setsource() caller to make `ifa' dereference safe. Suggestions and ok by bluhm@
2023-04-18Document `ifnetlist' locking.Vitaliy Makkoveev
We use both kernel and net lock for protect `ifnetlist'. This means we do modification with both locks held, but for read-only access only one lock required. Some places doing `ifnetlist' foreach loop are protected by kernel lock and context switch can't be introduced there. This is the exception, so "XXXSMP:" comment added. Proposed and ok by bluhm@
2023-04-05ARP has a queue of packets that should be sent after name resolution.Alexander Bluhm
ND6 did only hold a single packet. Unify the logic and add a mbuf hold queue to struct llinfo_nd6. This is MP safe and queue limits are tracked with atomic operations. New function if_mqoutput() has common code for ARP and ND6. ln_saddr6 holds the source address of the requesting packet. That is easier than fiddling with mbuf queue in nd6_ns_output(). OK kn@
2022-11-23Remove unused struct ifnet's *if_afdata[] and struct domain's ↵Klemens Nanni
dom_if{at,de}tach() Both made obsolete through struct ifnet's previous *if_nd addition. IPv6 Neighbour Discovery handles per-interface data directly, nothing else uses this generic domain API anymore. Outside of _KERNEL, but nothing in base uses them, either. OK bluhm mvs claudio
2022-11-23Add *if_nd to struct ifnet, call nd6_if{at,de}tach() directlyKlemens Nanni
*if_afdata[] and struct domain's dom_if{at,de}tach() are only used with IPv6 Neighbour Discovery in6_dom{at,de}tach(), which allocate/init and free single struct nd_ifinfo. Set up a new ND-specific *if_nd member directly to avoid yet another layer of indirection and thus make the generic domain API obsolete. The per-interface data is only accessed in nd6.c and nd6_nbr.c through the ND_IFINFO() macro; it is allocated and freed exactly once during interface at/detach, so document it as [I]mmutable. OK bluhm mvs claudio
2022-11-14Document global interface group list lockingKlemens Nanni
The per-interface group list is protected by the net lock and already documented as such. The global interface group list `ifg_head' is also protected by the net lock and all access to it (all within if.c) take it accordingly. Feedback OK mvs
2022-11-10typofix; ok dlgKlemens Nanni
2022-11-08Document ifc_list immutabilityKlemens Nanni
Move up to comment explaining different locks to account for all structs. OK millert mvs
2022-09-08Rename global ifnet TAILQKlemens Nanni
Naming the list like the struct itself makes for awful grepping. Call the global variable "ifnetlist" from now on. There used to be kvm(3) consumers in base picking up this symbol, but those have long been converted to other interfaces. A few potential ports users remain, same deal as sys/net/if_var.h r1.116 "Remove struct ifnet's unused if_switchport member": they get bumped. Previous users pointed out by deraadt OK bluhm
2022-08-30Remove struct ifnet's unused if_switchport memberKlemens Nanni
This is a switch(4) left-over. Even though it is defined under _KERNEL, a few ports do define it and include <net/if_var.h>, so this removal warrants a REVISION bump for all potential ports consumers (once ports bulk machines run on a snapshot containing this commit). OK mvs
2022-08-29Use struct refcnt for interface address reference counting.Alexander Bluhm
There was a crash due to use after free of the ifa although it is ref counted. As ifa_refcnt was a simple integer increment, there may be a path where multiple CPUs access it concurrently. So change to struct refcnt which is MP safe and provides dt(4) leak debugging. Link level address for IPsec enc(4) and various MPLS interfaces is special. There ifa is part of struct sc. Use refcount anyway and add a panic to detect use after free. bug report stsp@; OK mvs@
2021-02-20add p2p_input, like ether_input but for l3 tunnel interfaces.David Gwynne
the l3 protocol input to push the packet is based on a value in m->m_pkthdr.ph_family, which tunnel drivers should set before calling if_vinput. add p2p_bpf_mtap to call bpf_mtap_af also using m->m_pkthdr.ph_family.
2021-02-20give interfaces an if_bpf_mtap handler.David Gwynne
the network stack is now responsible for calling bpf for packets that the interface receives, and we so far got away with using bpf_mtap_ether for everything. this doesn't work if layer 3 input goes through the same functions, so letting drivers specify the appropriate bpf mtap function means they will be able to cope.
2020-07-29Interface index is unsigned integer. Fix the places where it referencedmvs
as signed. u_int used within pipex(4) for consistency with other code. ok dlg@ mpi@
2020-07-24Use interface index instead of pointer to `ifnet' in carp(4).mvs
ok sashan@
2020-07-22deprecate interface input handler lists, just use one input function.David Gwynne
the interface input handler lists were originally set up to help us during the intial mpsafe network stack work. at the time not all the virtual ethernet interfaces (vlan, svlan, bridge, trunk, etc) were mpsafe, so we wanted a way to avoid them by default, and only take the kernel lock hit when they were specifically enabled on the interface. since then, they have been fixed up to be mpsafe. i could leave the list in place, but it has some semantic problems. because virtual interfaces filter packets based on the order they were attached to the parent interface, you can get packets taken away in surprising ways, especially when you reboot and netstart does something different to what you did by hand. by hardcoding the order that things like vlan and bridge get to look at packets, we can document the behaviour and get consistency. it also means we can get rid of a use of SRPs which were difficult to replace with SMRs. the interface input handler list is an SRPL, which we would like to deprecate. it turns out that you can sleep during stack processing, which you're not supposed to do with SRPs or SMRs, but SRPs are a lot more forgiving and it worked. lastly, it turns out that this code is faster than the input list handling, so lots of winning all around. special thanks to hrvoje popovski and aaron bieber for testing. this has been in snaps as part of a larger diff for over a week.
2020-07-10Change users of IFQ_SET_MAXLEN() and IFQ_IS_EMPTY() to use the "new" API.Patrick Wildt
ok dlg@ tobhe@
2020-07-10Change users of IFQ_PURGE() to use the "new" API.Patrick Wildt
ok dlg@ tobhe@
2020-07-10Change users of IFQ_DEQUEUE(), IFQ_ENQUEUE() and IFQ_LEN() to use thePatrick Wildt
"new" API. ok dlg@ tobhe@
2020-07-04It's been agreed upon that global locks should be expressed usinganton
capital letters in locking annotations. Therefore harmonize the existing annotations. Also, if multiple locks are required they should be delimited using commas. ok mpi@
2020-05-12Set timeout(9) to refill the receive ring descriptors if the amount ofjan
descriptors runs below the low watermark. The em(4) firmware seems not to work properly with just a few descriptors in the receive ring. Thus, we use the low water mark as an indicator instead of zero descriptors, which causes deadlocks. ok kettenis@
2020-04-12say if_pcount needs NET_LOCK instead of the kernel lock.David Gwynne
if_pcount is only touched in ifpromisc(), and ifpromisc() needs NET_LOCK anyway because it also modifies if_flags. suggested by mpi@ ok visa@
2019-11-08convert interface address change hooks to tasks and a task_list.David Gwynne
this follows what's been done for detach and link state hooks, and makes handling of hooks generally more robust. address hooks are a bit different to detach/link state hooks in that there's only a few things that register hooks (carp, pf, vxlan), but a lot of places to run the hooks (lots of ipv4 and ipv6 address configuration). an address hook cookie was in struct pfi_kif, which is part of the pf abi. rather than break pfctl -sI, this maintains the void * used for the cookie and uses it to store a task, which is then used as intended with the new api.
2019-11-07turn the linkstate hooks into a task list, like the detach hooks.David Gwynne
this is largely mechanical, except for carp. this moves the addition of the carp link state hook after we're committed to using the new interface as a carpdev. because the add can't fail, we avoid a complicated unwind dance. also, this tweaks the carp linkstate hook so it only updates the relevant carp interface, not all of the carpdevs on the parent. hrvoje popovski has tested an early version of this diff and it's generally ok, but there's some splasserts that this diff fires that i'll fix in an upcoming diff. ok claudio@
2019-11-06replace the hooks used with if_detachhooks with a task list.David Gwynne
the main semantic change is that things registering detach hooks have to allocate and set a task structure that then gets added to the list. this means if the task is allocated up front (eg, as part of carps softc or bridges port structure), it avoids the possibility that adding a hook can fail. a lot of drivers weren't checking for failure, and unwinding state in the event of failure in other parts was error prone. while doing this i discovered that the list operations have to be in a particular order, but drivers weren't doing that consistently either. this diff wraps the list ops up so you have to seriously go out of your way to screw them up. ive also sprinkled some NET_ASSERT_LOCKED around the list operations so we can make sure there's no potential for the list to be corrupted, especially while it's being run. hrvoje popovski has tested this a bit, and some issues he discovered have been fixed. ok sashan@
2019-06-26Create IF_WWAN_DEFAULT_PRIORITY which is lower thanClaudio Jeker
IF_WIRELESS_DEFAULT_PRIORITY and use it in umb(4) as default prio. OK kettenis@, sthen@
2019-04-28Removes the KERNEL_LOCK() from bridge(4)'s output fast-path.Martin Pieuchot
This redefines the ifp <-> bridge relationship. No lock can be currently used across the multiples contexts where the bridge has tentacles to protect a pointer, use an interface index. Tested by various, ok dlg@, visa@
2019-04-22add if_vinput so pseudo (ethernet) interfaces can bypass ifiqsDavid Gwynne
if_vinput assumes that the interface that its called against uses per cpu counters so it can count input packets, but basically does all the things that if_input and ifiq_input do. the main difference is it assumes the network stack is already running and runs the interface input handlers directly. this is instead of queuing the packets for a nettq to run. ifiqs arent free, especially when they only run per packet like they do on psuedo interfaces. this allows that overhead to be bypassed.
2019-04-19provide factored out txhprio and rxhprio checksDavid Gwynne
l2 and l3 drivers do the same thing all the time, so reduce the chance of error by doing the checks once and making it available for drivers to call instead of rolling on their own again.
2019-04-16have another go at tx mitigationDavid Gwynne
the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
2019-03-31Document that it is safe to dereference `if_softc' when the caller hasMartin Pieuchot
a valid reference to the corresponding `ifp'. ok visa@
2019-01-09split if_enqueue up so drivers can replace ifq handling if neededDavid Gwynne
if_enqueue() still makes sure packets get handled by pf on the way out, and seen by bridge if needed. however instead of falling through to ifq mapping and output, it now calls a function pointer in the ifnet struct. that pointer defaults to the ifq handling, but drivers can override it to bypass ifq processing. the most obvious users of the function pointer will be virtual interfaces, eg, vlan(4). ifqs are good if you need to serialise access to the thing that transmits packets (like hardware rings on nics), or mitigate the number of times you do ring processing, but neither of those things are desirable on vlan interfaces. ideally vlan could transmit on any cpu without having packets serialised by it's own ifq before being pushed down to an arbitrary number of rings on the parent interface. bypassing ifqs means the driver can push the vlan tag on concurrently and push down to the parent frmo any cpu. ok mpi@ no objection from claudio@
2018-12-20Make this not hz dependent by using timeout_add_sec() also rename theClaudio Jeker
define to IFNET_SLOWTIMO since it is no longer a hz divisor. OK visa@ bluhm@ kn@
2018-12-19get rid of a prototype for if_enqueue_try()David Gwynne
it isn't implemented, and is never called.
2018-12-11add optional per-cpu counters for interface stats.David Gwynne
these exist so interfaces that want to do mpsafe work outside the ifq machinery have a place to allocate and update stats in. the generic ioctl handling for getting stats to userland knows how to roll the new per cpu stats into the rest before export. ok visa@
2018-09-10- if_cloners list populated at boot time only then becomes immutable,Alexandr Nedvedicky
so we can let go if_cloners_lock. OK tb@, claudio@, bluhm@, kn@, henning@
2018-01-10get rid of struct carp_if by moving the srpl into struct ifnet if_carp.David Gwynne
currently carp uses a struct carp_if to hold an srp list head, which is accessed by both if_carp in struct ifnet, and via the if input handlers list. this gets rid of some indirection by making if_carp itself the list head, rather than a pointer to the list head via a struct carp_if. it also makes accessing the list consistent by only using if_carp to get to it. ok mpi@
2018-01-08Convert IF_CLONE_INITIALIZER() into C99 initializer.Alexander Bluhm
OK mpi@
2018-01-04Include timeout & tasks in 'struct ifnet' instead of always allocatingMartin Pieuchot
them as M_TEMP. ok visa@
2018-01-02Move the NET_LOCK() inside the switch and start documenting which fieldMartin Pieuchot
is protected by which lock. ok bluhm@, visa@
2017-12-15add ifiqueues for mp safety and nics with multiple rx rings.David Gwynne
currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
2017-11-17add if_rxr_livelocked so rxr users can request backpressure themselves.David Gwynne
right now the rx ring moderation code makes a decision globally that a machine is livelocked, and uses that to apply backpressure on all the rx rings. we're moving toward having the network stack run on multiple cpus, and fed from multiple rx rings. if_rxr_livelocked lets a driver apply backpressure explicitely if something tells it that whatever is consuming previous packets cannot keep up. while here expose the current ring watermark with if_rxr_cwm. tweaks and ok visa@