summaryrefslogtreecommitdiff
path: root/sys/net
AgeCommit message (Collapse)Author
2024-03-26Avoid NULL pointer dereference in routing table an_match().Alexander Bluhm
OK mvs@
2024-03-19count if_enqueue/ifq_enqueue errors as oqdrops.David Gwynne
this helps narrow down where some "output failures" on sec interfaces occur. based on discussion with jason tubnor
2024-03-18expose per port information via kstats.David Gwynne
the most interesting information exposed here is the number of times a port changes state according to the lacp state machine. if a port is flapping, it's hard to see if you only look at the current state. getting a count of changes over time makes problems a lot more visible and therefore fixable. this also exposes counters around how the lacp protocol packets. all of these can be useful when trying to line up behaviors with another system (eg, a switch). ok jmatthew@
2024-03-18use high bits from the mbuf flowid to pick a port to transmit on.David Gwynne
a port here is a physical interface used by an aggr. this leaves the low bits for a physical interface to use to pick a tx ring. without this, aggr used low bits for port selection, which takes bits away from the ring selection, which can lead to uneven distribution of packets over tx rings. ive been running this in production for well over a year now.
2024-03-05Convert `t_lock', `r_keypair_lock' and `c_lock' rwlock(9)s toVitaliy Makkoveev
corresponding mutex(9)es. ifq_start() and following wg_qstart() could be called from software interrupt context if bandwidth control is enabled in pf.conf(5). Remove sleep points provided by rwlock(9)s from wg(4) output start routine. looks ok claudio
2024-03-04white space fixes. no functional changeDavid Gwynne
2024-02-29revert "Combine route_cache() and rtalloc_mpath() in new route_mpath()"Christian Weisgerber
It breaks NFS. ok claudio@
2024-02-28Enable IPv6 AF for ppp(4)Denis Fondras
OK claudio@
2024-02-27Combine route_cache() and rtalloc_mpath() in new route_mpath().Alexander Bluhm
Fill and check the cache and call rtalloc_mpath() together. Then the caller of route_mpath() does not have to care about the uint32_t *src pointer and just pass struct in_addr. All the conversions are done inside the functions. ro->ro_rt is either valid or NULL. Note that some places have a stricter rtisvalid() now compared to the previous NULL check. OK claudio@
2024-02-22Make the route cache aware of multipath routing.Alexander Bluhm
Pass source address to route_cache() and store it in struct route. Cached multipath routes are only valid if source address matches. If sysctl multipath changes, increase route generation number. OK claudio@
2024-02-14Check IP length in ether_extract_headers().Alexander Bluhm
For LRO with ix(4) it is necessary to detect ethernet padding. Extract ip_len and ip6_plen from the mbuf and provide it to the drivers. Add extended sanitity checks, like IP packet is shorter than TCP header. This prevents offloading to network hardware with bougus packets. Also iphlen of extracted headers contains header length for IPv4 and IPv6, to make code in drivers simpler. OK mglocker@
2024-02-13Analyse header layout in ether_extract_headers().Alexander Bluhm
Several drivers need IPv4 header length and TCP offset for checksum offload, TSO and LRO. Accessing these fields directly caused crashes on sparc64 due to misaligned access. It cannot be guaranteed that IP and TCP header is 4 byte aligned in driver level. Also gcc 4.2.1 assumes that bit fields can be accessed with 32 bit load instructions. Use memcpy() in ether_extract_headers() to get the bits from IPv4 and TCP header and store the header length in struct ether_extracted. From there network drivers can esily use it without caring about alignment and bit shift. Do some sanity checks with the length values to prevent that invalid values from evil packets get stored into hardware registers. If check fails, clear the pointer to the header to hide it from the driver. Add debug prints that help to figure out the reason for bad packets and provide information when debugging drivers. OK mglocker@
2024-02-13Merge struct route and struct route_in6.Alexander Bluhm
Use a common struct route for both inet and inet6. Unfortunately struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has to be exposed from net/route.h. Struct route has to be bsd visible for userland as netstat kvm code inspects inp_route. Internet PCB and TCP SYN cache can use a plain struct route now. All specific sockaddr types for inet and inet6 are embeded there. OK claudio@
2024-02-09Route cache function returns hit or miss.Alexander Bluhm
The route_cache() function can easily return whether it was a cache hit or miss. Then the logic to perform a route lookup gets a bit simpler. Some more complicated if (ro->ro_rt == NULL) checks still exist elsewhere. Also use route cache in in_pcbselsrc() instead of filling struct route manually. OK claudio@
2024-02-07Add missing #ifdef INET6 to fix ramdisk build.Alexander Bluhm
2024-02-07Use the route generation number also for IPv6.Alexander Bluhm
Implement route6_cache() to check whether the cached route is still valid and otherwise fill caching parameter of struct route_in6. Also count cache hits and misses in netstat. in_pcbrtentry() uses route cache now. OK claudio@
2024-02-06Invert broken check of panic string in if_linkstate().Alexander Bluhm
original bug report from syzkaller Reported-by: syzbot+d19060a65721eb432a72@syzkaller.appspotmail.com broken fix found by Hrvoje Popovski hint to the problem and OK deraadt@
2024-02-05Add netstat counter for route cache.Alexander Bluhm
To optimize route caching, count cache hits and misses. This is shown in netstat -s for both inet and inet6. Reuse the old IPv6 forward cache counter. Sort ip6s_wrongif consistently. For now only IPv4 cache counter has been implemented. OK mvs@
2024-02-05Don't send route messages while rebooting after panic. Syskaller exposedVitaliy Makkoveev
[1] that if_downall() tries to send route messages and triggers panic again but in knote(9) layer. 1. https://syzkaller.appspot.com/bug?extid=d19060a65721eb432a72 ok bluhm
2024-02-05Move route_cache() declaration from net/route.h to netinet/in.h.Kenji Aoyama
This prevents gcc3's 'parameter has incomplete type' warning that causes kernel build failure. Suggested by claudio@, ok bluhm@
2024-01-31Add route generation number to route cache.Alexander Bluhm
The outgoing route is cached at the inpcb. This cache was only invalidated when the socket closes or if the route gets invalid. More specific routes were not detected. Especially with dynamic routing protocols, sockets must be closed and reopened to use the correct route. Running ping during a route change shows the problem. To solve this, add a route generation number that is updated whenever the routing table changes. The lookup in struct route is put into the route_cache() function. If the generation number is too old, the cached route gets discarded. Implement route_cache() for ip_output() and ip_forward() first. IPv6 and more places will follow. OK claudio@
2024-01-26Put checksum flags in bpf_hdr to use them in userland dhcpleased.Jan Klemkow
Thus, dhcpleased accept non-calculated checksums which were verified by hardware/hypervisor. With tweaks from dlg@ ok bluhm@ mkay tobhe@
2024-01-24tag packets going out a sec interface to prevent route/encap loops.David Gwynne
sec(4) was already looking for this mbuf tag so it could drop packets that had already been sent out on the same interface, but i forgot the code that adds the tag. this was reported by jason tubnor who experienced spins/lockups when using sec and a physical interface was disconnected. rather than being a locking problem like we initially assumed, it turned out that unplugging a physical interface caused a route for ipsec encapsulated traffic to go out over sec(4), causing the packet to loop in the stack. the fix was also tested and verified by jason. sorry for taking so long to look at it.
2024-01-23Introduce pipex_iterator(), the special thing to performVitaliy Makkoveev
`pipex_session_list' foreach walkthrough with `pipex_list_mtx' mutex(9) relocking. It inserts special item after acquired `session' and keeps it linked until `session' release. Only owner can unlink it's own item, so the LIST_NEXT(session) is always valid even the `session' was unlinked. The iterator skips special items at the `session' acquisition time, as all other foreach loops where `pipex_list_mtx' mutex(9) is not relocked. ok yasuoka
2024-01-23Remove `pipex_rd_head6' and `ps6_rn[2]'. They are not used.Vitaliy Makkoveev
ok yasuoka
2024-01-18Use `nowake' as tsleep_nsec(9) ident. It has no corresponding wakeup(9).Vitaliy Makkoveev
ok bluhm
2024-01-11Use domain name for socket lock.Alexander Bluhm
Syzkaller with witness complains about lock ordering of pf lock with socket lock. Socket lock for inet is taken before pf lock. Pf lock is taken before socket lock for route. This is a false positive as route and inet socket locks are distinct. Witness does not know this. Name the socket lock like the domain of the socket, then rwlock name is used in witness lo_name subtype. Make domain names more consistent for locking, they were not used anyway. Regardless of witness problem, unique lock name for each socket type make sense. Reported-by: syzbot+34d22dcbf20d76629c5a@syzkaller.appspotmail.com Reported-by: syzbot+fde8d07ba74b69d0adfe@syzkaller.appspotmail.com OK mvs@
2024-01-10Split UDP PCB table into IPv4 and IPv6.Alexander Bluhm
Having two hash tables instead of a common one, reduces table size and contention on the per table lock. The address family is always known in advance. The lookups and loops are more specific. OK sashan@
2024-01-06Do not count packets though multicast loopback and simplex interfaces.Alexander Bluhm
Counting multicast packets sent to local stack or packets that are reflected by simplex interfaces does not make much sense. They are neither received nor output by any ethernet device. Counting these packets at lo0 or the loopback interface of the routing domain would be possible, but is not worth the effort. Make if_input_local() MP safe by deleting the if_opackets++ code. OK mvs@
2024-01-06Take net lock before kernel lock.Alexander Bluhm
Doing KERNEL_LOCK() just before NET_LOCK() does not make sense. Net lock is a rwlock that releases kernel lock during sleep. To avoid an unnecessary release and take kernel lock cycle, move KERNEL_LOCK() after NET_LOCK(). There is no lock order reversal deadlock issue. Both locks are used in any order thoughout the kernel. As NET_LOCK() releases the kernel lock when it cannot take the lock immediately and has to sleep, we always end in the order kernel lock before net lock after sleeping. OK sashan@
2024-01-01Protect link between pf and inp with mutex.Alexander Bluhm
Introduce global mutex to protect the pointers between pf state key and internet PCB. Then in_pcbdisconnect() and in_pcbdetach() do not need exclusive netlock anymore. Use a bunch of read once unlocked access to reduce performance impact. OK sashan@
2024-01-01Call if_counters_alloc() before if_attach().Vitaliy Makkoveev
ok bluhm sashan
2024-01-01Fix white space in pf.c.Alexander Bluhm
2023-12-29Make loopback interface counters MP safe.Alexander Bluhm
Create and use the MP safe version of the interface counters for lo(4). Input packets were counted twice. As interface input queue is already counting, remove input count in if_input_local(). Multicast and siplex packets are counted at the ethernet interface. Add a comment that this not MP safe. OK mvs@
2023-12-28use RB_FOREACH_SAFE for pf_purge_expired_src_nodesaisha
OK bluhm@
2023-12-23Backout always allocate per-CPU statistics counters for networkAlexander Bluhm
interface descriptor. It panics during attach of em(4) device at boot.
2023-12-22Always allocate per-CPU statistics counters for network interfaceVitaliy Makkoveev
descriptor. We have the mess in network interface statistics. Only pseudo drivers do per-CPU counters allocation, all other network devices use the old `if_data'. The network stack partially uses per-CPU counters and partially use `if_data', but the protection is inconsistent: some times counters accessed with exclusive netlock, some times with shared netlock, some times with kernel lock, but without netlock, some times with another locks. To make network interfaces statistics more consistent, always allocate per-CPU counters at interface attachment time and use it instead of `if_data'. At this step only move counters allocation to the if_attach() internals. The `if_data' removal will be performed with the following diffs to make review and tests easier. ok bluhm
2023-12-19Initialize `sc_outputtask' before interface attachment. if_alloc_sadl()Vitaliy Makkoveev
has sleep point, so the uninitialized `sc_outputtask` could be accessed through ioctl(2) interface. ok sashan bluhm
2023-12-16Rework pflowioctl() lock dances.Vitaliy Makkoveev
Release netlock and take `sc_lock' rwlock(9) just in the beginning of pflowioctl() and do corresponding operations in the end. Use `sc_lock' to protect `sc_dying'. We need to release netlock not only to keep locks order with `sc_lock' rwlock(9), but also because pflowioctl() calls some operations like socreate() or soclose() on udp(4) socket. Current implementation has many relocking places which breaks atomicy, so merge them into one. The `sc_lock' rwlock(9) is taken during all pflowioctl() call, so `sc_dying' atomicy is not broken. Not the ideal solution, but better then we have now. Tested by Hrvoje Popovski. Discussed with and ok from sashan
2023-12-12slyle(9) fix. No functional changes.Vitaliy Makkoveev
2023-12-12Turn `pflowstats' statistics counters into per-CPU counters to make themVitaliy Makkoveev
mpsafe. The weird interactions around `pflow_flows' and `sc_gcounter' replaced by simple `pflow_flows' increment. Since the flow sequence is the 32 bits integer, the `sc_gcounter' type replaced by the type of uint32_t. ok bluhm sashan
2023-12-11Turn `pflow_softc' list into SMR list.Vitaliy Makkoveev
Since the revision 1.1182 of net/pf.c netlock is not taken while export_pflow() called from pf_purge_states(). Current locks order requires netlock to be taken before PF_LOCK(), so there is no reason to turn it back into this path only for optional export_pflow() call. The `pflowif_list' foreach loop has no context switch within, so SMR list is better than mutex(9). Tested by Hrvoje Popovski. ok sashan bluhm
2023-12-08Add spaces around '='. style(9) fix, no functional changes.Vitaliy Makkoveev
2023-12-08Introduce `sc_mtx' mutex(9) to protect the most of pflow_softcVitaliy Makkoveev
structure. Protect the `send_nam', `sc_flowsrc' and `sc_flowdst' pflow_softc members by existing `sc_lock' rwlock(9). This partially fixes locking inconsistency of pflow_softc. The following work will be done with separate diffs. Also, pass `sc' instead of NULL to pflow_get_mbuf() while calling from pflow_sendout_ipfix_tmpl(). This fixes the NULL dereference. ok bluhm@
2023-12-03Make rtm_senddesync_timer() timeout(9) handler mpsafe. solock() protectsVitaliy Makkoveev
the socket and the socket's PCB data. ok bluhm
2023-12-01pipex(4) layer is completely mp-safe, move the pipex_timer() timeout(9)Vitaliy Makkoveev
handler out of kernel lock. ok bluhm
2023-12-01Prevent race between pf_test() and pf_purge_expired_states().Alexandr Nedvedicky
Packets (callers to pf_test()) must alter pf_state::timeout under protection of pf_state::mtx. We also have to make sure the packet does not update pf_state::timeout when ::timeout reaches PFTM_UNLINKED. The first report came from Johan Huldtgren, but he is not the single user who has noticed "st->timeout == PFTM_UNLINKED" assert violation. OK bluhm@
2023-11-29remove unused VXLANMTU definitionDenis Fondras
OK claudio, miod
2023-11-28Remove struct inpcb from in6_embedscope() parameters.Alexander Bluhm
rip6_output() did modify inp_outputopts6 temporarily to provide different ip6_pktopts to in6_embedscope(). Better pass inp_outputopts6 and inp_moptions6 as separate arguments to in6_embedscope(). Simplify the code that deals with these options in in6_embedscope(). Doucument inp_moptions and inp_moptions6 as protected by net lock. OK kn@
2023-11-23avoid passing weird mbuf chains to pf when pushing out a veb.David Gwynne
pf expects the ip header to be in the first mbuf of the chain we pass to pf_test, but in some situations the ethernet header is the only data in the first mbuf. after we remove the ethernet header, the first mbuf had no data in it which confused pf. fix this by passing all packets to ip_check on output as well as input. ip input handlers do all the necessary m_pullups. found by Mark Patruck.