summaryrefslogtreecommitdiff
path: root/sys/net
AgeCommit message (Collapse)Author
2024-02-05Add netstat counter for route cache.Alexander Bluhm
To optimize route caching, count cache hits and misses. This is shown in netstat -s for both inet and inet6. Reuse the old IPv6 forward cache counter. Sort ip6s_wrongif consistently. For now only IPv4 cache counter has been implemented. OK mvs@
2024-02-05Don't send route messages while rebooting after panic. Syskaller exposedVitaliy Makkoveev
[1] that if_downall() tries to send route messages and triggers panic again but in knote(9) layer. 1. https://syzkaller.appspot.com/bug?extid=d19060a65721eb432a72 ok bluhm
2024-02-05Move route_cache() declaration from net/route.h to netinet/in.h.Kenji Aoyama
This prevents gcc3's 'parameter has incomplete type' warning that causes kernel build failure. Suggested by claudio@, ok bluhm@
2024-01-31Add route generation number to route cache.Alexander Bluhm
The outgoing route is cached at the inpcb. This cache was only invalidated when the socket closes or if the route gets invalid. More specific routes were not detected. Especially with dynamic routing protocols, sockets must be closed and reopened to use the correct route. Running ping during a route change shows the problem. To solve this, add a route generation number that is updated whenever the routing table changes. The lookup in struct route is put into the route_cache() function. If the generation number is too old, the cached route gets discarded. Implement route_cache() for ip_output() and ip_forward() first. IPv6 and more places will follow. OK claudio@
2024-01-26Put checksum flags in bpf_hdr to use them in userland dhcpleased.Jan Klemkow
Thus, dhcpleased accept non-calculated checksums which were verified by hardware/hypervisor. With tweaks from dlg@ ok bluhm@ mkay tobhe@
2024-01-24tag packets going out a sec interface to prevent route/encap loops.David Gwynne
sec(4) was already looking for this mbuf tag so it could drop packets that had already been sent out on the same interface, but i forgot the code that adds the tag. this was reported by jason tubnor who experienced spins/lockups when using sec and a physical interface was disconnected. rather than being a locking problem like we initially assumed, it turned out that unplugging a physical interface caused a route for ipsec encapsulated traffic to go out over sec(4), causing the packet to loop in the stack. the fix was also tested and verified by jason. sorry for taking so long to look at it.
2024-01-23Introduce pipex_iterator(), the special thing to performVitaliy Makkoveev
`pipex_session_list' foreach walkthrough with `pipex_list_mtx' mutex(9) relocking. It inserts special item after acquired `session' and keeps it linked until `session' release. Only owner can unlink it's own item, so the LIST_NEXT(session) is always valid even the `session' was unlinked. The iterator skips special items at the `session' acquisition time, as all other foreach loops where `pipex_list_mtx' mutex(9) is not relocked. ok yasuoka
2024-01-23Remove `pipex_rd_head6' and `ps6_rn[2]'. They are not used.Vitaliy Makkoveev
ok yasuoka
2024-01-18Use `nowake' as tsleep_nsec(9) ident. It has no corresponding wakeup(9).Vitaliy Makkoveev
ok bluhm
2024-01-11Use domain name for socket lock.Alexander Bluhm
Syzkaller with witness complains about lock ordering of pf lock with socket lock. Socket lock for inet is taken before pf lock. Pf lock is taken before socket lock for route. This is a false positive as route and inet socket locks are distinct. Witness does not know this. Name the socket lock like the domain of the socket, then rwlock name is used in witness lo_name subtype. Make domain names more consistent for locking, they were not used anyway. Regardless of witness problem, unique lock name for each socket type make sense. Reported-by: syzbot+34d22dcbf20d76629c5a@syzkaller.appspotmail.com Reported-by: syzbot+fde8d07ba74b69d0adfe@syzkaller.appspotmail.com OK mvs@
2024-01-10Split UDP PCB table into IPv4 and IPv6.Alexander Bluhm
Having two hash tables instead of a common one, reduces table size and contention on the per table lock. The address family is always known in advance. The lookups and loops are more specific. OK sashan@
2024-01-06Do not count packets though multicast loopback and simplex interfaces.Alexander Bluhm
Counting multicast packets sent to local stack or packets that are reflected by simplex interfaces does not make much sense. They are neither received nor output by any ethernet device. Counting these packets at lo0 or the loopback interface of the routing domain would be possible, but is not worth the effort. Make if_input_local() MP safe by deleting the if_opackets++ code. OK mvs@
2024-01-06Take net lock before kernel lock.Alexander Bluhm
Doing KERNEL_LOCK() just before NET_LOCK() does not make sense. Net lock is a rwlock that releases kernel lock during sleep. To avoid an unnecessary release and take kernel lock cycle, move KERNEL_LOCK() after NET_LOCK(). There is no lock order reversal deadlock issue. Both locks are used in any order thoughout the kernel. As NET_LOCK() releases the kernel lock when it cannot take the lock immediately and has to sleep, we always end in the order kernel lock before net lock after sleeping. OK sashan@
2024-01-01Protect link between pf and inp with mutex.Alexander Bluhm
Introduce global mutex to protect the pointers between pf state key and internet PCB. Then in_pcbdisconnect() and in_pcbdetach() do not need exclusive netlock anymore. Use a bunch of read once unlocked access to reduce performance impact. OK sashan@
2024-01-01Call if_counters_alloc() before if_attach().Vitaliy Makkoveev
ok bluhm sashan
2024-01-01Fix white space in pf.c.Alexander Bluhm
2023-12-29Make loopback interface counters MP safe.Alexander Bluhm
Create and use the MP safe version of the interface counters for lo(4). Input packets were counted twice. As interface input queue is already counting, remove input count in if_input_local(). Multicast and siplex packets are counted at the ethernet interface. Add a comment that this not MP safe. OK mvs@
2023-12-28use RB_FOREACH_SAFE for pf_purge_expired_src_nodesaisha
OK bluhm@
2023-12-23Backout always allocate per-CPU statistics counters for networkAlexander Bluhm
interface descriptor. It panics during attach of em(4) device at boot.
2023-12-22Always allocate per-CPU statistics counters for network interfaceVitaliy Makkoveev
descriptor. We have the mess in network interface statistics. Only pseudo drivers do per-CPU counters allocation, all other network devices use the old `if_data'. The network stack partially uses per-CPU counters and partially use `if_data', but the protection is inconsistent: some times counters accessed with exclusive netlock, some times with shared netlock, some times with kernel lock, but without netlock, some times with another locks. To make network interfaces statistics more consistent, always allocate per-CPU counters at interface attachment time and use it instead of `if_data'. At this step only move counters allocation to the if_attach() internals. The `if_data' removal will be performed with the following diffs to make review and tests easier. ok bluhm
2023-12-19Initialize `sc_outputtask' before interface attachment. if_alloc_sadl()Vitaliy Makkoveev
has sleep point, so the uninitialized `sc_outputtask` could be accessed through ioctl(2) interface. ok sashan bluhm
2023-12-16Rework pflowioctl() lock dances.Vitaliy Makkoveev
Release netlock and take `sc_lock' rwlock(9) just in the beginning of pflowioctl() and do corresponding operations in the end. Use `sc_lock' to protect `sc_dying'. We need to release netlock not only to keep locks order with `sc_lock' rwlock(9), but also because pflowioctl() calls some operations like socreate() or soclose() on udp(4) socket. Current implementation has many relocking places which breaks atomicy, so merge them into one. The `sc_lock' rwlock(9) is taken during all pflowioctl() call, so `sc_dying' atomicy is not broken. Not the ideal solution, but better then we have now. Tested by Hrvoje Popovski. Discussed with and ok from sashan
2023-12-12slyle(9) fix. No functional changes.Vitaliy Makkoveev
2023-12-12Turn `pflowstats' statistics counters into per-CPU counters to make themVitaliy Makkoveev
mpsafe. The weird interactions around `pflow_flows' and `sc_gcounter' replaced by simple `pflow_flows' increment. Since the flow sequence is the 32 bits integer, the `sc_gcounter' type replaced by the type of uint32_t. ok bluhm sashan
2023-12-11Turn `pflow_softc' list into SMR list.Vitaliy Makkoveev
Since the revision 1.1182 of net/pf.c netlock is not taken while export_pflow() called from pf_purge_states(). Current locks order requires netlock to be taken before PF_LOCK(), so there is no reason to turn it back into this path only for optional export_pflow() call. The `pflowif_list' foreach loop has no context switch within, so SMR list is better than mutex(9). Tested by Hrvoje Popovski. ok sashan bluhm
2023-12-08Add spaces around '='. style(9) fix, no functional changes.Vitaliy Makkoveev
2023-12-08Introduce `sc_mtx' mutex(9) to protect the most of pflow_softcVitaliy Makkoveev
structure. Protect the `send_nam', `sc_flowsrc' and `sc_flowdst' pflow_softc members by existing `sc_lock' rwlock(9). This partially fixes locking inconsistency of pflow_softc. The following work will be done with separate diffs. Also, pass `sc' instead of NULL to pflow_get_mbuf() while calling from pflow_sendout_ipfix_tmpl(). This fixes the NULL dereference. ok bluhm@
2023-12-03Make rtm_senddesync_timer() timeout(9) handler mpsafe. solock() protectsVitaliy Makkoveev
the socket and the socket's PCB data. ok bluhm
2023-12-01pipex(4) layer is completely mp-safe, move the pipex_timer() timeout(9)Vitaliy Makkoveev
handler out of kernel lock. ok bluhm
2023-12-01Prevent race between pf_test() and pf_purge_expired_states().Alexandr Nedvedicky
Packets (callers to pf_test()) must alter pf_state::timeout under protection of pf_state::mtx. We also have to make sure the packet does not update pf_state::timeout when ::timeout reaches PFTM_UNLINKED. The first report came from Johan Huldtgren, but he is not the single user who has noticed "st->timeout == PFTM_UNLINKED" assert violation. OK bluhm@
2023-11-29remove unused VXLANMTU definitionDenis Fondras
OK claudio, miod
2023-11-28Remove struct inpcb from in6_embedscope() parameters.Alexander Bluhm
rip6_output() did modify inp_outputopts6 temporarily to provide different ip6_pktopts to in6_embedscope(). Better pass inp_outputopts6 and inp_moptions6 as separate arguments to in6_embedscope(). Simplify the code that deals with these options in in6_embedscope(). Doucument inp_moptions and inp_moptions6 as protected by net lock. OK kn@
2023-11-23avoid passing weird mbuf chains to pf when pushing out a veb.David Gwynne
pf expects the ip header to be in the first mbuf of the chain we pass to pf_test, but in some situations the ethernet header is the only data in the first mbuf. after we remove the ethernet header, the first mbuf had no data in it which confused pf. fix this by passing all packets to ip_check on output as well as input. ip input handlers do all the necessary m_pullups. found by Mark Patruck.
2023-11-18ports specified on tunnel destination addresses are invalid.David Gwynne
2023-11-13Fix rt_setgate() error handling.Alexander Bluhm
In revision 1.424 the logic in rt_setgate() has changed. The old code entered a value into rt_gateway also if rt_setgwroute() returned an error. Now if rt_setgwroute() fails, rt_gateway is NULL and ROUNDUP(rt->rt_gateway->sa_len) crashes. Put back the old logic in rt_setgate(). Setting rt_gateway and rt_gwroute are actually independent. If malloc(9) in rt_setgate() fails, rt_gateway can still be NULL. The subsequent crash in free(rt->rt_gateway, M_RTABLE, ROUNDUP(rt->rt_gateway->sa_len)) was just never observed. Add a NULL check around these free(9). Reported-by: syzbot+2e79dd9db712d3c5ade9@syzkaller.appspotmail.com OK mvs@
2023-11-12Use constant sockaddr in route lookup.Alexander Bluhm
In rtalloc() and rtalloc_mpath() declare the parameter dst as const sockaddr. This makes MP safe route lookup easier as the destination address is definitely not modified during the operation. Array rti_info, the central data structure with addresses for route matching, contains constant sockaddr now. OK mvs@ dlg@
2023-11-12Do not modify route info sockaddr in rtm_xaddrs().Alexander Bluhm
The rti_info array is used to describe routes that should be found by lookup. Modifying the addreses in it is not a good idea. There were places where rtm_xaddrs() tried to fix the address family instead of validating it. Replace the modification with a check and error out with EAFNOSUPPORT on failure. Route labels always have AF_UNSPEC and the other types are not used anyway. OK kn@
2023-11-12rt_gateway and rt_gwroute use X protections now.David Gwynne
pointed out by bluhm@
2023-11-12rt_setgate performs a series of tweaks to an rtable and the routes inDavid Gwynne
the rtable which should be serialised to ensure they're consistent. unfortunately, rt_setgate is called from the network stack while it's only holding shared NET_LOCK. this uses the [X] protections as described in route.h to serialise the changes, and reworks the code to try and keep enough stuff linked up properly during the changes that it will still work if another cpu is still using the rtentry structs while they still have shared net lock. tested by and ok bluhm@
2023-11-11Pass constant struct sockaddr to interface lookup functions.Alexander Bluhm
OK mvs@
2023-11-11start documenting the protections or locks needed for struct rtentry fields.David Gwynne
this is the result of a bunch of discussion at h2k23. ok claudio@ mvs@ bluhm@
2023-11-11Remove unused parameter dst from art_get().Alexander Bluhm
OK mvs@
2023-11-10rtable_match() takes constant destination.Alexander Bluhm
For implementing MP safe route lookup, it helps to know which function parameters are constant. Add some const declarations, so that the compiler guarantees that sockaddr dst parameter of rtable_match() does not change. OK dlg@
2023-11-10Make ifq and ifiq interface MP safe.Alexander Bluhm
Rename ifq_set_maxlen() to ifq_init_maxlen(). This function neither uses WRITE_ONCE() nor a mutex and is called before the ifq mutex is initialized. The new name expresses that it should be used only during interface attach when there is no concurrency. Protect ifq_len(), ifq_empty(), ifiq_len(), and ifiq_empty() with READ_ONCE(). They can be used without lock as they only read a single integer. OK dlg@
2023-11-09Remove delayed timeout(9) initialization. timeout_set*() only assignVitaliy Makkoveev
members of passed timeout structure, this delayed initialization provides nothing but makes code weird. ok kn
2023-10-27Forward TCP LRO disabling to parent devices.Jan Klemkow
Also disable TCP LRO on bridged vlan(4) and default for bpe(4), nvgre(4) and vxlan(4). ok bluhm@
2023-10-23Prevent wg(4) stuck on peer destruction.Vitaliy Makkoveev
While interface going down and output stopped, packets could rest in `if_snd' queue. So the (!ifq_empty(&sc->sc_if.if_snd)) condition will always be true and wg_peer_destroy() will sleep until interface became up and stuck packets transmitted. Check IFF_RUNNING flag within (!ifq_empty(&sc->sc_if.if_snd)) loop in wg_peer_destroy(). If the flag is not set that means interface is down, so drain the `if_snd' queue manually to prevent wg_peer_destroy() stuck. Problem reported and fix tested by Kirill Miazine. ok bluhm@
2023-10-12pflog(4) logs packet dropped by default rule with block.Alexander Bluhm
If a packet is malformed, it is dropped by pf(4). The rule referenced in pflog(4) is the default rule. As the default rule is a pass rule, tcpdump printed "pass" although the packet was actually dropped. To avoid confusion, change the action to drop. Then tcpdump prints "block". OK sashan@ kn@
2023-10-11Prevent deref-after-free when tdb_timeout() fires on invalid new tdb.Tobias Heider
When receiving a pfkeyv2 SADB_ADD message, a newly created tdb can fail in tdb_init(), which causes the tdb to not get added to the global tdb list and an immediate dereference. If a lifetime timeout triggers on this tdb, it will unconditionally try to remove it from the list and in the process deref once more than allowed, causing a one bit corruption in the already freed up slot in the tdb pool. We resolve this issue by moving timeout_add() after tdb_init() just before puttdb(). This means tdbs failing initialization get discarded immediately as they only hold a single reference. Valid tdbs get their timeouts activated just before we add them to the tdb list, meaning the timeout can safely assume they are linked. Feedback from mvs@ and millert@ ok mvs@ mbuhl@
2023-10-10pf(4) must not pass packet if state cannot be created.Alexander Bluhm
The behavior of the PFRULE_SRCTRACK and max_states check was unintentionally changed by commit revision 1.964. If the state was not created due to some limit had been reached, pf still passed the packet. Restore the old logic by setting action to pass later, after the checks. In pf_test_rule() action is initialized to drop. OK sashan@