src - OpenBSD base system

Age	Commit message (Collapse)	Author
2024-11-08	Use PCB iterator for raw IPv6 input loop.	Alexander Bluhm
	Implement inpcb iterator in rip6_input(). Factor out the real work to rip6_sbappend(). Now UDP broadcast and multicast, raw IPv4 and IPv6 input work similar. While there, make rip_input() look more like rip6_input(). OK mvs@
2024-11-05	Replace rwlock with iterator in UDP input multicast loop.	Alexander Bluhm
	The broadcast and multicast loop in udp_input() is protected by the table mutex. The relevant PCBs were collected in a separate list, which was processed while the table notify rwlock was held. When sending UDP multicast packets over vxlan(4) configured over UDP with multicast groups, this lock was taken recursively causing a kernel crash. By using an iterator, traversing the PCB list of the table does not require to hold the mutex all the time. Only while accessing the next element after the iterator, the mutex is taken for a short time. udp_sbappend() and the upcall to vxlan_input() is done with neither mutex nor rwlock. The PCB is reference counted while traversing the list. crash reported by Holger Glaess; iterator implemented by mvs@; tested and fixed by bluhm@; OK mvs@
2024-10-17	remove unneeded task.h include, missed in rev 1.67	Jonathan Gray

2024-09-04	Fix some spelling.	Marcus Glocker
	Input and ok jmc@, jsg@
2024-09-01	spelling; checked by jmc@, ok miod@ mglocker@ krw@	Jonathan Gray

2024-08-20	Unlock igmp_sysctl(), pfsync_sysctl() and rip6_sysctl(). Each of them is	Vitaliy Makkoveev
	the only IGMPCTL_STATS, PFSYNCCTL_STATS and RIPV6CTL_STATS per-CPU counters. sysctl_rdstruct() has "newp != NULL" check within and also returns EPERM, no need for redundant check in igmp_sysctl(). ok bluhm
2024-08-19	avoid uninitialised var use introduced in rev 1.63	Jonathan Gray
	found by smatch, ok bluhm@
2024-08-16	Introduce PR_MPSYSCTL flag to mark mp-safe (*pr_sysctl)() handlers and	Vitaliy Makkoveev
	unlock both divert_sysctl() and divert6_sysctl(). Unlock them together, because they are identical and pretty simple: - DIVERTCTL_RECVSPACE and DIVERTCTL_SENDSPACE - atomically accessed integers; - DIVERTCTL_STATS - per-CPU counters; ok bluhm
2024-08-12	Run network protocol timer without kernel lock.	Alexander Bluhm
	Mark slow and fast protocol timeouts as MP safe. This means they run on a spearate thread without holding the kernel lock. IGMP and MLD6 cannot run in parallel, they use exclusive net lock to protect themselves. As a performance optimization global variables are used to skip igmp_fasttimo() and mld6_fasttimeo() if no multicast is active. These global variables use atomic operations and memory barriers to work lockless. IPv6 fragment timeout protects itself with a mutex. TCP timers also run without kernel lock now. The whole TCP stack holds exclusive net lock, so additional kernel lock is useless. OK mvs@
2024-07-29	Use shared net lock instead of exclusive when frag6 calls icmp6_error().	Alexander Bluhm
	OK mvs@ a while ago as part of a larger diff
2024-07-26	Run UDP input on multiple CPU in parallel.	Alexander Bluhm
	The socket layer of UDP has been made fully MP safe. UDP output is MP safe for a while. mvs@ has fixed the missing pieces in socket splicing recently. This means that complete UDP stack can be processed by multiple threads now. Activate multi processing for udp_input() when called with IPv4 or IPv6 packets. Usually IP processing runs on multiple softnet threads with shared net lock. From there local packets are queued and processed by one thread with exclusive net lock. If the PR_MPINPUT flag is set, protocol input is called directly from IP input on multiple threads, with shared net lock and no additional queueing. tested by Hrvoje Popovski; OK mvs@
2024-07-19	Unlock sysctl net.inet.ip.redirect and net.inet6.ip6.redirect.	Alexander Bluhm
	Variable ip and ip6 sendredirects is only read once during packet processing. Use atomic_load_int() to access the value in exactly one read instruction. No memory barriers needed as there is no correlation with other values. Sort the ip and ip6 checks, so the difference is easier to see. Move access to global variable to the end. OK mvs@
2024-07-14	Unlock IPv6 sysctl net.inet6.ip6.forwarding from net lock.	Alexander Bluhm
	Use atomic operations to read ip6_forwarding while processing packets in the network stack. To make clear where actually the router property is needed, use the i_am_router variable based on ip6_forwarding. It already existed in nd6_nbr. Move i_am_router setting up the call stack until all users are independent. The forwarding decisions in pf_test, pf_refragment6, ip6_input do also not interfere. Use a new array ipv6ctl_vars_unlocked to make transition of all the integer sysctls easier. Adapt IPv4 to the new style. OK mvs@
2024-07-13	Previous commit broke RAMDISK_CD kernel build. Always include udp.h	Alexander Bluhm
	in ip6_forward.c.
2024-07-13	Do not store full IPv6 packet in common forwarding case.	Alexander Bluhm
	Forwarding IPv6 packets is slower than IPv4. Reason is that m_copym() is done for every packet. Just in case we may have to send an ICMP6 packet, ip6_forward() creates a mbuf copy. After that mbuf cluster is read only, so for the ethernet header another mbuf is allocated. pf NAT and RDR ignores readonly clusters, so it also modifies the potential ICMP6 packet. IPv4 ip_forward() avoids all these problems by copying the leading 68 bytes of the original packets onto the stack. More is not need for ICMP. IPv6 RFC 4443 2.4. (c) requires up to 1232 bytes in the ICMP6 packet. This cannot be copied to the stack. The reason for the difference in the standard seems to be that the ICMP6 packet has to contain the full header chain. If we have a simple TCP, UDP or ESP packet without chain, do a shortcut and just preserve the header for the ICMP6 packet. Small packets already use stack memory, large packets need extra mbuf allocation. Now truncate ICMP6 packet to a reasonable length if the original packets has a final protocol header directly after the IPv6 header. List of suitable protocols contains TCP, UDP, ESP as they cover the common cases and anything behind the header should not be needed for path MTU discovery. OK deraadt@ florian@ mvs@
2024-07-12	Remove internet PCB mutex.	Alexander Bluhm
	All incpb locking has been converted to socket receive buffer mutex. Per PCB mutex inp_mtx is not needed anymore. Also delete PRU related locking functions. A flag PR_MPSOCKET indicates whether protocol functions support parallel access with per socket rw-lock. TCP is the only protocol that is not MP capable from the socket layer and needs exclusive netlock. OK mvs@
2024-07-09	IPv6 forward copies small packet content on the stack.	Alexander Bluhm
	Unfortunately RFC 4443 demands that the ICMP6 error packet containing the orignal packet is up to 1280 bytes long. That means for every forwarded packet forward() creates a mbuf copy, just in case delivery fails. For small packets we can copy the content on the stack like IPv4 forward does. This saves us some mbuf allocations if the content is shorter than the mbuf data size. OK mvs@
2024-07-04	Implement IPv6 forwarding IPsec only.	Alexander Bluhm
	IPsec gateways set the forwarding sysctl to 2. While this worked for IPv4 since a long time, adapt this feature for IPv6 now. Set sysctl net.inet6.ip6.forwarding=2 to forward only packets that have been processed by IPsec. Set IPV6_FORWARDING_IPSEC in ip6_input() and pass the flag down to the call stack. This provides consistent view on global variable ip6_forwarding. In ip6_output() or ip6_forward() drop packets that do not match the policy. OK denis@
2024-06-20	Read IPv6 forwarding value only once while processing a packet.	Alexander Bluhm
	IPv4 uses IP_FORWARDING to pass down a consistent value of net.inet.ip.forwarding down the stack. This is needed for unlocking sysctl. Do the same for IPv6. Read ip6_forwarding once in ip6_input_if() and pass down IPV6_FORWARDING as flags to ip6_ours(), ip6_hbhchcheck(), ip6_forward(). Replace the srcrt value with IPV6_REDIRECT flag for consistency with IPv4. To have common syntax with IPv4, use ip6_forwarding == 0 checks instead of !ip6_forwarding. This will also make it easier to implement net.inet6.ip6.forwarding=2 for IPsec only forwarding later. In nd6_ns_input() and nd6_na_input() read ip6_forwarding once and store it in i_am_router. The variable name has been chosen to avoid confusion with is_router, which indicates router flag of the packet. Reading of ip6_forwarding is done independently from ip6_input_if(), consistency does not really matter. One is for ND router behavior the other for forwarding. Again use the ip6_forwarding != 0 check, so when ip6_forwarding IPsec only value 2 gets implemented, it will behave like a router. OK deraadt@ sashan@ florian@ claudio@
2024-06-07	Read IP forwarding variables only once.	Alexander Bluhm
	Do not assume that ip_forwarding and ip_directedbcast cannot change while processing one packet. Read it once and pass down its value with a flag. This is necessary for unlocking the sysctl path. There are a few places where a consistent value does not really matter, they are unchanged. Use a proper ip_ prefix for the global variable. OK claudio@
2024-06-07	Fix slaac on P2P interfaces	Florian Obser
	slaacd(8) can work on P2P interfaces, it will just never configure the destination address. But this works fine on at least pppoe(4) and tun(4). To make this less confusing pull ifra_dstaddr into dst6 or gw6 depending on if we are doing autoconf or not. I accidentally broke this when implementing rule 5.5 of RFC 6724. reported by & testing naddy OK bluhm
2024-06-07	remove unused defines	Jonathan Gray

2024-05-21	Inform user land when vltime / pltime changes.	Florian Obser
	Do not send a RTM_CHGADDRATTR route message when the address is tentative because we will send one when DAD finishes. To be used by rad(8) shortly. OK bluhm
2024-05-13	remove prototypes with no matching function	Jonathan Gray
	ok mpi@
2024-05-08	Fix route leak in ip input.	Alexander Bluhm
	In previous commit when refactoring the route cache, a rtfree() has been forgotten. For each forwarded packet the reference counter of the route entry was increased. This eventually leads to an integer overflow and triggers kassert. reported by and OK jan@
2024-04-21	Implement rule 5.5 of RFC 6724 (Default Address Selection for IPv6)	Florian Obser
	Rule 5.5: Prefer addresses in a prefix advertised by the next-hop. For this we have to track the (link-local) address of the advertising router per interface address and compare it with the selected route. Rule 5.5 is useful in multi-homing setups where we have more than one prefix and default router. We have to use the source address with the correct default gateway otherwise traffic is likely going to be dropped because of BCP 38. While here refactor in6_update_ifa() a bit to make the code clearer and consistently use (var & flag) instead of (var & flag) != 0. Patiently reviewed by & OK bluhm.
2024-04-17	Use struct ipsec_level within inpcb.	Alexander Bluhm
	Instead of passing around u_char[4], introduce struct ipsec_level that contains 4 ipsec levels. This provides better type safety. The embedding struct inpcb is globally visible for netstat(1), so put struct ipsec_level outside of #ifdef _KERNEL. OK deraadt@ mvs@
2024-04-17	Revert previous, it breaks IPv6 on loopback interfaces.	Florian Obser
	Reported by bket & anton
2024-04-16	Destination addresses make no sense on loopback interfaces.	Florian Obser
	While here use (variable & FLAG) or !(variable & FLAG) consistently in in6_update_ifa(). Discussed with claudio OK denis
2024-04-16	Use route cache function in IP input.	Alexander Bluhm
	Instaed of passing a struct rtentry from ip_input() to ip_forward() and then embed it into a struct route for ip_output(), start with struct route and pass it along. Then the route cache is used consistently. Also the route cache hit and missed counters should reflect reality after this commit. There is a small difference in the code. in_ouraddr() checks for NULL and not rtisvalid(). Previous discussion showed that the route RTF_UP flag should only be considered for multipath routing. Otherwise it does not mean anything. Especially the local and broadcast check in in_ouraddr() should not be affected by interface link status. When doing cache lookups, route must be valid, but after rtalloc_mpath() lookup, use any route that route_mpath() returns. OK claudio@
2024-04-16	Run raw IPv6 input in parallel.	Alexander Bluhm
	Get rip6_input() in the same shape as rip_input(). Call soisdisconnected() from rip6_disconnect(). This means that the raw IP socket cannot be reconnected later. Now raw IPv6 behaves like IPv4 in this regard, KAME code is quite inconsistent here. Also make sure that there is no race between disconnect, input and wakeup. The inpcb fileds inp_icmp6filt and inp_cksum6 are protected by exclusive net lock in icmp6_ctloutput(). With all that, mark raw IPv6 sockets to handle input in parallel. OK mvs@
2024-04-14	Run raw IP input in parallel.	Alexander Bluhm
	Running raw IPv4 input with shared net lock in parallel is less complex than UDP. Especially there is no socket splicing. New ip_deliver() may run with shared or exclusive net lock. The last parameter indicates the mode. If is is running with shared netlock and encounters a protocol that needs exclusive lock, the packet is queued. Old ip_ours() always queued the packet. Now it calls ip_deliver() with shared net lock, and if that cannot handle the packet completely, the packet is queued and later processed with exclusive net lock. In case of an IPv6 header chain, that switches from shared to exclusive processing, the next protocol and mbuf offset are stored in a mbuf tag. OK mvs@
2024-04-12	Split single TCP inpcb table into IPv4 and IPv6 parts.	Alexander Bluhm
	With two separate TCP hash tables, each one becomes smaller. When we remove the exclusive net lock from TCP, contention on internet PCB table mutex will be reduced. UDP has been split earlier into IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with assertions. OK mvs@
2024-04-09	Plug route leak in IP output.	Alexander Bluhm
	If no struct route is passed to ip_output() or ip6_output(), it uses its own iproute on the stack. In that case any route entry in the local route cache has to be freed. After pf decides to reroute, struct route is reset to NULL. Then the route reference counter has to be released. Call rtfree() without needless NULL check. OK mvs@
2024-04-06	IP multicast sysctl mrtmfc must not write outside of allocation.	Alexander Bluhm
	Reading sysctl mrt_sysctl_mfc() allocates memory to be copied back to user. Chunks of struct mfcinfo are copied from routing table to linear heap memory. If the allocated memory was not a multiple the struct size, a struct mfcinfo could be copied to a partially unallocated destination. Check that the end of the struct is within the allocation. From Alfredo Ortega; OK claudio@
2024-03-31	Combine route_cache() and rtalloc_mpath() in new route_mpath().	Alexander Bluhm
	Fill and check the cache and call rtalloc_mpath() together. Then the caller of route_mpath() does not have to care about the uint32_t *src pointer and just pass struct in_addr. All the conversions are done inside the functions. A previous version of this diff was backed out. There was an additional rtisvalid() in rtalloc_mpath() that prevented packet output via interfaces that were not up. Now the route in the cache has to be valid, but after new lookup, rtalloc_mpath() may return invalid routes. This generates less errors in userland an preserves existing behavior. OK sashan@
2024-03-26	Additional length check for IPv6 reassembled fragments.	Alexander Bluhm
	FreeBSD-SA-23:06.ipv6 security advisory has added an additional overflow check in frag6_input(). OpenBSD is not affected by that as the bug was introduced by another change in 2019. The existing code is complicated and NetBSD has taken the FreeBSD fix, although they were also not affected. The additional check makes the complicated code more robust. Length calculation taken from NetBSD. Discussed with FreeBSD. OK sashan@ mvs@
2024-03-22	Make local port which is bound during connect(2) unique per laddr.	Alexander Bluhm
	in_pcbconnect() did not pass down the address it got from in_pcbselsrc() to in_pcbpickport(). As a consequence local port numbers selected during connect(2) were globally unique although they belong to different addresses. This strict uniqueness is not necessary and wastes usable ports for outgoing connections. To solve this, pass ina from in_pcbconnect() to in_pcbbind_locked(). This does not interfere how wildcard sockets are matched with specific sockets during bind(2). It only allows non-wildcard sockets to share a local port during connect(2). OK mvs@ deraadt@
2024-02-29	revert "Combine route_cache() and rtalloc_mpath() in new route_mpath()"	Christian Weisgerber
	It breaks NFS. ok claudio@
2024-02-28	Cleanup IP input, forward, output.	Alexander Bluhm
	Before changing the routing code, get IPv4 and IPv6 input, forward, and output in a similar shape. Remove inconsistencies. OK claudio@
2024-02-27	Combine route_cache() and rtalloc_mpath() in new route_mpath().	Alexander Bluhm
	Fill and check the cache and call rtalloc_mpath() together. Then the caller of route_mpath() does not have to care about the uint32_t *src pointer and just pass struct in_addr. All the conversions are done inside the functions. ro->ro_rt is either valid or NULL. Note that some places have a stricter rtisvalid() now compared to the previous NULL check. OK claudio@
2024-02-22	Make the route cache aware of multipath routing.	Alexander Bluhm
	Pass source address to route_cache() and store it in struct route. Cached multipath routes are only valid if source address matches. If sysctl multipath changes, increase route generation number. OK claudio@
2024-02-14	Hide struct ip6q, struct ip6asfrag, struct ip6_moptions,	Claudio Jeker
	struct ip6po_rhinfo and struct ip6_pktopts behind _KERNEL. The only bit userland may want from netinet6/ip6_var.h is struct ip6stat. The recent change to struct ip6po_rhinfo to use struct route resulted in various build failures in ports because code included netinet6/ip6_var.h without net/route.h. OK tb@ sthen@
2024-02-13	Merge struct route and struct route_in6.	Alexander Bluhm
	Use a common struct route for both inet and inet6. Unfortunately struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has to be exposed from net/route.h. Struct route has to be bsd visible for userland as netstat kvm code inspects inp_route. Internet PCB and TCP SYN cache can use a plain struct route now. All specific sockaddr types for inet and inet6 are embeded there. OK claudio@
2024-02-11	Use `sb_mtx' instead of `inp_mtx' in receive path for inet sockets.	Vitaliy Makkoveev
	In soreceve(), we only touch `so_rcv' socket buffer, which has it's own `sb_mtx' mutex(9) for protection. So, we can avoid solock() in this path - it's enough to hold `sb_mtx' in soreceive() and around corresponding sbappend*(). But not right now :) This time we use shared netlock for some inet sockets in the soreceive() path. To protect `so_rcv' buffer we use `inp_mtx' mutex(9) and the pru_lock() to acquire this mutex(9) in socket layer. But the `inp_mtx' mutex belongs to the PCB. We initialize socket before PCB, tcp(4) sockets could exist without PCB, so use `sb_mtx' mutex(9) to protect sockbuf stuff. This diff mechanically replaces `inp_mtx' by `sb_mtx' in the receive path. Only for sockets which already use `inp_mtx'. All other sockets left as is. They will be converted later. Since the `sb_mtx' is optional, the new SB_MTXLOCK flag introduced. If this flag is set on `sb_flags', the `sb_mtx' mutex(9) should be taken. New sb_mtx_lock() and sb_mtx_unlock() was introduced to hide this check. They are temporary and will be replaced by mtx_enter() when all this area will be converted to `sb_mtx' mutex(9). Also, the new sbmtxassertlocked() function introduced to throw corresponding assertion for SB_MTXLOCK marked buffers. This time only sbappendaddr() calls it. This function is also temporary and will be replaced by MTX_ASSERT_LOCKED() later. ok bluhm
2024-02-11	Remove include netinet6/ip6_var.h from netinet/in_pcb.h.	Alexander Bluhm
	OK mvs@
2024-02-09	Route cache function returns hit or miss.	Alexander Bluhm
	The route_cache() function can easily return whether it was a cache hit or miss. Then the logic to perform a route lookup gets a bit simpler. Some more complicated if (ro->ro_rt == NULL) checks still exist elsewhere. Also use route cache in in_pcbselsrc() instead of filling struct route manually. OK claudio@
2024-02-07	Use the route generation number also for IPv6.	Alexander Bluhm
	Implement route6_cache() to check whether the cached route is still valid and otherwise fill caching parameter of struct route_in6. Also count cache hits and misses in netstat. in_pcbrtentry() uses route cache now. OK claudio@
2024-02-05	Add netstat counter for route cache.	Alexander Bluhm
	To optimize route caching, count cache hits and misses. This is shown in netstat -s for both inet and inet6. Reuse the old IPv6 forward cache counter. Sort ip6s_wrongif consistently. For now only IPv4 cache counter has been implemented. OK mvs@
2024-02-03	Rework socket buffers locking for shared netlock.	Vitaliy Makkoveev
	Shared netlock is not sufficient to call so{r,w}wakeup(). The following sowakeup() modifies `sb_flags' and knote(9) stuff. Unfortunately, we can't call so{r,w}wakeup() with `inp_mtx' mutex(9) because sowakeup() also calls pgsigio() which grabs kernel lock. However, `so_filtops' callbacks only perform read-only access to the socket stuff, so it is enough to hold shared netlock only, but the klist stuff needs to be protected. This diff introduces `sb_mtx' mutex(9) to protect sockbuf. This time `sb_mtx' used to protect only `sb_flags' and `sb_klist'. Now we have soassertlocked_readonly() and soassertlocked(). The first one is happy if only shared netlock is held, meanwhile the second wants `so_lock' or pru_lock() be held together with shared netlock. To keep soassertlocked() assertions soft, we need to know mutex(9) state, so new mtx_owned() macro was introduces. Also, the new optional (*pru_locked)() handler brings the state of pru_lock(). Tests and ok from bluhm.