src - OpenBSD base system

Age	Commit message (Collapse)	Author
2024-02-09	Route cache function returns hit or miss.	Alexander Bluhm
	The route_cache() function can easily return whether it was a cache hit or miss. Then the logic to perform a route lookup gets a bit simpler. Some more complicated if (ro->ro_rt == NULL) checks still exist elsewhere. Also use route cache in in_pcbselsrc() instead of filling struct route manually. OK claudio@
2024-02-07	Use the route generation number also for IPv6.	Alexander Bluhm
	Implement route6_cache() to check whether the cached route is still valid and otherwise fill caching parameter of struct route_in6. Also count cache hits and misses in netstat. in_pcbrtentry() uses route cache now. OK claudio@
2024-02-05	Add netstat counter for route cache.	Alexander Bluhm
	To optimize route caching, count cache hits and misses. This is shown in netstat -s for both inet and inet6. Reuse the old IPv6 forward cache counter. Sort ip6s_wrongif consistently. For now only IPv4 cache counter has been implemented. OK mvs@
2024-02-03	Rework socket buffers locking for shared netlock.	Vitaliy Makkoveev
	Shared netlock is not sufficient to call so{r,w}wakeup(). The following sowakeup() modifies `sb_flags' and knote(9) stuff. Unfortunately, we can't call so{r,w}wakeup() with `inp_mtx' mutex(9) because sowakeup() also calls pgsigio() which grabs kernel lock. However, `so_filtops' callbacks only perform read-only access to the socket stuff, so it is enough to hold shared netlock only, but the klist stuff needs to be protected. This diff introduces `sb_mtx' mutex(9) to protect sockbuf. This time `sb_mtx' used to protect only `sb_flags' and `sb_klist'. Now we have soassertlocked_readonly() and soassertlocked(). The first one is happy if only shared netlock is held, meanwhile the second wants `so_lock' or pru_lock() be held together with shared netlock. To keep soassertlocked() assertions soft, we need to know mutex(9) state, so new mtx_owned() macro was introduces. Also, the new optional (*pru_locked)() handler brings the state of pru_lock(). Tests and ok from bluhm.
2024-01-31	Add route generation number to route cache.	Alexander Bluhm
	The outgoing route is cached at the inpcb. This cache was only invalidated when the socket closes or if the route gets invalid. More specific routes were not detected. Especially with dynamic routing protocols, sockets must be closed and reopened to use the correct route. Running ping during a route change shows the problem. To solve this, add a route generation number that is updated whenever the routing table changes. The lookup in struct route is put into the route_cache() function. If the generation number is too old, the cached route gets discarded. Implement route_cache() for ip_output() and ip_forward() first. IPv6 and more places will follow. OK claudio@
2024-01-31	Split in_pcbrtentry() and in6_pcbrtentry() based on INP_IPV6.	Alexander Bluhm
	Splitting the IPv6 code into a separate function results in less #ifdef INET6. Also struct route_in6 *ro in in6_pcbrtentry() is of the correct type and in_pcbrtentry() does not rely on the fact that inp_route and inp_route6 are pointers to the same union. OK kn@ claudio@
2024-01-28	Use more specific sockaddr type for inpcb notify.	Alexander Bluhm
	in_pcbnotifyall() is an IPv4 only function. All callers check that sockaddr dst is in fact a sockaddr_in. Pass the more spcific type and remove the runtime check at beginning of in_pcbnotifyall(). Use const sockaddr_in in in_pcbnotifyall() and const sockaddr_in6 in6_pcbnotify() as dst parameter. OK millert@
2024-01-27	Declare address parameter in TCP SYN cache const.	Alexander Bluhm
	tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr. sa6_src may be &sa6_any which lives in read-only data section. Better pass down the const addresses to syn_cache_lookup(). They are needed for hash lookup and are not modified. OK mvs@
2024-01-21	Assert that inpcb table has correct address family.	Alexander Bluhm
	Since inpcb tables for UDP and Raw IP have been split into IPv4 and IPv6, assert that INP_IPV6 flag is correct instead of checking it. While there, give the table variable a nicer name. OK sashan@ mvs@
2024-01-18	Move the rtable_exists() check into in_pcbset_rtableid().	Claudio Jeker
	OK bluhm@ mvs@
2024-01-11	Use domain name for socket lock.	Alexander Bluhm
	Syzkaller with witness complains about lock ordering of pf lock with socket lock. Socket lock for inet is taken before pf lock. Pf lock is taken before socket lock for route. This is a false positive as route and inet socket locks are distinct. Witness does not know this. Name the socket lock like the domain of the socket, then rwlock name is used in witness lo_name subtype. Make domain names more consistent for locking, they were not used anyway. Regardless of witness problem, unique lock name for each socket type make sense. Reported-by: syzbot+34d22dcbf20d76629c5a@syzkaller.appspotmail.com Reported-by: syzbot+fde8d07ba74b69d0adfe@syzkaller.appspotmail.com OK mvs@
2024-01-09	Convert some struct inpcb parameter to const pointer.	Alexander Bluhm
	OK millert@
2024-01-01	Reduce code duplication in ip6 divert.	Alexander Bluhm
	Protocols like UDP or TCP keep only functions in netinet6 that are essentially different. Remove divert6_detach(), divert6_lock(), divert6_unlock(), divert6_bind(), and divert6_shutdown(). Replace them with identical IPv4 functions. INP_HDRINCL is an IPv4 only option, remove it from divert6_attach(). OK mvs@ sashan@ kn@
2023-12-15	Use inpcb table mutex to set addresses.	Alexander Bluhm
	Protect all remaining write access to inp_faddr and inp_laddr with inpcb table mutex. Document inpcb locking for foreign and local address and port and routing table id. Reading will be made MP safe by adding per socket rw-locks in a next step. OK sashan@ mvs@
2023-12-07	Inpcb table mutex protects addr and port during bind(2) and connect(2).	Alexander Bluhm
	in_pcbbind(), in_pcbconnect(), and in6_pcbconnect() have to set addresses and ports within the same critical section as the inpcb hash table calculation. Also lookup and address selection have to be protected to avoid bindings and connections that are not unique. For that in_pcbpickport() and in_pcbbind_locked() expect that the table mutex is already taken. The functions in_pcblookup_lock(), in_pcblookup_local_lock(), and in_pcbaddrisavail_lock() grab the mutex iff the lock parameter is IN_PCBLOCK_GRAB. Otherwise the parameter is IN_PCBLOCK_HOLD has the lock has to be taken already. Note that in_pcblookup_lock() and in_pcblookup_local() return an inp with increased reference iff they take and release the lock. Otherwise the caller protects the life time of the inp. This gives enough flexibility that in_pcbbind() and in_pcbconnect() can hold the table mutex when they need it. The public inpcb API does not change. OK sashan@ mvs@
2023-12-06	Protect socket receive buffer in IP multicast routing.	Alexander Bluhm
	Since soreceive() runs in parallel for raw sockets, sbappendaddr() has to be protected by inpcb mutex. This was missing in multicast forwarding which is running with a combination of shared net lock and kernel lock. soreceive() uses shared net lock and mutex per inpcb. Grab mutex before sbappendaddr() in socket_send() and socket6_send(). panic receive 1 reported by Jo Geraerts OK mvs@ claudio@
2023-12-03	Rename all in6p local variables to inp.	Alexander Bluhm
	There exists no struct in6pcb in OpenBSD, this was an old kame idea. Calling the local variable in6p does not make sense, it is actually a struct inpcb. Also in6p is not used consistently in inet6 code. Having the same convention for IPv4 and IPv6 is less confusing. OK sashan@ mvs@
2023-12-03	Use INP_IPV6 flag instead of sotopf().	Alexander Bluhm
	During initialization in_pcballoc() sets INP_IPV6 once to avoid reaching through inp_socket->so_proto->pr_domain->dom_family. Use this flag consistently. OK sashan@ mvs@
2023-12-01	Set inp address, port and rtable together with inpcb hash.	Alexander Bluhm
	The inpcb hash table is protected by table->inpt_mtx. The hash is based on addresses, ports, and routing table. These fields were not sychronized with the hash. Put writes and hash update into the same critical section. Move the updates from ip_ctloutput(), ip6_ctloutput(), syn_cache_get(), tcp_connect(), udp_disconnect() to dedicated inpcb set functions. There they use the same table mutex as in_pcbrehash(). in_pcbbind(), in_pcbconnect(), and in6_pcbconnect() need more work and are not included yet. OK sashan@ mvs@
2023-12-01	Make internet PCB connect more consistent.	Alexander Bluhm
	The public interface is in_pcbconnect(). It dispatches to in6_pcbconnect() if necessary. Call the former from tcp_connect() and udp_connect(). In in6_pcbconnect() initialization in6a = NULL is not necessary. in6_pcbselsrc() sets the pointer, but does not read the value. Pass a constant in6_addr pointer to in6_pcbselsrc() and in6_selectsrc(). It returns a reference to the address of some internal data structure. We want to be sure that in6_addr is not modified this way. IPv4 in_pcbselsrc() solves this by passing a copy of the address. OK kn@ sashan@ mvs@
2023-11-29	Document inp_socket as immutable and remove NULL checks.	Alexander Bluhm
	Struct inpcb field inp_socket is initialized in in_pcballoc(). It is not NULL and never changed. OK mvs@
2023-11-28	Remove struct inpcb from in6_embedscope() parameters.	Alexander Bluhm
	rip6_output() did modify inp_outputopts6 temporarily to provide different ip6_pktopts to in6_embedscope(). Better pass inp_outputopts6 and inp_moptions6 as separate arguments to in6_embedscope(). Simplify the code that deals with these options in in6_embedscope(). Doucument inp_moptions and inp_moptions6 as protected by net lock. OK kn@
2023-11-26	Remove inp parameter from ip_output().	Alexander Bluhm
	ip_output() received inp as parameter. This is only used to lookup the IPsec level of the socket. Reasoning about MP locking is much easier if only relevant data is passed around. Convert ip_output() to receive constant inp_seclevel as argument and mark it as protected by net lock. OK mvs@
2023-11-10	rtable_match() takes constant destination.	Alexander Bluhm
	For implementing MP safe route lookup, it helps to know which function parameters are constant. Add some const declarations, so that the compiler guarantees that sockaddr dst parameter of rtable_match() does not change. OK dlg@
2023-09-16	Allow counters_read(9) to take an optional scratch buffer.	Martin Pieuchot
	Using a scratch buffer makes it possible to take a consistent snapshot of per-CPU counters without having to allocate memory. Makes ddb(4) show uvmexp command work in OOM situations. ok kn@, mvs@, cheloha@
2023-09-06	Use shared net lock for ip_send() and ip6_send().	Alexander Bluhm
	When called with NULL options, ip_output() and ip6_output() are MP safe. Convert exclusive to shared net lock in send dispatch. OK mpi@
2023-07-30	Check for NULL before de-referencing a pointer, not after.	Kenneth R Westerback
	More complete solution after tb@ pointed out what Coverity missed. ok tb@
2023-07-29	Check for NULL before de-referencing a pointer, not after.	Kenneth R Westerback
	Coverity CID #1566406 ok phessler@
2023-07-09	Fix route entry leak.	Alexander Bluhm
	In in6_ifdetach() two struct rtentry were leaked. This was triggered by regress/sbin/route and detected with btrace(8) refcnt. The reference returned by rtalloc() must be freed with rtfree() in all cases. OK phessler@ mvs@
2023-07-07	Fix path MTU discovery for TCP LRO/TSO when forwarding.	Alexander Bluhm
	When doing LRO (Large Receive Offload), the drivers, currently ix(4) and lo(4) only, record an upper bound of the size of the original packets in ph_mss. When sending, either stack or hardware must chop the packets with TSO (TCP Segmentation Offload) to that size. That means we have to call tcp_if_output_tso() before ifp->if_output(). Put that logic into if_output_tso() to avoid code duplication. As TCP packets on the wire do not get larger that way, path MTU discovery should still work. tested by and OK jan@
2023-06-28	use refcnt API for multicast addresses, add tracepoint:refcnt:ifmaddr probe	Klemens Nanni
	Replace hand-rolled reference counting with refcnt_init(9) and hook it up with a new dt(4) probe. OK bluhm mvs
2023-06-24	Calculate inet PCB SIP hash without table mutex.	Alexander Bluhm
	Goal is to run UDP input in parallel. Btrace kstack analysis shows that SIP hash for PCB lookup is quite expensive. When running in parallel, there is also lock contention on the PCB table mutex. It results in better performance to calculate the hash value before taking the mutex. The hash secret has to be constant as hash calculation must not depend on values protected by the table mutex. Do not reseed anymore when hash table gets resized. Analysis also shows that asserting a rw_lock while holding a mutex is a bit expensive. Just remove the netlock assert. OK dlg@ mvs@
2023-06-16	If TSO is enabled, fix the IPv6 forward counters and icmp6 redirect.	Alexander Bluhm
	First try to send with TSO. The goto senderr handles icmp6 redirect and other errors. If TSO is not necessary and the interface MTU fits, just send the packet. Again goto senderr handles icmp6. Finally care about icmp6 packet too big. tested and OK jan@
2023-06-14	Add missing kernel lock around (*if_ioctl)().	Vitaliy Makkoveev
	ok bluhm
2023-06-13	Fix a typo with TSO logic in ip6_output(). Of course compare ph_mss	Alexander Bluhm
	with if_mtu and not the packet checksum flags. ph_mss contains the size of the copped packets. OK jan@
2023-06-01	Enable forwarding of ix(4) LRO Pakets via TSO	Jan Klemkow
	Also fix ip6_forwarding of TSO packets with tcp_if_output_tso(). With a lot of testing from Hrvoje Popovski and a lot of tweaks from bluhm@ ok bluhm@
2023-05-22	Fix TSO for traffic to a local address on a physical interface.	Alexander Bluhm
	When sending TCP packets with software TSO to the local address of a physical interface, the TCP checksum was miscalculated. As the small MSS is taken from the physical interface, but the large MTU of the loopback interface is used, large TSO packets are generated, but sent directly to the loopback interface. There we need the regular pseudo header checksum and not the modified without packet length. To avoid this confusion, use the same decision for checksum generation in in_proto_cksum_out() as for using hardware TSO in tcp_if_output_tso(). bug reported and tested by robert@ bket@ Hrvoje Popovski OK claudio@ jan@
2023-05-15	Implement the TCP/IP layer for hardware TCP segmentation offload.	Alexander Bluhm
	If the driver of a network interface claims to support TSO, do not chop the packet in software, but pass it down to the interface layer. Precalculate parts of the pseudo header checksum, but without the packet length. The length of all generated smaller packets is not known yet. Driver and hardware will use the mbuf packet header field ph_mss to calculate it and update checksum. Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware might support ony one protocol family. The old flag IFXF_TSO is only relevant for large receive offload. It is missnamed, but keep that for now. Note that drivers do not set TSO capabilites yet. Also the ifconfig flags and pseudo interfaces capabilities will be done separately. So this commit should not change behavior. heavily based on the work from jan@; OK sashan@
2023-05-13	Finally remove the kernel lock from IPv6 neighbor discovery. ND6	Alexander Bluhm
	entries in rt_llinfo are protected either by exclusive netlock or the ND6 mutex. The performance critical lookup path in nd6_resolve() uses shared netlock, but is not lockless. In contrast to ARP it grabs the mutex also in the common case. tested by Hrvoje Popovski; with and OK kn@
2023-05-12	Make access to rt_llinfo consistent and remove needless initialisation.	Alexander Bluhm
	OK mvs@
2023-05-10	Implement TCP send offloading, for now in software only. This is	Alexander Bluhm
	meant as a fallback if network hardware does not support TSO. Driver support is still work in progress. TCP output generates large packets. In IP output the packet is chopped to TCP maximum segment size. This reduces the CPU cycles used by pf. The regular output could be assisted by hardware later, but pf route-to and IPsec needs the software fallback in general. For performance comparison or to workaround possible bugs, sysctl net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows TSO counter with chopped and generated packets. based on work from jan@ tested by jmc@ jan@ Hrvoje Popovski OK jan@ claudio@
2023-05-08	The call to in_proto_cksum_out() is only needed before the packet	Alexander Bluhm
	is passed to ifp->if_output(). The fragment code has its own checksum calculation and the other paths end in goto bad. OK claudio@
2023-05-08	To make ND6 mp-safe, the life time of struct llinfo_nd6 *ln =	Alexander Bluhm
	rt->rt_llinfo has to be guaranteed. Replace the complicated logic in nd6_rtrequest() case RTM_ADD with what we have in ARP. This avoids accessing ln here. Digging through histroy shows a lot of refactoring that makes rt_expire handling in RTM_ADD obsolete. Just initialize it to 0. Cloning and local routes should never expire. If RTF_LLINFO is set, ln should not be NULL. So nd6_llinfo_settimer() was not reached in this case. While there, remove obsolete comments and #if 0 code that never worked. OK kn@ claudio@
2023-05-08	As the nd6 mutex protects the lifetime of struct llinfo_nd6 ln,	Alexander Bluhm
	nd6_mtx must be held longer in nd6_rtrequest() case RTM_RESOLVE. OK kn@
2023-05-07	I preparation for TSO in software, cleanup the fragment code. Use	Alexander Bluhm
	if_output_ml() to send mbuf lists to interfaces. This can be used for TSO, fragments, ARP and ND6. Rename variable fml to ml. In pf_route6() split the if else block. Put the safety check (hlen + firstlen < tlen) into ip_fragment(). It makes the code correct in case the packet is too short to be fragmented. This should not happen, but other functions also have this logic. No functional change. OK sashan@
2023-05-04	Introduce a neighbor discovery mutex like ARP uses it. For now it	Alexander Bluhm
	only protects nd6_list. It does not unlock ND6 from kernel lock yet. OK kn@
2023-05-03	Some checks in nd6_resolve() do not require kernel lock. The analog	Alexander Bluhm
	code for ARP has been unlocked a while ago. OK kn@
2023-05-02	Call nd6_ns_output() without kernel lock from nd6_resolve().	Alexander Bluhm
	OK kn@
2023-04-28	Inbound portion of RFC9131. Routers can create new neighbor cache entries	Peter Hessler
	when receiving a valid Neighbor Advertisement. OK florian@ kn@
2023-04-25	When configuring a new address on an interface, an upstream router	Peter Hessler
	doesn't know where to send traffic. This will send an unsolicited neighbor advertisement, as described in RFC9131, to the all-routers multicast address so all routers on the same link will learn the path back to the address. This is intended to speed up the first return packet on an IPv6 interface. OK florian@