src - OpenBSD base system

Age	Commit message (Collapse)	Author
2023-12-07	Inpcb table mutex protects addr and port during bind(2) and connect(2).	Alexander Bluhm
	in_pcbbind(), in_pcbconnect(), and in6_pcbconnect() have to set addresses and ports within the same critical section as the inpcb hash table calculation. Also lookup and address selection have to be protected to avoid bindings and connections that are not unique. For that in_pcbpickport() and in_pcbbind_locked() expect that the table mutex is already taken. The functions in_pcblookup_lock(), in_pcblookup_local_lock(), and in_pcbaddrisavail_lock() grab the mutex iff the lock parameter is IN_PCBLOCK_GRAB. Otherwise the parameter is IN_PCBLOCK_HOLD has the lock has to be taken already. Note that in_pcblookup_lock() and in_pcblookup_local() return an inp with increased reference iff they take and release the lock. Otherwise the caller protects the life time of the inp. This gives enough flexibility that in_pcbbind() and in_pcbconnect() can hold the table mutex when they need it. The public inpcb API does not change. OK sashan@ mvs@
2023-12-06	Protect socket receive buffer in IP multicast routing.	Alexander Bluhm
	Since soreceive() runs in parallel for raw sockets, sbappendaddr() has to be protected by inpcb mutex. This was missing in multicast forwarding which is running with a combination of shared net lock and kernel lock. soreceive() uses shared net lock and mutex per inpcb. Grab mutex before sbappendaddr() in socket_send() and socket6_send(). panic receive 1 reported by Jo Geraerts OK mvs@ claudio@
2023-12-03	Use INP_IPV6 flag instead of sotopf().	Alexander Bluhm
	During initialization in_pcballoc() sets INP_IPV6 once to avoid reaching through inp_socket->so_proto->pr_domain->dom_family. Use this flag consistently. OK sashan@ mvs@
2023-12-03	Make ipsp_ids_gc() timeout(9) handler mpsafe. `ipsec_flows_mtx' mutex(9)	Vitaliy Makkoveev
	protects related data. ok bluhm
2023-12-01	Set inp address, port and rtable together with inpcb hash.	Alexander Bluhm
	The inpcb hash table is protected by table->inpt_mtx. The hash is based on addresses, ports, and routing table. These fields were not sychronized with the hash. Put writes and hash update into the same critical section. Move the updates from ip_ctloutput(), ip6_ctloutput(), syn_cache_get(), tcp_connect(), udp_disconnect() to dedicated inpcb set functions. There they use the same table mutex as in_pcbrehash(). in_pcbbind(), in_pcbconnect(), and in6_pcbconnect() need more work and are not included yet. OK sashan@ mvs@
2023-12-01	Make internet PCB connect more consistent.	Alexander Bluhm
	The public interface is in_pcbconnect(). It dispatches to in6_pcbconnect() if necessary. Call the former from tcp_connect() and udp_connect(). In in6_pcbconnect() initialization in6a = NULL is not necessary. in6_pcbselsrc() sets the pointer, but does not read the value. Pass a constant in6_addr pointer to in6_pcbselsrc() and in6_selectsrc(). It returns a reference to the address of some internal data structure. We want to be sure that in6_addr is not modified this way. IPv4 in_pcbselsrc() solves this by passing a copy of the address. OK kn@ sashan@ mvs@
2023-11-30	Pass inp_seclevel to ip6_output() in TCP syn cache.	Alexander Bluhm
	TCP syn_cache_respond() uses inp_seclevel from listening socket as ip_output() parameter. This was missing for ip6_output(). OK mvs@
2023-11-29	Run TCP syn cache timer without kernel lock.	Alexander Bluhm
	As syn_cache_timer() uses syn cache mutex and exclusive net lock, it does not need kernel lock. OK mvs@
2023-11-29	Document inp_socket as immutable and remove NULL checks.	Alexander Bluhm
	Struct inpcb field inp_socket is initialized in in_pcballoc(). It is not NULL and never changed. OK mvs@
2023-11-28	Remove struct inpcb from in6_embedscope() parameters.	Alexander Bluhm
	rip6_output() did modify inp_outputopts6 temporarily to provide different ip6_pktopts to in6_embedscope(). Better pass inp_outputopts6 and inp_moptions6 as separate arguments to in6_embedscope(). Simplify the code that deals with these options in in6_embedscope(). Doucument inp_moptions and inp_moptions6 as protected by net lock. OK kn@
2023-11-27	Add NULL check before dereferencing inp_seclevel.	Alexander Bluhm
	In some cases inp may be NULL, so check that before passing inp->inp_seclevel to ipsp_spd_lookup() or ip_output(). Missed in previous commit.
2023-11-26	Remove inp parameter from ip_output().	Alexander Bluhm
	ip_output() received inp as parameter. This is only used to lookup the IPsec level of the socket. Reasoning about MP locking is much easier if only relevant data is passed around. Convert ip_output() to receive constant inp_seclevel as argument and mark it as protected by net lock. OK mvs@
2023-11-16	Run TCP SYN cache timer logik without net lock.	Alexander Bluhm
	Introduce global TCP SYN cache mutex. Devide timer function in parts protected by mutex and sending with netlock. Split the flags field in dynamic flags protected by mutex and fixed flags set during initialization. Document whether fields of struct syn_cache are protected by net lock or mutex. input and OK sashan@
2023-11-12	Declare global variable zeroin46_addr as const.	Alexander Bluhm
	OK mvs@ jca@
2023-11-10	rtable_match() takes constant destination.	Alexander Bluhm
	For implementing MP safe route lookup, it helps to know which function parameters are constant. Add some const declarations, so that the compiler guarantees that sockaddr dst parameter of rtable_match() does not change. OK dlg@
2023-11-09	Run arp timeout without kernel lock.	Alexander Bluhm
	Since cheloha@ has implemented timeout processes that do not grab the kernel lock, start using TIMEOUT_MPSAFE for arptimer(). OK kn@ mvs@
2023-10-11	Prevent deref-after-free when tdb_timeout() fires on invalid new tdb.	Tobias Heider
	When receiving a pfkeyv2 SADB_ADD message, a newly created tdb can fail in tdb_init(), which causes the tdb to not get added to the global tdb list and an immediate dereference. If a lifetime timeout triggers on this tdb, it will unconditionally try to remove it from the list and in the process deref once more than allowed, causing a one bit corruption in the already freed up slot in the tdb pool. We resolve this issue by moving timeout_add() after tdb_init() just before puttdb(). This means tdbs failing initialization get discarded immediately as they only hold a single reference. Valid tdbs get their timeouts activated just before we add them to the tdb list, meaning the timeout can safely assume they are linked. Feedback from mvs@ and millert@ ok mvs@ mbuhl@
2023-09-16	Allow counters_read(9) to take an optional scratch buffer.	Martin Pieuchot
	Using a scratch buffer makes it possible to take a consistent snapshot of per-CPU counters without having to allocate memory. Makes ddb(4) show uvmexp command work in OOM situations. ok kn@, mvs@, cheloha@
2023-09-06	Use shared net lock for ip_send() and ip6_send().	Alexander Bluhm
	When called with NULL options, ip_output() and ip6_output() are MP safe. Convert exclusive to shared net lock in send dispatch. OK mpi@
2023-09-04	Fix netstat output of uses of current SYN cache left.	Alexander Bluhm
	TCP syn cache variable scs_use is basically counting packet insertions into syn cache. Prefer type long to exclude overflow on fast machines. Due to counting downwards from a limit, it can become negative. Copy it out as tcps_sc_uses_left via sysctl, and print it as signed long long integer. OK mvs@
2023-09-03	Avoid a useless increment and decrement of the tcp syn cache refcount	Alexander Bluhm
	by unexpanding the SYN_CACHE_TIMER_ARM() macro in the timer callback. OK mvs@
2023-08-28	Introduce reference counting for TCP syn cache entries.	Alexander Bluhm
	The syn_cache_reaper() is a hack to serialize timeouts. Unfortunately it has a race and panics sometimes with pool_do_get: syncache free list modified. Add a reference counter for timeout and list of syn cache entries. Currently list refcout is not strictly necessary due to exclusive netlock, but will be needed when we continue unlocking. Checking timeout_initialized() is not MP friendly, better do proper initialization during object allocation. Refcount in btrace helps to find leaks. bug reported and fix tested by Peter J. Philipp OK claudio@
2023-08-07	add the glue between ipsec security associations and sec(4) interfaces.	David Gwynne
	if TDBF_IFACE is set on a tdb, the ipsec stack will pass it to the sec(4) driver to keep track of instead of wiring it up for security associations to use. when sec(4) transmits a packet, it will look up it's list of tdbs to find the right SA to encrypt and send the packet out with. if an incoming ipsec packet arrives with TDBF_IFACE set, it's passed to sec(4) to be injected back into the network stack as if it was received on the sec interface, instead of being reinjected into the IP stack like normal SA/SPD processing does. note that this means you do not have to configure tunnel endpoints on sec(4) interfaces, instead you line the interface unit number in the ipsec config up with the minor number of the sec(4) interfaces. the peer IPs used on the SAs are what's used as the traffic endpoints. support from many including markus@ tobhe@ claudio@ sthen@ patrick@ now is a good time deraadt@
2023-08-07	start adding support for route-based ipsec vpns.	David Gwynne
	rather than use ipsec flows (aka, entries in the ipsec security policy database) to decide which traffic should be encapsulated in ipsec and sent to a peer, this tweaks security associations (SAs) so they can refer to a tunnel interface. when traffic is routed over that tunnel interface, an ipsec SA is looked up and used to encapsulate traffic before being sent to the peer on the SA. When traffic is received from a peer using an interface SA, the specified interface is looked up and the packet is handed to it so it looks like packets come out of the tunnel. to support this, SAs get a TDBF_IFACE flag and iface and iface_dir fields. When TDBF_IFACE is set the iface and dir fields are considered valid, and the tdb/SA should be used with the tunnel interface instead of the SPD. support from many including markus@ tobhe@ claudio@ sthen@ patrick@ now is a good time deraadt@
2023-07-27	Fix inline vlan-tag handling of forwarded LRO packets from ix(4)	Jan Klemkow
	Implement vlan-tag parsing ether_extract_header() to use this information to adjust the MSS calculation of LRO packets. pointed out by mbuhl and bluhm with tweaks from bluhm ok bluhm@
2023-07-07	Fix path MTU discovery for TCP LRO/TSO when forwarding.	Alexander Bluhm
	When doing LRO (Large Receive Offload), the drivers, currently ix(4) and lo(4) only, record an upper bound of the size of the original packets in ph_mss. When sending, either stack or hardware must chop the packets with TSO (TCP Segmentation Offload) to that size. That means we have to call tcp_if_output_tso() before ifp->if_output(). Put that logic into if_output_tso() to avoid code duplication. As TCP packets on the wire do not get larger that way, path MTU discovery should still work. tested by and OK jan@
2023-07-06	use refcnt API for multicast addresses, add tracepoint:refcnt:ethmulti probe	Klemens Nanni
	Replace hand-rolled reference counting with refcnt_init(9) and hook it up with a new dt(4) probe. OK mvs Feedback OK bluhm
2023-07-06	Convert tcp_now() time counter to 64 bit.	Alexander Bluhm
	After changing tcp now tick to milliseconds, 32 bits will wrap around after 49 days of uptime. That may be a problem in some places of our stack. Better use a 64 bit counter. As timestamp option is 32 bit in TCP protocol, use the lower 32 bit there. There are casts to 32 bits that should behave correctly. Start with random 63 bit offset to avoid uptime leakage. 2^63 milliseconds result in 2.9*10^8 years of possible uptime. OK yasuoka@
2023-07-06	big update to pfsync to try and clean up locking in particular.	David Gwynne
	moving pf forward has been a real struggle, and pfsync has been a constant source of pain. we have been papering over the problems for a while now, but it reached the point that it needed a fundamental restructure, which is what this diff is. the big headliner changes in this diff are: - pfsync specific locks this is the whole reason for this diff. rather than rely on NET_LOCK or KERNEL_LOCK or whatever, pfsync now has it's own locks to protect it's internal data structures. this is important because pfsync runs a bunch of timeouts and tasks to push pfsync packets out on the wire, or when it's handling requests generated by incoming pfsync packets, both of which happen outside pf itself running. having pfsync specific locks around pfsync data structures makes the mutations of these data structures a lot more explicit and auditable. - partitioning to enable future parallelisation of the network stack, this rewrite includes support for pfsync to partition states into different "slices". these slices run independently, ie, the states collected by one slice are serialised into a separate packet to the states collected and serialised by another slice. states are mapped to pfsync slices based on the pf state hash, which is the same hash that the rest of the network stack and multiq hardware uses. - no more pfsync called from netisr pfsync used to be called from netisr to try and bundle packets, but now that there's multiple pfsync slices this doesnt make sense. instead it uses tasks in softnet tqs. - improved bulk transfer handling there's shiny new state machines around both the bulk transmit and receive handling. pfsync used to do horrible things to carp demotion counters, but now it is very predictable and returns the counters back where they started. - better tdb handling the tdb handling was pretty hairy, but hrvoje has kicked this around a lot with ipsec and sasyncd and we've found and fixed a bunch of issues as a result of that testing. - mpsafe pf state purges this was committed previously, but because the locks pfsync relied on weren't clear this just caused a ton of bugs. as part of this diff it's now reliable, and moves a big chunk of work out from under KERNEL_LOCK, which in turn improves the responsiveness and throughput of a firewall even if you're not using pfsync. there's a bunch of other little changes along the way, but the above are the big ones. hrvoje has done performance testing with this diff and notes a big improvement when pfsync is not in use. performance when pfsync is enabled is about the same, but im hoping the slices means we can scale along with pf as it improves. lots (months) of testing by me and hrvoje on pfsync boxes tests and ok sashan@ deraadt@ says this is a good time to put it in
2023-07-04	Remove redundant code when calculating checksum.	Alexander Bluhm
	OK jmatthew@
2023-07-02	Use TSO and LRO on the loopback interface to transfer TCP faster.	Alexander Bluhm
	If tcplro is activated on lo(4), ignore the MTU with TCP packets. They are passed along with the information that they have to be chopped in case they are forwarded later. New netstat(1) counter shows that software LRO is in effect. The feature is currently turned off by default. tested by jan@; OK claudio@ jan@
2023-06-28	use refcnt API for multicast addresses, add tracepoint:refcnt:ifmaddr probe	Klemens Nanni
	Replace hand-rolled reference counting with refcnt_init(9) and hook it up with a new dt(4) probe. OK bluhm mvs
2023-06-24	Calculate inet PCB SIP hash without table mutex.	Alexander Bluhm
	Goal is to run UDP input in parallel. Btrace kstack analysis shows that SIP hash for PCB lookup is quite expensive. When running in parallel, there is also lock contention on the PCB table mutex. It results in better performance to calculate the hash value before taking the mutex. The hash secret has to be constant as hash calculation must not depend on values protected by the table mutex. Do not reseed anymore when hash table gets resized. Analysis also shows that asserting a rw_lock while holding a mutex is a bit expensive. Just remove the netlock assert. OK dlg@ mvs@
2023-06-14	Add missing kernel lock around (*if_ioctl)().	Vitaliy Makkoveev
	ok bluhm
2023-05-30	Use generic checksum calculation for TCP SYN+ACK packets.	Alexander Bluhm
	Our syn cache did checksum calculation by hand, instead of the established mechanism in ip output. The software-checksummed counter increased once per incoming TCP connection. Just set the flag M_TCP_CSUM_OUT in syn_cache_respond() and let in_proto_cksum_out() do the work later. Then hardware checksumming is used where available. Also remove redundant code. The unhandled af case is handled in the first switch statement of the function. tested by Hrvoje Popovski; OK mvs@
2023-05-23	New counters for LRO packets from hardware TCP offloading.	Jan Klemkow
	With tweaks from patrick@ and bluhm@. OK bluhm@
2023-05-22	Fix TSO for traffic to a local address on a physical interface.	Alexander Bluhm
	When sending TCP packets with software TSO to the local address of a physical interface, the TCP checksum was miscalculated. As the small MSS is taken from the physical interface, but the large MTU of the loopback interface is used, large TSO packets are generated, but sent directly to the loopback interface. There we need the regular pseudo header checksum and not the modified without packet length. To avoid this confusion, use the same decision for checksum generation in in_proto_cksum_out() as for using hardware TSO in tcp_if_output_tso(). bug reported and tested by robert@ bket@ Hrvoje Popovski OK claudio@ jan@
2023-05-19	Move tcp_info structure to be under '#if __BSD_VISIBLE' to repair	Philip Guenther
	compliance with POSIX/SUS restrictions on <netinet/tcp.h> ok bluhm@ ports testing and ok sthen@
2023-05-18	Revert ip_sysctl() unlocking. Lock order issue was triggered in UVM	Vitaliy Makkoveev
	layer.
2023-05-18	Use TSO offloading in ix(4).	Jan Klemkow
	With a lot of tweaks, improvements and testing from bluhm. Thanks to Hrvoje Popovski from the University of Zagreb for his great testing effort to make this happen. ok bluhm
2023-05-16	Introduce temporary PR_MPSYSCTL flag to mark (*pr_sysctl)() handler MP	Vitaliy Makkoveev
	safe. We have may of them, so use flag instead of pushing kernel lock within. Unlock ip_sysctl(). Still take kernel lock within IPCTL_MRTSTATS case. It looks like `mrtstat' protection is inconsistent, so keep locking as it was. Since `mrtstat' are counters, it make sense to rework them into per CPU counters with separate diffs. Feedback and ok from bluhm@
2023-05-16	Use separate IFCAPs for LRO and TSO.	Jan Klemkow
	This diff introduces separate capabilities for TCP offloading. We split this into LRO (large receive offloading) and TSO (TCP segmentation offloading). LRO can be turned on/off via tcprecvoffload option of ifconfig and is not inherited to sub interfaces. TSO is inherited by sub interfaces to signal this hardware offloading capability to the network stack. With tweaks from bluhm, claudio and dlg ok bluhm, claudio
2023-05-15	Implement the TCP/IP layer for hardware TCP segmentation offload.	Alexander Bluhm
	If the driver of a network interface claims to support TSO, do not chop the packet in software, but pass it down to the interface layer. Precalculate parts of the pseudo header checksum, but without the packet length. The length of all generated smaller packets is not known yet. Driver and hardware will use the mbuf packet header field ph_mss to calculate it and update checksum. Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware might support ony one protocol family. The old flag IFXF_TSO is only relevant for large receive offload. It is missnamed, but keep that for now. Note that drivers do not set TSO capabilites yet. Also the ifconfig flags and pseudo interfaces capabilities will be done separately. So this commit should not change behavior. heavily based on the work from jan@; OK sashan@
2023-05-13	Instead of implementing IPv4 header checksum creation everywhere,	Alexander Bluhm
	introduce in_hdr_cksum_out(). It is used like in_proto_cksum_out(). OK claudio@
2023-05-12	Access rt_llinfo without checking RTF_LLINFO flag before. They are	Alexander Bluhm
	always set together with ARP mutex. OK mvs@
2023-05-10	Implement TCP send offloading, for now in software only. This is	Alexander Bluhm
	meant as a fallback if network hardware does not support TSO. Driver support is still work in progress. TCP output generates large packets. In IP output the packet is chopped to TCP maximum segment size. This reduces the CPU cycles used by pf. The regular output could be assisted by hardware later, but pf route-to and IPsec needs the software fallback in general. For performance comparison or to workaround possible bugs, sysctl net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows TSO counter with chopped and generated packets. based on work from jan@ tested by jmc@ jan@ Hrvoje Popovski OK jan@ claudio@
2023-05-08	The call to in_proto_cksum_out() is only needed before the packet	Alexander Bluhm
	is passed to ifp->if_output(). The fragment code has its own checksum calculation and the other paths end in goto bad. OK claudio@
2023-05-07	I preparation for TSO in software, cleanup the fragment code. Use	Alexander Bluhm
	if_output_ml() to send mbuf lists to interfaces. This can be used for TSO, fragments, ARP and ND6. Rename variable fml to ml. In pf_route6() split the if else block. Put the safety check (hlen + firstlen < tlen) into ip_fragment(). It makes the code correct in case the packet is too short to be fragmented. This should not happen, but other functions also have this logic. No functional change. OK sashan@
2023-04-25	Fix white space.	Alexander Bluhm

2023-04-25	Exclusive net lock or mutex arp_mtx protect the llinfo_arp fields.	Alexander Bluhm
	So kernel lock is only needed for changing the route rt_flags. In arpresolve() protect rt_llinfo lookup and llinfo_arp modification with arp_mtx. Grab kernel lock for rt_flags reject modification only when needed. Tested by Hrvoje Popovski; OK patrick@ kn@