src - OpenBSD base system

Age	Commit message (Collapse)	Author
2023-06-24	Calculate inet PCB SIP hash without table mutex.	Alexander Bluhm
	Goal is to run UDP input in parallel. Btrace kstack analysis shows that SIP hash for PCB lookup is quite expensive. When running in parallel, there is also lock contention on the PCB table mutex. It results in better performance to calculate the hash value before taking the mutex. The hash secret has to be constant as hash calculation must not depend on values protected by the table mutex. Do not reseed anymore when hash table gets resized. Analysis also shows that asserting a rw_lock while holding a mutex is a bit expensive. Just remove the netlock assert. OK dlg@ mvs@
2023-06-14	Add missing kernel lock around (*if_ioctl)().	Vitaliy Makkoveev
	ok bluhm
2023-05-30	Use generic checksum calculation for TCP SYN+ACK packets.	Alexander Bluhm
	Our syn cache did checksum calculation by hand, instead of the established mechanism in ip output. The software-checksummed counter increased once per incoming TCP connection. Just set the flag M_TCP_CSUM_OUT in syn_cache_respond() and let in_proto_cksum_out() do the work later. Then hardware checksumming is used where available. Also remove redundant code. The unhandled af case is handled in the first switch statement of the function. tested by Hrvoje Popovski; OK mvs@
2023-05-23	New counters for LRO packets from hardware TCP offloading.	Jan Klemkow
	With tweaks from patrick@ and bluhm@. OK bluhm@
2023-05-22	Fix TSO for traffic to a local address on a physical interface.	Alexander Bluhm
	When sending TCP packets with software TSO to the local address of a physical interface, the TCP checksum was miscalculated. As the small MSS is taken from the physical interface, but the large MTU of the loopback interface is used, large TSO packets are generated, but sent directly to the loopback interface. There we need the regular pseudo header checksum and not the modified without packet length. To avoid this confusion, use the same decision for checksum generation in in_proto_cksum_out() as for using hardware TSO in tcp_if_output_tso(). bug reported and tested by robert@ bket@ Hrvoje Popovski OK claudio@ jan@
2023-05-19	Move tcp_info structure to be under '#if __BSD_VISIBLE' to repair	Philip Guenther
	compliance with POSIX/SUS restrictions on <netinet/tcp.h> ok bluhm@ ports testing and ok sthen@
2023-05-18	Revert ip_sysctl() unlocking. Lock order issue was triggered in UVM	Vitaliy Makkoveev
	layer.
2023-05-18	Use TSO offloading in ix(4).	Jan Klemkow
	With a lot of tweaks, improvements and testing from bluhm. Thanks to Hrvoje Popovski from the University of Zagreb for his great testing effort to make this happen. ok bluhm
2023-05-16	Introduce temporary PR_MPSYSCTL flag to mark (*pr_sysctl)() handler MP	Vitaliy Makkoveev
	safe. We have may of them, so use flag instead of pushing kernel lock within. Unlock ip_sysctl(). Still take kernel lock within IPCTL_MRTSTATS case. It looks like `mrtstat' protection is inconsistent, so keep locking as it was. Since `mrtstat' are counters, it make sense to rework them into per CPU counters with separate diffs. Feedback and ok from bluhm@
2023-05-16	Use separate IFCAPs for LRO and TSO.	Jan Klemkow
	This diff introduces separate capabilities for TCP offloading. We split this into LRO (large receive offloading) and TSO (TCP segmentation offloading). LRO can be turned on/off via tcprecvoffload option of ifconfig and is not inherited to sub interfaces. TSO is inherited by sub interfaces to signal this hardware offloading capability to the network stack. With tweaks from bluhm, claudio and dlg ok bluhm, claudio
2023-05-15	Implement the TCP/IP layer for hardware TCP segmentation offload.	Alexander Bluhm
	If the driver of a network interface claims to support TSO, do not chop the packet in software, but pass it down to the interface layer. Precalculate parts of the pseudo header checksum, but without the packet length. The length of all generated smaller packets is not known yet. Driver and hardware will use the mbuf packet header field ph_mss to calculate it and update checksum. Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware might support ony one protocol family. The old flag IFXF_TSO is only relevant for large receive offload. It is missnamed, but keep that for now. Note that drivers do not set TSO capabilites yet. Also the ifconfig flags and pseudo interfaces capabilities will be done separately. So this commit should not change behavior. heavily based on the work from jan@; OK sashan@
2023-05-13	Instead of implementing IPv4 header checksum creation everywhere,	Alexander Bluhm
	introduce in_hdr_cksum_out(). It is used like in_proto_cksum_out(). OK claudio@
2023-05-12	Access rt_llinfo without checking RTF_LLINFO flag before. They are	Alexander Bluhm
	always set together with ARP mutex. OK mvs@
2023-05-10	Implement TCP send offloading, for now in software only. This is	Alexander Bluhm
	meant as a fallback if network hardware does not support TSO. Driver support is still work in progress. TCP output generates large packets. In IP output the packet is chopped to TCP maximum segment size. This reduces the CPU cycles used by pf. The regular output could be assisted by hardware later, but pf route-to and IPsec needs the software fallback in general. For performance comparison or to workaround possible bugs, sysctl net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows TSO counter with chopped and generated packets. based on work from jan@ tested by jmc@ jan@ Hrvoje Popovski OK jan@ claudio@
2023-05-08	The call to in_proto_cksum_out() is only needed before the packet	Alexander Bluhm
	is passed to ifp->if_output(). The fragment code has its own checksum calculation and the other paths end in goto bad. OK claudio@
2023-05-07	I preparation for TSO in software, cleanup the fragment code. Use	Alexander Bluhm
	if_output_ml() to send mbuf lists to interfaces. This can be used for TSO, fragments, ARP and ND6. Rename variable fml to ml. In pf_route6() split the if else block. Put the safety check (hlen + firstlen < tlen) into ip_fragment(). It makes the code correct in case the packet is too short to be fragmented. This should not happen, but other functions also have this logic. No functional change. OK sashan@
2023-04-25	Fix white space.	Alexander Bluhm

2023-04-25	Exclusive net lock or mutex arp_mtx protect the llinfo_arp fields.	Alexander Bluhm
	So kernel lock is only needed for changing the route rt_flags. In arpresolve() protect rt_llinfo lookup and llinfo_arp modification with arp_mtx. Grab kernel lock for rt_flags reject modification only when needed. Tested by Hrvoje Popovski; OK patrick@ kn@
2023-04-24	Hoist privilege checks further	Klemens Nanni
	in6.c already has the privilege check as early as possible, make in.c match. For unprivileged IPv4 ioctl calls with invalid args, this changes errno from E* to EPERM. OK bluhm
2023-04-22	Call pfkeyv2_sysctl_policydumper() with shared netlock. It performs	Vitaliy Makkoveev
	read-olny access to netlock protected data, so the radix tree will not be modified during spd_table_walk() run. Also change netlock assertion within spd_table_add() and ipsec_delete_policy() to exclusive. These are correlating functions which modifies radix tree, so make us sure spd_table_walk() run with shared netlock is safe. Feedback and ok by bluhm@
2023-04-21	Drop error variable and return directly; OK mvs tb	Klemens Nanni

2023-04-19	move kernel lock into multicast ioctl handlers; OK mvs	Klemens Nanni

2023-04-18	Hoist identical privilege checks in in_ioctl*()	Klemens Nanni
	All cases do the same check as first step, so merge it before the switch and before grapping exclusive locks. OK mvs
2023-04-15	Unlock in_ioctl_get(), push kernel lock into in_ioctl_{set,change}_ifaddr()	Klemens Nanni
	Just like in6_ioctl_get(), read ioctls are safe with the shared net lock to protect interface addresses and flags. OK mvs
2023-04-12	Pull MP-safe arprequest() out of kernel lock	Klemens Nanni
	Defer sending after unlock, reuse `refresh' from similar construct. OK bluhm
2023-04-11	fix double words in comments	Jonathan Gray
	feedback and ok jmc@ miod, ok millert@
2023-04-08	Do not reload `inp' in gre_send(). The pointer to PCB of raw socket is	Vitaliy Makkoveev
	immutable, we don't need to reload it again. ok bluhm@
2023-04-07	Remove kernel locks from the ARP input path. Caller if_netisr()	Alexander Bluhm
	grabs the exclusive netlock and that is sufficent for in_arpinput() and arpcache(). with kn@; OK mvs@; tested by Hrvoje Popovski
2023-04-05	ARP has a sysctl to show the number of packets waiting for an arp	Alexander Bluhm
	response. Implement analog sysctl net.inet6.icmp6.nd6_queued for ND6 to reduce places where mbufs can hide within the kernel. Atomic operations operate on unsigned int. Make the type of total hold queue length consistent. Use atomic load to read the value for the sysctl. This clarifies why no lock around sysctl_rdint() is needed. OK mvs@ kn@
2023-04-05	ARP has a queue of packets that should be sent after name resolution.	Alexander Bluhm
	ND6 did only hold a single packet. Unify the logic and add a mbuf hold queue to struct llinfo_nd6. This is MP safe and queue limits are tracked with atomic operations. New function if_mqoutput() has common code for ARP and ND6. ln_saddr6 holds the source address of the requesting packet. That is easier than fiddling with mbuf queue in nd6_ns_output(). OK kn@
2023-04-04	When sending IP packets to userland with divert-packet rules, the	Alexander Bluhm
	checksum may be wrong. Locally generated packets diverted by pf out rules may have no checksum due to to hardware offloading. Calculate the checksum in that case. OK mvs@ sashan@
2023-03-14	To avoid misunderstanding, keep variables for tcp keepalive in	YASUOKA Masahiko
	milliseconds, which is the same unit of tcp_now(). However, keep the unit of sysctl variables in seconds and convert their unit in tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds, which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19. ok claudio
2023-03-08	An invalid source routing IP option could overwrite kernel memory	Alexander Bluhm
	by using a bad option length. This bug is only reachable if both pf IP option check is disabled and IP source routing is enabled. reported by @fuzzingrf Erg Noor OK claudio@ deraadt@
2023-03-08	Delete obsolete /* ARGSUSED */ lint comments.	Philip Guenther
	ok miod@ millert@
2023-03-04	properly initialise LIST head	Klemens Nanni
	This worked because the global head variable is zero-initialised, but one must not rely on that. OK mvs claudio
2023-02-07	Remove needless #ifdef INET6 from struct ether_extracted field in	Alexander Bluhm
	public header file. Makes debugging with special kernels easier.
2023-02-06	consolidate mbuf header parsing on device driver layer	Jan Klemkow
	with tweaks from mvs@, mpi@, dlg@, naddy@ and bluhm@ "go for it" deraadt@ ok naddy@, mvs@
2023-01-31	Remove the last ones route lock references from comments.	Vitaliy Makkoveev
	No functional change.
2023-01-31	Route lock was reverted, adjust forgotten commentary.	Vitaliy Makkoveev
	No functional changes.
2023-01-28	Revert the `rt_lock' rwlock(9) diff to fix the recursive	Vitaliy Makkoveev
	rwlock(9) acquisition. Reported-by: syzbot+fbe3acb4886adeef31e0@syzkaller.appspotmail.com
2023-01-26	backing "consolidate mbuf header parsing on device driver layer"	Theo de Raadt
	easily repeatable ASSERT happens seconds after starting compiles over nfs.
2023-01-24	consolidate mbuf header parsing on device driver layer	Jan Klemkow
	with tweaks from mvs@, mpi@ and dlg@ ok mvs@, dlg@
2023-01-22	Move SS_CANTRCVMORE and SS_RCVATMARK bits from `so_state' to `sb_state' of	Vitaliy Makkoveev
	receive buffer. As it was done for SS_CANTSENDMORE bit, the definition kept as is, but now these bits belongs to the `sb_state' of receive buffer. `sb_state' ored with `so_state' when socket data exporting to the userland. ok bluhm@
2023-01-21	Introduce `rt_lock' rwlock(9) and use it instead of kernel lock to	Vitaliy Makkoveev
	serialize arpcache() and arpresolve(). In fact, net stack already has sleep points, so the rwlock(9) is better here because we avoid intersection with the rest of kernel locked paths. Also this new lock assumed to use to route layer protection instead of netlock. Hrvoje Popovski had tested this diff and found no visible performance impact. ok bluhm@
2023-01-21	Introduce per-sockbuf `sb_state' to use it with SS_CANTSENDMORE.	Vitaliy Makkoveev
	This time, socket's buffer lock requires solock() to be held. As a part of socket buffers standalone locking work, move socket state bits which represent its buffers state to per buffer state. Opposing the previous reverted diff, the SS_CANTSENDMORE definition left as is, but it used only with `sb_state'. `sb_state' ored with original `so_state' when socket's data exported to the userland, so the ABI kept as it was. Inputs from deraadt@. ok bluhm@
2023-01-12	Binding the accept socket in TCP input relies on the fact that the	Alexander Bluhm
	listen port is not bound to port 0. With a matching pf divert-to rule this assumption is no longer true and could crash the kernel with kassert. In both pf and stack drop TCP packets with destination port 0 before they can do harm. OK sashan@ claudio@
2022-12-27	Fix array bounds mismatch with clang 15	Patrick Wildt
	New warning -Warray-parameter is a bit overzealous. ok millert@ tb@
2022-12-13	In tcp_now() switch from getnsecuptime() to getnsecruntime()	Claudio Jeker
	The tcp timer is not supposed to run during suspend but getnsecuptime() does and because of this sessions with TCP_KEEPALIVE on reset after a few hours of sleep. Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@ OK yasuoka@ jca@ cheloha@
2022-12-12	Revert sb_state changes to unbreak tree.	Theo Buehler

2022-12-11	This time, socket's buffer lock requires solock() to be held. As a part of	Vitaliy Makkoveev
	socket buffers standalone locking work, move socket state bits which represent its buffers state to per buffer state. Introduce `sb_state' and turn SS_CANTSENDMORE to SBS_CANTSENDMORE. This bit will be processed on `so_snd' buffer only. Move SS_CANTRCVMORE and SS_RCVATMARK bits with separate diff to make review easier and exclude possible so_rcv/so_snd mistypes. Also, don't adjust the remaining SS_* bits right now. ok millert@