src - OpenBSD base system

Age	Commit message (Collapse)	Author
2023-07-07	Keep mbuf header field ph_mss during loopback TCP with LRO/TSO.	Alexander Bluhm
	When M_TCP_TSO is preserved, also keep ph_mss. In lo(4) this logic was missing. This may be relevant only for weird pf configs that forward from loopback. OK mvs@ jan@
2023-07-07	Fix path MTU discovery for TCP LRO/TSO when forwarding.	Alexander Bluhm
	When doing LRO (Large Receive Offload), the drivers, currently ix(4) and lo(4) only, record an upper bound of the size of the original packets in ph_mss. When sending, either stack or hardware must chop the packets with TSO (TCP Segmentation Offload) to that size. That means we have to call tcp_if_output_tso() before ifp->if_output(). Put that logic into if_output_tso() to avoid code duplication. As TCP packets on the wire do not get larger that way, path MTU discovery should still work. tested by and OK jan@
2023-07-06	use refcnt API for multicast addresses, add tracepoint:refcnt:ethmulti probe	Klemens Nanni
	Replace hand-rolled reference counting with refcnt_init(9) and hook it up with a new dt(4) probe. OK mvs Feedback OK bluhm
2023-07-06	big update to pfsync to try and clean up locking in particular.	David Gwynne
	moving pf forward has been a real struggle, and pfsync has been a constant source of pain. we have been papering over the problems for a while now, but it reached the point that it needed a fundamental restructure, which is what this diff is. the big headliner changes in this diff are: - pfsync specific locks this is the whole reason for this diff. rather than rely on NET_LOCK or KERNEL_LOCK or whatever, pfsync now has it's own locks to protect it's internal data structures. this is important because pfsync runs a bunch of timeouts and tasks to push pfsync packets out on the wire, or when it's handling requests generated by incoming pfsync packets, both of which happen outside pf itself running. having pfsync specific locks around pfsync data structures makes the mutations of these data structures a lot more explicit and auditable. - partitioning to enable future parallelisation of the network stack, this rewrite includes support for pfsync to partition states into different "slices". these slices run independently, ie, the states collected by one slice are serialised into a separate packet to the states collected and serialised by another slice. states are mapped to pfsync slices based on the pf state hash, which is the same hash that the rest of the network stack and multiq hardware uses. - no more pfsync called from netisr pfsync used to be called from netisr to try and bundle packets, but now that there's multiple pfsync slices this doesnt make sense. instead it uses tasks in softnet tqs. - improved bulk transfer handling there's shiny new state machines around both the bulk transmit and receive handling. pfsync used to do horrible things to carp demotion counters, but now it is very predictable and returns the counters back where they started. - better tdb handling the tdb handling was pretty hairy, but hrvoje has kicked this around a lot with ipsec and sasyncd and we've found and fixed a bunch of issues as a result of that testing. - mpsafe pf state purges this was committed previously, but because the locks pfsync relied on weren't clear this just caused a ton of bugs. as part of this diff it's now reliable, and moves a big chunk of work out from under KERNEL_LOCK, which in turn improves the responsiveness and throughput of a firewall even if you're not using pfsync. there's a bunch of other little changes along the way, but the above are the big ones. hrvoje has done performance testing with this diff and notes a big improvement when pfsync is not in use. performance when pfsync is enabled is about the same, but im hoping the slices means we can scale along with pf as it improves. lots (months) of testing by me and hrvoje on pfsync boxes tests and ok sashan@ deraadt@ says this is a good time to put it in
2023-07-04	This diff limits the number of transactions/tickets	Alexandr Nedvedicky
	pf_open_trans() can issue for each clone of /dev/pf to 512. The pf_open_trans() is currently being used by DIOCGETRULES ioctl(2). The limit avoids processes to consume all kernel memory by asking DIOCGETRULES for more tickets. If DIOCGETRULES hits the limit, then the application will see EBUSY error. This diff was fine tuned with feedback from cluadio@, deraadt@ and kn@. OK kn@
2023-07-04	Check for interface type ethernet before call ether_brport_isset()	Jan Klemkow
	Pointed out by bluhm. ok bluhm@
2023-07-04	The recent change to DIOCGETRULE allows applications which	Alexandr Nedvedicky
	periodically read rules from pf(4) to consume all kernel memory. The bug has been discovered and root caused by florian@. In this particular case it was snmpd(8) what ate all kernel memory. This commit introduces DIOCXEND to pf(4) so applications such as snmpd(8) and systat(1) to close ticket/transaction when they are done with fetching the rules. This change also updates snmpd(8) and systat(1) to use newly introduced DIOCXEND ioctl(2). OK claudio@, deraadt@, kn@
2023-07-04	remove unused global var	Jonathan Gray
	ok sashan@
2023-07-03	use consistent queue(9) example for LIST removal; OK bluhm mvs	Klemens Nanni

2023-07-02	Use TSO and LRO on the loopback interface to transfer TCP faster.	Alexander Bluhm
	If tcplro is activated on lo(4), ignore the MTU with TCP packets. They are passed along with the information that they have to be chopped in case they are forwarded later. New netstat(1) counter shows that software LRO is in effect. The feature is currently turned off by default. tested by jan@; OK claudio@ jan@
2023-06-30	Introduce M_PF type for pf(4) related memory allocations. Currently used	Vitaliy Makkoveev
	M_TEMP and M_IFADDR types are unreasonable for that purpose. This dedicated statistics simplify the future pf(4) unlocking work by decreasing search area of possible memory leaks. ok bluhm sashan
2023-06-28	pfioctl() must make sure pfioctl_rw() gets unlocked before function returns.	Alexandr Nedvedicky
	OK bluhm@
2023-06-28	Revert r1.406 "Close all pf transactions before opening a new one in ↵	Klemens Nanni
	DIOCGETRULES." regress/sbin/pfctl panics with "rw_enter: pfioctl_rw locking against myself" as reported by bluhm on bugs@.
2023-06-28	use refcnt API for multicast addresses, add tracepoint:refcnt:ifmaddr probe	Klemens Nanni
	Replace hand-rolled reference counting with refcnt_init(9) and hook it up with a new dt(4) probe. OK bluhm mvs
2023-06-27	Introduce M_IFGROUP type of memory allocation. M_TEMP is unreasonable	Vitaliy Makkoveev
	for interface groups data allocations. ok kn claudio bluhm
2023-06-27	Use shared net lock for DIOCGETIFACES	Klemens Nanni
	snmpd(8) and 'pfctl -s Interfaces' dump pf's internal list of interfaces. pf's internal interface list is completely protected by the pf lock, pf lock assertions since pf_if.c r1.110 from over a week ago support this. pfi_*() iterate over net lock protected if_groups lists, but only to read, so downgrade from exclusive write net lock to a shared read-only one. Feedback mvs OK sashan
2023-06-27	Remove net lock from DIOC{SET,CLR}IFFLAG	Klemens Nanni
	pf.conf's 'set skip on ifN' and 'pfctl -F all\|Reset' set and clear flags, PFI_IFLAG_SKIP being the only flag. Nothing else in base uses these ioctls and internal state is protected by the pf lock already. OK sashan
2023-06-26	Revert unrelated change that sneaked into the pf_ioctl.c commit.	Claudio Jeker

2023-06-26	Close all pf transactions before opening a new one in DIOCGETRULES.	Claudio Jeker
	Processes like snmpd or systat open pf(4) once and then issue many DIOCGETRULES calls over their runtime. This accumulates many pf_trans structs over their lifetime. At some point the kernel runs out of memory because of that. By closing all transactions before creating a new one, long living processes do no longer leak transactions. This probably needs further refinement once more transactions types are added but for now this solves the problem. Problem found by florian@ OK sashan@ kn@
2023-06-12	Move nd6_ifdetach() out of netlock. In this point, the interface is	Vitaliy Makkoveev
	disconnected from everywhere. No need to hold netlock for dummy 'nd_ifinfo' release. Netlock is also not needed for TAILQ_EMPTY(&ifp->if_*hooks) assertions. ok kn bluhm
2023-06-05	Do not calculate IP, TCP, UDP checksums on loopback interface.	Alexander Bluhm
	Packets sent over loopback got their checksums calculated twice. In the output path they were filled in and during TCP/IP input all checksums were calculated again to be compared with the previous result. Avoid this by claiming that lo(4) supports hardware checksum offloading. For each packet convert the flag that the checksum should be calculated to the flag that it has been checked successfully. Keep the flag that it should be calculated for the case that it may be bridged or forwarded later. A drawback is that "tcpdump -ni lo0 -v" reports invalid checksum. But that is the same with physical interfaces and hardware offloading. OK dlg@
2023-06-05	pfsync_update_state() is too paranoid about pf_state::pfsync_state.	Alexandr Nedvedicky
	For example it should not be surprised if caller asks to remove state from pfsync queue which has been removed already. That kind of race is sorted out later when pfsync_update_state() calls to pfsync_q_ins()/pfsync_q_del(). Change relaxes pfsync_update_state() to panic on sync_state value which is unknown. OK dlg@
2023-06-05	pf_remove_state() should not attempt to remove state which	Alexandr Nedvedicky
	is already removed. OK dlg@
2023-06-01	Add support for wireguard peer descriptions	Klemens Nanni
	"wgdescr[iption] foo" to label one peer (amongst many) on a wg(4) interface, "-wgdescr[iption]" or "wgdescr ''" to remove the label, completely analogous to existing interface discriptions. Idea/initial diff from Mikolaj Kucharski (OK sthen) Tests/prodded by Hrvoje Popovski Tweaks/manual bits from me Feedback deraadt sthen mvs claudio OK claudio
2023-05-30	add net_tq_barriers	David Gwynne
	this waits once for something to end in all the net tqs. ok claudio@
2023-05-30	spelling	Jonathan Gray
	ok jmc@ guenther@ tb@
2023-05-26	Remove net lock from DIOC{S,G}ETLIMIT	Klemens Nanni
	Grab the pf lock for pf_pool_limits[] in pfsync such that all access is covered by the pf lock; document accordingly. Hard memory pool limits don't need the net lock for protection, pool(9)s have their own internal lock and the pf lock fully covers limit values. (pf_pool_limits[] access in DIOCXCOMMIT remains under pf and net lock until the rest in there gets pulled out of the net lock.) OK sashan
2023-05-18	Assert pf lock on interface handling	Klemens Nanni
	Make sure that all hooks into pf's internal list of interfaces do happen with the pf lock held, i.e. nothing relies on the net lock alone, so that later unlocking can then rely on it. Full i386 regress (thanks bluhm) and daily usage are fine OK sashan
2023-05-18	sc_st_mtx is not sufficient protection to move state around	Alexandr Nedvedicky
	pfsync(4) queues. We also need to grab pf_state::mtx to put/remove state instance safely from pfsync(4) queue. The issue has been pointed out by bluhm@. Patch survived testing done by hrvoje@ OK dlg@
2023-05-17	fix stoeplitz_hash_h32.	David Gwynne
	discussed with and ok tb@
2023-05-16	Use separate IFCAPs for LRO and TSO.	Jan Klemkow
	This diff introduces separate capabilities for TCP offloading. We split this into LRO (large receive offloading) and TSO (TCP segmentation offloading). LRO can be turned on/off via tcprecvoffload option of ifconfig and is not inherited to sub interfaces. TSO is inherited by sub interfaces to signal this hardware offloading capability to the network stack. With tweaks from bluhm, claudio and dlg ok bluhm, claudio
2023-05-15	Implement the TCP/IP layer for hardware TCP segmentation offload.	Alexander Bluhm
	If the driver of a network interface claims to support TSO, do not chop the packet in software, but pass it down to the interface layer. Precalculate parts of the pseudo header checksum, but without the packet length. The length of all generated smaller packets is not known yet. Driver and hardware will use the mbuf packet header field ph_mss to calculate it and update checksum. Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware might support ony one protocol family. The old flag IFXF_TSO is only relevant for large receive offload. It is missnamed, but keep that for now. Note that drivers do not set TSO capabilites yet. Also the ifconfig flags and pseudo interfaces capabilities will be done separately. So this commit should not change behavior. heavily based on the work from jan@; OK sashan@
2023-05-14	give softnet threads unique names by suffixing softnet with their index.	David Gwynne
	ie, you'll see softnet0, softnet1, etc in top/ps/etc now instead of just softnet on these threads. this is done by wrapping the taskq and name up in a softnet struct. ok patrick@ bluhm@ mvs@ kn@ sashan@
2023-05-13	Instead of implementing IPv4 header checksum creation everywhere,	Alexander Bluhm
	introduce in_hdr_cksum_out(). It is used like in_proto_cksum_out(). OK claudio@
2023-05-11	pools are always initialised, zap overcautious NULL check	Klemens Nanni
	All pools are init'd after pfattach(), none is ever destroyed, so struct pf_pool_limit's .pp always points to valid pools. Drop a check for the impossible from twenty years ago. OK sashan dlg
2023-05-10	nat-to may fail to insert state due to conflict on chosen source	Alexandr Nedvedicky
	port number. This is typically indicated by 'wire key attach failed on...' message when pf(4) debugging is enabled. The problem is caused by glitch in pf_get_sport() which fails to discover conflict in advance. In order to fix it we must also calculate toeplitz hash in pf_get_sport() to initialize look up key properly. the bug has been kindly reported by joosepm _von_ gmail _dot_ com OK dlg@
2023-05-10	Implement TCP send offloading, for now in software only. This is	Alexander Bluhm
	meant as a fallback if network hardware does not support TSO. Driver support is still work in progress. TCP output generates large packets. In IP output the packet is chopped to TCP maximum segment size. This reduces the CPU cycles used by pf. The regular output could be assisted by hardware later, but pf route-to and IPsec needs the software fallback in general. For performance comparison or to workaround possible bugs, sysctl net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows TSO counter with chopped and generated packets. based on work from jan@ tested by jmc@ jan@ Hrvoje Popovski OK jan@ claudio@
2023-05-08	fix up some formatting in the pf_state_list comment.	David Gwynne

2023-05-08	The call to in_proto_cksum_out() is only needed before the packet	Alexander Bluhm
	is passed to ifp->if_output(). The fragment code has its own checksum calculation and the other paths end in goto bad. OK claudio@
2023-05-07	I preparation for TSO in software, cleanup the fragment code. Use	Alexander Bluhm
	if_output_ml() to send mbuf lists to interfaces. This can be used for TSO, fragments, ARP and ND6. Rename variable fml to ml. In pf_route6() split the if else block. Put the safety check (hlen + firstlen < tlen) into ip_fragment(). It makes the code correct in case the packet is too short to be fragmented. This should not happen, but other functions also have this logic. No functional change. OK sashan@
2023-05-07	Remove net lock from DIOCOSFP{FLUSH,ADD,GET} aka. OS fingerprinting	Klemens Nanni
	pf_osfp.c contains all the locking for these three ioctls, everything is protected by the pf lock; assert/document it and inline acess to the global list to eliminate useless function variables. OK bluhm sashan
2023-05-03	Remove net lock from DIOCGETRULESET and DIOCGETRULESETS	Klemens Nanni
	Both walk the list of rulesets aka. anchors, to yield a total count and specific anchor name, respectively. Same access, different copy out. pf_anchor_global are contained within pf_ioctl.c and pf_ruleset.c and fully protected by the pf lock, as is pf_main_ruleset and its pf.c usage. Rely on and assert for pf lock alone. 'pfctl -sr' on 60k unique rules gets noticably faster, around 2.1s instead of 3.5s. OK sashan
2023-04-29	Remove net lock from DIOCGETQUEUE	Klemens Nanni
	Same logic and argument as for the parent *S ioctl unlocked in r1.400, might as well have committed them together: Both ticket and number of queues stem from the pf_queues_active list which is effectively static to pf_ioctl.c and fully protected by the pf lock. OK sashan
2023-04-28	Add rtentry refcnt type to dt(4).	Vitaliy Makkoveev
	ok bluhm@
2023-04-28	remove superfluous/invalid KASSERT() in pfsync_q_del().	Alexandr Nedvedicky
	pointed and OK bluhm@
2023-04-28	This change speeds up DIOCGETRULE ioctl(2) which pfctl(8) uses to	Alexandr Nedvedicky
	retrieve rules from kernel. The current implementation requires like O((n^2)/2) operation to read the complete rule set, because each DIOCGETRULE operation must iterate over previous n rules to find (n + 1)-th rule to read. To address the issue diff introduces a pf_trans structure to keep pointer to next rule to read, thus reading process does not need to iterate from beginning of rule set to reach the next rule. All transactions opened by process get closed either when process is done (reads all rules) or when /dev/pf device is closed. the diff also comes with lots of improvements from dlg@ and kn@ OK dlg@, kn@
2023-04-28	Relax the "pass all" rule so all forms of neighbor advertisements are allowed	Peter Hessler
	in either direction. This more closely matches the IPv4 ARP behaviour. From sashan@ discussed with kn@ deraadt@
2023-04-28	Remove net lock from DIOCGETQUEUES	Klemens Nanni
	Both ticket and number of queues stem from the pf_queues_active list which is effectively static to pf_ioctl.c and fully protected by the pf lock. OK sashan
2023-04-27	Remove kernel lock from rtfree(9).	Vitaliy Makkoveev
	Route timers and route labels protected by corresponding mutexes. `ifa' uses references counting for protection. rt_mpls_clear() could be called lockless because this is the last reference of `rt'. ok bluhm@ kn@
2023-04-27	Remove net lock from DIOCGETTIMEOUT	Klemens Nanni
	'pfctl -s timeouts' values are only used inside of pf, entirely protected by the pf lock through the ioctl interface; the net lock is useless. Previous attempts to remove net lock usage showed that the pf lock cannot yet entirely replace it, so start with small pieces like this one. Contrary to IPv4/6 read-only ioctls, some pf ioctls without FWRITE flag do modify internal pf state, which is not entirely obvious when approached from the ioctl layer. OK sashan dlg