src - OpenBSD base system

Age	Commit message (Collapse)	Author
2023-05-15	Implement the TCP/IP layer for hardware TCP segmentation offload.	Alexander Bluhm
	If the driver of a network interface claims to support TSO, do not chop the packet in software, but pass it down to the interface layer. Precalculate parts of the pseudo header checksum, but without the packet length. The length of all generated smaller packets is not known yet. Driver and hardware will use the mbuf packet header field ph_mss to calculate it and update checksum. Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware might support ony one protocol family. The old flag IFXF_TSO is only relevant for large receive offload. It is missnamed, but keep that for now. Note that drivers do not set TSO capabilites yet. Also the ifconfig flags and pseudo interfaces capabilities will be done separately. So this commit should not change behavior. heavily based on the work from jan@; OK sashan@
2023-05-13	Instead of implementing IPv4 header checksum creation everywhere,	Alexander Bluhm
	introduce in_hdr_cksum_out(). It is used like in_proto_cksum_out(). OK claudio@
2023-05-10	Implement TCP send offloading, for now in software only. This is	Alexander Bluhm
	meant as a fallback if network hardware does not support TSO. Driver support is still work in progress. TCP output generates large packets. In IP output the packet is chopped to TCP maximum segment size. This reduces the CPU cycles used by pf. The regular output could be assisted by hardware later, but pf route-to and IPsec needs the software fallback in general. For performance comparison or to workaround possible bugs, sysctl net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows TSO counter with chopped and generated packets. based on work from jan@ tested by jmc@ jan@ Hrvoje Popovski OK jan@ claudio@
2023-05-08	The call to in_proto_cksum_out() is only needed before the packet	Alexander Bluhm
	is passed to ifp->if_output(). The fragment code has its own checksum calculation and the other paths end in goto bad. OK claudio@
2023-05-07	I preparation for TSO in software, cleanup the fragment code. Use	Alexander Bluhm
	if_output_ml() to send mbuf lists to interfaces. This can be used for TSO, fragments, ARP and ND6. Rename variable fml to ml. In pf_route6() split the if else block. Put the safety check (hlen + firstlen < tlen) into ip_fragment(). It makes the code correct in case the packet is too short to be fragmented. This should not happen, but other functions also have this logic. No functional change. OK sashan@
2023-05-03	Remove net lock from DIOCGETRULESET and DIOCGETRULESETS	Klemens Nanni
	Both walk the list of rulesets aka. anchors, to yield a total count and specific anchor name, respectively. Same access, different copy out. pf_anchor_global are contained within pf_ioctl.c and pf_ruleset.c and fully protected by the pf lock, as is pf_main_ruleset and its pf.c usage. Rely on and assert for pf lock alone. 'pfctl -sr' on 60k unique rules gets noticably faster, around 2.1s instead of 3.5s. OK sashan
2023-04-28	Relax the "pass all" rule so all forms of neighbor advertisements are allowed	Peter Hessler
	in either direction. This more closely matches the IPv4 ARP behaviour. From sashan@ discussed with kn@ deraadt@
2023-03-23	fix off-by-one in pf_state_expires() bounds test	Jonathan Gray
	such a value would have triggered a KASSERT() ok sashan@ deraadt@
2023-03-04	pf(4) should be enforcing TTL=1 to packets sent to 224.0.0.1 only.	Alexandr Nedvedicky
	Issue found and kindly reported by Luca Di Gregorio <lucdig _at_ gmail> OK bluhm@
2023-01-22	Fix pf_anchor_stackframe commit to revert pf rule matching to the	YASUOKA Masahiko
	previous behavior that stops when any rule matches within quick anchors. ok sasha kn
2023-01-12	Binding the accept socket in TCP input relies on the fact that the	Alexander Bluhm
	listen port is not bound to port 0. With a matching pf divert-to rule this assumption is no longer true and could crash the kernel with kassert. In both pf and stack drop TCP packets with destination port 0 before they can do harm. OK sashan@ claudio@
2023-01-06	PF_ANCHOR_STACK_MAX is insufficient protection against stack overflow.	Alexandr Nedvedicky
	On amd64 stack overflows for anchor rule with depth ~30. The tricky thing is the 'safe' depth varies depending on kind of packet processed by pf_match_rule(). For example for local outbound TCP packet stack overflows when recursion if pf_match_rule() reaches depth 24. Instead of lowering PF_ANCHOR_STACK_MAX to 20 and hoping it will be enough on all platforms and for all packets I'd like to stop calling pf_match_rule() recursively. This commit brings back pf_anchor_stackframe array we used to have back in 2017. It also revives patrick@'s idea to pre-allocate stack frame arrays from per-cpu. OK kn@
2023-01-05	more consistently name pf_state * variables "st".	David Gwynne
	pf_state ** are generally called "stp" now too. discussed with and ok sashan@
2023-01-04	move the pf_state_tree_id type from pfvar.h to pfvar_priv.h.	David Gwynne
	the pf_state_tree_id type is private to the kernel. while here, move it from being an RB tree to an RBT tree. this saves about 12k in pf.o on amd64. ok sashan@
2023-01-04	move the pf_state_tree rb tree type from pfvar.h to pfvar_priv.h	David Gwynne
	the pf_state_tree types are kernel private, and are not used by userland. make build agrees with me. while here, move the pf_state_tree from the RB macros to the RBT functions. this shaves about 13k off pf.o on amd64. ok sashan@
2023-01-02	use the pf generated toeplitz hash when setting the mbuf flow id.	David Gwynne
	before this it would use the pf state id, which is just an increasing number. the toeplitz hash is generated/used by the rest of the stack, so this encourages consistent flow of traffic through the system.
2022-12-27	Fix array bounds mismatch with clang 15	Patrick Wildt
	New warning -Warray-parameter is a bit overzealous. ok millert@ tb@
2022-12-24	fix and enable toeplitz hashing of pf_state_keys again.	David Gwynne
	the hash generated when setting up the pf pdesc struct uses outer addresses, while the addresses used in the state table goes through pf_state_key_addr_setup(), which does interesting things with some ipv6 icmp values. state lookups used pf_state_key_addr_setup(), but pf_state_key_setup copied the pdesc value, causing an inconsistency. pf_state_key_setup now calls pf_state_key_addr_setup(). found by anton@ tested by anton@ florian@
2022-12-23	disable the use of the has in the pf state key lookup (for now).	David Gwynne
	anton@ says the previous commit breaks ipv6 related regress tests. disabling the use of the hash in the state key compare gets it going again while i can figure out what's going on.
2022-12-22	use stoeplitz to generate a hash/flowid for state keys.	David Gwynne
	the hash will be used to partition work in pf and pfsync in the future, and right now it is used as the first comparison in the rb tree state lookup. using stoeplitz means that pf will hash traffic the same way that hardware using a stoeplitz key will hash incoming traffic on rings. stoeplitz is also used by the tcp stack to generate a flow id, which is used to pick which transmit ring is used on nics with multiple queues too. using the same algorithm throughout the stack encourages affinity of packets to rings and softnet threads the whole way through. using the hash as the first comparison in the state rb tree comparison should encourage faster traversal of the state tree by having all the address/port bits summarised into the single hash value. however, tests by hrvoje popovski don't show performance changing. on the plus side, if this change is free from a performance point of view then it makes the future steps more straightforward. discussed at length at h2k22 tested by sashan@ and hrvoje popovski ok tb@ sashan@ claudio@ jmatthew@
2022-12-21	tiny whitespace tweak.	David Gwynne

2022-12-21	consistently use the PF_REF wrappers around refcnts.	David Gwynne

2022-12-21	prefix pf_state_key and pf_state_item struct bits to make them more unique.	David Gwynne
	this makes searching for the struct members easier, which in turn makes tweaking code around them a lot easier too. sk_refcnt in particular would have been a lot nicer to fiddle with than just refcnt because pf_state structs also have a refcnt, which is annoying. tweaks and ok sashan@ reads ok kn@
2022-12-16	always keep pf_state_keys attached to pf_states.	David Gwynne
	pf_state structures don't contain ip addresses, protocols, ports, etc. that information is stored in a pf_state_key struct, which is used to wire a state into the state table. when things like pfsync or the pf state ioctls want to export information about a state, particularly the addresses on it, they needs the pf_state_key struct to read from. before this diff the code assumed that when a state was removed from the state tables it could throw the pf_state_key structs away as part of that removal. this code changes it so once pf_state_insert succeeds, a pf_state will keep its references to the pf_state_key structs until the pf_state struct itself is being destroyed. this allows anything that holds a reference to a pf_state to also look at the pf_state_key structs because they're now effectively an immutable part of the pf_state struct. this is by far the simplest and most straightforward fix for pfsync crashing on pf_state_key dereferences we've come up with so far. it has been made possible by the addition of reference counts to pf_state and pf_state_key structs, which allows us to properly account for this adjusted lifecycle for pf_state_keys on pf_state structs. sashan@ and i have been kicking this diff around for a couple of weeks now. ok sashan@ jmatthew@
2022-11-25	revert pf.c r1.1152 again: move pf_purge out from under the kernel lock	Alexander Bluhm
	Using systqmp for pf_purge creates a deadlock between pf_purge() and ixgbe_stop() and possibly other drivers. On systqmp pf(4) needs netlock which the interface ioctl(2) is holding. ix(4) waits in sched_barrier() which is also scheduled on the systqmp task queue. Removing the netlock from pf_purge() as a quick fix caused other problems. backout suggested by deraadt@
2022-11-25	Revert previous commit. It was not properly tested and produces splassert	Mark Kettenis
	warnings. Rushing to pile more stuff on top of it isn't the answer. This needs a rethink. ok deraadt@
2022-11-25	get rid of NET_LOCK in the pf purge work	David Gwynne
	pf purge was moved to systqmp (to get it away from KERNEL_LOCK) which is also used as the backend for things like intr_barrier and sched_barrier. it is common for network cards to call intr_barrier while holding NET_LOCK, and if pf is trying to get the NET_LOCK in the purge tasks that are now running in systqmp, it's a deadlock. bluhm@ hit this exact issue. sashan@ has been working to get rid of the need for NET_LOCK in pf, so now we can remove the NET_LOCKs here rather than create a pf specific taskq to run these tasks in. ok sashan@ bluhm@
2022-11-12	Put pf_state_import() under NPFSYNC>0 to fix build without pfsync	Klemens Nanni

2022-11-11	try pf.c r1.1143 again: move pf_purge out from under the kernel lock	David Gwynne
	this also avoids holding NET_LOCK too long. the main change is done by running the purge tasks in systqmp instead of systq. the pf state list was recently reworked so iteration over the state can be done without blocking insertions. however, scanning a lot of states can still take a lot of time, so this also makes the state list scanner yield if it has spent too much time running. the other purge tasks for source nodes, rules, and fragments have been moved to their own timeout/task pair to simplify the time accounting. in my environment, before this change pf purges often took 10 to 50ms. the softclock thread runs next to it often took a similar amount of time, presumably because they ended up spinning waiting for each other. after this change the pf_purges are more like 6 to 12ms, and dont block softclock. most of the variability in the runs now seems to come from contention on the net lock. tested by me sthen@ chris@ ok sashan@ kn@ claudio@ the diff was backed out because it made things a bit more racey, but sashan@ has squashed those races this week. let's try it again.
2022-11-11	add a mutex to struct pf_state and init it.	David Gwynne
	nothing is protected by it yet but it will allow us to provide consistent updates to individual states without relying on a global lock. getting that right between the packet processing in pf itself, pfsync, the pf purge code, the ioctl paths, etc is not worth the required contortions. while pf_state does grow, it doesn't use more cachelines on machines where we will want to run in parallel with a lot of states. stolen from and ok sashan@
2022-11-11	rename pfsync_up() to pfsync_is_up()	David Gwynne
	foo_up() where foo is a network driver is usually a function that configures and brings an interface up into a running state. this small tweak just makes the code a bit easier for me to read.
2022-11-11	rewrite the pf_state_peer_ntoh and pf_state_peer_hton macros as functions.	David Gwynne
	i can read this code as functions, but it takes too much effort as macros.
2022-11-10	revert pf_state mtx commit, because it breaks tree.	Alexandr Nedvedicky
	pfctl does not build OK dlg@
2022-11-10	Add a mutex to pf_state structure. Mutex retain a consistency	Alexandr Nedvedicky
	of structure members without using a global state lock. The first member which uses protection by mutex is key[] array. more will follow. OK dlg@
2022-11-09	simplify expiration of 'once' rules.	Alexandr Nedvedicky
	let packet to mark 'once' rule as expired. The rule will be removed by pfctl(8) when rules are updated. OK kn@
2022-11-08	This diff fixes panic tripped by KASSERT(st->sync_state == PFSYNC_S_NONE)	Alexandr Nedvedicky
	found in pfsync_insert_state(). It is caused by two packets which happen to belong to the same session. Think of UDP stream or two TCP SYN packets transmitted almost simultaneously. The first such packet wins a state lock and inserts state to table. The second packet waits for state lock as a reader. As soon as the first packet is done with state creation it drops the lock and is going to sent S_INS message to its peer via pfsync. The second update meanwhile obtains the state lock as a reader. It finds a state created by the first packet. Later the second packet also finds out the state needs to be updated, because sync_state is still set to PFSYNC_S_NONE. The second packet puts state to snapshot list marking it as S_UPD. All this happens before the first packet has a chance to make a progress. Think of the first packet loses cpu after dropping a write lock. Once the first packet gets running again it trips KASSERT() because sync_state is set to S_UPD. tested by hrvoje@ OK dlg@
2022-11-07	revert "move pf_purge out from under the kernel lock".	David Gwynne
	hrvoje popovski showed me pfsync blowing up with this. im backing it out quickly in case something else at the hackathon makes it harder to do later. kn@ agrees
2022-11-07	move pf_purge out from under the kernel lock and avoid the hogging cpu	David Gwynne
	this also avoids holding NET_LOCK too long. the main change is done by running the purge tasks in systqmp instead of systq. the pf state list was recently reworked so iteration over the state can be done without blocking insertions. however, scanning a lot of states can still take a lot of time, so this also makes the state list scanner yield if it has spent too much time running. the other purge tasks for source nodes, rules, and fragments have been moved to their own timeout/task pair to simplify the time accounting. in my environment, before this change pf purges often took 10 to 50ms. the softclock thread runs next to it often took a similar amount of time, presumably because they ended up spinning waiting for each other. after this change the pf_purges are more like 6 to 12ms, and dont block softclock. most of the variability in the runs now seems to come from contention on the net lock. tested by me sthen@ chris@ ok sashan@ kn@ claudio@
2022-11-06	move pfsync_state_import in if_pfsync.c to pf_state_import in pf.c	David Gwynne
	this is straightening the deck chairs. the state import and export code are used by both the pf ioctls and pfsync, but the export code is in pf.c and the import code is in if_pfsync. if pfsync was disabled then the ioctl stuff wouldnt link. moving the import code to pf.c makes it more symmetrical(?) and robust. tweaks and ok from kn@ sashan@
2022-10-10	Recalculate checksum of normalised packet	Bjorn Ketelaars
	In 2011, henning@ removed fiddling with the ip checksum of normalised packets in r1.131 of sys/net/pf_norm.c. Rationale was that the checksum is always recalculated in all output paths anyway. In 2016, procter@ reintroduced checksum modification to preserve end-to-end checksums in r1.189 of sys/net/pf_norm.c. Likely soomewhere in that timeslot checksum recalculation of normalised packets was broken. With input from bluhm@. OK sashan@, bluhm@
2022-09-03	Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This	Alexander Bluhm
	removes pressure from the exclusive netlock in tcp_slowtimo(). Reading is done atomically. Ensure that the tcp_now value is read only once per function to provide consistent time. OK yasuoka@
2022-09-03	When divert-reply is used, keep some pf states after pcb is dropped if	YASUOKA Masahiko
	its local address is translated, to prevent its source port from being reused. regress test by blumn. ok blumn
2022-08-30	Refactor internet PCB lookup function. Rename in_pcbhashlookup()	Alexander Bluhm
	so the public API is in_pcblookup() and in_pcblookup_listen(). For internal use introduce in_pcbhash_insert() and in_pcbhash_lookup() to avoid code duplication. Routing domain is unsigned, change the type to u_int. OK mvs@
2022-08-08	To make protocol input functions MP safe, internet PCB need protection.	Alexander Bluhm
	Use their reference counter in more places. The in_pcb lookup functions hold the PCBs in hash tables protected by table->inpt_mtx mutex. Whenever a result is returned, increment the ref count before releasing the mutex. Then the inp can be used as long as neccessary. Unref it at the end of all functions that call in_pcb lookup. As a shortcut, pf may also hold a reference to the PCB. When pf_inp_lookup() returns it, it also incements the ref count and the caller can handle it like the inp from table lookup. OK sashan@
2022-07-20	Add a pool for the allocation of the pf_anchor struct.	Moritz Buhl
	It was possible to exhaust kernel memory by repeatedly calling pfioctl DIOCXBEGIN with different anchor names. OK bluhm@ Reported-by: syzbot+9dd98cbce69e26f0fc11@syzkaller.appspotmail.com
2022-06-28	fix syncookies in conjunction with tcp fast port reuse.	Henning Brauer
	This really pointed out that the place syncookies were hooked in was almost, but not completely right. The way it was the special case for tcp fast port reuse in pf_test_state wasn't hit, because the first packet hitting that was the ACK from the peer finishing the 3WHS, and the reconstructed SYN came after. We're now doing pf_find_state (and only that) first, then syncookies, then going on so that the old state is thrown away properly and we get a new one with the sequence number modulator set up correctly Bonus: -11 lines of code tracked down (that took a while) + fixed under contract with Hush Communications Canada; special thanks to Lyndon ok sashan
2022-06-26	Allow waiting during ktable allocation in pf_ioctl.	mbuhl
	OK bluhm Reported-by: syzbot+50ea4f33ed5dd9264918@syzkaller.appspotmail.com Reported-by: syzbot+df65f8b7ee8c0089e885@syzkaller.appspotmail.com
2022-06-13	fix logic bug in pf_find_state()	Henning Brauer
	a state in PFTM_PURGE could potentially hide another state on the same state key that is active and we'd incorrectly block the packet I believe that cannot happen as things are now. ok sashan
2022-05-23	In pf the kernel paniced if IP options in packet within ICMP payload	Alexander Bluhm
	were truncated. Drop such packets instead. Reported-by: syzbot+91abd3aa2fdfe900f9ce@syzkaller.appspotmail.com OK sashan@ claudio@
2022-05-23	Fix white space.	Alexander Bluhm