summaryrefslogtreecommitdiff
path: root/sys/net/pf.c
AgeCommit message (Collapse)Author
2023-05-15Implement the TCP/IP layer for hardware TCP segmentation offload.Alexander Bluhm
If the driver of a network interface claims to support TSO, do not chop the packet in software, but pass it down to the interface layer. Precalculate parts of the pseudo header checksum, but without the packet length. The length of all generated smaller packets is not known yet. Driver and hardware will use the mbuf packet header field ph_mss to calculate it and update checksum. Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware might support ony one protocol family. The old flag IFXF_TSO is only relevant for large receive offload. It is missnamed, but keep that for now. Note that drivers do not set TSO capabilites yet. Also the ifconfig flags and pseudo interfaces capabilities will be done separately. So this commit should not change behavior. heavily based on the work from jan@; OK sashan@
2023-05-13Instead of implementing IPv4 header checksum creation everywhere,Alexander Bluhm
introduce in_hdr_cksum_out(). It is used like in_proto_cksum_out(). OK claudio@
2023-05-10Implement TCP send offloading, for now in software only. This isAlexander Bluhm
meant as a fallback if network hardware does not support TSO. Driver support is still work in progress. TCP output generates large packets. In IP output the packet is chopped to TCP maximum segment size. This reduces the CPU cycles used by pf. The regular output could be assisted by hardware later, but pf route-to and IPsec needs the software fallback in general. For performance comparison or to workaround possible bugs, sysctl net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows TSO counter with chopped and generated packets. based on work from jan@ tested by jmc@ jan@ Hrvoje Popovski OK jan@ claudio@
2023-05-08The call to in_proto_cksum_out() is only needed before the packetAlexander Bluhm
is passed to ifp->if_output(). The fragment code has its own checksum calculation and the other paths end in goto bad. OK claudio@
2023-05-07I preparation for TSO in software, cleanup the fragment code. UseAlexander Bluhm
if_output_ml() to send mbuf lists to interfaces. This can be used for TSO, fragments, ARP and ND6. Rename variable fml to ml. In pf_route6() split the if else block. Put the safety check (hlen + firstlen < tlen) into ip_fragment(). It makes the code correct in case the packet is too short to be fragmented. This should not happen, but other functions also have this logic. No functional change. OK sashan@
2023-05-03Remove net lock from DIOCGETRULESET and DIOCGETRULESETSKlemens Nanni
Both walk the list of rulesets aka. anchors, to yield a total count and specific anchor name, respectively. Same access, different copy out. pf_anchor_global are contained within pf_ioctl.c and pf_ruleset.c and fully protected by the pf lock, as is pf_main_ruleset and its pf.c usage. Rely on and assert for pf lock alone. 'pfctl -sr' on 60k unique rules gets noticably faster, around 2.1s instead of 3.5s. OK sashan
2023-04-28Relax the "pass all" rule so all forms of neighbor advertisements are allowedPeter Hessler
in either direction. This more closely matches the IPv4 ARP behaviour. From sashan@ discussed with kn@ deraadt@
2023-03-23fix off-by-one in pf_state_expires() bounds testJonathan Gray
such a value would have triggered a KASSERT() ok sashan@ deraadt@
2023-03-04pf(4) should be enforcing TTL=1 to packets sent to 224.0.0.1 only.Alexandr Nedvedicky
Issue found and kindly reported by Luca Di Gregorio <lucdig _at_ gmail> OK bluhm@
2023-01-22Fix pf_anchor_stackframe commit to revert pf rule matching to theYASUOKA Masahiko
previous behavior that stops when any rule matches within quick anchors. ok sasha kn
2023-01-12Binding the accept socket in TCP input relies on the fact that theAlexander Bluhm
listen port is not bound to port 0. With a matching pf divert-to rule this assumption is no longer true and could crash the kernel with kassert. In both pf and stack drop TCP packets with destination port 0 before they can do harm. OK sashan@ claudio@
2023-01-06PF_ANCHOR_STACK_MAX is insufficient protection against stack overflow.Alexandr Nedvedicky
On amd64 stack overflows for anchor rule with depth ~30. The tricky thing is the 'safe' depth varies depending on kind of packet processed by pf_match_rule(). For example for local outbound TCP packet stack overflows when recursion if pf_match_rule() reaches depth 24. Instead of lowering PF_ANCHOR_STACK_MAX to 20 and hoping it will be enough on all platforms and for all packets I'd like to stop calling pf_match_rule() recursively. This commit brings back pf_anchor_stackframe array we used to have back in 2017. It also revives patrick@'s idea to pre-allocate stack frame arrays from per-cpu. OK kn@
2023-01-05more consistently name pf_state * variables "st".David Gwynne
pf_state ** are generally called "stp" now too. discussed with and ok sashan@
2023-01-04move the pf_state_tree_id type from pfvar.h to pfvar_priv.h.David Gwynne
the pf_state_tree_id type is private to the kernel. while here, move it from being an RB tree to an RBT tree. this saves about 12k in pf.o on amd64. ok sashan@
2023-01-04move the pf_state_tree rb tree type from pfvar.h to pfvar_priv.hDavid Gwynne
the pf_state_tree types are kernel private, and are not used by userland. make build agrees with me. while here, move the pf_state_tree from the RB macros to the RBT functions. this shaves about 13k off pf.o on amd64. ok sashan@
2023-01-02use the pf generated toeplitz hash when setting the mbuf flow id.David Gwynne
before this it would use the pf state id, which is just an increasing number. the toeplitz hash is generated/used by the rest of the stack, so this encourages consistent flow of traffic through the system.
2022-12-27Fix array bounds mismatch with clang 15Patrick Wildt
New warning -Warray-parameter is a bit overzealous. ok millert@ tb@
2022-12-24fix and enable toeplitz hashing of pf_state_keys again.David Gwynne
the hash generated when setting up the pf pdesc struct uses outer addresses, while the addresses used in the state table goes through pf_state_key_addr_setup(), which does interesting things with some ipv6 icmp values. state lookups used pf_state_key_addr_setup(), but pf_state_key_setup copied the pdesc value, causing an inconsistency. pf_state_key_setup now calls pf_state_key_addr_setup(). found by anton@ tested by anton@ florian@
2022-12-23disable the use of the has in the pf state key lookup (for now).David Gwynne
anton@ says the previous commit breaks ipv6 related regress tests. disabling the use of the hash in the state key compare gets it going again while i can figure out what's going on.
2022-12-22use stoeplitz to generate a hash/flowid for state keys.David Gwynne
the hash will be used to partition work in pf and pfsync in the future, and right now it is used as the first comparison in the rb tree state lookup. using stoeplitz means that pf will hash traffic the same way that hardware using a stoeplitz key will hash incoming traffic on rings. stoeplitz is also used by the tcp stack to generate a flow id, which is used to pick which transmit ring is used on nics with multiple queues too. using the same algorithm throughout the stack encourages affinity of packets to rings and softnet threads the whole way through. using the hash as the first comparison in the state rb tree comparison should encourage faster traversal of the state tree by having all the address/port bits summarised into the single hash value. however, tests by hrvoje popovski don't show performance changing. on the plus side, if this change is free from a performance point of view then it makes the future steps more straightforward. discussed at length at h2k22 tested by sashan@ and hrvoje popovski ok tb@ sashan@ claudio@ jmatthew@
2022-12-21tiny whitespace tweak.David Gwynne
2022-12-21consistently use the PF_REF wrappers around refcnts.David Gwynne
2022-12-21prefix pf_state_key and pf_state_item struct bits to make them more unique.David Gwynne
this makes searching for the struct members easier, which in turn makes tweaking code around them a lot easier too. sk_refcnt in particular would have been a lot nicer to fiddle with than just refcnt because pf_state structs also have a refcnt, which is annoying. tweaks and ok sashan@ reads ok kn@
2022-12-16always keep pf_state_keys attached to pf_states.David Gwynne
pf_state structures don't contain ip addresses, protocols, ports, etc. that information is stored in a pf_state_key struct, which is used to wire a state into the state table. when things like pfsync or the pf state ioctls want to export information about a state, particularly the addresses on it, they needs the pf_state_key struct to read from. before this diff the code assumed that when a state was removed from the state tables it could throw the pf_state_key structs away as part of that removal. this code changes it so once pf_state_insert succeeds, a pf_state will keep its references to the pf_state_key structs until the pf_state struct itself is being destroyed. this allows anything that holds a reference to a pf_state to also look at the pf_state_key structs because they're now effectively an immutable part of the pf_state struct. this is by far the simplest and most straightforward fix for pfsync crashing on pf_state_key dereferences we've come up with so far. it has been made possible by the addition of reference counts to pf_state and pf_state_key structs, which allows us to properly account for this adjusted lifecycle for pf_state_keys on pf_state structs. sashan@ and i have been kicking this diff around for a couple of weeks now. ok sashan@ jmatthew@
2022-11-25revert pf.c r1.1152 again: move pf_purge out from under the kernel lockAlexander Bluhm
Using systqmp for pf_purge creates a deadlock between pf_purge() and ixgbe_stop() and possibly other drivers. On systqmp pf(4) needs netlock which the interface ioctl(2) is holding. ix(4) waits in sched_barrier() which is also scheduled on the systqmp task queue. Removing the netlock from pf_purge() as a quick fix caused other problems. backout suggested by deraadt@
2022-11-25Revert previous commit. It was not properly tested and produces splassertMark Kettenis
warnings. Rushing to pile more stuff on top of it isn't the answer. This needs a rethink. ok deraadt@
2022-11-25get rid of NET_LOCK in the pf purge workDavid Gwynne
pf purge was moved to systqmp (to get it away from KERNEL_LOCK) which is also used as the backend for things like intr_barrier and sched_barrier. it is common for network cards to call intr_barrier while holding NET_LOCK, and if pf is trying to get the NET_LOCK in the purge tasks that are now running in systqmp, it's a deadlock. bluhm@ hit this exact issue. sashan@ has been working to get rid of the need for NET_LOCK in pf, so now we can remove the NET_LOCKs here rather than create a pf specific taskq to run these tasks in. ok sashan@ bluhm@
2022-11-12Put pf_state_import() under NPFSYNC>0 to fix build without pfsyncKlemens Nanni
2022-11-11try pf.c r1.1143 again: move pf_purge out from under the kernel lockDavid Gwynne
this also avoids holding NET_LOCK too long. the main change is done by running the purge tasks in systqmp instead of systq. the pf state list was recently reworked so iteration over the state can be done without blocking insertions. however, scanning a lot of states can still take a lot of time, so this also makes the state list scanner yield if it has spent too much time running. the other purge tasks for source nodes, rules, and fragments have been moved to their own timeout/task pair to simplify the time accounting. in my environment, before this change pf purges often took 10 to 50ms. the softclock thread runs next to it often took a similar amount of time, presumably because they ended up spinning waiting for each other. after this change the pf_purges are more like 6 to 12ms, and dont block softclock. most of the variability in the runs now seems to come from contention on the net lock. tested by me sthen@ chris@ ok sashan@ kn@ claudio@ the diff was backed out because it made things a bit more racey, but sashan@ has squashed those races this week. let's try it again.
2022-11-11add a mutex to struct pf_state and init it.David Gwynne
nothing is protected by it yet but it will allow us to provide consistent updates to individual states without relying on a global lock. getting that right between the packet processing in pf itself, pfsync, the pf purge code, the ioctl paths, etc is not worth the required contortions. while pf_state does grow, it doesn't use more cachelines on machines where we will want to run in parallel with a lot of states. stolen from and ok sashan@
2022-11-11rename pfsync_up() to pfsync_is_up()David Gwynne
foo_up() where foo is a network driver is usually a function that configures and brings an interface up into a running state. this small tweak just makes the code a bit easier for me to read.
2022-11-11rewrite the pf_state_peer_ntoh and pf_state_peer_hton macros as functions.David Gwynne
i can read this code as functions, but it takes too much effort as macros.
2022-11-10revert pf_state mtx commit, because it breaks tree.Alexandr Nedvedicky
pfctl does not build OK dlg@
2022-11-10Add a mutex to pf_state structure. Mutex retain a consistencyAlexandr Nedvedicky
of structure members without using a global state lock. The first member which uses protection by mutex is key[] array. more will follow. OK dlg@
2022-11-09simplify expiration of 'once' rules.Alexandr Nedvedicky
let packet to mark 'once' rule as expired. The rule will be removed by pfctl(8) when rules are updated. OK kn@
2022-11-08This diff fixes panic tripped by KASSERT(st->sync_state == PFSYNC_S_NONE)Alexandr Nedvedicky
found in pfsync_insert_state(). It is caused by two packets which happen to belong to the same session. Think of UDP stream or two TCP SYN packets transmitted almost simultaneously. The first such packet wins a state lock and inserts state to table. The second packet waits for state lock as a reader. As soon as the first packet is done with state creation it drops the lock and is going to sent S_INS message to its peer via pfsync. The second update meanwhile obtains the state lock as a reader. It finds a state created by the first packet. Later the second packet also finds out the state needs to be updated, because sync_state is still set to PFSYNC_S_NONE. The second packet puts state to snapshot list marking it as S_UPD. All this happens before the first packet has a chance to make a progress. Think of the first packet loses cpu after dropping a write lock. Once the first packet gets running again it trips KASSERT() because sync_state is set to S_UPD. tested by hrvoje@ OK dlg@
2022-11-07revert "move pf_purge out from under the kernel lock".David Gwynne
hrvoje popovski showed me pfsync blowing up with this. im backing it out quickly in case something else at the hackathon makes it harder to do later. kn@ agrees
2022-11-07move pf_purge out from under the kernel lock and avoid the hogging cpuDavid Gwynne
this also avoids holding NET_LOCK too long. the main change is done by running the purge tasks in systqmp instead of systq. the pf state list was recently reworked so iteration over the state can be done without blocking insertions. however, scanning a lot of states can still take a lot of time, so this also makes the state list scanner yield if it has spent too much time running. the other purge tasks for source nodes, rules, and fragments have been moved to their own timeout/task pair to simplify the time accounting. in my environment, before this change pf purges often took 10 to 50ms. the softclock thread runs next to it often took a similar amount of time, presumably because they ended up spinning waiting for each other. after this change the pf_purges are more like 6 to 12ms, and dont block softclock. most of the variability in the runs now seems to come from contention on the net lock. tested by me sthen@ chris@ ok sashan@ kn@ claudio@
2022-11-06move pfsync_state_import in if_pfsync.c to pf_state_import in pf.cDavid Gwynne
this is straightening the deck chairs. the state import and export code are used by both the pf ioctls and pfsync, but the export code is in pf.c and the import code is in if_pfsync. if pfsync was disabled then the ioctl stuff wouldnt link. moving the import code to pf.c makes it more symmetrical(?) and robust. tweaks and ok from kn@ sashan@
2022-10-10Recalculate checksum of normalised packetBjorn Ketelaars
In 2011, henning@ removed fiddling with the ip checksum of normalised packets in r1.131 of sys/net/pf_norm.c. Rationale was that the checksum is always recalculated in all output paths anyway. In 2016, procter@ reintroduced checksum modification to preserve end-to-end checksums in r1.189 of sys/net/pf_norm.c. Likely soomewhere in that timeslot checksum recalculation of normalised packets was broken. With input from bluhm@. OK sashan@, bluhm@
2022-09-03Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. ThisAlexander Bluhm
removes pressure from the exclusive netlock in tcp_slowtimo(). Reading is done atomically. Ensure that the tcp_now value is read only once per function to provide consistent time. OK yasuoka@
2022-09-03When divert-reply is used, keep some pf states after pcb is dropped ifYASUOKA Masahiko
its local address is translated, to prevent its source port from being reused. regress test by blumn. ok blumn
2022-08-30Refactor internet PCB lookup function. Rename in_pcbhashlookup()Alexander Bluhm
so the public API is in_pcblookup() and in_pcblookup_listen(). For internal use introduce in_pcbhash_insert() and in_pcbhash_lookup() to avoid code duplication. Routing domain is unsigned, change the type to u_int. OK mvs@
2022-08-08To make protocol input functions MP safe, internet PCB need protection.Alexander Bluhm
Use their reference counter in more places. The in_pcb lookup functions hold the PCBs in hash tables protected by table->inpt_mtx mutex. Whenever a result is returned, increment the ref count before releasing the mutex. Then the inp can be used as long as neccessary. Unref it at the end of all functions that call in_pcb lookup. As a shortcut, pf may also hold a reference to the PCB. When pf_inp_lookup() returns it, it also incements the ref count and the caller can handle it like the inp from table lookup. OK sashan@
2022-07-20Add a pool for the allocation of the pf_anchor struct.Moritz Buhl
It was possible to exhaust kernel memory by repeatedly calling pfioctl DIOCXBEGIN with different anchor names. OK bluhm@ Reported-by: syzbot+9dd98cbce69e26f0fc11@syzkaller.appspotmail.com
2022-06-28fix syncookies in conjunction with tcp fast port reuse.Henning Brauer
This really pointed out that the place syncookies were hooked in was almost, but not completely right. The way it was the special case for tcp fast port reuse in pf_test_state wasn't hit, because the first packet hitting that was the ACK from the peer finishing the 3WHS, and the reconstructed SYN came after. We're now doing pf_find_state (and *only* that) first, then syncookies, then going on so that the old state is thrown away properly and we get a new one with the sequence number modulator set up correctly Bonus: -11 lines of code tracked down (that took a while) + fixed under contract with Hush Communications Canada; special thanks to Lyndon ok sashan
2022-06-26Allow waiting during ktable allocation in pf_ioctl.mbuhl
OK bluhm Reported-by: syzbot+50ea4f33ed5dd9264918@syzkaller.appspotmail.com Reported-by: syzbot+df65f8b7ee8c0089e885@syzkaller.appspotmail.com
2022-06-13fix logic bug in pf_find_state()Henning Brauer
a state in PFTM_PURGE could potentially hide another state on the same state key that is active and we'd incorrectly block the packet I believe that cannot happen as things are now. ok sashan
2022-05-23In pf the kernel paniced if IP options in packet within ICMP payloadAlexander Bluhm
were truncated. Drop such packets instead. Reported-by: syzbot+91abd3aa2fdfe900f9ce@syzkaller.appspotmail.com OK sashan@ claudio@
2022-05-23Fix white space.Alexander Bluhm