Age | Commit message (Collapse) | Author |
|
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@
|
|
introduce in_hdr_cksum_out(). It is used like in_proto_cksum_out().
OK claudio@
|
|
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@
|
|
is passed to ifp->if_output(). The fragment code has its own
checksum calculation and the other paths end in goto bad.
OK claudio@
|
|
if_output_ml() to send mbuf lists to interfaces. This can be used
for TSO, fragments, ARP and ND6. Rename variable fml to ml. In
pf_route6() split the if else block. Put the safety check (hlen +
firstlen < tlen) into ip_fragment(). It makes the code correct in
case the packet is too short to be fragmented. This should not
happen, but other functions also have this logic.
No functional change. OK sashan@
|
|
Both walk the list of rulesets aka. anchors, to yield a total count and
specific anchor name, respectively. Same access, different copy out.
pf_anchor_global are contained within pf_ioctl.c and pf_ruleset.c and
fully protected by the pf lock, as is pf_main_ruleset and its pf.c usage.
Rely on and assert for pf lock alone. 'pfctl -sr' on 60k unique rules gets
noticably faster, around 2.1s instead of 3.5s.
OK sashan
|
|
in either direction.
This more closely matches the IPv4 ARP behaviour.
From sashan@
discussed with kn@ deraadt@
|
|
such a value would have triggered a KASSERT()
ok sashan@ deraadt@
|
|
Issue found and kindly reported by Luca Di Gregorio <lucdig _at_ gmail>
OK bluhm@
|
|
previous behavior that stops when any rule matches within quick
anchors.
ok sasha kn
|
|
listen port is not bound to port 0. With a matching pf divert-to
rule this assumption is no longer true and could crash the kernel
with kassert. In both pf and stack drop TCP packets with destination
port 0 before they can do harm.
OK sashan@ claudio@
|
|
On amd64 stack overflows for anchor rule with depth ~30. The tricky
thing is the 'safe' depth varies depending on kind of packet processed
by pf_match_rule(). For example for local outbound TCP packet stack
overflows when recursion if pf_match_rule() reaches depth 24.
Instead of lowering PF_ANCHOR_STACK_MAX to 20 and hoping it will
be enough on all platforms and for all packets I'd like to stop
calling pf_match_rule() recursively. This commit brings back
pf_anchor_stackframe array we used to have back in 2017. It also
revives patrick@'s idea to pre-allocate stack frame arrays
from per-cpu.
OK kn@
|
|
pf_state ** are generally called "stp" now too.
discussed with and ok sashan@
|
|
the pf_state_tree_id type is private to the kernel.
while here, move it from being an RB tree to an RBT tree. this saves
about 12k in pf.o on amd64.
ok sashan@
|
|
the pf_state_tree types are kernel private, and are not used by
userland. make build agrees with me.
while here, move the pf_state_tree from the RB macros to the RBT
functions. this shaves about 13k off pf.o on amd64.
ok sashan@
|
|
before this it would use the pf state id, which is just an increasing
number. the toeplitz hash is generated/used by the rest of the
stack, so this encourages consistent flow of traffic through the
system.
|
|
New warning -Warray-parameter is a bit overzealous.
ok millert@ tb@
|
|
the hash generated when setting up the pf pdesc struct uses outer
addresses, while the addresses used in the state table goes through
pf_state_key_addr_setup(), which does interesting things with some
ipv6 icmp values. state lookups used pf_state_key_addr_setup(), but
pf_state_key_setup copied the pdesc value, causing an inconsistency.
pf_state_key_setup now calls pf_state_key_addr_setup().
found by anton@
tested by anton@ florian@
|
|
anton@ says the previous commit breaks ipv6 related regress tests.
disabling the use of the hash in the state key compare gets it going
again while i can figure out what's going on.
|
|
the hash will be used to partition work in pf and pfsync in the
future, and right now it is used as the first comparison in the rb
tree state lookup.
using stoeplitz means that pf will hash traffic the same way that
hardware using a stoeplitz key will hash incoming traffic on rings.
stoeplitz is also used by the tcp stack to generate a flow id, which
is used to pick which transmit ring is used on nics with multiple
queues too. using the same algorithm throughout the stack encourages
affinity of packets to rings and softnet threads the whole way
through.
using the hash as the first comparison in the state rb tree comparison
should encourage faster traversal of the state tree by having all
the address/port bits summarised into the single hash value. however,
tests by hrvoje popovski don't show performance changing. on the
plus side, if this change is free from a performance point of view
then it makes the future steps more straightforward.
discussed at length at h2k22
tested by sashan@ and hrvoje popovski
ok tb@ sashan@ claudio@ jmatthew@
|
|
|
|
|
|
this makes searching for the struct members easier, which in turn
makes tweaking code around them a lot easier too. sk_refcnt in
particular would have been a lot nicer to fiddle with than just
refcnt because pf_state structs also have a refcnt, which is annoying.
tweaks and ok sashan@
reads ok kn@
|
|
pf_state structures don't contain ip addresses, protocols, ports,
etc. that information is stored in a pf_state_key struct, which is
used to wire a state into the state table. when things like pfsync
or the pf state ioctls want to export information about a state,
particularly the addresses on it, they needs the pf_state_key struct
to read from.
before this diff the code assumed that when a state was removed
from the state tables it could throw the pf_state_key structs away
as part of that removal. this code changes it so once pf_state_insert
succeeds, a pf_state will keep its references to the pf_state_key
structs until the pf_state struct itself is being destroyed.
this allows anything that holds a reference to a pf_state to also
look at the pf_state_key structs because they're now effectively
an immutable part of the pf_state struct.
this is by far the simplest and most straightforward fix for pfsync
crashing on pf_state_key dereferences we've come up with so far.
it has been made possible by the addition of reference counts to
pf_state and pf_state_key structs, which allows us to properly
account for this adjusted lifecycle for pf_state_keys on pf_state
structs.
sashan@ and i have been kicking this diff around for a couple of
weeks now.
ok sashan@ jmatthew@
|
|
Using systqmp for pf_purge creates a deadlock between pf_purge()
and ixgbe_stop() and possibly other drivers. On systqmp pf(4) needs
netlock which the interface ioctl(2) is holding. ix(4) waits in
sched_barrier() which is also scheduled on the systqmp task queue.
Removing the netlock from pf_purge() as a quick fix caused other
problems.
backout suggested by deraadt@
|
|
warnings. Rushing to pile more stuff on top of it isn't the answer. This
needs a rethink.
ok deraadt@
|
|
pf purge was moved to systqmp (to get it away from KERNEL_LOCK)
which is also used as the backend for things like intr_barrier and
sched_barrier. it is common for network cards to call intr_barrier
while holding NET_LOCK, and if pf is trying to get the NET_LOCK in
the purge tasks that are now running in systqmp, it's a deadlock.
bluhm@ hit this exact issue.
sashan@ has been working to get rid of the need for NET_LOCK in pf,
so now we can remove the NET_LOCKs here rather than create a pf
specific taskq to run these tasks in.
ok sashan@ bluhm@
|
|
|
|
this also avoids holding NET_LOCK too long.
the main change is done by running the purge tasks in systqmp instead
of systq. the pf state list was recently reworked so iteration over
the state can be done without blocking insertions.
however, scanning a lot of states can still take a lot of time, so
this also makes the state list scanner yield if it has spent too
much time running.
the other purge tasks for source nodes, rules, and fragments have
been moved to their own timeout/task pair to simplify the time
accounting.
in my environment, before this change pf purges often took 10 to
50ms. the softclock thread runs next to it often took a similar
amount of time, presumably because they ended up spinning waiting
for each other. after this change the pf_purges are more like 6 to
12ms, and dont block softclock. most of the variability in the runs
now seems to come from contention on the net lock.
tested by me sthen@ chris@
ok sashan@ kn@ claudio@
the diff was backed out because it made things a bit more racey,
but sashan@ has squashed those races this week. let's try it again.
|
|
nothing is protected by it yet but it will allow us to provide
consistent updates to individual states without relying on a global
lock. getting that right between the packet processing in pf itself,
pfsync, the pf purge code, the ioctl paths, etc is not worth the
required contortions.
while pf_state does grow, it doesn't use more cachelines on machines
where we will want to run in parallel with a lot of states.
stolen from and ok sashan@
|
|
foo_up() where foo is a network driver is usually a function that
configures and brings an interface up into a running state. this
small tweak just makes the code a bit easier for me to read.
|
|
i can read this code as functions, but it takes too much effort as macros.
|
|
pfctl does not build
OK dlg@
|
|
of structure members without using a global state lock.
The first member which uses protection by mutex is key[] array.
more will follow.
OK dlg@
|
|
let packet to mark 'once' rule as expired. The rule
will be removed by pfctl(8) when rules are updated.
OK kn@
|
|
found in pfsync_insert_state(). It is caused by two packets which happen
to belong to the same session. Think of UDP stream or two TCP SYN packets
transmitted almost simultaneously. The first such packet wins a state lock
and inserts state to table. The second packet waits for state lock
as a reader. As soon as the first packet is done with state creation
it drops the lock and is going to sent S_INS message to its peer via
pfsync. The second update meanwhile obtains the state lock as a reader.
It finds a state created by the first packet. Later the second packet
also finds out the state needs to be updated, because sync_state
is still set to PFSYNC_S_NONE. The second packet puts state to snapshot
list marking it as S_UPD. All this happens before the first packet has
a chance to make a progress. Think of the first packet loses cpu after
dropping a write lock. Once the first packet gets running again it
trips KASSERT() because sync_state is set to S_UPD.
tested by hrvoje@
OK dlg@
|
|
hrvoje popovski showed me pfsync blowing up with this. im backing
it out quickly in case something else at the hackathon makes it
harder to do later.
kn@ agrees
|
|
this also avoids holding NET_LOCK too long.
the main change is done by running the purge tasks in systqmp instead
of systq. the pf state list was recently reworked so iteration over
the state can be done without blocking insertions.
however, scanning a lot of states can still take a lot of time, so
this also makes the state list scanner yield if it has spent too
much time running.
the other purge tasks for source nodes, rules, and fragments have
been moved to their own timeout/task pair to simplify the time
accounting.
in my environment, before this change pf purges often took 10 to
50ms. the softclock thread runs next to it often took a similar
amount of time, presumably because they ended up spinning waiting
for each other. after this change the pf_purges are more like 6 to
12ms, and dont block softclock. most of the variability in the runs
now seems to come from contention on the net lock.
tested by me sthen@ chris@
ok sashan@ kn@ claudio@
|
|
this is straightening the deck chairs. the state import and export
code are used by both the pf ioctls and pfsync, but the export code
is in pf.c and the import code is in if_pfsync. if pfsync was
disabled then the ioctl stuff wouldnt link.
moving the import code to pf.c makes it more symmetrical(?) and
robust.
tweaks and ok from kn@ sashan@
|
|
In 2011, henning@ removed fiddling with the ip checksum of normalised
packets in r1.131 of sys/net/pf_norm.c. Rationale was that the checksum
is always recalculated in all output paths anyway. In 2016, procter@
reintroduced checksum modification to preserve end-to-end checksums in
r1.189 of sys/net/pf_norm.c. Likely soomewhere in that timeslot checksum
recalculation of normalised packets was broken.
With input from bluhm@.
OK sashan@, bluhm@
|
|
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@
|
|
its local address is translated, to prevent its source port from being
reused. regress test by blumn.
ok blumn
|
|
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@
|
|
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@
|
|
It was possible to exhaust kernel memory by repeatedly calling
pfioctl DIOCXBEGIN with different anchor names.
OK bluhm@
Reported-by: syzbot+9dd98cbce69e26f0fc11@syzkaller.appspotmail.com
|
|
This really pointed out that the place syncookies were hooked in was almost,
but not completely right. The way it was the special case for tcp fast port
reuse in pf_test_state wasn't hit, because the first packet
hitting that was the ACK from the peer finishing the 3WHS, and the
reconstructed SYN came after. We're now doing pf_find_state (and *only* that)
first, then syncookies, then going on so that the old state is thrown away
properly and we get a new one with the sequence number modulator set up
correctly
Bonus: -11 lines of code
tracked down (that took a while) + fixed under contract with Hush
Communications Canada; special thanks to Lyndon
ok sashan
|
|
OK bluhm
Reported-by: syzbot+50ea4f33ed5dd9264918@syzkaller.appspotmail.com
Reported-by: syzbot+df65f8b7ee8c0089e885@syzkaller.appspotmail.com
|
|
a state in PFTM_PURGE could potentially hide another state on the same state
key that is active and we'd incorrectly block the packet
I believe that cannot happen as things are now.
ok sashan
|
|
were truncated. Drop such packets instead.
Reported-by: syzbot+91abd3aa2fdfe900f9ce@syzkaller.appspotmail.com
OK sashan@ claudio@
|
|
|