Age | Commit message (Collapse) | Author |
|
When M_TCP_TSO is preserved, also keep ph_mss. In lo(4) this logic
was missing. This may be relevant only for weird pf configs that
forward from loopback.
OK mvs@ jan@
|
|
When doing LRO (Large Receive Offload), the drivers, currently ix(4)
and lo(4) only, record an upper bound of the size of the original
packets in ph_mss. When sending, either stack or hardware must
chop the packets with TSO (TCP Segmentation Offload) to that size.
That means we have to call tcp_if_output_tso() before ifp->if_output().
Put that logic into if_output_tso() to avoid code duplication. As
TCP packets on the wire do not get larger that way, path MTU discovery
should still work.
tested by and OK jan@
|
|
Replace hand-rolled reference counting with refcnt_init(9) and hook it up
with a new dt(4) probe.
OK mvs
Feedback OK bluhm
|
|
moving pf forward has been a real struggle, and pfsync has been a
constant source of pain. we have been papering over the problems
for a while now, but it reached the point that it needed a fundamental
restructure, which is what this diff is.
the big headliner changes in this diff are:
- pfsync specific locks
this is the whole reason for this diff.
rather than rely on NET_LOCK or KERNEL_LOCK or whatever, pfsync now
has it's own locks to protect it's internal data structures. this
is important because pfsync runs a bunch of timeouts and tasks to
push pfsync packets out on the wire, or when it's handling requests
generated by incoming pfsync packets, both of which happen outside
pf itself running. having pfsync specific locks around pfsync data
structures makes the mutations of these data structures a lot more
explicit and auditable.
- partitioning
to enable future parallelisation of the network stack, this rewrite
includes support for pfsync to partition states into different "slices".
these slices run independently, ie, the states collected by one slice
are serialised into a separate packet to the states collected and
serialised by another slice.
states are mapped to pfsync slices based on the pf state hash, which
is the same hash that the rest of the network stack and multiq
hardware uses.
- no more pfsync called from netisr
pfsync used to be called from netisr to try and bundle packets, but now
that there's multiple pfsync slices this doesnt make sense. instead it
uses tasks in softnet tqs.
- improved bulk transfer handling
there's shiny new state machines around both the bulk transmit and
receive handling. pfsync used to do horrible things to carp demotion
counters, but now it is very predictable and returns the counters back
where they started.
- better tdb handling
the tdb handling was pretty hairy, but hrvoje has kicked this around
a lot with ipsec and sasyncd and we've found and fixed a bunch of
issues as a result of that testing.
- mpsafe pf state purges
this was committed previously, but because the locks pfsync relied on
weren't clear this just caused a ton of bugs. as part of this diff it's
now reliable, and moves a big chunk of work out from under KERNEL_LOCK,
which in turn improves the responsiveness and throughput of a firewall
even if you're not using pfsync.
there's a bunch of other little changes along the way, but the above are
the big ones.
hrvoje has done performance testing with this diff and notes a big
improvement when pfsync is not in use. performance when pfsync is
enabled is about the same, but im hoping the slices means we can scale
along with pf as it improves.
lots (months) of testing by me and hrvoje on pfsync boxes
tests and ok sashan@
deraadt@ says this is a good time to put it in
|
|
pf_open_trans() can issue for each clone of /dev/pf
to 512. The pf_open_trans() is currently being used
by DIOCGETRULES ioctl(2). The limit avoids processes
to consume all kernel memory by asking DIOCGETRULES
for more tickets. If DIOCGETRULES hits the limit, then
the application will see EBUSY error.
This diff was fine tuned with feedback from cluadio@,
deraadt@ and kn@.
OK kn@
|
|
Pointed out by bluhm.
ok bluhm@
|
|
periodically read rules from pf(4) to consume all kernel
memory. The bug has been discovered and root caused by florian@.
In this particular case it was snmpd(8) what ate all kernel
memory.
This commit introduces DIOCXEND to pf(4) so applications such
as snmpd(8) and systat(1) to close ticket/transaction when
they are done with fetching the rules. This change also
updates snmpd(8) and systat(1) to use newly introduced
DIOCXEND ioctl(2).
OK claudio@, deraadt@, kn@
|
|
ok sashan@
|
|
|
|
If tcplro is activated on lo(4), ignore the MTU with TCP packets.
They are passed along with the information that they have to be
chopped in case they are forwarded later. New netstat(1) counter
shows that software LRO is in effect. The feature is currently
turned off by default.
tested by jan@; OK claudio@ jan@
|
|
M_TEMP and M_IFADDR types are unreasonable for that purpose. This
dedicated statistics simplify the future pf(4) unlocking work by
decreasing search area of possible memory leaks.
ok bluhm sashan
|
|
OK bluhm@
|
|
DIOCGETRULES."
regress/sbin/pfctl panics with "rw_enter: pfioctl_rw locking against myself"
as reported by bluhm on bugs@.
|
|
Replace hand-rolled reference counting with refcnt_init(9) and hook it up
with a new dt(4) probe.
OK bluhm mvs
|
|
for interface groups data allocations.
ok kn claudio bluhm
|
|
snmpd(8) and 'pfctl -s Interfaces' dump pf's internal list of interfaces.
pf's internal interface list is completely protected by the pf lock,
pf lock assertions since pf_if.c r1.110 from over a week ago support this.
pfi_*() iterate over net lock protected if_groups lists, but only to read,
so downgrade from exclusive write net lock to a shared read-only one.
Feedback mvs
OK sashan
|
|
pf.conf's 'set skip on ifN' and 'pfctl -F all|Reset' set and clear flags,
PFI_IFLAG_SKIP being the only flag. Nothing else in base uses these ioctls
and internal state is protected by the pf lock already.
OK sashan
|
|
|
|
Processes like snmpd or systat open pf(4) once and then issue many
DIOCGETRULES calls over their runtime. This accumulates many pf_trans
structs over their lifetime. At some point the kernel runs out of
memory because of that. By closing all transactions before creating
a new one, long living processes do no longer leak transactions.
This probably needs further refinement once more transactions types are
added but for now this solves the problem.
Problem found by florian@
OK sashan@ kn@
|
|
disconnected from everywhere. No need to hold netlock for dummy
'nd_ifinfo' release. Netlock is also not needed for
TAILQ_EMPTY(&ifp->if_*hooks) assertions.
ok kn bluhm
|
|
Packets sent over loopback got their checksums calculated twice.
In the output path they were filled in and during TCP/IP input all
checksums were calculated again to be compared with the previous
result.
Avoid this by claiming that lo(4) supports hardware checksum
offloading. For each packet convert the flag that the checksum
should be calculated to the flag that it has been checked successfully.
Keep the flag that it should be calculated for the case that it may
be bridged or forwarded later.
A drawback is that "tcpdump -ni lo0 -v" reports invalid checksum.
But that is the same with physical interfaces and hardware offloading.
OK dlg@
|
|
For example it should not be surprised if caller asks to remove
state from pfsync queue which has been removed already. That kind
of race is sorted out later when pfsync_update_state() calls to
pfsync_q_ins()/pfsync_q_del(). Change relaxes pfsync_update_state()
to panic on sync_state value which is unknown.
OK dlg@
|
|
is already removed.
OK dlg@
|
|
"wgdescr[iption] foo" to label one peer (amongst many) on a wg(4) interface,
"-wgdescr[iption]" or "wgdescr ''" to remove the label, completely analogous
to existing interface discriptions.
Idea/initial diff from Mikolaj Kucharski (OK sthen)
Tests/prodded by Hrvoje Popovski
Tweaks/manual bits from me
Feedback deraadt sthen mvs claudio
OK claudio
|
|
this waits once for something to end in all the net tqs.
ok claudio@
|
|
ok jmc@ guenther@ tb@
|
|
Grab the pf lock for pf_pool_limits[] in pfsync such that all access is
covered by the pf lock; document accordingly.
Hard memory pool limits don't need the net lock for protection, pool(9)s
have their own internal lock and the pf lock fully covers limit values.
(pf_pool_limits[] access in DIOCXCOMMIT remains under pf *and net* lock
until the rest in there gets pulled out of the net lock.)
OK sashan
|
|
Make sure that all hooks into pf's internal list of interfaces do happen
with the pf lock held, i.e. nothing relies on the net lock alone, so that
later unlocking can then rely on it.
Full i386 regress (thanks bluhm) and daily usage are fine
OK sashan
|
|
pfsync(4) queues. We also need to grab pf_state::mtx to put/remove
state instance safely from pfsync(4) queue. The issue has been
pointed out by bluhm@. Patch survived testing done by hrvoje@
OK dlg@
|
|
discussed with and ok tb@
|
|
This diff introduces separate capabilities for TCP offloading. We split this
into LRO (large receive offloading) and TSO (TCP segmentation offloading).
LRO can be turned on/off via tcprecvoffload option of ifconfig and is not
inherited to sub interfaces.
TSO is inherited by sub interfaces to signal this hardware offloading capability
to the network stack.
With tweaks from bluhm, claudio and dlg
ok bluhm, claudio
|
|
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@
|
|
ie, you'll see softnet0, softnet1, etc in top/ps/etc now instead
of just softnet on these threads.
this is done by wrapping the taskq and name up in a softnet struct.
ok patrick@ bluhm@ mvs@ kn@ sashan@
|
|
introduce in_hdr_cksum_out(). It is used like in_proto_cksum_out().
OK claudio@
|
|
All pools are init'd after pfattach(), none is ever destroyed,
so struct pf_pool_limit's .pp always points to valid pools.
Drop a check for the impossible from twenty years ago.
OK sashan dlg
|
|
port number. This is typically indicated by 'wire key attach failed on...'
message when pf(4) debugging is enabled. The problem is caused by
glitch in pf_get_sport() which fails to discover conflict in advance.
In order to fix it we must also calculate toeplitz hash in
pf_get_sport() to initialize look up key properly.
the bug has been kindly reported by joosepm _von_ gmail _dot_ com
OK dlg@
|
|
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@
|
|
|
|
is passed to ifp->if_output(). The fragment code has its own
checksum calculation and the other paths end in goto bad.
OK claudio@
|
|
if_output_ml() to send mbuf lists to interfaces. This can be used
for TSO, fragments, ARP and ND6. Rename variable fml to ml. In
pf_route6() split the if else block. Put the safety check (hlen +
firstlen < tlen) into ip_fragment(). It makes the code correct in
case the packet is too short to be fragmented. This should not
happen, but other functions also have this logic.
No functional change. OK sashan@
|
|
pf_osfp.c contains all the locking for these three ioctls, everything is
protected by the pf lock; assert/document it and inline acess to the global
list to eliminate useless function variables.
OK bluhm sashan
|
|
Both walk the list of rulesets aka. anchors, to yield a total count and
specific anchor name, respectively. Same access, different copy out.
pf_anchor_global are contained within pf_ioctl.c and pf_ruleset.c and
fully protected by the pf lock, as is pf_main_ruleset and its pf.c usage.
Rely on and assert for pf lock alone. 'pfctl -sr' on 60k unique rules gets
noticably faster, around 2.1s instead of 3.5s.
OK sashan
|
|
Same logic and argument as for the parent *S ioctl unlocked in r1.400,
might as well have committed them together:
Both ticket and number of queues stem from the pf_queues_active list which
is effectively static to pf_ioctl.c and fully protected by the pf lock.
OK sashan
|
|
ok bluhm@
|
|
pointed and OK bluhm@
|
|
retrieve rules from kernel. The current implementation requires
like O((n^2)/2) operation to read the complete rule set, because
each DIOCGETRULE operation must iterate over previous n
rules to find (n + 1)-th rule to read.
To address the issue diff introduces a pf_trans structure to keep
pointer to next rule to read, thus reading process does not need
to iterate from beginning of rule set to reach the next rule.
All transactions opened by process get closed either when process
is done (reads all rules) or when /dev/pf device is closed.
the diff also comes with lots of improvements from dlg@ and kn@
OK dlg@, kn@
|
|
in either direction.
This more closely matches the IPv4 ARP behaviour.
From sashan@
discussed with kn@ deraadt@
|
|
Both ticket and number of queues stem from the pf_queues_active list which
is effectively static to pf_ioctl.c and fully protected by the pf lock.
OK sashan
|
|
Route timers and route labels protected by corresponding mutexes. `ifa'
uses references counting for protection. rt_mpls_clear() could be called
lockless because this is the last reference of `rt'.
ok bluhm@ kn@
|
|
'pfctl -s timeouts' values are only used inside of pf, entirely protected
by the pf lock through the ioctl interface; the net lock is useless.
Previous attempts to remove net lock usage showed that the pf lock cannot
yet entirely replace it, so start with small pieces like this one.
Contrary to IPv4/6 read-only ioctls, some pf ioctls without FWRITE flag do
modify internal pf state, which is not entirely obvious when approached
from the ioctl layer.
OK sashan dlg
|