Age | Commit message (Collapse) | Author |
|
Use atomic operations to read ip6_forwarding while processing packets
in the network stack.
To make clear where actually the router property is needed, use the
i_am_router variable based on ip6_forwarding. It already existed
in nd6_nbr. Move i_am_router setting up the call stack until all
users are independent.
The forwarding decisions in pf_test, pf_refragment6, ip6_input do
also not interfere.
Use a new array ipv6ctl_vars_unlocked to make transition of all the
integer sysctls easier. Adapt IPv4 to the new style.
OK mvs@
|
|
IPsec gateways set the forwarding sysctl to 2. While this worked
for IPv4 since a long time, adapt this feature for IPv6 now. Set
sysctl net.inet6.ip6.forwarding=2 to forward only packets that have
been processed by IPsec.
Set IPV6_FORWARDING_IPSEC in ip6_input() and pass the flag down to
the call stack. This provides consistent view on global variable
ip6_forwarding. In ip6_output() or ip6_forward() drop packets that
do not match the policy.
OK denis@
|
|
IPv4 uses IP_FORWARDING to pass down a consistent value of
net.inet.ip.forwarding down the stack. This is needed for unlocking
sysctl. Do the same for IPv6.
Read ip6_forwarding once in ip6_input_if() and pass down IPV6_FORWARDING
as flags to ip6_ours(), ip6_hbhchcheck(), ip6_forward(). Replace
the srcrt value with IPV6_REDIRECT flag for consistency with IPv4.
To have common syntax with IPv4, use ip6_forwarding == 0 checks
instead of !ip6_forwarding. This will also make it easier to
implement net.inet6.ip6.forwarding=2 for IPsec only forwarding
later.
In nd6_ns_input() and nd6_na_input() read ip6_forwarding once and
store it in i_am_router. The variable name has been chosen to avoid
confusion with is_router, which indicates router flag of the packet.
Reading of ip6_forwarding is done independently from ip6_input_if(),
consistency does not really matter. One is for ND router behavior
the other for forwarding. Again use the ip6_forwarding != 0 check,
so when ip6_forwarding IPsec only value 2 gets implemented, it will
behave like a router.
OK deraadt@ sashan@ florian@ claudio@
|
|
Framgent count and statistics are stored in struct pf_status. From
there pfctl(8) and systat(1) collect and show them. Note that pfctl
-s info needs the -v switch to show fragments. As fragment reassembly
has its own mutex, also grab this in pf ipctl(2) and sysctl(2) code.
input claudio@; OK henning@
|
|
pf_pull_hdr() allows to pass an action pointer parameter as output
value. This is never used, all callers pass a NULL argument. Remove
ACTION_SET() entirely.
The logic (fragoff >= len) in pf_pull_hdr() does not work since
revision 1.4. Before it was used to drop short TCP or UDP fragments
that contained only part of the header. Current code in pf_pull_hdr()
drops the packets anyway, so always set reason PFRES_FRAG.
OK kn@ sashan@
|
|
moving pf forward has been a real struggle, and pfsync has been a
constant source of pain. we have been papering over the problems
for a while now, but it reached the point that it needed a fundamental
restructure, which is what this diff is.
the big headliner changes in this diff are:
- pfsync specific locks
this is the whole reason for this diff.
rather than rely on NET_LOCK or KERNEL_LOCK or whatever, pfsync now
has it's own locks to protect it's internal data structures. this
is important because pfsync runs a bunch of timeouts and tasks to
push pfsync packets out on the wire, or when it's handling requests
generated by incoming pfsync packets, both of which happen outside
pf itself running. having pfsync specific locks around pfsync data
structures makes the mutations of these data structures a lot more
explicit and auditable.
- partitioning
to enable future parallelisation of the network stack, this rewrite
includes support for pfsync to partition states into different "slices".
these slices run independently, ie, the states collected by one slice
are serialised into a separate packet to the states collected and
serialised by another slice.
states are mapped to pfsync slices based on the pf state hash, which
is the same hash that the rest of the network stack and multiq
hardware uses.
- no more pfsync called from netisr
pfsync used to be called from netisr to try and bundle packets, but now
that there's multiple pfsync slices this doesnt make sense. instead it
uses tasks in softnet tqs.
- improved bulk transfer handling
there's shiny new state machines around both the bulk transmit and
receive handling. pfsync used to do horrible things to carp demotion
counters, but now it is very predictable and returns the counters back
where they started.
- better tdb handling
the tdb handling was pretty hairy, but hrvoje has kicked this around
a lot with ipsec and sasyncd and we've found and fixed a bunch of
issues as a result of that testing.
- mpsafe pf state purges
this was committed previously, but because the locks pfsync relied on
weren't clear this just caused a ton of bugs. as part of this diff it's
now reliable, and moves a big chunk of work out from under KERNEL_LOCK,
which in turn improves the responsiveness and throughput of a firewall
even if you're not using pfsync.
there's a bunch of other little changes along the way, but the above are
the big ones.
hrvoje has done performance testing with this diff and notes a big
improvement when pfsync is not in use. performance when pfsync is
enabled is about the same, but im hoping the slices means we can scale
along with pf as it improves.
lots (months) of testing by me and hrvoje on pfsync boxes
tests and ok sashan@
deraadt@ says this is a good time to put it in
|
|
if_output_ml() to send mbuf lists to interfaces. This can be used
for TSO, fragments, ARP and ND6. Rename variable fml to ml. In
pf_route6() split the if else block. Put the safety check (hlen +
firstlen < tlen) into ip_fragment(). It makes the code correct in
case the packet is too short to be fragmented. This should not
happen, but other functions also have this logic.
No functional change. OK sashan@
|
|
this is straightening the deck chairs. the state import and export
code are used by both the pf ioctls and pfsync, but the export code
is in pf.c and the import code is in if_pfsync. if pfsync was
disabled then the ioctl stuff wouldnt link.
moving the import code to pf.c makes it more symmetrical(?) and
robust.
tweaks and ok from kn@ sashan@
|
|
In 2011, henning@ removed fiddling with the ip checksum of normalised
packets in r1.131 of sys/net/pf_norm.c. Rationale was that the checksum
is always recalculated in all output paths anyway. In 2016, procter@
reintroduced checksum modification to preserve end-to-end checksums in
r1.189 of sys/net/pf_norm.c. Likely soomewhere in that timeslot checksum
recalculation of normalised packets was broken.
With input from bluhm@.
OK sashan@, bluhm@
|
|
for fragment entries was reached, pf_create_fragment() called
pf_flush_fragments() without lock. This could result in a crash.
Let PF_FRAG_LOCK() cover the whole pf_reassemble() function as
pf_nfrents++ was also missing the lock.
crash found and fix tested by Hrvoje Popovski; OK sashan@
|
|
ok gnezdo@ semarie@ mpi@
|
|
simplify the handling of the fragment list. Now the functions
ip_fragment() and ip6_fragment() always consume the mbuf. They
free the mbuf and mbuf list in case of an error and take care about
the counter. Adjust the code a bit to make v4 and v6 look similar.
Fixes a potential mbuf leak when pf_route6() called pf_refragment6()
and it failed. Now the mbuf is always freed by ip6_fragment().
OK dlg@ mvs@
|
|
reassembly, reinsert the fragment into the lookup table with correct
index.
Reported-by: syzbot+d043455a5346f726f1c4@syzkaller.appspotmail.com
OK claudio@
|
|
Silence from the network group
ok sashan@
|
|
time_second(9) and time_uptime(9) are widely used in the kernel to
quickly get the system UTC or system uptime as a time_t. However,
time_t is 64-bit everywhere, so it is not generally safe to use them
on 32-bit platforms: you have a split-read problem if your hardware
cannot perform atomic 64-bit reads.
This patch replaces time_second(9) with gettime(9), a safer successor
interface, throughout the kernel. Similarly, time_uptime(9) is replaced
with getuptime(9).
There is a performance cost on 32-bit platforms in exchange for
eliminating the split-read problem: instead of two register reads you
now have a lockless read loop to pull the values from the timehands.
This is really not *too* bad in the grand scheme of things, but
compared to what we were doing before it is several times slower.
There is no performance cost on 64-bit (__LP64__) platforms.
With input from visa@, dlg@, and tedu@.
Several bugs squashed by visa@.
ok kettenis@
|
|
passed by pf or cause a panic in pf.
fix from sashan@; OK bluhm@ claudio@
bug found by Corentin Bayet, Nicolas Collignon, Luca Moro at Synacktiv
|
|
OK bluhm@ kn@
|
|
put the algorithm into a new function m_calchdrlen(). Also set an
uninitialized m_len to 0 in NFS code.
OK claudio@
|
|
created. Add a new function m_removehdr() do convert packet header
mbufs within the chain to regular mbufs. Assert that the mbuf at
the beginning of the chain has a packet header.
found by Maxime Villard in NetBSD; from markus@; OK claudio@
|
|
a global limit of 1024 fragments, but it is fine grained to the
region of the packet. Smaller packets may have less fragments.
This costs another 16 bytes of memory per reassembly and devides
the worst case for searching by 8.
requestd by claudio@; OK sashan@ claudio@
|
|
Remember 16 entry points based on the fragment offset. Instead of
a worst case of 8196 list traversals we now check a maximum of 512
list entries or 16 array elements.
discussed with claudio@ and sashan@; OK sashan@
|
|
|
|
pf(4) reassembly is complete. Instead count the holes that are
created when inserting a fragment. If there are no holes left, the
fragments are continuous.
idea from claudio@; OK claudio@ sashan@
|
|
- MSS and WSCALE option candidates must now meet their min type length.
- 'max-mss' is now more tolerant of malformed option lists.
These changes were immaterial to the live traffic I've examined.
OK sashan@ mpi@
|
|
bzero -> memset and (very few) bcopy -> memcpy/memmove
|
|
may easily reuse the fragment id as it is only 16 bit for IPv4. To
avoid that pf reassembles them into the wrong packet, throw away
stale fragments. With the default timeout this happens after 12,000
newer fragements have been seen.
from markus@; OK sashan@
|
|
we need suitable data structures. Organize the pf fragments with
two red-black trees. One is holding the address and protocol
information and the other has only the fragment id. This will allow
to drop fragemts for specific connections more aggressively. `
from markus@; OK sashan@
|
|
bugs could easily result in use-after-free or double free. Introduce
m_freemp() which automatically resets the pointer before freeing
it. So we have less dangling pointers in the kernel.
OK krw@ mpi@ claudio@
|
|
to enable PF_LOCK(), you must add 'option WITH_PF_LOCK' to your kernel
configuration. The code does not do much currently it's just the very
small step towards MP.
O.K. henning@, mikeb@, mpi@
|
|
Recursions are still marked as XXXSMP.
ok deraadt@, bluhm@
|
|
with certain rulesets and excessively noisy; move them to LOG_INFO (which was
previously unused). ok benno@
|
|
For the moment the NET_LOCK() is always taken by threads running under
KERNEL_LOCK(). That means it doesn't buy us anything except a possible
deadlock that we did not spot. So make sure this doesn't happen, we'll
have plenty of time in the next release cycle to stress test it.
ok visa@
|
|
NET_LOCK(). pfioctl() will need the NET_LOCK() anyway. So better keep
things simple until we're going to redesign PF for a MP world.
fixes the crash reported by Kaya Saman.
ok mpi@, bluhm@
|
|
of calling rtalloc() again.
OK mpi@
|
|
|
|
|
|
Prevent pf_socket_lookup() reading uninitialised header buffers on fragments.
OK blum@ sashan@
|
|
in pf. Drop the whole fragment state if IPv6 fragments appear which
have invalid length or fragment-offset or more-fragment-bit. In
IPv4 they are considered invalid and just dropped like before.
Found by Antonios Atlasis; OK sashan@ sthen@
|
|
pfvar_priv.h. The pf_headers had to be defined in multiple .c files
before. In pfvar.h it would have unknown storage size, this file
is included in too many places. The idea is to have a private pf
header that is only included in the pf part of the kernel. For now
it contains pf_pdesc and pf_headers, it may be extended later.
discussion, input and OK henning@ procter@ sashan@
|
|
|
|
|
|
the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.
most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.
the manpage and subr_pool.c bits i did myself.
ok tedu@ jmatthew@
@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);
|
|
ok phessler@ henning@
|
|
A single forwarding cache is not the answer. The answer is 42... err PF!
ok bluhm@
|
|
when fiddling with packets but without the mess that motivated Henning to
remove it. Affects only this one aspect of Henning's checksum work. Also tweak
the basic algorithm and supply a correctness argument.
OK dlg@ deraadt@ sthen@; no objection henning@
|
|
has been moved to nd6_resolve().
ok visa@, millert@, florian@, sthen@
|
|
byte order. Spotted by Gleb Smirnoff (glebius@FreeBSD.org), thanks!
ok tedu
|
|
ok sthen@, bluhm@
|
|
pf_test calls pf_refragment6 with dst=NULL, which is passed down to
rtable_match which attempts to dereference it.
|
|
ok bluhm@
|