Age | Commit message (Collapse) | Author |
|
Instead of passing around u_char[4], introduce struct ipsec_level
that contains 4 ipsec levels. This provides better type safety.
The embedding struct inpcb is globally visible for netstat(1), so
put struct ipsec_level outside of #ifdef _KERNEL.
OK deraadt@ mvs@
|
|
ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.
OK mvs@
|
|
When receiving a pfkeyv2 SADB_ADD message, a newly created tdb can
fail in tdb_init(), which causes the tdb to not get added to the
global tdb list and an immediate dereference. If a lifetime timeout
triggers on this tdb, it will unconditionally try to remove it from
the list and in the process deref once more than allowed,
causing a one bit corruption in the already freed up slot in the
tdb pool.
We resolve this issue by moving timeout_add() after tdb_init()
just before puttdb(). This means tdbs failing initialization
get discarded immediately as they only hold a single reference.
Valid tdbs get their timeouts activated just before we add them
to the tdb list, meaning the timeout can safely assume they are
linked.
Feedback from mvs@ and millert@
ok mvs@ mbuhl@
|
|
rather than use ipsec flows (aka, entries in the ipsec security
policy database) to decide which traffic should be encapsulated in
ipsec and sent to a peer, this tweaks security associations (SAs)
so they can refer to a tunnel interface. when traffic is routed
over that tunnel interface, an ipsec SA is looked up and used to
encapsulate traffic before being sent to the peer on the SA. When
traffic is received from a peer using an interface SA, the specified
interface is looked up and the packet is handed to it so it looks
like packets come out of the tunnel.
to support this, SAs get a TDBF_IFACE flag and iface and iface_dir
fields. When TDBF_IFACE is set the iface and dir fields are
considered valid, and the tdb/SA should be used with the tunnel
interface instead of the SPD.
support from many including markus@ tobhe@ claudio@ sthen@ patrick@
now is a good time deraadt@
|
|
moving pf forward has been a real struggle, and pfsync has been a
constant source of pain. we have been papering over the problems
for a while now, but it reached the point that it needed a fundamental
restructure, which is what this diff is.
the big headliner changes in this diff are:
- pfsync specific locks
this is the whole reason for this diff.
rather than rely on NET_LOCK or KERNEL_LOCK or whatever, pfsync now
has it's own locks to protect it's internal data structures. this
is important because pfsync runs a bunch of timeouts and tasks to
push pfsync packets out on the wire, or when it's handling requests
generated by incoming pfsync packets, both of which happen outside
pf itself running. having pfsync specific locks around pfsync data
structures makes the mutations of these data structures a lot more
explicit and auditable.
- partitioning
to enable future parallelisation of the network stack, this rewrite
includes support for pfsync to partition states into different "slices".
these slices run independently, ie, the states collected by one slice
are serialised into a separate packet to the states collected and
serialised by another slice.
states are mapped to pfsync slices based on the pf state hash, which
is the same hash that the rest of the network stack and multiq
hardware uses.
- no more pfsync called from netisr
pfsync used to be called from netisr to try and bundle packets, but now
that there's multiple pfsync slices this doesnt make sense. instead it
uses tasks in softnet tqs.
- improved bulk transfer handling
there's shiny new state machines around both the bulk transmit and
receive handling. pfsync used to do horrible things to carp demotion
counters, but now it is very predictable and returns the counters back
where they started.
- better tdb handling
the tdb handling was pretty hairy, but hrvoje has kicked this around
a lot with ipsec and sasyncd and we've found and fixed a bunch of
issues as a result of that testing.
- mpsafe pf state purges
this was committed previously, but because the locks pfsync relied on
weren't clear this just caused a ton of bugs. as part of this diff it's
now reliable, and moves a big chunk of work out from under KERNEL_LOCK,
which in turn improves the responsiveness and throughput of a firewall
even if you're not using pfsync.
there's a bunch of other little changes along the way, but the above are
the big ones.
hrvoje has done performance testing with this diff and notes a big
improvement when pfsync is not in use. performance when pfsync is
enabled is about the same, but im hoping the slices means we can scale
along with pf as it improves.
lots (months) of testing by me and hrvoje on pfsync boxes
tests and ok sashan@
deraadt@ says this is a good time to put it in
|
|
instead of 's' for `tdb_sadb_mtx' mutex(9) because this is 'D'atabase.
No functional changes.
ok bluhm@
|
|
`id_refcount' decrement. This should be consistent with `ipsp_ids_gc_list'
list modifications, otherwise concurrent ipsp_ids_insert() could remove
this dying `ids' from the list before if was placed there by
ipsp_ids_free(). This makes atomic operations with `id_refcount' useless.
Also prevent ipsp_ids_lookup() to return dying `ids'.
ok bluhm@
|
|
of snapshots is to allow pfsync(4) to move items from global lists
to local lists (a.k.a. snapshots) under a mutex protection. Snapshots
are then processed without holding any mutexes. Such idea does not fly
well if link entry is currently used for global lists as well as snapshots.
Feedback by bluhm@ Credits also goes to hrvoje@ for extensive testing.
OK bluhm@
|
|
IP forwarding diff. Add mutex and refcount to make memory management
of struct ipsec_acquire MP safe.
testing Hrvoje Popovski; input sashan@; OK mvs@
|
|
OK tobhe@ mvs@
|
|
|
|
trees. ipsp_ids_lookup() returns `ids' with bumped reference
counter. original diff from mvs
ok mvs
|
|
'tdb_data' struct became unused and was removed.
Tested by Hrvoje Popovski.
ok bluhm@
|
|
sleep. So holding the tdb_sadb_mtx() when calling walker() is not
allowed. Move the TDB from the TDB-Hash to a temporary list that
is protected by netlock. Then unlock tdb_sadb_mtx and traverse the
list to call the walker.
OK mvs@
|
|
is also a list of SAs that belong to a policy. To make it MP safe,
protect these pointers with a mutex.
tested by Hrvoje Popovski; OK mvs@
|
|
TDB. Clearing the timeout flags just before pool put in tdb_free()
does not make sense. Move this to tdb_delete(). While there make
the parentheses in the flag check consistent.
tested by Hrvoje Popovski; OK tobhe@
|
|
that gettdb_dir() is MP safe now. Add the tdb_sadb_mtx mutex in
udpencap_ctlinput() to protect the access to tdb_snext. Make the
braces consistently for all these TDB loops. Move NET_ASSERT_LOCKED()
into the functions where the read access happens.
OK mvs@
|
|
may prevent that tdb_free() is called. It is not a real leak as
ipsecctl -F or termination of iked flush this cache when they remove
the IPsec policy. Move the code from tdb_free() to tdb_delete(),
then the kernel does the cleanup itself.
OK mvs@ tobhe@
|
|
pfkey_flush().
ok bluhm@ mvs@
|
|
out whether the TDB is linked to the hash bucket does not work.
This fixes removal of SAs that could not be flushed with ipsecctl -F.
OK tobhe@
|
|
is not always needed, but the error value is necessary for the
caller. As TDB should be refcounted, it makes not sense to always
return it. Pass an output pointer for the TDB which can be NULL.
OK mvs@ tobhe@
|
|
OK mvs@ yasuoka@
|
|
Protect tdb_unlink() and puttdb() for SADB_UPDATE with tdb_sadb_mutex.
Tested by Hrvoje Popovski
ok bluhm@ mvs@
|
|
covered yet, more ref counts to come. The timeouts are protected,
so the racy tdb_reaper() gets retired. The tdb_policy_head, onext
and inext lists are protected. All gettdb...() functions return a
tdb that is ref counted and has to be unrefed later. A flag ensures
that tdb_delete() is called only once.
Tested by Hrvoje Popovski; OK sthen@ mvs@ tobhe@
|
|
userland the TDBs which exceeded hard limit.
Also the `ipsec_notdb' counter description in header doesn't math to
netstat(1) description. We never count `ipsec_notdb' and the netstat(1)
description looks more appropriate so it's used to avoid confusion with
the new counter.
ok bluhm@
|
|
and "show all tdbs" in ddb.
tested by Hrvoje Popovski; OK mvs@
|
|
mutex locking against myself panic introduced by my previous commit.
OK beck@ patrick@
|
|
ok bluhm@
|
|
for ah, esp, and ipcomp. Move common code into ipsec_protoff()
which finds the offset of the next protocol field in the previous
header.
OK tobhe@
|
|
ok bluhm@
|
|
to old crypto API.
ok bluhm@
|
|
to the mbuf to update it globally. At the end it will reach
ip_deliver() which expects a pointer to an mbuf.
OK sashan@
|
|
This was needed to pass arguments to the callback function, but is no longer
necessary after the API makeover.
ok bluhm@
|
|
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@
|
|
adds unnecessary complexity. Dedicated crypto offloading devices are not common
anymore. Modern CPU crypto acceleration works synchronously, eliminating the need
for callbacks.
Replace all occurrences of crypto_dispatch() with crypto_invoke(), which is
blocking and only returns after the operation has completed or an error occured.
Invoke callback functions directly from the consumer (e.g. IPsec, softraid)
instead of relying on the crypto driver to call crypto_done().
ok bluhm@ mvs@ patrick@
|
|
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@
|
|
goto drop instead of return. An ENOBUFS should be EINVAL in IPv6
case. Also use combined packet and byte counter.
OK sthen@ dlg@
|
|
in ipsec_common_ctlinput() is not necessary, the loop in ipsec_set_mtu()
does that anyway. udpencap_ctlinput() did not work for bundled SA,
this also needs the loop in ipsec_set_mtu().
OK sthen@
|
|
Move the tdb pool init into an init function.
OK mvs@
|
|
ok gnezdo@
|
|
Panic reported by Hrvoje Popovski.
|
|
'tdb_data' struct became unused and was removed.
ok bluhm@
|
|
destruction instead of using per-entity timeout. This fixes the races
between ipsp_ids_insert(), ipsp_ids_free() and ipsp_ids_timeout().
ipsp_ids_insert() can't stop ipsp_ids_timeout() timeout handler which is
already running and awaiting netlock to be released, so reused `ids' will
be silently removed in this case.
ipsp_ids_free() can't determine is ipsp_ids_timeout() timeout handler
running because timeout_del(9) called by ipsp_ids_insert() clears it's
triggered state. So ipsp_ids_timeout() could be scheduled to run twice in
this case.
Also hrvoje@ reported about ipsec(4) throughput increased with this diff
so it seems we caught significant count of ipsp_ids_insert() races.
tests and feedback by hrvoje@
ok bluhm@
|
|
counter than after decryption. This could result in "esp_input_cb:
authentication failed for packet in SA" errors. As we run crypto
operations async, thousands of packets are stored in the crypto
task. During the queueing the replay counter of the tdb can change.
Then the higher 32 bits may increment although the lower 32 bits
did not wrap.
checkreplaywindow() must be called twice per packet with the same
replay counter. Store the value in struct tdb_crypto while dangling
in the task queue and doing crypto operations.
tested by Hrvoje Popovski; joint work with tobhe@
|
|
ok tobhe@
|
|
and map data read only.
OK deraadt@ mvs@ mpi@
|
|
constant. Then they are mapped as read only.
OK deraadt@ dlg@
|
|
|
|
in runtime within pfkeyv2_send(). Also set it's interrupt protection
level to IPL_SOFTNET.
ok bluhm@ mpi@
|
|
Tested with multiple Window 10 Pro (ver 2004) clients, and OpenBSD+iked
as the server.
OK tobhe@ sthen@ kn@
|