Age | Commit message (Collapse) | Author |
|
of snapshots is to allow pfsync(4) to move items from global lists
to local lists (a.k.a. snapshots) under a mutex protection. Snapshots
are then processed without holding any mutexes. Such idea does not fly
well if link entry is currently used for global lists as well as snapshots.
Feedback by bluhm@ Credits also goes to hrvoje@ for extensive testing.
OK bluhm@
|
|
Assertions should be active and rely on #ifdef DIAGNOSTIC. Retire
PFSYNC_DEBUG.
OK sashan@ dlg@
|
|
have at least mpfloor spl protection. Fix witness issue found by
Hrvoje where pfsync holds mutex and interrupt grabs kernel lock.
OK sashan@
|
|
one place. Use atomic_add_long() there. Call the local variable
sclen, that makes it easier to grep for the struct field sc_len.
OK sashan@
|
|
mutex st_mtx. This simplifies pf(4) state handling in pfsync(4). It also
makes it more reliable.
OK bluhm@
|
|
forwarding, protect tdb flags and lists in pfsync with a mutex.
help and OK sashan@
|
|
covered yet, more ref counts to come. The timeouts are protected,
so the racy tdb_reaper() gets retired. The tdb_policy_head, onext
and inext lists are protected. All gettdb...() functions return a
tdb that is ref counted and has to be unrefed later. A flag ensures
that tdb_delete() is called only once.
Tested by Hrvoje Popovski; OK sthen@ mvs@ tobhe@
|
|
object exists already, then caller must free the pre-allocated buffer.
If caller does not pre-allocate buffer, the pfi_kif_get() will get
memory from pool using M_NOWAIT flag.
Commit is also polishing pfi_initialize() a bit so it uses M_WAITOK
allocation for pfi_all.
there is no change in current behaviour.
feedback by bluhm@
OK bluhm@
|
|
OK @bluhm
|
|
a continue in the middle of a do { } while (0) loop is effectively
a break, it doesnt restart the loop.
without the retry, the code leaked update messages which in turn
made pool_destroy in pfsync destroy trip over a kassert cos items
were still out.
found by and fix tested by hrvoje popovski
ok sashan@
|
|
before this, things that iterated over the global list of pf states
had to take the net, pf, or pf state locks. in particular, the
ioctls that dump the state table took the net and pf state locks
before iterating over the states and using copyout to export them
to userland. when we tried replacing the use rwlocks with mutexes
under the pf locks, this blew up because you can't sleep when holding
a mutex and there's a sleeping lock used inside copyout.
this diff introduces two locks around the global state list: a mutex
that protects the head and tail of the list, and an rwlock that
protects the links between elements in the list. inserts on the
state list only occur during packet handling and can be done by
taking the mutex and putting the state on the tail before releasing
the mutex. iterating over states is only done from thread/process
contexts, so we can take a read lock, then the mutex to get a
snapshot of the head and tail pointers, and then keep the read lock
to iterate between the head and tail points. because it's a read
lock we can then take other sleeping locks (eg, the one inside
copyout) without (further) gymnastics. the pf state purge code takes
the rwlock exclusively and the mutex to remove elements from the
list.
this allows the ioctls and purge code to loop over the list
concurrently and largely without blocking the creation of states
when pf is processing packets.
pfsync also iterates over the state list when doing bulk sends,
which the state purge code needs to be careful around.
ok sashan@
|
|
pfsync_undefer_notify uses the state keys to look up the address
family, which is used to figure out if it should call ipv4 or ipv6
functions. however, the pf state purge code can unlink a state from
the trees (ie, the state keys get removed) while the pfsync defer
code is holding a reference to it and expects to be able to send
the deferred packet in the future. we can test if the state keys
are set by checking if the timeout state is PFTM_UNLINK or not.
this currently relies on both pf_remove_state and pfsync_undefer_notify
being called with the NET_LOCK held. this probably needs to be
rethought later but is good enough for now.
found the hard way on a production firewall at work.
|
|
pfsync_free_deferral doesnt need to check pd_m for NULL before
calling m_freem because m_freem does that anyway.
if pf_setup_pdesc in pfsync_undefer_notify failed, the mbuf was
freed but the pd_m pointer was not cleared, which would have led
to a double free when pfsync_free_deferral tried to do the same
thing for it.
if pfsync_undefer is supposed to drop the mbuf, let pfsync_free_deferral
do it for us.
ok jmatthew@
|
|
working on a uint64_t is easier than remembering how timercmp and
timersub works.
ok jmatthew@
|
|
"that seems kind of important" jmatthew@
|
|
instead of having a timeout per deferred packet structure, use a
single timeout in pfsync that pulls items off the list of deferred
packets.
this avoids confusion about whether a timeout is handling the defer
or another context owns it. this way round, the context that removes
a defer from the list owns it and is responsible for completing it.
this should fix a panic we hit on the firewalls at work. there's
still another one that needs a fix, but sashan@ has been looking
at it. this might make it simpler to deal with though.
ok sashan@ jmatthew@
|
|
state in pfsync(4) queue. pfsync_q_ins() takes that race into account with one
exception: the KASSERT() at line 2352. That KASSERT() needs to be removed.
2346 void
2347 pfsync_q_ins(struct pf_state *st, int q)
2348 {
2349 struct pfsync_softc *sc = pfsyncif;
2350 size_t nlen, sc_len;
2351
2352 KASSERT(st->sync_state == PFSYNC_S_NONE);
2353
2354 #if defined(PFSYNC_DEBUG)
2355 if (sc->sc_len < PFSYNC_MINPKT)
2356 panic("pfsync pkt len is too low %zd", sc->sc_len);
2357 #endif
2358 do {
2359 mtx_enter(&sc->sc_mtx[q]);
2360
2361 /*
2362 * If two threads are competing to insert the same state, then
2363 * there must be just single winner.
2364 */
2365 if (st->sync_state != PFSYNC_S_NONE) {
2366 mtx_leave(&sc->sc_mtx[q]);
2367 break;
2368 }
OK bluhm@
|
|
ok gnezdo@ semarie@ mpi@
|
|
the first cut of this diff was made with coccinelle using this spatch:
@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)
i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.
ok deraadt@ bluhm@
|
|
|
|
|
|
Silence from the network group
ok sashan@
|
|
The code delivered in this change is currently disabled. Brave souls
may enable the code by adding -DWITH_PF_LOCK when building customized
kernel. Big thanks goes to Hrvoje@ for providing test equipment and
testing.
As soon as we enter the next release cycle, the WITH_PF_LOCK will be
defined as default option for MP kernels.
OK dlg@
|
|
this is a significant (and breaking) reworking of the policy based
routing that pf can do. the intention is to make it as easy as
nat/rdr to use, and more robust when it's operating.
the main reasons for this change are:
- route-to, reply-to, and dup-to do not work with pfsync
this is because the information about where to route-to is stored in
rules, and it is hard to have a ruleset synced between firewalls,
and impossible to have them synced 100% of the time.
- i can make my boxes panic in certain situations using route-to
yeah...
- the configuration and syntax for route-to rules are confusing.
the argument to route-to and co is an interace name with an optional
ip address. there are several problems with this. one is that people
tend to think about routing as sending packets to peers by their
address, not by the interface they're reachable on. another is that
we currently have no way to synchronise interface topology information
between firewalls, so using an interface to say where packets go
means we can't do failover of these states with pfsync. another
is that a change in routing topology means a host may become
reachable over a different interface. tying routing policy to
interfaces gets in the way of failover and load balancing.
this change does the following:
- stores the route info in the state instead of the pf rule
this allows route-to to keep working when the ruleset changes, and
allows route-to info to be sent over pfsync. there's enough spare bits
in pfsync messages that the protocol doesnt break.
the caveat is that route-to becomes tied to pass rules that create
state, like rdr-to and nat-to.
- the argument to route-to etc is a destination ip address
it's not limited to a next-hop address (thought a next-hop can be a
destination address). this allows for the failover and load balancing
referred to above.
- deprecates the address@interface host syntax in pfctl
because routing is done entirely by IPs, the interface is derived from
the route lookup, not pf. any attempt to use the @interface syntax
will fail now in all contexts.
there's enthusiasm from proctor@ jmatthew@ and others
ok sashan@ bluhm@
|
|
ok sashan@
|
|
longer memcopied but assigned. Alignment should not be an issue
as it is __packed.
Part of a larger diff from dlg@; OK dlg@ sashan@
|
|
OK dlg@, bluhm@
No Opinion mpi@
Not against it claudio@
|
|
Each ruleset's rules are stored in a TAILQ called "ptr" with "rcount"
representing the number of rules in the ruleset; "ptr_array" points to an
array of the same length.
"ptr" is backed by pool_get(9) and may change in size as "expired" rules
get removed from the ruleset - see "once" in pf.conf(5).
"ptr_array" is allocated momentarily through mallocarray(9) and gets filled
with the TAILQ entries, so that the sole user pfsync(4) can access the list
of rules by index to pick the n-th rule during state insertion.
Remove "ptr_array" and make pfsync iterate over the TAILQ instead to get the
matching rule's index. This simplifies both code and data structures and
avoids duplicate memory management.
OK sashan
|
|
Most clonable interface drivers (except bridge, enc, loop, pppx,
switch, trunk and vlan) initialise the send queue's length to IFQ_MAXLEN
during *_clone_create() even though ifq_init(), which is eventually called
through if_attach(), does the same.
Remove all early "ifq_set_maxlen(&ifq->if_snd, IFQ_MAXLEN);" lines to leave
it to ifq_init() and have clonable drivers a tad more in sync.
OK mvs
|
|
pfsyncstart() does not require the big lock, make it use the ifq API.
OK mvs
|
|
pointer obtained by ifunit() and it's reference counter is not bumped.
This can cause use after free issue. Replace this pointer by interface
index.
ok dlg@ sashan@
|
|
ok dlg@ tobhe@
|
|
ok dlg@ tobhe@
|
|
Reported-by: syzbot+6fef0091252d57113bfb@syzkaller.appspotmail.com
ok kn@
|
|
time_second(9) and time_uptime(9) are widely used in the kernel to
quickly get the system UTC or system uptime as a time_t. However,
time_t is 64-bit everywhere, so it is not generally safe to use them
on 32-bit platforms: you have a split-read problem if your hardware
cannot perform atomic 64-bit reads.
This patch replaces time_second(9) with gettime(9), a safer successor
interface, throughout the kernel. Similarly, time_uptime(9) is replaced
with getuptime(9).
There is a performance cost on 32-bit platforms in exchange for
eliminating the split-read problem: instead of two register reads you
now have a lockless read loop to pull the values from the timehands.
This is really not *too* bad in the grand scheme of things, but
compared to what we were doing before it is several times slower.
There is no performance cost on 64-bit (__LP64__) platforms.
With input from visa@, dlg@, and tedu@.
Several bugs squashed by visa@.
ok kettenis@
|
|
Prevent concurrency in the socket layer which is not ready for that.
Two recent data corruptions in pfsync(4) and the socket layer pointed
out that, at least, tun(4) was incorrectly using NET_RUNLOCK(). Until
we find a way in software to avoid future mistakes and to make sure that
only the softnet thread and some ioctls are safe to use a read version
of the lock, put everything back to the exclusive version.
ok stsp@, visa@
|
|
ok mpi@
|
|
|
|
i think this is a fix for a real bug. pfsync leaked the hooks it
had on a parent^Wsyncdev when the parent went away. now there's
KASSERTs to make sure all hooks are removed before an interface
goes away, the leak caused the KASSERTs to fire and made the bug
obvious.
found by hrvoje popovski
|
|
this is largely mechanical, except for carp. this moves the addition
of the carp link state hook after we're committed to using the new
interface as a carpdev. because the add can't fail, we avoid a
complicated unwind dance. also, this tweaks the carp linkstate hook
so it only updates the relevant carp interface, not all of the
carpdevs on the parent.
hrvoje popovski has tested an early version of this diff and it's
generally ok, but there's some splasserts that this diff fires that
i'll fix in an upcoming diff.
ok claudio@
|
|
the main semantic change is that things registering detach hooks
have to allocate and set a task structure that then gets added to
the list. this means if the task is allocated up front (eg, as part
of carps softc or bridges port structure), it avoids the possibility
that adding a hook can fail. a lot of drivers weren't checking for
failure, and unwinding state in the event of failure in other parts
was error prone.
while doing this i discovered that the list operations have to be
in a particular order, but drivers weren't doing that consistently
either. this diff wraps the list ops up so you have to seriously
go out of your way to screw them up.
ive also sprinkled some NET_ASSERT_LOCKED around the list operations
so we can make sure there's no potential for the list to be corrupted,
especially while it's being run.
hrvoje popovski has tested this a bit, and some issues he discovered
have been fixed.
ok sashan@
|
|
ok semarie@, visa@
|
|
OK mpi@
|
|
instead
From Pamela Mosiejczuk, many thanks!
OK phessler@ deraadt@
|
|
When a pfsync interface is being deleted, all its timeout handlers and
pfsync_send_dispatch() have to stop accessing the software context
before the context is freed. Ensure sufficient synchronization by
acquiring NET_LOCK() and clearing `pfsyncif' inside the critical
section in pfsync_clone_destroy(). When a timeout handler has entered
the critical section, it has to check `pfsyncif' and bail out if the
value is NULL. pfsync_send_dispatch() already does this check.
Issue reported and fix tested by Hrvoje Popovski.
OK mpi@ bluhm@
|
|
OK bluhm@
|
|
this change adds a pf_state_lock rw-lock, which protects consistency
of state table in PF. The code delivered in this change is guarded
by 'WITH_PF_LOCK', which is still undefined. People, who are willing
to experiment and want to run it must do two things:
- compile kernel with -DWITH_PF_LOCK
- bump NET_TASKQ from 1 to ... sky is the limit,
(just select some sensible value for number of tasks your
system is able to handle)
OK bluhm@
|
|
OK bluhm@, OK mpi@, henning@, jca@
|
|
The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.
No objection from millert@, ok tedu@, bluhm@
|
|
memory shortage. As it is invoked from a system call, it should
not fail and wait instead.
OK visa@ mpi@
|