Age | Commit message (Collapse) | Author |
|
before this, things that iterated over the global list of pf states
had to take the net, pf, or pf state locks. in particular, the
ioctls that dump the state table took the net and pf state locks
before iterating over the states and using copyout to export them
to userland. when we tried replacing the use rwlocks with mutexes
under the pf locks, this blew up because you can't sleep when holding
a mutex and there's a sleeping lock used inside copyout.
this diff introduces two locks around the global state list: a mutex
that protects the head and tail of the list, and an rwlock that
protects the links between elements in the list. inserts on the
state list only occur during packet handling and can be done by
taking the mutex and putting the state on the tail before releasing
the mutex. iterating over states is only done from thread/process
contexts, so we can take a read lock, then the mutex to get a
snapshot of the head and tail pointers, and then keep the read lock
to iterate between the head and tail points. because it's a read
lock we can then take other sleeping locks (eg, the one inside
copyout) without (further) gymnastics. the pf state purge code takes
the rwlock exclusively and the mutex to remove elements from the
list.
this allows the ioctls and purge code to loop over the list
concurrently and largely without blocking the creation of states
when pf is processing packets.
pfsync also iterates over the state list when doing bulk sends,
which the state purge code needs to be careful around.
ok sashan@
|
|
|
|
since the actual modification of the state table is done by a call to
pf_state_insert(), which takes the pf state lock itself. Other calls
to pfsync_state_import() also only have the pf lock.
Reported-by: syzbot+d6ea8620b43dc69ecbc6@syzkaller.appspotmail.com
ok bluhm@
|
|
Silence from the network group
ok sashan@
|
|
Ranges where the left boundary is bigger than the right one are always bogus
as they work like `port any' (`port 34<>12' means "all ports") or in way
that inverts the rule's action (`pass ... port 34:12' means "pass no port at
all").
Add checks for all ranges and invalidate those that yield no or all ports.
For this to work on redirections, make pfctl(8) pass the range's type,
otherwise boundary including ranges are not detected as such; that is to
say, `struct pf_pool's `port_op' member was unused in the kernel so far.
`rdr-to' rules with invalid ranges could panic the kernel when hit.
Reported-by: syzbot+9c309db201f06e39a8ba@syzkaller.appspotmail.com
OK sashan
|
|
Reported-by: syzbot+b9af9c29ed1a6dabda25@syzkaller.appspotmail.com
OK anton@
|
|
outside of NET_LOCK()/PF_LOCK() scope in easy spots.
OK kn@
|
|
unused by the rule. So skip the rest of the check in that case.
Fixes rulest loading for semarie@
OK semarie@
|
|
Unlike "... rtable N", pf.conf(5)'s "on rdomain N" does not alter packet
state and will always work no matter if rdomain N currently exists or not,
i.e. the rule "pass on rdomain 42" will simply match (and pass) packets if
rdomain 42 exists, and it will simply not match (neither pass nor block)
packets if 42 does not exist.
There's no need to reload the ruleset whenever routing domains are created
or deleted, which can already be observed now by creating an rdomain,
loading rules referencing it and deleting the same rdomain immediately
afterwards: pf will continue to work as expected.
Relax both pfctl(8)'s parser check as well as pf(4)'s copyin routine to
accept any valid routing domain ID without expecting it to exist at the time
of ruleset creation - this lifts the requirement to create rdomains before
referencing them in pf.conf while keeping pf behaviour unchanged.
Prompted by yasuoka's recent pfctl parse.y r1.702 commit requiring an rtable
to exist upon ruleset creation.
Discussed with claudio and bluhm at k2k20.
Feedback sashan
OK sashan yasouka claudio
|
|
When "once" rules expire, they are removed from the active ruleset,
hence the main ruleset needs to be rehashed iff itself contains once rules.
After the previous commit, pf_setup_pfsync_matching() became much simpler
but its name even less accurate; simplify it further and rename it to
pf_calc_chksum() while here.
Admins using "once" rules in combination with pfsync(4) are hopefully aware
of this caveat (self-changing rulesets) already, but now the checksum in
"pfctl -v -s info" actually indicates out-of-sync rulesets and is no longer
misleading.
OK sashan
|
|
Each ruleset's rules are stored in a TAILQ called "ptr" with "rcount"
representing the number of rules in the ruleset; "ptr_array" points to an
array of the same length.
"ptr" is backed by pool_get(9) and may change in size as "expired" rules
get removed from the ruleset - see "once" in pf.conf(5).
"ptr_array" is allocated momentarily through mallocarray(9) and gets filled
with the TAILQ entries, so that the sole user pfsync(4) can access the list
of rules by index to pick the n-th rule during state insertion.
Remove "ptr_array" and make pfsync iterate over the TAILQ instead to get the
matching rule's index. This simplifies both code and data structures and
avoids duplicate memory management.
OK sashan
|
|
the checksum is exclusively used for pfsync to verify rulesets are identical
on all nodes. the automatic table names are random and have a near zero
chance to match. found at a customer in zurich
ok sashan kn
|
|
time_second(9) and time_uptime(9) are widely used in the kernel to
quickly get the system UTC or system uptime as a time_t. However,
time_t is 64-bit everywhere, so it is not generally safe to use them
on 32-bit platforms: you have a split-read problem if your hardware
cannot perform atomic 64-bit reads.
This patch replaces time_second(9) with gettime(9), a safer successor
interface, throughout the kernel. Similarly, time_uptime(9) is replaced
with getuptime(9).
There is a performance cost on 32-bit platforms in exchange for
eliminating the split-read problem: instead of two register reads you
now have a lockless read loop to pull the values from the timehands.
This is really not *too* bad in the grand scheme of things, but
compared to what we were doing before it is several times slower.
There is no performance cost on 64-bit (__LP64__) platforms.
With input from visa@, dlg@, and tedu@.
Several bugs squashed by visa@.
ok kettenis@
|
|
Since our last concurrency mistake only ioctl(2) ans sysctl(2) code path
take the reader lock. This is mostly for documentation purpose as long as
the softnet thread is converted back to use a read lock.
dlg@ said that comments should be good enough.
ok sashan@
|
|
Reported-by: syzbot+d0639632a0affe0a690e@syzkaller.appspotmail.com
Reported-by: syzbot+ae5e359d7f82688edd6a@syzkaller.appspotmail.com
OK anton@
|
|
Prevent concurrency in the socket layer which is not ready for that.
Two recent data corruptions in pfsync(4) and the socket layer pointed
out that, at least, tun(4) was incorrectly using NET_RUNLOCK(). Until
we find a way in software to avoid future mistakes and to make sure that
only the softnet thread and some ioctls are safe to use a read version
of the lock, put everything back to the exclusive version.
ok stsp@, visa@
|
|
Do not include <sys/kthread.h> where it is not needed and stop including
<sys/proc.h> in it.
ok visa@, anton@
|
|
entry instead of calling panic() due to unhandled af.
Reported-by: syzbot+92be143c2dd1746cf2af@syzkaller.appspotmail.com
from Benjamin Baier
|
|
No object change
OK sashan
|
|
the problem was introduced with a "mechanical" patch, which
replaced all "breaks;" with "PF_UNLOCK(); break;" This is wrong
for case of DIOCGETRULESETS.
issue analyzed and patch created by Joerg Goltermann <jg@osn.de>
OK tb@
|
|
in and the pf_pktdelay struct ws not declared and initialzed properly.
ok rob@ kn@
|
|
current status and statistics and can be exported without super-user
rights via sysctl to make it easier for tools like systat to access those.
OK deraadt@, sashan@
|
|
pfioc_src_nodes to size_t. This avoids integer truncation by casts
to unsigned. As the types of DIOCGETSTATES and DIOCGETSRCNODES
ioctl(2) arguments change, pfctl(8) and systat(1) should be updated
together with the kernel. Calculate number of pf(4) states as
size_t in userland.
OK sashan@ deraadt@
|
|
All rulesets reference their parent anchor, except for the special cased
main anchor containing the main ruleset, which's reference is always NULL
since initialization and never changes.
Replacing nullity tests with clearer equality checks makes the code less
ambigious and easier to understand.
OK sashan
|
|
For semantic consistency with pf_{create,find,remove}_{anchor,ruleset}().
Simplify logic by squashing the if/else block while here.
No functional change.
Feedback jca and mikeb, OK mikeb
|
|
OK kn@, florian@, visa@, cheloha@
|
|
These are just unhelpful case conversion.
OK sashan henning
|
|
This fixes certain operations such as `pfctl -t foo -T show' when the
system is in "Highly secure mode". `pfctl -t foo -T show -v' would already
work due to a different ioctl (DIOCRGETASTATS) being used.
Reported by Zbyszek ŻóÅkiewski, thanks!
OK sthen sashan
|
|
this change adds a pf_state_lock rw-lock, which protects consistency
of state table in PF. The code delivered in this change is guarded
by 'WITH_PF_LOCK', which is still undefined. People, who are willing
to experiment and want to run it must do two things:
- compile kernel with -DWITH_PF_LOCK
- bump NET_TASKQ from 1 to ... sky is the limit,
(just select some sensible value for number of tasks your
system is able to handle)
OK bluhm@
|
|
Due to the missing "void", this
extern void pf_purge_expired_src_nodes();
is no prototype but a declaration. It is enough to suppress the
'implicit declaration' warning but it does not allow the compiler to
check the arguments passed to the calls of the function.
Fix the prototypes and don't pass the waslocked argument anymore. It has
been removed a year ago.
ok sashan henning
|
|
by pf in the packet header. pf_delay_pkt reads the delay value from the packet
header, schedules a timeout and re-queues the packet when the timeout fires.
ok benno sashan
|
|
the rule created in pf_rule_copyin(). Plugs a potential kif memory
leak in pf(4) ioctl.
OK sashan@
|
|
No objections from henning, OK visa
|
|
ok bluhm@ sashan@ visa@
|
|
ok claudio benno procter
|
|
|
|
when syncookies are on, pf will blindly answer each and every SYN with a
syncookie-SYNACK. Upon reception of the ACK completing the 3WHS, pf will
reconstruct the original SYN, shove it through pf_test, where state will
be created if the ruleset permits it. Then massage the freshly created state
(we won't see the SYNACK), set up the sequence number modulator, and call
into the existing synproxy code to start the 3WHS with the backend host.
Add an - somewhat basic for now - adaptive mode where syncookies get enabled
if a certain percentage of the state table is filled up with half-open tcp
connections. This makes pf firewalls resilient against large synflood
attacks.
syncookies are off by default until we gained more experience, considered
experimental for now.
see http://bulabula.org/papers/2017/bsdcan/ for more details.
joint work with sashan@, widely discussed and with lots of input by many
|
|
bzero -> memset and (very few) bcopy -> memcpy/memmove
|
|
OK lteo@ sashan@
|
|
divert-to or divert-reply was active. If the address was also set,
it meant divert-to. Divert packet used a separate structure. This
is confusing and makes it hard to add new features. It is better
to have a divert type that explicitly says what is configured.
Adapt the pf rule struct in kernel and pfctl, no functional change.
Note that kernel and pfctl have to be updated together.
OK sashan@
|
|
pass in proto icmp max-pkt-rate 100/10
all packets matching the rule in the direction the state was created are
taken into consideration (typically: requests, but not replies).
Just like with the other max-*, the rule stops matching if the maximum is
reached, so in typical scenarios the default block rule would kick in then.
with input from Holger Mikolon
ok mikeb
|
|
NOTE: code still runs with single softnet task. change definition of
SOFTNET_TASKS in net/if.c, if you want to have more than one softnet task
OK mpi@, OK phessler@
|
|
(extra thanks to Hrvoje for testing)
OK mpi@
|
|
Tested by Hrvoje Popovski, ok bluhm@
|
|
thread to a task running on the `softnettq`.
Tested and inputs from Hrvoje Popovski.
ok visa@, sashan@
|
|
is correctly NUL terminated.
Reported by Ilja Van Sprundel
With and OK bluhm@
|
|
The selection mechanism introduced in pf_ioctl.c -r1.316 suffers
from being too ambiguous and lacks robustness. Instead of relying
on composition of multiple flags in the queue specification, it's
easier to identify the root class (if it exists) and derive all
further checks from it.
|
|
While there use TAILQ_FOREACH macro for traversing tags.
OK mpi@
|
|
Discussed with and OK henning@ at d2k17 as a part of a larger diff.
|
|
PFQS_FLOWQUEUE is about to become a flag that HFSC classes may specify
as well; thus a better way of telling FQ-CoDel and HFSC apart needs to
be found. At the moment its derived from the queue specification.
Discussed with and OK henning@ at d2k17 as a part of a larger diff.
|