Age | Commit message (Collapse) | Author |
|
OK mvs@
|
|
Rename ifq_set_maxlen() to ifq_init_maxlen(). This function neither
uses WRITE_ONCE() nor a mutex and is called before the ifq mutex
is initialized. The new name expresses that it should be used only
during interface attach when there is no concurrency.
Protect ifq_len(), ifq_empty(), ifiq_len(), and ifiq_empty() with
READ_ONCE(). They can be used without lock as they only read a
single integer.
OK dlg@
|
|
Also disable TCP LRO on bridged vlan(4) and default for bpe(4), nvgre(4) and
vxlan(4).
ok bluhm@
|
|
Using a scratch buffer makes it possible to take a consistent snapshot of
per-CPU counters without having to allocate memory.
Makes ddb(4) show uvmexp command work in OOM situations.
ok kn@, mvs@, cheloha@
|
|
|
|
When M_TCP_TSO is preserved, also keep ph_mss. In lo(4) this logic
was missing. This may be relevant only for weird pf configs that
forward from loopback.
OK mvs@ jan@
|
|
When doing LRO (Large Receive Offload), the drivers, currently ix(4)
and lo(4) only, record an upper bound of the size of the original
packets in ph_mss. When sending, either stack or hardware must
chop the packets with TSO (TCP Segmentation Offload) to that size.
That means we have to call tcp_if_output_tso() before ifp->if_output().
Put that logic into if_output_tso() to avoid code duplication. As
TCP packets on the wire do not get larger that way, path MTU discovery
should still work.
tested by and OK jan@
|
|
moving pf forward has been a real struggle, and pfsync has been a
constant source of pain. we have been papering over the problems
for a while now, but it reached the point that it needed a fundamental
restructure, which is what this diff is.
the big headliner changes in this diff are:
- pfsync specific locks
this is the whole reason for this diff.
rather than rely on NET_LOCK or KERNEL_LOCK or whatever, pfsync now
has it's own locks to protect it's internal data structures. this
is important because pfsync runs a bunch of timeouts and tasks to
push pfsync packets out on the wire, or when it's handling requests
generated by incoming pfsync packets, both of which happen outside
pf itself running. having pfsync specific locks around pfsync data
structures makes the mutations of these data structures a lot more
explicit and auditable.
- partitioning
to enable future parallelisation of the network stack, this rewrite
includes support for pfsync to partition states into different "slices".
these slices run independently, ie, the states collected by one slice
are serialised into a separate packet to the states collected and
serialised by another slice.
states are mapped to pfsync slices based on the pf state hash, which
is the same hash that the rest of the network stack and multiq
hardware uses.
- no more pfsync called from netisr
pfsync used to be called from netisr to try and bundle packets, but now
that there's multiple pfsync slices this doesnt make sense. instead it
uses tasks in softnet tqs.
- improved bulk transfer handling
there's shiny new state machines around both the bulk transmit and
receive handling. pfsync used to do horrible things to carp demotion
counters, but now it is very predictable and returns the counters back
where they started.
- better tdb handling
the tdb handling was pretty hairy, but hrvoje has kicked this around
a lot with ipsec and sasyncd and we've found and fixed a bunch of
issues as a result of that testing.
- mpsafe pf state purges
this was committed previously, but because the locks pfsync relied on
weren't clear this just caused a ton of bugs. as part of this diff it's
now reliable, and moves a big chunk of work out from under KERNEL_LOCK,
which in turn improves the responsiveness and throughput of a firewall
even if you're not using pfsync.
there's a bunch of other little changes along the way, but the above are
the big ones.
hrvoje has done performance testing with this diff and notes a big
improvement when pfsync is not in use. performance when pfsync is
enabled is about the same, but im hoping the slices means we can scale
along with pf as it improves.
lots (months) of testing by me and hrvoje on pfsync boxes
tests and ok sashan@
deraadt@ says this is a good time to put it in
|
|
Pointed out by bluhm.
ok bluhm@
|
|
If tcplro is activated on lo(4), ignore the MTU with TCP packets.
They are passed along with the information that they have to be
chopped in case they are forwarded later. New netstat(1) counter
shows that software LRO is in effect. The feature is currently
turned off by default.
tested by jan@; OK claudio@ jan@
|
|
for interface groups data allocations.
ok kn claudio bluhm
|
|
disconnected from everywhere. No need to hold netlock for dummy
'nd_ifinfo' release. Netlock is also not needed for
TAILQ_EMPTY(&ifp->if_*hooks) assertions.
ok kn bluhm
|
|
Packets sent over loopback got their checksums calculated twice.
In the output path they were filled in and during TCP/IP input all
checksums were calculated again to be compared with the previous
result.
Avoid this by claiming that lo(4) supports hardware checksum
offloading. For each packet convert the flag that the checksum
should be calculated to the flag that it has been checked successfully.
Keep the flag that it should be calculated for the case that it may
be bridged or forwarded later.
A drawback is that "tcpdump -ni lo0 -v" reports invalid checksum.
But that is the same with physical interfaces and hardware offloading.
OK dlg@
|
|
this waits once for something to end in all the net tqs.
ok claudio@
|
|
This diff introduces separate capabilities for TCP offloading. We split this
into LRO (large receive offloading) and TSO (TCP segmentation offloading).
LRO can be turned on/off via tcprecvoffload option of ifconfig and is not
inherited to sub interfaces.
TSO is inherited by sub interfaces to signal this hardware offloading capability
to the network stack.
With tweaks from bluhm, claudio and dlg
ok bluhm, claudio
|
|
ie, you'll see softnet0, softnet1, etc in top/ps/etc now instead
of just softnet on these threads.
this is done by wrapping the taskq and name up in a softnet struct.
ok patrick@ bluhm@ mvs@ kn@ sashan@
|
|
if_output_ml() to send mbuf lists to interfaces. This can be used
for TSO, fragments, ARP and ND6. Rename variable fml to ml. In
pf_route6() split the if else block. Put the safety check (hlen +
firstlen < tlen) into ip_fragment(). It makes the code correct in
case the packet is too short to be fragmented. This should not
happen, but other functions also have this logic.
No functional change. OK sashan@
|
|
time kernel and net locks are held in various combination to protect it.
We don't want to put kernel lock to all the places. Netlock also can't
be used because rtfree(9) which calls rtlabel_unref() has unknown
netlock state within.
This new `rtlabel_mtx' mutex(9) protects `rt_labels' list and `label'
entry dereference. Since we don't export 'rt_label' structure, keep this
lock private to net/route.c. For this reason rtlabel_id2name() now
copies label string to externally passed buffer instead of returning
address of `rt_labels' list data. This is the way which rtlabel_id2sa()
already works.
ok bluhm@
|
|
with tweaks from bluhm, claudio and dlg
I fine with it from claudio
looks good to me from dlg
ok bluhm
|
|
tb reports amd64 RAMDISK doesn't build with it.
also, vlan_flags_from_parent doesn't look right right. it iterates
over ifnetlist, which is all interfaces in the system, but appears
to assume they're all vlan interfaces and so uses a vlan_softc *
to inspect their if_softc pointers.
|
|
tested by Hrvoje Popovski
with tweaks from bluhm and claudio
encouraged from deraadt
ok bluhm
|
|
Netlock protects `if_list', `ifa_list' and returned `ifa' dereference,
so put netlock assertion within.
Please note, rtable_setsource() doesn't destroy data pointed by
`ar_source'. This is the `ifa_addr' data belongs to `ifa' and exclusive
netlock is required to destroy it. So the kernel lock is not required
within rt_setsource(). Take netlock by rt_setsource() caller to make
`ifa' dereference safe.
Suggestions and ok by bluhm@
|
|
We use both kernel and net lock for protect `ifnetlist'. This means we
do modification with both locks held, but for read-only access only one
lock required. Some places doing `ifnetlist' foreach loop are protected
by kernel lock and context switch can't be introduced there. This is the
exception, so "XXXSMP:" comment added.
Proposed and ok by bluhm@
|
|
This is the mbuf(9) allocation and broadcast transmission for PF_ROUTE
sockets, netlock is not required here.
ok bluhm@
|
|
grabs the exclusive netlock and that is sufficent for in_arpinput()
and arpcache().
with kn@; OK mvs@; tested by Hrvoje Popovski
|
|
ND6 did only hold a single packet. Unify the logic and add a mbuf
hold queue to struct llinfo_nd6. This is MP safe and queue limits
are tracked with atomic operations. New function if_mqoutput() has
common code for ARP and ND6. ln_saddr6 holds the source address
of the requesting packet. That is easier than fiddling with mbuf
queue in nd6_ns_output().
OK kn@
|
|
with tweaks from claudio and deraadt
ok claudio, bluhm
|
|
ok bluhm@, claudio@
|
|
Do it like the rest of at/detach routines which modify a struct ifnet
pointer without returning anything.
OK mvs
|
|
dom_if{at,de}tach()
Both made obsolete through struct ifnet's previous *if_nd addition.
IPv6 Neighbour Discovery handles per-interface data directly, nothing
else uses this generic domain API anymore.
Outside of _KERNEL, but nothing in base uses them, either.
OK bluhm mvs claudio
|
|
*if_afdata[] and struct domain's dom_if{at,de}tach() are only used with
IPv6 Neighbour Discovery in6_dom{at,de}tach(), which allocate/init and
free single struct nd_ifinfo.
Set up a new ND-specific *if_nd member directly to avoid yet another
layer of indirection and thus make the generic domain API obsolete.
The per-interface data is only accessed in nd6.c and nd6_nbr.c through
the ND_IFINFO() macro; it is allocated and freed exactly once during
interface at/detach, so document it as [I]mmutable.
OK bluhm mvs claudio
|
|
The global interface group list is also protected by the net lock and all
access to it (all within if.c) take it accordingly.
Getting all
- members of a group (SIOCGIFGMEMB),
- attributes of a group (SIOCGIFGATTR),
- groups (SIOCGIFGLIST)
are each read-only operations on the global interface group `ifg_head'.
The global interface list `ifnetlist' or its per-interface group lists are
not used in these ioctls.
OK mvs
|
|
As netintro(4) explains, this copies a bunch of data from the global
interface list as well as its per-interface address lists.
All of this is never written to by ifconf(), protected by the net lock
and documented as such in the struct comments already.
OK mvs
|
|
The per-interface group list is protected by the net lock and already
documented as such.
The global interface group list `ifg_head' is also protected by the net
lock and all access to it (all within if.c) take it accordingly.
Feedback OK mvs
|
|
when i first wrote if_idxmap i didn't realise (and no one thought
to tell me) that index 0 was special and means "no interface", so
while here use the 0th slot in the interface map to store the length
of the map instead of prepending the map with a length field.
if_get() now special cases index 0 and returns NULL directly. this
also means the size of the map is now always a power of 2, which
is a nicer fit with what the kernel malloc aprovides.
the problem with r1.673 that hrvoje popovski found was that attaching
a lot of interfaces during autoconf would lock up when growing the
map called smr_barrier. the fix in this diff is to (ab)use the
usedidx bitmap to store an smr_entry and defer the freeing of the
interface pointer map with it.
tested by hrvoje popovski
tweaks and ok visa@
|
|
if the map has to be reallocated during boot, there's an smr_barrier
waiting for the old map to become unused. that barrier ends up
waiting for cpus that aren't running yet because we haven't finished
booting yet, so boot gets stuck.
found by hrvoje popovski
|
|
OK mvs
|
|
Recommit these two together:
- r1.667 "Push kernel lock into ifioctl_get()"
locked before the switch() without unlocking in its cases
- r1.668 "Push kernel lock inside ifioctl_get()"
locked cases individually, as intended
I messed up splitting commits, but of course, Hrvoje managed to test a
CVS checkout right inbetween those two.
OK mpi mvs
|
|
when i first wrote if_idxmap i didn't realise (and no one thought
to tell me) that index 0 was special and means "no interface", so
while here use the 0th slot in the interface map to store the length
of the map instead of prepending the map with a length field.
if_get() now special cases index 0 and returns NULL directly. this
also means the size of the map is now always a power of 2, which
is a nicer fit with what the kernel malloc aprovides.
tweaks and ok visa@
|
|
WITNESS isn't happy with r1.667 "Push kernel lock into ifioctl_get()", so
revert it (including r1.668 and r1.669 depending on it):
witness: userret: returning with the following locks held:
exclusive kernel_lock &kernel_lock r = 0 (0xffffffff82455f58)
#0 witness_lock+0x311
#1 ifioctl_get+0x2e
#2 sys_ioctl+0x2c4
#3 syscall+0x384
#4 Xsyscall+0x128
panic: witness_warn
Stopped at db_enter+0x10: popq %rbp
TID PID UID PRFLAGS PFLAGS CPU COMMAND
* 70588 52613 0 0x3 0 4K pfctl
So back to the drawing board while leaving documentation bits (r1.670).
Thanks Hrvoje.
|
|
|
|
Move up to comment explaining different locks to account for all structs.
OK millert mvs
|
|
ifconfig(8) -C is the only user in base and the if_clone_attach() comment
explains how this list is being built during autoconf(9).
After that it is only ever read. Multiple threads may traverse the list in
parallel and reading the `int' count is atomic.
OK mvs
|
|
After this mechanical move, I can unlock the individual SIOCG* in there.
OK mvs
|
|
Another mechanical diff without semantic changes to avoid churn in actual
unlocking diffs.
OK mpi
|
|
This is a mechanical diff without semantical changes, locking ioctls
individually inside ifioctl() rather than all of them around it.
This allows us to unlock ioctls one by one.
OK mpi
|
|
Naming the list like the struct itself makes for awful grepping.
Call the global variable "ifnetlist" from now on.
There used to be kvm(3) consumers in base picking up this symbol, but those
have long been converted to other interfaces.
A few potential ports users remain, same deal as sys/net/if_var.h r1.116
"Remove struct ifnet's unused if_switchport member": they get bumped.
Previous users pointed out by deraadt
OK bluhm
|
|
The 'proc *' arg is not used for PRU_CONTROL request, so remove it from
pru_control() wrapper.
Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for
inet6 case.
ok guenther@ bluhm@
|
|
This is helpful for the following (*pr_usrreq)() split to multiple
handlers. But right now this makes code more readable.
Also add '#ifndef _SYS_SOCKETVAR_H_' to sys/socketvar.h. This prevents the
collisions when both sys/protosw.h and sys/socketvar.h are included
together. Both 'socket' and 'protosw' structures are required to be
defined before pru_*() wrappers, so we need to include sys/socketvar.h to
sys/protosw.h.
ok bluhm@
|
|
NET_RLOCK_IN_IOCTL, which have the same implementation. The R and
W are hard to see, call the new macro NET_LOCK_SHARED. Rename the
opposite assertion from NET_ASSERT_WLOCKED to NET_ASSERT_LOCKED_EXCLUSIVE.
Update some outdated comments about net locking.
OK mpi@ mvs@
|