Age | Commit message (Collapse) | Author |
|
|
|
interface descriptor. It panics during attach of em(4) device at
boot.
|
|
descriptor.
We have the mess in network interface statistics. Only pseudo drivers
do per-CPU counters allocation, all other network devices use the old
`if_data'. The network stack partially uses per-CPU counters and
partially use `if_data', but the protection is inconsistent: some times
counters accessed with exclusive netlock, some times with shared
netlock, some times with kernel lock, but without netlock, some times
with another locks.
To make network interfaces statistics more consistent, always allocate
per-CPU counters at interface attachment time and use it instead of
`if_data'. At this step only move counters allocation to the if_attach()
internals. The `if_data' removal will be performed with the following
diffs to make review and tests easier.
ok bluhm
|
|
OK mvs@
|
|
When doing LRO (Large Receive Offload), the drivers, currently ix(4)
and lo(4) only, record an upper bound of the size of the original
packets in ph_mss. When sending, either stack or hardware must
chop the packets with TSO (TCP Segmentation Offload) to that size.
That means we have to call tcp_if_output_tso() before ifp->if_output().
Put that logic into if_output_tso() to avoid code duplication. As
TCP packets on the wire do not get larger that way, path MTU discovery
should still work.
tested by and OK jan@
|
|
Replace hand-rolled reference counting with refcnt_init(9) and hook it up
with a new dt(4) probe.
OK bluhm mvs
|
|
ok jmc@ guenther@ tb@
|
|
if_output_ml() to send mbuf lists to interfaces. This can be used
for TSO, fragments, ARP and ND6. Rename variable fml to ml. In
pf_route6() split the if else block. Put the safety check (hlen +
firstlen < tlen) into ip_fragment(). It makes the code correct in
case the packet is too short to be fragmented. This should not
happen, but other functions also have this logic.
No functional change. OK sashan@
|
|
Netlock protects `if_list', `ifa_list' and returned `ifa' dereference,
so put netlock assertion within.
Please note, rtable_setsource() doesn't destroy data pointed by
`ar_source'. This is the `ifa_addr' data belongs to `ifa' and exclusive
netlock is required to destroy it. So the kernel lock is not required
within rt_setsource(). Take netlock by rt_setsource() caller to make
`ifa' dereference safe.
Suggestions and ok by bluhm@
|
|
We use both kernel and net lock for protect `ifnetlist'. This means we
do modification with both locks held, but for read-only access only one
lock required. Some places doing `ifnetlist' foreach loop are protected
by kernel lock and context switch can't be introduced there. This is the
exception, so "XXXSMP:" comment added.
Proposed and ok by bluhm@
|
|
ND6 did only hold a single packet. Unify the logic and add a mbuf
hold queue to struct llinfo_nd6. This is MP safe and queue limits
are tracked with atomic operations. New function if_mqoutput() has
common code for ARP and ND6. ln_saddr6 holds the source address
of the requesting packet. That is easier than fiddling with mbuf
queue in nd6_ns_output().
OK kn@
|
|
dom_if{at,de}tach()
Both made obsolete through struct ifnet's previous *if_nd addition.
IPv6 Neighbour Discovery handles per-interface data directly, nothing
else uses this generic domain API anymore.
Outside of _KERNEL, but nothing in base uses them, either.
OK bluhm mvs claudio
|
|
*if_afdata[] and struct domain's dom_if{at,de}tach() are only used with
IPv6 Neighbour Discovery in6_dom{at,de}tach(), which allocate/init and
free single struct nd_ifinfo.
Set up a new ND-specific *if_nd member directly to avoid yet another
layer of indirection and thus make the generic domain API obsolete.
The per-interface data is only accessed in nd6.c and nd6_nbr.c through
the ND_IFINFO() macro; it is allocated and freed exactly once during
interface at/detach, so document it as [I]mmutable.
OK bluhm mvs claudio
|
|
The per-interface group list is protected by the net lock and already
documented as such.
The global interface group list `ifg_head' is also protected by the net
lock and all access to it (all within if.c) take it accordingly.
Feedback OK mvs
|
|
|
|
Move up to comment explaining different locks to account for all structs.
OK millert mvs
|
|
Naming the list like the struct itself makes for awful grepping.
Call the global variable "ifnetlist" from now on.
There used to be kvm(3) consumers in base picking up this symbol, but those
have long been converted to other interfaces.
A few potential ports users remain, same deal as sys/net/if_var.h r1.116
"Remove struct ifnet's unused if_switchport member": they get bumped.
Previous users pointed out by deraadt
OK bluhm
|
|
This is a switch(4) left-over.
Even though it is defined under _KERNEL, a few ports do define it and
include <net/if_var.h>, so this removal warrants a REVISION bump for all
potential ports consumers (once ports bulk machines run on a snapshot
containing this commit).
OK mvs
|
|
There was a crash due to use after free of the ifa although it is
ref counted. As ifa_refcnt was a simple integer increment, there
may be a path where multiple CPUs access it concurrently. So change
to struct refcnt which is MP safe and provides dt(4) leak debugging.
Link level address for IPsec enc(4) and various MPLS interfaces is
special. There ifa is part of struct sc. Use refcount anyway and
add a panic to detect use after free.
bug report stsp@; OK mvs@
|
|
the l3 protocol input to push the packet is based on a value in
m->m_pkthdr.ph_family, which tunnel drivers should set before calling
if_vinput.
add p2p_bpf_mtap to call bpf_mtap_af also using m->m_pkthdr.ph_family.
|
|
the network stack is now responsible for calling bpf for packets
that the interface receives, and we so far got away with using
bpf_mtap_ether for everything. this doesn't work if layer 3 input
goes through the same functions, so letting drivers specify the
appropriate bpf mtap function means they will be able to cope.
|
|
as signed. u_int used within pipex(4) for consistency with other code.
ok dlg@ mpi@
|
|
ok sashan@
|
|
the interface input handler lists were originally set up to help
us during the intial mpsafe network stack work. at the time not all
the virtual ethernet interfaces (vlan, svlan, bridge, trunk, etc)
were mpsafe, so we wanted a way to avoid them by default, and only
take the kernel lock hit when they were specifically enabled on the
interface. since then, they have been fixed up to be mpsafe.
i could leave the list in place, but it has some semantic problems.
because virtual interfaces filter packets based on the order they
were attached to the parent interface, you can get packets taken
away in surprising ways, especially when you reboot and netstart
does something different to what you did by hand. by hardcoding the
order that things like vlan and bridge get to look at packets, we
can document the behaviour and get consistency.
it also means we can get rid of a use of SRPs which were difficult
to replace with SMRs. the interface input handler list is an SRPL,
which we would like to deprecate. it turns out that you can sleep
during stack processing, which you're not supposed to do with SRPs
or SMRs, but SRPs are a lot more forgiving and it worked.
lastly, it turns out that this code is faster than the input list
handling, so lots of winning all around.
special thanks to hrvoje popovski and aaron bieber for testing.
this has been in snaps as part of a larger diff for over a week.
|
|
ok dlg@ tobhe@
|
|
ok dlg@ tobhe@
|
|
"new" API.
ok dlg@ tobhe@
|
|
capital letters in locking annotations. Therefore harmonize the existing
annotations.
Also, if multiple locks are required they should be delimited using
commas.
ok mpi@
|
|
descriptors runs below the low watermark.
The em(4) firmware seems not to work properly with just a few descriptors in
the receive ring. Thus, we use the low water mark as an indicator instead of
zero descriptors, which causes deadlocks.
ok kettenis@
|
|
if_pcount is only touched in ifpromisc(), and ifpromisc() needs
NET_LOCK anyway because it also modifies if_flags.
suggested by mpi@
ok visa@
|
|
this follows what's been done for detach and link state hooks, and
makes handling of hooks generally more robust.
address hooks are a bit different to detach/link state hooks in
that there's only a few things that register hooks (carp, pf, vxlan),
but a lot of places to run the hooks (lots of ipv4 and ipv6 address
configuration).
an address hook cookie was in struct pfi_kif, which is part of the
pf abi. rather than break pfctl -sI, this maintains the void * used
for the cookie and uses it to store a task, which is then used as
intended with the new api.
|
|
this is largely mechanical, except for carp. this moves the addition
of the carp link state hook after we're committed to using the new
interface as a carpdev. because the add can't fail, we avoid a
complicated unwind dance. also, this tweaks the carp linkstate hook
so it only updates the relevant carp interface, not all of the
carpdevs on the parent.
hrvoje popovski has tested an early version of this diff and it's
generally ok, but there's some splasserts that this diff fires that
i'll fix in an upcoming diff.
ok claudio@
|
|
the main semantic change is that things registering detach hooks
have to allocate and set a task structure that then gets added to
the list. this means if the task is allocated up front (eg, as part
of carps softc or bridges port structure), it avoids the possibility
that adding a hook can fail. a lot of drivers weren't checking for
failure, and unwinding state in the event of failure in other parts
was error prone.
while doing this i discovered that the list operations have to be
in a particular order, but drivers weren't doing that consistently
either. this diff wraps the list ops up so you have to seriously
go out of your way to screw them up.
ive also sprinkled some NET_ASSERT_LOCKED around the list operations
so we can make sure there's no potential for the list to be corrupted,
especially while it's being run.
hrvoje popovski has tested this a bit, and some issues he discovered
have been fixed.
ok sashan@
|
|
IF_WIRELESS_DEFAULT_PRIORITY and use it in umb(4) as default prio.
OK kettenis@, sthen@
|
|
This redefines the ifp <-> bridge relationship. No lock can be
currently used across the multiples contexts where the bridge has
tentacles to protect a pointer, use an interface index.
Tested by various, ok dlg@, visa@
|
|
if_vinput assumes that the interface that its called against uses
per cpu counters so it can count input packets, but basically does
all the things that if_input and ifiq_input do. the main difference
is it assumes the network stack is already running and runs the
interface input handlers directly. this is instead of queuing the
packets for a nettq to run.
ifiqs arent free, especially when they only run per packet like
they do on psuedo interfaces. this allows that overhead to be
bypassed.
|
|
l2 and l3 drivers do the same thing all the time, so reduce the
chance of error by doing the checks once and making it available
for drivers to call instead of rolling on their own again.
|
|
the idea is to call the hardware transmit routine less since in a
lot of cases posting a producer ring update to the chip is (very)
expensive. it's better to do it for several packets instead of each
packet, hence calling this tx mitigation.
this diff defers the call to the transmit routine to a network
taskq, or until a backlog of packets has built up. dragonflybsd
uses 16 as the size of it's backlog, so i'm copying them for now.
i've tried this before, but previous versions caused deadlocks. i
discovered that the deadlocks in the previous version was from
ifq_barrier calling taskq_barrier against the nettq. interfaces
generally hold NET_LOCK while calling ifq_barrier, but the tq might
already be waiting for the lock we hold.
this version just doesnt have ifq_barrier call taskq_barrier. it
instead relies on the IFF_RUNNING flag and normal ifq serialiser
barrier to guarantee the start routine wont be called when an
interface is going down. the taskq_barrier is only used during
interface destruction to make sure the task struct wont get used
in the future, which is already done without the NET_LOCK being
held.
tx mitigation provides a nice performanace bump in some setups. up
to 25% in some cases.
tested by tb@ and hrvoje popovski (who's running this in production).
ok visa@
|
|
a valid reference to the corresponding `ifp'.
ok visa@
|
|
if_enqueue() still makes sure packets get handled by pf on the way
out, and seen by bridge if needed. however instead of falling through
to ifq mapping and output, it now calls a function pointer in the
ifnet struct. that pointer defaults to the ifq handling, but drivers
can override it to bypass ifq processing.
the most obvious users of the function pointer will be virtual
interfaces, eg, vlan(4). ifqs are good if you need to serialise
access to the thing that transmits packets (like hardware rings on
nics), or mitigate the number of times you do ring processing, but
neither of those things are desirable on vlan interfaces. ideally
vlan could transmit on any cpu without having packets serialised
by it's own ifq before being pushed down to an arbitrary number of
rings on the parent interface. bypassing ifqs means the driver can
push the vlan tag on concurrently and push down to the parent frmo
any cpu.
ok mpi@
no objection from claudio@
|
|
define to IFNET_SLOWTIMO since it is no longer a hz divisor.
OK visa@ bluhm@ kn@
|
|
it isn't implemented, and is never called.
|
|
these exist so interfaces that want to do mpsafe work outside the
ifq machinery have a place to allocate and update stats in. the
generic ioctl handling for getting stats to userland knows how to
roll the new per cpu stats into the rest before export.
ok visa@
|
|
so we can let go if_cloners_lock.
OK tb@, claudio@, bluhm@, kn@, henning@
|
|
currently carp uses a struct carp_if to hold an srp list head, which
is accessed by both if_carp in struct ifnet, and via the if input
handlers list.
this gets rid of some indirection by making if_carp itself the list
head, rather than a pointer to the list head via a struct carp_if.
it also makes accessing the list consistent by only using if_carp
to get to it.
ok mpi@
|
|
OK mpi@
|
|
them as M_TEMP.
ok visa@
|
|
is protected by which lock.
ok bluhm@, visa@
|
|
currently there is a single mbuf_queue per interface, which all
rings on a nic shove packets onto. while the list inside this queue
is protected by a mutex, the counters around it (ie, ipackets,
ibytes, idrops) are not. this means updates can be lost, and reading
the statistics is also inconsistent. having a single queue means
that busy rx rings can dominate and then starve the others.
ifiqueue structs are like ifqueue structs. they provide per ring
queues, and independent counters for each ring. when ifdata is read
for userland, these counters are aggregated. having a queue per
ring now allows for per ring backpressure to be applied. MCLGETI
will have it's day again.
right now we assume every interface wants an input queue and
unconditionally provide one. individual interfaces can opt into
more.
im not completely happy about the shape of this atm, but shuffling
it around more makes the diff bigger.
ok visa@
|
|
right now the rx ring moderation code makes a decision globally
that a machine is livelocked, and uses that to apply backpressure
on all the rx rings. we're moving toward having the network stack
run on multiple cpus, and fed from multiple rx rings. if_rxr_livelocked
lets a driver apply backpressure explicitely if something tells it
that whatever is consuming previous packets cannot keep up.
while here expose the current ring watermark with if_rxr_cwm.
tweaks and ok visa@
|