Age | Commit message (Collapse) | Author |
|
the l3 protocol input to push the packet is based on a value in
m->m_pkthdr.ph_family, which tunnel drivers should set before calling
if_vinput.
add p2p_bpf_mtap to call bpf_mtap_af also using m->m_pkthdr.ph_family.
|
|
the network stack is now responsible for calling bpf for packets
that the interface receives, and we so far got away with using
bpf_mtap_ether for everything. this doesn't work if layer 3 input
goes through the same functions, so letting drivers specify the
appropriate bpf mtap function means they will be able to cope.
|
|
as signed. u_int used within pipex(4) for consistency with other code.
ok dlg@ mpi@
|
|
ok sashan@
|
|
the interface input handler lists were originally set up to help
us during the intial mpsafe network stack work. at the time not all
the virtual ethernet interfaces (vlan, svlan, bridge, trunk, etc)
were mpsafe, so we wanted a way to avoid them by default, and only
take the kernel lock hit when they were specifically enabled on the
interface. since then, they have been fixed up to be mpsafe.
i could leave the list in place, but it has some semantic problems.
because virtual interfaces filter packets based on the order they
were attached to the parent interface, you can get packets taken
away in surprising ways, especially when you reboot and netstart
does something different to what you did by hand. by hardcoding the
order that things like vlan and bridge get to look at packets, we
can document the behaviour and get consistency.
it also means we can get rid of a use of SRPs which were difficult
to replace with SMRs. the interface input handler list is an SRPL,
which we would like to deprecate. it turns out that you can sleep
during stack processing, which you're not supposed to do with SRPs
or SMRs, but SRPs are a lot more forgiving and it worked.
lastly, it turns out that this code is faster than the input list
handling, so lots of winning all around.
special thanks to hrvoje popovski and aaron bieber for testing.
this has been in snaps as part of a larger diff for over a week.
|
|
ok dlg@ tobhe@
|
|
ok dlg@ tobhe@
|
|
"new" API.
ok dlg@ tobhe@
|
|
capital letters in locking annotations. Therefore harmonize the existing
annotations.
Also, if multiple locks are required they should be delimited using
commas.
ok mpi@
|
|
descriptors runs below the low watermark.
The em(4) firmware seems not to work properly with just a few descriptors in
the receive ring. Thus, we use the low water mark as an indicator instead of
zero descriptors, which causes deadlocks.
ok kettenis@
|
|
if_pcount is only touched in ifpromisc(), and ifpromisc() needs
NET_LOCK anyway because it also modifies if_flags.
suggested by mpi@
ok visa@
|
|
this follows what's been done for detach and link state hooks, and
makes handling of hooks generally more robust.
address hooks are a bit different to detach/link state hooks in
that there's only a few things that register hooks (carp, pf, vxlan),
but a lot of places to run the hooks (lots of ipv4 and ipv6 address
configuration).
an address hook cookie was in struct pfi_kif, which is part of the
pf abi. rather than break pfctl -sI, this maintains the void * used
for the cookie and uses it to store a task, which is then used as
intended with the new api.
|
|
this is largely mechanical, except for carp. this moves the addition
of the carp link state hook after we're committed to using the new
interface as a carpdev. because the add can't fail, we avoid a
complicated unwind dance. also, this tweaks the carp linkstate hook
so it only updates the relevant carp interface, not all of the
carpdevs on the parent.
hrvoje popovski has tested an early version of this diff and it's
generally ok, but there's some splasserts that this diff fires that
i'll fix in an upcoming diff.
ok claudio@
|
|
the main semantic change is that things registering detach hooks
have to allocate and set a task structure that then gets added to
the list. this means if the task is allocated up front (eg, as part
of carps softc or bridges port structure), it avoids the possibility
that adding a hook can fail. a lot of drivers weren't checking for
failure, and unwinding state in the event of failure in other parts
was error prone.
while doing this i discovered that the list operations have to be
in a particular order, but drivers weren't doing that consistently
either. this diff wraps the list ops up so you have to seriously
go out of your way to screw them up.
ive also sprinkled some NET_ASSERT_LOCKED around the list operations
so we can make sure there's no potential for the list to be corrupted,
especially while it's being run.
hrvoje popovski has tested this a bit, and some issues he discovered
have been fixed.
ok sashan@
|
|
IF_WIRELESS_DEFAULT_PRIORITY and use it in umb(4) as default prio.
OK kettenis@, sthen@
|
|
This redefines the ifp <-> bridge relationship. No lock can be
currently used across the multiples contexts where the bridge has
tentacles to protect a pointer, use an interface index.
Tested by various, ok dlg@, visa@
|
|
if_vinput assumes that the interface that its called against uses
per cpu counters so it can count input packets, but basically does
all the things that if_input and ifiq_input do. the main difference
is it assumes the network stack is already running and runs the
interface input handlers directly. this is instead of queuing the
packets for a nettq to run.
ifiqs arent free, especially when they only run per packet like
they do on psuedo interfaces. this allows that overhead to be
bypassed.
|
|
l2 and l3 drivers do the same thing all the time, so reduce the
chance of error by doing the checks once and making it available
for drivers to call instead of rolling on their own again.
|
|
the idea is to call the hardware transmit routine less since in a
lot of cases posting a producer ring update to the chip is (very)
expensive. it's better to do it for several packets instead of each
packet, hence calling this tx mitigation.
this diff defers the call to the transmit routine to a network
taskq, or until a backlog of packets has built up. dragonflybsd
uses 16 as the size of it's backlog, so i'm copying them for now.
i've tried this before, but previous versions caused deadlocks. i
discovered that the deadlocks in the previous version was from
ifq_barrier calling taskq_barrier against the nettq. interfaces
generally hold NET_LOCK while calling ifq_barrier, but the tq might
already be waiting for the lock we hold.
this version just doesnt have ifq_barrier call taskq_barrier. it
instead relies on the IFF_RUNNING flag and normal ifq serialiser
barrier to guarantee the start routine wont be called when an
interface is going down. the taskq_barrier is only used during
interface destruction to make sure the task struct wont get used
in the future, which is already done without the NET_LOCK being
held.
tx mitigation provides a nice performanace bump in some setups. up
to 25% in some cases.
tested by tb@ and hrvoje popovski (who's running this in production).
ok visa@
|
|
a valid reference to the corresponding `ifp'.
ok visa@
|
|
if_enqueue() still makes sure packets get handled by pf on the way
out, and seen by bridge if needed. however instead of falling through
to ifq mapping and output, it now calls a function pointer in the
ifnet struct. that pointer defaults to the ifq handling, but drivers
can override it to bypass ifq processing.
the most obvious users of the function pointer will be virtual
interfaces, eg, vlan(4). ifqs are good if you need to serialise
access to the thing that transmits packets (like hardware rings on
nics), or mitigate the number of times you do ring processing, but
neither of those things are desirable on vlan interfaces. ideally
vlan could transmit on any cpu without having packets serialised
by it's own ifq before being pushed down to an arbitrary number of
rings on the parent interface. bypassing ifqs means the driver can
push the vlan tag on concurrently and push down to the parent frmo
any cpu.
ok mpi@
no objection from claudio@
|
|
define to IFNET_SLOWTIMO since it is no longer a hz divisor.
OK visa@ bluhm@ kn@
|
|
it isn't implemented, and is never called.
|
|
these exist so interfaces that want to do mpsafe work outside the
ifq machinery have a place to allocate and update stats in. the
generic ioctl handling for getting stats to userland knows how to
roll the new per cpu stats into the rest before export.
ok visa@
|
|
so we can let go if_cloners_lock.
OK tb@, claudio@, bluhm@, kn@, henning@
|
|
currently carp uses a struct carp_if to hold an srp list head, which
is accessed by both if_carp in struct ifnet, and via the if input
handlers list.
this gets rid of some indirection by making if_carp itself the list
head, rather than a pointer to the list head via a struct carp_if.
it also makes accessing the list consistent by only using if_carp
to get to it.
ok mpi@
|
|
OK mpi@
|
|
them as M_TEMP.
ok visa@
|
|
is protected by which lock.
ok bluhm@, visa@
|
|
currently there is a single mbuf_queue per interface, which all
rings on a nic shove packets onto. while the list inside this queue
is protected by a mutex, the counters around it (ie, ipackets,
ibytes, idrops) are not. this means updates can be lost, and reading
the statistics is also inconsistent. having a single queue means
that busy rx rings can dominate and then starve the others.
ifiqueue structs are like ifqueue structs. they provide per ring
queues, and independent counters for each ring. when ifdata is read
for userland, these counters are aggregated. having a queue per
ring now allows for per ring backpressure to be applied. MCLGETI
will have it's day again.
right now we assume every interface wants an input queue and
unconditionally provide one. individual interfaces can opt into
more.
im not completely happy about the shape of this atm, but shuffling
it around more makes the diff bigger.
ok visa@
|
|
right now the rx ring moderation code makes a decision globally
that a machine is livelocked, and uses that to apply backpressure
on all the rx rings. we're moving toward having the network stack
run on multiple cpus, and fed from multiple rx rings. if_rxr_livelocked
lets a driver apply backpressure explicitely if something tells it
that whatever is consuming previous packets cannot keep up.
while here expose the current ring watermark with if_rxr_cwm.
tweaks and ok visa@
|
|
NOTE: code still runs with single softnet task. change definition of
SOFTNET_TASKS in net/if.c, if you want to have more than one softnet task
OK mpi@, OK phessler@
|
|
ok visa@, bluhm@, deraadt@
|
|
* don't share mifs (multicast interface) between rdomains
* allow multiple routing sockets connected at the same time if they are
in different rdomains.
ok bluhm@
|
|
an ifq to transmit a packet is picked by the current traffic
conditioner (ie, priq or hfsc) by providing an index into an array
of ifqs. by default interfaces get a single ifq but can ask for
more using if_attach_queues().
the vast majority of our drivers still think there's a 1:1 mapping
between interfaces and transmit queues, so their if_start routines
take an ifnet pointer instead of a pointer to the ifqueue struct.
instead of changing all the drivers in the tree, drivers can opt
into using an if_qstart routine and setting the IFXF_MPSAFE flag.
the stack provides a compatability wrapper from the new if_qstart
handler to the previous if_start handlers if IFXF_MPSAFE isnt set.
enabling hfsc on an interface configures it to transmit everything
through the first ifq. any other ifqs are left configured as priq,
but unused, when hfsc is enabled.
getting this in now so everyone can kick the tyres.
ok mpi@ visa@ (who provided some tweaks for cnmac).
|
|
complaining that assigning the MULTICAST flag, which sets the
uppermost bit, would invert the meaning of MULTICAST flag's
numeric value.
ok claudio@ deraadt@ tom@ visa@
|
|
configuration and instead use ifnet to store the configuration and
counters. With this we can safely use multicast routing daemons on
multiple domains without vif id colisions.
ok mpi@
|
|
In order to stop abusing lo0 for all rdomains, a new loopback interface
will be created every time a rdomain is created. The unit number will
be the same as the rdomain, i.e. lo1 will be attached to rdomain 1.
If this loopback interface is already in use it wont be possible to create
the corresponding rdomain.
In order to know which lo(4) interface is attached to a rdomain, its index
is stored in the rtable/rdomain map.
This is a long overdue since the introduction of rtable/rdomain. It also
fixes a recent regression due to resetting the rdomain of an incoming
packet reported by semarie@, Andreas Bartelt and Nils Frohberg.
ok claudio@
|
|
ok vgross@
|
|
device, inherit the rdomain from the calling process. This adds an
rdomain argument to if_clone_create().
OK mpi@ henning@
|
|
Reduce the number of if_get/if_put from one per packet to one per ring
since we now know that all the packets are coming from the same interface.
Improve forwarding performances by 10Kpps in Hrvoje Popovski's test setup.
ok bluhm@, henning@, dlg@
|
|
switch(4) currently supports OpenFlow 1.3.5.
Currently, it's disabled by the kernel config.
With help from yasuoka@ reyk@ jsg@.
ok deraadt@ yasuoka@ reyk@ henning@
|
|
to ifconfig.
"llprio" allows one to set the priority of packets that do not go through
pf(4), as the case is for arp(4) or bpf(4).
ok sthen@ mikeb@
|
|
theyre currently unused, so no functional change.
|
|
|
|
ok mpi@
|
|
|
|
work is represented by struct task.
the start routine is now wrapped by a task which is serialised by the
infrastructure. if_start_barrier has been renamed to ifq_barrier and
is now implemented as a task that gets serialised with the start
routine.
this also adds an ifq_restart() function. it serialises a call to
ifq_clr_oactive and calls the start routine again. it exists to
avoid a race that kettenis@ identified in between when a start
routine discovers theres no space left on a ring, and when it calls
ifq_set_oactive. if the txeof side of the driver empties the ring
and calls ifq_clr_oactive in between the above calls in start, the
queue will be marked oactive and the stack will never call the start
routine again.
by serialising the ifq_set_oactive call in the start routine and
ifq_clr_oactive calls we avoid that race.
tested on various nics
ok mpi@
|
|
ok mpi@
|
|
the intention is to make it more clear what belongs to a transmit
queue and what belongs to an interface.
suggested by and ok mpi@
|