Age | Commit message (Collapse) | Author |
|
OK mvs@
|
|
this helps narrow down where some "output failures" on sec interfaces
occur.
based on discussion with jason tubnor
|
|
the most interesting information exposed here is the number of times
a port changes state according to the lacp state machine. if a port
is flapping, it's hard to see if you only look at the current state.
getting a count of changes over time makes problems a lot more
visible and therefore fixable.
this also exposes counters around how the lacp protocol packets.
all of these can be useful when trying to line up behaviors with
another system (eg, a switch).
ok jmatthew@
|
|
a port here is a physical interface used by an aggr.
this leaves the low bits for a physical interface to use to pick a
tx ring. without this, aggr used low bits for port selection, which
takes bits away from the ring selection, which can lead to uneven
distribution of packets over tx rings.
ive been running this in production for well over a year now.
|
|
corresponding mutex(9)es.
ifq_start() and following wg_qstart() could be called from software
interrupt context if bandwidth control is enabled in pf.conf(5). Remove
sleep points provided by rwlock(9)s from wg(4) output start routine.
looks ok claudio
|
|
|
|
It breaks NFS.
ok claudio@
|
|
OK claudio@
|
|
Fill and check the cache and call rtalloc_mpath() together. Then
the caller of route_mpath() does not have to care about the uint32_t
*src pointer and just pass struct in_addr. All the conversions are
done inside the functions. ro->ro_rt is either valid or NULL. Note
that some places have a stricter rtisvalid() now compared to the
previous NULL check.
OK claudio@
|
|
Pass source address to route_cache() and store it in struct route.
Cached multipath routes are only valid if source address matches.
If sysctl multipath changes, increase route generation number.
OK claudio@
|
|
For LRO with ix(4) it is necessary to detect ethernet padding.
Extract ip_len and ip6_plen from the mbuf and provide it to the
drivers.
Add extended sanitity checks, like IP packet is shorter than TCP
header. This prevents offloading to network hardware with bougus
packets.
Also iphlen of extracted headers contains header length for IPv4
and IPv6, to make code in drivers simpler.
OK mglocker@
|
|
Several drivers need IPv4 header length and TCP offset for checksum
offload, TSO and LRO. Accessing these fields directly caused crashes
on sparc64 due to misaligned access. It cannot be guaranteed that
IP and TCP header is 4 byte aligned in driver level. Also gcc 4.2.1
assumes that bit fields can be accessed with 32 bit load instructions.
Use memcpy() in ether_extract_headers() to get the bits from IPv4
and TCP header and store the header length in struct ether_extracted.
From there network drivers can esily use it without caring about
alignment and bit shift. Do some sanity checks with the length
values to prevent that invalid values from evil packets get stored
into hardware registers. If check fails, clear the pointer to the
header to hide it from the driver. Add debug prints that help to
figure out the reason for bad packets and provide information when
debugging drivers.
OK mglocker@
|
|
Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.
OK claudio@
|
|
The route_cache() function can easily return whether it was a cache
hit or miss. Then the logic to perform a route lookup gets a bit
simpler. Some more complicated if (ro->ro_rt == NULL) checks still
exist elsewhere.
Also use route cache in in_pcbselsrc() instead of filling struct
route manually.
OK claudio@
|
|
|
|
Implement route6_cache() to check whether the cached route is still
valid and otherwise fill caching parameter of struct route_in6.
Also count cache hits and misses in netstat. in_pcbrtentry() uses
route cache now.
OK claudio@
|
|
original bug report from syzkaller
Reported-by: syzbot+d19060a65721eb432a72@syzkaller.appspotmail.com
broken fix found by Hrvoje Popovski
hint to the problem and OK deraadt@
|
|
To optimize route caching, count cache hits and misses. This is
shown in netstat -s for both inet and inet6. Reuse the old IPv6
forward cache counter. Sort ip6s_wrongif consistently. For now
only IPv4 cache counter has been implemented.
OK mvs@
|
|
[1] that if_downall() tries to send route messages and triggers panic
again but in knote(9) layer.
1. https://syzkaller.appspot.com/bug?extid=d19060a65721eb432a72
ok bluhm
|
|
This prevents gcc3's 'parameter has incomplete type' warning that
causes kernel build failure.
Suggested by claudio@, ok bluhm@
|
|
The outgoing route is cached at the inpcb. This cache was only
invalidated when the socket closes or if the route gets invalid.
More specific routes were not detected. Especially with dynamic
routing protocols, sockets must be closed and reopened to use the
correct route. Running ping during a route change shows the problem.
To solve this, add a route generation number that is updated whenever
the routing table changes. The lookup in struct route is put into
the route_cache() function. If the generation number is too old,
the cached route gets discarded.
Implement route_cache() for ip_output() and ip_forward() first.
IPv6 and more places will follow.
OK claudio@
|
|
Thus, dhcpleased accept non-calculated checksums which were verified by
hardware/hypervisor.
With tweaks from dlg@
ok bluhm@
mkay tobhe@
|
|
sec(4) was already looking for this mbuf tag so it could drop packets
that had already been sent out on the same interface, but i forgot
the code that adds the tag.
this was reported by jason tubnor who experienced spins/lockups
when using sec and a physical interface was disconnected. rather
than being a locking problem like we initially assumed, it turned
out that unplugging a physical interface caused a route for ipsec
encapsulated traffic to go out over sec(4), causing the packet to
loop in the stack.
the fix was also tested and verified by jason. sorry for taking so
long to look at it.
|
|
`pipex_session_list' foreach walkthrough with `pipex_list_mtx' mutex(9)
relocking. It inserts special item after acquired `session' and keeps it
linked until `session' release. Only owner can unlink it's own item, so
the LIST_NEXT(session) is always valid even the `session' was unlinked.
The iterator skips special items at the `session' acquisition time, as
all other foreach loops where `pipex_list_mtx' mutex(9) is not relocked.
ok yasuoka
|
|
ok yasuoka
|
|
ok bluhm
|
|
Syzkaller with witness complains about lock ordering of pf lock
with socket lock. Socket lock for inet is taken before pf lock.
Pf lock is taken before socket lock for route. This is a false
positive as route and inet socket locks are distinct. Witness does
not know this. Name the socket lock like the domain of the socket,
then rwlock name is used in witness lo_name subtype. Make domain
names more consistent for locking, they were not used anyway.
Regardless of witness problem, unique lock name for each socket
type make sense.
Reported-by: syzbot+34d22dcbf20d76629c5a@syzkaller.appspotmail.com
Reported-by: syzbot+fde8d07ba74b69d0adfe@syzkaller.appspotmail.com
OK mvs@
|
|
Having two hash tables instead of a common one, reduces table size
and contention on the per table lock. The address family is always
known in advance. The lookups and loops are more specific.
OK sashan@
|
|
Counting multicast packets sent to local stack or packets that are
reflected by simplex interfaces does not make much sense. They are
neither received nor output by any ethernet device. Counting these
packets at lo0 or the loopback interface of the routing domain would
be possible, but is not worth the effort. Make if_input_local()
MP safe by deleting the if_opackets++ code.
OK mvs@
|
|
Doing KERNEL_LOCK() just before NET_LOCK() does not make sense.
Net lock is a rwlock that releases kernel lock during sleep. To
avoid an unnecessary release and take kernel lock cycle, move
KERNEL_LOCK() after NET_LOCK().
There is no lock order reversal deadlock issue. Both locks are
used in any order thoughout the kernel. As NET_LOCK() releases the
kernel lock when it cannot take the lock immediately and has to
sleep, we always end in the order kernel lock before net lock after
sleeping.
OK sashan@
|
|
Introduce global mutex to protect the pointers between pf state key
and internet PCB. Then in_pcbdisconnect() and in_pcbdetach() do
not need exclusive netlock anymore. Use a bunch of read once
unlocked access to reduce performance impact.
OK sashan@
|
|
ok bluhm sashan
|
|
|
|
Create and use the MP safe version of the interface counters for
lo(4). Input packets were counted twice. As interface input queue
is already counting, remove input count in if_input_local().
Multicast and siplex packets are counted at the ethernet interface.
Add a comment that this not MP safe.
OK mvs@
|
|
OK bluhm@
|
|
interface descriptor. It panics during attach of em(4) device at
boot.
|
|
descriptor.
We have the mess in network interface statistics. Only pseudo drivers
do per-CPU counters allocation, all other network devices use the old
`if_data'. The network stack partially uses per-CPU counters and
partially use `if_data', but the protection is inconsistent: some times
counters accessed with exclusive netlock, some times with shared
netlock, some times with kernel lock, but without netlock, some times
with another locks.
To make network interfaces statistics more consistent, always allocate
per-CPU counters at interface attachment time and use it instead of
`if_data'. At this step only move counters allocation to the if_attach()
internals. The `if_data' removal will be performed with the following
diffs to make review and tests easier.
ok bluhm
|
|
has sleep point, so the uninitialized `sc_outputtask` could be accessed
through ioctl(2) interface.
ok sashan bluhm
|
|
Release netlock and take `sc_lock' rwlock(9) just in the beginning of
pflowioctl() and do corresponding operations in the end. Use `sc_lock'
to protect `sc_dying'.
We need to release netlock not only to keep locks order with `sc_lock'
rwlock(9), but also because pflowioctl() calls some operations like
socreate() or soclose() on udp(4) socket. Current implementation has
many relocking places which breaks atomicy, so merge them into one.
The `sc_lock' rwlock(9) is taken during all pflowioctl() call, so
`sc_dying' atomicy is not broken.
Not the ideal solution, but better then we have now.
Tested by Hrvoje Popovski.
Discussed with and ok from sashan
|
|
|
|
mpsafe.
The weird interactions around `pflow_flows' and `sc_gcounter' replaced
by simple `pflow_flows' increment. Since the flow sequence is the 32
bits integer, the `sc_gcounter' type replaced by the type of uint32_t.
ok bluhm sashan
|
|
Since the revision 1.1182 of net/pf.c netlock is not taken while
export_pflow() called from pf_purge_states(). Current locks order
requires netlock to be taken before PF_LOCK(), so there is no reason
to turn it back into this path only for optional export_pflow() call.
The `pflowif_list' foreach loop has no context switch within, so SMR
list is better than mutex(9).
Tested by Hrvoje Popovski.
ok sashan bluhm
|
|
|
|
structure. Protect the `send_nam', `sc_flowsrc' and `sc_flowdst'
pflow_softc members by existing `sc_lock' rwlock(9).
This partially fixes locking inconsistency of pflow_softc. The following
work will be done with separate diffs.
Also, pass `sc' instead of NULL to pflow_get_mbuf() while calling from
pflow_sendout_ipfix_tmpl(). This fixes the NULL dereference.
ok bluhm@
|
|
the socket and the socket's PCB data.
ok bluhm
|
|
handler out of kernel lock.
ok bluhm
|
|
Packets (callers to pf_test()) must alter pf_state::timeout
under protection of pf_state::mtx. We also have to make sure
the packet does not update pf_state::timeout when ::timeout
reaches PFTM_UNLINKED.
The first report came from Johan Huldtgren, but he is not
the single user who has noticed "st->timeout == PFTM_UNLINKED"
assert violation.
OK bluhm@
|
|
OK claudio, miod
|
|
rip6_output() did modify inp_outputopts6 temporarily to provide
different ip6_pktopts to in6_embedscope(). Better pass inp_outputopts6
and inp_moptions6 as separate arguments to in6_embedscope().
Simplify the code that deals with these options in in6_embedscope().
Doucument inp_moptions and inp_moptions6 as protected by net lock.
OK kn@
|
|
pf expects the ip header to be in the first mbuf of the chain we
pass to pf_test, but in some situations the ethernet header is the
only data in the first mbuf. after we remove the ethernet header,
the first mbuf had no data in it which confused pf. fix this by
passing all packets to ip_check on output as well as input. ip input
handlers do all the necessary m_pullups.
found by Mark Patruck.
|