Age | Commit message (Collapse) | Author |
|
To optimize route caching, count cache hits and misses. This is
shown in netstat -s for both inet and inet6. Reuse the old IPv6
forward cache counter. Sort ip6s_wrongif consistently. For now
only IPv4 cache counter has been implemented.
OK mvs@
|
|
[1] that if_downall() tries to send route messages and triggers panic
again but in knote(9) layer.
1. https://syzkaller.appspot.com/bug?extid=d19060a65721eb432a72
ok bluhm
|
|
This prevents gcc3's 'parameter has incomplete type' warning that
causes kernel build failure.
Suggested by claudio@, ok bluhm@
|
|
The outgoing route is cached at the inpcb. This cache was only
invalidated when the socket closes or if the route gets invalid.
More specific routes were not detected. Especially with dynamic
routing protocols, sockets must be closed and reopened to use the
correct route. Running ping during a route change shows the problem.
To solve this, add a route generation number that is updated whenever
the routing table changes. The lookup in struct route is put into
the route_cache() function. If the generation number is too old,
the cached route gets discarded.
Implement route_cache() for ip_output() and ip_forward() first.
IPv6 and more places will follow.
OK claudio@
|
|
Thus, dhcpleased accept non-calculated checksums which were verified by
hardware/hypervisor.
With tweaks from dlg@
ok bluhm@
mkay tobhe@
|
|
sec(4) was already looking for this mbuf tag so it could drop packets
that had already been sent out on the same interface, but i forgot
the code that adds the tag.
this was reported by jason tubnor who experienced spins/lockups
when using sec and a physical interface was disconnected. rather
than being a locking problem like we initially assumed, it turned
out that unplugging a physical interface caused a route for ipsec
encapsulated traffic to go out over sec(4), causing the packet to
loop in the stack.
the fix was also tested and verified by jason. sorry for taking so
long to look at it.
|
|
`pipex_session_list' foreach walkthrough with `pipex_list_mtx' mutex(9)
relocking. It inserts special item after acquired `session' and keeps it
linked until `session' release. Only owner can unlink it's own item, so
the LIST_NEXT(session) is always valid even the `session' was unlinked.
The iterator skips special items at the `session' acquisition time, as
all other foreach loops where `pipex_list_mtx' mutex(9) is not relocked.
ok yasuoka
|
|
ok yasuoka
|
|
ok bluhm
|
|
Syzkaller with witness complains about lock ordering of pf lock
with socket lock. Socket lock for inet is taken before pf lock.
Pf lock is taken before socket lock for route. This is a false
positive as route and inet socket locks are distinct. Witness does
not know this. Name the socket lock like the domain of the socket,
then rwlock name is used in witness lo_name subtype. Make domain
names more consistent for locking, they were not used anyway.
Regardless of witness problem, unique lock name for each socket
type make sense.
Reported-by: syzbot+34d22dcbf20d76629c5a@syzkaller.appspotmail.com
Reported-by: syzbot+fde8d07ba74b69d0adfe@syzkaller.appspotmail.com
OK mvs@
|
|
Having two hash tables instead of a common one, reduces table size
and contention on the per table lock. The address family is always
known in advance. The lookups and loops are more specific.
OK sashan@
|
|
Counting multicast packets sent to local stack or packets that are
reflected by simplex interfaces does not make much sense. They are
neither received nor output by any ethernet device. Counting these
packets at lo0 or the loopback interface of the routing domain would
be possible, but is not worth the effort. Make if_input_local()
MP safe by deleting the if_opackets++ code.
OK mvs@
|
|
Doing KERNEL_LOCK() just before NET_LOCK() does not make sense.
Net lock is a rwlock that releases kernel lock during sleep. To
avoid an unnecessary release and take kernel lock cycle, move
KERNEL_LOCK() after NET_LOCK().
There is no lock order reversal deadlock issue. Both locks are
used in any order thoughout the kernel. As NET_LOCK() releases the
kernel lock when it cannot take the lock immediately and has to
sleep, we always end in the order kernel lock before net lock after
sleeping.
OK sashan@
|
|
Introduce global mutex to protect the pointers between pf state key
and internet PCB. Then in_pcbdisconnect() and in_pcbdetach() do
not need exclusive netlock anymore. Use a bunch of read once
unlocked access to reduce performance impact.
OK sashan@
|
|
ok bluhm sashan
|
|
|
|
Create and use the MP safe version of the interface counters for
lo(4). Input packets were counted twice. As interface input queue
is already counting, remove input count in if_input_local().
Multicast and siplex packets are counted at the ethernet interface.
Add a comment that this not MP safe.
OK mvs@
|
|
OK bluhm@
|
|
interface descriptor. It panics during attach of em(4) device at
boot.
|
|
descriptor.
We have the mess in network interface statistics. Only pseudo drivers
do per-CPU counters allocation, all other network devices use the old
`if_data'. The network stack partially uses per-CPU counters and
partially use `if_data', but the protection is inconsistent: some times
counters accessed with exclusive netlock, some times with shared
netlock, some times with kernel lock, but without netlock, some times
with another locks.
To make network interfaces statistics more consistent, always allocate
per-CPU counters at interface attachment time and use it instead of
`if_data'. At this step only move counters allocation to the if_attach()
internals. The `if_data' removal will be performed with the following
diffs to make review and tests easier.
ok bluhm
|
|
has sleep point, so the uninitialized `sc_outputtask` could be accessed
through ioctl(2) interface.
ok sashan bluhm
|
|
Release netlock and take `sc_lock' rwlock(9) just in the beginning of
pflowioctl() and do corresponding operations in the end. Use `sc_lock'
to protect `sc_dying'.
We need to release netlock not only to keep locks order with `sc_lock'
rwlock(9), but also because pflowioctl() calls some operations like
socreate() or soclose() on udp(4) socket. Current implementation has
many relocking places which breaks atomicy, so merge them into one.
The `sc_lock' rwlock(9) is taken during all pflowioctl() call, so
`sc_dying' atomicy is not broken.
Not the ideal solution, but better then we have now.
Tested by Hrvoje Popovski.
Discussed with and ok from sashan
|
|
|
|
mpsafe.
The weird interactions around `pflow_flows' and `sc_gcounter' replaced
by simple `pflow_flows' increment. Since the flow sequence is the 32
bits integer, the `sc_gcounter' type replaced by the type of uint32_t.
ok bluhm sashan
|
|
Since the revision 1.1182 of net/pf.c netlock is not taken while
export_pflow() called from pf_purge_states(). Current locks order
requires netlock to be taken before PF_LOCK(), so there is no reason
to turn it back into this path only for optional export_pflow() call.
The `pflowif_list' foreach loop has no context switch within, so SMR
list is better than mutex(9).
Tested by Hrvoje Popovski.
ok sashan bluhm
|
|
|
|
structure. Protect the `send_nam', `sc_flowsrc' and `sc_flowdst'
pflow_softc members by existing `sc_lock' rwlock(9).
This partially fixes locking inconsistency of pflow_softc. The following
work will be done with separate diffs.
Also, pass `sc' instead of NULL to pflow_get_mbuf() while calling from
pflow_sendout_ipfix_tmpl(). This fixes the NULL dereference.
ok bluhm@
|
|
the socket and the socket's PCB data.
ok bluhm
|
|
handler out of kernel lock.
ok bluhm
|
|
Packets (callers to pf_test()) must alter pf_state::timeout
under protection of pf_state::mtx. We also have to make sure
the packet does not update pf_state::timeout when ::timeout
reaches PFTM_UNLINKED.
The first report came from Johan Huldtgren, but he is not
the single user who has noticed "st->timeout == PFTM_UNLINKED"
assert violation.
OK bluhm@
|
|
OK claudio, miod
|
|
rip6_output() did modify inp_outputopts6 temporarily to provide
different ip6_pktopts to in6_embedscope(). Better pass inp_outputopts6
and inp_moptions6 as separate arguments to in6_embedscope().
Simplify the code that deals with these options in in6_embedscope().
Doucument inp_moptions and inp_moptions6 as protected by net lock.
OK kn@
|
|
pf expects the ip header to be in the first mbuf of the chain we
pass to pf_test, but in some situations the ethernet header is the
only data in the first mbuf. after we remove the ethernet header,
the first mbuf had no data in it which confused pf. fix this by
passing all packets to ip_check on output as well as input. ip input
handlers do all the necessary m_pullups.
found by Mark Patruck.
|
|
|
|
In revision 1.424 the logic in rt_setgate() has changed. The old
code entered a value into rt_gateway also if rt_setgwroute() returned
an error. Now if rt_setgwroute() fails, rt_gateway is NULL and
ROUNDUP(rt->rt_gateway->sa_len) crashes.
Put back the old logic in rt_setgate(). Setting rt_gateway and
rt_gwroute are actually independent.
If malloc(9) in rt_setgate() fails, rt_gateway can still be NULL.
The subsequent crash in free(rt->rt_gateway, M_RTABLE,
ROUNDUP(rt->rt_gateway->sa_len)) was just never observed. Add a
NULL check around these free(9).
Reported-by: syzbot+2e79dd9db712d3c5ade9@syzkaller.appspotmail.com
OK mvs@
|
|
In rtalloc() and rtalloc_mpath() declare the parameter dst as const
sockaddr. This makes MP safe route lookup easier as the destination
address is definitely not modified during the operation. Array
rti_info, the central data structure with addresses for route
matching, contains constant sockaddr now.
OK mvs@ dlg@
|
|
The rti_info array is used to describe routes that should be found
by lookup. Modifying the addreses in it is not a good idea. There
were places where rtm_xaddrs() tried to fix the address family
instead of validating it. Replace the modification with a check
and error out with EAFNOSUPPORT on failure. Route labels always
have AF_UNSPEC and the other types are not used anyway.
OK kn@
|
|
pointed out by bluhm@
|
|
the rtable which should be serialised to ensure they're consistent.
unfortunately, rt_setgate is called from the network stack while it's
only holding shared NET_LOCK.
this uses the [X] protections as described in route.h to serialise the
changes, and reworks the code to try and keep enough stuff linked up
properly during the changes that it will still work if another cpu is
still using the rtentry structs while they still have shared net lock.
tested by and ok bluhm@
|
|
OK mvs@
|
|
this is the result of a bunch of discussion at h2k23.
ok claudio@ mvs@ bluhm@
|
|
OK mvs@
|
|
For implementing MP safe route lookup, it helps to know which
function parameters are constant. Add some const declarations, so
that the compiler guarantees that sockaddr dst parameter of
rtable_match() does not change.
OK dlg@
|
|
Rename ifq_set_maxlen() to ifq_init_maxlen(). This function neither
uses WRITE_ONCE() nor a mutex and is called before the ifq mutex
is initialized. The new name expresses that it should be used only
during interface attach when there is no concurrency.
Protect ifq_len(), ifq_empty(), ifiq_len(), and ifiq_empty() with
READ_ONCE(). They can be used without lock as they only read a
single integer.
OK dlg@
|
|
members of passed timeout structure, this delayed initialization
provides nothing but makes code weird.
ok kn
|
|
Also disable TCP LRO on bridged vlan(4) and default for bpe(4), nvgre(4) and
vxlan(4).
ok bluhm@
|
|
While interface going down and output stopped, packets could rest in
`if_snd' queue. So the (!ifq_empty(&sc->sc_if.if_snd)) condition will
always be true and wg_peer_destroy() will sleep until interface became
up and stuck packets transmitted.
Check IFF_RUNNING flag within (!ifq_empty(&sc->sc_if.if_snd)) loop in
wg_peer_destroy(). If the flag is not set that means interface is down,
so drain the `if_snd' queue manually to prevent wg_peer_destroy() stuck.
Problem reported and fix tested by Kirill Miazine.
ok bluhm@
|
|
If a packet is malformed, it is dropped by pf(4). The rule referenced
in pflog(4) is the default rule. As the default rule is a pass
rule, tcpdump printed "pass" although the packet was actually
dropped. To avoid confusion, change the action to drop. Then
tcpdump prints "block".
OK sashan@ kn@
|
|
When receiving a pfkeyv2 SADB_ADD message, a newly created tdb can
fail in tdb_init(), which causes the tdb to not get added to the
global tdb list and an immediate dereference. If a lifetime timeout
triggers on this tdb, it will unconditionally try to remove it from
the list and in the process deref once more than allowed,
causing a one bit corruption in the already freed up slot in the
tdb pool.
We resolve this issue by moving timeout_add() after tdb_init()
just before puttdb(). This means tdbs failing initialization
get discarded immediately as they only hold a single reference.
Valid tdbs get their timeouts activated just before we add them
to the tdb list, meaning the timeout can safely assume they are
linked.
Feedback from mvs@ and millert@
ok mvs@ mbuhl@
|
|
The behavior of the PFRULE_SRCTRACK and max_states check was
unintentionally changed by commit revision 1.964. If the state was
not created due to some limit had been reached, pf still passed the
packet. Restore the old logic by setting action to pass later,
after the checks. In pf_test_rule() action is initialized to drop.
OK sashan@
|