Age | Commit message (Collapse) | Author |
|
Variable ip and ip6 sendredirects is only read once during packet
processing. Use atomic_load_int() to access the value in exactly
one read instruction. No memory barriers needed as there is no
correlation with other values.
Sort the ip and ip6 checks, so the difference is easier to see.
Move access to global variable to the end.
OK mvs@
|
|
Use atomic operations to read ip6_forwarding while processing packets
in the network stack.
To make clear where actually the router property is needed, use the
i_am_router variable based on ip6_forwarding. It already existed
in nd6_nbr. Move i_am_router setting up the call stack until all
users are independent.
The forwarding decisions in pf_test, pf_refragment6, ip6_input do
also not interfere.
Use a new array ipv6ctl_vars_unlocked to make transition of all the
integer sysctls easier. Adapt IPv4 to the new style.
OK mvs@
|
|
in ip6_forward.c.
|
|
Forwarding IPv6 packets is slower than IPv4. Reason is that m_copym()
is done for every packet. Just in case we may have to send an ICMP6
packet, ip6_forward() creates a mbuf copy. After that mbuf cluster
is read only, so for the ethernet header another mbuf is allocated.
pf NAT and RDR ignores readonly clusters, so it also modifies the
potential ICMP6 packet.
IPv4 ip_forward() avoids all these problems by copying the leading
68 bytes of the original packets onto the stack. More is not need
for ICMP. IPv6 RFC 4443 2.4. (c) requires up to 1232 bytes in the
ICMP6 packet. This cannot be copied to the stack.
The reason for the difference in the standard seems to be that the
ICMP6 packet has to contain the full header chain. If we have a
simple TCP, UDP or ESP packet without chain, do a shortcut and just
preserve the header for the ICMP6 packet.
Small packets already use stack memory, large packets need extra
mbuf allocation. Now truncate ICMP6 packet to a reasonable length
if the original packets has a final protocol header directly after
the IPv6 header. List of suitable protocols contains TCP, UDP, ESP
as they cover the common cases and anything behind the header should
not be needed for path MTU discovery.
OK deraadt@ florian@ mvs@
|
|
All incpb locking has been converted to socket receive buffer mutex.
Per PCB mutex inp_mtx is not needed anymore. Also delete PRU related
locking functions. A flag PR_MPSOCKET indicates whether protocol
functions support parallel access with per socket rw-lock.
TCP is the only protocol that is not MP capable from the socket
layer and needs exclusive netlock.
OK mvs@
|
|
Unfortunately RFC 4443 demands that the ICMP6 error packet containing
the orignal packet is up to 1280 bytes long. That means for every
forwarded packet forward() creates a mbuf copy, just in case delivery
fails.
For small packets we can copy the content on the stack like IPv4
forward does. This saves us some mbuf allocations if the content
is shorter than the mbuf data size.
OK mvs@
|
|
IPsec gateways set the forwarding sysctl to 2. While this worked
for IPv4 since a long time, adapt this feature for IPv6 now. Set
sysctl net.inet6.ip6.forwarding=2 to forward only packets that have
been processed by IPsec.
Set IPV6_FORWARDING_IPSEC in ip6_input() and pass the flag down to
the call stack. This provides consistent view on global variable
ip6_forwarding. In ip6_output() or ip6_forward() drop packets that
do not match the policy.
OK denis@
|
|
IPv4 uses IP_FORWARDING to pass down a consistent value of
net.inet.ip.forwarding down the stack. This is needed for unlocking
sysctl. Do the same for IPv6.
Read ip6_forwarding once in ip6_input_if() and pass down IPV6_FORWARDING
as flags to ip6_ours(), ip6_hbhchcheck(), ip6_forward(). Replace
the srcrt value with IPV6_REDIRECT flag for consistency with IPv4.
To have common syntax with IPv4, use ip6_forwarding == 0 checks
instead of !ip6_forwarding. This will also make it easier to
implement net.inet6.ip6.forwarding=2 for IPsec only forwarding
later.
In nd6_ns_input() and nd6_na_input() read ip6_forwarding once and
store it in i_am_router. The variable name has been chosen to avoid
confusion with is_router, which indicates router flag of the packet.
Reading of ip6_forwarding is done independently from ip6_input_if(),
consistency does not really matter. One is for ND router behavior
the other for forwarding. Again use the ip6_forwarding != 0 check,
so when ip6_forwarding IPsec only value 2 gets implemented, it will
behave like a router.
OK deraadt@ sashan@ florian@ claudio@
|
|
Do not assume that ip_forwarding and ip_directedbcast cannot change
while processing one packet. Read it once and pass down its value
with a flag. This is necessary for unlocking the sysctl path.
There are a few places where a consistent value does not really
matter, they are unchanged. Use a proper ip_ prefix for the global
variable.
OK claudio@
|
|
slaacd(8) can work on P2P interfaces, it will just never configure the
destination address. But this works fine on at least pppoe(4) and
tun(4).
To make this less confusing pull ifra_dstaddr into dst6 or gw6
depending on if we are doing autoconf or not.
I accidentally broke this when implementing rule 5.5 of RFC 6724.
reported by & testing naddy
OK bluhm
|
|
|
|
Do not send a RTM_CHGADDRATTR route message when the address is
tentative because we will send one when DAD finishes.
To be used by rad(8) shortly.
OK bluhm
|
|
ok mpi@
|
|
In previous commit when refactoring the route cache, a rtfree() has
been forgotten. For each forwarded packet the reference counter
of the route entry was increased. This eventually leads to an
integer overflow and triggers kassert.
reported by and OK jan@
|
|
Rule 5.5: Prefer addresses in a prefix advertised by the next-hop.
For this we have to track the (link-local) address of the advertising
router per interface address and compare it with the selected route.
Rule 5.5 is useful in multi-homing setups where we have more than one
prefix and default router. We have to use the source address with the
correct default gateway otherwise traffic is likely going to be
dropped because of BCP 38.
While here refactor in6_update_ifa() a bit to make the code clearer
and consistently use (var & flag) instead of (var & flag) != 0.
Patiently reviewed by & OK bluhm.
|
|
Instead of passing around u_char[4], introduce struct ipsec_level
that contains 4 ipsec levels. This provides better type safety.
The embedding struct inpcb is globally visible for netstat(1), so
put struct ipsec_level outside of #ifdef _KERNEL.
OK deraadt@ mvs@
|
|
Reported by bket & anton
|
|
While here use (variable & FLAG) or !(variable & FLAG) consistently in
in6_update_ifa().
Discussed with claudio
OK denis
|
|
Instaed of passing a struct rtentry from ip_input() to ip_forward()
and then embed it into a struct route for ip_output(), start with
struct route and pass it along. Then the route cache is used
consistently. Also the route cache hit and missed counters should
reflect reality after this commit.
There is a small difference in the code. in_ouraddr() checks for
NULL and not rtisvalid(). Previous discussion showed that the route
RTF_UP flag should only be considered for multipath routing.
Otherwise it does not mean anything. Especially the local and
broadcast check in in_ouraddr() should not be affected by interface
link status.
When doing cache lookups, route must be valid, but after rtalloc_mpath()
lookup, use any route that route_mpath() returns.
OK claudio@
|
|
Get rip6_input() in the same shape as rip_input(). Call
soisdisconnected() from rip6_disconnect(). This means that the raw
IP socket cannot be reconnected later. Now raw IPv6 behaves like
IPv4 in this regard, KAME code is quite inconsistent here. Also
make sure that there is no race between disconnect, input and wakeup.
The inpcb fileds inp_icmp6filt and inp_cksum6 are protected by
exclusive net lock in icmp6_ctloutput(). With all that, mark raw
IPv6 sockets to handle input in parallel.
OK mvs@
|
|
Running raw IPv4 input with shared net lock in parallel is less
complex than UDP. Especially there is no socket splicing.
New ip_deliver() may run with shared or exclusive net lock. The
last parameter indicates the mode. If is is running with shared
netlock and encounters a protocol that needs exclusive lock, the
packet is queued. Old ip_ours() always queued the packet. Now it
calls ip_deliver() with shared net lock, and if that cannot handle
the packet completely, the packet is queued and later processed
with exclusive net lock.
In case of an IPv6 header chain, that switches from shared to
exclusive processing, the next protocol and mbuf offset are stored
in a mbuf tag.
OK mvs@
|
|
With two separate TCP hash tables, each one becomes smaller. When
we remove the exclusive net lock from TCP, contention on internet
PCB table mutex will be reduced. UDP has been split earlier into
IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with
assertions.
OK mvs@
|
|
If no struct route is passed to ip_output() or ip6_output(), it
uses its own iproute on the stack. In that case any route entry
in the local route cache has to be freed. After pf decides to
reroute, struct route is reset to NULL. Then the route reference
counter has to be released. Call rtfree() without needless NULL
check.
OK mvs@
|
|
Reading sysctl mrt_sysctl_mfc() allocates memory to be copied back
to user. Chunks of struct mfcinfo are copied from routing table
to linear heap memory. If the allocated memory was not a multiple
the struct size, a struct mfcinfo could be copied to a partially
unallocated destination. Check that the end of the struct is within
the allocation.
From Alfredo Ortega; OK claudio@
|
|
Fill and check the cache and call rtalloc_mpath() together. Then
the caller of route_mpath() does not have to care about the uint32_t
*src pointer and just pass struct in_addr. All the conversions are
done inside the functions.
A previous version of this diff was backed out. There was an
additional rtisvalid() in rtalloc_mpath() that prevented packet
output via interfaces that were not up. Now the route in the cache
has to be valid, but after new lookup, rtalloc_mpath() may return
invalid routes. This generates less errors in userland an preserves
existing behavior.
OK sashan@
|
|
FreeBSD-SA-23:06.ipv6 security advisory has added an additional
overflow check in frag6_input(). OpenBSD is not affected by that
as the bug was introduced by another change in 2019. The existing
code is complicated and NetBSD has taken the FreeBSD fix, although
they were also not affected.
The additional check makes the complicated code more robust. Length
calculation taken from NetBSD. Discussed with FreeBSD.
OK sashan@ mvs@
|
|
in_pcbconnect() did not pass down the address it got from in_pcbselsrc()
to in_pcbpickport(). As a consequence local port numbers selected
during connect(2) were globally unique although they belong to
different addresses. This strict uniqueness is not necessary and
wastes usable ports for outgoing connections.
To solve this, pass ina from in_pcbconnect() to in_pcbbind_locked().
This does not interfere how wildcard sockets are matched with
specific sockets during bind(2). It only allows non-wildcard sockets
to share a local port during connect(2).
OK mvs@ deraadt@
|
|
It breaks NFS.
ok claudio@
|
|
Before changing the routing code, get IPv4 and IPv6 input, forward,
and output in a similar shape. Remove inconsistencies.
OK claudio@
|
|
Fill and check the cache and call rtalloc_mpath() together. Then
the caller of route_mpath() does not have to care about the uint32_t
*src pointer and just pass struct in_addr. All the conversions are
done inside the functions. ro->ro_rt is either valid or NULL. Note
that some places have a stricter rtisvalid() now compared to the
previous NULL check.
OK claudio@
|
|
Pass source address to route_cache() and store it in struct route.
Cached multipath routes are only valid if source address matches.
If sysctl multipath changes, increase route generation number.
OK claudio@
|
|
struct ip6po_rhinfo and struct ip6_pktopts behind _KERNEL.
The only bit userland may want from netinet6/ip6_var.h is
struct ip6stat.
The recent change to struct ip6po_rhinfo to use struct route
resulted in various build failures in ports because code
included netinet6/ip6_var.h without net/route.h.
OK tb@ sthen@
|
|
Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.
OK claudio@
|
|
In soreceve(), we only touch `so_rcv' socket buffer, which has it's own
`sb_mtx' mutex(9) for protection. So, we can avoid solock() in this
path - it's enough to hold `sb_mtx' in soreceive() and around
corresponding sbappend*(). But not right now :)
This time we use shared netlock for some inet sockets in the soreceive()
path. To protect `so_rcv' buffer we use `inp_mtx' mutex(9) and the
pru_lock() to acquire this mutex(9) in socket layer. But the `inp_mtx'
mutex belongs to the PCB. We initialize socket before PCB, tcp(4)
sockets could exist without PCB, so use `sb_mtx' mutex(9) to protect
sockbuf stuff.
This diff mechanically replaces `inp_mtx' by `sb_mtx' in the receive
path. Only for sockets which already use `inp_mtx'. All other sockets
left as is. They will be converted later.
Since the `sb_mtx' is optional, the new SB_MTXLOCK flag introduced. If
this flag is set on `sb_flags', the `sb_mtx' mutex(9) should be taken.
New sb_mtx_lock() and sb_mtx_unlock() was introduced to hide this check.
They are temporary and will be replaced by mtx_enter() when all this
area will be converted to `sb_mtx' mutex(9).
Also, the new sbmtxassertlocked() function introduced to throw
corresponding assertion for SB_MTXLOCK marked buffers. This time only
sbappendaddr() calls it. This function is also temporary and will be
replaced by MTX_ASSERT_LOCKED() later.
ok bluhm
|
|
OK mvs@
|
|
The route_cache() function can easily return whether it was a cache
hit or miss. Then the logic to perform a route lookup gets a bit
simpler. Some more complicated if (ro->ro_rt == NULL) checks still
exist elsewhere.
Also use route cache in in_pcbselsrc() instead of filling struct
route manually.
OK claudio@
|
|
Implement route6_cache() to check whether the cached route is still
valid and otherwise fill caching parameter of struct route_in6.
Also count cache hits and misses in netstat. in_pcbrtentry() uses
route cache now.
OK claudio@
|
|
To optimize route caching, count cache hits and misses. This is
shown in netstat -s for both inet and inet6. Reuse the old IPv6
forward cache counter. Sort ip6s_wrongif consistently. For now
only IPv4 cache counter has been implemented.
OK mvs@
|
|
Shared netlock is not sufficient to call so{r,w}wakeup(). The following
sowakeup() modifies `sb_flags' and knote(9) stuff. Unfortunately, we
can't call so{r,w}wakeup() with `inp_mtx' mutex(9) because sowakeup()
also calls pgsigio() which grabs kernel lock.
However, `so*_filtops' callbacks only perform read-only access to the
socket stuff, so it is enough to hold shared netlock only, but the klist
stuff needs to be protected.
This diff introduces `sb_mtx' mutex(9) to protect sockbuf. This time
`sb_mtx' used to protect only `sb_flags' and `sb_klist'.
Now we have soassertlocked_readonly() and soassertlocked(). The first
one is happy if only shared netlock is held, meanwhile the second wants
`so_lock' or pru_lock() be held together with shared netlock.
To keep soassertlocked*() assertions soft, we need to know mutex(9)
state, so new mtx_owned() macro was introduces. Also, the new optional
(*pru_locked)() handler brings the state of pru_lock().
Tests and ok from bluhm.
|
|
The outgoing route is cached at the inpcb. This cache was only
invalidated when the socket closes or if the route gets invalid.
More specific routes were not detected. Especially with dynamic
routing protocols, sockets must be closed and reopened to use the
correct route. Running ping during a route change shows the problem.
To solve this, add a route generation number that is updated whenever
the routing table changes. The lookup in struct route is put into
the route_cache() function. If the generation number is too old,
the cached route gets discarded.
Implement route_cache() for ip_output() and ip_forward() first.
IPv6 and more places will follow.
OK claudio@
|
|
Splitting the IPv6 code into a separate function results in less
#ifdef INET6. Also struct route_in6 *ro in in6_pcbrtentry() is of
the correct type and in_pcbrtentry() does not rely on the fact that
inp_route and inp_route6 are pointers to the same union.
OK kn@ claudio@
|
|
in_pcbnotifyall() is an IPv4 only function. All callers check that
sockaddr dst is in fact a sockaddr_in. Pass the more spcific type
and remove the runtime check at beginning of in_pcbnotifyall().
Use const sockaddr_in in in_pcbnotifyall() and const sockaddr_in6
in6_pcbnotify() as dst parameter.
OK millert@
|
|
tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr.
sa6_src may be &sa6_any which lives in read-only data section.
Better pass down the const addresses to syn_cache_lookup(). They
are needed for hash lookup and are not modified.
OK mvs@
|
|
Since inpcb tables for UDP and Raw IP have been split into IPv4 and
IPv6, assert that INP_IPV6 flag is correct instead of checking it.
While there, give the table variable a nicer name.
OK sashan@ mvs@
|
|
OK bluhm@ mvs@
|
|
Syzkaller with witness complains about lock ordering of pf lock
with socket lock. Socket lock for inet is taken before pf lock.
Pf lock is taken before socket lock for route. This is a false
positive as route and inet socket locks are distinct. Witness does
not know this. Name the socket lock like the domain of the socket,
then rwlock name is used in witness lo_name subtype. Make domain
names more consistent for locking, they were not used anyway.
Regardless of witness problem, unique lock name for each socket
type make sense.
Reported-by: syzbot+34d22dcbf20d76629c5a@syzkaller.appspotmail.com
Reported-by: syzbot+fde8d07ba74b69d0adfe@syzkaller.appspotmail.com
OK mvs@
|
|
OK millert@
|
|
Protocols like UDP or TCP keep only functions in netinet6 that are
essentially different. Remove divert6_detach(), divert6_lock(),
divert6_unlock(), divert6_bind(), and divert6_shutdown(). Replace
them with identical IPv4 functions. INP_HDRINCL is an IPv4 only
option, remove it from divert6_attach().
OK mvs@ sashan@ kn@
|
|
Protect all remaining write access to inp_faddr and inp_laddr with
inpcb table mutex. Document inpcb locking for foreign and local
address and port and routing table id. Reading will be made MP
safe by adding per socket rw-locks in a next step.
OK sashan@ mvs@
|
|
in_pcbbind(), in_pcbconnect(), and in6_pcbconnect() have to set
addresses and ports within the same critical section as the inpcb
hash table calculation. Also lookup and address selection have to
be protected to avoid bindings and connections that are not unique.
For that in_pcbpickport() and in_pcbbind_locked() expect that the
table mutex is already taken. The functions in_pcblookup_lock(),
in_pcblookup_local_lock(), and in_pcbaddrisavail_lock() grab the
mutex iff the lock parameter is IN_PCBLOCK_GRAB. Otherwise the
parameter is IN_PCBLOCK_HOLD has the lock has to be taken already.
Note that in_pcblookup_lock() and in_pcblookup_local() return an
inp with increased reference iff they take and release the lock.
Otherwise the caller protects the life time of the inp.
This gives enough flexibility that in_pcbbind() and in_pcbconnect()
can hold the table mutex when they need it. The public inpcb API
does not change.
OK sashan@ mvs@
|