Age | Commit message (Collapse) | Author |
|
It breaks NFS.
ok claudio@
|
|
Before changing the routing code, get IPv4 and IPv6 input, forward,
and output in a similar shape. Remove inconsistencies.
OK claudio@
|
|
Fill and check the cache and call rtalloc_mpath() together. Then
the caller of route_mpath() does not have to care about the uint32_t
*src pointer and just pass struct in_addr. All the conversions are
done inside the functions. ro->ro_rt is either valid or NULL. Note
that some places have a stricter rtisvalid() now compared to the
previous NULL check.
OK claudio@
|
|
Pass source address to route_cache() and store it in struct route.
Cached multipath routes are only valid if source address matches.
If sysctl multipath changes, increase route generation number.
OK claudio@
|
|
struct ip6po_rhinfo and struct ip6_pktopts behind _KERNEL.
The only bit userland may want from netinet6/ip6_var.h is
struct ip6stat.
The recent change to struct ip6po_rhinfo to use struct route
resulted in various build failures in ports because code
included netinet6/ip6_var.h without net/route.h.
OK tb@ sthen@
|
|
Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.
OK claudio@
|
|
In soreceve(), we only touch `so_rcv' socket buffer, which has it's own
`sb_mtx' mutex(9) for protection. So, we can avoid solock() in this
path - it's enough to hold `sb_mtx' in soreceive() and around
corresponding sbappend*(). But not right now :)
This time we use shared netlock for some inet sockets in the soreceive()
path. To protect `so_rcv' buffer we use `inp_mtx' mutex(9) and the
pru_lock() to acquire this mutex(9) in socket layer. But the `inp_mtx'
mutex belongs to the PCB. We initialize socket before PCB, tcp(4)
sockets could exist without PCB, so use `sb_mtx' mutex(9) to protect
sockbuf stuff.
This diff mechanically replaces `inp_mtx' by `sb_mtx' in the receive
path. Only for sockets which already use `inp_mtx'. All other sockets
left as is. They will be converted later.
Since the `sb_mtx' is optional, the new SB_MTXLOCK flag introduced. If
this flag is set on `sb_flags', the `sb_mtx' mutex(9) should be taken.
New sb_mtx_lock() and sb_mtx_unlock() was introduced to hide this check.
They are temporary and will be replaced by mtx_enter() when all this
area will be converted to `sb_mtx' mutex(9).
Also, the new sbmtxassertlocked() function introduced to throw
corresponding assertion for SB_MTXLOCK marked buffers. This time only
sbappendaddr() calls it. This function is also temporary and will be
replaced by MTX_ASSERT_LOCKED() later.
ok bluhm
|
|
OK mvs@
|
|
The route_cache() function can easily return whether it was a cache
hit or miss. Then the logic to perform a route lookup gets a bit
simpler. Some more complicated if (ro->ro_rt == NULL) checks still
exist elsewhere.
Also use route cache in in_pcbselsrc() instead of filling struct
route manually.
OK claudio@
|
|
Implement route6_cache() to check whether the cached route is still
valid and otherwise fill caching parameter of struct route_in6.
Also count cache hits and misses in netstat. in_pcbrtentry() uses
route cache now.
OK claudio@
|
|
To optimize route caching, count cache hits and misses. This is
shown in netstat -s for both inet and inet6. Reuse the old IPv6
forward cache counter. Sort ip6s_wrongif consistently. For now
only IPv4 cache counter has been implemented.
OK mvs@
|
|
Shared netlock is not sufficient to call so{r,w}wakeup(). The following
sowakeup() modifies `sb_flags' and knote(9) stuff. Unfortunately, we
can't call so{r,w}wakeup() with `inp_mtx' mutex(9) because sowakeup()
also calls pgsigio() which grabs kernel lock.
However, `so*_filtops' callbacks only perform read-only access to the
socket stuff, so it is enough to hold shared netlock only, but the klist
stuff needs to be protected.
This diff introduces `sb_mtx' mutex(9) to protect sockbuf. This time
`sb_mtx' used to protect only `sb_flags' and `sb_klist'.
Now we have soassertlocked_readonly() and soassertlocked(). The first
one is happy if only shared netlock is held, meanwhile the second wants
`so_lock' or pru_lock() be held together with shared netlock.
To keep soassertlocked*() assertions soft, we need to know mutex(9)
state, so new mtx_owned() macro was introduces. Also, the new optional
(*pru_locked)() handler brings the state of pru_lock().
Tests and ok from bluhm.
|
|
The outgoing route is cached at the inpcb. This cache was only
invalidated when the socket closes or if the route gets invalid.
More specific routes were not detected. Especially with dynamic
routing protocols, sockets must be closed and reopened to use the
correct route. Running ping during a route change shows the problem.
To solve this, add a route generation number that is updated whenever
the routing table changes. The lookup in struct route is put into
the route_cache() function. If the generation number is too old,
the cached route gets discarded.
Implement route_cache() for ip_output() and ip_forward() first.
IPv6 and more places will follow.
OK claudio@
|
|
Splitting the IPv6 code into a separate function results in less
#ifdef INET6. Also struct route_in6 *ro in in6_pcbrtentry() is of
the correct type and in_pcbrtentry() does not rely on the fact that
inp_route and inp_route6 are pointers to the same union.
OK kn@ claudio@
|
|
in_pcbnotifyall() is an IPv4 only function. All callers check that
sockaddr dst is in fact a sockaddr_in. Pass the more spcific type
and remove the runtime check at beginning of in_pcbnotifyall().
Use const sockaddr_in in in_pcbnotifyall() and const sockaddr_in6
in6_pcbnotify() as dst parameter.
OK millert@
|
|
tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr.
sa6_src may be &sa6_any which lives in read-only data section.
Better pass down the const addresses to syn_cache_lookup(). They
are needed for hash lookup and are not modified.
OK mvs@
|
|
Since inpcb tables for UDP and Raw IP have been split into IPv4 and
IPv6, assert that INP_IPV6 flag is correct instead of checking it.
While there, give the table variable a nicer name.
OK sashan@ mvs@
|
|
OK bluhm@ mvs@
|
|
Syzkaller with witness complains about lock ordering of pf lock
with socket lock. Socket lock for inet is taken before pf lock.
Pf lock is taken before socket lock for route. This is a false
positive as route and inet socket locks are distinct. Witness does
not know this. Name the socket lock like the domain of the socket,
then rwlock name is used in witness lo_name subtype. Make domain
names more consistent for locking, they were not used anyway.
Regardless of witness problem, unique lock name for each socket
type make sense.
Reported-by: syzbot+34d22dcbf20d76629c5a@syzkaller.appspotmail.com
Reported-by: syzbot+fde8d07ba74b69d0adfe@syzkaller.appspotmail.com
OK mvs@
|
|
OK millert@
|
|
Protocols like UDP or TCP keep only functions in netinet6 that are
essentially different. Remove divert6_detach(), divert6_lock(),
divert6_unlock(), divert6_bind(), and divert6_shutdown(). Replace
them with identical IPv4 functions. INP_HDRINCL is an IPv4 only
option, remove it from divert6_attach().
OK mvs@ sashan@ kn@
|
|
Protect all remaining write access to inp_faddr and inp_laddr with
inpcb table mutex. Document inpcb locking for foreign and local
address and port and routing table id. Reading will be made MP
safe by adding per socket rw-locks in a next step.
OK sashan@ mvs@
|
|
in_pcbbind(), in_pcbconnect(), and in6_pcbconnect() have to set
addresses and ports within the same critical section as the inpcb
hash table calculation. Also lookup and address selection have to
be protected to avoid bindings and connections that are not unique.
For that in_pcbpickport() and in_pcbbind_locked() expect that the
table mutex is already taken. The functions in_pcblookup_lock(),
in_pcblookup_local_lock(), and in_pcbaddrisavail_lock() grab the
mutex iff the lock parameter is IN_PCBLOCK_GRAB. Otherwise the
parameter is IN_PCBLOCK_HOLD has the lock has to be taken already.
Note that in_pcblookup_lock() and in_pcblookup_local() return an
inp with increased reference iff they take and release the lock.
Otherwise the caller protects the life time of the inp.
This gives enough flexibility that in_pcbbind() and in_pcbconnect()
can hold the table mutex when they need it. The public inpcb API
does not change.
OK sashan@ mvs@
|
|
Since soreceive() runs in parallel for raw sockets, sbappendaddr()
has to be protected by inpcb mutex. This was missing in multicast
forwarding which is running with a combination of shared net lock
and kernel lock. soreceive() uses shared net lock and mutex per
inpcb. Grab mutex before sbappendaddr() in socket_send() and
socket6_send().
panic receive 1 reported by Jo Geraerts
OK mvs@ claudio@
|
|
There exists no struct in6pcb in OpenBSD, this was an old kame idea.
Calling the local variable in6p does not make sense, it is actually
a struct inpcb. Also in6p is not used consistently in inet6 code.
Having the same convention for IPv4 and IPv6 is less confusing.
OK sashan@ mvs@
|
|
During initialization in_pcballoc() sets INP_IPV6 once to avoid
reaching through inp_socket->so_proto->pr_domain->dom_family. Use
this flag consistently.
OK sashan@ mvs@
|
|
The inpcb hash table is protected by table->inpt_mtx. The hash is
based on addresses, ports, and routing table. These fields were
not sychronized with the hash. Put writes and hash update into the
same critical section.
Move the updates from ip_ctloutput(), ip6_ctloutput(), syn_cache_get(),
tcp_connect(), udp_disconnect() to dedicated inpcb set functions.
There they use the same table mutex as in_pcbrehash().
in_pcbbind(), in_pcbconnect(), and in6_pcbconnect() need more work
and are not included yet.
OK sashan@ mvs@
|
|
The public interface is in_pcbconnect(). It dispatches to
in6_pcbconnect() if necessary. Call the former from tcp_connect()
and udp_connect().
In in6_pcbconnect() initialization in6a = NULL is not necessary.
in6_pcbselsrc() sets the pointer, but does not read the value.
Pass a constant in6_addr pointer to in6_pcbselsrc() and in6_selectsrc().
It returns a reference to the address of some internal data structure.
We want to be sure that in6_addr is not modified this way. IPv4
in_pcbselsrc() solves this by passing a copy of the address.
OK kn@ sashan@ mvs@
|
|
Struct inpcb field inp_socket is initialized in in_pcballoc(). It
is not NULL and never changed.
OK mvs@
|
|
rip6_output() did modify inp_outputopts6 temporarily to provide
different ip6_pktopts to in6_embedscope(). Better pass inp_outputopts6
and inp_moptions6 as separate arguments to in6_embedscope().
Simplify the code that deals with these options in in6_embedscope().
Doucument inp_moptions and inp_moptions6 as protected by net lock.
OK kn@
|
|
ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.
OK mvs@
|
|
For implementing MP safe route lookup, it helps to know which
function parameters are constant. Add some const declarations, so
that the compiler guarantees that sockaddr dst parameter of
rtable_match() does not change.
OK dlg@
|
|
Using a scratch buffer makes it possible to take a consistent snapshot of
per-CPU counters without having to allocate memory.
Makes ddb(4) show uvmexp command work in OOM situations.
ok kn@, mvs@, cheloha@
|
|
When called with NULL options, ip_output() and ip6_output() are MP
safe. Convert exclusive to shared net lock in send dispatch.
OK mpi@
|
|
More complete solution after tb@ pointed out what Coverity missed.
ok tb@
|
|
Coverity CID #1566406
ok phessler@
|
|
In in6_ifdetach() two struct rtentry were leaked. This was triggered
by regress/sbin/route and detected with btrace(8) refcnt. The
reference returned by rtalloc() must be freed with rtfree() in all
cases.
OK phessler@ mvs@
|
|
When doing LRO (Large Receive Offload), the drivers, currently ix(4)
and lo(4) only, record an upper bound of the size of the original
packets in ph_mss. When sending, either stack or hardware must
chop the packets with TSO (TCP Segmentation Offload) to that size.
That means we have to call tcp_if_output_tso() before ifp->if_output().
Put that logic into if_output_tso() to avoid code duplication. As
TCP packets on the wire do not get larger that way, path MTU discovery
should still work.
tested by and OK jan@
|
|
Replace hand-rolled reference counting with refcnt_init(9) and hook it up
with a new dt(4) probe.
OK bluhm mvs
|
|
Goal is to run UDP input in parallel. Btrace kstack analysis shows
that SIP hash for PCB lookup is quite expensive. When running in
parallel, there is also lock contention on the PCB table mutex.
It results in better performance to calculate the hash value before
taking the mutex. The hash secret has to be constant as hash
calculation must not depend on values protected by the table mutex.
Do not reseed anymore when hash table gets resized.
Analysis also shows that asserting a rw_lock while holding a mutex
is a bit expensive. Just remove the netlock assert.
OK dlg@ mvs@
|
|
First try to send with TSO. The goto senderr handles icmp6 redirect
and other errors. If TSO is not necessary and the interface MTU
fits, just send the packet. Again goto senderr handles icmp6.
Finally care about icmp6 packet too big.
tested and OK jan@
|
|
ok bluhm
|
|
with if_mtu and not the packet checksum flags. ph_mss contains the
size of the copped packets.
OK jan@
|
|
Also fix ip6_forwarding of TSO packets with tcp_if_output_tso().
With a lot of testing from Hrvoje Popovski
and a lot of tweaks from bluhm@
ok bluhm@
|
|
When sending TCP packets with software TSO to the local address of
a physical interface, the TCP checksum was miscalculated. As the
small MSS is taken from the physical interface, but the large MTU
of the loopback interface is used, large TSO packets are generated,
but sent directly to the loopback interface. There we need the
regular pseudo header checksum and not the modified without packet
length.
To avoid this confusion, use the same decision for checksum generation
in in_proto_cksum_out() as for using hardware TSO in tcp_if_output_tso().
bug reported and tested by robert@ bket@ Hrvoje Popovski
OK claudio@ jan@
|
|
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@
|
|
entries in rt_llinfo are protected either by exclusive netlock or
the ND6 mutex. The performance critical lookup path in nd6_resolve()
uses shared netlock, but is not lockless. In contrast to ARP it
grabs the mutex also in the common case.
tested by Hrvoje Popovski; with and OK kn@
|
|
OK mvs@
|
|
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@
|
|
is passed to ifp->if_output(). The fragment code has its own
checksum calculation and the other paths end in goto bad.
OK claudio@
|