Age | Commit message (Collapse) | Author |
|
Instead of passing around u_char[4], introduce struct ipsec_level
that contains 4 ipsec levels. This provides better type safety.
The embedding struct inpcb is globally visible for netstat(1), so
put struct ipsec_level outside of #ifdef _KERNEL.
OK deraadt@ mvs@
|
|
With two separate TCP hash tables, each one becomes smaller. When
we remove the exclusive net lock from TCP, contention on internet
PCB table mutex will be reduced. UDP has been split earlier into
IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with
assertions.
OK mvs@
|
|
Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.
OK claudio@
|
|
OK mvs@
|
|
in_pcbnotifyall() is an IPv4 only function. All callers check that
sockaddr dst is in fact a sockaddr_in. Pass the more spcific type
and remove the runtime check at beginning of in_pcbnotifyall().
Use const sockaddr_in in in_pcbnotifyall() and const sockaddr_in6
in6_pcbnotify() as dst parameter.
OK millert@
|
|
tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr.
sa6_src may be &sa6_any which lives in read-only data section.
Better pass down the const addresses to syn_cache_lookup(). They
are needed for hash lookup and are not modified.
OK mvs@
|
|
|
|
Struct inpcb field inp_socket is initialized in in_pcballoc(). It
is not NULL and never changed.
OK mvs@
|
|
ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.
OK mvs@
|
|
After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.
As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.
Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.
OK yasuoka@
|
|
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@
|
|
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.
tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio
|
|
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@
|
|
found by Hrvoje Popovski with witness; OK mvs@
|
|
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@
|
|
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@
|
|
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@
|
|
function.
OK gnezdo@ mvs@ florian@ sashan@
|
|
ok jmc@ reads ok tb@
|
|
crypto task anymore, it is possible to return the next protocol.
Then ip_deliver() will walk the header chain in its loop.
IPsec bridge(4) tested by jan@
OK mvs@ tobhe@ jan@
|
|
the mbuf, the callers must be careful. Although there is no bug,
use the common pattern to handle this. Pass down an mbuf pointer
mp and let m_pullup() update the pointer in all callers.
It looks like the tcp signature functions should not be called.
Avoid an mbuf leak and return an error.
OK mvs@
|
|
function. But was is never called via this pointer. It would have
immediatley crashed as mp is always NULL when called via .xf_output().
Do not set .xf_output to ipip_output. This allows to pass only the
parameters which are actually needed and the control flow is clearer.
OK mpi@
|
|
route and was not there before. This should prevent a recursion
in path MTU discovery with TCP over IPsec.
reported and tested Matthias Schmidt; tested and OK tobhe@
|
|
and map data read only.
OK deraadt@ mvs@ mpi@
|
|
calling tcp_output() if the TCP maximum segment size changes. But
that did not work, as the new value was compared before tcp_mss()
had a chance to modify it. Move the comparison and change it from
not equal to greater than. It makes only sense to resend a packet
immediately if it becomes smaller and is more likely to fit.
OK sashan@ tobhe@
|
|
the first cut of this diff was made with coccinelle using this spatch:
@rule@
type caddr_t;
expression m, off, len, cp;
@@
-m_copydata(m, off, len, (caddr_t)cp)
+m_copydata(m, off, len, cp)
i had fix it's opinionated idea of formatting by hand though, so
i'm not sure it was worth it.
ok deraadt@ bluhm@
|
|
Zero-tick timeouts rely on implicit behavior in the timeout layer that
inhibits optimizations in softclock().
bluhm@ says waiting a tick for the reaper shouldn't break anything.
ok bluhm@
|
|
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@
|
|
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@
|
|
ok bluhm
|
|
the delack timer had a different implementation. Use the same
mechanism for all TCP timer.
OK mpi@ visa@
|
|
OK millert@
|
|
if the tcpcb exits.
OK mpi@
|
|
it could run immediately and was not synchronized with the TCP
timeouts, although that was the intension when it was introduced
in revision 1.85. Convert the reaper to an ordinary TCP timeout
so it is scheduled on the same timeout thread after all timeouts
have finished. A net lock is not necessary as the process calling
tcp_close() will not access the tcpcb after arming the reaper
timeout.
OK mikeb@
|
|
The initialization of a secret SHA256 context for generating TCP
initial sequence numbers is moved out of tcp_set_iss_tsm used to
set up ISN for new connections and into tcp_init, sparing the
need for a global flag.
OK deraadt, visa, mpi
|
|
OK deraadt, mpi, visa, job
|
|
buffers.
This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.
Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.
Tested by Hrvoje Popovski.
ok claudio@, bluhm@, mikeb@
|
|
<netinet/tcp_debug.h>.
The IPv6 variant was always included and the IPv4 version is not
present on all systems.
Most of the offending ports are already fixed, thanks to sthen@!
|
|
No binary change.
OK mpi@
|
|
inline function instead of casting it to sockaddr. While there,
use inline instead of __inline for all these conversions. Some
struct sockaddr casts can be avoided completely.
OK dhill@ mpi@
|
|
No binary change.
OK mpi@
|
|
ok mpi@ bluhm@
|
|
IPv4 pr_ctlinput functions did return a void pointer that was always
NULL and never used. Make all functions void like in the IPv6 case.
OK mpi@
|
|
ok bluhm@, kettenis@
|
|
has IPL_SOFTNET as ipl.
ok mikeb@, kettenis@
|
|
|
|
the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.
most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.
the manpage and subr_pool.c bits i did myself.
ok tedu@ jmatthew@
@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);
|
|
thank you to everyone who helped reviewed these diffs
ok mpi@
|
|
the additional clusters in the socket buffer and not elsewhere.
OK claudio@
|
|
This is another little step towards deprecating 'struct route{,_in6}'.
ok florian@
|