Age | Commit message (Collapse) | Author |
|
Socket field so_oobmark belongs to receive path, so use so_rcv mutex
to protect it. Although tcp_input() is still exclusively locked,
put mutex there to prepare further unlocking.
OK mvs@
|
|
|
|
|
|
Instead of passing around u_char[4], introduce struct ipsec_level
that contains 4 ipsec levels. This provides better type safety.
The embedding struct inpcb is globally visible for netstat(1), so
put struct ipsec_level outside of #ifdef _KERNEL.
OK deraadt@ mvs@
|
|
no functional change, found by smatch warnings
ok miod@ bluhm@
|
|
With two separate TCP hash tables, each one becomes smaller. When
we remove the exclusive net lock from TCP, contention on internet
PCB table mutex will be reduced. UDP has been split earlier into
IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with
assertions.
OK mvs@
|
|
OK mvs@
|
|
Use a common struct route for both inet and inet6. Unfortunately
struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has
to be exposed from net/route.h. Struct route has to be bsd visible
for userland as netstat kvm code inspects inp_route. Internet PCB
and TCP SYN cache can use a plain struct route now. All specific
sockaddr types for inet and inet6 are embeded there.
OK claudio@
|
|
OK mvs@
|
|
tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr.
sa6_src may be &sa6_any which lives in read-only data section.
Better pass down the const addresses to syn_cache_lookup(). They
are needed for hash lookup and are not modified.
OK mvs@
|
|
|
|
The inpcb hash table is protected by table->inpt_mtx. The hash is
based on addresses, ports, and routing table. These fields were
not sychronized with the hash. Put writes and hash update into the
same critical section.
Move the updates from ip_ctloutput(), ip6_ctloutput(), syn_cache_get(),
tcp_connect(), udp_disconnect() to dedicated inpcb set functions.
There they use the same table mutex as in_pcbrehash().
in_pcbbind(), in_pcbconnect(), and in6_pcbconnect() need more work
and are not included yet.
OK sashan@ mvs@
|
|
TCP syn_cache_respond() uses inp_seclevel from listening socket as
ip_output() parameter. This was missing for ip6_output().
OK mvs@
|
|
As syn_cache_timer() uses syn cache mutex and exclusive net lock,
it does not need kernel lock.
OK mvs@
|
|
In some cases inp may be NULL, so check that before passing
inp->inp_seclevel to ipsp_spd_lookup() or ip_output().
Missed in previous commit.
|
|
ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.
OK mvs@
|
|
Introduce global TCP SYN cache mutex. Devide timer function in
parts protected by mutex and sending with netlock. Split the flags
field in dynamic flags protected by mutex and fixed flags set during
initialization. Document whether fields of struct syn_cache are
protected by net lock or mutex.
input and OK sashan@
|
|
by unexpanding the SYN_CACHE_TIMER_ARM() macro in the timer callback.
OK mvs@
|
|
The syn_cache_reaper() is a hack to serialize timeouts. Unfortunately
it has a race and panics sometimes with pool_do_get: syncache free
list modified. Add a reference counter for timeout and list of syn
cache entries. Currently list refcout is not strictly necessary
due to exclusive netlock, but will be needed when we continue
unlocking.
Checking timeout_initialized() is not MP friendly, better do proper
initialization during object allocation. Refcount in btrace helps
to find leaks.
bug reported and fix tested by Peter J. Philipp
OK claudio@
|
|
After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.
As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.
Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.
OK yasuoka@
|
|
Our syn cache did checksum calculation by hand, instead of the
established mechanism in ip output. The software-checksummed counter
increased once per incoming TCP connection.
Just set the flag M_TCP_CSUM_OUT in syn_cache_respond() and let
in_proto_cksum_out() do the work later. Then hardware checksumming
is used where available. Also remove redundant code. The unhandled
af case is handled in the first switch statement of the function.
tested by Hrvoje Popovski; OK mvs@
|
|
milliseconds, which is the same unit of tcp_now(). However, keep the
unit of sysctl variables in seconds and convert their unit in
tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds,
which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19.
ok claudio
|
|
receive buffer. As it was done for SS_CANTSENDMORE bit, the definition
kept as is, but now these bits belongs to the `sb_state' of receive
buffer. `sb_state' ored with `so_state' when socket data exporting to the
userland.
ok bluhm@
|
|
listen port is not bound to port 0. With a matching pf divert-to
rule this assumption is no longer true and could crash the kernel
with kassert. In both pf and stack drop TCP packets with destination
port 0 before they can do harm.
OK sashan@ claudio@
|
|
type from short to int. Also switch local variables holding temporary
timer values from short to int.
OK yasuoka
|
|
of tcp time. This fixes the retransmit timer of syn_cache which was
broken. reported by naddy, input dlg, test jca
ok jca
|
|
(SRTT) instead of the timestamp option. Since the timestamp option is
disabled on some OSs (eg. Windows) or dropped by some
firewalls/routers, in such a case the window size had been fixed at
16KB, this limits throughput at very low on high latency networks.
Also replace "tcp_now" from 2HZ tick counter to binuptime in
milliseconds to calculate the SRTT better.
tested by krw matthieu jmatthew dlg djm stu stsp
ok claudio
|
|
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@
|
|
removes pressure from the exclusive netlock in tcp_slowtimo().
Reading is done atomically. Ensure that the tcp_now value is read
only once per function to provide consistent time.
OK yasuoka@
|
|
so the public API is in_pcblookup() and in_pcblookup_listen(). For
internal use introduce in_pcbhash_insert() and in_pcbhash_lookup()
to avoid code duplication. Routing domain is unsigned, change the
type to u_int.
OK mvs@
|
|
ok bluhm@
|
|
TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@
|
|
Use their reference counter in more places.
The in_pcb lookup functions hold the PCBs in hash tables protected
by table->inpt_mtx mutex. Whenever a result is returned, increment
the ref count before releasing the mutex. Then the inp can be used
as long as neccessary. Unref it at the end of all functions that
call in_pcb lookup.
As a shortcut, pf may also hold a reference to the PCB. When
pf_inp_lookup() returns it, it also incements the ref count and the
caller can handle it like the inp from table lookup.
OK sashan@
|
|
trees. ipsp_ids_lookup() returns `ids' with bumped reference
counter. original diff from mvs
ok mvs
|
|
ok jmc@ reads ok tb@
|
|
is not always needed, but the error value is necessary for the
caller. As TDB should be refcounted, it makes not sense to always
return it. Pass an output pointer for the TDB which can be NULL.
OK mvs@ tobhe@
|
|
|
|
covered yet, more ref counts to come. The timeouts are protected,
so the racy tdb_reaper() gets retired. The tdb_policy_head, onext
and inext lists are protected. All gettdb...() functions return a
tdb that is ref counted and has to be unrefed later. A flag ensures
that tdb_delete() is called only once.
Tested by Hrvoje Popovski; OK sthen@ mvs@ tobhe@
|
|
The sending machine is doing zero window probes, but is not sending
any more data although the other machine announced that it has space
again. The header prediction code did not update snd_wl2. If there
was a sequence number wrap, the send window update block is not
reached.
Update snd_wl2 when receiving predicted ACKs and and update snd_wl1
and rcv_up for predicted pure data.
from FreeBSD; OK sashan@ claudio@
|
|
|
|
This is a backout of rev 1.366 which turned this feature off.
Although sending less ACKs makes TCP faster if the CPU is busy with
processing packets, there are corner cases where TCP gets slower.
Especially OpenBSD 6.8 and older has a maxbust limitiation that
scales badly if the other side sends too few ACKs. Also regress
test relayd run-args-http-slow-consumer.pl uses strange socket
buffer sizes that triggers slow performance with the new algorithm.
For OpenBSD 6.9 release switch back to 6.8 delayed ACK behavior.
discussed with deraadt@ benno@ claudio@ jan@
|
|
ok gnezdo@ semarie@ mpi@
|
|
The kernel uses a huge amount of processing time for sending ACKs to the sender
on the receiving interface. After receiving a data segment, we send out two
ACKs. The first one in tcp_input() direct after receiving. The second ACK is
send out, after the userland or the sosplice task read some data out of the
socket buffer. Thus, we save some processing time and improve network
performance.
Longer tested by sthen@
OK claudio@
|
|
avoidance. The problem and fix is noted in RFC5681 section 3.1, page 7.
Report, diff and testing from Brian Brombacher, thanks!
Testing and a cosmetic tweak by myself.
ok claudio
|
|
ok bluhm@
|
|
isakmpd and iked to REQUIRE. Filter policy violations earlier.
ok sashan@ bluhm@
|
|
tp->snd_wnd. This can happen, for example, when the remote side
responds to a window probe by ACKing the one byte it contains.
from FreeBSD; via markus@; OK sashan@ tobhe@
|
|
sack hole list length or pool limit.
OK claudio@
|
|
There is a global tunable limit net.inet.tcp.sackholelimit, default
is 32768. If an attacker manages to attach all these sack holes
to a few TCP connections, the lists may grow long. Traversing them
might cause higher CPU consumption on the victim machine. In
practice such a situation is hard to create as the TCP retransmit
and 2*msl timer flush the list periodically. For additional
protection, enforce a per connection limit of 128 SACK holes in the
list.
reported by Reuven Plevinsky and Tal Vainshtein
discussed with claudio@ and procter@; OK deraadt@
|
|
PAWS. Otherwise we could trigger a retransmit of the opposite party with another
wrong timestamp and produce loop. I have seen this with a buggy server which
messed up tcp timestamps.
Suggested by Prof. Jacobson for FreeBSD.
ok krw, bluhm, henning, mpi
|