summaryrefslogtreecommitdiff
path: root/sys/netinet/tcp_input.c
AgeCommit message (Collapse)Author
2024-11-08Use mutex of receive socket buffer to protect so_oobmark.Alexander Bluhm
Socket field so_oobmark belongs to receive path, so use so_rcv mutex to protect it. Although tcp_input() is still exclusively locked, put mutex there to prepare further unlocking. OK mvs@
2024-08-26Rearrange #ifdef TCP_SIGNATURE to keep braces balanced.Alexander Bluhm
2024-06-07remove unused packet header length definesJonathan Gray
2024-04-17Use struct ipsec_level within inpcb.Alexander Bluhm
Instead of passing around u_char[4], introduce struct ipsec_level that contains 4 ipsec levels. This provides better type safety. The embedding struct inpcb is globally visible for netstat(1), so put struct ipsec_level outside of #ifdef _KERNEL. OK deraadt@ mvs@
2024-04-13correct indentationJonathan Gray
no functional change, found by smatch warnings ok miod@ bluhm@
2024-04-12Split single TCP inpcb table into IPv4 and IPv6 parts.Alexander Bluhm
With two separate TCP hash tables, each one becomes smaller. When we remove the exclusive net lock from TCP, contention on internet PCB table mutex will be reduced. UDP has been split earlier into IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with assertions. OK mvs@
2024-04-10Move global variables for TCP debug onto the tcp_input() stack.Alexander Bluhm
OK mvs@
2024-02-13Merge struct route and struct route_in6.Alexander Bluhm
Use a common struct route for both inet and inet6. Unfortunately struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has to be exposed from net/route.h. Struct route has to be bsd visible for userland as netstat kvm code inspects inp_route. Internet PCB and TCP SYN cache can use a plain struct route now. All specific sockaddr types for inet and inet6 are embeded there. OK claudio@
2024-02-11Remove include netinet6/ip6_var.h from netinet/in_pcb.h.Alexander Bluhm
OK mvs@
2024-01-27Declare address parameter in TCP SYN cache const.Alexander Bluhm
tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr. sa6_src may be &sa6_any which lives in read-only data section. Better pass down the const addresses to syn_cache_lookup(). They are needed for hash lookup and are not modified. OK mvs@
2024-01-11Fix white spaces in TCP.Alexander Bluhm
2023-12-01Set inp address, port and rtable together with inpcb hash.Alexander Bluhm
The inpcb hash table is protected by table->inpt_mtx. The hash is based on addresses, ports, and routing table. These fields were not sychronized with the hash. Put writes and hash update into the same critical section. Move the updates from ip_ctloutput(), ip6_ctloutput(), syn_cache_get(), tcp_connect(), udp_disconnect() to dedicated inpcb set functions. There they use the same table mutex as in_pcbrehash(). in_pcbbind(), in_pcbconnect(), and in6_pcbconnect() need more work and are not included yet. OK sashan@ mvs@
2023-11-30Pass inp_seclevel to ip6_output() in TCP syn cache.Alexander Bluhm
TCP syn_cache_respond() uses inp_seclevel from listening socket as ip_output() parameter. This was missing for ip6_output(). OK mvs@
2023-11-29Run TCP syn cache timer without kernel lock.Alexander Bluhm
As syn_cache_timer() uses syn cache mutex and exclusive net lock, it does not need kernel lock. OK mvs@
2023-11-27Add NULL check before dereferencing inp_seclevel.Alexander Bluhm
In some cases inp may be NULL, so check that before passing inp->inp_seclevel to ipsp_spd_lookup() or ip_output(). Missed in previous commit.
2023-11-26Remove inp parameter from ip_output().Alexander Bluhm
ip_output() received inp as parameter. This is only used to lookup the IPsec level of the socket. Reasoning about MP locking is much easier if only relevant data is passed around. Convert ip_output() to receive constant inp_seclevel as argument and mark it as protected by net lock. OK mvs@
2023-11-16Run TCP SYN cache timer logik without net lock.Alexander Bluhm
Introduce global TCP SYN cache mutex. Devide timer function in parts protected by mutex and sending with netlock. Split the flags field in dynamic flags protected by mutex and fixed flags set during initialization. Document whether fields of struct syn_cache are protected by net lock or mutex. input and OK sashan@
2023-09-03Avoid a useless increment and decrement of the tcp syn cache refcountAlexander Bluhm
by unexpanding the SYN_CACHE_TIMER_ARM() macro in the timer callback. OK mvs@
2023-08-28Introduce reference counting for TCP syn cache entries.Alexander Bluhm
The syn_cache_reaper() is a hack to serialize timeouts. Unfortunately it has a race and panics sometimes with pool_do_get: syncache free list modified. Add a reference counter for timeout and list of syn cache entries. Currently list refcout is not strictly necessary due to exclusive netlock, but will be needed when we continue unlocking. Checking timeout_initialized() is not MP friendly, better do proper initialization during object allocation. Refcount in btrace helps to find leaks. bug reported and fix tested by Peter J. Philipp OK claudio@
2023-07-06Convert tcp_now() time counter to 64 bit.Alexander Bluhm
After changing tcp now tick to milliseconds, 32 bits will wrap around after 49 days of uptime. That may be a problem in some places of our stack. Better use a 64 bit counter. As timestamp option is 32 bit in TCP protocol, use the lower 32 bit there. There are casts to 32 bits that should behave correctly. Start with random 63 bit offset to avoid uptime leakage. 2^63 milliseconds result in 2.9*10^8 years of possible uptime. OK yasuoka@
2023-05-30Use generic checksum calculation for TCP SYN+ACK packets.Alexander Bluhm
Our syn cache did checksum calculation by hand, instead of the established mechanism in ip output. The software-checksummed counter increased once per incoming TCP connection. Just set the flag M_TCP_CSUM_OUT in syn_cache_respond() and let in_proto_cksum_out() do the work later. Then hardware checksumming is used where available. Also remove redundant code. The unhandled af case is handled in the first switch statement of the function. tested by Hrvoje Popovski; OK mvs@
2023-03-14To avoid misunderstanding, keep variables for tcp keepalive inYASUOKA Masahiko
milliseconds, which is the same unit of tcp_now(). However, keep the unit of sysctl variables in seconds and convert their unit in tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds, which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19. ok claudio
2023-01-22Move SS_CANTRCVMORE and SS_RCVATMARK bits from `so_state' to `sb_state' ofVitaliy Makkoveev
receive buffer. As it was done for SS_CANTSENDMORE bit, the definition kept as is, but now these bits belongs to the `sb_state' of receive buffer. `sb_state' ored with `so_state' when socket data exporting to the userland. ok bluhm@
2023-01-12Binding the accept socket in TCP input relies on the fact that theAlexander Bluhm
listen port is not bound to port 0. With a matching pf divert-to rule this assumption is no longer true and could crash the kernel with kassert. In both pf and stack drop TCP packets with destination port 0 before they can do harm. OK sashan@ claudio@
2022-12-09Some TCP timer units have changed from slowhz to msec and theirAlexander Bluhm
type from short to int. Also switch local variables holding temporary timer values from short to int. OK yasuoka
2022-12-08Convert tcptv_keep_init in milliseconds before comparing other valuesYASUOKA Masahiko
of tcp time. This fixes the retransmit timer of syn_cache which was broken. reported by naddy, input dlg, test jca ok jca
2022-11-07Modify TCP receive buffer size auto scaling to use the smoothed RTTYASUOKA Masahiko
(SRTT) instead of the timestamp option. Since the timestamp option is disabled on some OSs (eg. Windows) or dropped by some firewalls/routers, in such a case the window size had been fixed at 16KB, this limits throughput at very low on high latency networks. Also replace "tcp_now" from 2HZ tick counter to binuptime in milliseconds to calculate the SRTT better. tested by krw matthieu jmatthew dlg djm stu stsp ok claudio
2022-10-03System calls should not fail due to temporary memory shortage inAlexander Bluhm
malloc(9) or pool_get(9). Pass down a wait flag to pru_attach(). During syscall socket(2) it is ok to wait, this logic was missing for internet pcb. Pfkey and route sockets were already waiting. sonewconn() must not wait when called during TCP 3-way handshake. This logic has been preserved. Unix domain stream socket connect(2) can wait until the other side has created the socket to accept. OK mvs@
2022-09-03Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. ThisAlexander Bluhm
removes pressure from the exclusive netlock in tcp_slowtimo(). Reading is done atomically. Ensure that the tcp_now value is read only once per function to provide consistent time. OK yasuoka@
2022-08-30Refactor internet PCB lookup function. Rename in_pcbhashlookup()Alexander Bluhm
so the public API is in_pcblookup() and in_pcblookup_listen(). For internal use introduce in_pcbhash_insert() and in_pcbhash_lookup() to avoid code duplication. Routing domain is unsigned, change the type to u_int. OK mvs@
2022-08-21Change soabort() return value to void. We never interesting on it.Vitaliy Makkoveev
ok bluhm@
2022-08-11Add TCP_INFO support to getsockopt for tcp sessions.Claudio Jeker
TCP_INFO provides a lot of information about the TCP session of this socket. Many processes like to peek at the rtt of a connection but this also provides a lot of more special info for use by e.g. tcpbench(1). While the basic minimal info is available all the time the more specific data is only populated for privileged processes. This is done to not share data back to userland that may allow to attack a session. TCP_INFO is available to pledge "inet" since pledged processes like chrome tend to use TCP_INFO when available. OK bluhm@
2022-08-08To make protocol input functions MP safe, internet PCB need protection.Alexander Bluhm
Use their reference counter in more places. The in_pcb lookup functions hold the PCBs in hash tables protected by table->inpt_mtx mutex. Whenever a result is returned, increment the ref count before releasing the mutex. Then the inp can be used as long as neccessary. Unref it at the end of all functions that call in_pcb lookup. As a shortcut, pf may also hold a reference to the PCB. When pf_inp_lookup() returns it, it also incements the ref count and the caller can handle it like the inp from table lookup. OK sashan@
2022-01-04Add `ipsec_flows_mtx' mutex(9) to protect `ipsp_ids_*' list andYASUOKA Masahiko
trees. ipsp_ids_lookup() returns `ids' with bumped reference counter. original diff from mvs ok mvs
2022-01-02spellingJonathan Gray
ok jmc@ reads ok tb@
2021-12-01Let ipsp_spd_lookup() return an error instead of a TDB. The TDBAlexander Bluhm
is not always needed, but the error value is necessary for the caller. As TDB should be refcounted, it makes not sense to always return it. Pass an output pointer for the TDB which can be NULL. OK mvs@ tobhe@
2021-11-25move label to fix RAMDISKTheo de Raadt
2021-11-25Implement reference counting for IPsec tdbs. Not all cases areAlexander Bluhm
covered yet, more ref counts to come. The timeouts are protected, so the racy tdb_reaper() gets retired. The tdb_policy_head, onext and inext lists are protected. All gettdb...() functions return a tdb that is ref counted and has to be unrefed later. A flag ensures that tdb_delete() is called only once. Tested by Hrvoje Popovski; OK sthen@ mvs@ tobhe@
2021-08-09During unidirectional data transmission, a TCP connection may stall.Alexander Bluhm
The sending machine is doing zero window probes, but is not sending any more data although the other machine announced that it has space again. The header prediction code did not update snd_wl2. If there was a sequence number wrap, the send window update block is not reached. Update snd_wl2 when receiving predicted ACKs and and update snd_wl1 and rcv_up for predicted pure data. from FreeBSD; OK sashan@ claudio@
2021-08-09Fix white spaces.Alexander Bluhm
2021-04-16Turn on the direct ACK on every other segment.Alexander Bluhm
This is a backout of rev 1.366 which turned this feature off. Although sending less ACKs makes TCP faster if the CPU is busy with processing packets, there are corner cases where TCP gets slower. Especially OpenBSD 6.8 and older has a maxbust limitiation that scales badly if the other side sends too few ACKs. Also regress test relayd run-args-http-slow-consumer.pl uses strange socket buffer sizes that triggers slow performance with the new algorithm. For OpenBSD 6.9 release switch back to 6.8 delayed ACK behavior. discussed with deraadt@ benno@ claudio@ jan@
2021-03-10spellingJonathan Gray
ok gnezdo@ semarie@ mpi@
2021-02-03Turns off the direct ACK on every other segmentjan
The kernel uses a huge amount of processing time for sending ACKs to the sender on the receiving interface. After receiving a data segment, we send out two ACKs. The first one in tcp_input() direct after receiving. The second ACK is send out, after the userland or the sosplice task read some data out of the socket buffer. Thus, we save some processing time and improve network performance. Longer tested by sthen@ OK claudio@
2020-06-19Break a glass ceiling on cwnd due to integer division during congestionRichard Procter
avoidance. The problem and fix is noted in RFC5681 section 3.1, page 7. Report, diff and testing from Brian Brombacher, thanks! Testing and a cosmetic tweak by myself. ok claudio
2019-12-06Checking the IPsec policy is expensive. Check only when IPsec is used.tobhe
ok bluhm@
2019-11-29Change the default security level for incoming IPsec flows fromtobhe
isakmpd and iked to REQUIRE. Filter policy violations earlier. ok sashan@ bluhm@
2019-11-11Prevent underflows in tp->snd_wnd if the remote side ACKs more thanAlexander Bluhm
tp->snd_wnd. This can happen, for example, when the remote side responds to a window probe by ACKing the one byte it contains. from FreeBSD; via markus@; OK sashan@ tobhe@
2019-07-12Count the number of TCP SACK options that were dropped due to theAlexander Bluhm
sack hole list length or pool limit. OK claudio@
2019-07-10Received SACK options are managed by a linked list at the TCP socket.Alexander Bluhm
There is a global tunable limit net.inet.tcp.sackholelimit, default is 32768. If an attacker manages to attach all these sack holes to a few TCP connections, the lists may grow long. Traversing them might cause higher CPU consumption on the victim machine. In practice such a situation is hard to create as the TCP retransmit and 2*msl timer flush the list periodically. For additional protection, enforce a per connection limit of 128 SACK holes in the list. reported by Reuven Plevinsky and Tal Vainshtein discussed with claudio@ and procter@; OK deraadt@
2018-09-17Do not acknowledge a received ack-only tcp packet that we would drop due tofriehm
PAWS. Otherwise we could trigger a retransmit of the opposite party with another wrong timestamp and produce loop. I have seen this with a buggy server which messed up tcp timestamps. Suggested by Prof. Jacobson for FreeBSD. ok krw, bluhm, henning, mpi