src - OpenBSD base system

Age	Commit message (Collapse)	Author
2024-11-08	Use mutex of receive socket buffer to protect so_oobmark.	Alexander Bluhm
	Socket field so_oobmark belongs to receive path, so use so_rcv mutex to protect it. Although tcp_input() is still exclusively locked, put mutex there to prepare further unlocking. OK mvs@
2024-08-26	Rearrange #ifdef TCP_SIGNATURE to keep braces balanced.	Alexander Bluhm

2024-06-07	remove unused packet header length defines	Jonathan Gray

2024-04-17	Use struct ipsec_level within inpcb.	Alexander Bluhm
	Instead of passing around u_char[4], introduce struct ipsec_level that contains 4 ipsec levels. This provides better type safety. The embedding struct inpcb is globally visible for netstat(1), so put struct ipsec_level outside of #ifdef _KERNEL. OK deraadt@ mvs@
2024-04-13	correct indentation	Jonathan Gray
	no functional change, found by smatch warnings ok miod@ bluhm@
2024-04-12	Split single TCP inpcb table into IPv4 and IPv6 parts.	Alexander Bluhm
	With two separate TCP hash tables, each one becomes smaller. When we remove the exclusive net lock from TCP, contention on internet PCB table mutex will be reduced. UDP has been split earlier into IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with assertions. OK mvs@
2024-04-10	Move global variables for TCP debug onto the tcp_input() stack.	Alexander Bluhm
	OK mvs@
2024-02-13	Merge struct route and struct route_in6.	Alexander Bluhm
	Use a common struct route for both inet and inet6. Unfortunately struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has to be exposed from net/route.h. Struct route has to be bsd visible for userland as netstat kvm code inspects inp_route. Internet PCB and TCP SYN cache can use a plain struct route now. All specific sockaddr types for inet and inet6 are embeded there. OK claudio@
2024-02-11	Remove include netinet6/ip6_var.h from netinet/in_pcb.h.	Alexander Bluhm
	OK mvs@
2024-01-27	Declare address parameter in TCP SYN cache const.	Alexander Bluhm
	tcp6_ctlinput() casted a constant sockaddr_sin6 to non-const sockaddr. sa6_src may be &sa6_any which lives in read-only data section. Better pass down the const addresses to syn_cache_lookup(). They are needed for hash lookup and are not modified. OK mvs@
2024-01-11	Fix white spaces in TCP.	Alexander Bluhm

2023-12-01	Set inp address, port and rtable together with inpcb hash.	Alexander Bluhm
	The inpcb hash table is protected by table->inpt_mtx. The hash is based on addresses, ports, and routing table. These fields were not sychronized with the hash. Put writes and hash update into the same critical section. Move the updates from ip_ctloutput(), ip6_ctloutput(), syn_cache_get(), tcp_connect(), udp_disconnect() to dedicated inpcb set functions. There they use the same table mutex as in_pcbrehash(). in_pcbbind(), in_pcbconnect(), and in6_pcbconnect() need more work and are not included yet. OK sashan@ mvs@
2023-11-30	Pass inp_seclevel to ip6_output() in TCP syn cache.	Alexander Bluhm
	TCP syn_cache_respond() uses inp_seclevel from listening socket as ip_output() parameter. This was missing for ip6_output(). OK mvs@
2023-11-29	Run TCP syn cache timer without kernel lock.	Alexander Bluhm
	As syn_cache_timer() uses syn cache mutex and exclusive net lock, it does not need kernel lock. OK mvs@
2023-11-27	Add NULL check before dereferencing inp_seclevel.	Alexander Bluhm
	In some cases inp may be NULL, so check that before passing inp->inp_seclevel to ipsp_spd_lookup() or ip_output(). Missed in previous commit.
2023-11-26	Remove inp parameter from ip_output().	Alexander Bluhm
	ip_output() received inp as parameter. This is only used to lookup the IPsec level of the socket. Reasoning about MP locking is much easier if only relevant data is passed around. Convert ip_output() to receive constant inp_seclevel as argument and mark it as protected by net lock. OK mvs@
2023-11-16	Run TCP SYN cache timer logik without net lock.	Alexander Bluhm
	Introduce global TCP SYN cache mutex. Devide timer function in parts protected by mutex and sending with netlock. Split the flags field in dynamic flags protected by mutex and fixed flags set during initialization. Document whether fields of struct syn_cache are protected by net lock or mutex. input and OK sashan@
2023-09-03	Avoid a useless increment and decrement of the tcp syn cache refcount	Alexander Bluhm
	by unexpanding the SYN_CACHE_TIMER_ARM() macro in the timer callback. OK mvs@
2023-08-28	Introduce reference counting for TCP syn cache entries.	Alexander Bluhm
	The syn_cache_reaper() is a hack to serialize timeouts. Unfortunately it has a race and panics sometimes with pool_do_get: syncache free list modified. Add a reference counter for timeout and list of syn cache entries. Currently list refcout is not strictly necessary due to exclusive netlock, but will be needed when we continue unlocking. Checking timeout_initialized() is not MP friendly, better do proper initialization during object allocation. Refcount in btrace helps to find leaks. bug reported and fix tested by Peter J. Philipp OK claudio@
2023-07-06	Convert tcp_now() time counter to 64 bit.	Alexander Bluhm
	After changing tcp now tick to milliseconds, 32 bits will wrap around after 49 days of uptime. That may be a problem in some places of our stack. Better use a 64 bit counter. As timestamp option is 32 bit in TCP protocol, use the lower 32 bit there. There are casts to 32 bits that should behave correctly. Start with random 63 bit offset to avoid uptime leakage. 2^63 milliseconds result in 2.9*10^8 years of possible uptime. OK yasuoka@
2023-05-30	Use generic checksum calculation for TCP SYN+ACK packets.	Alexander Bluhm
	Our syn cache did checksum calculation by hand, instead of the established mechanism in ip output. The software-checksummed counter increased once per incoming TCP connection. Just set the flag M_TCP_CSUM_OUT in syn_cache_respond() and let in_proto_cksum_out() do the work later. Then hardware checksumming is used where available. Also remove redundant code. The unhandled af case is handled in the first switch statement of the function. tested by Hrvoje Popovski; OK mvs@
2023-03-14	To avoid misunderstanding, keep variables for tcp keepalive in	YASUOKA Masahiko
	milliseconds, which is the same unit of tcp_now(). However, keep the unit of sysctl variables in seconds and convert their unit in tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds, which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19. ok claudio
2023-01-22	Move SS_CANTRCVMORE and SS_RCVATMARK bits from `so_state' to `sb_state' of	Vitaliy Makkoveev
	receive buffer. As it was done for SS_CANTSENDMORE bit, the definition kept as is, but now these bits belongs to the `sb_state' of receive buffer. `sb_state' ored with `so_state' when socket data exporting to the userland. ok bluhm@
2023-01-12	Binding the accept socket in TCP input relies on the fact that the	Alexander Bluhm
	listen port is not bound to port 0. With a matching pf divert-to rule this assumption is no longer true and could crash the kernel with kassert. In both pf and stack drop TCP packets with destination port 0 before they can do harm. OK sashan@ claudio@
2022-12-09	Some TCP timer units have changed from slowhz to msec and their	Alexander Bluhm
	type from short to int. Also switch local variables holding temporary timer values from short to int. OK yasuoka
2022-12-08	Convert tcptv_keep_init in milliseconds before comparing other values	YASUOKA Masahiko
	of tcp time. This fixes the retransmit timer of syn_cache which was broken. reported by naddy, input dlg, test jca ok jca
2022-11-07	Modify TCP receive buffer size auto scaling to use the smoothed RTT	YASUOKA Masahiko
	(SRTT) instead of the timestamp option. Since the timestamp option is disabled on some OSs (eg. Windows) or dropped by some firewalls/routers, in such a case the window size had been fixed at 16KB, this limits throughput at very low on high latency networks. Also replace "tcp_now" from 2HZ tick counter to binuptime in milliseconds to calculate the SRTT better. tested by krw matthieu jmatthew dlg djm stu stsp ok claudio
2022-10-03	System calls should not fail due to temporary memory shortage in	Alexander Bluhm
	malloc(9) or pool_get(9). Pass down a wait flag to pru_attach(). During syscall socket(2) it is ok to wait, this logic was missing for internet pcb. Pfkey and route sockets were already waiting. sonewconn() must not wait when called during TCP 3-way handshake. This logic has been preserved. Unix domain stream socket connect(2) can wait until the other side has created the socket to accept. OK mvs@
2022-09-03	Use a mutex to update tcp_maxidle, tcp_iss, and tcp_now. This	Alexander Bluhm
	removes pressure from the exclusive netlock in tcp_slowtimo(). Reading is done atomically. Ensure that the tcp_now value is read only once per function to provide consistent time. OK yasuoka@
2022-08-30	Refactor internet PCB lookup function. Rename in_pcbhashlookup()	Alexander Bluhm
	so the public API is in_pcblookup() and in_pcblookup_listen(). For internal use introduce in_pcbhash_insert() and in_pcbhash_lookup() to avoid code duplication. Routing domain is unsigned, change the type to u_int. OK mvs@
2022-08-21	Change soabort() return value to void. We never interesting on it.	Vitaliy Makkoveev
	ok bluhm@
2022-08-11	Add TCP_INFO support to getsockopt for tcp sessions.	Claudio Jeker
	TCP_INFO provides a lot of information about the TCP session of this socket. Many processes like to peek at the rtt of a connection but this also provides a lot of more special info for use by e.g. tcpbench(1). While the basic minimal info is available all the time the more specific data is only populated for privileged processes. This is done to not share data back to userland that may allow to attack a session. TCP_INFO is available to pledge "inet" since pledged processes like chrome tend to use TCP_INFO when available. OK bluhm@
2022-08-08	To make protocol input functions MP safe, internet PCB need protection.	Alexander Bluhm
	Use their reference counter in more places. The in_pcb lookup functions hold the PCBs in hash tables protected by table->inpt_mtx mutex. Whenever a result is returned, increment the ref count before releasing the mutex. Then the inp can be used as long as neccessary. Unref it at the end of all functions that call in_pcb lookup. As a shortcut, pf may also hold a reference to the PCB. When pf_inp_lookup() returns it, it also incements the ref count and the caller can handle it like the inp from table lookup. OK sashan@
2022-01-04	Add `ipsec_flows_mtx' mutex(9) to protect `ipsp_ids_*' list and	YASUOKA Masahiko
	trees. ipsp_ids_lookup() returns `ids' with bumped reference counter. original diff from mvs ok mvs
2022-01-02	spelling	Jonathan Gray
	ok jmc@ reads ok tb@
2021-12-01	Let ipsp_spd_lookup() return an error instead of a TDB. The TDB	Alexander Bluhm
	is not always needed, but the error value is necessary for the caller. As TDB should be refcounted, it makes not sense to always return it. Pass an output pointer for the TDB which can be NULL. OK mvs@ tobhe@
2021-11-25	move label to fix RAMDISK	Theo de Raadt

2021-11-25	Implement reference counting for IPsec tdbs. Not all cases are	Alexander Bluhm
	covered yet, more ref counts to come. The timeouts are protected, so the racy tdb_reaper() gets retired. The tdb_policy_head, onext and inext lists are protected. All gettdb...() functions return a tdb that is ref counted and has to be unrefed later. A flag ensures that tdb_delete() is called only once. Tested by Hrvoje Popovski; OK sthen@ mvs@ tobhe@
2021-08-09	During unidirectional data transmission, a TCP connection may stall.	Alexander Bluhm
	The sending machine is doing zero window probes, but is not sending any more data although the other machine announced that it has space again. The header prediction code did not update snd_wl2. If there was a sequence number wrap, the send window update block is not reached. Update snd_wl2 when receiving predicted ACKs and and update snd_wl1 and rcv_up for predicted pure data. from FreeBSD; OK sashan@ claudio@
2021-08-09	Fix white spaces.	Alexander Bluhm

2021-04-16	Turn on the direct ACK on every other segment.	Alexander Bluhm
	This is a backout of rev 1.366 which turned this feature off. Although sending less ACKs makes TCP faster if the CPU is busy with processing packets, there are corner cases where TCP gets slower. Especially OpenBSD 6.8 and older has a maxbust limitiation that scales badly if the other side sends too few ACKs. Also regress test relayd run-args-http-slow-consumer.pl uses strange socket buffer sizes that triggers slow performance with the new algorithm. For OpenBSD 6.9 release switch back to 6.8 delayed ACK behavior. discussed with deraadt@ benno@ claudio@ jan@
2021-03-10	spelling	Jonathan Gray
	ok gnezdo@ semarie@ mpi@
2021-02-03	Turns off the direct ACK on every other segment	jan
	The kernel uses a huge amount of processing time for sending ACKs to the sender on the receiving interface. After receiving a data segment, we send out two ACKs. The first one in tcp_input() direct after receiving. The second ACK is send out, after the userland or the sosplice task read some data out of the socket buffer. Thus, we save some processing time and improve network performance. Longer tested by sthen@ OK claudio@
2020-06-19	Break a glass ceiling on cwnd due to integer division during congestion	Richard Procter
	avoidance. The problem and fix is noted in RFC5681 section 3.1, page 7. Report, diff and testing from Brian Brombacher, thanks! Testing and a cosmetic tweak by myself. ok claudio
2019-12-06	Checking the IPsec policy is expensive. Check only when IPsec is used.	tobhe
	ok bluhm@
2019-11-29	Change the default security level for incoming IPsec flows from	tobhe
	isakmpd and iked to REQUIRE. Filter policy violations earlier. ok sashan@ bluhm@
2019-11-11	Prevent underflows in tp->snd_wnd if the remote side ACKs more than	Alexander Bluhm
	tp->snd_wnd. This can happen, for example, when the remote side responds to a window probe by ACKing the one byte it contains. from FreeBSD; via markus@; OK sashan@ tobhe@
2019-07-12	Count the number of TCP SACK options that were dropped due to the	Alexander Bluhm
	sack hole list length or pool limit. OK claudio@
2019-07-10	Received SACK options are managed by a linked list at the TCP socket.	Alexander Bluhm
	There is a global tunable limit net.inet.tcp.sackholelimit, default is 32768. If an attacker manages to attach all these sack holes to a few TCP connections, the lists may grow long. Traversing them might cause higher CPU consumption on the victim machine. In practice such a situation is hard to create as the TCP retransmit and 2*msl timer flush the list periodically. For additional protection, enforce a per connection limit of 128 SACK holes in the list. reported by Reuven Plevinsky and Tal Vainshtein discussed with claudio@ and procter@; OK deraadt@
2018-09-17	Do not acknowledge a received ack-only tcp packet that we would drop due to	friehm
	PAWS. Otherwise we could trigger a retransmit of the opposite party with another wrong timestamp and produce loop. I have seen this with a buggy server which messed up tcp timestamps. Suggested by Prof. Jacobson for FreeBSD. ok krw, bluhm, henning, mpi