src - OpenBSD base system

Age	Commit message (Collapse)	Author
2024-11-05	Use PCB iterator for raw IP input deliver loop.	Alexander Bluhm
	Inspired by mvs@ idea of the iterator in the UDP multicast loop, implement the same for raw IP input delivery. This removes an unneccesary rwlock and only uses table mutex. When comparing the inp routing table, address and port, the table lock must be held. So assume that in_pcb_iterator() already has the table mutex and hold it while traversing the list and doing the checks. Release the mutex during mbuf copy, socket buffer append and the upcalls. Adapt the logic for both rip_input() and udp_input(). In rip_input() move the actual work to rip_sbappend(). This can be called without mutex during list traversal and for the final element. OK mvs@
2024-11-05	Replace rwlock with iterator in UDP input multicast loop.	Alexander Bluhm
	The broadcast and multicast loop in udp_input() is protected by the table mutex. The relevant PCBs were collected in a separate list, which was processed while the table notify rwlock was held. When sending UDP multicast packets over vxlan(4) configured over UDP with multicast groups, this lock was taken recursively causing a kernel crash. By using an iterator, traversing the PCB list of the table does not require to hold the mutex all the time. Only while accessing the next element after the iterator, the mutex is taken for a short time. udp_sbappend() and the upcall to vxlan_input() is done with neither mutex nor rwlock. The PCB is reference counted while traversing the list. crash reported by Holger Glaess; iterator implemented by mvs@; tested and fixed by bluhm@; OK mvs@
2024-11-03	Clear UDP checksum out flag when stripping UDP header.	Alexander Bluhm
	Some network interfaces, like lo(4) or vio(4), set the M_UDP_CSUM_OUT flag on incoming packets. For optimization they produce packets with M_UDP_CSUM_IN_OK, but the actual checksum field in the packet is wrong. If such a packet is forwarded, the checksum must be calculated. So they also set M_UDP_CSUM_OUT. For protocols tunneled in UDP, udp_input() removes the header, but the mbuf flags stay. This means later processing of the packet may insert an UDP checksum, although it is not UDP anymore. This has been observed when forwarding ping packets between two vxlan(4) interfaces. Then an UDP checksum was inserted into the ICMP packet. Clearing the M_UDP_CSUM_OUT flag when the UDP header is stripped, fixes the problem. OK mvs@
2024-10-13	remove unneeded timeout.h include	Jonathan Gray

2024-08-26	Rearrange #ifdef TCP_SIGNATURE to keep braces balanced.	Alexander Bluhm

2024-08-22	Unlock unlock ipip_sysctl().	Vitaliy Makkoveev
	- IPIPCTL_ALLOW - atomically accessed integer; - IPIPCTL_STATS - per-CPU counters; In ipip_input() load `ipip_allow' value to `ipip_allow_local' and pass it down to ipip_input_if() as `allow' arg. ok bluhm
2024-08-21	Revert previous. It was committed mistakenly.	Vitaliy Makkoveev

2024-08-21	Unlock ipip_sysctl().	Vitaliy Makkoveev
	- IPIPCTL_ALLOW - atomically accessed integer; - IPIPCTL_STATS - per-CPU counters; ok bluhm
2024-08-20	Unlock etherip_sysctl().	Vitaliy Makkoveev
	- ETHERIPCTL_ALLOW - atomically accessed integer; - ETHERIPCTL_STATS - per-CPU counters ok bluhm
2024-08-20	Unlock igmp_sysctl(), pfsync_sysctl() and rip6_sysctl(). Each of them is	Vitaliy Makkoveev
	the only IGMPCTL_STATS, PFSYNCCTL_STATS and RIPV6CTL_STATS per-CPU counters. sysctl_rdstruct() has "newp != NULL" check within and also returns EPERM, no need for redundant check in igmp_sysctl(). ok bluhm
2024-08-16	Introduce PR_MPSYSCTL flag to mark mp-safe (*pr_sysctl)() handlers and	Vitaliy Makkoveev
	unlock both divert_sysctl() and divert6_sysctl(). Unlock them together, because they are identical and pretty simple: - DIVERTCTL_RECVSPACE and DIVERTCTL_SENDSPACE - atomically accessed integers; - DIVERTCTL_STATS - per-CPU counters; ok bluhm
2024-08-12	Run network protocol timer without kernel lock.	Alexander Bluhm
	Mark slow and fast protocol timeouts as MP safe. This means they run on a spearate thread without holding the kernel lock. IGMP and MLD6 cannot run in parallel, they use exclusive net lock to protect themselves. As a performance optimization global variables are used to skip igmp_fasttimo() and mld6_fasttimeo() if no multicast is active. These global variables use atomic operations and memory barriers to work lockless. IPv6 fragment timeout protects itself with a mutex. TCP timers also run without kernel lock now. The whole TCP stack holds exclusive net lock, so additional kernel lock is useless. OK mvs@
2024-08-06	Unlock `udpctl_vars'.	Vitaliy Makkoveev
	`udp_sendspace' and `udp_recvspace' are integers which read-only accessed in udp_attach(). `udpcksum' read-only accessed in udp_output(). No netlock required to modify them through sysctl(2). ok bluhm
2024-08-06	Unlock sysctl net.inet.ip.directed-broadcast.	Alexander Bluhm
	ip_directedbcast is read once in either ip_input() or pf_test() during packet processing. So writing the variable does not need net lock. OK mvs@
2024-07-26	Run UDP input on multiple CPU in parallel.	Alexander Bluhm
	The socket layer of UDP has been made fully MP safe. UDP output is MP safe for a while. mvs@ has fixed the missing pieces in socket splicing recently. This means that complete UDP stack can be processed by multiple threads now. Activate multi processing for udp_input() when called with IPv4 or IPv6 packets. Usually IP processing runs on multiple softnet threads with shared net lock. From there local packets are queued and processed by one thread with exclusive net lock. If the PR_MPINPUT flag is set, protocol input is called directly from IP input on multiple threads, with shared net lock and no additional queueing. tested by Hrvoje Popovski; OK mvs@
2024-07-20	Unlock udp(4) somove().	Vitaliy Makkoveev
	Socket splicing belongs to sockets buffers. udp(4) sockets are fully switched to fine-grained buffers locks, so use them instead of exclusive solock(). Always schedule somove() thread to run as we do for tcp(4) case. This brings delay to packet processing, but it is comparable wit non splicing case where soreceive() threads are always scheduled. So, now spliced udp(4) sockets rely on sb_lock() of `so_rcv' buffer together with `sb_mtx' mutexes of both buffers. Shared solock() only required around pru_send() call, so the most of somove() thread runs simultaneously with network stack. Also document 'sosplice' structure locking. Feedback, tests and OK from bluhm.
2024-07-19	Unlock sysctl net.inet.ip.redirect and net.inet6.ip6.redirect.	Alexander Bluhm
	Variable ip and ip6 sendredirects is only read once during packet processing. Use atomic_load_int() to access the value in exactly one read instruction. No memory barriers needed as there is no correlation with other values. Sort the ip and ip6 checks, so the difference is easier to see. Move access to global variable to the end. OK mvs@
2024-07-19	Relax socket lock assertion in UDP input and send.	Alexander Bluhm
	OK mvs@
2024-07-14	Unlock IPv6 sysctl net.inet6.ip6.forwarding from net lock.	Alexander Bluhm
	Use atomic operations to read ip6_forwarding while processing packets in the network stack. To make clear where actually the router property is needed, use the i_am_router variable based on ip6_forwarding. It already existed in nd6_nbr. Move i_am_router setting up the call stack until all users are independent. The forwarding decisions in pf_test, pf_refragment6, ip6_input do also not interfere. Use a new array ipv6ctl_vars_unlocked to make transition of all the integer sysctls easier. Adapt IPv4 to the new style. OK mvs@
2024-07-13	Add condition to ip_gre.c in files.	Alexander Bluhm
	Use gre condition in conf/files for compiling netinet/ip_gre.c only if needed. Remove #if NGRE > 0 from ip_gre.c that caused ramdisk build to compile an empty C file. OK kn@ deraadt@; input jsg@
2024-07-13	Mark IP protocol GRE as MP safe from socket layer.	Alexander Bluhm
	The pipex code in gre_send() matches more or less what udp_send() does. This has been MP safe for a long time. rip_send() is already called with PR_MPSOCKET. OK mvs@
2024-07-12	Remove internet PCB mutex.	Alexander Bluhm
	All incpb locking has been converted to socket receive buffer mutex. Per PCB mutex inp_mtx is not needed anymore. Also delete PRU related locking functions. A flag PR_MPSOCKET indicates whether protocol functions support parallel access with per socket rw-lock. TCP is the only protocol that is not MP capable from the socket layer and needs exclusive netlock. OK mvs@
2024-07-12	Run sysctl net.inet.ip.forwarding without net lock.	Alexander Bluhm
	The places in packet processing where ip_forwarding is evaluated have been consolidated. The remaining pieces in pf test, ip input, and icmp input do not need consistent information. If the integer value is changed by another CPU, it is harmless. The sysctl syscall sets the value atomically, so add atomic read in network processing and remove the net lock in sysctl IPCTL_FORWARDING. OK claudio@ mvs@
2024-07-02	Read IPsec forwarding information once.	Alexander Bluhm
	Fix MP race between reading ip_forwarding in ip_input() and checking ip_forwarding == 2 in ip_output(). In theory ip_forwarding could be 2 during ip_input() and later 0 in ip_output(). Then a packet would be forwarded that was never allowed. Currently exclusive netlock in sysctl(2) prevents all races. Introduce IP_FORWARDING_IPSEC and pass it with the flags parameter that was introduced for IP_FORWARDING. Instead of calling m_tag_find(), traversing the list, and comparing with NULL, just check the PACKET_TAG_IPSEC_IN_DONE bit. Reading ipsec_in_use in ip_output() is a performance hack that is not necessary. New code only checks tree bits. OK mvs@
2024-06-24	Explicitly allocate stack memory for ICMP payload in IPv4 forward.	Alexander Bluhm
	Old ip_forward() allocated a fake mbuf copy on the stack to send an ICMP packet after ip_output() has failed. It seems easier to just copy the data onto the stack that icmp_error() may use. Only if the ICMP error packet is acutally sent, create the mbuf. m_dup_pkthdr() uses atomic operation to link the incpb to mbuf. pf_pkt_addr_changed() was immediately called afterwards to remove the linkage again. Also m_tag_delete_chain() was overhead. New code uses less CPU locking in the hot path. OK deraadt@ claudio@
2024-06-20	Read IPv6 forwarding value only once while processing a packet.	Alexander Bluhm
	IPv4 uses IP_FORWARDING to pass down a consistent value of net.inet.ip.forwarding down the stack. This is needed for unlocking sysctl. Do the same for IPv6. Read ip6_forwarding once in ip6_input_if() and pass down IPV6_FORWARDING as flags to ip6_ours(), ip6_hbhchcheck(), ip6_forward(). Replace the srcrt value with IPV6_REDIRECT flag for consistency with IPv4. To have common syntax with IPv4, use ip6_forwarding == 0 checks instead of !ip6_forwarding. This will also make it easier to implement net.inet6.ip6.forwarding=2 for IPsec only forwarding later. In nd6_ns_input() and nd6_na_input() read ip6_forwarding once and store it in i_am_router. The variable name has been chosen to avoid confusion with is_router, which indicates router flag of the packet. Reading of ip6_forwarding is done independently from ip6_input_if(), consistency does not really matter. One is for ND router behavior the other for forwarding. Again use the ip6_forwarding != 0 check, so when ip6_forwarding IPsec only value 2 gets implemented, it will behave like a router. OK deraadt@ sashan@ florian@ claudio@
2024-06-20	Do not send ICMP redirect if IP forwarding is IPsec only.	Alexander Bluhm
	If sysctl net.inet.ip.forwarding is set to 2, only packets processed by IPsec are forwarded. I this case behave more like a router than a host and do not accept ICMP redirect packets. OK deraadt@ sashan@ florian@ claudio@
2024-06-07	Read IP forwarding variables only once.	Alexander Bluhm
	Do not assume that ip_forwarding and ip_directedbcast cannot change while processing one packet. Read it once and pass down its value with a flag. This is necessary for unlocking the sysctl path. There are a few places where a consistent value does not really matter, they are unchanged. Use a proper ip_ prefix for the global variable. OK claudio@
2024-06-07	remove MAXBUFSIZ define, unused since rev 1.33	Jonathan Gray

2024-06-07	remove unused packet header length defines	Jonathan Gray

2024-05-17	IPv6 has to use ip6_defhlim, not ip_defttl.	Alexander Bluhm
	OK claudio@
2024-05-16	Fix IPsec in use with IP forwarding 2 logic.	Alexander Bluhm
	If sysctl net.inet.ip.forwarding is 2, only packets processed by IPsec are forwarded. Variable ipsec_in_use is a shortcut to avoid IPsec processing if no policy has been configured. With ipsec_in_use unset and ipforwarding set to IPsec only, the packet must be dropped. OK claudio@
2024-05-14	Sanity check for TSO payload length in TCP chopper.	Alexander Bluhm
	Although it should not happen, check that ph_mss is not 0 in tcp_chopper(). This could catch errors in the LRO path of network drivers. Better count bad packet and drop it rather than ending in an endless loop. The new logic is analog to a recent change in the hardware TSO path in the drivers. OK jan@
2024-05-13	remove prototypes with no matching function	Jonathan Gray
	ok mpi@
2024-05-08	Fix route leak in ip input.	Alexander Bluhm
	In previous commit when refactoring the route cache, a rtfree() has been forgotten. For each forwarded packet the reference counter of the route entry was increased. This eventually leads to an integer overflow and triggers kassert. reported by and OK jan@
2024-04-19	Merge IPv4 and IPv6 options in inpcb.	Alexander Bluhm
	A internet PCB has either inp_options or inp_outputopts6. Put them into a common anonymous union. OK mvs@ kn@
2024-04-17	Use struct ipsec_level within inpcb.	Alexander Bluhm
	Instead of passing around u_char[4], introduce struct ipsec_level that contains 4 ipsec levels. This provides better type safety. The embedding struct inpcb is globally visible for netstat(1), so put struct ipsec_level outside of #ifdef _KERNEL. OK deraadt@ mvs@
2024-04-16	Use route cache function in IP input.	Alexander Bluhm
	Instaed of passing a struct rtentry from ip_input() to ip_forward() and then embed it into a struct route for ip_output(), start with struct route and pass it along. Then the route cache is used consistently. Also the route cache hit and missed counters should reflect reality after this commit. There is a small difference in the code. in_ouraddr() checks for NULL and not rtisvalid(). Previous discussion showed that the route RTF_UP flag should only be considered for multipath routing. Otherwise it does not mean anything. Especially the local and broadcast check in in_ouraddr() should not be affected by interface link status. When doing cache lookups, route must be valid, but after rtalloc_mpath() lookup, use any route that route_mpath() returns. OK claudio@
2024-04-15	Delete unused inp_csumoffset define.	Alexander Bluhm
	OK mvs@
2024-04-14	Run raw IP input in parallel.	Alexander Bluhm
	Running raw IPv4 input with shared net lock in parallel is less complex than UDP. Especially there is no socket splicing. New ip_deliver() may run with shared or exclusive net lock. The last parameter indicates the mode. If is is running with shared netlock and encounters a protocol that needs exclusive lock, the packet is queued. Old ip_ours() always queued the packet. Now it calls ip_deliver() with shared net lock, and if that cannot handle the packet completely, the packet is queued and later processed with exclusive net lock. In case of an IPv6 header chain, that switches from shared to exclusive processing, the next protocol and mbuf offset are stored in a mbuf tag. OK mvs@
2024-04-13	correct indentation	Jonathan Gray
	no functional change, found by smatch warnings ok miod@ bluhm@
2024-04-12	Split single TCP inpcb table into IPv4 and IPv6 parts.	Alexander Bluhm
	With two separate TCP hash tables, each one becomes smaller. When we remove the exclusive net lock from TCP, contention on internet PCB table mutex will be reduced. UDP has been split earlier into IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with assertions. OK mvs@
2024-04-12	Fix race between rip_input() and soisdisconnected().	Alexander Bluhm
	Setting SS_CANTRCVMORE is protected by mutex of receive socket buffer. The raw inpcb loop in rip_input() does a lockless access. Protect it with READ_ONCE(), although it is not perfect. Check the socket buffer state again when the mutex is held. Drop and count the packet that is processed between the checks. Currently soisdisconnected() is called with exclusive net lock. The new code also works without net lock. OK mvs@
2024-04-10	Make TCP debug code MP safe.	Alexander Bluhm
	Protect the global variables in TCP debug code with global mutex. Add a missing include and also fix the -Wunused-but-set-variable warning. OK mvs@
2024-04-10	Move global variables for TCP debug onto the tcp_input() stack.	Alexander Bluhm
	OK mvs@
2024-04-09	Plug route leak in IP output.	Alexander Bluhm
	If no struct route is passed to ip_output() or ip6_output(), it uses its own iproute on the stack. In that case any route entry in the local route cache has to be freed. After pf decides to reroute, struct route is reset to NULL. Then the route reference counter has to be released. Call rtfree() without needless NULL check. OK mvs@
2024-04-06	IP multicast sysctl mrtmfc must not write outside of allocation.	Alexander Bluhm
	Reading sysctl mrt_sysctl_mfc() allocates memory to be copied back to user. Chunks of struct mfcinfo are copied from routing table to linear heap memory. If the allocated memory was not a multiple the struct size, a struct mfcinfo could be copied to a partially unallocated destination. Check that the end of the struct is within the allocation. From Alfredo Ortega; OK claudio@
2024-03-31	Combine route_cache() and rtalloc_mpath() in new route_mpath().	Alexander Bluhm
	Fill and check the cache and call rtalloc_mpath() together. Then the caller of route_mpath() does not have to care about the uint32_t *src pointer and just pass struct in_addr. All the conversions are done inside the functions. A previous version of this diff was backed out. There was an additional rtisvalid() in rtalloc_mpath() that prevented packet output via interfaces that were not up. Now the route in the cache has to be valid, but after new lookup, rtalloc_mpath() may return invalid routes. This generates less errors in userland an preserves existing behavior. OK sashan@
2024-03-22	Remove padding from union inpaddru.	Alexander Bluhm
	Alignment of IPv4 address with lower part of IPv6 address looks like a leftover from times when IPv6 compatible addresses should contain IPv4 addreses. Better use a simple union for both IPv4 and IPv6 addresses like everywhere else. Use this type also for common zero address. OK mvs@
2024-03-22	Make local port which is bound during connect(2) unique per laddr.	Alexander Bluhm
	in_pcbconnect() did not pass down the address it got from in_pcbselsrc() to in_pcbpickport(). As a consequence local port numbers selected during connect(2) were globally unique although they belong to different addresses. This strict uniqueness is not necessary and wastes usable ports for outgoing connections. To solve this, pass ina from in_pcbconnect() to in_pcbbind_locked(). This does not interfere how wildcard sockets are matched with specific sockets during bind(2). It only allows non-wildcard sockets to share a local port during connect(2). OK mvs@ deraadt@