summaryrefslogtreecommitdiff
path: root/sys/netinet
AgeCommit message (Collapse)Author
2023-06-24Calculate inet PCB SIP hash without table mutex.Alexander Bluhm
Goal is to run UDP input in parallel. Btrace kstack analysis shows that SIP hash for PCB lookup is quite expensive. When running in parallel, there is also lock contention on the PCB table mutex. It results in better performance to calculate the hash value before taking the mutex. The hash secret has to be constant as hash calculation must not depend on values protected by the table mutex. Do not reseed anymore when hash table gets resized. Analysis also shows that asserting a rw_lock while holding a mutex is a bit expensive. Just remove the netlock assert. OK dlg@ mvs@
2023-06-14Add missing kernel lock around (*if_ioctl)().Vitaliy Makkoveev
ok bluhm
2023-05-30Use generic checksum calculation for TCP SYN+ACK packets.Alexander Bluhm
Our syn cache did checksum calculation by hand, instead of the established mechanism in ip output. The software-checksummed counter increased once per incoming TCP connection. Just set the flag M_TCP_CSUM_OUT in syn_cache_respond() and let in_proto_cksum_out() do the work later. Then hardware checksumming is used where available. Also remove redundant code. The unhandled af case is handled in the first switch statement of the function. tested by Hrvoje Popovski; OK mvs@
2023-05-23New counters for LRO packets from hardware TCP offloading.Jan Klemkow
With tweaks from patrick@ and bluhm@. OK bluhm@
2023-05-22Fix TSO for traffic to a local address on a physical interface.Alexander Bluhm
When sending TCP packets with software TSO to the local address of a physical interface, the TCP checksum was miscalculated. As the small MSS is taken from the physical interface, but the large MTU of the loopback interface is used, large TSO packets are generated, but sent directly to the loopback interface. There we need the regular pseudo header checksum and not the modified without packet length. To avoid this confusion, use the same decision for checksum generation in in_proto_cksum_out() as for using hardware TSO in tcp_if_output_tso(). bug reported and tested by robert@ bket@ Hrvoje Popovski OK claudio@ jan@
2023-05-19Move tcp_info structure to be under '#if __BSD_VISIBLE' to repairPhilip Guenther
compliance with POSIX/SUS restrictions on <netinet/tcp.h> ok bluhm@ ports testing and ok sthen@
2023-05-18Revert ip_sysctl() unlocking. Lock order issue was triggered in UVMVitaliy Makkoveev
layer.
2023-05-18Use TSO offloading in ix(4).Jan Klemkow
With a lot of tweaks, improvements and testing from bluhm. Thanks to Hrvoje Popovski from the University of Zagreb for his great testing effort to make this happen. ok bluhm
2023-05-16Introduce temporary PR_MPSYSCTL flag to mark (*pr_sysctl)() handler MPVitaliy Makkoveev
safe. We have may of them, so use flag instead of pushing kernel lock within. Unlock ip_sysctl(). Still take kernel lock within IPCTL_MRTSTATS case. It looks like `mrtstat' protection is inconsistent, so keep locking as it was. Since `mrtstat' are counters, it make sense to rework them into per CPU counters with separate diffs. Feedback and ok from bluhm@
2023-05-16Use separate IFCAPs for LRO and TSO.Jan Klemkow
This diff introduces separate capabilities for TCP offloading. We split this into LRO (large receive offloading) and TSO (TCP segmentation offloading). LRO can be turned on/off via tcprecvoffload option of ifconfig and is not inherited to sub interfaces. TSO is inherited by sub interfaces to signal this hardware offloading capability to the network stack. With tweaks from bluhm, claudio and dlg ok bluhm, claudio
2023-05-15Implement the TCP/IP layer for hardware TCP segmentation offload.Alexander Bluhm
If the driver of a network interface claims to support TSO, do not chop the packet in software, but pass it down to the interface layer. Precalculate parts of the pseudo header checksum, but without the packet length. The length of all generated smaller packets is not known yet. Driver and hardware will use the mbuf packet header field ph_mss to calculate it and update checksum. Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware might support ony one protocol family. The old flag IFXF_TSO is only relevant for large receive offload. It is missnamed, but keep that for now. Note that drivers do not set TSO capabilites yet. Also the ifconfig flags and pseudo interfaces capabilities will be done separately. So this commit should not change behavior. heavily based on the work from jan@; OK sashan@
2023-05-13Instead of implementing IPv4 header checksum creation everywhere,Alexander Bluhm
introduce in_hdr_cksum_out(). It is used like in_proto_cksum_out(). OK claudio@
2023-05-12Access rt_llinfo without checking RTF_LLINFO flag before. They areAlexander Bluhm
always set together with ARP mutex. OK mvs@
2023-05-10Implement TCP send offloading, for now in software only. This isAlexander Bluhm
meant as a fallback if network hardware does not support TSO. Driver support is still work in progress. TCP output generates large packets. In IP output the packet is chopped to TCP maximum segment size. This reduces the CPU cycles used by pf. The regular output could be assisted by hardware later, but pf route-to and IPsec needs the software fallback in general. For performance comparison or to workaround possible bugs, sysctl net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows TSO counter with chopped and generated packets. based on work from jan@ tested by jmc@ jan@ Hrvoje Popovski OK jan@ claudio@
2023-05-08The call to in_proto_cksum_out() is only needed before the packetAlexander Bluhm
is passed to ifp->if_output(). The fragment code has its own checksum calculation and the other paths end in goto bad. OK claudio@
2023-05-07I preparation for TSO in software, cleanup the fragment code. UseAlexander Bluhm
if_output_ml() to send mbuf lists to interfaces. This can be used for TSO, fragments, ARP and ND6. Rename variable fml to ml. In pf_route6() split the if else block. Put the safety check (hlen + firstlen < tlen) into ip_fragment(). It makes the code correct in case the packet is too short to be fragmented. This should not happen, but other functions also have this logic. No functional change. OK sashan@
2023-04-25Fix white space.Alexander Bluhm
2023-04-25Exclusive net lock or mutex arp_mtx protect the llinfo_arp fields.Alexander Bluhm
So kernel lock is only needed for changing the route rt_flags. In arpresolve() protect rt_llinfo lookup and llinfo_arp modification with arp_mtx. Grab kernel lock for rt_flags reject modification only when needed. Tested by Hrvoje Popovski; OK patrick@ kn@
2023-04-24Hoist privilege checks furtherKlemens Nanni
in6.c already has the privilege check as early as possible, make in.c match. For unprivileged IPv4 ioctl calls with invalid args, this changes errno from E* to EPERM. OK bluhm
2023-04-22Call pfkeyv2_sysctl_policydumper() with shared netlock. It performsVitaliy Makkoveev
read-olny access to netlock protected data, so the radix tree will not be modified during spd_table_walk() run. Also change netlock assertion within spd_table_add() and ipsec_delete_policy() to exclusive. These are correlating functions which modifies radix tree, so make us sure spd_table_walk() run with shared netlock is safe. Feedback and ok by bluhm@
2023-04-21Drop error variable and return directly; OK mvs tbKlemens Nanni
2023-04-19move kernel lock into multicast ioctl handlers; OK mvsKlemens Nanni
2023-04-18Hoist identical privilege checks in in_ioctl*()Klemens Nanni
All cases do the same check as first step, so merge it before the switch and before grapping exclusive locks. OK mvs
2023-04-15Unlock in_ioctl_get(), push kernel lock into in_ioctl_{set,change}_ifaddr()Klemens Nanni
Just like in6_ioctl_get(), read ioctls are safe with the shared net lock to protect interface addresses and flags. OK mvs
2023-04-12Pull MP-safe arprequest() out of kernel lockKlemens Nanni
Defer sending after unlock, reuse `refresh' from similar construct. OK bluhm
2023-04-11fix double words in commentsJonathan Gray
feedback and ok jmc@ miod, ok millert@
2023-04-08Do not reload `inp' in gre_send(). The pointer to PCB of raw socket isVitaliy Makkoveev
immutable, we don't need to reload it again. ok bluhm@
2023-04-07Remove kernel locks from the ARP input path. Caller if_netisr()Alexander Bluhm
grabs the exclusive netlock and that is sufficent for in_arpinput() and arpcache(). with kn@; OK mvs@; tested by Hrvoje Popovski
2023-04-05ARP has a sysctl to show the number of packets waiting for an arpAlexander Bluhm
response. Implement analog sysctl net.inet6.icmp6.nd6_queued for ND6 to reduce places where mbufs can hide within the kernel. Atomic operations operate on unsigned int. Make the type of total hold queue length consistent. Use atomic load to read the value for the sysctl. This clarifies why no lock around sysctl_rdint() is needed. OK mvs@ kn@
2023-04-05ARP has a queue of packets that should be sent after name resolution.Alexander Bluhm
ND6 did only hold a single packet. Unify the logic and add a mbuf hold queue to struct llinfo_nd6. This is MP safe and queue limits are tracked with atomic operations. New function if_mqoutput() has common code for ARP and ND6. ln_saddr6 holds the source address of the requesting packet. That is easier than fiddling with mbuf queue in nd6_ns_output(). OK kn@
2023-04-04When sending IP packets to userland with divert-packet rules, theAlexander Bluhm
checksum may be wrong. Locally generated packets diverted by pf out rules may have no checksum due to to hardware offloading. Calculate the checksum in that case. OK mvs@ sashan@
2023-03-14To avoid misunderstanding, keep variables for tcp keepalive inYASUOKA Masahiko
milliseconds, which is the same unit of tcp_now(). However, keep the unit of sysctl variables in seconds and convert their unit in tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds, which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19. ok claudio
2023-03-08An invalid source routing IP option could overwrite kernel memoryAlexander Bluhm
by using a bad option length. This bug is only reachable if both pf IP option check is disabled and IP source routing is enabled. reported by @fuzzingrf Erg Noor OK claudio@ deraadt@
2023-03-08Delete obsolete /* ARGSUSED */ lint comments.Philip Guenther
ok miod@ millert@
2023-03-04properly initialise LIST headKlemens Nanni
This worked because the global head variable is zero-initialised, but one must not rely on that. OK mvs claudio
2023-02-07Remove needless #ifdef INET6 from struct ether_extracted field inAlexander Bluhm
public header file. Makes debugging with special kernels easier.
2023-02-06consolidate mbuf header parsing on device driver layerJan Klemkow
with tweaks from mvs@, mpi@, dlg@, naddy@ and bluhm@ "go for it" deraadt@ ok naddy@, mvs@
2023-01-31Remove the last ones route lock references from comments.Vitaliy Makkoveev
No functional change.
2023-01-31Route lock was reverted, adjust forgotten commentary.Vitaliy Makkoveev
No functional changes.
2023-01-28Revert the `rt_lock' rwlock(9) diff to fix the recursiveVitaliy Makkoveev
rwlock(9) acquisition. Reported-by: syzbot+fbe3acb4886adeef31e0@syzkaller.appspotmail.com
2023-01-26backing "consolidate mbuf header parsing on device driver layer"Theo de Raadt
easily repeatable ASSERT happens seconds after starting compiles over nfs.
2023-01-24consolidate mbuf header parsing on device driver layerJan Klemkow
with tweaks from mvs@, mpi@ and dlg@ ok mvs@, dlg@
2023-01-22Move SS_CANTRCVMORE and SS_RCVATMARK bits from `so_state' to `sb_state' ofVitaliy Makkoveev
receive buffer. As it was done for SS_CANTSENDMORE bit, the definition kept as is, but now these bits belongs to the `sb_state' of receive buffer. `sb_state' ored with `so_state' when socket data exporting to the userland. ok bluhm@
2023-01-21Introduce `rt_lock' rwlock(9) and use it instead of kernel lock toVitaliy Makkoveev
serialize arpcache() and arpresolve(). In fact, net stack already has sleep points, so the rwlock(9) is better here because we avoid intersection with the rest of kernel locked paths. Also this new lock assumed to use to route layer protection instead of netlock. Hrvoje Popovski had tested this diff and found no visible performance impact. ok bluhm@
2023-01-21Introduce per-sockbuf `sb_state' to use it with SS_CANTSENDMORE.Vitaliy Makkoveev
This time, socket's buffer lock requires solock() to be held. As a part of socket buffers standalone locking work, move socket state bits which represent its buffers state to per buffer state. Opposing the previous reverted diff, the SS_CANTSENDMORE definition left as is, but it used only with `sb_state'. `sb_state' ored with original `so_state' when socket's data exported to the userland, so the ABI kept as it was. Inputs from deraadt@. ok bluhm@
2023-01-12Binding the accept socket in TCP input relies on the fact that theAlexander Bluhm
listen port is not bound to port 0. With a matching pf divert-to rule this assumption is no longer true and could crash the kernel with kassert. In both pf and stack drop TCP packets with destination port 0 before they can do harm. OK sashan@ claudio@
2022-12-27Fix array bounds mismatch with clang 15Patrick Wildt
New warning -Warray-parameter is a bit overzealous. ok millert@ tb@
2022-12-13In tcp_now() switch from getnsecuptime() to getnsecruntime()Claudio Jeker
The tcp timer is not supposed to run during suspend but getnsecuptime() does and because of this sessions with TCP_KEEPALIVE on reset after a few hours of sleep. Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@ OK yasuoka@ jca@ cheloha@
2022-12-12Revert sb_state changes to unbreak tree.Theo Buehler
2022-12-11This time, socket's buffer lock requires solock() to be held. As a part ofVitaliy Makkoveev
socket buffers standalone locking work, move socket state bits which represent its buffers state to per buffer state. Introduce `sb_state' and turn SS_CANTSENDMORE to SBS_CANTSENDMORE. This bit will be processed on `so_snd' buffer only. Move SS_CANTRCVMORE and SS_RCVATMARK bits with separate diff to make review easier and exclude possible so_rcv/so_snd mistypes. Also, don't adjust the remaining SS_* bits right now. ok millert@