summaryrefslogtreecommitdiff
path: root/sys/netinet
AgeCommit message (Collapse)Author
2023-12-07Inpcb table mutex protects addr and port during bind(2) and connect(2).Alexander Bluhm
in_pcbbind(), in_pcbconnect(), and in6_pcbconnect() have to set addresses and ports within the same critical section as the inpcb hash table calculation. Also lookup and address selection have to be protected to avoid bindings and connections that are not unique. For that in_pcbpickport() and in_pcbbind_locked() expect that the table mutex is already taken. The functions in_pcblookup_lock(), in_pcblookup_local_lock(), and in_pcbaddrisavail_lock() grab the mutex iff the lock parameter is IN_PCBLOCK_GRAB. Otherwise the parameter is IN_PCBLOCK_HOLD has the lock has to be taken already. Note that in_pcblookup_lock() and in_pcblookup_local() return an inp with increased reference iff they take and release the lock. Otherwise the caller protects the life time of the inp. This gives enough flexibility that in_pcbbind() and in_pcbconnect() can hold the table mutex when they need it. The public inpcb API does not change. OK sashan@ mvs@
2023-12-06Protect socket receive buffer in IP multicast routing.Alexander Bluhm
Since soreceive() runs in parallel for raw sockets, sbappendaddr() has to be protected by inpcb mutex. This was missing in multicast forwarding which is running with a combination of shared net lock and kernel lock. soreceive() uses shared net lock and mutex per inpcb. Grab mutex before sbappendaddr() in socket_send() and socket6_send(). panic receive 1 reported by Jo Geraerts OK mvs@ claudio@
2023-12-03Use INP_IPV6 flag instead of sotopf().Alexander Bluhm
During initialization in_pcballoc() sets INP_IPV6 once to avoid reaching through inp_socket->so_proto->pr_domain->dom_family. Use this flag consistently. OK sashan@ mvs@
2023-12-03Make ipsp_ids_gc() timeout(9) handler mpsafe. `ipsec_flows_mtx' mutex(9)Vitaliy Makkoveev
protects related data. ok bluhm
2023-12-01Set inp address, port and rtable together with inpcb hash.Alexander Bluhm
The inpcb hash table is protected by table->inpt_mtx. The hash is based on addresses, ports, and routing table. These fields were not sychronized with the hash. Put writes and hash update into the same critical section. Move the updates from ip_ctloutput(), ip6_ctloutput(), syn_cache_get(), tcp_connect(), udp_disconnect() to dedicated inpcb set functions. There they use the same table mutex as in_pcbrehash(). in_pcbbind(), in_pcbconnect(), and in6_pcbconnect() need more work and are not included yet. OK sashan@ mvs@
2023-12-01Make internet PCB connect more consistent.Alexander Bluhm
The public interface is in_pcbconnect(). It dispatches to in6_pcbconnect() if necessary. Call the former from tcp_connect() and udp_connect(). In in6_pcbconnect() initialization in6a = NULL is not necessary. in6_pcbselsrc() sets the pointer, but does not read the value. Pass a constant in6_addr pointer to in6_pcbselsrc() and in6_selectsrc(). It returns a reference to the address of some internal data structure. We want to be sure that in6_addr is not modified this way. IPv4 in_pcbselsrc() solves this by passing a copy of the address. OK kn@ sashan@ mvs@
2023-11-30Pass inp_seclevel to ip6_output() in TCP syn cache.Alexander Bluhm
TCP syn_cache_respond() uses inp_seclevel from listening socket as ip_output() parameter. This was missing for ip6_output(). OK mvs@
2023-11-29Run TCP syn cache timer without kernel lock.Alexander Bluhm
As syn_cache_timer() uses syn cache mutex and exclusive net lock, it does not need kernel lock. OK mvs@
2023-11-29Document inp_socket as immutable and remove NULL checks.Alexander Bluhm
Struct inpcb field inp_socket is initialized in in_pcballoc(). It is not NULL and never changed. OK mvs@
2023-11-28Remove struct inpcb from in6_embedscope() parameters.Alexander Bluhm
rip6_output() did modify inp_outputopts6 temporarily to provide different ip6_pktopts to in6_embedscope(). Better pass inp_outputopts6 and inp_moptions6 as separate arguments to in6_embedscope(). Simplify the code that deals with these options in in6_embedscope(). Doucument inp_moptions and inp_moptions6 as protected by net lock. OK kn@
2023-11-27Add NULL check before dereferencing inp_seclevel.Alexander Bluhm
In some cases inp may be NULL, so check that before passing inp->inp_seclevel to ipsp_spd_lookup() or ip_output(). Missed in previous commit.
2023-11-26Remove inp parameter from ip_output().Alexander Bluhm
ip_output() received inp as parameter. This is only used to lookup the IPsec level of the socket. Reasoning about MP locking is much easier if only relevant data is passed around. Convert ip_output() to receive constant inp_seclevel as argument and mark it as protected by net lock. OK mvs@
2023-11-16Run TCP SYN cache timer logik without net lock.Alexander Bluhm
Introduce global TCP SYN cache mutex. Devide timer function in parts protected by mutex and sending with netlock. Split the flags field in dynamic flags protected by mutex and fixed flags set during initialization. Document whether fields of struct syn_cache are protected by net lock or mutex. input and OK sashan@
2023-11-12Declare global variable zeroin46_addr as const.Alexander Bluhm
OK mvs@ jca@
2023-11-10rtable_match() takes constant destination.Alexander Bluhm
For implementing MP safe route lookup, it helps to know which function parameters are constant. Add some const declarations, so that the compiler guarantees that sockaddr dst parameter of rtable_match() does not change. OK dlg@
2023-11-09Run arp timeout without kernel lock.Alexander Bluhm
Since cheloha@ has implemented timeout processes that do not grab the kernel lock, start using TIMEOUT_MPSAFE for arptimer(). OK kn@ mvs@
2023-10-11Prevent deref-after-free when tdb_timeout() fires on invalid new tdb.Tobias Heider
When receiving a pfkeyv2 SADB_ADD message, a newly created tdb can fail in tdb_init(), which causes the tdb to not get added to the global tdb list and an immediate dereference. If a lifetime timeout triggers on this tdb, it will unconditionally try to remove it from the list and in the process deref once more than allowed, causing a one bit corruption in the already freed up slot in the tdb pool. We resolve this issue by moving timeout_add() after tdb_init() just before puttdb(). This means tdbs failing initialization get discarded immediately as they only hold a single reference. Valid tdbs get their timeouts activated just before we add them to the tdb list, meaning the timeout can safely assume they are linked. Feedback from mvs@ and millert@ ok mvs@ mbuhl@
2023-09-16Allow counters_read(9) to take an optional scratch buffer.Martin Pieuchot
Using a scratch buffer makes it possible to take a consistent snapshot of per-CPU counters without having to allocate memory. Makes ddb(4) show uvmexp command work in OOM situations. ok kn@, mvs@, cheloha@
2023-09-06Use shared net lock for ip_send() and ip6_send().Alexander Bluhm
When called with NULL options, ip_output() and ip6_output() are MP safe. Convert exclusive to shared net lock in send dispatch. OK mpi@
2023-09-04Fix netstat output of uses of current SYN cache left.Alexander Bluhm
TCP syn cache variable scs_use is basically counting packet insertions into syn cache. Prefer type long to exclude overflow on fast machines. Due to counting downwards from a limit, it can become negative. Copy it out as tcps_sc_uses_left via sysctl, and print it as signed long long integer. OK mvs@
2023-09-03Avoid a useless increment and decrement of the tcp syn cache refcountAlexander Bluhm
by unexpanding the SYN_CACHE_TIMER_ARM() macro in the timer callback. OK mvs@
2023-08-28Introduce reference counting for TCP syn cache entries.Alexander Bluhm
The syn_cache_reaper() is a hack to serialize timeouts. Unfortunately it has a race and panics sometimes with pool_do_get: syncache free list modified. Add a reference counter for timeout and list of syn cache entries. Currently list refcout is not strictly necessary due to exclusive netlock, but will be needed when we continue unlocking. Checking timeout_initialized() is not MP friendly, better do proper initialization during object allocation. Refcount in btrace helps to find leaks. bug reported and fix tested by Peter J. Philipp OK claudio@
2023-08-07add the glue between ipsec security associations and sec(4) interfaces.David Gwynne
if TDBF_IFACE is set on a tdb, the ipsec stack will pass it to the sec(4) driver to keep track of instead of wiring it up for security associations to use. when sec(4) transmits a packet, it will look up it's list of tdbs to find the right SA to encrypt and send the packet out with. if an incoming ipsec packet arrives with TDBF_IFACE set, it's passed to sec(4) to be injected back into the network stack as if it was received on the sec interface, instead of being reinjected into the IP stack like normal SA/SPD processing does. note that this means you do not have to configure tunnel endpoints on sec(4) interfaces, instead you line the interface unit number in the ipsec config up with the minor number of the sec(4) interfaces. the peer IPs used on the SAs are what's used as the traffic endpoints. support from many including markus@ tobhe@ claudio@ sthen@ patrick@ now is a good time deraadt@
2023-08-07start adding support for route-based ipsec vpns.David Gwynne
rather than use ipsec flows (aka, entries in the ipsec security policy database) to decide which traffic should be encapsulated in ipsec and sent to a peer, this tweaks security associations (SAs) so they can refer to a tunnel interface. when traffic is routed over that tunnel interface, an ipsec SA is looked up and used to encapsulate traffic before being sent to the peer on the SA. When traffic is received from a peer using an interface SA, the specified interface is looked up and the packet is handed to it so it looks like packets come out of the tunnel. to support this, SAs get a TDBF_IFACE flag and iface and iface_dir fields. When TDBF_IFACE is set the iface and dir fields are considered valid, and the tdb/SA should be used with the tunnel interface instead of the SPD. support from many including markus@ tobhe@ claudio@ sthen@ patrick@ now is a good time deraadt@
2023-07-27Fix inline vlan-tag handling of forwarded LRO packets from ix(4)Jan Klemkow
Implement vlan-tag parsing ether_extract_header() to use this information to adjust the MSS calculation of LRO packets. pointed out by mbuhl and bluhm with tweaks from bluhm ok bluhm@
2023-07-07Fix path MTU discovery for TCP LRO/TSO when forwarding.Alexander Bluhm
When doing LRO (Large Receive Offload), the drivers, currently ix(4) and lo(4) only, record an upper bound of the size of the original packets in ph_mss. When sending, either stack or hardware must chop the packets with TSO (TCP Segmentation Offload) to that size. That means we have to call tcp_if_output_tso() before ifp->if_output(). Put that logic into if_output_tso() to avoid code duplication. As TCP packets on the wire do not get larger that way, path MTU discovery should still work. tested by and OK jan@
2023-07-06use refcnt API for multicast addresses, add tracepoint:refcnt:ethmulti probeKlemens Nanni
Replace hand-rolled reference counting with refcnt_init(9) and hook it up with a new dt(4) probe. OK mvs Feedback OK bluhm
2023-07-06Convert tcp_now() time counter to 64 bit.Alexander Bluhm
After changing tcp now tick to milliseconds, 32 bits will wrap around after 49 days of uptime. That may be a problem in some places of our stack. Better use a 64 bit counter. As timestamp option is 32 bit in TCP protocol, use the lower 32 bit there. There are casts to 32 bits that should behave correctly. Start with random 63 bit offset to avoid uptime leakage. 2^63 milliseconds result in 2.9*10^8 years of possible uptime. OK yasuoka@
2023-07-06big update to pfsync to try and clean up locking in particular.David Gwynne
moving pf forward has been a real struggle, and pfsync has been a constant source of pain. we have been papering over the problems for a while now, but it reached the point that it needed a fundamental restructure, which is what this diff is. the big headliner changes in this diff are: - pfsync specific locks this is the whole reason for this diff. rather than rely on NET_LOCK or KERNEL_LOCK or whatever, pfsync now has it's own locks to protect it's internal data structures. this is important because pfsync runs a bunch of timeouts and tasks to push pfsync packets out on the wire, or when it's handling requests generated by incoming pfsync packets, both of which happen outside pf itself running. having pfsync specific locks around pfsync data structures makes the mutations of these data structures a lot more explicit and auditable. - partitioning to enable future parallelisation of the network stack, this rewrite includes support for pfsync to partition states into different "slices". these slices run independently, ie, the states collected by one slice are serialised into a separate packet to the states collected and serialised by another slice. states are mapped to pfsync slices based on the pf state hash, which is the same hash that the rest of the network stack and multiq hardware uses. - no more pfsync called from netisr pfsync used to be called from netisr to try and bundle packets, but now that there's multiple pfsync slices this doesnt make sense. instead it uses tasks in softnet tqs. - improved bulk transfer handling there's shiny new state machines around both the bulk transmit and receive handling. pfsync used to do horrible things to carp demotion counters, but now it is very predictable and returns the counters back where they started. - better tdb handling the tdb handling was pretty hairy, but hrvoje has kicked this around a lot with ipsec and sasyncd and we've found and fixed a bunch of issues as a result of that testing. - mpsafe pf state purges this was committed previously, but because the locks pfsync relied on weren't clear this just caused a ton of bugs. as part of this diff it's now reliable, and moves a big chunk of work out from under KERNEL_LOCK, which in turn improves the responsiveness and throughput of a firewall even if you're not using pfsync. there's a bunch of other little changes along the way, but the above are the big ones. hrvoje has done performance testing with this diff and notes a big improvement when pfsync is not in use. performance when pfsync is enabled is about the same, but im hoping the slices means we can scale along with pf as it improves. lots (months) of testing by me and hrvoje on pfsync boxes tests and ok sashan@ deraadt@ says this is a good time to put it in
2023-07-04Remove redundant code when calculating checksum.Alexander Bluhm
OK jmatthew@
2023-07-02Use TSO and LRO on the loopback interface to transfer TCP faster.Alexander Bluhm
If tcplro is activated on lo(4), ignore the MTU with TCP packets. They are passed along with the information that they have to be chopped in case they are forwarded later. New netstat(1) counter shows that software LRO is in effect. The feature is currently turned off by default. tested by jan@; OK claudio@ jan@
2023-06-28use refcnt API for multicast addresses, add tracepoint:refcnt:ifmaddr probeKlemens Nanni
Replace hand-rolled reference counting with refcnt_init(9) and hook it up with a new dt(4) probe. OK bluhm mvs
2023-06-24Calculate inet PCB SIP hash without table mutex.Alexander Bluhm
Goal is to run UDP input in parallel. Btrace kstack analysis shows that SIP hash for PCB lookup is quite expensive. When running in parallel, there is also lock contention on the PCB table mutex. It results in better performance to calculate the hash value before taking the mutex. The hash secret has to be constant as hash calculation must not depend on values protected by the table mutex. Do not reseed anymore when hash table gets resized. Analysis also shows that asserting a rw_lock while holding a mutex is a bit expensive. Just remove the netlock assert. OK dlg@ mvs@
2023-06-14Add missing kernel lock around (*if_ioctl)().Vitaliy Makkoveev
ok bluhm
2023-05-30Use generic checksum calculation for TCP SYN+ACK packets.Alexander Bluhm
Our syn cache did checksum calculation by hand, instead of the established mechanism in ip output. The software-checksummed counter increased once per incoming TCP connection. Just set the flag M_TCP_CSUM_OUT in syn_cache_respond() and let in_proto_cksum_out() do the work later. Then hardware checksumming is used where available. Also remove redundant code. The unhandled af case is handled in the first switch statement of the function. tested by Hrvoje Popovski; OK mvs@
2023-05-23New counters for LRO packets from hardware TCP offloading.Jan Klemkow
With tweaks from patrick@ and bluhm@. OK bluhm@
2023-05-22Fix TSO for traffic to a local address on a physical interface.Alexander Bluhm
When sending TCP packets with software TSO to the local address of a physical interface, the TCP checksum was miscalculated. As the small MSS is taken from the physical interface, but the large MTU of the loopback interface is used, large TSO packets are generated, but sent directly to the loopback interface. There we need the regular pseudo header checksum and not the modified without packet length. To avoid this confusion, use the same decision for checksum generation in in_proto_cksum_out() as for using hardware TSO in tcp_if_output_tso(). bug reported and tested by robert@ bket@ Hrvoje Popovski OK claudio@ jan@
2023-05-19Move tcp_info structure to be under '#if __BSD_VISIBLE' to repairPhilip Guenther
compliance with POSIX/SUS restrictions on <netinet/tcp.h> ok bluhm@ ports testing and ok sthen@
2023-05-18Revert ip_sysctl() unlocking. Lock order issue was triggered in UVMVitaliy Makkoveev
layer.
2023-05-18Use TSO offloading in ix(4).Jan Klemkow
With a lot of tweaks, improvements and testing from bluhm. Thanks to Hrvoje Popovski from the University of Zagreb for his great testing effort to make this happen. ok bluhm
2023-05-16Introduce temporary PR_MPSYSCTL flag to mark (*pr_sysctl)() handler MPVitaliy Makkoveev
safe. We have may of them, so use flag instead of pushing kernel lock within. Unlock ip_sysctl(). Still take kernel lock within IPCTL_MRTSTATS case. It looks like `mrtstat' protection is inconsistent, so keep locking as it was. Since `mrtstat' are counters, it make sense to rework them into per CPU counters with separate diffs. Feedback and ok from bluhm@
2023-05-16Use separate IFCAPs for LRO and TSO.Jan Klemkow
This diff introduces separate capabilities for TCP offloading. We split this into LRO (large receive offloading) and TSO (TCP segmentation offloading). LRO can be turned on/off via tcprecvoffload option of ifconfig and is not inherited to sub interfaces. TSO is inherited by sub interfaces to signal this hardware offloading capability to the network stack. With tweaks from bluhm, claudio and dlg ok bluhm, claudio
2023-05-15Implement the TCP/IP layer for hardware TCP segmentation offload.Alexander Bluhm
If the driver of a network interface claims to support TSO, do not chop the packet in software, but pass it down to the interface layer. Precalculate parts of the pseudo header checksum, but without the packet length. The length of all generated smaller packets is not known yet. Driver and hardware will use the mbuf packet header field ph_mss to calculate it and update checksum. Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware might support ony one protocol family. The old flag IFXF_TSO is only relevant for large receive offload. It is missnamed, but keep that for now. Note that drivers do not set TSO capabilites yet. Also the ifconfig flags and pseudo interfaces capabilities will be done separately. So this commit should not change behavior. heavily based on the work from jan@; OK sashan@
2023-05-13Instead of implementing IPv4 header checksum creation everywhere,Alexander Bluhm
introduce in_hdr_cksum_out(). It is used like in_proto_cksum_out(). OK claudio@
2023-05-12Access rt_llinfo without checking RTF_LLINFO flag before. They areAlexander Bluhm
always set together with ARP mutex. OK mvs@
2023-05-10Implement TCP send offloading, for now in software only. This isAlexander Bluhm
meant as a fallback if network hardware does not support TSO. Driver support is still work in progress. TCP output generates large packets. In IP output the packet is chopped to TCP maximum segment size. This reduces the CPU cycles used by pf. The regular output could be assisted by hardware later, but pf route-to and IPsec needs the software fallback in general. For performance comparison or to workaround possible bugs, sysctl net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows TSO counter with chopped and generated packets. based on work from jan@ tested by jmc@ jan@ Hrvoje Popovski OK jan@ claudio@
2023-05-08The call to in_proto_cksum_out() is only needed before the packetAlexander Bluhm
is passed to ifp->if_output(). The fragment code has its own checksum calculation and the other paths end in goto bad. OK claudio@
2023-05-07I preparation for TSO in software, cleanup the fragment code. UseAlexander Bluhm
if_output_ml() to send mbuf lists to interfaces. This can be used for TSO, fragments, ARP and ND6. Rename variable fml to ml. In pf_route6() split the if else block. Put the safety check (hlen + firstlen < tlen) into ip_fragment(). It makes the code correct in case the packet is too short to be fragmented. This should not happen, but other functions also have this logic. No functional change. OK sashan@
2023-04-25Fix white space.Alexander Bluhm
2023-04-25Exclusive net lock or mutex arp_mtx protect the llinfo_arp fields.Alexander Bluhm
So kernel lock is only needed for changing the route rt_flags. In arpresolve() protect rt_llinfo lookup and llinfo_arp modification with arp_mtx. Grab kernel lock for rt_flags reject modification only when needed. Tested by Hrvoje Popovski; OK patrick@ kn@