summaryrefslogtreecommitdiff
path: root/sys/net
AgeCommit message (Collapse)Author
2024-07-02Read IPsec forwarding information once.Alexander Bluhm
Fix MP race between reading ip_forwarding in ip_input() and checking ip_forwarding == 2 in ip_output(). In theory ip_forwarding could be 2 during ip_input() and later 0 in ip_output(). Then a packet would be forwarded that was never allowed. Currently exclusive netlock in sysctl(2) prevents all races. Introduce IP_FORWARDING_IPSEC and pass it with the flags parameter that was introduced for IP_FORWARDING. Instead of calling m_tag_find(), traversing the list, and comparing with NULL, just check the PACKET_TAG_IPSEC_IN_DONE bit. Reading ipsec_in_use in ip_output() is a performance hack that is not necessary. New code only checks tree bits. OK mvs@
2024-06-26return type on a dedicated line when declaring functionsJonathan Gray
ok mglocker@
2024-06-22remove space between function names and argument listJonathan Gray
2024-06-21My earlier commit [1.1169 of pf.c (2023/01/05)] makes pf(4) to report wrongAlexandr Nedvedicky
rule and anchor number when packet matches rule found and anchor depth 2 and more. The issue has been noticed and reported by Giannis Kapetanakis (billias _at_ edu.physics.uoc.gr), who also co-developed and tested the final fix presented in this commit. To fix the issue pf(4) must also remember the anchor where matching rule belongs while rules are traversed to find a match for given packet. The information on anchor is now kept in anchor stack frame.w OK sthen@
2024-06-20Read IPv6 forwarding value only once while processing a packet.Alexander Bluhm
IPv4 uses IP_FORWARDING to pass down a consistent value of net.inet.ip.forwarding down the stack. This is needed for unlocking sysctl. Do the same for IPv6. Read ip6_forwarding once in ip6_input_if() and pass down IPV6_FORWARDING as flags to ip6_ours(), ip6_hbhchcheck(), ip6_forward(). Replace the srcrt value with IPV6_REDIRECT flag for consistency with IPv4. To have common syntax with IPv4, use ip6_forwarding == 0 checks instead of !ip6_forwarding. This will also make it easier to implement net.inet6.ip6.forwarding=2 for IPsec only forwarding later. In nd6_ns_input() and nd6_na_input() read ip6_forwarding once and store it in i_am_router. The variable name has been chosen to avoid confusion with is_router, which indicates router flag of the packet. Reading of ip6_forwarding is done independently from ip6_input_if(), consistency does not really matter. One is for ND router behavior the other for forwarding. Again use the ip6_forwarding != 0 check, so when ip6_forwarding IPsec only value 2 gets implemented, it will behave like a router. OK deraadt@ sashan@ florian@ claudio@
2024-06-14Switch AF_ROUTE sockets to the new locking scheme.Vitaliy Makkoveev
At sockets layer only mark buffers as SB_MTXLOCK. At PCB layer only protect `so_rcv' with corresponding `sb_mtx' mutex(9). SS_ISCONNECTED and SS_CANTRCVMORE bits are redundant for AF_ROUTE sockets. Since SS_CANTRCVMORE modifications performed with both solock() and `sb_mtx' held, the 'unlocked' SS_CANTRCVMORE check in rtm_senddesync() is safe. ok bluhm
2024-06-09Introduce IFCAP_VLAN_HWOFFLOAD for vio(4).Jan Klemkow
Add IFCAP_VLAN_HWOFFLOAD to signal hardware like vio(4) can handle checksum or TSO offloading with inline VLAN tags. tested by Mark Patruck, sf@ and bluhm@ ok sf@ and bluhm@
2024-06-07Read IP forwarding variables only once.Alexander Bluhm
Do not assume that ip_forwarding and ip_directedbcast cannot change while processing one packet. Read it once and pass down its value with a flag. This is necessary for unlocking the sysctl path. There are a few places where a consistent value does not really matter, they are unchanged. Use a proper ip_ prefix for the global variable. OK claudio@
2024-06-07remove ph_ppp_proto define, unused since rev 1.123Jonathan Gray
2024-05-29remove prototypes with no matching functionJonathan Gray
2024-05-24pfsync must let to progress state for destination peerAlexandr Nedvedicky
The issue has been noticed by matthieu@ when he was chasing cause of excessive pfsync traffic between firewall boxes. When comparing content of state tables between primary and backup firewall the backup firewall showed many states as follows: ESTABLISHED:SYN_SENT FIN_WAIT_2:SYN_SENT * :SYN_SENT this is caused by pfsync_upd_tcp() which fails to update TCP-state for destination connection peer, so it remains stuck in SYN_SENT. matthieu@ confirms diff helps with 'stuck-state'. It also seems to help with excessive pfsync traffic. ok @dlg
2024-05-17Switch AF_KEY sockets to the new locking scheme.Vitaliy Makkoveev
The simplest case. Nothing to change in sockets layer, only set SB_MTXLOCK on socket buffers. ok bluhm
2024-05-17Fix uninitialized memory access in pfkeyv2_sysctl().Vitaliy Makkoveev
pfkeyv2_sysctl() reads the SA type from uninitialized memory if it is not provided by the caller of sysctl(2) because of a missing length check. From Carsten Beckmann. ok bluhm
2024-05-14remove prototypes with no matching functionJonathan Gray
2024-05-13remove prototypes with no matching functionJonathan Gray
ok mpi@
2024-05-12sync_ifp and ticket_pabuf don't exist, remove externsJonathan Gray
2024-05-10make pf_match_rule() prototype match the functionJonathan Gray
2024-04-22Show pf fragment reassembly counters.Alexander Bluhm
Framgent count and statistics are stored in struct pf_status. From there pfctl(8) and systat(1) collect and show them. Note that pfctl -s info needs the -v switch to show fragments. As fragment reassembly has its own mutex, also grab this in pf ipctl(2) and sysctl(2) code. input claudio@; OK henning@
2024-04-14Run raw IP input in parallel.Alexander Bluhm
Running raw IPv4 input with shared net lock in parallel is less complex than UDP. Especially there is no socket splicing. New ip_deliver() may run with shared or exclusive net lock. The last parameter indicates the mode. If is is running with shared netlock and encounters a protocol that needs exclusive lock, the packet is queued. Old ip_ours() always queued the packet. Now it calls ip_deliver() with shared net lock, and if that cannot handle the packet completely, the packet is queued and later processed with exclusive net lock. In case of an IPv6 header chain, that switches from shared to exclusive processing, the next protocol and mbuf offset are stored in a mbuf tag. OK mvs@
2024-04-13correct indentationJonathan Gray
no functional change, found by smatch warnings ok miod@ bluhm@
2024-04-12Split single TCP inpcb table into IPv4 and IPv6 parts.Alexander Bluhm
With two separate TCP hash tables, each one becomes smaller. When we remove the exclusive net lock from TCP, contention on internet PCB table mutex will be reduced. UDP has been split earlier into IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with assertions. OK mvs@
2024-04-11Prevent changing interface loopback flag from userland.Alexander Bluhm
IFF_LOOPBACK is telling userland the behaviour of a specific driver, it is supposed to be static and permanent. Clearing the loopback flag on lo0 could lead to a kernel crash due to inconsistent multicast igmp group. Reported-by: syzbot+2f24ed6c8ddb2d6bb22c@syzkaller.appspotmail.com OK claudio@ deraadt@
2024-04-09Don't include net/art.h in net/rtable.h instead let the two usersClaudio Jeker
include the file themselves. OK bluhm@ mpi@
2024-03-31Combine route_cache() and rtalloc_mpath() in new route_mpath().Alexander Bluhm
Fill and check the cache and call rtalloc_mpath() together. Then the caller of route_mpath() does not have to care about the uint32_t *src pointer and just pass struct in_addr. All the conversions are done inside the functions. A previous version of this diff was backed out. There was an additional rtisvalid() in rtalloc_mpath() that prevented packet output via interfaces that were not up. Now the route in the cache has to be valid, but after new lookup, rtalloc_mpath() may return invalid routes. This generates less errors in userland an preserves existing behavior. OK sashan@
2024-03-26Avoid NULL pointer dereference in routing table an_match().Alexander Bluhm
OK mvs@
2024-03-19count if_enqueue/ifq_enqueue errors as oqdrops.David Gwynne
this helps narrow down where some "output failures" on sec interfaces occur. based on discussion with jason tubnor
2024-03-18expose per port information via kstats.David Gwynne
the most interesting information exposed here is the number of times a port changes state according to the lacp state machine. if a port is flapping, it's hard to see if you only look at the current state. getting a count of changes over time makes problems a lot more visible and therefore fixable. this also exposes counters around how the lacp protocol packets. all of these can be useful when trying to line up behaviors with another system (eg, a switch). ok jmatthew@
2024-03-18use high bits from the mbuf flowid to pick a port to transmit on.David Gwynne
a port here is a physical interface used by an aggr. this leaves the low bits for a physical interface to use to pick a tx ring. without this, aggr used low bits for port selection, which takes bits away from the ring selection, which can lead to uneven distribution of packets over tx rings. ive been running this in production for well over a year now.
2024-03-05Convert `t_lock', `r_keypair_lock' and `c_lock' rwlock(9)s toVitaliy Makkoveev
corresponding mutex(9)es. ifq_start() and following wg_qstart() could be called from software interrupt context if bandwidth control is enabled in pf.conf(5). Remove sleep points provided by rwlock(9)s from wg(4) output start routine. looks ok claudio
2024-03-04white space fixes. no functional changeDavid Gwynne
2024-02-29revert "Combine route_cache() and rtalloc_mpath() in new route_mpath()"Christian Weisgerber
It breaks NFS. ok claudio@
2024-02-28Enable IPv6 AF for ppp(4)Denis Fondras
OK claudio@
2024-02-27Combine route_cache() and rtalloc_mpath() in new route_mpath().Alexander Bluhm
Fill and check the cache and call rtalloc_mpath() together. Then the caller of route_mpath() does not have to care about the uint32_t *src pointer and just pass struct in_addr. All the conversions are done inside the functions. ro->ro_rt is either valid or NULL. Note that some places have a stricter rtisvalid() now compared to the previous NULL check. OK claudio@
2024-02-22Make the route cache aware of multipath routing.Alexander Bluhm
Pass source address to route_cache() and store it in struct route. Cached multipath routes are only valid if source address matches. If sysctl multipath changes, increase route generation number. OK claudio@
2024-02-14Check IP length in ether_extract_headers().Alexander Bluhm
For LRO with ix(4) it is necessary to detect ethernet padding. Extract ip_len and ip6_plen from the mbuf and provide it to the drivers. Add extended sanitity checks, like IP packet is shorter than TCP header. This prevents offloading to network hardware with bougus packets. Also iphlen of extracted headers contains header length for IPv4 and IPv6, to make code in drivers simpler. OK mglocker@
2024-02-13Analyse header layout in ether_extract_headers().Alexander Bluhm
Several drivers need IPv4 header length and TCP offset for checksum offload, TSO and LRO. Accessing these fields directly caused crashes on sparc64 due to misaligned access. It cannot be guaranteed that IP and TCP header is 4 byte aligned in driver level. Also gcc 4.2.1 assumes that bit fields can be accessed with 32 bit load instructions. Use memcpy() in ether_extract_headers() to get the bits from IPv4 and TCP header and store the header length in struct ether_extracted. From there network drivers can esily use it without caring about alignment and bit shift. Do some sanity checks with the length values to prevent that invalid values from evil packets get stored into hardware registers. If check fails, clear the pointer to the header to hide it from the driver. Add debug prints that help to figure out the reason for bad packets and provide information when debugging drivers. OK mglocker@
2024-02-13Merge struct route and struct route_in6.Alexander Bluhm
Use a common struct route for both inet and inet6. Unfortunately struct sockaddr is shorter than sockaddr_in6, so netinet/in.h has to be exposed from net/route.h. Struct route has to be bsd visible for userland as netstat kvm code inspects inp_route. Internet PCB and TCP SYN cache can use a plain struct route now. All specific sockaddr types for inet and inet6 are embeded there. OK claudio@
2024-02-09Route cache function returns hit or miss.Alexander Bluhm
The route_cache() function can easily return whether it was a cache hit or miss. Then the logic to perform a route lookup gets a bit simpler. Some more complicated if (ro->ro_rt == NULL) checks still exist elsewhere. Also use route cache in in_pcbselsrc() instead of filling struct route manually. OK claudio@
2024-02-07Add missing #ifdef INET6 to fix ramdisk build.Alexander Bluhm
2024-02-07Use the route generation number also for IPv6.Alexander Bluhm
Implement route6_cache() to check whether the cached route is still valid and otherwise fill caching parameter of struct route_in6. Also count cache hits and misses in netstat. in_pcbrtentry() uses route cache now. OK claudio@
2024-02-06Invert broken check of panic string in if_linkstate().Alexander Bluhm
original bug report from syzkaller Reported-by: syzbot+d19060a65721eb432a72@syzkaller.appspotmail.com broken fix found by Hrvoje Popovski hint to the problem and OK deraadt@
2024-02-05Add netstat counter for route cache.Alexander Bluhm
To optimize route caching, count cache hits and misses. This is shown in netstat -s for both inet and inet6. Reuse the old IPv6 forward cache counter. Sort ip6s_wrongif consistently. For now only IPv4 cache counter has been implemented. OK mvs@
2024-02-05Don't send route messages while rebooting after panic. Syskaller exposedVitaliy Makkoveev
[1] that if_downall() tries to send route messages and triggers panic again but in knote(9) layer. 1. https://syzkaller.appspot.com/bug?extid=d19060a65721eb432a72 ok bluhm
2024-02-05Move route_cache() declaration from net/route.h to netinet/in.h.Kenji Aoyama
This prevents gcc3's 'parameter has incomplete type' warning that causes kernel build failure. Suggested by claudio@, ok bluhm@
2024-01-31Add route generation number to route cache.Alexander Bluhm
The outgoing route is cached at the inpcb. This cache was only invalidated when the socket closes or if the route gets invalid. More specific routes were not detected. Especially with dynamic routing protocols, sockets must be closed and reopened to use the correct route. Running ping during a route change shows the problem. To solve this, add a route generation number that is updated whenever the routing table changes. The lookup in struct route is put into the route_cache() function. If the generation number is too old, the cached route gets discarded. Implement route_cache() for ip_output() and ip_forward() first. IPv6 and more places will follow. OK claudio@
2024-01-26Put checksum flags in bpf_hdr to use them in userland dhcpleased.Jan Klemkow
Thus, dhcpleased accept non-calculated checksums which were verified by hardware/hypervisor. With tweaks from dlg@ ok bluhm@ mkay tobhe@
2024-01-24tag packets going out a sec interface to prevent route/encap loops.David Gwynne
sec(4) was already looking for this mbuf tag so it could drop packets that had already been sent out on the same interface, but i forgot the code that adds the tag. this was reported by jason tubnor who experienced spins/lockups when using sec and a physical interface was disconnected. rather than being a locking problem like we initially assumed, it turned out that unplugging a physical interface caused a route for ipsec encapsulated traffic to go out over sec(4), causing the packet to loop in the stack. the fix was also tested and verified by jason. sorry for taking so long to look at it.
2024-01-23Introduce pipex_iterator(), the special thing to performVitaliy Makkoveev
`pipex_session_list' foreach walkthrough with `pipex_list_mtx' mutex(9) relocking. It inserts special item after acquired `session' and keeps it linked until `session' release. Only owner can unlink it's own item, so the LIST_NEXT(session) is always valid even the `session' was unlinked. The iterator skips special items at the `session' acquisition time, as all other foreach loops where `pipex_list_mtx' mutex(9) is not relocked. ok yasuoka
2024-01-23Remove `pipex_rd_head6' and `ps6_rn[2]'. They are not used.Vitaliy Makkoveev
ok yasuoka
2024-01-18Use `nowake' as tsleep_nsec(9) ident. It has no corresponding wakeup(9).Vitaliy Makkoveev
ok bluhm