summaryrefslogtreecommitdiff
path: root/sys/netinet/raw_ip.c
AgeCommit message (Collapse)Author
2024-11-08Use PCB iterator for raw IPv6 input loop.Alexander Bluhm
Implement inpcb iterator in rip6_input(). Factor out the real work to rip6_sbappend(). Now UDP broadcast and multicast, raw IPv4 and IPv6 input work similar. While there, make rip_input() look more like rip6_input(). OK mvs@
2024-11-05Use PCB iterator for raw IP input deliver loop.Alexander Bluhm
Inspired by mvs@ idea of the iterator in the UDP multicast loop, implement the same for raw IP input delivery. This removes an unneccesary rwlock and only uses table mutex. When comparing the inp routing table, address and port, the table lock must be held. So assume that in_pcb_iterator() already has the table mutex and hold it while traversing the list and doing the checks. Release the mutex during mbuf copy, socket buffer append and the upcalls. Adapt the logic for both rip_input() and udp_input(). In rip_input() move the actual work to rip_sbappend(). This can be called without mutex during list traversal and for the final element. OK mvs@
2024-07-12Remove internet PCB mutex.Alexander Bluhm
All incpb locking has been converted to socket receive buffer mutex. Per PCB mutex inp_mtx is not needed anymore. Also delete PRU related locking functions. A flag PR_MPSOCKET indicates whether protocol functions support parallel access with per socket rw-lock. TCP is the only protocol that is not MP capable from the socket layer and needs exclusive netlock. OK mvs@
2024-04-17Use struct ipsec_level within inpcb.Alexander Bluhm
Instead of passing around u_char[4], introduce struct ipsec_level that contains 4 ipsec levels. This provides better type safety. The embedding struct inpcb is globally visible for netstat(1), so put struct ipsec_level outside of #ifdef _KERNEL. OK deraadt@ mvs@
2024-04-12Fix race between rip_input() and soisdisconnected().Alexander Bluhm
Setting SS_CANTRCVMORE is protected by mutex of receive socket buffer. The raw inpcb loop in rip_input() does a lockless access. Protect it with READ_ONCE(), although it is not perfect. Check the socket buffer state again when the mutex is held. Drop and count the packet that is processed between the checks. Currently soisdisconnected() is called with exclusive net lock. The new code also works without net lock. OK mvs@
2024-03-05Validate IPv4 packet options in divert output.Alexander Bluhm
When sending raw packets over divert socket, IP options were not validated. Fragment code tries to copy them and crashes. Raw IP output has a similar feature, but uses rip_chkhdr() to prevent invalid packets from userland. Call this funtion also from divert_output() for strict user input validation. Reported-by: syzbot+b1ba3a2a8ef13e5b4698@syzkaller.appspotmail.com OK dlg@ deraadt@ mvs@
2024-02-11Use `sb_mtx' instead of `inp_mtx' in receive path for inet sockets.Vitaliy Makkoveev
In soreceve(), we only touch `so_rcv' socket buffer, which has it's own `sb_mtx' mutex(9) for protection. So, we can avoid solock() in this path - it's enough to hold `sb_mtx' in soreceive() and around corresponding sbappend*(). But not right now :) This time we use shared netlock for some inet sockets in the soreceive() path. To protect `so_rcv' buffer we use `inp_mtx' mutex(9) and the pru_lock() to acquire this mutex(9) in socket layer. But the `inp_mtx' mutex belongs to the PCB. We initialize socket before PCB, tcp(4) sockets could exist without PCB, so use `sb_mtx' mutex(9) to protect sockbuf stuff. This diff mechanically replaces `inp_mtx' by `sb_mtx' in the receive path. Only for sockets which already use `inp_mtx'. All other sockets left as is. They will be converted later. Since the `sb_mtx' is optional, the new SB_MTXLOCK flag introduced. If this flag is set on `sb_flags', the `sb_mtx' mutex(9) should be taken. New sb_mtx_lock() and sb_mtx_unlock() was introduced to hide this check. They are temporary and will be replaced by mtx_enter() when all this area will be converted to `sb_mtx' mutex(9). Also, the new sbmtxassertlocked() function introduced to throw corresponding assertion for SB_MTXLOCK marked buffers. This time only sbappendaddr() calls it. This function is also temporary and will be replaced by MTX_ASSERT_LOCKED() later. ok bluhm
2024-02-03Rework socket buffers locking for shared netlock.Vitaliy Makkoveev
Shared netlock is not sufficient to call so{r,w}wakeup(). The following sowakeup() modifies `sb_flags' and knote(9) stuff. Unfortunately, we can't call so{r,w}wakeup() with `inp_mtx' mutex(9) because sowakeup() also calls pgsigio() which grabs kernel lock. However, `so*_filtops' callbacks only perform read-only access to the socket stuff, so it is enough to hold shared netlock only, but the klist stuff needs to be protected. This diff introduces `sb_mtx' mutex(9) to protect sockbuf. This time `sb_mtx' used to protect only `sb_flags' and `sb_klist'. Now we have soassertlocked_readonly() and soassertlocked(). The first one is happy if only shared netlock is held, meanwhile the second wants `so_lock' or pru_lock() be held together with shared netlock. To keep soassertlocked*() assertions soft, we need to know mutex(9) state, so new mtx_owned() macro was introduces. Also, the new optional (*pru_locked)() handler brings the state of pru_lock(). Tests and ok from bluhm.
2024-01-21Assert that inpcb table has correct address family.Alexander Bluhm
Since inpcb tables for UDP and Raw IP have been split into IPv4 and IPv6, assert that INP_IPV6 flag is correct instead of checking it. While there, give the table variable a nicer name. OK sashan@ mvs@
2023-12-15Use inpcb table mutex to set addresses.Alexander Bluhm
Protect all remaining write access to inp_faddr and inp_laddr with inpcb table mutex. Document inpcb locking for foreign and local address and port and routing table id. Reading will be made MP safe by adding per socket rw-locks in a next step. OK sashan@ mvs@
2023-11-26Remove inp parameter from ip_output().Alexander Bluhm
ip_output() received inp as parameter. This is only used to lookup the IPsec level of the socket. Reasoning about MP locking is much easier if only relevant data is passed around. Convert ip_output() to receive constant inp_seclevel as argument and mark it as protected by net lock. OK mvs@
2023-01-22Move SS_CANTRCVMORE and SS_RCVATMARK bits from `so_state' to `sb_state' ofVitaliy Makkoveev
receive buffer. As it was done for SS_CANTSENDMORE bit, the definition kept as is, but now these bits belongs to the `sb_state' of receive buffer. `sb_state' ored with `so_state' when socket data exporting to the userland. ok bluhm@
2022-10-17Change pru_abort() return type to the type of void and make pru_abort()Vitaliy Makkoveev
optional. We have no interest on pru_abort() return value. We call it only from soabort() which is dummy pru_abort() wrapper and has no return value. Only the connection oriented sockets need to implement (*pru_abort)() handler. Such sockets are tcp(4) and unix(4) sockets, so remove existing code for all others, it doesn't called. ok guenther@
2022-10-03System calls should not fail due to temporary memory shortage inAlexander Bluhm
malloc(9) or pool_get(9). Pass down a wait flag to pru_attach(). During syscall socket(2) it is ok to wait, this logic was missing for internet pcb. Pfkey and route sockets were already waiting. sonewconn() must not wait when called during TCP 3-way handshake. This logic has been preserved. Unix domain stream socket connect(2) can wait until the other side has created the socket to accept. OK mvs@
2022-09-13Do soreceive() with shared netlock for raw sockets.Vitaliy Makkoveev
ok bluhm@
2022-09-03Move PRU_PEERADDR request to (*pru_peeraddr)().Vitaliy Makkoveev
Introduce in{,6}_peeraddr() and use them for inet and inet6 sockets, except tcp(4) case. Also remove *_usrreq() handlers. ok bluhm@
2022-09-03Move PRU_SOCKADDR request to (*pru_sockaddr)()Vitaliy Makkoveev
Introduce in{,6}_sockaddr() functions, and use them for all except tcp(4) inet sockets. For tcp(4) sockets use tcp_sockaddr() to keep debug ability. The key management and route domain sockets returns EINVAL error for PRU_SOCKADDR request, so keep this behaviour for a while instead of make pru_sockaddr handler optional and return EOPNOTSUPP. ok bluhm@
2022-09-02Move PRU_CONTROL request to (*pru_control)().Vitaliy Makkoveev
The 'proc *' arg is not used for PRU_CONTROL request, so remove it from pru_control() wrapper. Split out {tcp,udp}6_usrreqs from {tcp,udp}_usrreqs and use them for inet6 case. ok guenther@ bluhm@
2022-09-01Move PRU_CONNECT2 request to (*pru_connect2)().Vitaliy Makkoveev
ok bluhm@
2022-08-31Move PRU_SENDOOB request to (*pru_sendoob)().Vitaliy Makkoveev
PRU_SENDOOB request always consumes passed `top' and `control' mbufs. To avoid dummy m_freem(9) handlers for all protocols release passed mbufs in the pru_sendoob() EOPNOTSUPP error path. Also fix `control' mbuf(9) leak in the tcp(4) PRU_SENDOOB error path. ok bluhm@
2022-08-29Move PRU_RCVOOB request to (*pru_rcvoob)().Vitaliy Makkoveev
ok bluhm@
2022-08-28Move PRU_SENSE request to (*pru_sense)().Vitaliy Makkoveev
ok bluhm@
2022-08-28Move PRU_ABORT request to (*pru_abort)().Vitaliy Makkoveev
We abort only the sockets which are linked to `so_q' or `so_q0' queues of listening socket. Such sockets have no corresponding file descriptor and are not accessed from userland, so PRU_ABORT used to destroy them on listening socket destruction. Currently all our sockets support PRU_ABORT request, but actually it required only for tcp(4) and unix(4) sockets, so i should be optional. However, they will be removed with separate diff, and this time PRU_ABORT requests were converted as is. Also, the socket should be destroyed on PRU_ABORT request, but route and key management sockets leave it alive. This was also converted as is, because this wrong code never called. ok bluhm@
2022-08-27Move PRU_SEND request to (*pru_send)().Vitaliy Makkoveev
The former PRU_SEND error path of gre_usrreq() had `control' mbuf(9) leak. It was fixed in new gre_send(). The former pfkeyv2_send() was renamed to pfkeyv2_dosend(). ok bluhm@
2022-08-26Move PRU_RCVD request to (*pru_rcvd)().Vitaliy Makkoveev
ok bluhm@
2022-08-22Move PRU_SHUTDOWN request to (*pru_shutdown)().Vitaliy Makkoveev
ok bluhm@
2022-08-22Move PRU_DISCONNECT request to (*pru_disconnect).Vitaliy Makkoveev
ok bluhm@
2022-08-22Use rwlock per inpcb table to protect notify list. The notifyAlexander Bluhm
function may sleep, so holding a mutex is not possible. The same list entry and rwlock is used for UDP multicast and raw IP delivery. By adding a write lock, exclusive netlock is no longer necessary for PCB notify and UDP and raw IP input. OK mvs@
2022-08-22Move PRU_ACCEPT request to (*pru_accept)().Vitaliy Makkoveev
ok bluhm@
2022-08-21Move PRU_CONNECT request to (*pru_connect)() handler.Vitaliy Makkoveev
ok bluhm@
2022-08-21Move PRU_LISTEN request to (*pru_listen)() handler.Vitaliy Makkoveev
ok bluhm@
2022-08-20Move PRU_BIND request to (*pru_bind)() handler.Vitaliy Makkoveev
For the protocols which don't support request, leave handler NULL. Do the NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in such case. This will be done for all upcoming user request handlers. ok bluhm@ guenther@
2022-08-15Introduce 'pr_usrreqs' structure and move existing user-protocolVitaliy Makkoveev
handlers into it. We want to split existing (*pr_usrreq)() to multiple short handlers for each PRU_ request as it was already done for PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)() split will be done with the following diffs. Based on reverted diff from guenther@. ok bluhm@
2022-08-06Clean up the netlock macros. Merge NET_RLOCK_IN_SOFTNET andAlexander Bluhm
NET_RLOCK_IN_IOCTL, which have the same implementation. The R and W are hard to see, call the new macro NET_LOCK_SHARED. Rename the opposite assertion from NET_ASSERT_WLOCKED to NET_ASSERT_LOCKED_EXCLUSIVE. Update some outdated comments about net locking. OK mpi@ mvs@
2022-05-15have in_pcbselsrc copy the selected address to memory provided by the caller.David Gwynne
having it return a pointer to something that has a lifetime managed by a lock without accounting for it or taking a reference count or anything like that is asking for trouble. copying the address to caller provded memory while still inside the lock is a lot safer. discussed with visa@ ok bluhm@ claudio@
2022-03-23Move global variable ripsrc onto stack, it is only used once withinAlexander Bluhm
rip_input(). from dhill@
2022-03-22For raw IP packets rip_input() traverses the loop of all PCBs. FromAlexander Bluhm
there it calls sbappendaddr() while holding the raw table mutex. This ends in sorwakeup() where we finally grab the kernel lock while holding a mutex. Witness detects this misuse. Use the same solution as for PCB notify. Collect the affected PCBs in a temporary list. The list is protected by exclusive net lock. syzbot+ebe3f03a472fecf5e42e@syzkaller.appspotmail.com OK claudio@
2022-03-21Header netinet/in_pcb.h includes sys/mutex.h now. Recommit mutexAlexander Bluhm
for PCB tables. It does not break userland build anymore. pf_socket_lookup() calls in_pcbhashlookup() in the PCB layer. To run pf in parallel, make parts of the stack MP safe. Protect the list and hashes in the PCB tables with a mutex. Note that the protocol notify functions may call pf via tcp_output(). As the pf lock is a sleeping rw_lock, we must not hold a mutex. To solve this for now, collect these PCBs in inp_notify list and protect it with exclusive netlock. OK sashan@
2022-03-21call in_pcbselsrc from rip_output so route sourceaddr can take effect.David Gwynne
previously things that used sendto or similar with raw sockets would ignore any configured sourceaddr. this made it inconsistent with other traffic, which in turn makes things confusing to debug if you're using ping or traceroute (which use raw sockets) to figure out what's happening to other packets. the ipv6 equiv already does this too. ok sthen@ claudio@
2022-03-14Unbreak the tree, revert commitid aZ8fm4iaUnTCc0ulTheo Buehler
This reverts the commit protecting the list and hashes in the PCB tables with a mutex since the build of sysctl(8) breaks, as found by kettenis. ok sthen
2022-03-14pf_socket_lookup() calls in_pcbhashlookup() in the PCB layer. ToAlexander Bluhm
run pf in parallel, make parts of the stack MP safe. Protect the list and hashes in the PCB tables with a mutex. Note that the protocol notify functions may call pf via tcp_output(). As the pf lock is a sleeping rw_lock, we must not hold a mutex. To solve this for now, collect these PCBs in inp_notify list and protect it with exclusive netlock. OK sashan@
2022-02-25Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.comPhilip Guenther
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref and I won't be available to monitor for followup issues for a bit
2022-02-25Move pr_attach and pr_detach to a new structure pr_usrreqs that canPhilip Guenther
then be shared among protosw structures, following the same basic direction as NetBSD and FreeBSD for this. Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the proper prototype to eliminate the previously necessary casts. ok mvs@ bluhm@
2019-02-04Avoid an mbuf double free in the oob soreceive() path. In theAlexander Bluhm
usrreq functions move the mbuf m_freem() logic to the release block instead of distributing it over the switch statement. Then the goto release in the initial check, whether the pcb still exists, will not free the mbuf for the PRU_RCVD, PRU_RVCOOB, PRU_SENSE command. OK claudio@ mpi@ visa@ Reported-by: syzbot+8e7997d4036ae523c79c@syzkaller.appspotmail.com
2019-01-08Botched up an if conditional in the last commit. The IP length needs toClaudio Jeker
bigger than the IP header len to be valid. With this I can traceroute again.
2019-01-07Validate the version, and all length fields of IP packets passed to a raw socketClaudio Jeker
with INP_HDRINCL. There is no reason to allow badly constructed packets through our network stack. Especially since they may trigger diagnostic checks further down the stack. Now EINVAL is returned instead which was already used for some checks that happened before. OK florian@ Reported-by: syzbot+0361ed02deed123667cb@syzkaller.appspotmail.com
2018-12-03In PRU_DISCONNECT don't fall through into PRU_ABORT since the latter freesClaudio Jeker
the inpcb apart from the disconnect. Just call soisdisconnected() and clear the inp->inp_faddr since the socket is still valid after a disconnect. Problem found by syzkaller via Greg Steuck OK visa@ Fixes: Reported-by: syzbot+2cd350dfe5c96f6469f2@syzkaller.appspotmail.com Reported-by: syzbot+139ac2d7d3d60162334b@syzkaller.appspotmail.com Reported-by: syzbot+02168317bd0156c13b69@syzkaller.appspotmail.com Reported-by: syzbot+de8d2459ecf4cdc576a1@syzkaller.appspotmail.com
2018-11-10Do not translate the EACCES error from pf(4) to EHOSTUNREACH anymore.Alexander Bluhm
It also translated a documented send(2) EACCES case erroneously. This was too much magic and always prone to errors. from Jan Klemkow; man page jmc@; OK claudio@
2018-10-04Revert the inpcb table mutex commit. It triggers a witness panicAlexander Bluhm
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx is held and sorwakeup() is called within the loop. As sowakeup() grabs the kernel lock, we have a lock ordering problem. found by Hrvoje Popovski; OK deraadt@ mpi@
2018-09-20As a step towards per inpcb or socket locks, remove the net lockAlexander Bluhm
for netstat -a. Introduce a global mutex that protects the tables and hashes for the internet PCBs. To detect detached PCB, set its inp_socket field to NULL. This has to be protected by a per PCB mutex. The protocol pointer has to be protected by the mutex as netstat uses it. Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify() before the table mutex to avoid lock ordering problems in the notify functions. OK visa@