summaryrefslogtreecommitdiff
path: root/sys/kern/uipc_socket2.c
AgeCommit message (Collapse)Author
2022-10-03System calls should not fail due to temporary memory shortage inAlexander Bluhm
malloc(9) or pool_get(9). Pass down a wait flag to pru_attach(). During syscall socket(2) it is ok to wait, this logic was missing for internet pcb. Pfkey and route sockets were already waiting. sonewconn() must not wait when called during TCP 3-way handshake. This logic has been preserved. Unix domain stream socket connect(2) can wait until the other side has created the socket to accept. OK mvs@
2022-09-05Use shared netlock in soreceive(). The UDP and IP divert layerAlexander Bluhm
provide locking of the PCB. If that is possible, use shared instead of exclusive netlock in soreceive(). The PCB mutex provides a per socket lock against multiple soreceive() running in parallel. Release and regrab both locks in sosleep_nsec(). OK mvs@
2022-08-13Introduce the pru_*() wrappers for corresponding (*pr_usrreq)() calls.Vitaliy Makkoveev
This is helpful for the following (*pr_usrreq)() split to multiple handlers. But right now this makes code more readable. Also add '#ifndef _SYS_SOCKETVAR_H_' to sys/socketvar.h. This prevents the collisions when both sys/protosw.h and sys/socketvar.h are included together. Both 'socket' and 'protosw' structures are required to be defined before pru_*() wrappers, so we need to include sys/socketvar.h to sys/protosw.h. ok bluhm@
2022-07-25Replace selwakeup() with KNOTE() in socket event activationVisa Hankala
Let's try this again now that the kernel locking issue in nfsrv_rcv() has been fixed. The previous attempt of the conversion triggered hangs on NFS servers. This was probably caused by the removal of the kernel-locked section just prior to the socket upcall. The section had masked a locking error in NFS code.
2022-07-01Make fine grained unix(4) domain sockets locking. Use the per-socketVitaliy Makkoveev
`so_lock' rwlock(9) instead of global `unp_lock' which locks the whole layer. The PCB of unix(4) sockets are linked to each other and we need to lock them both. This introduces the lock ordering problem, because when the thread (1) keeps lock on `so1' and trying to lock `so2', the thread (2) could hold lock on `so2' and trying to lock `so1'. To solve this we always lock sockets in the strict order. For the sockets which are already accessible from userland, we always lock socket with the smallest memory address first. Sometimes we need to unlock socket before lock it's peer and lock it again. We use reference counters for prevent the connected peer destruction during to relock. We also handle the case where the peer socket was replaced by another socket. For the newly connected sockets, which are not yet exported to the userland by accept(2), we always lock the listening socket `head' first. This allows us to avoid unwanted relock within accept(2) syscall. ok claudio@
2022-06-26Remove unused VOP_POLL().Visa Hankala
OK mpi@
2022-06-06Simplify solock() and sounlock(). There is no reason to return a valueClaudio Jeker
for the lock operation and to pass a value to the unlock operation. sofree() still needs an extra flag to know if sounlock() should be called or not. But sofree() is called less often and mostly without keeping the lock. OK mpi@ mvs@
2022-05-09Revert "Replace selwakeup() with KNOTE() in pipe and socket event activation."Visa Hankala
The commit caused hangs with NFS. Reported by ajacoutot@ and naddy@
2022-05-06Replace selwakeup() with KNOTE() in pipe and socket event activation.Visa Hankala
OK mpi@
2022-02-25Reported-by: syzbot+1b5b209ce506db4d411d@syzkaller.appspotmail.comPhilip Guenther
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref and I won't be available to monitor for followup issues for a bit
2022-02-25Move pr_attach and pr_detach to a new structure pr_usrreqs that canPhilip Guenther
then be shared among protosw structures, following the same basic direction as NetBSD and FreeBSD for this. Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the proper prototype to eliminate the previously necessary casts. ok mvs@ bluhm@
2022-02-21expliclitly -> explicitlyJonathan Gray
2022-02-14update sbchecklowmem() to better detect actual mbuf memory usage.David Gwynne
previously sbchecklowmem() (and sonewconn()) would look at the mbuf and mbuf cluster pools to see if they were approaching their hard limits. based on how many mbufs/clusters were allocated against the limits, socket operations would start to fail with ENOBUFS until utilisation went down. mbufs and clusters have changed a lot since then though. there are now many mbuf cluster pools, not just one for 2k clusters. because of this the mbuf layer now limits the amount of memory all the mbuf pools can allocate backend pages from rather than limit the individual pools. this means sbchecklowmem() ends up looking at the default pool hard limit, which is UINT_MAX, which in turn means means sbchecklowmem() probably never applies backpressure. this is made worse on multiprocessor systems where per cpu caches of mbuf and cluster pool items are enabled because the number of in use pool items is distorted by the cpu caches. this switches sbchecklowmem to looking at the page allocations made by all the pools instead. the big benefit of this is that the page allocations are much more representative of the overall mbuf memory usage in the system. the downside is is that the backend page allocation accounting does not see idle memory held by pools. pools cannot release partially free pages to the page backend (obviously), and pools cache idle items to avoid thrashing on the backend page allocator. this means the page allocation level is higher than the memory used by actual in-flight mbufs. however, this can also be a benefit. the backend page allocation is a kind of smoothed out "trend" line. mbuf utilisation over short periods can be extremely bursty because of things like rx ring dequeue and fill cycles, or large socket sends. if you're trying to grow socket buffers while these things are happening, luck becomes an important factor in whether it will work or not. because pools cache idle items, the backend page utilisation better represents the overall trend of activity in the system and will give more consistent behaviour here. this diff is deliberately simple. we're basically going from "no limits" to "some sort of limit" for sockets again, so keeping the code simple means it should be easy to understand and tweak in the future. ok djm@ visa@ claudio@
2021-11-06Allocate socket and initialize so_lock in one placeVisa Hankala
This makes witness(4) use a single lock type for tracking so_lock. Previously, so_lock was covered by two distinct lock types because there were separate rw_init() initializers in socreate() and sonewconn(). OK kettenis@
2021-10-27Replace 'DIAGNOSTIC' block within soqinsque() by KASSERT(9).Vitaliy Makkoveev
ok sashan@
2021-10-24Set klist lock for sockets to make socket event filters MP-safeVisa Hankala
The filterops instances already provide f_modify and f_process callbacks with proper internal locking. Locking of socket klists has been the missing detail for MP-safety. OK mpi@
2021-07-26Pass a socket pointer to various socket buffer routines in preparation forMartin Pieuchot
per-socket locking. No functional change.
2021-07-25Kill unused sbinsertoob().Martin Pieuchot
ok mvs@
2021-06-07Kill SS_ASYNC and only check SB_ASYNC when async signals are wanted.Martin Pieuchot
This socket flag was redundant with the socket buffer one. ok mvs@
2021-05-26Use `so_lock' to protect key management (PF_KEY) sockets. This can bemvs
done because we have no cases where one thread should lock two sockets simultaneously. tested by yasuoka@ ok bluhm@ markus@
2021-05-01Implement per-socket `so_lock' rwlock(9) and use it to protect routingmvs
(PF_ROUTE) sockets. This can be done because we have no cases where one thread should lock two sockets simultaneously. Against the previous version rtm_senddesync_timer() execution was moved to process context. Also this time `so_lock' used for routing sockets only but in the future it will be used to other socket types too. tested by claudio@ ok claudio@ bluhm@
2021-04-26Revert per-socket `so_lock' rwlock(9) and use it to protect routingClaudio Jeker
(PF_ROUTE) sockets. There is a locking issue with timeouts that needs to be fixed. Requested by deraadt@
2021-04-25Implement per-socket `so_lock' rwlock(9) and use it to protect routingmvs
(PF_ROUTE) sockets. This can be done because we have no cases where one thread should lock two sockets simultaneously. Also this time `so_lock 'used for routing sockets only but in the future it will be used to other socket types too. ok bluhm@
2021-02-11sbdrop(): use NULL instead of 0 in pointer assignmentmvs
ok bluhm@
2021-02-10Move UNIX domain sockets out of kernel lock. The new `unp_lock' rwlock(9)mvs
used as solock()'s backend to protect the whole layer. With feedback from mpi@. ok bluhm@ claudio@
2020-04-11Add soassertlocked() checks to sbappend() and sbappendaddr(). This bringsClaudio Jeker
them in line with sbappendstream() and sbappendrecord(). Agreed by mpi@
2020-02-14Push the KERNEL_LOCK() insidge pgsigio() and selwakeup().Martin Pieuchot
The 3 subsystems: signal, poll/select and kqueue can now be addressed separatly. Note that bpf(4) and audio(4) currently delay the wakeups to a separate context in order to respect the KERNEL_LOCK() requirement. Sockets (UDP, TCP) and pipes spin to grab the lock for the sames reasons. ok anton@, visa@
2020-01-15Keep socket timeout intervals in nsecs and use them with tsleep_nsec(9).Martin Pieuchot
Introduce and use TIMEVAL_TO_NSEC() to convert SO_RCVTIMEO/SO_SNDTIMEO specified values into nanoseconds. As a side effect it is now possible to specify a timeout larger that (USHRT_MAX / 100) seconds. To keep code simple `so_linger' now represents a number of seconds with 0 meaning no timeout or 'infinity'. Yes, the 0 -> INFSLP API change makes conversions complicated as many timeout holders are still memset()'d. Inputs from cheloha@ and bluhm@, ok bluhm@
2019-04-16Use the actual cluster size instead of fixed MCLBYTES for theYASUOKA Masahiko
condition in sb_compress(). Currently the actual cluster size might be 9KB even if the mtu is 1500, in this case a lot of memory space had been wasted, since sbcompress() doesn't compress because of previous condition. ok dlg claudio
2019-02-15let sbcreatecontrol take a const void * instead of a caddr_t.David Gwynne
this makes it easier to call since you don't have to cast to caddr_t if it's a void *. this also changes a size argument from int to size_t. ok claudio@
2018-11-19Utilize sigio with sockets.Visa Hankala
OK mpi@
2018-11-09M_LEADINGSPACE() and M_TRAILINGSPACE() are just wrappers forClaudio Jeker
m_leadingspace() and m_trailingspace(). Convert all callers to call directly the functions and remove the defines. OK krw@, mpi@
2018-10-29Now that most archs have better NMBCLUSTERS defaults it is possible to bringClaudio Jeker
back rev 1.90. ---- mbufs and mbuf clusters are now backed by large pools. Because of this we can relax the oversubscribe limit of socketbuffers a fair bit. Instead of maxing out as sb_max * 1.125 or 2 * sb_hiwat the maximum is increased to 8 * sb_hiwat -- which seems to be a good compromise between memory waste and better socket buffer usage. OK deraadt@ ---- ok benno@
2018-07-10After removing raw_usrreq() from route and pfkey, the global sockaddrAlexander Bluhm
variables can be delared constant. OK claudio@ mpi@
2018-06-11Do not unlock the KERNEL_LOCK() unconditionally in sounlock().Martin Pieuchot
Instead introduce two flags to deal with global lock recursion. This is necessary until we get per-socket lock. Req. by and ok visa@
2018-06-06Pass the socket to sounlock(), this prepare the terrain for per-socketMartin Pieuchot
locking. ok visa@, bluhm@
2018-05-07Grab the KERNEL_LOCK() for unix/routing/pfkey sockets in solock()...Martin Pieuchot
...and release it in sounlock(). This will allows us to progressively remove the KERNEL_LOCK() in syscalls. ok visa@ some time ago
2018-04-08AF_LOCAL was a failed attempt (by POSIX?) to seem less UNIX-specific, butPhilip Guenther
AF_UNIX is both the historical _and_ standard name, so prefer and recommend it in the headers, manpages, and kernel. ok miller@ deraadt@ schwarze@
2018-02-18Revert previous. It triggers mbuf pool exhaustion on arm64.Mark Kettenis
Requested by claudio@
2018-02-10mbufs and mbuf clusters are now backed by large pools. Because of thisClaudio Jeker
we can relax the oversubscribe limit of socketbuffers a fair bit. Instead of maxing out as sb_max * 1.125 or 2 * sb_hiwat the maximum is increased to 8 * sb_hiwat -- which seems to be a good compromise between memory waste and better socket buffer usage. OK deraadt@
2017-12-30Delete unnecessary <sys/file.h> includesPhilip Guenther
ok millert@ krw@
2017-12-10Move SB_SPLICE, SB_WAIT and SB_SEL to `sb_flags', serialized by solock().Martin Pieuchot
SB_KNOTE remains the only bit set on `sb_flagsintr' as it is set/unset in contexts related to kqueue(2) where we'd like to avoid grabbing solock(). While here add some KERNEL_LOCK()/UNLOCK() dances around selwakeup() and csignal() to mark which remaining functions need to be addressed in the socket layer. ok visa@, bluhm@
2017-11-23We want `sb_flags' to be protected by the socket lock rather than theMartin Pieuchot
KERNEL_LOCK(), so change asserts accordingly. This is now possible since sblock()/sbunlock() are always called with the socket lock held. ok bluhm@, visa@
2017-08-11Remove NET_LOCK()'s argument.Martin Pieuchot
Tested by Hrvoje Popovski, ok bluhm@
2017-07-27Assert that the KERNEL_LOCK() is held prior to call csignal() andMartin Pieuchot
selwakeup(). ok bluhm@
2017-07-18soreserve() modifies `so_snd' and `so_rcv' so asserts that it is calledMartin Pieuchot
with the socket lock. This change is safe because sbreserve() already asserts that the lock is held, but it acts as implicit documentation and indicates that I looked at the function.
2017-07-04Always hold the socket lock when calling sblock().Martin Pieuchot
Implicitely protects `so_state' with the socket lock in sosend(). ok visa@, bluhm@
2017-07-04Assert that the socket lock is held when `so_state' is modified.Martin Pieuchot
ok bluhm@, visa@
2017-07-04Assert that the socket lock is held when `so_qlen' is modified.Martin Pieuchot
ok bluhm@, visa@
2017-06-27Add missing solock()/sounlock() dances around sbreserve().Martin Pieuchot
While here document an abuse of parent socket's lock. Problem reported by krw@, analysis and ok bluhm@