summaryrefslogtreecommitdiff
path: root/sys/kern/uipc_socket.c
AgeCommit message (Collapse)Author
2019-12-31Use C99 designated initializers with struct filterops. In addition,Visa Hankala
make the structs const so that the data are put in .rodata. OK mpi@, deraadt@, anton@, bluhm@
2019-12-12Reintroduce socket locking inside socket event filters.Visa Hankala
Tested by anton@, sashan@ OK mpi@, anton@, sashan@
2019-12-02Revert "timeout(9): switch to tickless backend"cheloha
It appears to have caused major performance regressions all over the network stack. Reported by bluhm@ ok deraadt@
2019-11-26timeout(9): switch to tickless backendcheloha
Rebase the timeout wheel on the system uptime clock. Timeouts are now set to run at or after an absolute time as returned by nanouptime(9). Timeouts are thus "tickless": they expire at a real time on that clock instead of at a particular value of the global "ticks" variable. To facilitate this change the timeout struct's .to_time member becomes a timespec. Hashing timeouts into a bucket on the wheel changes slightly: we build a 32-bit hash with 25 bits of seconds (.tv_sec) and 7 bits of subseconds (.tv_nsec). 7 bits of subseconds means the width of the lowest wheel level is now 2 seconds on all platforms and each bucket in that lowest level corresponds to 1/128 seconds on the uptime clock. These values were chosen to closely align with the current 100hz hardclock(9) typical on almost all of our platforms. At 100hz a bucket is currently ~1/100 seconds wide on the lowest level and the lowest level itself is ~2.56 seconds wide. Not a huge change, but a change nonetheless. Because a bucket no longer corresponds to a single tick more than one bucket may be dumped during an average timeout_hardclock_update() call. On 100hz platforms you now dump ~2 buckets. On 64hz machines (sh) you dump ~4 buckets. On 1024hz machines (alpha) you dump only 1 bucket, but you are doing extra work in softclock() to reschedule timeouts that aren't due yet. To avoid changing current behavior all timeout_add*(9) interfaces convert their timeout interval into ticks, compute an equivalent timespec interval, and then add that interval to the timestamp of the most recent timeout_hardclock_update() call to determine an absolute deadline. So all current timeouts still "use" ticks, but the ticks are faked in the timeout layer. A new interface, timeout_at_ts(9), is introduced here to bypass this backwardly compatible behavior. It will be used in subsequent diffs to add absolute timeout support for userland and to clean up some of the messier parts of kernel timekeeping, especially at the syscall layer. Because timeouts are based against the uptime clock they are subject to NTP adjustment via adjtime(2) and adjfreq(2). Unless you have a crazy adjfreq(2) adjustment set this will not change the expiration behavior of your timeouts. Tons of design feedback from mpi@, visa@, guenther@, and kettenis@. Additional amd64 testing from anton@ and visa@. Octeon testing from visa@. macppc testing from me. Positive feedback from deraadt@, ok visa@
2019-07-22implement SO_DOMAIN and SO_PROTOCOL so that the domain and the protocolRobert Nagy
can also be retrieved with getsockopt(3) it looks like these will also be in the next issue of posix: http://austingroupbugs.net/view.php?id=840#c2263 ok claudio@, sthen@
2019-07-11listen(2) should return EINVAL if the socket is connected.Alexander Bluhm
This behavior matches NetBSD, POSIX, and our own man page. Fix whitespace while here. from Moritz Buhl; OK millert@
2019-07-04Remove a useless kernel lock from the TCP socket splicing path.Alexander Bluhm
When send buffer space in the drain socket becomes available, a task is added to move data, and also the userland was informed. The latter is not usefull as this would mix a kernel and user stream. So programs do not wait for this event. Avoid calling sowakeup() from sowwakeup(), this also reduces grabing the kernel lock. Instead inform the userland about the write event when the splicing is dissolved in sounsplice(). OK claudio@
2018-12-17When using MSG_WAITALL, soreceive() can sleep while processing theAlexander Bluhm
receive buffer of a stream socket. Then a new pair of control and data mbuf can be appended to the mbuf queue. In this case, terminate the loop with a short read to prevent a panic. Userland should read the control message with the next system call. OK claudio@ deraadt@
2018-11-30Trivial MH_ALIGN/M_ALIGN to m_align conversions.Claudio Jeker
OK bluhm@
2018-11-21When using MSG_PEEK to peak into packets skip control messages holdingClaudio Jeker
SCM_RIGHTS from being sent to the userland since they hold kernel internal data and it does not make sense to externalize it. OK deraadt@, guenther@, visa@
2018-11-19Utilize sigio with sockets.Visa Hankala
OK mpi@
2018-08-21If the control message of IP_SENDSRCADDR did not fit into the socketAlexander Bluhm
buffer together with an UDP packet, sosend(9) returned EWOULDBLOCK. As it is an persistent problem, EMSGSIZE is the correct error code. Split the AF_UNIX case into a separate condition and do not change its logic. For atomic protocols, check that both data and control message length fit into the socket buffer. original bug report from Alexander Markert discussed with jca@; OK vgross@
2018-07-30Use FNONBLOCK instead of SS_NBIO to check/indicate that the I/O modeMartin Pieuchot
for sockets is non-blocking. This allows us to G/C SS_NBIO. Having to keep the two flags in sync in a mp-safe way is complicated. This change introduce a behavior change in sosplice(), it can now always block. However this should not matter much due to the socket lock being taken beforhand. ok bluhm@, benno@, visa@
2018-07-05Serialize the sosplice taskq allocation. This prevents an unlikelyVisa Hankala
duplicate allocation that could happen in the future when each socket has a dedicated lock. Right now, the code path is serialized also by the NET_LOCK() (and the KERNEL_LOCK()). OK mpi@
2018-06-14In soclose() and soaccept() convert the KASSERT(SS_NOFDREF) backAlexander Bluhm
to a panic message. The latter prints socket pointer and type to help debugging. OK mpi@
2018-06-06Pass the socket to sounlock(), this prepare the terrain for per-socketMartin Pieuchot
locking. ok visa@, bluhm@
2018-06-06Asseert that a pfkey or routing socket is referenced by a `fp' insteadMartin Pieuchot
of calling sofree(), when its PCB is detached. This is different from TCP which does not always detach `inpcb's from sockets. In the pfkey & routing case caling sofree() there is a noop whereas for TCP it's needed to free closed connections. Having fewer sofree() makes it easier to understand the code and move the locks down. ok visa@
2018-05-08Socket splicing can delay operations by task or timeout. IntroduceAlexander Bluhm
soreaper() that is scheduled onto the timer thread. soput() is scheduled from there onto the sosplice task thread. After that it is save to pool_put() the socket and splicing data structures. OK mpi@ visa@
2018-04-08AF_LOCAL was a failed attempt (by POSIX?) to seem less UNIX-specific, butPhilip Guenther
AF_UNIX is both the historical _and_ standard name, so prefer and recommend it in the headers, manpages, and kernel. ok miller@ deraadt@ schwarze@
2018-03-27Use a goto to merge multiple error blocks in sosplice().Martin Pieuchot
ok bluhm@
2018-03-01When socket splicing is involved, delay the pool_put() after theAlexander Bluhm
splicing thread has finished sotask() with the socket to be freed. Use after free reported and fix successfully tested by Rivo Nurges. discussed with mpi@
2018-02-19Grab solock() inside soconnect2() instead of asserting for it to be held.Martin Pieuchot
ok millert@
2018-02-19Remove almost unused `flags' argument of suser().Martin Pieuchot
The account flag `ASU' will no longer be set but that makes suser() mpsafe since it no longer mess with a per-process field. No objection from millert@, ok tedu@, bluhm@
2018-01-10Mark sosplice task mp safe, do not grab kernel lock for tcp output.Alexander Bluhm
OK mpi@
2018-01-09Change `so_state' and `so_error' to unsigned int such that they canMartin Pieuchot
be atomically read from any context. ok bluhm@, visa@
2018-01-02Do not memset() the whole structure in sorflush() to keep `sb_flagsintr'Martin Pieuchot
untouched. ok bluhm@, visa@
2017-12-19Remove unnecessary unlock/lock dance when following a goto.Martin Pieuchot
ok bluhm@
2017-12-18Revert grabbing the socket lock in kqueue(2) filters.Martin Pieuchot
This change exposed or created a situation where a CPU started to be irresponsive while holding the KERNEL_LOCK(). These led to lockups and even with MP_LOCKDEBUG it was not clear what happened to this CPU. These situations have been experience by dhill@ with dcrwallet and jcs@ with syncthing. Both applications are written in Go and do kevent(2) & networking across multiple threads.
2017-12-10Move SB_SPLICE, SB_WAIT and SB_SEL to `sb_flags', serialized by solock().Martin Pieuchot
SB_KNOTE remains the only bit set on `sb_flagsintr' as it is set/unset in contexts related to kqueue(2) where we'd like to avoid grabbing solock(). While here add some KERNEL_LOCK()/UNLOCK() dances around selwakeup() and csignal() to mark which remaining functions need to be addressed in the socket layer. ok visa@, bluhm@
2017-11-23Constify protocol tables and remove an assert now that ip_deliver() isMartin Pieuchot
mp-safe. ok bluhm@, visa@
2017-11-23We want `sb_flags' to be protected by the socket lock rather than theMartin Pieuchot
KERNEL_LOCK(), so change asserts accordingly. This is now possible since sblock()/sbunlock() are always called with the socket lock held. ok bluhm@, visa@
2017-11-04Make it possible for multiple threads to enter kqueue_scan() in parallel.Martin Pieuchot
This is a requirement to use a sleeping lock inside kqueue filters. It is now possible, but not recommended, to sleep inside ``f_event''. Threads iterating over the list of pending events are now recognizing and skipping other threads' markers. knote_acquire() and knote_release() must be used to "own" a knote to make sure no other thread is sleeping with a reference on it. Acquire and marker logic taken from DragonFly but the KERNEL_LOCK() is still serializing the execution of the kqueue code. This also enable the NET_LOCK() in socket filters. Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@
2017-11-02Move PRU_DETACH out of pr_usrreq into per proto pr_detachFlorian Obser
functions to pave way for more fine grained locking. Suggested by, comments & OK mpi
2017-09-15Coverity complains that top == NULL was checked and further downAlexander Bluhm
top->m_pkthdr.len was accessed without check. See CID 1452933. In fact top cannot be NULL there and the condition was always false. m_getuio() did never reserve space for the header. The correct check is m == top to find the first mbuf. OK visa@
2017-09-11Coverty complains that the return value of sblock() is not checkedAlexander Bluhm
in sorflush(), but in other places it is. See CID 1453099. The flags SB_NOINTR and M_WAITOK should avoid failure. Put an assert there to be sure. OK visa@ mpi@
2017-09-01Change sosetopt() to no longer free the mbuf it receives and changeMartin Pieuchot
all the callers to call m_freem(9). Support from deraadt@ and tedu@, ok visa@, bluhm@
2017-08-22Make sogetopt(9) caller responsible for allocating an MT_SOOPTS mbuf.Martin Pieuchot
Move a blocking memory allocation out of the socket lock and create a simpler alloc/free pattern to review. Now both m_get() and m_free() are in the same place. Discussed with bluhm@. Encouragements from deraadt@ and tedu@, ok kettenis@, florian@, visa@
2017-08-10Move the solock()/sounlock() dance outside of sobind().Martin Pieuchot
ok phessler@, visa@, bluhm@
2017-08-10The socket field so_proto can never be NULL. Remove the checks.Alexander Bluhm
OK mpi@ visa@
2017-08-09Move the socket lock "above" sosetopt(), sogetopt() and sosplice().Martin Pieuchot
Protect the fields modifieds by sosetopt() and simplify the dance with the stars. ok bluhm@
2017-07-27Assert that the KERNEL_LOCK() is held prior to call csignal() andMartin Pieuchot
selwakeup(). ok bluhm@
2017-07-24Extend the scope of the socket lock to protect `so_state' in connect(2).Martin Pieuchot
As a side effect, soconnect() and soconnect2() now expect a locked socket, so update all the callers. ok bluhm@
2017-07-20If pool_get() sleeps while allocating additional memory for socketAlexander Bluhm
splicing, another process may allocate it in the meantime. Then one of the splicing structures leaked in sosplice(). Recheck that no struct sosplice exists after a protential sleep. reported by Ilja Van Sprundel; OK mpi@
2017-07-20Prepare filt_soread() to be locked. No functionnal change.Martin Pieuchot
ok bluhm@, claudio@, visa@
2017-07-13Do not unlock the netlock in the goto out error path before it hasAlexander Bluhm
been acquired in sosend(). Fixes a kernel lock assertion panic. OK visa@ mpi@
2017-07-08Revert grabbing the socket lock in kqueue filters.Martin Pieuchot
It is unsafe to sleep while iterating the list of pending events in kqueue_scan(). Reported by abieber@ and juanfra@
2017-07-04Always hold the socket lock when calling sblock().Martin Pieuchot
Implicitely protects `so_state' with the socket lock in sosend(). ok visa@, bluhm@
2017-07-03Protect `so_state', `so_error' and `so_qlen' with the socket lock inMartin Pieuchot
kqueue filters. ok millert@, bluhm@, visa@
2017-06-27Add missing solock()/sounlock() dances around sbreserve().Martin Pieuchot
While here document an abuse of parent socket's lock. Problem reported by krw@, analysis and ok bluhm@
2017-06-26Assert that the corresponding socket is locked when manipulating socketMartin Pieuchot
buffers. This is one step towards unlocking TCP input path. Note that all the functions asserting for the socket lock are not necessarilly MP-safe. All the fields of 'struct socket' aren't protected. Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to tell when a filter needs to lock the underlying data structures. Logic and name taken from NetBSD. Tested by Hrvoje Popovski. ok claudio@, bluhm@, mikeb@