Age | Commit message (Collapse) | Author |
|
malloc(9) or pool_get(9).
Pass down a wait flag to pru_attach(). During syscall socket(2)
it is ok to wait, this logic was missing for internet pcb. Pfkey
and route sockets were already waiting.
sonewconn() must not wait when called during TCP 3-way handshake.
This logic has been preserved. Unix domain stream socket connect(2)
can wait until the other side has created the socket to accept.
OK mvs@
|
|
provide locking of the PCB. If that is possible, use shared instead
of exclusive netlock in soreceive(). The PCB mutex provides a per
socket lock against multiple soreceive() running in parallel.
Release and regrab both locks in sosleep_nsec().
OK mvs@
|
|
This is helpful for the following (*pr_usrreq)() split to multiple
handlers. But right now this makes code more readable.
Also add '#ifndef _SYS_SOCKETVAR_H_' to sys/socketvar.h. This prevents the
collisions when both sys/protosw.h and sys/socketvar.h are included
together. Both 'socket' and 'protosw' structures are required to be
defined before pru_*() wrappers, so we need to include sys/socketvar.h to
sys/protosw.h.
ok bluhm@
|
|
Let's try this again now that the kernel locking issue in nfsrv_rcv()
has been fixed.
The previous attempt of the conversion triggered hangs on NFS servers.
This was probably caused by the removal of the kernel-locked section
just prior to the socket upcall. The section had masked a locking error
in NFS code.
|
|
`so_lock' rwlock(9) instead of global `unp_lock' which locks the whole
layer.
The PCB of unix(4) sockets are linked to each other and we need to lock
them both. This introduces the lock ordering problem, because when the
thread (1) keeps lock on `so1' and trying to lock `so2', the thread (2)
could hold lock on `so2' and trying to lock `so1'. To solve this we
always lock sockets in the strict order.
For the sockets which are already accessible from userland, we always
lock socket with the smallest memory address first. Sometimes we need to
unlock socket before lock it's peer and lock it again.
We use reference counters for prevent the connected peer destruction
during to relock. We also handle the case where the peer socket was
replaced by another socket.
For the newly connected sockets, which are not yet exported to the
userland by accept(2), we always lock the listening socket `head' first.
This allows us to avoid unwanted relock within accept(2) syscall.
ok claudio@
|
|
OK mpi@
|
|
for the lock operation and to pass a value to the unlock operation.
sofree() still needs an extra flag to know if sounlock() should be called
or not. But sofree() is called less often and mostly without keeping the lock.
OK mpi@ mvs@
|
|
The commit caused hangs with NFS.
Reported by ajacoutot@ and naddy@
|
|
OK mpi@
|
|
Revert the pr_usrreqs move: syzkaller found a NULL pointer deref
and I won't be available to monitor for followup issues for a bit
|
|
then be shared among protosw structures, following the same basic
direction as NetBSD and FreeBSD for this.
Split PRU_CONTROL out of pr_usrreq into pru_control, giving it the
proper prototype to eliminate the previously necessary casts.
ok mvs@ bluhm@
|
|
|
|
previously sbchecklowmem() (and sonewconn()) would look at the mbuf
and mbuf cluster pools to see if they were approaching their hard
limits. based on how many mbufs/clusters were allocated against the
limits, socket operations would start to fail with ENOBUFS until
utilisation went down.
mbufs and clusters have changed a lot since then though. there are
now many mbuf cluster pools, not just one for 2k clusters. because
of this the mbuf layer now limits the amount of memory all the mbuf
pools can allocate backend pages from rather than limit the individual
pools. this means sbchecklowmem() ends up looking at the default
pool hard limit, which is UINT_MAX, which in turn means means
sbchecklowmem() probably never applies backpressure. this is made
worse on multiprocessor systems where per cpu caches of mbuf and
cluster pool items are enabled because the number of in use pool
items is distorted by the cpu caches.
this switches sbchecklowmem to looking at the page allocations made
by all the pools instead. the big benefit of this is that the page
allocations are much more representative of the overall mbuf memory
usage in the system. the downside is is that the backend page
allocation accounting does not see idle memory held by pools. pools
cannot release partially free pages to the page backend (obviously),
and pools cache idle items to avoid thrashing on the backend page
allocator. this means the page allocation level is higher than the
memory used by actual in-flight mbufs.
however, this can also be a benefit. the backend page allocation is a
kind of smoothed out "trend" line. mbuf utilisation over short periods
can be extremely bursty because of things like rx ring dequeue and fill
cycles, or large socket sends. if you're trying to grow socket
buffers while these things are happening, luck becomes an important
factor in whether it will work or not. because pools cache idle items,
the backend page utilisation better represents the overall trend
of activity in the system and will give more consistent behaviour here.
this diff is deliberately simple. we're basically going from "no
limits" to "some sort of limit" for sockets again, so keeping the
code simple means it should be easy to understand and tweak in the
future.
ok djm@ visa@ claudio@
|
|
This makes witness(4) use a single lock type for tracking so_lock.
Previously, so_lock was covered by two distinct lock types because there
were separate rw_init() initializers in socreate() and sonewconn().
OK kettenis@
|
|
ok sashan@
|
|
The filterops instances already provide f_modify and f_process
callbacks with proper internal locking. Locking of socket klists
has been the missing detail for MP-safety.
OK mpi@
|
|
per-socket locking.
No functional change.
|
|
ok mvs@
|
|
This socket flag was redundant with the socket buffer one.
ok mvs@
|
|
done because we have no cases where one thread should lock two sockets
simultaneously.
tested by yasuoka@
ok bluhm@ markus@
|
|
(PF_ROUTE) sockets. This can be done because we have no cases where one
thread should lock two sockets simultaneously.
Against the previous version rtm_senddesync_timer() execution was moved
to process context.
Also this time `so_lock' used for routing sockets only but in the future
it will be used to other socket types too.
tested by claudio@
ok claudio@ bluhm@
|
|
(PF_ROUTE) sockets. There is a locking issue with timeouts that needs
to be fixed.
Requested by deraadt@
|
|
(PF_ROUTE) sockets. This can be done because we have no cases where one
thread should lock two sockets simultaneously.
Also this time `so_lock 'used for routing sockets only but in the future
it will be used to other socket types too.
ok bluhm@
|
|
ok bluhm@
|
|
used as solock()'s backend to protect the whole layer.
With feedback from mpi@.
ok bluhm@ claudio@
|
|
them in line with sbappendstream() and sbappendrecord().
Agreed by mpi@
|
|
The 3 subsystems: signal, poll/select and kqueue can now be addressed
separatly.
Note that bpf(4) and audio(4) currently delay the wakeups to a separate
context in order to respect the KERNEL_LOCK() requirement. Sockets (UDP,
TCP) and pipes spin to grab the lock for the sames reasons.
ok anton@, visa@
|
|
Introduce and use TIMEVAL_TO_NSEC() to convert SO_RCVTIMEO/SO_SNDTIMEO
specified values into nanoseconds. As a side effect it is now possible
to specify a timeout larger that (USHRT_MAX / 100) seconds.
To keep code simple `so_linger' now represents a number of seconds with
0 meaning no timeout or 'infinity'.
Yes, the 0 -> INFSLP API change makes conversions complicated as many
timeout holders are still memset()'d.
Inputs from cheloha@ and bluhm@, ok bluhm@
|
|
condition in sb_compress(). Currently the actual cluster size might
be 9KB even if the mtu is 1500, in this case a lot of memory space had
been wasted, since sbcompress() doesn't compress because of previous
condition.
ok dlg claudio
|
|
this makes it easier to call since you don't have to cast to caddr_t
if it's a void *. this also changes a size argument from int to
size_t.
ok claudio@
|
|
OK mpi@
|
|
m_leadingspace() and m_trailingspace(). Convert all callers to call
directly the functions and remove the defines.
OK krw@, mpi@
|
|
back rev 1.90.
----
mbufs and mbuf clusters are now backed by large pools. Because of this
we can relax the oversubscribe limit of socketbuffers a fair bit.
Instead of maxing out as sb_max * 1.125 or 2 * sb_hiwat the maximum is
increased to 8 * sb_hiwat -- which seems to be a good compromise between
memory waste and better socket buffer usage.
OK deraadt@
----
ok benno@
|
|
variables can be delared constant.
OK claudio@ mpi@
|
|
Instead introduce two flags to deal with global lock recursion. This
is necessary until we get per-socket lock.
Req. by and ok visa@
|
|
locking.
ok visa@, bluhm@
|
|
...and release it in sounlock(). This will allows us to progressively
remove the KERNEL_LOCK() in syscalls.
ok visa@ some time ago
|
|
AF_UNIX is both the historical _and_ standard name, so prefer and recommend
it in the headers, manpages, and kernel.
ok miller@ deraadt@ schwarze@
|
|
Requested by claudio@
|
|
we can relax the oversubscribe limit of socketbuffers a fair bit.
Instead of maxing out as sb_max * 1.125 or 2 * sb_hiwat the maximum is
increased to 8 * sb_hiwat -- which seems to be a good compromise between
memory waste and better socket buffer usage.
OK deraadt@
|
|
ok millert@ krw@
|
|
SB_KNOTE remains the only bit set on `sb_flagsintr' as it is set/unset in
contexts related to kqueue(2) where we'd like to avoid grabbing solock().
While here add some KERNEL_LOCK()/UNLOCK() dances around selwakeup() and
csignal() to mark which remaining functions need to be addressed in the
socket layer.
ok visa@, bluhm@
|
|
KERNEL_LOCK(), so change asserts accordingly.
This is now possible since sblock()/sbunlock() are always called with
the socket lock held.
ok bluhm@, visa@
|
|
Tested by Hrvoje Popovski, ok bluhm@
|
|
selwakeup().
ok bluhm@
|
|
with the socket lock.
This change is safe because sbreserve() already asserts that the lock is
held, but it acts as implicit documentation and indicates that I looked
at the function.
|
|
Implicitely protects `so_state' with the socket lock in sosend().
ok visa@, bluhm@
|
|
ok bluhm@, visa@
|
|
ok bluhm@, visa@
|
|
While here document an abuse of parent socket's lock.
Problem reported by krw@, analysis and ok bluhm@
|