Age | Commit message (Collapse) | Author |
|
(PF_ROUTE) sockets. This can be done because we have no cases where one
thread should lock two sockets simultaneously.
Against the previous version rtm_senddesync_timer() execution was moved
to process context.
Also this time `so_lock' used for routing sockets only but in the future
it will be used to other socket types too.
tested by claudio@
ok claudio@ bluhm@
|
|
(PF_ROUTE) sockets. There is a locking issue with timeouts that needs
to be fixed.
Requested by deraadt@
|
|
(PF_ROUTE) sockets. This can be done because we have no cases where one
thread should lock two sockets simultaneously.
Also this time `so_lock 'used for routing sockets only but in the future
it will be used to other socket types too.
ok bluhm@
|
|
no objections mvs@
|
|
Passing local copy of socket to sbrelease() is too complicated to just
free receive buffer. We don't allocate large object on the stack. Also
we don't pass unlocked socket to soassertlocked() within sbdrop(). This
was not triggered because we lock the whole layer with one lock.
Also sorflush() is now private to kern/uipc_socket.c, so it's definition
was made to be in accordance.
ok claudio@ mpi@
|
|
OK mpi@ as part of a larger diff
|
|
error, a broadcast mbuf will stay in the socket buffer forever.
This is bad as multiple mbufs can use up all the space. Better
report ELOOP, dissolve splicing, and let userland handle it.
OK anton@
|
|
Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.
Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.
OK mpi@
|
|
OK dlg@, bluhm@
No Opinion mpi@
Not against it claudio@
|
|
mbuf is encountered in a seqpacket socket.
This diff uses the fact that setting orig_resid to 0 results in soreceive()
to return instead of looping back with the intent to sleep for more data.
orig_resid is now always set to 0 in the control message case (instead of
only if controlp is defined). This is the same behaviour as for the PR_NAME
case. Additionally orig_resid is set to 0 in the data reader when MSG_PEEK
is used.
Tested in snaps for a while and by anton@
Reported-by: syzbot+4b0e9698b344b0028b14@syzkaller.appspotmail.com
|
|
so_state and splice checks were done without the proper lock which is
incorrect. This is similar to sobind(), soconnect() which also require
the callee to hold the socket lock.
Found by, with and OK mvs@, OK mpi@
|
|
The socket splice idle timeout is a timeval, so we need to check that
tv_usec is both non-negative and less than one million. Otherwise it
isn't in canonical form.
We can check for this with timerisvalid(3).
benno@ says this shouldn't break anything in base.
ok benno@, bluhm@
|
|
This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.
The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.
ok millert@ on a previous version, ok visa@
|
|
sockets from different domains so there is no reason to have locking and memory
allocation in this error path. Also in this case only `so' will be locked by
solock() so we should avoid `sosp' modification.
ok mpi@
|
|
This is only done in poll-compatibility mode, when __EV_POLL is set.
ok visa@, millert@
|
|
FRELE() as the last reference could be dropped which in turn will cause
soclose() to be called where the socket lock is unconditionally
acquired. Note that this is only a problem for sockets protected by the
non-recursive NET_LOCK() right now.
ok mpi@ visa@
Reported-by: syzbot+7c805a09545d997b924d@syzkaller.appspotmail.com
|
|
for example, with locking assertions.
OK mpi@, anton@
|
|
sent via spliced socket.
Reported-by: syzbot+2f9616f39d3f3b281cfb@syzkaller.appspotmail.com
OK bluhm@
|
|
adding more filter properties without cluttering the struct.
OK mpi@, anton@
|
|
The 3 subsystems: signal, poll/select and kqueue can now be addressed
separatly.
Note that bpf(4) and audio(4) currently delay the wakeups to a separate
context in order to respect the KERNEL_LOCK() requirement. Sockets (UDP,
TCP) and pipes spin to grab the lock for the sames reasons.
ok anton@, visa@
|
|
Introduce and use TIMEVAL_TO_NSEC() to convert SO_RCVTIMEO/SO_SNDTIMEO
specified values into nanoseconds. As a side effect it is now possible
to specify a timeout larger that (USHRT_MAX / 100) seconds.
To keep code simple `so_linger' now represents a number of seconds with
0 meaning no timeout or 'infinity'.
Yes, the 0 -> INFSLP API change makes conversions complicated as many
timeout holders are still memset()'d.
Inputs from cheloha@ and bluhm@, ok bluhm@
|
|
make the structs const so that the data are put in .rodata.
OK mpi@, deraadt@, anton@, bluhm@
|
|
Tested by anton@, sashan@
OK mpi@, anton@, sashan@
|
|
It appears to have caused major performance regressions all over the
network stack.
Reported by bluhm@
ok deraadt@
|
|
Rebase the timeout wheel on the system uptime clock. Timeouts are now
set to run at or after an absolute time as returned by nanouptime(9).
Timeouts are thus "tickless": they expire at a real time on that clock
instead of at a particular value of the global "ticks" variable.
To facilitate this change the timeout struct's .to_time member becomes a
timespec. Hashing timeouts into a bucket on the wheel changes slightly:
we build a 32-bit hash with 25 bits of seconds (.tv_sec) and 7 bits of
subseconds (.tv_nsec). 7 bits of subseconds means the width of the
lowest wheel level is now 2 seconds on all platforms and each bucket in
that lowest level corresponds to 1/128 seconds on the uptime clock.
These values were chosen to closely align with the current 100hz
hardclock(9) typical on almost all of our platforms. At 100hz a bucket
is currently ~1/100 seconds wide on the lowest level and the lowest
level itself is ~2.56 seconds wide. Not a huge change, but a change
nonetheless.
Because a bucket no longer corresponds to a single tick more than one
bucket may be dumped during an average timeout_hardclock_update() call.
On 100hz platforms you now dump ~2 buckets. On 64hz machines (sh) you
dump ~4 buckets. On 1024hz machines (alpha) you dump only 1 bucket,
but you are doing extra work in softclock() to reschedule timeouts
that aren't due yet.
To avoid changing current behavior all timeout_add*(9) interfaces
convert their timeout interval into ticks, compute an equivalent
timespec interval, and then add that interval to the timestamp of
the most recent timeout_hardclock_update() call to determine an
absolute deadline. So all current timeouts still "use" ticks,
but the ticks are faked in the timeout layer.
A new interface, timeout_at_ts(9), is introduced here to bypass this
backwardly compatible behavior. It will be used in subsequent diffs
to add absolute timeout support for userland and to clean up some of
the messier parts of kernel timekeeping, especially at the syscall
layer.
Because timeouts are based against the uptime clock they are subject to
NTP adjustment via adjtime(2) and adjfreq(2). Unless you have a crazy
adjfreq(2) adjustment set this will not change the expiration behavior
of your timeouts.
Tons of design feedback from mpi@, visa@, guenther@, and kettenis@.
Additional amd64 testing from anton@ and visa@. Octeon testing from visa@.
macppc testing from me.
Positive feedback from deraadt@, ok visa@
|
|
can also be retrieved with getsockopt(3)
it looks like these will also be in the next issue of posix:
http://austingroupbugs.net/view.php?id=840#c2263
ok claudio@, sthen@
|
|
This behavior matches NetBSD, POSIX, and our own man page.
Fix whitespace while here.
from Moritz Buhl; OK millert@
|
|
When send buffer space in the drain socket becomes available, a
task is added to move data, and also the userland was informed.
The latter is not usefull as this would mix a kernel and user stream.
So programs do not wait for this event. Avoid calling sowakeup()
from sowwakeup(), this also reduces grabing the kernel lock. Instead
inform the userland about the write event when the splicing is
dissolved in sounsplice().
OK claudio@
|
|
receive buffer of a stream socket. Then a new pair of control and
data mbuf can be appended to the mbuf queue. In this case, terminate
the loop with a short read to prevent a panic. Userland should
read the control message with the next system call.
OK claudio@ deraadt@
|
|
OK bluhm@
|
|
SCM_RIGHTS from being sent to the userland since they hold kernel internal
data and it does not make sense to externalize it.
OK deraadt@, guenther@, visa@
|
|
OK mpi@
|
|
buffer together with an UDP packet, sosend(9) returned EWOULDBLOCK.
As it is an persistent problem, EMSGSIZE is the correct error code.
Split the AF_UNIX case into a separate condition and do not change
its logic. For atomic protocols, check that both data and control
message length fit into the socket buffer.
original bug report from Alexander Markert
discussed with jca@; OK vgross@
|
|
for sockets is non-blocking.
This allows us to G/C SS_NBIO. Having to keep the two flags in sync
in a mp-safe way is complicated.
This change introduce a behavior change in sosplice(), it can now
always block. However this should not matter much due to the socket
lock being taken beforhand.
ok bluhm@, benno@, visa@
|
|
duplicate allocation that could happen in the future when each socket
has a dedicated lock. Right now, the code path is serialized also by
the NET_LOCK() (and the KERNEL_LOCK()).
OK mpi@
|
|
to a panic message. The latter prints socket pointer and type to
help debugging.
OK mpi@
|
|
locking.
ok visa@, bluhm@
|
|
of calling sofree(), when its PCB is detached.
This is different from TCP which does not always detach `inpcb's from
sockets. In the pfkey & routing case caling sofree() there is a noop
whereas for TCP it's needed to free closed connections.
Having fewer sofree() makes it easier to understand the code and move
the locks down.
ok visa@
|
|
soreaper() that is scheduled onto the timer thread. soput() is
scheduled from there onto the sosplice task thread. After that it
is save to pool_put() the socket and splicing data structures.
OK mpi@ visa@
|
|
AF_UNIX is both the historical _and_ standard name, so prefer and recommend
it in the headers, manpages, and kernel.
ok miller@ deraadt@ schwarze@
|
|
ok bluhm@
|
|
splicing thread has finished sotask() with the socket to be freed.
Use after free reported and fix successfully tested by Rivo Nurges.
discussed with mpi@
|
|
ok millert@
|
|
The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.
No objection from millert@, ok tedu@, bluhm@
|
|
OK mpi@
|
|
be atomically read from any context.
ok bluhm@, visa@
|
|
untouched.
ok bluhm@, visa@
|
|
ok bluhm@
|
|
This change exposed or created a situation where a CPU started to be
irresponsive while holding the KERNEL_LOCK(). These led to lockups and
even with MP_LOCKDEBUG it was not clear what happened to this CPU.
These situations have been experience by dhill@ with dcrwallet and jcs@
with syncthing. Both applications are written in Go and do kevent(2)
& networking across multiple threads.
|
|
SB_KNOTE remains the only bit set on `sb_flagsintr' as it is set/unset in
contexts related to kqueue(2) where we'd like to avoid grabbing solock().
While here add some KERNEL_LOCK()/UNLOCK() dances around selwakeup() and
csignal() to mark which remaining functions need to be addressed in the
socket layer.
ok visa@, bluhm@
|