Age | Commit message (Collapse) | Author |
|
OK mpi@
|
|
These sockets are not connection oriented, they don't call pru_rcvd(),
but they have splicing ability and they set `so_error'.
Splicing ability is the most problem. However, we can hold `sb_mtx'
around `ssp_socket' modifications together with solock(). So the
`sb_mtx' is pretty enough to isspiced() check in soreceive(). The
unlocked `so_sp' dereference is fine, because we set it only once for
the whole socket life-time and we do this before `ssp_socket'
assignment.
We also need to take sblock() before splice sockets, so the sosplice()
and soreceive() are both serialized. Since `sb_mtx' required to unsplice
sockets too, it also serializes somove() with soreceive() regardless on
somove() caller.
The sosplice() was reworked to accept standalone sblock() for udp(4)
sockets.
soreceive() performs unlocked `so_error' check and modification.
Previously, we have no ability to predict which concurrent soreceive()
or sosend() thread will fail and clean `so_error'. With this unlocked
access we could have sosend() and soreceive() threads which fails
together.
`so_error' stored to local `error2' variable because `so_error' could be
overwritten by concurrent sosend() thread.
Tested and ok bluhm
|
|
|
|
dosigsuspend() no longer needs it.
OK mvs@ mpi@
|
|
no functional change, found by smatch warnings
ok miod@ bluhm@
|
|
With two separate TCP hash tables, each one becomes smaller. When
we remove the exclusive net lock from TCP, contention on internet
PCB table mutex will be reduced. UDP has been split earlier into
IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with
assertions.
OK mvs@
|
|
For inet sockets solock() is the netlock wrapper, so soreceive() could
be performed simultaneous with exclusively locked code paths.
These sockets are not connection oriented, they don't call pru_rcvd(),
they can't be spliced, they don't set `so_error'. Nothing to protect
with solock() in soreceive() path.
`so_rcv' buffer protected by `sb_mtx' mutex(9), but since it released,
sblock() required to serialize concurrent soreceive() and sorflush()
threads. Current sblock() is some kind of rwlock(9) implementation, so
introduce `sb_lock' rwlock(9) and use it directly for that purpose.
The sorflush() and callers were refactored to avoid solock() for raw
inet sockets. This was done to avoid packet processing stop.
Tested and ok bluhm.
|
|
Only unix(4) and tcp(4) sockets set (*pru_sence)() handler. The rest of
soo_stat() is the read only access.
ok bluhm
|
|
uipc_attach() releases solock() because it should be taken after
`unp_gc_lock' rwlock(9) which protects the `unp_link' list. For this
reason, the listening `head' socket should be unlocked too while
sonewconn() calls uipc_attach(). This could be reworked because now
`so_rcv' sockbuf relies on `sb_mtx' mutex(9).
The last one `unp_link' foreach loop within unp_gc() discards sockets
previously marked as UNP_GCDEAD. These sockets are not accessed from the
userland. The only exception is the sosend() threads of connected
sending peers, but they only sbappend*() mbuf(9) to `so_rcv'. So it's
enough to unlink mbuf(9) chain with `sb_mtx' held and discard lockless.
Please note, the existing SS_NEWCONN_WAIT logic was never used because
the listening unix(4) socket protected from concurrent unp_detach() by
vnode(9) lock, however `head' re-locked all times.
ok bluhm
|
|
Change p_sigmask from atomic back to non-atomic updates. All changes to
p_sigmask are only allowed by curproc (the owner). There is no need for
atomic instructions here.
p_sigmask is mostly accessed by curproc with the exception of ptsignal().
In ptsignal() p_sigmask is now only read once unless a SSLEEP proc gets
the signal. In that case recheck the p_sigmask before wakeup to ensure
that no unnecessary wakeup happens.
Add some KASSERT(p == curproc) to ensure this precondition.
sigabort() is special since it is also called by ddb but apart from that
only works for curproc.
With and OK mvs@ OK mpi@
|
|
|
|
|
|
|
|
Requested by robert@
OK mvs@ millert@ deraadt@
|
|
only udp(4) sockets set and check `so_error'.
No functional changes.
ok bluhm
|
|
replaced it with a more strict mechanism, which happens to be lockless O(1)
rather than micro-lock O(1)+O(log N). Also nop-out the sys_msyscall(2) guts,
but leave the syscall around for a bit longer so that people can build through
it, since ld.so(1) still wants to call it.
|
|
|
|
listen(2) man(1) page clearly prohibits sockets of other types.
Reported-by: syzbot+00450333592fcd38c6fe@syzkaller.appspotmail.com
ok bluhm
|
|
sbappend*() and soreceive() of SB_MTXLOCK marked sockets uses `sb_mtx'
mutex(9) for protection, meanwhile buffer usage check and corresponding
sbwait() sleep still serialized by solock(). Mark udp(4) as SB_OWNLOCK
to avoid solock() serialization and rely to `sb_mtx' mutex(9). The
`sb_state' and `sb_flags' modifications must be protected by `sb_mtx'
too.
ok bluhm
|
|
Tracepoints like "sched:enqueue" and "sched:unsleep" were called from inside
the loop iterating over sleeping threads as part of wakeup_proc(). When such
tracepoints were enabled they could result in another wakeup(9) possibly
corrupting the sleepqueue.
Rewrite wakeup(9) in two stages, first dequeue threads from the sleepqueue then
call setrunnable() and possible tracepoints for each of them.
This requires moving unsleep() outside of setrunnable() because it messes with
the sleepqueue.
ok claudio@
|
|
ok guenther@ deraadt@
|
|
|
|
that it has been replaced with pinsyscalls(2) [which tells the kernel
the location of all system calls in libc.so]
floated to various people before release, but it was prudent to wait.
|
|
outside socket lock.
`sb_mtx' mutex(9) used for this case and it should not be released between
`so_rcv' usage check and corresponding sbwait() sleep. Otherwise wakeup()
could be lost sometimes.
ok bluhm
|
|
Instead of calling mtx_enter_try() in each spinning loop, do it
only if the result of a lockless read indicates that the mutex has
been released. This avoids some expensive atomic compare-and-swap
operations. Up to 5% reduction of spinning time during kernel build
can been seen on a 8 core amd64 machine. On other machines there
was no visible effect.
Test on powerpc64 has revealed a bug in mtx_owner declaration. Not
the variable was volatile, but the object it points to. Move the
volatile declaration in struct mutex to avoid a hang when going to
multiuser.
from Mateusz Guzik; input kettenis@ jca@; OK mpi@
|
|
This makes re-locking unnecessary in the uipc_*send() paths, because
it's enough to lock one socket to prevent peer from concurrent
disconnection. As the little bonus, one unix(4) socket can perform
simultaneous transmission and reception with one exception for
uipc_rcvd(), which still requires the re-lock for connection oriented
sockets.
The socket lock is not held while filt_soread() and filt_soexcept()
called from uipc_*send() through sorwakeup(). However, the unlocked
access to the `so_options', `so_state' and `so_error' is fine.
The receiving socket can't be or became listening socket. It also can't
be disconnected concurrently. This makes immutable SO_ACCEPTCONN,
SS_ISDISCONNECTED and SS_ISCONNECTED bits which are clean and set
respectively.
`so_error' is set on the peer sockets only by unp_detach(), which also
can't be called concurrently on sending socket.
This is also true for filt_fiforead() and filt_fifoexcept(). For other
callers like kevent(2) or doaccept() the socket lock is still held.
ok bluhm
|
|
checks from all the filesystems that support hardlinks at all into
the VFS layer. Simplify, EPERM description in link(2).
ok miod@ mpi@
|
|
|
|
ok bluhm
|
|
dead unix(4) sockets.
The difference in direct unp_scan() and sorflush() is the mbuf(9) chain.
For the first case it is still linked to the `so_rcv', for the second it
is not. This is required to make `sb_mtx' mutex(9) the only `so_rcv'
sockbuf protection and remove socket re-locking from the most of
uipc_*send() paths. The unlinked mbuf(9) chain doesn't require any
protection, so this allows to perform sleeping unp_discard() lockless.
Also, the mbuf(9) chain of the discarded socket still contains addresses
of file descriptors and it is much safer to unlink it before FRELE()
them. This is the reason to commit this diff standalone.
ok bluhm
|
|
ok deraadt, kn, phessler
|
|
return EINVAL if set. This prevents concurrent solisten() thread to make
this socket listening while socket is unlocked.
Reported-by: syzbot+4acfcd73d15382a3e7cf@syzkaller.appspotmail.com
ok mpi
|
|
m_defrag() is intended as last resort to make DMA transfers to the
hardware. Therefore page alingment is more important than IP header
alignment. The reason, why the mbuf returned by m_defrag() was
switched to IP header alingment, was that ether_extract_headers()
failed in em(4) driver with TSO on sparc64. This has been fixed
by using memcpy().
The alignment change in m_defrag() is too late in the 7.5 relaese
process. It may affect several drivers on different architectures.
Bus dmamap for ixl(4) on sun4v expects page alignment. Such alignment
issues and TSO mbuf mapping for IOMMU need more thought.
OK deraadt@
|
|
Pool namei_pool is initialized with IPL_NONE as filesystem always
runs with kernel lock. So pool_get() needs kernel lock also in
sys_ypconnect().
OK kn@ deraadt@
|
|
From Christian Ludwig, ok claudio@
|
|
The code has outgrown the original name for this struct. Both the
external and internal APIs have used the "clockqueue" namespace for
some time when operating on it, and that name is eyeball-consistent
with "clockintr" and "clockrequest", so "clockqueue" it is.
|
|
|
|
has occurred in the process.
ok various people
|
|
The function should be in the clockqueue_intrclock namespace. Also,
"reprogram" is a better word for what the function actually does.
|
|
OpenBSD starts the system uptime clock at 1.0 instead of 0.0. We
inherited this behavior from FreeBSD when we imported kern_tc.c.
patrick@ reports that this causes a problem in sdmmc(4) during boot:
the sdmmc_delay() call in sdmmc_init() doesn't block for the full
250ms. This happens because the system hardclock() starts at 0.0 and
executes about hz times, rapidly, to "catch up" to 1.0. This
instantly expires the first hz timeout ticks, hence the short sleep.
Starting the system uptime at 0.0 fixes the problem.
Prompted by patrick@. Tested by patrick@. In snaps since Feb 19 2023.
Thread: https://marc.info/?l=openbsd-tech&m=170830229732396&w=2
ok patrick@ deraadt@
|
|
In kern_timeout.c, the to_kclock checks are not strict enough to catch
all plausible programmer mistakes. Tighten them up:
- timeout_set_flags: KASSERT that kclock is valid
- timeout_abs_ts: KASSERT that to_kclock is KCLOCK_UPTIME
We can also add to_kclock validation to softclock() and
db_show_timeout(), which may help to debug memory corruption:
- softclock: panic if to_kclock is not KCLOCK_NONE or KCLOCK_UPTIME
- db_show_timeout: print warning if to_kclock is invalid
Prompted by bluhm@ in response to a syzbot panic. Hopefully these
changes help to narrow down the root cause.
Link: https://syzkaller.appspot.com/bug?extid=49d3f7118413963f651a
Reported-by: syzbot+49d3f7118413963f651a@syzkaller.appspotmail.com
ok bluhm@
|
|
The recent TSO support in em(4) triggered an alignment error on the TCP
header. In em(4) m_defrag() is called before setting up the TSO dma bits
and with that the TCP header was suddenly no longer aligned. Like other
mbuf functions preserve the data alignment in m_defrag() to prevent such
unaligned packets.
With help and OK bluhm@ mglocker@
|
|
pmap_unmap_direct() has been fixed; also tested by aoyama@
|
|
`pr_type'. The corresponding domain is referenced as `pr_domain'.
Otherwise dp->dom_protosw->pr_type of inet sockets always points
to inetsw[0].
ok bluhm
|
|
There is no useful work left for secondary CPUs to do in hardclock().
Disable cq_hardclock on secondary CPUs and remove the now-unnecessary
early-return from hardclock().
This change reduces every system's normal clock interrupt rate by
(HZ - HZ/10) per secondary CPU. For example, an 8-core machine
with a HZ=100 kernel should see its clock interrupt rate drop from
~1600 to ~960.
Thread: https://marc.info/?l=openbsd-tech&m=170750140915898&w=2
ok kettenis@
|
|
ok bluhm
|
|
In soreceve(), we only touch `so_rcv' socket buffer, which has it's own
`sb_mtx' mutex(9) for protection. So, we can avoid solock() in this
path - it's enough to hold `sb_mtx' in soreceive() and around
corresponding sbappend*(). But not right now :)
This time we use shared netlock for some inet sockets in the soreceive()
path. To protect `so_rcv' buffer we use `inp_mtx' mutex(9) and the
pru_lock() to acquire this mutex(9) in socket layer. But the `inp_mtx'
mutex belongs to the PCB. We initialize socket before PCB, tcp(4)
sockets could exist without PCB, so use `sb_mtx' mutex(9) to protect
sockbuf stuff.
This diff mechanically replaces `inp_mtx' by `sb_mtx' in the receive
path. Only for sockets which already use `inp_mtx'. All other sockets
left as is. They will be converted later.
Since the `sb_mtx' is optional, the new SB_MTXLOCK flag introduced. If
this flag is set on `sb_flags', the `sb_mtx' mutex(9) should be taken.
New sb_mtx_lock() and sb_mtx_unlock() was introduced to hide this check.
They are temporary and will be replaced by mtx_enter() when all this
area will be converted to `sb_mtx' mutex(9).
Also, the new sbmtxassertlocked() function introduced to throw
corresponding assertion for SB_MTXLOCK marked buffers. This time only
sbappendaddr() calls it. This function is also temporary and will be
replaced by MTX_ASSERT_LOCKED() later.
ok bluhm
|
|
the empty string, rather than error.
ok krw
|
|
To improve the utility of dt(4)'s interval and profile probes we need
to move the probe entry points from the fixed-frequency hardclock() to
a dedicated clock interrupt callback so that the probes can fire at
arbitrary frequencies.
- Remove entry points for interval/profile probes from hardclock().
- Merge dt_prov_profile_enter(), dt_prov_interval_enter(), and
dt_prov_profile_fire() into one function, dt_clock(). This is
the now-unified callback for interval/profile probes. dt_clock()
will consume multiple events during a single execution if it is
delayed, but on platforms with high quality interrupt clocks this
should be rare.
- Each struct dt_pcb gets its own clockintr handle, dp_clockintr.
- In struct dt_pcb, replace dp_maxtick/dp_nticks with dp_nsecs,
the PCB's sampling period. Aynchronous probes must initialize
dp_nsecs to a non-zero value during dtpv_alloc().
- In struct dt_pcb, replace dp_cpuid with dp_cpu so that
dt_ioctl_record_start() knows where to bind the PCB's
dp_clockintr.
- dt_ioctl_record_start() binds, staggers, and starts all
interval/profile PCBs on the given dt_softc. Each dp_clockintr
is given a reference to its enclosing PCB so that dt_clock()
doesn't need to search for it. The staggering sort-of simulates
the current behavior under hardclock().
- dt_ioctl_record_stop() unbinds all interval/profile PCBs. The
CL_BARRIER ensures that dp_clockintr's PCB reference is not in
use by dt_clock() so that the PCB may be safely freed upon
return from dt_ioctl_record_stop(). Blocking while holding
dt_lock is not ideal, but in practice blocking in this spot is
rare and dt_clock() completes quickly on all but the oldest
hardware. An extremely unlucky thread could block for every
interval/profile PCB on the softc, but this is implausible.
DT_FA_PROFILE values are up-to-date for amd64, i386, and macppc.
Somebody with the right hardware needs to check-and-maybe-fix the
values on octeon, powerpc64, and sparc64.
Joint effort with mpi@.
Thread: https://marc.info/?l=openbsd-tech&m=170629371821879&w=2
ok mpi@
|
|
The clockintr_unbind() function cancels any pending execution of the
given clock interrupt object's callback and severs the binding between
the object and its host CPU. Upon return from clockintr_unbind(), the
clock interrupt object may be rebound with a call to clockintr_bind().
The optional CL_BARRIER flag tells clockintr_unbind() to block if the
clockintr's callback function is executing at the moment of the call.
This is useful when the clockintr's arg is a shared reference and the
caller needs to be certain the reference is inactive.
Now that clockintrs can be bound and unbound repeatedly, there is more
room for error. To help catch programmer errors, clockintr_unbind()
sets cl_queue to NULL. Calls to other API functions after a clockintr
is unbound will then fault on a NULL dereference. clockintr_bind()
also KASSERTs that cl_queue is NULL to ensure the clockintr is not
already bound. These checks are not perfect, but they do catch some
common errors.
With input from mpi@.
Thread: https://marc.info/?l=openbsd-tech&m=170629367121800&w=2
ok mpi@
|