Age | Commit message (Collapse) | Author |
|
it is a dangerous alternative entry point for all system calls, and thus
incompatible with the precision system call entry point scheme we are
heading towards. This has been a 3-year mission:
First perl needed a code-generated wrapper to fake syscall(2) as a giant
switch table, then all the ports were cleaned with relatively minor fixes,
except for "go". "go" required two fixes -- 1) a framework issue with
old library versions, and 2) like perl, a fake syscall(2) wrapper to
handle ioctl(2) and sysctl(2) because "syscall(SYS_ioctl" occurs all over
the place in the "go" ecosystem because the "go developers" are plan9-loving
unix-hating folk who tried to build an ecosystem without allowing "ioctl".
ok kettenis, jsing, afresh1, sthen
|
|
descriptor (pted) pool in the arm64 pmap implementation. This
significantly reduces the side-effects of lock contention on the kernel
map lock that is (incorrectly) translated into excessive page daemon
wakeups. This is not a perfect solution but it does lead to significant
speedups on machines with many CPU cores.
This requires adding a new pmap_init_percpu() function that gets called
at the point where kernel is ready to set up the per-CPU pool caches.
Dummy implementations of this function are added for all non-arm64
architectures. Some other architectures can probably benefit from
providing an actual implementation that sets up per-CPU caches for
pmap pools as well.
ok phessler@, claudio@, miod@, patrick@
|
|
|
|
ok kettenis
|
|
|
|
before future work where ld.so(1) will need this new system call.
Putting this in the kernel ahead of time will save some grief.
ok kettenis
|
|
|
|
For internet sockets sobind() runs with exclusive net lock due to
solock(). For unix domain sockets uipc_bind() grabs the kernel
lock itself. So sys_bind() is MP safe. Add NOLOCK flag to avoid
kernel lock.
OK mvs@
|
|
NKMEMPAGES_MIN was removed long time ago in all archs so there is no
need to keep it.
Also initalize nkmempages_max at compile time since sparc (with variable
page size) is long gone as well.
No objection from miod@
|
|
|
|
ok miod@
|
|
Since revision 1.26 dt_ioctl_get_auxbase() is calling process_domem().
Build the latter function into kernel if pseudo device dt is enabled.
from Matthias Pitzl; OK claudio@
|
|
The disklabel UID passed in is not modified, reflect that and allow callers
using 'const char *'.
OK miod
|
|
The soreceive() code depends on the fact that MSG_EOR is set on the
last mbuf of the chain. In sbappendcontrol() move MSG_EOR to the
end like sbcompress() does it. This fixes MSG_EOR handling for
SOCK_SEQPACKET sockets with control message.
bug reported by Eric Wong
analysed, tested and OK claudio@
|
|
If single thread is already held by another thread just unwind to userret()
wait there and retry the system call later (if at all).
OK mpi@
|
|
where a switch happens outside. Cleanup these code paths and make the
machine independent.
- when a process forks (fork, tfork, kthread), the new proc needs to
somehow be scheduled for the first time. This is done by proc_trampoline.
Since proc_trampoline is machine dependent assembler code change
the MP specific proc_trampoline_mp() to proc_trampoline_mi() and make
sure it is now always called.
- cpu_hatch: when booting APs the code needs to jump to the first proc
running on that CPU. This should be the idle thread for that CPU.
- sched_exit: when a proc exits it needs to switch away from itself and
then instruct the reaper to clean up the rest. This is done by switching
to the idle loop.
Since the last two cases require a context switch to the idle proc factor
out the common code to sched_toidle() and use it in those places.
Tested by many on all archs.
OK miod@ mpi@ cheloha@
|
|
m_split() calls m_align() to initialize the data pointer of newly
allocated mbuf. If the new mbuf will be converted to a cluster,
this is not necessary. If additionally the new mbuf is larger than
MLEN, this can lead to a panic.
Only call m_align() when a valid m_data is needed. This is the
case if we do not refecence the existing cluster, but memcpy() the
data into the new mbuf.
Reported-by: syzbot+0e6817f5877926f0e96a@syzkaller.appspotmail.com
OK claudio@ deraadt@
|
|
The API's behavior when invoked from a callback function is impossible
to document. Move the special behavior into a distinct namespace,
"clockrequest".
- Add a 'struct clockrequest'. Basically a stripped-down 'struct clockintr'
for exclusive use during clockintr_dispatch().
- In clockintr_queue, replace the "cq_shadow" clockintr with a "cq_request"
clockrequest. They serve the same purpose.
- CLST_SHADOW_PENDING -> CR_RESCHEDULE; different namespace, same meaning.
- CLST_IGNORE_SHADOW -> CLST_IGNORE_REQUEST; same meaning.
- Move shadow branch in clockintr_advance() to clockrequest_advance().
- clockintr_request_random() becomes clockrequest_advance_random().
- Delete dead shadow branches in clockintr_cancel(), clockintr_schedule().
- Callback functions now get a clockrequest pointer instead of a special
clockintr pointer: update all prototypes, callers.
No functional change intended.
|
|
Add a TIMEOUT_MPSAFE flag to signal that a timeout is safe to run
without the kernel lock. Currently, TIMEOUT_MPSAFE requires
TIMEOUT_PROC. When the softclock() is unlocked in the future this
dependency will be removed.
On MULTIPROCESSOR kernels, softclock() now shunts TIMEOUT_MPSAFE
timeouts to a dedicated "timeout_proc_mp" bucket for processing by the
dedicated softclock_thread_mp() kthread. Unlike softclock_thread(),
softclock_thread_mp() is not pinned to any CPU and runs run at IPL_NONE.
Prompted by bluhm@. Lots of input from bluhm@. Joint work with mvs@.
Prompt: https://marc.info/?l=openbsd-tech&m=169646019109736&w=2
Thread: https://marc.info/?l=openbsd-tech&m=169652212131109&w=2
ok mvs@
|
|
Technically, all the current fixed clock interrupt periods fit within
an unsigned 32-bit value. But 32-bit multiplication is an accident
waiting to happen. So, expand the fixed periods for hardclock,
statclock, profclock, and roundrobin to 64-bit values.
One exception: statclock_mask remains 32-bit because random(9) yields
32-bit values. Update the initclocks() comment to make it clear that
this is not an accident.
|
|
Prototype clockintr_schedule() in <sys/clockintr.h>.
|
|
Rename these parameters to align the code with the forthcoming
manpage. No functional change.
|
|
intrclock_rearm() and intrclock_trigger() are not part of the public
API, so there's no reason to implement them in sys/clockintr.h. Move
them to kern_clockintr.c.
|
|
channel instead of inventing an own one.
OK kettenis@ mvs@
|
|
ports.
Suggested by deraadt@, USB route idea from kettenis@. Feedback
from anton@, man page improvements from deraadt@, jmc@,
schwarze@.
ok deraadt@ kettenis@
|
|
The mode can now be or-ed with SINGLE_DEEP or SINGLE_NOWAIT to alter
the behaviour of single_thread_set(). This allows explicit control
of the SINGLE_DEEP behaviour.
If SINGLE_DEEP is set the deep flag is passed to the initial check call
and by that the check will error out instead of suspending (SINGLE_UNWIND)
or exiting (SINGLE_EXIT). The SINGLE_DEEP flag is required in calls to
single_thread_set() outside of userret. E.g. at the start of sys_execve
because the proc is not allowed to call exit1() in that location.
SINGLE_NOWAIT skips the wait at the end of single_thread_set() and therefor
returns BEFORE all threads have been parked. Currently this is only used by
the ptrace code and should not be used anywhere else. Not waiting for all
threads to settle is asking for trouble.
This solves an issue by using SINGLE_UNWIND in the coredump case where
the code should actually exit in case another thread crashed moments earlier.
Also the SINGLE_UNWIND in pledge_fail() is now marked SINGLE_DEEP since
the call to pledge_fail() is for sure not at the kernel boundary.
OK mpi@
|
|
|
|
|
|
Also remove useless NULL check.
ok bluhm@
|
|
protecting message buffer.
ok bluhm
|
|
- PS_PROFIL bit is moved into the process cleanup block where it belongs
- The proc read-only limit cache cleanup is moved up right after clearing
p->p_fd cache. lim_free() can potentially sleep and so needs to be
above the line where p_stat is set to SDEAD.
With and OK jca@
|
|
Include missing fields -- like the sleep channel and message -- and
show both the PID and TID of the proc.
Also add '/t' as an argument that can be used to specify a proc by TID
instead of by address.
OK mpi@
|
|
There is the same check in sched_chooseproc() but that is too late
to know where the bad insertion into the runqueue was done.
OK mpi@
|
|
SINGLE_UNWIND unwinds to the kernel boundary. On the other hand
SINGLE_SUSPEND will sleep inside tsleep(9) and other sleep functions.
Since the code will exit1() very soon after it is better to already unwind.
Now one could argue that for coredumps all threads should stop asap to
get a clean dump. Using SINGLE_UNWIND the sleep will fail with ERESTART
and no copyout should happen in that case.
This is a bit of a workaround since SINGLE_SUSPEND has a small race
where single_thread_wait() returns before all threads are really stopped.
When SINGLE_EXIT is called quickly after this can blow up inside
sleep_finish.
Reported-by: syzbot+3ef066fcfaf991f2ac2c@syzkaller.appspotmail.com
OK mpi@ kettenis@
|
|
With input from claudio@ and deraadt@.
|
|
"cq_all" is a more obvious name than "cq_est". It's the list of all
established clockintrs. Duh.
|
|
All the state initialization once done in clockintr_init() has been
moved to other parts of the kernel. It's a dead function. Remove it.
Likewise, the clockintr_flags variable no longer sports any meaningful
flags. Remove it. This frees up the CL_* flag namespace, which might
be useful to the clockintr frontend if we ever need to add behavior
flags to any of those functions.
|
|
Move the schedcpu() and update_loadavg() timeout structs from
scheduler_start() into their respective callback functions and
statically initialize them with TIMEOUT_INITIALIZER(9).
The structs are already hidden from the global namespace and the
timeouts are already self-managing, so we may as well fully
consolidate things.
Thread: https://marc.info/?l=openbsd-tech&m=169488184019047&w=2
"Sure." claudio@
|
|
Using a scratch buffer makes it possible to take a consistent snapshot of
per-CPU counters without having to allocate memory.
Makes ddb(4) show uvmexp command work in OOM situations.
ok kn@, mvs@, cheloha@
|
|
|
|
To separate the hardclock from the clock interrupt subsystem we'll
need to move all related state out first.
hz(9) is set when we return from cpu_initclocks(), so it's safe to
move hardclock_period and roundrobin_period initialization out into
initclocks(). Move hardclock_period itself out into kern_clock.c
alongside the statclock variables.
|
|
schedstate_percpu
Move the statclock handle from clockintr_queue.cq_statclock to
schedstate_percpu.spc_statclock. Establish spc_statclock during
sched_init_cpu() alongside the other scheduler clock interrupts.
Thread: https://marc.info/?l=openbsd-tech&m=169428749720476&w=2
|
|
- Move remaining statclock variables from kern_clockintr.c to
kern_clock.c. Move statclock variable initialization from
clockintr_init() into initclocks().
- Change statclock() prototype to make it a legal clockintr
callback function and establish the handle with statclock()
instead clockintr_statclock().
- Merge the contents of clockintr_statclock() into statclock().
statclock() can now reschedule itself and handles multiple
expirations transparently.
- Make statclock_avg visible from sys/systm.h so that clockintr_cpu_init()
can use it to advance the statclock across suspend/hibernate.
Thread: https://marc.info/?l=openbsd-tech&m=169428749720476&w=2
|
|
statclock() is going to need this. Move the prototype into the public API.
Thread: https://marc.info/?l=openbsd-tech&m=169428749720476&w=2
|
|
In order to separate the statclock from the clock interrupt subsystem
we need to move all statclock state out into the broader kernel.
Start by replacing the CL_RNDSTAT flag with a new global variable,
"statclock_is_randomized", in kern_clock.c. Update all clockintr_init()
callers to set the boolean instead of passing the flag.
Thread: https://marc.info/?l=openbsd-tech&m=169428749720476&w=2
|
|
The change to the single thread API results in crashes inside exit1()
as found by Syzkaller. There seems to be a race in the exit codepath.
What exactly fails is not really clear therefor revert for now.
This should fix the following Syzkaller reports:
Reported-by: syzbot+38efb425eada701ca8bb@syzkaller.appspotmail.com
Reported-by: syzbot+ecc0e8628b3db39b5b17@syzkaller.appspotmail.com
and maybe more.
Reverted commits:
----------------------------
Protect ps_single, ps_singlecnt and ps_threadcnt by the process mutex.
The single thread API needs to lock the process to enter single thread
mode and does not need to stop the scheduler.
This code changes ps_singlecount from a count down to zero to ps_singlecnt
which counts up until equal to ps_threadcnt (in which case all threads
are properly asleep).
Tested by phessler@, OK mpi@ cheloha@
----------------------------
Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK.
The per process thread list can be traversed (read) by holding either
the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK).
Abusing the SCHED_LOCK for this makes it impossible to split up the
scheduler lock into something more fine grained.
Tested by phessler@, ok mpi@
----------------------------
Fix SCHED_LOCK() leak in single_thread_set()
In the (q->p_flag & P_WEXIT) branch is a continue that did not release
the SCHED_LOCK. Refactor the code a bit to simplify the places SCHED_LOCK
is grabbed and released.
Reported-by: syzbot+ea26d351acfad3bb3f15@syzkaller.appspotmail.com
OK kettenis@
|
|
Callers can now provide an argument pointer to clockintr_establish().
The pointer is kept in a new struct clockintr member, cl_arg. The
pointer is passed as the third parameter to clockintr.cl_func when it
is executed during clockintr_dispatch(). Like the callback function,
the callback argument is immutable after the clockintr is established.
At present, nothing uses this. All current clockintr_establish()
callers pass a NULL arg pointer. However, I am confident that dt(4)'s
profile provider will need this in the near future.
Requested by dlg@ back in March.
|
|
Adding an intermediate pointer lets me shortens "cq->cq_shadow" to
just "shadow". I think it makes the dispatch loop logic a little
easier to read.
While here, add a clarifying comment.
|
|
Now that alpha no longer sets schedhz, schedhz is a dead variable.
Remove it.
For now, leave the schedclock() call in place in statclock(). It
still runs at its default rate of (stathz / 4).
Part of mpi@'s WIP scheduler patch. Suggested by mpi@.
Thread: https://marc.info/?l=openbsd-tech&m=169419781317781&w=2
ok mpi@
|
|
With the switch to clockintr_schedule_locked(), clockintr_advance() is
now much shorter and the early-return from the non-mutex path doesn't
make the function any easier to read. Move the mutex path into the else
branch and always return 'count' at the end of the function.
|