Age | Commit message (Collapse) | Author |
|
This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().
No intended behavior change.
ok visa@
|
|
It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.
Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@
|
|
Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.
ok visa@, cheloha@
|
|
Otherwise, the system can crash in smr_call_impl() if SMT is
enabled later.
Crash reported by jcs@
|
|
objects that readers can access without locking. This provides a basis
for read-copy-update operations.
Readers access SMR-protected shared objects inside SMR read-side
critical section where sleeping is not allowed. To reclaim
an SMR-protected object, the writer has to ensure mutual exclusion of
other writers, remove the object's shared reference and wait until
read-side references cannot exist any longer. As an alternative to
waiting, the writer can schedule a callback that gets invoked when
reclamation is safe.
The mechanism relies on CPU quiescent states to determine when an
SMR-protected object is ready for reclamation.
The <sys/smr.h> header additionally provides an implementation of
singly- and doubly-linked lists that can be used together with SMR.
These lists allow lockless read access with a concurrent writer.
Discussed with many
OK mpi@ sashan@
|
|
Because of hw.smt we need a way to determine whether a given CPU is "online"
or "offline" from userspace. KERN_CPTIME2 is an array, and so cannot be
cleanly extended for this purpose, so add a new sysctl(2) KERN_CPUSTATS
with an extensible struct. At the moment it's just KERN_CPTIME2 with a
flags member, but it can grow as needed.
KERN_CPUSTATS appears to have been defined by BSDi long ago, but there are
few (if any) packages in the wild still using the symbol so breakage in ports
should be near zero. No other system inherited the symbol from BSDi, either.
Then, use the new sysctl(2) in systat(1) and top(1):
- systat(1) draws placeholder marks ('-') instead of percentages for
offline CPUs in the cpu view.
- systat(1) omits offline CPU ticks when drawing the "big bar" in
the vmstat view. The upshot is that the bar isn't half idle when
half your logical CPUs are disabled.
- top(1) does not draw lines for offline CPUs; if CPUs toggle on or
offline in interactive mode we redraw the display to expand/reduce
space for the new/missing CPUs. This is consistent with what some
top(1) implementations do on Linux.
- top(1) omits offline CPUs from the totals when CPU totals are
combined into a single line (the '-1' flag).
Originally prompted by deraadt@. Discussed endlessly with deraadt@,
ketennis@, and sthen@. Tested by jmc@ and jca@. Earlier versions also
discussed with jca@. Earlier versions tested by jmc@, tb@, and many
others.
docs ok jmc@, kernel bits ok ketennis@, everything ok sthen@,
"Is your stuff in yet?" deraadt@
|
|
ok kettenis deraadt
|
|
This lets userspace distinguish between idle CPUs and those that are
not schedulable because hw.smt=0.
A subsequent commit probably needs to add documentation for this
to sysctl.2 (and perhaps elsewhere) after the dust settles.
Also included here are changes to systat(1) and top(1) that account
for the ENODEV case and adjust behavior accordingly:
- systat(1)'s cpu view prints placeholder marks ('-') instead of
percentages for each state if the given CPU is offline.
- systat(1)'s vmstat view checks for offline CPUs when computing the
machine state total and excludes them, so the CPU usage graph
only represents the states for online CPUs.
- top(1) does not draw CPU rows for offline CPUs when the view is
redrawn. If CPUs "go offline", percentages for each state are
replaced by placeholder marks ('-'); the view will need to be
redrawn to remove these rows. If CPUs "go online" the view will
need to be redrawn to show these new CPUs. In "combined CPU" mode,
the count and the state totals only represent online CPUs.
Ports using KERN_CPTIME2 will need to be updated. The changes
described above to make systat(1) and top(1) aware of the ENODEV
case *and* gracefully handle a changing HW_NCPUONLINE while the
application is running are not necessarily appropriate for each
and every port.
The changes described above are so extensive in part to demonstrate
one way a program *might* be made robust to changing CPU availability.
In particular, changing hw.smt after boot is an extremely rare event,
and this needs to be weighed when updating ports.
The logic needed to account for the KERN_CPTIME2 ENODEV case is
very roughly:
if (sysctl(...) == -1) {
if (errno != ENODEV) {
/* Actual error occurred. */
} else {
/* CPU is offline. */
}
} else {
/* CPU is online and CPU states were set by sysctl(2). */
}
Prompted by deraadt@. Basic idea for ENODEV from kettenis@. Discussed at
length with kettenis@. Additional testing by tb@.
No complaints from hackers@ after a week.
ok kettenis@, "I think you should commit [now]" deraadt@
|
|
The introduction of hw.smt means that logical CPUs can be disabled
after boot and prior to suspend/resume. If hw.smt=0 (the default),
there needs to be a way to count the number of hardware threads
available on the system at any given time.
So, import HW_NCPUONLINE/hw.ncpuonline from NetBSD and document it.
hw.ncpu becomes equal to the number of CPUs given to sched_init_cpu()
during boot, while hw.ncpuonline is equal to the number of CPUs available
to the scheduler in the cpuset "sched_all_cpus". Set_SC_NPROCESSORS_ONLN
equal to this new sysctl and keep _SC_NPROCESSORS_CONF equal to hw.ncpu.
This is preferable to adding a new sysctl to count the number of
configured CPUs and keeping hw.ncpu equal to the number of online
CPUs because such a change would break software in the ecosystem
that relies on HW_NCPU/hw.ncpu to measure CPU usage and the like.
Such software in base includes top(1), systat(1), and snmpd(8),
and perhaps others.
We don't need additional locking to count the cardinality of a cpuset
in this case because the only interfaces that can modify said cardinality
are sysctl(2) and ioctl(2), both of which are under the KERNEL_LOCK.
Software using HW_NCPU/hw.ncpu to determine optimal parallism will need
to be updated to use HW_NCPUONLINE/hw.ncpuonline. Until then, such software
may perform suboptimally. However, most changes will be similar to the
change included here for libcxx's std::thread:hardware_concurrency():
using HW_NCPUONLINE in lieu of HW_NCPU should be sufficient for determining
optimal parallelism for most software if the change to _SC_NPROCESSORS_ONLN
is insufficient.
Prompted by deraadt. Discussed at length with kettenis, deraadt, and sthen.
Lots of patch tweaks from kettenis.
ok kettenis, "proceed" deraadt
|
|
error that would happen otherwise when a traced and stopped
multithreaded process is forced to exit. The error shows up as a kernel
panic when WITNESS is enabled. Without WITNESS, the error causes
a system hang.
sched_exit() has expected that a single KERNEL_UNLOCK() would release
the lock completely. That assumption is wrong when an exit happens
through the signal tracing logic:
sched_exit
exit1
single_thread_check
single_thread_set
issignal <-- KERNEL_LOCK()
userret <-- KERNEL_LOCK()
syscall
The error is a regression of r1.216 of kern_sig.c.
Panic reported and fix tested by Laurence Tratt
OK mpi@
|
|
a CPU.
ok deraadt@
|
|
TLBs and L1 caches between threads. This can make cache timing
attacks a lot easier and we strongly suspect that this will make
several spectre-class bugs exploitable. Especially on Intel's SMT
implementation which is better known as Hypter-threading. We really
should not run different security domains on different processor
threads of the same core. Unfortunately changing our scheduler to
take this into account is far from trivial. Since many modern
machines no longer provide the ability to disable Hyper-threading in
the BIOS setup, provide a way to disable the use of additional
processor threads in our scheduler. And since we suspect there are
serious risks, we disable them by default. This can be controlled
through a new hw.smt sysctl. For now this only works on Intel CPUs
when running OpenBSD/amd64. But we're planning to extend this feature
to CPUs from other vendors and other hardware architectures.
Note that SMT doesn't necessarily have a posive effect on performance;
it highly depends on the workload. In all likelyhood it will actually
slow down most workloads if you have a CPU with more than two cores.
ok deraadt@
|
|
previously the code was using a percpu flag to manage the sleeps/wakeups,
which means multiple threads waiting for a barrier on a cpu could
race. moving to a cond struct on the stack fixes this.
while here, get rid of the sbar taskq and just use systqmp instead.
the barrier tasks are short, so there's no real downside.
ok mpi@
|
|
with the kernel lock.
Fixes a deadlock seen by Hrvoje Popovski and dhill@.
OK mpi@, dhill@
|
|
- FORK_THREAD handling is a totally separate function, thread_fork(),
that is only used by sys___tfork() and which loses the flags, func,
arg, and newprocp parameters and gains tcb parameter to guarantee
the new thread's TCB is set before the creating thread returns
- fork1() loses its stack and tidptr parameters
Common bits factor out:
- struct proc allocation and initialization moves to thread_new()
- maxthread handling moves to fork_check_maxthread()
- setting the new thread running moves to fork_thread_start()
The MD cpu_fork() function swaps its unused stacksize parameter for
a tcb parameter.
luna88k testing by aoyama@, alpha testing by dlg@
ok mpi@
|
|
struct proc to struct process.
ok deraadt@ kettenis@
|
|
halting a CPU. Necessary for intr_barrier(9) to work when interrupts are
targeted at secondary CPUs.
ok mpi@, mikeb@ (a while back)
|
|
our own.
From Michal Mazurek, ok mmcc@
|
|
ok visa@
|
|
the CPU that handles most hardware interrupts but we don't account for that
in any way in the scheduler. So processes (and kernel threads) that are
unlucky enough to end up on this CPU will get less CPU cycles than those
running on other CPUs. This is especially true for the softnet taskq.
There network interrupts will prevent the softnet taskq from running. This
means that the more packets we receive, the less packets we can actually
process and/or forward. This is why "unlocking" network drivers actually
decreases the forwarding performance. This diff restores most of the lost
performance by making it less likely that the softnet taskq ends up on the
same CPU that handles network interrupts.
Tested by Hrvoje Popovski
ok mpi@, deraadt@
|
|
Prevent a deadlock from occuring when intr_barrier() is called from
a non-primary CPU in the watchdog task, also enqueued on ``systq''.
ok kettenis@
|
|
suspend on machines with em(4) now that it uses intr_barrier(9).
ok krw@
|
|
the sense that it guarantees that the specified CPU went through the
scheduler. This also guarantees that interrupt handlers running on that CPU
will have finished when sched_barrier() returns.
ok miod@, guenther@
|
|
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.
ok tedu@ deraadt@
|
|
and SPCF_HALTED, these flags only make sense on secondary CPUs which are
unlikely to be present on a SP kernel.
ok kettenis@
|
|
and return the current current CPU, otherwise sched_stop_secondary_cpus()
will spin forever trying to empty its run queues. Fixes hangs during suspend
that many people reported over the last couple of days.
ok bcook@, guenther@
|
|
TAILQ_FOREACH() isn't safe to use in sched_chooseproc() to iterate
over the run queues because within the loop body we remove the threads
from their run queues and reinsert them elsewhere. As a result, we
end up only draining the first thread of each run queue rather than
all of them.
ok kettenis
|
|
and FORK_SYSTEM as a flag to set them. This eliminates needing to
peek into other processes threads in various places. Inspired by NetBSD
ok miod@ matthew@
|
|
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c
ok tedu@ pirofti@
|
|
There is a race condition which might trigger a case where two cpus try
to run the same idle thread.
The problem arises when one cpu steals the idle proc of another cpu and
this other cpu ends up running the idle thread via spc->spc_idleproc,
resulting in two cpus trying to cpu_switchto(idleX).
On startup, idle procs are scaterred around different runqueues, the
decision for scheduling is:
1 look at my runqueue.
2 if empty, look at other dudes runqueue.
3 if empty, select idle proc via spc->spc_idleproc.
The problem is that cpu0's idle0 might be running on cpu1 due to step 1
or 2 and cpu0 hits step 3.
So cpu0 will select idle0, while cpu1 is in fact running it already.
The solution is to never place idle on a runqueue, therefore being
only selectable through spc->spc_idleproc.
This race can be more easily triggered on a HT cpu on virtualized
environments, where the guest more often than not doesn't have the cpu
for itself, so timing gets shuffled.
ok tedu@ guenther@
go ahead after t2k13 deraadt@
|
|
ok matthew@ deraadt@
|
|
ok deraadt
|
|
the scheduler.
ok hasbaert@. deraadt@
|
|
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.
ok kettenis@
|
|
ok tedu@
|
|
they apply.
ok oga@ deraadt@
|
|
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK
oga@ ok
|
|
and incorrect. Kills an XXX comment.
ok syuu, thib, art, kettenis, millert, deraadt
|
|
Also make sure not to take the scheduler lock once we have stopped a CPU such
that we can safely take it away without having to worry about deadlock
because it happened to own the scheduler lock.
Fixes issues with suspen on SMP machines.
ok mlarkin@, marco@, art@, deraadt@
|
|
ok miod@, thib@, oga@, jsing@
|
|
since it is time to start transitioning away from the no-op behaviour.
ok oga kettenis
|
|
give them back again, effectively stopping and starting these CPUs. Use
the stop function in sys_reboot().
ok marco@, deraadt@
|
|
Use this to do a shutdown with only the boot processor running. This should
avoid nasty races during shutdown.
help from art@, ok deraadt@, miod@
|
|
for sys_reboot() to hang forever.
|
|
particular CPU such that it just sits and spins in the idle loop, effectively
halting that CPU.
ok deraadt@, miod@
|
|
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod
|
|
idle proc. p_cpu might be necessary in the future and pegging is just
to be extra safe (although we'll be horribly broken if the idle proc
ever ends up where that flag is checked).
|
|
something to do. Walk the highest priority queue looking for a proc
to steal and skip those that are pegged.
We could consider walking the other queues in the future too, but this
should do for now.
kettenis@ guenther@ ok
|
|
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.
kettenis@ ok (except one bugfix that was not in the intial diff)
|
|
forever.
|