Age | Commit message (Collapse) | Author |
|
comment 'As above...' makes sense again. Improve comments for
sysctl_int_bounded() and sysctl_bounded_arr().
OK gnezdo@ mvs@
|
|
remove it. This also fixes a defective check of the dynamic boundary
in sysctl_sysvshm().
OK mvs@ gnezdo@
|
|
OK mvs@
|
|
functions are sysctl_int() and sysctl_rdint(). This brings us back
the 4.4BSD implementation. Then sysctl_int_bounded() builds the
magic for range checks on top. sysctl_bounded_arr() is a wrapper
around it to support multiple variables.
Introduce macros that describe the meaning of the magic boundary
values. Use these macros in obvious places.
input and OK gnezdo@ mvs@
|
|
configured. This will result in a "value is not available" error
from sysctl when trying to enable dt on a kernel without support.
The variable allowdt should be in the device, not in sysctl source.
We don't need #ifdef for extern and prototypes.
OK mpi@
|
|
We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.
|
|
This allows us to unlock getppid(2).
ok mpi@
|
|
Removed a rash of +/-1 and made both functions shorter and more focused.
OK millert@
|
|
This changes amd64 GENERIC.MP .text size of kern_sysctl.o from 6440 to 6400.
Surprisingly, RAMDISK grows from 1645 to 1678.
OK millert@, mglocker@
|
|
devices, introduce kern.video.record for video(4) devices. By default
kern.video.record will be set to zero, blanking all data delivered
by device drivers which attach to video(4).
The idea was initially proposed by
Laurence Tratt <laurie AT tratt DOT net>.
ok mpi@
|
|
Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.
From and ok claudio@
|
|
This one is surprisingly a minor loss if one were to simply add bytes
on amd64:
.text+.data+.bss+.rodata
before 0x64b0+0x40+0x14+0x338 = 0x683c
after 0x6440+0x48+0x14+0x3b8 = 0x6854
|
|
objdump -h changes in Size of kern_sysctl.o on amd64
before after
.text 7140 64b0
.data 24 40
.bss 10 14
.rodata 50 338
|
|
Requires sysctl_bounded_arr branch to support sysctl_rdint.
The read-only variables are marked by an empty range of [1, 0].
OK millert@
|
|
The underlying vm_space lock is used as a substitute to the KERNEL_LOCK()
in uvm_grow() to make sure `vm_ssize' is not corrupted.
ok anton@, kettenis@
|
|
|
|
"syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008.
Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c
the only visible difference between used and stub ctldebug structs in the
debugvars[] array is their extern keyword, indicating that it is defined
elsewhere.
sys/sysctl.h declares all debugN members as extern upfront, but these
declarations are not needed.
Remove the unused debug sysctl, rename the only remaining one to something
meaningful and remove forward declarations from /sys/sysctl.h; this way,
adding new debug sysctls is a matter of adding extern and coming up with a
name, which is nicer to read on its own and better to grep for.
OK mpi
|
|
Adding "debug.my-knob" sysctls is really helpful to select different
code paths and/or log on demand during runtime without recompile,
but as this code is under DEBUG, lots of other noise comes with it
which is often undesired, at least when looking at specific subsystems
only.
Adding globals to the kernel and breaking into DDB to change them helps,
but that does not work over SSH, hence the need for debug sysctls.
Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of
DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general
option for all of sysctl(2).
OK gnezdo
|
|
Thanks kettenis@ for pointing out.
ok kettenis@
|
|
Design by deraadt@
ok deraadt@
|
|
Range violations are now consistently reported as EOPNOTSUPP.
Previously they were mixed with ENOPROTOOPT.
OK kn@
|
|
|
|
|
|
Prevents a panic due to a NULL dereference; Coverity CID 1452899.
Based on a diff from mpi@, OK deraadt@ kettenis@
|
|
dt(4) exposes kernel internals, addresses and content of states to
userland. As such its interface shouldn't be available without
enabling it consciously.
ok millert@, deraadt@
|
|
idle time is reported in tools like vmstat.
OK visa@ benno@ krw@
|
|
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))
ok mpi@
|
|
Allows us to determine how long a process has been running, even if the
UTC clock jumps.
With help from bluhm@ and millert@, who squashed several bugs.
ok bluhm@ millert@
|
|
The DST and TIMEZONE options(4) are incompatible with KARL, so we need
some other way to compensate for an RTC running with a known offset.
Enter kern.utc_offset, an offset in minutes East of UTC. TIMEZONE has
always been minutes West, but this is inconsistent with how everyone
else talks about timezones, hence the flip.
TIMEZONE has the advantage of being compiled into the binary. Our new
sysctl(2) has no such luck, so it needs to be set as early as possible
in boot, from sysctl.conf(5), so we can correct the kernel clock from
the RTC's local time to UTC before daemons like ntpd(8) and cron(8)
start. To encourage this, kern.utc_offset is made immutable after the
securelevel(7) is raised to 1.
Prompted by yasuoka@. Discussed with deraadt@, kettenis@, yasuoka@.
Additional testing by yasuoka@.
ok deraadt@, yasuoka@
|
|
serializing both read/write operations using the existing file mutex.
The vnode lock still grants exclusive write access to the offset; the
mutex is only used to make the actual write atomic and prevent any
concurrent reader from observing intermediate values.
ok mpi@ visa@
|
|
limits. Convert kernel variables and calculations for mbuf memory
into long to allow larger values on 64 bit machines. Put a range
check into the kernel sysctl. For the interface itself int is still
sufficient. In netstat -m cast all multiplications to unsigned
long to hold the product of two unsigned int.
input and OK visa@
|
|
https://marc.info/?l=openbsd-cvs&m=156277704122293&w=2
ok anton@
|
|
With these totals one can track the throughput of the timeout(9) layer
from userspace.
With input from mpi@.
ok mpi@
|
|
as part of the effort to unlock the kernel. Instead of relying on the
vnode lock, introduce a dedicated lock per file. Exclusive write access
is granted using the new foffset_enter and foffset_leave API. A
convenience function foffset_get is also available for threads that only
need to read the current offset.
The lock acquisition order in vn_write has been changed to match the one
in vn_read in order to avoid a potential deadlock. This change also gets
rid of a documented race in vn_read().
Inspired by the FreeBSD implementation.
With help and ok mpi@ visa@
|
|
was already gone.
OK mpi@
|
|
could crash due to missing inp_ppcb. This happend when fstat(1)
was called often and TCP was aborted with reset. Protect the sysctl
path with the net lock.
OK mpi@
|
|
It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.
Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@
|
|
Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.
ok visa@, cheloha@
|
|
do word loads and stores and so partial updates should no longer be observed.
With this accessing global variables set by sysctl_int() should be mostly MP
save.
OK dlg@ mpi@
|
|
current status and statistics and can be exported without super-user
rights via sysctl to make it easier for tools like systat to access those.
OK deraadt@, sashan@
|
|
The new node contains the subsystem's main control variable,
kern.witness.watch. It is aliased by the old name, kern.witnesswatch.
The alias will be removed in the future.
OK anton@ mpi@
|
|
To protect the timehands we first need to protect the basis for all UTC
time in the kernel: the boottime.
Because the boottime can be changed at any time it needs to be versioned
along with the other members of the timehands to enable safe lockless reads
when using it for anything. So the global boottime timespec goes away and
the static boottimebin becomes a member of the timehands. Instead of reading
the global boottime you use one of two interfaces: binboottime(9) or
microboottime(9). nanoboottime(9) can trivially be added later, though there
are no consumers for it at the moment.
This introduces one small change in behavior. We used to advance the
reported boottime just before launching kernel threads from main().
This makes it look to userland like we "booted" moments before those
threads were launched. Because there is no longer a boottime global we
can no longer trivially do this from main(), so the boottime we report
to userspace via e.g. kern.boottime will now reflect whatever the time
was when we bootstrapped the timehands via inittodr(9). This is usually
no more than a minute before the kernel threads are launched from main().
The prior behavior can be restored by adding a new interface to the
timecounter layer in a future commit.
Based on FreeBSD r303387.
Discussed with mpi@ and visa@.
ok visa@
|
|
|
|
Because of hw.smt we need a way to determine whether a given CPU is "online"
or "offline" from userspace. KERN_CPTIME2 is an array, and so cannot be
cleanly extended for this purpose, so add a new sysctl(2) KERN_CPUSTATS
with an extensible struct. At the moment it's just KERN_CPTIME2 with a
flags member, but it can grow as needed.
KERN_CPUSTATS appears to have been defined by BSDi long ago, but there are
few (if any) packages in the wild still using the symbol so breakage in ports
should be near zero. No other system inherited the symbol from BSDi, either.
Then, use the new sysctl(2) in systat(1) and top(1):
- systat(1) draws placeholder marks ('-') instead of percentages for
offline CPUs in the cpu view.
- systat(1) omits offline CPU ticks when drawing the "big bar" in
the vmstat view. The upshot is that the bar isn't half idle when
half your logical CPUs are disabled.
- top(1) does not draw lines for offline CPUs; if CPUs toggle on or
offline in interactive mode we redraw the display to expand/reduce
space for the new/missing CPUs. This is consistent with what some
top(1) implementations do on Linux.
- top(1) omits offline CPUs from the totals when CPU totals are
combined into a single line (the '-1' flag).
Originally prompted by deraadt@. Discussed endlessly with deraadt@,
ketennis@, and sthen@. Tested by jmc@ and jca@. Earlier versions also
discussed with jca@. Earlier versions tested by jmc@, tb@, and many
others.
docs ok jmc@, kernel bits ok ketennis@, everything ok sthen@,
"Is your stuff in yet?" deraadt@
|
|
ok kettenis deraadt
|
|
in raw IP delivery and UDP broadcast loops. There inpcbtable_mtx
is held and sorwakeup() is called within the loop. As sowakeup()
grabs the kernel lock, we have a lock ordering problem.
found by Hrvoje Popovski; OK deraadt@ mpi@
|
|
This lets userspace distinguish between idle CPUs and those that are
not schedulable because hw.smt=0.
A subsequent commit probably needs to add documentation for this
to sysctl.2 (and perhaps elsewhere) after the dust settles.
Also included here are changes to systat(1) and top(1) that account
for the ENODEV case and adjust behavior accordingly:
- systat(1)'s cpu view prints placeholder marks ('-') instead of
percentages for each state if the given CPU is offline.
- systat(1)'s vmstat view checks for offline CPUs when computing the
machine state total and excludes them, so the CPU usage graph
only represents the states for online CPUs.
- top(1) does not draw CPU rows for offline CPUs when the view is
redrawn. If CPUs "go offline", percentages for each state are
replaced by placeholder marks ('-'); the view will need to be
redrawn to remove these rows. If CPUs "go online" the view will
need to be redrawn to show these new CPUs. In "combined CPU" mode,
the count and the state totals only represent online CPUs.
Ports using KERN_CPTIME2 will need to be updated. The changes
described above to make systat(1) and top(1) aware of the ENODEV
case *and* gracefully handle a changing HW_NCPUONLINE while the
application is running are not necessarily appropriate for each
and every port.
The changes described above are so extensive in part to demonstrate
one way a program *might* be made robust to changing CPU availability.
In particular, changing hw.smt after boot is an extremely rare event,
and this needs to be weighed when updating ports.
The logic needed to account for the KERN_CPTIME2 ENODEV case is
very roughly:
if (sysctl(...) == -1) {
if (errno != ENODEV) {
/* Actual error occurred. */
} else {
/* CPU is offline. */
}
} else {
/* CPU is online and CPU states were set by sysctl(2). */
}
Prompted by deraadt@. Basic idea for ENODEV from kettenis@. Discussed at
length with kettenis@. Additional testing by tb@.
No complaints from hackers@ after a week.
ok kettenis@, "I think you should commit [now]" deraadt@
|
|
for netstat -a. Introduce a global mutex that protects the tables
and hashes for the internet PCBs. To detect detached PCB, set its
inp_socket field to NULL. This has to be protected by a per PCB
mutex. The protocol pointer has to be protected by the mutex as
netstat uses it.
Always take the kernel lock in in_pcbnotifyall() and in6_pcbnotify()
before the table mutex to avoid lock ordering problems in the notify
functions.
OK visa@
|
|
The introduction of hw.smt means that logical CPUs can be disabled
after boot and prior to suspend/resume. If hw.smt=0 (the default),
there needs to be a way to count the number of hardware threads
available on the system at any given time.
So, import HW_NCPUONLINE/hw.ncpuonline from NetBSD and document it.
hw.ncpu becomes equal to the number of CPUs given to sched_init_cpu()
during boot, while hw.ncpuonline is equal to the number of CPUs available
to the scheduler in the cpuset "sched_all_cpus". Set_SC_NPROCESSORS_ONLN
equal to this new sysctl and keep _SC_NPROCESSORS_CONF equal to hw.ncpu.
This is preferable to adding a new sysctl to count the number of
configured CPUs and keeping hw.ncpu equal to the number of online
CPUs because such a change would break software in the ecosystem
that relies on HW_NCPU/hw.ncpu to measure CPU usage and the like.
Such software in base includes top(1), systat(1), and snmpd(8),
and perhaps others.
We don't need additional locking to count the cardinality of a cpuset
in this case because the only interfaces that can modify said cardinality
are sysctl(2) and ioctl(2), both of which are under the KERNEL_LOCK.
Software using HW_NCPU/hw.ncpu to determine optimal parallism will need
to be updated to use HW_NCPUONLINE/hw.ncpuonline. Until then, such software
may perform suboptimally. However, most changes will be similar to the
change included here for libcxx's std::thread:hardware_concurrency():
using HW_NCPUONLINE in lieu of HW_NCPU should be sufficient for determining
optimal parallelism for most software if the change to _SC_NPROCESSORS_ONLN
is insufficient.
Prompted by deraadt. Discussed at length with kettenis, deraadt, and sthen.
Lots of patch tweaks from kettenis.
ok kettenis, "proceed" deraadt
|
|
instead of using a mutex for update serialization. Use a per-fdp mutex
to manage updating of file instance pointers in the `fd_ofiles' array
to let fd_getfile() acquire file references safely with concurrent file
reference releases.
OK mpi@
|