Age | Commit message (Collapse) | Author |
|
Implement inpcb iterator in rip6_input(). Factor out the real work
to rip6_sbappend(). Now UDP broadcast and multicast, raw IPv4 and
IPv6 input work similar. While there, make rip_input() look more
like rip6_input().
OK mvs@
|
|
Inspired by mvs@ idea of the iterator in the UDP multicast loop,
implement the same for raw IP input delivery. This removes an
unneccesary rwlock and only uses table mutex.
When comparing the inp routing table, address and port, the table
lock must be held. So assume that in_pcb_iterator() already has
the table mutex and hold it while traversing the list and doing the
checks. Release the mutex during mbuf copy, socket buffer append
and the upcalls. Adapt the logic for both rip_input() and udp_input().
In rip_input() move the actual work to rip_sbappend(). This can
be called without mutex during list traversal and for the final
element.
OK mvs@
|
|
The broadcast and multicast loop in udp_input() is protected by the
table mutex. The relevant PCBs were collected in a separate list,
which was processed while the table notify rwlock was held. When
sending UDP multicast packets over vxlan(4) configured over UDP
with multicast groups, this lock was taken recursively causing a
kernel crash.
By using an iterator, traversing the PCB list of the table does not
require to hold the mutex all the time. Only while accessing the
next element after the iterator, the mutex is taken for a short
time. udp_sbappend() and the upcall to vxlan_input() is done with
neither mutex nor rwlock. The PCB is reference counted while
traversing the list.
crash reported by Holger Glaess; iterator implemented by mvs@;
tested and fixed by bluhm@; OK mvs@
|
|
accessed integer.
ok bluhm
|
|
Also use atomic_load_int(9) to load `securelevel'. sysctl_securelevel()
is mp-safe, but will be under kernel lock until all existing
`securelevel' loading became mp-safe too.
ok mpi
|
|
protected by `timeout_mtx' mutex(9).
ok kettenis
|
|
OK mpi@
|
|
When mallocarray(9) sleeps, disk_count can change, and diskstatslen
gets inconsistent. This caused free(9) to panic.
Reported-by: syzbot+36e1f3b306f721f90c72@syzkaller.appspotmail.com
OK deraadt@ mpi@
|
|
If the memory layout is not optimal, m_defrag(), m_prepend(),
m_pullup(), and m_pulldown() will allocate mbufs or copy memory.
Count these operations to find possible optimizations.
input dhill@; OK mvs@
|
|
|
|
|
|
KERN_SECURELVL locked until existing `securelevel' checks became moved
out of kernel lock.
Make sysctl_securelevel_int() mp-safe by using atomic_load_int(9) to
unlocked read-only access for `securelevel'.
Unlock KERN_ALLOWDT. `allowdt' is the atomically accessed integer used
only once in dtopen().
ok mpi
|
|
`maxfiles' is atomically accessed integer which is lockless and
read-only accessed in file descriptors layer.
lim_startup() called during kernel bootstrap, no need to
atomic_load_int() within.
ok mpi
|
|
`maxprocess' and `maxthread' are atomically accessed integers.
ok mpi
|
|
It is the only KERN_AUDIO_RECORD. `audio_record_enable' is atomically
accessed integer.
Reasonable from deraadt
|
|
All except PF_MPLS paths are mp-safe:
- net_link_sysctl() and following net_ifiq_sysctl() only return
EOPNOTSUPP;
- uipc_sysctl() - mp-safe atomic access to integers;
- bpf_sysctl() - mp-safe atomic access to integers;
- pflow_sysctl() - returns statistics from per-CPU counters;
- pipex_sysctl() - mp-safe atomic access to integer;
Push kernel lock down to mpls_sysctl(). sysctl_int_bounded() do copying
with local variable, so context switch is safe. No need to wire memory
or take `sysctl_lock' rwlock(9).
Keep protocols locked as they was include pages wiring. Copying will not
sleep - no network slowdown while doing it with net lock held.
ok bluhm
|
|
The only difference between sysctl_int() and sysctl_int_bounded()
is the range check, so sysctl_int() is just sysctl_int_bounded(...,
INT_MIN, INT_MAX). sysctl_int() is not the fast path, so this useless
check is not significant.
Mp-safe sysctl_int() is meaningless for sysctl_int_lower(), so rework it
in the sysctl_int_bounded() style. This time all affected paths are
kernel locked, but this doesn't make sysctl_int_lower() worse.
Change `hostid' type to the type of int. It only stored but never used
within kernel, userland accesses it through sysctl_int(). Nothing
changes, but variable becomes consistent with sysctl_int().
ok bluhm
|
|
Regardless on wired userland memory, KERN_FILE_BYPID and KERN_FILE_BYUID
`allprocess' loops have netlock provided sleep points, so concurrent
process exit(1) could crash kernel.
The main exit1() problem is that process teardown begins while process
is still linked to `allprocess' list, and current code doesn't allow to
unlink it first. Wait for concurrent sysctl(2) `allprocess' loops
between PS_EXITING bit setting and list unlinking. Both KERN_FILE_BYPID
and KERN_FILE_BYUID loops do PS_EXITING check and won't deal with dying
process. Concurrent exit1() thread will wait loops keeping process
linked to `allprocess' list.
Tested with i386 dpb(1) run.
Stress tests and ok bluhm.
|
|
When searching for a specific process, there is no need to traverse
the list of all processes to the end. Break after pid has been
found and the file structure has been filled. Also check for arg
>= 0 as this is consistent with the arg < -1 check before. This
makes no functional difference as process 0 has PS_SYSTEM set and
is skipped anyway.
OK millert@ mvs@
|
|
`msgbufp' and `consbufp' are immutable, such as `msg_magic' and
`msg_bufs'. initmsgbuf() and initconsbuf() which initialize this buffers
are called during kernel bootstrap, when concurrent sysctl(2) is
impossible, so they don't need to be reordered or use barriers.
ok bluhm
|
|
Read-only access to local `clkinfo' filled with immutable data.
ok bluhm
|
|
microboottime() and following binboottime() are mp-safe and `mb' is
local data.
ok bluhm
|
|
Add corresponding cases to the kern_sysctl() switch and unlock read-only
variables from `kern_vars'. Unlock KERN_SOMAXCONN and KERN_SOMINCONN
which are atomically read-only accessed only from solisten().
ok kettenis
|
|
ok bluhm
|
|
Unlock few obvious immutable or read-only variables from "kern.*" and
"hw.*" paths. Keep the rest variables locked as before, include pages
wiring. Use new sysctl_vs{,un}lock() functions introduced for thar
purpose.
In kern.* path:
- KERN_OSTYPE, KERN_OSRELEASE, KERN_OSVERSION, KERN_VERSION -
immutable;
- KERN_NUMVNODES - read-only access to integer;
- KERN_MBSTAT - read-only access to per-CPU counters;
In hw.* path:
- HW_MACHINE, HW_MODEL, HW_NCPUONLINE, HW_PHYSMEM, HW_VENDOR,
HW_PRODUCT, HW_VERSION, HW_SERIALNO, HW_UUID, HW_PHYSMEM64 -
immutable;
- HW_USERMEM and HW_USERMEM64 - `physmem' is immutable, uvmexp.wired
is mutable but integer; read-only access to localy stored difference
between `physmem' and uvmexp.wired;
- `hw_vars' - read-only access to integers; some of them like
HW_BYTEORDER and HW_PAGESIZE are immutable;
ok bluhm kettenis
|
|
In sysctl_int_bounded() use atomic operations to load, store, or
swap integer values. By using volatile pointers this will result
in a single assembly instruction, no matter how over optimizing
compilers will become. Note that this does not solve data dependency
problems, nor MP problems in the kernel code using these integers.
For full MP safety additional considerations, memory barriers, or
locks will be needed where the values are used. But for simple
integer in- and output volatile is enough. If new and old value
pointers are given to sysctl, atomic swapping guarantees that
userlands sees the same old value only once. There are more
sysctl_int() functions that have to be adapted.
OK deraadt@ kettenis@
|
|
For procs (threads) the accounting happens now lockless by curproc using
a generation counter. Callers need to use tu_enter() and tu_leave() for this.
To read the proc p_tu struct tuagg_get_proc() should be used. It ensures
that the values read is consistent.
For processes only the time of exited threads is accumulated in ps_tu and
to get the proper process time usage tuagg_get_process() needs to be called.
tuagg_get_process() will sum up all procs p_tu plus the ps_tu.
This removes another SCHED_LOCK() dependency. Adjust the code in
exit1() and exit2() to correctly account for the full run time.
For this adjust sched_exit() to do the runtime accounting like it is done
in mi_switch().
OK jca@ dlg@
|
|
With two separate TCP hash tables, each one becomes smaller. When
we remove the exclusive net lock from TCP, contention on internet
PCB table mutex will be reduced. UDP has been split earlier into
IPv4 and IPv6. Replace branch conditions based on INP_IPV6 with
assertions.
OK mvs@
|
|
ok guenther@ deraadt@
|
|
the empty string, rather than error.
ok krw
|
|
This approach does not work as LIST_NEXT() of a removed element
does not return NULL. I causes a crash in syzcaller and triggers
kernel diagnostic assertion "vp->v_uvcount == 0" in sys/kern/kern_unveil.c
line 845 during reboot. Unfortunately the backout brings back the
race in fill_file() and fstat(1) may crash the kernel.
Reported-by: syzbot+54fba1c004d7383d5e85@syzkaller.appspotmail.com
|
|
socket types protected. The netlock is still used while fill_file()
called through *table.inpt_queue walkthroughs, but this is the inet
sockets case.
ok bluhm
|
|
list walkthroughs have context switch within, so make exit1() wait
until the last reference released.
Reported-by: syzbot+0e9dda76c42c82c626d7@syzkaller.appspotmail.com
ok bluhm claudio
|
|
Having two hash tables instead of a common one, reduces table size
and contention on the per table lock. The address family is always
known in advance. The lookups and loops are more specific.
OK sashan@
|
|
ports.
Suggested by deraadt@, USB route idea from kettenis@. Feedback
from anton@, man page improvements from deraadt@, jmc@,
schwarze@.
ok deraadt@ kettenis@
|
|
Using a scratch buffer makes it possible to take a consistent snapshot of
per-CPU counters without having to allocate memory.
Makes ddb(4) show uvmexp command work in OOM situations.
ok kn@, mvs@, cheloha@
|
|
revert the previous that the mbstat is located on the stack.
ok claudio
|
|
malloc(9) memory instead of kernel stack for sysctl kern.mbstat.
from yasuoka@; chunk missed in previous commit; OK claudio@ tb@
|
|
Every platform made the clockintr switch at least six months ago.
The __HAVE_CLOCKINTR symbol is now redundant. Remove it.
Prompted by claudio@.
Link: https://marc.info/?l=openbsd-tech&m=168826181015032&w=2
"makes sense" mlarkin@
|
|
and not hw_battery_setchargestart.
OK kettenis@
|
|
layer.
|
|
to control the charging of laptop batteries:
* hw.battery.chargemode (int)
-1: force discharge
0: inhibit charge
1: auto
In auto mode charging may be controlled by:
* hw.battery.chargestop (int)
Percentage (0-100) of last full capacity at which the battery should
stop charging.
* hw.battery.chargestart (int)
Percentage (0-100) of last full capacity at which the battery should
start charging.
The idea is that with
hw.battery.chargemode=1
hw.battery.chargestop=80
hw.battery.chargestart=75
the battery would be kept charged within the range between 75% and 80%.
Allowable settings and some details of the behavior may differ between
hardware implementations.
Committing this early to easy testing of further diffs that implement this
functionality in acpithinkpad(4) and aplsmc(4).
ok kn@
|
|
sysctl(8) MIBs relies on netlock or another locks and doesn't require
kernel lock, so unlock it. The protocols layer *_sysctl()s are left
under kernel lock and will be sequentially unlocked later.
ok bluhm@
|
|
receive buffer. As it was done for SS_CANTSENDMORE bit, the definition
kept as is, but now these bits belongs to the `sb_state' of receive
buffer. `sb_state' ored with `so_state' when socket data exporting to the
userland.
ok bluhm@
|
|
This time, socket's buffer lock requires solock() to be held. As a part of
socket buffers standalone locking work, move socket state bits which
represent its buffers state to per buffer state.
Opposing the previous reverted diff, the SS_CANTSENDMORE definition left
as is, but it used only with `sb_state'. `sb_state' ored with original
`so_state' when socket's data exported to the userland, so the ABI kept as
it was.
Inputs from deraadt@.
ok bluhm@
|
|
|
|
to monitor state changes of the kernel device tree
input from dnd ok dlg@, deraadt@
|
|
clockintr(9) is a machine-independent clock interrupt scheduler. It
emulates most of what the machine-dependent clock interrupt code is
doing on every platform. Every CPU has a work schedule based on the
system uptime clock. For now, every CPU has a hardclock(9) and a
statclock(). If schedhz is set, every CPU has a schedclock(), too.
This commit only contains the MI pieces. All code is conditionally
compiled with __HAVE_CLOCKINTR. This commit changes no behavior yet.
At a high level, clockintr(9) is configured and used as follows:
1. During boot, the primary CPU calls clockintr_init(9). Global state
is initialized.
2. Primary CPU calls clockintr_cpu_init(9). Local, per-CPU state is
initialized. An "intrclock" struct may be installed, too.
3. Secondary CPUs call clockintr_cpu_init(9) to initialize their
local state.
4. All CPUs repeatedly call clockintr_dispatch(9) from the MD clock
interrupt handler. The CPUs complete work and rearm their local
interrupt clock, if any, during the dispatch.
5. Repeat step (4) until the system shuts down, suspends, or hibernates.
6. During resume, the primary CPU calls inittodr(9) and advances the
system uptime.
7. Go to step (2). This time around, clockintr_cpu_init(9) also
advances the work schedule on the calling CPU to skip events that
expired during suspend. This prevents a "thundering herd" of
useless work during the first clock interrupt.
In the long term, we need an MI clock interrupt scheduler in order to
(1) provide control over the clock interrupt to MI subsystems like
timeout(9) and dt(4) to improve their accuracy, (2) provide drivers
like acpicpu(4) a means for slowing or stopping the clock interrupt on
idle CPUs to conserve power, and (3) reduce the amount of duplicated
code in the MD clock interrupt code.
Before we can do any of that, though, we need to switch every platform
over to using clockintr(9) and do some cleanup.
Prompted by "the vmm(4) time bug," among other problems, and a
discussion at a2k19 on the subject. Lots of design input from
kettenis@. Early versions reviewed by kettenis@ and mlarkin@.
Platform-specific help and testing from kettenis@, gkoehler@,
mlarkin@, miod@, aoyama@, visa@, and dv@. Babysitting and spiritual
guidance from mlarkin@ and kettenis@.
Link: https://marc.info/?l=openbsd-tech&m=166697497302283&w=2
ok kettenis@ mlarkin@
|
|
OK millert@ deraadt@
|
|
ok mpi@ miod@
|