Age | Commit message (Collapse) | Author |
|
Having two hash tables instead of a common one, reduces table size
and contention on the per table lock. The address family is always
known in advance. The lookups and loops are more specific.
OK sashan@
|
|
This is not necessarily wrong per se, but would need special consideration,
as not all platforms are currently able to process more than six syscall
arguments (and upcoming diffs will rely upon reasonably-sized argument
lists), so better break now and reconsider later if need be.
ok deraadt@
|
|
This unlocks soconnect() for UDP, rip, rip6 and divert. It takes
shared net lock in combination with per socket lock. TCP and GRE
still use exclusive net lock when connecting.
OK mvs@
|
|
|
|
anymore.
ok tb@ deraadt@, no need to regen anything
|
|
As sbwait() may sleep, holding any mutex is not allowed. Call
pru_unlock() before sbwait() in soreceive().
Bug spotted by sashan@; OK sashan@ mvs@
|
|
|
|
not "size of the storage of the pin elements"
|
|
In my recent commit I missed that sblock() may sleep while soreceive()
holds the incpb mutex. Call pru_lock() after sblock().
Reported-by: syzbot+f79c896ec019553655a0@syzkaller.appspotmail.com
Reported-by: syzbot+08b6f1102e429b2d4f84@syzkaller.appspotmail.com
OK mvs@
|
|
For protocols that care about locking, use the shared net lock to
call sobind(). Use the per socket rwlock together with shared net
lock. This affects protocols UDP, raw IP, and divert. Move the
inpcb mutex locking into soreceive(), it is only used there. Add
a comment to describe the current inmplementation of inpcb locking.
OK mvs@ sashan@
|
|
(not used yet, because the pinsyscall changes are still being worked on)
ok kettenis
|
|
implementation.
Set nkmempages to -1 by default instead of 0 so that the value ends up in
the data section. This way config(8) is able to alter the value as promised.
See also: https://github.com/llvm/llvm-project/issues/74632
OK miod@
|
|
The new logic is:
Up to 1G physmem use physical memory / 4,
above 1G add an extra 16MB per 1G of memory.
Clamp it down depending on available kernel virtual address space
- up and including 512M -> 64MB (macppc, arm, sh)
- between 512M and 1024M -> 128MB (hppa, i386, mips, luna88k)
- over 1024M clamping to VM_KERNEL_SPACE_SIZE / 4
The result is much more malloc(9) space on 64bit archs with lots of memory
and large kva space.
Note: amd64 only has 4G of kva and therefor nkmempages is limited to 262144
As a side-effect NKMEMPAGES_MAX and nkmempages_max are no longer used.
Tested and OK miod@
|
|
|
|
|
|
it is a dangerous alternative entry point for all system calls, and thus
incompatible with the precision system call entry point scheme we are
heading towards. This has been a 3-year mission:
First perl needed a code-generated wrapper to fake syscall(2) as a giant
switch table, then all the ports were cleaned with relatively minor fixes,
except for "go". "go" required two fixes -- 1) a framework issue with
old library versions, and 2) like perl, a fake syscall(2) wrapper to
handle ioctl(2) and sysctl(2) because "syscall(SYS_ioctl" occurs all over
the place in the "go" ecosystem because the "go developers" are plan9-loving
unix-hating folk who tried to build an ecosystem without allowing "ioctl".
ok kettenis, jsing, afresh1, sthen
|
|
descriptor (pted) pool in the arm64 pmap implementation. This
significantly reduces the side-effects of lock contention on the kernel
map lock that is (incorrectly) translated into excessive page daemon
wakeups. This is not a perfect solution but it does lead to significant
speedups on machines with many CPU cores.
This requires adding a new pmap_init_percpu() function that gets called
at the point where kernel is ready to set up the per-CPU pool caches.
Dummy implementations of this function are added for all non-arm64
architectures. Some other architectures can probably benefit from
providing an actual implementation that sets up per-CPU caches for
pmap pools as well.
ok phessler@, claudio@, miod@, patrick@
|
|
|
|
ok kettenis
|
|
|
|
before future work where ld.so(1) will need this new system call.
Putting this in the kernel ahead of time will save some grief.
ok kettenis
|
|
|
|
For internet sockets sobind() runs with exclusive net lock due to
solock(). For unix domain sockets uipc_bind() grabs the kernel
lock itself. So sys_bind() is MP safe. Add NOLOCK flag to avoid
kernel lock.
OK mvs@
|
|
NKMEMPAGES_MIN was removed long time ago in all archs so there is no
need to keep it.
Also initalize nkmempages_max at compile time since sparc (with variable
page size) is long gone as well.
No objection from miod@
|
|
|
|
ok miod@
|
|
Since revision 1.26 dt_ioctl_get_auxbase() is calling process_domem().
Build the latter function into kernel if pseudo device dt is enabled.
from Matthias Pitzl; OK claudio@
|
|
The disklabel UID passed in is not modified, reflect that and allow callers
using 'const char *'.
OK miod
|
|
The soreceive() code depends on the fact that MSG_EOR is set on the
last mbuf of the chain. In sbappendcontrol() move MSG_EOR to the
end like sbcompress() does it. This fixes MSG_EOR handling for
SOCK_SEQPACKET sockets with control message.
bug reported by Eric Wong
analysed, tested and OK claudio@
|
|
If single thread is already held by another thread just unwind to userret()
wait there and retry the system call later (if at all).
OK mpi@
|
|
where a switch happens outside. Cleanup these code paths and make the
machine independent.
- when a process forks (fork, tfork, kthread), the new proc needs to
somehow be scheduled for the first time. This is done by proc_trampoline.
Since proc_trampoline is machine dependent assembler code change
the MP specific proc_trampoline_mp() to proc_trampoline_mi() and make
sure it is now always called.
- cpu_hatch: when booting APs the code needs to jump to the first proc
running on that CPU. This should be the idle thread for that CPU.
- sched_exit: when a proc exits it needs to switch away from itself and
then instruct the reaper to clean up the rest. This is done by switching
to the idle loop.
Since the last two cases require a context switch to the idle proc factor
out the common code to sched_toidle() and use it in those places.
Tested by many on all archs.
OK miod@ mpi@ cheloha@
|
|
m_split() calls m_align() to initialize the data pointer of newly
allocated mbuf. If the new mbuf will be converted to a cluster,
this is not necessary. If additionally the new mbuf is larger than
MLEN, this can lead to a panic.
Only call m_align() when a valid m_data is needed. This is the
case if we do not refecence the existing cluster, but memcpy() the
data into the new mbuf.
Reported-by: syzbot+0e6817f5877926f0e96a@syzkaller.appspotmail.com
OK claudio@ deraadt@
|
|
The API's behavior when invoked from a callback function is impossible
to document. Move the special behavior into a distinct namespace,
"clockrequest".
- Add a 'struct clockrequest'. Basically a stripped-down 'struct clockintr'
for exclusive use during clockintr_dispatch().
- In clockintr_queue, replace the "cq_shadow" clockintr with a "cq_request"
clockrequest. They serve the same purpose.
- CLST_SHADOW_PENDING -> CR_RESCHEDULE; different namespace, same meaning.
- CLST_IGNORE_SHADOW -> CLST_IGNORE_REQUEST; same meaning.
- Move shadow branch in clockintr_advance() to clockrequest_advance().
- clockintr_request_random() becomes clockrequest_advance_random().
- Delete dead shadow branches in clockintr_cancel(), clockintr_schedule().
- Callback functions now get a clockrequest pointer instead of a special
clockintr pointer: update all prototypes, callers.
No functional change intended.
|
|
Add a TIMEOUT_MPSAFE flag to signal that a timeout is safe to run
without the kernel lock. Currently, TIMEOUT_MPSAFE requires
TIMEOUT_PROC. When the softclock() is unlocked in the future this
dependency will be removed.
On MULTIPROCESSOR kernels, softclock() now shunts TIMEOUT_MPSAFE
timeouts to a dedicated "timeout_proc_mp" bucket for processing by the
dedicated softclock_thread_mp() kthread. Unlike softclock_thread(),
softclock_thread_mp() is not pinned to any CPU and runs run at IPL_NONE.
Prompted by bluhm@. Lots of input from bluhm@. Joint work with mvs@.
Prompt: https://marc.info/?l=openbsd-tech&m=169646019109736&w=2
Thread: https://marc.info/?l=openbsd-tech&m=169652212131109&w=2
ok mvs@
|
|
Technically, all the current fixed clock interrupt periods fit within
an unsigned 32-bit value. But 32-bit multiplication is an accident
waiting to happen. So, expand the fixed periods for hardclock,
statclock, profclock, and roundrobin to 64-bit values.
One exception: statclock_mask remains 32-bit because random(9) yields
32-bit values. Update the initclocks() comment to make it clear that
this is not an accident.
|
|
Prototype clockintr_schedule() in <sys/clockintr.h>.
|
|
Rename these parameters to align the code with the forthcoming
manpage. No functional change.
|
|
intrclock_rearm() and intrclock_trigger() are not part of the public
API, so there's no reason to implement them in sys/clockintr.h. Move
them to kern_clockintr.c.
|
|
channel instead of inventing an own one.
OK kettenis@ mvs@
|
|
ports.
Suggested by deraadt@, USB route idea from kettenis@. Feedback
from anton@, man page improvements from deraadt@, jmc@,
schwarze@.
ok deraadt@ kettenis@
|
|
The mode can now be or-ed with SINGLE_DEEP or SINGLE_NOWAIT to alter
the behaviour of single_thread_set(). This allows explicit control
of the SINGLE_DEEP behaviour.
If SINGLE_DEEP is set the deep flag is passed to the initial check call
and by that the check will error out instead of suspending (SINGLE_UNWIND)
or exiting (SINGLE_EXIT). The SINGLE_DEEP flag is required in calls to
single_thread_set() outside of userret. E.g. at the start of sys_execve
because the proc is not allowed to call exit1() in that location.
SINGLE_NOWAIT skips the wait at the end of single_thread_set() and therefor
returns BEFORE all threads have been parked. Currently this is only used by
the ptrace code and should not be used anywhere else. Not waiting for all
threads to settle is asking for trouble.
This solves an issue by using SINGLE_UNWIND in the coredump case where
the code should actually exit in case another thread crashed moments earlier.
Also the SINGLE_UNWIND in pledge_fail() is now marked SINGLE_DEEP since
the call to pledge_fail() is for sure not at the kernel boundary.
OK mpi@
|
|
|
|
|
|
Also remove useless NULL check.
ok bluhm@
|
|
protecting message buffer.
ok bluhm
|
|
- PS_PROFIL bit is moved into the process cleanup block where it belongs
- The proc read-only limit cache cleanup is moved up right after clearing
p->p_fd cache. lim_free() can potentially sleep and so needs to be
above the line where p_stat is set to SDEAD.
With and OK jca@
|
|
Include missing fields -- like the sleep channel and message -- and
show both the PID and TID of the proc.
Also add '/t' as an argument that can be used to specify a proc by TID
instead of by address.
OK mpi@
|
|
There is the same check in sched_chooseproc() but that is too late
to know where the bad insertion into the runqueue was done.
OK mpi@
|
|
SINGLE_UNWIND unwinds to the kernel boundary. On the other hand
SINGLE_SUSPEND will sleep inside tsleep(9) and other sleep functions.
Since the code will exit1() very soon after it is better to already unwind.
Now one could argue that for coredumps all threads should stop asap to
get a clean dump. Using SINGLE_UNWIND the sleep will fail with ERESTART
and no copyout should happen in that case.
This is a bit of a workaround since SINGLE_SUSPEND has a small race
where single_thread_wait() returns before all threads are really stopped.
When SINGLE_EXIT is called quickly after this can blow up inside
sleep_finish.
Reported-by: syzbot+3ef066fcfaf991f2ac2c@syzkaller.appspotmail.com
OK mpi@ kettenis@
|
|
With input from claudio@ and deraadt@.
|