Age | Commit message (Collapse) | Author |
|
ok bluhm@
|
|
ok bluhm@
|
|
ok bluhm@
|
|
ok bluhm@
|
|
ok bluhm@
|
|
For the protocols which don't support request, leave handler NULL. Do the
NULL check within corresponding pru_() wrapper and return EOPNOTSUPP in
such case. This will be done for all upcoming user request handlers.
ok bluhm@ guenther@
|
|
MBR partitions previously enjoyed.
Found and fix tested by matthieu@
|
|
These ktrace points do not seem useful any longer because the new
implementation of poll(2) and select(2) appears to work well.
OK deraadt@ mpi@
|
|
OK millert@ deraadt@
|
|
|
|
handlers into it. We want to split existing (*pr_usrreq)() to multiple
short handlers for each PRU_ request as it was already done for
PRU_ATTACH and PRU_DETACH. This is the preparation step, (*pr_usrreq)()
split will be done with the following diffs.
Based on reverted diff from guenther@.
ok bluhm@
|
|
This doesn't work on MP systems. We do locked `t_flags' check just after
lockless check, so just remove it.
ok dlg@
|
|
ok mpi@ miod@
|
|
This is helpful for the following (*pr_usrreq)() split to multiple
handlers. But right now this makes code more readable.
Also add '#ifndef _SYS_SOCKETVAR_H_' to sys/socketvar.h. This prevents the
collisions when both sys/protosw.h and sys/socketvar.h are included
together. Both 'socket' and 'protosw' structures are required to be
defined before pru_*() wrappers, so we need to include sys/socketvar.h to
sys/protosw.h.
ok bluhm@
|
|
swblk_t on OpenBSD.
reorder if condition in blst_meta_alloc(), in order to check if the node is
'Terminator' node first (and leave the loop).
DragonFlyBSD is unaffected by it as swblk_t is signed (and the first condition
isn't taken).
add a regress test for it.
while here, more the KASSERT() to KDASSERT(). it is useful but only with DEBUG.
ok miod@ todd@
|
|
partitions.
miod@ (re)discovered an off-by-one in some device size
calculations. Whether the ancient misbehaviour of some devices to
confuse number of sectors with highest valid sector address or
something newer.
Should fix miod@'s octeon boot disk.
|
|
Buffer cache related struct vnode fields can be accessed in interrupt
context. Be more consistent with the use of splbio().
OK mpi@
|
|
Computing a per-CPU TSC skew value is error-prone, especially on
multisocket machines and VMs. My best guess is that larger latencies
appear to the current skew measurement test as TSC desync, and so the
TSC is demoted to a kernel timecounter on these machines or marked
non-monotonic.
This patch eliminates per-CPU TSC skew values. Instead of trying to
measure and correct for TSC desync we only try to detect desync, which
is less error-prone. This approach should allow a wider variety of
machines to use the TSC as a timecounter when running OpenBSD.
In the new sync test, both CPUs repeatedly try to detect whether their
TSC is trailing the other CPU's TSC. The upside to this approach is
that it yields no false positives. The downside to this approach is
that it takes more time than the current skew measurement test. Each
test round takes 1ms, and we run up to two rounds per CPU, so this
patch slows boot down by 2ms per AP.
If any CPU fails the sync test, the TSC is marked non-monotonic and a
different timecounter is activated. The TC_USER flag remains intact.
There is no middle ground where we fall back to only using the TSC in
the kernel.
Before running the test, we check for the IA32_TSC_ADJUST register and
reset it if necessary. This is a trivial way to work around firmware
bugs that desync the TSC before we reach the kernel. Unfortunately,
at the moment this register appears to only be available on Intel
processors. I cannot find an equivalent but differently-named MSR for
AMD processors.
Because there is no per-CPU skew value, there is also no concept of
TSC drift anymore.
Miscellaneous notes:
- This patch adds a new timecounter utility function, tc_reset_quality().
Used after sync test failure to mark the TSC non-monotonic.
- I have left TSC_DEBUG enabled for now. Unsure if we should leave it
enabled for release or not. If we disable it we no longer run the
sync test after failing it once. Running the test even after failure
provides information about the desync on every CPU.
- Taking 1ms per test round is fairly conservative. We can experiment
with and discuss shorter test rounds. My main goal with a relatively
long test round is ensuring VMs actually run the test. It would be
bad if a hypervisor interrupted the test for so long that it concealed
desync.
- The use of two test rounds is mostly a diagnostic tool: it would be
very strange if a CPU passed the first round but failed the second.
If we ever saw this in the wild it would indicate something odd.
- Most of the desync seen in test reports is on Ryzen CPUs. I
believe, but cannot prove, that this is due to a widespread
firmware bug on AMD motherboards. Hopefully AMD and/or the
downstream vendors fix it.
- Fixing TSC desync by writing the TSC directly with WRMSR is very
difficult. The TSC is a moving target incrementing very quickly and
compensating for WRMSR overhead is non-trivial. We can experiment
with this, but my confidence is low that we can make it work reliably.
Prompted by deraadt@ and kettenis@ in 2021. Shepherded along by
deraadt@ throughout. Reprompted by Yuichiro Naito several times.
With input from Yuichiro Naito, naddy@, sthen@, dv@, and deraadt@.
Tested by florian@, gnezdo@, sthen@, Josh Rickmar, dv@, Mohamed Aslan,
Hrvoje Popovski, Yuichiro Naito, semarie@, mlarkin@, asou@, jmatthew@,
Renato Aguiar, and Timo Myyra.
Patch v1: https://marc.info/?l=openbsd-tech&m=164330092208035&w=2
Patch v2: https://marc.info/?l=openbsd-tech&m=164558519712957&w=2
Patch v3: https://marc.info/?l=openbsd-tech&m=165698681018991&w=2
Patch v4: https://marc.info/?l=openbsd-tech&m=165835507113680&w=2
Patch v5: https://marc.info/?l=openbsd-tech&m=165923705118770&w=2
"just commit it" deraadt@
|
|
a uint64_t may not produce the (humanly) obvious result.
Cast one of them to a (uint64_t) in the hope of invoking the
appropriate int promotion god.
CID 1519495
|
|
validity is checked.
Found the hard way by kn@
Cluebats from millert@ and deraadt@.
Fix tested by and ok kn@
|
|
TCP_INFO provides a lot of information about the TCP session of this socket.
Many processes like to peek at the rtt of a connection but this also provides
a lot of more special info for use by e.g. tcpbench(1).
While the basic minimal info is available all the time the more specific
data is only populated for privileged processes. This is done to not share
data back to userland that may allow to attack a session.
TCP_INFO is available to pledge "inet" since pledged processes like chrome
tend to use TCP_INFO when available.
OK bluhm@
|
|
on libc trying to open /var/run/ypbind.lock, so pledge had to BYPASSUNVEIL
accesses to this file. We accepted the opening of that file for a small
period for build cross-over, but that waiting period ends now.
|
|
helper functions.
The refactored code ensures disklabels are read from/written to
disk only from/to unused space or an OpenBSD partition. This
prevents accidental damage to filesystems that start immediately
following an MBR or GPT.
The refactored code also finds the disklabel present on the
i386/amd64 floppyXX.img, rather than spoofing the media as a
single MSDOS partition.
Tweak and positive comments from jmatthew@
|
|
reduce the diff with DragonFlyBSD by using swblk_t and u_swblk_t types.
while here, move bitmap type (u_swblk_t) to u_int64_t on all archs. it makes the
regress the same on 64 and 32bits archs (and it success on both).
ok mpi@
|
|
to the padded syscalls going away.
|
|
|
|
to the padded syscalls going away.
|
|
It makes uvm_swap_free() faster: extents have a cost of O(n*n) which doesn't
really scale with gigabytes of swap.
Based on initial work from mpi@
The blist implementation comes from DragonFlyBSD.
The diff adds also a ddb(4) 'show swap' command to show the blist and help
debugging, and fix some off-by-one in size printed during hibernate.
ok mpi@
|
|
libc YP support has a couple of places where the domainname is cached, and
this results in wildly incoherent behaviour which could even be risky.
If you want to change the domainname, you will have to reboot.
ok beck miod
|
|
Let's try this again now that the kernel locking issue in nfsrv_rcv()
has been fixed.
The previous attempt of the conversion triggered hangs on NFS servers.
This was probably caused by the removal of the kernel-locked section
just prior to the socket upcall. The section had masked a locking error
in NFS code.
|
|
The timecounting subsystem computes elapsed time by scaling (64 bits)
the difference between two counter values (32 bits at most) up into a
struct bintime (128 bits).
Under normal circumstances it is sufficient to do this with 64-bit
multiplication, like this:
struct bintime bt;
bt.sec = 0;
bt.frac = th->tc_scale * tc_delta(th);
However, if tc_delta() exceeds 1 second's worth of counter ticks, that
multiplication overflows. The result is that the monotonic clock appears
to jump backwards.
When can this happen? In practice, I have seen it when trying to
compile LLVM on an EdgeRouter Lite when using an SD card as the
backing disk. The box gets stuck in swap, the hardclock(9) is
delayed, and we appear to "lose time".
To avoid this overflow we need to compute the full 96-bit product of
the delta and the scale.
This commit adds TIMECOUNT_TO_BINTIME(), a function for computing that
full product, to sys/time.h. The patch puts the new function to use
in lib/libc/sys/microtime.c and sys/kern/kern_tc.c.
(The commit also reorganizes some of our high resolution bintime code
so that we always read the timecounter first.)
Doing the full 96-bit multiplication is between 0% and 15% slower than
doing the cheaper 64-bit multiplication on amd64. Measuring a precise
difference is extremely difficult because the computation is already
quite fast.
I would guess that the cost is slightly higher than that on 32-bit
platforms. Nobody ever volunteered to test, so this remains a guess.
Thread: https://marc.info/?l=openbsd-tech&m=163424607918042&w=2
6 month bump: https://marc.info/?l=openbsd-tech&m=165124251401342&w=2
Committed after 9 months without review.
|
|
Apparently, we used to created several kthreads before the kernel
random number generator was up and running. A toggle, "randompid",
was needed to tell allocpid() whether it made sense to attempt to
allocate random PIDs.
However, these days we get e.g. arc4random(9) into a working state
before any kthreads are spawned, so the toggle is no longer needed.
Thread: https://marc.info/?l=openbsd-tech&m=165541052614453&w=2
Very nice historical context provided by miod@.
probably ok miod@ deraadt@
|
|
ok guenther
|
|
|
|
ok guenther
|
|
Those are the read-only operations allowed for non-root users:
SWAP_NSWAP and SWAP_STATS. Users of pledge("vminfo") in base which also
call swapctl(2) with said commands: top(1) and pstat(8).
No regression spotted with top(1) and pstat(8) -s/-T.
ok deraadt@
|
|
/var/run/ypbind.lock. "getpw" is now only allows ypconnect(2) and the minimum
unveil bypasses.
Still allow open/acesss to file for a little while, because getpwent/getgrent/etc
were opening it unconditionally to hint for YPACTIVE.
That code should be deleted before 7.2
|
|
inside ypconnect(), it is best if we prevent "../" problems. so reject
domainnames containing '/.
discussed with jca
|
|
if chrooted
issue pointed out by semarie
|
|
new libc..
|
|
no longer does accesses /var/run/ypbind.lock to trigger extra permissions
for userland-opening of files & sockets to engage with ypserver for YP/LDAP
lookups. libc now uses the super secret special ypconnect() system call
to perform socket-setup.
Delete some other things which are no longer reached via libc/rpc
ok jmatthew, miod
|
|
Annotate two blocks relating to ypbind.lock that will be deleted once libc
switches over to the new mechanism.
|
|
|
|
rights, so that libc/yp could access YP services via a fairly complex 'protocol'
including file access, sockets, etc. This YP protocol is also used by ypldap --
this is our way of bringing 'NIS' services into libc without monster sub-libraries.
I have managed to remove this "inet" right by creating a new ypconnect() system
call, which performs parts of the yp_bind.c dance inside the kernel.. It checks if
domainname is set, looks for a binding file with advisory lock, reads it to
get the IP and udp/tcp port numbers, and then establishes a connnected socket
direct to that ypserv. This socket has a SS_YP flag set, and non-required system
calls are prohibited. libc maintains lifetime on this socket so a process
should never see it, but it seems safer to block udp re-connect and other calls
even in non-pledge mode.
Userland changes to use this will follow in a few days.
Lots of help from claudio and jmatthew, also ok miod
|
|
|
|
OK jsg@
|
|
Also remove unneeded seltrue() and selfalse().
OK mpi@ jsg@
|
|
Reported-by: syzbot+a648408d6a58fd40b59a@syzkaller.appspotmail.com
by anton@
|
|
Also remove unneeded includes of <sys/poll.h> and <sys/select.h>.
Some addenda from jsg@.
OK miod@ mpi@
|
|
`so_lock' rwlock(9) instead of global `unp_lock' which locks the whole
layer.
The PCB of unix(4) sockets are linked to each other and we need to lock
them both. This introduces the lock ordering problem, because when the
thread (1) keeps lock on `so1' and trying to lock `so2', the thread (2)
could hold lock on `so2' and trying to lock `so1'. To solve this we
always lock sockets in the strict order.
For the sockets which are already accessible from userland, we always
lock socket with the smallest memory address first. Sometimes we need to
unlock socket before lock it's peer and lock it again.
We use reference counters for prevent the connected peer destruction
during to relock. We also handle the case where the peer socket was
replaced by another socket.
For the newly connected sockets, which are not yet exported to the
userland by accept(2), we always lock the listening socket `head' first.
This allows us to avoid unwanted relock within accept(2) syscall.
ok claudio@
|