Age | Commit message (Collapse) | Author |
|
use the proper way to read tu_runtime.
OK mpi@
|
|
ok deraadt@ claudio@
|
|
For procs (threads) the accounting happens now lockless by curproc using
a generation counter. Callers need to use tu_enter() and tu_leave() for this.
To read the proc p_tu struct tuagg_get_proc() should be used. It ensures
that the values read is consistent.
For processes only the time of exited threads is accumulated in ps_tu and
to get the proper process time usage tuagg_get_process() needs to be called.
tuagg_get_process() will sum up all procs p_tu plus the ps_tu.
This removes another SCHED_LOCK() dependency. Adjust the code in
exit1() and exit2() to correctly account for the full run time.
For this adjust sched_exit() to do the runtime accounting like it is done
in mi_switch().
OK jca@ dlg@
|
|
The API's behavior when invoked from a callback function is impossible
to document. Move the special behavior into a distinct namespace,
"clockrequest".
- Add a 'struct clockrequest'. Basically a stripped-down 'struct clockintr'
for exclusive use during clockintr_dispatch().
- In clockintr_queue, replace the "cq_shadow" clockintr with a "cq_request"
clockrequest. They serve the same purpose.
- CLST_SHADOW_PENDING -> CR_RESCHEDULE; different namespace, same meaning.
- CLST_IGNORE_SHADOW -> CLST_IGNORE_REQUEST; same meaning.
- Move shadow branch in clockintr_advance() to clockrequest_advance().
- clockintr_request_random() becomes clockrequest_advance_random().
- Delete dead shadow branches in clockintr_cancel(), clockintr_schedule().
- Callback functions now get a clockrequest pointer instead of a special
clockintr pointer: update all prototypes, callers.
No functional change intended.
|
|
Callers can now provide an argument pointer to clockintr_establish().
The pointer is kept in a new struct clockintr member, cl_arg. The
pointer is passed as the third parameter to clockintr.cl_func when it
is executed during clockintr_dispatch(). Like the callback function,
the callback argument is immutable after the clockintr is established.
At present, nothing uses this. All current clockintr_establish()
callers pass a NULL arg pointer. However, I am confident that dt(4)'s
profile provider will need this in the near future.
Requested by dlg@ back in March.
|
|
as argument to the tuagg_locked function.
- Remove incorrect use of p_rtime in other parts of the tree. p_rtime was
almost always 0 so including it in any sum did not alter the result.
- In main() the update of time can be further simplified since at that time
only the primary cpu is running.
- Add missing nanouptime() call in cpu_hatch() for hppa
- Rename tuagg_unlocked to tuagg_locked like it is done in the rest of
the tree.
OK cheloha@ dlg@
|
|
- Move the setitimer(2) code responsible for updating the ITIMER_VIRTUAL
and ITIMER_PROF timers from hardclock(9) into a new clock interrupt
routine, itimer_update(). itimer_update() is periodic and runs at the
same frequency as the hardclock.
+ Revise itimerdecr() to run within itimer_mtx instead of entering
and leaving it.
- Each schedstate_percpu has its own itimer_update() handle, spc_itimer.
A new scheduler flag, SPCF_ITIMER, indicates whether spc_itimer was
started during the last mi_switch() and needs to be stopped during the
next mi_switch() or sched_exit().
- A new per-process flag, PS_ITIMER, indicates whether ITIMER_VIRTUAL
and/or ITIMER_PROF are running. Checking the flag is easier than
entering itimer_mtx to check process.ps_timer[]. The flag is set
and cleared in a new helper function, process_reset_itimer_flag().
- In setitimer(), call need_resched() when the state of ITIMER_VIRTUAL
or ITIMER_PROF is changed to force an mi_switch() and update
spc_itimer.
claudio@ notes that ITIMER_PROF could be implemented as a high-res
timer using the thread's execution time as a guide for when to
interrupt the process and assert SIGPROF. This would probably work
really well in single-threaded processes. ITIMER_VIRTUAL would be
more difficult to make high-res, though, as you need to exclude time
spent in the kernel.
Tested on powerpc64 by gkoehler@. With input from claudio@.
Thread: https://marc.info/?l=openbsd-tech&m=169038818517101&w=2
ok claudio@
|
|
OK cheloha@ jca@ mvs@
|
|
Now that the clockintr switch is complete, cpu_initclocks() always
initializes stathz to a non-zero value. We don't call statclock()
from hardclock(9) anymore and, more broadly, we don't need to test
whether stathz is non-zero before using it.
With input from kettenis@.
Link: https://marc.info/?l=openbsd-tech&m=167434223309668&w=2
ok kettenis@ miod@
|
|
the process* that it should be part of. Use that in clock_get{time,res}(),
thrkill(), and ptrace().
ok jca@ miod@ mpi@ mvs@
|
|
I think "abs" ("absolute timeout") is a better mnemonic than
"at" ("at the given time").
The interface is undocumented and there are only two callers, so
renaming it is not a big deal.
probably ok kn@
|
|
|
|
to assign a quality to RTC implementation and pick the "best" RTC if a
system has multiple RTCs (or multiple interfaces to an RTC). This allows
us to prefer a battery-backed I2C RTC over an RTC that is part of the SoC
which is only running of the SoC is powered. It also allows us to
work around issues with firmware RTC interfaces that may lie to us or
even crash the system.
This change makes sure the todr_quality member of the struct is always
initialized. In most cases the quality will be set to zero; further
adjustments of the quality for specific subsystems/architectures will follow.
ok cheloha@, patrick@
|
|
ok mpi@ miod@
|
|
witness. Make ratecheck mutex global.
Reported-by: syzbot+9864ba1338526d0e8aca@syzkaller.appspotmail.com
|
|
mutex with spl high for all function calls is used for now. It
protects the lasttime and curpps parameter. This solution is MP
safe for the usual use case, allows progress, and can be optimized
later. Remove a useless #if 1 while there.
OK claudio@
|
|
Currently setitimer(2) rejects timers larger than 100 million seconds
and sets EINVAL.
With the change to kclock timeouts there is no longer any reason to
use this arbitrary value. Kclock timeouts support the full range of a
timespec, so we can increase the upper bound without practical risk of
arithmetic overflow.
If we push the limit to UINT_MAX we can support the full input range
of alarm(3). We can then simplify the alarm.3 manpage in a separate
patch.
We can push the limit even higher in the future if we find software
that doesn't like the UINT_MAX limit. Until then, UINT_MAX seconds
(over 68 years) is plenty for all practical timers.
ok claudio@
|
|
We can reduce latency for the first expiration of a timer if we don't
round it_value up to the minimum interval (1 tick).
While we're at it, we may as well consolidate all input validation and
adjustment into a single itimerfix() call. There are no other callers
in the kernel (nor should there be), so remove the prototype from
sys/time.h.
Discussion: https://marc.info/?l=openbsd-tech&m=162084338005502&w=2
Tested by weerd@ and claudio@.
probably ok claudio@
|
|
Change the definition of ADJFREQ_MIN so that it does not shift
a negative value. Such shifting is undefined in standard C.
This came up when cross-compiling the kernel using ports clang.
The shifting becomes defined when compiling with option -fwrapv.
Base clang enables this option by default.
OK naddy@ cheloha@
|
|
We only see 8 characters of wmesg in e.g. top(1), so shorten the
string to fit.
Indirectly prompted by kettenis@.
|
|
To unlock getitimer(2) and setitimer(2) we need to protect the
per-process ITIMER_REAL state with something other than the kernel
lock. As the ITIMER_REAL timeout callback realitexpire() runs at
IPL_SOFTCLOCK the per-process mutex ps_mtx is appropriate.
In setitimer() we need to use ps_mtx instead of the global itimer_mtx
if the given timer is ITIMER_REAL. Easy.
The ITIMER_REAL timeout callback routine realitexpire() is trickier.
When we enter ps_mtx during the callback we need to check if the timer
was cancelled or rescheduled. A thread from the process can call
setitimer(2) at the exact moment the callback is about to run from
timeout_run() (see kern_timeout.c).
Update the locking annotation in sys/proc.h accordingly.
ok anton@
|
|
Reimplement the ITIMER_REAL interval timer with a kclock timeout.
Couple things of note:
- We need to use the high-res nanouptime(9) call, not the low-res
getnanouptime(9).
- The code is simpler now that we aren't working with ticks.
Misc. thoughts:
- Still unsure if "kclock" is the right name for these things.
- MP-safely cancelling a periodic timeout is very difficult.
|
|
If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".
Based on input from millert@ and kettenis@.
|
|
During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.
The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.
While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.
ok deraadt@
|
|
If itv.it_value is zero we cancel the timer. When we cancel the timer
we don't care about itv.it_interval because the timer is not running:
we don't use it, we don't look at it, etc.
To be on the paranoid side, I think we should zero itv.it_interval
when itv.it_value is zero. No need to write arbitrary values into the
process struct if we aren't required to. The standard is ambiguous
about what should happen in this case, i.e. the value of olditv after
the following code executes is unspecified:
struct itimerval newitv, olditv;
newitv.it_value.tv_sec = newitv.it_value.tv_usec = 0;
newitv.it_interval.tv_sec = newitv.it_interval.tv_usec = 1;
setitimer(ITIMER_REAL, &newitv, NULL);
getitimer(ITIMER_REAL, &olditv);
This change should not break any real code.
|
|
timespecadd(3) is fast. There is no need to call getnanouptime(9)
repeatedly when searching for the next expiration point. Given that
it_interval is at least 1/hz, we expect to run through the loop maybe
hz times at most. Even at HZ=10000 that's pretty brief.
While we're here, pull *all* of the other logic out of the loop.
The only thing we need to do in the loop is timespecadd(3).
|
|
- Consolidate variable declarations.
- Remove superfluous parentheses from return statements.
- Prefer sizeof(variable) to sizeof(type) for copyin(9)/copyout(9).
- Remove some intermediate pointers from sys_setitimer(). Using SCARG()
directly here makes it more obvious to the reader what you're copying.
|
|
Now that the critical sections are merged we should call
getnanouptime(9) once. This makes an ITIMER_REAL timer swap atomic
with respect to the clock: the time remaining on the old timer is
computed with the same timestamp used to schedule the new timer.
|
|
Merge the common code from sys_getitimer() and sys_setitimer() into a
new kernel subroutine, setitimer(). setitimer() performs all of the
error-free work for both system calls within a single critical
section. We need a single critical section to make the setitimer(2)
timer swap operation atomic relative to realitexpire() and hardclock(9).
The downside of the new atomicity is that the behavior of setitimer(2)
must change. With a single critical section we can no longer copyout(9)
the old timer before installing the new timer. So If SCARG(uap, oitv)
points to invalid memory, setitimer(2) now fail with EFAULT but the
new timer will be left running. You can see this in action with code
like the following:
struct itv, olditv;
itv.it_value.tv_sec = 1;
itv.it_value.tv_usec = 0;
itv.it_interval = itv.it_value;
/* This should EFAULT. 0x1 is probably an invalid address. */
if (setitimer(ITIMER_REAL, &itv, (void *)0x1) == -1)
warn("setitimer");
/* The timer will be running anyway. */
getitimer(ITIMER_REAL, &olditv);
printf("time left: %lld.%06ld\n",
olditv.it_value.tv_sec, olditv.it_value.tv_usec);
There is no easy way to work around this. Both FreeBSD's and Linux's
setitimer(2) implementations have a single critical section and they
too fail with EFAULT in this case and leave the new timer running.
I imagine their developers decided that fixing this error case was
a waste of effort. Without permitting copyout(9) from within a mutex
I'm not sure it is even possible to avoid it on OpenBSD without
sacrificing atomicity during a setitimer(2) timer swap.
Given the rarity of this error case I would rather have an atomic swap.
Behavior change discussed with deraadt@.
|
|
if they are out of range, making it easier to isolate reason for EINVAL
ok cheloha
|
|
setitimer(2) works with timespecs in its critical section. It will be
easier to merge the two critical sections if getitimer(2) also works
with timespecs.
In particular, we currently read the uptime clock *twice* during a
setitimer(2) swap: we call getmicrouptime(9) in sys_getitimer() and
then call getnanouptime(9) in sys_setitimer(). This means that
swapping one timer in for another is not atomic with respect to the
uptime clock. It also means the two operations are working with
different time structures and resolutions, which is potentially
confusing.
If both critical sections work with timespecs we can combine the two
getnanouptime(9) calls into a single call at the start of the combined
critical section in a future patch, making the swap atomic with
respect to the clock.
So, in preparation, move the TIMESPEC_TO_TIMEVAL conversions in
getitimer(2) after the ITIMER_REAL conversion from absolute to
relative time, just before copyout(9). The ITIMER_REAL conversion
must then be done with timespec macros and getnanouptime(9), just like
in setitimer(2).
|
|
If we're replacing the current ITIMER_REAL timer with a new one we
don't need to call timeout_del(9) before calling timeout_add(9).
timeout_add(9) does the work of timeout_del(9) implicitly if the
timeout in question is already pending.
This saves us an extra trip through the timeout_mutex.
|
|
Rearrange the critical section in setitimer(2) to match that of
getitimer(2). This will make it easier to merge the two critical
sections in a subsequent diff.
In particular, we want to write the new timer value in *one* place in
the code, regardless of which timer we're setting.
ok millert@
|
|
For what are probably historical reasons, setitimer(2) does not
validate its input (itv) immediately after copyin(9). Instead, it
waits until after (possibly) performing a getitimer(2) to copy out the
state of the timer.
Consolidating copyin(9), input validation, and input conversion into a
single block before the getitimer(2) operation makes setitimer(2)
itself easier to read. It will also simplify merging the critical
sections of setitimer(2) and getitimer(2) in a subsequent patch.
This changes setitimer(2)'s behavior in the EINVAL case. Currently,
if your input (itv) is invalid, we return EINVAL *after* modifying the
output (olditv). With the patch we will now return EINVAL *before*
modifying the output. However, any code dependent upon this behavior
is broken: the contents of olditv are undefined in all setitimer(2)
error cases.
ok millert@
|
|
The ITIMER_REAL per-process interval timer is protected by the kernel
lock. The ITIMER_REAL timeout (ps_realit_to), setitimer(2), and
getitimer(2) all run under the kernel lock. Entering itimer_mtx
during getitimer(2) when reading the ITIMER_REAL ps_timer state is
superfluous and misleading.
|
|
The ITIMER_VIRTUAL and ITIMER_PROF per-process interval timers are
updated from hardclock(9). If a timer for the parent process is
enabled the hardclock(9) thread calls itimerdecr() to update and
reload it as needed.
However, in itimerdecr(), after entering itimer_mtx, the thread needs
to double-check that the timer in question is still enabled. While
the hardclock(9) thread is entering itimer_mtx a thread in
setitimer(2) can take the mutex and disable the timer.
If the timer is disabled, itimerdecr() should return 1 to indicate
that the timer has not expired and that no action needs to be taken.
ok kettenis@
|
|
The current input validation for overflow is more complex than
it needs to be. We can flatten the conditional hierarchy into
a string of checks just one level deep. The result is easier
to read.
|
|
At securelevel 2 we prevent root from rewinding the kernel UTC clock.
The rationale given in the comment is that this prevents a compromised
root from setting arbitrary timestamps on files.
I can't really speak to the efficacy of this mitigation, or to the
efficacy of the securelevel concept in general, but the implementation
of this mitigation is wrong. We need to check:
timespeccmp(ts, &now, <=)
instead of
timespeccmp(ts, &now, <)
like we do now.
Time is a continuous value that is always advancing. We must prevent
root from setting the kernel UTC clock to its current value in
addition to prior values. Setting the UTC clock to its current value
amounts to rewinding it even if we cannot actually measure the
difference with a timespec.
With this change, at securelevel 2, root can no longer completely
freeze the UTC clock.
|
|
When we recompute the scaling factor during tc_windup() there is an
opportunity for arithmetic overflow if the active timecounter's
adjfreq(2) adjustment is too large. If we limit the adjustment to
[-500000, +500000] ppm the statement in question cannot overflow.
In particular, we are concerned with the following bit of code:
scale = (u_int64_t)1 << 63;
scale += \
((th->th_adjustment + th->th_counter->tc_freq_adj) / 1024) * 2199;
scale /= th->th_counter->tc_frequency;
th->th_scale = scale * 2;
where scale is an int64_t. Overflow when we do:
scale += (...) / 1024 * 2199;
as th->th_counter->tc_freq_adj is currently unbounded.
th->th_adjustment is limited to [-5000ppm, 5000ppm].
To see that overflow is prevented with the new bounds, consider the
new edge case where th->th_counter->tc_freq_adj is 500000ppm and
th->th_adjustment is 5000ppm. Both are of type int64_t. We have:
int64_t th_adjustment = (5000 * 1000) << 32; /* 21474836480000000 */
int64_t tc_freq_adj = 500000000LL << 32; /* 2147483648000000000 */
scale = (u_int64_t)1 << 63; /* 9223372036854775808 */
scale += (th_adjustment + tc_freq_adj) / 1024 * 2199;
/* scale += 2168958484480000000 / 1024 * 2199; */
/* scale += 4657753620480000000; */
9223372036854775808 + 4657753620480000000 = 13881125657334775808,
which less than 18446744073709551616, so we don't have overflow.
On the opposite end, if th->th_counter->tc_freq_adj is -500000ppm and
th->th_adjustment is -5000ppm we would have -4657753620480000000.
9223372036854775808 - 4657753620480000000 = 4565618416374775808.
Again, no overflow.
500000ppm and -500000ppm are extreme adjustments. otto@ says ntpd(8)
would never arrive at them naturally, so we are not at risk of breaking
a working setup by imposing these restrictions.
Documentation input from kettenis@.
No complaints from otto@.
|
|
We don't want resettodr(9) to write the RTC until inittodr(9) has
actually run. Until inittodr(9) calls tc_setclock() the system UTC
clock will contain a meaningless value and there's no sense in
overwriting a good value with a value we know is nonsense.
This is not an uncommon problem if you're debugging a problem in early
boot, e.g. a panic that occurs prior to inittodr(9).
Currently we use the following logic in resettodr(9) to inhibit writes:
if (time_second == 1)
return;
... this is too magical.
A better way to accomplish the same thing is to introduce a dedicated
flag set from inittodr(9). Hence, "inittodr_done".
Suggested by visa@.
ok kettenis@
|
|
|
|
of todr_handle.
OK kettenis@
|
|
ok deraadt@, mpi@, visa@
ok cheloha@ as well (would have preferred in new file for this code)
|
|
While here, rename the wait channel so the tsleep_nsec(9) call will fit
onto a single line. It isn't a global channel so the name is arbitrary
anyway.
With input from visa@.
ok visa@
|
|
I broke adjfreq(2)'s atomic swap in kern_time.c,v1.112. By using the
"f" variable to store both the new and old frequency adjustments, the
new adjustment gets clobbered by the old adjustment if the caller asked
for a swap.
ok visa@ mpi@
|
|
Currently we return (1000000000 / hz) from clock_getres(2) as the
resolution for every clock. This is often untrue.
For CPUTIME clocks, if we have a separate statclock interrupt the
resolution is (1000000000 / stathz). Otherwise it is as we currently
claim: (1000000000 / hz).
For the REALTIME/MONOTONIC/UPTIME/BOOTTIME clocks the resolution is
that of the active timecounter. During tc_init() we can compute the
precision of a timecounter by examining its tc_counter_mask and store
it for lookup later in a new member, tc_precision. The resolution of
a clock backed by a timecounter "tc" is then
tc.tc_precision * (2^64 / tc.tc_frequency)
fractional seconds.
While here we can clean up sys_clock_getres() a bit.
Standards input from guenther@. Lots of input, feedback from
kettenis@.
ok kettenis@
|
|
For gettimeofday(2), always copy out an empty timezone struct. For
settimeofday(2), still copyin(9) the struct but ignore the contents.
In gettimeofday(2)'s case we have not changed the original BSD semantics:
the kernel only tracks UTC time without an offset for DST, so a zeroed
timezone struct is the correct thing to return to the caller.
Future work could move these out into libc as stubs for clock_gettime and
clock_settime(2). But, definitely a "later" thing, given that we are in
beta.
Update the manpage to de-emphasize the timezone parameters for these
syscalls.
Discussed with tedu@, deraadt@, millert@, kettenis@, yasuoka@, jca@, and
guenther@. Tested by job@. Ports input from jca@ and sthen@. Manpage
input from jca@.
ok jca@ deraadt@
|
|
|
|
Loongson runs at 128hz. 128 doesn't divide evenly into a million,
but it does divide evenly into a billion. So if we do the per-process
itimer bookkeeping with itimerspec structs we can have error-free
virtual itimers on loongson just as we do on most other platforms.
This change doesn't fix the virtual itimer error alpha, as 1024 does not
divide evenly into a billion. But this doesn't make the situation any
worse, either.
ok deraadt@
|
|
|