Age | Commit message (Collapse) | Author |
|
pool_get() and pool_put(). This makes it clearer that the knote
allocation cannot fail and that the error check is unnecessary.
OK anton@, kettenis@, mpi@
|
|
turn them into proper sentences. Gets rid of 27 lines in total.
|
|
to simplify the locking pattern, revert back to using a hand-rolled I/O
lock just like FreeBSD and NetBSD does. The state of pipes is quite
different compared to when I made use of a rwlock for the I/O lock in
revision 1.96. Most notably, the pipe_lock can now be used while
sleeping. This does imply that witness(4) tracking of the I/O lock is
lost but the implementation ends up being simpler.
ok visa@
|
|
This flag is set whenever a timeout is put on the wheel and cleared upon
(a) running, (b) deletion, and (c) readdition. It serves two purposes:
1. Facilitate distinguishing scheduled and rescheduled timeouts. When a
timeout is put on the wheel it is "scheduled" for a later softclock().
If this happens two or more times it is also said to be "rescheduled".
The tos_rescheduled value thus indicates how many distant timeouts
have been cascaded into a lower wheel level.
2. Eliminate false late timeouts. A timeout is not late if it is due
before softclock() has had a chance to schedule it. To track this we
need additional state, hence a new flag.
rprocter@ raises some interesting questions. Some answers:
- This interface is not stable and name changes are possible at a
later date.
- Although rescheduling timeouts is a side effect of the underlying
implementation, I don't forsee us using anything but a timeout wheel
in the future. Other data structures are too slow in practice, so
I doubt that the concept of a rescheduled timeout will be irrelevant
any time soon.
- I think the development utility of gathering these sorts of statistics
is high. Watching the distribution of timeouts under a given workflow
is informative.
ok visa@
|
|
only called from main(). There allocation must not fail, so better
use M_WAITOK and remove error handling. As it is not a temporary
buffer, M_TTYS is more appropriate.
OK deraadt@ mpi@
|
|
|
|
OK cheloha@, anton@, mpi@
|
|
pipe_lock. This add a potential sleeping point in the kqueue filter
routines which should be fine by now thanks to changes made to the
kqueue subsystem by visa.
ok visa@
|
|
ok visa@
|
|
Missing piece of tickless timeout revert.
|
|
structures"
Backed out during revert of "timeout(9): switch to tickless backend".
Original commit message:
- CIRCQ_APPEND -> CIRCQ_CONCAT
- Flip argument order of CIRCQ_INSERT to match e.g. TAILQ_INSERT_TAIL
- CIRCQ_INSERT -> CIRCQ_INSERT_TAIL
- Add CIRCQ_FOREACH, use it in ddb(4) when printing buckets
- While here, use tabs for indentation like we do with other macros
ok visa@ mpi@
|
|
timehands.th_adjustment"
Reverted with backout of tickless timeouts.
Original commit message:
We currently mix timecounter.tc_freq_adj and timehands.th_adjtimedelta
in ntp_update_second() to produce timehands.th_adjustment, our net skew.
But if you set a low enough adjfreq(2) adjustment you can freeze time.
This prevents ntp_update_second() from running again. So even if you
then set a sane adjfreq(2) you cannot unfreeze time without rebooting.
If we just reread timecounter.tc_freq_adj every time we recompute
timehands.th_scale we avoid this trap. visa@ notes that this is
more costly than what we currently do but that the cost itself is
negligible.
Intuitively, timecounter.tc_freq_adj is a constant skew and should be
handled separately from timehands.th_adjtimedelta, an adjustment that
we chip away at very slowly.
tedu@ notes that this problem is sort-of an argument for imposing range
limits on adjfreq(2) inputs. He's right, but I think we should still
separate the counter adjustment from the adjtime(2) adjustment, with
or without range limits.
ok visa@
|
|
Tested by anton@, sashan@
OK mpi@, anton@, sashan@
|
|
In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().
The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.
This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.
Tested by anton@, sashan@
OK mpi@, anton@, sashan@
|
|
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))
ok mpi@
|
|
|
|
it's own checks.
|
|
ok visa@, jca@
|
|
to make the code clearer.
OK claudio@ mpi@
|
|
It appears to have caused major performance regressions all over the
network stack.
Reported by bluhm@
ok deraadt@
|
|
Prompted by and OK cheloha@
OK mpi@ anton@
|
|
OK mpi@ anton@
|
|
We currently mix timecounter.tc_freq_adj and timehands.th_adjtimedelta
in ntp_update_second() to produce timehands.th_adjustment, our net skew.
But if you set a low enough adjfreq(2) adjustment you can freeze time.
This prevents ntp_update_second() from running again. So even if you
then set a sane adjfreq(2) you cannot unfreeze time without rebooting.
If we just reread timecounter.tc_freq_adj every time we recompute
timehands.th_scale we avoid this trap. visa@ notes that this is
more costly than what we currently do but that the cost itself is
negligible.
Intuitively, timecounter.tc_freq_adj is a constant skew and should be
handled separately from timehands.th_adjtimedelta, an adjustment that
we chip away at very slowly.
tedu@ notes that this problem is sort-of an argument for imposing range
limits on adjfreq(2) inputs. He's right, but I think we should still
separate the counter adjustment from the adjtime(2) adjustment, with
or without range limits.
ok visa@
|
|
ok millert@ deraadt@
|
|
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.
Tested by anton@, cheloha@, chris@
OK anton@, cheloha@
|
|
Use this in the buffer cache to free all the pages from a buffer,
resulting in a considerable speedup when throwing away pages from
the buffer cache.
Lots of work done with mlarkin and kettenis
ok kettinis@ deraadt@
|
|
proc, so they don't need to be explicitly initialized in thread_new()
suggested by anton@
ok kettenis@
|
|
take a struct proc* instead of a struct process*, and vice versa making
unveil_lookup() take a process* instead of a proc*.
ok beck@
|
|
ok anton@
|
|
|
|
rwlock. This lock is shared among all pipes for simplicity. In the
future, the lock will probably be replaced with one lock per pipe pair,
just like FreeBSD and NetBSD does.
While here, extract the common rundown wakeup logic into a dedicated
function.
Thanks to cheloha@ for testing and feedback.
ok mpi@ visa@
|
|
- CIRCQ_APPEND -> CIRCQ_CONCAT
- Flip argument order of CIRCQ_INSERT to match e.g. TAILQ_INSERT_TAIL
- CIRCQ_INSERT -> CIRCQ_INSERT_TAIL
- Add CIRCQ_FOREACH, use it in ddb(4) when printing buckets
- While here, use tabs for indentation like we do with other macros
ok visa@
|
|
Match FreeBSD and NetBSD.
ok bluhm@, deraadt@, kettenis@
|
|
Suggested by and ok sashan@
|
|
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.
This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.
For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)
For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.
We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.
This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.
ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen
|
|
ensure we handle the uvm_objects of bread_cluster buffers correctly.
Original commit message:
Fix the buffer cache code to not use a giant uvm obj of all pages
when a small one on each buf is all that is needed. reduces the
cost of large frees by about 25%.
Again, lots of assistence from kettenis and mlarkin
still ok kettenis@
|
|
the bread_cluster code has confused even me and mark,
we need to handle the buffer slice and dice case better
for bread_cluster.
|
|
"show malloc" output
ok deraadt@ mpi@
|
|
to determine if the buffer has pages to free. we have to
set this pointer only after we could sleep allocating pages.
setting it before creates the potential for a race to free
us while we are sleeping
ok kettenis@
|
|
ok deraadt@
|
|
when a small one on each buf is all that is needed. reduces the
cost of large frees by about 25%.
ok kettenis@
|
|
|
|
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther
|
|
Rebase the timeout wheel on the system uptime clock. Timeouts are now
set to run at or after an absolute time as returned by nanouptime(9).
Timeouts are thus "tickless": they expire at a real time on that clock
instead of at a particular value of the global "ticks" variable.
To facilitate this change the timeout struct's .to_time member becomes a
timespec. Hashing timeouts into a bucket on the wheel changes slightly:
we build a 32-bit hash with 25 bits of seconds (.tv_sec) and 7 bits of
subseconds (.tv_nsec). 7 bits of subseconds means the width of the
lowest wheel level is now 2 seconds on all platforms and each bucket in
that lowest level corresponds to 1/128 seconds on the uptime clock.
These values were chosen to closely align with the current 100hz
hardclock(9) typical on almost all of our platforms. At 100hz a bucket
is currently ~1/100 seconds wide on the lowest level and the lowest
level itself is ~2.56 seconds wide. Not a huge change, but a change
nonetheless.
Because a bucket no longer corresponds to a single tick more than one
bucket may be dumped during an average timeout_hardclock_update() call.
On 100hz platforms you now dump ~2 buckets. On 64hz machines (sh) you
dump ~4 buckets. On 1024hz machines (alpha) you dump only 1 bucket,
but you are doing extra work in softclock() to reschedule timeouts
that aren't due yet.
To avoid changing current behavior all timeout_add*(9) interfaces
convert their timeout interval into ticks, compute an equivalent
timespec interval, and then add that interval to the timestamp of
the most recent timeout_hardclock_update() call to determine an
absolute deadline. So all current timeouts still "use" ticks,
but the ticks are faked in the timeout layer.
A new interface, timeout_at_ts(9), is introduced here to bypass this
backwardly compatible behavior. It will be used in subsequent diffs
to add absolute timeout support for userland and to clean up some of
the messier parts of kernel timekeeping, especially at the syscall
layer.
Because timeouts are based against the uptime clock they are subject to
NTP adjustment via adjtime(2) and adjfreq(2). Unless you have a crazy
adjfreq(2) adjustment set this will not change the expiration behavior
of your timeouts.
Tons of design feedback from mpi@, visa@, guenther@, and kettenis@.
Additional amd64 testing from anton@ and visa@. Octeon testing from visa@.
macppc testing from me.
Positive feedback from deraadt@, ok visa@
|
|
require this anymore since we now behave like posix.
Fixes a problem where a symlink to / would return ENOTDIR because
the parent could not be locked - noticed by Raimo Niskanen <raimo@erlang.org>
ok guenther@ deraadt@
|
|
rwsleep(). All made possible by the recent switch to using a rwlock as
the exclusive pipe lock.
ok visa@
|
|
The checker keeps track of all held rwlocks, so it is able to tell
if a given thread holds a specific lock even when the lock is shared.
OK anton@ mpi@
|
|
|
|
in a few places. No code change.
|
|
With this semantic change it is now possible to use a similar assert for
both mutexes and rwlocks as required by the vm_map_assert_lock() diff.
ok sashan@
|