summaryrefslogtreecommitdiff
path: root/sys/kern
AgeCommit message (Collapse)Author
2019-12-19Start protecting the pipe_peer member of `struct pipe' using theanton
pipe_lock. This add a potential sleeping point in the kqueue filter routines which should be fine by now thanks to changes made to the kqueue subsystem by visa. ok visa@
2019-12-19Convert infinite sleeps to {m,t}sleep_nsec(9).Martin Pieuchot
ok visa@
2019-12-12tc_setclock: reintroduce timeout_adjust_ticks() callcheloha
Missing piece of tickless timeout revert.
2019-12-12Recommit "timeout(9): make CIRCQ look more like other sys/queue.h data ↵cheloha
structures" Backed out during revert of "timeout(9): switch to tickless backend". Original commit message: - CIRCQ_APPEND -> CIRCQ_CONCAT - Flip argument order of CIRCQ_INSERT to match e.g. TAILQ_INSERT_TAIL - CIRCQ_INSERT -> CIRCQ_INSERT_TAIL - Add CIRCQ_FOREACH, use it in ddb(4) when printing buckets - While here, use tabs for indentation like we do with other macros ok visa@ mpi@
2019-12-12Recommit "tc_windup: separate timecounter.tc_freq_adj from ↵cheloha
timehands.th_adjustment" Reverted with backout of tickless timeouts. Original commit message: We currently mix timecounter.tc_freq_adj and timehands.th_adjtimedelta in ntp_update_second() to produce timehands.th_adjustment, our net skew. But if you set a low enough adjfreq(2) adjustment you can freeze time. This prevents ntp_update_second() from running again. So even if you then set a sane adjfreq(2) you cannot unfreeze time without rebooting. If we just reread timecounter.tc_freq_adj every time we recompute timehands.th_scale we avoid this trap. visa@ notes that this is more costly than what we currently do but that the cost itself is negligible. Intuitively, timecounter.tc_freq_adj is a constant skew and should be handled separately from timehands.th_adjtimedelta, an adjustment that we chip away at very slowly. tedu@ notes that this problem is sort-of an argument for imposing range limits on adjfreq(2) inputs. He's right, but I think we should still separate the counter adjustment from the adjtime(2) adjustment, with or without range limits. ok visa@
2019-12-12Reintroduce socket locking inside socket event filters.Visa Hankala
Tested by anton@, sashan@ OK mpi@, anton@, sashan@
2019-12-12Allow sleeping inside kqueue event filters.Visa Hankala
In kqueue_scan(), threads have to get an exclusive access to a knote before processing by calling knote_acquire(). This prevents the knote from being destroyed while it is still in use. knote_acquire() also blocks other threads from processing the knote. Once knote processing has finished, the thread has to call knote_release(). The kqueue subsystem is still serialized by the kernel lock. If an event filter sleeps, the kernel lock is released and another thread might enter kqueue_scan(). kqueue_scan() uses start and end markers to keep track of the scan's progress and it has to be aware of other threads' markers. This patch is a revised version of mpi@'s work derived from DragonFly BSD. kqueue_check() has been adapted from NetBSD. Tested by anton@, sashan@ OK mpi@, anton@, sashan@
2019-12-11Replace p_xstat with ps_xexit and ps_xsigPhilip Guenther
Convert those to a consolidated status when needed in wait4(), kevent(), and sysctl() Pass exit code and signal separately to exit1() (This also serves as prep for adding waitid(2)) ok mpi@
2019-12-09typoTheo de Raadt
2019-12-08msyscall(2) is like kbind(2), and should be always permitted. it doesTheo de Raadt
it's own checks.
2019-12-08Convert infinite sleeps to tsleep_nsec(9).Martin Pieuchot
ok visa@, jca@
2019-12-07Combine macro KNOTE_ACTIVATE() with function knote_activate()Visa Hankala
to make the code clearer. OK claudio@ mpi@
2019-12-02Revert "timeout(9): switch to tickless backend"cheloha
It appears to have caused major performance regressions all over the network stack. Reported by bluhm@ ok deraadt@
2019-12-02Replace rwsleep(9) with rwsleep_nsec(9) in vfs_lockf.c.Visa Hankala
Prompted by and OK cheloha@ OK mpi@ anton@
2019-12-02Remove now unneeded kernel locking from vfs_lockf.c.Visa Hankala
OK mpi@ anton@
2019-12-02tc_windup: separate timecounter.tc_freq_adj from timehands.th_adjustmentcheloha
We currently mix timecounter.tc_freq_adj and timehands.th_adjtimedelta in ntp_update_second() to produce timehands.th_adjustment, our net skew. But if you set a low enough adjfreq(2) adjustment you can freeze time. This prevents ntp_update_second() from running again. So even if you then set a sane adjfreq(2) you cannot unfreeze time without rebooting. If we just reread timecounter.tc_freq_adj every time we recompute timehands.th_scale we avoid this trap. visa@ notes that this is more costly than what we currently do but that the cost itself is negligible. Intuitively, timecounter.tc_freq_adj is a constant skew and should be handled separately from timehands.th_adjtimedelta, an adjustment that we chip away at very slowly. tedu@ notes that this problem is sort-of an argument for imposing range limits on adjfreq(2) inputs. He's right, but I think we should still separate the counter adjustment from the adjtime(2) adjustment, with or without range limits. ok visa@
2019-12-01comply with POSIX and make execve() return EACCES for directoriesChristian Weisgerber
ok millert@ deraadt@
2019-11-30Move kernel locking inside the sleep machinery. This enables callingVisa Hankala
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel lock. In addition, now tsleep(9) with PCATCH should be safe to use without the kernel lock if the sleep is purely time-based. Tested by anton@, cheloha@, chris@ OK anton@, cheloha@
2019-11-29Add uvm_objfree function to free all pages in a uvm_obj in one go.Bob Beck
Use this in the buffer cache to free all the pages from a buffer, resulting in a considerable speedup when throwing away pages from the buffer cache. Lots of work done with mlarkin and kettenis ok kettinis@ deraadt@
2019-11-29Move p_sleeplocks and p_limit into the "zero on create" section of structPhilip Guenther
proc, so they don't need to be explicitly initialized in thread_new() suggested by anton@ ok kettenis@
2019-11-29Eliminate the sketchy use of ps_mainproc here by making unveil_add_vnode()Philip Guenther
take a struct proc* instead of a struct process*, and vice versa making unveil_lookup() take a process* instead of a proc*. ok beck@
2019-11-29Move kcov(4)'s p_kd into the "zero on create" section to simplify fork codePhilip Guenther
ok anton@
2019-11-29add missing parens around return expression and zap empty lineanton
2019-11-29Start protecting the pipe_busy field of struct pipe using a globalanton
rwlock. This lock is shared among all pipes for simplicity. In the future, the lock will probably be replaced with one lock per pipe pair, just like FreeBSD and NetBSD does. While here, extract the common rundown wakeup logic into a dedicated function. Thanks to cheloha@ for testing and feedback. ok mpi@ visa@
2019-11-29timeout(9): make CIRCQ look more like other sys/queue.h data structurescheloha
- CIRCQ_APPEND -> CIRCQ_CONCAT - Flip argument order of CIRCQ_INSERT to match e.g. TAILQ_INSERT_TAIL - CIRCQ_INSERT -> CIRCQ_INSERT_TAIL - Add CIRCQ_FOREACH, use it in ddb(4) when printing buckets - While here, use tabs for indentation like we do with other macros ok visa@
2019-11-29Return EBUSY for successive PT_TRACE_ME calls.Martin Pieuchot
Match FreeBSD and NetBSD. ok bluhm@, deraadt@, kettenis@
2019-11-29Use RW_PROC() consistently.Martin Pieuchot
Suggested by and ok sashan@
2019-11-29Repurpose the "syscalls must be on a writeable page" mechanism toTheo de Raadt
enforce a new policy: system calls must be in pre-registered regions. We have discussed more strict checks than this, but none satisfy the cost/benefit based upon our understanding of attack methods, anyways let's see what the next iteration looks like. This is intended to harden (translation: attackers must put extra effort into attacking) against a mixture of W^X failures and JIT bugs which allow syscall misinterpretation, especially in environments with polymorphic-instruction/variable-sized instructions. It fits in a bit with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash behaviour, particularily for remote problems. Less effective once on-host since someone the libraries can be read. For static-executables the kernel registers the main program's PIE-mapped exec section valid, as well as the randomly-placed sigtramp page. For dynamic executables ELF ld.so's exec segment is also labelled valid; ld.so then has enough information to register libc's exec section as valid via call-once msyscall(2) For dynamic binaries, we continue to to permit the main program exec segment because "go" (and potentially a few other applications) have embedded system calls in the main program. Hopefully at least go gets fixed soon. We declare the concept of embedded syscalls a bad idea for numerous reasons, as we notice the ecosystem has many of static-syscall-in-base-binary which are dynamically linked against libraries which in turn use libc, which contains another set of syscall stubs. We've been concerned about adding even one additional syscall entry point... but go's approach tends to double the entry-point attack surface. This was started at a nano-hackathon in Bob Beck's basement 2 weeks ago during a long discussion with mortimer trying to hide from the SSL scream-conversations, and finished in more comfortable circumstances next to a wood-stove at Elk Lakes cabin with UVM scream-conversations. ok guenther kettenis mortimer, lots of feedback from others conversations about go with jsing tb sthen
2019-11-29Re commit what was committed in version 1.43 with a fix added toBob Beck
ensure we handle the uvm_objects of bread_cluster buffers correctly. Original commit message: Fix the buffer cache code to not use a giant uvm obj of all pages when a small one on each buf is all that is needed. reduces the cost of large frees by about 25%. Again, lots of assistence from kettenis and mlarkin still ok kettenis@
2019-11-28back out the buffer cache uvm_obj change for now.Bob Beck
the bread_cluster code has confused even me and mark, we need to handle the buffer slice and dice case better for bread_cluster.
2019-11-28Delete km_mapblocks from kmemstats and its always-zero column from the ddbPhilip Guenther
"show malloc" output ok deraadt@ mpi@
2019-11-28Fix panic noticed by bluhm@ and florian@. bp->b_pobj is usedBob Beck
to determine if the buffer has pages to free. we have to set this pointer only after we could sleep allocating pages. setting it before creates the potential for a race to free us while we are sleeping ok kettenis@
2019-11-28struct execsw's es_emul is no longer used, so delete itPhilip Guenther
ok deraadt@
2019-11-28Fix the buffer cache code to not use a giant uvm obj of all pagesBob Beck
when a small one on each buf is all that is needed. reduces the cost of large frees by about 25%. ok kettenis@
2019-11-27syncTheo de Raadt
2019-11-27Add dummy msyscall(2) system call which is currently a noop. This willTheo de Raadt
be used by kernel and ld.so in the near future. Adding the system call earlier will reduce the number of people who try to build through and encounter agony. ok kettenis guenther
2019-11-26timeout(9): switch to tickless backendcheloha
Rebase the timeout wheel on the system uptime clock. Timeouts are now set to run at or after an absolute time as returned by nanouptime(9). Timeouts are thus "tickless": they expire at a real time on that clock instead of at a particular value of the global "ticks" variable. To facilitate this change the timeout struct's .to_time member becomes a timespec. Hashing timeouts into a bucket on the wheel changes slightly: we build a 32-bit hash with 25 bits of seconds (.tv_sec) and 7 bits of subseconds (.tv_nsec). 7 bits of subseconds means the width of the lowest wheel level is now 2 seconds on all platforms and each bucket in that lowest level corresponds to 1/128 seconds on the uptime clock. These values were chosen to closely align with the current 100hz hardclock(9) typical on almost all of our platforms. At 100hz a bucket is currently ~1/100 seconds wide on the lowest level and the lowest level itself is ~2.56 seconds wide. Not a huge change, but a change nonetheless. Because a bucket no longer corresponds to a single tick more than one bucket may be dumped during an average timeout_hardclock_update() call. On 100hz platforms you now dump ~2 buckets. On 64hz machines (sh) you dump ~4 buckets. On 1024hz machines (alpha) you dump only 1 bucket, but you are doing extra work in softclock() to reschedule timeouts that aren't due yet. To avoid changing current behavior all timeout_add*(9) interfaces convert their timeout interval into ticks, compute an equivalent timespec interval, and then add that interval to the timestamp of the most recent timeout_hardclock_update() call to determine an absolute deadline. So all current timeouts still "use" ticks, but the ticks are faked in the timeout layer. A new interface, timeout_at_ts(9), is introduced here to bypass this backwardly compatible behavior. It will be used in subsequent diffs to add absolute timeout support for userland and to clean up some of the messier parts of kernel timekeeping, especially at the syscall layer. Because timeouts are based against the uptime clock they are subject to NTP adjustment via adjtime(2) and adjfreq(2). Unless you have a crazy adjfreq(2) adjustment set this will not change the expiration behavior of your timeouts. Tons of design feedback from mpi@, visa@, guenther@, and kettenis@. Additional amd64 testing from anton@ and visa@. Octeon testing from visa@. macppc testing from me. Positive feedback from deraadt@, ok visa@
2019-11-26Don't use LOCKPARENT on namei calls for realpath(). We don'tBob Beck
require this anymore since we now behave like posix. Fixes a problem where a symlink to / would return ENOTDIR because the parent could not be locked - noticed by Raimo Niskanen <raimo@erlang.org> ok guenther@ deraadt@
2019-11-19When waiting on pipe I/O, simplify the unlock/relock logic usinganton
rwsleep(). All made possible by the recent switch to using a rwlock as the exclusive pipe lock. ok visa@
2019-11-16Provide exact lock assertions for rwlocks when witness(4) is enabled.Visa Hankala
The checker keeps track of all held rwlocks, so it is able to tell if a given thread holds a specific lock even when the lock is shared. OK anton@ mpi@
2019-11-15Remove gratuitous #ifdef.Visa Hankala
2019-11-15Fix a spelling error in a comment and remove some extra whitespaceMike Larkin
in a few places. No code change.
2019-11-12Only check if the current thread has the lock in rw_assert_unlocked(9).Martin Pieuchot
With this semantic change it is now possible to use a similar assert for both mutexes and rwlocks as required by the vm_map_assert_lock() diff. ok sashan@
2019-11-12Check sleep timeout state only if the sleep has a timeout. Otherwise,Visa Hankala
the timeout cancellation in sleep_finish_timeout() would acquire the kernel lock every time in the no-timeout case, as noticed by mpi@. This also reduces the contention of timeout_mutex. OK mpi@, feedback guenther@
2019-11-11Extended the scope of the pipelock() in pipe_write() making the lockinganton
pattern more similar to pipe_read(). This also eliminates two races caused by relocking. ok visa@
2019-11-10Invert a conditional in pipe_write() for reduced indent and inanton
preparation for further refactoring. ok cheloha@ mpi@ visa@
2019-11-10Change the EINVAL return code to a KASSERT if the namei structure isBob Beck
initialized incorrectly for vn_open ok visa@ anton@
2019-11-09Replace the hand-rolled pipe lock with a rwlock. A necessary first stepanton
towards unlocking pipes. ok cheloha@ mpi@ visa@
2019-11-07adjfreq(2): fix atomic swapcheloha
I broke adjfreq(2)'s atomic swap in kern_time.c,v1.112. By using the "f" variable to store both the new and old frequency adjustments, the new adjustment gets clobbered by the old adjustment if the caller asked for a swap. ok visa@ mpi@
2019-11-07db_addr_t -> vaddr_t, missed in previous.Martin Pieuchot
ok deraadt@