src - OpenBSD base system

Age	Commit message (Collapse)	Author
2019-12-19	Start protecting the pipe_peer member of `struct pipe' using the	anton
	pipe_lock. This add a potential sleeping point in the kqueue filter routines which should be fine by now thanks to changes made to the kqueue subsystem by visa. ok visa@
2019-12-19	Convert infinite sleeps to {m,t}sleep_nsec(9).	Martin Pieuchot
	ok visa@
2019-12-12	tc_setclock: reintroduce timeout_adjust_ticks() call	cheloha
	Missing piece of tickless timeout revert.
2019-12-12	Recommit "timeout(9): make CIRCQ look more like other sys/queue.h data ↵	cheloha
	structures" Backed out during revert of "timeout(9): switch to tickless backend". Original commit message: - CIRCQ_APPEND -> CIRCQ_CONCAT - Flip argument order of CIRCQ_INSERT to match e.g. TAILQ_INSERT_TAIL - CIRCQ_INSERT -> CIRCQ_INSERT_TAIL - Add CIRCQ_FOREACH, use it in ddb(4) when printing buckets - While here, use tabs for indentation like we do with other macros ok visa@ mpi@
2019-12-12	Recommit "tc_windup: separate timecounter.tc_freq_adj from ↵	cheloha
	timehands.th_adjustment" Reverted with backout of tickless timeouts. Original commit message: We currently mix timecounter.tc_freq_adj and timehands.th_adjtimedelta in ntp_update_second() to produce timehands.th_adjustment, our net skew. But if you set a low enough adjfreq(2) adjustment you can freeze time. This prevents ntp_update_second() from running again. So even if you then set a sane adjfreq(2) you cannot unfreeze time without rebooting. If we just reread timecounter.tc_freq_adj every time we recompute timehands.th_scale we avoid this trap. visa@ notes that this is more costly than what we currently do but that the cost itself is negligible. Intuitively, timecounter.tc_freq_adj is a constant skew and should be handled separately from timehands.th_adjtimedelta, an adjustment that we chip away at very slowly. tedu@ notes that this problem is sort-of an argument for imposing range limits on adjfreq(2) inputs. He's right, but I think we should still separate the counter adjustment from the adjtime(2) adjustment, with or without range limits. ok visa@
2019-12-12	Reintroduce socket locking inside socket event filters.	Visa Hankala
	Tested by anton@, sashan@ OK mpi@, anton@, sashan@
2019-12-12	Allow sleeping inside kqueue event filters.	Visa Hankala
	In kqueue_scan(), threads have to get an exclusive access to a knote before processing by calling knote_acquire(). This prevents the knote from being destroyed while it is still in use. knote_acquire() also blocks other threads from processing the knote. Once knote processing has finished, the thread has to call knote_release(). The kqueue subsystem is still serialized by the kernel lock. If an event filter sleeps, the kernel lock is released and another thread might enter kqueue_scan(). kqueue_scan() uses start and end markers to keep track of the scan's progress and it has to be aware of other threads' markers. This patch is a revised version of mpi@'s work derived from DragonFly BSD. kqueue_check() has been adapted from NetBSD. Tested by anton@, sashan@ OK mpi@, anton@, sashan@
2019-12-11	Replace p_xstat with ps_xexit and ps_xsig	Philip Guenther
	Convert those to a consolidated status when needed in wait4(), kevent(), and sysctl() Pass exit code and signal separately to exit1() (This also serves as prep for adding waitid(2)) ok mpi@
2019-12-09	typo	Theo de Raadt

2019-12-08	msyscall(2) is like kbind(2), and should be always permitted. it does	Theo de Raadt
	it's own checks.
2019-12-08	Convert infinite sleeps to tsleep_nsec(9).	Martin Pieuchot
	ok visa@, jca@
2019-12-07	Combine macro KNOTE_ACTIVATE() with function knote_activate()	Visa Hankala
	to make the code clearer. OK claudio@ mpi@
2019-12-02	Revert "timeout(9): switch to tickless backend"	cheloha
	It appears to have caused major performance regressions all over the network stack. Reported by bluhm@ ok deraadt@
2019-12-02	Replace rwsleep(9) with rwsleep_nsec(9) in vfs_lockf.c.	Visa Hankala
	Prompted by and OK cheloha@ OK mpi@ anton@
2019-12-02	Remove now unneeded kernel locking from vfs_lockf.c.	Visa Hankala
	OK mpi@ anton@
2019-12-02	tc_windup: separate timecounter.tc_freq_adj from timehands.th_adjustment	cheloha
	We currently mix timecounter.tc_freq_adj and timehands.th_adjtimedelta in ntp_update_second() to produce timehands.th_adjustment, our net skew. But if you set a low enough adjfreq(2) adjustment you can freeze time. This prevents ntp_update_second() from running again. So even if you then set a sane adjfreq(2) you cannot unfreeze time without rebooting. If we just reread timecounter.tc_freq_adj every time we recompute timehands.th_scale we avoid this trap. visa@ notes that this is more costly than what we currently do but that the cost itself is negligible. Intuitively, timecounter.tc_freq_adj is a constant skew and should be handled separately from timehands.th_adjtimedelta, an adjustment that we chip away at very slowly. tedu@ notes that this problem is sort-of an argument for imposing range limits on adjfreq(2) inputs. He's right, but I think we should still separate the counter adjustment from the adjtime(2) adjustment, with or without range limits. ok visa@
2019-12-01	comply with POSIX and make execve() return EACCES for directories	Christian Weisgerber
	ok millert@ deraadt@
2019-11-30	Move kernel locking inside the sleep machinery. This enables calling	Visa Hankala
	rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel lock. In addition, now tsleep(9) with PCATCH should be safe to use without the kernel lock if the sleep is purely time-based. Tested by anton@, cheloha@, chris@ OK anton@, cheloha@
2019-11-29	Add uvm_objfree function to free all pages in a uvm_obj in one go.	Bob Beck
	Use this in the buffer cache to free all the pages from a buffer, resulting in a considerable speedup when throwing away pages from the buffer cache. Lots of work done with mlarkin and kettenis ok kettinis@ deraadt@
2019-11-29	Move p_sleeplocks and p_limit into the "zero on create" section of struct	Philip Guenther
	proc, so they don't need to be explicitly initialized in thread_new() suggested by anton@ ok kettenis@
2019-11-29	Eliminate the sketchy use of ps_mainproc here by making unveil_add_vnode()	Philip Guenther
	take a struct proc* instead of a struct process, and vice versa making unveil_lookup() take a process instead of a proc*. ok beck@
2019-11-29	Move kcov(4)'s p_kd into the "zero on create" section to simplify fork code	Philip Guenther
	ok anton@
2019-11-29	add missing parens around return expression and zap empty line	anton

2019-11-29	Start protecting the pipe_busy field of struct pipe using a global	anton
	rwlock. This lock is shared among all pipes for simplicity. In the future, the lock will probably be replaced with one lock per pipe pair, just like FreeBSD and NetBSD does. While here, extract the common rundown wakeup logic into a dedicated function. Thanks to cheloha@ for testing and feedback. ok mpi@ visa@
2019-11-29	timeout(9): make CIRCQ look more like other sys/queue.h data structures	cheloha
	- CIRCQ_APPEND -> CIRCQ_CONCAT - Flip argument order of CIRCQ_INSERT to match e.g. TAILQ_INSERT_TAIL - CIRCQ_INSERT -> CIRCQ_INSERT_TAIL - Add CIRCQ_FOREACH, use it in ddb(4) when printing buckets - While here, use tabs for indentation like we do with other macros ok visa@
2019-11-29	Return EBUSY for successive PT_TRACE_ME calls.	Martin Pieuchot
	Match FreeBSD and NetBSD. ok bluhm@, deraadt@, kettenis@
2019-11-29	Use RW_PROC() consistently.	Martin Pieuchot
	Suggested by and ok sashan@
2019-11-29	Repurpose the "syscalls must be on a writeable page" mechanism to	Theo de Raadt
	enforce a new policy: system calls must be in pre-registered regions. We have discussed more strict checks than this, but none satisfy the cost/benefit based upon our understanding of attack methods, anyways let's see what the next iteration looks like. This is intended to harden (translation: attackers must put extra effort into attacking) against a mixture of W^X failures and JIT bugs which allow syscall misinterpretation, especially in environments with polymorphic-instruction/variable-sized instructions. It fits in a bit with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash behaviour, particularily for remote problems. Less effective once on-host since someone the libraries can be read. For static-executables the kernel registers the main program's PIE-mapped exec section valid, as well as the randomly-placed sigtramp page. For dynamic executables ELF ld.so's exec segment is also labelled valid; ld.so then has enough information to register libc's exec section as valid via call-once msyscall(2) For dynamic binaries, we continue to to permit the main program exec segment because "go" (and potentially a few other applications) have embedded system calls in the main program. Hopefully at least go gets fixed soon. We declare the concept of embedded syscalls a bad idea for numerous reasons, as we notice the ecosystem has many of static-syscall-in-base-binary which are dynamically linked against libraries which in turn use libc, which contains another set of syscall stubs. We've been concerned about adding even one additional syscall entry point... but go's approach tends to double the entry-point attack surface. This was started at a nano-hackathon in Bob Beck's basement 2 weeks ago during a long discussion with mortimer trying to hide from the SSL scream-conversations, and finished in more comfortable circumstances next to a wood-stove at Elk Lakes cabin with UVM scream-conversations. ok guenther kettenis mortimer, lots of feedback from others conversations about go with jsing tb sthen
2019-11-29	Re commit what was committed in version 1.43 with a fix added to	Bob Beck
	ensure we handle the uvm_objects of bread_cluster buffers correctly. Original commit message: Fix the buffer cache code to not use a giant uvm obj of all pages when a small one on each buf is all that is needed. reduces the cost of large frees by about 25%. Again, lots of assistence from kettenis and mlarkin still ok kettenis@
2019-11-28	back out the buffer cache uvm_obj change for now.	Bob Beck
	the bread_cluster code has confused even me and mark, we need to handle the buffer slice and dice case better for bread_cluster.
2019-11-28	Delete km_mapblocks from kmemstats and its always-zero column from the ddb	Philip Guenther
	"show malloc" output ok deraadt@ mpi@
2019-11-28	Fix panic noticed by bluhm@ and florian@. bp->b_pobj is used	Bob Beck
	to determine if the buffer has pages to free. we have to set this pointer only after we could sleep allocating pages. setting it before creates the potential for a race to free us while we are sleeping ok kettenis@
2019-11-28	struct execsw's es_emul is no longer used, so delete it	Philip Guenther
	ok deraadt@
2019-11-28	Fix the buffer cache code to not use a giant uvm obj of all pages	Bob Beck
	when a small one on each buf is all that is needed. reduces the cost of large frees by about 25%. ok kettenis@
2019-11-27	sync	Theo de Raadt

2019-11-27	Add dummy msyscall(2) system call which is currently a noop. This will	Theo de Raadt
	be used by kernel and ld.so in the near future. Adding the system call earlier will reduce the number of people who try to build through and encounter agony. ok kettenis guenther
2019-11-26	timeout(9): switch to tickless backend	cheloha
	Rebase the timeout wheel on the system uptime clock. Timeouts are now set to run at or after an absolute time as returned by nanouptime(9). Timeouts are thus "tickless": they expire at a real time on that clock instead of at a particular value of the global "ticks" variable. To facilitate this change the timeout struct's .to_time member becomes a timespec. Hashing timeouts into a bucket on the wheel changes slightly: we build a 32-bit hash with 25 bits of seconds (.tv_sec) and 7 bits of subseconds (.tv_nsec). 7 bits of subseconds means the width of the lowest wheel level is now 2 seconds on all platforms and each bucket in that lowest level corresponds to 1/128 seconds on the uptime clock. These values were chosen to closely align with the current 100hz hardclock(9) typical on almost all of our platforms. At 100hz a bucket is currently ~1/100 seconds wide on the lowest level and the lowest level itself is ~2.56 seconds wide. Not a huge change, but a change nonetheless. Because a bucket no longer corresponds to a single tick more than one bucket may be dumped during an average timeout_hardclock_update() call. On 100hz platforms you now dump ~2 buckets. On 64hz machines (sh) you dump ~4 buckets. On 1024hz machines (alpha) you dump only 1 bucket, but you are doing extra work in softclock() to reschedule timeouts that aren't due yet. To avoid changing current behavior all timeout_add*(9) interfaces convert their timeout interval into ticks, compute an equivalent timespec interval, and then add that interval to the timestamp of the most recent timeout_hardclock_update() call to determine an absolute deadline. So all current timeouts still "use" ticks, but the ticks are faked in the timeout layer. A new interface, timeout_at_ts(9), is introduced here to bypass this backwardly compatible behavior. It will be used in subsequent diffs to add absolute timeout support for userland and to clean up some of the messier parts of kernel timekeeping, especially at the syscall layer. Because timeouts are based against the uptime clock they are subject to NTP adjustment via adjtime(2) and adjfreq(2). Unless you have a crazy adjfreq(2) adjustment set this will not change the expiration behavior of your timeouts. Tons of design feedback from mpi@, visa@, guenther@, and kettenis@. Additional amd64 testing from anton@ and visa@. Octeon testing from visa@. macppc testing from me. Positive feedback from deraadt@, ok visa@
2019-11-26	Don't use LOCKPARENT on namei calls for realpath(). We don't	Bob Beck
	require this anymore since we now behave like posix. Fixes a problem where a symlink to / would return ENOTDIR because the parent could not be locked - noticed by Raimo Niskanen <raimo@erlang.org> ok guenther@ deraadt@
2019-11-19	When waiting on pipe I/O, simplify the unlock/relock logic using	anton
	rwsleep(). All made possible by the recent switch to using a rwlock as the exclusive pipe lock. ok visa@
2019-11-16	Provide exact lock assertions for rwlocks when witness(4) is enabled.	Visa Hankala
	The checker keeps track of all held rwlocks, so it is able to tell if a given thread holds a specific lock even when the lock is shared. OK anton@ mpi@
2019-11-15	Remove gratuitous #ifdef.	Visa Hankala

2019-11-15	Fix a spelling error in a comment and remove some extra whitespace	Mike Larkin
	in a few places. No code change.
2019-11-12	Only check if the current thread has the lock in rw_assert_unlocked(9).	Martin Pieuchot
	With this semantic change it is now possible to use a similar assert for both mutexes and rwlocks as required by the vm_map_assert_lock() diff. ok sashan@
2019-11-12	Check sleep timeout state only if the sleep has a timeout. Otherwise,	Visa Hankala
	the timeout cancellation in sleep_finish_timeout() would acquire the kernel lock every time in the no-timeout case, as noticed by mpi@. This also reduces the contention of timeout_mutex. OK mpi@, feedback guenther@
2019-11-11	Extended the scope of the pipelock() in pipe_write() making the locking	anton
	pattern more similar to pipe_read(). This also eliminates two races caused by relocking. ok visa@
2019-11-10	Invert a conditional in pipe_write() for reduced indent and in	anton
	preparation for further refactoring. ok cheloha@ mpi@ visa@
2019-11-10	Change the EINVAL return code to a KASSERT if the namei structure is	Bob Beck
	initialized incorrectly for vn_open ok visa@ anton@
2019-11-09	Replace the hand-rolled pipe lock with a rwlock. A necessary first step	anton
	towards unlocking pipes. ok cheloha@ mpi@ visa@
2019-11-07	adjfreq(2): fix atomic swap	cheloha
	I broke adjfreq(2)'s atomic swap in kern_time.c,v1.112. By using the "f" variable to store both the new and old frequency adjustments, the new adjustment gets clobbered by the old adjustment if the caller asked for a swap. ok visa@ mpi@
2019-11-07	db_addr_t -> vaddr_t, missed in previous.	Martin Pieuchot
	ok deraadt@