Age | Commit message (Collapse) | Author |
|
Instead of the KERNEL_LOCK use the ps_mtx for most operations.
If the ps_klist is modified an additional global rwlock (kqueue_ps_list_lock)
is required. This includes the knotes with NOTE_FORK and NOTE_EXIT since
in either cases a ps_klist is changed. In the NOTE_FORK | NOTE_TRACK case
the call to kqueue_register() can sleep this is why a global rwlock is used.
Adjust the reaper() to call knote_processexit() without KERNEL_LOCK.
Double lock idea from visa@
OK mvs@
|
|
Since proc and signal filters share the same klist it makes sense
to keep them together.
OK mvs@
|
|
kqueue1() takes the flags argument. This lets the kqueue file descriptor
be opened with O_CLOEXEC. Adapted from NetBSD.
OK guenther@
|
|
Add timer precision flags NOTE_SECONDS, NOTE_MSECONDS, NOTE_USECONDS
and NOTE_NSECONDS for EVFILT_TIMER. Also, add an initial implementation
of NOTE_ABSTIME timers.
Similar kevent(2) flags exist on FreeBSD, NetBSD and XNU.
Initial diff by and OK aisha@
OK mpi@
|
|
feedback and ok jmc@ miod, ok millert@
|
|
Make knote(9) lock the knote list internally, and add knote_locked(9)
for the typical situation where the list is already locked.
Remove the KNOTE(9) macro to simplify the API.
Manual page OK jmc@
OK mpi@ mvs@
|
|
OK mpi@
|
|
ok mpi@ miod@
|
|
OK jsg@
|
|
When closing a kqueue, block until any pending wakeup task has finished.
Otherwise, if a pending task progressed slowly, the kqueue could stay
alive longer than the associated file descriptor table, causing
a use-after-free in KQRELE().
This also fixes a failed assertion "p->p_kq->kq_refcnt.r_refs == 1" in
kqpoll_exit().
The use-after-free bug had existed since the introduction of
kqueue_task() (the bug could occur if fdplock() blocked in KQRELE()).
However, the issue became worse when the task was allowed to run without
the kernel lock in sys/kern/kern_event.c r1.187.
Prompted by a report from Mikhail on bugs@.
OK mpi@
Reported-by: syzbot+fca7e4fa773c90886819@syzkaller.appspotmail.com
|
|
OK mpi@
|
|
While one thread is running kqueue_scan(), another thread can begin
scanning the same kqueue, observe that the event queue is empty, and
go to sleep. If the first thread re-inserts a knote for re-processing,
the second thread can miss the newly pending event. Wake up the kqueue
after a re-insert to correct this.
This fixes a Go test hang that jsing@ tracked down to kqueue.
Tested in snaps for a week.
OK jsing@ mpi@
|
|
Always fetch the knlist array pointer at the start of every iteration
in knote_remove(). This prevents the use of a stale pointer after
another thread has simultaneously reallocated the kq_knlist array.
Reported and tested by and OK jsing@
|
|
The deferred activation can now run in an MP-safe task queue.
|
|
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@
|
|
|
|
OK dlg@ bluhm@
|
|
net/if_pppx.c pointed out by jsg@
ok gnezdo@ deraadt@ jsg@ mpi@ millert@
|
|
|
|
This avoids verb overlap with f_modify.
|
|
OK mpi@
|
|
Implement the poll(2) system call on top of the kqueue subsystem.
This obsoletes the old, non-MP-safe poll backend.
On entering poll(2), the new code translates each pollfd array entry
into a set of knotes. When these knotes receive events through kqueue,
the events are translated back to pollfd format.
Entries in the pollfd array can refer to the same file descriptor with
overlapping event masks. To allow such overlap with knotes, use an extra
kn_pollid key that separates knotes of different pollfd entries.
Adapted from DragonFly BSD, initial implementation by mpi@.
Tested in snaps for three weeks.
OK mpi@
|
|
When a tty device is revoked, the associated knotes should be
invalidated. Otherwise the user processes can keep on receiving
events from the device.
It appears tricky to do the invalidation as part of revocation
in a way that does not allow unwanted event registration or clutter
the tty code. For now, make the knotes invalid lazily before delivery.
OK mpi@
|
|
When a knote uses the dead event filter, the knote's file descriptor is
not supposed to point to an object with pending out-of-band data. Make
the knote inactive so that userspace will not receive a spurious event.
However, kqueue-based poll(2) should still receive HUP notifications.
This lets the system use dead_filtops with less strings attached
relative to the filter type.
|
|
Do not assume event status in the modify and process callbacks. Instead
always run the event filter so that it has a chance to set knote flags.
The filter can also indicate event inactivity.
|
|
Make __EV_POLL specific to kqueue-based poll(2), to remove overlap
with __EV_SELECT that only select(2) uses.
OK millert@ mpi@
|
|
When closing a file descriptor and converting the poll/select knotes
into badfd knotes, keep the knotes attached to the by-fd table. This
should prevent kqueue_purge() from returning before the kqueue has
become quiescent. This in turn should fix a
KASSERT(TAILQ_EMPTY(&kq->kq_head)) panic in KQRELE() that bluhm@ has
reported.
The badfd conversion is only needed when a poll/select scan is ongoing.
The system can skip the conversion if the knote is not part of the
active event set.
The code of this commit skips the conversion when the fd is closed by
the same thread that has done the fd polling. This can be improved but
should already cover typical fd usage patterns.
As badfd knotes now hold slots in the by-fd table, kqueue_register()
clears them. poll/select use kqueue_register() to set up a new scan;
any found fd close notification is a leftover from the previous scan.
The new badfd handling should be free of accidental knote accumulation.
This obsoletes kqpoll_dequeue() and lowers kqpoll_init() overhead.
Re-enable lazy removal of poll/select knotes because the panic should
no longer happen.
OK mpi@
|
|
This should prevent a panic that bluhm@ has reported.
|
|
This makes it possible to attach pipe, socket and kqueue event filters
without acquiring the kernel lock. Event filters behind vn_kqfilter()
are not MP-safe yet, so vn_kqfilter() has to take KERNEL_LOCK().
dmabuf_kqfilter() can skip locking because it has no side effects.
OK anton@, mpi@
|
|
Reduce the time overhead of kqueue-based poll(2) and select(2) by
keeping knotes registered between the system calls. It is expected that
the set of monitored file descriptors is relatively unchanged between
consecutive iterations of these system calls. By keeping the knotes,
the system saves the effort of repeated knote unregistering and
re-registering.
To avoid receiving events from file descriptors that are no longer in
the monitored set, each poll/select knote is assigned an increasing
serial number. Every iteration of poll/select uses a previously unused
range of serials for its knotes. In the setup stage, kqueue_register()
updates the serials of any existing knotes in the currently monitored
set. Function kqueue_scan() delivers only the events whose serials are
recent enough; expired knotes are dropped. When the serial range is
about to wrap around, all the knotes in the kqueue backend are dropped.
This change is a space-time tradeoff. Memory usage is increased somewhat
because of the retained knotes. The increase is limited by the number
of open file descriptors and active threads.
Idea from DragonFly BSD, initial patch by mpi@, kqueue_scan()-based
approach by me.
Tested by anton@ and mpi@
OK mpi@
|
|
Use the monitored kqueue's kq_lock to serialize kqueue and knote access.
Typically, the "object lock" would cover also the klist, but that is not
possible with kqueues. knote_activate() needs kq_lock of the monitoring
kqueue, which would create lock order troubles if kq_lock was held when
calling KNOTE(&kq->kq_sel.si_note). Avoid this by using a separate klist
lock for kqueues.
The new klist lock is system-wide. Each kqueue instance could have
a dedicated klist lock. However, the efficacy of dedicated versus
system-wide lock is somewhat limited because the current implementation
activates kqueue knotes through a single thread.
OK mpi@
|
|
that the KERNEL_LOCK() is held unless the filter is marked as MPSAFE.
Should help finding missing locks when unlocking various filters.
ok visa@
|
|
Adjust kqpoll_dequeue() so that it will clear only badfd knotes when
called from kqpoll_init(). This is needed by kqpoll's lazy removal
of knotes. Eager removal in kqpoll_dequeue() would defeat kqpoll's
attempt to reuse previously established knotes under workloads where
knote activation tends to occur already before next kqpoll scan.
Prompted by mpi@
|
|
The kn_status field of struct knote is part of kqueue's internal state.
When kn_status is being updated, kq_lock has to be locked. This is true
even with MP-unsafe event filters.
OK mpi@
|
|
Reported-by: syzbot+c2aba7645a218ce03027@syzkaller.appspotmail.com
|
|
Extend struct kqueue with a mutex and use it to serializes the internals
of each kqueue instance. This should make possible to call kqueue's
system call interface without the kernel lock. The event source facing
side of kqueue should now be MP-safe, too, as long as the event source
itself is MP-safe.
msleep() with PCATCH still requires the kernel lock. To manage with
this, kqueue_scan() locks the kernel temporarily for the section that
may sleep.
As a consequence of the kqueue mutex, knote_acquire() can lose a wakeup
when klist_invalidate() calls it. To preserve proper nesting of mutexes,
knote_acquire() has to release the kqueue mutex before it unlocks klist.
This early unlocking of the mutex lets badly timed wakeups go unnoticed.
However, the system should not hang because the sleep has a timeout.
Tested by gnezdo@ and mpi@
OK mpi@
|
|
Use the pool cache to reduce the overhead of memory management in
function kqueue_register().
When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.
The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.
OK dlg@ mpi@
|
|
When an existing EVFILT_TIMER filter is re-added, cancel the existing
timer and any pending event, and restart the timer using the new timeout
period. This makes the new timeout period take effect immediately and
matches the behaviour of FreeBSD. Previously, the new setting was
applied only after the existing timer expired.
The timer rescheduling is done by using an f_modify callback. The
reading of timer events is moved from f_event to f_process. f_event of
timer_filtops becomes redundant. Unlike most other event sources, timers
activate knotes directly without using a klist and knote(9).
OK mpi@
|
|
This does not change the current behaviour, but filterops should be
invoked through filter_*() for consistency.
|
|
Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.
There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.
The new filterops interface has been influenced by XNU's kqueue.
OK mpi@ semarie@
|
|
When a kqueue file is closed, the kqueue can still have threads
scanning it. Consequently, kqueue_terminate() can see scan markers
in the event queue. These markers are removed when the scanning threads
leave the kqueue. Take this into account when checking the queue's
state, to avoid a panic when kqueue is closed from under a thread.
OK anton@
Reported-by: syzbot+757c60a2aa1125137cce@syzkaller.appspotmail.com
|
|
Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.
When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.
OK mpi@
|
|
This prevents unwanted spinning with interrupts disabled.
At the moment, this code is only invoked through klist_invalidate()
and the callers should already hold the kernel lock. Also, one could
argue that in MP-unsafe contexts klist_lock() should only assert for
the kernel lock.
|
|
|
|
Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.
Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.
OK mpi@
|
|
Invoke dead_filtops' f_event callback in klist_invalidate() to ensure
that filt_dead() modifies every invalidated knote. If a knote has
EV_ONESHOT set in its event flags, kqueue_scan() will not call f_event.
OK mpi@
|
|
This fixes a regression where kqueue_scan() may incorrectly return
EWOULDBLOCK after a timeout.
OK mpi@
|
|
This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().
Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.
There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.
The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.
For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.
Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.
OK mpi@
|
|
When the file descriptor of an __EV_POLL-flagged knote is closed,
post EBADF through the kqueue instance to the caller of kqueue_scan().
This lets kqueue-based poll() and select() preserve their current
behaviour of returning EBADF when a polled file descriptor is closed
concurrently.
OK mpi@
|
|
OK mpi@
|