Age | Commit message (Collapse) | Author |
|
When a tty device is revoked, the associated knotes should be
invalidated. Otherwise the user processes can keep on receiving
events from the device.
It appears tricky to do the invalidation as part of revocation
in a way that does not allow unwanted event registration or clutter
the tty code. For now, make the knotes invalid lazily before delivery.
OK mpi@
|
|
When a knote uses the dead event filter, the knote's file descriptor is
not supposed to point to an object with pending out-of-band data. Make
the knote inactive so that userspace will not receive a spurious event.
However, kqueue-based poll(2) should still receive HUP notifications.
This lets the system use dead_filtops with less strings attached
relative to the filter type.
|
|
Do not assume event status in the modify and process callbacks. Instead
always run the event filter so that it has a chance to set knote flags.
The filter can also indicate event inactivity.
|
|
Make __EV_POLL specific to kqueue-based poll(2), to remove overlap
with __EV_SELECT that only select(2) uses.
OK millert@ mpi@
|
|
When closing a file descriptor and converting the poll/select knotes
into badfd knotes, keep the knotes attached to the by-fd table. This
should prevent kqueue_purge() from returning before the kqueue has
become quiescent. This in turn should fix a
KASSERT(TAILQ_EMPTY(&kq->kq_head)) panic in KQRELE() that bluhm@ has
reported.
The badfd conversion is only needed when a poll/select scan is ongoing.
The system can skip the conversion if the knote is not part of the
active event set.
The code of this commit skips the conversion when the fd is closed by
the same thread that has done the fd polling. This can be improved but
should already cover typical fd usage patterns.
As badfd knotes now hold slots in the by-fd table, kqueue_register()
clears them. poll/select use kqueue_register() to set up a new scan;
any found fd close notification is a leftover from the previous scan.
The new badfd handling should be free of accidental knote accumulation.
This obsoletes kqpoll_dequeue() and lowers kqpoll_init() overhead.
Re-enable lazy removal of poll/select knotes because the panic should
no longer happen.
OK mpi@
|
|
This should prevent a panic that bluhm@ has reported.
|
|
This makes it possible to attach pipe, socket and kqueue event filters
without acquiring the kernel lock. Event filters behind vn_kqfilter()
are not MP-safe yet, so vn_kqfilter() has to take KERNEL_LOCK().
dmabuf_kqfilter() can skip locking because it has no side effects.
OK anton@, mpi@
|
|
Reduce the time overhead of kqueue-based poll(2) and select(2) by
keeping knotes registered between the system calls. It is expected that
the set of monitored file descriptors is relatively unchanged between
consecutive iterations of these system calls. By keeping the knotes,
the system saves the effort of repeated knote unregistering and
re-registering.
To avoid receiving events from file descriptors that are no longer in
the monitored set, each poll/select knote is assigned an increasing
serial number. Every iteration of poll/select uses a previously unused
range of serials for its knotes. In the setup stage, kqueue_register()
updates the serials of any existing knotes in the currently monitored
set. Function kqueue_scan() delivers only the events whose serials are
recent enough; expired knotes are dropped. When the serial range is
about to wrap around, all the knotes in the kqueue backend are dropped.
This change is a space-time tradeoff. Memory usage is increased somewhat
because of the retained knotes. The increase is limited by the number
of open file descriptors and active threads.
Idea from DragonFly BSD, initial patch by mpi@, kqueue_scan()-based
approach by me.
Tested by anton@ and mpi@
OK mpi@
|
|
Use the monitored kqueue's kq_lock to serialize kqueue and knote access.
Typically, the "object lock" would cover also the klist, but that is not
possible with kqueues. knote_activate() needs kq_lock of the monitoring
kqueue, which would create lock order troubles if kq_lock was held when
calling KNOTE(&kq->kq_sel.si_note). Avoid this by using a separate klist
lock for kqueues.
The new klist lock is system-wide. Each kqueue instance could have
a dedicated klist lock. However, the efficacy of dedicated versus
system-wide lock is somewhat limited because the current implementation
activates kqueue knotes through a single thread.
OK mpi@
|
|
that the KERNEL_LOCK() is held unless the filter is marked as MPSAFE.
Should help finding missing locks when unlocking various filters.
ok visa@
|
|
Adjust kqpoll_dequeue() so that it will clear only badfd knotes when
called from kqpoll_init(). This is needed by kqpoll's lazy removal
of knotes. Eager removal in kqpoll_dequeue() would defeat kqpoll's
attempt to reuse previously established knotes under workloads where
knote activation tends to occur already before next kqpoll scan.
Prompted by mpi@
|
|
The kn_status field of struct knote is part of kqueue's internal state.
When kn_status is being updated, kq_lock has to be locked. This is true
even with MP-unsafe event filters.
OK mpi@
|
|
Reported-by: syzbot+c2aba7645a218ce03027@syzkaller.appspotmail.com
|
|
Extend struct kqueue with a mutex and use it to serializes the internals
of each kqueue instance. This should make possible to call kqueue's
system call interface without the kernel lock. The event source facing
side of kqueue should now be MP-safe, too, as long as the event source
itself is MP-safe.
msleep() with PCATCH still requires the kernel lock. To manage with
this, kqueue_scan() locks the kernel temporarily for the section that
may sleep.
As a consequence of the kqueue mutex, knote_acquire() can lose a wakeup
when klist_invalidate() calls it. To preserve proper nesting of mutexes,
knote_acquire() has to release the kqueue mutex before it unlocks klist.
This early unlocking of the mutex lets badly timed wakeups go unnoticed.
However, the system should not hang because the sleep has a timeout.
Tested by gnezdo@ and mpi@
OK mpi@
|
|
Use the pool cache to reduce the overhead of memory management in
function kqueue_register().
When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.
The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.
OK dlg@ mpi@
|
|
When an existing EVFILT_TIMER filter is re-added, cancel the existing
timer and any pending event, and restart the timer using the new timeout
period. This makes the new timeout period take effect immediately and
matches the behaviour of FreeBSD. Previously, the new setting was
applied only after the existing timer expired.
The timer rescheduling is done by using an f_modify callback. The
reading of timer events is moved from f_event to f_process. f_event of
timer_filtops becomes redundant. Unlike most other event sources, timers
activate knotes directly without using a klist and knote(9).
OK mpi@
|
|
This does not change the current behaviour, but filterops should be
invoked through filter_*() for consistency.
|
|
Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.
There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.
The new filterops interface has been influenced by XNU's kqueue.
OK mpi@ semarie@
|
|
When a kqueue file is closed, the kqueue can still have threads
scanning it. Consequently, kqueue_terminate() can see scan markers
in the event queue. These markers are removed when the scanning threads
leave the kqueue. Take this into account when checking the queue's
state, to avoid a panic when kqueue is closed from under a thread.
OK anton@
Reported-by: syzbot+757c60a2aa1125137cce@syzkaller.appspotmail.com
|
|
Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.
When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.
OK mpi@
|
|
This prevents unwanted spinning with interrupts disabled.
At the moment, this code is only invoked through klist_invalidate()
and the callers should already hold the kernel lock. Also, one could
argue that in MP-unsafe contexts klist_lock() should only assert for
the kernel lock.
|
|
|
|
Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.
Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.
OK mpi@
|
|
Invoke dead_filtops' f_event callback in klist_invalidate() to ensure
that filt_dead() modifies every invalidated knote. If a knote has
EV_ONESHOT set in its event flags, kqueue_scan() will not call f_event.
OK mpi@
|
|
This fixes a regression where kqueue_scan() may incorrectly return
EWOULDBLOCK after a timeout.
OK mpi@
|
|
This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().
Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.
There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.
The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.
For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.
Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.
OK mpi@
|
|
When the file descriptor of an __EV_POLL-flagged knote is closed,
post EBADF through the kqueue instance to the caller of kqueue_scan().
This lets kqueue-based poll() and select() preserve their current
behaviour of returning EBADF when a polled file descriptor is closed
concurrently.
OK mpi@
|
|
OK mpi@
|
|
Because kqpoll instances are now linked to the file descriptor table,
the freeing of kqpoll and ordinary kqueues is similar.
Suggested by mpi@
|
|
This lets the system remove kqpoll-related event registrations when
a file descriptor is closed.
OK mpi@
|
|
OK cheloha@, mpi@, mvs@
|
|
This will soon be used by select(2) and poll(2).
ok anton@, visa@
|
|
Stop iterating in the function and instead copy the returned events to
userland after every call.
ok visa@
|
|
It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.
This is required to implement select(2) and poll(2) on top of kqueue_scan().
Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.
ok visa@, anton@
|
|
As deraadt@ has pointed out, tracing timevals and timespecs before
validating them makes debugging easier.
|
|
The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.
Extracted from a previous diff from visa@.
ok visa@, anton@
|
|
Reuse the kev[] array of sys_kevent() in kqueue_scan() to lower
stack usage.
The code has reset kevp, but not nkev, whenever the retry branch is
taken. However, the resetting is unnecessary because retry should be
taken only if no events have been collected. Make this clearer by
adding KASSERTs.
OK mpi@
|
|
This leaves knote_remove() for kqueue's internal use. As a result,
knote_remove() is used to drop knotes from the knlist of a single
kqueue instance. klist_invalidate() clears knotes from a klist that
can contain entries from different kqueue instances.
Use FILTEROP_ISFD to control how klist_invalidate() treats knotes,
to preserve the current behaviour of knote_processexit(). All the
existing callers of klist_invalidate() are fd-based. The existing code
rewires and activates knotes to give userspace a clear indication that
the state of the fd has changed. In knote_processexit(), any remaining
knotes in ps_klist are non-fd-based (EVFILT_SIGNAL). Those are dropped
without notifying userspace.
OK mpi@
|
|
This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.
The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.
ok millert@ on a previous version, ok visa@
|
|
ok visa@, millert@
|
|
The list can be accessed from interrupt context if a signal is sent
from an interrupt handler.
OK anton@ cheloha@ mpi@
|
|
subsystem and ps_klist handling still run under the kernel lock.
|
|
Port breakages reported by naddy@
|
|
While here prefix kernel-only EV flags with two underbars.
Suggested by kettenis@, ok visa@
|
|
These functions will be used to managed per-thread kqueues that are not
associated to a file descriptor.
ok visa@
|
|
sthen@ has reported that the patch might be causing hangs with X.
|
|
The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.
OK mpi@
|
|
for example, with locking assertions.
OK mpi@, anton@
|
|
deep recursion. This also helps making kqueue_wakeup() free of the
kernel lock because the current implementation of selwakeup()
requires the lock.
OK mpi@
|
|
While here check for the validity of the timeout at the begining of the
syscall.
ok kettenis@, cheloha@
|