Age | Commit message (Collapse) | Author |
|
OK sashan@ bluhm@
|
|
OK dlg@
|
|
and then rounded before checking. Put the same check before the
calculations to avoid overflow.
Reported-by: syzbot+6f29d23eca959c5a9705@syzkaller.appspotmail.com
OK claudio@
|
|
This avoids verb overlap with f_modify.
|
|
OK mpi@
|
|
bluhm@ hit a problem while running a regress test where a packet
generated and injected via bpf ends up being consumed by the network
stack. the stack assumes that packets are aligned properly, but bpf
was lazy and put whatever was written to it at the start of an mbuf.
ethernet has a 14 byte header, so if you put that at the start the
payload will be misaligned by 2 bytes.
bpf already has handling for different link header types, so this
handling is extended a bit to align the payload after the link
header.
while here we're fixing up a few error codes. short packets produce
EINVAL instead of EPERM, and packets larger than the biggest mbuf
the kernel supports generates EMSGSIZE.
with tweaks and ok bluhm@
|
|
this builds on the mpsafe kq/kevent work visa has been doing.
normally kevents are notified by calling selwakeup, but selwakeup
needs the KERNEL_LOCK. because bpf runs from all sorts of contexts
that may or may not have the kernel lock, the call to selwakeup is
deferred to the systq which already has the kernel lock. while this
avoids spinning in bpf for the kernel lock, it still adds latency
between when the buffer is ready for a program and when that program
gets notified about it. now that bpf kevents are mpsafe and bpf_wakeup
is already holding the necessary locks, we can avoid that latency.
bpf_wakeup now checks if there are waiting kevents and notifies
them immediately. if there are no other things to wake up, bpf_wakeup
avoids the task_add (and associated reference counting) to defer
the selwakeup call.
selwakeup can still try to notify waiting kevents, so this uses the
hint passed to knote() to differentiate between the notification
from bpf_wakeup and selwakeup and returns early from the latter.
ok visa@
|
|
Use bd_mtx to serialize bpf knote handling. This allows calling the
event filter without the kernel lock.
OK mpi@
|
|
The lookup should not fail because the kernel lock should prevent
simultaneous detaching on the vnode layer. However, most other device
kqfilter routines check the lookup's outcome anyway, which is maybe
a bit more forgiving.
OK mpi@
|
|
|
|
Reported by Peter J. Philipp.
OK mpi@
|
|
these functions were implemented in a bunch of places with comments
saying it should be moved to kern_tc.c when more pop up, and i was
about to add another one. i think it's time to move them to kern_tc.c.
ok cheloa@ jmatthew@
|
|
without this, something using a kevent to monitor a bpf fd on an
idle interface never has the event fire, which means it never
realises the interface goes away. with this, the read event goes
off and the next read fails with EIO, like pretty much every other
driver when the underlying device is removed.
ok claudio@ visa@ jmatthew@
|
|
ok claudio@ mvs@
|
|
the vlan tag we're injecting into the mbuf chain is either straight
off the wire and therefore already has the vlan priority encoded,
or is straight after it's been set up by vlan(4), which also has
the prio already encoded.
ok kn@ visa@ mvs@
|
|
bpf_catchpacket had a chunk to deal with reader timeouts, but that
has largely been moved to bpfread. the vestigal code that was left
still tried to wake up a reader when a buffer got full, but there
already is a chunk of code that wakes up readers when the buffer
gets full.
bpf_wakeup now checks for readers before calling wakeup directly,
rather than pushing the wakeup to a task and calling it unconditionally.
the task_add is now only done when the bpfdesc actually has something
that needs it.
ok visa@
|
|
Change bd_rtout to a uint64_t of nanoseconds. Update the code in
bpfioctl() and bpfread() accordingly.
Add a local copy of nsecuptime() to make the diff smaller. This will
need to move to kern_tc.c if/when we have another user elsewhere in
the kernel.
Prompted by mpi@. With input from dlg@.
ok dlg@ mpi@ visa@
|
|
bd_rdStart is strange. It nominally represents the start of a read(2)
on a given bpf(4) descriptor, but there are several problems with it:
1. If there are multiple readers, the bd_rdStart is not set by subsequent
readers, so their timeout is screwed up. The read timeout should really
be tracked on a per-thread basis in bpfread().
2. We set bd_rdStart for poll(2), select(2), and kevent(2), even though
that makes no sense. We should not be setting bd_rdStart in bpfpoll()
or bpfkqfilter().
3. bd_rdStart is buggy. If ticks is 0 when the read starts then
bpf_catchpacket() won't wake up the reader. This is a problem
inherent to the design of bd_rdStart: it serves as both a boolean
and a scalar value, even though 0 is a valid value in the scalar
range.
So let's replace it with a better struct member. "bd_nreaders" is a
count of threads sleeping in bpfread(). It is incremented before a
thread goes to sleep in bpfread() and decremented when a thread wakes
up. If bd_nreaders is greater than zero when we reach bpf_catchpacket()
and fbuf is non-NULL we wake up all readers.
The read timeout, if any, is now tracked locally by the thread in
bpfread().
Unlike bd_rdStart, bpfpoll() and bpfkqfilter() don't touch
bd_nreaders.
Prompted by mpi@. Basic idea from dlg@. Lots of input from dlg@.
Tested by dlg@ with tcpdump(8) (blocking read) and flow-collector
(https://github.com/eait-itig/flow-collector, non-blocking read).
ok dlg@
|
|
Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.
Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.
OK mpi@
|
|
OK dlg@, bluhm@
No Opinion mpi@
Not against it claudio@
|
|
this is so _bpf_mtap can look at the mbuf with packet headers on
it so it can fill in more stuff in the bpf_hdr struct.
ive been running this in production for most of a month now and
it's working well.
|
|
|
|
Reading and writing bd_rtout is not an atomic operation, so it needs
to be done under the per-descriptor mutex.
While here, start annotating locking in bpfdesc.h. There's lots more
to do on this front, but you have to start somewhere.
Tweaked by mpi@.
ok mpi@
|
|
Unlike the other cases of sysctl_bounded_arr this one uses a dynamic limit.
OK millert@
|
|
this lets things calling bpf_mtap_hdr and related functions also
populate the extended bpf_hdr with the rcvif and prio and stuff.
|
|
the metadata is set if the mbuf is passed with an m_pktrhdr, and
copies the mbufs rcvif, priority, flowid. it also carries the
direction of the packet.
it also makes bpf_hdr a multiple of 4 bytes, which simplifies some
calculations a bit. it also requires no changes in userland because
libpcap just thinks the extra bytes in the header are padding and
skips over them to the payload.
this helps me verify things like whether the stack and a network
card agree about toeplitz hashes, and paves the way for doing more
interesting packet captures. being able to see where a packet came
from as it is leaving a machine is very useful.
ok mpi@
|
|
If you set FIONBIO on a bpf(4) descriptor you enable non-blocking mode
and also clobber any read timeout set for the descriptor. The reverse
is also true: do BIOCSRTIMEOUT and you'll set a timeout and
simultaneously disable non-blocking status. The two are mutually
exclusive.
This relationship is undocumented and might cause a bug. At the
very least it makes reasoning about the code difficult.
This patch adds a new member to bpf_d, bd_rnonblock, to store the
non-blocking status of the descriptor. The read timeout is still
kept in bd_rtout.
With this in place, non-blocking status and the read timeout can
coexist. Setting one state does not clear the other, and vice versa.
Separating the two states also clears the way for changing the bpf(4)
read timeout to use the system clock instead of ticks. More on that
in a later patch.
With insight from dlg@ regarding the purpose of the read timeout.
ok dlg@
|
|
for example, with locking assertions.
OK mpi@, anton@
|
|
adding more filter properties without cluttering the struct.
OK mpi@, anton@
|
|
The 3 subsystems: signal, poll/select and kqueue can now be addressed
separatly.
Note that bpf(4) and audio(4) currently delay the wakeups to a separate
context in order to respect the KERNEL_LOCK() requirement. Sockets (UDP,
TCP) and pipes spin to grab the lock for the sames reasons.
ok anton@, visa@
|
|
like USB only use the former and bpf_iflist was otherwise retaining
references to a freed bpf_if.
ok sashan
|
|
FIOGETOWN/SIOCGPGRP/TIOCGPGRP. Do this by determining the meaning of
the ID parameter inside the sigio code. Also add cases for FIOSETOWN
and FIOGETOWN where there have been TIOCSPGRP and TIOCGPGRP before.
These changes allow removing the ID translation from sys_fcntl() and
sys_ioctl().
Idea from NetBSD
OK mpi@, claudio@
|
|
something with csignal().
OK visa@
|
|
make the structs const so that the data are put in .rodata.
OK mpi@, deraadt@, anton@, bluhm@
|
|
BPF: remove redundant reference counting of filedescriptors
Anton@ made problem crystal clear:
I've been looking into a similar bpf panic reported by syzkaller,
which looks somewhat related. The one reported by syzkaller is caused
by issuing ioctl(SIOCIFDESTROY) on the interface which the packet filter
is attached to. This will in turn invoke the following functions
expressed as an inverted stacktrace:
1. bpfsdetach()
2. vdevgone()
3. VOP_REVOKE()
4. vop_generic_revoke()
5. vgonel()
6. vclean(DOCLOSE)
7. VOP_CLOSE()
8. bpfclose()
Note that bpfclose() is called before changing the vnode type. In
bpfclose(), the `struct bpf_d` is immediately removed from the global
bpf_d_list list and might end up sleeping inside taskq_barrier(systq).
Since the bpf file descriptor (fd) is still present and valid, another
thread could perform an ioctl() on the fd only to fault since
bpfilter_lookup() will return NULL. The vnode is not locked in this path
either so it won't end up waiting on the ongoing vclean().
Steps to trigger the similar type of panic are straightforward, let there be
two processes running concurrently:
process A:
while true ; do ifconfig tun0 up ; ifconfig tun0 destroy ; done
process B:
while true ; do tcpdump -i tun0 ; done
panic happens within few secs (Dell PowerEdge 710)
OK @visa, OK @anton
|
|
this is not needed now that the "public" api does not provide a way
to pass a custom copy function in for the internals to pass around.
ok claudio@ visa@
|
|
it was previously (ab)used by pflog, which has since been fixed.
apart from that nothing else used it, so we can trim the cruft.
ok kn@ claudio@ visa@
visa@ also made sure i fixed ipw(4) so i386 won't break.
|
|
pointed out by naddy@
|
|
this makes it easier to call at least, and makes it consistent with
bpf_tap_hdr.
ok stsp@ sashan@
|
|
ok anton@, sashan@
|
|
|
|
OK visa@, OK mpi@
|
|
prevent passing negative values to timeout_add().
While here, protect against unsigned wrap around during addition of
bd_rdStart and bd_rtout since it could also cause passing negative
values to timeout_add().
ok bluhm@
Reported-by: syzbot+6771e3d6d9567b3983aa@syzkaller.appspotmail.com
|
|
OK visa@
|
|
the timeout converted to ticks is later passed timeout_add(), it could
cause a panic if the timeout is negative.
ok deraadt@ millert@
Reported-by: syzbot+82cb4dfe6a1fc3d8b490@syzkaller.appspotmail.com
|
|
BIOCSFILDROP was already able to be used as a quick and dirty
firewall, which is especially useful when you you want to filter
non-ip things. however, capturing the packets you're dropping is a
lot of overhead when you just want to drop stuff. this extends
fildrop so you can tell bpf not to capture the packets it drops.
ok sthen@ mikeb@ claudio@ visa@
|
|
ioctl function after the device has been pulled out. Also accept
this error code in bpf_detachd() to prevent a kernel panic. tcpdump(8)
may run while the interface is detached.
from Moritz Buhl; OK stsp@
|
|
affects the bpfioctl() and bpfclose() path.
lock assertion reported and fix tested by Pierre Emeriaud; OK visa@
|
|
The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.
No objection from millert@, ok tedu@, bluhm@
|
|
internally it uses mbufs to handle the chain of buffers, but the
caller doesnt have to deal with that or allocate a temporary buffer
with the header attached.
ok mpi@
|