summaryrefslogtreecommitdiff
path: root/sys/net/bpf.c
AgeCommit message (Collapse)Author
2022-03-17Use the refcnt API in bpf.Visa Hankala
OK sashan@ bluhm@
2022-02-15Use knote_modify_fn() and knote_process_fn() in bpf.Visa Hankala
OK dlg@
2022-02-13The length value in bpf_movein() is casted to from size_t to u_intAlexander Bluhm
and then rounded before checking. Put the same check before the calculations to avoid overflow. Reported-by: syzbot+6f29d23eca959c5a9705@syzkaller.appspotmail.com OK claudio@
2022-02-13Rename knote_modify() to knote_assign()Visa Hankala
This avoids verb overlap with f_modify.
2022-02-11Replace manual !klist_empty()+knote() with KNOTE().Visa Hankala
OK mpi@
2022-02-05make bpf_movein align the packet payload.David Gwynne
bluhm@ hit a problem while running a regress test where a packet generated and injected via bpf ends up being consumed by the network stack. the stack assumes that packets are aligned properly, but bpf was lazy and put whatever was written to it at the start of an mbuf. ethernet has a 14 byte header, so if you put that at the start the payload will be misaligned by 2 bytes. bpf already has handling for different link header types, so this handling is extended a bit to align the payload after the link header. while here we're fixing up a few error codes. short packets produce EINVAL instead of EPERM, and packets larger than the biggest mbuf the kernel supports generates EMSGSIZE. with tweaks and ok bluhm@
2022-01-16activate/notify waiting kq kevents from bpf_wakeup directly.David Gwynne
this builds on the mpsafe kq/kevent work visa has been doing. normally kevents are notified by calling selwakeup, but selwakeup needs the KERNEL_LOCK. because bpf runs from all sorts of contexts that may or may not have the kernel lock, the call to selwakeup is deferred to the systq which already has the kernel lock. while this avoids spinning in bpf for the kernel lock, it still adds latency between when the buffer is ready for a program and when that program gets notified about it. now that bpf kevents are mpsafe and bpf_wakeup is already holding the necessary locks, we can avoid that latency. bpf_wakeup now checks if there are waiting kevents and notifies them immediately. if there are no other things to wake up, bpf_wakeup avoids the task_add (and associated reference counting) to defer the selwakeup call. selwakeup can still try to notify waiting kevents, so this uses the hint passed to knote() to differentiate between the notification from bpf_wakeup and selwakeup and returns early from the latter. ok visa@
2022-01-13Make bpf event filter MP-safeVisa Hankala
Use bd_mtx to serialize bpf knote handling. This allows calling the event filter without the kernel lock. OK mpi@
2022-01-13Return an error if bpfilter_lookup() fails in bpfkqfilter()Visa Hankala
The lookup should not fail because the kernel lock should prevent simultaneous detaching on the vnode layer. However, most other device kqfilter routines check the lookup's outcome anyway, which is maybe a bit more forgiving. OK mpi@
2021-11-10whitespace tweaks, no functional change.David Gwynne
2021-10-23Fix double free after allocation failure in bpf(4).Visa Hankala
Reported by Peter J. Philipp. OK mpi@
2021-06-15factor out nsecuptime and getnsecuptime.David Gwynne
these functions were implemented in a bunch of places with comments saying it should be moved to kern_tc.c when more pop up, and i was about to add another one. i think it's time to move them to kern_tc.c. ok cheloa@ jmatthew@
2021-04-23call klist_invalidate from bpfsdetach to tell kq listeners what happened.David Gwynne
without this, something using a kevent to monitor a bpf fd on an idle interface never has the event fire, which means it never realises the interface goes away. with this, the read event goes off and the next read fails with EIO, like pretty much every other driver when the underlying device is removed. ok claudio@ visa@ jmatthew@
2021-01-21let vfs keep track of nonblocking state for us.David Gwynne
ok claudio@ mvs@
2021-01-17don't encode the mbuf prio as part of the vlan tag in bpf_mtap_ether.David Gwynne
the vlan tag we're injecting into the mbuf chain is either straight off the wire and therefore already has the vlan priority encoded, or is straight after it's been set up by vlan(4), which also has the prio already encoded. ok kn@ visa@ mvs@
2021-01-02optimise bpf_catchpacket and bpf_wakeup.David Gwynne
bpf_catchpacket had a chunk to deal with reader timeouts, but that has largely been moved to bpfread. the vestigal code that was left still tried to wake up a reader when a buffer got full, but there already is a chunk of code that wakes up readers when the buffer gets full. bpf_wakeup now checks for readers before calling wakeup directly, rather than pushing the wakeup to a task and calling it unconditionally. the task_add is now only done when the bpfdesc actually has something that needs it. ok visa@
2021-01-02bpf(4): remove tickscheloha
Change bd_rtout to a uint64_t of nanoseconds. Update the code in bpfioctl() and bpfread() accordingly. Add a local copy of nsecuptime() to make the diff smaller. This will need to move to kern_tc.c if/when we have another user elsewhere in the kernel. Prompted by mpi@. With input from dlg@. ok dlg@ mpi@ visa@
2020-12-26bpf(4): bpf_d struct: replace bd_rdStart member with bd_nreaders membercheloha
bd_rdStart is strange. It nominally represents the start of a read(2) on a given bpf(4) descriptor, but there are several problems with it: 1. If there are multiple readers, the bd_rdStart is not set by subsequent readers, so their timeout is screwed up. The read timeout should really be tracked on a per-thread basis in bpfread(). 2. We set bd_rdStart for poll(2), select(2), and kevent(2), even though that makes no sense. We should not be setting bd_rdStart in bpfpoll() or bpfkqfilter(). 3. bd_rdStart is buggy. If ticks is 0 when the read starts then bpf_catchpacket() won't wake up the reader. This is a problem inherent to the design of bd_rdStart: it serves as both a boolean and a scalar value, even though 0 is a valid value in the scalar range. So let's replace it with a better struct member. "bd_nreaders" is a count of threads sleeping in bpfread(). It is incremented before a thread goes to sleep in bpfread() and decremented when a thread wakes up. If bd_nreaders is greater than zero when we reach bpf_catchpacket() and fbuf is non-NULL we wake up all readers. The read timeout, if any, is now tracked locally by the thread in bpfread(). Unlike bd_rdStart, bpfpoll() and bpfkqfilter() don't touch bd_nreaders. Prompted by mpi@. Basic idea from dlg@. Lots of input from dlg@. Tested by dlg@ with tcpdump(8) (blocking read) and flow-collector (https://github.com/eait-itig/flow-collector, non-blocking read). ok dlg@
2020-12-25Refactor klist insertion and removalVisa Hankala
Rename klist_{insert,remove}() to klist_{insert,remove}_locked(). These functions assume that the caller has locked the klist. The current state of locking remains intact because the kernel lock is still used with all klists. Add new functions klist_insert() and klist_remove() that lock the klist internally. This allows some code simplification. OK mpi@
2020-12-12Rename the macro MCLGETI to MCLGETL and removes the dead parameter ifp.jan
OK dlg@, bluhm@ No Opinion mpi@ Not against it claudio@
2020-12-12get bpf_mtap_ether to call _bpf_mtap directly instead of via bpf_mtap.David Gwynne
this is so _bpf_mtap can look at the mbuf with packet headers on it so it can fill in more stuff in the bpf_hdr struct. ive been running this in production for most of a month now and it's working well.
2020-12-12try to read the mbuf timestamp from the mbuf with the pkthdrs in it.David Gwynne
2020-12-11bpf(4): BIOCGRTIMEOUT, BIOCSRTIMEOUT: protect bd_rtout with bd_mtxcheloha
Reading and writing bd_rtout is not an atomic operation, so it needs to be done under the per-descriptor mutex. While here, start annotating locking in bpfdesc.h. There's lots more to do on this front, but you have to start somewhere. Tweaked by mpi@. ok mpi@
2020-11-04Use sysctl_int_bounded in bpf_sysctlgnezdo
Unlike the other cases of sysctl_bounded_arr this one uses a dynamic limit. OK millert@
2020-06-18pass the mbuf with the data separately to the one with the pkthdr to mtap.David Gwynne
this lets things calling bpf_mtap_hdr and related functions also populate the extended bpf_hdr with the rcvif and prio and stuff.
2020-06-18extend the bpf_hdr struct to include some metadata if available.David Gwynne
the metadata is set if the mbuf is passed with an m_pktrhdr, and copies the mbufs rcvif, priority, flowid. it also carries the direction of the packet. it also makes bpf_hdr a multiple of 4 bytes, which simplifies some calculations a bit. it also requires no changes in userland because libpcap just thinks the extra bytes in the header are padding and skips over them to the payload. this helps me verify things like whether the stack and a network card agree about toeplitz hashes, and paves the way for doing more interesting packet captures. being able to see where a packet came from as it is leaving a machine is very useful. ok mpi@
2020-05-13bpf(4): separate descriptor non-blocking status from read timeoutcheloha
If you set FIONBIO on a bpf(4) descriptor you enable non-blocking mode and also clobber any read timeout set for the descriptor. The reverse is also true: do BIOCSRTIMEOUT and you'll set a timeout and simultaneously disable non-blocking status. The two are mutually exclusive. This relationship is undocumented and might cause a bug. At the very least it makes reasoning about the code difficult. This patch adds a new member to bpf_d, bd_rnonblock, to store the non-blocking status of the descriptor. The read timeout is still kept in bd_rtout. With this in place, non-blocking status and the read timeout can coexist. Setting one state does not clear the other, and vice versa. Separating the two states also clears the way for changing the bpf(4) read timeout to use the system clock instead of ticks. More on that in a later patch. With insight from dlg@ regarding the purpose of the read timeout. ok dlg@
2020-04-07Abstract the head of knote lists. This allows extending the lists,Visa Hankala
for example, with locking assertions. OK mpi@, anton@
2020-02-20Replace field f_isfd with field f_flags in struct filterops to allowVisa Hankala
adding more filter properties without cluttering the struct. OK mpi@, anton@
2020-02-14Push the KERNEL_LOCK() insidge pgsigio() and selwakeup().Martin Pieuchot
The 3 subsystems: signal, poll/select and kqueue can now be addressed separatly. Note that bpf(4) and audio(4) currently delay the wakeups to a separate context in order to respect the KERNEL_LOCK() requirement. Sockets (UDP, TCP) and pipes spin to grab the lock for the sames reasons. ok anton@, visa@
2020-01-27update bpf_iflist in bpfsdetach instead of bpfdetach as some driversJoshua Stein
like USB only use the former and bpf_iflist was otherwise retaining references to a freed bpf_if. ok sashan
2020-01-08Unify handling of ioctls FIOSETOWN/SIOCSPGRP/TIOCSPGRP andVisa Hankala
FIOGETOWN/SIOCGPGRP/TIOCGPGRP. Do this by determining the meaning of the ID parameter inside the sigio code. Also add cases for FIOSETOWN and FIOGETOWN where there have been TIOCSPGRP and TIOCGPGRP before. These changes allow removing the ID translation from sys_fcntl() and sys_ioctl(). Idea from NetBSD OK mpi@, claudio@
2020-01-02Switch bpf to use pgsigio(9) and sigio_init(9) instead of handrollingClaudio Jeker
something with csignal(). OK visa@
2019-12-31Use C99 designated initializers with struct filterops. In addition,Visa Hankala
make the structs const so that the data are put in .rodata. OK mpi@, deraadt@, anton@, bluhm@
2019-10-21put bpfdesc reference counting back, revert change introduced in 1.175 as:Alexandr Nedvedicky
BPF: remove redundant reference counting of filedescriptors Anton@ made problem crystal clear: I've been looking into a similar bpf panic reported by syzkaller, which looks somewhat related. The one reported by syzkaller is caused by issuing ioctl(SIOCIFDESTROY) on the interface which the packet filter is attached to. This will in turn invoke the following functions expressed as an inverted stacktrace: 1. bpfsdetach() 2. vdevgone() 3. VOP_REVOKE() 4. vop_generic_revoke() 5. vgonel() 6. vclean(DOCLOSE) 7. VOP_CLOSE() 8. bpfclose() Note that bpfclose() is called before changing the vnode type. In bpfclose(), the `struct bpf_d` is immediately removed from the global bpf_d_list list and might end up sleeping inside taskq_barrier(systq). Since the bpf file descriptor (fd) is still present and valid, another thread could perform an ioctl() on the fd only to fault since bpfilter_lookup() will return NULL. The vnode is not locked in this path either so it won't end up waiting on the ongoing vclean(). Steps to trigger the similar type of panic are straightforward, let there be two processes running concurrently: process A: while true ; do ifconfig tun0 up ; ifconfig tun0 destroy ; done process B: while true ; do tcpdump -i tun0 ; done panic happens within few secs (Dell PowerEdge 710) OK @visa, OK @anton
2019-10-01remove the internal plumbing that supported a custom mbuf copy function.David Gwynne
this is not needed now that the "public" api does not provide a way to pass a custom copy function in for the internals to pass around. ok claudio@ visa@
2019-09-30remove the "copy function" argument to bpf_mtap_hdr.David Gwynne
it was previously (ab)used by pflog, which has since been fixed. apart from that nothing else used it, so we can trim the cruft. ok kn@ claudio@ visa@ visa@ also made sure i fixed ipw(4) so i386 won't break.
2019-09-12we don't need to cast hdr arguments to caddr_t for bpf_mtap_hdr anymore.David Gwynne
pointed out by naddy@
2019-09-12make bpf_mtap_hdr take a const void *, not a caddr_t.David Gwynne
this makes it easier to call at least, and makes it consistent with bpf_tap_hdr. ok stsp@ sashan@
2019-06-13free(9) sizes for buffers.Martin Pieuchot
ok anton@, sashan@
2019-06-10use m_microtime to get the packet rx time it might be available.David Gwynne
2019-05-18BPF: remove redundant reference counting of filedescriptorsAlexandr Nedvedicky
OK visa@, OK mpi@
2019-04-25Lower the accepted upper bound for bd_rtout to INT_MAX in order toanton
prevent passing negative values to timeout_add(). While here, protect against unsigned wrap around during addition of bd_rdStart and bd_rtout since it could also cause passing negative values to timeout_add(). ok bluhm@ Reported-by: syzbot+6771e3d6d9567b3983aa@syzkaller.appspotmail.com
2019-04-15moving BPF to RCUAlexandr Nedvedicky
OK visa@
2019-04-03Reject negative and too large timeouts passed to BIOCSRTIMEOUT. Sinceanton
the timeout converted to ticks is later passed timeout_add(), it could cause a panic if the timeout is negative. ok deraadt@ millert@ Reported-by: syzbot+82cb4dfe6a1fc3d8b490@syzkaller.appspotmail.com
2019-03-18extend BIOCSFILDROP so it can be configured to not capture packets.David Gwynne
BIOCSFILDROP was already able to be used as a quick and dirty firewall, which is especially useful when you you want to filter non-ip things. however, capturing the packets you're dropping is a lot of overhead when you just want to drop stuff. this extends fildrop so you can tell bpf not to capture the packets it drops. ok sthen@ mikeb@ claudio@ visa@
2018-07-13Some USB network interfaces like rum(4) report ENXIO from theirAlexander Bluhm
ioctl function after the device has been pulled out. Also accept this error code in bpf_detachd() to prevent a kernel panic. tcpdump(8) may run while the interface is detached. from Moritz Buhl; OK stsp@
2018-03-02Protect the calls to ifpromisc() in bpf(4) with net lock. ThisAlexander Bluhm
affects the bpfioctl() and bpfclose() path. lock assertion reported and fix tested by Pierre Emeriaud; OK visa@
2018-02-19Remove almost unused `flags' argument of suser().Martin Pieuchot
The account flag `ASU' will no longer be set but that makes suser() mpsafe since it no longer mess with a per-process field. No objection from millert@, ok tedu@, bluhm@
2018-02-01add bpf_tap_hdr(), for handling a buffer (not an mbuf) with a header.David Gwynne
internally it uses mbufs to handle the chain of buffers, but the caller doesnt have to deal with that or allocate a temporary buffer with the header attached. ok mpi@