Age | Commit message (Collapse) | Author |
|
Use atomic operations to make explicit where access from multiple
CPU happens. Add a comment why sbchecklowmem() is sufficently MP
safe without locks.
OK mvs@ claudio@
|
|
Use %zu to print mbuf MHLEN and MLEN in ddb, otherwise gcc complains.
found by claudio@
|
|
Command "ddb> show /c mbuf" always prints mbuf data size. In
uipc_mbuf.c include db_interface.h as it contains prototype for
m_print_chain().
OK mvs@
|
|
For debugging hardware offloading, DMA requirements, bounce buffers,
and performance optimizations, knowing the memory layout of mbuf
content helps.
Implement /c and /p modifiers in ddb show mbuf. It traverses the
pointer m_next for mbuf chain or m_nextpkt for packet list. Show
mbuf type, data offset, mbuf length, packet length, cluster size,
and total number of elements, length and size.
OK claudio@ mvs@
|
|
If the memory layout is not optimal, m_defrag(), m_prepend(),
m_pullup(), and m_pulldown() will allocate mbufs or copy memory.
Count these operations to find possible optimizations.
input dhill@; OK mvs@
|
|
m_defrag() is intended as last resort to make DMA transfers to the
hardware. Therefore page alingment is more important than IP header
alignment. The reason, why the mbuf returned by m_defrag() was
switched to IP header alingment, was that ether_extract_headers()
failed in em(4) driver with TSO on sparc64. This has been fixed
by using memcpy().
The alignment change in m_defrag() is too late in the 7.5 relaese
process. It may affect several drivers on different architectures.
Bus dmamap for ixl(4) on sun4v expects page alignment. Such alignment
issues and TSO mbuf mapping for IOMMU need more thought.
OK deraadt@
|
|
The recent TSO support in em(4) triggered an alignment error on the TCP
header. In em(4) m_defrag() is called before setting up the TSO dma bits
and with that the TCP header was suddenly no longer aligned. Like other
mbuf functions preserve the data alignment in m_defrag() to prevent such
unaligned packets.
With help and OK bluhm@ mglocker@
|
|
m_split() calls m_align() to initialize the data pointer of newly
allocated mbuf. If the new mbuf will be converted to a cluster,
this is not necessary. If additionally the new mbuf is larger than
MLEN, this can lead to a panic.
Only call m_align() when a valid m_data is needed. This is the
case if we do not refecence the existing cluster, but memcpy() the
data into the new mbuf.
Reported-by: syzbot+0e6817f5877926f0e96a@syzkaller.appspotmail.com
OK claudio@ deraadt@
|
|
OK dlg@
Reported-by: syzbot+a377d5cd833c2343429a@syzkaller.appspotmail.com
|
|
This is not the fast path, so dropping mq->mq_maxlen check doesn't
introduce any performance impact, but makes code MP consistent.
Discussed with and ok from bluhm@
|
|
another CPU may change simultaneously. To prevent miss optimisation
by the compiler, they need the READ_ONCE() macro. Otherwise there
could be two read operations with inconsistent values. Writing to
integer in mq_set_maxlen() needs mutex protection. Otherwise the
value could change within critical sections. Again the compiler
could optimize to multiple read operations within the critical
section. With inconsistent values, the behavior is undefined.
OK dlg@
|
|
ok mpi@ miod@
|
|
net/if_pppx.c pointed out by jsg@
ok gnezdo@ deraadt@ jsg@ mpi@ millert@
|
|
previously sbchecklowmem() (and sonewconn()) would look at the mbuf
and mbuf cluster pools to see if they were approaching their hard
limits. based on how many mbufs/clusters were allocated against the
limits, socket operations would start to fail with ENOBUFS until
utilisation went down.
mbufs and clusters have changed a lot since then though. there are
now many mbuf cluster pools, not just one for 2k clusters. because
of this the mbuf layer now limits the amount of memory all the mbuf
pools can allocate backend pages from rather than limit the individual
pools. this means sbchecklowmem() ends up looking at the default
pool hard limit, which is UINT_MAX, which in turn means means
sbchecklowmem() probably never applies backpressure. this is made
worse on multiprocessor systems where per cpu caches of mbuf and
cluster pool items are enabled because the number of in use pool
items is distorted by the cpu caches.
this switches sbchecklowmem to looking at the page allocations made
by all the pools instead. the big benefit of this is that the page
allocations are much more representative of the overall mbuf memory
usage in the system. the downside is is that the backend page
allocation accounting does not see idle memory held by pools. pools
cannot release partially free pages to the page backend (obviously),
and pools cache idle items to avoid thrashing on the backend page
allocator. this means the page allocation level is higher than the
memory used by actual in-flight mbufs.
however, this can also be a benefit. the backend page allocation is a
kind of smoothed out "trend" line. mbuf utilisation over short periods
can be extremely bursty because of things like rx ring dequeue and fill
cycles, or large socket sends. if you're trying to grow socket
buffers while these things are happening, luck becomes an important
factor in whether it will work or not. because pools cache idle items,
the backend page utilisation better represents the overall trend
of activity in the system and will give more consistent behaviour here.
this diff is deliberately simple. we're basically going from "no
limits" to "some sort of limit" for sockets again, so keeping the
code simple means it should be easy to understand and tweak in the
future.
ok djm@ visa@ claudio@
|
|
this makes it consistent with the rest of the network stack when
determining alignment.
ok bluhm@
|
|
If the first mbuf of a chain in m_pullup is a cluster, check if the
cluster is read-only (shared or an external buffer). If so, don't
touch it and create a new mbuf for the pullup data.
This restores original 4.4BSD m_pullup, that not only returned
contiguous mbuf data of the specified length, but also converted
read-only clusters into writeable memory. The latter feature was
lost during some refactoring.
from ehrhardt@; tested by weerd@; OK stsp@ bluhm@ claudio@
|
|
|
|
i'm not a fan of having to cast to caddr_t when we have modern
inventions like void *s we can take advantage of.
ok claudio@ mvs@ bluhm@
|
|
Should prevent to use uninitialized value as bogus counter index.
OK mvs@ claudio@ anton@
|
|
OK dlg@, bluhm@
No Opinion mpi@
Not against it claudio@
|
|
from Matt Dunwoodie and Jason A. Donenfeld
|
|
this is so pppx(4) and the upcoming pppac(4) can give kq read data
dn FIONREAD values that makes sense like the ones tun(4) and tap(4)
provide with ifq_hdatalen.
|
|
atomic operation.
OK visa@ cheloha@
|
|
operations get stuck while holding the net lock. Increasing the
limit did not help as there was no wakeup of the waiting pools. So
introduce pool_wakeup() and run through the mbuf pool request list
when the limit changes.
OK dlg@ visa@
|
|
|
|
limits. Convert kernel variables and calculations for mbuf memory
into long to allow larger values on 64 bit machines. Put a range
check into the kernel sysctl. For the interface itself int is still
sufficient. In netstat -m cast all multiplications to unsigned
long to hold the product of two unsigned int.
input and OK visa@
|
|
if the packet has the M_TIMESTAMP csum_flag, ph_timestamp is added
to the boottime clock, otherwise it just uses microtime().
|
|
|
|
accomodating allocator. an interrupt safe pool may also be used in process
context, as indicated by waitok flags. thanks to the garbage collector, we
can always free pages in process context. the only complication is where
to put the pages. solve this by saving the allocation flags in the pool
page header so the free function can examine them.
not actually used in this diff. (coming soon.)
arm testing and compile fixes from phessler
|
|
this fixes an issue found by a regress test on sparc64 by claudio,
and between us took about half a day of work to understand and fix
at a2k19.
ok claudio@
|
|
OK millert@ bluhm@
|
|
flag to the other references. Then the final m_free() will clear
the memory.
OK claudio@
|
|
return. Hopefully the other reference holder has the M_ZEROIZE flag set as
well. Triggered by syzkaller. OK deradt@ visa@
Reported-by: syzbot+c578107d70008715d41f@syzkaller.appspotmail.com
|
|
OK bluhm@
|
|
all types of mbufs. Also introduce some KASSERT in the m_*space() functions
to ensure that no negative number is returned. This also introduces two
internal macros M_SIZE() & M_DATABUF() which return the right size and start
pointer of the mbuf data area. Use it in a few obvious places to simplify code.
OK bluhm@
|
|
m_leadingspace() and m_trailingspace(). Convert all callers to call
directly the functions and remove the defines.
OK krw@, mpi@
|
|
start locking the socket. An inp can be referenced by the PCB queue
and hashes, by a pf mbuf header, or by a pf state key.
OK visa@
|
|
put the algorithm into a new function m_calchdrlen(). Also set an
uninitialized m_len to 0 in NFS code.
OK claudio@
|
|
created. Add a new function m_removehdr() do convert packet header
mbufs within the chain to regular mbufs. Assert that the mbuf at
the beginning of the chain has a packet header.
found by Maxime Villard in NetBSD; from markus@; OK claudio@
|
|
Previous commit has no OK's or discussion about testing.
|
|
|
|
previously it took a shortcut when emptying an mbuf by only setting
m_len to 0, but leaving m_data alone. this interacts badly with
m_pullup, which tries to maintain the alignment of the data
payload. if there was a 14 byte ethernet header on its own that was
m_adjed off, and then the stack wants an ip header, m_pullup
would put the ip header on the ethernet header alignment, which is
off by 2 bytes.
found by stsp@ with pair(4) on sparc64.
ok stsp@ too
|
|
ok dlg@
|
|
existing statekey in the mbuf header. Reset the statekey in
m_dup_pkthdr().
suggested by and OK sahan@
|
|
or other states more consistent.
OK visa@ sashan@ on a previous version
|
|
ok visa@, bluhm@, deraadt@
|
|
dereference m if it is NULL. See CID 501458.
- Remove the m NULL check from the final for loop, it is not
necessary. This cannot happen due to the length calculation.
The inconsistent code caused the coverity issue.
- Move the m = mp close to all the loops where the mbuf
chain is traversed.
- Use mp to access the m_pkthdr consistently.
- Move the next assignemnt from for (;;m = m->m_next) to the
end of the loop to make it consistent to the previous for (;;)
where the total length is calculated.
OK visa@ mpi@
|
|
mbuf functions.
OK claudio@
|
|
Still quite complicated but more legible in the end and it will do less
M_GET calls for huge packets.
OK bluhm@
|
|
ok kettenis mpi tom
|