src - OpenBSD base system

diff options

author	David Gwynne <dlg@cvs.openbsd.org>	2022-02-14 04:33:19 +0000
committer	David Gwynne <dlg@cvs.openbsd.org>	2022-02-14 04:33:19 +0000
commit	2e544c1fd5a89f53d8899294fe5a4918548e9c20 (patch)
tree	cea1aa83df29d91a54c02136357bd2c30993c4d9 /sys/kern
parent	620a4d6be8c39008d2e2ab99e9684acca55f37b8 (diff)

update sbchecklowmem() to better detect actual mbuf memory usage.

previously sbchecklowmem() (and sonewconn()) would look at the mbuf and mbuf cluster pools to see if they were approaching their hard limits. based on how many mbufs/clusters were allocated against the limits, socket operations would start to fail with ENOBUFS until utilisation went down. mbufs and clusters have changed a lot since then though. there are now many mbuf cluster pools, not just one for 2k clusters. because of this the mbuf layer now limits the amount of memory all the mbuf pools can allocate backend pages from rather than limit the individual pools. this means sbchecklowmem() ends up looking at the default pool hard limit, which is UINT_MAX, which in turn means means sbchecklowmem() probably never applies backpressure. this is made worse on multiprocessor systems where per cpu caches of mbuf and cluster pool items are enabled because the number of in use pool items is distorted by the cpu caches. this switches sbchecklowmem to looking at the page allocations made by all the pools instead. the big benefit of this is that the page allocations are much more representative of the overall mbuf memory usage in the system. the downside is is that the backend page allocation accounting does not see idle memory held by pools. pools cannot release partially free pages to the page backend (obviously), and pools cache idle items to avoid thrashing on the backend page allocator. this means the page allocation level is higher than the memory used by actual in-flight mbufs. however, this can also be a benefit. the backend page allocation is a kind of smoothed out "trend" line. mbuf utilisation over short periods can be extremely bursty because of things like rx ring dequeue and fill cycles, or large socket sends. if you're trying to grow socket buffers while these things are happening, luck becomes an important factor in whether it will work or not. because pools cache idle items, the backend page utilisation better represents the overall trend of activity in the system and will give more consistent behaviour here. this diff is deliberately simple. we're basically going from "no limits" to "some sort of limit" for sockets again, so keeping the code simple means it should be easy to understand and tweak in the future. ok djm@ visa@ claudio@

Diffstat (limited to 'sys/kern')

-rw-r--r--

sys/kern/uipc_mbuf.c

-rw-r--r--

sys/kern/uipc_socket2.c

2 files changed, 13 insertions, 7 deletions


context:
space:
mode: