Age | Commit message (Collapse) | Author |
|
operations get stuck while holding the net lock. Increasing the
limit did not help as there was no wakeup of the waiting pools. So
introduce pool_wakeup() and run through the mbuf pool request list
when the limit changes.
OK dlg@ visa@
|
|
Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .
This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.
Discussed with and OK dlg@, OK mpi@
|
|
|
|
to the not interrupt allocator.
|
|
accomodating allocator. an interrupt safe pool may also be used in process
context, as indicated by waitok flags. thanks to the garbage collector, we
can always free pages in process context. the only complication is where
to put the pages. solve this by saving the allocation flags in the pool
page header so the free function can examine them.
not actually used in this diff. (coming soon.)
arm testing and compile fixes from phessler
|
|
ok visa@
|
|
ok tedu@ deraadt@
|
|
no other process which could free it. Better panic in malloc(9)
or pool_get(9) instead of sleeping forever.
tested by visa@ patrick@ Jan Klemkow
suggested by kettenis@; OK deraadt@
|
|
of mutexes. Use this immediately for the pool_cache futex pools.
Mostly worked out with dlg@ during e2k17
ok mpi@ tedu@
|
|
Suggested by and OK dlg@
|
|
of items that a cache list is allowed to hold. This lets the cache
release resources back to the common pool after pressure on the cache
has decreased.
OK dlg@
|
|
hardcoding 64 is too optimistic.
|
|
previously it would figure out if there's enough items overall for
all the cpus to have full active an inactive free lists. this
included currently allocated items, which pools wont actually hold
on a free list and cannot predict when they will come back.
instead, see if there's enough items in the idle lists in the depot
that could instead go on all the free lists on the cpus. if there's
enough idle items, then we can grow.
tested by hrvoje popovski and amit kulkarni
ok visa@
|
|
if the lock around the global depot of extra cache lists is contented
a lot in between the gc task runs, consider growing the number of
entries a free list can hold.
the size of the list is bounded by the number of pool items the
current set of pages can represent to avoid having cpus starve each
other. im not sure this semantic is right (or the least worst) but
we're putting it in now to see what happens.
this also means reality matches the documentation i just committed
in pool_cache_init.9.
tested by hrvoje popovski and amit kulkarni
ok visa@
|
|
the cpu caches in pools amortise the cost of accessing global
structures by moving lists of items around instead of individual
items. excess lists of items are stored in the global pool struct,
but these idle lists never get returned back to the system for use
elsewhere.
this adds a timestamp to the global idle list, which is updated
when the idle list stops being empty. if the idle list hasn't been
empty for a while, it means the per cpu caches arent using the idle
entries and they can be recovered. timestamping the pages prevents
recovery of a lot of items that may be used again shortly. eg, rx
ring processing and replenishing from rate limited interrupts tends
to allocate and free items in large chunks, which the timestamping
smooths out.
gc'ed lists are returned to the pool pages, which in turn get gc'ed
back to uvm.
ok visa@
|
|
this lets pool_cache_list_put return items to the pages. currently,
if pool_cache_list_put is called while the per cpu caches are
enabled, the items on the list will put put straight back onto
another list in the cpu cache. this also avoids counting puts for
these items twice. a put for the items have already been coutned
when the items went to a cpu cache, it doesnt need to be counted
again when it goes back to the pool pages.
another side effect of this is that pool_cache_list_put can take
the pool mutex once when returning all the items in the list with
pool_do_put, rather than once per item.
ok visa@
|
|
|
|
|
|
KERN_POOL_CACHE reports info about the global cache info, like how long
the lists of cache items the cpus build should be and how many of these
lists are idle on the pool struct.
KERN_POOL_CACHE_CPUS reports counters from each each. the counters
are for how many item and list operations the cache has handled on
a cpu. the sysctl provides an array of ncpusfound * struct
kinfo_pool_cache_cpu, not a single struct kinfo_pool_cache_cpu.
tested by hrvoje popovski
ok mikeb@ millert@
----------------------------------------------------------------------
|
|
lists of free items on the per cpu caches are built out the pool items
as struct pool_cache_items, not struct pool_cache. make the KASSERT
in pool_cache_init check that properly.
|
|
on amd64 and i386.
|
|
if the gc task is running on a cpu that handles interrupts it is
possible to allow a deadlock. the gc task my be cleaning up a pool
and holding its mutex when an non-MPSAFE interrupt arrives and tries
to take the kernel lock. another cpu may already be holding the
kernel lock when it then tries use the same pool thats the pool GC
is currently processing.
thanks to sthen@ and mpi@ for chasing this down.
|
|
all pools set their ipls unconditionally now, so there isn't a need
to second guess them.
pointed out by and ok jmatthew@
|
|
if pool_debug is equal to 2, just like we do for malloc(9).
ok dlg@
|
|
to keep things concise i let the multi page allocators provide
multiple sizes of pages, but this feature was implicit inside
pool_init and only usable if the caller of pool_init did not specify
a page allocator.
callers of pool_init can now suplly a page allocator that provides
multiple page sizes. pool_init will try to fit 8 items onto a page
still, but will scale its page size down until it fits into what
the allocator provides.
supported page sizes are specified as a bit field in the pa_pagesz
member of a pool_allocator. setting the low bit in that word indicates
that the pages can be aligned to their size.
|
|
pool_item_header is now pool_page_header. the more useful change
is pool_list is now pool_cache_item. that's what items going into
the per cpu pool caches are cast to, and they get linked together
to make a list.
the functions operating on what is now pool_cache_items have been
renamed to make it more obvious what they manipulate.
|
|
|
|
it copies the existing pool code, except it works on pool_list
structures instead of pool_item structures.
after this id like to poison the words used by the TAILQ_ENTRY in
the pool_list struct that arent used until a list of items is moved
into the global depot.
|
|
it makes it more readable, and fixes a bug in pool_list_put where it
was returning the next item in the current list rather than the next
list to be freed.
|
|
this is modelled on whats described in the "Magazines and Vmem:
Extending the Slab Allocator to Many CPUs and Arbitrary Resources"
paper by Jeff Bonwick and Jonathan Adams.
the main semantic borrowed from the paper is the use of two lists
of free pool items on each cpu, and only moving one of the lists
in and out of a global depot of free lists to mitigate against a
cpu thrashing against that global depot.
unlike slabs, pools do not maintain or cache constructed items,
which allows us to use the items themselves to build the free list
rather than having to allocate arrays to point at constructed pool
items.
the per cpu caches are build on top of the cpumem api.
this has been kicked a bit by hrvoje popovski and simon mages (thank you).
im putting it in now so it is easier to work on and test.
ok jmatthew@
|
|
the ioff argument to pool_init() is unused and has been for many
years, so this replaces it with an ipl argument. because the ipl
will be set on init we no longer need pool_setipl.
most of these changes have been done with coccinelle using the spatch
below. cocci sucks at formatting code though, so i fixed that by hand.
the manpage and subr_pool.c bits i did myself.
ok tedu@ jmatthew@
@ipl@
expression pp;
expression ipl;
expression s, a, o, f, m, p;
@@
-pool_init(pp, s, a, o, f, m, p);
-pool_setipl(pp, ipl);
+pool_init(pp, s, a, ipl, f, m, p);
|
|
this is half way to recovering the space used by the subr_tree code.
|
|
itll go in again when i dont break userland.
|
|
ok tedu@
|
|
should help inspecting socket issues in the future.
enthusiasm from mpi@ bluhm@ deraadt@
|
|
multi page backend allocator implementation no longer needs to grab the
kernel lock.
ok mlarkin@, dlg@
|
|
* pool_allocator_single: single page allocator, always interrupt safe
* pool_allocator_multi: multi-page allocator, interrupt safe
* pool_allocator_multi_ni: multi-page allocator, not interrupt-safe
ok deraadt@, dlg@
|
|
isn't specified) the default backend allocator implementation no longer
needs to grab the kernel lock.
ok visa@, guenther@
|
|
in the (default) single page pool backend allocator. This means it is now
safe to call pool_get(9) and pool_put(9) for "small" items while holding
a mutex without holding the kernel lock as well as these functions will
no longer acquire the kernel lock under any circumstances. For "large" items
(where large is larger than 1/8th of a page) this still isn't safe though.
ok dlg@
|
|
functions. Note that these calls are deliberately not added to the
special-purpose back-end allocators in the various pmaps. Those allocators
either don't need to grab the kernel lock, are always called with the kernel
lock already held, or are only used on non-MULTIPROCESSOR platforms.
pk tedu@, deraadt@, dlg@
|
|
if we're allowed to try and use large pages, we try and fit at least
8 of the items. this amortises the per page cost of an item a bit.
"be careful" deraadt@
|
|
from martin natano
|
|
|
|
|
|
|
|
now that idle pool pages are timestamped we can tell how long theyve
been idle. this adds a task that runs every second that iterates
over all the pools looking for pages that have been idle for 8
seconds so it can free them.
this idea probably came from a conversation with tedu@ months ago.
ok tedu@ kettenis@
|
|
> if we're able to use large page allocators, try and place at least
> 8 items on a page. this reduces the number of allocator operations
> we have to do per item on large items.
this was backed out because of fallout on landisk which has since
been fixed. putting this in again early in the cycle so we can look
for more fallout. hopefully it will stick.
ok deraadt@
|
|
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.
ok tedu@ deraadt@
|
|
if you're having trouble understanding what this helps, imagine
your cpus caches are a hash table. by moving the base address of
items around (colouring them), you give it more bits to hash with.
in turn that makes it less likely that you will overflow buckets
in your hash. i mean cache.
it was inadvertantly removed in my churn of this subsystem, but as
tedu has said on this issue:
> The history of pool is filled with features getting trimmed because they
> seemed unnecessary or in the way, only to later discover how important they
> were. Having slowly learned that lesson, I think our default should be "if
> bonwick says do it, we do it" until proven otherwise.
until proven otherwise we can keep the functionality, especially
as the code cost is minimal.
ok many including tedu@ guenther@ deraadt@ millert@
|
|
the items address is within the page. it does that by masking the
item address with the page mask and comparing that to the page
address.
however, if we're using large pages with external page headers, we
dont request that the large page be aligned to its size. eg, on an
arch with 4k pages, an 8k large page could be aligned to 4k, so
masking bits to get the page address wont work.
these incorrect checks were distracting while i was debugging large
pages on landisk.
this changes it to do range checks to see if the item is within the
page. it also checks if the item is on the page before checking if
its magic values or poison is right.
ok miod@
|