kern/src/arena.c - upstream - Git at Google

 /* Copyright (c) 2016 Google Inc
  * Barret Rhoden <brho@cs.berkeley.edu>
  * See LICENSE for details.
  *
  * Arena resource allocator, based on Bonwick and Adams's "Magazines and Vmem:
  * Extending the Slab Allocator to Many CPUs and Arbitrary Resources".
  *
  * There are two major arenas (or arena types; see the NUMA discussion below):
  * base_arena and kpages_arena.  The base_arena consists of all the virtual
  * addresses of the KERNBASE mapping, and is entirely self-sufficient.  Some
  * slab caches pull directly from this arena.  The kpages_arena pulls from the
  * base_arena and adds a level of quantum/slab caching.  Most users will pull
  * from kpages_arena.
  *
  * For jumbo pages, you'd think we'd want a larger page sizes to be the source
  * for the smaller page size arenas.  E.g. 'base' is a PML3 allocator.  The
  * problem with that is that a base allocator needs to be self-sufficient, which
  * means it needs to allocate its own boundary tags.  We'd prefer to use a small
  * page for that.  So instead, we can flip the hierarchy around.  A base
  * allocator uses a PGSIZE quantum, and the jumbo allocators are source from
  * the base arena using an aligned allocation helper for its afunc.  I think,
  * without a lot of thought, that the fragmentation would be equivalent.
  *
  * In the future, we can set up N base_arenas, one for each NUMA domain, each of
  * which is a source for other NUMA allocators, e.g. kpages_i_arena.  Higher
  * level allocators (kmalloc()) will need to choose a NUMA domain and call into
  * the correct allocator.  Each NUMA base arena is self-sufficient: they have no
  * qcaches and their BTs come from their own free page list.  This just
  * replicates the default memory allocator across each NUMA node, and at some
  * point, some generic allocator software needs to pick which node to pull from.
  * I tried to keep assumptions about a single base_arena to a minimum, but
  * you'll see some places where the arena code needs to find some base arena for
  * its BT allocations.  Also note that the base setup happens before we know
  * about NUMA domains.  The plan is to do a small part of domain 0 during
  * pmem_init(), then once we know the full memory layout, add in the rest of
  * domain 0's memory and bootstrap the other domains.
  *
  * When it comes to importing spans, it's not clear whether or not we should
  * import exactly the current allocation request or to bring in more.  If we
  * don't bring in more, then a child arena will have a span for every allocation
  * and will return that span to the source whenever the segment is freed.  We'll
  * never get the Figure 4.4 from the Vmem paper.  Alternatively, we could either
  * allow partial frees of segments or we could hang on to completely free spans
  * for a *while*, and possibly require a reclaim callback.  In the meantime, I
  * added a per-arena scaling factor where we can adjust how much we import.
  *
  * TODO:
  * - Blocking.  We'll probably want to reserve some memory for emergencies to
  *   help us get out of OOM.  So we might block when we're at low-mem, not at 0.
  *   We probably should have a sorted list of desired amounts, and unblockers
  *   poke the CV if the first waiter is likely to succeed.
  * - Reclaim: have a ktask that sleeps on a rendez.  We poke it, even from IRQ
  *   context.  It qlocks arenas_and_slabs_lock, then does the reclaim.
  * - There's an issue with when slab objects get deconstructed, and how that
  *   interacts with what I wanted to do with kstacks and TLB shootdowns.  I
  *   think right now (2019-09) there is a problem with it.
  *
  * FAQ:
  * - Does allocating memory from an arena require it to take a btag?  Yes -
  *   unless the allocation is for the exact size of an existing btag/segment.
  * - Why does arena_free() need size?  Isn't it just for sanity checks?  No - it
  *   is also used to determine which slab/qcache to return the segment to.
  * - Why does a jumbo page arena use its own import function, instead of just
  *   xallocing from kpages with alignment?  Because of fragmentation.  kpages
  *   pulls directly from base, using a normal alloc for its import function
  *   (afunc).  Because of this, its xalloc needs to request size + align, which
  *   will fragment base.  It's better for jumbo to call xalloc directly on base,
  *   in essence pushing the aligned alloc as far down the stack as possible.
  * - Does the stuff in a qcache (allocated or free/available) count against the
  *   arena's total/free amounts?  No, at least not the way I did it.  That's why
  *   it's called amt_total_segs: segments, not free memory.  Those slab/qcaches
  *   will have their own stats, and it'd be a minor pain to sync up with them
  *   all the time.  Also, the important stat is when the base arena starts to
  *   run out of memory, and base arenas don't have qcaches, so it's moot.
  * - What's the difference between a kmem_cache (slab / KC) that is used as a
  *   qcache and one that just sources from the arena?  Two things: qcaches are
  *   set up to be KMC_NOTOUCH, meaning that the objects we allocate cannot be
  *   treated as memory at any point.  When the KC is "pro-touch", it can use
  *   the memory of unallocated objects, meaning allocated from the arena to the
  *   KC but not allocated to the user, for bookkeeping.  NOTOUCH means we cannot
  *   do that.  e.g. an integer allocator: don't just write to random addresses!
  *   In general, all arenas are NOTOUCH by default, to avoid these sorts of
  *   disasters.  The only time you'd want 'pro-touch' is for small-size KCs that
  *   are actually backed by memory.  In those cases, you just make slabs that
  *   source from an arena that are not qcaches.  These are the kmalloc slabs.
  *   The other difference between a qcache and a KC sourcing is the qcaches have
  *   a different import_amt.  It's just a performance decision recommended in
  *   the vmem paper.  import_amt is only used with large or no-touch objects,
  *   which is always the case with qcaches.  Small, pro-touch KCs just grab a
  *   page at a time and use that for the slab struct and linked list.  Finally,
  *   all qcaches, being NOTOUCH, therefore use bufctls in the slab.  These are
  *   basically the same as the arena btags.  So ultimately, all objects
  *   allocated from an arena, even those from qcaches, have some sort of btag.
  *   Each slab has an *additional* BT in the arena, representing the source's
  *   alloc.  (Actually, that alloc can be from another, larger qcache of the
  *   arena!).
  */

 #include <arena.h>
 #include <kmalloc.h>
 #include <string.h>
 #include <stdio.h>
 #include <hash.h>
 #include <slab.h>
 #include <kthread.h>

 struct arena_tailq all_arenas = TAILQ_HEAD_INITIALIZER(all_arenas);
 qlock_t arenas_and_slabs_lock = QLOCK_INITIALIZER(arenas_and_slabs_lock);

 struct arena *base_arena;
 struct arena *kpages_arena;

 /* Misc helpers and forward declarations */
 static struct btag *__get_from_freelists(struct arena *arena, int list_idx);
 static bool __account_alloc(struct arena *arena, struct btag *bt, size_t size,
                             struct btag *new);
 static void *__xalloc_nextfit(struct arena *arena, size_t size, size_t np2sb,
                               size_t phase, size_t nocross);
 static void __try_hash_resize(struct arena *arena, int flags,
                               void **to_free_addr, size_t *to_free_sz);
 static void __arena_asserter(struct arena *arena);
 void print_arena_stats(struct arena *arena, bool verbose);

 /* For NUMA situations, where there are multiple base arenas, we'll need a way
  * to find *some* base arena.  Ideally, it'll be in the same NUMA domain as
  * arena. */
 static struct arena *find_my_base(struct arena *arena)
 {
 	/* TODO: could walk down sources until is_base is set.  But barring
 	 * that, we'd still need a way to find a base arena for some other
 	 * allocator that just wants a page.  arena may be NULL, so handle that.
 	 * */
 	return base_arena;
 }

 static void setup_qcaches(struct arena *arena, size_t quantum,
                           size_t qcache_max)
 {
 	int nr_qcaches = qcache_max / quantum;
 	char kc_name[KMC_NAME_SZ];
 	size_t qc_size;

 	if (!nr_qcaches)
 		return;

 	/* TODO: same as with hash tables, here and in slab.c, we ideally want
 	 * the nearest kpages arena, but bootstrappers need to use base_alloc.
 	 * */
 	arena->qcaches = base_alloc(arena,
 				    nr_qcaches * sizeof(struct kmem_cache),
 				    MEM_WAIT);
 	for (int i = 0; i < nr_qcaches; i++) {
 		qc_size = (i + 1) * quantum;
 		snprintf(kc_name, KMC_NAME_SZ, "%s_%d", arena->name, qc_size);
 		/* Alignment == 1.  These QCs will give out quantum-aligned
 		 * (actually multiples) objects, even without an alignment
 		 * request.  The reason is that the QCs pull their slabs from
 		 * us, and we give out quantum-aligned objects (i.e. the slabs).
 		 * Those slabs are made of up objects that are multiples
 		 * of quantum. */
 		__kmem_cache_create(&arena->qcaches[i], kc_name, qc_size, 1,
 				    KMC_NOTOUCH | KMC_QCACHE, arena,
 				    NULL, NULL, NULL);
 	}
 }

 static void destroy_qcaches(struct arena *arena)
 {
 	int nr_qcaches = arena->qcache_max / arena->quantum;
 	struct kmem_cache *kc;

 	if (!nr_qcaches)
 		return;

 	for (int i = 0; i < nr_qcaches; i++) {
 		kc = &arena->qcaches[i];
 		__kmem_cache_destroy(kc);
 	}

 	base_free(arena, arena->qcaches,
 		  nr_qcaches * sizeof(struct kmem_cache));
 	arena->qcaches = NULL;
 }

 static void arena_init(struct arena *arena, const char *name, size_t quantum,
                        void *(*afunc)(struct arena *, size_t, int),
                        void (*ffunc)(struct arena *, void *, size_t),
                        struct arena *source, size_t qcache_max)
 {
 	static_assert((ARENA_ALLOC_STYLES & MEM_FLAGS) == 0);

 	spinlock_init_irqsave(&arena->lock);
 	arena->import_scale = 0;
 	arena->is_base = FALSE;
 	if (qcache_max % quantum)
 		panic("Arena %s, qcache_max %p must be a multiple of quantum %p",
 		      name, qcache_max, quantum);
 	arena->quantum = quantum;
 	arena->qcache_max = qcache_max;
 	arena->afunc = afunc;
 	arena->ffunc = ffunc;
 	if (source == ARENA_SELF_SOURCE)
 		arena->source = arena;
 	else
 		arena->source = source;
 	if (arena->source) {
 		assert(afunc && ffunc);
 		/* When we import, there may be a quantum mismatch such that our
 		 * source hands us spans that are not sufficient for our
 		 * quantum.  For instance, s->q == 1, a->q = 4096, and they hand
 		 * us 4096 bytes at 4095.  If any alloc in our source's quantum,
 		 * when rounded up to our quantum would change that alloc, we
 		 * need to import 2x the size to be sure a span would work.
 		 *
 		 * All s allocs are divided (% x == 0) by s->q.  We want to know
 		 * if all s allocs (our spans) are also divided by a->q, in
 		 * which case we don't need to import extra.  This is true when
 		 * a->q divides s->q.  (s->q is a multiple of a->q). */
 		if (arena->source->quantum % arena->quantum)
 			arena->import_scale = 1;
 	}
 	arena->amt_total_segs = 0;
 	arena->amt_alloc_segs = 0;
 	arena->nr_allocs_ever = 0;

 	arena->all_segs = RB_ROOT;
 	BSD_LIST_INIT(&arena->unused_btags);
 	for (int i = 0; i < ARENA_NR_FREE_LISTS; i++)
 		BSD_LIST_INIT(&arena->free_segs[i]);

 	arena->alloc_hash = arena->static_hash;
 	hash_init_hh(&arena->hh);
 	for (int i = 0; i < arena->hh.nr_hash_lists; i++)
 		BSD_LIST_INIT(&arena->static_hash[i]);

 	TAILQ_INIT(&arena->__importing_arenas);
 	TAILQ_INIT(&arena->__importing_slabs);

 	strlcpy(arena->name, name, ARENA_NAME_SZ);
 	setup_qcaches(arena, quantum, qcache_max);

 	if (arena->source)
 		add_importing_arena(arena->source, arena);
 	qlock(&arenas_and_slabs_lock);
 	TAILQ_INSERT_TAIL(&all_arenas, arena, next);
 	qunlock(&arenas_and_slabs_lock);
 }

 struct arena *arena_create(const char *name, void *base, size_t size,
 			   size_t quantum,
                            void *(*afunc)(struct arena *, size_t, int),
                            void (*ffunc)(struct arena *, void *, size_t),
                            struct arena *source, size_t qcache_max, int flags)
 {
 	struct arena *arena;

 	/* See note in arena_add() */
 	if (source && base)
 		panic("Arena can't have both a source and an initial span");
 	if (!base && size)
 		panic("Arena can't have a base starting at 0");
 	arena = kmalloc(sizeof(struct arena), flags);
 	if (!arena)
 		return 0;
 	arena_init(arena, name, quantum, afunc, ffunc, source, qcache_max);
 	if (base) {
 		if (!arena_add(arena, base, size, flags)) {
 			warn("Failed to add base to arena %s, aborting!",
 			     arena->name);
 			arena_destroy(arena);
 			return 0;
 		}
 	}
 	return arena;
 }

 void arena_destroy(struct arena *arena)
 {
 	struct btag *bt_i, *temp;

 	qlock(&arenas_and_slabs_lock);
 	TAILQ_REMOVE(&all_arenas, arena, next);
 	qunlock(&arenas_and_slabs_lock);
 	if (arena->source)
 		del_importing_arena(arena->source, arena);
 	destroy_qcaches(arena);
 	for (int i = 0; i < arena->hh.nr_hash_lists; i++) {
 		/* Marginal at best.  The qcaches are destroyed already; if
 		 * someone tries to free this later, we're in trouble. */
 		if (!BSD_LIST_EMPTY(&arena->alloc_hash[i])) {
 			warn("Arena %s has unfreed items!  Will not destroy.",
 			     arena->name);
 			return;
 		}
 	}
 	if (arena->alloc_hash != arena->static_hash)
 		kfree(arena->alloc_hash);
 	/* We shouldn't have any spans left.  We can tell we messed up if we had
 	 * a source and still have some free segments.  Otherwise, just collect
 	 * the free tags on the unused btag list. */
 	for (int i = 0; i < ARENA_NR_FREE_LISTS; i++) {
 		if (arena->source)
 			assert(BSD_LIST_EMPTY(&arena->free_segs[i]));
 		BSD_LIST_FOREACH_SAFE(bt_i, &arena->free_segs[i], misc_link,
 				      temp) {
 			BSD_LIST_REMOVE(bt_i, misc_link);
 			BSD_LIST_INSERT_HEAD(&arena->unused_btags, bt_i,
 					     misc_link);
 		}
 	}
 	/* To free our BTs, we need to give the page back to the base arena.
 	 * The BTs that are page aligned are the ones we want.  We can just
 	 * ignore the others (unlink from the list). */
 	BSD_LIST_FOREACH_SAFE(bt_i, &arena->unused_btags, misc_link, temp) {
 		if (PGOFF(bt_i))
 			BSD_LIST_REMOVE(bt_i, misc_link);
 	}
 	/* Now the remaining BTs are the first on their page. */
 	BSD_LIST_FOREACH_SAFE(bt_i, &arena->unused_btags, misc_link, temp)
 		arena_free(find_my_base(arena), bt_i, PGSIZE);
 	kfree(arena);
 }

 static void __insert_btag(struct rb_root *root, struct btag *bt)
 {
 	struct rb_node **new = &root->rb_node, *parent = NULL;
 	struct btag *node;

 	while (*new) {
 		node = container_of(*new, struct btag, all_link);
 		parent = *new;
 		/* Span (BTAG_SPAN) nodes are ahead (less than) of regular
 		 * segment nodes (BTAG_FREE or BTAG_ALLOC) that have the same
 		 * start. */
 		if (bt->start < node->start)
 			new = &parent->rb_left;
 		else if (bt->start > node->start)
 			new = &parent->rb_right;
 		else if (node->status == BTAG_SPAN)
 			new = &parent->rb_right;
 		else
 			panic("BT %p already in tree %p!", bt, root);
 	}
 	rb_link_node(&bt->all_link, parent, new);
 	rb_insert_color(&bt->all_link, root);
 }

 /* Helper: tracks a seg pointed to by @bt as being allocated, assuming it is
  * already off the free list (or was never on).  This doesn't do anything with
  * all_segs; that's someone else's job (usually bt is already on it). */
 static void __track_alloc_seg(struct arena *arena, struct btag *bt)
 {
 	size_t hash_idx;

 	bt->status = BTAG_ALLOC;
 	hash_idx = hash_long(bt->start, arena->hh.nr_hash_bits);
 	BSD_LIST_INSERT_HEAD(&arena->alloc_hash[hash_idx], bt, misc_link);
 	arena->hh.nr_items++;
 }

 /* Helper: untracks a seg pointed to by @bt as being allocated.  Basically,
  * removes it from the alloc_hash. */
 static struct btag *__untrack_alloc_seg(struct arena *arena, uintptr_t start)
 {
 	size_t hash_idx;
 	struct btag *bt_i;

 	hash_idx = hash_long(start, arena->hh.nr_hash_bits);
 	BSD_LIST_FOREACH(bt_i, &arena->alloc_hash[hash_idx], misc_link) {
 		if (bt_i->start == start) {
 			BSD_LIST_REMOVE(bt_i, misc_link);
 			assert(bt_i->status == BTAG_ALLOC);
 			arena->hh.nr_items--;
 			return bt_i;
 		}
 	}
 	return NULL;
 }

 /* Helper, tries to resize our hash table, if necessary.  Call with the lock
  * held, but beware that this will unlock and relock - meaning, don't rely on
  * the lock to protect invariants, but hold it as an optimization.
  *
  * A lot of the nastiness here is related to the allocation.  Critically, we
  * might be the base arena, so it'd be nice to unlock before calling into
  * ourselves (o/w deadlock).  Likewise, we'd like to be able to do a blocking
  * allocation, if flags has MEM_WAIT.  We'd need to unlock for that too. */
 static void __try_hash_resize(struct arena *arena, int flags,
                               void **to_free_addr, size_t *to_free_sz)
 {
 	struct btag_list *new_tbl, *old_tbl;
 	struct btag *bt_i;
 	unsigned int new_tbl_nr_lists, old_tbl_nr_lists;
 	size_t new_tbl_sz, old_tbl_sz;
 	size_t hash_idx;

 	if (!hash_needs_more(&arena->hh))
 		return;
 	new_tbl_nr_lists = hash_next_nr_lists(&arena->hh);
 	/* We want future checkers to not think we need an increase, to avoid
 	 * excessive hash resizers as well as base arena deadlocks (base_alloc
 	 * must not call into base_alloc infinitely) */
 	hash_set_load_limit(&arena->hh, SIZE_MAX);
 	spin_unlock_irqsave(&arena->lock);
 	new_tbl_sz = new_tbl_nr_lists * sizeof(struct btag_list);
 	/* Regardless of the caller's style, we'll try and be quick with
 	 * INSTANT. */
 	flags &= ~ARENA_ALLOC_STYLES;
 	flags |= ARENA_INSTANTFIT;
 	new_tbl = base_zalloc(arena, new_tbl_sz, flags);
 	spin_lock_irqsave(&arena->lock);
 	if (!new_tbl) {
 		/* Need to reset so future callers will try to grow. */
 		hash_reset_load_limit(&arena->hh);
 		spin_unlock_irqsave(&arena->lock);
 		return;
 	}
 	/* After relocking, we need to re-verify that we want to go ahead.  It's
 	 * possible that another thread resized the hash already, which we can
 	 * detect because our alloc size is wrong. */
 	if (new_tbl_nr_lists != hash_next_nr_lists(&arena->hh)) {
 		spin_unlock_irqsave(&arena->lock);
 		base_free(arena, new_tbl, new_tbl_sz);
 		return;
 	}
 	old_tbl = arena->alloc_hash;
 	old_tbl_nr_lists = arena->hh.nr_hash_lists;
 	old_tbl_sz = old_tbl_nr_lists * sizeof(struct btag_list);
 	arena->alloc_hash = new_tbl;
 	hash_incr_nr_lists(&arena->hh);
 	for (int i = 0; i < old_tbl_nr_lists; i++) {
 		while ((bt_i = BSD_LIST_FIRST(&old_tbl[i]))) {
 			BSD_LIST_REMOVE(bt_i, misc_link);
 			hash_idx = hash_long(bt_i->start,
 					     arena->hh.nr_hash_bits);
 			BSD_LIST_INSERT_HEAD(&arena->alloc_hash[hash_idx], bt_i,
 					     misc_link);
 		}
 	}
 	hash_reset_load_limit(&arena->hh);
 	if (old_tbl != arena->static_hash) {
 		*to_free_addr = old_tbl;
 		*to_free_sz = old_tbl_sz;
 	}
 }

 /* Typically this will be just checking for one or two BTs on the free list */
 static bool __has_enough_btags(struct arena *arena, size_t nr_needed)
 {
 	struct btag *bt_i;
 	size_t so_far = 0;

 	BSD_LIST_FOREACH(bt_i, &arena->unused_btags, misc_link) {
 		so_far++;
 		if (so_far == nr_needed)
 			return TRUE;
 	}
 	return FALSE;
 }

 /* Allocs new boundary tags and puts them on the arena's free list.  Returns 0
  * on failure, which could happen if MEM_ATOMIC is set).  Hold the lock when you
  * call this, but note it will unlock and relock.
  *
  * The base arena is special in that it must be self-sufficient.  It will create
  * get its free page from itself.  Other arena's just pull from base in the
  * normal fashion.  We could pull from kpages_arena, but that would require a
  * little more special casing.  Maybe in the future.
  *
  * Note that BTs are only freed when the arena is destroyed.  We use the fact
  * that the first BT is at an aligned address to track the specific page it came
  * from. */
 static struct btag *__add_more_btags(struct arena *arena, int mem_flags)
 {
 	struct btag *bt, *tags;
 	size_t nr_bts = PGSIZE / sizeof(struct btag);

 	if (arena->is_base) {
 		bt = __get_from_freelists(arena, LOG2_UP(PGSIZE));
 		if (!bt) {
 			/* TODO: block / reclaim if not MEM_ATOMIC.  Remember,
 			 * we hold the lock!  We might need to rework this or
 			 * get a reserved page. */
 			if (!(mem_flags & MEM_ATOMIC))
 				panic("Base failed to alloc its own btag, OOM");
 			return 0;
 		}
 		/* __account_alloc() will often need a new BT; specifically when
 		 * we only need part of the segment tracked by the BT.  Since we
 		 * don't have any extra BTs, we'll use the first one on the page
 		 * we just allocated. */
 		tags = (struct btag*)bt->start;
 		if (__account_alloc(arena, bt, PGSIZE, &tags[0])) {
 			/* We used the tag[0]; we'll have to skip over it now.*/
 			tags++;
 			nr_bts--;
 		}
 	} else {
 		/* Here's where we unlock and relock around a blocking call */
 		spin_unlock_irqsave(&arena->lock);
 		tags = arena_alloc(find_my_base(arena), PGSIZE,
 		                   mem_flags | ARENA_INSTANTFIT);
 		spin_lock_irqsave(&arena->lock);
 		if (!tags)
 			return 0;
 	}
 	for (int i = 0; i < nr_bts; i++)
 		BSD_LIST_INSERT_HEAD(&arena->unused_btags, &tags[i], misc_link);
 	return tags;
 }

 /* Helper, returns TRUE when we have enough BTs.  Hold the lock, but note this
  * will unlock and relock, and will attempt to acquire more BTs.  Returns FALSE
  * if an alloc failed (MEM_ATOMIC).
  *
  * This complexity is so that we never fail an arena operation due to lack of
  * memory unless the caller has MEM_ATOMIC set.  Further, __get_btag() never
  * fails, which makes other code easier.  Otherwise, functions that currently
  * call __get_btag will need one or two BTs passed in from their callers, who
  * allocate/remove from the list at a place where they can easily fail. */
 static bool __get_enough_btags(struct arena *arena, size_t nr_needed,
                                int mem_flags)
 {
 	if (__has_enough_btags(arena, nr_needed))
 		return TRUE;
 	/* This will unlock and relock, and maybe block. */
 	if (!__add_more_btags(arena, mem_flags)) {
 		/* This is the only failure scenario */
 		assert(mem_flags & MEM_ATOMIC);
 		return FALSE;
 	}
 	/* Since the lock was held in __add_more_btags, no one should have been
 	 * able to drain them.  If someone asked for more than a page worth of
 	 * BTs, there's a problem somewhere else. */
 	assert(__has_enough_btags(arena, nr_needed));
 	return TRUE;
 }

 /* Helper: gets a btag.  All callpaths must have made sure the arena has enough
  * tags before starting its operation, holding the lock throughout.  Thus, this
  * cannot fail. */
 static struct btag *__get_btag(struct arena *arena)
 {
 	struct btag *ret;

 	ret = BSD_LIST_FIRST(&arena->unused_btags);
 	/* All code paths should have made sure that there were enough BTs
 	 * before diving in. */
 	assert(ret);
 	BSD_LIST_REMOVE(ret, misc_link);
 	return ret;
 }

 static void __free_btag(struct arena *arena, struct btag *bt)
 {
 	BSD_LIST_INSERT_HEAD(&arena->unused_btags, bt, misc_link);
 }

 /* Helper: adds seg pointed to by @bt to the appropriate free list of @arena. */
 static void __track_free_seg(struct arena *arena, struct btag *bt)
 {
 	int list_idx = LOG2_DOWN(bt->size);

 	bt->status = BTAG_FREE;
 	BSD_LIST_INSERT_HEAD(&arena->free_segs[list_idx], bt, misc_link);
 }

 /* Helper: removes seg pointed to by @bt from the appropriate free list of
  * @arena. */
 static void __untrack_free_seg(struct arena *arena, struct btag *bt)
 {
 	BSD_LIST_REMOVE(bt, misc_link);
 }

 /* Helper: we decided we want to alloc part of @bt, which has been removed from
  * its old list.  We need @size units.  The rest can go back to the arena.
  *
  * Takes @new, which we'll use if we need a new btag.  If @new is NULL, we'll
  * allocate one.  If we used the caller's btag, we'll return TRUE.  This
  * complexity is for a base arena's manual btag allocation. */
 static bool __account_alloc(struct arena *arena, struct btag *bt,
                             size_t size, struct btag *new)
 {
 	bool ret = FALSE;

 	assert(bt->status == BTAG_FREE);
 	if (bt->size != size) {
 		assert(bt->size > size);
 		if (new)
 			ret = TRUE;
 		else
 			new = __get_btag(arena);
 		new->start = bt->start + size;
 		new->size = bt->size - size;
 		bt->size = size;
 		__track_free_seg(arena, new);
 		__insert_btag(&arena->all_segs, new);
 	}
 	__track_alloc_seg(arena, bt);
 	arena->amt_alloc_segs += size;
 	arena->nr_allocs_ever++;
 	return ret;
 }

 /* Helper: gets the first segment from the smallest, populated list. */
 static struct btag *__get_from_freelists(struct arena *arena, int list_idx)
 {
 	struct btag *ret = NULL;

 	for (int i = list_idx; i < ARENA_NR_FREE_LISTS; i++) {
 		ret = BSD_LIST_FIRST(&arena->free_segs[i]);
 		if (ret) {
 			BSD_LIST_REMOVE(ret, misc_link);
 			break;
 		}
 	}
 	return ret;
 }

 /* Allocates using the 'best fit' policy.  Recall that each free_segs list holds
  * segments of size [ 2^n, 2^(n+1) )  We try to find the smallest segment on
  * that list that can satisfy the request.  Otherwise, any segment from a larger
  * list will suffice. */
 static void *__alloc_bestfit(struct arena *arena, size_t size)
 {
 	int list_idx = LOG2_DOWN(size);
 	struct btag *bt_i, *best = NULL;

 	BSD_LIST_FOREACH(bt_i, &arena->free_segs[list_idx], misc_link) {
 		if (bt_i->size >= size) {
 			if (!best || (best->size > bt_i->size))
 				best = bt_i;
 		}
 	}
 	if (best)
 		BSD_LIST_REMOVE(best, misc_link);
 	else
 		best = __get_from_freelists(arena, list_idx + 1);
 	if (!best)
 		return NULL;
 	__account_alloc(arena, best, size, NULL);
 	return (void*)best->start;
 }

 static void *__alloc_nextfit(struct arena *arena, size_t size)
 {
 	return __xalloc_nextfit(arena, size, arena->quantum, 0, 0);
 }

 /* Instant fit grabs the first segment guaranteed to be big enough.  Note that
  * we round list_idx up, compared to bestfit's initial list.  That way, you're
  * always sure you have a big enough segment. */
 static void *__alloc_instantfit(struct arena *arena, size_t size)
 {
 	struct btag *ret;

 	ret = __get_from_freelists(arena, LOG2_UP(size));
 	if (!ret)
 		return NULL;
 	__account_alloc(arena, ret, size, NULL);
 	return (void*)ret->start;
 }

 /* Non-qcache allocation.  Hold the arena's lock.  Note that all allocations are
  * done in multiples of the quantum. */
 static void *alloc_from_arena(struct arena *arena, size_t size, int flags)
 {
 	void *ret;
 	void *to_free_addr = 0;
 	size_t to_free_sz = 0;

 	spin_lock_irqsave(&arena->lock);
 	if (!__get_enough_btags(arena, 1, flags & MEM_FLAGS)) {
 		spin_unlock_irqsave(&arena->lock);
 		return NULL;
 	}
 	if (flags & ARENA_BESTFIT)
 		ret = __alloc_bestfit(arena, size);
 	else if (flags & ARENA_NEXTFIT)
 		ret = __alloc_nextfit(arena, size);
 	else
 		ret = __alloc_instantfit(arena, size);
 	/* Careful, this will unlock and relock.  It's OK right before an
 	 * unlock. */
 	__try_hash_resize(arena, flags, &to_free_addr, &to_free_sz);
 	spin_unlock_irqsave(&arena->lock);
 	if (to_free_addr)
 		base_free(arena, to_free_addr, to_free_sz);
 	return ret;
 }

 /* Adds segment [@base, @base + @size) to @arena.  We'll add a span tag if the
  * arena had a source. */
 static void *__arena_add(struct arena *arena, void *base, size_t size,
                          int flags)
 {
 	struct btag *bt, *span_bt;
 	uintptr_t limit;

 	assert(base < base + size);
 	spin_lock_irqsave(&arena->lock);
 	/* Make sure there are two, bt and span. */
 	if (!__get_enough_btags(arena, 2, flags & MEM_FLAGS)) {
 		spin_unlock_irqsave(&arena->lock);
 		return NULL;
 	}
 	bt = __get_btag(arena);
 	/* Our source may have a different (and smaller) quantum than us.
 	 * span_bt will track the source's allocation, and our bt will track a
 	 * subset of those bytes that are multiples our quantum. */
 	limit = (uintptr_t)base + size;
 	bt->start = ROUNDUP((uintptr_t)base, arena->quantum);
 	bt->size = ROUNDDOWN(limit - bt->start, arena->quantum);
 	/* Caller should have been careful about this.  get_more_resources()
 	 * should have a large enough import_amt / import_scale. */
 	if (bt->start >= limit || !bt->size) {
 		warn("Added segment too small! (a: %s, b:%p, s:%p, q:%p)",
 		     arena->name, base, size, arena->quantum);
 		spin_unlock_irqsave(&arena->lock);
 		return NULL;
 	}
 	if (arena->source) {
 		span_bt = __get_btag(arena);
 		span_bt->start = (uintptr_t)base;
 		span_bt->size = size;
 		span_bt->status = BTAG_SPAN;
 		/* This is dirty, but it saves 8 bytes in every BT that would
 		 * only be used by span BTs.  We're not on any list, so
 		 * misc-link is available.  We also need to keep track of the
 		 * size of this arena's BT so we can detect when it is free. */
 		*(uintptr_t *)&span_bt->misc_link = bt->size;
 		/* Note the btag span is not on any list, but it is in all_segs
 		 */
 		__insert_btag(&arena->all_segs, span_bt);
 	}
 	arena->amt_total_segs += bt->size;
 	__track_free_seg(arena, bt);
 	__insert_btag(&arena->all_segs, bt);
 	spin_unlock_irqsave(&arena->lock);
 	return base;
 }

 /* Adds segment [@base, @base + @size) to @arena. */
 void *arena_add(struct arena *arena, void *base, size_t size, int flags)
 {
 	/* This wasn't clear from the paper, but mixing source spans and
 	 * manually added spans seems like a pain when coalescing BTs and
 	 * freeing. */
 	if (arena->source)
 		panic("Arenas with sources must not manually add resources.");
 	if (!base && size)
 		panic("Arena can't have a base starting at 0");
 	return __arena_add(arena, base, size, flags);
 }

 /* Attempt to get more resources, either from a source or by blocking.  Returns
  * TRUE if we got something.  FALSE on failure (e.g. MEM_ATOMIC). */
 static bool get_more_resources(struct arena *arena, size_t size, int flags)
 {
 	void *span;
 	size_t import_size;

 	if (arena->source) {
 		/* MAX check, in case size << scale overflows */
 		import_size = MAX(size, size << arena->import_scale);
 		/* The source will roundup to the nearest quantum.  We might as
 		 * well do it now so that we know about the extra space.
 		 * Otherwise we'd just waste the excess. */
 		import_size = MAX(import_size,
 				  ROUNDUP(import_size, arena->source->quantum));
 		span = arena->afunc(arena->source, import_size, flags);
 		if (!span)
 			return FALSE;
 		if (!__arena_add(arena, span, import_size, flags)) {
 			/* We could fail if MEM_ATOMIC and we couldn't get a BT
 			 */
 			warn("Excessively rare failure, tell brho");
 			arena->ffunc(arena->source, span, import_size);
 			return FALSE;
 		}
 	} else {
 		/* TODO: allow blocking */
 		if (!(flags & MEM_ATOMIC))
 			panic("OOM!");
 		return FALSE;
 	}
 	return TRUE;
 }

 /* Helper.  For a given size, return the applicable qcache. */
 static struct kmem_cache *size_to_qcache(struct arena *arena, size_t size)
 {
 	/* If we ever get grumpy about the costs of dividing (both here and in
 	 * the ROUND ops, we could either insist that quantum is a power of two,
 	 * or track whether or not it is and use other shifting ops. */
 	return &arena->qcaches[(size / arena->quantum) - 1];
 }

 void *arena_alloc(struct arena *arena, size_t size, int flags)
 {
 	void *ret;

 	size = ROUNDUP(size, arena->quantum);
 	if (!size)
 		panic("Arena %s, request for zero", arena->name);
 	if (size <= arena->qcache_max) {
 		/* NEXTFIT is an error, since free won't know to skip the qcache
 		 * and then we'd be handing an item to the qcache that it didn't
 		 * alloc. */
 		if (flags & ARENA_NEXTFIT)
 			panic("Arena %s, NEXTFIT, but has qcaches. Use xalloc.",
 			      arena->name);
 		return kmem_cache_alloc(size_to_qcache(arena, size), flags);
 	}
 	while (1) {
 		ret = alloc_from_arena(arena, size, flags);
 		if (ret)
 			return ret;
 		/* This is a little nasty.  We asked our source for enough, but
 		 * it may be a bestfit sized chunk, not an instant fit.  Since
 		 * we already failed once, we can just downgrade to BESTFIT,
 		 * which will likely find our recently-allocated span.  Even
 		 * worse, the source might only give us segments that are
 		 * BESTFIT, and if we only look at the INSTANTFIT, we'll keep
 		 * looping.  The invariant here is that if we
 		 * get_more_resources, then our allocation can succeed if no one
 		 * else grabs that memory first.
 		 *
 		 * We actually have two options here.  Either we downgrade to
 		 * BESTFIT or we round up our request to our source to the
 		 * nearest power of two.  Doing so brings some of the
 		 * fragmentation into our arena, but an instant fit is likely to
 		 * succeed.  Considering how the first item on the BESTFIT list
 		 * is likely ours, downgrading makes sense. */
 		flags &= ~ARENA_ALLOC_STYLES;
 		flags |= ARENA_BESTFIT;
 		if (!get_more_resources(arena, size, flags))
 			return NULL;
 	}
 }

 /* Helper: given a BT's start and size, return a starting address within the BT
  * that satisfies the constraints.  Returns 0 on failure.
  *
  * The rough idea is to try the two potential starting alloc locations:
  * - from the bt_start, round *down* to np2sb, which may be below the bt, then
  *   add the phase, which may go back in (or overflow).
  * - add one more np2sb.  this should be > bt_start (mod overflow), since
  *   ROUNDDOWN is -= less than np2sb.
  *
  * * The 'nocross' boundary (also an alignment) complicates things a little. */
 static uintptr_t __find_sufficient(uintptr_t bt_start, size_t bt_size,
                                    size_t size, size_t np2sb,
                                    size_t phase, size_t nocross)
 {
 	uintptr_t try;
 	size_t try_size;

 	do {
 		try = bt_start;
 		try = ROUNDDOWN(try, np2sb);
 		try += phase;
 		if (try < bt_start)
 			try += np2sb;
 		/* Overflow sanity check.  Ultimately, don't look outside bt. */
 		if (try < bt_start)
 			return 0;
 		/* Check wraparound */
 		if (try + size < try)
 			return 0;
 		/* Too big for BT, no chance. */
 		if (try + size > bt_start + bt_size)
 			return 0;
 		if (nocross == 0)
 			return try;
 		/* Got to deal with nocross boundaries.  If we round up from our
 		 * potential start and that is beyond our potential finish,
 		 * we're OK. */
 		if (ALIGN(try, nocross) >= try + size)
 			return try;
 		/* The segment still might have a chance.  Perhaps we started
 		 * right before a nocross.
 		 *
 		 * All we're doing here is artificially limiting bt to a subset,
 		 * starting at the next nocross, if it is within BT.  And we're
 		 * checking *all* nocross-aligned chunks in this BT */
 		try = ALIGN(bt_start + 1, nocross);
 		if (try - bt_start >= bt_size)
 			return 0;
 		bt_size -= try - bt_start;
 		bt_start = try;
 	} while (1);
 }

 /* Helper: splits bt, which is not on any freelist, at @at, and puts the front
  * part back on a free list. */
 static void __split_bt_at(struct arena *arena, struct btag *bt, uintptr_t at)
 {
 	struct btag *front = __get_btag(arena);

 	/* We're changing bt's start, which is its key for its position in the
 	 * all_segs tree.  However, we don't need to remove and reinsert it,
 	 * since although we increased its start, we know that no BT should be
 	 * between its old start and its new start.  That's actually where the
 	 * front BT will get inserted (so long as we insert after changing bt's
 	 * start). */
 	front->status = BTAG_FREE;
 	front->start = bt->start;
 	front->size = at - bt->start;
 	bt->start += front->size;
 	bt->size -= front->size;
 	__track_free_seg(arena, front);
 	__insert_btag(&arena->all_segs, front);
 	/* At this point, bt's old space in all_segs is broken into: front:
 	 * [old_start, try), bt: [try, old_end).  front is on the free list.  bt
 	 * is not. */
 }

 /* Helper.  We want either the BT containing minaddr, or the closest to it
  * (above).  There might be no BTs containing it, above it, or below it. */
 static bool __found_least_upper_btag(struct btag *bt, uintptr_t minaddr)
 {
 	struct rb_node *prev;
 	struct btag *btp;

 	if (bt->start + bt->size <= minaddr)
 		return FALSE;	/* We are strictly below */
 	if (bt->start <= minaddr)
 		return TRUE;	/* We contain it */
 	prev = rb_prev(&bt->all_link);
 	if (!prev)
 		return TRUE;	 /* We are above, but no one else below */
 	/* We are above and not containing.  If our prev is below min, then
 	 * we're it. */
 	btp = container_of(prev, struct btag, all_link);
 	if (btp->start + btp->size <= minaddr)
 		return TRUE;
 	return FALSE;
 }

 /* Does the a search in min/max for a segment. */
 static void *__xalloc_min_max(struct arena *arena, size_t size,
                               size_t np2sb, size_t phase, size_t nocross,
                               uintptr_t minaddr, uintptr_t maxaddr)
 {
 	struct rb_node *node = arena->all_segs.rb_node;
 	struct btag *bt;
 	uintptr_t try, try_start, try_size;

 	/* Find the first BT containing, or >= minaddr */
 	while (node) {
 		bt = container_of(node, struct btag, all_link);
 		if (__found_least_upper_btag(bt, minaddr))
 			break;
 		if (minaddr < bt->start)
 			node = node->rb_left;
 		else
 			node = node->rb_right;
 	}
 	/* Now we're probably at the first start point (or there's no node).
 	 * Just scan from here.  The first node could contain minaddr, so we
 	 * need to round up.  Also note that this is *all* segs, including
 	 * non-free ones. */
 	for (/* node set */; node; node = rb_next(node)) {
 		bt = container_of(node, struct btag, all_link);
 		if (bt->status != BTAG_FREE)
 			continue;
 		try_start = bt->start;
 		try_size = bt->size;
 		if (bt->start < minaddr) {
 			try_start = minaddr;
 			try_size = bt->size - (minaddr - bt->start);
 		}
 		try = __find_sufficient(try_start, try_size, size, np2sb, phase,
 		                        nocross);
 		if (!try)
 			continue;
 		if (maxaddr && (try + size > maxaddr))
 			return NULL;
 		__untrack_free_seg(arena, bt);
 		if (try != bt->start)
 			__split_bt_at(arena, bt, try);
 		__account_alloc(arena, bt, size, NULL);
 		return (void*)bt->start;
 	}
 	return NULL;
 }

 /* For xalloc, there isn't any real instant fit, due to the nocross issues.  We
  * can still try to get a quicker fit by starting on a higher order list. */
 static void *__xalloc_from_freelists(struct arena *arena, size_t size,
                                      size_t np2sb, size_t phase, size_t nocross,
                                      bool try_instant_fit)
 {
 	int list_idx;
 	struct btag *bt_i;
 	uintptr_t try = 0;

 	/* This starting list_idx is an optimization.  We could scan all the
 	 * lists.  You can't round-up size and add phase, because that could
 	 * cross a power-of-two boundary and skip the one list that works. */
 	list_idx = LOG2_DOWN(size);
 	list_idx += try_instant_fit ? 1 : 0;
 	for (int i = list_idx; i < ARENA_NR_FREE_LISTS; i++) {
 		BSD_LIST_FOREACH(bt_i, &arena->free_segs[i], misc_link) {
 			try = __find_sufficient(bt_i->start, bt_i->size, size,
 						np2sb, phase, nocross);
 			if (try) {
 				BSD_LIST_REMOVE(bt_i, misc_link);
 				break;
 			}
 		}
 		if (try)
 			break;
 	}
 	if (!try)
 		return NULL;
 	if (try != bt_i->start)
 		__split_bt_at(arena, bt_i, try);
 	__account_alloc(arena, bt_i, size, NULL);
 	return (void*)bt_i->start;
 }

 static void *__xalloc_nextfit(struct arena *arena, size_t size, size_t np2sb,
                               size_t phase, size_t nocross)
 {
 	void *ret;

 	/* NEXTFIT is a lot like a minaddr.  We can start from the old addr + 1,
 	 * since the implementation of that helper starts a search from minaddr.
 	 * If it fails, we can try again from 1 (quantum, really), skipping 0.
 	 * */
 	ret = __xalloc_min_max(arena, size, np2sb, phase, nocross,
 	                       arena->last_nextfit_alloc + arena->quantum, 0);
 	if (!ret) {
 		ret = __xalloc_min_max(arena, size, np2sb, phase, nocross,
 		                       arena->quantum, 0);
 	}
 	if (!ret)
 		return NULL;
 	arena->last_nextfit_alloc = (uintptr_t)ret;
 	return ret;
 }

 static void *xalloc_from_arena(struct arena *arena, size_t size,
                                size_t np2sb, size_t phase, size_t nocross,
                                void *minaddr, void *maxaddr, int flags)
 {
 	void *ret;
 	void *to_free_addr = 0;
 	size_t to_free_sz = 0;

 	spin_lock_irqsave(&arena->lock);
 	/* Need two, since we might split a BT into 3 BTs. */
 	if (!__get_enough_btags(arena, 2, flags & MEM_FLAGS)) {
 		spin_unlock_irqsave(&arena->lock);
 		return NULL;
 	}
 	if (minaddr || maxaddr) {
 		ret = __xalloc_min_max(arena, size, np2sb, phase, nocross,
 		                       (uintptr_t)minaddr, (uintptr_t)maxaddr);
 	} else {
 		if (flags & ARENA_BESTFIT) {
 			ret = __xalloc_from_freelists(arena, size, np2sb, phase,
 						      nocross, FALSE);
 		} else if (flags & ARENA_NEXTFIT) {
 			ret = __xalloc_nextfit(arena, size, np2sb, phase,
 					       nocross);
 		} else {
 			ret = __xalloc_from_freelists(arena, size, np2sb, phase,
 						      nocross, TRUE);
 		}
 	}
 	/* Careful, this will unlock and relock.  It's OK right before an
 	 * unlock. */
 	__try_hash_resize(arena, flags, &to_free_addr, &to_free_sz);
 	spin_unlock_irqsave(&arena->lock);
 	if (to_free_addr)
 		base_free(arena, to_free_addr, to_free_sz);
 	return ret;
 }

 void *arena_xalloc(struct arena *arena, size_t size, size_t align, size_t phase,
                    size_t nocross, void *minaddr, void *maxaddr, int flags)
 {
 	void *ret;
 	size_t req_size;
 	bool ovf = false;
 	size_t np2sb;	/* non-power-of 2 start boundary */

 	size = ROUNDUP(size, arena->quantum);
 	if (!size)
 		panic("Arena %s, request for zero", arena->name);
 	/* align == 0 is basically align == 1: "don't care" */
 	if (!align)
 		align = 1;
 	/* If we make quantum a power of two, we can just take the larger, which
 	 * is a multiple of the smaller */
 	np2sb = LCM_PWR2(arena->quantum, align);
 	if (!np2sb)
 		panic("Arena %s, could not find np2sb %p %p",
 		      arena->name, arena->quantum, align);

 	if (phase >= align)
 		panic("Arena %s, phase %d >= align %d",
 		      arena->name, phase, align);
 	if (phase % arena->quantum)
 		panic("Arena %s, non-multiple phase %d %d",
 		      arena->name, phase, arena->quantum);

 	if (nocross) {
 		if (!IS_PWR2(nocross))
 			panic("Arena %s, non-power of two nocross %p",
 			      arena->name, nocross);
 		/* See the discussion on nocross below.  Paranoia for overflow
 		 * is checked below (our callers are kernel users). */
 		if (size + phase > nocross)
 			panic("Arena %s, unsat size and phase: %p + %p > %p",
 			      arena->name, size, phase, nocross);
 		/* This is a little aggressive.  This arena or its source might
 		 * very well give us something that works.  This check covers
 		 * cases where we might be unable to ask our source via a
 		 * regular alloc for a segment that could satisfy this
 		 * allocation request, and we could lock up. */
 		if (arena->source && !ALIGNED(np2sb, nocross) &&
 		    !(2 * size - 2 < nocross))
 			panic("Arena %s, unsat size: 2 * %p - 2 > %p",
 			      arena->name, size, nocross);
 	}
 	/* Ok, it's a pain to import resources from a source such that we'll be
 	 * able to guarantee we make progress without stranding resources if we
 	 * have min/maxaddr.  We don't have a way to ask the source for a
 	 * particular range: i.e. an xalloc.
 	 *
 	 * If we get a span from the source and never use it, then we run a risk
 	 * of fragmenting and stranding a bunch of spans in our current arena.
 	 * Imagine the loop where we keep asking for spans (e.g. 8 pgs) and
 	 * getting something that doesn't work.  Those 8 pgs are fragmented, and
 	 * we won't give them back to the source until we allocate and then free
 	 * them (barring some sort of reclaim callback).
 	 *
 	 * If we want to support this, we'll need to require an xafunc that gets
 	 * called when xalloc needs to get_more_resources().  This means all
 	 * arenas in the chain need an xafunc, all the way back to the base.
 	 * Otherwise, we don't know that when we ask a source if we get
 	 * something back that is usable.
 	 *
 	 * Note that if we import a segment with xalloc, we need to free it with
 	 * xfree.  That means this arena needs to track its segment types.
 	 * Also, every xalloc call skips the qcache.  That might be a
 	 * performance issue, so it's better to not do those if you can.  */
 	if (arena->source && (minaddr || maxaddr))
 		panic("Arena %s, has source, can't xalloc with minaddr %p or maxaddr %p",
 		      arena->name, minaddr, maxaddr);
 	while (1) {
 		ret = xalloc_from_arena(arena, size, np2sb, phase, nocross,
 					minaddr, maxaddr, flags);
 		if (ret)
 			return ret;
 		/* Ah, nocross.  We need to make sure the segment we pull from
 		 * the source is sufficient, but we are doing a regular alloc
 		 * (not xalloc).  One conservative way to do this is to request
 		 * space for two of whatever we need, and abort if that block
 		 * could contain more than one nocross boundary.
 		 *
 		 * For starters, if size > nocross, we're completely
 		 * unsatisfiable.  So there are some requests we just can't do.
 		 * Similarly, and slightly stricter: size + phase > nocross is
 		 * unsat too.  Here's why: phase is a shift off from an
 		 * alignment, and nocross is an alignment.  The best case
 		 * scenario for a potential allocation is if it starts right at
 		 * a nocross boundary.  (Starting later makes our potential
 		 * space even smaller).
 		 *
 		 * Let's consider nocross >= align.  So we know the closest the
 		 * boundary could get to the start of the object we want,
 		 * without intersecting is phase bytes above nocross.  That
 		 * leaves us with nocross - phase bytes until the next boundary.
 		 *
 		 * Now consider align > nocross.  Any potential object that
 		 * starts at an unaligned nocross will need to get rounded up to
 		 * align, and then add phase, and then have that last bit of
 		 * space for the object.  That's even less space, though it
 		 * varies based on which object we look at - some of the nocross
 		 * boundaries will be align-aligned.
 		 *
 		 * Next, any allocation >= 2 bytes could have a boundary
 		 * (subject to align and phase).  So we're going to have at
 		 * least a boundary in most all allocations.  (nocross with sz
 		 * == 1 is meaningless).  If we have a boundary, we have limited
 		 * control over where it is in the object - it could be right in
 		 * the middle.  The safest thing to do is grab 2x the space we
 		 * need, and then the one boundary can ruin at most one of the
 		 * two objects.
 		 *
 		 * How many boundaries are there in X bytes?  (where X will be
 		 * 2*size)
 		 * 	FLOOR((x - 2) / nocross) + 1;  (x >= 2)
 		 * To have at most one boundary:
 		 * 	x - 2 < nocross
 		 * size >= 1, so x >=2.  Thus to be sure our alloc will work, we
 		 * check 2*size - 2 < nocross.  That's for a request of 2*size
 		 * from the source arena.  If the original request to our arena
 		 * was greater than that, then there's no guarantee we can use a
 		 * regular alloc from our source and get a result that will be
 		 * nocross-sat.
 		 *
 		 * Oh, and if we have align / phase, we will need to request
 		 * more to make sure our 2x block is in the right place, though
 		 * we don't need to worry about a boundary falling in the
 		 * alignment/phase space.
 		 *
 		 * The minimum we need for that is (align - 1) - phase.  Though
 		 * the xalloc algorithms might be simple/lazy, so previously I
 		 * just added align + phase.  It's actually safe to ask for
 		 * more; it might be a waste or might block, but the main
 		 * concern is that we get something that is guaranteed to work.
 		 *
 		 * And since quantum can be a non-power-of-two, we're aligning
 		 * to "np2sb" (the LCM of quantum and align), so it's really
 		 * np2sb + phase.
 		 *
 		 * At this point, we're requesting 2*size + np2sb + phase.
 		 *
 		 * Now, if we have an align (and/or phase), the align can help
 		 * us with the nocross too!  If np2sb is nocross-aligned, and
 		 * size + phase < nocross, which always must be true, then we
 		 * know the start of the object is on a boundary (minus phase),
 		 * and if it can fit at all, it will certainly fit.  So other
 		 * than the sanity check, we just ignore nocross.  It's somewhat
 		 * meaningless to ask for align >= nocross.
 		 *
 		 * I'm certainly not 100% on any of this. */
 		req_size = size;
 		/* Distilled: need 2x when nocross, and align doesn't help */
 		if (nocross && !ALIGNED(np2sb, nocross))
 			ovf |= check_add_overflow(req_size, size, &req_size);
 		/* TODO: check xalloc internals: could be align - 1 - phase */
 		ovf |= check_add_overflow(req_size, np2sb, &req_size);
 		ovf |= check_add_overflow(req_size, phase, &req_size);
 		if (ovf)
 			panic("Arena %s, size %p + np2sb %p + phase %p overflw",
 			      arena->name, size, np2sb, phase);
 		if (!get_more_resources(arena, req_size, flags))
 			return NULL;
 		/* Our source may have given us a segment that is on the BESTFIT
 		 * list, same as with arena_alloc. */
 		flags &= ~ARENA_ALLOC_STYLES;
 		flags |= ARENA_BESTFIT;
 		/* TODO: could put a check in here to make sure we don't loop
 		 * forever, in case we trip some other bug. */
 	}
 }

 /* Helper: if possible, merges the right BT to the left.  Returns TRUE if we
  * merged. */
 static bool __merge_right_to_left(struct arena *arena, struct btag *left,
                                   struct btag *right)
 {
 	/* These checks will also make sure we never merge SPAN boundary tags.*/
 	if (left->status != BTAG_FREE)
 		return FALSE;
 	if (right->status != BTAG_FREE)
 		return FALSE;
 	if (left->start + left->size == right->start) {
 		/* Need to yank left off its list before changing its size. */
 		__untrack_free_seg(arena, left);
 		__untrack_free_seg(arena, right);
 		left->size += right->size;
 		__track_free_seg(arena, left);
 		rb_erase(&right->all_link, &arena->all_segs);
 		__free_btag(arena, right);
 		return TRUE;
 	}
 	return FALSE;
 }

 /* Merges @bt's segments with its adjacent neighbors.  If we end up having an
  * entire span free, we'll stop tracking it in this arena and return it for our
  * caller to free. */
 static void __coalesce_free_seg(struct arena *arena, struct btag *bt,
                                 void **to_free_addr, size_t *to_free_sz)
 {
 	struct rb_node *rb_p, *rb_n;
 	struct btag *bt_p, *bt_n;

 	rb_n = rb_next(&bt->all_link);
 	if (rb_n) {
 		bt_n = container_of(rb_n, struct btag, all_link);
 		__merge_right_to_left(arena, bt, bt_n);
 	}
 	rb_p = rb_prev(&bt->all_link);
 	if (rb_p) {
 		bt_p = container_of(rb_p, struct btag, all_link);
 		if (__merge_right_to_left(arena, bt_p, bt))
 			bt = bt_p;
 	}
 	/* Check for a span */
 	rb_p = rb_prev(&bt->all_link);
 	if (rb_p) {
 		bt_p = container_of(rb_p, struct btag, all_link);
 		/* If the prev is a span tag, we know it is ours.  We just need
 		 * to know if our bt covers the entire import span size. */
 		if ((bt_p->status == BTAG_SPAN) &&
 		    (*(uintptr_t *)&bt_p->misc_link == bt->size)) {
 			*to_free_addr = (void*)bt_p->start;
 			*to_free_sz = bt_p->size;
 			/* Note the span was not on a free list */
 			__untrack_free_seg(arena, bt);
 			rb_erase(&bt_p->all_link, &arena->all_segs);
 			__free_btag(arena, bt_p);
 			rb_erase(&bt->all_link, &arena->all_segs);
 			__free_btag(arena, bt);
 		}
 	}
 }

 static void free_from_arena(struct arena *arena, void *addr, size_t size)
 {
 	struct btag *bt;
 	void *to_free_addr = 0;
 	size_t to_free_sz = 0;

 	spin_lock_irqsave(&arena->lock);
 	bt = __untrack_alloc_seg(arena, (uintptr_t)addr);
 	if (!bt) {
 		warn("Free of unallocated addr %p size %p from arena %s", addr,
 		     arena->name, size);
 		return;
 	}
 	if (bt->size != size) {
 		warn("Free of %p with wrong size %p (%p) from arena %s", addr,
 		      size, bt->size, arena->name);
 		return;
 	}
 	arena->amt_alloc_segs -= size;
 	__track_free_seg(arena, bt);
 	__coalesce_free_seg(arena, bt, &to_free_addr, &to_free_sz);
 	arena->amt_total_segs -= to_free_sz;
 	spin_unlock_irqsave(&arena->lock);
 	if (to_free_addr)
 		arena->ffunc(arena->source, to_free_addr, to_free_sz);
 }

 void arena_free(struct arena *arena, void *addr, size_t size)
 {
 	if (!addr)
 		return;
 	size = ROUNDUP(size, arena->quantum);
 	if (size <= arena->qcache_max)
 		return kmem_cache_free(size_to_qcache(arena, size), addr);
 	free_from_arena(arena, addr, size);
 }

 void arena_xfree(struct arena *arena, void *addr, size_t size)
 {
 	if (!addr)
 		return;
 	size = ROUNDUP(size, arena->quantum);
 	free_from_arena(arena, addr, size);
 }

 /* Low-level arena builder.  Pass in a page address, and this will build an
  * arena in that memory.
  *
  * This will be used for each NUMA domain's base arena, kpages_arena, and
  * kmalloc_arena, since the normal arena_create() won't work yet (no kmalloc).
  */
 struct arena *arena_builder(void *pgaddr, const char *name, size_t quantum,
                             void *(*afunc)(struct arena *, size_t, int),
                             void (*ffunc)(struct arena *, void *, size_t),
                             struct arena *source, size_t qcache_max)
 {
 	struct arena *a = (struct arena*)pgaddr;
 	struct btag *two_tags = (struct btag*)(pgaddr + sizeof(struct arena));

 	static_assert(sizeof(struct arena) + 2 * sizeof(struct btag) <= PGSIZE);

 	arena_init(a, name, quantum, afunc, ffunc, source, qcache_max);
 	if (!source)
 		a->is_base = TRUE;
 	BSD_LIST_INSERT_HEAD(&a->unused_btags, &two_tags[0], misc_link);
 	BSD_LIST_INSERT_HEAD(&a->unused_btags, &two_tags[1], misc_link);
 	return a;
 }

 /* Sanity checker for an arena's structures.  Hold the lock. */
 static void __arena_asserter(struct arena *arena)
 {
 	struct btag *bt_i;
 	struct rb_node *rb_i;
 	size_t amt_free = 0, amt_alloc = 0, nr_allocs = 0;

 	for (int i = 0; i < ARENA_NR_FREE_LISTS; i++) {
 		BSD_LIST_FOREACH(bt_i, &arena->free_segs[i], misc_link) {
 			assert(bt_i->status == BTAG_FREE);
 			assert(bt_i->size >= (1ULL << i));
 			assert(bt_i->size < (1ULL << (i + 1)));
 		}
 	}
 	for (int i = 0; i < arena->hh.nr_hash_lists; i++) {
 		BSD_LIST_FOREACH(bt_i, &arena->alloc_hash[i], misc_link)
 			assert(bt_i->status == BTAG_ALLOC);
 	}
 	for (rb_i = rb_first(&arena->all_segs); rb_i; rb_i = rb_next(rb_i)) {
 		bt_i = container_of(rb_i, struct btag, all_link);
 		if (bt_i->status == BTAG_FREE)
 			amt_free += bt_i->size;
 		if (bt_i->status == BTAG_ALLOC) {
 			amt_alloc += bt_i->size;
 			nr_allocs++;
 		}
 	}
 	assert(arena->amt_total_segs == amt_free + amt_alloc);
 	assert(arena->amt_alloc_segs == amt_alloc);
 	assert(arena->hh.nr_items == nr_allocs);
 }

 size_t arena_amt_free(struct arena *arena)
 {
 	return arena->amt_total_segs - arena->amt_alloc_segs;
 }

 size_t arena_amt_total(struct arena *arena)
 {
 	return arena->amt_total_segs;
 }

 void add_importing_arena(struct arena *source, struct arena *importer)
 {
 	qlock(&arenas_and_slabs_lock);
 	TAILQ_INSERT_TAIL(&source->__importing_arenas, importer, import_link);
 	qunlock(&arenas_and_slabs_lock);
 }

 void del_importing_arena(struct arena *source, struct arena *importer)
 {
 	qlock(&arenas_and_slabs_lock);
 	TAILQ_REMOVE(&source->__importing_arenas, importer, import_link);
 	qunlock(&arenas_and_slabs_lock);
 }

 void add_importing_slab(struct arena *source, struct kmem_cache *importer)
 {
 	qlock(&arenas_and_slabs_lock);
 	TAILQ_INSERT_TAIL(&source->__importing_slabs, importer, import_link);
 	qunlock(&arenas_and_slabs_lock);
 }

 void del_importing_slab(struct arena *source, struct kmem_cache *importer)
 {
 	qlock(&arenas_and_slabs_lock);
 	TAILQ_REMOVE(&source->__importing_slabs, importer, import_link);
 	qunlock(&arenas_and_slabs_lock);
 }

 void *base_alloc(struct arena *guess, size_t size, int flags)
 {
 	return arena_alloc(find_my_base(guess), size, flags);
 }

 void *base_zalloc(struct arena *guess, size_t size, int flags)
 {
 	void *ret = base_alloc(guess, size, flags);

 	if (!ret)
 		return NULL;
 	memset(ret, 0, size);
 	return ret;
 }

 void base_free(struct arena *guess, void *addr, size_t size)
 {
 	return arena_free(find_my_base(guess), addr, size);
 }