|
Use the open hashing functions for global contexts instead of List in
var.c.
All the preliminary work to trim down local contexts means that we don't
suffer from the heavy initialization work that a hash table entails.
There is some make kludgery to:
- build the hashing functions as a library,
- recreate hashconsts.h, even if make depend was not invoked.
One point of the hashing scheme written was to separate the computation
of the hash function, and the hash lookup itself. This is very convenient
for make, because of those pesky special variables. hashconsts.h is there
to pre-hash the correct values, which replaces a few expensive string
comparisons with quick hash value comparisons, followed by one expensive
string comparison. The modulus MAGICSLOTS chosen in the Makefile is
ad-hoc: it is small enough to write a small switch without collision,
and will need changing if the hash function changes...
The function quick_lookup is the most important:
it either returns an index, for a local variable, or it does compute a
hashing value, and returns -1.
Another somewhat controversial decision is the use of string intervals.
This avoids either copying a string, or twiddling with a byte for cases
such as ${VAR}.
Finally, the variable name is stored within the variable itself. Since
a given variable name never changes, this makes sense. All that was needed
was a hash library with support for this. Note that the hashing table
holds only a variable pointer AND the corresponding hashing value, WITHOUT
a modulo hashtablesize. Two reasons:
- hash resizes can be done faster, without having to recompute hashing values.
- locality of access. The hash table fits into memory without problem. Once
a candidate slot is found, we check the complete hashing value. Probability
of a collision is very small (32 bits...). So bringing up the whole
variable in memory at once is good: the name will almost always match, in
which case we want the variable value as well, so it makes sense to put
them together.
The ohash functions implement open hashing, as described in Knuth, but with
a variable table size. Choosing powers of 2 sizes does not yield more
collisions, but it makes the hashing scheme much simpler. The thresholds at
which to expand/shrink the tables seem to work well in practice. The
default sizes were chosen such that the tables hardly ever shrink or expand
anyways (though I've tried with smaller/larger sizes to verify that the
shrinking/expanding worked correctly): larger Makefiles hold roughly
500/600 variables, which fits without trouble into a 1024-sized variable.
Disregard #ifdef STATS_HASH, this is some internal scaffolding I'm using
to measure make performance.
The only known issue with open-hashing is that deletions cannot create
empty slots, but do leave slots marked as `occupied once' so that lookup
works. We use a well-known optimization which records those pseudo-empty
slots while looking up values. If the value is not found, the pseudo-empty
slot is returned to be filled. If the value is found, it is swapped with
the pseudo-empty slot. This is an improvement in both cases, since this
shortens the length of lookup chains, eventually pushing the pseudo-empty
slots to the end.
Reviewed by millert@ and miod@
|