    
        
        
        
    
Hash-Based Containers
    This section describes hash-based containers. It is organized
as follows.
	
		
#overviewOverview  is an overview.
	
	
		
#hash_policiesHash Policies  discusses
	hash policies.
	
	
		
#resize_policiesResize Policies  discusses
	resize policies.
	
	
		
#policy_interactionPolicy Interaction  discusses
	interaction between policies.
	
Overview 
	Figure
#hash_cdHash-based containers 	shows the container-hierarchy; the hash-based containers are circled.
cc_hash_assoc_cntnr.htmlcc_hash_assoc_cntnr is a collision-chaining hash-based container;
gp_hash_assoc_cntnr.htmlgp_hash_assoc_cntnr is a (general) probing hash-based container.
no image
Hash-based containers.
	The collision-chaining hash-based container has the following declaration.
template<
	
typename Key,
	
typename Data,
	
class Hash_Fn = std::hash<Key>,
	
class Eq_Fn = std::equal_to<Key>,
	
class Comb_Hash_Fn =
		
direct_mask_range_hashing.htmldirect_mask_range_hashing <>
	
class Resize_Policy = default explained below.	
bool Store_Hash = false,
	
class Allocator =
		std::allocator<
char> >
class cc_hash_assoc_cntnr.htmlcc_hash_assoc_cntnr ;
	The parameters have the following meaning:
	
 Key is the key type.
	
	
 Data is the data-policy, and is explained in
ms_gen.html#ds_policyMapping-Semantics Genericity::Data Types as a Policy .
	
	
 Hash_Fn is a key hashing functor.	
 Eq_Fn is a key equivalence functor.	
 Comb_Hash_Fn is a range-hashing_functor; it
describes how to translate hash values into positions within the table.
This is described in
Hash Policies .	
	
 Resize_Policy describes how a container object should
change its internal size. This is described in
Resize Policies .	
 Store_Hash indicates whether the hash value should
be stored with each entry. This is described in
Policy Interaction .	
 Allocator is (surprisingly) an allocator type.
	
	The probing hash-based container has the following declaration.
template<
	
typename Key,
	
typename Data,
	
class Hash_Fn =
		std::hash<
			Key>,
	
class Eq_Fn =
		std::equal_to<
			Key>,
	
class Comb_Probe_Fn =
		
direct_mask_range_hashing.htmldirect_mask_range_hashing <>
	
class Probe_Fn = default explained below.	
class Resize_Policy = default explained below.	
bool Store_Hash = false,
	
class Allocator =
		std::allocator<
char> >
class gp_hash_assoc_cntnr.htmlgp_hash_assoc_cntnr ;
	The parameters are identical to those of the collision-chaining container, except
for the following.
	
 Comb_Probe_Fn describes how to transform a probe sequence into
a sequence of positions within the table.
	
	
 Probe_Fn describes a probe sequence policy.	Some of the default template values depend on the values of other parameters,
and are explained in
Policy Interaction .
Hash Policies
    This subsection describes hash policies. It is organized as follows:
    
 #general_termsGeneral Terms   describes
            some general terms.
    
    
 #range_hashing_fnsRange-Hashing Functions         describes range-hasing functions.
    
 #hash_policies_ranged_hash_policiesRanged-Hash Functions         describes ranged-hash functions. 
    
 #pb_assoc_impImplementation in pb_assoc             describes the implementation of these concepts in 
pb_assoc.
    
General Terms 
    There
are actually three functions involved in transforming a key into a hash-table's
position (see Figure
#hash_ranged_hash_range_hashing_fnsHash runctions, ranged-hash functions, and range-hashing functions
):
    
        A 
ranged-hash function, which maps keys into an interval of the
        non-negative integrals. This is the function actually required by the
        hash-table algorithm.
    
    
        A hash function, which maps keys into non-negative integral types. This is
        typically specified by the writer of the key class.
    
    
        A 
range-hashing function, which maps non-negative integral types into an
        interval of non-negative integral types.
    
no image
Hash runctions, ranged-hash functions, and range-hashing functions.
    Let 
U be a domain (e.g., the integers, or the strings of 3
    characters). A hash-table algorithm needs to map elements of 
U "uniformly"
    into the range 
[0,..., m - 1] (where m is a non-negative integral
    value, and is, in general, time varying). 
I.e., the algorithm needs a ranged-hash    function
    
f : U × Z+ → Z+ ,
    such that for any 
u in U,
    
0 ≤ f(u, m) ≤ m - 1 ,
    and which has "good uniformity" properties [
references.html#knuth98sortingknuth98sorting ].
    One common solution is to use the composition of the hash function
    
h : U → Z+ ,
    which maps elements of 
U into the non-negative integrals, and
    
g : Z+ × Z+ → Z+, 
    which maps a non-negative hash value, and a non-negative range upper-bound into
    a non-negative integral in the range between 0 (inclusive) and the range upper
    bound (exclusive), 
i.e., for any r in Z+,
    
0 ≤ g(r, m) ≤ m - 1 .
    The resulting ranged-hash function, is
    
f(u , m) = g(h(u), m)     
(1) .
    From the above, it is obvious that given 
g and h, f can
    always be composed (however the converse is not true). The STL's hash-based
    containers allow specifying a hash function, and use a hard-wired range-hashing function; the ranged-hash function is implicitly composed.
    The above describes the case where a key is to be mapped into a 
single
position
 within a hash table, e.g., in a collision-chaining table.
In other cases, a key is to be mapped into a 
sequence of poisitionswithin a table, 
e.g., in a probing table.
    Similar terms apply in this case: the table requires a 
ranged probefunction, mapping a key into a sequence of positions withing the table. This is
typically acheived by composing a 
hash function mapping the key
into a non-negative integral type, a 
probe function transforming the
hash value into a sequence of hash values, and a 
range-hashing function
transforming the sequence of hash values into a sequence of positions.
Range-Hashing Functions 
    Some common choices for range-hashing functions are the division,
    multiplication, and middle-square methods [
references.html#knuth98sortingknuth98sorting ],
    defined as
    
g(r, m) = r mod m (2) ,
    
g(r, m) = ⌈ u/v ( a r mod v ) ⌉ ,
    and
    
g(r, m) = ⌈ u/v ( r2 mod v ) ⌉ ,
respectively, for some positive integrals 
u and v (typically
powers of 2), and some 
a. Each of these range-hashing functions works
best for some different setting.
    The division method 
#division_method(2)  is a very common
    choice. However, even this single method can be implemented in two very
    different ways. It is possible to implement 
#division_method(2)     using the low level 
% (modulo) operation (for any m), or the low
    level 
& (bit-mask) operation (for the case where m is a power of
    2), 
i.e.,
    
g(r, m) = r % m (3) ,
    and
    
    
g(r, m) = r & m - 1, ( m = 2k    
        for some
 k) (4) ,
    respectively.
    The 
% (modulo) implementation #division_method_prime_mod(3)     has the advantage that for 
m a prime far from a power of 2, g(r, m)    is affected by all the bits of 
r (minimizing the chance of collision).
    It has the disadvantage of using the costly modulo operation. This method is
    hard-wired into SGI's implementation [
references.html#sgi_stlsgi_stl ].
    The 
& (bit-mask) implementation #division_method_bit_mask(4)     has the advantage of relying on the fast bitwise and operation. It has the
    disadvantage that for 
g(r, m) is affected only by the low order bits of r.
    This method is hard-wired into Dinkumware's implementation [
references.html#dinkumware_stldinkumware_stl ].
Ranged-Hash Functions 
    In some less frequent cases it is beneficial to allow the client to
directly specify a ranged-hash hash function. It is true, that the writer of
the ranged-hash function cannot rely on the values of 
m having specific
numerical properties suitable for hashing (in the sense used in [
references.html#knuth98sortingknuth98sorting ]),
since the values of 
m are determined by a resize policy with possibly
orthogonal considerations.
	There are two cases where a ranged-hash function can be superior. The firs is when using perfect hashing
[
references.html#knuth98sortingknuth98sorting ]; the second
is when the values of 
m can be used to estimate the
"general" number of distinct values required. This is described in the following.
    Let
    
s = [ s0,..., st - 1] 
    be a string of 
t characters, each of which is from domain S.
Consider the following ranged-hash function:
    
        
            f
1(s, m) =
            ∑ 
i =
            0
t   - 1 si ai mod m     
 (5) ,
    where 
a is some non-negative integral value. This is the standard
string-hashing function used in SGI's implementation (with 
a = 5) [ references.html#sgi_stlsgi_stl ].
Its advantage is that it takes into account all of the characters of the
string.
    Now assume that 
s is the string representation of a of a long DNA
sequence (and so 
S = {'A', 'C', 'G', 'T'}). In this case, scanning the
entire string might be prohibitively expensive. A possible alternative might be
to use only the first 
k characters of the string, where
    k 
|S| ≥ m ,
    
i.e., using the hash function
    
f2(s, m) = ∑ i = 0k
                - 1
 si ai mod m , (6)
    requiring scanning over only
    
k = log4( m ) 
    characters.
    Other more elaborate hash-functions might scan 
k characters starting at
    a random position (determined at each resize), or scanning 
k random
    positions (determined at each resize), 
i.e., using
    
f3(s, m) = ∑ i = r0r0 + k - 1        s
i ai mod m ,
    or
    
f4(s, m) = ∑ i = 0k - 1 sri        a
ri mod m ,
    respectively, for 
r0,..., rk-1 each in the
    (inclusive) range 
[0,...,t-1].
Implementation in pb_assoc 
cc_hash_assoc_cntnr.htmlcc_hash_assoc_cntnr  is
parameterized by 
Hash_Fn and Comb_Hash_Fn, a hash functor
and a combining hash functor, respectively.
	For any hash functor except 
null_hash_fn.htmlnull_hash_fn ,
one of the
concepts.html#concepts_null_policiesConcepts::Null Policy Classes ,
then 
Comb_Hash_Fn is considered a range-hashing functor.
The container will synthesize a ranged-hash functor from both. For example, Figure
#hash_range_hashing_seq_diagramInsert hash sequence diagram
shows an 
insert sequence diagram. The user inserts an element (point A),
the container transforms the key into a non-negative integral using the hash
functor (points B and C), and transforms the result into a position
using the combining functor (points D and E).
no image 
Insert hash sequence diagram.
    
pb_assoc contains the following range-hashing policies:
    
direct_mask_range_hashing.htmldirect_mask_range_hashing and
direct_mod_range_hashing.htmldirect_mod_range_hashing are range-hashing functions based on a bit-mask and a modulo operation, respectively.
    
    If 
Comb_Hash_Fn is instantiated by
null_hash_fn.htmlnull_hash_fn ,
and a combining-hash functor, the container treats
the combining hash functor as a ranged-hash function. For example, Figure
#hash_range_hashing_seq_diagram2Insert hash sequence diagram with a null combination policy
shows an 
insert sequence diagram. The user inserts an element (point A),
the container transforms the key into a position
using the combining functor (points B and C).
no image 
Insert hash sequence diagram with a null combination policy.
	Similarly,
gp_hash_assoc_cntnr.html gp_hash_assoc_cntnris parameterized by 
Hash_Fn, Probe_Fn, and
Comb_Probe_Fn. As before, if Probe_Fnand 
Comb_Probe_Fn are, respectively,
null_hash_fn.htmlnull_hash_fn  and
null_probe_fn.htmlnull_probe_fn ,
then 
Comb_Probe_Fn is a ranged-probe functor. Otherwise, Hash_Fnis a hash functor, 
Probe_Fn is a functor for offsets from a hash value,
and 
Comb_Probe_Fn transforms a probe sequence into a sequence of positions
within the table.
	
pb_assoc contains the following probe policies:
    
linear_probe_fn.htmllinear_probe_fn  is a linear probe
function.
    
	
quadratic_probe_fn.htmlquadratic_probe_fn  is
a quadratic probe function.
    
Resize Policies 
    This subsection describes resize policies. It is organized as follows:
    
 #generalGeneral Terms  describes general
        terms.
    
    
 #size_policiesSize Policies   describes size
    policies.
    
    
 #trigger_policiesTrigger Policies   describes trigger
    policies.
    
    #imp_pb_assocImplementation in pb_assoc          describes the implementation of these concepts in 
pb_assoc.
    
General Terms 
    Hash-tables, as opposed to trees, do not naturally grow or shrink. It
is necessary to specify policies to determine how and when a hash table should change
its size.
    In general, resize policies can be decomposed into (probably orthogonal)
policies:
    
 A size policy indicating how a hash table should
grow (
e.g., it should multiply by powers of 2).
    
    
 A trigger policy indicating when a hash table should
grow (
e.g., a load factor is exceeded).
    
Size Policies 
    Size policies determine how a hash table
changes size. These policies are simple, and there are relatively
few sensible options. An exponential-size policy (with the initial
size and growth factors both powers of 2) works well with a
mask-based range-hashing function (see the 
hash_policies.html#hash_policies_range_hashing_policiesRange-Hashing Policies  subsection),
and is the
hard-wired policy used by Dinkumware
[
references.html#dinkumware_stldinkumware_stl ]. A
prime-list based policy works well with a modulo-prime range
hashing function (see the 
hash_policies.html#hash_policies_range_hashing_policiesRange-Hashing Policies  subsection),
and is the
hard-wired policy used by SGI's implementation
[
references.html#sgi_stlsgi_stl ].
Trigger Policies 
    Trigger policies determine when a hash table changes size.
Following is a description of two polcies: 
load-checkpolicies, and a collision-check policies.
    Load-check policies are straightforward. The user
specifies two factors, 
αmin and αmax, and
the hash table maintains the invariant that
    
        
        α
min        ≤
        (number of stored elements) / (hash-table size)
        ≤
        α
max        
    
    (1)
    .
    Collision-check policies work in the opposite direction of
load-check policies. They focus on keeping the number of
collisions moderate and hoping
that the size of the table will not grow very large,
instead of keeping a moderate load-factor and
hoping that the number of collisions will be small.
A
maximal collision-check policy resizes when the shortest
probe-sequence grows too large.
    Consider Figure
#balls_and_binsBalls and bins .
    Let the size of the hash table be denoted by 
m, the
length of a probe sequence be denoted by 
k, and some load
factor be denoted by α. We would like to calculate the
minimal length of 
k, such that if there were α m elements
in the hash table, a probe sequence of length 
k would be found
with probability at most 
1/m.
no image 
Balls and bins.
    Denote the probability that a probe sequence of length 
kappears in bin 
i by pi, the length of the probe sequence
of bin 
i by li, and assume uniform distribution.
Then
    
    
        
p1        = 
(3)
    
    
    
    
    
P(l1 ≥ k)
    =
    
    
    
    
P(l1 ≥ α ( 1 + k / α - 1 )
    ≤ 
(a)
    
    
    
    e
    ^
    (
        -
        (
            α ( k / α - 1 )
2        )
        /2
    )
    
    ,
    where (a) follows from the Chernoff bound
[
references.html#motwani95randommotwani95random ].
To
calculate the probability that 
some bin contains a probe
sequence greater than 
k, we note that the li are
negatively-dependent
[
references.html#dubhashi98negdubhashi98neg ].
Let 
I(.)denote the indicator function. Then
    
    
        
P( existsi li ≥ k )
        = 
(3)
    
    
    
    
   
P    (
        ∑ 
i = 1m        
I(li ≥ k) ≥ 1
    )
    =
    
    
    
    
    
P    (
        ∑ 
i = 1m        
I        (
            l
i ≥ k
        )
        ≥
        m  p
1 ( 1 + 1 / (m p1) - 1 )
    )
    ≤ 
(a)
    
    
    
    e
    ^
    (
        (
            -
            m p
1            (
                1 / (m p
1) - 1
            )
            
2        )
        /
        2
    )
    ,
    
    
where (a) follows from the fact that the Chernoff bound can be
applied to negatively-dependent variables
[
references.html#dubhashi98negdubhashi98neg ].
Inserting 
#prob_of_p1(2)  into
#at_least_k_i_n_some_bin(3) , and equating with 1/m,
we obtain
    
    k
    ~
    √
    (
        2 α 
ln 2 m ln(m) )
    )
    
    .
Implementation in pb_assoc 
    The resize policies in the previous subsection are conceptually straightforward. The design
of hash-based containers' size-related interface is complicated by some factors.
    
 Most containers, i.e. lists, trees, and vectors, have a single "size" concept. There is no
distinction between the number of entries the container holds and the number of entries it is using. This,
of course, is not the case for hash-based containers. Moreover, even describing the
"actual" size of a hash-based container (as opposed to its logical size) is difficult - a probing-based container
holds entries to elements, even those it does not use, while a chaining-based container holds pointers to entries.
    
    
    The policies mentioned above operate in terms of invariants. 
E.g. a load-check trigger policy
maintains an invariant concerning the load factor of a container object. This is sometimes too rigid:
    
        
In some cases it is desirable to allow controlled override of an entire policy, e.g. by externally resizing a container object (or giving it an initial size, which is a special case of externally resizing the container).
        
        
            In other cases it is desirable to allow changing the specifics of a policy in runtime, 
e.g., changing the load factors of a load-check policy.
        
    
    
    
    Resize policies interact strongly with hash policies. Performance-wise, for example, it is undesirable to use an exponential size policy of powers of two with a modulo range-hashing function, and it is undesirable to use a prime size policy with a mask range-hashing function. In other cases, the effects are more dramatic. For example, using a quadratic probe function with an exponential size policy will probably cause cases where the container object has available entries which are never reached by the probe function. (
hash_policies.htmlHash Policies discusses the previous concepts.)
    
    Clearly, the more of these points an interface addresses, the greater its flexibility but the lower its encapsulation and uniformity between associative containers.
        This library attempts to address these types of problems by delegating all size-related functionality to
policy classes. Hash-based containers
are parameterized by a resize-policy class (among others), and derive publicly from
the resize-policy class
[
references.html#alexandrescu01modernalexandrescu01modern ]
 
E.g., a collision-chaining
hash table is defined as follows:
cc_ht_map<
  
class Key,
  
class Data,
  ...
  
class Resize_Policy
  ...> :
    
public Resize_Policy
    The containers themselves lack any functionality or public interface for manipulating sizes. A container
object merely forwards events to its resize policy object and queries it for needed actions.
    Figure
#insert_resize_sequence_diagram1Insert resize sequence diagram
shows a (possible) sequence diagram of an insert operation.
The user inserts an element; the hash table
notifies its resize policy that a search has started (point A);
in this case, a single collision is encountered - the table
notifies its resize policy of this (point B); the container
finally notifies its resize policy that the search has ended (point C);
it then queries its resize policy whether a resize is needed, and if so,
what is the new size (points D to G); following the resize, it notifies
the policy that a resize has completed (point H); finally, the element
is inserted, and the policy notified (point I).
no image 
Insert resize sequence diagram.
    This addresses, to some extent, the problems mentioned above:
    
        Different instantiations of range-hashing policies can be met with different instantiations of
    resize policies.
    
    
        Questions on size-related interface are avoided, since the containers have no size-related methods. Thus
    a container has no method for querying its actual size. It merely continuously forwards enough information to
    its resize policy to answer such queries; the designer of the resize policy can decide whether, or how, to design     the appropriate method. Also, a container has no methods for setting its size. It merely queries its
resize policy for an initial size, queries it on a new size (if the resize policy indicates a resize is needed), and
supports a 
protected virtual function for external resize.
    
    The library contains a single class for instantiating a resize policy,
pb_assoc contains
a standard resize policy,
hash_standard_resize_policy.htmlhash_standard_resize_policy  (the name is explained shortly).
In terms of interface, it is parameterized by a boolean constant indicating whether its public interface supports
queries of actual size and external resize operations (the inclusion and exclusion of these methods in the interface have obvious tradeoffs in terms of encapsulation and flexibility).
([
references.html#alexandrescu01modernalexandrescu01modern ] shows many techniques for
changing between alternative interfaces at compile time.)
As noted before,
    size and trigger policies are usually orthogonal.
hash_standard_resize_policy.htmlhash_standard_resize_policy is parameterized by size and trigger policies. For example,
a collision-chaining hash table
is typically be defined as follows:
cc_ht_map<
  key,
  data,
  ...
  hash_standard_resize_policy<
    some_trigger_policy,
    some_size_policy,
    ...> >
 The sole function of
hash_standard_resize_policy.htmlhash_standard_resize_policy  is to
act as a standard delegator
[
references.html#gamma95designpatternsgamma95designpatterns ] for these
policies.
    Figures
#insert_resize_sequence_diagram2Standard resize policy trigger sequence diagram     and
#insert_resize_sequence_diagram3Standard resize policy size sequence diagram     show sequence diagrams illustrating the interaction between
    the standard resize policy and its trigger and size policies, respectively.
no image 
Standard resize policy trigger sequence diagram.
no image 
Standard resize policy size sequence diagram.
    The library (currently) supports the following instantiations of size
and trigger policies:
    
        
hash_load_check_resize_trigger.htmlhash_load_check_resize_trigger  implements
    a load check trigger policy.
    
    
        
cc_hash_max_collision_check_resize_trigger.htmlcc_hash_max_collision_check_resize_trigger     implements a collision check trigger policy.
    
    
hash_exponential_size_policy.htmlhash_exponential_size_policy  implemens
an exponential-size policy (which should be used with mask range hashing).
    
    
hash_prime_size_policy.htmlhash_prime_size_policy  implementing
a size policy based on a sequence of primes
[
references.html#sgi_stlsgi_stl ] (which should be used with mod range hashing
    
    The trigger policies also support interfaces for changing their specifics which depend on compile time constants.
    Figure
#resize_policy_cdResize policy class diagram  gives an overall picture
of the resize-related classes.
Container (which stands for any of the hash-based containers) is parameterized
by 
Resize_Policy, from which it subclasses publicly
[
references.html#alexandrescu01modernalexandrescu01modern ].
This class is currently instantiated only by
hash_standard_resize_policy.htmlhash_standard_resize_policy .
hash_standard_resize_policy.htmlhash_standard_resize_policy  itself
is parameterized by 
Trigger_Policy and Size_Policy.
Currently, 
Trigger_Policy is instantiated by
hash_load_check_resize_trigger.htmlhash_load_check_resize_trigger ,
or
cc_hash_max_collision_check_resize_trigger.htmlcc_hash_max_collision_check_resize_trigger ; Size_Policy is instantiated by
hash_exponential_size_policy.htmlhash_exponential_size_policy ,
or
hash_prime_size_policy.htmlhash_prime_size_policy .
no image 
Resize policy class diagram.
Policy Interaction 
	Hash-tables are unfortunately susceptible to choice of policies. One
of the more complicated aspects of this is that poor combinations of good policies
can alter performance drastically. Following are some considerations.
Range-Hashing Policies and Resize Policies
Equivalence Functors, Storing Hash Values, and Hash Functions
cc_hash_assoc_cntnr.htmlcc_hash_assoc_cntnr and
gp_hash_assoc_cntnr.htmlgp_hash_assoc_cntnr are parameterized by an equivalenc functor and by a 
Store_Hashparameter. If the latter parameter is 
true, then
the container stores with each entry a hash value, and uses
this value in case of collisions to determine whether to apply a hash value.
This can lower the cost of collision for some types, but increase the cost of collisions for other types.
	If a ranged-hash function or ranged probe function is directly supplied, however,
then it makes no sense to store the hash value with each entry. 
pb_assoc's container will fail at compilation, by design, if this is attempted.
