Higher Order Blog

home

Assoc and Clojure's PersistentHashMap: part ii

16 Aug 2010

Some time ago I wrote introductory posts that gave high-level overviews of how Clojure's PersistentVector and PersistentHashMap work. In the PersistentHashMap post I promised that "In part 2 we look at how assoc works…" - it seems I never got around to that! A lot of interesting things have happened with data structures in JVM-land since then: Erjang uses Clojure's data structures, it looks like Scala is porting PersistentVector, upcoming Seph is using them too. My clj-ds project should help in this regard: I've extracted the data structures of Clojure from its compiler for use with JVM-based languages (providing some extra stuff like reverse and "positioned" iterators). There are already people interested in using this in Java land. Some people have asked for the "part ii" post, and my son just fell asleep, so ... :) Most of what is described in the PersistentHashMap post is still true, however there have been optimizations and simplifications which I explain here. Note: before reading this post, you should read the previous post on PersistentHashMap. First, some changes. Previously there were five implementations of the INode interface; this has changed and there are now only three implementations: ArrayNode, BitmapIndexedNode and HashCollisionNode. This means that: EmptyNode, LeafNode, FullNode are out, with ArrayNode replacing FullNode. An array node is an array where the entries are null or instances of INode, i.e., it stores other nodes but not any key-value pairs. An empty persistent hash map is simply a persistent hash map where the root node is null -- this removes the need for EmptyNode. Finally, leaf nodes used to store the actual entries stored in the map. Leaf nodes are out, and instead BitmapIndexedNodes directly embed the map entries in their arrays. The idea is the following: if, in the old implementation, a BitmapIndexedNode would store a leaf node at an index, then instead, it now embeds the key and value directly in its array. This array used to be of type INode[], storing only nodes, but is now a mixed object array for which the value can be one of: a map key, a map entry, null or an INode object. Old persistent hashmap structure The above drawing corresponds to the old structure. In the new structure all the white circles (the leaf nodes) are embedded directly in their parents. There is an invariant: for all the indices that are in the bitmap, the even indices store keys or null; the odd indices store values or INode objects. Key-value pairs are layed out in sequence, so that if index 2*i is a key then index 2*i + 1 is a value. If an index 2*i is null then 2*i+1 must be a non-null INode object. Note: if you don't remember how bitpos, bitmap and index work see the part i post. The strategy used in BitmapIndexedNode is that it can store up to 16 entries in the bit map: if the size grows above 16, it is converted to a FullNode. Assoc. The assoc method creates a persistent hash map which is like the current one, except that it additionally stores another map entry. As before assoc works using path copying: all that is changed in the new map is the path from the root node to the newly added map entry. Beware: the following drawing looks confusing: take the time and read the explanation. It is showing two persistent hash maps, where one is obtained by assoc'ing to the first. Again this is a modified version of one of Rich's slides. Path copying in the old structure The first map is rooted on the left (where the left-most box points to). The second is rooted where the right-most box points to. The nodes are grouped in three, each indicated by a colored circle that surrounds it. The new node is in the right-most, lower corner. Purple is the path in the old tree to where the new map entry would be; green is the new nodes that are created in the new hash map. The dashed lines indicate that the nodes in the new tree share children with nodes in the old tree. The red circles show all the nodes that are shared in the tree: note that this is most of the nodes. So how much work needs to be done to create the new tree? Suppose we are storing key k with value v. Using recursion, descend down the original tree as if looking for the key k. The key k isn't found. In the worst case this takes time: O(log32 n), in practice it we could stop at any level in the tree so 2-3 steps would be common (remember from the old post that the work done at each level is constant and fast). At the bottom we create a new BitmapIndexedNode and store the new map entry in it. If the bottom node in the old tree was a BitmapIndexedNode with less than 16 elements the new BitmapIndexedNode is a copy of the old one, except that the new map entry is added. This step takes constant time since the array to copy at the bitmap indexed node always has less than 32 elements (because we store at most 16 map entries as: key, value, key, value, ...). If the bottom node was an ArrayNode we simply copy the array node, and make the new BitmapIndexedNode a size-one child of this array node (still constant time). On the drawing above, we have now recursively descended the purple path in the left/old tree, and we have created the new node which is a bitmap indexed node. What remains is to establish the path to this new node, in the new tree: this path is almost identical to the "purple" path in the old tree: the only difference is that we have created a new bitmap indexed node. The parent of the new node must of course reference it, so that is copied and modified to get a reference to the new node. This means that we must also copy the grandparent of the new node, modifying it to reference the parent. And so on... This copying takes place on our way "up" through the recursion, i.e., after the recursive calls complete at each node level. Let's decompose the code. We only look at assoc for BitmapIndexedNode as it is the most interesting.
public INode assoc(int shift, int hash, Object key, Object val, Box addedLeaf){
		int bit = bitpos(hash, shift);
		int idx = index(bit);
		if((bitmap & bit) != 0) {
			//..
		} else {
			//
		}
}
This should be familiar from the previous post: we check if the index of the hash (at the current level) is in the bitmap. The else branch is the case where the index is not in the bitmap, corresponding to reaching the bottom of the path: here we must create a new node and path-copy as described above. In the "if"-branch we simply either call recursively if the index references an INode: if it references a key-value pair this is a "replace" and we do path copying here too. Again we look only at the else-branch as it is the most interesting.
public INode assoc(int shift, int hash, Object key, Object val, Box addedLeaf){
		int bit = bitpos(hash, shift);
		int idx = index(bit);
		if((bitmap & bit) != 0) {
			//..
		} else {
			int n = Integer.bitCount(bitmap);
			if(n >= 16) {
				//convert to ArrayNode
			} else {
				Object[] newArray = new Object[2*(n+1)];
				System.arraycopy(array, 0, newArray, 0, 2*idx);
				newArray[2*idx] = key;
				addedLeaf.val = addedLeaf; 
				newArray[2*idx+1] = val;
				System.arraycopy(array, 2*idx, newArray, 2*(idx+1), 2*(n-idx));
				return new BitmapIndexedNode(null, bitmap | bit, newArray);
			}
		}
}
Remember (from the previous post) that Integer.bitCount(bitmap) counts the number of children of this node. If we are storing less than 16 elements, we have room for one more: To create the new bitmap indexed node simply copy the object array, and modify it to store the new key-value pair (using the invariant mentioned above). (Ignore the "box" part, it is used to communicate to higher-levels in the recursion what happened.) Finally, if we have 16 elements stored already, convert to an ArrayNode:
public INode assoc(int shift, int hash, Object key, Object val, Box addedLeaf){
		int bit = bitpos(hash, shift);
		int idx = index(bit);
		if((bitmap & bit) != 0) {
			//..
		} else {
			int n = Integer.bitCount(bitmap);
			if(n >= 16) {
				INode[] nodes = new INode[32];
				int jdx = mask(hash, shift);
				nodes[jdx] = EMPTY.assoc(shift + 5, hash, key, val, addedLeaf);  
				int j = 0;
				for(int i = 0; i < 32; i++)
					if(((bitmap >>> i) & 1) != 0) {
						if (array[j] == null)
							nodes[i] = (INode) array[j+1];
						else
							nodes[i] = EMPTY.assoc(shift + 5,  Util.hash(array[j]), array[j], array[j+1], addedLeaf);
						j += 2;
					}
				return new ArrayNode(null, n + 1, nodes);
			} else {
				//we covered this...
			}
		}
}
The first block up to the 'for' loop prepares the array of INodes that will be the children array of the new ArrayNode. We create a new BitmapIndexedNode easily by calling assoc on an empty persistent hash map (which is an easy way of creating an one-entry BitmapIndexedNode). Note that the indexing strategy for ArrayNodes is different from BitmapIndexedNode: we don't need the bitmap :) We simply use the 5-bit block corresponding to the level (again see the older post). Finally, the for-loop copies all the other nodes that are stored in this bitmap indexed node into the new array node. This entails mapping between the bitmap index scheme and the "5-bit hash-block" scheme. For each possible index in the bitmap node, only do work if the index is present in the bit map: if(((bitmap >>> i) & 1) != 0). The variable j runs through the even indexes, and according to the invariant: null means that j+1 is an INode, and non-null means it is a value in a map entry. I am wondering if there is an optimization possible here? We are looping through all possible indices of the new ArrayNode, but we know that we only have to do something on the indices that correspond to an non-zero index in the bit map; and there can be only 15 of those... Would it be possible to iterate only those using some strategy? If I figure something out, I'll let you know :)