Understanding Clojure's PersistentVector implementation

Update: See also the description of PersistentHashMap. One of the unique features of Clojure is that the core data structures are persistent (immutable with efficient structural sharing). This includes data structures Vector and Map that are mutable in most other languages. To be useful, operations on persistent data structures need to have performance characteristics that are similar to their mutating counterparts; e.g., the cost of random access on a persistent vector (put/get) needs to be comparable to random access to a mutable vector. Clojure manages to achieve this. Here we focus on understanding the Clojure implementation of Vector: PersistentVector. The basics. PersistentVector stores its elements in arrays, each array having at most size 32. One can think of the arrays as forming a wide balanced tree with each node having at most 32 children. Here is an example of what a size 32² = 1024 PersistentVector tree might look like. PersistentVector example

The root of the tree is a pointer to a size-32 array ("blue") which has 32 children ("red"). The 32 red arrays each have 32 references to actual 1024 objects stored in the PersistentVector. Fetching elements (get). The key to understanding PersistentVector is to look at the binary representation of the numbers that index the vector. Java uses 32 bits for ints with signed two's-complement representation. Examples: Binary representation examples

One can think of this binary representation as divided into chunks of 5 bits, represented in colors here. These "chunks" are mapped to levels in the tree: for example, when fetching the element at 16, one looks at the the "blue" bits (00000) -- this determines the index at the root child array (0); following that pointer, the "red" bits (10000) determines the index at the first level (16). When fetching index 49 the blue bits index to 1 and the red bits index to 17. In this case the depth of the tree is 2: the cost of lookup is following two pointers and some bit computation. In general the cost of lookup is proportional to the height of the tree, i.e., O(log₃₂ n) where n is the number of elements in the vector. In practice the height is no more than 7. Fetching the n'th element in PersistentVector is done with the nth method.


public Object nth(int i){
 if(i >= 0 && i < cnt)
 {
  if(i >= tailoff())
   return tail[i & 0x01f];
 Object[] arr = root;
 for(int level = shift; level > 0; level -= 5)
   arr = (Object[]) arr[(i >>> level) & 0x01f];
 return arr[i & 0x01f];
 }
 throw new IndexOutOfBoundsException();
}

Ignoring the 'tail'-part, the code uses an instance variable: shift which is determined by the height of the tree: shift = 5*(h+1) where h is the height of the tree (defined here as 0 for an empty tree). Another way to think of the shift variable is that it determines how many bits to right-shift the bit representation of an integer to move for a certain block (color) of interest to be the rightmost block. For example shifting 0000110001 (49) with 5 gives 00000000001 which is exactly the "blue" index. In general shift determines the right-shift needed to access the root index, i.e., for the block of bits that index into the root array to become the right-most block. The tree is searched from the most significant blocks (root, and it's children) towards the least significant blocks (the "high levels" or leaves). Notice that for any integer i, if we mask i's bit representation with 00...011111 we get the five right-most bits of i. The expression i & 0x1f does this as 0x1f is 011111. Insertion (put). Insertion is more tricky. Here we consider only the 'cons' method which appends an element to the end of the vector. The method which replaces the element at an index is easy once 'cons' and 'nth' is understood. Of course, since PersistentVector is immutable 'cons' really returns a new PersistentVector object which is equal to the old one except that it has the new element at the end. With cons, the tree grows only in a balanced way: elements are added at the deepest level at the rightmost array. If the PersistentVector contains a number of elements which is a power of 32 (e.g., 1024 = 32²), the height of the tree is increased by one and there is now "room" for 32 times more elements. In our example from before, suppose we add an element at the end of the vector. We obtain this tree: Consing persistent vector

Notice that the entire old PersistentVector is shared with the new PersistentVector! No copying is involved in this particular case. The tail... Actually (to be precise) this happens not exactly at powers of 32 but slightly delayed: at the insertion of the 33rd element after a power of 32. E.g., 1024 + 33 = 1057. As you can see from the figure above, an entire size-32 array is inserted at the lower right-most position (instead of a size 1 array with the new element). This is an optimization. There is a special instance variable tail of size at most 32. The tail array is a sort of buffer where elements are appended until a full size-32 array can be placed in the tree. This makes "batch" insertion faster: suppose you creating a new PersistentVector from another collection type, e.g., a LinkedList. Since PersistentVector is immutable this is done by starting with the empty vector and then repeatedly 'cons'ing elements from the LinkedList to obtain a new vector. E.g.,


static public PersistentVector create(List items){
	PersistentVector ret = EMPTY;
	for(Object item : items)
		ret = ret.cons(item);
	return ret;
}

This also explains the 'tail' code in the nth method above: If we a getting an element which happens to be located in the tail buffer at present (there can be from 0 to 32 elements with this property), we simply access it directly. This is done by looking at the 5 least significant bits of the index. While the code for 'cons' is intricate, the idea is reasonably simple as illustrated by the above figure. I will end this post with the actual code. If anyone is interested, I can go into the details of the implementation.


public PersistentVector cons(Object val){
	if(tail.length < 32)
		{
		Object[] newTail = new Object[tail.length + 1];
		System.arraycopy(tail, 0, newTail, 0, tail.length);
		newTail[tail.length] = val;
		return new PersistentVector(meta(), cnt + 1, shift, root, newTail);
		}
	Box expansion = new Box(null);
	Object[] newroot = pushTail(shift - 5, root, tail, expansion);
	int newshift = shift;
	if(expansion.val != null)
		{
		newroot = new Object[]{newroot, expansion.val};
		newshift += 5;
		}
	return new PersistentVector(meta(), cnt + 1, newshift, newroot, new Object[]{val});
}


private Object[] pushTail(int level, Object[] arr, Object[] tailNode, Box expansion){
	Object newchild;
	if(level == 0)
		{
		newchild = tailNode;
		}
	else
		{
		newchild = pushTail(level - 5, (Object[]) arr[arr.length - 1], tailNode, expansion);
		if(expansion.val == null)
			{
			Object[] ret = arr.clone();
			ret[arr.length - 1] = newchild;
			return ret;
			}
		else
			newchild = expansion.val;
		}
	//expansion
	if(arr.length == 32)
		{
		expansion.val = new Object[]{newchild};
		return arr;
		}
	Object[] ret = new Object[arr.length + 1];
	System.arraycopy(arr, 0, ret, 0, arr.length);
	ret[arr.length] = newchild;
	expansion.val = null;
	return ret;
}

Higher Order Blog

Understanding Clojure's PersistentVector implementation