Understanding Clojure’s PersistentVector implementation

February 1, 2009 – 10:17 pm

Update: See also the description of PersistentHashMap.

One of the unique features of Clojure is that the core data structures are persistent (immutable with efficient structural sharing). This includes data structures Vector and Map that are mutable in most other languages. To be useful, operations on persistent data structures need to have performance characteristics that are similar to their mutating counterparts; e.g., the cost of random access on a persistent vector (put/get) needs to be comparable to random access to a mutable vector. Clojure manages to achieve this. Here we focus on understanding the Clojure implementation of Vector: PersistentVector.

The basics. PersistentVector stores its elements in arrays, each array having at most size 32. One can think of the arrays as forming a wide balanced tree with each node having at most 32 children. Here is an example of what a size 322 = 1024 PersistentVector tree might look like.

PersistentVector example

The root of the tree is a pointer to a size-32 array (“blue”) which has 32 children (“red”). The 32 red arrays each have 32 references to actual 1024 objects stored in the PersistentVector.

Fetching elements (get). The key to understanding PersistentVector is to look at the binary representation of the numbers that index the vector. Java uses 32 bits for ints with signed two’s-complement representation. Examples:

Binary representation examples

One can think of this binary representation as divided into chunks of 5 bits, represented in colors here. These “chunks” are mapped to levels in the tree: for example, when fetching the element at 16, one looks at the the “blue” bits (00000) — this determines the index at the root child array (0); following that pointer, the “red” bits (10000) determines the index at the first level (16). When fetching index 49 the blue bits index to 1 and the red bits index to 17.

In this case the depth of the tree is 2: the cost of lookup is following two pointers and some bit computation. In general the cost of lookup is proportional to the height of the tree, i.e., O(log32 n) where n is the number of elements in the vector. In practice the height is no more than 7.

Fetching the n’th element in PersistentVector is done with the nth method.


public Object nth(int i){
 if(i >= 0 && i < cnt)
 {
  if(i >= tailoff())
   return tail[i & 0x01f];
 Object[] arr = root;
 for(int level = shift; level > 0; level -= 5)
   arr = (Object[]) arr[(i >>> level) & 0x01f];
 return arr[i & 0x01f];
 }
 throw new IndexOutOfBoundsException();
}

Ignoring the ‘tail’-part, the code uses an instance variable: shift which is determined by the height of the tree: shift = 5*(h+1) where h is the height of the tree (defined here as 0 for an empty tree). Another way to think of the shift variable is that it determines how many bits to right-shift the bit representation of an integer to move for a certain block (color) of interest to be the rightmost block. For example shifting 0000110001 (49) with 5 gives 00000000001 which is exactly the “blue” index.

In general shift determines the right-shift needed to access the root index, i.e., for the block of bits that index into the root array to become the right-most block. The tree is searched from the most significant blocks (root, and it’s children) towards the least significant blocks (the “high levels” or leaves). Notice that for any integer i, if we mask i‘s bit representation with 00...011111 we get the five right-most bits of i. The expression i & 0x1f does this as 0x1f is 011111.

Insertion (put).
Insertion is more tricky. Here we consider only the ‘cons’ method which appends an element to the end of the vector. The method which replaces the element at an index is easy once ‘cons’ and ‘nth’ is understood. Of course, since PersistentVector is immutable ‘cons’ really returns a new PersistentVector object which is equal to the old one except that it has the new element at the end. With cons, the tree grows only in a balanced way: elements are added at the deepest level at the rightmost array. If the PersistentVector contains a number of elements which is a power of 32 (e.g., 1024 = 322), the height of the tree is increased by one and there is now “room” for 32 times more elements. In our example from before, suppose we add an element at the end of the vector. We obtain this tree:

Consing persistent vector

Notice that the entire old PersistentVector is shared with the new PersistentVector! No copying is involved in this particular case.

The tail… Actually (to be precise) this happens not exactly at powers of 32 but slightly delayed: at the insertion of the 33rd element after a power of 32. E.g., 1024 + 33 = 1057. As you can see from the figure above, an entire size-32 array is inserted at the lower right-most position (instead of a size 1 array with the new element). This is an optimization. There is a special instance variable tail of size at most 32. The tail array is a sort of buffer where elements are appended until a full size-32 array can be placed in the tree. This makes “batch” insertion faster: suppose you creating a new PersistentVector from another collection type, e.g., a LinkedList. Since PersistentVector is immutable this is done by starting with the empty vector and then repeatedly ‘cons’ing elements from the LinkedList to obtain a new vector. E.g.,


static public PersistentVector create(List items){
	PersistentVector ret = EMPTY;
	for(Object item : items)
		ret = ret.cons(item);
	return ret;
}

This also explains the ‘tail’ code in the nth method above: If we a getting an element which happens to be located in the tail buffer at present (there can be from 0 to 32 elements with this property), we simply access it directly. This is done by looking at the 5 least significant bits of the index.

While the code for ‘cons’ is intricate, the idea is reasonably simple as illustrated by the above figure. I will end this post with the actual code. If anyone is interested, I can go into the details of the implementation.


public PersistentVector cons(Object val){
	if(tail.length < 32)
		{
		Object[] newTail = new Object[tail.length + 1];
		System.arraycopy(tail, 0, newTail, 0, tail.length);
		newTail[tail.length] = val;
		return new PersistentVector(meta(), cnt + 1, shift, root, newTail);
		}
	Box expansion = new Box(null);
	Object[] newroot = pushTail(shift - 5, root, tail, expansion);
	int newshift = shift;
	if(expansion.val != null)
		{
		newroot = new Object[]{newroot, expansion.val};
		newshift += 5;
		}
	return new PersistentVector(meta(), cnt + 1, newshift, newroot, new Object[]{val});
}

private Object[] pushTail(int level, Object[] arr, Object[] tailNode, Box expansion){
	Object newchild;
	if(level == 0)
		{
		newchild = tailNode;
		}
	else
		{
		newchild = pushTail(level - 5, (Object[]) arr[arr.length - 1], tailNode, expansion);
		if(expansion.val == null)
			{
			Object[] ret = arr.clone();
			ret[arr.length - 1] = newchild;
			return ret;
			}
		else
			newchild = expansion.val;
		}
	//expansion
	if(arr.length == 32)
		{
		expansion.val = new Object[]{newchild};
		return arr;
		}
	Object[] ret = new Object[arr.length + 1];
	System.arraycopy(arr, 0, ret, 0, arr.length);
	ret[arr.length] = newchild;
	expansion.val = null;
	return ret;
}

Happy New Year!

December 31, 2008 – 2:31 pm

Can’t remember the last time I laughed so much.

http://www.yankeepotroast.org/archives/2008/09/11_words_that_s.html

Got the Link from Ryan Tomayko’s (great) blog.

IE doesn’t understand HTML or HTTP

December 20, 2008 – 2:46 pm

The so-called “web”-browser Internet Explorer version 7 (and probably all versions below) doesn’t get HTML or HTTP. In a recent project for a client we’ve been building an advanced, 100% JavaScript client application that had to run in IE7 (as a minimal requirement). The application is rather complex and has to live as a component in every (darn) webpage/webapp on the client’s intranet, including all legacy apps; essentially, the applications include this component by inserting a script tag in their pages. The component’s most basic job is to help the user navigate to the existing multitude of intranet applications, supplying these applications with appropriate input (e.g., navigate to the customer mangagement app with a customer reference as input).

Now these legacy applications are of different ages and have been developed in various technologies, with varying quality and standards-awareness. Now these requirements brings a bunch of problems, and I won’t start complaining too much here. Instead I’ll just focus on one issue: character encodings! One of the aspects of this scenarios is that the applications are using different character-encodings, e.g., ISO-8859-1 and UTF-8. This gives the following requirements.

  • Cross page (cross encoding) navigation. For example, our client app often has to navigate from a UTF-8 encoded page to an ISO-8859-1 encoded page, transferring properly encoded parameters from the first page to the second.
  • Localization. The client app is available in various languages; together with the requirement that it must exist in UTF-8 and ISO-8859-1 encoded pages, this means that the actual JavaScript being sent on the wire must be available in different languages and encodings, based on the browsers settings.

You’d expect these things to be pretty basic stuff for an HTTP based app; something that would be easy in a version 7 browser.

Cross page (cross encoding) navigation

For various reasons irrelevant here, our client app navigates to an application by dynamically creating a form with method “GET”, and plugging in the URL and the parameters for the receiving application, i.e.,


<form method="GET" action="URL">
<input type="hidden" name="pname" value="pvalue" >
…
</form>

There is an implicit attribute called ‘accept-charset’ which according to HTML 4.01 satisfies:

“This attribute specifies the list of character encodings for input data that is accepted by the server processing this form. The value is a space- and/or comma-delimited list of charset values. The client must interpret this list as an exclusive-or list, i.e., the server is able to accept any single character encoding per entity received. The default value for this attribute is the reserved string “UNKNOWN”. User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element.”

The problem with IE7 is that it ignores the accept-charset property entirely on forms. In fact, it instead does the default behaviour for “UNKNOWN”, i.e., “the character encoding that was used to transmit the document containing this form element.” Actually, to be precise it doesn’t even do that either: It turns out the IE7 encodes the form according to the current value of document.charset. Fortunately it is possible to set document.charset using JavaScript, this means that it is possible to control the encoding of form input data by setting document.charset just before the form submits. And this works. Problem solved. Kind of …

Unfortunately, this hack turns out to trigger a somewhat strange and nasty bug: After form submission, IE navigates to the desired page and input data is encoded according to document.charset, which is just what we want. However, if you click the ‘back’-button on the next page, IE navigates back to the original page, but the document.charset property somehow persists which means that the page is not interpreted with the correct encoding. OK, so we should clean up after ourselves: One would think that restoring the document.charset to it’s original value after calling form.submit() would work: it doesn’t.

After some experimentation I found that the following works. To submit a form with input data encoded in a particular encoding:

  1. on document load, store the original page encoding
  2. on form submit change the value of document.charset to the desired encoding, e.g., “ISO-8859-1″
  3. on ‘beforeunload’ restore document.charset to it’s original value.

Script character encodings

Our JavaScript client app sometimes lives in a UTF-8 encoded page, sometimes in an ISO-8859-1 encoded page. The client app is loaded by including a <script> tag that points to a servlet that delivers the JavaScript code (customized for the requesting user and his preferred locale). By default when IE loads script, interprets the bytes it receives as characters in the encoding of the current page. However, it doesn’t tell the server which encoding that is. More specifically, our servlet receives a request for the JavaScript client with these HTTP headers:


Accept : */*
Referer : http://localhost:10045/wps/portal/sn
Accept-Language : da
UA-CPU : x86
Accept-Encoding : gzip, deflate
User-Agent : Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1; MS-RTC LM 8; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)
Connection : Keep-Alive
Cache-Control : no-cache

Furthermore, even if we explicitly tell IE which encoding we have chosen, it will still ignore that and use the page encoding. More specifically, even if we return this HTTP:


Content-Type: application/javascript; charset=UTF-8

IE ignores the Content-Type charset. In other words: if we choose as a default (when no Accept-Charset header is present) to deliver the script in UTF-8 encoded, it breaks for ISO-8859-1 pages, and vice versa. So what to do? IE apparently doesn’t speak HTTP.

It turns out that one should use a script tag with a charset property to tell IE how to interpret the script. Ironically this means that on UTF-8 encoded pages one would need to say <script charset=”ISO-8859-1″>. In other words, it is not the HTTP headers describing the resource that describes its encoding. This must be know in advance by all documents that link to the resource by embedding a charset property. Yet another reason that these are tough times to be RESTful…

More Clojure news and links

December 7, 2008 – 12:29 pm

Clojure Pipe Clojure blogger and user Bill Clementson has used Yahoo Pipes to create a Clojure mashup feed. This should turn out a good source of Clojure info. Bill’s intro is here, and the pipe is here.

Improved Emacs setup. Bill also posted his Emacs setup. Looking forward to implementing something like that as my setup.

Atoms: uncoordinated and synchronous Rich Hickey added a new reference type with concurrency semantics called atoms. Clojure now has four reference types, each with a unique concurrency semantics: vars (changes isolated within thread), refs (synchronous & coordinated), atoms (synchronous and uncoordinated) and agents (asynchronous & uncoordinated).

Mutual recursion. Using ‘trampolines’, Rich Hickey added support for mutual recursion.

Misc Clojure Podcast w. Stuart Halloway.

Annotated Clojure links – part 1

November 17, 2008 – 6:08 pm

RESTafarians will tell you that hypermedia is one of the essential ingredients of REST (and the web in particular), and that most people aren’t taking this seriously enough. In my last posting I claimed that

I will write a number of blog postings about my experiences with Clojure. [...] However, to my knowledge, there is less information about how to actually program Clojure.

Well, my knowledge was wrong. Surprise, surprise — I’m not the first person to discover Clojure, be excited and decide to blog about it…

So instead of showing you Clojure fibonacci, (prn “hello world”) and writing imprecisely about functions and macros, I’ll hyper-point to to high-quality information on ‘getting started with Clojure’. If nothing else, my links will increase the PageRank of those blogs ;-) So here is my RESTful description of one path to Clojure.

Readers please add your high-quality Clojure links and provide a reasonable description in the blog comments.

Part I – getting started

Clojure theory


Get excited: Clojure for Java Programmers
Get excited by watching the Clojure videos. For the basics, see Clojure for Java Programmers 1 and 2, with accompanying slides. This is great quality information targeted directly at Java programmers. You should have a strong background in Java and be knowledgeable of Java concurrency (I personally recommend Java Concurrency in Practice by Brian Goetz et al.).

Re-iterate your understanding, and go deeper: comparison with other Lisps
If you are ready to dig deeper into this: Clojure for Lisp programmers 1 and 2, with accompanying slides. There is even a transcript of the audience questions for part 1.

The Clojure website
There is a lot of information on the website. To me, this information made the most sense after having seen the videos. I particularly like the essay: Values and Change which explains Clojure’s functional approach to mutability and state. Furthermore, sometimes when watching the videos you’ll want to clarify something. A good way to do this is reading the API docs on the site (which is quite precise) and then open up a REPL (see next step) and try it out.

A Clojure environment

In order to get Clojure up and running you need to setup a working environment containing Java 5 or 6, Clojure and some kind of IDE or text editor. I’ve had success with Aquamacs (which is an Emacs for Mac).


Emacs & Unix: Clojure, Emacs and Slime/Swank on Ubuntu 8.10
Slime/Swank is a development environment for Emacs. Adapting this guide to Mac and Aquamacs was easy. A couple of useful Emacs commands:

  • C-x C-e (Ctrl+x then Ctrl+e or “hold Ctrl then press x then e the release”). Send the current expression (to the left of cursor) to Slime.
  • C-c C-d C-d. Show docs for the current symbol.
  • C-c RET. Macro-expand once.
  • M-.. Goto (edit) definition of current symbol; and.
  • M-,. return from Goto (edit) definition of current symbol.

Others: Eclipse IDE, Netbeans plugin, Vim anyone?
I only have experience with the Emacs setup – it works fine so far.

If you are annoyed with Aquamacs keeping opening windows on you:
(custom-set-variables
'(one-buffer-one-frame-mode nil nil (aquamacs-frame-setup)))

Bill Clementson is currently blogging Clojure. He has a setup posting where he is able to run JSwat Java debugger with Clojure. More detailed information about debugging with JSwat here.

I plan to go into depth on various parts of Clojure in the next couple of postings. It will be the same style of annotated links, possibly with discussion if I feel I have anything interesting to say.

The need for Clojure

October 18, 2008 – 7:14 pm

It is a need usually provoked after experiencing an emotional conclusion to a difficult life event, such as the breakdown of a close interpersonal relationship [or an unreliable concurrent program, editor] or the death of loved one [or an old familiar programming language, editor]… A person with a high need for closure prefers order and predictability and is decisive [...]

[selected, edited parts from Wikipedia on the need for closure ;-) ]

I’ve just decided to put all my pet projects and programming languages aside to dedicate myself to learning one single language, with the ambition to eventually master it. That language is Clojure.

To force myself to reflect on my learning process and to help the community to the best of my ability, I will write a number of blog postings about my experiences with Clojure. There is already a lot of truly outstanding quality information about the theory and ideas behind Clojure around the net (mostly this excellent quality is due to fact the the creator of Clojure, Rich Hickey, is a great speaker). However, to my knowledge, there is less information about how to actually program Clojure. So this series of postings will explore programming Clojure from a beginners perspective — hopefully others may learn from my mistakes and insights.

In other words: This series is about Beginning Clojure in Practice.

Update: Sat. Oct. 18, 2008: A few points about Clojure, and links to better information.

What is Clojure?
Clojure is a relatively young language (recently, it’s 1-year birthday was celebrated). As I am beginning clojure, I am certainly not the best source of information; however, I would like to share my personal top 5 reasons why Clojure stands out in the ocean of programming languages that seems to be growing only faster by the day. If you want to know more about clojure, I strongly encourage reading the entire homepage and watching the videos on blip.tv.

Why Clojure?

  • Clojure is a LISP (perhaps it is more precise to say that it is strongly inspired by LISP, in some sense it is more general). Most importantly this means the full power of macros.
  • It is compiled to the JVM, and designed to embrace it. This means making use of a powerful infrastructure, e.g., garbage collection and a powerful, mature, optimizing JIT. Rich Hickey makes a good point on distinguishing languages that ‘live on’ the JVM (like JRuby or Jython) and languages that are designed for the JVM like (Groovy and Scala). Since JRuby and Jython are ‘ports’ of a language specified elsewhere, they have to fit on the JVM for better or worse (for example no call/cc in JRuby). In contrast, Clojure is designed for the JVM: Rich calls this ‘embracing it.’
  • Java interoperability. Clojure is designed to inter-operate with Java; there is language-level support [syntax, even ;-) ] for interacting with Java (no wrappers). This means that the enormous number of already existing Java libraries can be used from Clojure. This solves the ‘library problem’ (see below).
  • Clojure has an opinion about how to do in-process concurrency (my favorite point): First, immutability is good; it is the default, and all Clojure data structures are immutable. Yet, Clojure accepts that sometimes mutability is needed and even desired, and support for a mutability is provided in the form of Refs and Agents. These are programming language constructs that allow for mutability, but with a concurrency semantics. Refs come with a software transactional memory system, and agents allow an in-process asynchronous programming model (similar to, but not equal Erlang’s actors). Go and read Rich’s short essay: On state and identity; once you get a feel for the language, read it again ;-) . Persistent data structures, are an amazing and important part of Clojure’s concurrency model. This is what makes immutability viable in Clojure. An ‘update’ operation on a persistent data structure doesn’t actually change the data structure (it is immutable), but returns a new data structure (also immutable) representing the updated structure. Now, one might think this entails copying the entire structure; amazingly via structural sharing this is not the case. Clojures data structures (e.g., a hash map) are immutable and persistent yet retain the performance guarantees of their mutable counter-parts (Rich claims performance from faster-than-Java to up-to at most 4x Java speed). More on this in a later posting.
  • Dynamic and mostly functional; this speaks for itself.

I am really looking forward to exploring this language in depth and in practice, and to sharing what I find with anyone actually reading this blog ;-)

Finally, just as a thought: Eric Kidd wrote about choosing between Ruby and LISP.

So if LISP is still more powerful than Ruby, why not use LISP? The typical objections to programming in LISP are:

  • There aren’t enough libraries.
  • We can’t hire LISP programmers.
  • LISP has gone nowhere in the past 20 years.

My reply to this would be: Well, with Clojure, LISP is definitely going somewhere; only the second objection applies (and perhaps with time this will no longer be the case, who knows).

Jeene update: Performance, features and backlog

September 21, 2008 – 6:19 pm

Don’t know what Jeene is? It will make your JavaScript code run faster ;-) read the intro first,
then come back to this!

Performance examples

I was curious about the performance gains of functions specialized via Jeene, so I ran a simple performance example in Firefox 2 and 3; Safari 3.1.2; Opera 9.52; SquirrelFish Extreme (the new JIT); IE 6, 7 ; Google’s Chrome and finally, TraceMonkey. All with great results! You can run the simple example here.

The test executes the string concatenation example from the previous post. The actual example is not that interesting; what is interesting is that Jeene works in all current popular JavaScript implementations and that it gives good performance boosts also in the new JITs: V8, TraceMonkey and SquirrelFish Extreme. The figures include the time it takes to specialize the function, so they are actual performance gains. (Note that the Windows and Mac tests were on different hardware).

  • Firefox 2: (Windows Vista)
    1. Without specialization: 1430ms;
    2. With specialization: 635ms;
    3. Specialization reduces time by approx. 56%.
  • Firefox 3: (Mac)
    1. Without specialization: 386ms;
    2. With specialization: 230ms;
    3. Specialization reduces time by approx. 40%.
  • Opera 9.52: (Mac)
    1. Without specialization: 502ms;
    2. With specialization: 251ms;
    3. Specialization reduces time by approx. 50%.
  • Firefox 3.1 – TraceMonkey (Mac)
    1. Without specialization: 302;
    2. With specialization: 152ms;
    3. Specialization reduces time by approx. 50%.
  • Chrome – V8: (Windows Vista)
    1. Without specialization: 227ms;
    2. With specialization: 83ms;
    3. Specialization reduces time by approx. 63%.
  • IE6/7 (approx. the same numbers) (Windows Vista)
    1. Without specialization: 2672ms;
    2. With specialization: 1485ms;
    3. Specialization reduces time by approx. 44%.
  • Safari 3.1.2 (Mac)
    1. Without specialization: 378ms;
    2. With specialization: 157ms;
    3. Specialization reduces time by approx. 58%.
  • WebKit nightly — SquirrelFish Extreme (Mac)
    1. Without specialization: 247ms;
    2. With specialization: 94ms;
    3. Specialization reduces time by approx. 61%.

What can Jeene do right now?

Specialize JavaScript functions written in simplified JavaScript which contain
the following constructs (note as of now you can only specialize with primitive values,
e.g., strings and numbers; this will be fixed soon):

  • variables which occur in the function’s formal parameter
  • literals and constants (“abc”, 42, true, null)
  • simple assignment ‘=’
  • and, or, not logical operators ‘&&’, ‘||’, ‘!’
  • plus and minus binary operators (‘+’ and ‘-’); by default it assumes ‘+’
    to be associative, which is not sound in general but is for common case uses (ask me for details)
  • return statements

What Jeene can’t do right now (i.e., what I am working on).

Some of these are easy, others are harder and need careful analysis.

  • global variables (in general variables in the function’s closure)
  • ‘var’ statements (i.e., local variable declaration and initialization)
  • incrementing/decrementing assignments (‘+=’ and ‘-=’)
  • ternary if, i.e., “exp ? v1 : v2″ operator
  • comparison “===”, “!==”, “<”, “<=”, “>”, “>=”
  • multiplication, division (‘*’,'/’)
  • static/dynamic property lookup (‘.’ and ‘[‘)
  • function calls (“fn()”)
  • unary minus (‘-1′)
  • typeof operator (‘typeof x’)
  • function operator (‘function(){}’)
  • expression statements, array and object literals
  • expression statements, array and object literals
  • ‘if’, ‘break’ statements
  • ‘while’ and ‘for’ statements

New in latest svn commit

Jeene can now handle simple assignments of the form (x = exp) where x is a variable and exp is an expression. Further, formal parameters which are specialized disappear from the formal parameters in the specialized function, but now appear as local variables of that function (so that one can assign to them).

Consider this example:

Function.prototype.specialize = net.higherorder.jeene.Jeene.make();
source_fun = function(a,b,c) {
    c = a;
    return b+c;
};
var env = {a:3, b:2};
res = source_fun.specialize(env);

Result: 

(function (c) {var a, b;return 5;})

Note that the assignment c=a; can be eliminated and the static value of a is propagated to c so that the expression returned is static. Neat ;-)

Check back for more later…

Jeene: Status

September 16, 2008 – 3:29 pm

I am happy that Jeene was mentioned on Ajaxian today. This is good ;-) it means that the project is more likely to attract potential contributors which is nice since there is quite some work yet to be done.

I thought, I would provide a status update on Jeene and a “backlog” of work that needs to be done.

What is working:

  • Partial evaluation of simple operations: ‘+’, ‘-’, ‘&&’ and ‘||’
  • code generation for ‘function’ expressions and ‘return’ statements

This means that we can run a number of simple examples. For a quick and dirty performance test checkout: performance_simple1

This is all fine, the basic design is in place. Now it is just a matter of “getting it done” to implement other operators. However, there are some constructs that are more difficult:
What I will be working on next

  • Assignment. Since an assignment changes the static/dynamic status of variables, this is more advanced than e.g. ‘+’ which doesn’t have side effects. For example if ‘x’ is a static variable and ‘y’ is dynamic, then ‘y = x;’ should cause ‘y’ to be static from that program point.
  • Handling of function calls. For now, I will simply replace a function call with a function call where we replace static expressions in the arguments with their values. Later I will consider ‘unrolling’/'inlining’ of function calls.

As always, I welcome anyone interested in contributing. I will run a rule: 1-patch approved = commit rights.

Jeene: An automatic partial evaluator for JavaScript

September 14, 2008 – 9:42 pm

The purpose of this posting is to show that is is possible to create an online partial evaluator for JavaScript, written also in JavaScript. As far as I know, this has been not been done before. This post is the first in a series describing the inner workings of Jeene.

A what? A partial evaluator (or program specializer) is a program which takes two inputs: another program and an environment mapping variables to values; it outputs a specialized (i.e., more efficient) version of the input program with respect to the environment. One can think of a partial evaluator as a mix between an interpreter and a compiler: it interprets the static parts of the program and emits code for the dynamic parts.

For a simple example of code specialization, consider the following function for creating a string corresponding to an HTML tag:

var mk_tag = function(tag_name,clazz,contents) {
   return '<'+tag_name+' class="'+clazz+'">'+contents+'</'+tag_name+'>';
};

If we are only interested in making ‘div’ tags with a class of ‘green’ then a more efficient version would be:

var mk_div_green = function(contents) {
   return '<div class="green">'+contents+'</div>';
};

since it would require fewer string concatenations per call. We can think of ‘mk_div_green’ as a version of ‘mk_tag’ which is specialized for writing ‘div’ tags with a class of ‘green’. A partial evaluator for JavaScript could automatically derive the ‘mk_div_green’ function from the ‘mk_tag’ function.

This is exactly what we can do with the evaluator in this posting.

//This code actually works ;-) 
Function.prototype.specialize = net.higherorder.jeene.Jeene.make();

var mk_tag = function(tag,clz,cont) {
   return "<"+tag+" class='"+clz+"'>"+cont+"</"+tag+">";
};

var mk_div_green = mk_tag.specialize({tag:'div', clz: 'green'});

mk_div_green("Pratt rocks!");
//result: <div class='green'>Pratt rocks!</div>

mk_div_green.toSource ? mk_div_green.toSource() : mk_div_green.toString();
//result:
//(function (cont) {return ("<div class='green'>" + cont) + "</div>";})

This last line, shows that the output function is much more efficient than what is created by general JavaScript curriers which have been seen before: these functions merely wait evaluating the function until all parameters are supplied; instead, a partial evaluator will create specialized function taking advantage of the information given.

The first goal is for the partial evaluator to process any function written in an extension of “simplified JavaScript” (a subset of full JavaScript corresponding to what Crockford calls the “good parts“). This partial evaluator will have a number of useful features: it

  • can easily be extended to full JavaScript;
  • is written in simplified JavaScript (think Futamura projections);
  • is a fast extension of Douglas Crockford’s Pratt parser;
  • because functions implement the toString method, it can be run on dynamically generated functions (e.g., a specialized function can be further specialized);
  • can be embedded in any full JavaScript program as long as it is only used to specialize functions which are syntactically in simplified JavaScript;

I have started a new open source project, Jeene, which aims to create an efficient partial evaluator for full JavaScript that works in any ECMAScript 3 compliant implementation (e.g., all major browsers, Rhino, TraceMonkey, V8 etc). Right now the project is at a very early stage; a proof of concept, for example it cannot specialize itself. Let me know if you are interested in contributing: I will use a 1 patch threshold like Rubinius: if you submit one patch which is accepted, you get commit rights.

Stay tuned here for more information about how Jeene is designed and implemented.

Probability Theory: The Logic of Science

August 18, 2008 – 10:09 pm

This is a bit of an unusual posting. It was triggered because I am frustrated, having just written my third review for the journal TAAS, which I accidentally agreed to do reviews for at some point during my PhD studies. I still review papers on computational trust models which was the topic of my PhD dissertation. I have recommended ‘reject’ for all of these papers, not because these papers are any worse than most of what is being and has been published in that field (or what I’ve published myself); but because I accidentally started reading the book: Probability Theory: The Logic of Science by E.T. Jaynes, Cambridge University Press (June 9, 2003). This book radically changed the way I understand probability theory and its applications (which includes computational trust models). Once you read just the basic parts of his book (which is all I’ve read yet), you realize that much of the work being done in this area is waste; I will claim that it could all be done much simpler and with superior results if based on Jaynes formulation of probability theory (which according to Jaynes goes back to Jeffreys and Laplace).

During my PhD studies I was working on something called experience-based trust management. Fundamentally, this topic is about programs that reason about the behaviour of agents (other dynamic programs) in large open distributed systems (think Internet). Such reasoning is based on information, usually in the form of past interactions with agents or in the form of statements made by other agents about such interaction (i.e., reputation information).

After the first two years we had been working hard on creating a formal model for “computational trust” encompassing uncertainty, based on somewhat hardcore mathematical theory of complete lattices and monotonic functions, complete partial orders and continuous functions (domain theory) and even category theory. It was abstract, it was fun, it was warm, nice and cuddly; it turned out, however, to be essentially useless… Fortunately, after approximately two years I somehow realized this and started working on the same problems, but with a less abstract approach. At some point later I somehow came by the book of Jaynes. Now, I only wish I had read that book in 2004…

Anyway, I don’t know how you were taught probability theory (or worse, statistics) but the courses I took had abstract definitions (corresponds to what is on wikipedia) that seemed magical to a first year computer science student, and abstract but at least general later when I encountered measure theory. While this is all very interesting if one is interested in abstract mathematics, when one reads Jaynes account of probability one cannot avoid to think that Jaynes approach is overwhelmingly appealing: at first sight it intuitive and much is simpler; and once one gets into the later chapters, one learns that it is also more powerful, and in fact, the rules of probability theory is proven to be the unique set of rules that satisfy an absolutely reasonable set of qualitative desiderata (this is known as Cox’s theorem which is on wikipedia, but Jaynes’ exposition is much better in my opinion, a version is here (from page 13)).

I won’t even try to give an account of the book here, but only recommend it to anyone even remotely interested in scientific reasoning and logic, but also applied mathematics and computer science. There are some places that are mathematically challenging for your typical CS grad, but it still has value even if the advanced techniques and proofs are skipped. Read it before you submit any paper on trust ;-)