Probability Theory: The Logic of Science
August 18, 2008 – 10:09 pmThis is a bit of an unusual posting. It was triggered because I am frustrated, having just written my third review for the journal TAAS, which I accidentally agreed to do reviews for at some point during my PhD studies. I still review papers on computational trust models which was the topic of my PhD dissertation. I have recommended ‘reject’ for all of these papers, not because these papers are any worse than most of what is being and has been published in that field (or what I’ve published myself); but because I accidentally started reading the book: Probability Theory: The Logic of Science by E.T. Jaynes, Cambridge University Press (June 9, 2003). This book radically changed the way I understand probability theory and its applications (which includes computational trust models). Once you read just the basic parts of his book (which is all I’ve read yet), you realize that much of the work being done in this area is waste; I will claim that it could all be done much simpler and with superior results if based on Jaynes formulation of probability theory (which according to Jaynes goes back to Jeffreys and Laplace).
During my PhD studies I was working on something called experience-based trust management. Fundamentally, this topic is about programs that reason about the behaviour of agents (other dynamic programs) in large open distributed systems (think Internet). Such reasoning is based on information, usually in the form of past interactions with agents or in the form of statements made by other agents about such interaction (i.e., reputation information).
After the first two years we had been working hard on creating a formal model for “computational trust” encompassing uncertainty, based on somewhat hardcore mathematical theory of complete lattices and monotonic functions, complete partial orders and continuous functions (domain theory) and even category theory. It was abstract, it was fun, it was warm, nice and cuddly; it turned out, however, to be essentially useless… Fortunately, after approximately two years I somehow realized this and started working on the same problems, but with a less abstract approach. At some point later I somehow came by the book of Jaynes. Now, I only wish I had read that book in 2004…
Anyway, I don’t know how you were taught probability theory (or worse, statistics) but the courses I took had abstract definitions (corresponds to what is on wikipedia) that seemed magical to a first year computer science student, and abstract but at least general later when I encountered measure theory. While this is all very interesting if one is interested in abstract mathematics, when one reads Jaynes account of probability one cannot avoid to think that Jaynes approach is overwhelmingly appealing: at first sight it intuitive and much is simpler; and once one gets into the later chapters, one learns that it is also more powerful, and in fact, the rules of probability theory is proven to be the unique set of rules that satisfy an absolutely reasonable set of qualitative desiderata (this is known as Cox’s theorem which is on wikipedia, but Jaynes’ exposition is much better in my opinion, a version is here (from page 13)).
I won’t even try to give an account of the book here, but only recommend it to anyone even remotely interested in scientific reasoning and logic, but also applied mathematics and computer science. There are some places that are mathematically challenging for your typical CS grad, but it still has value even if the advanced techniques and proofs are skipped. Read it before you submit any paper on trust
4 Responses to “Probability Theory: The Logic of Science”
I agree wholeheartedly on all points! However, Jaynes’ book is dangerous before you get your PhD because you start recognizing bs all around you, like in every second CS paper where the word “probability” comes up!
If you haven’t read Cox’s original exposition, it is worth a trip to library. His little “algebra of probable inference” book is short and neat, maybe even easier to follow than Jaynes.
Anyway, I see one problem with the explanatory approach taken by Cox/Jaynes (which I started transcribing for the average Joe non-mathematician in my blog). Their discussion of the probability rules and their uniqueness in context of logical propositions is nice and dandy, but it leaves a lingering question of where these proposition sets are supposed to come from in practice. When Jaynes explains his maximum entropy principle, it looks very much like he is doing a “frequentist” thing after all to arrive at the atomic probabilities. Basically he’s counting possibilities and weighing more complex propositions by counting the number of atomic propositions that imply them and weighing this against other complex propositions.
By silkop on Aug 20, 2008
Hello silkop.
Good to hear from you; an interesting response! I wasn’t really expecting much activity on this thread, since this blog is centered around computer science, and if you are a computer scientist, it is quite unlikely that you have come by Jaynes’ book… I myself came by it by pure coincidence.
But you have! Great; and thanks for the reference to Cox’ original work, I will definitely read that when I get the time and opportunity
Regarding your problem with MaxEnt, I think Jaynes gives a really satisfactory explanation on this. One can think of the “frequentist thing” as a degenerate special case of proper Bayesian reasoning in the case where the prior information says nothing, i.e., when we use the principle of insufficient reasoning. Now, in the case of MaxEnt we have actual prior information, say in the form of average values that the solution must satisfy. Frequentist theory (conventional statistics) cannot make use of this prior information, but MaxEnt can: intuitively, one gets a prior which is as uniform as possible while respecting the given constraints. So the frequency correspondence is not a bad thing it is good; it is what makes MaxEnt as noncommittal as possible.
If you read section 11.8, page 365 in the 2003 edition, there is an interesting exposition on ‘frequency correspondence’: “…the probability distribution which maximises entropy is numerically identitical with the frequency distribution which can be realized in the greatest number of ways (which is vastly greater than it’s competitors).
By admin on Aug 21, 2008
I suppose that this “realized in the greatest number of ways” remark is what I am nit-picky about. IIRC, Jaynes elsewhere chides the “orthodoxians” for considering not the data at hand, but rather “what could have been, but is not”. However, in order to justify maximum entropy, he seems to implicitly rely on a very similar approach:
First, consider all the possible “worlds” that agree with the constraints but are equally likely based on indifference. In each such world a particular frequency distribution is “realized”. Then, examine which distributions are going to come up most often if you keep drawing randomly from the bag of worlds; this of course is a basic problem solved by the multinomial distribution.
If my remarks are not clear, think about his broken windows example. N windows have been broken into an integer number of pieces and all that we know is the average number of pieces (seems like a rather strange situation to me, but who am I to criticize textbook examples). If we assume some upper integer limit on the number of pieces per window, we can easily imagine a concrete world in which the “first” window was broken into p_1 pieces, the “second” window into p_2 pieces and so on until p_n. Now, if we enumerate all the possible worlds (and there’s a finite number of them, based on our assumptions about the number of windows and pieces), some of them will agree with the average number of pieces constraint, most will not. Then, we conceptually put these matching worlds into a “bag”, sample from this bag and examine the relative frequency of each number of pieces in each drawn world. What the maxent principle says is that an overwhelming number of draws from the bag will have the relative frequencies very close to most other draws, and that the most frequent frequency distribution can be calculated by maximizing entropy (why this correspondence holds is not explained very well in the book, I find).
Why are we willing to accept the maxent frequency distribution, which is after all based on a thought up generative sampling model? So far, the only good answer I understand is that other distributions would have to be also based on thought up generative sampling models – ones that are even more ridiculous (arbitrary) than the maxent one. Sometimes I wonder if it is the only answer.
As for CS people not knowing about Jaynes: I think it is “Jaynes’s fault” – he assumes that his reader has a working knowledge of calculus (and often also “orthodox” statistics and history) to follow his reasoning. This may be true for physicists, but certainly isn’t true for CS students. The funny thing is Jaynes has inspired me to improve my maths education. There’s something magnetic in the way he explains stuff and deals with critics.
By silkop on Aug 30, 2008