## Monday, January 19, 2009

### Co-occurrence and Correlation

In one of our projects, we encountered this dilemma where we had to nitpick on (the probability of) co-occurrence of a pair of events and correlation between the pair of events.

Here is my attempt at disambiguating between the two. Looking forward to any pokes at loopholes in my argument.

Consider two events e1 and e2 that have a temporal signature. For instance, they could be login events of two users on a computer system across time. Let us also assume that time is organized as discrete units of constant duration each (say one hour).

We want to now compare the login behaviour of e1 and e2 over time. We need to find out whether e1 and e2 are taking place independently or are they correlated. Do they tend to occur together (i.e. co-occur) or do they take place independent of one another?

This is where terminologies are freely used and things start getting a bit confusing. So to clear the confusion, we need to define our terms more precisely.

Co-occurrence is simply the probability that the events e1 and e2 occur together. In other words, this is the joint probability of e1 and e2. Let's represent this as p(e1,e2). The joint probability of a pair of random processes A and B is defined as:

$p(A,B)=\frac{|A\cap B|}{|A \cup B|}$When we are talking about temporally distributed processes like e1 and e2, the intersection is simply the number of times e1 and e2 have occurred in the same time bucket and the union is the total number of times e1 or e2 have occurred (counting the co-occurrences only once).

Another form of measuring relatedness between the events e1 and e2 is to compute their correlation coefficient. Correlation measures the linear relationship between pairs of random variables.

Intuitively, suppose our time units were divided into discrete buckets t1 .. tn. In each time bucket, we have counted the number of times e1 and e2 have occurred. We now take a 2-dimensional plot and for each time bucket, we place a point on the plot, whose x and y values are the number of times e1 and e2 have occurred respectively.

(Here there is an implicit assumption that the resolution of our measurement is the width of the time bucket. That is, if the time bucket is 1 hour, it does not matter when exactly an event occurred in that hour.)

Given this scatter plot, if we can now draw a line passing through all the points such that the error between the points and the line is minimal, we say that the events are (linearly) correlated. A popular way of computing this line is to use the Pearson coefficient.

The correlation coefficient takes into consideration similarity in occurrences across each bucket. The scatter plot can also reveal specific patterns of correlations that cannot be discerned by the probability of co-occurrence. For instance, suppose that if the events e1 and e2, co-occur, they co-occur not more than 5 times within a time bucket or not co-occur at all. This peculiarity is lost in the co-occurrence computation.

On the other hand, the co-occurrence probability gives an easy way of summarizing pairwise relationships in a large set of events. It is also easier to build generative models and synthetic data sets that reflect co-occurrence probabilities than those that reflect co-occurrences (I think).

There are many other ways to compute relatedness. One other metric worth mentioning here is the mutual information metric between pairs of events. This is a small but significant addition over computing the joint probability. Intuitively, the mutual information between a pair of random variables A and B is the amount of bits that are required to change from a description of A to a description of B. Formally:
$I(A,B)=\displaystyle\sum_{a \in A}\sum_{b \in B}p(a,b)\log\left(\frac{p(a,b)}{p(a)p(b)}\right)$Here p(a,b) is the joint probability of events in A and B and p(a) and p(b) are the marginal (unconditional) probabilities of events in A and B respectively.

## Monday, October 06, 2008

### Meta-modeling heuristics

(Acknowledgment: This post has inputs from Sanket of First principles fame.)

When teaching about formal axiomatic systems, I'm usually asked a question something like, "But, how do you find the axioms in the first place?"

To take a few steps back, reasoning processes based on logic and deduction are set in an "axiomatic" context. Axioms are the ground truths, based on which we set out to prove or refute theorems within the system. For example, "Given any two points, there can be only one line that passes through both of them" is an axiom of Eucledian geometry.

When proving theorems within an axiomatic system, we don't question the truth of the axioms themselves. As long as the set of axioms are consistent among themselves (i.e. don't contradict one another) it is fine. Axioms are either self-evident or well-known truths, or somethings that are assumed to be true for the context.

But then, the main question becomes very relevant. Can we assume just about anything as an axiom, just as long as it does not contradict with any other axioms? Or should there be some element of objective truth in the axioms?

Technically, we can go with the belief that truth is subjective and we can assume whatever we want as axioms. But then, this is a self-contradicting statement when used in general. Perhaps one can define a wholly abstract system based on an alternate but consistent set of physical laws and prove that (say) pigs can fly. But other than being seen as "intellectual -er- stimulation", such results are generally not accorded a great deal of importance.

Good axiomatic systems are based on axioms that are grounded in reality and encapsulate some element of objective truth. So the question of "how do we find axioms" reduces to "how do we recognize objective truth when we see it?"

The succinct answer is: "wish we knew". This is the question that has perhaps defined humankind's quest throughout history. And I certainly won't claim to have an answer for that!

But even then, it helps to have some kind of thumb rules or heuristics using which, we can be reasonably confident that our assumptions are grounded in reality. So what follows here are some thumb rules that I have found useful at some time or the other. They are by no means exhaustive and by themselves they don't guarantee that they will lead you to true statements. Nevertheless, it is good to have some thumb rules rather than nothing at all. So here goes:

Principle of invariance
Any characteristic of a system that is invariant and displays a property of constancy, is a good candidate for a ground truth. Remember how the north star was used as the basis for designing navigation rules by sailors, long ago?

Principle of least bias (maximum entropy)
If you have insufficient information about an external entity that may impact your system in one or more ways, choose the least biased explanation about how the impact will be. For instance, suppose you are constructing a tall building and need to protect it from heavy winds. If you know the direction in which strong winds typically blow from, then you can construct the building in such a way that the building only has sharp edges in that direction so that the wind can cut through it. However, if strong winds are erratic in that region and it is impossible to predict the direction in which they can blow from, design your building on the assumption that strong winds can blow from any possible direction.

The least-bias principle is also known as the principle of maximum entropy, because in information theoretic terms, this also represents the statement with the maximum information content possible about the unknown variable.

Principle of conservation (Symmetry)
When you have mapped out the inputs and outputs of your system, typically everything else should be conserved. Else there is something like a "memory leak" happening somewhere that may make your system appear to behave strange eventually.

Every debit should have a corresponding credit (if money was not dispensed out of the system) and every buy should have a corresponding sell. So, don't believe an axiom like, "if lot of people are buying shares from others, then the market is bullish." If lot of people are buying shares from others, it also means that a lot of people are selling shares.

Principle of minimum description length (Occam's razor) If the same phenomenon can be described in more than one ways, choose the simplest possible description. An earlier debate some of us had on elevator design is a motivating example. Occam's razor can also be seen as the principle of irreducibility of the axioms. If the axioms can be reduced to something smaller without changing their meaning, then we probably should be using the latter set of axioms.

Extensibility: This is the dual of the above characteristic. The more facts you can explain using a concept, the better it is as an axiom. Extensability can also be termed the principle of generality. If there are two or more theories for a phenomena, and one of them explains far more than the other (correctly of course), then that is the better one. Quantum mechanics is seen as something that supercedes Newtonian mechanics because Quantum mechanics has a theory general enough to explain mechanics in the sub-atomic level as well as (in principle) at macro levels; unlike classical mechanics that described phenomena only at macro levels.

Universality: Many a time, similar phenomena occur in (seemingly) unrelated contexts. Such phenomena are very likely to point to some deeper underlying principle. The power-law distribution is an example of a characteristic that is found in several disparate systems like degree distribution on the web, population distribution across cities, distribution of the sizes of our blood vessels, etc. It is no wonder that the power law has aroused a lot of curiosity and there are various theories (preferential attachment, utility maximization under saturation) about the truth underlying this phenomenon.

## Tuesday, August 19, 2008

One of the most celebrated paradoxes in set theory is the Russel's paradox. The story behind the paradox and subsequent developments is rather interesting.

Consider a set of the kind S = {a, b, S}. It seems somewhat unusual because S is a set in which S itself is a member. If we expand it, we get S = {a, b, {a, b, {a, b ....}}} leading to an infinite membership chain.

Suppose we want to express the class of sets that don't have this "foundationless" property. Let us call this set as the set of all "proper" sets, that is, sets that don't contain themselves. We can express this set of sets as: X = {x | x is not a member of x}

Now this begs the question whether X is a member of itself. If X is a member of itself, then by the definition of X (set of all sets that don't contain themselves), X should not be a member of itself. If X is not a member of itself, then by the definition of X (set of all sets that don't contain themselves), X should be a member of itself.

This is the famous Russel's paradox. Bertrand Russel was so flustered by this paradox, he went to great lengths to create an axiomatic basis for set theory that prevents the formation of such paradoxical situations. Incidentally, Russel's paradox also dealt a major blow to David Hilbert's grand idea of representing all of mathematics using set theory. Russel showed that it is not possible to represent some parts of set theory itself using set theory, let alone all of mathematics.

Russel and Whitehead then embarked on a major project that introduced the theory of "types" that was finely intertwined with set theory. This they published as a three volume series called "Principia Mathematica." Set theory could not be described without theory of types and vice versa.

Essentially, the gist of this is as follows. A type is basically a property (a predicate logic statement for instance) and an instance is an element having that property. A set, now becomes a collection of instances sharing a given property. Hence if we have a property isPrimeNumber(x), then the set Primes = {x | x \in Z, isPrimeNumber(x)} is the set of all integers for which isPrimeNumber(x) holds.

In most purposes, we can refer to a set and its type interchangeably. But the subtle difference is that, while we can define a property P(x): x does not contain x without any problem, we cannot define the corresponding set X = {x | P(x)} without getting into a paradox.

In the axiomatic system of Principia Mathematica, a set X can only contain elements belonging to a type lower than that of X itself. This leads to a strict hierarchy among sets, where every set is made up of lower level elements and so on. This brings forth the question, what is the lowest of them all? In order to make sets exist, we will then need to have "atomic" types, below which there are no more sets defined.

This axiom is called the "Axiom of Foundation" in set theory, which prohibits an infinite membership chain and decrees that sets ought to end somewhere.

However, the Foundation Axiom is a little over zealous in its restrictions in order to prevent Russel's paradox like situations. Self reference is not necessarily a paradox.

Suppose we mathematically represent the semantics of each sentence with the set of all nouns in them. Then the first sentence of the previous paragraph would be something like: {Foundation Axiom, restrictions, Russel's paradox}.

Now suppose we want to represent the following sentence in set-theoretic form: This sentence is written in English. We see that there are two nouns here, English and "this sentence". Unfortunately, "this sentence" cannot be an element of "this sentence" according to the Foundation Axiom. Similarly, we can't represent a sentence talking about the universe in general.

There are several situations where self reference is not only consistent, but also necessary. A container class in object oriented programming for example, can contain objects of any class including the container class or even its super classes.

Much later, mathematicians saw the need to relax the Axiom of Foundation and so was born the theory of non-wellfounded sets or "Hypersets." Hypersets are based on the Anti-foundation Axiom that allows self references as long as the cycle does not contain a contradiction. A paradox occurs only when there is a cyclic reference evaluating to a contradiction.

Hence the set X = {x | x contains x} is not paradoxical, while X = {x | x does not contain x} is paradoxical.

It is only that the former set is "non-wellfounded". It is like the story where, a boy who is asked where Leela Palace is, replies that it is in front of Manipal Hospital, and gives directions to Manipal Hospital as "opposite Leela Palace." The sentences are mutually consistent, but by themselves, they don't have any information about how to get there -- either to Leela Palace or to Manipal Hospital.

Such mathematical structures, even though they appear as if they are useless, have some important applications and are very powerful constructs to model some classes of natural phenomena. But more on them in some other post, later.

## Friday, February 29, 2008

### The Power Law of Monkeys..

The paper talks about the new found interest in the power-law and log-normal distributions especially among web researchers. Michael provides ample evidence to show that these debates are nothing new. Power-law versus log-normal processes have been debated in several areas like biology, chemistry and astronomy for more than a century now.

A power-law distribution over a random variable X is one where the probability that X takes a value at least k is proportional to k-g. Here g > 0 is the power-law "exponent". Intuitively a power-law distribution represents a system where X has a very large number of instances of very small values and a very small number of instances of extremely large values.

Power-law distributions are also called scale-free distributions as their overall shape appears the same, no matter at what scale they are viewed from. A characteristic feature of the power-law is that it appears as a straight line with a negative slope, on a log-log scale.

Log-normal distributions on the other hand appear as normal (Gaussian) distributions in the log-log scale. Power-law is log-linear.

Both power-law and log-normal distributions represent very skewed systems with a small number of very resourceful entities and a large number of entities with very limited resources. Such skewed distributions are very pervasive. Some examples are: the sizes of blood vessels to the number of such vessels in our body; income distributions and the number of people with such incomes; populations of cities versus the number of such cities; the rank of a word in a document (in terms of its frequency) versus the number of times it appears; etc.

There are several generative models that speculate on the underlying random processes that generate the power-law and log-normal distributions. And the sheer simplicity of some of them make them most appealing.

One of the most common generative models for the power-law is preferential attachment. This is one of the most popular models to explain the in-degree distribution of web pages. Simply put, the preferential attachment model for the web says that the probability of a page obtaining a new incoming hyperlink is proportional to its existing in-degree. Pages that have more in-degree are more popular and tend to be more visible -- making their popularity go up further.

There are several other models that generate the power-law based on models like information-theoretic optimizations and martingale processes.

Similarly, there are several models for generating log-normal distributions. One of the simplest generative models is of multiplicative processes. Suppose that on a social networking site like Orkut, each person brings in about 'r' new friends into the system. Since each person is assumed to be acting independently, let r be modeled as an i.i.d. (independent identically distributed) random variable with a finite variance (i.e. there are limits up to which any person is likely to have friends). Suppose the site starts with an initial population of X(0). The population after one time step would be X(1) = r X(0).

As we can see, the population at any time t is given as X(t) = r X(t-1) = rt X(0).

This is a multiplicative process, where each iteration has a multiplicative, rather than an additive effect on the population. Now taking log on both sides, we get: log X(t) = log X(0) + t log r.

Logarithms have this effect of converting multiplications into additions. So what was a multiplicative process in the linear domain is now an additive process in log scale. Now we had assumed that r is an i.i.d. random variable with finite variance (and so would be log r). There is this nice little theorem called the central-limit theorem, which says that the distribution of any i.i.d. random variable with finite variance, approaches the normal distribution asymptotically. So, there we have a multiplicative process generating a normal distribution in a log-log scale.

Anyway, one of the best generative models for the power-law is as follows: A monkey randomly typing on a typewriter (having n letters and one space) is just as likely to generate words with a power-law frequency distribution as do any other models!

Assume that the monkey types a space with a probability q and any of the n letters with a probability of (1-q)/n. Spaces are used to delimit words. A word with c letters would then have a probability of: ((1-q)/n)c q.

There are nc words of length c. The probability of a word with frequency rank r(j) (having nj words) occurs with probability q((1-q)/n)log_n(r(j)), which in turn reduces to q(r(j))log_n(1-q)-1

.. giving us a power-law distribution. Reading this serves us with a good reminder that there could be nothing but pure noise that determines how the web (for example) grows! Maybe there is a giant monkey sitting out there weaving the web...