## Probability – 2

In a comment on my previous post arguing that probability is arbitrary, Stephen Bourque wrote

Probability is an empirical measurement of an ensemble of events. It means: Given a set of N independent events, the probability of a specific event is, to a degree of certainty, the number of times the specific event occurred divided by the total N, as N becomes large. By “a degree of certainty,” it is meant simply that the uncertainty in the measurement can be made smaller and smaller by increasing N. (Since this is an inductive process, it has the characteristics of induction, including the requirement of objectively determining when N is large enough to achieve certainty of the probability measure.)

Let me work out the math to calculate the degree of certainty. Consider a coin tossed N times. Suppose that M tosses resulted in a ‘heads’ (H) outcome. To simplify the math (by keeping it in the discrete domain), suppose I know that the coin has been designed to have a “true” heads probability ‘r’ for a single toss of either ‘p’ or ‘q’. Let HM,N denote the event of obtaining M heads from N tosses. Let P(A/B) denote the conditional probability of A given B.

Using Bayes’ theorem,
P(r = p / HM,N) = P(HM,N / r = p) P(r = p) /

[P(HM,N / r = p) P(r = p) + P(HM,N / r = q) P(r = q)]

with

P(HM,N / r = p) = NCM rM(1-r)N-M

and

P(HM,N / r = q) = NCM qM(1-q)N-M

If one knows P(r = p), the probability of the true probability being p, one can calculate P(r = p / HM,N), the degree of certainty for the probability estimate of r = p given the empirical data. The problem is that to calculate the degree of certainty of a probability estimate based on empirical data, one needs another probability number. To take a concrete example, suppose I know that my coin has a ‘true’ probability of either 0.3 or 0.4 for a single toss. I toss the coin 100 times and get 33 heads, so that N = 100, M = 33, p = 0.3, q = 0.4. If I use P(r = 0.3) to be 0.5, then the degree of certainty works out to be 69.7 %. The problem is that the value of 0.5 for P(r = 0.3) is still arbitrary. It has no basis in empirical data.

One can extend this to the continuous domain, where r may take any value between 0 and 1. To get a degree of certainty measure, one will need a prior probability distribution for the “true” probability and this distribution will have to be arbitrary. Just as I used a value of 0.5 in my concrete example, one may take this distribution to be the uniform distribution. I have not worked out the math for this case, but it should be easy to do so.

Anyway, it turns out that as one increases the values of N and M proportionately, the degree of certainty for the probability estimate r = M / N, rises to 100% very fast irrespective of the arbitrarily chosen prior probabilities. Practically, this is a very useful feature and this is what Stephen refers to when he writes that the uncertainty in the measurement can be made smaller and smaller by increasing N. But does it change the epistemological status of probability calculations? I don’t think so. As long as N is finite – that is, always – the degree of certainty is arbitrary. At some level, probability calculations always depend on an arbitrary choice of equal likelihood. To see this, just consider Bayes’ theorem above. It uses a weighted average where the weights are prior (or unconditional) probabilities. These unconditional probabilities are usually themselves estimated with other empirical data. Regardless, the calculation of an average assumes an equality of significance of the numbers being averaged. My position is that this assumption of equality is an arbitrary assumption. By using more and more empirical data, one can drive this assumption deeper and deeper, but unless one develops a physical theory – a cause and effect relationship – one cannot get rid of it.

### 8 Responses

1. Rather than relying on an assumption of equal likelihood, I think a more general way of describing it is that we rely on an assumption that the future pattern of occurrences of a set of events will be similar to a pattern observed for that class in the past.

This assumption of expecting the pattern to remain the same is our best fall-back epistemological position. Of course, we cannot be certain until we understand the actual causes of the variability, and no longer have to rely on probability.

2. Hi Koustubh,

Sorry for no response on the previous posts, though this ain’t a proper one either. Just a quick comment.

I think you are confusing between probability and statistics. All the issues you mention are about statistics. Probability in itself is just a branch of mathematics with its set of axioms and the theorems that follow thereby. Statistics on the other hand is a more heuristic concept that different physical quantities of interest follows those axioms and more. As you mention there is no justification to assume that these hold.

Statistics itself divides into two forms : frequentist and Bayesian. The one you mention in the post is the Bayesian statisitcs, where atleast the initial prior assumption is made explicit. How one reaches the prior is almost always an unanswered question, and often times unasked one. A better way to interpret the bayesian statistics would be that if one started with the assumed prior, then after the event one would have the calculated posterior. As such this could be a statement about how to update ones beliefs, and is therefore mainly subjective. (epistemological as per your previous post).

Frequentist statistics adopts the more objective approach in that the probabilities describe something metaphysical. I do not agree to this, as I think so do you.

But all this criticism is about statistics and not Probability as such.

3. “Anyway, it turns out that as one increases the values of N and M proportionately, the degree of certainty for the probability estimate r = M / N, rises to 100% very fast irrespective of the arbitrarily chosen prior probabilities”

This is not true in general; for this to be true it is necessary that the initial prior puts some weight on the “true” value of r. This condition is usually referred to as the “grain of truth” assumption. Without this, the posterior will always put 0 weight on the true value irrespective of the (finite) number of observations. This again shows how arbitrary and crucial the choice of the prior is.

4. K.M, I think I must have inadvertently implied something I did not intend. The point I am arguing is just this: It is perfectly valid and objective to increase one’s knowledge of some physical process by recording and measuring the results of many iterations of the process, and subsequently using these measurements to build the concept of “expected value.” I did not intend to direct the focus of my argument toward the degree of certainty of such measurements, which I had mentioned only to give a qualitative indication of what I meant in my definition. You did a nice job pursuing a quantitative estimate of this degree of certainty, but I think in using this as your starting point, you had already skipped over the essence of my point.

The critical issue emerges at the end of your analysis, when you state, “If I use P(r=0.3) to be 0.5 . . . the value of 0.5 . . . is still arbitrary.” Stated in this way, the probability of 0.5 does indeed seem to be arbitrary, pulled from out of the blue. Or, as you indicated in a comment from your previous post, perhaps the value of 0.5 is not quite “out of the blue,” but it “contains an irreducible arbitrary assumption of equal likelihood.”

There are at least two senses in which I think it is decidedly NOT arbitrary – indeed, it is not an assumption at all – to say that a probability of heads is 0.5. One applies to probability measure in general, and one applies to the fair coin (and other such symmetrical processes) in particular.

First, the general sense: Probability measurements do NOT start with, “If I assume the probability of heads is one half . . .” That’s not a measurement! I chose a fair coin for my example because it is familiar to everyone, but in this case, its very familiarity seems to be obscuring my point. Let’s take a physical process in which one is not immediately able to estimate probabilities by mere inspection – a radio link. (We could substitute innumerable physical processes here: a rock rolling down a hill, a chemical process, biological inheritance, etc.) Suppose I send 10^8 bits of data at a certain rate, from a transmitter to a receiver separated by a fixed distance, through a channel of particular characteristics. Since I know the data that I transmitted, I can empirically measure (count) how many bits were received in error. If 10^2 bits were wrong, for instance, I can say that the probability of an erroneous bit is 10^2 / 10^8 = 10^-6 (i.e. one in a million). If 10^3 bits were wrong, my bit error rate is 10^-5, and so on.

So, what is non-objective about that? Nothing. Armed with a measurement (not an arbitrary assumption – a measurement!) of bit error rate, I can now use that to send data across my radio link with a quantitative confidence in its accuracy. The concept does not (and cannot) tell me WHICH bits will arrive in error, but it does provide a measure of approximately HOW MANY errors I can expect over large samples of data. Of course, if the context changes (e.g. an interferer is introduced) the probabilities will in general change also, but that is the nature of all knowledge. Knowledge is contextual.

Everything I just described about a radio link can be repeated with a coin flip. Pretend you know nothing about the process of flipping a coin. If you perform the operation 1000 times and count the number of heads, you will arrive at a number such as 483 or 508 or 539. To one significant figure, 483 / 1000 = 0.5. So too, 508 / 1000 = 0.5 and 539 / 1000 = 0.5. There’s nothing arbitrary about this; there is nothing assumed. It depends only on the law of identity; a coin behaves in a certain way when it is flipped, and one may gain knowledge of this process inductively through observations. (If you flip not a coin but a maple leaf, you might arrive at very different probabilities – perhaps 0.7 and 0.3 for the two sides.)

Now that I just made my case from the purely empirical perspective (for general complex deterministic processes with no obvious symmetry), which is the main thrust of my argument, let’s look at the fair coin in particular again for a moment.

Even if one had never flipped a coin before, why couldn’t one predict a priori that the probabilities of heads and tails would be equal, simply by observing the symmetry of the process? Why should nature favor one side of the coin over another? Why shouldn’t the probability of obtaining a three on a dice roll be 1 in 6? Why shouldn’t the probability of selecting a nine of hearts from a deck of cards be 1 in 52? It strikes me as unjustifiably agnostic and noncommittal to REFRAIN from holding that the probabilities would distribute evenly across the possible outcomes. Far from holding equal likelihood as an “irreducible arbitrary assumption,” I regard it as arbitrary to hold out the possibility that the probability of heads is something OTHER than one half. (Of course, one can modify coins, dice, and cards to skew the odds, but that simply returns us to the issue of context.)

I apologize for the length of this comment, but I think it was necessary to get the idea across.

5. Stephen,
You make two important points in your comment. Let me consider them in turn.

First, your radio link example. I agree (with a qualification) that there is nothing arbitrary about measuring data loss, interference etc and using that to predict future outcomes. The qualification is that the cause for the data errors must be identified. For example, if one determines that the errors were a result of interference from some other source, determines that source, determines that the source will continue to cause interference with the same characteristics and determines that there are no other sources that may cause interference, then one may objectively use the measured error rate to estimate the errors in future. This is fully objective. But where is the use of probability in such a process? Use of probability is not necessary. I know that such analyses commonly make use of probability theory. But I think of such use as a convenience, not as a necessity. For example, normal distributions are commonly used to characterize measurement errors. But in fact, measurement error is not normally distributed. A better analysis would use truncated distributions, but the math soon becomes intractable. The use of probability theory allows an easy calculation of error bounds and so we use it. In principle, one could calculate the error bounds (based on the error-bounded measurement of various factors) without the use of probability.

Continuing with your radio link example, what if one has not in fact done all the work to justify the estimates? I think you provide an answer to that in your second point, when you ask “Why should nature favor one side of the coin over another?” More generally, one may ask “What reason is there to attach a higher likelihood to a particular possibility?” If the answer to that is “none”, one uses the equal likelihood assumption. Within the context of one’s knowledge, this equal likelihood assumption is irreducible. And because it is based on a lack of reasons, I choose to call it arbitrary. You seem to dislike my choice of words, but I think we are basically in agreement here.

To sum up, I think there are two possible uses of probability
1) where it is merely used as a shortcut to simplify the math involved in analyzing a complex, deterministic and fully understood (in principle) process.
2) where it is used to obtain “best-effort” estimates without complete knowledge.
It is the latter that I am primarily interested in and it is there that I am claiming an irreducible arbitrary assumption of equal likelihood.

6. softwarenerd,
> I think a more general way of describing it is that we rely on an assumption that the future pattern of occurrences of a set of events will be similar to a pattern observed for that class in the past.

Agreed. But when one translates that to math to quantify things, I think that necessarily turns into an assumption of equal likelihood at some level. As support for this, consider that all probability distributions conceptually depend on the uniform distribution. Therefore even if one is not fitting a uniform distribution to empirical data directly, it is there at some level.

7. Krishnamurthy,
> I think you are confusing between probability and statistics.

I will have to think some more about that. But as of now, I don’t like the kind of distinction you are making, here and in your comment on the previous post. That is partly because your view of probability (and mathematics in general?) seems too platonic – in a different plane from reality. Will write about this once I think this through fully.

> it is necessary that the initial prior puts some weight on the “true” value of r. This condition is usually referred to as the “grain of truth” assumption. Without this, the posterior will always put 0 weight on the true value irrespective of the (finite) number of observations.

Can you provide an example?

8. >I will have to think some more about that. But as of now, I don’t like the kind of distinction you are making, here and in your comment on the previous post. That is partly because your view of probability (and mathematics in general?) seems too platonic – in a different plane from reality. Will write about this once I think this through fully.

I don’t understand how my view of maths comes into this, nor how it is platonic. One could just as well take a formalist view and make the same comment.

>Can you provide an example?

Sure. Assume the initial prior says the true value r lies uniformly in the set (0,1/3) union (2/3,1). Also, suppose the true value of r is 0.5. In this case, the initial prior puts 0 weight on the true value and does not satisfy the grain of truth assumption. In this case, irrespective of any finite number of observations, the posterior still puts 0 weight on the true value of r. Only an infinite number of observations can ever correctly identify the error made in choosing the initial prior.

Also, the posterior should converge to 50% on 1/3 and 50% on 2/3. Thus there is never a 100% degree of certainty on any value after any finite number of observations.