Lovegrove Mathematicals

"Probabilities are likelinesses over singleton sets"

- A coin is of degree 2;
- A die is of degree 6;
- A pack of cards is of degree 52;
- The days of the week are of degree 7.

*The degree* is the number of possibilities that
something (tossing a coin; rolling a die; drawing a card) may take. To enable a coherent theory to be
developed, the possibilities/classes are labelled 1,2,...,N where N is the degree.

The set {1,...,N} is denoted by X_{N}.
This is the domain of definition for distributions and histograms of degree N.

A *distribution of degree N* is a function f:

- which is defined for i=1, ..., N
*and* - for which f(i)>0 for all i
*and* - for which f(1)+...+f(N)=1 .

It is usual to represent the distribution f by the ordered N-tuple ( f(1), ... ,f(N) ).

The set of all distributions of degree N
is denoted by S(N).The symbol S_{N} is more usual, but to use that here would result in an inelegant subscripted subscript -which is difficult to typeset in a clear and legible way.

If f∈S(N) then f is *injective* if [i≠j] ⇒ [f(i)≠f(j)].
The set of non-injective elements of S(N) has measure zero and so may be safely
ignored: the description of various algorithms as producing injective
distributions is for the sake of technical precision only, and has no practical
or theoretical consequences.

An *histogram of degree N* is a function h:-

- which is defined for i=1,...,N
- for which h(i)≥0 for all i.

The set of all histograms of degree N is denoted by H(N).

The *sample size* of h is ω(h)=
h(1)+...+h(N).

When writing out the histogram h we normally just write out the values of the h(i) as an ordered N-tuple. For example, (1.25, 2.13, 4.87, 8.92)

An *integram of degree N* is an histogram of degree N which takes only integer
values. The most important integram is the zero integram
0 for which all values are
zero.

The set of all integrams of degree N is denoted by G(N).

We denote the integram g(i)=1, g(j)=0 for j≠i by ''i''_{N}
(the quote marks are part of the notation). Because the degree is normally
obvious from the context, this is more usually written as ''i''.For those familiar with vector notation, this is the same as the bolded i used to denote an unit vector in the direction of the i-axis. That notation is impossible to write by hand and so must be the worst piece of notation ever devised.

For example, if the degree is 6 then "2"="2"_{6}=(0,1,0,0,0,0)

If g∈G(N) the *Multinomial coefficient associated with g*
is given by

(Because the denominator contains the term g(i)! then g(i) must be an integer. Since this is the case for all i, g does have to be an integram, not just an histogram.)

Let h∈H(N), h≠0 . Then
we define RF(h) to be the relative frequencies of h, that is RF(h)(i)= h(i)/ω(h).
An alternative notation for RF(h)(i) is RF(i|h), the relative frequency of i
*given* h.

If we have a set of histograms of degree N, say {h_{1}, ...,h_{K}}, then:-

- Provided at least one h is not 0, we
define the
*Mean Relative Frequency of i*to be

.

An alternative name for this is the*Global Relative Frequency of i*.

- Provided no h is 0, we define the
*Mean of the Relative Frequencies of i*to be

A set, P, is *convex* if, no matter which two points we select in P,
the straight line segment joining them is wholly in P. A
set which is not convex is called *concave*, but many authors prefer the term
*non-convex*.

The *core*
(some prefer the term *convex hull*) of a set is its smallest convex
superset. It follows from the definition that a convex set is its own core.

Core(P) is important because the mean of any number (not necessarily finite) of elements of P always lies in Core(P). In particular, if P is convex then the mean lies in P, but if P is concave then that mean might not be in P.

This is especially important in some of the applied sciences, where it
can be essential that the result of any analysis be describable in the same
way as P. For example, R(N) -the set of ranked distributions of degree N- is
convex, so a mean of ranked distributions will also be a ranked distributions of
the same degree. On the other hand, U(N) -the set of unimodal distributions
of degree N- is concave, so a mean of unimodal distributions might not be unimodal. When this is important there is the tendency to use
best-fitting rather than best-estimating, since best-fitting forces a
solution which has the required description even though the fit might be extremely poor. (Being the "best" fit does not imply being a "good" fit)

If P⊂S(N) then P is said to be *(i,j)-symmetric* if P contains f_{(i,j)} whenever P contains f, where f_{(i,j)} is that distribution which is obtained from f by interchanging f(i) and f(j).

The histogram h is *(i,j)-symmetric* if h(i)=h(j).

P is *symmetric* if it is (i,j)-symmetric for all i,j∈X_{N}.

h is *symmetric* if it is (i,j)-symmetric for all i,j∈X_{N}.
That is, if it is a constant histogram.

Let P be a non-empty subset of S(N), g∈G(N) and h∈H(N) then we define *the Likeliness, over P, of g given h* by

h is called the *given histogram* (often an integram, but it doesn't have to be), and g
is the *required integram.* P is the *underlying set.* (g+h) is the
*objective histogram*. It is often the case that it is the objective histogram that is specified, and the required integram is then found by subtracting the given histogram

(Σ is the Daniell integral, which can be thought of as summation on a finite set but as the Riemann integral when that is required.)

Provided no confusion results, we

- write L
_{P}(i|h) rather than L_{P}("i"|h); - write L
_{P}(g) rather than L_{P}(g|0).

The *L-point* is the point (distribution) in S(N) with co-ordinates

( L_{P}(1|h), ..., L_{P}(N|h) ). Associated with this is
the function

which will be often used on this site to plot graphs. When doing this, we
shall sometimes use the alternative notation Average(P) rather than L_{P}.

If Pis an underlying set in (ie. a non-empty subset of) S(N), V⊂S(N) and h∈H(N) then we define *the Likeliness, over P, of V given h* by

When h=__0__, we write L_{P}(V) rather than L_{P}(V|__0__)

A fundamental difference between the likeliness of an integram and the likeliness of a set of distributions is that the former cannot be 0 but the latter can (if V and P do not intersect). Similarly, the likeliness of an integram cannot be 1 (except in the degenerate case N=1), but that of a set of distributions can (if P⊂V).

If P is a singleton set, P={f}, then L_{P}(V|h)
can be only 0 (if f∉V) or 1 (if f∈V) and is equivalent to the characteristic function of V.

If the underlying set is singleton, P={f}, then the expression for L_{P}(g|h)
takes on an especially simple and significant form, namely L_{P}(g|h)=
M(g)f^{g}.

This is the everyday Multinomial Theorem form of the probability of g
when the generating distribution is f. Since M(g)f^{g} contains no reference to h (If it had contained a reference to h then we would have needed to write Pr(g|f,h) rather than Pr(g|f)) , we
may define
Pr(g|f)=L_{{f}}(g|h) and call this *the probability of the integram g given the
distribution f*.

This is a highly significant step, because it is defining probability in terms of likeliness, rather than the other way round.

It follows from this that Pr("i"|f) = f(i).

It is important to note that, since the expression M(g)f^{g} makes no reference to h,
Pr(g|f) is independent of h, that is of experimental/observational data.

So probabilities are likelinesses over singleton sets, and are independent of the given histogram.