imported>Boris Tsirelson |
imported>Boris Tsirelson |
Line 1: |
Line 1: |
| WP
| | Space (mathematics) |
| Conditioning (probability)
| |
| 16:17, 18 November 2008
| |
|
| |
|
| Beliefs depend on the available information. This idea is formalized in [[probability theory]] by '''conditioning'''. Conditional [[probability|probabilities]], conditional [[Expected value|expectations]] and conditional [[Probability distribution|distributions]] are treated on three levels: [[Discrete probability distribution|discrete probabilities]], [[probability density function]]s, and [[measure theory]]. Conditioning leads to a non-random result if the condition is completely specified; otherwise, if the condition is left random, the result of conditioning is also random.
| | In the ancient mathematics, "space" was a geometric abstraction of the |
| | three-dimensional space observed in the everyday life. Axiomatization |
| | of this space, started by Euclid, was finished in the 19 |
| | century. Non-equivalent axiomatic systems appeared in the same 19 |
| | century: the hyperbolic geometry (Nikolai Lobachevskii, Janos Bolyai, |
| | Carl Gauss) and the elliptic geometry (Georg Riemann). Thus, different |
| | three-dimensional spaces appeared: Euclidean, hyperbolic and |
| | elliptic. These are symmetric spaces; a symmetric space looks the same |
| | around every point. |
|
| |
|
| This article concentrates on interrelations between various kinds of conditioning, shown mostly by examples. For systematic treatment (and corresponding literature) see more specialized articles mentioned below.
| | Much more general, not necessarily symmetric spaces were introduced in |
| | 1854 by Riemann, to be used by Albert Einstein in 1916 as a foundation |
| | of his general theory of relativity. An Einstein space looks |
| | differently around different points, because its geometry is |
| | influenced by matter. |
|
| |
|
| ==Conditioning on the discrete level==
| | In 1872 the Erlangen program by Felix Klein proclaimed various kinds |
| | of geometry corresponding to various transformation groups. Thus, new |
| | kinds of symmetric spaces appeared: metric, affine, projective (and |
| | some others). |
|
| |
|
| '''Example.''' A fair coin is tossed 10 times; the [[random variable]] <math>X</math> is the number of heads in these 10 tosses, and <math>Y</math> — the number of heads in the first 3 tosses. In spite of the fact that <math>Y</math> emerges before <math>X</math> it may happen that someone knows <math>X</math> but not <math>Y</math>.
| | The distinction between Euclidean, hyperbolic and elliptic spaces is |
| | not similar to the distinction between metric, affine and projective |
| | spaces. In the latter case one wonders, which questions apply, in the |
| | former --- which answers hold. For example, the question about the sum |
| | of the three angles of a triangle: is it equal to 180 degrees, or |
| | less, or more? In Euclidean space the answer is "equal", in hyperbolic |
| | space --- "less"; in elliptic space --- "more". However, this question |
| | does not apply to an affine or projective space, since the notion of |
| | angle is not defined in such spaces. |
|
| |
|
| ===Conditional probability===
| | Euclidean axioms leave no freedom, they determine uniquely all |
| | geometric properties of the space. More exactly: all three-dimensional |
| | Euclidean spaces are mutually isomorphic. In this sense we have "the" |
| | three-dimensional Euclidean space. Three-dimensional symmetric |
| | hyperbolic (or elliptic) spaces differ by a single parameter, the |
| | curvature. The definition of a Riemann space leaves a huge freedom, |
| | more than a finite number of numeric parameters. On the other hand, |
| | all affine (or projective) spaces are mutually isomorphic, provided |
| | that they are three-dimensional (or n-dimensional for a given n) and |
| | over the reals (or another given field of scalars). |
|
| |
|
| Given that <math>X=1,</math> the conditional probability of the event <math>Y=0</math> is <math>P(Y=0|X=1)=P(Y=0,X=1)/P(X=1)=0.7.</math> More generally,
| | Nowadays mathematics uses a wide assortment of spaces. Many of them |
| : <math> \mathbb{P} (Y=0|X=x) = \frac{ \binom 7 x }{ \binom{10} x } = \frac{ 7! (10-x)! }{ (7-x)! 10! } </math>
| | are quite far from the ancient geometry. Here is a rough and |
| for <math>x=0,1,2,3,4,5,6,7;</math> otherwise (for <math>x=8,9,10</math>), <math>P(Y=0|X=x)=0.</math> One may also treat the conditional probability as a random variable, — a function of the random variable <math>X</math>, namely,
| | incomplete classification according to the applicable questions |
| : <math> \mathbb{P} (Y=0|X) = \begin{cases}
| | (rather than answers). First, some basic classes. |
| \binom 7 X / \binom{10}X &\text{for } X \le 7,\\
| |
| 0 &\text{for } X > 7.
| |
| \end{cases} </math>
| |
| The [[expected value|expectation]] of this random variable is equal to the (unconditional) probability,
| |
| : <math> \mathbb{E} ( \mathbb{P} (Y=0|X) ) = \sum_x \mathbb{P} (Y=0|X=x) \mathbb{P} (X=x) = \mathbb{P} (Y=0), </math>
| |
| namely,
| |
| : <math>\sum_{x=0}^7 \frac{ \binom 7 x }{ \binom{10}x } \cdot \frac1{2^{10}} \binom{10}x = \frac 1 8 , </math>
| |
| which is an instance of the [[law of total probability]] <math>E(P(A|X))=P(A).</math>
| |
|
| |
|
| Thus, <math>P(Y=0|X=1)</math> may be treated as the value of the random variable <math>P(Y=0|X)</math> corresponding to <math>X=1.</math> On the other hand, <math>P(Y=0|X=1)</math> is well-defined irrespective of other possible values of <math>X</math>.
| | Space Stipulates |
|
| |
|
| ===Conditional expectation===
| | Projective Straight lines. |
| | Topological spaces. Convergence, continuity. |
| | Open sets, closed sets. |
|
| |
|
| Given that <math>X=1,</math> the conditional expectation of the random variable <math>Y</math> is <math>E(Y|X=1)=0.3.</math> More generally,
| | Distances between points are defined in metric spaces. In addition, |
| : <math> \mathbb{E} (Y|X=x) = \frac3{10} x </math>
| | all questions applicable to topological spaces apply also to metric |
| for <math>x=0,...,10.</math> (In this example it appears to be a linear function, but in general it is nonlinear.) One may also treat the conditional expectation as a random variable, — a function of the random variable <math>X</math>, namely,
| | spaces, since each metric space "downgrades" to the corresponding |
| : <math> \mathbb{E} (Y|X) = \frac3{10} X. </math>
| | topological space. Such relations between classes of spaces are shown |
| The expectation of this random variable is equal to the (unconditional) expectation of <math>Y</math>,
| | below. |
| : <math> \mathbb{E} ( \mathbb{E} (Y|X) ) = \sum_x \mathbb{E} (Y|X=x) \mathbb{P} (X=x) = \mathbb{E} (Y), </math>
| |
| namely,
| |
| : <math>\sum_{x=0}^{10} \frac{3}{10} x \cdot \frac1{2^{10}} \binom{10}x = \frac 3 2 \, , </math> or simply <math> \mathbb{E} \Big( \frac3{10} X \Big) = \frac3{10} \mathbb{E} (X) = \frac3{10} \cdot 5 = \frac32 \, , </math>
| |
| which is an instance of the [[law of total expectation]] <math>E(E(Y|X))=E(Y).</math>
| |
|
| |
|
| The random variable <math>E(Y|X)</math> is the best predictor of <math>Y</math> given <math>X</math>. That is, it minimizes the mean square error <math>E(Y-f(X))^2</math> on the class of all random variables of the form <math>f(X).</math> This class of random variables remains intact if <math>X</math> is replaced, say, with <math>2X.</math> Thus, <math>E(Y|2X)=E(Y|X).</math> It does not mean that <math>E(Y|2X)=0.3\cdot2X;</math> rather, <math>E(Y|2X)=0.15\cdot2X=0.3X.</math> In particular, <math>E(Y|2X=2)=0.3.</math> More generally, <math>E(Y|g(X))=E(Y|X)</math> for every function <math>g</math> that is one-to-one on the set of all possible values of <math>X</math>. The values of <math>X</math> are irrelevant; what matters is the partition (denote it α<sub><math>X</math></sub>)
| | Space Is richer than Stipulates |
| : <math> \Omega = \{ X=x_1 \} \uplus \{ X=x_2 \} \uplus \dots </math>
| |
| of the sample space <math>\Omega</math> into disjoint sets <math> \{ X=x_n \}. </math> (Here <math> x_1, x_2, \dots </math> are all possible values of <math>X</math>.) Given an arbitrary partition <math>\alpha</math> of <math>\Omega</math>, one may define the random variable <math>E(Y|\alpha).</math> Still, <math>E(E(Y|\alpha))=E(Y).</math>
| |
|
| |
|
| Conditional probability may be treated as a special case of conditional expectation. Namely, <math>P(A|X)=E(Y|X)</math> if <math>Y</math> is the [[indicator function|indicator]] of <math>A</math>. Therefore the conditional probability also depends on the partition <math>\alpha_X</math> generated by <math>X</math> rather than on <math>X</math> itself; <math>P(A|g(X))=P(A|X)=P(A|\alpha),</math> <math>\alpha=\alpha_X=\alpha_{g(X)}.</math>
| | Affine Projective Parallel lines |
| | Linear Affine Origin. Vectors. |
| | Linear topological Linear space. |
| | Topological space. |
| | Metric Topological space. Distances. |
| | Normed Linear topological space. |
| | Metric space. |
| | Inner product Normed space. Angles. |
| | Euclidean Affine. Angles. |
| | Metric. |
| | A finer classification uses answers to some applicable questions. |
|
| |
|
| On the other hand, conditioning on an event <math>B</math> is well-defined, provided that <math>P(B) \ne 0,</math> irrespective of any partition that may contain <math>B</math> as one of several parts.
| | Space Special cases Properties |
|
| |
|
| ===Conditional distribution===
| | Linear three-dimensional Basis of 3 vectors |
| | | finite-dimensional A finite basis |
| Given <math>X=x,</math> the conditional distribution of <math>Y</math> is
| | Metric complete All Cauchy sequences converge |
| : <math> \mathbb{P} ( Y=y | X=x ) = \frac{ \binom 3 y \binom 7 {x-y} }{ \binom{10}x } = \frac{ \binom x y \binom{10-x}{3-y} }{ \binom{10}3 } </math>
| | Topological compact Every open covering has a finite subcovering |
| for <math>0\le y\le min(3,x).</math> It is the [[hypergeometric distribution]] <math>\mathrm{H}(x;3,7),</math> or equivalently, <math>\mathrm{H}(3;x,10-x).</math> The corresponding expectation <math>0.3x,</math> obtained from the general formula <math> n \frac{R}{R+W} </math> for <math>H(n;R,W),</math> is nothing but the conditional expectation <math>E(Y|X=x)=0.3x.</math>
| | connected Only trivial open-and-closed sets |
| | | Normed Banach Complete |
| Treating <math>H(X;3,7)</math> as a random distribution (a random vector in the four-dimensional space of all measures on <math>\{0,1,2,3\}),</math> one may take its expectation, getting the unconditional distribution of <math>Y</math>, — the [[binomial distribution]] <math>\mathrm{Bin}(3,0.5).</math> This fact amounts to the equality
| | Inner product Hilbert Complete |
| : <math> \sum_{x=0}^{10} \mathbb{P} ( Y=y | X=x ) \mathbb{P} (X=x) = \mathbb{P} (Y=y) = \frac1{2^3} \binom 3 y </math>
| |
| for <math>y=0,1,2,3;</math> just the law of total probability.
| |
| | |
| ==Conditioning on the level of densities==
| |
| | |
| '''Example.''' A point of the sphere <math>x^2+y^2+z^2=1</math> is chosen at random according to the uniform distribution on the sphere. The random variables <math>X</math>, <math>Y</math>, <math>Z</math> are the coordinates of the random point. The joint density of <math>X</math>, <math>Y</math>, <math>Z</math> does not exist (since the sphere is of zero volume), but the joint density <math>f_{X,Y}</math> of <math>X</math>, <math>Y</math> exists,
| |
| : <math> f_{X,Y} (x,y) = \begin{cases}
| |
| \frac1{2\pi\sqrt{1-x^2-y^2}} &\text{if } x^2+y^2<1,\\
| |
| 0 &\text{otherwise}.
| |
| \end{cases} </math>
| |
| (The density is non-constant because of a non-constant angle between the sphere and the plane.) The density of <math>X</math> may be calculated by integration,
| |
| : <math> f_X(x) = \int_{-\infty}^{+\infty} f_{X,Y}(x,y) \, \mathrm{d}y = \int_{-\sqrt{1-x^2}}^{+\sqrt{1-x^2}} \frac{ \mathrm{d}y }{ 2\pi\sqrt{1-x^2-y^2} } \, ; </math>
| |
| surprisingly, the result does not depend on <math>x</math> in (-1,1),
| |
| : <math> f_X(x) = \begin{cases}
| |
| 0.5 &\text{for } -1<x<1,\\
| |
| 0 &\text{otherwise},
| |
| \end{cases} </math>
| |
| which means that <math>X</math> is distributed uniformly on <math>(-1,1).</math> The same holds for <math>Y</math> and <math>Z</math> (and in fact, for <math>aX+bY+cZ</math> whenever <math>a^2+b^2+c^2=1).</math>
| |
| | |
| ===Conditional probability===
| |
| | |
| ====Calculation====
| |
| Given that <math>X=0.5,</math> the conditional probability of the event <math>Y\le0.75</math> is the integral of the conditional density,
| |
| : <math> \begin{align}
| |
| & f_{Y|X=0.5}(y) = \frac{ f_{X,Y}(0.5,y) }{ f_X(0.5) } = \begin{cases}
| |
| \frac1{ \pi \sqrt{0.75-y^2} } &\text{for } -\sqrt{0.75}<y<\sqrt{0.75},\\
| |
| 0 &\text{otherwise};
| |
| \end{cases} \\
| |
| & \mathbb{P} (Y \le 0.75|X=0.5) = \int_{-\infty}^{0.75} f_{Y|X=0.5}(y) \, \mathrm{d}y = \\
| |
| & = \int_{-\sqrt{0.75}}^{0.75} \frac{ \mathrm{d}y }{ \pi \sqrt{0.75-y^2} } = \frac12 + \frac1{\pi} \arcsin \sqrt{0.75} = \frac56 \, .
| |
| \end{align} </math>
| |
| More generally,
| |
| : <math> \mathbb{P} (Y \le y|X=x) = \frac12 + \frac1{\pi} \arcsin \frac{ y }{ \sqrt{1-x^2} } </math>
| |
| for all <math>x</math> and <math>y</math> such that <math>-1<x<1</math> (otherwise the denominator <math>f_X(x)</math> vanishes) and <math>\textstyle -\sqrt{1-x^2} < y < \sqrt{1-x^2} </math> (otherwise the conditional probability degenerates to 0 or 1). One may also treat the conditional probability as a random variable, — a function of the random variable <math>X</math>, namely,
| |
| : <math> \mathbb{P} (Y \le y|X) = \begin{cases}
| |
| 0 &\text{for } X^2 \ge 1-y^2 \text{ and } y<0,\\
| |
| \frac12 + \frac1{\pi} \arcsin \frac{ y }{ \sqrt{1-X^2} } &\text{for } X^2 < 1-y^2,\\
| |
| 1 &\text{for } X^2 \ge 1-y^2 \text{ and } y>0.
| |
| \end{cases} </math>
| |
| The expectation of this random variable is equal to the (unconditional) probability,
| |
| : <cite id="DPC8"> <math> \mathbb{E} ( \mathbb{P} (Y\le y|X) ) = \int_{-\infty}^{+\infty} \mathbb{P} (Y\le y|X=x) f_X(x) \, \mathrm{d}x = \mathbb{P} (Y\le y), </math> </cite>
| |
| which is an instance of the [[law of total probability]] <math>E(P(A|X))=P(A).</math>
| |
| | |
| ====Interpretation====
| |
| The conditional probability <math>P(Y\le0.75|X=0.5)</math> cannot be interpreted as <math>P(Y\le0.75,X=0.5)/P(X=0.5),</math> since the latter gives 0/0. Accordingly, <math>P(Y\le0.75|X=0.5)</math> cannot be interpreted via empirical frequencies, since the exact value <math>X=0.5</math> has no chance to appear at random, not even once during an infinite sequence of independent trials.
| |
| | |
| The conditional probability can be interpreted as a limit,
| |
| : <cite id="DPI5"> <math> \begin{align}
| |
| & \mathbb{P} (Y\le0.75 | X=0.5) = \lim_{\varepsilon\to0+} \mathbb{P} (Y\le0.75 | 0.5-\varepsilon<X<0.5+\varepsilon) = \\
| |
| & = \lim_{\varepsilon\to0+} \frac{ \mathbb{P} (Y\le0.75, 0.5-\varepsilon<X<0.5+\varepsilon) }{ \mathbb{P} (0.5-\varepsilon<X<0.5+\varepsilon) } = \\
| |
| & = \lim_{\varepsilon\to0+} \frac{ \int_{0.5-\varepsilon}^{0.5+\varepsilon} \mathrm{d}x \int_{-\infty}^{0.75} \mathrm{d}y \, f_{X,Y}(x,y) }{ \int_{0.5-\varepsilon}^{0.5+\varepsilon} \mathrm{d}x \, f_X(x) } \, .
| |
| \end{align} </math> </cite>
| |
| | |
| ===Conditional expectation===
| |
| The conditional expectation <math>E(Y|X=0.5)</math> is of little interest; it vanishes just by symmetry. It is more interesting to calculate <math>E(|Z||X=0.5)</math> treating |<math>Z</math>| as a function of <math>X</math>, <math>Y</math>:
| |
| : <math> \begin{align}
| |
| & |Z| = h(X,Y) = \sqrt{1-X^2-Y^2} \, ; \\
| |
| & \mathrm{E} ( |Z| | X=0.5 ) = \int_{-\infty}^{+\infty} h(0.5,y) f_{Y|X=0.5} (y) \, \mathrm{d} y = \\
| |
| & = \int_{-\sqrt{0.75}}^{+\sqrt{0.75}} \sqrt{0.75-y^2} \cdot \frac{ \mathrm{d}y }{ \pi \sqrt{0.75-y^2} } = \frac2\pi \sqrt{0.75} \, .
| |
| \end{align} </math>
| |
| More generally,
| |
| : <math> \mathbb{E} ( |Z| | X=x ) = \frac2\pi \sqrt{1-x^2} </math>
| |
| for <math>-1<x<1.</math> One may also treat the conditional expectation as a random variable, — a function of the random variable X, namely,
| |
| : <math> \mathbb{E} ( |Z| | X ) = \frac2\pi \sqrt{1-X^2} \, . </math>
| |
| The expectation of this random variable is equal to the (unconditional) expectation of <math>|Z|,</math>
| |
| : <math> \mathbb{E} ( \mathbb{E} ( |Z| | X ) ) = \int_{-\infty}^{+\infty} \mathbb{E} ( |Z| | X=x ) f_X(x) \, \mathrm{d}x = \mathbb{E} (|Z|) \, , </math>
| |
| namely,
| |
| : <math> \int_{-1}^{+1} \frac2\pi \sqrt{1-x^2} \cdot \frac{ \mathrm{d}x }2 = \frac12 \, , </math>
| |
| which is an instance of the [[law of total expectation]] <math>E(E(Y|X))=E(Y).</math>
| |
| | |
| The random variable <math>E(|Z||X)</math> is the best predictor of <math>|Z|</math> given <math>X</math>. That is, it minimizes the mean square error <math>E(|Z|-f(X))^2</math> on the class of all random variables of the form <math>f(X).</math> Similarly to the discrete case, <math>E(|Z||g(X))=E(|Z||X)</math> for every measurable function <math>g</math> that is one-to-one on <math>(-1,1).</math>
| |
| | |
| ===Conditional distribution===
| |
| Given <math>X=x,</math> the conditional distribution of <math>Y</math>, given by the density <math>f_{Y|X=x}(y),</math> is the (rescaled) arcsin distribution; its cumulative distribution function is
| |
| : <math> F_{Y|X=x} (y) = \mathbb{P} ( Y \le y | X = x ) = \frac12 + \frac1\pi \arcsin \frac{y}{\sqrt{1-x^2}} </math>
| |
| for all <math>x</math> and <math>y</math> such that <math>x^2+y^2<1.</math> The corresponding expectation of <math>h(x,Y)</math> is nothing but the conditional expectation <math>E(h(X,Y)|X=x).</math> The [[Mixture density|mixture]] of these conditional distributions, taken for all <math>x</math> (according to the distribution of <math>X</math>) is the unconditional distribution of <math>Y</math>. This fact amounts to the equalities
| |
| : <math> \begin{align}
| |
| & \int_{-\infty}^{+\infty} f_{Y|X=x} (y) f_X(x) \, \mathrm{d}x = f_Y(y) \, , \\
| |
| & \int_{-\infty}^{+\infty} F_{Y|X=x} (y) f_X(x) \, \mathrm{d}x = F_Y(y) \, ,
| |
| \end{align} </math>
| |
| the latter being the instance of the law of total probability [[#DPC8|mentioned above]].
| |
| | |
| ==What conditioning is not==
| |
| On the discrete level conditioning is possible only if the condition is of nonzero probability (one cannot divide by zero). On the level of densities, conditioning on <math>X=x</math> is possible even though <math>P(X=x)=0.</math> This success may create the illusion that conditioning is ''always'' possible. Regretfully, it is not, for several reasons presented below.
| |
| | |
| ===Geometric intuition: caution===
| |
| The result <math>P(Y\le0.75|X=0.5)=5/6,</math> mentioned above, is geometrically evident in the following sense. The points <math>(x,y,z)</math> of the sphere <math>x^2+y^2+z^2=1,</math> satisfying the condition <math>x=0.5,</math> are a circle <math>y^2+z^2=0.75</math> of radius <math> \sqrt{0.75} </math> on the plane <math>x=0.5.</math> The inequality <math>y\le0.75</math> holds on an arc. The length of the arc is 5/6 of the length of the circle, which is why the conditional probability is equal to 5/6.
| |
| | |
| This successful geometric explanation may create the illusion that the following question is trivial.
| |
| | |
| : A point of a given sphere is chosen at random (uniformly). Given that the point lies on a given plane, what is its conditional distribution?
| |
| | |
| It may seem evident that the conditional distribution must be uniform on the given circle (the intersection of the given sphere and the given plane). Sometimes it really is, but in general it is not. Especially, <math>Z</math> is distributed uniformly on <math>(-1,+1)</math> and independent of the ratio <math>Y/X,</math> thus, <math>P(Z\le0.5|Y/X)=0.75.</math> On the other hand, the inequality <math>z\le0.5</math> holds on an arc of the circle <math>x^2+y^2+z^2=1,</math> <math>y=cx</math> (for any given <math>c</math>). The length of the arc is 2/3 of the length of the circle. However, the conditional probability is 3/4, not 2/3. This is a manifestation of the classical Borel paradox<ref>{{harvnb|Pollard|2002|loc=Sect. 5.5, Example 17 on page 122}}</ref> <ref>{{harvnb|Durrett|1996|loc=Sect. 4.1(a), Example 1.6 on page 224}}</ref>.
| |
| | |
| "Appeals to symmetry can be misleading if not formalized as invariance arguments." Pollard<ref name="Pollard-5.5-122">{{harvnb|Pollard|2002|loc=Sect. 5.5, page 122}}</ref>
| |
| | |
| Another example. A [[Rotation matrix#Uniform random rotation matrices|random rotation]] of the three-dimensional space is a rotation by a random angle around a random axis. Geometric intuition suggests that the angle is independent of the axis and distributed uniformly. However, the latter is wrong; small values of the angle are less probable.
| |
| | |
| ===The limiting procedure===
| |
| Given an event <math>B</math> of zero probability, the formula <math>\textstyle \mathbb{P} (A|B) = \mathbb{P} ( A \cap B ) / \mathbb{P} (B) </math> is useless, however, one can try <math>\textstyle \mathbb{P} (A|B) = \lim_{n\to\infty} \mathbb{P} ( A \cap B_n ) / \mathbb{P} (B_n) </math> for an appropriate sequence of events <math>B_n</math> of nonzero probability such that <math>B_n\downarrow B</math> (that is, <math>\textstyle B_1 \supset B_2 \supset \dots </math> and <math>\textstyle B_1 \cap B_2 \cap \dots = B </math>). One example is given [[#DPI5|above]]. Two more examples are [[Wiener process#Related processes|Brownian bridge and Brownian excursion]].
| |
| | |
| In the latter two examples the law of total probability is irrelevant, since only a single event (the condition) is given. In contrast, in the example [[#DPI5|above]] the law of total probability [[#DPC8|applies]], since the event <math>X=0.5</math> is included into a family of events <math>X=x</math> where <math>x</math> runs over <math>(-1,1),</math> and these events are a partition of the probability space.
| |
| | |
| In order to avoid paradoxes (such as the [[Borel's paradox]]), the following important distinction should be taken into account. If a given event is of nonzero probability then conditioning on it is well-defined (irrespective of any other events), as was noted [[#EP8|above]]. In contrast, if the given event is of zero probability then conditioning on it is ill-defined unless some additional input is provided. Wrong choice of this additional input leads to wrong conditional probabilities (expectations, distributions). In this sense, "''the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible.''" ([[Andrey Kolmogorov|Kolmogorov]]; quoted in <ref name="Pollard-5.5-122">{{harvnb|Pollard|2002|loc=Sect. 5.5, page 122}}</ref>).
| |
| | |
| The additional input may be (a) a symmetry (invariance group); (b) a sequence of events <math>B_n</math> such that <math>B_n\downarrow B,</math> <math>P(B_n)>0;</math> (c) a partition containing the given event. Measure-theoretic conditioning (below) investigates Case (c), discloses its relation to (b) in general and to (a) when applicable.
| |
| | |
| Some events of zero probability are beyond the reach of conditioning. An example: let <math>X_n</math> be independent random variables distributed uniformly on <math>(0,1),</math> and <math>B</math> the event "<math>X_n\to0</math> as <math>n\to\infty</math>"; what about <math>P(X_n<0.5|B)?</math> Does it tend to 1, or not? Another example: let <math>X</math> be a random variable distributed uniformly on <math>(0,1),</math> and <math>B</math> the event "<math>X</math> is a rational number"; what about <math>P(X=1/n|B)?</math>
| |
| The only answer is that, once again, "the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible." Kolmogorov, quoted in <ref name="Pollard-5.5-122">{{harvnb|Pollard|2002|loc=Sect. 5.5, page 122}}.</ref>
| |
| | |
| ==Conditioning on the level of measure theory==
| |
| | |
| '''Example.''' Let <math>Y</math> be a random variable distributed uniformly on <math>(0,1),</math> and <math>X=f(Y)</math> where <math>f</math> is a given function. Two cases are treated below: <math>f=f_1</math> and <math>f=f_2,</math> where <math>f_1</math> is the continuous piecewise-linear function
| |
| : <math> f_1(y) = \begin{cases}
| |
| 3y &\text{for } 0 \le y \le 1/3,\\
| |
| 1.5(1-y) &\text{for } 1/3 \le y \le 2/3,\\
| |
| 0.5 &\text{for } 2/3 \le y \le 1,
| |
| \end{cases} </math>
| |
| and <math>f_2</math> is the everywhere continuous but nowhere differentiable [[Weierstrass function]].
| |
| | |
| ===Geometric intuition: caution===
| |
| Given <math>X=0.75,</math> two values of <math>Y</math> are possible, 0.25 and 0.5. It may seem evident that both values are of conditional probability 0.5 just because one point is congruent to another point. However, this is an illusion; see below.
| |
| | |
| ===Conditional probability===
| |
| The conditional probability <math>P(Y\le1/3|X)</math> may be defined as the best predictor of the indicator
| |
| : <math> I = \begin{cases}
| |
| 1 &\text{if } Y \le 1/3,\\
| |
| 0 &\text{otherwise},
| |
| \end{cases} </math>
| |
| given X. That is, it minimizes the mean square error <math>E(I-g(X))^2</math> on the class of all random variables of the form <math>g(X).</math>
| |
| | |
| In the case <math>f=f_1</math> the corresponding function <math>g=g_1</math> may be calculated explicitly,<ref>
| |
| Proof:
| |
| : <math> \begin{align}
| |
| & \mathbb{E} ( I - g(X) )^2 = \\
| |
| & = \int_0^{1/3} (1-g(3y))^2 \, \mathrm{d}y + \int_{1/3}^{2/3} g^2 (1.5(1-y)) \, \mathrm{d}y + \int_{2/3}^1 g^2 (0.5) \, \mathrm{d}y = \\
| |
| & = \int_0^1 (1-g(x))^2 \frac{ \mathrm{d}x }{ 3 } + \int_{0.5}^1 g^2(x) \frac{ \mathrm{d} x }{ 1.5 } + \frac13 g^2(0.5) = \\
| |
| & = \frac13 \int_0^{0.5} (1-g(x))^2 \, \mathrm{d}x + \frac13 g^2(0.5) + \frac13 \int_{0.5}^1 ( (1-g(x))^2 + 2g^2(x) ) \, \mathrm{d}x \, ;
| |
| \end{align} </math>
| |
| it remains to note that <math>(1-a)^2+2a^2</math> is minimal at <math>a=1/3.</math>
| |
| </ref>
| |
| : <math> g_1(x) = \begin{cases}
| |
| 1 &\text{for } 0 < x < 0.5,\\
| |
| 0 &\text{for } x = 0.5,\\
| |
| 1/3 &\text{for } 0.5 < x < 1.
| |
| \end{cases} </math>
| |
| | |
| Alternatively, the limiting procedure may be used,
| |
| : <math> g_1(x) = \lim_{\varepsilon\to0+} \mathbb{P} ( Y \le 1/3 | x-\varepsilon \le X \le x+\varepsilon ) \, , </math>
| |
| giving the same result.
| |
| | |
| Thus, <math>P(Y\le1/3|X)=g_1(X).</math> The expectation of this random variable is equal to the (unconditional) probability, <math>E(P(Y\le1/3|X))=P(Y\le1/3),</math> namely,
| |
| : <math> 1 \cdot \mathbb{P} (X<0.5) + 0 \cdot \mathbb{P} (X=0.5) + \frac13 \cdot \mathbb{P} (X>0.5) = 1 \cdot \frac16 + 0 \cdot \frac13 + \frac13 \cdot \Big( \frac16 + \frac13 \Big) = \frac13 \, , </math>
| |
| which is an instance of the [[law of total probability]] <math>E(P(A|X))=P(A).</math>
| |
| | |
| In the case <math>f=f_2</math> the corresponding function <math>g=g_2</math> probably cannot be calculated explicitly. Nevertheless it exists, and can be computed numerically. Indeed, the [[Banach space|space]] <math>L_2(\Omega)</math> of all square integrable random variables is a [[Hilbert space]]; the indicator <math>I</math> is a vector of this space; and random variables of the form <math>g(X)</math> are a (closed, linear) subspace. The orthogonal projection of this vector to this subspace is well-defined. It can be computed numerically, using finite-dimensional approximations to the infinite-dimensional Hilbert space.
| |
| | |
| Once again, the expectation of the random variable <math>P(Y\le1/3|X)=g_2(X)</math> is equal to the (unconditional) probability, <math>E(P(Y\le1/3|X))=P(Y≤1/3),</math> namely,
| |
| : <math> \int_0^1 g_2 (f_2(y)) \, \mathrm{d}y = \frac13 \, . </math>
| |
| | |
| However, the Hilbert space approach treats <math>g_2</math> as an equivalence class of functions rather than an individual function. Measurability of <math>g_2</math> is ensured, but continuity (or even [[Riemann integrability]]) is not. The value <math>g_2(0.5)</math> is determined uniquely, since the point 0.5 is an atom of the distribution of <math>X</math>. Other values <math>x</math> are not atoms, thus, corresponding values <math>g_2(x)</math> are not determined uniquely. Once again, "''the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible.''" ([[Andrey Kolmogorov|Kolmogorov]]; quoted in <ref name="Pollard-5.5-122">{{harvnb|Pollard|2002|loc=Sect. 5.5, page 122}}</ref>).
| |
| | |
| Alternatively, the same function <math>g</math> (be it <math>g_1</math> or <math>g_2</math>) may be defined as the [[Radon-Nikodym derivative]]
| |
| : <math> g = \frac{ \mathrm{d}\nu }{ \mathrm{d}\mu } \, , </math>
| |
| where measures μ, ν are defined by
| |
| : <math> \begin{align}
| |
| \mu (B) &= \mathbb{P} ( X \in B ) \, , \\
| |
| \nu (B) &= \mathbb{P} ( X \in B, \, Y \le 1/3 )
| |
| \end{align} </math>
| |
| for all Borel sets <math> B \subset \mathbb R. </math> That is, μ is the (unconditional) distribution of <math>X</math>, while ν is one third of its conditional distribution,
| |
| : <math> \nu (B) = \mathbb{P} ( X \in B | Y \le 1/3 ) \mathbb{P} ( Y \le 1/3 ) = \frac13 \mathbb{P} ( X \in B | Y \le 1/3 ) \, . </math>
| |
| | |
| Both approaches (via the Hilbert space, and via the Radon-Nikodym derivative) treat <math>g</math> as an equivalence class of functions; two functions <math>g</math> and <math>g'</math> are treated as equivalent, if <math>g(X)=g'(X)</math> almost surely. Accordingly, the conditional probability <math>P(Y\le1/3|X)</math> is treated as an equivalence class of random variables; as usual, two random variables are treated as equivalent if they are equal almost surely.
| |
| | |
| ===Conditional expectation===
| |
| The conditional expectation <math>E(Y|X)</math> may be defined as the best predictor of <math>Y</math> given <math>X</math>. That is, it minimizes the mean square error <math>E(Y-h(X))^2</math> on the class of all random variables of the form <math>h(X).</math>
| |
| | |
| In the case <math>f=f_1</math> the corresponding function <math>h=h_1</math> may be calculated explicitly,<ref>
| |
| Proof:
| |
| <math> \begin{align}
| |
| & \mathbb{E} ( Y - h_1(X) )^2 = \int_0^1 \Big( y - h_1 ( f_1(x) ) \Big)^2 \, \mathrm{d}y = \\
| |
| & \int_0^{1/3} (y-h_1(3y))^2 \, \mathrm{d}y + \int_{1/3}^{2/3} \Big( y - h_1( 1.5(1-y) ) \Big)^2 \, \mathrm{d}y + \int_{2/3}^1 \Big( y - h_1(0.5) \Big)^2 \, \mathrm{d}y = \\
| |
| & \int_0^1 \Big( \frac x 3 - h_1(x) \Big)^2 \frac{ \mathrm{d}x }{ 3 } + \int_{0.5}^1 \Big( 1 - \frac{x}{1.5} - h_1(x) \Big)^2 \frac{ \mathrm{d} x }{ 1.5 } + \frac13 h_1^2(0.5) - \frac 5 9 h_1(0.5) + \frac{19}{81} = \\
| |
| & \frac13 \int_0^{0.5} \Big( h_1(x) - \frac x 3 \Big)^2 \, \mathrm{d}x + \frac13 h_1^2(0.5) - \frac 5 9 h_1(0.5) + \frac{19}{81} + \\
| |
| & \quad \frac13 \int_{0.5}^1 \bigg( \Big( h_1(x) - \frac x 3 \Big)^2 + 2 \Big( h_1(x) - 1 + \frac{2x}3 \Big)^2 \bigg) \, \mathrm{d}x \, ;
| |
| \end{align} </math>
| |
| it remains to note that <math>\textstyle (a-\frac x 3)^2 + 2(a-1+\frac{2x}3)^2 </math> is minimal at <math>\textstyle a = \frac{2-x}3, </math> and <math>\textstyle \frac13 a^2 - \frac 5 9 a </math> is minimal at <math>\textstyle a = \frac 5 6. </math>
| |
| </ref>
| |
| : <math> h_1(x) = \begin{cases}
| |
| x/3 &\text{for } 0 < x < 0.5,\\
| |
| 5/6 &\text{for } x = 0.5,\\
| |
| (2-x)/3 &\text{for } 0.5 < x < 1.
| |
| \end{cases} </math>
| |
| | |
| Alternatively, the limiting procedure may be used,
| |
| : <math> h_1(x) = \lim_{\varepsilon\to0+} \mathbb{E} ( Y | x-\varepsilon \le X \le x+\varepsilon ) \, , </math>
| |
| giving the same result.
| |
| | |
| Thus, <math>E(Y|X)=h_1(X).</math> The expectation of this random variable is equal to the (unconditional) expectation, <math>E(E(Y|X))=E(Y),</math> namely,
| |
| : <math> \begin{align}
| |
| & \int_0^1 h_1(f_1(y)) \, \mathrm{d}y = \int_0^{1/6} \frac{3y}3 \, \mathrm{d}y + \\
| |
| & \quad + \int_{1/6}^{1/3} \frac{2-3y}3 \, \mathrm{d}y + \int_{1/3}^{2/3} \frac{ 2 - 1.5(1-y) }{ 3 } \, \mathrm{d}y + \int_{2/3}^1 \frac56 \, \mathrm{d}y = \frac12 \, ,
| |
| \end{align} </math>
| |
| which is an instance of the [[law of total expectation]] <math>E(E(Y|X))=E(Y).</math>
| |
| | |
| In the case <math>f=f_2</math> the corresponding function <math>h=h_2</math> probably cannot be calculated explicitly. Nevertheless it exists, and can be computed numerically in the same way as <math>g_2</math> above, — as the orthogonal projection in the Hilbert space. The law of total expectation holds, since the projection cannot change the scalar product by the constant function 1 belonging to the subspace.
| |
| | |
| Alternatively, the same function <math>h</math> (be it <math>h_1</math> or <math>h_2</math>) may be defined as the [[Radon-Nikodym derivative]]
| |
| : <math> h = \frac{ \mathrm{d}\nu }{ \mathrm{d}\mu } \, , </math>
| |
| where measures <math>\mu,\nu</math> are defined by
| |
| : <math> \begin{align}
| |
| \mu (B) &= \mathbb{P} ( X \in B ) \, , \\
| |
| \nu (B) &= \mathbb{E} ( Y, \, X \in B )
| |
| \end{align} </math>
| |
| for all Borel sets <math> B \subset \mathbb R. </math> Here <math>E(Y;A)</math> is the restricted expectation, not to be confused with the conditional expectation <math>E(Y|A)=E(Y;A)/P(A).</math>
| |
| | |
| ===Conditional distribution===
| |
| | |
| In the case <math>f=f_1</math> the conditional [[cumulative distribution function]] may be calculated explicitly, similarly to <math>g_1.</math> The limiting procedure gives
| |
| : <math> \begin{align}
| |
| & F_{Y|X=0.75} (y) = \mathbb{P} ( Y \le y | X = 0.75 ) = \\
| |
| & = \lim_{\varepsilon\to0+} \mathbb{P} ( Y \le y | 0.75-\varepsilon \le X \le 0.75+\varepsilon ) = \\
| |
| & = \begin{cases}
| |
| 0 &\text{for } -\infty < y < 1/4,\\
| |
| 1/6 &\text{for } y = 1/4,\\
| |
| 1/3 &\text{for } 1/4 < y < 1/2,\\
| |
| 2/3 &\text{for } y = 1/2,\\
| |
| 1 &\text{for } 1/2 < y < \infty,
| |
| \end{cases} \end{align} </math>
| |
| which cannot be correct, since a cumulative distribution function must be [[right-continuous]]!
| |
| | |
| This paradoxical result is explained by measure theory as follows. For a given <math>y</math> the corresponding <math>F_{Y|X=x}(y)=P(Y\le y|X=x)</math> is well-defined (via the Hilbert space or the Radon-Nikodym derivative) as an equivalence class of functions (of <math>x</math>). Treated as a function of <math>y</math> for a given <math>x</math> it is ill-defined unless some additional input is provided. Namely, a function (of <math>x</math>) must be chosen within every (or at least almost every) equivalence class. Wrong choice leads to wrong conditional cumulative distribution functions.
| |
| | |
| A right choice can be made as follows. First, <math>F_{Y|X=x}(y)=P(Y\le y|X=x)</math> is considered for rational numbers <math>y</math> only. (Any other dense countable set may be used equally well.) Thus, only a countable set of equivalence classes is used; all choices of functions within these classes are mutually equivalent, and the corresponding function of rational <math>y</math> is well-defined (for almost every <math>x</math>). Second, the function is extended from rational numbers to real numbers by right continuity.
| |
| | |
| In general the conditional distribution is defined for almost all <math>x</math> (according to the distribution of <math>X</math>), but sometimes the result is continuous in <math>x</math>, in which case individual values are acceptable. In the considered example this is the case; the correct result for <math>x=0.75,</math>
| |
| : <math> \begin{align}
| |
| & F_{Y|X=0.75} (y) = \mathbb{P} ( Y \le y | X = 0.75 ) = \\
| |
| & = \begin{cases}
| |
| 0 &\text{for } -\infty < y < 1/4,\\
| |
| 1/3 &\text{for } 1/4 \le y < 1/2,\\
| |
| 1 &\text{for } 1/2 \le y < \infty
| |
| \end{cases} \end{align} </math>
| |
| shows that the conditional distribution of <math>Y</math> given <math>X=0.75</math> consists of two atoms, at 0.25 and 0.5, of probabilities 1/3 and 2/3 respectively.
| |
| | |
| Similarly, the conditional distribution may be calculated for all <math>x</math> in <math>(0,0.5)</math> or <math>(0.5,1).</math>
| |
| | |
| The value <math>x=0.5</math> is an atom of the distribution of <math>X</math>, thus, the corresponding conditional distribution is well-defined and may be calculated by elementary means (the denominator does not vanish); the conditional distribution of <math>Y</math> given <math>X=0.5</math> is uniform on <math>(2/3,1).</math> Measure theory leads to the same result.
| |
| | |
| The mixture of all conditional distributions is the (unconditional) distribution of <math>Y</math>.
| |
| | |
| The conditional expectation <math>E(Y|X=x)</math> is nothing but the expectation with respect to the conditional distribution.
| |
| | |
| In the case <math>f=f_2</math> the corresponding <math>F_{Y|X=x}(y)=P(Y\le y|X=x)</math> probably cannot be calculated explicitly. For a given <math>y</math> it is well-defined (via the Hilbert space or the Radon-Nikodym derivative) as an equivalence class of functions (of <math>x</math>). The right choice of functions within these equivalence classes may be made as above; it leads to correct conditional cumulative distribution functions, thus, conditional distributions. In general, conditional distributions need not be [[discrete probability distribution|atomic]] or [[Absolutely continuous random variable|absolutely continuous]] (nor mixtures of both types). Probably, in the considered example they are [[Singular distribution|singular]] (like the [[Cantor distribution]]).
| |
| | |
| Once again, the mixture of all conditional distributions is the (unconditional) distribution, and the conditional expectation is the expectation with respect to the conditional distribution.
| |
| | |
| ==See also==
| |
| * [[Conditional probability]]
| |
| * [[Conditional expectation]]
| |
| * [[Conditional probability distribution]]
| |
| * [[Borel's paradox]]
| |
| | |
| ==Notes==
| |
| <references />
| |
| | |
| ==References==
| |
| *{{citation|last=Durrett|first=Richard|title=Probability: theory and examples|edition=Second|year=1996}}
| |
| *{{citation|last=Pollard|first=David|title=A user's guide to measure theoretic probability|year=2002|publisher=Cambridge University Press}}
| |
| | |
| [[:Category:Probability theory]]
| |