Linear Algebra and Related Introductory Topics

Barry Kurt Moser , in Linear Models, 1996

1.3 RANDOM VECTORS

Let the n × 1 random vector Y = (Y 1, Y 2,…, Yn )′ where Yi is a random variable for i = 1,…, n. The vector Y is a random entity. Therefore, Y has an expectation; each element of Y has a variance; and any two elements of Y have a covariance (assuming the expectations, variances, and covariances exist). The following definitions and theorems describe the structure of random vectors.

Definition 1.3.1

Joint Probability Distribution : The probability distribution of the n × 1 random vector Y = (Y 1,…, Yn )′ equals the joint probability distribution of Y 1,…, Yn . Denote the distribution of Y by f Y (y) − f Y (y 1,…, yn ).

Definition 1.3.2

Expectation of a Random Vector: The expected value of the n × 1 random vector Y = (Y 1,…, Yn )′ is given by E(Y) = [E(Y 1),…, E(Yn )]′.

Definition 1.3.3

Covariance Matrix of a Random Vector Y: The n × 1 random vector Y = (Y 1,…, Yn )′ has n × n covariance matrix given by

cov ( Y ) = E { [ Y E ( Y ) ] [ Y E ( Y ) ] }

The ij th element of cov(Y) equals E{[Yi − E(Yi )][Yj − E(Yj )]} for i, j = 1,…, n.

Definition 1.3.4

Linear Transformations of a Random Vector Y: If B is an m × n matrix of constants and Y is an n × 1 random vector, then the m × 1 random vector BY represents m linear transformations of Y.

The following theorem provides the covariance matrix of linear transformations of a random vector.

Theorem 1.3.1

If B is an m × n matrix of constants, Y is an n × 1 random vector, and cov(Y) is the n × n covariance matrix of Y, then the m × 1 random vector BY has an m × m covariance matrix given by B [cov(Y)]B′.

Proof:

cov ( B Y ) = E { [ B Y E ( B Y ) ] [ B Y E ( B Y ) ] } = E { B [ Y E ( Y ) ] [ Y E ( Y ) ] B } = B { E { [ Y E ( Y ) ] [ Y E ( Y ) ] } } B = B [ cov ( Y ) ] B

The next theorem provides the expected value of a quadratic form.

Theorem 1.3.2

Let Y be an n × 1 random vector with mean vector μ = E(Y) and n × n covariance matrix Σ = cov(Y) then E(Y′AY) = tr() + μ′Aμ where A is any n × n symmetric matrix of constants.

Proof:

Since (Y − μ)′A (Y − μ) is a scalar and using Result 1.5,

( Y μ ) A ( Y μ ) = tr [ ( Y μ ) A ( Y μ ) ] = tr [ A ( Y μ ) ( Y μ ) ]

Therefore,

E [ Y A Y ] = E [ ( Y μ ) A ( Y μ ) 2 Y A μ μ A μ ] = E { tr [ A ( Y μ ) ( Y μ ) ] } + 2 E ( Y A μ ) μ A μ = tr [ AE { ( Y μ ) ( Y μ ) } ] + μ A μ = tr [ A Σ ] + μ A μ

The moment generating function of a random vector is used extensively in the next chapter. The following definitions and theorems provide some general moment generating function results.

Definition 1.3.5

Moment Generating Function (MGF) of a Random Vector Y: The MGF of an n × 1 random vector Y is given by

m Y ( t ) = E ( e t Y )

where the n × 1 vector of constants t = (t 1,…, tn )′ if the expectation exists for −h < ti < h where h > 0 and i = 1,…, n.

There is a one-to-one correspondence between the probability distribution of Y and the MGF of Y, if the MGF exists. Therefore, the probability distribution of Y can be identified if the MGF of Y can be found. The following two theorems and corollary are used to derive the MGF of a random vector Y.

Theorem 1.3.3

Let the n × 1 random vector Y = (Y′ 1, Y′ 2,…, Y′ m )′ where Y i is an ni × 1 random vector for i = 1,…, m and n = Σ i = 1 m n i . Let m Y ( . ) , m Y 1 ( . ) , , m Y m ( . ) represent the MGFs of Y, Y 1,…, Y m respectively. The vectors Y 1,…, Y m are mutually independent if and only if

m Y ( t ) = m Y 1 ( t 1 ) m Y 2 ( t 2 ) m Y m ( t m )

for all t = (t′ 1,…, t′ m )′ on the open rectangle around 0.

Theorem 1.3.4

If Y is an n × 1 random vector, g is an n × 1 vector of constants, and c is a scalar constant, then

m g Y ( c ) = m Y ( c g )

Proof:

m g Y ( c ) = E [ e cg ′ Y ] = E [ e (cg) Y ] = m Y ( cg ) .

Corollary 1.3.4

Let m Y1, (.),…, m Y m (.) represent the MGFs of the independent random variables Y 1,…, Ym, respectively. If Z = Σ i = 1 m Y i then the MGF of Z is given by

m Z ( s ) = Π i = 1 m m Y i ( s )

Proof:

m Z ( s ) = m 1 m Y ( s ) = m Y ( s 1 m ) by Theorem 1 3 4 = Π i = 1 m m Y i ( s ) by Theorem 1 3 3

Moment generating functions are used in the next example to derive the distribution of the sum of independent chi-square random variables.

Example 1.3.1

Let Y 1,…, Ym be m independent central chi-square random variables where Yi and ni degrees of freedom for i = 1,…, m. For any i

m Y i ( t ) = E ( e t Y i ) = 0 e t y i [ Γ ( n i / 2 ) 2 n i / 2 ] 1 y i n i / 2 1 e y i / 2 d y i = [ Γ ( n i / 2 ) 2 n i / 2 ] 1 0 y i n i / 2 1 e y i / ( 1 / 2 t ) 1 d y i = [ Γ ( n i / 2 ) 2 n i / 2 ] 1 [ Γ ( n i / 2 ) ( 1 / 2 t ) n i / 2 ] = ( 1 2 t ) n i / 2

Let Z = Σ i = 1 m Y i . By Corollary 1.3.4,

m Z ( t ) = Π i = 1 m m Y i ( t ) = ( 1 2 t ) Σ i = 1 n i / 2 m

Therefore, Σ i = 1 m Y i is distributed as a central chi-square random variable with Σ i = 1 m n i degrees of freedom.

The next theorem is useful when dealing with functions of independent random vectors.

Theorem 1.3.5

Let g 1(Y 1),.…gm (Y m ) be m functions of the random vectors Y 1,…, Y m, respectively. If Y 1,…, Y m are mutually independent, then g 1,…, gm are mutually independent.

The next example demonstrates that the sum of squares of n independent N1(0, 1) random variables has a central chi-square distribution with n degrees of freedom.

Example 1.3.2

Let Z 1,…, Zn be a random sample of normally distributed random variables with mean 0 and variance 1. Let Yi = Zi 2 for i = 1,…, n. The moment generating function of Yi is

m Y i ( t ) = m Z i 2 ( t ) = E ( e t Z i 2 ) = ( 2 π ) 1 / 2 e t z i 2 z i 2 / 2 d z i = ( 2 π ) 1 / 2 e ( 1 2 t ) z i 2 / 2 d z i = ( 1 2 t ) 1 / 2

That is, each Yi , has a central chi-square distribution with one degree of freedom. Furthermore, by Theorem 1.3.5, the Yi 's are independent random variables. Therefore, by Example 1.3.1, Σ i = 1 n Y i = Σ i = 1 n Z i 2 is a central chi-square random variable with n degrees of freedom.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780125084659500016

Random Variables

Sheldon Ross , in Introduction to Probability Models (Eleventh Edition), 2014

2.5.1 Joint Distribution Functions

Thus far, we have concerned ourselves with the probability distribution of a single random variable. However, we are often interested in probability statements concerning two or more random variables. To deal with such probabilities, we define, for any two random variables X and Y , the joint cumulative probability distribution function of X and Y by

F ( a , b ) = P { X a , Y b } , - < a , b <

The distribution of X can be obtained from the joint distribution of X and Y as follows:

F X ( a ) = P { X a } = P { X a , Y < } = F ( a , )

Similarly, the cumulative distribution function of Y is given by

F Y ( b ) = P { Y b } = F ( , b )

In the case where X and Y are both discrete random variables, it is convenient to define the joint probability mass function of X and Y by

p ( x , y ) = P { X = x , Y = y }

The probability mass function of X may be obtained from p ( x , y ) by

p X ( x ) = y : p ( x , y ) > 0 p ( x , y )

Similarly,

p Y ( y ) = x : p ( x , y ) > 0 p ( x , y )

We say that X and Y are jointly continuous if there exists a function f ( x , y ) , defined for all real x and y , having the property that for all sets A and B of real numbers

P { X A , Y B } = B A f ( x , y ) dx dy

The function f ( x , y ) is called the joint probability density function of X and Y . The probability density of X can be obtained from a knowledge of f ( x , y ) by the following reasoning:

P { X A } = P { X A , Y ( - , ) } = - A f ( x , y ) dx dy = A f X ( x ) dx

where

f X ( x ) = - f ( x , y ) dy

is thus the probability density function of X . Similarly, the probability density function of Y is given by

f Y ( y ) = - f ( x , y ) dx

Because

F ( a , b ) = P ( X a , Y b ) = - a - b f ( x , y ) dy dx

differentiation yields

d 2 da db F ( a , b ) = f ( a , b )

Thus, as in the single variable case, differentiating the probability distribution function gives the probability density function.

A variation of Proposition 2.1 states that if X and Y are random variables and g is a function of two variables, then

E [ g ( X , Y ) ] = y x g ( x , y ) p ( x , y ) in the discrete case = - - g ( x , y ) f ( x , y ) dx dy in the continuous case

For example, if g ( X , Y ) = X + Y , then, in the continuous case,

E [ X + Y ] = - - ( x + y ) f ( x , y ) dx dy = - - xf ( x , y ) dx dy + - - yf ( x , y ) dx dy = E [ X ] + E [ Y ]

where the first integral is evaluated by using the variation of Proposition 2.1 with g ( x , y ) = x , and the second with g ( x , y ) = y .

The same result holds in the discrete case and, combined with the corollary in Section 2.4.3, yields that for any constants a , b

(2.10) E [ aX + bY ] = aE [ X ] + bE [ Y ]

Joint probability distributions may also be defined for n random variables. The details are exactly the same as when n = 2 and are left as an exercise. The corresponding result to Equation (2.10) states that if X 1 , X 2 , , X n are n random variables, then for any n constants a 1 , a 2 , , a n ,

(2.11) E [ a 1 X 1 + a 2 X 2 + + a n X n ] = a 1 E [ X 1 ] + a 2 E [ X 2 ] + + a n E [ X n ]

Example 2.29

Calculate the expected sum obtained when three fair dice are rolled.

Solution:  Let X denote the sum obtained. Then X = X 1 + X 2 + X 3 where X i represents the value of the i th die. Thus,

E [ X ] = E [ X 1 ] + E [ X 2 ] + E [ X 3 ] = 3 7 2 = 21 2

Example 2.30

As another example of the usefulness of Equation (2.11), let us use it to obtain the expectation of a binomial random variable having parameters n and p . Recalling that such a random variable X represents the number of successes in n trials when each trial has probability p of being a success, we have

X = X 1 + X 2 + + X n

where

X i = 1 , if the i th trial is a success 0 , if the i th trial is a failure

Hence, X i is a Bernoulli random variable having expectation E [ X i ] = 1 ( p ) + 0 ( 1 - p ) = p . Thus,

E [ X ] = E [ X 1 ] + E [ X 2 ] + + E [ X n ] = np

This derivation should be compared with the one presented in Example 2.17.  

Example 2.31

At a party N men throw their hats into the center of a room. The hats are mixed up and each man randomly selects one. Find the expected number of men who select their own hats.

Solution:  Letting X denote the number of men that select their own hats, we can best compute E [ X ] by noting that

X = X 1 + X 2 + + X N

where

X i = 1 , if the i th man selects his own hat 0 , otherwise

Now, because the i th man is equally likely to select any of the N hats, it follows that

P { X i = 1 } = P { i th man selects his own hat } = 1 N

and so

E [ X i ] = 1 P { X i = 1 } + 0 P { X i = 0 } = 1 N

Hence, from Equation (2.11) we obtain

E [ X ] = E [ X 1 ] + + E [ X N ] = 1 N N = 1

Hence, no matter how many people are at the party, on the average exactly one of the men will select his own hat.  

Example 2.32

Suppose there are 25 different types of coupons and suppose that each time one obtains a coupon, it is equally likely to be any one of the 25 types. Compute the expected number of different types that are contained in a set of 10 coupons.

Solution:  Let X denote the number of different types in the set of 10 coupons. We compute E [ X ] by using the representation

X = X 1 + + X 25

where

X i = 1 , if at least one type i coupon is in the set of 10 0 , otherwise

Now,

E [ X i ] = P { X i = 1 } = P { at least one type i coupon is in the set of 10 } = 1 - P { no type i coupons are in the set of 10 } = 1 - 24 25 10

when the last equality follows since each of the 10 coupons will (independently) not be a type i with probability 24 25 . Hence,

E [ X ] = E [ X 1 ] + + E [ X 25 ] = 25 1 - 24 25 10

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124079489000025

RANDOM VARIABLES AND EXPECTATION

Sheldon M. Ross , in Introduction to Probability and Statistics for Engineers and Scientists (Fourth Edition), 2009

SOLUTION

We start by determining the distribution function of X/Y. For a > 0

F X / Y ( a ) = P { X / Y a } = x / y a f ( x , y ) d x d y = x / y a e - x e - y d x d y = 0 0 a y e - x e - y d x d y = 0 ( 1 - e - a y ) e - y d y = [ - e - y + e - ( a + 1 ) y a + 1 ] | 0 = 1 - 1 a + 1

Differentiation yields that the density function of X/Y is given by

We can also define joint probability distributions for n random variables in exactly the same manner as we did for n = 2. For instance, the joint cumulative probability distribution function F (a 1, a 2,…, an ) of the n random variables X 1, X 2,…, Xn is defined by

F ( a 1 , a 2 , , a n ) = P { X 1 a 1 , X 2 a 2 , , X n a n }

If these random variables are discrete, we define their joint probability mass function p (x 1, X 2,…, X n) by

p ( x 1 , x 2 , , x n ) = P { X 1 = x 1 , X 2 = x 2 , , X n = x n }

Further, the n random variables are said to be jointly continuous if there exists a function f (x 1, x 2,…, xn ), called the joint probability density function, such that for any set C in n -space

P { ( X 1 , X 2 , , X n ) C } = ( x 1 , , x n ) C f ( x 1 , , x n ) d x 1 d x 2 d x n

In particular, for any n sets of real numbers A 1, A 2,…, A n

P { X 1 A 1 , X 2 A 2 , X n A n } = A n A n - 1 A 1 f ( x 1 , , x n ) d x 1 d x 2 d x n

The concept of independence may, of course, also be defined for more than two random variables. In general, the n random variables X 1, X 2,…, Xn are said to be independent if, for all sets of real numbers A 1, A 2,…, An ,

P { X 1 A 1 , X 2 A 2 , , X n A n } = i = 1 n P { X i A i }

As before, it can be shown that this condition is equivalent to

P { X 1 a 1 , X 2 a 2 , , X n a n } = i = 1 n P { X 1 a j } for all a 1 , a 2 , a n

Finally, we say that an infinite collection of random variables is independent if every finite subcollection of them is independent.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123704832000096

Graphical Models

R.G. Almond , in International Encyclopedia of Education (Third Edition), 2010

Factorization and Conditional Independence

A graphical model is a joint probability distribution over a collection of variables that can be factored according to the cliques of an undirected graph. Let G = v , ɛ be a graph whose nodes correspond to the variables in the model, and let C be the set of cliques in the graph. Let v be an instantiation of the values in ν and let v C be the corresponding set of values for the variables in C     ν. Then we can write the joint probability distribution as:

[1] p ( v ) = C C ϕ C ( v C )

In this equation, p(v) should be interpreted as a probability mass function if all of the variables in ν are discrete, and a density function if any of the variables are continuous. The factors, ϕ C(·) are called potentials (sometimes Gibbs potentials; the term comes from statistical physics). The potentials can be probability distributions, conditional probability distributions, or products of any of the above. Generally, they are not interpreted directly and often they are left unnormalized.

The original use of the term graphical model was to distinguish a subset of the number of possible log-linear models for contingency tables that could be represented as a graph. Consider the set of possible models over three discrete variables A, B, and C. The models [A][B][C], [A][BC], [AB][BC], and [ABC] (and ones that are equivalent after relabeling the variables) are all graphical – the factors correspond to the cliques. The no-three-way interaction model [AB][BC][CA] is not graphical because the clique should be [ABC] (the three factors do not correspond to cliques). Difference in the substantive interpretation of the no-three-way interaction model from the three-way interaction model is quite subtle; thus, restricting consideration to graphical models does not greatly limit the interpretability. Although the no-three-way interaction model has fewer parameters, it also is more difficult to work with, requiring iterative solutions to fit the model to data.

Restricting the variables to have a multivariate normal distribution creates a set of models corresponding to covariance selection models. (Note that for normally distributed data, the pairwise correlations define the joint distribution, so all models can be expressed with a graph, although it might be the saturated graph in which all nodes are connected.) In fact, there are three classes of graphical models for which most computations can be done exactly: (1) models in which all variables are discrete (graphical log-linear models), (2) multivariate normal distributions, and (3) finite mixtures of multivariate normal distributions. The graphical notation can be used to express other kinds of models, but operations on those kinds of models frequently involves integrals that must be solved numerically. (For example, the standard unidimensional item response theory model can be drawn as a graphical model with the observable outcome variables conditionally independent given the latent ability variable. However, the model is not a conditional Gaussian distribution; therefore estimating examinee ability using this model requires numeric integration.)

The Markov property of a graphical model states that if X and Y are variables in the graph, and S is a set that separates X from Y in the graph, then X is conditionally independent of Y given S, sometimes written XY|S. This Markov property drives a number of efficient algorithms for computation, as well as aiding in the interpretation of the model. Probability distributions with this property are sometimes called Markov random fields.

Consider the well-known example of Simpson's paradox in which a university has several departments, each of which admits a different proportion of their applicants, and each of which has a different proportion of women in their applicant pool. Further assume that each department is completely gender neutral in their admissions decisions. The graph shown in Figure 2 illustrates these assumptions. The nodes gender and acceptance are connected in the graph, so they are marginally dependent. Depending on whether women prefer more- or less-selective departments, the university will appear biased against or toward women. The graph tells the story visually: it is the preference of women for more (or less) selective departments that causes the marginal dependency.

Figure 2. A Graphical model illustrating simpson's paradox.

A fundamental result in the theory of graphical models is that factorization (according to a graph) implies the Markov property (with respect to that graph) and vice versa. This is sometimes known as the Gibbs–Markov equivalence theorem. Going from the factorization to the Markov property requires no special assumptions, but going from the conditional independence statements to the factorization requires that the probability for all configurations of the variables be positive. This can present a problem if there is a deterministic constraint between two or more variables; however, in practical circumstances, it can usually be made to work.

The Gibbs–Markov equivalence property has particular salience when building graphical models from expert opinion. The usual practice is to first elicit the conditional independence properties, then specify the probabilities for the factors defined by the cliques of the graph.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780080448947013348

Crystal Structure Determination

R. Spagna , in Encyclopedia of Condensed Matter Physics, 2005

Probability Methods

Probability methods were introduced by Hauptman and Karle, and led to the joint probability distributions of a set of normalized structure factors on the basis that the atomic coordinates were the primitive random variables uniformly and independently distributed, and the reciprocal vectors were assumed to be fixed. It may also be assumed that the reciprocal vectors are the primitive random variables while the crystal structure is fixed. The probabilistic approach yielded formulas to calculate phases or combination of phases which are s.i.'s by the | E h | alone. Considering the most important class of s.i's, the three-phase s.i. (triplets), the distribution associated derived by Cochran for a non-centrosymmetric structure is given by:

P ( Φ hk ) = 1 L exp ( G hk cos Φ hk )

where L is a normalization factor and for equal atoms

G hk = 2 N | E h E k E h k |

where N is the number of atoms in the unit cell.

In Figure 13, the probability distributions for different values of the parameter G hk are shown to have a maximum at Φ hk equal to zero, and the variance decreases as G hk increases.

Figure 13. Trends of the probability distributions for different values of the parameter G hk .

In centrosymmetric crystals, the relation becomes:

S ( h ) S ( k ) S ( h k ) +

where S( h ) stands for the sign of the reflection h and the symbol ≅ stands for "probably equal." In this case, the basic conditional formula for sign determination is given by Cochran and Woolfson:

P + = 1 2 + 1 2 tanh ( 1 N | E h E k E h k | )

The larger the absolute value of the argument of tanh, the more reliable is the sign indication.

For fixed h , let the vector k range over the set of known values E k E h k , then the total probability distribution of φ h is given by the product of the single distributions, increasing the reliability of the estimate, in mathematical terms.

Calculation of the conditional joint probability distributions for φ h , under the condition that several phases φ k and φ h k are known, led to the tangent formula:

tan ϕ h = k | E k E h k | sin ( ϕ k + ϕ h k ) k | E k E h k | cos ( ϕ k + ϕ h k )

in which the summations are taken over the same sample of reciprocal vectors k .

Formulations of the "nested neighborhood principle" and of the "representation theory" gave the basis to obtain joint probability distributions which improve the estimates for s.i.'s Furthermore, the latter formulated precise general rules for identifying the phasing magnitudes using the space-group symmetry. More complex formulas derived were also able to give indication of Φ hk away from 0°.

Modified tangent techniques are used in almost all computer programs for the phase determination process.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B0123694019004253

Stochastic Dynamics

Don Kulasiri , Wynand Verwoerd , in North-Holland Series in Applied Mathematics and Mechanics, 2002

2.8 Relationship between White Noise and Brownian Motion

consider a stochastic process X(t,ω ) having a stationary joint probability distribution and E(X(t,ω))=0, i.e. the mean value of the process is zero. The Fourier transform of Var(X (t,ω)) can be written as,

(2.23) S λ ω = 1 2 π var ( X τ ω e iλτ

S(λ,ω) is called the spectral density of the process X(t,ω) and is also a function of angular frequency λ. The inverse of the Fourier transform is given by

(2.24) Var X τ ω = S λ ω e iλτ

and when τ = 0,

(2.25) Var X 0 ω = S λ ω .

Therefore, variance of X(0,ω)is the area under a graph of spectral density S(λ,ω) against λ:

(2.26) Var X 0 ω = E X 2 0 ω ,

because E(X(t,ω))= 0

Spectral density S(λ,ω) is considered as the "average power" per unit frequency at λ which gives rise to the variance of X(t, ω) at τ = o. If the average power is a constant which means that the power is distributed uniformly across the frequency spectrum, such as the case for white light, then X(t,ω) is called white noise. White noise is often used to model independent random disturbances in engineering systems, and the increments of Brownian motion have the same characteristics as white noise. Therefore white noise (ζ(t)) is defined as

ζ t = dB t dt ,

(2.27) dB t = ζ t dt .

We will use this relationship to formulate stochastic differential equations.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0167593102800031

Financial Markets with Continuous Time

Yuliya Mishura , in Finance Mathematics, 2016

2.5.3 Pricing look-back options in the Black-Scholes model

In order to evaluate the look-back option in some market model, we need to know the joint probability distribution of the final value S T of underlying asset's price and minimal value S T min and/or maximal value S T max of underlying asset's price on the interval [0, T]. Unfortunately, the joint distribution of these three random variables has a simple form only in Bachelier model: if W is a Wiener process, W T min and W T max are its minimal and maximal values on the interval [0, T], then the joint distribution for x a b is given by the following formula:

P a W T min W T max b , W T d x = 1 2 π t k = exp y k 2 x 2 t exp y k x 2 b 2 2 t d x ,

where y k x = x + 2 k b a . For the Black-Scholes model, there is no such explicit formula. However, there is a fairly simple formula for the joint distribution of the final value and maximal value (for the joint distribution of the final value and minimal value), which is sufficient for look-back options and barrier options. Let X t = W t + μ t + x be a Wiener process with drift coefficient μ and initial value x , X T max = max t 0 T X t is its maximal value on the interval [0, T], and t max is a random point where this maximal value is achieved. Then, for z V x y , the joint distribution is given by the formula

P X T d z , X T max dy , t max d t = y x y z π t 3 T t 3 × exp y x 2 2 t y z 2 2 T t μ x z μ 2 T 2 d z dy d t

=: pT,x,μ(z, y, t)dz dy dt.

Therefore, the joint probability density of X T and X T max equals 0 T P T , x , μ z y t d t . The joint probability density of X T and X t min is readily determined from this formula, if we have to consider the process X t , that is also a Wiener process with a constant drift coefficient. Consider the Black-Scholes model of the financial market. Recall that w.r.t. the martingale measure this model has the form B t = e r t , S t = S 0 e r σ 2 / 2 t + σ W t . Let the look-back derivative have the payoff function C = S T S T max . Taking into account that log S t is a Wiener process with the drift coefficient μ = r σ 2 / 2 and initial value x = log S 0 , we can determine the price of the corresponding discounted option by the formula

π D = E e r T C = e r T x R g e z e y 0 T P T , x , μ z y t d t d z dy = e r T S 0 0 g z y z y 0 T P T , x , μ log z , log y , t d t d z dy .

For specific options, this formula can be converted and we can obtain rather simple expressions. In particular, the look-back call option price with floating strike equals

S 0 Φ a 1 S 0 S 0 S 0 e r T Φ a 2 S 0 S 0

= S 0 σ 2 2 r Φ a 1 S 0 S 0 e r T Φ a 3 S 0 S 0 ,

Where for x > 0 and y > 0

a 1 x y = log x y + r + σ 2 2 T σ T , a 2 x y = log x y r σ 2 2 T σ T = a 1 x y σ T , a 3 x y = log x y r σ 2 2 T σ T = a 1 x y 2 r T ,

and Φ is, as usual, a standard Gaussian distribution function. The look-back call option price with fixed strike equals

S 0 Φ a 1 S 0 K K e r T Φ a 2 S 0 K +

+ S 0 σ 2 2 r Φ a 1 S 0 K e r T K S 2 r σ 2 Φ a 3 S 0 K

for S 0 K and

S 0 K e r T + S 0 Φ a 1 S 0 S 0 S 0 e r T Φ a 2 S 0 S 0 +

+ S 0 σ 2 2 r Φ a 1 S 0 S 0 e r T Φ a 3 S 0 S 0

for S 0 > K . It is easy to understand the dependence of the option value on K and S 0: for S 0 > K , option is "in money". Other formulas for look-back options, including pricing formula for any time, are discussed in [MUS 02].

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781785480461500028

OCKHAM'S RAZOR, TRUTH, AND INFORMATION

Kevin T. Kelly , in Philosophy of Information, 2008

5 SOME EXAMPLES

For some guidance in the general developments that follow, consider some familiar examples.

Polynomial structures. Let S be a finite set of natural numbers and suppose that the truth is some unknown polynomial law:

y = f ( x ) = Σ i S a i x i ,

where for each iS, ai ≠ 0. Say that S is the structure of the law, as it determines the form of the law as it would be written in a textbook. Suppose that the problem is to infer the true structure S of the law. It is implausible to suppose that for a given value of the independent variable x one could observe the exact value of the dependent variable y, so suppose that for each queried value of x at stage k of inquiry, the scientist receives an arbitrarily small, open interval around the corresponding value of y and that repeated queries of x result in an infinite sequence of open intervals converging to {y}.

It is impossible to be sure that one has selected S correctly by any finite time, since there may be some iS such that a 1 is set to a very small value in f, making it appear that the monomial ai xi is missing from f. Ockham's razor urges the conclusion that iS until the corresponding monomial is noticed in the data.

There is a connection between the complexity of the true polynomial structure and what scientists and engineers call effects. Suppose that S 0 = {0}, so for some ai > 0, f 0(x) = ai . Let experience e o present a finite sequence of interval observations of the sort just described for f0 . Then there is a bit of wiggle room in each such interval, so that for some suitably small a 1 > 0, the curve f 1(x) = a 1 x + a 0 of form S 1 = {0, 1} is compatible with e 0. Eventually, some open interval around y = a 0 is presented that excludes f 0. Call such information a first-order effect. If e1 extends that information and presents an arbitrary, finite number of shrinking, open intervals around f 1 then, again, there exists suitably small a 2 > 0 such that f 2(x) = a2x2 + a 1 x + a 0 of form S2 = {0, 1, 2} passes through each of the intervals presented in e 1. Eventually, the intervals tighten so that no linear curve passes between them. Call such information a second-order effect, and so forth. The number of effects presented by a world corresponds to the cardinality of S, so there is a correspondence between empirical effects and empirical complexity. A general account of empirical effects is provided in Section 16 below.

Linear dependence. Suppose that the truth is a multivariate linear law

y = f ( x ) = Σ i S a i x i ,

where for each iS, ai ≠ 0. Again, the problem is to infer the structure S of f. Let the data be presented as in the preceding example. As before, it seems that complexity corresponds with the cardinality of S which is connected, in turn, to the number of effects presented by nature if f is true.

Conservation laws. Consider an idealized version of explaining reactions with conservation laws, as in the theory of elementary particles [Schulte, 2001; Valdez-Perez, 1996]. Suppose that there are n observable types of particles, and it is assumed that they interact so as to conserve n distinct quantities. In other words, each particle of type Pi carries a specific amount of each of the conserved quantities and for each of the conserved quantities, the total amount of that quantity going into an arbitrary reaction must be the total amount that emerges. Usually, one thinks of a reaction in terms of inputs and outputs; e.g.,

r = ( p 1 , p 1 , p 1 , p 2 , p 2 , p 3 , p 1 , p 1 , p 2 , p 3 , p 3 , ) .

One can represent the inputs by a vector in which entry i is the number of input particles of type pi in r, and similarly for the output:

a = ( 3,2,1 ) ; b = ( 2,1,2 ) ; r = ( a b ) ;

A quantity q (e.g., mass or spin) is an assignment of real numbers to particle types, as in q = (1, 0, 1), which says that particles a 1, a 3 both carry a unit of q and a 2 carries none. Quantity q is conserved in r just in case the total q in is the total q out. That condition is just:

Σ i = 1 3 q i a i = Σ i = 1 3 q i b i ,

or, in vector notation,

q a = q b ,

which is equivalent to:

q ( a b ) = 0.

Since reaction r enters the condition for conservation solely as the vector difference a – b, there is no harm, so far as conservation is concerned, in identifying reaction r with the difference vector:

r = a b = ( 1,1 , 1 ) .

Then the condition for r conserving q can be rewritten succinctly as:

q r = 0

which is the familiar condition for geometrical orthogonality of q with r. Thus, the reactions that preserve quantity q are precisely the integer-valued vectors orthogonal to q. In this example, r does conserve q, for:

( 1,0,1 ) ( 1,1 , 1 ) = 1 + 0 1 = 0.

But so do reactions u = (1, 0, −1) and v = (0, 1, 0), which are linearly independent. Since the subspace of vectors orthogonal to q is two-dimensional, every reaction that conserves q is a linear combination of u and v (e.g., r = u + v). If the only conserved quantity were q, then it would be strange to observe only scalar multiples of r. In that case, one would expect that the possible reactions are constrained by some other conserved quantity linearly independent of q, say q' = (0, 1, 1). Now the possible reactions lie along the intersection of the planes respectively orthogonal to q and q', which are precisely the scalar multiples of r. Notice that any two linearly independent quantities orthogonal to r would suffice — the quantities, themselves, are not uniquely determined.

Now suppose that the problem is to determine how many quantities are conserved, assuming that some conservation theory is true and that every possible reaction is observed, eventually. Let an "effect" be the observation of a reaction linearly independent of the reactions seen so far. As in the preceding applications, effects may appear at any time but cannot be taken back after they occur and the correct answer is uniquely determined by the (finite) number of effects that occur.

In this example, favoring the answer that corresponds to the fewest effects corresponds to positing the greatest possible number of conserved quantities, which corresponds to physical practice (cf. [Ford, 1963]). In this case, simplicity intuitions are consonant with testability and explanation, but run counter to minimization of free parameters (posited conserved quantities).

Discovering causal structure. If one does not have access to experimental data, due to cost, feasibility, or ethical considerations, one must base one's policy recommendations on purely observational data. In spite of the usual advice that correlation does not imply causation, sometimes it does. The following setup is based upon [Spirtes et al., 2000]. Let V be a finite set of empirical variables. A causal structure associates with each unordered pair of variables {X, Y} one of the following statements:

X Y ; X Y ; X | | Y ;

interpreted, respectively, as X is a direct cause of Y, Y is a direct cause of X, and X, Y have no direct causal connection. The first two cases are direct causal connections and the fourth case denies such a connection. A causal structure can, therefore, be presented as a directed, acyclic graph (DAG) in which variables are vertices and arrows are direct causal connections. The notation X — Y means that there is a direct connection in either direction between X and Y without specifying which. A partially oriented graph with such ambiguous edges is understood, for present purposes, to represent the disjunction of the structures that result from specifying them in each possible way.

At the core of the approach is a rule for associating causal structures with probability distributions. Let p be a joint probability distribution on variables V. If S is a subset of V, let (XY)|S abbreviate that X is statistically independent of Y conditional on S in p. A sequence of variables is a path if each successive pair is immediately causally connected. A collision on a path is a variable with arrows coming in from adjacent variables on the path (e.g., variable Y in path XY ← Z). A path is activated by variable set S just in case the only variables in S that occur on the path are collisions and every collision on the path has a descendent in S. Then the key assumption relating probabilities to causal structures is simply:

(XY)|S if and only if no path between X and Y is activated by S.

Let Tp denote the set of all causal structures satisfying this relation to probability measure p.

To see why it is intuitive to associate Tp with p, suppose that XYZ and that none of these variables are in conditioning set S. Then knowing something about Z tells one something about X and knowing something about the value of X tells one something about Z. But the ultimate cause X yields no further information about Z when the intermediate cause Y is known (unless there is some other activated path between X and Y). On the other hand, suppose that the path is XY ← Z with collision Y. If there is no further path connecting X with Z, knowing about X says nothing about Z (X and Z are independent causes of Y), but since X and Z may cooperate or compete in a systematic way to produce Y, knowing the value of Y together with the value of X yields some information about the corresponding setting of Z. The dependency among causes given the state of the common effect turns out to be an important clue to causal orientation.

It follows from the preceding assumption that there is a direct connection X — Y just in case X and Y are dependent conditional on each set of variables not including X, Y. There is a collision (XY ← Z) if (X – Y – Z) holds (by the preceding rule) and (X — Z) does not hold (by the preceding rule) and, furthermore, X, Z are dependent given every set of variables including Y but not X, Z [Spirtes et al., 2000, theorem 3.4]. Further causal orientations may be entailed in light of background assumptions. The preceding rules (actually, more computationally efficient heuristic versions thereof) have been implemented in "data-mining" software packages that search for causal structures governing large sets of observational variables. The key points to remember are that (1) a direct causal connection is implied by the appearance of some set of statistical dependencies and (2) edge orientations depend both on the appearance of some statistical dependencies and on the non-appearance in the future of further statistical dependencies.

The above considerations are taken to be general. However, much of the literature on causal discovery focuses on two special cases. In the discrete multinomial case, say that GDg if and only if GTp and p is a discrete, joint distribution over a finite range of possible values for each variable in G. In the linear Gaussian case, say that GLp if and only if GTp and p is generated from G as follows: each variable in G is assumed to be a linear function of its parents, together with an extra, normally distributed, unobserved variable called an error term and the error terms are assumed to be uncorrelated. For brevity, say that G is standard for p if and only if GDp or GLp . The following discussion is restricted to the standard cases because that is where matters are best understood at present.

In practice, not all variables are measured, but assume, optimistically, that all causally relevant variables are measured. Even then, in the standard cases, the DAGs in Tp cannot possibly be distinguished from one another from samples drawn from p, so one may as well require only convergence to Tp in each p compatible with background assumptions. 4

Statistical dependencies among variables must be inferred from finite samples, which can result in spurious causal conclusions because finite samples cannot reliably distinguish statistical independence from weak statistical dependence. Idealizing, as in the preceding examples, suppose that one receives the outputs of a data-processing laboratory that merely informs one of the dependencies 5 that have been verified so far (at the current, growing sample size) by a standard statistical dependency test, where the null hypothesis is independence. 6 Think of an effect as data verifying that a partial correlation is non-zero. Absence of an effect is compatible with noticing it later (the correlation could be arbitrarily small). If it is required only that one infer the true indistinguishability class T(p) for arbitrary p representable by a DAG, then effects determine the right answer.

What does Ockham say? In the light of the preceding examples, something like: assume no more dependencies than one has seen so far, unless background knowledge and other dependencies entail them. It follows, straightforwardly, that direct causal connections add complexity, and that seems intuitively right. Causal orientation of causal connections is more interesting. It may seem that causal orientation does affect complexity, because, with binary variables, a common effect depends in some manner that must be specified upon four states of the joint causes whereas a common cause affects each effect with just two states. Usually, free parameters contribute to complexity, as in the curve-fitting example above. But given the overall assumptions of causal discovery, a result due to Chickering [2003] implies that these extra parameters do not correspond to potential empirical effects and, hence, do not really contribute to empirical complexity. In other words, given that no further edges are coming, one can afford to wait for data that decide all the discernable facts about orientation [Schulte, 2007]. Standard MDL procedures that tax free parameters can favor non-collisions over collisions before the data resolve the issue, risking extra surprises. 7

For example, when there are three variables X, Y, Z and (X − Y −Z) is known, then, excluding unobserved causes, there are two equivalence classes of graphs, the collision orientation (XY ← Z) in one class C and all the other orientations in the complementary class ¬C. Looking at the total set of implied dependencies for C, C′, it turns out that the only differences are that C entails ¬((XZ)|Y) but not ¬(XZ), whereas ¬C entails ¬(XZ) but not ¬((XZ)|Y), so there is no inclusion relationship between the dependencies characterizing C and the dependencies characterizing ¬C. Therefore, both hypotheses are among the simplest compatible with the data, so Ockham's razor does not choose among them. Moreover, given that the truth is (X — Y — Z), nature must present either ¬(XZ) or ¬(XZ)|Y) eventually (given that the causal truth can be represented by some graph over the observable variables) so it seems that science can and should wait for nature to resolve the matter instead of racing ahead — and that is just how Ockham's razor is interpreted in the following discussion. Regardless of which effect nature elects to present, it remains possible, thereafter, to present the other effect as well, in which case each variable is connected immediately to every other and one can infer nothing about causal directionality. This situation involves more effects than either of the two preceding cases, but another direct causal connection is also added, reflecting the increase in complexity.

The preceding evolution can result in spectacular reversals of causal conclusions as experience increases, not just in terms of truth, but in terms of practical consequences as well. Suppose that it is known that (XY —Z) and none of these variables has yet exhibited any dependence with W. Then discovery of ¬((XZ)|Y), background knowledge, and Ockham's razor unambiguously imply (XY ← Z), a golden invitation to exploit Z to control Y. Indeed, the connections may be obvious and strong, inviting one to invest serious resources to exploit Z. But the conclusion rests entirely on Ockham's razor, for the further discovery of ¬(XZ) is incompatible with (XY ← Z) and the new Ockham answer is (XY – Z) with edge (X – Z) added. Further discovery that ¬((ZW)|X, Y) and that ¬((YW)|Z) results in the conclusion YZ ← W, reversing the original

conclusion that Y can be controlled by Z. 8 The orientation of the direct causal connection Y — Z can be flipped n times in sequence by assuming causes X 0, … Xn of Y in the role of X and potential collisions W 0, …, Wn in the role of W. There is no way that a convergent strategy can avoid such discrete flips of Y — Z; they are an ineluctable feature of the problem of determining the efficacy of Z on Y from non-experimental data, no matter how strong the estimate of the strength of the cause YZ is prior to the reversal. Indeed, standard causal discovery algorithms exhibit the diachronic retractions just discussed in computer simulations. The practical consequences of getting the edge orientation wrong are momentous, for if Z does not cause Y, the policy of manipulating Z to achieve results for Y will have no benefits at all to justify its cost. Indeed, in the case just described, sample size imposes no non-trivial bound on arbitrarily large mis-estimates of the effectiveness of Y in controlling Z (cf. [Robins et al., 2003; Zhang and Spirtes, 2003]). Therefore a skeptical stance toward causal inference is tempting:

We could try to learn the correct causal graph from data but this is dangerous. In fact it is impossible with two variables. With more than two variables there are methods that can find the causal graph under certain assumptions but they are large sample methods and, furthermore, there is no way to ever know if the sample size you have is large enough to make the methods reliable [Wasserman, 2003, p. 275].

This skepticism is one more symptom of the unrealizable demand that simplicity should reliably point toward or inform one of the true theoretical structure, a popular — if infeasible — view both in statistics and philosophy [Goldman, 1986; Mayo, 1996; Dretske, 1981]. The approach developed below is quite different: insofar as finding the truth makes reversals of opinion unavoidable, they are not only justified but laudable — whereas, insofar as they are avoidable, they should be avoided. So the best possible strategies are those that converge to the truth with as few course-reversals as possible. That is what standard causal inference algorithms tend to do, and it is the best they could possibly do in the standard cases.

To summarize, an adequate explanation of Ockham's razor should isolate what is common to the simplicity intuitions in examples like the preceding ones and should also explain how favoring the simplest theory compatible with experience helps one find the truth more directly or efficiently than competing strategies when infallibility or even probable infallibility is hopeless. Such an explanation, along the lines of the freeway metaphor, will now be presented. First, simplicity and efficient convergence to the truth must be defined with mathematical rigor and then a proper proof must be provided that Ockham's razor is the most efficient possible strategy for converging to the truth.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444517265500145

Clinical Decisions Based on Models

R.H. Riffenburgh , in Statistics in Medicine (Third Edition), 2012

The Gibbs Sampler

We know that each of the probability estimates in the transition matrix has a probability distribution and so together they have a joint probability distribution. The Gibbs sampler, named for physicist J.W. Gibbs but developed by the Geman brothers38, is an algorithm for sampling randomly for each parameter conditional upon (often said "given") all the others to find the conditional distribution for each and use the collection of all the conditionals to find the joint distribution of the set of parameters.

The method will be easier to follow couched in terms of a simple clinical example. Let us return to the flu example. We have a vector (a list) of estimates on the number of patients falling into each state (well, slightly ill, severely ill, recovered, and dead) and a probability distribution for each state. The vector will also have a joint probability distribution. A random frequency is drawn from each distribution in turn. However, each of these frequencies will depend on all the other influences. The number slightly ill will depend on (be conditional on) the number who have become severely ill, the number who have died, and so forth. We can develop a joint distribution from the conditionals, but each time we find a new probability, it changes all the others. The Gibbs sampler adjusts the joint distribution for each conditional in turn, generating a second vector of frequencies. But this second vector is again subject to dependencies on subsequent conditionals, so must "loop" through the same path to a third, and so forth. This is a form of simulation. In most cases, the sequence of computations exhibits smaller and smaller changes with successive simulations so that at some point, perhaps after thousands of "loops", the probabilities become stable, that is, fixed or stationary. Then we have dependable estimates of what the rates for the various flu states really are.

Computer-based sampling and processing is necessary because the simulation is very computationally intensive. The Gibbs sampler provides a mechanism to pass quickly through thousands of loops.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123848642000202

Computational Analysis and Understanding of Natural Languages: Principles, Methods and Applications

Indranil Ghosh , in Handbook of Statistics, 2018

3 Conceptual Exercises

1.

Let us have an arbitrary set of (conditional) independence relationships among N variables that is associated with a joint probability distribution.

(a)

Can we always find a DAG that perfectly maps this set (perfectly maps = preserves all the (conditional) independence relationships, it neither removes nor adds any)?

(b)

Can we always find an undirected graph that perfectly maps this set?

(c)

Can directed acyclic models represent the conditional independence relationships of all possible undirected models?

(d)

Can undirected models represent the conditional independence relationships of all possible directed acyclic models?

(e)

Can we always find a directed acyclic model or an undirected model? (Scutari and Denis, 2015, collected by Jiri Klema)

2.

Suppose our goal is inference for a parameter δ based on data that would ideally consist of n independent pairs (X, Y), but that some values of Y are missing, as shown by an indicator variable I. Thus the data on an individual have form (x, y, 1) or (x, ?, 0). We suppose that although the missing mechanism Pr(I = 0|x, y) may depend on x and y, it does not involve δ. There are now three possibilities (Davison, 2003, Exercise: 6.2.10):

data are missing completely at random, that is Pr(I = 0|x, y) = Pr(I = 0) is independent of both x and y;

data are missing at random, that is Pr(I = 0|x, y) = Pr(I = 0|x) depends on x but not on y;

there is nonignorable nonresponse, meaning that Pr(I = 0|x, y) depends on y and possibly also on x.

Write down the DAG and its moral graph for X, Y, and I under the missing data models described above. Use them to give an equation-free explanation of the differences among the models and of their consequences.

3.

Show that for a DAG, the local directed Markov property (DL) implies the ordered directed Markov property (DO) (course by Yu, 2010).

4.

A language L has a four-character vocabulary W = {δ, A, B, C}, where δ is the empty symbol. The probability that character w i will be followed by w j is given by the following matrix:

In transmitting messages from L, some characters may be corrupted by noise and be confused with others. The probability that the transmitted character w j will be interpreted as w k is given by the following confusion matrix:

The string, δ, δ, B, C, A, δ, δ is received, and it is known that the transmitted string begins and ends with δ. Then,
(a)

Find the probability that the i-th transmitted symbol is B, for i = 1, 2, …, 7.

(b)

Find the probability that the string transmitted is the one received.

(c)

Find the probability that no message (a string of δs) was transmitted.

(d)

Find the message most likely to have been transmitted.

(e)

Find the most likely seven-symbol string in L that starts and ends with δ (course by Yu, 2010).

5.

Let us have three tram lines —–7, 17, and 21—regularly coming to the stop in front of the faculty building. Line 17 operates more frequently than line 21, 21 goes more often than line 7 (the ratio is 5:3:2, it is kept during all the hours of operation). Line 7 uses a single car setting in 9 out of 10 cases during the daytime, in the evening it always has the only car. Line 17 has one car rarely and only in evenings (1 out of 10 tram cars). Line 21 can be short whenever, however, it takes a long setting with 2 cars in 8 out of 10 cases. Albertov is available by line 21, lines 7 and 17 are headed in the direction of IP Pavlova. The line changes appear only when a tram goes to depot (let 21 have its depot in the direction of IP Pavlova, 7 and 17 have their depots in the direction of Albertov). Every 10th tram goes to the depot evenly throughout the operation. The evening regime is from 5   pm to 22   pm, the daytime regime is from 5.30   am to 6.30   pm. Then,

(a)

Draw a correct, efficient, and causal BN.

(b)

is evening. A short tram is approaching the stop. What is the probability it will go to Pavlova?

(c)

There is a tram 17 standing in the stop. How many cars does it have?

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0169716118300166