# Cumulants

If instead of evaluating the derivatives of the characteristic function, we evaluate the derivative of its logarithm, the results are the cumulants of the distribution.

$\displaystyle c_n=(-i)^n \left. \frac{\mathrm{d}^n}{\mathrm{d}z^n}\ln\hat{F}(z)\right|_{z=0}$

Though it is perhaps not immediately obvious, the cumulants of a pdf are often more convenient to work with than the moments. As we’ll see, when we add random variables, the pdf of the sum is the convolution of the pdf’s of the terms. In Fourier space, (i.e. in terms of the characteristic function), this convolution corresponds to multiplication, and since the log of a product is the sum of the logarithms of the multipliers, convolution of pdf’s corresponds to summation of the cumulants.

It’s clear that the cumulants are closely related to the moments. For example, since:

$\displaystyle \frac{\mathrm{d}}{\mathrm{d}x}\ln f(x)=\frac{1}{f(x)}\frac{\mathrm{d}}{\mathrm{d}x}f(x)$

the first cumulant is:

$\displaystyle c_1=\frac{-i}{\hat{P}(0)} \frac{\mathrm{d}\hat{P}}{\mathrm{d}z}(0)=m$

the second is:

\begin{aligned} c_2 & = - \left. \frac{\mathrm{d}}{\mathrm{d}z}\frac{\hat{P}'(z)}{\hat{P}(z)}\right|_{z=0} \\ & = -\left[\frac{\hat{P}''(z)}{\hat{P}(z)}- \left( \frac{\hat{P}'(z)}{\hat{P}(z)} \right)^2 \right ]_{z=0} \\ & =m_2-m^2 \end{aligned}

and the third:

\begin{aligned} c_3 & = i \frac{\mathrm{d}}{\mathrm{d}z}\left[\frac{\hat{P}''(z)}{\hat{P}(z)}- \left( \frac{\hat{P}'(z)}{\hat{P}(z)} \right)^2 \right ]_{z=0} \\ & = i\left[\frac{\hat{P}'''(z)}{\hat{P}(z)}-\frac{\hat{P}''(z)\hat{P}'(z)}{\hat{P}(z)^2}- 2\left( \frac{\hat{P}'(z)}{\hat{P}(z)} \right)\left( \frac{\hat{P}''(z)}{\hat{P}(z)}- \left( \frac{\hat{P}'(z)}{\hat{P}(z)} \right)^2\right ) \right ]_{z=0} \\ & = i \left[ \frac{\hat{P}'''(z)} {\hat{P}(z)}-3\frac{\hat{P}''(z)\hat{P}'(z)}{\hat{P}(z)} + 2 \left( \frac{\hat{P}'(z)}{\hat{P}(z)} \right)^3 \right]_{z=0} \\ &=m_3-3m_2m+2m^3 \end{aligned}

Continuing in the same manner, one can demonstrate that:

$\displaystyle c_4=m_4-4m_3m_1-3m_2^2+12m_2m^2-6m_1^4$

Scaling the cumulants by the respective power of the standard deviation results in the normalized cumulants:

$\displaystyle \lambda_n\equiv \frac{c_n}{\sigma^n}$

Usually we want to consider the behaviour of a distribution about it’s mean value, rather than about zero. In this case $m=0$ which means that $c_1$ and $\lambda_1$ are zero as well. Since the normalization is with respect to the variance,

$\displaystyle \lambda_2=\frac{c_2}{\sigma^2} = \frac{m_2}{m_2}=1$

And as can be readily seen from the above relations, the third and fourth normalized cumulants are thus:

$\displaystyle \lambda_3=\frac{m_3}{\sigma^3}$     and     $\displaystyle \lambda_4=\frac{m_4}{\sigma^4}-3$

# Characteristic function and moments

Having defined the mean of a distribution as the expectation value of $x$,

$m\equiv\langle x\rangle =\displaystyle\int xP(x)\, \mathrm{d}x$

we might now pose the more general question: what is the expectation value for some arbitrary power of $x$?

$\displaystyle m_n\equiv\langle x^n\rangle =\int x^nP(x)\, \mathrm{d}x$

These values, $m_n$ are called the moments of $P(x)$. To obtain the moments for a given distribution one could simply evaluate the integral for the desired moment. But more practically, the moments can be calculated using the characteristic function, defined as the Fourier transform of the pdf:

$\displaystyle \hat{P}(z) \equiv \mathcal{F} \{P(x)\}= \int \mathrm{e}^{izx}P(x) \, \mathrm{d}x$

Once we have the characteristic function, we can recover the pdf with an inverse Fourier transform:

$\displaystyle P(x) = \mathcal{F}^{-1} \{\hat{P}(z)\}= \frac{1}{2\pi} \int \mathrm{e}^{-izx}\hat{P}(z) \, \mathrm{d}z$

The moments can be found by differentiating the characteristic function:

$\displaystyle m_n=(-i)^n \left. \frac{\mathrm{d}^n}{\mathrm{d}z^n}\hat{F}(z)\right|_{z=0}$

To see why this is so, consider the expansion of $\mathrm{e}^{izx}$:

$\displaystyle \mathrm{e}^{izx}=1+(izx)+\frac{(izx)^2}{2!}+\frac{(izx)^3}{3!} +\dotsb =\sum_{n=0}^\infty \frac{(izx)^n}{n!}$

Therefore:

\begin{aligned} \hat{P}(z)&=\int P(x)\, \mathrm{d}x+(iz)\int xP(x)\, \mathrm{d}x+\frac{(iz)^2}{2!}\int x^2 P(x)\, \mathrm{d}x+\dotsb \\ &=\sum_{n=0}^\infty \frac{(iz)^n}{n!}\int x^nP(x)\, \mathrm{d}x \end{aligned}

From which it should be clear that successive differentiation followed by evaluation at $z=0$ picks off the desired moment.

Alternatively, instead of using the characteristic function, we could accomplish the same by dropping the i from the exponent in the transform (and replacing z with t to minimize confusion), giving us something known as the moment-generating function:

$\displaystyle \mathrm{M}_X(t) \equiv \int \mathrm{e}^{tx}P(x) \, \mathrm{d}x$

Which, if we make the substitution $t=-s$, is the two sided Laplace transform of the pdf. As with the characteristic function, the moments can be obtained by differentiation:

$\displaystyle m_n=\left.\frac{\mathrm{d}^n}{\mathrm{d}t^n}\mathrm{M}_X(t) \right|_{t=0}$

More often in statistics, what we’re interested in is the central moment, the moment about the mean $\langle(x-m)^n\rangle$, rather than the absolute moment about zero. For example, recall that the variance is the second moment about the mean:

\begin{aligned} \sigma^2=\langle(x-m)^2\rangle &=\int (x-m)^2 P(x) \, \mathrm{d}x \\ &=\int (x^2-2xm+m^2) P(x) \, \mathrm{d}x \\ &= \int x^2 P(x) \, \mathrm{d}x -2m\int x P(x) \, \mathrm{d}x + m^2\int P(x) \, \mathrm{d}x \\ &= m_2 - m^2 \end{aligned}

Beautiful, no?

# Deviations

Once we’ve determined the typical value of a random variable, the next thing we want to know is how broad the distribution is. If we observe a phenomenon multiple times or make the same measurement repeatedly, do we observe values that are closely clustered together, or do they vary widely?

One way to quantify the spread in the data is to consider the mean absolute distance between the possible values of the random variable and its typical value, $x_0$.

$\displaystyle \mathrm{MAD} = \langle|x-x_0|\rangle = \int |x-x_0|P(x)\, \mathrm{d}x$

Alternatively we could consider the root mean squared distance between X and the centre of the distribution.

$\displaystyle \mathrm{RMS} = \sqrt{\langle(x-x_0)^2\rangle} = \sqrt{\int (x-x_0)^2 P(x)\, \mathrm{d}x}$

Now the question is what value to use for $x_0$. Here we’ll choose the value that minimizes the measured deviation. We need to find the values of $x_0$ for which:

$\displaystyle \frac{\mathrm{d}}{\mathrm{d}x_0} \mathrm{MAD}(x_0) =0 \qquad \mathrm{and} \qquad \frac{\mathrm{d}}{\mathrm{d}x_0} \mathrm{RMS}(x_0) =0$

Taking the derivative of the mean absolute distance:

\begin{aligned} \frac{\mathrm{d}}{\mathrm{d} x_0} \int |x-x_0|P(x)\, \mathrm{d}x &= \frac{\mathrm{d}}{\mathrm{d}x_0}\left[ \int_{-\infty}^{x_0} (x-x_0)P(x)\, \mathrm{d}x + \int_{x_0}^{\infty} (x_0 -x)P(x)\, \mathrm{d}x \right] \\ &=\frac{\mathrm{d}}{\mathrm{d}x_0}\left[ \int_{-\infty}^{x_0} xP(x)\, \mathrm{d}x - x_0\int_{-\infty}^{x_0} P(x)\, \mathrm{d}x +x_0\int_{x_0}^{\infty}P(x)\, \mathrm{d}x -\int_{x_0}^{\infty}xP(x)\, \mathrm{d}x\right] \\ &=\frac{\mathrm{d}}{\mathrm{d}x_0}\left[ \int_{-\infty}^{x_0} xP(x)\, \mathrm{d}x - x_0\mathcal{P}_<(x_0) +x_0\mathcal{P}_>(x_0) -\int_{x_0}^{\infty}xP(x)\, \mathrm{d}x\right] \\ &=\frac{\mathrm{d}}{\mathrm{d}x_0}\left[ \int_{-\infty}^{x_0} xP(x)\, \mathrm{d}x-\int_{x_0}^{\infty}xP(x)\, \mathrm{d}x + x_0\left(1-2\mathcal{P}_<(x_0)\right) \right] \\ &=2x_0P(x_0)+\left(1-2\mathcal{P}_<(x_0)\right)-2x_0P(x_0) = 1-2\mathcal{P}_<(x_0)=0 \end{aligned}

Therefore, $\mathcal{P}_<(x_0) = 0.5$, which means $x_0=x_{med}$. Which means the mean absolute distance is minimized when calculated with respect to the median value of the distribution.

$\displaystyle E_{abs}= \int |x-x_{med}|P(x)\, \mathrm{d}x$

Doing the same for the mean squared distance:

\begin{aligned} \frac{\mathrm{d}}{\mathrm{d} x_0} \int (x-x_0)^2P(x)\, \mathrm{d}x &= \frac{\mathrm{d}}{\mathrm{d}x_0}\int(x^2-2x_0x+x_0^2)P(x)\, \mathrm{d}x \\ &=\frac{\mathrm{d}}{\mathrm{d}x_0}\left[ \int x^2P(x)\, \mathrm{d}x - 2x_0\int xP(x)\, \mathrm{d}x +x_0^2\int P(x)\, \mathrm{d}x \right] \\ &=\frac{\mathrm{d}}{\mathrm{d}x_0}\left[ \int x^2P(x)\, \mathrm{d}x - 2x_0m +x_0^2 \right] \\ &=0-2m+2x_0 = 0 \end{aligned}

So the mean squared distance is minimized when calculated with respect to the mean. This gives us the definition of the variance:

$\displaystyle \sigma^2=\langle(x-m)^2\rangle=\int (x-m)^2 P(x)\, \mathrm{d}x$

The standard deviation, $\sigma$, is the square root of the variance.

# Typical values

The most obvious question to ask about any set of data is what is a typical value, by which we usually mean “what is the average value”. If we have set of numbers, we can sum them up and divide by the total number and get the average, or mean value of the sample.

$\bar{x}=\frac{1}{N}\displaystyle\sum_{i=1}^{N}x_i$

Easy. On the other hand, if we want to characterize a theoretical distribution, we can find the mean of the distribution, or the expectation value by averaging all possible values of x, weighted by the probability density.

$m\equiv\langle x\rangle =\displaystyle\int xP(x)\, \mathrm{d}x$

Alternatively, we could ask “for what value is there a 50/50 chance that any given sample will fall either above or below it?” Here we’re asking for the median value. This is the point where $\mathcal{P}_{<}(x_{med})=\mathcal{P}_{>}(x_{med})=0.5$.

We’ve already seen how the cumulative distribution function lets us calculate the probability of getting a result less than (or greater than) a particular value using pnorm(). Now we want to turn the question around and ask what value gives us a 0.5 probability than any particular sample will be smaller that than value. And conveniently, R has a set of functions to do just that: qnorm, qpois, qchisq, etc,. (That’s ‘q’ for quantile.) This is a rather boring question for the normal distribution, since for a Gaussian, $x_{med}=m$.

> qnorm(0.5, mean=3.46894)
[1] 3.46894

Yawn. Whatever value you put in for the mean, qnorm(0.5,…) is just going to spit it right back out again. For a normal distribution, qnorm() is more useful if you want to know, for example, what value will a random sample have only a 5% probability of exceeding? This is the same as asking for what x does $\mathcal{P}_<(x)=0.95$? For a standard normal (mean=0, sd=1) distribution:

> qnorm(0.95)
[1] 1.644854

But, returning to the question of the median, let’s condsider a different distribution: the Χ² (chi-squared) distribution. The Χ² is determined by a single parameter k, which is also the mean. The pdf of a Χ² with k=5 looks like this:

> x <- seq(0,15,by=0.05)
> plot(x,dchisq(x,5))

Now if we try to find the median:

> qchisq(0.5, 5)
[1] 4.35146

Finally we could ask “what is the most probable value?”, or in statistical jargon, the mode,denoted by $x^*$. Where is the peak of the distribution function? Assuming a uni-modal distribution:

$\displaystyle \left.\frac{\mathrm{d}}{\mathrm{d}x}P(x)\right|_{x=x^*}=0$

Returning to the Χ² distribution from earlier, we can find the max of the pdf using which.max(), which returns the location of the maximum value in a vector. (This is a bit of a crude hack, since it only considers the points where the pdf has been explicitly evaluated, but it serves the purpose for a demonstration.)

> x <- seq(0,15,by=0.05)
> y <- dchisq(x,5)
> x[which.max(y)]
[1] 3

And in fact, if you look up the mode of a Χ² distribution, it should be k-2 (unless k=1, in which case it’s 0). So 3 is the right answer.

# Cumulative distribution function

Often we want to know what is the probability that a random variable is less than some particular value, x. In this case, we can simply integrate the pdf from -∞ to x to obtain the cumulative distribution function (cdf):

$\displaystyle\mathcal{P}_<(x)\equiv \mathcal{P}(X

Since $P(x)$ is positive everywhere, with a total area under the curve of 1, $\mathcal{P}_>(x)$ is a monotonically increasing function that goes from 0 to 1.

In R, the cdf can be evaluated by the p* family of functions. So for a standard normal distribution, to find the probability that the random variable is <-1:

> pnorm(-1)
[1] 0.1586553

Alternatively, we might want to know the probability that a random variable is greater than some value. In this case: $\mathcal{P}_>(x)=1-\mathcal{P}_<(x)$

There are two ways to calculate the upper tail of the pdf in R. The obvious is to use the above formula:

> 1-pnorm(-1)
[1] 0.8413447

But the same thing can be achieved by passing ‘lower.tail=FALSE’ to pnorm():

> pnorm(-1,lower.tail=FALSE)
[1] 0.8413447

The probability that X falls in the range (a,b) can be found using the cdf:

\begin{aligned} \mathcal{P}(a

And, of course:

$\displaystyle P(x) = \frac{\mathrm{d}}{\mathrm{d}x}\mathcal{P}_<(x)$

To place multiple plots in the same image, we can used par() with the ‘mfrow’ or ‘mfcol’ parameter.

x <- seq(-5,5,by=0.05)
pdf <- dnorm(x)
cdf <- pnorm(x)

# the par command sets parameters related to the graphical output
# - in this case we set mfcol to draw 2 rows of one plot each
# - each subsequent call to plot uses the next plot in the array
par(mfcol=c(2,1), mai=c(.5,1,0.1,0.1))
plot(x,pdf,type="l")
plot(x,cdf,type="l") 

Which produces:

# R scripting

Typing commands at the command line gets tedious, so for longer programs we can save our commands in a script and then execute them in R. Enter a list of commands in a text editor and save it as a ‘.R’ file. (R has a simple built-in text editor but there are better options. Notepad++ is nice. It supports syntax highlighting for .R files and auto-complete .) Enter and save the following in your working directory as “gauss_example.R”.

# number of random samples to generate
N <- 5000

# explicitly define the bins for the histogram
bin_width <- 0.5
bins <- seq(-5.5,5.5,by=bin_width)

# generate N random variables from
# standard normal distribution (mean=0, sd=1)
xs <- rnorm(N)

# calculate the probability distribution function
# scaled by the number of samples and the bin width
x <- seq(-5,5,by=0.05)
pdf_g <- N * bin_width * dnorm(x)

# create the histogram
# the bins are specified explicity, the upper ylimit is
# specified to be larger than the max of the probability
# distribution; lines() won't rescale the plot
xs_hist <- hist(xs, breaks=bins, ylim=c(0,1.1 * max(pdf_g)))
# overlay the scaled pdf
lines(x,pdf_g,type="l",lwd=2)


Here, we use R to generate Gaussian random numbers with rnorm(), the same way we used runif() to generate uniformly distributed random numbers. For each probability distribution built into R, there is a function with the name r*, where * is the name of the distribution, to generate random numbers from that distribution. For example rpois() generates Poisson random numbers, rbinom() for a binomial distribution, rchisq() for a chi-squared, and so on.

Likewise, the pdf a distribution can be calculated with a function called d*: dnorm(), dpois(), dbinom(), dchisq(), etc. Of course most of these usually require additional parameters specific to the particular distribution.

To run the script in R, use source().

> source("gauss_example.R")

This assumes the file is in the current working directory. To see the current working directory, use getwd().

> getwd()
[1] "C:/Users/Jared/Documents"


To change the working directory, use setwd().

> setwd("examples")
> getwd()
[1] "C:/Users/Jared/Documents/examples"


The above scripts should produce something like the following.

As expected, our simulated data aligns beautifully with the theoretical distribution. Play with the total number of events and bin widths and observe how they affect the agreement.

Notice that in our script, when we called hist(), the output was assigned to a variable called xs_hist. In addition to plotting the histogram, hist() also returns a histogram object. In fact, it would be more accurate to say that hist() creates a histogram object and the default behaviour is to plot it as well. Plotting can be turned off by passing ‘plot=FALSE’ as one of the arguments to hist(). To see the histogram object, enter the name of the object on the command line:

> xs_hist
$breaks [1] -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 [16] 2.5 3.0 3.5 4.0 4.5 5.0$counts
[1]    0    0    0    7   48  146  458  908 1539 1881 1926 1492  943  413  176
[16]   47   14    2    0    0

$density [1] 0.0000 0.0000 0.0000 0.0014 0.0096 0.0292 0.0916 0.1816 0.3078 0.3762 [11] 0.3852 0.2984 0.1886 0.0826 0.0352 0.0094 0.0028 0.0004 0.0000 0.0000$mids
[1] -4.75 -4.25 -3.75 -3.25 -2.75 -2.25 -1.75 -1.25 -0.75 -0.25  0.25  0.75
[13]  1.25  1.75  2.25  2.75  3.25  3.75  4.25  4.75

$xname [1] "xs"$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

This object returned is a “list” object, which means its a collection of components (other objects) grouped under a single object. Here the components include ‘breaks’ (which we passed in when we created the histogram), ‘counts’, ‘density’, ‘mids’, ‘xname’, and ‘equidist’. To access the components individually, enter the object and the component separated by a $. For example, to see the midpoints of each bin: > xs_hist$mids
[1] -5.25 -4.75 -4.25 -3.75 -3.25 -2.75 -2.25 -1.75 -1.25 -0.75 -0.25  0.25
[13]  0.75  1.25  1.75  2.25  2.75  3.25  3.75  4.25  4.75  5.25

The type of an object can be determined with typeof().

> typeof(xs_hist)
[1] "list"
> typeof(xs_hist$breaks) [1] "double" > typeof(xs_hist$counts)
[1] "integer"
> typeof(xs_hist\$xname)
[1] "character"

# Gaussian distribution

The pdf of the Gaussian, or normal distribution with mean m and variance $\sigma^2$ is defined by:

$\displaystyle P_G(x)=\frac{1}{\sigma \sqrt{2\pi}} exp(-\frac{(x-m)^2}{2\sigma^2})$

To evaluate the normal distribution with R, use dnorm(). First we have to create a vector, x, containing the range over which we want to plot the function.

> x <- seq(-5,5,by=0.05)

This creates a vector of values from -5 to 5, in steps of 0.05. If we pass that to dnorm() with no other parameters, we get the pdf for a Gaussian with a mean of 0 and a standard deviation of 1.

> PG1 <- dnorm(x)

To specify a different standard deviations, use the ‘sd’ parameter.

> PG2 <- dnorm(x, sd=2)

Next, display the curve with the plot() function.

> plot(x,PG1)

Not bad, but by default plot() displays the curve as a collection of points. This would look better as a continuous curve. To do that, we have to specify type=”l”. (“l” as in ‘line’, not a one.)

> plot(x,PG1,type="l", ylab="P(x)")

The ‘ylab’ parameter sets the y-axis label.
To add the second curve to an already existing plot, use the lines() function.

> lines(x,PG2,lty=2,lwd=2)

By default, lines() uses a solid line 1 pixel wide; the ‘lty’ and ‘lwd’ parameters can be used to set a different line type and thickness, respectively.

> legend(2,0.4,expression(sigma^2==1,sigma^2==2),lty=c(1,5),lwd=c(1,2))

The first two parameters are the x and y coordinates of the upper left corner of the legend. The next parameter is a vector containing the labels. The ‘lty’ and ‘lwd’ parameters indicate the line types and widths. Since there are two lines on the plot, the label, line type, and line thickness parameters are passed as vectors of two values, created using the c() function which combine its arguments into a vector. To create greek letter labels with R, we use expression(). expression() returns a vector, so there’s no need to use c() to create a vector of expressions.

The final result:

Unlike the uniform distribution which shows up almost nowhere in nature, the normal distribution shows up all over the place. Test scores, heights of men or women taken separately (but not the heights of the entire population, which combines two groups with different means), repeated measurements of some value, variations in manufactured products, etc. But there are also plenty of cases where it doesn’t apply, which we will get to eventually.