# Just a toy function to show how functions are implemented in R.
<- function(x){
myCubicFunction = x^3
y return(y)
}myCubicFunction(3)
[1] 27
Home assignment - Part 1
Mattias Villani
April 8, 2025
The exponential distribution, \(X\sim \mathrm{Expon}(\beta)\) has (probability) density function
\[ f(x) = \frac{1}{\beta}e^{-\frac{x}{\beta}}\text{ for }x\geq0 \]
In this parameterization, the parameter \(\beta\) is called a scale parameter, and here \(\mathbb{E}(X)=\beta\). This is the parameterization used in the course book.
R (and Wikipedia) instead uses the alternative parameterization with a rate parameter \(\lambda\) and the density function
\[ f(x)=\lambda e^{-\lambda x} \text{ for }x\geq0. \]
In this parameterization \(\mathbb{E}(X) = \frac{1}{\lambda}\). So, the connection between the two parameterization is that \(\lambda = \frac{1}{\beta}\).
We will use the parameterization in the course book with the scale parameter \(\beta.\) If you want to simulate 10 random numbers from the \(X\sim \mathrm{Expon}(\beta)\) with \(\beta=2\) you have to use the command rexp(n = 10, rate = 1/2)
, since \(\beta=2\) implies \(\lambda = 1/2\). The names of the arguments can be left out so rexp(10, 1/2)
also works (but then you have to write the arguments in that exact order).
A good way to check which parameterization is actually used in a given programming language is to simulate a large number of random numbers (also called draws) from the distribution and then compute the usual sample mean
\[\bar{x}=\frac{1}{n }\sum_{i=1}^n x_i\]
of those random numbers. According to the law of large numbers, this sample mean should be close to the “theoretical”/population mean of \(\mathbb{E}(X)\) in the given parameterization.
Simulate \(n=10000\) random numbers from the exponential distribution with rate \(\lambda = 2\) to verify that R is indeed using the rate parameterization.
Simulate 200 draws (random numbers) from the \(X\sim \mathrm{Expon}(\beta = 2)\) distribution. Plot a histogram of the draws (use 30 histogram bins/cells) and overlay the theoretical probability density function (pdf) for the \(\mathrm{Expon}(\beta = 2)\) distribution as a curve. Note that you have to use the argument freq=FALSE
in the hist
function, otherwise the vertical scale will be counts within each bin in the histogram, and you want really want the height of the histogram bars to represent the density.
[Hint: evaluate the dexp
density function over a fine grid of \(x\)-values to plot the pdf.]
Overlay two more pdf curves: one for \(\mathrm{Expon}(\beta = 1)\) and the other for \(\mathrm{Expon}(\beta = 3)\). Use different colors. Which of the three pdf curves fit the data (histogram) best? Why do you think that is?
The empirical cumulative density function (cdf) from a sample with \(n\) observations is given by
\[ \hat{F}_n(x) = \frac{\text{number of elements in the sample }\leq x}{n} \]
Plot the empirical cdf for the \(n=200\) observations that you simulated in Problem 1b); see https://en.wikipedia.org/wiki/Empirical_distribution_function for a little information about the empirical cdf, if you are curious. Overlay the cdf from the three distributions above: \(\mathrm{Expon}(\beta = 1)\) , \(\mathrm{Expon}(\beta = 2)\) and \(\mathrm{Expon}(\beta = 3)\). Which distribution seems to fit best? Does it match with your conclusion from Problem 1c)?
[Hint: the sort
function might be handy for the empirical cdf, and don’t forget about the so called p
-functions in R.]
Compare the sample median from the \(n=200\) observations to the theoretical medians for each of the above three distributions. Explain both how:
a sample median is defined and
how a median of a statistical distribution is defined.
[Hint: recall the so called q
-functions in R].
Verify by numerical integration that the \(\mathrm{Expon}(\beta = 2)\) density in R really fulfills the required property of any density \(\int_{-\infty}^\infty f(x)dx=1\). This entails doing a rectangle sum approximation as in the definition of the integral in Lecture 2 (do not use a built-in function or a package for numerical integration). Start with a rectangle width of \(\Delta x = 0.5\) and then lower it until the integral seems to have converged.
Compute the expected value of the exponential distribution with \(\beta=2\) using numerical integration, i.e. using similar technique as in Problem 1f). Verify your result from the rectangle sum by using R’s built in numerical integration routine integrate
(see ?integrate
for the documentation).
(Here we actually know the result, the expected value of \(\mathrm{Expon}(\beta)\) is \(\beta\), but numerical integration technique can be used for the expectation of any function, for example if you are interested in \(\mathbb{E}(\log (X))\), when \(X \sim \mathrm{Expon}(\beta)\)).
[Hint1: note that integrate
requires the function f(x) to be integrated as input argument, so you have to define such a function before calling the integrate function. To remind you of how functions are written in R, a toy function in R is given below.]
[Hint2: don’t forget that I asking you to compute the expected value, not just to integrate the density function]
The file bugs.csv
contains a dataset with the number of bugs and some other explanatory variables for \(n=91\) releases of several software projects. Here we will only analyze the variable nBugs
, which we will store in a vector y
, for simplicity. Load the data like this:
You can ignore that some of the observations actually comes from the same project at different releases, and assume that the observations are independent and identically distributed. Consider first the model
\[Y_1,Y_2,\ldots,Y_n \overset{iid}{\sim}\mathrm{Poisson}(\lambda)\]
where \(n=91\) here and \(\overset{iid}{\sim}\) means that the observations are assumed independent and identically distributed (that is, each observation is assumed to come from the same Poisson distribution).
Since \(\lambda\) is the mean in the \(\mathrm{Poisson}(\lambda)\) distribution, a reasonable estimator of \(\lambda\) is the sample mean \(\bar y\). Plot a histogram of the data and overlay the density of Poisson distribution with \(\lambda = \bar y\). Does this Poisson model fit the data well. If not, why?
[Hint: either use a histogram when plotting the data, or use proportions(table(y))
to compute a table of proportions and then use barplot
to plot a bar chart, which is suitable for discrete data. A histogram is easier, however.]
Let us now try to with a negative binomial model for the data. We will use the variant that counts the number of failures until \(r\) successes has been observed, and we will use the alternative parameterization with an explicit parameter \(\mu\) for the mean. So the model is
\[ Y_1,Y_2,\ldots,Y_n \overset{iid}{\sim}\mathrm{NegBin}(r,\mu), \]
where each random variable \(Y_i\) can take on values in the set \(\{0,1,2,\ldots\}\). Since \(\mu\) is the mean, we will estimate it with the sample mean \(\bar y\). Add the probability function from the negative binomial model for three different \(r\) values: \(r=1\), \(r=3\) and \(r=100\) (one curve for each) to the plot you did in Problem 2a). Which of these models do you prefer? Why? Which of the negative binomial models is closest to the Poisson model? Why?
[hint: note that R has the dnbinom
function that can be called with the mean parameterization. For example, dnbinom(1, size = 3, mu = 2)
give the probability \(\mathrm{Pr}(Y=1)\) when \(Y\sim \mathrm{NegBin}(r = 3,\mu = 2)\), so that the argument size
is the parameter \(r\).]
This problem is to be done after Lecture 6.
Let \(X \sim \mathrm{Normal}(\mu = 0, \sigma^2 = 1)\). We are now interested in the distribution of \(Y=\exp(X)\). Obtain the distribution for \(Y\) by simulating 10000 draws. Plot a histogram with 100 bins.
Use the method of transformation (Section 6.4 in the course book) to show that the probability density for \(Y\) is given by
\[ f(x)=\frac{1}{\sqrt{2\pi}x}\exp\Big(-\frac{1}{2}(\log(x)-\mu)^2\Big) \]
Overlay a plot of this density in the histogram from Problem 3a).
[hint: you can use LaTeX to write math in Quarto file (Google it), but it is also OK to just do the math on paper, take a photo and include the photo]
Use (Monte Carlo) simulation with \(m=10000\) random draws to estimate \(\mathrm{E}(Y)\), where, as before, \(Y=\exp(X)\) and \(X \sim \mathrm{Normal}(\mu = 0, \sigma^2 = 1)\). Check the convergence of the estimate by plotting the sequential Monte Carlo estimates for increasing Monte Carlo sample sizes of \(10,20,30,\ldots,9900, 10000\). Does the estimate seem to converge (settle down) to the true expectation, which happens to be \(\mathrm{E}(Y)=\exp(\frac{1}{2})\)? [How do I know that this is the true expected value? See this: https://en.wikipedia.org/wiki/Log-normal_distribution]