## A quick GNU R tutorial to statistical models and graphics

# 1. Introduction

In this quick GNU R tutorial to statistical models and graphics we will provide a simple linear regression example and learn how to perform such basic statistical analysis of data. This analysis will be accompanied by graphical examples, which will take us closer to producing plots and charts with GNU R. If you are not familiar with using R at all please have a look at the prerequisite tutorial: A quick GNU R tutorial to basic operations, functions and data structures.

# 2. Models and Formulas in R

We understand a **model **in statistics as a concise description of data. Such presentation of data is usually exhibited with a **mathematical formula**. R has its own way to represent relationships between variables. For instance, the following relationship y=c_{0}+c_{1}x_{1}+c_{2}x_{2}+...+c_{n}x_{n}+r is in R written as

y~x1+x2+...+xn,

which is a formula object.

# 3. Linear regression example

Let us now provide a linear regression example for GNU R, which consists of two parts. In the first part of this example we will study a relationship between the financial index returns denominated in the US dollar and such returns denominated in the Canadian dollar. Additionally in the second part of the example we add one more variable to our analysis, which are returns of the index denominated in Euro.

## 3.1. Simple linear regression

Download the example data file to your working directory: regression-example-gnu-r.csv

Let us now run R in Linux from the location of the working directory simply by

$ R

and read the data from our example data file:

> returns<-read.csv("regression-example-gnu-r.csv",header=TRUE)

You can see the names of the variables typing

>names(returns)

[1] "USA" "CANADA" "GERMANY"

It is time to define our statistical model and run linear regression. This can be done in the following few lines of code:

> y<-returns[,1]

> x1<-returns[,2]

> returns.lm<-lm(formula=y~x1)

To display the summary of the regression analysis we execute the **summary()** function on the returned object **returns.lm. **That is,

> summary(returns.lm)

Call:

lm(formula = y ~ x1)

Residuals:

Min 1Q Median 3Q Max

-0.038044 -0.001622 0.000001 0.001631 0.050251

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.174e-05 3.862e-05 0.822 0.411

x1 9.275e-01 4.880e-03 190.062 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.003921 on 10332 degrees of freedom

Multiple R-squared: 0.7776, Adjusted R-squared: 0.7776

F-statistic: 3.612e+04 on 1 and 10332 DF, p-value: < 2.2e-16

This function outputs the above corresponding result. The estimated coefficients are here c_{0}~3.174e-05 and c_{1 }~9.275e-01. The above p-values suggest that the estimated intercept c_{0} is not significantly different from zero, therefore it can be neglected. The second coefficient is significantly different from zero since the p-value<2e-16. Therefore, our estimated model is represented by: y=0.93 x_{1}. Moreover, R-squared is 0.78, meaning about 78% of the variance in the variable y is explained by the model.

## 3.2. Multiple linear regression

Let us now add one more variable into our model and perform a multiple regression analysis. The question now is whether adding one more variable to our model produces a more reliable model.

> x2<-returns[,3]

> returns.lm<-lm(formula=y~x1+x2)

> summary(returns.lm)

Call:

lm(formula = y ~ x1 + x2)

Residuals:

Min 1Q Median 3Q Max

-0.0244426 -0.0016599 0.0000053 0.0016889 0.0259443

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.385e-05 3.035e-05 0.786 0.432

x1 6.736e-01 4.978e-03 135.307 <2e-16 ***

x2 3.026e-01 3.783e-03 80.001 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.003081 on 10331 degrees of freedom

Multiple R-squared: 0.8627, Adjusted R-squared: 0.8626

F-statistic: 3.245e+04 on 2 and 10331 DF, p-value: < 2.2e-16

Above, we can see the result of the multiple regression analysis after adding the variable x_{2}. This variable represents the returns of the financial index in Euro. We now obtain a more reliable model, since the adjusted R-squared is 0.86, which is greater then the value obtained before equal to 0.76. Note, that we compared the adjusted R-squared because it takes the number of values and the sample size into account. Again the intercept coefficient is not significant, therefore, the estimated model can be represented as: y=0.67x_{1}+0.30x_{2}.

Note also that we could have referred to our data vectors by their names, for instance

> lm(returns$USA~returns$CANADA)

Call:

lm(formula = returns$USA ~ returns$CANADA)

Coefficients:

(Intercept) returns$CANADA

3.174e-05 9.275e-01

# 4. Graphics

In this section we will demonstrate how to use R for visualization of some properties in the data. We will illustrate figures obtained by such functions as** plot()**, **boxplot()**, **hist(), qqnorm().**

## 4.1. Scatter plot

Probably the simplest of all graphs you can obtain with R is the scatter plot. To illustrate the relationship between the US dollar denomination of the financial index returns and the Canadian dollar denomination we use the function **plot()** as follows:

> plot(returns$USA, returns$CANADA)

As a result of the execution of this function we obtain a scatter diagram as exhibited below

One of the most important arguments you can pass to the function **plot()** is ‘type’. It determines what type of plot should be drawn. Possible types are:

• ‘"**p**"’ for *p*oints

• ‘"**l**"’ for *l*ines

• ‘"**b**"’ for *b*oth

• ‘"**c**"’ for the lines part alone of ‘"b"’

• ‘"**o**"’ for both ‘*o*verplotted’

• ‘"**h**"’ for ‘*h*istogram’ like (or ‘high-density’) vertical lines

• ‘"**s**"’ for stair *s*teps

• ‘"**S**"’ for other type of *s*teps

• ‘"**n**"’ for no plotting

To overlay a regression line over the scatter diagram above we use the **curve() **function with the argument 'add' and 'col', which determines that the line should be added to the existing plot and the color of the line plotted, respectively.

> curve(0.93*x,-0.1,0.1,add=TRUE,col=2)

Consequently, we obtain the following changes in our graph:

For more information on the function plot() or lines() use function **help()**, for instance

>help(plot)

## 4.2. Box plot

Let us now see how to use the **boxplot() **function to illustrate the data descriptive statistics. First, produce a summary of descriptive statistics for our data by the **summary()** function and then execute the **boxplot()** function for our returns:

> summary(returns)

USA CANADA GERMANY

Min. :-0.0928805 Min. :-0.0792810 Min. :-0.0901134

1st Qu.:-0.0036463 1st Qu.:-0.0038282 1st Qu.:-0.0046976

Median : 0.0005977 Median : 0.0005318 Median : 0.0005021

Mean : 0.0003897 Mean : 0.0003859 Mean : 0.0003499

3rd Qu.: 0.0046566 3rd Qu.: 0.0047591 3rd Qu.: 0.0056872

Max. : 0.0852364 Max. : 0.0752731 Max. : 0.0927688

Note that the descriptive statistics are similar for all three vectors, therefore we can expect similar boxplots for all of the sets of financial returns. Now, execute the boxplot() function as follows

> boxplot(returns)

As a result we obtain the following three boxplots.

## 4.3. Histogram

In this section we will take a look at histograms. The frequency histogram was already introduced in Introduction to GNU R on Linux Operating System. We will now produce the density histogram for normalized returns and compare it with the normal density curve.

Let us, first, normalize the returns of the index denominated in US dollars to obtain zero mean and variance equal to one in order to be able to compare the real data with the theoretical standard normal density function.

> retUS.norm<-(returns$USA-mean(returns$USA))/sqrt(var(returns$USA))

> mean(retUS.norm)

[1] -1.053152e-17

> var(retUS.norm)

[1] 1

Now, we produce the density histogram for such normalized returns and plot a standard normal density curve over such histogram. This can be achieved by the following R expression

> hist(retUS.norm,breaks=50,freq=FALSE)

> curve(dnorm(x),-10,10,add=TRUE,col=2)

Visually, the normal curve does not fit the data well. A different distribution may be more suitable for financial returns. We will learn how to fit a distribution to the data in later articles. At the moment we can conclude that the more suitable distribution will be more picked in the middle and will have heavier tails.

## 4.4. QQ-plot

Another useful graph in statistical analysis is the QQ-plot. The QQ-plot is a quantile quantile plot, which compares the quantiles of the empirical density to the quantiles of the theoretical density. If these match well we should see a straight line. Let us now compare the distribution of the residuals obtained by our regression analysis above. First, we will obtain a QQ-plot for the simple linear regression and then for the multiple linear regression. The type of the QQ-plot we will use is the normal QQ-plot, which means that the theoretical quantiles in the graph correspond to quantiles of the normal distribution.

The first plot corresponding to the simple linear regression residuals is obtained by the function **qqnorm()** in the following way:

> returns.lm<-lm(returns$US~returns$CANADA)

> qqnorm(returns.lm$residuals)

The corresponding graph is displayed below:

The second plot corresponds to the multiple linear regression residuals and is obtained as:

> returns.lm<-lm(returns$US~returns$CANADA+returns$GERMANY)

> qqnorm(returns.lm$residuals)

This plot is displayed below:

Note that the second plot is closer to the straight line. This suggests that the residuals produced by the multiple regression analysis are closer to normally distributed. This supports further the second model as more useful over the first regression model.

# 5. Conclusion

In this article we have introduced the statistical modeling with GNU R on the example of linear regression. We have also discussed some frequently used in statistics graphs. I hope this has opened a door to statistical analysis with GNU R for you. We will, in later articles, discuss more complex applications of R for statistical modeling as well as programming so keep reading.

**GNU R tutorial series:**

**Part I: GNU R Introductory Tutorials:**

- Introduction to GNU R on Linux Operating System
- Running GNU R on Linux Operating System
- A quick GNU R tutorial to basic operations, functions and data structures
- A quick GNU R tutorial to statistical models and graphics
- How to install and use packages in GNU R
- Building basic packages in GNU R

**Part II: GNU R Language:**

- An overview of GNU R programming language