Simple Linear Regression Model using R
UNIFE_ STATISTICA MULTIVARIATA (DEP. MATEMATICA) 

Mini V. OTTOBRE-2019

RESEARCH QUESTION:
does exist a linear causal relationship between the number of cakes sold in a week (by a firm) and the unit’s price (the price applied per cake)?


Let’s observe a given dataset and perform a simple linear regression analysis

#Analysis: step by step
0. LET'S PREPARE THE DATASET
1. Visualize the relationship: the scatter plot
2. Identify the estimated model
3. The model on a graph
4. Prediction: the expected Y values given a X value
5. The model’s goodness of fit
6. Graphical analysis of Linear Regression Model’s assumptions
7. what about the inference? #

_________________________________________

#0.LET'S PREPARE THE DATASET
#we upload an external dataset

#A) CHECK THE DIRECTORY PROCESS
getwd()
#B) CHANGE THE ACTUAL DIRECTORY (IF NECESSARY) FROM THE FILE BAR
#C) CHECK AGAIN THE DIRECTORY PROCESS
getwd()
#D) WE IMPORT THE DATABASE OF INTERES
cake<-read.csv2("cake_reg lin.csv")
#E) CHECK THE UPLOADED DATASET
View(cake)
#F) CHECK THE DATABASE STRUCTURE
str(cake) #this command shows the structure and characteristics of the data
head(cake) #this command shows the first six rows of our dataset
# G) ...TO BE SURE THE DATABASE IS AVAILABLE WITHIN THE R SOFTWARE FOR NEXT ANALYSIS
attach(cake)

#BECAUSE WE ARE INTERESTED IN TWO VARIABLES (UNITS AND PRICE), WE EXCLUDE THE FIRST ONE
cake=cake[,-1]




#1. Graphical observation of the data#

plot(x,y)


#What we can say about the relationship between this couple of data?#

#2. We may identify the model using two different strategies: 
a)	Following all the steps seen in theory
b)	Using the lm function in R#

#2A: Let’s follow the steps we’ve seen in theory#

x.difference=x-mean(x)                       #xi - x average #
x.difference
y.difference=y-mean(y)                       #yi - y average #
y.scarti

dev.x=sum(x.difference^2)          #total sum of (xi - x average)#
dev.x
dev.y=sum(y.difference^2)          #total sum of (yi - y average)#
dev.y

# let’s compute the total sum of the product between x and y differences#

codev.xy=sum(x.difference*y.difference)
codev.xy

#now we have all the elements to compute the coefficients of our model#

b1=codev.xy/dev.x           #SSYX/SSX #
b1

b0=mean(y)-mean(x)*b1                  # average y -b1*average x#
b0

#using those information we may transcript the equation of our estimated model #
#y= b0+b1 * xi -->  #

#we may predict the value of weekly SOLD_KAKES for a given unit price #
#before to make any prediction, It’s important to individuate the X range, given by the minimum and maximum value  that X takes in our database: we have two different possibilities: 
>max(x)
>min(x)
>range(x)

#let’s now make prediction    ?  Using the model :   prediction=b0+b1*x
#How many cakes we estimate to sell in a week in which the unit’s price is 5.3$ ?#
prediction5.3=b0+b1*5.3
prediction5.3

#please, interpret the obtained result ? when the unit’s price is 5.3$, in that week we’ll expect to sell …... cakes#
#How many cakes we estimate to sell in a week in which the unit’s price is 7.2$ ?#

Prediction7.2=b0+b1*7.2
Prediction7.2
#please, interpret the obtained result#
#--------------#
#B. let’s compute the Simple Linear Regression Model  using the R function lm()#

#the function is  lm(dependent variable (Y)~explanatory variable (X))#
#how to write “tilde” using your keyboard? alt+126 (from the numerical small keyboard on the right side)#

reg.lin=lm(y~x)

#the result is an object in R: we may visualize the performed linear regression simply by re-calling the object’s name#

reg.lin

#when we want to visualize some specified contents of our analysis 
we need to use the dollar symbol between the model’s name and 
the specified contents we are interested in $#
# i.e.  regression$specification
#for instance we may want to visualize the coefficients of our model #
reg.lin$coefficients

# definitely we have individuate the equation of our estimated model #
#__________________#

#3. Plot of our linear model#

plot(x,y)    #pairs of coordinates 

lines(x,y)   #line which link all the coordinates

abline(reg.lin)        #graphical representation of the regression line

#__________________#

# 4. Prediction: the expected Y values given a X value
?	Already seen in the 2A step
#If in a given week, the company we are working for decides to apply a unit cake’s price equals to 6.8$, how many cakes we’ll expect to sell (in that week)?#

prev6.8=b0+b1*6.8
prev6.8
#comment the results: how many cakes the company should prepare for that week?#
#_________________#

#5. The model’s goodness of fit or the coefficient of determination (R2) 
# how much of the total variation in Y is explained by our simple regression model?
#three ways to identify R2:
a)	Computing SSR/SST
b)	Checking the regression model’s output
c)	Checking the ANOVA table

#5A. let’s compute R2=SSR/SST#

dev.tot=sum((y-mean(y))^2)                      #total residuals SST
dev.disp=sum(reg.lin$residuals^2)           #residuals SSE
dev.reg=dev.tot-dev.disp                            #regression’s residuals SSR

RQ=dev.reg/dev.tot
RQ

#how we can interpret the result?
#does the model we’ve performed explain a lot of the variation in Y?
#Is it a good model or not? 
#how much of the variation in Y is not explained by the model? So, how much of unexplained variation in Y still exists? (part of variation dues to different factors or not caught by the linear relationship)

#------------#
#5B. we may obtain the value of the coefficient of determination (R2) observing the summary of our regression model ? we use the command “summary”

summary(reg.lin)
#on the penultimate row of the obtained output we’ll see the R2 value
#_________________#
#5C. . we may obtain the value of the coefficient of determination (R2) observing the ANOVA output ? we use the anova command (analysis of variance)
anova(reg.lin)
 
SSR=77991
SST=(77991+91998)=169989
R2=77991/169989=0.4588
#__________________#

#6. CHECKING THE LINEAR REGRESSION ASSUMPTIONS #
We observe one plot for each assumption: 
a)	linearity between Y and X
plot(x,y)
abline(reg.lin)
b)	independence of the error terms from the explanatory variable
e=reg.lin$residuals
plot(x,e)
c)	constant variance for all levels of X
plot(x,e)
d)	normal distribution of the error terms
hist(e)
#please, comment each plot considering the basic assumptions#
#_______________#

 7. CHEKING THE INFERENCE ON THE RELATIONSHIP BETWEEN X AN Y 
#7. Making inference: test of the linear relationship between the variables within the reality (and not only in our sample)#
#We mail follow two ways:
#a.we compute the t-statistic and the critical value, thus we compare them
#b.we observe the p-value associated with the b1 coefficient within the summary of our regression analysis

#the system of  hypothesis: 
# H0 : B1= 0 (we exclude linear relation between Y and X within the population)
# H1 : B1 ? 0 (we don’t exclude the linear relationship between Y and X within the reality)

#if Tstat > v.c. --> we refuse H0 --> linear relationship between Y and X within the rality

#Tstat = (b1 - B1)/Sb1 --> where Sb1 = SSE^2/dev X

# let’s compute the elements using R: 
dev.disp=sum(reg.lin$residuals^2)           #residuals SSE
dev.disp    

x.difference=x-mean(x)                       #xi - x average #
dev.x=sum(x.difference^2)          #total sum of (xi - x average)#
dev.x         

b1=47.577
b1            #already computed to individuate the SLR coefficients

#thus, the T-statistics is: 

tstat= b1/sqrt(dev.disp/(32*dev.x))
tstat           # equals 5.208#

#let’s now calculate the t a/2  critical value, assuming a level of significance a equals to 0.05 (so a level of confidence = 1-0.05= 95% )#
#vc = ta/2 --> alfa = 0.01--> alfa/2 = 0.025; n-2 = 34-2=32#
#the command for the critical value  t sub a over 2 : qt(a/2, degree of freedom)

qt(0.025,32)      #il v.c. -2.037
#let’s compare the obtained results (in absolute value) for making a final decision: 
#t-stat > ta/2   5.208>2.037 --> we refuse H0 thus
#it exists an empirical evidence that at the 95% of confidence there is a linear relationship between number of sold cakes and unit’s cake prices#
#Please observe the summary to individuate the confidence level of our inference: 
summary(reg.lin)


## 7. BIS : INFERENCE ABOUT RHO

# #TOPIC:  the strength of the correlation between x and y within the population 
#we test the system of hypothesis based on the correlation coefficient RHO (?) #

#H0: RHO=0 (no correlation)
#H1: RHO dif. from 0 (correlazione –with given intensity)

#if t-statr > v.c. --> we reject H0 --> thus, it exists a correlation –with a given intensity – within the reality between x and y

#tstat.r=(r-RHO)/root.sq of ((1-r^2)/n-2))    c.v.:  a/2  and d.f = n-2

# we find the R^2 in summary

RQ= valore

r=sqrt(RQ)
r #there is a "moderate/strong" positive correlation

num=r #numerator of tstat.r

den=sqrt((1-(RQ))/32) # denominator of tstat.r

tstat.r=num/den

tstat.r 

vc.r=qt(0.025,32)

vc.r #-2.037

# (ABS VALUE!): tstat.r>C.V.r --> we reject H0 --> it exists a correlation –whit that given strenght- within the reality 
# significant correlation between  x and  y #