Simple Linear Regression Model using R UNIFE_ STATISTICA MULTIVARIATA (DEP. MATEMATICA) Mini V. OTTOBRE-2019 RESEARCH QUESTION: does exist a linear causal relationship between the number of cakes sold in a week (by a firm) and the unit’s price (the price applied per cake)? Let’s observe a given dataset and perform a simple linear regression analysis #Analysis: step by step 0. LET'S PREPARE THE DATASET 1. Visualize the relationship: the scatter plot 2. Identify the estimated model 3. The model on a graph 4. Prediction: the expected Y values given a X value 5. The model’s goodness of fit 6. Graphical analysis of Linear Regression Model’s assumptions 7. what about the inference? # _________________________________________ #0.LET'S PREPARE THE DATASET #we upload an external dataset #A) CHECK THE DIRECTORY PROCESS getwd() #B) CHANGE THE ACTUAL DIRECTORY (IF NECESSARY) FROM THE FILE BAR #C) CHECK AGAIN THE DIRECTORY PROCESS getwd() #D) WE IMPORT THE DATABASE OF INTERES cake<-read.csv2("cake_reg lin.csv") #E) CHECK THE UPLOADED DATASET View(cake) #F) CHECK THE DATABASE STRUCTURE str(cake) #this command shows the structure and characteristics of the data head(cake) #this command shows the first six rows of our dataset # G) ...TO BE SURE THE DATABASE IS AVAILABLE WITHIN THE R SOFTWARE FOR NEXT ANALYSIS attach(cake) #BECAUSE WE ARE INTERESTED IN TWO VARIABLES (UNITS AND PRICE), WE EXCLUDE THE FIRST ONE cake=cake[,-1] #1. Graphical observation of the data# plot(x,y) #What we can say about the relationship between this couple of data?# #2. We may identify the model using two different strategies: a) Following all the steps seen in theory b) Using the lm function in R# #2A: Let’s follow the steps we’ve seen in theory# x.difference=x-mean(x) #xi - x average # x.difference y.difference=y-mean(y) #yi - y average # y.scarti dev.x=sum(x.difference^2) #total sum of (xi - x average)# dev.x dev.y=sum(y.difference^2) #total sum of (yi - y average)# dev.y # let’s compute the total sum of the product between x and y differences# codev.xy=sum(x.difference*y.difference) codev.xy #now we have all the elements to compute the coefficients of our model# b1=codev.xy/dev.x #SSYX/SSX # b1 b0=mean(y)-mean(x)*b1 # average y -b1*average x# b0 #using those information we may transcript the equation of our estimated model # #y= b0+b1 * xi --> # #we may predict the value of weekly SOLD_KAKES for a given unit price # #before to make any prediction, It’s important to individuate the X range, given by the minimum and maximum value that X takes in our database: we have two different possibilities: >max(x) >min(x) >range(x) #let’s now make prediction ? Using the model : prediction=b0+b1*x #How many cakes we estimate to sell in a week in which the unit’s price is 5.3$ ?# prediction5.3=b0+b1*5.3 prediction5.3 #please, interpret the obtained result ? when the unit’s price is 5.3$, in that week we’ll expect to sell …... cakes# #How many cakes we estimate to sell in a week in which the unit’s price is 7.2$ ?# Prediction7.2=b0+b1*7.2 Prediction7.2 #please, interpret the obtained result# #--------------# #B. let’s compute the Simple Linear Regression Model using the R function lm()# #the function is lm(dependent variable (Y)~explanatory variable (X))# #how to write “tilde” using your keyboard? alt+126 (from the numerical small keyboard on the right side)# reg.lin=lm(y~x) #the result is an object in R: we may visualize the performed linear regression simply by re-calling the object’s name# reg.lin #when we want to visualize some specified contents of our analysis we need to use the dollar symbol between the model’s name and the specified contents we are interested in $# # i.e. regression$specification #for instance we may want to visualize the coefficients of our model # reg.lin$coefficients # definitely we have individuate the equation of our estimated model # #__________________# #3. Plot of our linear model# plot(x,y) #pairs of coordinates lines(x,y) #line which link all the coordinates abline(reg.lin) #graphical representation of the regression line #__________________# # 4. Prediction: the expected Y values given a X value ? Already seen in the 2A step #If in a given week, the company we are working for decides to apply a unit cake’s price equals to 6.8$, how many cakes we’ll expect to sell (in that week)?# prev6.8=b0+b1*6.8 prev6.8 #comment the results: how many cakes the company should prepare for that week?# #_________________# #5. The model’s goodness of fit or the coefficient of determination (R2) # how much of the total variation in Y is explained by our simple regression model? #three ways to identify R2: a) Computing SSR/SST b) Checking the regression model’s output c) Checking the ANOVA table #5A. let’s compute R2=SSR/SST# dev.tot=sum((y-mean(y))^2) #total residuals SST dev.disp=sum(reg.lin$residuals^2) #residuals SSE dev.reg=dev.tot-dev.disp #regression’s residuals SSR RQ=dev.reg/dev.tot RQ #how we can interpret the result? #does the model we’ve performed explain a lot of the variation in Y? #Is it a good model or not? #how much of the variation in Y is not explained by the model? So, how much of unexplained variation in Y still exists? (part of variation dues to different factors or not caught by the linear relationship) #------------# #5B. we may obtain the value of the coefficient of determination (R2) observing the summary of our regression model ? we use the command “summary” summary(reg.lin) #on the penultimate row of the obtained output we’ll see the R2 value #_________________# #5C. . we may obtain the value of the coefficient of determination (R2) observing the ANOVA output ? we use the anova command (analysis of variance) anova(reg.lin) SSR=77991 SST=(77991+91998)=169989 R2=77991/169989=0.4588 #__________________# #6. CHECKING THE LINEAR REGRESSION ASSUMPTIONS # We observe one plot for each assumption: a) linearity between Y and X plot(x,y) abline(reg.lin) b) independence of the error terms from the explanatory variable e=reg.lin$residuals plot(x,e) c) constant variance for all levels of X plot(x,e) d) normal distribution of the error terms hist(e) #please, comment each plot considering the basic assumptions# #_______________# 7. CHEKING THE INFERENCE ON THE RELATIONSHIP BETWEEN X AN Y #7. Making inference: test of the linear relationship between the variables within the reality (and not only in our sample)# #We mail follow two ways: #a.we compute the t-statistic and the critical value, thus we compare them #b.we observe the p-value associated with the b1 coefficient within the summary of our regression analysis #the system of hypothesis: # H0 : B1= 0 (we exclude linear relation between Y and X within the population) # H1 : B1 ? 0 (we don’t exclude the linear relationship between Y and X within the reality) #if Tstat > v.c. --> we refuse H0 --> linear relationship between Y and X within the rality #Tstat = (b1 - B1)/Sb1 --> where Sb1 = SSE^2/dev X # let’s compute the elements using R: dev.disp=sum(reg.lin$residuals^2) #residuals SSE dev.disp x.difference=x-mean(x) #xi - x average # dev.x=sum(x.difference^2) #total sum of (xi - x average)# dev.x b1=47.577 b1 #already computed to individuate the SLR coefficients #thus, the T-statistics is: tstat= b1/sqrt(dev.disp/(32*dev.x)) tstat # equals 5.208# #let’s now calculate the t a/2 critical value, assuming a level of significance a equals to 0.05 (so a level of confidence = 1-0.05= 95% )# #vc = ta/2 --> alfa = 0.01--> alfa/2 = 0.025; n-2 = 34-2=32# #the command for the critical value t sub a over 2 : qt(a/2, degree of freedom) qt(0.025,32) #il v.c. -2.037 #let’s compare the obtained results (in absolute value) for making a final decision: #t-stat > ta/2 5.208>2.037 --> we refuse H0 thus #it exists an empirical evidence that at the 95% of confidence there is a linear relationship between number of sold cakes and unit’s cake prices# #Please observe the summary to individuate the confidence level of our inference: summary(reg.lin) ## 7. BIS : INFERENCE ABOUT RHO # #TOPIC: the strength of the correlation between x and y within the population #we test the system of hypothesis based on the correlation coefficient RHO (?) # #H0: RHO=0 (no correlation) #H1: RHO dif. from 0 (correlazione –with given intensity) #if t-statr > v.c. --> we reject H0 --> thus, it exists a correlation –with a given intensity – within the reality between x and y #tstat.r=(r-RHO)/root.sq of ((1-r^2)/n-2)) c.v.: a/2 and d.f = n-2 # we find the R^2 in summary RQ= valore r=sqrt(RQ) r #there is a "moderate/strong" positive correlation num=r #numerator of tstat.r den=sqrt((1-(RQ))/32) # denominator of tstat.r tstat.r=num/den tstat.r vc.r=qt(0.025,32) vc.r #-2.037 # (ABS VALUE!): tstat.r>C.V.r --> we reject H0 --> it exists a correlation –whit that given strenght- within the reality # significant correlation between x and y #