We use function c()
my_vec <- c(28, 29, 35, 75, 40, 52, 23, 25, 10, 50)
mean(my_vec)
## [1] 36.7
min(my_vec)
## [1] 10
max(my_vec)
## [1] 75
We store the mean in an object so that we can use this object in the formula
my_mean <- mean(my_vec)
1 / length(my_vec) * sum((my_vec - my_mean) ^ 2)
## [1] 306.41
var(my_vec)
## [1] 340.4556
Remark: the function var() \(n-1\) at the denominator (which gives an unbiased estimator of the variance for i.i.d. observations).
my_vec_st <- (my_vec - mean(my_vec)) / sd(my_vec)
getwd()
## [1] "/home/thibault/Documents/opio/R_advanced/chapter_1/slides"
save(my_vec, my_vec_st, file = "exo1.RData")
If we print the code of the function require() we notice that it actually uses the function library() but this last one is included in the tryCatch() function.
When we apply the function library() on a package which is not installed on the machine, it causes an error and if the instruction is included in a file that we source (with function source()), the codes which appear after library() will not be executed.
When we use tryCatch(), we know that there could be an error but it will be ignored. In other term, even if a package is not installed, there will be a warning message (which is different than an error message) and the codes after require() will not be interrupted.
Moreover, require() returns a logical equal to TRUE if the package is loaded and FALSE otherwise
(require("this_package_does_not_exist"))
## Le chargement a nécessité le package : this_package_does_not_exist
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : aucun package nommé 'this_package_does_not_exist' n'est
## trouvé
## [1] FALSE
cat("the codes coming after is still executed")
## the codes coming after is still executed
require("stringr")
require(stringr)
Actually, if we look at the code of the function require(), we can see:
if (!character.only)
package <- as.character(substitute(package))
In other terms, when we do not use the ““, it does not consider package argument as an object which explains why the following syntax is not working
nom_package <- "stringr"
require(nom_package)
Indeed, nom_package is not evaluated; instead of the function substitute() permits to keep nom_package as a name object which is then converted into strings with as.character().
It is not necessary to load a package for using one particular function inside it. Indeed, when we load a package, it imports lots of functions and new classes of object which is not necessaraly desired. That is why we can do:
stringr::str_to_title("data management with r (session 1)")
## [1] "Data Management With R (Session 1)"
c(21, 180, "F", "DU", "FR", TRUE)
## [1] "21" "180" "F" "DU" "FR" "TRUE"
Character dominates numeric/logical. All the elements of the vectors are converted in strings.
TRUE | this_object_does_not_exist
TRUE || this_object_does_not_exist
The first line gives an error message whereas it is not the case for the second line. Indeed || is a shortcut: the first verification is TRUE which means that we know that the result will be TRUE whatever the result of the second verification.
c(1, 1, 1, 1) ^ c(0, 1) + c(0, 1, 2)
The size of the vectors are all different. The bigger vector is the first one, so the other vectors are repeated until their sizes equal 4.
c(1, 1, 1, 1) ^ c(0, 1, 0, 1) + c(0, 1, 2, 0)
## [1] 1 2 3 1
R includes a lot of base functions which can be seen here. Choose 5 of them, describe and illustrate them.
Sys.time() returns the current date with time of the machine
Sys.time()
## [1] "2022-09-20 18:55:33 CEST"
is.na() returns a vector of logical for identifying the elements with NA (Non Available) values
x <- c(NA, 1, 2, 3)
is.na(x)
## [1] TRUE FALSE FALSE FALSE
for is a reserved word (like if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, NA_real_ NA_complex_, NA_character_, which means that it can not be used as a name object. It is used for doing a for loop
for (i in 1:3)
print(i)
## [1] 1
## [1] 2
## [1] 3
ceiling() takes a single numeric argument x and returns a numeric vector containing the smallest integers not less than the corresponding elements of x.
ceiling(0.0000000000001)
## [1] 1
floor() takes a single numeric argument x and returns a numeric vector containing the largest integers not greater than the corresponding elements of x.
floor(0.999999999999)
## [1] 0
round() rounds the values in its first argument to the specified number of decimal places (default 0). See ‘Details’ about “round to even” when rounding off a 5.
round(c(0.4999999999, 0.50000000001))
## [1] 0 1
Plot the function \(f(x)=\frac{1}{\sqrt{2\pi}}\exp(-\frac{1}{2}x^2)\) for \(x\in[-4,4]\)
x <- seq(from = -4, to = 4, by = 0.01)
plot(x, 1 /sqrt(2 * pi) * exp(-0.5 * x ^ 2), type = "l",
ylab = "Probability density function")
A bernoulli distribution has two possible values : 0 and 1. This is the first argument of the function sample(). We draw 100 values randomly (which is equivalent to choose \(p=0.5\)) and use argument replace = TRUE.
What are the differences between sort(), order() and rank():
age <- c(25, 28, 30, NA, 21, 26, 29, 31, NA, 22, 27)
sort(age)
## [1] 21 22 25 26 27 28 29 30 31
rank(age)
## [1] 3 6 8 10 1 4 7 9 11 2 5
order(age)
## [1] 5 10 1 6 11 2 7 3 8 4 9
Let consider the vector of strings my_word:
my_word <- c("we went 2 times to warwick",
"moi 1 fois 1 w-e")
gregexpr(pattern = "we", my_word)
## [[1]]
## [1] 1 4
## attr(,"match.length")
## [1] 2 2
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
gregexpr(pattern = "[we]", my_word)
## [[1]]
## [1] 1 2 4 5 14 20 23
## attr(,"match.length")
## [1] 1 1 1 1 1 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[2]]
## [1] 14 16
## attr(,"match.length")
## [1] 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
gregexpr(pattern = " we", "we went to warwick")
## [[1]]
## [1] 3
## attr(,"match.length")
## [1] 3
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
gregexpr(pattern = "[0-9]", my_word)
## [[1]]
## [1] 9
## attr(,"match.length")
## [1] 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[2]]
## [1] 5 12
## attr(,"match.length")
## [1] 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
stringr::str_count(my_word, "[0-9]")
## [1] 1 2
We consider the two vectors weight and group
ctl <- c(4.17, 5.58, 5.18, 6.11, 4.50, 4.61, 5.17, 4.53, 5.33, 5.14)
trt <- c(4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03, 4.89, 4.32, 4.69)
group <- gl(2, 10, 20, labels = c("Ctl", "Trt"))
weight <- c(ctl, trt)
X <- matrix(0, 20, 2)
X[, 1] <- group == "Ctl"
X[, 2] <- group == "Trt"
split(weight, group)
## $Ctl
## [1] 4.17 5.58 5.18 6.11 4.50 4.61 5.17 4.53 5.33 5.14
##
## $Trt
## [1] 4.81 4.17 4.41 3.59 5.87 3.83 6.03 4.89 4.32 4.69
It splits the vector into two vectors with regards to the variable group
lapply(split(weight, group), mean)
## $Ctl
## [1] 5.032
##
## $Trt
## [1] 4.661
It seems that the “control” group has higher values but this not obvious.
create matrix \(A=(X'X)\) and \(b = (X'y)\) where \(y\) is the weight
solve the equation \(A\beta=b\). We call hat_beta the solution of the equation
hat_beta <- solve(A, b)
hat_y <- X %*% hat_beta
hat_e <- (weight - hat_y)
sse <- sum(hat_e ^ 2)
sqrt(sse / 18)
## [1] 0.6963895
sst <- sum((weight - mean(weight)) ^ 2)
ssr <- sum((hat_y - mean(weight)) ^ 2)
sst == (ssr + sse)
## [1] FALSE
ssr/sst
## [1] 0.0730776
Execute the solution of exercice 3 and create a list object called res_lm which contains the residuals, the SSE value and the \(R^2\).
Create a data.frame called pred_y which contains the fitted values, the residuals and the \(Y\) variable.
Import one data set from these different web pages by using the method of your choice:
covid_data <- readr::read_csv2("https://www.data.gouv.fr/fr/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7", col_types = "cncnnnn")
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
download.file("https://www.data.gouv.fr/fr/datasets/r/046bb27f-0a97-464d-b6cd-d78740c3e3a2", destfile = paste0(getwd(), "/covid.xlsx"))
covid_data <- readxl::read_xlsx("covid.xlsx")
download.file("https://www.data.gouv.fr/fr/datasets/r/021a7ffb-855f-498b-8ddc-7d9f299ba823", destfile = paste0(getwd(), "/maire.xlsx"))
maire_data <- readxl::read_xlsx("maire.xlsx", col_types = rep("text", 11))
library(wooldridge)
admnrev_wide <- tidyr::pivot_wider(admnrev, id_cols = "state",
names_from = 2,
values_from = c(3, 4, 5))
admnrev_wide
## # A tibble: 51 × 10
## state admnrev_85 admnrev_90 admnrev_95 daysfrst_85 daysfrst_90 daysfrst_95
## <chr> <int> <int> <int> <int> <int> <int>
## 1 AL 0 0 0 0 0 0
## 2 AK 1 1 1 30 30 30
## 3 AZ 0 1 1 0 30 30
## 4 AR 0 0 0 0 0 0
## 5 CA 0 1 1 0 120 30
## 6 CO 1 1 1 365 90 90
## 7 CT 0 1 1 0 90 0
## 8 DE 1 1 1 90 90 90
## 9 DC 1 1 1 0 0 0
## 10 FL 0 1 1 0 30 0
## # … with 41 more rows, and 3 more variables: daysscnd_85 <int>,
## # daysscnd_90 <int>, daysscnd_95 <int>
tidyr::pivot_longer(admnrev_wide,
cols = 2:10,
names_to = c(".value", "year"),
names_pattern = "(.*)_(.*)"
)
## # A tibble: 153 × 5
## state year admnrev daysfrst daysscnd
## <chr> <chr> <int> <int> <int>
## 1 AL 85 0 0 0
## 2 AL 90 0 0 0
## 3 AL 95 0 0 0
## 4 AK 85 1 30 365
## 5 AK 90 1 30 365
## 6 AK 95 1 30 365
## 7 AZ 85 0 0 0
## 8 AZ 90 1 30 90
## 9 AZ 95 1 30 90
## 10 AR 85 0 0 0
## # … with 143 more rows