Exercise 1.1

We use function c()

my_vec <- c(28, 29, 35, 75, 40, 52, 23, 25, 10, 50)
mean(my_vec)
## [1] 36.7
min(my_vec)
## [1] 10
max(my_vec)
## [1] 75

We store the mean in an object so that we can use this object in the formula

my_mean <- mean(my_vec)
1 / length(my_vec) * sum((my_vec - my_mean) ^ 2)
## [1] 306.41
var(my_vec)
## [1] 340.4556

Remark: the function var() \(n-1\) at the denominator (which gives an unbiased estimator of the variance for i.i.d. observations).

my_vec_st <- (my_vec - mean(my_vec)) / sd(my_vec)
getwd()
## [1] "/home/thibault/Documents/opio/R_advanced/chapter_1/slides"
save(my_vec, my_vec_st, file = "exo1.RData")

Exercise 1.2

If we print the code of the function require() we notice that it actually uses the function library() but this last one is included in the tryCatch() function.

When we apply the function library() on a package which is not installed on the machine, it causes an error and if the instruction is included in a file that we source (with function source()), the codes which appear after library() will not be executed.

When we use tryCatch(), we know that there could be an error but it will be ignored. In other term, even if a package is not installed, there will be a warning message (which is different than an error message) and the codes after require() will not be interrupted.

Moreover, require() returns a logical equal to TRUE if the package is loaded and FALSE otherwise

(require("this_package_does_not_exist"))
## Le chargement a nécessité le package : this_package_does_not_exist
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : aucun package nommé 'this_package_does_not_exist' n'est
## trouvé
## [1] FALSE
cat("the codes coming after is still executed")
## the codes coming after is still executed
require("stringr")
require(stringr)

Actually, if we look at the code of the function require(), we can see:

if (!character.only) 
  package <- as.character(substitute(package))

In other terms, when we do not use the ““, it does not consider package argument as an object which explains why the following syntax is not working

nom_package <- "stringr"
require(nom_package)

Indeed, nom_package is not evaluated; instead of the function substitute() permits to keep nom_package as a name object which is then converted into strings with as.character().

It is not necessary to load a package for using one particular function inside it. Indeed, when we load a package, it imports lots of functions and new classes of object which is not necessaraly desired. That is why we can do:

stringr::str_to_title("data management with r (session 1)")
## [1] "Data Management With R (Session 1)"

Exercise 2.1

c(21, 180, "F", "DU", "FR", TRUE)
## [1] "21"   "180"  "F"    "DU"   "FR"   "TRUE"

Character dominates numeric/logical. All the elements of the vectors are converted in strings.

TRUE | this_object_does_not_exist
TRUE || this_object_does_not_exist

The first line gives an error message whereas it is not the case for the second line. Indeed || is a shortcut: the first verification is TRUE which means that we know that the result will be TRUE whatever the result of the second verification.

c(1, 1, 1, 1) ^ c(0, 1) + c(0, 1, 2) 

The size of the vectors are all different. The bigger vector is the first one, so the other vectors are repeated until their sizes equal 4.

c(1, 1, 1, 1) ^ c(0, 1, 0, 1) + c(0, 1, 2, 0)
## [1] 1 2 3 1
x <- seq(from = -4, to = 4, by = 0.01)
plot(x, 1 /sqrt(2 * pi) * exp(-0.5 * x ^ 2), type = "l",
     ylab = "Probability density function")

A bernoulli distribution has two possible values : 0 and 1. This is the first argument of the function sample(). We draw 100 values randomly (which is equivalent to choose \(p=0.5\)) and use argument replace = TRUE.

Exercise 2.2

Let consider the vector of strings my_word:

my_word <- c("we went 2 times to warwick",
             "moi 1 fois 1 w-e")
  • give the character position of “we” in my_word
gregexpr(pattern = "we", my_word)
## [[1]]
## [1] 1 4
## attr(,"match.length")
## [1] 2 2
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
  • give the character position of “w” or “e” in my_word
gregexpr(pattern = "[we]", my_word)
## [[1]]
## [1]  1  2  4  5 14 20 23
## attr(,"match.length")
## [1] 1 1 1 1 1 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] 14 16
## attr(,"match.length")
## [1] 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
  • give the character position of “we” in my_word, knowing that there is a empty space before
gregexpr(pattern = " we", "we went to warwick")
## [[1]]
## [1] 3
## attr(,"match.length")
## [1] 3
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
  • give the character position of any numbers in my_word
gregexpr(pattern = "[0-9]", my_word)
## [[1]]
## [1] 9
## attr(,"match.length")
## [1] 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1]  5 12
## attr(,"match.length")
## [1] 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
  • count the number of times any numbers is appearing in my_word
stringr::str_count(my_word, "[0-9]")
## [1] 1 2

Exercise 2.3

We consider the two vectors weight and group

ctl <- c(4.17, 5.58, 5.18, 6.11, 4.50, 4.61, 5.17, 4.53, 5.33, 5.14)
trt <- c(4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03, 4.89, 4.32, 4.69)
group <- gl(2, 10, 20, labels = c("Ctl", "Trt"))
weight <- c(ctl, trt)
X <- matrix(0, 20, 2)
X[, 1] <- group == "Ctl"
X[, 2] <- group == "Trt"
split(weight, group)
## $Ctl
##  [1] 4.17 5.58 5.18 6.11 4.50 4.61 5.17 4.53 5.33 5.14
## 
## $Trt
##  [1] 4.81 4.17 4.41 3.59 5.87 3.83 6.03 4.89 4.32 4.69

It splits the vector into two vectors with regards to the variable group

lapply(split(weight, group), mean)
## $Ctl
## [1] 5.032
## 
## $Trt
## [1] 4.661

It seems that the “control” group has higher values but this not obvious.

hat_beta <- solve(A, b)
hat_y <- X %*% hat_beta
hat_e <- (weight - hat_y) 
sse <- sum(hat_e ^ 2)
sqrt(sse / 18)
## [1] 0.6963895
sst <- sum((weight - mean(weight)) ^ 2)
ssr <- sum((hat_y - mean(weight)) ^ 2)
sst == (ssr + sse)
## [1] FALSE
ssr/sst
## [1] 0.0730776

Exercise 2.4

  • Execute the solution of exercice 3 and create a list object called res_lm which contains the residuals, the SSE value and the \(R^2\).

  • Create a data.frame called pred_y which contains the fitted values, the residuals and the \(Y\) variable.

Exercise 3.1

  • Import one data set from these different web pages by using the method of your choice:

covid_data <- readr::read_csv2("https://www.data.gouv.fr/fr/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7", col_types = "cncnnnn")
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
  • link 2 (import if possible a “.xls” file)
download.file("https://www.data.gouv.fr/fr/datasets/r/046bb27f-0a97-464d-b6cd-d78740c3e3a2", destfile = paste0(getwd(), "/covid.xlsx"))
covid_data <- readxl::read_xlsx("covid.xlsx")
download.file("https://www.data.gouv.fr/fr/datasets/r/021a7ffb-855f-498b-8ddc-7d9f299ba823", destfile = paste0(getwd(), "/maire.xlsx"))
maire_data <- readxl::read_xlsx("maire.xlsx", col_types = rep("text", 11))

Exercise 4.1

  • In the data admnrev from package wooldridge, transform the data from the long to the wide form with respect to the variable year. We call admnrev_wide this object
library(wooldridge)
admnrev_wide <- tidyr::pivot_wider(admnrev, id_cols = "state",
                   names_from = 2,
                   values_from = c(3, 4, 5))
admnrev_wide                   
## # A tibble: 51 × 10
##    state admnrev_85 admnrev_90 admnrev_95 daysfrst_85 daysfrst_90 daysfrst_95
##    <chr>      <int>      <int>      <int>       <int>       <int>       <int>
##  1 AL             0          0          0           0           0           0
##  2 AK             1          1          1          30          30          30
##  3 AZ             0          1          1           0          30          30
##  4 AR             0          0          0           0           0           0
##  5 CA             0          1          1           0         120          30
##  6 CO             1          1          1         365          90          90
##  7 CT             0          1          1           0          90           0
##  8 DE             1          1          1          90          90          90
##  9 DC             1          1          1           0           0           0
## 10 FL             0          1          1           0          30           0
## # … with 41 more rows, and 3 more variables: daysscnd_85 <int>,
## #   daysscnd_90 <int>, daysscnd_95 <int>
  • Transform the object admnrev_wide from wide to long object.
tidyr::pivot_longer(admnrev_wide,
                    cols = 2:10,
                    names_to = c(".value", "year"),
                    names_pattern = "(.*)_(.*)"
             )
## # A tibble: 153 × 5
##    state year  admnrev daysfrst daysscnd
##    <chr> <chr>   <int>    <int>    <int>
##  1 AL    85          0        0        0
##  2 AL    90          0        0        0
##  3 AL    95          0        0        0
##  4 AK    85          1       30      365
##  5 AK    90          1       30      365
##  6 AK    95          1       30      365
##  7 AZ    85          0        0        0
##  8 AZ    90          1       30       90
##  9 AZ    95          1       30       90
## 10 AR    85          0        0        0
## # … with 143 more rows