Introduction to Applied Data Science

Lecture 8: Roll of Data Science in Society.

2022-03-29T00:00:00+00:00

Introduction

You’ll find this post in your _posts directory. Go ahead and edit it and re-build the site to see your changes. You can rebuild the site in many different ways, but the most common way is to run jekyll serve, which launches a web server and auto-regenerates your site when a file is updated.

Jekyll requires blog post files to be named according to the following format:

YEAR-MONTH-DAY-title.MARKUP

Where YEAR is a four-digit number, MONTH and DAY are both two-digit numbers, and MARKUP is the file extension representing the format used in the file. After that, include the necessary front matter. Take a look at the source for this post to get an idea about how it works.

Jekyll also offers powerful support for code snippets:

def print_hi(name)
  puts "Hi, #{name}"
end
print_hi('Tom')
#=> prints 'Hi, Tom' to STDOUT.

Check out the Jekyll docs for more info on how to get the most out of Jekyll. File all bugs/feature requests at Jekyll’s GitHub repo. If you have questions, you can ask them on Jekyll Talk.

Tutorial 7: Introduction to String Analysis.

2022-03-28T00:00:00+00:00

Exercise 1: `grep()/grepl()`

a 1.1 Extract only the strings that contain an email address from the following character vector a <- c("www.google.com", "www.yahoo.com", "fisher@gmail.com", "www.youtube.com", "thompson@hotmail.com").

[1] "fisher@gmail.com"     "thompson@hotmail.com"

b 1.2 Display TRUE if the last name of the person starts with A or G and FALSE otherwise using the following character vector b <- c("Anderson", "Abel", "Armstrong", "Barbosa", "Brunton", "Boucher", "Crossley", "Cameron", "Cleveland", "Delatorre", "Durrant", "Ellwood", "Eaton", "Gibbins", "Griff", "Guzman")

 [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE  TRUE  TRUE  TRUE

Exercise 2: `sub()/gsub()`

a 2.1 Clean the following strings and transform this character vector c <- c("cSVgKl9yyb", "e7w4o11oh8", "iYdWYvV7b2", "Epal3cNuGH", "NNhbMR0ocT", "fYaRvoag8B", "LO4fkHm7Kn", "JK8jKhS5De", "DcMAZ7Rxtp", "sV0tqC8XSd") into an integer type of vector.

 [1]     9 74118    72     3     0     8    47    85     7     8

2.2 Clean the following strings by removing the quotes the following character vector d <- c("\"abilene\"" , "\"christian\"", "\"university\"", "\"adelphi\"", "\"university\"", "\"adrian\"", "\"college\"", "\"agnes\"", "\"scott\"", "\"college\"", "\"alaska\"", "\"pacific\"", "\"university\"")

 [1] "abilene"    "christian"  "university" "adelphi"    "university"
 [6] "adrian"     "college"    "agnes"      "scott"      "college"   
[11] "alaska"     "pacific"    "university"

Exercise 3: `strsplit`

3.1 Split the following string of characters, identifying each sentence in the string by using "\\." (the period) at the end of each sentence. f <- "Down, down, down. There was nothing else to do, so Alice soon began talking again. "Dinah'll miss me very much to-night, I should think!" (Dinah was the cat.) "I hope they'll remember her saucer of milk at tea-time. Dinah my dear! I wish you were down here with me! There are no mice in the air, I'm afraid, but you might catch a bat, and that's very like a mouse, you know. But do cats eat bats, I wonder?" And here Alice began to get rather sleepy, and went on saying to herself, in a dreamy sort of way, "Do cats eat bats? Do cats eat bats?" and sometimes, "Do bats eat cats?" for, you see, as she couldn't answer either question, it didn't much matter which way she put it. She felt that she was dozing off, and had just begun to dream that she was walking hand in hand with Dinah, and saying to her very earnestly, "Now, Dinah, tell me the truth: did you ever eat a bat?" when suddenly, thump! thump! down she came upon a heap of sticks and dry leaves, and the fall was over."

[[1]]
[1] "Down, down, down"                                                                                                                                                                                                                                                                                              
[2] " There was nothing else to do, so Alice soon began talking again"                                                                                                                                                                                                                                              
[3] " "Dinah'll miss me very much to-night, I should think!" (Dinah was the cat"                                                                                                                                                                                                                                    
[4] ") "I hope they'll remember her saucer of milk at tea-time"                                                                                                                                                                                                                                                     
[5] " Dinah my dear! I wish you were down here with me! There are no mice in the air, I'm afraid, but you might catch a bat, and that's very like a mouse, you know"
             
[6] " But do cats eat bats, I wonder?" And here Alice began to get rather sleepy, and went on saying to herself, in a dreamy sort of way, "Do cats eat bats? Do cats eat bats?" and sometimes, "Do bats eat cats?" for, you see, as she couldn't answer either question, it didn't much matter which way she put it"

[7] " She felt that she was dozing off, and had just begun to dream that she was walking hand in hand with Dinah, and saying to her very earnestly, "Now, Dinah, tell me the truth: did you ever eat a bat?" when suddenly, thump! thump! down she came upon a heap of sticks and dry leaves, and the fall was over"

Create a matrix by splitting f<-c("20-04-2018","15-07-2021","11-11-2022","08-12-2021","28-01-2020") the following string, allocate one column for the day, month and year.

     [,1] [,2] [,3]
[1,]   20    4 2018
[2,]   15    7 2021
[3,]   11   11 2022
[4,]    8   12 2021
[5,]   28    1 2020

Exercise 4: Semantic Analysis

a 4.1 Using the data set Corona_NLP_test_udpipe.csv, (DOWNLOAD THE DATA), load the data, and with the udpipe package generate barplot of the most common terms in the corpus (VERB, NOUN, ADJ, …).

b 4.2 Using the data set Corona_NLP_test_udpipe.csv, load the data, and with the udpipe package generate barplot of the most common VERB, NOUN and ADJ in the corpus.

c 4.3 Using the packages wordcloud, igraph and udpipe, generate a cloud of words using the most common frequent VERB, NOUN and ADJ in the corpus. What can you interpret from the current and previos plot?

Loading required package: RColorBrewer

Attaching package: 'igraph'

The following object is masked from 'package:tidyr':

    crossing

The following object is masked from 'package:tibble':

    as_data_frame

The following objects are masked from 'package:purrr':

    compose, simplify

The following objects are masked from 'package:dplyr':

    as_data_frame, groups, union

The following objects are masked from 'package:dials':

    degree, neighbors

The following objects are masked from 'package:stats':

    decompose, spectrum

The following object is masked from 'package:base':

    union

Lecture 7: Introduction to String Analysis

2022-03-28T00:00:00+00:00

Download presentation

Refer to the presentation of this lecture:

Download

Lecture 5: Introduction to Algorithms

2022-03-27T00:00:00+00:00

Introduction

In lectures one to four, we have set the stage to introduce the heart of Applied Data Science: algorithmic thinking for problem-solving. In lecture one, we learn about the scope of Data Science and the rise of Big Data. Lecture two, is an introduction to the use of inductive reasoning applied to Data Science. Lectures three and four are an overview on basic estimation using statistics and econometrics applied with R Programming. In this lecture, I cover another pillar of Data Science: Algorithm programming using control flow structures and functions.

Functions and control flow structures are the building blocks of algorithm programming. So far, we have use r-packages and more specifically FUN(X) functions, that take arguments and perform certain action. However, in this lecture, we will learn the elementary building blocks from algorithmic programming. Learning the elementary building blocks of algorithmic programming has two main advantages in your formation in Data Science. Firstly, algorithmic programming allows you to understand in detail how the functions work. I’m sure that thus far, you know that if we pass a numeric vector x, inside the functionmean(x), R somehow computes the mean. However, after learning the building blocks of algorithmic programming, we are going to be able to understand how the functions work. What is the series of steps behind the computation of certain functions? And how do the arguments of the function being used, in which order? In a nutshell, algorithmic programming enables you to deeply understand functions and packages in R.

The second advantage of algorithmic programming is that enables you to go beyond the “out-of-the-shelf” functions from the R-base and other packages. Indeed, instead of being bound to only functions from the R-base and other packages, algorithmic programming, gives you the tools to create your own functions. In general, is recommended to search first if there are no functions available to perform the action that you want. But, knowing algorithmic programming removes the constraints of only using available tools and gives you the freedom of developing tools that fulfil your particular needs. Indeed, we would like to build our own functions and algorithms in two cases. Firstly, when we can’t find a similar function in The R Base or in the packages maintained by The Comprehensive R Archive Network (CRAN). Especially, if this is your first course in Data Science, you would like to verify first if there is a function on CRAN that fulfils your needs before investing time building your own function. Using a function from CRAN’s database is typically a better option, not only because we save time, but also because the code is audited by lead experts in their corresponding fields. Secondly, we may opt to build our own function when we often perform a sequential series of functions or repetitive steps. For instance, I typically use the function lapply combined with the function class to verify the class of each column from a df (data.frame) in the following manner lapply(df, class).

What is an algorithm?

An algorithm is simply a “well-defined computational procedure that takes some value, or set of values, as input and produces some value, or set of values, as output” [@cormen2022introduction]. Indeed, an algorithm serves a specific purpose and has a specific procedure designed to solve a problem. Before we start learning the necessary syntax to produce algorithms in R, we are going to understand the structure (procedure) of examples of algorithms employing pseudo-code or flow diagrams. Later, in a second step, we will revise the specific R-code that we need to produce the algorithm in R.

Example 1: Fibonacci Sequence

Input: A variable x that is the number of numbers to generate in the Fibonacci serie.

Output: A series z that is the Fibonacci serie of lenght x.

Fibonacci Sequence Generator:

graph TD A[Input x: lenght of sequence to generate] -->|Start| B(Sum the last 2 numbers) B--> C(Add the number to the series z) C--> D(Count the y of generated numbers) D--> Z{Is x = y?} Z--> |NO| B Z--> |Yes| R(Return the sequence z)

Flow controls: `while` and `if`

To implement the algorithm in Example one, we need to expand our knowledge of R operators. The operator if and while are always followed by a (...) that assesses a logical condition and some {...} brackets that perform a set of actions, ... if the condition is fulfilled. For instance, in Example 1, the input of the algorithm is a variable x that defines the number of elements to generate in the sequence. The output of the algorithm is a sequence z that has y number of elements. To generate the series we need to add one number to the series z that corresponds to the sum of the last two elements of the previous series until the total length is y. To start the algorithm we need the variable x that inputs the number of values to generate in the series z. followed by y which is the initial value of the length of z. Assuming that the user is going to request more than two numbers in the series then y>2 (Requesting less than two numbers makes the algorithm redundant).

Next, the algorithm needs to be programmed to continue doing a series of steps until the goal is reached. Remember that the goal is to produce a series z that has a total length of x. Notice that y is then the actual number of elements in the series z in each iteration of the process of adding one element to the series. To operationalize the algorithm we are going to use while(y, which assesses the condition where y. The operator will perform the set of actions ... only while the condition is satisfied y. That means that the algorithm using while stops when y>=x (when the corresponding series z has a total length of x or more).




x <- 20L
y <- 0L
z <- c(0L, 1L)

while (y < x) {
    z[length(z) + 1L] <- sum(z[c(length(z) - 1L, length(z))])
    y <- length(z)
}
z



Example 2: Sorthing Algorithm


  Input: A variable \(x={a_1, a_2, \dots, a_n}\) of n rational numbers.



  Output: A permutation (reordering) of x called y such that \(y={a_1^*, a_2^*, \dots, a_n^*}\)


Source: (Cormen, Leiserson, Rivest, and Stein, 2022).




 
    Sorting Algorithm
    
    graph TD
    A[Input x: a series of rational numbers length n] -->|Start| B(Take 'n>=j>1' from 'x' and store it in 'key')
    B --> C(Location of the previous number: 'i=j-1')
    C --> Z{Is x_i -previous- > key -next- ? }
    Z --> |Yes| Y1(Swap x_i -previous with key -next- )
    Y1-->Y2(Swap key-next with x_i -previous- )
    Z --> |NO| B
    




for loop

The sorting algorithm of Example 2, takes a series x of unsorted rational numbers and using an iterative procedure (for loop) compares each value on the list with all other values in the series. Using an index of previous value i and next value j and a storing vector key the algorithm swaps places when a previous number x[i] is greater than the current number being compared (key). This algorithm performs the same action as the function sort(x) with the argument decreasing = FALSE. Therefore, the algorithm itself has no more purpose than learning how the for loop is being used in R. The most fundamental aspect of the for loop is that takes n values in a series to perform a list of steps in each iteration of the loop. In this case, the algorithm evaluates each number in the series x to verify if a previous number is bigger than the next number in the series x[i]>x[j]. If that is the case, the algorithm replaces (swaps) the previous number x[i] with the current number being evaluated x[j]


# Unsorted
x <- sample(1L:99L, 15)
x

# sorted with the function
sort(x, decreasing = FALSE)

# iterative sorting algorithm
for(j in 2L:length(x)){
    key <- x[j]
    i <- j - 1
    while(i>0&&x[i]>key){ #previous number in the series (x[i]) is greater than next number (key)
        x[i+1] <-  x[i] #swap previous number (x[i]) with the next number (x[i+1])
        i <- i - 1 
        x[i + 1] <- key # swap next number (key) with previous number (x[i+1])
    }
}
# Sorted
x


Example 3: Odds and Even numbers

This example is may have only a pedagogical application. The algorithm samples one random number  sample(..., 1) between one and lim to generate an x numeric series of length y of even or odd numbers.


  Input: A variable lim that defines the range of numbers to sample \([1, lim]\) and the length of the series to generate. Also, we need a binary variable to switch the series from  even to odd numbers.



  Output: A series x of even or odd numbers.


if, else

A good analogy to understand the dynamics of if and else operators is a choice or selection between a set of possible categories.

Example 5: Choice Algorithm

Suppose you have a bag of candies with the following flavours: c("orange", "lemon", "strawberry", "mango"). Your preference is lemon above all and strawberry over orange, you dislike mango. Suppose that the bag contains 100 candies, and you are interested in how many candies you would need to take (by chance) before getting 3 lemon candies in total?


  Input: A random sample c with repetition of size 100 of candies (the bag).



  Output: A series x of at least three lemon candies.



candies <- c("orange", "lemon", "strawberry", "mango")

candy_bag <- sample(candies, 100L, replace = T)

lemon <- 0L
picks <- ""
c <- 1L



while(sum(picks=="lemon")<3){
  pick <- sample(candy_bag, 1L)
  if(pick=="lemon"){
    picks[c] <- pick
    c <- c + 1L
  }else if(pick=="strawberry"){
    picks[c] <- pick
    c <- c + 1L
  }else if(pick=="orange"){
    picks[c] <- pick
    c <- c + 1L
  }
  
}

# Number of picks
length(picks)

# Distribution of picks
library(ggplot2)
ggplot(as.data.frame(table(picks)), aes(x=picks, y = Freq)) +
  geom_bar(stat="identity")


Functions

FUN(...), functions make explicit the kind of input that we the algorithms need in the form of arguments. Functions can take n=... arguments as our implementation may require. As we mention before the operator if(...) evaluates a logical condition and it is the gatekeeper of a set of operations grouped within {} brackets. Finally, the ifelse and the else operators evaluate a set of logical conditions always after the first condition stated in the if operator.

Example 6: Even or Odd numbers
In the implementation of Example 6, the algorithm employs if and else if to select between an odd or even number. Instead of using a for loop, that has a deterministic number of iterations, the examples use a while to prevent the algorithm to stop before generating a series length y of even or odd numbers.


  Input: A random sample lim that generates



  Output: A series x of at least three lemon candies.



odd_even <- function(lim=100L, y=25L, even=TRUE){
    x <-  vector(mode = "numeric")
    i <- 1L
    while (length(x)


Example 7:  Randomized Hire-Assistant

Finally, example 7 employs previous control flows. Starts by assuming that there is a fixed supply of assistants in the labor market of data science. Further, it assumes that by a process of selection and interview their ability is explicit. Each candidate arrives at the interview randomly, and the goal is then to select the top candidate on a fixed number of interviews.


  Input: A vector of candidates supply with a vector a of ability. Additionally, a vector of interviews that contains the max number of interviews in each experiment.



  Output: A matrix H with i-rows hired assistant ability per j-column round of interviews. From this matrix, we are interested in estimating the total number of hires for each round of interviews, c(150L, 250L, 1000L, 3000L, 5000L, 10000L, 15000L),


The algorithm uses a for loop to perform an iterative selection of candidates using the vector interviews. Then selects a random sample of candidates selected for the interviews. Using a while operator each iteration continues to run until length(selected)==0. For each run, I sample one candidate for interview and remove it from the selected vector. Finally, using an if(best operator the algorithm hires a candidate if their ability is higher than the current best candidate.


h <- 1L
supply <- 20000L
hires <- 0L
a <- runif(supply)
interviews <- c(150L, 250L, 1000L, 3000L, 5000L, 10000L, 15000L)

H <- matrix(NA, nrow = supply, ncol = length(interviews))

j <- 2L
for(j in seq_along(interviews)){
    selected <- sample(a, interviews[j]) #interview candidate
    h <- 1L
    best <- 0
    while(length(selected)!=0){
        i <- sample(1L:length(selected), 1)
        interview <- selected[i]
        selected <- selected[-i]
        
        if(best


References



Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein. 2022.
Introduction to Algorithms, Fourth Edition. MIT Press. https://books.google.nl/books?id=RSMuEAAAQBAJ.



Tutorial 6: Introduction to Machine Learning.
2022-03-27T00:00:00+00:00
Exercise 1: Machine Learning Model

a 1.1 Perform the exercise done during the tutorial on the database
LifeCycleSavings (you can load it in base R by running LifeCycleSavings,
https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/LifeCycleSavings).
Use the linear model as an engine to check how well the variables pop15,
pop75 and dpi predict sr (aggregate personal savings). Use set.seed(70).
Why did the model removed one variable? Interpret if these variables are
good in predicting the value of sr.


<30/20/50>

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          3

Training data contained 30 data points and no missing data.

Operations:

Centering for pop15, pop75, dpi [trained]
Scaling for pop15, pop75, dpi [trained]
Correlation filter on pop75 [trained]

# A tibble: 20 x 1
   .pred
   
 1 10.7 
 2  7.53
 3  7.14
 4 11.8 
 5  7.61
 6 11.5 
 7 12.6 
 8  9.72
 9 11.0 
10 12.5 
11 12.0 
12 12.1 
13 12.1 
14  8.29
15  7.68
16 12.6 
17  9.41
18  7.23
19  8.90
20  8.40

# A tibble: 3 x 3
  .metric .estimator .estimate
               
1 rmse    standard       3.98 
2 rsq     standard       0.315
3 mae     standard       3.09 


b   1.2 Assess the performance of the previous model using a scatter
    plot. Put the actual sr (aggregate personal savings) in the
    horizontal axis (x) and the predicted
    sr (aggregate personal savings) in the vertical axis. What is your
    conclusion about the quality of the prediction?



c   1.3 Perform the same exercise using the london_house_price
    dataset. Check how well house type, area(in sq ft), number of
    bedrooms and number of bathrooms predict the house price (using a
    linear model). Use set.seed(71).




<2088/1392/3480>

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          4

Training data contained 2088 data points and no missing data.

Operations:

Centering for Area.in.sq.ft, No..of.Bedrooms, No..of.Bathrooms [trained]
Scaling for Area.in.sq.ft, No..of.Bedrooms, No..of.Bathrooms [trained]
Correlation filter on No..of.Bedrooms [trained]

# A tibble: 1,392 x 1
      .pred
      
 1 2569291.
 2 1707867.
 3  842547.
 4  559158.
 5 1123157.
 6  915861.
 7  770459.
 8 3358627.
 9 6010800.
10  771794.
#... with 1,382 more rows

# A tibble: 3 x 3
  .metric .estimator   .estimate
                 
1 rmse    standard   1501431.   
2 rsq     standard         0.461
3 mae     standard    757525.   


d  1.4 Use the heart.csv dataset, (read more about this dataset
    here),
    to train a classification model predicting the probability of
    getting a heart attack.


  Fit the model using all variables predicting the dependent variable
target.
  Use a prop = .75 in initial_split.
  Calculate the Confusion Matrix.
  Calculate the accuracy, sensitivity, specificity.
  Plot the ROC.
  Calculate the ROC-AUC.
  What is your conclusion about the model?






parsnip model object


Call:  stats::glm(formula = target ~ ., family = stats::binomial, data = data)

Coefficients:
(Intercept)          age          sex           cp     trestbps         chol  
   4.946390    -0.001948    -1.868410     0.867557    -0.026999    -0.007845  
        fbs      restecg      thalach        exang      oldpeak        slope  
  -0.065102     0.217100     0.025711    -0.776306    -0.442561     0.806031  
         ca         thal  
  -0.656924    -1.172524  

Degrees of Freedom: 225 Total (i.e. Null);  212 Residual
Null Deviance:      311.5 
Residual Deviance: 160.6    AIC: 188.6

# A tibble: 77 x 2
   .pred_0 .pred_1
        
 1  0.332    0.668
 2  0.0200   0.980
 3  0.0165   0.984
 4  0.0103   0.990
 5  0.0467   0.953
 6  0.749    0.251
 7  0.0600   0.940
 8  0.150    0.850
 9  0.0611   0.939
10  0.0992   0.901
# ... with 67 more rows

          Truth
Prediction  0  1
         0 28  5
         1  7 37

# A tibble: 3 x 3
  .metric  .estimator .estimate
                
1 accuracy binary         0.844
2 sens     binary         0.8  
3 spec     binary         0.881




# A tibble: 1 x 3
  .metric .estimator .estimate
               
1 roc_auc binary        0.0776


Exercise 2: Dplyr Package


  We will use the same database in 1.2 (prices on London houses) in
this exercise
  Answers must be done using a function of the dplyr package


a 2.1 Create a new subset with houses where the area is equal to or
greater than 1000 sq. feet and smaller than 2000 sq. feet. Use head() to
print the firs 6 observation of this subset

   X   Property.Name   Price       House.Type Area.in.sq.ft No..of.Bedrooms
1  3    Festing Road 1765000            House          1986               4
2  6  Alfriston Road 1475000            House          1548               4
3  8 Adam & Eve Mews 2500000            House          1308               3
4 16  Cambridge Park 1450000 Flat / Apartment          1702               3
5 18  Elsworthy Rise 2275000  New development          1173               3
6 25 St Mary's Grove 1300000            House          1101               3
  No..of.Bathrooms No..of.Receptions      Location City.County Postal.Code
1                4                 4        Putney      London    SW15 1LP
2                4                 4                    London    SW11 6NW
3                3                 3                    London      W8 6UG
4                3                 3                Twickenham     TW1 2PF
5                3                 3 Primrose Hill      London     NW3 3DS
6                3                 3     Islington      London      N1 2NT


b 2.2 Arrange the dataset (PS: not the subset) in decreasing order of
number of bedrooms. Use head() to print the first 6 observations.

     X         Property.Name    Price House.Type Area.in.sq.ft No..of.Bedrooms
1   43   Old Battersea House  9975000      House         10100              10
2 1422 St. Petersburgh Place  5500000      House          4227               9
3 2619      Courtenay Avenue 16999999      House         11733               9
4 3394  Upper Wimpole Street 14750000      House          9053               9
5  224           Harper Lane  1650000      House          4016               8
6  286         Fentiman Road  3100000      House          3800               8
  No..of.Bathrooms No..of.Receptions   Location   City.County Postal.Code
1               10                10  Battersea        London    SW11 3LD
2                9                 9                   London      W2 4LA
3                9                 9   Highgate        London      N6 4LR
4                9                 9 Marylebone        London     W1G 6LG
5                8                 8    Radlett Hertfordshire     WD7 9HJ
6                8                 8                   London     SW8 1QA


c 2.3 Create a new variable (i.e., save it in the dataframe) that gives
you the price of the square foot in each house. Use mean() to check the
mean of this new variable

[1] 1066.25


d  2.4 Create a new variable that assumes the value 1 when the house
    type is House, 2 if the type is Penthouse, 3 if the type is Flat /
    Apartment or Studio and 0 otherwise. Use the table() function on the
    new variable



   0    1    2    3 
 375 1430  100 1575 


e  2.5 Get the interquartile range of the Price per House Type.



# A tibble: 8 x 2
  House.Type       iqr_price
                  
1 Bungalow           275000 
2 Duplex             337500 
3 Flat / Apartment   700050 
4 House             1512500 
5 Mews                50000 
6 New development   1520000 
7 Penthouse         2963788.
8 Studio             150000 



Lecture 6: Introduction to Machine Learning
2022-03-27T00:00:00+00:00
Download presentation

Refer to the presentation of this lecture:

Download


Lecture 4: Inferential Statistics: Causation or Correlation?
2022-03-21T00:00:00+00:00

Introduction

Introduction

In lecture number three, we review the use of descriptive statistics to answer questions such as: “What is the current state of affairs?”; “How often, how many, when?” An also I introduce the us of the correlation coefficient to assess “what is the association between two variables?” However, in many cases, to show that two variables have an association sometime is not enough. Associations only measure how the set of variables change toguether, but, they do not say anything regarding the direction or magnitude of the relationship. To say something about the direction, means to discover if one varibles is the cause or determinant of another. Here, there is a clear order in the relationship between two varibles, for instance, \(X \rightarrow Y\), represents that \(X\) is the cause or determinant of \(Y\). The magnitude of the relationship refers the measurement of the effect of \(X\) on \(Y\), for instance, if \(X\) changes by one unit how much does \(Y\) would vary?

The distiction between a correlation and a causal relationship between two variables is not only important but necesarry in many applications. Imagine for intance the development of a vaccine or an important policy prescription. Obviosly, the research that backs-up these developments will impact the life of many people. Therefore, we would like to make a precise inference to be able to claim with robustness the magnitude and the direction that exist between variables. An asociation between two variables is not strong enough to draw conclusions about the population of our interest. In many cases we would like to move then from showing a correlation between variables to find which variable is the cause or determinant of the other. This kind of research is the central quest of econometrics and Data Science and has a special place in empirical economics.

Causality and Correlation

As it turns out, the kind of relationship between two variables is no so clear. As Data Scientist, we should proceed with scientific skepticism when we analyze the relationship between two variables. When we measure a correlation between two variables, we are merely assessing the association between two variables, but that is not the same as causation. To help you draw a line between an association, correlation and a causal relationship between two variables, I elaborate on some properties that causal relationships must have:

A Causal Mechanism

The development of Machine Learning and Big Data are pushing the boundaries between data-driven and theory-driven research (Maass, Parsons, Et Al., 2018). Indeed, there is nowadays a real debate on the power of Data Science to replace the scientific method:





  
    Very large databases are a major opportunity for science and data analytics is a remarkable new field of investigation in computer science. The effectiveness of these tools is used to support a “philosophy” against the scientific method as developed throughout history. According to this view, computer-discovered correlations should replace understanding and guide prediction and action. Consequently, there will be no need to give scientific meaning to phenomena, by proposing, say, causal relations, since regularities in very large databases are enough: “with enough data, the numbers speak for themselves”. The “end of science” is proclaimed… Claude & Longo, 2017.
  





However, in Economics, we are skeptical about weather data driven methods can really be a substitute for theory driven research. With the advent of Information Communication Systems (ICT) and now Big data, are generating large volumes of information. The availability of all sorts of data, also pose a challenge of identifying meaningful relationships between variables. The issue is that more often than before, we can find out by chance pairs of variables that seem to be related, but in fact they are completely disconnected from each other. In Economics, there is long persistent concern about this kind of problem called spurious relationship between variables. In Layman’s terms, a spurious relationship occurs when a set of variables seem to have a relationship, however, they are in fact completely unrelated.

So then, what is the solution to avoid the trap of the spurious relationship between two variables? The answer is a well-defined and coherent theoretical framework. In fact, the main stream of methodology in economics, has always been about finding methods to prove economic theory and not the other way around. Although, that trend is changing, and some Data Scientist would argue that research is becoming more data driven, the fact is that in economics there is no substitute for a well-defined and coherent theory. The seminal work of Blaug (1992), takes a closer look at the developments of methodology in economics and argues that:




  
    Methodology is study of the relationship between theoretical concepts and warranted conclusions about the real world; in particular, methodology is that branch of economics where we examine the ways in which economists justify their theories and the reasons they offer for preferring one theory over another.


  


An Exogenous Model

To argue that two or more variables hold a causal relationship, we must ensure that our models are exogenous. What does that mean? Well, to say that \(X \rightarrow Y\), requires that we control in the estimation all other factors that affect \(Z \rightarrow Y\) our dependent variable. If our theory suggest that \(X\) causes \(Y\), we must ensure that our estimation isolates well the causal mechanism. In other words, we must account jointly with \(X\) all the other \(Z\) determinants of \(Y\). If we fail to include all the variables that are systematically affecting \(Y\), we fall in the omitted variable bias (OVB) trap. OVB is common, because if there are variables that remain confounded or unobservable (\(Z\)), make it hard to distinguish if \(X\) determines \(Y\), or perhaps is \(Z\)? A graphical approach to understand OVB treats to a causal estimation is represented in the following diagram. Here we can see that our variable of interest \(X\) is indeed causing \(Y\), however, there is another variable (in the yellow region), \(Z\) that is jointly affecting \(Y\). Failing then to control for \(Z\) induces a discrepancy between the population parameter(s) and our estimate(s) called bias.



 
    Omitted Variable Bias (OVB).
    
  graph TD
    X -->  Y
    subgraph OVB;
    Z --> Y
    classDef red fill:#fdc
    class AN red
    end
    


Biased model:

$$Y=\beta_1 X+\epsilon$$



Unbiased model:

$$Y=\beta_1 X+ \beta_2 Z+\epsilon$$





The classic example is the estimation of years of education (\(X\)) on income (\(Y\)). The problem is that we can measure really well the years of education, but other determinants like ability, motivation and number of hours of study are very hard to measure. Even if we have psychometric measurements of IQ, these metrics are only proxies of the latent ability at the individual level. A proxy means that is just an approximation of the real variable that remains or confounded or unobserved. The book of Stock and Watson, (2019), offers another example from the study of school grades (\(Y\)) and the student-teacher ratio. The intuition of the study is that if the student-teacher ratio is high, then the grades are low. The causal mechanism is that explains this negative relation is the lack of capacity of teachers to properly tutor many students. However, the estimation, suffers from OBV, because it does not account for the percentage of English learners in some schools. This is a problem because migrant children might require additional tutoring, given that they do not master the language. Another potential source of OVB is the lack of control of the time of the test. As it turns out, the time of the test can impact the scores, because in early morning and later in the evening the alertness may reduce.

A special type of OVB is called self-selection. Self-selection appears in an estimation when there are inherent characteristics of the unit of observation that affect the outcome variable (\(Y\)) but remain confounded or latent. This perhaps sounds quite abstract, so let’s give some examples to clarify the concept. Imagine that you are interested in estimating the effect of education quality (\(Y\)) on the career success (\(Y\)), measured in monthly income. So you run a model and control for the different schools; among the sample you have graduates from Oxford, Harvard, Stanford and so on. Then in your estimation, it appears that indeed the higher the rank of the university (for the World University Ranking) the higher the measured salary of the graduates \(Y\). But wait, aren’t more able students also more likely to enroll themselves into highly ranked universities? Indeed, variables such as ability and motivation are very difficult to observe, and hence it is hard to determine if schooling from highly ranked universities causes latter career success. Although the association between university prestige and career success intuitively makes sense, most of the time we are only able to describe a simple correlation between the variables  (Gonzalez-Sauri and Rossello 2022).. Another example, from studies of science and technology, is whether research collaboration causes higher research productivity? Similarly to the previous example, intuitively, we may expect that increments in the collaboration render beneficial exchanges between researchers that increase the overall productivity of authors. Similar to the previous example, intuitively, makes sense collaboration brings gains of human capital, division of labor and pooling of resources. However, we disregard, that these positive externalities from collaboration depend on the individual self-selection into networks or teams (Ductor 2015). The self-selection takes place because researchers do not connect or make partnerships with everybody randomly. But most of the time, a researcher’s own preferences in terms of discipline, research interest and other individual characteristics such as their personality are the reason behind the membership into different networks. Thus, it is hard to tell weather is collaboration the determinant of productivity or is it some other individual characteristics that help some researchers to be part of prolific networks.

A second source of endogeneity (opposite of exogeneity) is called reverse-causality. Reverse causality is a real problem in many datasets because there is some feedback mechanism \(Y \rightarrow X\) in which the dependent variable also affects the explanatory variable. A classic example in economics of this issue is presents in the functions of supply and demand. The issue is that \(S=P\) supply varies depending on the selling price, but simultaneously, the price is also changing according to the demand \(D=P\). This is an issue of feedback, in which the dependent variable (supply-demand), affects the explanatory variable (price) under equilibrium. This system of equations has a problem of reverse causality that is not so straightforward to solve. In graphical form, the problem or reverse causality is represented in the following way:



 
    Reverse Causality.
    
  graph TD
    P[Price] -->  S[Supply]
    subgraph REV-CAUSAL;
    D[Demand] --> P
    classDef red fill:#fdc
    class AN red
    end
    S ---|Equilibrium: =| D
    


Biased model:

$$S=\beta_1 P+\epsilon$$



Unbiased model:

$$S=\beta_1 P +\epsilon$$
$$D=\beta_2 P +\epsilon$$





A similar problem that poses a thread to exogeneity, is called circularity, and it happens when past realizations of a dependent variable have an impact on contemporaneous values. There are many examples of this problem in finance and time series econometrics. For instance, in macroeconomic estimations of the \(GDP_{t}\) it is crucial to include the previous state of affairs \(GDP_{t-y}\). Where \(t>y\) stands for a previous period, such that the current or contemporaneous \(GDP\) depends on the state of affairs of the last year. The problem of circularity in graphical form is described as follows:



 
    Circularity.
    
  graph TD
    X -->  GDP_t2
    subgraph CIRCULARITY;
    GDP_t1 --> GDP_t2
    classDef red fill:#fdc
    class AN red
    end
    


Biased model:

$$GDP_{t+1}=\beta_1 X+\epsilon$$



Unbiased model:

$$GDP_{t+1}=\beta_1 X + \beta_2 GDP_{t+1} +\epsilon$$






Nature of the data: Observational vs Experimental

If we think deeply about the exogeneity threads discussed in the last section (OVB, self-selection, reverse causality and circularity) we may see a common problem. Yes indeed, at the heart of the issue of endogeneity (opposite of exogeneity) there is a common problem of confounding factors. A confounding factor, in Layman’s terms, is simply a variable that we can’t get our hands on. Either because we do not have the data, we can’t measure it (OVB), due to a problem of self-selection or because our dependent variable has a form of feedback (reverse causality or circularity). All the listed problems, induce bias and yield an unreliable inference because at the backbone of the estimation there is a problem of confounding variables. Having confounding factors in an estimation is like cooking with an incomplete recipe, or analogous to having a jigsaw puzzle with missing pieces.





One Missing Puzzle Piece - Black Tile

by FlowstoneGraphics


This general issue of confounding variables is not easy to solve with the vast majority of the data sets that we can get our hands on. Indeed, the aforementioned threads to exogeneity may persist even in the most tidy and organize data from relational datasets (SQL for instance), surveys or administrative records. And unfortunately, the issue of confounding factors is not solved by increasing the magnitude and quantity of the data at our disposal. Even if we could collect Big Data that has millions of records using Web Scrapping algorithms, a large company or government agency, the problems may persist.

One way to solve the problem of confounding factors is to employ what has become the golden standard in Economics and Social Sciences is called Randomized Control Trials (RCTs). Data that comes from RCTs is called experimental data and is different from all data we can collect from other sources, generally called observational data. An RCT typically has a well-defined causal mechanism that directs the process of data collection to eradicate by design the problems of confounding variable. Indeed, the power of the RCTs derive from their power to isolate well the causal mechanism by virtue of a random assignment. In simple terms, the ideal RCT design starts by selecting at least two groups of similar units (individuals, firms, regions). These two groups must be similar in all characteristics such that any difference between them becomes insignificant on the averages. The comparison is then an “apples to apples” and not “apples to oranges”.


 Source:Initiating an Experiment, Ch.4

The heart of the RCT is changing randomly the circumstances that surround the causal-mechanism in one of the two groups, namely, the treatment-group. The randomized assignment has two main virtues, firstly, we eradicate the problem of self-selection by controlling which of the two identical groups receives the treatment. Keep in mind that the treatment embodies the causal-mechanism that we are aiming to showcase \(X \rightarrow Y\). The second benefit is that by changing the circumstances randomly, the variable of interest is most likely disconnected or unrelated to any other \(Z\) factor affecting the outcome variable \(Y\) of our research.



Examples

Chocolate consumption and Noble laureates

The study of Aloys LeoPrinz, (2020), studies the well known association between Nobel laureates and chocolate consumption. At first glimpse, when we look at the consumption of coffee and chocolate with the number of Nobel laureates winner, we can tell that there is a positive relationship. Using descriptive statistics, we can assess this easily with a scatter plot or correlation table.

library(ggplot2)
library(gridExtra)
library(dplyr)


choc_lauretes <- readRDS("choc_lauretes.rds")

grid.arrange(
 
choc_lauretes %>%
  ggplot(aes(cholate_per_cap, no_nobel_lau)) +
  geom_point() +
  geom_smooth(method = "lm", se = T),
 
choc_lauretes %>%
  ggplot(aes(coffee_per_cap, no_nobel_lau)) +
  geom_point() +
  geom_smooth(method = "lm", se = T),
 
  nrow = 1
)



cor(choc_lauretes[, c(2L:4L)],  use="complete.obs")




 Table: 1 
     cholate_per_cap   coffee_per_cap   no_nobel_lau   
    cholate_per_cap   1.00        
    coffee_per_cap   0.45   1.00     
    no_nobel_lau   0.17   -0.12   1.00  
   




The correlation matrix shows that both chocolate and coffee have a positive association to the number of Laureates winners. The correlation of chocolate is to the Nobel winners is stronger than of the coffee. While looking at the scatter plot, we see that the trend indicates almost no relation between the coffee consumption and Nobel winners, however, we can observe a clear positive trend between chocolate consumption and Nobel Prize winners. Does chocolate consumption cause people to become smarter?






As you are probably suspecting, to show a causal link between chocolate and human cognition, a simple correlation and trend analysis are not enough. But what is missing?


  Causal Mechanism


In fact the sduty of Aloys LeoPrinz, (2020) provides a compelling theory that claims that because of the effects of flavonoids and caffeine has a positive effect on cognition and the dopaminergic reward system of the human brain. However, his paper does not provide the nuances about how come the flavoids and caffeine interact with a particular area(s) of the brain to yield that effect. In fact, his empirical study does not provide any biological evidence supporting that claim.


  Data


The study uses observational data, and does not solve the problem of self-selection and confounding variables. Moreover, the unit of observation (countries) is quite disconnected from the unit of analysis (Nobel laureates winners). That is, the study attempts to describe a causal mechanism that occurs at the micro level, namely, in the brain of Nobel Prize winners. In other words, he is using a macro data at the country level to draw conclusions about the brain of researchers.


  Endogeneity


The study does not controls for important confounding variables such as natural ability and the level of education of individuals. Also, there is no account for motivation and the number of weekly hours that researchers invest in their work. The lack of these controls, induces doubts in the estimation because they are important determinants of research productivity. Furthermore, the data has a problem of self-selection, because individuals are choosing to consume chocolate or coffee. Henceforth, we can observe the outcome of individuals of similar characteristics that do not consume coffee or chocolate (control or counterfactual group).

Social Norms and Energy Consumption.

The study of Schultz, Nolan, Et. Al, (2017) conducts an experiment on 290 households in San Marcos, CA, USA. The experiment was design to analyze the effect of two different kinds of social norms. One group was treated with an intervention that induce a “descriptive norm”, namely, the group was given information on their energy consumption compare to the average consumption of the neighborhood. The average consumption, has implicitly, a social norm given that individuals tend to have conformity with the behavior of their peers. The second group, was treated with another norm called the “injunctive norm” that embodies the perceptions of what is commonly right or wrong in a given situation. The core of the analysis is then to measure the effect of the two norms before and after the treatment.


  Causal Mechanism


The study uses the theoretical framework of “Focus theory” that predicts that if only one of the two types of norms is prominent in an individual’s consciousness, it will exert the stronger influence on behavior (Cialdini & Goldstein, 2004). The theory thus prescribe that the group treated with a “descriptive norm” will increase and decrease their energy consumption towards the mean. In contrast, the group that is treated with an “injunctive norm” should change the behavior only if they receive a negative signal (a sad face) when their consumption is above the mean, but not the other way around (boomerang effect).


  Data


The study uses experimental data because the treatment (social norm) is allocated randomly. The experimental data has the advantage of removing the problem of self-selection and the variable of interest \(X\), the social norm, is, by virtue of the random assignment, uncorrelated with other determinants \(Z\), of the energy consumption \(Y\).


  Endogeneity


The study only derives conclusions based on a difference in means, and they do not assess the effects of other factors that might drive the change of behavior, for instance, unemployment during the period of observation or absence in the household due to holidays or work. Further, it is not clear that the two groups were completely isolated from one another. The causal-mechanism depends on the prominence of one of the norms in the mind of the individuals. However, neighborhoods are typically well-know to communicate and interact among themselves, henceforth, it is not unlikely that a norm affected more than one household.

References



Blaug, Mark. 1992. The Methodology of Economics: Or, How Economists
Explain. 2nd ed. Cambridge Surveys of Economic Literature.
Cambridge University Press. https://doi.org/10.1017/CBO9780511528224.


Calude, Cristian S., and Giuseppe Longo. 2017. "The Deluge of
Spurious Correlations in Big Data." Foundations of
Science 22 (3): 595-612. https://doi.org/10.1007/s10699-016-9489-4.


Cialdini, Robert B., and Noah J. Goldstein. 2004. "Social
Influence: Compliance and Conformity." Annual Review of
Psychology 55 (1): 591-621. https://doi.org/10.1146/annurev.psych.55.090902.142015.


Ductor, Lorenzo. 2015. "Does Co-Authorship Lead to Higher Academic
Productivity?" Oxford Bulletin of Economics and
Statistics 77 (3): 385-407. https://doi.org/https://doi.org/10.1111/obes.12070.


Gonzalez-Sauri, Mario, and Giulia Rossello. 2022. "The Role of
Early-Career University Prestige Stratification on the Future Academic
Performance of Scholars." Research in Higher Education,
April. https://doi.org/10.1007/s11162-022-09679-7.


Maass, Wolfgang, Jeffrey Parsons, Sandeep Purao, Veda C Storey, and
Carson Woo. 2018. "Data-Driven Meets Theory-Driven Research in the
Era of Big Data: Opportunities and Challenges for Information Systems
Research." Journal of the Association for Information
Systems 19 (12): 1. https://doi.org/10.17705/1jais.00526.


Nabavi, Noushin. 2020. Chapter 4 Defining the
problem. https://noushinn.github.io/experimentation_course/defining-the-problem.html.


Prinz, Aloys Leo. 2020. "Chocolate Consumption and Noble
Laureates." Social Sciences & Humanities Open. https://doi.org/10.1016/j.ssaho.2020.100082.


Schultz, P. Wesley, Jessica M. Nolan, Robert B. Cialdini, Noah J.
Goldstein, and Vladas Griskevicius. 2007. "The Constructive,
Destructive, and Reconstructive Power of Social Norms."
Psychological Science 18 (5): 429-34. https://doi.org/10.1111/j.1467-9280.2007.01917.x.


Stock, James, and Mark W. Watson. 2003. Introduction to
Econometrics. New York: Prentice Hall; Prentice Hall.


Times Higher Education. 2022. "World University
Rankings." https://www.timeshighereducation.com/world-university-rankings.





Tutorial 3: Descriptive Statistics and Data Visualization.
2022-03-01T00:00:00+00:00
About the Data


  
  Data on season, day of the week, temperature, humidity, wind speed,
weather, number of users, etc.


Exercise 1: summary statistics.

Solve:


  1.1 Average number of total users




[1] 189.4631



  1.2 Average temperature (in F)




[1] 58.77751



  1.3 median Humidity




[1] 63



  1.4 Variance of the Number of Registered Users




[1] 22909.03



  1.5 Standard Deviation of the Number of Casual Users




[1] 49.30503


Exercise 2: Vizualization the Data Distribution

Instructions: - to include title: main - to change the color of lines:
col - to change the label of y and x axes: ylab and xlab - to change the
limit of y and x axes: ylim and xlim





  2.1 Density plot of Humidity





  
    2.2 Density Plot of the Temperature including the mean and the
median

  
  
    2.3 Histogram of Wind Speed 


  
  
    2.4 Boxplot of Casual Users Including a line with the mean

  


Exercise 3: Normal Distribution


  
    3.1 Generate a normal distribution with 1000 observations, mean = 20
and sd = 3 and then plot its density plot (Use the code
set.seed(320) before)

  
  
    3.2 Present the Histogram of this normal distribution

  
  
    3.3 Present the Boxplot of this normal distribution

  


Excersise 4: Covariance and Correlation


  4.1 Covariance between temperature and Total Number of Users




[1] 1220.347



  4.2 Correlation between “Feels Like” Temperature and Total Number of
Users




[1] 0.4009377



  4.3 Correlation Matrix of bike data




                          Season         Hour      Holiday Day.of.the.Week
Season               1.000000000  0.004931139  0.055947939    -0.003163450
Hour                 0.004931139  1.000000000  0.000479136    -0.003497739
Holiday              0.055947939  0.000479136  1.000000000    -0.102087791
Day.of.the.Week     -0.003163450 -0.003497739 -0.102087791     1.000000000
Working.Day         -0.036158734  0.002284998 -0.252471370     0.035955071
Weather.Type         0.040452288 -0.020202528 -0.017036113     0.003310740
Temperature.F       -0.470806327  0.137625946 -0.027356343    -0.001805613
Temperature.Feels.F -0.469271254  0.133758276 -0.030974740    -0.008817003
Humidity             0.014750149 -0.276497828 -0.010588465    -0.037158268
Wind.Speed          -0.038741686  0.137253208  0.003984692     0.011504125
Casual.Users        -0.227260165  0.301201730  0.031563628     0.032721415
Registered.Users    -0.099585576  0.374140710 -0.047345424     0.021577888
Total.Users         -0.144872483  0.394071498 -0.030927303     0.026899860
                     Working.Day Weather.Type Temperature.F Temperature.Feels.F
Season              -0.036158734   0.04045229  -0.470806327        -0.469271254
Hour                 0.002284998  -0.02020253   0.137625946         0.133758276
Holiday             -0.252471370  -0.01703611  -0.027356343        -0.030974740
Day.of.the.Week      0.035955071   0.00331074  -0.001805613        -0.008817003
Working.Day          1.000000000   0.04467222   0.055396228         0.054665178
Weather.Type         0.044672224   1.00000000  -0.102600649        -0.105570718
Temperature.F        0.055396228  -0.10260065   1.000000000         0.987677449
Temperature.Feels.F  0.054665178  -0.10557072   0.987677449         1.000000000
Humidity             0.015687512   0.41813033  -0.069889709        -0.051935510
Wind.Speed          -0.011831470   0.02622604  -0.023115427        -0.062325722
Casual.Users        -0.300942486  -0.15262788   0.459626269         0.454088895
Registered.Users     0.134325791  -0.12096552   0.335373166         0.332565807
Total.Users          0.030284368  -0.14242614   0.404785441         0.400937689
                       Humidity   Wind.Speed Casual.Users Registered.Users
Season               0.01475015 -0.038741686  -0.22726017      -0.09958558
Hour                -0.27649783  0.137253208   0.30120173       0.37414071
Holiday             -0.01058846  0.003984692   0.03156363      -0.04734542
Day.of.the.Week     -0.03715827  0.011504125   0.03272142       0.02157789
Working.Day          0.01568751 -0.011831470  -0.30094249       0.13432579
Weather.Type         0.41813033  0.026226043  -0.15262788      -0.12096552
Temperature.F       -0.06988971 -0.023115427   0.45962627       0.33537317
Temperature.Feels.F -0.05193551 -0.062325722   0.45408890       0.33256581
Humidity             1.00000000 -0.290108894  -0.34702809      -0.27393312
Wind.Speed          -0.29010889  1.000000000   0.09029235       0.08232535
Casual.Users        -0.34702809  0.090292353   1.00000000       0.50661770
Registered.Users    -0.27393312  0.082325350   0.50661770       1.00000000
Total.Users         -0.32291074  0.093239057   0.69456408       0.97215073
                    Total.Users
Season              -0.14487248
Hour                 0.39407150
Holiday             -0.03092730
Day.of.the.Week      0.02689986
Working.Day          0.03028437
Weather.Type        -0.14242614
Temperature.F        0.40478544
Temperature.Feels.F  0.40093769
Humidity            -0.32291074
Wind.Speed           0.09323906
Casual.Users         0.69456408
Registered.Users     0.97215073
Total.Users          1.00000000


Excersise 5: Visualizing Two Variables


  
    5.1 Plot Temperature in the x-axis and Total Number of Users in
y-axis

  
  
    5.2 Plot “Feels Like” Temperature in the x-axis and Number of Casual
Users in y-axis, Include a line that indicates the correlation
between these two variables (i.e., use the correlation as the slope
of this line)

  


Exercise 6: Applied Questions.


  6.1 Get the correlation between Temperature and “Feels Like”
Temperature and interpret its value




[1] 0.9876774



  6.2 Density Plot of Total Users including the mean and the median.
Does this variable have a skew? If so, is it left of right skewed?
Is it possible to know this just by looking at the mean and median?
How?





  6.3 Generate a normal with 10 observations, mean = 0 and sd = 1 (use
set.seed = 99). Does this distribution looks like a normal? What
would you have to do to make it look more like a normal
distribution?





  6.4 Provide the boxplot of Registered Users, What can you infer from
this boxplot?




Tutorial 4: Linear Regression.
2022-03-01T00:00:00+00:00
About the Data


  
    Panel Data from South Korea
  
  
    Variables included:

    
      id
      year
      wave : from wave 1st in 2005 to wave 14th in 2018
      region: 1) Seoul 2) Kyeong-gi 3) Kyoung-nam 4) Kyoung-buk 5)
Chung-nam 6) Gang-won &. Chung-buk 7) Jeolla & Jeju
      income: yearly income in 10,000 KRW(ten thousands Korean Won.
1100 KRW = 1 USD)
      family_member: no of family members
      gender: 1) male 2) female
      year_born
      education_level: 1) no education(under 7 yrs-old) 2) no
education(7 & over 7 yrs-old) 3) elementary 4) middle school 5)
high school 6) college 7) university degree 8) MA 9) doctoral
degree
      marriage: marital status. 1) not applicable (under 18) 2)
married 3) separated by death 4) separated 5) not married yet 6)
others
      religion: 1) have religion 2) do not have
      occupation
      company_size
      reason_none_worker: 1) no capable 2) in military service 3)
studying in school 4) prepare for school 5) prepare to apply
job 6) house worker 7) caring kids at home 8) nursing 9)
giving-up economic activities 10) no intention to work 11)
others
    
  


Exercise 1: Linear Regression

a 1.1 Regress education level on income.

Call:
lm(formula = income ~ education_level, data = korea)

Residuals:
    Min      1Q  Median      3Q     Max 
-237119   -1461    -439     790  462253 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -1118.833     36.118  -30.98   <2e-16 ***
education_level  1010.652      7.507  134.62   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3820 on 92855 degrees of freedom
Multiple R-squared:  0.1633,    Adjusted R-squared:  0.1633 
F-statistic: 1.812e+04 on 1 and 92855 DF,  p-value: < 2.2e-16


b 1.2 Create an age variable and regress it on income

Call:
lm(formula = income ~ age, data = korea)

Residuals:
    Min      1Q  Median      3Q     Max 
-236901   -1633    -697     800  465548 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8278.5654    48.6856   170.0   <2e-16 ***
age          -82.6049     0.8013  -103.1   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3956 on 92855 degrees of freedom
Multiple R-squared:  0.1027,    Adjusted R-squared:  0.1027 
F-statistic: 1.063e+04 on 1 and 92855 DF,  p-value: < 2.2e-16


c 1.3 Regress both age and education level on income

Call:
lm(formula = income ~ age + education_level, data = korea)

Residuals:
    Min      1Q  Median      3Q     Max 
-237303   -1463    -431     731  462951 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1321.5122    92.4989   14.29   <2e-16 ***
age              -28.3540     0.9902  -28.64   <2e-16 ***
education_level  837.7978     9.6077   87.20   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3803 on 92854 degrees of freedom
Multiple R-squared:  0.1706,    Adjusted R-squared:  0.1706 
F-statistic:  9551 on 2 and 92854 DF,  p-value: < 2.2e-16


d 1.4 Regress log of age and education level on income

Call:
lm(formula = income ~ log_age + education_level, data = korea)

Residuals:
    Min      1Q  Median      3Q     Max 
-237212   -1459    -440     752  462701 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      3180.67     240.01   13.25   <2e-16 ***
log_age          -948.16      52.33  -18.12   <2e-16 ***
education_level   903.97       9.53   94.85   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3813 on 92854 degrees of freedom
Multiple R-squared:  0.1662,    Adjusted R-squared:  0.1662 
F-statistic:  9257 on 2 and 92854 DF,  p-value: < 2.2e-16



  1.5 Regress gender, log of age and education level on income.
Present only the coefficients


tip: explore the lm function on the help tab

    (Intercept)         log_age education_level          gender 
      5307.3048       -934.4571        766.9034      -1206.0092 


Exercise 2: T-test


  2.1 Get the average number of family members for each one of
groups: 1) have religion, 2) do not have a religion, 9) unknown.




       1        2        9 
2.409554 2.559758 3.129630 



  2.2 perform a t-test to know if the means are statistically
significant different from each other


tip: you will have to remove the observations where religion = 9 before

    Welch Two Sample t-test

data:  family_member by religion
t = -17.737, df = 92793, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
 -0.1668026 -0.1336061
sample estimates:
mean in group 1 mean in group 2 
       2.409554        2.559758 


Exercise 3: Visualization a Regression


  3.1 Using ggplot, visualize the regression done in 1.1




`geom_smooth()` using formula 'y ~ x'





  3.2 Visualize the regression done in 1.2.




`geom_smooth()` using formula 'y ~ x'




PS: Even though the correlation is significant it is not that clear when
looking at the graphs (Too many outliers)


  3.3 repeat 3.2, but only show observations with an income up to
20,000




`geom_smooth()` using formula 'y ~ x'

Warning: Removed 505 rows containing non-finite values (stat_smooth).

Warning: Removed 505 rows containing missing values (geom_point).




Exercise 4


  We will use another dataset in this exercise: Data on house prices
in london



  4.1 Regress area (in sq feet) on the price. Interpret the
coefficients (i.e., the value and their statistical significance)




Call:
lm(formula = Price ~ Area.in.sq.ft, data = london_house)

Residuals:
     Min       1Q   Median       3Q      Max 
-8755213  -503561  -167061   129088 33546963 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -36674.05   45936.19  -0.798    0.425    
Area.in.sq.ft   1109.68      20.98  52.897   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1688000 on 3478 degrees of freedom
Multiple R-squared:  0.4458,    Adjusted R-squared:  0.4457 
F-statistic:  2798 on 1 and 3478 DF,  p-value: < 2.2e-16



  4.2 Regress the log of area (in sq feet) on the price. Interpret the
coefficients.




Call:
lm(formula = Price ~ log_area, data = london_house)

Residuals:
     Min       1Q   Median       3Q      Max 
-3447406  -834395  -205682   438856 34869633 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -13584801     344947  -39.38   <2e-16 ***
log_area      2138504      47561   44.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1803000 on 3478 degrees of freedom
Multiple R-squared:  0.3676,    Adjusted R-squared:  0.3674 
F-statistic:  2022 on 1 and 3478 DF,  p-value: < 2.2e-16



  4.3 Display the regression fit done in 4.1 graphically. Looking at
the plot, do you think the area is enough to predict the value of a
house




`geom_smooth()` using formula 'y ~ x'


!


  4.4 Regress the log of area on the price again, but now control for
the city county and number of bedrooms. Why would we want to control
for different counties? Does the coefficient of N of Bedrooms has
the expected sign?




Call:
lm(formula = Price ~ log_area + City.County + No..of.Bedrooms, 
    data = london_house)

Residuals:
     Min       1Q   Median       3Q      Max 
-3539955  -784231  -191495   446677 33697553 

Coefficients:
                                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         -23734985    1789948 -13.260   <2e-16 ***
log_area                              3673635      95113  38.624   <2e-16 ***
City.County27 Carlton Drive           1838298    2384893   0.771    0.441    
City.County311 Goldhawk Road           251035    2384362   0.105    0.916    
City.County4 Circus Road West          637541    2384422   0.267    0.789    
City.County52 Holloway Road           1594405    2384641   0.669    0.504    
City.County6 Deal Street               600116    2384423   0.252    0.801    
City.County82-88 Fulham High Street   1353683    2384563   0.568    0.570    
City.CountyBattersea                   833206    2065216   0.403    0.687    
City.CountyBlackheath                  672067    2384912   0.282    0.778    
City.CountyBushey                    -1137737    2384441  -0.477    0.633    
City.CountyChelsea                    1106813    1946854   0.569    0.570    
City.CountyChessington                 130712    2384396   0.055    0.956    
City.CountyCity Of London             1492139    2064974   0.723    0.470    
City.CountyClapton                    1539011    2384625   0.645    0.519    
City.CountyClerkenwell                1616168    2384598   0.678    0.498    
City.CountyDe Beauvoir                 209057    2384847   0.088    0.930    
City.CountyDeptford                   1293718    2384780   0.542    0.588    
City.CountyDowns Road                 -191611    2384341  -0.080    0.936    
City.CountyE5 8DE                     -204646    2064925  -0.099    0.921    
City.CountyEaling                     -211849    2384840  -0.089    0.929    
City.CountyEssex                      -250797    1700106  -0.148    0.883    
City.CountyFitzrovia                  1869415    2384676   0.784    0.433    
City.CountyFulham                      592614    1847062   0.321    0.748    
City.CountyFulham High Street         1732175    2384854   0.726    0.468    
City.CountyGreenford                  1032605    2385692   0.433    0.665    
City.CountyHertfordshire              -478510    1777923  -0.269    0.788    
City.CountyHolland Park               1217343    2384485   0.511    0.610    
City.CountyHornchurch                  454640    2386091   0.191    0.849    
City.CountyKensington                 2114679    2384628   0.887    0.375    
City.CountyKent                      -2186168    2385234  -0.917    0.359    
City.CountyLambourne End              -992796    2385091  -0.416    0.677    
City.CountyLillie Square              2322928    2384467   0.974    0.330    
City.CountyLittle Venice              1233507    2384533   0.517    0.605    
City.CountyLondon                     1194884    1686516   0.708    0.479    
City.CountyLondon1500                   46965    2384463   0.020    0.984    
City.CountyMarylebone                 2366145    1885108   1.255    0.210    
City.CountyMiddlesex                  -215738    1697289  -0.127    0.899    
City.CountyMiddx                     -1404958    2385323  -0.589    0.556    
City.CountyN1 6FU                     1722555    2385430   0.722    0.470    
City.CountyN7 6QX                      998057    1802625   0.554    0.580    
City.CountyNorthwood                   231127    2065368   0.112    0.911    
City.CountyOxshott                   -1871493    2386503  -0.784    0.433    
City.CountyQueens Park                 648622    2384413   0.272    0.786    
City.CountyRichmond                   -201387    2065328  -0.098    0.922    
City.CountyRichmond Hill               397874    2384575   0.167    0.867    
City.CountyRomford                   -1079758    2385756  -0.453    0.651    
City.CountySpitalfields               -982532    2384692  -0.412    0.680    
City.CountySurrey                     -307927    1689877  -0.182    0.855    
City.CountySurrey Quays               1698776    2384685   0.712    0.476    
City.CountyThames Ditton              -353279    2384799  -0.148    0.882    
City.CountyThe Metal Works             748814    2384386   0.314    0.754    
City.CountyThurleigh Road              208120    1803097   0.115    0.908    
City.CountyTwickenham                  191162    1755395   0.109    0.913    
City.CountyWandsworth                   34922    2065234   0.017    0.987    
City.CountyWatford                    -759393    1886148  -0.403    0.687    
City.CountyWimbledon                  -419106    2384650  -0.176    0.860    
City.CountyWornington Road            1240295    1847103   0.671    0.502    
No..of.Bedrooms                       -625580      40088 -15.605   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1686000 on 3421 degrees of freedom
Multiple R-squared:  0.4563,    Adjusted R-squared:  0.447 
F-statistic: 49.49 on 58 and 3421 DF,  p-value: < 2.2e-16



  4.5 Regress the square of area and area on the price. Do you think
the relationship between the area and the price is linear or
non-linear?




Call:
lm(formula = Price ~ area2 + Area.in.sq.ft, data = london_house)

Residuals:
     Min       1Q   Median       3Q      Max 
-8162658  -516343  -143281   160531 33506243 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -1.440e+05  6.306e+04  -2.283   0.0225 *  
area2         -1.286e-02  5.182e-03  -2.481   0.0131 *  
Area.in.sq.ft  1.208e+03  4.494e+01  26.888   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1687000 on 3477 degrees of freedom
Multiple R-squared:  0.4468,    Adjusted R-squared:  0.4465 
F-statistic:  1404 on 2 and 3477 DF,  p-value: < 2.2e-16


Exercise 5


  5.1 Create a dummy variable that is equal to 1 if the number of
bathrooms is greater than 2 and 0 otherwise. Regress this dummy on
the price. Interpret the results




Call:
lm(formula = Price ~ d_bathroom, data = london_house)

Residuals:
     Min       1Q   Median       3Q      Max 
-2168411  -943411  -344676   130324 37206589 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   969676      54943   17.65   <2e-16 ***
d_bathroom   1573735      72877   21.59   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2129000 on 3478 degrees of freedom
Multiple R-squared:  0.1182,    Adjusted R-squared:  0.118 
F-statistic: 466.3 on 1 and 3478 DF,  p-value: < 2.2e-16



  5.2 Perform the same regression done in 5.1, but now include the
area as an additional independent variable. What happens with the
coefficient of the dummy? Interpret it. Why do you think this
happens?




Call:
lm(formula = Price ~ d_bathroom + Area.in.sq.ft, data = london_house)

Residuals:
     Min       1Q   Median       3Q      Max 
-9001985  -477709  -189080   107772 33486687 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      1445.92   48460.12   0.030   0.9762    
d_bathroom    -170118.65   69321.99  -2.454   0.0142 *  
Area.in.sq.ft    1143.87      25.17  45.443   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1687000 on 3477 degrees of freedom
Multiple R-squared:  0.4468,    Adjusted R-squared:  0.4465 
F-statistic:  1404 on 2 and 3477 DF,  p-value: < 2.2e-16



  5.3 What is Omitted Variable Bias? Can you give examples of two
additional variables that are not in the data that could influence
the price of a house?



Tutorial 5: Algorithms.
2022-03-01T00:00:00+00:00
About the Data

Exercise 1: Loops

a 1.1 Create a loop that iterates over the number 1 to 12 and print the
square of each value

[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36
[1] 49
[1] 64
[1] 81
[1] 100
[1] 121
[1] 144



  1.2 Create a loop that iterates over the numbers 1 to 100 and only
prints even numbers




[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
[1] 12
[1] 14
[1] 16
[1] 18
[1] 20
[1] 22
[1] 24
[1] 26
[1] 28
[1] 30
[1] 32
[1] 34
[1] 36
[1] 38
[1] 40
[1] 42
[1] 44
[1] 46
[1] 48
[1] 50
[1] 52
[1] 54
[1] 56
[1] 58
[1] 60
[1] 62
[1] 64
[1] 66
[1] 68
[1] 70
[1] 72
[1] 74
[1] 76
[1] 78
[1] 80
[1] 82
[1] 84
[1] 86
[1] 88
[1] 90
[1] 92
[1] 94
[1] 96
[1] 98
[1] 100



  1.3 Create a loop that adds 1/2 to each element of a vector where
the first element is 2. Stop when the loop reaches 20




 [1]  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0  8.5  9.0
[16]  9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5
[31] 17.0 17.5 18.0 18.5 19.0 19.5 20.0



  Create a while loop that starts with the value of 0.5 and add one
fourth of the last value on each iteration Repeat this loop while
the value is less than 40.




[1] 0.5
[1] 0.625
[1] 0.78125
[1] 0.9765625
[1] 1.220703
[1] 1.525879
[1] 1.907349
[1] 2.384186
[1] 2.980232
[1] 3.72529
[1] 4.656613
[1] 5.820766
[1] 7.275958
[1] 9.094947
[1] 11.36868
[1] 14.21085
[1] 17.76357
[1] 22.20446
[1] 27.75558
[1] 34.69447


Exercise 2: Applying Loops to a Dataset


  In this exercise, we will use the College dataset
  All questions need to be answered using loops



  2.1 Use a Loop to compute the variance of every column in
College.csv, print the results in a vector. If the variable is not
numeric (or integer) print the string “no variance”.




 [1] "no variance"      "no variance"      "14978459.5301251" "6007959.69879526"
 [5] "863368.392309836" "311.182455651528" "392.229215592618" "23526579.3264538"
 [9] "2317798.85145418" "16184661.6314367" "1202743.02797569" "27259.779945999" 
[13] "458425.753267258" "266.608635513275" "216.747840624129" "15.6685278761825"
[17] "153.556744152105" "27266865.6394771" "295.073717310831" "no variance"     
[21] "no variance"     



  2.2 Compute the number of unique values in each column of
College.csv




 [1] 777   2 711 693 581  82  89 714 566 640 553 122 294  78  65 173  61 744  81
[20]   2   2



  2.3 Generate (using a loop) 20 random normally distributed vectors
(of length 100 each) with mean 100 and sd 10


Save the mean minus the median of each randomly generated vector inside
a new vector (Use set.seed(28))

 [1]  1.75261330  1.80825286 -1.29377663  1.22513780 -1.04835404  0.07878913
 [7] -1.82175663 -0.60195083 -1.19990087 -1.19068782  2.04934384 -1.34000940
[13] -1.65250370  0.55954496 -1.16188787 -0.84151065  1.07363336 -2.32332294
[19]  1.11157950  0.04375424



  2.4 Using a for loop, create an algorithm that returns the lowest
outlier for each numeric (or integer) variable. If the variable does
not have an outlier, it should return the string “No outliers”. If
the variable is not numeric (or integer), print the string not
numeric/integer




 [1] "not numeric/integer" "not numeric/integer" "8000"               
 [4] "5200"                "1902"                "66"                 
 [7] "No outliers"         "8528"                "2281"               
[10] "21700"               "7262"                "96"                 
[13] "3000"                "8"                   "24"                 
[16] "2.5"                 "60"                  "17007"              
[19] "10"                  "not numeric/integer" "not numeric/integer"


Exercise 3. Generate the Following Series (using loops), print 100 iterations of each:


  3.1 Harmonic Series: \(\frac{1}{1} + \frac{1}{2} + \frac{1}{3} + ... \rightarrow \infty\)




  [1] 0.000000 1.000000 1.500000 1.833333 2.083333 2.283333 2.450000 2.592857
  [9] 2.717857 2.828968 2.928968 3.019877 3.103211 3.180134 3.251562 3.318229
 [17] 3.380729 3.439553 3.495108 3.547740 3.597740 3.645359 3.690813 3.734292
 [25] 3.775958 3.815958 3.854420 3.891457 3.927171 3.961654 3.994987 4.027245
 [33] 4.058495 4.088798 4.118210 4.146781 4.174559 4.201586 4.227902 4.253543
 [41] 4.278543 4.302933 4.326743 4.349999 4.372726 4.394948 4.416687 4.437964
 [49] 4.458797 4.479205 4.499205 4.518813 4.538044 4.556912 4.575430 4.593612
 [57] 4.611469 4.629013 4.646255 4.663204 4.679870 4.696264 4.712393 4.728266
 [65] 4.743891 4.759276 4.774427 4.789352 4.804058 4.818551 4.832837 4.846921
 [73] 4.860810 4.874509 4.888022 4.901356 4.914514 4.927501 4.940321 4.952979
 [81] 4.965479 4.977825 4.990020 5.002068 5.013973 5.025738 5.037366 5.048860
 [89] 5.060224 5.071459 5.082571 5.093560 5.104429 5.115182 5.125820 5.136346
 [97] 5.146763 5.157072 5.167277 5.177378 5.187378



  3.2 Sum of Reciprocals of Square Numbers: \(\frac{1}{1} + \frac{1}{4} + \frac{1}{9} + ... = \frac{\pi}{6}\)




  [1] 0.000000 1.000000 1.250000 1.361111 1.423611 1.463611 1.491389 1.511797
  [9] 1.527422 1.539768 1.549768 1.558032 1.564977 1.570894 1.575996 1.580440
 [17] 1.584347 1.587807 1.590893 1.593663 1.596163 1.598431 1.600497 1.602387
 [25] 1.604123 1.605723 1.607203 1.608574 1.609850 1.611039 1.612150 1.613191
 [33] 1.614167 1.615086 1.615951 1.616767 1.617539 1.618269 1.618962 1.619619
 [41] 1.620244 1.620839 1.621406 1.621947 1.622463 1.622957 1.623430 1.623882
 [49] 1.624316 1.624733 1.625133 1.625517 1.625887 1.626243 1.626586 1.626917
 [57] 1.627235 1.627543 1.627840 1.628128 1.628406 1.628674 1.628934 1.629186
 [65] 1.629431 1.629667 1.629897 1.630120 1.630336 1.630546 1.630750 1.630948
 [73] 1.631141 1.631329 1.631511 1.631689 1.631862 1.632031 1.632195 1.632356
 [81] 1.632512 1.632664 1.632813 1.632958 1.633100 1.633238 1.633374 1.633506
 [89] 1.633635 1.633761 1.633884 1.634005 1.634123 1.634239 1.634352 1.634463
 [97] 1.634571 1.634678 1.634782 1.634884 1.634984



  3.3 Sum of Reciprocals of the power of any \(n > 1\):
\(\frac{1}{1} + \frac{1}{n} + \frac{1}{n^2} + \frac{1}{n^3} + ... = \frac{n}{n-1}\)


Do it once for n = 2 and another for n = 10

  [1] 1.000000 1.500000 1.750000 1.875000 1.937500 1.968750 1.984375 1.992188
  [9] 1.996094 1.998047 1.999023 1.999512 1.999756 1.999878 1.999939 1.999969
 [17] 1.999985 1.999992 1.999996 1.999998 1.999999 2.000000 2.000000 2.000000
 [25] 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000
 [33] 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000
 [41] 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000
 [49] 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000
 [57] 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000
 [65] 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000
 [73] 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000
 [81] 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000
 [89] 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000
 [97] 2.000000 2.000000 2.000000 2.000000 2.000000

  [1] 1.000000 1.100000 1.110000 1.111000 1.111100 1.111110 1.111111 1.111111
  [9] 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111
 [17] 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111
 [25] 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111
 [33] 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111
 [41] 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111
 [49] 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111
 [57] 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111
 [65] 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111
 [73] 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111
 [81] 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111
 [89] 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111 1.111111
 [97] 1.111111 1.111111 1.111111 1.111111 1.111111



  3.4 Grandi’s Series:
\(1 - 1 + 1 - 1 + 1 - ... = \sum_{n=0}^{\infty}(-1)^n\)




  [1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
 [38] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
 [75] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1


Exercise 4. Theorethical Questions


  
    In which recursion problems is best to use a while rather than a
for loop.
  
  
    Define what is an algorithm? and what makes it different from a
mathematical operation?
  
  
    In which cases is worth to invest time in building your own
algorithms.

	cholate_per_cap	coffee_per_cap	no_nobel_lau
cholate_per_cap	1.00
coffee_per_cap	0.45	1.00
no_nobel_lau	0.17	-0.12	1.00

Introduction to Applied Data Science

Lecture 8: Roll of Data Science in Society.

Introduction

Tutorial 7: Introduction to String Analysis.

Exercise 1: grep()/grepl()

Exercise 2: sub()/gsub()

Exercise 3: strsplit

Exercise 4: Semantic Analysis

Lecture 7: Introduction to String Analysis

Download presentation

Lecture 5: Introduction to Algorithms

Introduction

What is an algorithm?

Example 1: Fibonacci Sequence

Flow controls: while and if

Example 2: Sorthing Algorithm

for loop

Example 3: Odds and Even numbers

if, else

Example 5: Choice Algorithm

Functions

Example 6: Even or Odd numbers

Example 7: Randomized Hire-Assistant

References

Tutorial 6: Introduction to Machine Learning.

Exercise 1: Machine Learning Model

Exercise 2: Dplyr Package

Lecture 6: Introduction to Machine Learning

Download presentation

Lecture 4: Inferential Statistics: Causation or Correlation?

Introduction

Introduction

Causality and Correlation

A Causal Mechanism

An Exogenous Model

Nature of the data: Observational vs Experimental

Examples

Chocolate consumption and Noble laureates

Social Norms and Energy Consumption.

References

Tutorial 3: Descriptive Statistics and Data Visualization.

About the Data

Exercise 1: summary statistics.

Exercise 2: Vizualization the Data Distribution

Exercise 3: Normal Distribution

Excersise 4: Covariance and Correlation

Excersise 5: Visualizing Two Variables

Exercise 6: Applied Questions.

Tutorial 4: Linear Regression.

About the Data

Exercise 1: Linear Regression

Exercise 2: T-test

Exercise 3: Visualization a Regression

Exercise 4

Exercise 5

Tutorial 5: Algorithms.

About the Data

Exercise 1: Loops

Exercise 2: Applying Loops to a Dataset

Exercise 3. Generate the Following Series (using loops), print 100 iterations of each:

Exercise 4. Theorethical Questions

Exercise 1: `grep()/grepl()`

Exercise 2: `sub()/gsub()`

Exercise 3: `strsplit`

Flow controls: `while` and `if`

`for` loop

`if`, `else`

`Functions`