using computer simulation. Based on examples from the infer package. Code for Quiz 13.
Load the R packages we will use.
Replace all the instances of ???. These are answers on your moodle quiz.
Run all the individual code chunks to make sure the answers in this file correspond with your quiz answers
After you check all your code chunks run then you can knit it. It won’t knit until the ??? are replaced
Save a plot to be your preview plot
set.seed(123)
hr_2_tidy.csv
is the name of your data subset
hr
hr <- read_csv("https://estanny.com/static/week13/data/hr_2_tidy.csv",
col_types = "fddfff")
use skim to summarize the data in hr
skim(hr)
Name | hr |
Number of rows | 500 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 4 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
gender | 0 | 1 | FALSE | 2 | mal: 256, fem: 244 |
evaluation | 0 | 1 | FALSE | 4 | bad: 154, fai: 142, goo: 108, ver: 96 |
salary | 0 | 1 | FALSE | 6 | lev: 95, lev: 94, lev: 87, lev: 85 |
status | 0 | 1 | FALSE | 3 | fir: 194, pro: 179, ok: 127 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age | 0 | 1 | 39.86 | 11.55 | 20.3 | 29.60 | 40.2 | 50.1 | 59.9 | ▇▇▇▇▇ |
hours | 0 | 1 | 49.39 | 13.15 | 35.0 | 37.48 | 45.6 | 58.9 | 79.9 | ▇▃▂▂▂ |
The mean hours worked per week is: 49.4
specify
that hours
is the variable of interest
Response: hours (numeric)
# A tibble: 500 × 1
hours
<dbl>
1 78.1
2 35.1
3 36.9
4 38.5
5 36.1
6 78.1
7 76
8 35.6
9 35.6
10 56.8
# … with 490 more rows
hr %>%
specify(response = hours) %>%
hypothesize(null = "point", mu = 48)
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 500 × 1
hours
<dbl>
1 78.1
2 35.1
3 36.9
4 38.5
5 36.1
6 78.1
7 76
8 35.6
9 35.6
10 56.8
# … with 490 more rows
generate
1000 replicates representing the null hypothesishr %>%
specify(response = hours) %>%
hypothesize(null = "point", mu = 48) %>%
generate(reps = 1000, type = "bootstrap")
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 500,000 × 2
# Groups: replicate [1,000]
replicate hours
<int> <dbl>
1 1 39.7
2 1 44.3
3 1 46.8
4 1 33.7
5 1 39.6
6 1 39.5
7 1 40.5
8 1 55.8
9 1 72.6
10 1 35.7
# … with 499,990 more rows
calculate
the distribution of statistics from the generated datanull_t_distribution
null_t_distribution
null_t_distribution <- hr %>%
specify(response = age) %>%
hypothesize(null = "point", mu = 48) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "t")
null_t_distribution
Response: age (numeric)
Null Hypothesis: point
# A tibble: 1,000 × 2
replicate stat
<int> <dbl>
1 1 0.144
2 2 -1.72
3 3 0.404
4 4 -1.11
5 5 0.00894
6 6 1.46
7 7 -0.905
8 8 -0.663
9 9 0.291
10 10 3.09
# … with 990 more rows
visualize(null_t_distribution)
calculate
the statistic from your observed dataAssign the output observed_t_statistic
Display observed_t_statistic
observed_t_statistic <- hr %>%
specify(response = hours) %>%
hypothesize(null = "point", mu = 48) %>%
calculate(stat = "t")
observed_t_statistic
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 1 × 1
stat
<dbl>
1 2.37
null_t_distribution %>%
get_p_value(obs_stat = observed_t_statistic, direction = "two-sided")
# A tibble: 1 × 1
p_value
<dbl>
1 0.014
shade_p_value
on the simulated null distributionnull_t_distribution %>%
visualize() +
shade_p_value(obs_stat = observed_t_statistic, direction = "two-sided")
Is the p-value < 0.05? yes
Does your analysis support the null hypothesis that the true mean number of hours worked was 48? no
hr_2_tidy.csv
is the name of your data subset
Read it into and assign to hr_2
hr_2 <- read_csv("https://estanny.com/static/week13/data/hr_2_tidy.csv",
col_types = "fddfff")
skim
to summarize the data in hr_2
by gender
Name | Piped data |
Number of rows | 500 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 2 |
________________________ | |
Group variables | gender |
Variable type: factor
skim_variable | gender | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|---|
evaluation | male | 0 | 1 | FALSE | 4 | bad: 79, fai: 68, goo: 61, ver: 48 |
evaluation | female | 0 | 1 | FALSE | 4 | bad: 75, fai: 74, ver: 48, goo: 47 |
salary | male | 0 | 1 | FALSE | 6 | lev: 49, lev: 48, lev: 48, lev: 44 |
salary | female | 0 | 1 | FALSE | 6 | lev: 47, lev: 46, lev: 41, lev: 39 |
status | male | 0 | 1 | FALSE | 3 | fir: 93, pro: 90, ok: 73 |
status | female | 0 | 1 | FALSE | 3 | fir: 101, pro: 89, ok: 54 |
Variable type: numeric
skim_variable | gender | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
age | male | 0 | 1 | 38.63 | 11.57 | 20.3 | 28.50 | 37.85 | 49.52 | 59.6 | ▇▇▆▆▆ |
age | female | 0 | 1 | 41.14 | 11.43 | 20.3 | 31.30 | 41.60 | 50.90 | 59.9 | ▆▅▇▇▇ |
hours | male | 0 | 1 | 49.30 | 13.24 | 35.0 | 37.35 | 46.00 | 59.23 | 79.9 | ▇▃▂▂▂ |
hours | female | 0 | 1 | 49.49 | 13.08 | 35.0 | 37.68 | 45.05 | 58.73 | 78.4 | ▇▃▃▂▂ |
Females worked an average of 49.5 hours per week Males worked an average of 49.3 hours per week
geom_boxplot
to plot distributions of hours worked by genderhr_2 %>%
ggplot(aes(x = gender, y = hours)) +
geom_boxplot()
specify
the variables of interest are hours
and gender
Response: hours (numeric)
Explanatory: gender (factor)
# A tibble: 500 × 2
hours gender
<dbl> <fct>
1 78.1 male
2 35.1 female
3 36.9 female
4 38.5 male
5 36.1 male
6 78.1 female
7 76 female
8 35.6 female
9 35.6 male
10 56.8 male
# … with 490 more rows
hypothesize
that the number of hours worked and gender are independenthr_2 %>%
specify(response = hours, explanatory = gender) %>%
hypothesize(null = "independence")
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 500 × 2
hours gender
<dbl> <fct>
1 78.1 male
2 35.1 female
3 36.9 female
4 38.5 male
5 36.1 male
6 78.1 female
7 76 female
8 35.6 female
9 35.6 male
10 56.8 male
# … with 490 more rows
hr_2 %>%
specify(response = hours, explanatory = gender) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute")
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 500,000 × 3
# Groups: replicate [1,000]
hours gender replicate
<dbl> <fct> <int>
1 47.8 male 1
2 60.3 female 1
3 46.5 female 1
4 37.2 male 1
5 74.1 male 1
6 35.9 female 1
7 35.6 female 1
8 54.5 female 1
9 55.6 male 1
10 44.1 male 1
# … with 499,990 more rows
Assign the output null_distribution_2_sample_permute
Display null_distribution_2_sample_permute
null_distribution_2_sample_permute <- hr_2 %>%
specify(response = hours, explanatory = gender) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "t", order = c("female", "male"))
null_distribution_2_sample_permute
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 1,000 × 2
replicate stat
<int> <dbl>
1 1 0.505
2 2 -0.650
3 3 0.279
4 4 0.435
5 5 1.73
6 6 -0.139
7 7 -2.14
8 8 0.274
9 9 0.766
10 10 1.52
# … with 990 more rows
null_t_distribution
has 1000 t-statsvisualize(null_distribution_2_sample_permute)
observed_t_2_sample_stat
observed_t_2_sample_stat
observed_t_2_sample_stat <- hr_2 %>%
specify(response = hours, explanatory = gender) %>%
calculate(stat = "t", order = c("female", "male"))
observed_t_2_sample_stat
Response: hours (numeric)
Explanatory: gender (factor)
# A tibble: 1 × 1
stat
<dbl>
1 0.160
null_t_distribution %>%
get_p_value(obs_stat = observed_t_2_sample_stat, direction = "two-sided")
# A tibble: 1 × 1
p_value
<dbl>
1 0.924
shade_p_value
on the simulated null distributionnull_t_distribution %>%
visualize() +
shade_p_value(obs_stat = observed_t_2_sample_stat, direction = "two-sided")
Is the p-value < 0.05? no
Does your analysis support the null hypothesis that the true mean number of hours worked by female and male employees was the same? yes
hr_1_tidy.csv is the name of your data subset
Read it into and assign to hr_anova
Note: col_types = “fddfff” defines the column types factor-double-double-factor-factor-factor
hr_anova <- read_csv("https://estanny.com/static/week13/data/hr_1_tidy.csv",
col_types = "fddfff")
use skim
to summarize the data in hr_anova
by status
Name | Piped data |
Number of rows | 500 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 2 |
________________________ | |
Group variables | status |
Variable type: factor
skim_variable | status | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|---|
gender | fired | 0 | 1 | FALSE | 2 | fem: 96, mal: 89 |
gender | ok | 0 | 1 | FALSE | 2 | fem: 77, mal: 76 |
gender | promoted | 0 | 1 | FALSE | 2 | fem: 87, mal: 75 |
evaluation | fired | 0 | 1 | FALSE | 4 | bad: 65, fai: 63, goo: 31, ver: 26 |
evaluation | ok | 0 | 1 | FALSE | 4 | bad: 69, fai: 59, goo: 15, ver: 10 |
evaluation | promoted | 0 | 1 | FALSE | 4 | ver: 63, goo: 60, fai: 20, bad: 19 |
salary | fired | 0 | 1 | FALSE | 6 | lev: 41, lev: 37, lev: 32, lev: 32 |
salary | ok | 0 | 1 | FALSE | 6 | lev: 40, lev: 37, lev: 29, lev: 23 |
salary | promoted | 0 | 1 | FALSE | 6 | lev: 37, lev: 35, lev: 29, lev: 23 |
Variable type: numeric
skim_variable | status | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
age | fired | 0 | 1 | 38.64 | 11.43 | 20.2 | 28.30 | 38.30 | 47.60 | 59.6 | ▇▇▇▅▆ |
age | ok | 0 | 1 | 41.34 | 12.11 | 20.3 | 31.00 | 42.10 | 51.70 | 59.9 | ▆▆▆▆▇ |
age | promoted | 0 | 1 | 42.13 | 10.98 | 21.0 | 33.40 | 42.95 | 50.98 | 59.9 | ▆▅▆▇▇ |
hours | fired | 0 | 1 | 41.67 | 7.88 | 35.0 | 36.10 | 38.90 | 43.90 | 75.5 | ▇▂▁▁▁ |
hours | ok | 0 | 1 | 48.05 | 11.65 | 35.0 | 37.70 | 45.60 | 56.10 | 78.2 | ▇▃▃▂▁ |
hours | promoted | 0 | 1 | 59.27 | 12.90 | 35.0 | 51.12 | 60.10 | 70.15 | 79.7 | ▆▅▇▇▇ |
Use geom_boxplot
to plot distributions of hours worked by status
hr_anova %>%
ggplot(aes(x = status, y = hours)) +
geom_boxplot()
specify
the variables of interest are hours
and status
Response: hours (numeric)
Explanatory: status (factor)
# A tibble: 500 × 2
hours status
<dbl> <fct>
1 36.5 fired
2 55.8 ok
3 35 fired
4 52 promoted
5 35.1 ok
6 36.3 ok
7 40.1 promoted
8 42.7 fired
9 66.6 promoted
10 35.5 ok
# … with 490 more rows
hr_anova %>%
specify(response = hours, explanatory = status) %>%
hypothesize(null = "independence")
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 500 × 2
hours status
<dbl> <fct>
1 36.5 fired
2 55.8 ok
3 35 fired
4 52 promoted
5 35.1 ok
6 36.3 ok
7 40.1 promoted
8 42.7 fired
9 66.6 promoted
10 35.5 ok
# … with 490 more rows
hr_anova %>%
specify(response = hours, explanatory = status) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute")
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 500,000 × 3
# Groups: replicate [1,000]
hours status replicate
<dbl> <fct> <int>
1 40.3 fired 1
2 40.3 ok 1
3 37.3 fired 1
4 50.5 promoted 1
5 35.1 ok 1
6 67.8 ok 1
7 39.3 promoted 1
8 35.7 fired 1
9 40.2 promoted 1
10 38.4 ok 1
# … with 499,990 more rows
The output has 500,000 rows
Assign the output null_distribution_anova
Display null_distribution_anova
null_distribution_anova <- hr_anova %>%
specify(response = hours, explanatory = status) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "F")
null_distribution_anova
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 1,000 × 2
replicate stat
<int> <dbl>
1 1 0.667
2 2 2.78
3 3 1.24
4 4 0.330
5 5 2.08
6 6 1.95
7 7 0.243
8 8 0.312
9 9 0.440
10 10 0.0281
# … with 990 more rows
visualize the simulated null distribution
visualize(null_distribution_anova)
Assign the output observed_f_sample_stat
Display observed_f_sample_stat
observed_f_sample_stat <- hr_anova %>%
specify(response = hours, explanatory = status) %>%
calculate(stat = "F")
observed_f_sample_stat
Response: hours (numeric)
Explanatory: status (factor)
# A tibble: 1 × 1
stat
<dbl>
1 115.
null_distribution_anova %>%
get_p_value(obs_stat = observed_f_sample_stat, direction = "greater")
# A tibble: 1 × 1
p_value
<dbl>
1 0
shade_p_value
on the simulated null distributionnull_t_distribution %>%
visualize() +
shade_p_value(obs_stat = observed_f_sample_stat, direction = "greater")
Save the previous plot to preview.png and add to the yaml chunk at the top
If the p-value < 0.05? yes
Does your analysis support the null hypothesis that the true means of the number of hours worked for those that were “fired”, “ok” and “promoted” were the same? no