Loading the data

omega <- read_csv(here::here("data", "omega.csv"))
glimpse(omega) # examine the data frame
## Rows: 50
## Columns: 3
## $ salary     <dbl> 81894, 69517, 68589, 74881, 65598, 76840, 78800, 70033, 635…
## $ gender     <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
## $ experience <dbl> 16, 25, 15, 33, 16, 19, 32, 34, 1, 44, 7, 14, 33, 19, 24, 3…

Relationship: Salary vs. Gender?

# Summary Statistics of salary by gender
mosaic::favstats (salary ~ gender, data=omega)
##   gender   min    Q1 median    Q3   max  mean   sd  n missing
## 1 female 47033 60338  64618 70033 78800 64543 7567 26       0
## 2   male 54768 68331  74675 78568 84576 73239 7463 24       0
# Dataframe with two rows (male-female) and having as columns gender, mean, SD, sample size, 
# the t-critical value, the standard error, the margin of error, 
# and the low/high endpoints of a 95% condifence interval

salary_gender <- omega %>% 
  group_by(gender) %>% 
  summarise (mean = mean(salary), SD = sd(salary), sample_size = n()) %>% 
  mutate(se = sqrt(SD^2/sample_size), t_value = qt(p=.05/2, df=sample_size-1, lower.tail=FALSE),
         margin_of_error = t_value*se, salary_low = mean-t_value*se, salary_high = mean+t_value*se)

salary_gender
## # A tibble: 2 × 9
##   gender   mean    SD sample_size    se t_value margin_of_error salary_low
##   <chr>   <dbl> <dbl>       <int> <dbl>   <dbl>           <dbl>      <dbl>
## 1 female 64543. 7567.          26 1484.    2.06           3056.     61486.
## 2 male   73239. 7463.          24 1523.    2.07           3151.     70088.
## # … with 1 more variable: salary_high <dbl>

The 95% confidence interval for female is from 61486 to 67599, while that for male is from 70088 to 76490. Since their confidence intervals do not have any overlap, it can be concluded that the null hypothesis can be rejected. There is a significant difference in the mean of salary for female and male.

Let us run a hypothesis testing, assuming as a null hypothesis that the mean difference in salaries is zero, or that, on average, men and women make the same amount of money. Let’s run our hypothesis testing using t.test() and the simulation method from the infer package.

# hypothesis testing using t.test() 
t.test(salary ~ gender, data = omega)
## 
##  Welch Two Sample t-test
## 
## data:  salary by gender
## t = -4, df = 48, p-value = 2e-04
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -12973  -4420
## sample estimates:
## mean in group female   mean in group male 
##                64543                73239
# hypothesis testing using infer package
set.seed(1234)

salary_gender_boot <- omega %>% 
  # Specify the variable of interest "salary" and group by gender
  specify(salary ~ gender) %>% 
  
  # Hypothesize a null of no (or zero) difference
  hypothesize (null = "independence") %>% 
  
  # Generate a bunch of simualted samples
  generate (reps = 1000, type = "permute") %>% 
  
  # Find the mean diffference of each sample
  calculate(stat = "diff in means",
            order = c("female", "male"))


# Select the low and high endpoint from the formula-calculated CIs
formula_ci <- salary_gender %>%
  select(salary_low,salary_high)

# Generate 95% percentile of the difference in the two genders' salaries from the bootstrap data
percentile_ci <- salary_gender_boot %>% 
  get_confidence_interval(level = 0.95, type = "percentile")

percentile_ci
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1   -4829.    5025.
observed_difference <- salary_gender$mean[1]-salary_gender$mean[2]

visualize(salary_gender_boot) + 
  annotate("rect", xmin=-Inf, xmax=observed_difference, ymin=0, ymax=Inf, alpha=0.3, fill = "pink") +
  annotate("rect", xmin=-observed_difference, xmax=Inf, ymin=0, ymax=Inf, alpha=0.3, fill  = "pink") +
  #shade_ci(endpoints = percentile_ci,fill = "khaki")+
  labs(title='Differences in Female and Male Mean Salary in a world where there is no difference', 
       subtitle = "Observed difference marked in red",
       x = "Mean (female salary - male salary)", y = "Count")+
  geom_vline(xintercept = observed_difference, colour = "red", linetype="solid", size=1.2)+
  theme_bw()+
  NULL

salary_gender_boot %>% 
  get_pvalue(obs_stat = observed_difference, direction = "both")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

With bootstrap, the confidence interval for the difference in the two genders’ salary is constructed, while in the null world. As a result, this CI does not include the observed difference in real world, which means that the null hypothesis should be rejected. There is a significant difference in the two genders’ salaries.

Relationship: Experience vs Gender?

# Summary Statistics of salary by gender
favstats (experience ~ gender, data=omega)
##   gender min    Q1 median   Q3 max  mean    sd  n missing
## 1 female   0  0.25    3.0 14.0  29  7.38  8.51 26       0
## 2   male   1 15.75   19.5 31.2  44 21.12 10.92 24       0
t.test(experience ~ gender, data = omega)
## 
##  Welch Two Sample t-test
## 
## data:  experience by gender
## t = -5, df = 43, p-value = 1e-05
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -19.35  -8.13
## sample estimates:
## mean in group female   mean in group male 
##                 7.38                21.12

Based on this evidence, the t-stat value is -5, which has a larger absolute value than 1.96, indicating that there is a significant difference in the two genders’ experience.

Relationship: Salary vs. Experience?

salary_exp <- omega %>% 
  ggplot(aes(x = experience, y = salary))+
  geom_point()+
  labs(title = "Relationship between salary and number of years of experience", x = "Year(s) of experience",y = "Salary")+
  theme_bw()+
  NULL

salary_exp

Correlations between the data:

omega %>% 
  select(gender, experience, salary) %>% #order variables they will appear in ggpairs()
  ggpairs(aes(colour=gender, alpha = 0.3))+
  theme_bw()

Generally, the distribution of years of experience for male is more widely distributed than female in the scatter plot. There is also more female than male at 0 year of experience, and there is no female with more than 30 years of experience. Overall, there is a positive relation between experience and salary. It can be seen in the gender - experience box plot that male has a higher mean value for years of experience than female, and in the gender - salary box plot male also has a higher mean salary, which is predicted. However, the difference in mean salary is smaller than that in mean experience. This indicate that salary is less likely to be dependent on gender. Moreover, gender has even narrowed the gap between the difference in the two genders’ experience. While there is no female within the 95% CI that has a higher experience level than male in their 95% CI, the CI for salaries do overlap.