Scaling Disproportionate Impact (DI) Calculations for Interactive Visualizations

Vinh Nguyen

2021-08-31

Introduction

It is often desirable to visualize student success data with the ability to disaggregate by multiple group variables to highlight equity gaps and disproportionate impact (DI) in an interactive dashboard (e.g., Tableau or Power BI). It is certainly feasible to calculate disproportionate impact on the fly in standard dashboard tools, but doing so:

  1. increases development time,
  2. increases the likelihood for error in calculations as the code has to be “re-written” for each dashboard, and
  3. is more difficult to maintain and support, especially when transitioning projects between analysts.

A suggested workflow is to:

  1. start with a student-level data set;
  2. call a single function to pre-calculate success rates and disproportionate impact across all levels of disaggregation, cohorts, and scenarios;
  3. export the pre-calculated data set;
  4. import the pre-calculated data set to the dashboard tool of choice for visualization, where every point visualized is a row from the imported data set.

Using this workflow, one could scale up DI calculations and rapidly develop dashboards with the ability to disaggregate and highlight equity gaps / disproportionate impact for many disaggregation variables, many outcomes, and many scenarios / student populations.

The DisImpact package offers the di_iterate function that allows one to accomplish step 2 in the suggested workflow.

Load DisImpact and toy data set

First, load the necessary packages.

library(DisImpact)
library(dplyr) # Ease in manipulations with data frames

Second, load a toy data set.

data(student_equity) # provided from DisImpact
dim(student_equity)
## [1] 20000    24
# head(student_equity)
## Warning: package 'knitr' was built under R version 4.0.5
A few rows from the student_equity data set.
Ethnicity Gender Cohort Transfer Cohort_Math Math Cohort_English English Ed_Goal College_Status Student_ID EthnicityFlag_Asian EthnicityFlag_Black EthnicityFlag_Hispanic EthnicityFlag_NativeAmerican EthnicityFlag_PacificIslander EthnicityFlag_White EthnicityFlag_Carribean EthnicityFlag_EastAsian EthnicityFlag_SouthEastAsian EthnicityFlag_SouthWestAsianNorthAfrican EthnicityFlag_AANAPI EthnicityFlag_Unknown EthnicityFlag_TwoorMoreRaces
Native American Female 2017 0 2017 1 2017 0 Deg/Transfer First-time College 100001 0 0 0 1 0 0 0 0 0 0 1 0 0
Native American Female 2017 0 2018 1 NA NA Deg/Transfer First-time College 100002 0 0 0 1 0 0 0 0 0 0 1 0 0
Native American Female 2017 0 2018 1 2017 0 Deg/Transfer First-time College 100003 0 0 0 1 0 0 0 0 0 0 1 0 0
Native American Male 2017 1 2017 1 2018 1 Other First-time College 100004 0 0 0 1 0 0 0 0 0 0 1 0 0
Native American Male 2017 0 2017 1 2019 0 Deg/Transfer Other 100005 0 0 0 1 0 0 0 0 0 0 1 0 0
Native American Male 2017 1 2019 1 2018 1 Other First-time College 100006 0 0 0 1 0 0 0 0 0 0 1 0 0

To get a description of each variable, type ?student_equity in the R console.

Execute di_iterate on a data set

Let’s illustrate the di_iterate function with some key arguments:

  • data: a data frame of unitary (student) level or summarized data.
  • success_vars: all outcome variables of interest.
  • group_vars: all variables to disaggregate by (for calculating equity gaps and disproportionate impact).
  • cohort_vars (optional): variables defining cohorts, corresponding to those in success_vars.
  • scenario_repeat_by_vars (optional): variables to repeat DI calculations for across all combination of these variables. Use only if the user is interested in performing a DI analysis for variables in group_vars for everyone in data, and separately for each combination of subpopulations specified using scenario_repeat_by_vars. Each combination of these variables (e.g., full time, first time college students with an ed goal of degree/transfer as one combination) would constitute an iteration / sample for which to calculate disproportionate impact for outcomes listed in success_vars and for the disaggregation variables listed in group_vars.

To see the details of these and other arguments, type ?di_iterate in the R console.

df_di_summary <- di_iterate(data=student_equity
                          , success_vars=c('Math', 'English', 'Transfer')
                          , group_vars=c('Ethnicity', 'Gender')
                          , cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort')
                          , scenario_repeat_by_vars=c('Ed_Goal', 'College_Status')
                            )
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = "Ed_Goal"
## Joining, by = "Ed_Goal"
## df_di_summary <- di_iterate(data=student_equity, success_vars=c('Math', 'English', 'Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort', 'Cohort', 'Cohort'), scenario_repeat_by_vars=c('Ed_Goal', 'College_Status'))

## df_di_summary <- di_iterate(data=student_equity, success_vars=c('Math', 'English', 'Transfer'), group_vars=c('Ethnicity', 'Gender'), scenario_repeat_by_vars=c('Ed_Goal', 'College_Status'))

## df_di_summary_2 <- di_iterate(data=student_equity, success_vars=c('Math', 'English', 'Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort'), scenario_repeat_by_vars=c('Ed_Goal', 'College_Status'), ppg_reference_groups=c('White', 'Male'), di_80_index_reference_groups=c('White', 'Male'))

## df_di_summary <- di_iterate(data=student_equity, success_vars=c('Math', 'English', 'Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort'), scenario_repeat_by_vars=c('Ed_Goal', 'College_Status'), ppg_reference_groups=c('all but current'), di_80_index_reference_groups=c('White', 'Male'))

Explore resulting summary data set

dim(df_di_summary)
## [1] 898  27
df_di_summary %>% head %>% as.data.frame # first few rows
##        Ed_Goal     College_Status success_variable cohort_variable cohort
## 1 Deg/Transfer First-time College             Math     Cohort_Math   2017
## 2 Deg/Transfer First-time College             Math     Cohort_Math   2017
## 3 Deg/Transfer First-time College             Math     Cohort_Math   2017
## 4 Deg/Transfer First-time College             Math     Cohort_Math   2017
## 5 Deg/Transfer First-time College             Math     Cohort_Math   2017
## 6 Deg/Transfer First-time College             Math     Cohort_Math   2017
##   disaggregation           group   n success       pct ppg_reference
## 1      Ethnicity           Asian 776     692 0.8917526     0.8427699
## 2      Ethnicity           Black 235     185 0.7872340     0.8427699
## 3      Ethnicity        Hispanic 474     347 0.7320675     0.8427699
## 4      Ethnicity Multi-Ethnicity 117      99 0.8461538     0.8427699
## 5      Ethnicity Native American  30      28 0.9333333     0.8427699
## 6      Ethnicity           White 823     718 0.8724180     0.8427699
##   ppg_reference_group        moe    pct_lo    pct_hi di_indicator_ppg
## 1             overall 0.03517995 0.8565726 0.9269325                0
## 2             overall 0.06392815 0.7233059 0.8511622                0
## 3             overall 0.04501289 0.6870546 0.7770804                1
## 4             overall 0.09060103 0.7555528 0.9367549                0
## 5             overall 0.17892270 0.7544106 1.1122560                0
## 6             overall 0.03416065 0.8382573 0.9065786                0
##   success_needed_not_di_ppg success_needed_full_parity_ppg di_prop_index
## 1                         0                              0     1.0581211
## 2                         0                             14     0.9341032
## 3                        32                             53     0.8686446
## 4                         0                              0     1.0040153
## 5                         0                              0     1.1074593
## 6                         0                              0     1.0351794
##   di_indicator_prop_index success_needed_not_di_prop_index
## 1                       0                                0
## 2                       0                                0
## 3                       0                                0
## 4                       0                                0
## 5                       0                                0
## 6                       0                                0
##   success_needed_full_parity_prop_index di_80_index_reference_group di_80_index
## 1                                     0             Native American   0.9554492
## 2                                    15             Native American   0.8434650
## 3                                    66             Native American   0.7843580
## 4                                     0             Native American   0.9065934
## 5                                     0             Native American   1.0000000
## 6                                     0             Native American   0.9347336
##   di_indicator_80_index success_needed_not_di_80_index
## 1                     0                              0
## 2                     0                              0
## 3                     1                              7
## 4                     0                              0
## 5                     0                              0
## 6                     0                              0
##   success_needed_full_parity_80_index
## 1                                  33
## 2                                  35
## 3                                  96
## 4                                  11
## 5                                   0
## 6                                  51

The variables di_indicator_ppg, di_indicator_prop_index, and di_indicator_80_index are DI flags using the three methods. For additional explanations on other variables/columns in the returned data set, type ?di_iterate in the R console to bring up the documentation.

Next, note that the scenario '- All' is included for all variables passed to scenario_repeat_by_vars by default:

table(df_di_summary$Ed_Goal)
## 
##        - All Deg/Transfer        Other 
##          300          300          298
table(df_di_summary$College_Status)
## 
##              - All First-time College              Other 
##                300                300                298

Also note di_iterate returns non-disaggregated results by default ('- None' scenario):

table(df_di_summary$disaggregation)
## 
##    - None Ethnicity    Gender 
##        90       539       269

Let’s inspect the rows corresponding to non-disaggregated results.

# No Disaggregation
df_di_summary %>%
  filter(Ed_Goal=='- All', College_Status=='- All', disaggregation=='- None') %>%
  as.data.frame
Ed_Goal College_Status success_variable cohort_variable cohort disaggregation group n success pct ppg_reference ppg_reference_group moe pct_lo pct_hi di_indicator_ppg success_needed_not_di_ppg success_needed_full_parity_ppg di_prop_index di_indicator_prop_index success_needed_not_di_prop_index success_needed_full_parity_prop_index di_80_index_reference_group di_80_index di_indicator_80_index success_needed_not_di_80_index success_needed_full_parity_80_index
- All - All Math Cohort_Math 2017 - None - All 4398 3722 0.8462938 0.8462938 overall 0.0300000 0.8162938 0.8762938 0 0 0 1 0 0 0 - All 1 0 0 0
- All - All Math Cohort_Math 2018 - None - All 7295 6193 0.8489376 0.8489376 overall 0.0300000 0.8189376 0.8789376 0 0 0 1 0 0 0 - All 1 0 0 0
- All - All Math Cohort_Math 2019 - None - All 4456 3807 0.8543537 0.8543537 overall 0.0300000 0.8243537 0.8843537 0 0 0 1 0 0 0 - All 1 0 0 0
- All - All Math Cohort_Math 2020 - None - All 1780 1505 0.8455056 0.8455056 overall 0.0300000 0.8155056 0.8755056 0 0 0 1 0 0 0 - All 1 0 0 0
- All - All English Cohort_English 2017 - None - All 5520 4183 0.7577899 0.7577899 overall 0.0300000 0.7277899 0.7877899 0 0 0 1 0 0 0 - All 1 0 0 0
- All - All English Cohort_English 2018 - None - All 8543 6532 0.7646026 0.7646026 overall 0.0300000 0.7346026 0.7946026 0 0 0 1 0 0 0 - All 1 0 0 0
- All - All English Cohort_English 2019 - None - All 3866 2938 0.7599586 0.7599586 overall 0.0300000 0.7299586 0.7899586 0 0 0 1 0 0 0 - All 1 0 0 0
- All - All English Cohort_English 2020 - None - All 913 678 0.7426068 0.7426068 overall 0.0324333 0.7101735 0.7750401 0 0 0 1 0 0 0 - All 1 0 0 0
- All - All Transfer Cohort 2017 - None - All 10000 5140 0.5140000 0.5140000 overall 0.0300000 0.4840000 0.5440000 0 0 0 1 0 0 0 - All 1 0 0 0
- All - All Transfer Cohort 2018 - None - All 10000 5388 0.5388000 0.5388000 overall 0.0300000 0.5088000 0.5688000 0 0 0 1 0 0 0 - All 1 0 0 0

Visualization (emulating dashboard features)

In this section, we emulate what a dashboard could visualize.

Imagine a dashboard with the following dropdown menus and option values:

  • Ed Goal
    • ‘- All’
    • ‘Degree/Transfer’
    • ‘Other’
  • College Status
    • ‘- All’
    • ‘First-time college’
    • ‘Other’
  • Outcome:
    • ‘Transfer’
    • ‘Math’
    • ‘English’
  • Disaggregation:
    • ‘- None’
    • ‘Ethnicity’
    • ‘Gender’

Each combination of this set of dropdown menus could be visualized using a subset of rows in df_di_summary.

For example, let’s visualize non-disaggregated results for math (the dropdown selections are described at the top of the visualization):

# No Disaggregation
df_di_summary %>%
  filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Math', disaggregation=='- None') %>%
  as.data.frame
##   Ed_Goal College_Status success_variable cohort_variable cohort disaggregation
## 1   - All          - All             Math     Cohort_Math   2017         - None
## 2   - All          - All             Math     Cohort_Math   2018         - None
## 3   - All          - All             Math     Cohort_Math   2019         - None
## 4   - All          - All             Math     Cohort_Math   2020         - None
##   group    n success       pct ppg_reference ppg_reference_group  moe    pct_lo
## 1 - All 4398    3722 0.8462938     0.8462938             overall 0.03 0.8162938
## 2 - All 7295    6193 0.8489376     0.8489376             overall 0.03 0.8189376
## 3 - All 4456    3807 0.8543537     0.8543537             overall 0.03 0.8243537
## 4 - All 1780    1505 0.8455056     0.8455056             overall 0.03 0.8155056
##      pct_hi di_indicator_ppg success_needed_not_di_ppg
## 1 0.8762938                0                         0
## 2 0.8789376                0                         0
## 3 0.8843537                0                         0
## 4 0.8755056                0                         0
##   success_needed_full_parity_ppg di_prop_index di_indicator_prop_index
## 1                              0             1                       0
## 2                              0             1                       0
## 3                              0             1                       0
## 4                              0             1                       0
##   success_needed_not_di_prop_index success_needed_full_parity_prop_index
## 1                                0                                     0
## 2                                0                                     0
## 3                                0                                     0
## 4                                0                                     0
##   di_80_index_reference_group di_80_index di_indicator_80_index
## 1                       - All           1                     0
## 2                       - All           1                     0
## 3                       - All           1                     0
## 4                       - All           1                     0
##   success_needed_not_di_80_index success_needed_full_parity_80_index
## 1                              0                                   0
## 2                              0                                   0
## 3                              0                                   0
## 4                              0                                   0
library(ggplot2)
library(forcats)
library(scales)

# No Disaggregation
df_di_summary %>%
  filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Math', disaggregation=='- None') %>%
  select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>%
  mutate(group=factor(group) %>% fct_reorder(desc(pct))) %>% 
  ggplot(data=., mapping=aes(x=factor(cohort), y=pct, group=group, color=group)) +
  geom_point() +
  geom_line() +
  xlab('Cohort') +
  ylab('Rate') +
  theme_bw() +
  scale_color_manual(values=c('#1b9e77'), name='Group') +
                                      # labs(size='Disproportionate Impact') +
scale_y_continuous(labels = percent, limits=c(0, 1)) +
  ggtitle('Dashboard drop-down selections:', subtitle=paste0("Ed Goal = '- All' | College Status = '- All' | Outcome = 'Math' | Disaggregation = '- None'"))

In this dashboard, one could choose to disaggregate by ethnicity and highlight disproportionate impact (for simplicity, let’s use the percentage point gap method, or the di_indicator_ppg flag in subsequent visualizations):

# Disaggregation: Ethnicity
df_di_summary %>%
  filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Math', disaggregation=='Ethnicity') %>%
  select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>%
  as.data.frame
##    cohort           group    n       pct di_indicator_ppg
## 1    2017           Asian 1406 0.8968706                0
## 2    2017           Black  421 0.7862233                1
## 3    2017        Hispanic  815 0.7325153                1
## 4    2017 Multi-Ethnicity  211 0.8293839                0
## 5    2017 Native American   45 0.9333333                0
## 6    2017           White 1500 0.8773333                0
## 7    2018           Asian 2212 0.9235986                0
## 8    2018           Black  684 0.7441520                1
## 9    2018        Hispanic 1386 0.7366522                1
## 10   2018 Multi-Ethnicity  369 0.7940379                1
## 11   2018 Native American   68 0.8088235                0
## 12   2018           White 2576 0.8819876                0
## 13   2019           Asian 1429 0.9083275                0
## 14   2019           Black  411 0.7834550                1
## 15   2019        Hispanic  786 0.7404580                1
## 16   2019 Multi-Ethnicity  225 0.8000000                0
## 17   2019 Native American   47 0.8297872                0
## 18   2019           White 1558 0.8896021                0
## 19   2020           Asian  573 0.9301920                0
## 20   2020           Black  180 0.7333333                1
## 21   2020        Hispanic  304 0.7171053                1
## 22   2020 Multi-Ethnicity   99 0.7575758                0
## 23   2020 Native American   14 0.6428571                0
## 24   2020           White  610 0.8819672                0
##    di_indicator_prop_index di_indicator_80_index
## 1                        0                     0
## 2                        0                     0
## 3                        0                     1
## 4                        0                     0
## 5                        0                     0
## 6                        0                     0
## 7                        0                     0
## 8                        0                     0
## 9                        0                     1
## 10                       0                     0
## 11                       0                     0
## 12                       0                     0
## 13                       0                     0
## 14                       0                     0
## 15                       0                     0
## 16                       0                     0
## 17                       0                     0
## 18                       0                     0
## 19                       0                     0
## 20                       0                     1
## 21                       0                     1
## 22                       0                     0
## 23                       1                     1
## 24                       0                     0
# Disaggregation: Ethnicity
df_di_summary %>%
  filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Math', disaggregation=='Ethnicity') %>%
  select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>%
  mutate(group=factor(group) %>% fct_reorder(desc(pct))) %>% 
  ggplot(data=., mapping=aes(x=factor(cohort), y=pct, group=group, color=group)) +
  geom_point(aes(size=factor(di_indicator_ppg, levels=c(0, 1), labels=c('Not DI', 'DI')))) +
  geom_line() +
  xlab('Cohort') +
  ylab('Rate') +
  theme_bw() +
  scale_color_manual(values=c('#1b9e77', '#d95f02', '#7570b3', '#e7298a', '#66a61e', '#e6ab02'), name='Ethnicity') +
  labs(size='Disproportionate Impact') +
  scale_y_continuous(labels = percent, limits=c(0, 1)) +
  ggtitle('Dashboard drop-down selections:', subtitle=paste0("Ed Goal = '- All' | College Status = '- All' | Outcome = 'Math' | Disaggregation = 'Ethnicity'"))
## Warning: Using size for a discrete variable is not advised.

In a dashboard, the user might be interested in focusing on degree/transfer students. We emulate this by filtering on Ed_Goal=='Deg/Transer':

# Disaggregation: Ethnicity; Deg/Transfer
df_di_summary %>%
  filter(Ed_Goal=='Deg/Transfer', College_Status=='- All', success_variable=='Math', disaggregation=='Ethnicity') %>%
  select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>%
  as.data.frame
##    cohort           group    n       pct di_indicator_ppg
## 1    2017           Asian  975 0.8984615                0
## 2    2017           Black  290 0.7827586                1
## 3    2017        Hispanic  591 0.7292724                1
## 4    2017 Multi-Ethnicity  148 0.8445946                0
## 5    2017 Native American   36 0.9444444                0
## 6    2017           White 1039 0.8748797                0
## 7    2018           Asian 1552 0.9233247                0
## 8    2018           Black  478 0.7322176                1
## 9    2018        Hispanic  988 0.7439271                1
## 10   2018 Multi-Ethnicity  246 0.7886179                0
## 11   2018 Native American   45 0.7555556                0
## 12   2018           White 1829 0.8737015                0
## 13   2019           Asian  972 0.8971193                0
## 14   2019           Black  302 0.7913907                1
## 15   2019        Hispanic  556 0.7607914                1
## 16   2019 Multi-Ethnicity  162 0.8148148                0
## 17   2019 Native American   33 0.8181818                0
## 18   2019           White 1081 0.8843663                0
## 19   2020           Asian  402 0.9203980                0
## 20   2020           Black  127 0.6850394                1
## 21   2020        Hispanic  204 0.7107843                1
## 22   2020 Multi-Ethnicity   69 0.7681159                0
## 23   2020 Native American    8 0.6250000                0
## 24   2020           White  418 0.8851675                0
##    di_indicator_prop_index di_indicator_80_index
## 1                        0                     0
## 2                        0                     0
## 3                        0                     1
## 4                        0                     0
## 5                        0                     0
## 6                        0                     0
## 7                        0                     0
## 8                        0                     1
## 9                        0                     0
## 10                       0                     0
## 11                       0                     0
## 12                       0                     0
## 13                       0                     0
## 14                       0                     0
## 15                       0                     0
## 16                       0                     0
## 17                       0                     0
## 18                       0                     0
## 19                       0                     0
## 20                       0                     1
## 21                       0                     1
## 22                       0                     0
## 23                       1                     1
## 24                       0                     0
# Disaggregation: Ethnicity; Deg/Transfer
df_di_summary %>%
  filter(Ed_Goal=='Deg/Transfer', College_Status=='- All', success_variable=='Math', disaggregation=='Ethnicity') %>%
  select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>%
  mutate(group=factor(group) %>% fct_reorder(desc(pct))) %>% 
  ggplot(data=., mapping=aes(x=factor(cohort), y=pct, group=group, color=group)) +
  geom_point(aes(size=factor(di_indicator_ppg, levels=c(0, 1), labels=c('Not DI', 'DI')))) +
  geom_line() +
  xlab('Cohort') +
  ylab('Rate') +
  theme_bw() +
  scale_color_manual(values=c('#1b9e77', '#d95f02', '#7570b3', '#e7298a', '#66a61e', '#e6ab02'), name='Ethnicity') +
  labs(size='Disproportionate Impact') +
  scale_y_continuous(labels = percent, limits=c(0, 1)) +
  ggtitle('Dashboard drop-down selections:', subtitle=paste0("Ed Goal = 'Deg/Transfer' | College Status = '- All' | Outcome = 'Math' | Disaggregation = 'Ethnicity'"))
## Warning: Using size for a discrete variable is not advised.

In a dashboard, the user could switch the outcome to English and disaggregate by Gender:

# Disaggregation: Gender; Deg/Transfer; English
df_di_summary %>%
  filter(Ed_Goal=='Deg/Transfer', College_Status=='- All', success_variable=='English', disaggregation=='Gender') %>%
  as.data.frame
##         Ed_Goal College_Status success_variable cohort_variable cohort
## 1  Deg/Transfer          - All          English  Cohort_English   2017
## 2  Deg/Transfer          - All          English  Cohort_English   2017
## 3  Deg/Transfer          - All          English  Cohort_English   2017
## 4  Deg/Transfer          - All          English  Cohort_English   2018
## 5  Deg/Transfer          - All          English  Cohort_English   2018
## 6  Deg/Transfer          - All          English  Cohort_English   2018
## 7  Deg/Transfer          - All          English  Cohort_English   2019
## 8  Deg/Transfer          - All          English  Cohort_English   2019
## 9  Deg/Transfer          - All          English  Cohort_English   2019
## 10 Deg/Transfer          - All          English  Cohort_English   2020
## 11 Deg/Transfer          - All          English  Cohort_English   2020
## 12 Deg/Transfer          - All          English  Cohort_English   2020
##    disaggregation  group    n success       pct ppg_reference
## 1          Gender Female 1916    1424 0.7432150     0.7496751
## 2          Gender   Male 1863    1411 0.7573806     0.7496751
## 3          Gender  Other   68      49 0.7205882     0.7496751
## 4          Gender Female 2833    2151 0.7592658     0.7597185
## 5          Gender   Male 3003    2296 0.7645688     0.7597185
## 6          Gender  Other  132      87 0.6590909     0.7597185
## 7          Gender Female 1385    1032 0.7451264     0.7577753
## 8          Gender   Male 1308    1003 0.7668196     0.7577753
## 9          Gender  Other   40      36 0.9000000     0.7577753
## 10         Gender Female  307     213 0.6938111     0.7192429
## 11         Gender   Male  315     234 0.7428571     0.7192429
## 12         Gender  Other   12       9 0.7500000     0.7192429
##    ppg_reference_group        moe    pct_lo    pct_hi di_indicator_ppg
## 1              overall 0.03000000 0.7132150 0.7732150                0
## 2              overall 0.03000000 0.7273806 0.7873806                0
## 3              overall 0.11884246 0.6017458 0.8394307                0
## 4              overall 0.03000000 0.7292658 0.7892658                0
## 5              overall 0.03000000 0.7345688 0.7945688                0
## 6              overall 0.08529805 0.5737929 0.7443890                1
## 7              overall 0.03000000 0.7151264 0.7751264                0
## 8              overall 0.03000000 0.7368196 0.7968196                0
## 9              overall 0.15495161 0.7450484 1.0549516                0
## 10             overall 0.05593155 0.6378795 0.7497426                0
## 11             overall 0.05521674 0.6876404 0.7980739                0
## 12             overall 0.28290163 0.4670984 1.0329016                0
##    success_needed_not_di_ppg success_needed_full_parity_ppg di_prop_index
## 1                          0                             13     0.9913829
## 2                          0                              0     1.0102784
## 3                          0                              2     0.9612007
## 4                          0                              2     0.9994041
## 5                          0                              0     1.0063843
## 6                          3                             14     0.8675462
## 7                          0                             18     0.9833077
## 8                          0                              0     1.0119352
## 9                          0                              0     1.1876871
## 10                         0                              8     0.9646408
## 11                         0                              0     1.0328321
## 12                         0                              0     1.0427632
##    di_indicator_prop_index success_needed_not_di_prop_index
## 1                        0                                0
## 2                        0                                0
## 3                        0                                0
## 4                        0                                0
## 5                        0                                0
## 6                        0                                0
## 7                        0                                0
## 8                        0                                0
## 9                        0                                0
## 10                       0                                0
## 11                       0                                0
## 12                       0                                0
##    success_needed_full_parity_prop_index di_80_index_reference_group
## 1                                     25                        Male
## 2                                      0                        Male
## 3                                      3                        Male
## 4                                      3                        Male
## 5                                      0                        Male
## 6                                     14                        Male
## 7                                     36                       Other
## 8                                      0                       Other
## 9                                      0                       Other
## 10                                    16                       Other
## 11                                     0                       Other
## 12                                     0                       Other
##    di_80_index di_indicator_80_index success_needed_not_di_80_index
## 1    0.9812967                     0                              0
## 2    1.0000000                     0                              0
## 3    0.9514216                     0                              0
## 4    0.9930641                     0                              0
## 5    1.0000000                     0                              0
## 6    0.8620427                     0                              0
## 7    0.8279182                     0                              0
## 8    0.8520217                     0                              0
## 9    1.0000000                     0                              0
## 10   0.9250814                     0                              0
## 11   0.9904762                     0                              0
## 12   1.0000000                     0                              0
##    success_needed_full_parity_80_index
## 1                                   28
## 2                                    0
## 3                                    3
## 4                                   16
## 5                                    0
## 6                                   14
## 7                                  215
## 8                                  175
## 9                                    0
## 10                                  18
## 11                                   3
## 12                                   0
# Disaggregation: Gender; Deg/Transfer; English
df_di_summary %>%
  filter(Ed_Goal=='Deg/Transfer', College_Status=='- All', success_variable=='English', disaggregation=='Gender') %>%
  select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>%
  mutate(group=factor(group) %>% fct_reorder(desc(pct))) %>% 
  ggplot(data=., mapping=aes(x=factor(cohort), y=pct, group=group, color=group)) +
  geom_point(aes(size=factor(di_indicator_ppg, levels=c(0, 1), labels=c('Not DI', 'DI')))) +
  geom_line() +
  xlab('Cohort') +
  ylab('Rate') +
  theme_bw() +
  scale_color_manual(values=c('#1b9e77', '#d95f02', '#7570b3', '#e7298a', '#66a61e', '#e6ab02'), name='Gender') +
  labs(size='Disproportionate Impact') +
  scale_y_continuous(labels = percent, limits=c(0, 1)) +
  ggtitle('Dashboard drop-down selections:', subtitle=paste0("Ed Goal = 'Deg/Transfer' | College Status = '- All' | Outcome = 'English' | Disaggregation = 'Gender'"))
## Warning: Using size for a discrete variable is not advised.

What is the difference between group_vars and scenario_repeat_by_vars?

For different classification variables, (e.g., age group, full time status, and education goal), the user might be confused as to whether to pass these into the group_vars argument or the scenario_repeat_by_vars argument. The answer is it depends on what the user wants to analyze. If we think of a single student population of interest (e.g., the data set being passed to di_iterate such as all students enrolled at the institution), then the user should pass into group_vars all variables that they are interested in disaggregating on and performing a DI analysis (e.g., are there disparity among ethnic student groups? First generation students?). The group_vars argument is required.

On the other hand, the scenario_repeat_by_vars argument is optional, and when not specified, the DI analysis is performed on all outcomes specified in success_vars and all disaggregation variables specified in group_vars, using all students passed to data as a single population. The user should only pass variables into scenario_repeat_by_vars if they want to split the student population into multiple subpopulations to perform DI analysis on. For example, if ethnicity, first generation status, and age group and were specified in group_vars, then the user is trying to answer the following questions:

  1. Is there disparity between different ethnic student groups?
  2. Is there disparity between first generation students vs. non-first generation students?
  3. Is there disparity between students of different age groups?

If on the other hand, the user passes ethnicity and first generation status to group_vars, and age group to scenario_repeat_by_vars, then the user is trying to answer the following questions:

  1. Is there disparity between different ethnic student groups?
    1. Among all students defined by data?
    2. Among different subpopulations defined by age group? (e.g., among each of these groups: 18-21, 22-25, 26-35, 35-50, 51+)
  2. Is there disparity between first generation students vs. non-first generation students?
    1. Among all students defined by data?
    2. Among different subpopulations defined by age group? (e.g., among each of these groups: 18-21, 22-25, 26-35, 35-50, 51+)

Understanding the default parameters in di_iterate, and overriding them

The function di_iterate has been designed to be highly flexible through the use of function arguments / parameters, with many defaults:

args(di_iterate)
## function (data, success_vars, group_vars, cohort_vars = NULL, 
##     scenario_repeat_by_vars = NULL, exclude_scenario_df = NULL, 
##     weight_var = NULL, include_non_disagg_results = TRUE, ppg_reference_groups = "overall", 
##     min_moe = 0.03, use_prop_in_moe = FALSE, prop_sub_0 = 0.5, 
##     prop_sub_1 = 0.5, di_prop_index_cutoff = 0.8, di_80_index_cutoff = 0.8, 
##     di_80_index_reference_groups = "hpg", check_valid_reference = TRUE) 
## NULL

In this section, we illustrate how each argument could be used. Type ?di_iterate to read the description of each.

Passing a summarized data set to data and using weight_var

Instead of passing in a student level data set, the user could also pass in a summarized data set, which saves space on your disk drive or in memory when imported into R. When passing a summarized data set, the user should also specify weight_var to indicate the group size of each row. Let’s illustrate with an example:

dim(student_equity)
## [1] 20000    24
## Example summarized data set
student_equity_summ <- student_equity %>%
  group_by(Ethnicity, Gender, Cohort, Cohort_Math, Cohort_English, Ed_Goal, College_Status) %>%
  summarize(N=n() %>% as.numeric # not needed, for all.equal()
            , Math=sum(Math, na.rm=TRUE)
            , English=sum(English, na.rm=TRUE)
            , Transfer=sum(Transfer, na.rm=TRUE)
            ) %>%
  ungroup
## `summarise()` has grouped output by 'Ethnicity', 'Gender', 'Cohort', 'Cohort_Math', 'Cohort_English', 'Ed_Goal'. You can override using the `.groups` argument.
dim(student_equity_summ) # same number of columns, less number of rows
## [1] 1402   11
student_equity_summ %>% head %>% as.data.frame # first few rows
##   Ethnicity Gender Cohort Cohort_Math Cohort_English      Ed_Goal
## 1     Asian Female   2017        2017           2017 Deg/Transfer
## 2     Asian Female   2017        2017           2017 Deg/Transfer
## 3     Asian Female   2017        2017           2017        Other
## 4     Asian Female   2017        2017           2017        Other
## 5     Asian Female   2017        2017           2018 Deg/Transfer
## 6     Asian Female   2017        2017           2018 Deg/Transfer
##       College_Status   N Math English Transfer
## 1 First-time College 202  185     178      157
## 2              Other  55   52      50       41
## 3 First-time College  96   87      82       70
## 4              Other  25   23      22       20
## 5 First-time College 104   88      87       71
## 6              Other  31   29      29       26
## Run on summarized data set
df_di_summary_2 <- di_iterate(data=student_equity_summ
                          , success_vars=c('Math', 'English', 'Transfer')
                          , group_vars=c('Ethnicity', 'Gender')
                          , cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort')
                          , scenario_repeat_by_vars=c('Ed_Goal', 'College_Status')
                          , weight_var='N' # SET THIS
                            )
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = "Ed_Goal"
## Joining, by = "Ed_Goal"
dim(df_di_summary)  ## original results
## [1] 898  27
dim(df_di_summary_2) # more rows?  because of NA cohort from Cohort_English and Cohort_Math
## [1] 1075   27
dim(df_di_summary_2 %>% filter(!is.na(cohort)))
## [1] 898  27
## ## if user wants to see the extra rows
## extra_rows <- df_di_summary_2 %>%
##   anti_join(df_di_summary %>% select(Ed_Goal, College_Status, success_variable, cohort_variable, cohort, disaggregation, group))
## difference %>% head %>% as.data.frame  

all.equal(df_di_summary
        , df_di_summary_2 %>% filter(!is.na(cohort))
          ) # returned results are the same
## [1] TRUE

Suppress non-disaggregated results using include_non_disagg_results

By default, the non-disaggregated results are also returned. If the user wants to suppress this, they could set include_non_disagg_results=FALSE:

df_di_summary_2 <- di_iterate(data=student_equity
                          , success_vars=c('Math', 'English', 'Transfer')
                          , group_vars=c('Ethnicity', 'Gender')
                          , cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort')
                          , scenario_repeat_by_vars=c('Ed_Goal', 'College_Status')
                          , include_non_disagg_results=FALSE ## SET THIS
                            )
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = "Ed_Goal"
## Joining, by = "Ed_Goal"
dim(df_di_summary)
## [1] 898  27
dim(df_di_summary_2) ## less rows because no longer have disaggregated results
## [1] 808  27
table(df_di_summary$disaggregation)
## 
##    - None Ethnicity    Gender 
##        90       539       269
table(df_di_summary_2$disaggregation) # No more '- None'
## 
## Ethnicity    Gender 
##       539       269

PPG reference groups and other parameters

For the percentage point gap (PPG) method, di_iterate defaults to using the overall success rate as the reference for comparison (ppg_reference_groups='overall'). The user could set ppg_reference_groups='hpg' to use the highest performing group as the comparison group, or ppg_reference_groups='all but current' for using the combined success rate of all other groups excluding the group of interest (e.g., if studying Hispanic students, then the reference group would be all non-Hispanic students). The latter is sometimes referred to as “PPG minus 1” or “PPG-1.” The user could also specify specific groups as reference:

# Highest performing group as reference
df_di_summary_2 <- di_iterate(data=student_equity
                            , success_vars=c('Math', 'English', 'Transfer')
                            , group_vars=c('Ethnicity', 'Gender')
                            , cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort')
                            , scenario_repeat_by_vars=c('Ed_Goal', 'College_Status')
                            , ppg_reference_groups='hpg' ## SET THIS
                              )
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = "Ed_Goal"
## Joining, by = "Ed_Goal"
# Reference: all other groups except group of interest (PPG minus 1)
df_di_summary_2 <- di_iterate(data=student_equity
                            , success_vars=c('Math', 'English', 'Transfer')
                            , group_vars=c('Ethnicity', 'Gender')
                            , cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort')
                            , scenario_repeat_by_vars=c('Ed_Goal', 'College_Status')
                            , ppg_reference_groups='all but current' ## SET THIS
                              )
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = "Ed_Goal"
## Joining, by = "Ed_Goal"
# Reference: custom groups
df_di_summary_2 <- di_iterate(data=student_equity
                            , success_vars=c('Math', 'English', 'Transfer')
                            , group_vars=c('Ethnicity', 'Gender')
                            , cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort')
                            , scenario_repeat_by_vars=c('Ed_Goal', 'College_Status')
                            , ppg_reference_groups=c('White', 'Male') ## corresponds to each variable in group_vars
                              )
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = "Ed_Goal"
## Joining, by = "Ed_Goal"

The following arguments apply to the PPG: min_moe, use_prop_in_moe, prop_sub_0, prop_sub_1, and use_prop_in_moe. See ?di_ppg for more details.

Proportionality index DI threshold

For the proportionality index (PI) method, DI is determined using di_prop_index_cutoff=0.8 by default. This could be changed using the di_prop_index_cutoff argument.

df_di_summary_2 <- di_iterate(data=student_equity
                            , success_vars=c('Math', 'English', 'Transfer')
                            , group_vars=c('Ethnicity', 'Gender')
                            , cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort')
                            , scenario_repeat_by_vars=c('Ed_Goal', 'College_Status')
                            , di_prop_index_cutoff=0.9 # Easier to declare DI using PI
                              )
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = "Ed_Goal"
## Joining, by = "Ed_Goal"

80% index reference groups and DI threshold

For the 80% index method, the highest performing group is used as reference by default (di_80_index_reference_groups='hpg'). Similar to the PPG, the user could specify custom reference groups.

# Custom reference groups
df_di_summary_2 <- di_iterate(data=student_equity
                            , success_vars=c('Math', 'English', 'Transfer')
                            , group_vars=c('Ethnicity', 'Gender')
                            , cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort')
                            , scenario_repeat_by_vars=c('Ed_Goal', 'College_Status')
                            , di_80_index_reference_groups=c('White', 'Male') ## corresponds to each variable in group_vars
                              )
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = "Ed_Goal"
## Joining, by = "Ed_Goal"

Besides specifying a specific reference group, the function also acccepts 'overall' and 'all but current'. The former uses the overall success rate as reference for comparison. The latter uses the combined success rate of all other groups as reference for comparison.

The 80% index uses 80% as the default threshold for declaring DI. The user could alter this with the di_80_index_cutoff argument.

df_di_summary_2 <- di_iterate(data=student_equity
                            , success_vars=c('Math', 'English', 'Transfer')
                            , group_vars=c('Ethnicity', 'Gender')
                            , cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort')
                            , scenario_repeat_by_vars=c('Ed_Goal', 'College_Status')
                            , di_80_index_cutoff=0.5 # Harder to declare DI using 80% index
                              )
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = "Ed_Goal"
## Joining, by = "Ed_Goal"

Multiple PPG or DI parameter scenarios in results

In a single call of di_iterate, the results of all three DI methods are returned in one run. If the user is interested in doing DI calculations using various scenarios of the same method (e.g., using the overall rate as reference for PPG, and using a pre-specified list of reference rates), then it is recommended that the user execute di_iterate multiple times and combining the results (stacking). If the user chooses to do this, then it is a good idea to set include_non_disagg_results=FALSE in subsequent di_iterate runs to not duplicate rows of non-disaggregated results.

# Multiple group variables and different reference groups
df_di_summary_long <- bind_rows(
  di_iterate(data=student_equity
           , success_vars=c('Math', 'English', 'Transfer')
           , group_vars=c('Ethnicity', 'Gender')
           , cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort')
           , scenario_repeat_by_vars=c('Ed_Goal', 'College_Status')
             )
  , di_iterate(data=student_equity
           , success_vars=c('Math', 'English', 'Transfer')
           , group_vars=c('Ethnicity', 'Gender')
           , cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort')
           , scenario_repeat_by_vars=c('Ed_Goal', 'College_Status')
           , ppg_reference_groups=c('White', 'Male') ## corresponds to each variable in group_vars
           , include_non_disagg_results = FALSE # Already have non-disaggregated results in the first run
             )
)
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = "Ed_Goal"
## Joining, by = "Ed_Goal"
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = c("Ed_Goal", "College_Status")
## Joining, by = "College_Status"
## Joining, by = "Ed_Goal"
## Joining, by = "Ed_Goal"
dim(df_di_summary_long)
## [1] 1706   27

FERPA block/suppression

Since di_iterate disaggregates on many variables and subpopulations, it is not uncommon the returned results contain rows summarizing small samples. As is common in education research, care should be taken to not unintentionally disclose the educational outcomes of students (results linked to particular students, ie, FERPA regulation). The user might want to filter out rows with small samples (e.g., n < 10):

## df_di_summary %>%
##   mutate(FERPA_Block=ifelse(n < 10, 1, 0)) %>%
##   filter(FERPA_Block == 0)

Appendix: R and R Package Versions

This vignette was generated using an R session with the following packages. There may be some discrepancies when the reader replicates the code caused by version mismatch.

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=C                          
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.33       DisImpact_0.0.15 forcats_0.5.1    scales_1.1.1    
## [5] ggplot2_3.3.3    stringr_1.4.0    dplyr_1.0.6     
## 
## loaded via a namespace (and not attached):
##  [1] highr_0.9         pillar_1.6.1      bslib_0.2.5.1     compiler_4.0.2   
##  [5] jquerylib_0.1.4   prettydoc_0.4.1   tools_4.0.2       digest_0.6.27    
##  [9] jsonlite_1.7.2    evaluate_0.14     lifecycle_1.0.0   tibble_3.1.2     
## [13] gtable_0.3.0      pkgconfig_2.0.3   rlang_0.4.11      rstudioapi_0.13  
## [17] cli_2.5.0         yaml_2.2.1        xfun_0.23         withr_2.4.2      
## [21] generics_0.1.0    vctrs_0.3.8       sass_0.4.0        grid_4.0.2       
## [25] tidyselect_1.1.1  glue_1.4.2        R6_2.5.0          fansi_0.5.0      
## [29] rmarkdown_2.9     farver_2.1.0      purrr_0.3.4       tidyr_1.1.3      
## [33] magrittr_2.0.1    ps_1.6.0          ellipsis_0.3.2    htmltools_0.5.1.1
## [37] colorspace_2.0-1  labeling_0.4.2    utf8_1.2.1        stringi_1.4.6    
## [41] munsell_0.5.0     crayon_1.4.1