Visualize variation between the expected & actual percentages — plot_expected

Ideal for a 0/1 dichotomous variable.

plot_expected_proportions(
  df,
  dv,
  ...,
  trunc_length = 100,
  sort_by = c("expected", "actual"),
  threshold = 0.02,
  return_data = FALSE,
  n_field = 9,
  color_over = "navyblue",
  color_under = "red"
)

Arguments

df	data to be analyzed
dv	dependent variable
...	Arguments passed on to `refactor_columns` `split_on` variable to split data / group by `id_col` field to use as ID `n_cat` for categorical variables, the max number of unique values to keep. This field feeds the `forcats::fct_lump(n = )` argument. `collapse_by` should `n_cat` collapse by the distance to the grand mean `"dv"` leaving the extremes as is and grouping factors closer to the grand mean as "other" or should it use size `"n"` `n_quantile` for numeric/date fields, the number of quantiles used to split the data into a factor. Fields that have less than this amount will not be changed. `n_digits` for numeric fields, the number of digits to keep in the breaks ex: [1.2345 to 2.3456] will be [1.23 to 2.34] if `n_digits = 2` `avg_type` mean or median `ignore_cols` columns to ignore from analysis. Good candidates are fields that have have no duplicate values (primary keys) or fields with a large proportion of null values
trunc_length	length to shorten y-axis labels
sort_by	should data be sorted by expected or actual percentages
threshold	the cut-off (percentage difference) between actual and expected values. This allows the chart to focus on the bigger changes. Use `NULL` to keep all values
return_data	if TRUE will return a data frame instead of a plot
n_field	the max number of facets to show. The fields are sorted in descending order by those that have the most change (the 'field_delta' column).
color_over	color name/hex code for values that are over-represented
color_under	color name/hex code for values that are under-represented

Examples

# sorted by the expected representation (default)
plot_expected_proportions(
  df = employee_attrition[, 1:5],
  dv = attrition
)

# sorted by the actual representation
plot_expected_proportions(
  df = employee_attrition[, 1:5],
  dv = attrition,
  sort_by = "actual"
)

# you can return the dataframe if you want
plot_expected_proportions(
  df = employee_attrition[, 1:5],
  dv = attrition,
  return_data = TRUE
)
#> # A tibble: 14 x 10
#>    field value     n total expected actual  delta abs_delta field_delta category
#>    <fct> <chr> <int> <int>    <dbl>  <dbl>  <dbl>     <dbl>       <dbl> <chr>   
#>  1 job_~ Dire~    69     5     4.69  2.11   -2.58      2.58       46.8  under   
#>  2 job_~ Rese~    80     2     5.44  0.844  -4.60      4.60       40.4  under   
#>  3 job_~ Sale~    83    33     5.65 13.9     8.28      8.28       40.4  over    
#>  4 job_~ Mana~   102     5     6.94  2.11   -4.83      4.83       40.4  under   
#>  5 job_~ Seni~   106     5     7.21  2.11   -5.10      5.10       46.8  under   
#>  6 job_~ Heal~   131     9     8.91  3.80   -5.11      5.11       40.4  under   
#>  7 job_~ Manu~   145    10     9.86  4.22   -5.64      5.64       40.4  under   
#>  8 job_~ Labo~   259    62    17.6  26.2     8.54      8.54       40.4  over    
#>  9 depa~ Sales   446    92    30.3  38.8     8.48      8.48       18.5  over    
#> 10 job_~ Juni~   534    52    36.3  21.9   -14.4      14.4        46.8  under   
#> 11 job_~ Inte~   543   143    36.9  60.3    23.4      23.4        46.8  over    
#> 12 gend~ Fema~   588    87    40    36.7    -3.29      3.29        6.58 under   
#> 13 gend~ Male    882   150    60    63.3     3.29      3.29        6.58 over    
#> 14 depa~ Rese~   961   133    65.4  56.1    -9.26      9.26       18.5  under   

# an example with more parameters
plot_expected_proportions(
  df = employee_attrition[, 1:5], # data to use
  dv = attrition, # can be a field name or an evaluation
  n_cat = 5, # collapse field values into 5 categories, all else in "Other"
  n_field = 2, # keep the first 2 facets
  threshold = NULL # keep all values
)