Visualize variation between two groups

plot_group_split(
  df,
  split_on,
  type = c("dv", "count", "percent_field", "percent_factor"),
  dv,
  ...,
  n_cat = 10,
  trunc_length = 100,
  threshold = 0.02,
  base_group = c("1", "2"),
  return_data = FALSE,
  n_field = 9,
  color_over = "navyblue",
  color_under = "red",
  color_missing = "grey50",
  title = NULL,
  subtitle = NULL,
  caption = NULL
)

Arguments

df	dataframe to evaluate
split_on	variable to split data / group by
type	the outcome or dependent variable ("dv"), the percent of obs. ("percent"), or the number of obs. ("count")
dv	dependent variable to use (column name)
...	Arguments passed on to `refactor_columns` `id_col` field to use as ID `collapse_by` should `n_cat` collapse by the distance to the grand mean `"dv"` leaving the extremes as is and grouping factors closer to the grand mean as "other" or should it use size `"n"` `n_quantile` for numeric/date fields, the number of quantiles used to split the data into a factor. Fields that have less than this amount will not be changed. `n_digits` for numeric fields, the number of digits to keep in the breaks ex: [1.2345 to 2.3456] will be [1.23 to 2.34] if `n_digits = 2` `avg_type` mean or median `ignore_cols` columns to ignore from analysis. Good candidates are fields that have have no duplicate values (primary keys) or fields with a large proportion of null values
n_cat	the number of factors to keep in the y-axis. Factors will be prioritized by the size of the difference and may not match the way categories are collapsed in `refactor_columns()`
trunc_length	number of charcters to print on y-axis
threshold	threshold for excluding nominal differences. The value should reflect the type, if the count is in the hundreds you might use 20, meaning when viewing count differences, values where the difference is <20 will be excluded. For proportion/percent and the dv type, the default is 0.02 or 2 percept
base_group	Should group 1 or group 2 be the base. This group will be the bar and the other will be the point.
return_data	When TRUE will return data frame instead of a plot.
n_field	How many fields/facets should the plot return.
color_over	Color to use when point is higher than bar
color_under	Color to use when point is lower than bar
color_missing	Color to use when either a point or bar is missing
title	title for chart
subtitle	subtitle for chart
caption	caption for chart

Examples

# there are 4 types of plots available: comparing the dependent variable,
# comparing counts, comparing % of field, comparing % within each factor

# type = "dv" is used when comparing an outcome variable (dependent variable)
# here we see that men have higher rates of attrition than women in most
# categories except for when job_level = "Director" or when the person
# works in HR
employee_attrition[,1:4] %>%
  plot_group_split(split_on = gender, type = "dv", dv = attrition)

# type = "count" is used to compare raw volume differences between two groups
# here we see that there are more men than women in each of these areas
employee_attrition[,2:4] %>%
  plot_group_split(split_on = gender, type = "count")

# type = "percent_field" is used when comparing the distribution of one
# demographic vs another. A good example would be pre- vs post-COVID
# closures. In this example, more men are in intern and director roles
# than the other  categories
employee_attrition[,2:4] %>%
  plot_group_split(split_on = gender, type = "percent_field")

# type = "percent_factor" is used when comparing the representation of two
# groups vs how they are represented in the overall data. In this example
# we can see that men make up ~60% of the observations but 65% of the
# director positions and 52% of the senior position while women have the
# inverse at 35 and 48% respectively
employee_attrition[,2:4] %>%
  plot_group_split(split_on = gender, type = "percent_factor")