Identify the factors with the most extreme averages for each field

summarize_factor_extremes(...)

Arguments

...

...	Arguments passed on to `refactor_columns` `df` dataframe to evaluate `dv` dependent variable to use (column name) `split_on` variable to split data / group by `id_col` field to use as ID `n_cat` for categorical variables, the max number of unique values to keep. This field feeds the `forcats::fct_lump(n = )` argument. `collapse_by` should `n_cat` collapse by the distance to the grand mean `"dv"` leaving the extremes as is and grouping factors closer to the grand mean as "other" or should it use size `"n"` `n_quantile` for numeric/date fields, the number of quantiles used to split the data into a factor. Fields that have less than this amount will not be changed. `n_digits` for numeric fields, the number of digits to keep in the breaks ex: [1.2345 to 2.3456] will be [1.23 to 2.34] if `n_digits = 2` `avg_type` mean or median `ignore_cols` columns to ignore from analysis. Good candidates are fields that have have no duplicate values (primary keys) or fields with a large proportion of null values

Arguments passed on to refactor_columns

df: dataframe to evaluate
dv: dependent variable to use (column name)
split_on: variable to split data / group by
id_col: field to use as ID
n_cat: for categorical variables, the max number of unique values to keep. This field feeds the forcats::fct_lump(n = ) argument.
collapse_by: should n_cat collapse by the distance to the grand mean "dv" leaving the extremes as is and grouping factors closer to the grand mean as "other" or should it use size "n"
n_quantile: for numeric/date fields, the number of quantiles used to split the data into a factor. Fields that have less than this amount will not be changed.
n_digits: for numeric fields, the number of digits to keep in the breaks ex: [1.2345 to 2.3456] will be [1.23 to 2.34] if n_digits = 2
avg_type: mean or median
ignore_cols: columns to ignore from analysis. Good candidates are fields that have have no duplicate values (primary keys) or fields with a large proportion of null values

Examples

summarize_factor_extremes(df = ggplot2::mpg, dv = hwy)
#> exploring: hwy 
#> p-values from Kruskal-Wallis test
#> # A tibble: 10 x 8
#>     p_value field   lowest_avg highest_avg lowest_factor lowest_n highest_factor
#>       <dbl> <fct>        <dbl>       <dbl> <chr>            <int> <chr>         
#>  1 1.17e-40 cty           14.8        31.2 01 [8.97 to ~       25 06 [22 to 24.~
#>  2 5.71e-34 class         16.9        28.3 pickup              33 compact       
#>  3 4.86e-32 cyl           17.6        28.8 8                   70 4             
#>  4 2.14e-31 displ         16.9        30.8 06 [4.3 to 4~       29 01 [1.59 to 2~
#>  5 2.12e-29 drv           19.2        28.2 4                  103 f             
#>  6 1.22e-21 manufa~       17.6        32.6 jeep                 8 honda         
#>  7 9   e-20 model         15.3        32.8 ram 1500 pic~       10 new beetle    
#>  8 1.42e- 6 fl            13.2        25.2 e                    8 p             
#>  9 2.99e- 5 trans         20          26.3 auto(l6)             6 manual(m5)    
#> 10 5.37e- 1 year          23.4        23.5 1999               117 2008          
#> # ... with 1 more variable: highest_n <int>