Identify the factors with the most extreme averages for each field
summarize_factor_extremes(...)
Arguments
| ... |
Arguments passed on to refactor_columns
dfdataframe to evaluate
dvdependent variable to use (column name)
split_onvariable to split data / group by
id_colfield to use as ID
n_catfor categorical variables, the max number of unique values
to keep. This field feeds the forcats::fct_lump(n = ) argument.
collapse_byshould n_cat collapse by the distance to the grand
mean "dv" leaving the extremes as is and grouping factors closer to the
grand mean as "other" or should it use size "n"
n_quantilefor numeric/date fields, the number of quantiles used
to split the data into a factor. Fields that have less than this amount
will not be changed.
n_digitsfor numeric fields, the number of digits to keep in the breaks
ex: [1.2345 to 2.3456] will be [1.23 to 2.34] if n_digits = 2
avg_typemean or median
ignore_colscolumns to ignore from analysis. Good candidates are
fields that have have no duplicate values (primary keys) or fields with
a large proportion of null values
|
Examples
#> exploring: hwy
#> p-values from Kruskal-Wallis test
#> # A tibble: 10 x 8
#> p_value field lowest_avg highest_avg lowest_factor lowest_n highest_factor
#> <dbl> <fct> <dbl> <dbl> <chr> <int> <chr>
#> 1 1.17e-40 cty 14.8 31.2 01 [8.97 to ~ 25 06 [22 to 24.~
#> 2 5.71e-34 class 16.9 28.3 pickup 33 compact
#> 3 4.86e-32 cyl 17.6 28.8 8 70 4
#> 4 2.14e-31 displ 16.9 30.8 06 [4.3 to 4~ 29 01 [1.59 to 2~
#> 5 2.12e-29 drv 19.2 28.2 4 103 f
#> 6 1.22e-21 manufa~ 17.6 32.6 jeep 8 honda
#> 7 9 e-20 model 15.3 32.8 ram 1500 pic~ 10 new beetle
#> 8 1.42e- 6 fl 13.2 25.2 e 8 p
#> 9 2.99e- 5 trans 20 26.3 auto(l6) 6 manual(m5)
#> 10 5.37e- 1 year 23.4 23.5 1999 117 2008
#> # ... with 1 more variable: highest_n <int>