Identify the factors with the most extreme averages for each field
summarize_factor_extremes(...)
Arguments
... |
Arguments passed on to refactor_columns
df dataframe to evaluate
dv dependent variable to use (column name)
split_on variable to split data / group by
id_col field to use as ID
n_cat for categorical variables, the max number of unique values
to keep. This field feeds the forcats::fct_lump(n = ) argument.
collapse_by should n_cat collapse by the distance to the grand
mean "dv" leaving the extremes as is and grouping factors closer to the
grand mean as "other" or should it use size "n"
n_quantile for numeric/date fields, the number of quantiles used
to split the data into a factor. Fields that have less than this amount
will not be changed.
n_digits for numeric fields, the number of digits to keep in the breaks
ex: [1.2345 to 2.3456] will be [1.23 to 2.34] if n_digits = 2
avg_type mean or median
ignore_cols columns to ignore from analysis. Good candidates are
fields that have have no duplicate values (primary keys) or fields with
a large proportion of null values
|
Examples
#> exploring: hwy
#> p-values from Kruskal-Wallis test
#> # A tibble: 10 x 8
#> p_value field lowest_avg highest_avg lowest_factor lowest_n highest_factor
#> <dbl> <fct> <dbl> <dbl> <chr> <int> <chr>
#> 1 1.17e-40 cty 14.8 31.2 01 [8.97 to ~ 25 06 [22 to 24.~
#> 2 5.71e-34 class 16.9 28.3 pickup 33 compact
#> 3 4.86e-32 cyl 17.6 28.8 8 70 4
#> 4 2.14e-31 displ 16.9 30.8 06 [4.3 to 4~ 29 01 [1.59 to 2~
#> 5 2.12e-29 drv 19.2 28.2 4 103 f
#> 6 1.22e-21 manufa~ 17.6 32.6 jeep 8 honda
#> 7 9 e-20 model 15.3 32.8 ram 1500 pic~ 10 new beetle
#> 8 1.42e- 6 fl 13.2 25.2 e 8 p
#> 9 2.99e- 5 trans 20 26.3 auto(l6) 6 manual(m5)
#> 10 5.37e- 1 year 23.4 23.5 1999 117 2008
#> # ... with 1 more variable: highest_n <int>