Calculate the importance of ALL factors in a ALL fields — summarize_factors_all

Pivots data and summarizes factor frequencies by field and generates stats used for plotting

summarize_factors_all_fields(df, ...)

Arguments

df	dataframe to evaluate
...	Arguments passed on to `refactor_columns` `dv` dependent variable to use (column name) `split_on` variable to split data / group by `id_col` field to use as ID `n_cat` for categorical variables, the max number of unique values to keep. This field feeds the `forcats::fct_lump(n = )` argument. `collapse_by` should `n_cat` collapse by the distance to the grand mean `"dv"` leaving the extremes as is and grouping factors closer to the grand mean as "other" or should it use size `"n"` `n_quantile` for numeric/date fields, the number of quantiles used to split the data into a factor. Fields that have less than this amount will not be changed. `n_digits` for numeric fields, the number of digits to keep in the breaks ex: [1.2345 to 2.3456] will be [1.23 to 2.34] if `n_digits = 2` `avg_type` mean or median `ignore_cols` columns to ignore from analysis. Good candidates are fields that have have no duplicate values (primary keys) or fields with a large proportion of null values

dataframe to evaluate

...

Arguments passed on to refactor_columns

dv: dependent variable to use (column name)
split_on: variable to split data / group by
id_col: field to use as ID
n_cat: for categorical variables, the max number of unique values to keep. This field feeds the forcats::fct_lump(n = ) argument.
collapse_by: should n_cat collapse by the distance to the grand mean "dv" leaving the extremes as is and grouping factors closer to the grand mean as "other" or should it use size "n"
n_quantile: for numeric/date fields, the number of quantiles used to split the data into a factor. Fields that have less than this amount will not be changed.
n_digits: for numeric fields, the number of digits to keep in the breaks ex: [1.2345 to 2.3456] will be [1.23 to 2.34] if n_digits = 2
avg_type: mean or median
ignore_cols: columns to ignore from analysis. Good candidates are fields that have have no duplicate values (primary keys) or fields with a large proportion of null values

Details

The list option includes the original min/max of the data and the grand average.

Examples

summarize_factors_all_fields(df = iris, dv = Sepal.Length)
#> # A tibble: 26 x 9
#>    field   value factor_avg     n field_p_value method statistic    df grand_avg
#>  * <fct>   <chr>      <dbl> <int>         <dbl> <chr>      <dbl> <int>     <dbl>
#>  1 Sepal.~ 02 [~       5.31     7      2.76e- 2 Krusk~      18.7     9      5.84
#>  2 Sepal.~ 03 [~       5.89    22      2.76e- 2 Krusk~      18.7     9      5.84
#>  3 Sepal.~ 04 [~       6.22    24      2.76e- 2 Krusk~      18.7     9      5.84
#>  4 Sepal.~ 05 [~       6.02    37      2.76e- 2 Krusk~      18.7     9      5.84
#>  5 Sepal.~ 06 [~       5.69    31      2.76e- 2 Krusk~      18.7     9      5.84
#>  6 Sepal.~ 07 [~       5.26    10      2.76e- 2 Krusk~      18.7     9      5.84
#>  7 Sepal.~ 08 [~       5.75    11      2.76e- 2 Krusk~      18.7     9      5.84
#>  8 Petal.~ 01 [~       4.98    37      1.79e-22 Krusk~     121.      8      5.84
#>  9 Petal.~ 02 [~       5.07    13      1.79e-22 Krusk~     121.      8      5.84
#> 10 Petal.~ 05 [~       5.49     8      1.79e-22 Krusk~     121.      8      5.84
#> # ... with 16 more rows

# similar to other functions, you can see the attributes
summarize_factors_all_fields(df = iris, dv = Sepal.Length) %>% attr("about")
#> $avg_type
#> [1] "mean"
#> 
#> $avg_fn
#> function (x, ...) 
#> UseMethod("mean")
#> <bytecode: 0x0000000015997b68>
#> <environment: namespace:base>
#> 
#> $dv
#> [1] "Sepal.Length"
#> 
#> $dv_binary
#> [1] FALSE
#> 
#> $grand_avg
#> [1] 5.843333
#> 
#> $field_types
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species    unique_id 
#>    "numeric"    "numeric"    "numeric"    "numeric"     "factor"    "integer" 
#>