Pivots data and summarizes factor frequencies by field and generates stats used for plotting

summarize_factors_all_fields(df, ...)

Arguments

df

dataframe to evaluate

...

Arguments passed on to refactor_columns

dv

dependent variable to use (column name)

split_on

variable to split data / group by

id_col

field to use as ID

n_cat

for categorical variables, the max number of unique values to keep. This field feeds the forcats::fct_lump(n = ) argument.

collapse_by

should n_cat collapse by the distance to the grand mean "dv" leaving the extremes as is and grouping factors closer to the grand mean as "other" or should it use size "n"

n_quantile

for numeric/date fields, the number of quantiles used to split the data into a factor. Fields that have less than this amount will not be changed.

n_digits

for numeric fields, the number of digits to keep in the breaks ex: [1.2345 to 2.3456] will be [1.23 to 2.34] if n_digits = 2

avg_type

mean or median

ignore_cols

columns to ignore from analysis. Good candidates are fields that have have no duplicate values (primary keys) or fields with a large proportion of null values

Details

The list option includes the original min/max of the data and the grand average.

Examples

summarize_factors_all_fields(df = iris, dv = Sepal.Length)
#> # A tibble: 26 x 9 #> field value factor_avg n field_p_value method statistic df grand_avg #> * <fct> <chr> <dbl> <int> <dbl> <chr> <dbl> <int> <dbl> #> 1 Sepal.~ 02 [~ 5.31 7 2.76e- 2 Krusk~ 18.7 9 5.84 #> 2 Sepal.~ 03 [~ 5.89 22 2.76e- 2 Krusk~ 18.7 9 5.84 #> 3 Sepal.~ 04 [~ 6.22 24 2.76e- 2 Krusk~ 18.7 9 5.84 #> 4 Sepal.~ 05 [~ 6.02 37 2.76e- 2 Krusk~ 18.7 9 5.84 #> 5 Sepal.~ 06 [~ 5.69 31 2.76e- 2 Krusk~ 18.7 9 5.84 #> 6 Sepal.~ 07 [~ 5.26 10 2.76e- 2 Krusk~ 18.7 9 5.84 #> 7 Sepal.~ 08 [~ 5.75 11 2.76e- 2 Krusk~ 18.7 9 5.84 #> 8 Petal.~ 01 [~ 4.98 37 1.79e-22 Krusk~ 121. 8 5.84 #> 9 Petal.~ 02 [~ 5.07 13 1.79e-22 Krusk~ 121. 8 5.84 #> 10 Petal.~ 05 [~ 5.49 8 1.79e-22 Krusk~ 121. 8 5.84 #> # ... with 16 more rows
# similar to other functions, you can see the attributes summarize_factors_all_fields(df = iris, dv = Sepal.Length) %>% attr("about")
#> $avg_type #> [1] "mean" #> #> $avg_fn #> function (x, ...) #> UseMethod("mean") #> <bytecode: 0x0000000015997b68> #> <environment: namespace:base> #> #> $dv #> [1] "Sepal.Length" #> #> $dv_binary #> [1] FALSE #> #> $grand_avg #> [1] 5.843333 #> #> $field_types #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species unique_id #> "numeric" "numeric" "numeric" "numeric" "factor" "integer" #>