Convert all fields to factors

refactor_columns(
  df,
  dv,
  split_on = NA_character_,
  id_col = NULL,
  n_cat = 10,
  collapse_by = c("dv", "n"),
  n_quantile = 10,
  n_digits = 2,
  avg_type = c("mean", "median"),
  ignore_cols = NULL
)

Arguments

df

dataframe to evaluate

dv

dependent variable to use (column name)

split_on

variable to split data / group by

id_col

field to use as ID

n_cat

for categorical variables, the max number of unique values to keep. This field feeds the forcats::fct_lump(n = ) argument.

collapse_by

should n_cat collapse by the distance to the grand mean "dv" leaving the extremes as is and grouping factors closer to the grand mean as "other" or should it use size "n"

n_quantile

for numeric/date fields, the number of quantiles used to split the data into a factor. Fields that have less than this amount will not be changed.

n_digits

for numeric fields, the number of digits to keep in the breaks ex: [1.2345 to 2.3456] will be [1.23 to 2.34] if n_digits = 2

avg_type

mean or median

ignore_cols

columns to ignore from analysis. Good candidates are fields that have have no duplicate values (primary keys) or fields with a large proportion of null values

Examples

refactor_columns(df = iris, dv = Sepal.Length)
#> # A tibble: 150 x 7 #> y_outcome y_split unique_id Sepal.Width Petal.Length Petal.Width Species #> * <dbl> <chr> <int> <chr> <chr> <chr> <chr> #> 1 5.1 1 1 07 [3.44 to 3.6~ 01 [0.99 to~ 01 [0.09 t~ setosa #> 2 4.9 1 2 05 [2.96 to 3.2) 01 [0.99 to~ 01 [0.09 t~ setosa #> 3 4.7 1 3 06 [3.2 to 3.44) 01 [0.99 to~ 01 [0.09 t~ setosa #> 4 4.6 1 4 05 [2.96 to 3.2) 01 [0.99 to~ 01 [0.09 t~ setosa #> 5 5 1 5 07 [3.44 to 3.6~ 01 [0.99 to~ 01 [0.09 t~ setosa #> 6 5.4 1 6 08 [3.68 to 3.9~ 02 [1.59 to~ 02 [0.34 t~ setosa #> 7 4.6 1 7 06 [3.2 to 3.44) 01 [0.99 to~ 01 [0.09 t~ setosa #> 8 5 1 8 06 [3.2 to 3.44) 01 [0.99 to~ 01 [0.09 t~ setosa #> 9 4.4 1 9 04 [2.72 to 2.9~ 01 [0.99 to~ 01 [0.09 t~ setosa #> 10 4.9 1 10 05 [2.96 to 3.2) 01 [0.99 to~ 01 [0.09 t~ setosa #> # ... with 140 more rows