Convert all fields to factors
refactor_columns( df, dv, split_on = NA_character_, id_col = NULL, n_cat = 10, collapse_by = c("dv", "n"), n_quantile = 10, n_digits = 2, avg_type = c("mean", "median"), ignore_cols = NULL )
df | dataframe to evaluate |
---|---|
dv | dependent variable to use (column name) |
split_on | variable to split data / group by |
id_col | field to use as ID |
n_cat | for categorical variables, the max number of unique values
to keep. This field feeds the |
collapse_by | should |
n_quantile | for numeric/date fields, the number of quantiles used to split the data into a factor. Fields that have less than this amount will not be changed. |
n_digits | for numeric fields, the number of digits to keep in the breaks
ex: [1.2345 to 2.3456] will be [1.23 to 2.34] if |
avg_type | mean or median |
ignore_cols | columns to ignore from analysis. Good candidates are fields that have have no duplicate values (primary keys) or fields with a large proportion of null values |
refactor_columns(df = iris, dv = Sepal.Length)#> # A tibble: 150 x 7 #> y_outcome y_split unique_id Sepal.Width Petal.Length Petal.Width Species #> * <dbl> <chr> <int> <chr> <chr> <chr> <chr> #> 1 5.1 1 1 07 [3.44 to 3.6~ 01 [0.99 to~ 01 [0.09 t~ setosa #> 2 4.9 1 2 05 [2.96 to 3.2) 01 [0.99 to~ 01 [0.09 t~ setosa #> 3 4.7 1 3 06 [3.2 to 3.44) 01 [0.99 to~ 01 [0.09 t~ setosa #> 4 4.6 1 4 05 [2.96 to 3.2) 01 [0.99 to~ 01 [0.09 t~ setosa #> 5 5 1 5 07 [3.44 to 3.6~ 01 [0.99 to~ 01 [0.09 t~ setosa #> 6 5.4 1 6 08 [3.68 to 3.9~ 02 [1.59 to~ 02 [0.34 t~ setosa #> 7 4.6 1 7 06 [3.2 to 3.44) 01 [0.99 to~ 01 [0.09 t~ setosa #> 8 5 1 8 06 [3.2 to 3.44) 01 [0.99 to~ 01 [0.09 t~ setosa #> 9 4.4 1 9 04 [2.72 to 2.9~ 01 [0.99 to~ 01 [0.09 t~ setosa #> 10 4.9 1 10 05 [2.96 to 3.2) 01 [0.99 to~ 01 [0.09 t~ setosa #> # ... with 140 more rows