Title: | Machine Learning Tools |
---|---|
Description: | A collection of machine learning helper functions, particularly assisting in the Exploratory Data Analysis phase. Makes heavy use of the 'data.table' package for optimal speed and memory efficiency. Highlights include a versatile bin_data() function, sparsify() for converting a data.table to sparse matrix format with one-hot encoding, fast evaluation metrics, and empirical_cdf() for calculating empirical Multivariate Cumulative Distribution Functions. |
Authors: | Ben Gorman |
Maintainer: | Ben Gorman <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.6 |
Built: | 2024-10-25 02:59:43 UTC |
Source: | https://github.com/ben519/mltools |
A dataset describing features of living beings
alientest
alientest
A data.table with 8 rows and 5 variables:
Skin color of the individual
IQ score of the individual
Categorical descriptor
Categorical descriptor
Categorical descriptor
alientest <- data.table::data.table( SkinColor=c("white", "green", "brown", "white", "red"), IQScore=c(79, 100, 125, 90, 115), Cat1=c("type4", "type4", "type3", "type1", "type1"), Cat2=c("type5", "type5", "type9", "type8", "type2"), Cat3=c("type2", "type2", "type7", "type4", "type4") ) # usethis::use_data(alientest, overwrite = TRUE)
A dataset describing features of living beings and whether or not they are an alien
alientrain
alientrain
A data.table with 8 rows and 6 variables:
Skin color of the individual
IQ score of the individual
Categorical descriptor
Categorical descriptor
Categorical descriptor
Is this being an alien?
alientrain <- data.table::data.table( SkinColor=c("green", "white", "brown", "white", "blue", "white", "green", "white"), IQScore=c(300, 95, 105, 250, 115, 85, 130, 115), Cat1=c("type1", "type1", "type2", "type4", "type2", "type4", "type1", "type1"), Cat2=c("type1", "type2", "type6", "type5", "type7", "type5", "type2", "type1"), Cat3=c("type4", "type4", "type11", "type2", "type11", "type2", "type4", "type4"), IsAlien=c(TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE) ) # usethis::use_data(alientrain, overwrite = TRUE)
Calculates Area Under the ROC Curve
auc_roc(preds, actuals, returnDT = FALSE)
auc_roc(preds, actuals, returnDT = FALSE)
preds |
A vector of prediction values |
actuals |
A vector of actuals values (numeric or ordered factor) |
returnDT |
If TRUE, a data.table of (FalsePositiveRate, TruePositiveRate) pairs is returned, otherwise AUC ROC score is returned |
If returnDT=FALSE
, returns Area Under the ROC Curve.If returnDT=TRUE
, returns a data.table object with
False Positive Rate and True Positive Rate for plotting the ROC curve.
https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve
library(data.table) preds <- c(.1, .3, .3, .9) actuals <- c(0, 0, 1, 1) auc_roc(preds, actuals) auc_roc(preds, actuals, returnDT=TRUE)
library(data.table) preds <- c(.1, .3, .3, .9) actuals <- c(0, 0, 1, 1) auc_roc(preds, actuals) auc_roc(preds, actuals, returnDT=TRUE)
Takes a vector of values and bin parameters and maps each value to an ordered factor whose levels are a set of bins like [0,1), [1,2), [2,3).
Values may be provided as a vector or via a pair of parameters - a data.table object and the name of the column to bin.
bin_data( x = NULL, binCol = NULL, bins = 10, binType = "explicit", boundaryType = "lcro]", returnDT = FALSE, roundbins = FALSE )
bin_data( x = NULL, binCol = NULL, bins = 10, binType = "explicit", boundaryType = "lcro]", returnDT = FALSE, roundbins = FALSE )
x |
A vector of values or a data.table object |
binCol |
A column of |
bins |
|
binType |
|
boundaryType |
|
returnDT |
If FALSE, return an ordered factor of bins corresponding to the values given, else return a data.table object which includes all bins and values (makes a copy of data.table object if given) |
roundbins |
Should bin values be rounded? (Only applicable for binType = "quantile")
|
This function can return two different types of output, depending on whether returnDT
is TRUE
or FALSE
.
If returnDT=FALSE
, returns an ordered factor vector of bins like [1, 2), [-3,-2), ... corresponding to the values which were
binned and whose levels correspond to all the generated bins. (Note that empty bins may be present as unused factor levels).
If returnDT=TRUE
, returns a data.table object with all values and all bins (including empty bins). If dt
is provided
instead of vals
, a full copy of dt
is created and merged with the set of generated bins.
library(data.table) iris.dt <- data.table(iris) # custom bins bin_data(iris.dt, binCol="Sepal.Length", bins=c(4, 5, 6, 7, 8)) # 10 equally spaced bins bin_data(iris$Petal.Length, bins=10, returnDT=TRUE) # make the last bin [left-closed, right-open) bin_data(c(0,0,1,2), bins=2, boundaryType="lcro)", returnDT=TRUE) # bin values by quantile bin_data(c(0,0,0,0,1,2,3,4), bins=4, binType="quantile", returnDT=TRUE)
library(data.table) iris.dt <- data.table(iris) # custom bins bin_data(iris.dt, binCol="Sepal.Length", bins=c(4, 5, 6, 7, 8)) # 10 equally spaced bins bin_data(iris$Petal.Length, bins=10, returnDT=TRUE) # make the last bin [left-closed, right-open) bin_data(c(0,0,1,2), bins=2, boundaryType="lcro)", returnDT=TRUE) # bin values by quantile bin_data(c(0,0,0,0,1,2,3,4), bins=4, binType="quantile", returnDT=TRUE)
Map a vector of dates to a factor at one of these levels "yearmonth", "yearquarter", "quarter", "month"
date_factor( dateVec, type = "yearmonth", minDate = min(dateVec, na.rm = TRUE), maxDate = max(dateVec, na.rm = TRUE) )
date_factor( dateVec, type = "yearmonth", minDate = min(dateVec, na.rm = TRUE), maxDate = max(dateVec, na.rm = TRUE) )
dateVec |
A vector of date values |
type |
One of "year", "yearquarter", "yearmonth", "quarter", "month" |
minDate |
(Default = min(dateVec)) When determining factor levels, use this date to set the min level, after coercing
dates to the specified |
maxDate |
(Default = max(dateVec)) When determining factor levels, use this date to set the max level. (See minDate, above) |
The resulting vector is an ordered factor of the specified type
(e.g. yearmonth)
library(data.table) dts <- as.Date(c("2014-1-1", "2015-1-15", "2015-6-1")) date_factor(dts, type = "yearmonth") date_factor(dts, type = "yearquarter") date_factor( dateVec = dts, type = "yearquarter", minDate = as.Date("2015-1-1"), maxDate = as.Date("2015-12-31") ) date_factor( dateVec = as.Date(character(0)), type = "yearmonth", minDate = as.Date("2016-1-1"), as.Date("2016-12-31") )
library(data.table) dts <- as.Date(c("2014-1-1", "2015-1-15", "2015-6-1")) date_factor(dts, type = "yearmonth") date_factor(dts, type = "yearquarter") date_factor( dateVec = dts, type = "yearquarter", minDate = as.Date("2015-1-1"), maxDate = as.Date("2015-12-31") ) date_factor( dateVec = as.Date(character(0)), type = "yearmonth", minDate = as.Date("2016-1-1"), as.Date("2016-12-31") )
Given a vector x, calculate P(x <= X) for a set of upper bounds X. Can be applied to a data.table object for multivariate use. That is, calculate P(x <= X, y <= Y, z <= Z, ...)
empirical_cdf(x, ubounds)
empirical_cdf(x, ubounds)
x |
Numeric vector or a data.table object for multivariate use. |
ubounds |
A vector of upper bounds on which to evaluate the CDF. For multivariate version, a data.table whose names correspond to columns of x. |
Calculate the empirical CDF of a vector, or data.table with multiple columns for multivariate use.
library(data.table) dt <- data.table(x=c(0.3, 1.3, 1.4, 3.6), y=c(1.2, 1.2, 3.8, 3.9)) empirical_cdf(dt$x, ubounds=1:4) empirical_cdf(dt, ubounds=CJ(x = 1:4, y = 1:4))
library(data.table) dt <- data.table(x=c(0.3, 1.3, 1.4, 3.6), y=c(1.2, 1.2, 3.8, 3.9)) empirical_cdf(dt$x, ubounds=1:4) empirical_cdf(dt, ubounds=CJ(x = 1:4, y = 1:4))
(Experimental) Automated Exploratory Data Analysis
explore_dataset(dt1, dt2 = NULL, targetCol = NULL, verbose = FALSE)
explore_dataset(dt1, dt2 = NULL, targetCol = NULL, verbose = FALSE)
dt1 |
dataset to analyze |
dt2 |
(optional) second dataset to analyze, with the same columns as dt1 |
targetCol |
Name of the column you're trying to model/predict |
verbose |
Should the exploratory process steps be displayed? |
Experimental. Evaluates and summarizes the data in every column of a data.table. Can identify columns with hierarchical structure and columns with perfectly correlated values.
library(data.table) explore_dataset(alientrain)
library(data.table) explore_dataset(alientrain)
Generate exponential weights
exponential_weight(k, base = exp(1), offset = 0, slope = 0.1)
exponential_weight(k, base = exp(1), offset = 0, slope = 0.1)
k |
1-base^(offset-slope*k) |
base |
1-base^(offset-slope*k) |
offset |
1-base^(offset-slope*k) |
slope |
1-base^(offset-slope*k) |
Returns a weight based on the formula 1-base^(offset-slope*k)
exponential_weight(1:3, slope=.1) exponential_weight(1:3, slope=1) exponential_weight(1:3, slope=10)
exponential_weight(1:3, slope=.1) exponential_weight(1:3, slope=1) exponential_weight(1:3, slope=10)
Map an object x
into equal (or nearly equal) size folds.
If x
is a positive integer, a vector of FoldIDs of length matching x is returned, otherwise
If x
is a vector, a matching vector of FoldIDs is returned.
If x
is a data.table, a list of partitions of x is returned.
folds(x, nfolds = 5L, stratified = FALSE, seed = NULL)
folds(x, nfolds = 5L, stratified = FALSE, seed = NULL)
x |
A positive integer, a vector of values or a data.table object |
nfolds |
How many folds? |
stratified |
If x is a vector then TRUE or FALSE indicating whether x's split the class's of x proportionally. If x
is a data.table then |
seed |
Random number seed |
Convenient method for mapping an object into equal size folds, potentially with stratification
library(data.table) folds(8, nfolds=2) folds(alientrain$IsAlien, nfolds=2) folds(alientrain$IsAlien, nfolds=2, stratified=TRUE, seed=2016) folds(alientrain$IQScore, nfolds=2, stratified=TRUE, seed=2016) folds(alientrain, nfolds=2, stratified="IsAlien", seed=2016)
library(data.table) folds(8, nfolds=2) folds(alientrain$IsAlien, nfolds=2) folds(alientrain$IsAlien, nfolds=2, stratified=TRUE, seed=2016) folds(alientrain$IQScore, nfolds=2, stratified=TRUE, seed=2016) folds(alientrain, nfolds=2, stratified="IsAlien", seed=2016)
Generate geometric weights
geometric_weight(k, n, r = 1)
geometric_weight(k, n, r = 1)
k |
r^k/sum(r^(1, 2, ... n)) |
n |
r^k/sum(r^(1, 2, ... n)) |
r |
r^k/sum(r^(1, 2, ... n)) |
Returns a weight based on the formula r^k/sum(r^seq_len(n)). The sequence of weights for k=1, 2, ..., n sum to 1
geometric_weight(1:3, n=3, r=1) geometric_weight(1:3, n=3, r=.5) geometric_weight(1:3, n=3, r=2)
geometric_weight(1:3, n=3, r=1) geometric_weight(1:3, n=3, r=.5) geometric_weight(1:3, n=3, r=2)
Identify group weighted gini impurities using pairs of columns within a dataset. Can be used to located hierarchical data, or 1-1 correspondences
gini_impurities(dt, wide = FALSE, verbose = FALSE)
gini_impurities(dt, wide = FALSE, verbose = FALSE)
dt |
A data.table with at least two columns |
wide |
Should the results be in wide format? |
verbose |
Should progress be printed to the screen? |
For pairs of columns (Var1, Var2) in a dataset, calculates the weighted gini impurity of Var2 relative to the groups determined by Var1
library(data.table) gini_impurities(alientrain) gini_impurities(alientrain, wide=TRUE)
library(data.table) gini_impurities(alientrain) gini_impurities(alientrain, wide=TRUE)
Calculates the Gini Impurity of a set
gini_impurity(vals)
gini_impurity(vals)
vals |
A vector of values. Values can be given as raw instances like c("red", "red", "blue", "green") or as a named vector of class frequencies like c(red=2, blue=1, green=1) |
Gini Impurity is a measure of how often a randomly chosen element from a set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the set.
gini_impurity(c("red", "red", "blue", "green")) gini_impurity(c(red=2, blue=1, green=1))
gini_impurity(c("red", "red", "blue", "green")) gini_impurity(c(red=2, blue=1, green=1))
Calculate Matthews correlation coefficient
mcc( preds = NULL, actuals = NULL, TP = NULL, FP = NULL, TN = NULL, FN = NULL, confusionM = NULL )
mcc( preds = NULL, actuals = NULL, TP = NULL, FP = NULL, TN = NULL, FN = NULL, confusionM = NULL )
preds |
A vector of prediction values, or a data.frame or matrix of TRUE/FALSE or 1/0 whose columns correspond to the possible classes |
actuals |
A vector of actuals values, or a data.frame or matrix of TRUE/FALSE or 1/0 whose columns correspond to the possible classes |
TP |
Count of true positives (correctly predicted 1/TRUE) |
FP |
Count of false positives (predicted 1/TRUE, but actually 0/FALSE) |
TN |
Count of true negatives (correctly predicted 0/FALSE) |
FN |
Count of false negatives (predicted 0/FALSE, but actually 1/TRUE) |
confusionM |
Confusion matrix whose (i,j) element represents the number of samples with predicted class i and true class j |
Calculate Matthews correlation coefficient. Provide either
preds
and actuals
or
TP
, FP
, TN
, and FN
confusionM
https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
preds <- c(1,1,1,0,1,1,0,0) actuals <- c(1,1,1,1,0,0,0,0) mcc(preds, actuals) mcc(actuals, actuals) mcc(TP=3, FP=2, TN=2, FN=1) # Multiclass preds <- data.frame( setosa = rnorm(n = 150), versicolor = rnorm(n = 150), virginica = rnorm(n = 150) ) preds <- preds == apply(preds, 1, max) actuals <- data.frame( setosa = rnorm(n = 150), versicolor = rnorm(n = 150), virginica = rnorm(n = 150) ) actuals <- actuals == apply(actuals, 1, max) mcc(preds = preds, actuals = actuals) # Confusion matrix mcc(confusionM = matrix(c(0,3,3,3,0,3,3,3,0), nrow = 3)) mcc(confusionM = matrix(c(1,0,0,0,1,0,0,0,1), nrow = 3))
preds <- c(1,1,1,0,1,1,0,0) actuals <- c(1,1,1,1,0,0,0,0) mcc(preds, actuals) mcc(actuals, actuals) mcc(TP=3, FP=2, TN=2, FN=1) # Multiclass preds <- data.frame( setosa = rnorm(n = 150), versicolor = rnorm(n = 150), virginica = rnorm(n = 150) ) preds <- preds == apply(preds, 1, max) actuals <- data.frame( setosa = rnorm(n = 150), versicolor = rnorm(n = 150), virginica = rnorm(n = 150) ) actuals <- actuals == apply(actuals, 1, max) mcc(preds = preds, actuals = actuals) # Confusion matrix mcc(confusionM = matrix(c(0,3,3,3,0,3,3,3,0), nrow = 3)) mcc(confusionM = matrix(c(1,0,0,0,1,0,0,0,1), nrow = 3))
Calculate Mean-Square Error (Deviation)
For the ith sample, Squared Error is calculated as SE = (prediction - actual)^2. MSE is then mean(squared errors).
mse(preds = NULL, actuals = NULL, weights = 1, na.rm = FALSE)
mse(preds = NULL, actuals = NULL, weights = 1, na.rm = FALSE)
preds |
A vector of prediction values in [0, 1] |
actuals |
A vector of actuals values in 0, 1, or FALSE, TRUE |
weights |
Optional vectors of weights |
na.rm |
Should (prediction, actual) pairs with at least one NA value be ignored? |
Calculate Mean-Square Error (Deviation)
https://en.wikipedia.org/wiki/Mean_squared_error
preds <- c(1.0, 2.0, 9.5) actuals <- c(0.9, 2.1, 10.0) mse(preds, actuals)
preds <- c(1.0, 2.0, 9.5) actuals <- c(0.9, 2.1, 10.0) mse(preds, actuals)
Calculate Mean-Square-Logarithmic Error (Deviation)
For the ith sample, Squared Logarithmic Error is calculated as SLE = (log(prediction + alpha) - log(actual + alpha))^2. MSE is then mean(squared logarithmic errors). alpha (1 by default) can be used to prevent taking log(0) for data that contains non positive values
msle(preds = NULL, actuals = NULL, weights = 1, na.rm = FALSE, alpha = 1)
msle(preds = NULL, actuals = NULL, weights = 1, na.rm = FALSE, alpha = 1)
preds |
A vector of prediction values in [0, 1] |
actuals |
A vector of actuals values in 0, 1, or FALSE, TRUE |
weights |
Optional vectors of weights |
na.rm |
Should (prediction, actual) pairs with at least one NA value be ignored? |
alpha |
(default = 1) See the formula details. Primary purpose is to prevent taking log(0) |
Calculate Mean-Square-Logarithmic Error (Deviation)
preds <- c(1.0, 2.0, 9.5) actuals <- c(0.9, 2.1, 10.0) msle(preds, actuals)
preds <- c(1.0, 2.0, 9.5) actuals <- c(0.9, 2.1, 10.0) msle(preds, actuals)
One-Hot-Encode unordered factor columns of a data.table
one_hot( dt, cols = "auto", sparsifyNAs = FALSE, naCols = FALSE, dropCols = TRUE, dropUnusedLevels = FALSE )
one_hot( dt, cols = "auto", sparsifyNAs = FALSE, naCols = FALSE, dropCols = TRUE, dropUnusedLevels = FALSE )
dt |
A data.table |
cols |
Which column(s) should be one-hot-encoded? DEFAULT = "auto" encodes all unordered factor columns |
sparsifyNAs |
Should NAs be converted to 0s? |
naCols |
Should columns be generated to indicate the present of NAs? Will only apply to factor columns with at least one NA |
dropCols |
Should the resulting data.table exclude the original columns which are one-hot-encoded? |
dropUnusedLevels |
Should columns of all 0s be generated for unused factor levels? |
One-hot-encoding converts an unordered categorical vector (i.e. a factor) to multiple binarized vectors where each binary vector of 1s and 0s indicates the presence of a class (i.e. level) of the of the original vector.
library(data.table) dt <- data.table( ID = 1:4, color = factor(c("red", NA, "blue", "blue"), levels=c("blue", "green", "red")) ) one_hot(dt) one_hot(dt, sparsifyNAs=TRUE) one_hot(dt, naCols=TRUE) one_hot(dt, dropCols=FALSE) one_hot(dt, dropUnusedLevels=TRUE)
library(data.table) dt <- data.table( ID = 1:4, color = factor(c("red", NA, "blue", "blue"), levels=c("blue", "green", "red")) ) one_hot(dt) one_hot(dt, sparsifyNAs=TRUE) one_hot(dt, naCols=TRUE) one_hot(dt, dropCols=FALSE) one_hot(dt, dropUnusedLevels=TRUE)
Scale a vector of values to the range [0, 1] based on rank/position
relative_position(vals)
relative_position(vals)
vals |
vector of values |
Values are ranked and then scaled to the range [0, 1]. Ties result in the same relative position
(e.g. relative_position(c(1, 2, 2, 3))
returns the vector c(0.0 0.5 0.5 1.0))
. NAs remain as NAs.
relative_position(1:10) relative_position(c(1, 2, 2, 3)) relative_position(c(1, NA, 3, 4))
relative_position(1:10) relative_position(c(1, 2, 2, 3)) relative_position(c(1, NA, 3, 4))
Convenience method for returning a copy of a vector such that NA values are substituted with a replacement value
replace_na(x, repl = "auto")
replace_na(x, repl = "auto")
x |
vector of values |
repl |
what to substitute in place of NAs |
Returns a copy of x
such that NAs get replaced with a replacement value. Default replacement value is 0.
replace_na(c(1, NA, 1, 0))
replace_na(c(1, NA, 1, 0))
Calculate Root-Mean-Square Error (Deviation)
For the ith sample, Squared Error is calculated as SE = (prediction - actual)^2. RMSE is then sqrt(mean(squared errors)).
rmse(preds = NULL, actuals = NULL, weights = 1, na.rm = FALSE)
rmse(preds = NULL, actuals = NULL, weights = 1, na.rm = FALSE)
preds |
A vector of prediction values in [0, 1] |
actuals |
A vector of actuals values in 0, 1, or FALSE, TRUE |
weights |
Optional vectors of weights |
na.rm |
Should (prediction, actual) pairs with at least one NA value be ignored? |
Calculate Root-Mean-Square Error (Deviation)
https://en.wikipedia.org/wiki/Root-mean-square_deviation
preds <- c(1.0, 2.0, 9.5) actuals <- c(0.9, 2.1, 10.0) rmse(preds, actuals)
preds <- c(1.0, 2.0, 9.5) actuals <- c(0.9, 2.1, 10.0) rmse(preds, actuals)
Calculate Root-Mean-Square-Logarithmic Error (Deviation)
For the ith sample, Squared Logarithmic Error is calculated as SLE = (log(prediction + alpha) - log(actual + alpha))^2. RMSLE is then sqrt(mean(squared logarithmic errors)). alpha (1 by default) can be used to prevent taking log(0) for data that contains non positive values
rmsle(preds = NULL, actuals = NULL, weights = 1, na.rm = FALSE, alpha = 1)
rmsle(preds = NULL, actuals = NULL, weights = 1, na.rm = FALSE, alpha = 1)
preds |
A vector of prediction values in [0, 1] |
actuals |
A vector of actuals values in 0, 1, or FALSE, TRUE |
weights |
Optional vectors of weights |
na.rm |
Should (prediction, actual) pairs with at least one NA value be ignored? |
alpha |
(defualt = 1) See the formula details. Primary purpose is to prevent taking log(0) |
Calculate Root-Mean-Square-Logarithmic Error (Deviation)
preds <- c(1.0, 2.0, 9.5) actuals <- c(0.9, 2.1, 10.0) rmsle(preds, actuals)
preds <- c(1.0, 2.0, 9.5) actuals <- c(0.9, 2.1, 10.0) rmsle(preds, actuals)
This function provides a way to identify the worst predictions when measuring Area Under the ROC curve. Simply put, the worst predictions are the ones with very low or high relative prediction scores (usually probabilities) which relate to the positive and negative samples respectively.
roc_scores(preds, actuals)
roc_scores(preds, actuals)
preds |
vector of predictions (need not be in range [0-1] - only order matters) |
actuals |
vector of actuals - either logical or vector of 1s and 0s |
How it works
First the relative position (between 0 and 1) of each prediction is determined
Next the mean of actuals is determined
For samples whose position is on the correct side of the overall mean, 0 is given
For samples whose position is on the wrong side of the overall mean, its distance from the mean is given
roc_scores(c(1,2,3,4), actuals=c(1,1,0,0)) roc_scores(c(0.1, 0.2, 0.3, 0.4), actuals=c(TRUE, FALSE, TRUE, FALSE))
roc_scores(c(1,2,3,4), actuals=c(1,1,0,0)) roc_scores(c(0.1, 0.2, 0.3, 0.4), actuals=c(TRUE, FALSE, TRUE, FALSE))
Convenience method for dealing with factors. Map a list of vectors to a list of factor vectors (1-1 mapping) such that the factor vectors all have the same levels - the unique values of the union of all the vectors in the list. Optionally group all low frequency values into a "_other_" level.
set_factor(vectorList, aggregationThreshold = 0)
set_factor(vectorList, aggregationThreshold = 0)
vectorList |
A list of values to convert to factors |
aggregationThreshold |
Values which appear this many times or less will be grouped into the level "_other_" |
x <- c("a", "b", "c", "c") y <- c("a", "d", "d") set_factor(list(x, y)) set_factor(list(x, y), aggregationThreshold=1)
x <- c("a", "b", "c", "c") y <- c("a", "d", "d") set_factor(list(x, y)) set_factor(list(x, y), aggregationThreshold=1)
Calculates the skewness of each field in a data.table
skewness(dt)
skewness(dt)
dt |
A data.table |
Counts the frequency of each value in each column, then displays the results in descending order
library(data.table) skewness(alientrain)
library(data.table) skewness(alientrain)
Convert a data.table object into a sparse matrix (with the same number of rows).
sparsify( dt, sparsifyNAs = FALSE, naCols = "none", sparsifyCols = NULL, memEfficient = FALSE )
sparsify( dt, sparsifyNAs = FALSE, naCols = "none", sparsifyCols = NULL, memEfficient = FALSE )
dt |
A data.table object |
sparsifyNAs |
Should NAs be converted to 0s and sparsified? |
naCols |
|
sparsifyCols |
What columns to use. Use this to exclude columns of dt from being sparsified without having to build a column-subsetted copy of dt to input into sparsify(...). Default = NULL means use all columns of dt. |
memEfficient |
Default = FALSE. Set this to TRUE for a slower but more memory efficient process |
Converts a data.table object to a sparse matrix (class "dgCMatrix"). Requires the Matrix package. All sparsified data is assumed to take on the value 0/FALSE
### Data Type | Description & NA handling
numeric | If sparsifyNAs
= FALSE, only 0s will be sparsified
If sparsifyNAs
= TRUE, 0s and NAs will be sparsified
factor (unordered) | Each level will generate a sparsified binary column Column names are feature_level, e.g. "color_red", "color_blue"
factor (ordered) | Levels are converted to numeric, 1 - NLevels
If sparsifyNAs
= FALSE, NAs will remain as NAs
If sparsifyNAs
= TRUE, NAs will be sparsified
logical | TRUE and FALSE values will be converted to 1s and 0s
If sparsifyNAs
= FALSE, only FALSEs will be sparsified
If sparsifyNAs
= TRUE, FALSEs and NAs will be sparsified
library(data.table) library(Matrix) dt <- data.table( intCol=c(1L, NA_integer_, 3L, 0L), realCol=c(NA, 2, NA, NA), logCol=c(TRUE, FALSE, TRUE, FALSE), ofCol=factor(c("a", "b", NA, "b"), levels=c("a", "b", "c"), ordered=TRUE), ufCol=factor(c("a", NA, "c", "b"), ordered=FALSE) ) sparsify(dt) sparsify(dt, sparsifyNAs=TRUE) sparsify(dt[, list(realCol)], naCols="identify") sparsify(dt[, list(realCol)], naCols="efficient")
library(data.table) library(Matrix) dt <- data.table( intCol=c(1L, NA_integer_, 3L, 0L), realCol=c(NA, 2, NA, NA), logCol=c(TRUE, FALSE, TRUE, FALSE), ofCol=factor(c("a", "b", NA, "b"), levels=c("a", "b", "c"), ordered=TRUE), ufCol=factor(c("a", NA, "c", "b"), ordered=FALSE) ) sparsify(dt) sparsify(dt, sparsifyNAs=TRUE) sparsify(dt[, list(realCol)], naCols="identify") sparsify(dt[, list(realCol)], naCols="efficient")