| Title: | A Common API to Clustering |
| Version: | 0.3.0 |
| Description: | A common interface to specifying clustering models, in the same style as 'parsnip'. Creates unified interface across different functions and computational engines. |
| License: | MIT + file LICENSE |
| URL: | https://github.com/tidymodels/tidyclust, https://tidyclust.tidymodels.org/ |
| BugReports: | https://github.com/tidymodels/tidyclust/issues |
| Depends: | R (≥ 4.1) |
| Imports: | cli (≥ 3.0.0), lifecycle, dials (≥ 1.3.0), dplyr (≥ 1.0.9), flexclust (≥ 1.3-6), generics (≥ 0.1.2), glue (≥ 1.6.2), hardhat (≥ 1.0.0), mclust, modelenv (≥ 0.2.0), parsnip (≥ 1.0.2), philentropy (≥ 0.9.0), prettyunits (≥ 1.1.0), purrr (≥ 1.0.0), rlang (≥ 1.0.6), rsample (≥ 1.0.0), stats, tibble (≥ 3.1.0), tidyr (≥ 1.2.0), tune (≥ 2.1.0), utils, vctrs (≥ 0.5.0) |
| Suggests: | butcher, cluster, ClusterR, clustMixType (≥ 0.3-5), covr, dbscan, future, future.apply, klaR, knitr, LPCM, meanShiftR, mirai (≥ 1.0.0), modeldata (≥ 1.0.0), RcppHungarian, recipes (≥ 1.0.0), rmarkdown, testthat (≥ 3.0.0), withr, workflows (≥ 1.1.2) |
| Config/Needs/website: | pkgdown, tidymodels, tidyverse, palmerpenguins, patchwork, ggforce, tidyverse/tidytemplate, mvtnorm |
| Config/testthat/edition: | 3 |
| Config/usethis/last-upkeep: | 2025-04-24 |
| Encoding: | UTF-8 |
| VignetteBuilder: | knitr |
| RoxygenNote: | 7.3.3 |
| NeedsCompilation: | no |
| Packaged: | 2026-05-21 21:54:47 UTC; emilhvitfeldt |
| Author: | Emil Hvitfeldt |
| Maintainer: | Emil Hvitfeldt <emil.hvitfeldt@posit.co> |
| Repository: | CRAN |
| Date/Publication: | 2026-05-21 22:50:02 UTC |
tidyclust: A Tidy Interface to Clustering
Description
The tidyclust package provides a tidy, unified interface to clustering models, following the same design patterns as parsnip. It creates a consistent API across different clustering functions and engines.
Details
Model specifications
-
k_means(): K-means clustering (stats, ClusterR, klaR, clustMixType engines) -
hier_clust(): Hierarchical/agglomerative clustering (stats engine) -
db_clust(): Density-based clustering (dbscan engine) -
gm_clust(): Gaussian mixture model clustering (mclust engine)
Key functions
-
Prediction:
predict.cluster_fit() -
Extraction:
extract_centroids(),extract_cluster_assignment() -
Metrics:
silhouette_avg(),sse_within_total(),sse_ratio() -
Tuning:
tune_cluster()
Getting started
# Create a specification spec <- k_means(num_clusters = 3) # Fit to data fit <- fit(spec, ~., data = mtcars) # Extract results extract_centroids(fit) extract_cluster_assignment(fit)
Author(s)
Maintainer: Emil Hvitfeldt emil.hvitfeldt@posit.co (ORCID)
Authors:
Kelly Bodwin kelly@bodwin.us
Other contributors:
Posit Software, PBC (ROR) [copyright holder, funder]
See Also
Package website: https://tidyclust.tidymodels.org/
Bug reports: https://github.com/tidymodels/tidyclust/issues
Helper functions to convert between formula and matrix interface
Description
Functions to take a formula interface and get the resulting
objects (y, x, weights, etc) back or the other way around. The functions
are intended for developer use. For the most part, this emulates the
internals of lm() (and also see the notes at
https://developer.r-project.org/model-fitting-functions.html).
.convert_form_to_x_fit() is for when the data are created for modeling.
It saves both the data objects as well as the objects needed when new data
are predicted (e.g. terms, etc.).
.convert_form_to_x_new() is used when new samples are being predicted and
only requires the predictors to be available.
Usage
.convert_form_to_x_fit(
formula,
data,
...,
na.action = na.omit,
indicators = "traditional",
composition = "data.frame",
remove_intercept = TRUE
)
.convert_form_to_x_new(
object,
new_data,
na.action = stats::na.pass,
composition = "data.frame"
)
Arguments
formula |
An object of class |
data |
A data frame containing all relevant variables (e.g. predictors, case weights, etc). |
... |
Additional arguments passed to |
na.action |
A function which indicates what should happen when the data contain NAs. |
indicators |
A string describing whether and how to create
indicator/dummy variables from factor predictors. Possible options are
|
composition |
A string describing whether the resulting |
remove_intercept |
A logical indicating whether to remove the intercept
column after |
object |
An object of class |
new_data |
A rectangular data object, such as a data frame. |
Simple Wrapper around dbscan function
Description
This wrapper prepares the data into a distance matrix to send to
dbscan::dbscan() and retains the parameters radius or min_points as an
attribute.
Usage
.db_clust_fit_dbscan(x, radius = NULL, min_points = NULL, ...)
Arguments
x |
matrix or data frame. |
radius |
Radius used to determine core-points and cluster points together. |
min_points |
Minimum number of points needed to form a cluster. |
Value
dbscan object
Simple Wrapper around hdbscan function
Description
This wrapper passes the data to dbscan::hdbscan() and stashes the training
data on the result so it can be reused for prediction and extraction.
Usage
.db_clust_fit_hdbscan(x, min_points = NULL, min_cluster_size = NULL, ...)
Arguments
x |
matrix or data frame. |
min_points |
Minimum cluster size used as the |
min_cluster_size |
Engine-specific override for |
Value
hdbscan object
Simple Wrapper around Mclust function
Description
This wrapper prepares the data into a distance matrix to send to
mclust::Mclust and retains the parameters num_clusters as an
attribute.
Usage
.gm_clust_fit_mclust(
x,
num_clusters = NULL,
circular = NULL,
zero_covariance = NULL,
shared_orientation = NULL,
shared_shape = NULL,
shared_size = NULL,
...
)
Arguments
x |
matrix or data frame. |
num_clusters |
Number of clusters. |
circular |
Whether or not to fit circular MVG distributions for each cluster. |
zero_covariance |
Whether or not to assign covariances of 0 for each MVG. |
shared_orientation |
Whether each cluster MVG should have the same orientation. |
shared_shape |
Whether each cluster MVG should have the same shape. |
shared_size |
Whether each cluster MVG should have the same size/volume. |
Value
mclust object
Simple Wrapper around hclust function
Description
This wrapper prepares the data into a distance matrix to send to
stats::hclust and retains the parameters num_clusters or h as an
attribute.
Usage
.hier_clust_fit_stats(
x,
num_clusters = NULL,
cut_height = NULL,
linkage_method = NULL,
dist_fun = philentropy::distance
)
Arguments
x |
matrix or data frame |
num_clusters |
the number of clusters |
cut_height |
the height to cut the dendrogram |
linkage_method |
the agglomeration method to be used. This should be (an
unambiguous abbreviation of) one of |
dist_fun |
A function of the form |
Value
A dendrogram
Simple Wrapper around ClusterR kmeans
Description
This wrapper runs ClusterR::KMeans_rcpp() and adds column names to the
centroids field. And reorders the clusters.
Usage
.k_means_fit_ClusterR(
data,
clusters,
num_init = 1,
max_iters = 100,
initializer = "kmeans++",
fuzzy = FALSE,
verbose = FALSE,
CENTROIDS = NULL,
tol = 1e-04,
tol_optimal_init = 0.3,
seed = 1
)
Arguments
data |
matrix or data frame |
clusters |
the number of clusters |
num_init |
number of times the algorithm will be run with different centroid seeds |
max_iters |
the maximum number of clustering iterations |
initializer |
the method of initialization. One of, optimal_init, quantile_init, kmeans++ and random. See details for more information |
fuzzy |
either TRUE or FALSE. If TRUE, then prediction probabilities will be calculated using the distance between observations and centroids |
verbose |
either TRUE or FALSE, indicating whether progress is printed during clustering. |
CENTROIDS |
a matrix of initial cluster centroids. The rows of the CENTROIDS matrix should be equal to the number of clusters and the columns should be equal to the columns of the data. |
tol |
a float number. If, in case of an iteration (iteration > 1 and iteration < max_iters) 'tol' is greater than the squared norm of the centroids, then kmeans has converged |
tol_optimal_init |
tolerance value for the 'optimal_init' initializer. The higher this value is, the far appart from each other the centroids are. |
seed |
integer value for random number generator (RNG) |
Value
a list with the following attributes: clusters, fuzzy_clusters (if fuzzy = TRUE), centroids, total_SSE, best_initialization, WCSS_per_cluster, obs_per_cluster, between.SS_DIV_total.SS
Simple Wrapper around clustMixType kmeans
Description
This wrapper runs clustMixType::kproto() and reorders the clusters.
Usage
.k_means_fit_clustMixType(x, k, ...)
Arguments
x |
Data frame with both numerics and factors (also ordered factors are possible). |
k |
Either the number of clusters, a vector specifying indices of initial prototypes, or a data frame of
prototypes of the same columns as |
... |
Other arguments passed to |
Value
Result from clustMixType::kproto()
Simple Wrapper around klaR kmeans
Description
This wrapper runs klaR::kmodes() and reorders the clusters.
Usage
.k_means_fit_klaR(data, modes, ...)
Arguments
data |
A matrix or data frame of categorical data. Objects have to be in rows, variables in columns. |
modes |
Either the number of modes or a set of initial
(distinct) cluster modes. If a number, a random set of (distinct)
rows in |
... |
Other arguments passed to |
Value
Result from klaR::kmodes()
Simple Wrapper around stats kmeans
Description
This wrapper runs stats::kmeans() and adds a check that centers is
specified. And reorders the clusters.
Usage
.k_means_fit_stats(data, centers = NULL, ...)
Arguments
centers |
either the number of clusters, say |
... |
Other arguments passed to |
Value
Result from stats::kmeans()
Simple Wrapper around LPCM::ms function
Description
This wrapper passes the data and bandwidth to LPCM::ms() with plotting
disabled.
Usage
.mean_shift_fit_LPCM(x, bandwidth = NULL, ...)
Arguments
x |
matrix or data frame. |
bandwidth |
Kernel bandwidth controlling the neighborhood size. |
Value
ms object
Simple Wrapper around meanShiftR::meanShift function
Description
This wrapper passes the data and bandwidth to meanShiftR::meanShift() and
stashes the training data and bandwidth on the result so they can be reused
for prediction and extraction.
Usage
.mean_shift_fit_meanShiftR(x, bandwidth = NULL, ...)
Arguments
x |
matrix or data frame. |
bandwidth |
Kernel bandwidth controlling the neighborhood size. A scalar is recycled to a per-column vector. |
Value
A list with class "ms_meanShiftR".
Augment data with predictions
Description
augment() will add column(s) for predictions to the given data.
Usage
## S3 method for class 'cluster_fit'
augment(x, new_data, ...)
Arguments
x |
A |
new_data |
A data frame or matrix. |
... |
Not currently used. |
Details
For partition models, a .pred_cluster column is added.
Preprocessing with workflows
When x is a fitted workflows::workflow() that includes a recipe, the
recipe transformations are applied to new_data before predicting. The
returned tibble contains the original (untransformed) new_data plus
the .pred_cluster column, so the data is not altered by preprocessing.
Value
A tibble containing new_data with a .pred_cluster column
appended giving the cluster assignment for each row.
Examples
kmeans_spec <- k_means(num_clusters = 5) |>
set_engine("stats")
kmeans_fit <- fit(kmeans_spec, ~., mtcars)
kmeans_fit |>
augment(new_data = mtcars)
# With a workflow that includes a recipe
library(recipes)
library(workflows)
rec <- recipe(~., data = mtcars) |>
step_normalize(all_predictors())
wf_fit <- workflow() |>
add_recipe(rec) |>
add_model(kmeans_spec) |>
fit(data = mtcars)
# Returns original (untransformed) data with .pred_cluster appended
augment(wf_fit, new_data = mtcars)
Axing a cluster_fit.
Description
cluster_fit objects are created from the tidyclust package.
Usage
axe_call.cluster_fit(x, verbose = FALSE, ...)
axe_ctrl.cluster_fit(x, verbose = FALSE, ...)
axe_data.cluster_fit(x, verbose = FALSE, ...)
axe_env.cluster_fit(x, verbose = FALSE, ...)
axe_fitted.cluster_fit(x, verbose = FALSE, ...)
Arguments
x |
A model object. |
verbose |
Print information each time an axe method is executed.
Notes how much memory is released and what functions are
disabled. Default is |
... |
Any additional arguments related to axing. |
Value
Axed cluster_fit object.
Examples
k_fit <- k_means(num_clusters = 3) |>
parsnip::set_engine("stats") |>
fit(~., data = mtcars)
butcher::butcher(k_fit)
Bandwidth
Description
The kernel bandwidth used by mean shift to estimate the local density gradient. Smaller values yield more clusters, while larger values merge them.
Usage
bandwidth(range = c(0.01, 1), trans = NULL)
Arguments
range |
A two-element vector holding the defaults for the smallest and largest possible values, respectively. If a transformation is specified, these values should be in the transformed units. |
trans |
A |
Details
Used in tidyclust::mean_shift() models. The scale on which the bandwidth
is interpreted depends on the engine, since some engines rescale predictors
internally before applying the kernel.
Value
A dials parameter object for use with tune::tune_grid() and
related functions.
Examples
bandwidth()
Model Fit Object Information
Description
An object with class "cluster_fit" is a container for information about a model that has been fit to the data.
Details
The following model types are implemented in tidyclust:
K-Means in
k_means()Hierarchical (Agglomerative) Clustering in
hier_clust()
The main elements of the object are:
-
spec: Acluster_specobject. -
fit: The object produced by the fitting function. -
preproc: This contains any data-specific information required to process new a sample point for prediction. For example, if the underlying model function requires argumentsxand the user passed a formula tofit, thepreprocobject would contain items such as the terms object and so on. When no information is required, this isNA.
As discussed in the documentation for cluster_spec, the original
arguments to the specification are saved as quosures. These are evaluated for
the cluster_fit object prior to fitting. If the resulting model object
prints its call, any user-defined options are shown in the call preceded by a
tilde (see the example below). This is a result of the use of quosures in the
specification.
This class and structure is the basis for how tidyclust stores model objects after seeing the data and applying a model.
Combine metric functions
Description
cluster_metric_set() allows you to combine multiple metric functions
together into a new function that calculates all of them at once.
Usage
cluster_metric_set(...)
Arguments
... |
The bare names of the functions to be included in the metric set.
These functions must be cluster metrics such as |
Details
All functions must be:
Only cluster metrics
Value
A cluster_metric_set() object, combining the use of all input
metrics.
Model Specification Information
Description
An object with class "cluster_spec" is a container for information about a model that will be fit.
Details
The following model types are implemented in tidyclust:
K-Means in
k_means()Hierarchical (Agglomerative) Clustering in
hier_clust()
The main elements of the object are:
-
args: A vector of the main arguments for the model. The names of these arguments may be different from their counterparts n the underlying model function. For example, for ak_means()model, the argument name for the number of clusters are called "num_clusters" instead of "k" to make it more general and usable across different types of models (and to not be specific to a particular model function). The elements ofargscantune()with the use intune_cluster().
For more information see https://www.tidymodels.org/start/tuning/. If left
to their defaults (NULL), the arguments will use the underlying model
functions default value. As discussed below, the arguments in args are
captured as quosures and are not immediately executed.
-
...: Optional model-function-specific parameters. As withargs, these will be quosures and can betune(). -
mode: The type of model, such as "partition". Other modes will be added once the package adds more functionality. -
method: This is a slot that is filled in later by the model's constructor function. It generally contains lists of information that are used to create the fit and prediction code as well as required packages and similar data. -
engine: This character string declares exactly what software will be used. It can be a package name or a technology type.
This class and structure is the basis for how tidyclust stores model objects prior to seeing the data.
Argument Details
An important detail to understand when creating model specifications is that they are intended to be functionally independent of the data. While it is true that some tuning parameters are data dependent, the model specification does not interact with the data at all.
For example, most R functions immediately evaluate their arguments. For
example, when calling mean(dat_vec), the object dat_vec is immediately
evaluated inside of the function.
tidyclust model functions do not do this. For example, using
k_means(num_clusters = ncol(mtcars) / 5)
does not execute ncol(mtcars) / 5 when creating the specification.
This can be seen in the output:
> k_means(num_clusters = ncol(mtcars) / 5)
K Means Cluster Specification (partition)
Main Arguments:
num_clusters = ncol(mtcars)/5
Computational engine: stats
The model functions save the argument expressions and their associated
environments (a.k.a. a quosure) to be evaluated later when either
fit.cluster_spec() or fit_xy.cluster_spec() are called with the actual
data.
The consequence of this strategy is that any data required to get the parameter values must be available when the model is fit. The two main ways that this can fail is if:
The data have been modified between the creation of the model specification and when the model fit function is invoked.
If the model specification is saved and loaded into a new session where those same data objects do not exist.
The best way to avoid these issues is to not reference any data objects in
the global environment but to use data descriptors such as .cols(). Another
way of writing the previous specification is
k_means(num_clusters = .cols() / 5)
This is not dependent on any specific data object and is evaluated immediately before the model fitting process begins.
One less advantageous approach to solving this issue is to use quasiquotation. This would insert the actual R object into the model specification and might be the best idea when the data object is small. For example, using
k_means(num_clusters = ncol(!!mtcars) - 1)
would work (and be reproducible between sessions) but embeds the entire
mtcars data set into the num_clusters expression:
> k_means(num_clusters = ncol(!!mtcars) / 5) K Means Cluster Specification (partition) Main Arguments: num_clusters = ncol(structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7,<snip> Computational engine: stats
However, if there were an object with the number of columns in it, this wouldn't be too bad:
> num_clusters_val <- ncol(mtcars) / 5 > num_clusters_val [1] 10 > k_means(num_clusters = !!num_clusters_val) K Means Cluster Specification (partition) Main Arguments: num_clusters = 2.2
More information on quosures and quasiquotation can be found at https://adv-r.hadley.nz/quasiquotation.html.
One-hot contrast matrix
Description
A re-export of hardhat::contr_one_hot() for use with
indicators = "one_hot".
Usage
contr_one_hot(n, contrasts = TRUE, sparse = FALSE)
Arguments
n |
A vector of character factor levels (of length >=1) or the number of unique levels (>= 1). |
contrasts |
This argument is for backwards compatibility and only the
default of |
sparse |
This argument is for backwards compatibility and only the
default of |
Control the fit function
Description
Options can be passed to the fit.cluster_spec() function that control the
output and computations.
Usage
control_cluster(verbosity = 1L, catch = FALSE)
## S3 method for class 'control_cluster'
print(x, ...)
Arguments
verbosity |
An integer where a value of zero indicates that no messages
or output should be shown when packages are loaded or when the model is
fit. A value of 1 means that package loading is quiet but model fits can
produce output to the screen (depending on if they contain their own
|
catch |
A logical where a value of |
x |
A |
... |
Not currently used. |
Value
An S3 object with class "control_cluster" that is a named list with the results of the function call
The input x, invisibly.
Examples
control_cluster()
# Catch errors instead of stopping — useful inside loops or tune_cluster()
control_cluster(catch = TRUE)
# Suppress all output during fitting
control_cluster(verbosity = 0L)
# Show model output but suppress package loading messages (default)
control_cluster(verbosity = 1L)
# Show all output including package loading messages
control_cluster(verbosity = 2L)
Cut Height
Description
Used in most tidyclust::hier_clust() models.
Usage
cut_height(range = c(0, dials::unknown()), trans = NULL)
Arguments
range |
A two-element vector holding the defaults for the smallest and largest possible values, respectively. If a transformation is specified, these values should be in the transformed units. |
trans |
A |
Value
A dials parameter object for use with tune::tune_grid() and
related functions.
Examples
cut_height()
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Description
db_clust defines a model that fits clusters based on areas with observations
that are densely packed together using the DBSCAN algorithm
There are multiple implementations for this model, and the implementation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
Usage
db_clust(
mode = "partition",
engine = "dbscan",
radius = NULL,
min_points = NULL
)
Arguments
mode |
A single character string for the type of model. The only
possible value for this model is |
engine |
A single character string specifying what computational engine
to use for fitting. The engine for this model is |
radius |
Positive double, Radius drawn around points to determine core-points and cluster assignments (required). |
min_points |
Positive integer, Minimum number of connected points required to form a core-point, including the point itself (required). |
Details
What does it mean to predict?
To predict the cluster assignment for a new observation, we determine if a point is within the radius of a core point. If so, we predict the same cluster as the core point. If not, we predict the observation to be an outlier.
Value
A db_clust cluster specification.
Examples
# Show all engines
modelenv::get_from_env("db_clust")
db_clust()
dbscan fit helper function
Description
This function returns the cluster assignments for the training data based on their distance to the CLOSEST core point in the data.
Usage
dbscan_helper(object, ...)
Arguments
object |
db_clust object |
Value
numeric vector
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) via dbscan
Description
db_clust() creates DBSCAN model.
Details
For this engine, there is a single mode: partition
Tuning Parameters
This model has 2 tuning parameters:
-
radius: Radius (type: double, default: no default) -
min_points: Minimum Number of Points (type: integer, default: no_default)
Translation from tidyclust to the original package (partition)
db_clust(radius = 0.5, min_points = 5)%>%
set_engine("dbscan") %>%
set_mode("partition") %>%
translate_tidyclust()
## DBSCAN Clustering Specification (partition) ## ## Main Arguments: ## radius = 0.5 ## min_points = 5 ## ## Computational engine: dbscan ## ## Model fit template: ## tidyclust::.db_clust_fit_dbscan(x = missing_arg(), radius = missing_arg(), ## min_points = missing_arg(), radius = 0.5, min_points = 5)
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit(), tidyclust
will convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
References
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.
Hahsler, M., Piekenbrock, M., & Doran, D. (2019a). Dbscan : Fast Density-Based Clustering with r. Journal of Statistical Software, 91(1). https://www.jstatsoft.org/article/view/v091i01
Kriegel, H., Kröger, P., Sander, J., & Zimek, A. (2011). Density-based clustering. WIREs Data Mining and Knowledge Discovery, 1(3), 231–240. https://wires.onlinelibrary.wiley.com/doi/10.1002/widm.30. 30
Tran, T. N., Drab, K., & Daszykowski, M. (2013). Revised DBSCAN algorithm to cluster data with dense adjacent clusters. Chemometrics and Intelligent Laboratory Systems, 49 120, 92–96. https://www.sciencedirect.com/science/article/pii/S0169743912002249
Hierarchical Density-Based Spatial Clustering (HDBSCAN) via dbscan
Description
db_clust() creates an HDBSCAN model.
Details
For this engine, there is a single mode: partition
Tuning Parameters
This model has 1 tuning parameters:
-
min_points: Minimum Number of Points (type: integer, default: no default)
The hdbscan engine also accepts the engine-specific argument
min_cluster_size (passed via
set_engine("hdbscan", min_cluster_size = ...)). When supplied, it
overrides min_points as the value of minPts passed to
dbscan::hdbscan(). If not supplied, min_points is used.
Translation from tidyclust to the original package (partition)
db_clust(min_points = 5) |>
set_engine("hdbscan") |>
set_mode("partition") |>
translate_tidyclust()
## DBSCAN Clustering Specification (partition) ## ## Main Arguments: ## min_points = 5 ## ## Computational engine: hdbscan ## ## Model fit template: ## tidyclust::.db_clust_fit_hdbscan(x = missing_arg(), min_points = missing_arg(), ## min_points = 5)
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit(), tidyclust
will convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
What does it mean to predict?
To predict the cluster assignment for a new observation, the nearest training observation that was not classified as noise is found. The new observation is assigned to that neighbor’s cluster if the distance is at most the neighbor’s core distance; otherwise the new observation is marked as an outlier.
References
Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. In Advances in Knowledge Discovery and Data Mining (Vol. 7819, pp. 160–172). Springer. https://link.springer.com/chapter/10.1007/978-3-642-37456-2_14
Campello, R. J. G. B., Moulavi, D., Zimek, A., & Sander, J. (2015). Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Transactions on Knowledge Discovery from Data, 10(1), 1–51. https://dl.acm.org/doi/10.1145/2733381
Hahsler, M., Piekenbrock, M., & Doran, D. (2019). dbscan: Fast Density-Based Clustering with R. Journal of Statistical Software, 91(1). https://www.jstatsoft.org/article/view/v091i01
Gaussian Mixture Model (GMM) via mclust
Description
gm_clust() creates GMM model.
Details
For this engine, there is a single mode: partition
Tuning Parameters
This model has 6 tuning parameters:
-
num_clusters: # Clusters (type: integer, default: no default) -
circular: Circular MVG (type: logical, default: TRUE) -
zero_covariance: Zero Covariance (type: logical, default: TRUE) -
shared_orientation: Shared Orientation (type: logical, default: TRUE) -
shared_shape: Shared Shape (type: logical, default: TRUE) -
shared_size: Shared Size (type: logical, default: TRUE)
Translation from tidyclust to the original package (partition)
gm_clust(num_clusters = 3, circular = FALSE, zero_covariance = FALSE) %>%
set_engine("mclust") %>%
set_mode("partition") %>%
translate_tidyclust()
## GMM Clustering Specification (partition) ## ## Main Arguments: ## num_clusters = 3 ## circular = FALSE ## zero_covariance = FALSE ## shared_orientation = TRUE ## shared_shape = TRUE ## shared_size = TRUE ## ## Computational engine: mclust ## ## Model fit template: ## tidyclust::.gm_clust_fit_mclust(x = missing_arg(), num_clusters = missing_arg(), ## circular = missing_arg(), zero_covariance = missing_arg(), ## shared_orientation = missing_arg(), shared_shape = missing_arg(), ## shared_size = missing_arg(), num_clusters = 3, circular = FALSE, ## zero_covariance = FALSE, shared_orientation = TRUE, shared_shape = TRUE, ## shared_size = TRUE)
Preprocessing requirements
Gaussian Mixture Models should be fit with only quantitative predictors and without any categorical predictors. No scaling is required since the variance-covariance matrices of the Gaussian distributions account for the unequal variances between predictors and their covariances.
References
Banfield, J. D., & Raftery, A. E. (1993). Model-Based Gaussian and Non-Gaussian Clustering. Biometrics, 49(3), 803. https://www.jstor.org/stable/2532201
Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition, 28(5), 781–793. https://www.sciencedirect.com/science/article/pii/0031320394001256
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm.
McNicholas, P. D. (2016). Model-Based clustering. Journal of Classification, 33(3), 331–373. https://link.springer.com/article/10.1007/s00357-016-9211-9
Scrucca, L., Fop, M., Murphy, T., Brendan, & Raftery, A., E. (2016). Mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models. The R Journal, 8(1), 289. https://journal.r-project.org/articles/RJ-2016-021/index.html
Scrucca, L., Fraley, C., Murphy, T. B., & Raftery, A. E. (2023). Model-based clustering, classification, and density estimation using mclust in R. Chapman; Hall/CRC. https: //doi.org/10.1201/9781003277965
Hierarchical (Agglomerative) Clustering via stats
Description
hier_clust() creates Hierarchical (Agglomerative) Clustering model.
Details
For this engine, there is a single mode: partition
Tuning Parameters
This model has 1 tuning parameters:
-
num_clusters: # Clusters (type: integer, default: no default)
Translation from tidyclust to the original package (partition)
hier_clust(num_clusters = integer(1)) |>
set_engine("stats") |>
set_mode("partition") |>
translate_tidyclust()
## Hierarchical Clustering Specification (partition) ## ## Main Arguments: ## num_clusters = integer(1) ## linkage_method = complete ## ## Computational engine: stats ## ## Model fit template: ## tidyclust::.hier_clust_fit_stats(data = missing_arg(), num_clusters = integer(1), ## linkage_method = "complete")
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit(), tidyclust
will convert factor columns to indicators.
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). The New S Language. Wadsworth & Brooks/Cole. (S version.)
Everitt, B. (1974). Cluster Analysis. London: Heinemann Educ. Books.
Hartigan, J.A. (1975). Clustering Algorithms. New York: Wiley.
Sneath, P. H. A. and R. R. Sokal (1973). Numerical Taxonomy. San Francisco: Freeman.
Anderberg, M. R. (1973). Cluster Analysis for Applications. Academic Press: New York.
Gordon, A. D. (1999). Classification. Second Edition. London: Chapman and Hall / CRC
Murtagh, F. (1985). “Multidimensional Clustering Algorithms”, in COMPSTAT Lectures 4. Wuerzburg: Physica-Verlag (for algorithmic details of algorithms used).
McQuitty, L.L. (1966). Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data. Educational and Psychological Measurement, 26, 825–831. https://doi.org/10.1177/001316446602600402.
Legendre, P. and L. Legendre (2012). Numerical Ecology, 3rd English ed. Amsterdam: Elsevier Science BV.
Murtagh, Fionn and Legendre, Pierre (2014). Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion? Journal of Classification, 31, 274–295. https://doi.org/10.1007/s00357-014-9161-z.
K-means via ClusterR
Description
k_means() creates K-means model. This engine uses the classical definition
of a K-means model, which only takes numeric predictors.
Details
For this engine, there is a single mode: partition
Tuning Parameters
This model has 1 tuning parameters:
-
num_clusters: # Clusters (type: integer, default: no default)
Translation from tidyclust to the original package (partition)
k_means(num_clusters = integer(1)) |>
set_engine("ClusterR") |>
set_mode("partition") |>
translate_tidyclust()
## K Means Cluster Specification (partition) ## ## Main Arguments: ## num_clusters = integer(1) ## ## Computational engine: ClusterR ## ## Model fit template: ## tidyclust::.k_means_fit_ClusterR(data = missing_arg(), clusters = missing_arg(), ## clusters = integer(1))
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit(), tidyclust
will convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
References
Forgy, E. W. (1965). Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics, 21, 768–769.
Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 28, 100–108. https://doi.org/10.2307/2346830.
Lloyd, S. P. (1957, 1982). Least squares quantization in PCM. Technical Note, Bell Laboratories. Published in 1982 in IEEE Transactions on Information Theory, 28, 128–137.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281–297. Berkeley, CA: University of California Press.
K-means via clustMixType
Description
k_means() creates K-prototypes model. A K-prototypes is the middle ground
between a K-means and K-modes model, in the sense that it can be used with
data that contains both numeric and categorical predictors.
Details
Both numeric and categorical predictors are requires for this engine.
For this engine, there is a single mode: partition
Tuning Parameters
This model has 1 tuning parameters:
-
num_clusters: # Clusters (type: integer, default: no default)
Translation from tidyclust to the original package (partition)
k_means(num_clusters = integer(1)) |>
set_engine("clustMixType") |>
set_mode("partition") |>
translate_tidyclust()
## K Means Cluster Specification (partition) ## ## Main Arguments: ## num_clusters = integer(1) ## ## Computational engine: clustMixType ## ## Model fit template: ## tidyclust::.k_means_fit_clustMixType(x = missing_arg(), k = missing_arg(), ## keep.data = missing_arg(), k = integer(1), keep.data = TRUE, ## verbose = FALSE)
Preprocessing requirements
Both categorical and numeric predictors are required.
References
Szepannek, G. (2018): clustMixType: User-Friendly Clustering of Mixed-Type Data in R, The R Journal 10/2, 200-208, https://doi.org/10.32614/RJ-2018-048.
Aschenbruck, R., Szepannek, G., Wilhelm, A. (2022): Imputation Strategies for Clustering Mixed‑Type Data with Missing Values, Journal of Classification, https://doi.org/10.1007/s00357-022-09422-y.
Z.Huang (1998): Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Variables, Data Mining and Knowledge Discovery 2, 283-304.
K-means via klaR
Description
k_means() creates K-Modes model. This model is intended to be used with
categorical predictors. Although it will accept numeric predictors if they
contain a few number of unique values. The numeric predictors will then be
treated like categorical.
Details
For this engine, there is a single mode: partition
Tuning Parameters
This model has 1 tuning parameters:
-
num_clusters: # Clusters (type: integer, default: no default)
Translation from tidyclust to the original package (partition)
k_means(num_clusters = integer(1)) |>
set_engine("klaR") |>
set_mode("partition") |>
translate_tidyclust()
## K Means Cluster Specification (partition) ## ## Main Arguments: ## num_clusters = integer(1) ## ## Computational engine: klaR ## ## Model fit template: ## tidyclust::.k_means_fit_klaR(data = missing_arg(), modes = missing_arg(), ## modes = integer(1))
Preprocessing requirements
Only categorical variables are accepted, along with numerics with few unique values.
References
Huang, Z. (1997) A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. in KDD: Techniques and Applications (H. Lu, H. Motoda and H. Luu, Eds.), pp. 21-34, World Scientific, Singapore.
MacQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281-297. Berkeley, CA: University of California Press.
K-means via stats
Description
k_means() creates K-means model. This engine uses the classical definition
of a K-means model, which only takes numeric predictors.
Details
For this engine, there is a single mode: partition
Tuning Parameters
This model has 1 tuning parameters:
-
num_clusters: # Clusters (type: integer, default: no default)
Translation from tidyclust to the original package (partition)
k_means(num_clusters = integer(1)) |>
set_engine("stats") |>
set_mode("partition") |>
translate_tidyclust()
## K Means Cluster Specification (partition) ## ## Main Arguments: ## num_clusters = integer(1) ## ## Computational engine: stats ## ## Model fit template: ## tidyclust::.k_means_fit_stats(x = missing_arg(), centers = missing_arg(), ## centers = integer(1))
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit(), tidyclust
will convert factor columns to indicators.
Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.
References
Forgy, E. W. (1965). Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics, 21, 768–769.
Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 28, 100–108. https://doi.org/10.2307/2346830.
Lloyd, S. P. (1957, 1982). Least squares quantization in PCM. Technical Note, Bell Laboratories. Published in 1982 in IEEE Transactions on Information Theory, 28, 128–137.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281–297. Berkeley, CA: University of California Press.
Mean Shift Clustering via LPCM
Description
mean_shift() creates a mean shift clustering model.
Details
For this engine, there is a single mode: partition
Tuning Parameters
This model has 1 tuning parameters:
-
bandwidth: Bandwidth (type: double, default: no default)
Translation from tidyclust to the original package (partition)
mean_shift(bandwidth = 0.5) |>
set_engine("LPCM") |>
set_mode("partition") |>
translate_tidyclust()
## Mean Shift Clustering Specification (partition) ## ## Main Arguments: ## bandwidth = 0.5 ## ## Computational engine: LPCM ## ## Model fit template: ## tidyclust::.mean_shift_fit_LPCM(x = missing_arg(), bandwidth = missing_arg(), ## bandwidth = 0.5)
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit(), tidyclust
will convert factor columns to indicators.
LPCM::ms() scales each variable internally to the unit range before
applying the Gaussian kernel, so the bandwidth value lives on the
scaled scale rather than the raw data scale. Bandwidths between roughly
0.05 and 1 are typical; smaller values find more clusters and larger
values merge them.
What does it mean to predict?
To predict the cluster assignment for a new observation, the mean shift procedure is run from the new point until it converges to a mode. The observation is then assigned to the cluster of the nearest discovered training mode by Euclidean distance.
References
Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8), 790–799. https://ieeexplore.ieee.org/document/400568
Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 603–619. https://ieeexplore.ieee.org/document/1000236
Einbeck, J., Evers, L., & Hinchliff, K. (2010). Data compression and regression based on local principal curves. In A. Fink, B. Lausen, W. Seidel, & A. Ultsch (Eds.), Advances in Data Analysis, Data Handling and Business Intelligence (pp. 701–712). Springer.
Mean Shift Clustering via meanShiftR
Description
mean_shift() creates a mean shift clustering model.
Details
For this engine, there is a single mode: partition
Tuning Parameters
This model has 1 tuning parameters:
-
bandwidth: Bandwidth (type: double, default: no default)
Translation from tidyclust to the original package (partition)
mean_shift(bandwidth = 0.5) |>
set_engine("meanShiftR") |>
set_mode("partition") |>
translate_tidyclust()
## Mean Shift Clustering Specification (partition) ## ## Main Arguments: ## bandwidth = 0.5 ## ## Computational engine: meanShiftR ## ## Model fit template: ## tidyclust::.mean_shift_fit_meanShiftR(x = missing_arg(), bandwidth = missing_arg(), ## bandwidth = 0.5)
Preprocessing requirements
Factor/categorical predictors need to be converted to numeric values
(e.g., dummy or indicator variables) for this engine. When using the
formula method via fit(), tidyclust
will convert factor columns to indicators.
Unlike the LPCM engine, meanShiftR::meanShift() does not scale
variables internally and operates on the raw data scale. The bandwidth
value is used directly as a per-dimension kernel width on the original
variables, and a scalar bandwidth is recycled to a per-column vector.
Because of this, appropriate bandwidths typically depend on the spread
of the predictors. Standardizing predictors before fitting (for example,
with recipes::step_normalize()) is
recommended; otherwise the default dials::bandwidth() range of
c(0.01, 1) may be too narrow.
What does it mean to predict?
To predict the cluster assignment for a new observation, the mean shift procedure is run from the new point against the training data’s kernel density estimate. The observation is assigned to the cluster whose training mode is closest to the converged value by Euclidean distance.
References
Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8), 790–799. https://ieeexplore.ieee.org/document/400568
Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 603–619. https://ieeexplore.ieee.org/document/1000236
Lisic, J. (2015). Parcel Level Agricultural Land Cover Prediction (Doctoral dissertation, George Mason University).
Extract elements of a tidyclust model object
Description
These functions extract various elements from a clustering object. If they do not exist yet, an error is thrown.
-
extract_fit_engine()returns the engine specific fit embedded within a tidyclust model fit. For example, when usingk_means()with the"lm"engine, this returns the underlyingkmeansobject. -
extract_parameter_set_dials()returns a set of dials parameter objects.
Usage
## S3 method for class 'cluster_fit'
extract_fit_engine(x, ...)
## S3 method for class 'cluster_spec'
extract_parameter_set_dials(x, ...)
Arguments
x |
A |
... |
Not currently used. |
Details
Extracting the underlying engine fit can be helpful for describing the
model (via print(), summary(), plot(), etc.) or for variable
importance/explainers.
However, users should not invoke the
predict() method on an extracted model.
There may be preprocessing operations that tidyclust has executed on the
data prior to giving it to the model. Bypassing these can lead to errors or
silently generating incorrect predictions.
Good:
tidyclust_fit |> predict(new_data)
Bad:
tidyclust_fit |> extract_fit_engine() |> predict(new_data)
Value
The extracted value from the tidyclust object, x, as described in the
description section.
Examples
kmeans_spec <- k_means(num_clusters = 2)
kmeans_fit <- fit(kmeans_spec, ~., data = mtcars)
extract_fit_engine(kmeans_fit)
Extract clusters from model
Description
When applied to a fitted cluster specification, returns a tibble with cluster location. When such locations doesn't make sense for the model, a mean location is used.
Usage
extract_centroids(object, ...)
Arguments
object |
An fitted |
... |
Other arguments passed to methods. Using the |
Details
Some model types such as K-means as seen in k_means() stores the centroid
in the object itself. leading the use of this function to act as an simple
extract. Other model types such as Hierarchical (Agglomerative) Clustering as
seen in hier_clust(), are fit in such a way that the number of clusters can
be determined at any time after the fit. Setting the num_clusters or
cut_height in this function will be used to determine the clustering when
reported.
Further more, some models like hier_clust(), doesn't have a notion of
"centroids". The mean of the observation within each cluster assignment is
returned as the centroid.
The ordering of the clusters is such that the first observation in the training data set will be in cluster 1, the next observation that doesn't belong to cluster 1 will be in cluster 2, and so on and forth. As the ordering of clustering doesn't matter, this is done to avoid identical sets of clustering having different labels if fit multiple times.
Related functions
extract_centroids() is a part of a trio of functions doing similar things:
-
extract_cluster_assignment()returns the cluster assignments of the training observations -
extract_centroids()returns the location of the centroids -
predict()returns the cluster a new observation belongs to
Value
A tibble::tibble() with 1 row for each centroid and their position.
.cluster denotes the cluster name for the centroid. The remaining
variables match variables passed into model.
See Also
extract_cluster_assignment() predict.cluster_fit()
Examples
set.seed(1234)
kmeans_spec <- k_means(num_clusters = 5) |>
set_engine("stats")
kmeans_fit <- fit(kmeans_spec, ~., mtcars)
kmeans_fit |>
extract_centroids()
kmeans_fit |>
extract_centroids(labels = c("A", "B", "C", "D", "E"))
# Some models such as `hier_clust()` fits in such a way that you can specify
# the number of clusters after the model is fit.
# A Hierarchical (Agglomerative) Clustering method doesn't technically have
# clusters, so the center of the observation within each cluster is returned
# instead.
hclust_spec <- hier_clust() |>
set_engine("stats")
hclust_fit <- fit(hclust_spec, ~., mtcars)
hclust_fit |>
extract_centroids(num_clusters = 2)
hclust_fit |>
extract_centroids(cut_height = 250)
Extract cluster assignments from model
Description
When applied to a fitted cluster specification, returns a tibble with cluster assignments of the data used to train the model.
Usage
extract_cluster_assignment(object, ...)
Arguments
object |
An fitted |
... |
Other arguments passed to methods. Using the |
Details
Some model types such as K-means as seen in k_means() stores the
cluster assignments in the object itself. leading the use of this function to
act as an simple extract. Other model types such as Hierarchical
(Agglomerative) Clustering as seen in hier_clust(), are fit in such a way
that the number of clusters can be determined at any time after the fit.
Setting the num_clusters or cut_height in this function will be used to
determine the clustering when reported.
The ordering of the clusters is such that the first observation in the training data set will be in cluster 1, the next observation that doesn't belong to cluster 1 will be in cluster 2, and so on and forth. As the ordering of clustering doesn't matter, this is done to avoid identical sets of clustering having different labels if fit multiple times.
Related functions
extract_cluster_assignment() is a part of a trio of functions doing
similar things:
-
extract_cluster_assignment()returns the cluster assignments of the training observations -
extract_centroids()returns the location of the centroids -
predict()returns the cluster a new observation belongs to
Value
A tibble::tibble() with 1 column named .cluster. This tibble will
correspond the the training data set.
See Also
extract_centroids() predict.cluster_fit()
Examples
kmeans_spec <- k_means(num_clusters = 5) |>
set_engine("stats")
kmeans_fit <- fit(kmeans_spec, ~., mtcars)
kmeans_fit |>
extract_cluster_assignment()
kmeans_fit |>
extract_cluster_assignment(prefix = "C_")
kmeans_fit |>
extract_cluster_assignment(labels = c("A", "B", "C", "D", "E"))
# Some models such as `hier_clust()` fits in such a way that you can specify
# the number of clusters after the model is fit
hclust_spec <- hier_clust() |>
set_engine("stats")
hclust_fit <- fit(hclust_spec, ~., mtcars)
hclust_fit |>
extract_cluster_assignment(num_clusters = 2)
hclust_fit |>
extract_cluster_assignment(cut_height = 250)
S3 method to get fitted model summary info depending on engine
Description
S3 method to get fitted model summary info depending on engine
Usage
extract_fit_summary(object, ...)
Arguments
object |
a fitted |
... |
other arguments passed to methods |
Details
The elements cluster_names and cluster_assignments will be factors.
Value
A list with various summary elements
Examples
kmeans_spec <- k_means(num_clusters = 5) |>
set_engine("stats")
kmeans_fit <- fit(kmeans_spec, ~., mtcars)
kmeans_fit |>
extract_fit_summary()
Splice final parameters into objects
Description
These functions are deprecated. Please use tune::finalize_model() and
tune::finalize_workflow() instead, which now support cluster_spec
objects natively.
Usage
finalize_model_tidyclust(x, parameters)
finalize_workflow_tidyclust(x, parameters)
Arguments
x |
A recipe, |
parameters |
A list or 1-row tibble of parameter values. Note that the
column names of the tibble should be the |
Value
An updated version of x.
Examples
kmeans_spec <- k_means(num_clusters = tune())
best_params <- data.frame(num_clusters = 5)
# Old:
finalize_model_tidyclust(kmeans_spec, best_params)
# New:
tune::finalize_model(kmeans_spec, best_params)
Fit a Model Specification to a Data Set
Description
fit() and fit_xy() take a model specification, translate_tidyclust the
required code by substituting arguments, and execute the model fit routine.
Usage
## S3 method for class 'cluster_spec'
fit(object, formula, data, control = control_cluster(), ...)
## S3 method for class 'cluster_spec'
fit_xy(object, x, case_weights = NULL, control = control_cluster(), ...)
Arguments
object |
An object of class |
formula |
An object of class |
data |
Optional, depending on the interface (see Details below). A data frame containing all relevant variables (e.g. predictors, case weights, etc). Note: when needed, a named argument should be used. |
control |
A named list with elements |
... |
Not currently used; values passed here will be ignored. Other
options required to fit the model should be passed using |
x |
A matrix, sparse matrix, or data frame of predictors. Only some
models have support for sparse matrix input. See |
case_weights |
An optional classed vector of numeric case weights. This
must return |
Details
fit() and fit_xy() substitute the current arguments in the
model specification into the computational engine's code, check them for
validity, then fit the model using the data and the engine-specific code.
Different model functions have different interfaces (e.g. formula or
x/y) and these functions translate_tidyclust between the interface used
when fit() or fit_xy() was invoked and the one required by the
underlying model.
When possible, these functions attempt to avoid making copies of the data.
For example, if the underlying model uses a formula and fit() is invoked,
the original data are references when the model is fit. However, if the
underlying model uses something else, such as x/y, the formula is
evaluated and the data are converted to the required format. In this case,
any calls in the resulting model objects reference the temporary objects
used to fit the model.
If the model engine has not been set, the model's default engine will be
used (as discussed on each model page). If the verbosity option of
control_cluster() is greater than zero, a warning will be produced.
If you would like to use an alternative method for generating contrasts
when supplying a formula to fit(), set the global option contrasts to
your preferred method. For example, you might set it to: options(contrasts = c(unordered = "contr.helmert", ordered = "contr.poly")). See the help
page for stats::contr.treatment() for more possible contrast types.
Value
A cluster_fit object that contains several elements:
-
spec: The model specification object (objectin the call tofit) -
fit: when the model is executed without error, this is the model object. Otherwise, it is atry-errorobject with the error message. -
preproc: any objects needed to convert between a formula and non-formula interface (such as thetermsobject)
The return value will also have a class related to the fitted model (e.g.
"_kmeans") before the base class of "cluster_fit".
A fitted cluster_fit object.
See Also
set_engine(), control_cluster(), cluster_spec,
cluster_fit
Examples
library(dplyr)
kmeans_mod <- k_means(num_clusters = 5)
using_formula <-
kmeans_mod |>
set_engine("stats") |>
fit(~., data = mtcars)
using_x <-
kmeans_mod |>
set_engine("stats") |>
fit_xy(x = mtcars)
using_formula
using_x
Computes distance from observations to centroids
Description
Computes distance from observations to centroids
Usage
get_centroid_dists(
new_data,
centroids,
dist_fun = function(x, y) {
philentropy::dist_many_many(x, y, method =
"euclidean")
}
)
Arguments
new_data |
A data frame |
centroids |
A data frame where each row is a centroid. |
dist_fun |
A function of the form |
Get colors for tidyclust text.
Description
Get colors for tidyclust text.
Usage
get_tidyclust_colors()
Value
a list of cli functions.
Construct a single row summary "glance" of a model, fit, or other object
Description
This method glances the model in a tidyclust model object, if it exists.
Usage
## S3 method for class 'cluster_fit'
glance(x, ...)
Arguments
x |
model or other R object to convert to single-row data frame |
... |
other arguments passed to methods |
Value
A one-row tibble with model-level summary statistics such as total within-cluster sum of squares, between-cluster sum of squares, and number of iterations. Support depends on the underlying engine.
Examples
# glance() support depends on the underlying engine.
## Not run:
kmeans_fit <- k_means(num_clusters = 3) |>
set_engine("stats") |>
fit(~., mtcars)
glance(kmeans_fit)
## End(Not run)
Gaussian Mixture Models (GMM)
Description
gm_clust defines a model that fits clusters based on fitting a specified number of
multivariate Gaussian distributions (MVG) to the data.
There are multiple implementations for this model, and the implementation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
Usage
gm_clust(
mode = "partition",
engine = "mclust",
num_clusters = NULL,
circular = TRUE,
shared_size = TRUE,
zero_covariance = TRUE,
shared_orientation = TRUE,
shared_shape = TRUE
)
Arguments
mode |
A single character string for the type of model. The only possible value for this model is "partition". |
engine |
A single character string specifying what computational engine
to use for fitting. The engine for this model is |
num_clusters |
Positive integer, number of clusters in model (required). |
circular |
Boolean, whether or not to fit circular MVG distributions for each cluster. Default |
shared_size |
Boolean, whether each cluster MVG should have the same size/volume. Default |
zero_covariance |
Boolean, whether or not to assign covariances of 0 for each MVG. Default |
shared_orientation |
Boolean, whether each cluster MVG should have the same orientation. Default |
shared_shape |
Boolean, whether each cluster MVG should have the same shape. Default |
Details
What does it mean to predict?
To predict the cluster assignment for a new observation, we determine which cluster a point has the highest probability of belonging to.
Value
A gm_clust cluster specification.
Examples
# Show all engines
modelenv::get_from_env("gm_clust")
gm_clust()
Gaussian mixture covariance structure parameters
Description
Logical flags controlling the covariance structure of cluster Gaussians
fit by tidyclust::gm_clust() with the mclust engine. See
gm_clust() for descriptions.
Usage
circular(values = c(TRUE, FALSE))
zero_covariance(values = c(TRUE, FALSE))
shared_orientation(values = c(TRUE, FALSE))
shared_shape(values = c(TRUE, FALSE))
shared_size(values = c(TRUE, FALSE))
Arguments
values |
A vector of possible values ( |
Value
A dials parameter object for use with tune::tune_grid() and
related functions.
Examples
circular()
zero_covariance()
shared_orientation()
shared_shape()
shared_size()
Hierarchical (Agglomerative) Clustering
Description
hier_clust() defines a model that fits clusters based on a distance-based
dendrogram
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
Usage
hier_clust(
mode = "partition",
engine = "stats",
num_clusters = NULL,
cut_height = NULL,
linkage_method = "complete",
dist_fun = NULL
)
Arguments
mode |
A single character string for the type of model. The only possible value for this model is "partition". |
engine |
A single character string specifying what computational engine
to use for fitting. Possible engines are listed below. The default for this
model is |
num_clusters |
Positive integer, number of clusters in model (optional). |
cut_height |
Positive double, height at which to cut dendrogram to
obtain cluster assignments (only used if |
linkage_method |
the agglomeration method to be used. This should be (an
unambiguous abbreviation of) one of |
dist_fun |
A function for calculating the distance between observations.
Defaults to |
Details
What does it mean to predict?
To predict the cluster assignment for a new observation, we find the closest cluster. How we measure “closeness” is dependent on the specified type of linkage in the model:
-
single linkage: The new observation is assigned to the same cluster as its nearest observation from the training data.
-
complete linkage: The new observation is assigned to the cluster with the smallest maximum distances between training observations and the new observation.
-
average linkage: The new observation is assigned to the cluster with the smallest average distances between training observations and the new observation.
-
centroid method: The new observation is assigned to the cluster with the closest centroid, as in prediction for k_means.
Examples
# Show all engines
modelenv::get_from_env("hier_clust")
hier_clust()
K-Means
Description
k_means() defines a model that fits clusters based on distances to a number
of centers. This definition doesn't just include K-means, but includes
models like K-prototypes.
There are different ways to fit this model, and the method of estimation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
-
stats: Classical K-means
-
ClusterR: Classical K-means
-
klaR: K-Modes
-
clustMixType: K-prototypes
Usage
k_means(mode = "partition", engine = "stats", num_clusters = NULL)
Arguments
mode |
A single character string for the type of model. The only possible value for this model is "partition". |
engine |
A single character string specifying what computational engine
to use for fitting. Possible engines are listed below. The default for this
model is |
num_clusters |
Positive integer, number of clusters in model. |
Details
What does it mean to predict?
For a K-means model, each cluster is defined by a location in the predictor space. Therefore, prediction in tidyclust is defined by calculating which cluster centroid an observation is closest too.
Value
A k_means cluster specification.
Examples
# Show all engines
modelenv::get_from_env("k_means")
k_means()
Knit engine-specific documentation
Description
Knit engine-specific documentation
Usage
knit_engine_docs(pattern = NULL)
Arguments
pattern |
A regular expression to specify which files to knit. The default knits all engine documentation files. |
Value
A tibble with column file for the file name and result (a
character vector that echos the output file name or, when there is
a failure, the error message).
The agglomeration Linkage method
Description
The agglomeration Linkage method
Usage
linkage_method(values = values_linkage_method)
values_linkage_method
Arguments
values |
A character string of possible values. See |
Format
An object of class character of length 8.
Details
This parameter is used in tidyclust models for hier_clust().
Value
A dials parameter object for use with tune::tune_grid() and
related functions.
Examples
values_linkage_method
linkage_method()
Locate and show errors/warnings in engine-specific documentation
Description
Locate and show errors/warnings in engine-specific documentation
Usage
list_md_problems()
Value
A tibble with column file for the file name, line indicating
the line where the error/warning occurred, and problem showing the
error/warning message.
Quietly load package namespace
Description
For one or more packages, load the namespace. This is used during parallel processing since the different parallel backends handle the package environments differently.
Usage
## S3 method for class 'cluster_spec'
load_pkgs(x, infra = TRUE, ...)
Arguments
x |
A character vector of packages. |
infra |
Should base tidymodels packages be loaded as well? |
Value
An invisible NULL.
Prepend a new class
Description
This adds an extra class to a base class of "cluster_spec".
Usage
make_classes_tidyclust(prefix)
Arguments
prefix |
A character string for a class. |
Value
A character vector.
mclust fit helper function
Description
This function returns the mclust model name based on the specified TRUE/FALSE model arguments.
Usage
mclust_helper(
circular,
zero_covariance,
shared_orientation,
shared_shape,
shared_size
)
Arguments
circular |
Whether or not to fit circular MVG distributions for each cluster. |
zero_covariance |
Whether or not to assign covariances of 0 for each MVG. |
shared_orientation |
Whether each cluster MVG should have the same orientation. |
shared_shape |
Whether each cluster MVG should have the same shape. |
shared_size |
Whether each cluster MVG should have the same size/volume. |
Value
string containing mclust model name
Mean Shift Clustering
Description
mean_shift() defines a model that fits clusters by iteratively shifting
observations toward regions of high density, with the number of clusters
determined automatically from the data.
There are different implementations for this model, and the implementation is chosen by setting the model engine. The engine-specific pages for this model are listed below.
Usage
mean_shift(mode = "partition", engine = "LPCM", bandwidth = NULL)
Arguments
mode |
A single character string for the type of model. The only
possible value for this model is |
engine |
A single character string specifying what computational engine
to use for fitting. The default engine for this model is |
bandwidth |
Positive double, kernel bandwidth controlling the size of the neighborhood used to compute the density estimate (required). |
Details
What does it mean to predict?
To predict the cluster assignment for a new observation, the mean shift procedure is run from the new point until it converges to a mode. The observation is then assigned to the cluster of the nearest discovered training mode.
Value
A mean_shift cluster specification.
Examples
# Show all engines
modelenv::get_from_env("mean_shift")
mean_shift()
Determine the minimum set of model fits
Description
Determine the minimum set of model fits
Usage
## S3 method for class 'cluster_spec'
min_grid(x, grid, ...)
Arguments
x |
A cluster specification. |
grid |
A tibble with tuning parameter combinations. |
... |
Not currently used. |
Value
A tibble with the minimum tuning parameters to fit and an additional list column with the parameter combinations used for prediction.
Minimum number of points
Description
The minimum number of connected points required to form a core point in
density-based clustering. Used in tidyclust::db_clust() with the dbscan
and hdbscan engines.
Usage
min_points(range = c(2L, 20L), trans = NULL)
Arguments
range |
A two-element vector holding the defaults for the smallest and largest possible values, respectively. If a transformation is specified, these values should be in the transformed units. |
trans |
A |
Value
A dials parameter object for use with tune::tune_grid() and
related functions.
Examples
min_points()
Construct a new clustering metric function
Description
These functions provide convenient wrappers to create the one
type of metric functions in celrry: clustering metrics. They add a
metric-specific class to fn. These features are used by
cluster_metric_set() and by tune_cluster() when tuning.
Usage
new_cluster_metric(fn, direction)
Arguments
fn |
A function. |
direction |
A string. One of:
|
Value
A cluster_metric object.
Functions required for tidyclust-adjacent packages
Description
These functions are helpful when creating new packages that will register new cluster specifications.
Usage
new_cluster_spec(cls, args, eng_args, mode, method, engine)
Arguments
cls |
A single character string for the model type (e.g. |
args |
A named list of main model arguments. |
eng_args |
A named list of engine-specific arguments. |
mode |
A single character string for the model mode (e.g.
|
method |
A list of method details or |
engine |
A single character string for the computational engine. |
Value
A cluster_spec object made to work with tidyclust.
Model predictions
Description
Apply to a model to create different types of predictions. predict() can be
used for all types of models and uses the "type" argument for more
specificity.
Usage
## S3 method for class 'cluster_fit'
predict(object, new_data, type = NULL, opts = list(), ...)
## S3 method for class 'cluster_fit'
predict_raw(object, new_data, opts = list(), ...)
Arguments
object |
An object of class |
new_data |
A rectangular data object, such as a data frame. |
type |
A single character value or |
opts |
A list of optional arguments to the underlying predict function
that will be used when |
... |
Optional arguments passed to the underlying predict function.
Use |
Details
If "type" is not supplied to predict(), then a choice is made:
-
type = "cluster"for clustering models
predict() is designed to provide a tidy result (see "Value" section below)
in a tibble output format.
The ordering of the clusters is such that the first observation in the training data set will be in cluster 1, the next observation that doesn't belong to cluster 1 will be in cluster 2, and so on and forth. As the ordering of clustering doesn't matter, this is done to avoid identical sets of clustering having different labels if fit multiple times.
What does it mean to predict?
Prediction is not always formally defined for clustering models. Therefore,
each cluster_spec method will have their own section on how "prediction"
is interpreted, and done if implemented.
Related functions
predict() when used with tidyclust objects is a part of a trio of functions
doing similar things:
-
extract_cluster_assignment()returns the cluster assignments of the training observations -
extract_centroids()returns the location of the centroids -
predict()returns the cluster a new observation belongs to
Value
With the exception of type = "raw", the results of
predict.cluster_fit() will be a tibble as many rows in the output as
there are rows in new_data and the column names will be predictable.
For clustering results the tibble will have a .pred_cluster column.
Using type = "raw" with predict.cluster_fit() will return the
unadulterated results of the prediction function.
When the model fit failed and the error was captured, the predict()
function will return the same structure as above but filled with missing
values. This does not currently work for multivariate models.
See Also
extract_cluster_assignment() extract_centroids()
Examples
kmeans_spec <- k_means(num_clusters = 5) |>
set_engine("stats")
kmeans_fit <- fit(kmeans_spec, ~., mtcars)
kmeans_fit |>
predict(new_data = mtcars)
# Some models such as `hier_clust()` fits in such a way that you can specify
# the number of clusters after the model is fit
hclust_spec <- hier_clust() |>
set_engine("stats")
hclust_fit <- fit(hclust_spec, ~., mtcars)
hclust_fit |>
predict(new_data = mtcars[4:6, ], num_clusters = 2)
hclust_fit |>
predict(new_data = mtcars[4:6, ], cut_height = 250)
Other predict methods.
Description
These are internal functions not meant to be directly called by the user.
Usage
predict_cluster(object, ...)
## S3 method for class 'cluster_fit'
predict_cluster(object, new_data, ...)
Arguments
object |
An object of class |
... |
Optional arguments passed to the underlying predict function.
Use |
new_data |
A rectangular data object, such as a data frame. |
Value
A tibble::tibble().
A tibble::tibble().
Prepares data and distance matrices for metric calculation
Description
Prepares data and distance matrices for metric calculation
Usage
prep_data_dist(
object,
new_data = NULL,
dists = NULL,
dist_fun = philentropy::distance
)
Arguments
object |
A fitted |
new_data |
A dataset to calculate predictions on. If |
dists |
A distance matrix for the data. If |
dist_fun |
A function of the form |
Value
A list
Print a cluster object
Description
Print a cluster object
Usage
## S3 method for class 'cluster_fit'
print(x, ...)
## S3 method for class 'cluster_spec'
print(x, ...)
Arguments
x |
A |
... |
Arguments passed to the underlying print method. |
Value
The input x, invisibly.
Radius
Description
The radius used by density-based clustering to determine core points and
cluster assignments. Used in tidyclust::db_clust() with the dbscan
engine.
Usage
radius(range = c(0, dials::unknown()), trans = NULL)
Arguments
range |
A two-element vector holding the defaults for the smallest and largest possible values, respectively. If a transformation is specified, these values should be in the transformed units. |
trans |
A |
Value
A dials parameter object for use with tune::tune_grid() and
related functions.
Examples
radius()
Relabels clusters to match another cluster assignment
Description
When forcing one-to-one, the user needs to decide what to prioritize:
"accuracy": optimize raw count of all observations with the same label across the two assignments
"precision": optimize the average percent of each alt cluster that matches the corresponding primary cluster
Usage
reconcile_clusterings_mapping(
primary,
alternative,
one_to_one = TRUE,
optimize = "accuracy"
)
Arguments
primary |
A vector containing cluster labels, to be matched |
alternative |
Another vector containing cluster labels, to be changed |
one_to_one |
Boolean; should each alt cluster match only one primary cluster? |
optimize |
One of "accuracy" or "precision"; see description. |
Details
Retains the cluster labels of the primary assignment, and relabel the alternate assignment to match as closely as possible. The user must decide whether clusters are forced to be "one-to-one"; that is, are we allowed to assign multiple labels from the alternate assignment to the same primary label?
Cluster labels are arbitrary — two clusterings of the same data may agree on
the groups but use different label names (e.g. "Dog" vs "Apple" for the same
cluster). reconcile_clusterings_mapping() is useful when you want to
compare two clusterings, for example:
Comparing cluster assignments across cross-validation folds.
Checking stability of a clustering algorithm across different random seeds.
Aligning predicted clusters on new data with the original training labels.
Value
A tibble with 3 columns; primary, alt, alt_recoded
Examples
factor1 <- c("Apple", "Apple", "Carrot", "Carrot", "Banana", "Banana")
factor2 <- c("Dog", "Dog", "Cat", "Dog", "Fish", "Fish")
reconcile_clusterings_mapping(factor1, factor2)
factor1 <- c("Apple", "Apple", "Carrot", "Carrot", "Banana", "Banana")
factor2 <- c("Dog", "Dog", "Cat", "Dog", "Fish", "Parrot")
reconcile_clusterings_mapping(factor1, factor2, one_to_one = FALSE)
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- dplyr
- generics
- hardhat
extract_fit_engine,extract_fit_parsnip,extract_parameter_set_dials,extract_preprocessor,extract_spec_parsnip,tune- parsnip
- tune
Get required packages for a cluster object
Description
Get required packages for a cluster object
Usage
## S3 method for class 'cluster_spec'
required_pkgs(x, infra = TRUE, ...)
## S3 method for class 'cluster_fit'
required_pkgs(x, infra = TRUE, ...)
Arguments
x |
A |
infra |
A logical. Should tidyclust itself be included in the result? |
... |
Currently unused. |
Value
A character vector of required package names.
Change arguments of a cluster specification
Description
Change arguments of a cluster specification
Usage
## S3 method for class 'cluster_spec'
set_args(object, ...)
Arguments
object |
|
... |
One or more named model arguments. |
Value
An updated cluster_spec object.
Change engine of a cluster specification
Description
Change engine of a cluster specification
Usage
## S3 method for class 'cluster_spec'
set_engine(object, engine, ...)
Arguments
object |
|
engine |
A character string for the software that should be used to fit the model. This is highly dependent on the type of model (e.g. linear regression, random forest, etc.). |
... |
Any optional arguments associated with the chosen computational
engine. These are captured as quosures and can be tuned with |
Value
An updated cluster_spec object.
Change mode of a cluster specification
Description
Change mode of a cluster specification
Usage
## S3 method for class 'cluster_spec'
set_mode(object, mode, ...)
Arguments
object |
|
mode |
A character string for the model type (e.g. "classification" or "regression") |
... |
One or more named model arguments. |
Value
An updated cluster_spec object.
Measures silhouette between clusters
Description
Measures silhouette between clusters
Usage
silhouette(
object,
new_data = NULL,
dists = NULL,
dist_fun = philentropy::distance
)
Arguments
object |
A fitted tidyclust model |
new_data |
A dataset to predict on. If |
dists |
A distance matrix. Used if |
dist_fun |
A function of the form |
Details
silhouette_avg() is the corresponding cluster metric function that
returns the average of the values given by silhouette().
Value
A tibble giving the silhouette for each observation.
Examples
kmeans_spec <- k_means(num_clusters = 5) |>
set_engine("stats")
kmeans_fit <- fit(kmeans_spec, ~., mtcars)
dists <- mtcars |>
as.matrix() |>
dist()
silhouette(kmeans_fit, dists = dists)
Measures average silhouette across all observations
Description
Measures average silhouette across all observations
Usage
silhouette_avg(object, ...)
## S3 method for class 'cluster_spec'
silhouette_avg(object, ...)
## S3 method for class 'cluster_fit'
silhouette_avg(object, new_data = NULL, dists = NULL, dist_fun = NULL, ...)
## S3 method for class 'workflow'
silhouette_avg(object, new_data = NULL, dists = NULL, dist_fun = NULL, ...)
silhouette_avg_vec(
object,
new_data = NULL,
dists = NULL,
dist_fun = philentropy::distance,
...
)
Arguments
object |
A fitted kmeans tidyclust model |
... |
Other arguments passed to methods. |
new_data |
A dataset to predict on. If |
dists |
A distance matrix. Used if |
dist_fun |
A function of the form |
Details
Not to be confused with silhouette() that returns a tibble
with silhouette for each observation. The silhouette coefficient ranges
from -1 to 1, where values close to 1 indicate well-separated clusters.
This metric has direction = "maximize", so tune::select_best() and
tune::show_best() will return models with the highest silhouette values.
Value
A double; the average silhouette.
See Also
Other cluster metric:
sse_ratio(),
sse_total(),
sse_within_total()
Examples
kmeans_spec <- k_means(num_clusters = 5) |>
set_engine("stats")
kmeans_fit <- fit(kmeans_spec, ~., mtcars)
dists <- mtcars |>
as.matrix() |>
dist()
silhouette_avg(kmeans_fit, dists = dists)
silhouette_avg_vec(kmeans_fit, dists = dists)
Compute the ratio of the WSS to the total SSE
Description
Compute the ratio of the WSS to the total SSE
Usage
sse_ratio(object, ...)
## S3 method for class 'cluster_spec'
sse_ratio(object, ...)
## S3 method for class 'cluster_fit'
sse_ratio(object, new_data = NULL, dist_fun = NULL, ...)
## S3 method for class 'workflow'
sse_ratio(object, new_data = NULL, dist_fun = NULL, ...)
sse_ratio_vec(
object,
new_data = NULL,
dist_fun = function(x, y) {
philentropy::dist_many_many(x, y, method =
"euclidean")
},
...
)
Arguments
object |
A fitted kmeans tidyclust model |
... |
Other arguments passed to methods. |
new_data |
A dataset to predict on. If |
dist_fun |
A function of the form |
Value
A tibble with 3 columns; .metric, .estimator, and .estimate.
See Also
Other cluster metric:
silhouette_avg(),
sse_total(),
sse_within_total()
Examples
kmeans_spec <- k_means(num_clusters = 5) |>
set_engine("stats")
kmeans_fit <- fit(kmeans_spec, ~., mtcars)
sse_ratio(kmeans_fit)
sse_ratio_vec(kmeans_fit)
Compute the total sum of squares
Description
Compute the total sum of squares
Usage
sse_total(object, ...)
## S3 method for class 'cluster_spec'
sse_total(object, ...)
## S3 method for class 'cluster_fit'
sse_total(object, new_data = NULL, dist_fun = NULL, ...)
## S3 method for class 'workflow'
sse_total(object, new_data = NULL, dist_fun = NULL, ...)
sse_total_vec(
object,
new_data = NULL,
dist_fun = function(x, y) {
philentropy::dist_many_many(x, y, method =
"euclidean")
},
...
)
Arguments
object |
A fitted kmeans tidyclust model |
... |
Other arguments passed to methods. |
new_data |
A dataset to predict on. If |
dist_fun |
A function of the form |
Value
A tibble with 3 columns; .metric, .estimator, and .estimate.
See Also
Other cluster metric:
silhouette_avg(),
sse_ratio(),
sse_within_total()
Examples
kmeans_spec <- k_means(num_clusters = 5) |>
set_engine("stats")
kmeans_fit <- fit(kmeans_spec, ~., mtcars)
sse_total(kmeans_fit)
sse_total_vec(kmeans_fit)
Calculates Sum of Squared Error in each cluster
Description
Calculates Sum of Squared Error in each cluster
Usage
sse_within(
object,
new_data = NULL,
dist_fun = function(x, y) {
philentropy::dist_many_many(x, y, method =
"euclidean")
}
)
Arguments
object |
A fitted kmeans tidyclust model |
new_data |
A dataset to predict on. If |
dist_fun |
A function of the form |
Details
sse_within_total() is the corresponding cluster metric function
that returns the sum of the values given by sse_within().
Value
A tibble with two columns, the cluster name and the SSE within that cluster.
Examples
kmeans_spec <- k_means(num_clusters = 5) |>
set_engine("stats")
kmeans_fit <- fit(kmeans_spec, ~., mtcars)
sse_within(kmeans_fit)
Compute the sum of within-cluster SSE
Description
Compute the sum of within-cluster SSE
Usage
sse_within_total(object, ...)
## S3 method for class 'cluster_spec'
sse_within_total(object, ...)
## S3 method for class 'cluster_fit'
sse_within_total(object, new_data = NULL, dist_fun = NULL, ...)
## S3 method for class 'workflow'
sse_within_total(object, new_data = NULL, dist_fun = NULL, ...)
sse_within_total_vec(
object,
new_data = NULL,
dist_fun = function(x, y) {
philentropy::dist_many_many(x, y, method =
"euclidean")
},
...
)
Arguments
object |
A fitted kmeans tidyclust model |
... |
Other arguments passed to methods. |
new_data |
A dataset to predict on. If |
dist_fun |
A function of the form |
Details
Not to be confused with sse_within() that returns a tibble
with within-cluster SSE, one row for each cluster.
Value
A tibble with 3 columns; .metric, .estimator, and .estimate.
See Also
Other cluster metric:
silhouette_avg(),
sse_ratio(),
sse_total()
Examples
kmeans_spec <- k_means(num_clusters = 5) |>
set_engine("stats")
kmeans_fit <- fit(kmeans_spec, ~., mtcars)
sse_within_total(kmeans_fit)
sse_within_total_vec(kmeans_fit)
Turn a tidyclust model object into a tidy tibble
Description
This method tidies the model in a tidyclust model object, if it exists.
Usage
## S3 method for class 'cluster_fit'
tidy(x, ...)
Arguments
x |
An object to be converted into a tidy |
... |
Additional arguments to tidying method. |
Value
A tibble with one row per cluster. Columns depend on the underlying
engine but typically include .cluster and cluster-level summary
statistics such as centroid coordinates or cluster size.
Examples
# tidy() support depends on the underlying engine. For the stats engine,
# broom must be installed.
## Not run:
kmeans_fit <- k_means(num_clusters = 3) |>
set_engine("stats") |>
fit(~., mtcars)
tidy(kmeans_fit)
hclust_fit <- hier_clust(num_clusters = 3) |>
set_engine("stats") |>
fit(~., mtcars)
tidy(hclust_fit)
## End(Not run)
Resolve a Model Specification for a Computational Engine
Description
translate_tidyclust() will translate_tidyclust a model specification into a
code object that is specific to a particular engine (e.g. R package). It
translate tidyclust generic parameters to their counterparts.
Usage
translate_tidyclust(x, ...)
## Default S3 method:
translate_tidyclust(x, engine = x$engine, ...)
Arguments
x |
A model specification. |
... |
Not currently used. |
engine |
The computational engine for the model (see |
Details
translate_tidyclust() produces a template call that lacks the
specific argument values (such as data, etc). These are filled in once
fit() is called with the specifics of the data for the model. The call
may also include tune() arguments if these are in the specification. To
handle the tune() arguments, you need to use the tune package. For more information see
https://www.tidymodels.org/start/tuning/
It does contain the resolved argument names that are specific to the model fitting function/engine.
This function can be useful when you need to understand how tidyclust
goes from a generic model specific to a model fitting function.
Note: this function is used internally and users should only use it to understand what the underlying syntax would be. It should not be used to modify the cluster specification.
Value
Prints translated code.
Get tunable parameters for a cluster specification
Description
Get tunable parameters for a cluster specification
Usage
## S3 method for class 'cluster_spec'
tunable(x, ...)
## S3 method for class 'hier_clust'
tunable(x, ...)
## S3 method for class 'k_means'
tunable(x, ...)
## S3 method for class 'db_clust'
tunable(x, ...)
## S3 method for class 'gm_clust'
tunable(x, ...)
Arguments
x |
An object, such as a recipe, recipe step, workflow, or model specification. |
... |
Other arguments passed to methods |
Value
A tibble with columns name, call_info, source, component,
and component_id describing each tunable parameter.
Get tune arguments for a cluster specification
Description
Get tune arguments for a cluster specification
Usage
## S3 method for class 'cluster_spec'
tune_args(object, full = FALSE, ...)
Arguments
object |
A |
... |
Other arguments passed to methods. |
Value
A tibble describing the tunable arguments in the cluster specification.
Model tuning via grid search
Description
tune_cluster() computes a set of performance metrics for a pre-defined set
of tuning parameters that correspond to a cluster model or recipe across one
or more resamples of the data.
Usage
tune_cluster(object, ...)
## S3 method for class 'cluster_spec'
tune_cluster(
object,
preprocessor,
resamples,
...,
param_info = NULL,
grid = 10,
metrics = NULL,
control = tune::control_grid()
)
## S3 method for class 'workflow'
tune_cluster(
object,
resamples,
...,
param_info = NULL,
grid = 10,
metrics = NULL,
control = tune::control_grid()
)
Arguments
object |
A |
... |
Not currently used. |
preprocessor |
A traditional model formula or a recipe created using
|
resamples |
An |
param_info |
A |
grid |
A data frame of tuning combinations or a positive integer. The data frame should have columns for each parameter being tuned and rows for tuning parameter candidates. An integer denotes the number of candidate parameter sets to be created automatically. |
metrics |
A |
control |
An object used to modify the tuning process. Defaults to
|
Value
An updated version of resamples with extra list columns for
.metrics and .notes (optional columns are .predictions and
.extracts). .notes contains warnings and errors that occur during
execution. The .notes column is a tibble with columns location,
type, note, and trace. The trace column contains
rlang::trace_back() objects for errors and warnings, which can be
useful for debugging.
Choosing metrics
The metrics argument accepts a cluster_metric_set(). If NULL, the
default metrics are sse_within_total() and sse_total().
Common metrics and their interpretation:
-
sse_within_total(): Total within-cluster sum of squares. Lower values indicate tighter, more compact clusters. Use the "elbow method" — plot this againstnum_clustersand look for where the improvement flattens. -
sse_ratio(): Ratio of within-cluster SS to total SS. Lower is better (more variance explained by the clustering). -
silhouette_avg(): Average silhouette width (range -1 to 1). Higher values indicate better-separated clusters. Values above 0.5 are generally considered good.
After tuning, use these functions to inspect results:
-
tune::collect_metrics(): All metrics for every parameter combination. -
tune::show_best(): Top N parameter combinations for a given metric. -
tune::select_best(): Single best parameter combination.
Configuration column
The .config column in the results follows the pattern
pre{num}_mod{num}_post{num}. The numbers encode which combination of
preprocessor, model, and postprocessor parameters was used. A value of
0 means that element was not tuned. For example, pre0_mod2_post0
means the preprocessor was not tuned and this is the second model
parameter combination.
Parallel processing
Parallel processing is supported via the future and mirai packages.
To enable parallelism, set up a future plan or mirai daemons before
calling tune_cluster():
# Using future library(future) plan(multisession, workers = 4) res <- tune_cluster(wflow, resamples = folds, grid = grid) plan(sequential) # Using mirai library(mirai) daemons(4) res <- tune_cluster(wflow, resamples = folds, grid = grid) daemons(0)
See tune::parallelism for more details.
Examples
library(recipes)
library(rsample)
library(workflows)
library(tune)
rec_spec <- recipe(~., data = mtcars) |>
step_normalize(all_numeric_predictors()) |>
step_pca(all_numeric_predictors())
kmeans_spec <- k_means(num_clusters = tune())
wflow <- workflow() |>
add_recipe(rec_spec) |>
add_model(kmeans_spec)
grid <- tibble(num_clusters = 1:3)
set.seed(4400)
folds <- vfold_cv(mtcars, v = 2)
res <- tune_cluster(
wflow,
resamples = folds,
grid = grid
)
res
collect_metrics(res)
Update a cluster specification
Description
If parameters of a cluster specification need to be modified,
update() can be used in lieu of recreating the object from scratch.
Usage
## S3 method for class 'db_clust'
update(
object,
parameters = NULL,
radius = NULL,
min_points = NULL,
fresh = FALSE,
...
)
## S3 method for class 'gm_clust'
update(
object,
parameters = NULL,
num_clusters = NULL,
circular = NULL,
zero_covariance = NULL,
shared_orientation = NULL,
shared_shape = NULL,
shared_size = NULL,
fresh = FALSE,
...
)
## S3 method for class 'hier_clust'
update(
object,
parameters = NULL,
num_clusters = NULL,
cut_height = NULL,
linkage_method = NULL,
dist_fun = NULL,
fresh = FALSE,
...
)
## S3 method for class 'k_means'
update(object, parameters = NULL, num_clusters = NULL, fresh = FALSE, ...)
## S3 method for class 'mean_shift'
update(object, parameters = NULL, bandwidth = NULL, fresh = FALSE, ...)
Arguments
object |
A cluster specification. |
parameters |
A 1-row tibble or named list with main parameters to
update. Use either |
radius |
Positive double, Radius drawn around points to determine core-points and cluster assignments (required). |
min_points |
Positive integer, Minimum number of connected points required to form a core-point, including the point itself (required). |
fresh |
A logical for whether the arguments should be modified in-place or replaced wholesale. |
... |
Not used for |
num_clusters |
Positive integer, number of clusters in model. |
circular |
Boolean, whether or not to fit circular MVG distributions for each cluster. Default |
zero_covariance |
Boolean, whether or not to assign covariances of 0 for each MVG. Default |
shared_orientation |
Boolean, whether each cluster MVG should have the same orientation. Default |
shared_shape |
Boolean, whether each cluster MVG should have the same shape. Default |
shared_size |
Boolean, whether each cluster MVG should have the same size/volume. Default |
cut_height |
Positive double, height at which to cut dendrogram to
obtain cluster assignments (only used if |
linkage_method |
the agglomeration method to be used. This should be (an
unambiguous abbreviation of) one of |
dist_fun |
A function for calculating the distance between observations.
Defaults to |
bandwidth |
Positive double, kernel bandwidth controlling the size of the neighborhood used to compute the density estimate (required). |
Value
An updated cluster specification.
Examples
kmeans_spec <- k_means(num_clusters = 5)
kmeans_spec
update(kmeans_spec, num_clusters = 1)
update(kmeans_spec, num_clusters = 1, fresh = TRUE)
param_values <- tibble::tibble(num_clusters = 10)
kmeans_spec |> update(param_values)