Aivia Software

# Automatic Object Classifier

## Comparison of methods

While both K-means and PhenoGraph-Leiden are clustering methods, their approaches and ideal applications differ significantly. K-means partitions data into clusters by minimizing intra-cluster variance, often in Euclidean space. It requires the number of clusters (K) to be specified a priori and is sensitive to initialization. In contrast, Pheno-Leiden leverages graph-based methods to emphasize local relationships and naturally reveal data structures without the strict necessity of pre-specifying the number of clusters. PhenoGraph-Leiden is particularly adept at handling high-dimensional data with intricate relationships, such as those found in single-cell datasets, whereas K-means might struggle or oversimplify such datasets due to its linear separation assumption, PhenoGraph-Leiden can unveil more nuanced cell populations or states.

### When to use Kmeans over Phenograph-Leiden:

#### Prior Knowledge

If you have prior knowledge or a reasonable assumption about the number of clusters in your data, K-means can be directly applied, whereas graph-based methods might not always offer a straightforward way to impose such prior knowledge

#### Simplicity and Interpretability

K-means is a straightforward algorithm with a clear objective function of minimizing the variance within clusters. K-means also requires little parameter tuning, where the only parameter required is the number of clusters to assign data points to. This simplicity often leads to easier interpretation and visualization of the results. K-means also assumes clusters to be spherical and generally equal in size, which can work well if actual data clusters align with this assumption

#### Computational Efficiency

For certain datasets, K-means can be computationally faster, especially when the number of dimensions and datapoints is relatively low.

#### Avoiding Overclustering

Graph-based methods can sometimes result in overclustering, where too many small, granular clusters are identified. While this can be beneficial for revealing fine structures in data, there are contexts where broader groupings are more desirable.

### When to use Phenograph-Leiden over Kmeans:

#### High-Dimensionality

PhenoGraph-Leiden is particularly suited for high-dimensional datasets, such as single-cell RNA sequencing data, where the underlying structure might be obscured due to analyzing and organizing data in high-dimensional spaces.

#### Complex Data Structures

Graph-based methods like PhenoGraph-Leiden can identify intricate and non-linear relationships in the data, making them adept at uncovering nuanced clusters that might not be evident to linear methods like K-means.

#### No Strict Need for Pre-defined Cluster Number

While K-means necessitates a predetermined number of clusters, PhenoGraph-Leiden has the capability to intuitively discern the data's inherent structure without mandating a pre-specified cluster count. Nonetheless, fine-tuning PhenoGraph-Leiden parameters to obtain optimal clustering results introduces its own set of challenges, such as increased computational cost, potential overfitting to specific data characteristics, and the complexity of navigating interdependent parameters. It is crucial to approach this tuning process with both caution and insight to ensure that the resulting clusters are contextually meaningful.

#### Emphasis on Local Relationships

PhenoGraph-Leiden focuses on local relationships between data points, emphasizing the intrinsic structure and connectivity within the data, which can sometimes be more informative than global relationships. Additionally, K-means generally assumes that clusters are spherical and equally sized, which might not always be the case, especially in complex datasets. Graph-based methods do not make such assumptions.

#### Noise and Outliers

Graph-based methods can be more robust to noise and outliers, given their emphasis on local neighborhoods and connectivity. Algorithms like K-means clustering on the other hand are more sensitive to outliers, because computing the cluster center is easily influenced by extreme values.

## References

- MacQueen J. Some methods for classification and analysis of multivariate observations.
*In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability*. 1967 Jun 21 (Vol. 1, No. 14, pp. 281-297). - Traag VA, Waltman L, Van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities.
*Scientific Reports*. 2019 Mar 26;9(1):5233.

## Related articles