Aivia Software

Automatic Object Classifier

Object classification, particularly in the realm of cellular biology, is a critical step in digital image analysis that enables researchers to categorize cells based on their phenotypes, behaviors, or morphological characteristics. This process involves assigning predefined labels to individual cells within an image, facilitating the quantitative analysis of cellular properties and the understanding of complex biological systems. The advancements in imaging technology and computational methods have made it possible to automate this classification, leading to more accurate and efficient data interpretation.

This page delves into the interface elements for configuring an Automatic object classifier. We will highlight the array of components and functionalities available to users, allowing them to execute unsupervised clustering algorithms on detected objects within their images. For detailed information on specific methods and parameter adjustment guidance, please refer to the respective wiki pages for each method.

On this page:

Creating an automatic classifier

  1. To create an automatic classifier, click on the Create new classifier icon.  This will open a dialog box as pictured below:

    There are three properties available:

    1. Name – the name you want to call your classifier.  This will also be the default filename of the file when exporting the classifier to a file, which can also be adjusted when you are exporting the classifier.
    2. Classification Type – the mode of classification. There will be three options available: (1) Object Classifier, (2) Automatic Classifier, and (3) Phenotyper.  For this page, we will be focusing primarily on the Automatic Classifier.

    3. Object Type – the type of object you want to apply your classifier to. Depending on whether a recipe was run previously on the image, or if there are other user defined objects in the image, this is the object the classifier will be applied to.

  2. To create an automatic classifier, under the Classification Type drop-down menu, select Automatic Classifier and select the object type you wish to classify.
  3. This will open two different sections of the classifier. Firstly, we will examine the Measure section.  This is where you can select the attributes of the objects to classify by.  Under Measurements, there are two possible modes to select attributes to classify by.

    1. The first option, Standard, allows for classification of selected channels under Input Channels by their Mean Intensity.

    2. The second option, Custom, allows for selection of individual features of each individual channel, as well as other object properties.

  4. Under the section titled Cluster, under the drop down menu next to Clustering Type, you can select the clustering method.  More information on each method and parameter tuning can be found below.  There are two clustering methods available:
    1. K-means
    2. PhenoGraph-Leiden
  5. Once the appropriate parameters and attributes for clustering have been chosen, in the lower right hand corner of the panel is a drop down menu titled Set to apply to.  This menu allows for application of the classifier to the object selected.  Once selected to the appropriate object, click Apply to run the classifier.

  6. To save/export the classifier parameters to a file, at the top of the menu press the save classifier to file icon.  This will produce a “*.classifier” file that can be reloaded via clicking .



K-means

K-means1 is a popular unsupervised machine learning algorithm used for clustering similar data points into groups. Given a dataset and a specified number of clusters (k), the algorithm works iteratively to assign each data point to one of the k groups based on the features provided. The process starts with random initialization of k cluster centroids. In each iteration, data points are assigned to the nearest centroid, and then centroids are recalculated as the mean of the data points in that cluster. The algorithm repeats these steps until the centroids stabilize or a specified number of iterations is reached. The result is that data points in the same cluster are more similar to each other than those in different clusters. While K-means is simple and efficient, selecting the optimal number of clusters (k) and the possibility of reaching a local minimum are challenges that users should be aware of.

Parameters

Parameter NameDefault ValueMinimum ValueMaximum ValueDescription
Number of Cluster Centers101100

This parameter effectively determines the number of distinct clusters the algorithm will attempt to segment the dataset into. The algorithm initializes by selecting K data points randomly as the initial centroids. Subsequent iterations reassign data points to the closest centroid and recalculate the centroids based on the mean of the points assigned to each cluster. The iterative process continues until the centroids stabilize and no longer shift significantly or until a predefined number of iterations is reached.

Selecting an appropriate value for K is critical: an under-specified K might merge distinct data groups, while an over-specified K can fragment genuine clusters.

Advantages

Efficiency

K-means is computationally efficient, especially for datasets where the number of clusters  is not too high.

Consistency

Given the same initial conditions and dataset, K-means will always produce the same results.

Convergence

The algorithm will always converge to a result, though it might be a local optimum.

Visual Interpretability

Results can often be visualized easily, especially in 2D or 3D datasets.

Disadvantages

Fixed Cluster Number

Requires a pre-defined number of clusters . Determining the optimal  can be a challenge.

Initialization Sensitivity

The final clusters can be sensitive to the initial centroid placement.

Assumption of Spherical Clusters

Assumes that clusters are spherical and equally sized, which might not hold true for all datasets.  This also can translate to difficulty with non-convex clusters, where K-means struggles with clusters of complex geometries or with clusters within clusters.

Sensitivity to Outliers

Outliers can heavily influence the position of centroids, potentially skewing clusters.

PhenoGraph-Leiden

PhenoGraph and the Leiden algorithm2 are methods primarily used for clustering single-cell data, especially in the context of single-cell RNA sequencing (scRNA-seq), and in the case for AIVIA, single cell image data. The Phenograph-Leiden approach integrates both algorithms sequentially. Initially, PhenoGraph operates by constructing a k-nearest neighbor (k-NN) graph from the high-dimensional data followed by community detection to partition cells into clusters based on their shared neighborhoods. Subsequently the Leiden algorithm (which refines the Louvain method for community detection), is used to optimize the modularity within the graph to discern distinct cell communities. Together, these methods emphasize the intrinsic structure and connectivity of the data, making them particularly adept at handling high-dimensional datasets where inherent structures could be masked by noise or the challenges of high dimensionality.

Parameters

Parameter NameDefault ValueMinimum ValueMaximum ValueDescription
Minimum Objects per Cluster100110,000

This parameter stipulates the minimum number of objects (or data points) that a cluster should contain to be considered valid. By setting this parameter, users can filter out smaller, potentially noisy or spurious clusters, ensuring that the resultant clusters are of a significant size and more likely to be biologically or contextually meaningful. It's a way to add robustness to the clustering results, as tiny clusters can sometimes arise from outliers or noise in the data.

Max Cluster Count51100

This defines the upper limit on the total number of clusters the algorithm should produce. It's a control mechanism to prevent over-segmentation of the data, especially in scenarios where the natural structure of the data is complex and could potentially be divided into numerous small clusters. By setting an upper limit, users can ensure that the algorithm strikes a balance between finding meaningful clusters and not over-dividing the data.

Number of Neighbors302200

Determines how many nearest neighbors each data point should be connected to. A commonly used starting value is 30, which can be adjusted based on the desired clustering outcomeA higher number of neighbors tends to produce denser clustering with increased connections, potentially leading to the merging of closely situated clusters. However, it also results in longer processing time. Conversely, using a lower number of neighbors highlights separate clusters, but it might also lead to dividing the clusters into smaller parts.

Resolution0.10.01.0

Resolution is a parameter in community detection algorithms like Leiden that determines the granularity of the clusters. A higher resolution value will generally result in a larger number of smaller clusters, revealing finer substructures within the data. Conversely, a lower resolution will lead to fewer, broader clusters, which might merge distinct sub-populations. This parameter essentially allows users to zoom in or out on the data's structure, making it pivotal in uncovering varying levels of detail in the data’s inherent groupings.

Advantages

Graph-based Representation

Captures the intrinsic topology and relationships within high-dimensional data, making it suitable for complex datasets. This unique representation is adept at mimicking the actual structure and interrelations within datasets, especially those with nuanced and layered connectivity patterns.

Adaptive Cluster Size

Unlike K-means, which assumes roughly equal cluster sizes, PhenoGraph-Leiden can identify clusters of varying densities and sizes.

Less Sensitivity to Noise

By leveraging the community structure in graphs, it can often be more robust against noise in the data.

High-dimensional Compatibility

PhenoGraph-Leiden was tailored for datasets with high dimensionality, such as single-cell RNA sequencing data. As such, it is optimized for dissecting and revealing intricate structures and relationships within these dense datasets.

No Need for Predefined Cluster Count

One of the standout features of this approach is its ability to discern the inherent number of clusters present in the data. This contrasts with algorithms that require users to provide a predetermined cluster count, offering a more adaptive and intuitive clustering experience.

Flexibility in Resolution

By adjusting the resolution parameter, users can explore data at varying levels of granularity, revealing subclusters or broader groupings as needed.

Disadvantages

Computational Intensity

Constructing and analyzing graphs, especially those derived from dense or voluminous datasets, can lead to longer processing times and a demand for more computational resources.

Parameter Sensitivity

PhenoGraph-Leiden's effectiveness can hinge significantly on its parameters. Elements like the number of neighbors and resolution can dramatically alter outcomes, which necessitates careful calibration, understanding, and potentially multiple trial-and-error iterations to find the optimal settings.

Potential for Over-segmentation

At certain parameter settings, especially higher resolutions, there exists a risk of subdividing the data into overly fine-grained clusters. These might be challenging to interpret or might not hold substantial biological or contextual significance.

Learning Curve

For individuals unfamiliar with the nuances of graph-based clustering or the intricacies of community detection, mastering PhenoGraph-Leiden might pose a steeper learning curve compared to more straightforward clustering algorithms like K-means.

Initialization Variability

Like K-means, different runs or initializations can produce varying results, though community detection methods like Leiden aim to mitigate this.


Comparison of methods

While both K-means and PhenoGraph-Leiden are clustering methods, their approaches and ideal applications differ significantly. K-means partitions data into clusters by minimizing intra-cluster variance, often in Euclidean space. It requires the number of clusters (K) to be specified a priori and is sensitive to initialization. In contrast, Pheno-Leiden leverages graph-based methods to emphasize local relationships and naturally reveal data structures without the strict necessity of pre-specifying the number of clusters. PhenoGraph-Leiden is particularly adept at handling high-dimensional data with intricate relationships, such as those found in single-cell datasets, whereas K-means might struggle or oversimplify such datasets due to its linear separation assumption, PhenoGraph-Leiden can unveil more nuanced cell populations or states.

When to use Kmeans over Phenograph-Leiden:

Prior Knowledge

If you have prior knowledge or a reasonable assumption about the number of clusters in your data, K-means can be directly applied, whereas graph-based methods might not always offer a straightforward way to impose such prior knowledge

Simplicity and Interpretability

K-means is a straightforward algorithm with a clear objective function of minimizing the variance within clusters.  K-means also requires little parameter tuning, where the only parameter required is the number of clusters to assign data points to. This simplicity often leads to easier interpretation and visualization of the results. K-means also assumes clusters to be spherical and generally equal in size, which can work well if actual data clusters align with this assumption

Computational Efficiency

For certain datasets, K-means can be computationally faster, especially when the number of dimensions and datapoints is relatively low.

Avoiding Overclustering

Graph-based methods can sometimes result in overclustering, where too many small, granular clusters are identified.  While this can be beneficial for revealing fine structures in data, there are contexts where broader groupings are more desirable. 

When to use Phenograph-Leiden over Kmeans:

High-Dimensionality

PhenoGraph-Leiden is particularly suited for high-dimensional datasets, such as single-cell RNA sequencing data, where the underlying structure might be obscured due to analyzing and organizing data in high-dimensional spaces.

Complex Data Structures

Graph-based methods like PhenoGraph-Leiden can identify intricate and non-linear relationships in the data, making them adept at uncovering nuanced clusters that might not be evident to linear methods like K-means.

No Strict Need for Pre-defined Cluster Number

While K-means necessitates a predetermined number of clusters, PhenoGraph-Leiden has the capability to intuitively discern the data's inherent structure without mandating a pre-specified cluster count. Nonetheless, fine-tuning PhenoGraph-Leiden parameters to obtain optimal clustering results introduces its own set of challenges, such as increased computational cost, potential overfitting to specific data characteristics, and the complexity of navigating interdependent parameters. It is crucial to approach this tuning process with both caution and insight to ensure that the resulting clusters are contextually meaningful.

Emphasis on Local Relationships

PhenoGraph-Leiden focuses on local relationships between data points, emphasizing the intrinsic structure and connectivity within the data, which can sometimes be more informative than global relationships. Additionally, K-means generally assumes that clusters are spherical and equally sized, which might not always be the case, especially in complex datasets. Graph-based methods do not make such assumptions.

Noise and Outliers

Graph-based methods can be more robust to noise and outliers, given their emphasis on local neighborhoods and connectivity. Algorithms like K-means clustering on the other hand are more sensitive to outliers, because computing the cluster center is easily influenced by extreme values.


Evaluating Cluster Fitting through Confidence Scores

The Silhouette Score is a widely-used metric for assessing the quality of clusters in data analysis. It measures the coherence of data points within their clusters relative to other clusters. For efficient processing with large datasets, we utilize an approximation method known as the "Approximate Silhouette Score." This method is adapted for use with the automatic clustering algorithms (k-means and Phenograph Leiden). The resultant confidence scoring for each point's clustering can be seen under Spreadsheets or plotted under Charts.Methodology

1. Definition: The silhouette score traditionally calculates the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each data point. The score for a point is (b−a)/max(a,b)​, which ranges from -1 (incorrectly clustered) to +1 (highly appropriate clustering)3.
2. Approximation Technique: Given the computational intensity of exact distance calculations for large datasets, we approximate the average distances using the root-mean-squared distance. We utilize the centroid of each cluster for computing the nearest-cluster distance to each data point, while the approximate mean intra-cluster distance is computed using the root-mean-square of distances between the target point and other points within the same cluster4.
3. Normalization: To simplify interpretation, the silhouette scores are linearly normalized from their original range of -1 to +1 to a new range of 0 to 1. Any point that has a score below 0.5 would be considered incorrectly clustered.


A tutorial on how to use and visualize confidence scores can be found at How to auto-classify objects and visualize them.

References

  1. MacQueen J. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. 1967 Jun 21 (Vol. 1, No. 14, pp. 281-297).
  2. Traag VA, Waltman L, Van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports. 2019 Mar 26;9(1):5233.
  3. Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53-65.
  4. Lenssen, L., & Schubert, E. (2022, September). Clustering by direct optimization of the medoid silhouette. In International Conference on Similarity Search and Applications (pp. 190-204). Cham: Springer International Publishing.