Pairwise constraint selection (skquery.pairwise)

This module regroups methods that query pairwise (i.e. must-link/cannot-link) constraints.

class skquery.pairwise.AIPC(epsilon=0.0)[source]

Active Informative Pairwise Constraint Formulation algorithm [1]_.

Selects pairwise constraints between strong and weak samples, as defined according to Shannon entropy based on fuzzy clustering.

Parameters

epsilonfloat, default=0.

Threshold value for the entropy of weak samples. By default, epsilon is computed during fit.

Attributes

epsilonfloat

Threshold value for the entropy of weak samples. The value can be given at initialization, otherwise it is computed during fit.

fuzzy_partitionarray-like

Soft partition of the dataset computed by fuzzy c-means clustering.

References

csts_to_file(constraints, filename='constraints')

Write the contents of a constraint dictionary into a text file.

Parameters

constraintsdict of list

Constraints to write.

filenamestring, default=”constraints”

Name of the output file.

fit(X, oracle, partition=None, n_clusters=None)[source]

Select pairwise constraints with AIPC.

Parameters

Xarray-like

Instances to use for query.

oraclecallable

Source of background knowledge able to answer the queries.

partitionarray-like

Existing partition of the data. Used to define the number of clusters.

n_clustersint

Number of clusters to find.

Returns

constraintsdict of lists

ML and CL constraints between strong and weak samples.

class skquery.pairwise.FFQS(neighborhoods=None, distances=None)[source]

Farthest First Query Selection [1]_.

Explores the data with farthest-first traversal to discover the cluster structure, then consolidates the skeleton with random selection.

Parameters

neighborhoodslist of lists, default=None

Optional precomputed neighborhoods of the data. This can be used as a warm start, e.g. to skip the Explore phase.

distancesarray-like, default=None

Euclidean pairwise distance matrix of the data. If none given, it will be computed during fit.

Attributes

neighborhoodslist of lists

Points whose cluster affectation is certified by the answers of the oracle to the queries. Each list contains points belonging to the same cluster.

p_distsarray-like

Euclidean pairwise distance matrix of the data.

unknownset

Points whose affectation has been queried but remain unknown, i.e. the oracle couldn’t answer the query.

References

. [1] Basu, S., Banerjee, A., Mooney, R. Active Semi-Supervision for

Pairwise Constrained Clustering. 2004. Proceedings of the 2004 SIAM International Conference on Data Mining. ISBN 978-1-61197-274-0, pp. 333-344.

csts_to_file(constraints, filename='constraints')

Write the contents of a constraint dictionary into a text file.

Parameters

constraintsdict of list

Constraints to write.

filenamestring, default=”constraints”

Name of the output file.

fit(X, oracle, partition=None, n_clusters=None)[source]

Select pairwise constraints with the FFQS scheme.

Parameters

Xarray-like

Instances to use for query.

oraclecallable

Source of background knowledge able to answer the queries.

partitionarray-like, default=None

Existing partition of the data that provides the number of clusters to find. It has priority over n_clusters, i.e. n_clusters will not be taken into account if a partition is passed.

n_clustersint, default=None

Number of clusters to find. If none given, only the Explore phase will be used.

Returns

constraintsdict of lists

ML and CL constraints derived from the neighborhoods.

get_constraints_from_neighborhoods()[source]

Derives constraints from the neighborhood structure : ML between all pairs of points in the same neighborhood and CL between all pairs of points in separate neighborhoods

Returns

ml, cllist of tuples

Pairwise constraints derived from the neighborhoods.

class skquery.pairwise.MinMax(neighborhoods=None, distances=None)[source]

Min-max Farthest First Query Selection [1]_.

FFQS with modified Consolidate step with similarity for constraint selection.

Parameters

neighborhoodslist of lists, default=None

Optional precomputed neighborhoods of the data. This can be used as a warm start, e.g. to skip the Explore phase.

distancesarray-like, default=None

Euclidean pairwise distance matrix of the data. If none given, it will be computed during fit.

Attributes

neighborhoodslist of lists

Points whose cluster affectation is certified by the answers of the oracle to the queries. Each list contains points belonging to the same cluster.

p_distsarray-like

Euclidean pairwise distance matrix of the data.

unknownset

Points whose affectation has been queried but remain unknown, i.e. the oracle couldn’t answer the query.

References

csts_to_file(constraints, filename='constraints')

Write the contents of a constraint dictionary into a text file.

Parameters

constraintsdict of list

Constraints to write.

filenamestring, default=”constraints”

Name of the output file.

fit(X, oracle, partition=None, n_clusters=None)

Select pairwise constraints with the FFQS scheme.

Parameters

Xarray-like

Instances to use for query.

oraclecallable

Source of background knowledge able to answer the queries.

partitionarray-like, default=None

Existing partition of the data that provides the number of clusters to find. It has priority over n_clusters, i.e. n_clusters will not be taken into account if a partition is passed.

n_clustersint, default=None

Number of clusters to find. If none given, only the Explore phase will be used.

Returns

constraintsdict of lists

ML and CL constraints derived from the neighborhoods.

get_constraints_from_neighborhoods()

Derives constraints from the neighborhood structure : ML between all pairs of points in the same neighborhood and CL between all pairs of points in separate neighborhoods

Returns

ml, cllist of tuples

Pairwise constraints derived from the neighborhoods.

class skquery.pairwise.NPU(neighborhoods=None, cc_alg=None)[source]

Normalized Point-based Uncertainty algorithm [1]_.

Uncertainty sampling using Shannon entropy based on an existing partition, normalized by the expected number of queries for a given point.

Parameters

neighborhoodslist of lists, default=None

Optional precomputed neighborhoods of the data. This can be used as a warm start, e.g. for incremental constrained clustering.

cc_algcallable, default=None

Constrained clustering algorithm to use in tandem with the selection procedure. Not needed if a partition is given during fit.

Attributes

partitionarray-like

Partition of the dataset.

neighborhoodslist of lists

Points whose cluster affectation is certified by the answers of the oracle to the queries. Each list contains points belonging to the same cluster.

cc_algcallable

Constrained clustering algorithm used to compute a partition using the selected constraints.

unknownset

Points whose affectation has been queried but remain unknown, i.e. the oracle couldn’t answer the query.

Notes

This implementation of NPU works either with a constrained clustering algorithm or by giving a partition as argument of the fit method. In the former case, only algorithms from active-semi-supervised-clustering library are supported.

References

csts_to_file(constraints, filename='constraints')

Write the contents of a constraint dictionary into a text file.

Parameters

constraintsdict of list

Constraints to write.

filenamestring, default=”constraints”

Name of the output file.

fit(X, oracle, partition=None, n_clusters=None)[source]

Select pairwise constraints with NPU.

Parameters

Xarray-like

Instances to use for query.

oraclecallable

Source of background knowledge able to answer the queries.

partitionarray-like

Existing partition of the data. Not used if a clustering algorithm has been defined at init.

n_clustersIgnored

Not used, present for API consistency.

Returns

constraintsdict of lists

ML and CL constraints derived from the neighborhoods.

class skquery.pairwise.RandomMLCL[source]

Random sampling of pairwise constraints.

csts_to_file(constraints, filename='constraints')

Write the contents of a constraint dictionary into a text file.

Parameters

constraintsdict of list

Constraints to write.

filenamestring, default=”constraints”

Name of the output file.

fit(X, oracle, partition=None, n_clusters=None)[source]

Selects pairwise constraints randomly.

Parameters

Xarray-like

Instances to use for query.

oraclecallable

Source of background knowledge able to answer the queries.

partitionIgnored

Not used, present for API consistency.

n_clustersIgnored

Not used, present for API consistency.

class skquery.pairwise.SASC(alpha=0.5)[source]

Sequential approach for Active Constraint Selection algorithm [1]_.

Selects a subset of boundary instances with Support Vector Data Description (SVDD), then computes individual constraint utility and sequential utility of constraints.

Parameters

alphaint, default=0.5

Proportion of constraints selected based on Assumption A.

Attributes

alphaint, default=0.5

Proportion of constraints selected based on Assumption A.

support_vectorsarray-like

Indexes of support vectors found through SVDD.

p_dists_svarray-like

Euclidean pairwise distance matrix of support vectors.

References

csts_to_file(constraints, filename='constraints')

Write the contents of a constraint dictionary into a text file.

Parameters

constraintsdict of list

Constraints to write.

filenamestring, default=”constraints”

Name of the output file.

fit(X, oracle, partition=None, n_clusters=None)[source]

Select pairwise constraints with SASC.

Parameters

Xarray-like

Instances to use for query.

oraclecallable

Source of background knowledge able to answer the queries.

partitionIgnored

Not used, present for API consistency.

n_clustersIgnored

Not used, present for API consistency.

Returns

constraintsdict of lists

ML and CL sequentially selected constraints.

is_between(X, a, b, c)[source]
u_t(c_t, u_t_ij, cl, ml, link)[source]

La fonction u_t se charge de la sélection des contraintes en se basant sur les mesures d’utilité calculées par les fonctions u_indA et u_indB.

Voici un résumé du fonctionnement de cette fonction :

  1. Elle commence par identifier la contrainte avec la plus grande utilité dans la matrice d’utilité u_t_ij.

2. Elle vérifie ensuite si cette paire de contraintes n’a pas déjà été sélectionnée. Si ce n’est pas le cas, elle l’ajoute à l’ensemble des contraintes sélectionnées c_t, ainsi qu’à la liste appropriée de contraintes de lien (ml) ou de contraintes de non-lien (cl), selon le paramètre link.

3. Après avoir ajouté une nouvelle contrainte, elle met à jour la valeur d’utilité de cette contrainte dans la matrice d’utilité u_t_ij pour éviter de la sélectionner à nouveau.

4. Enfin, elle met à jour les valeurs d’utilité de toutes les autres contraintes dans u_t_ij en fonction de leur distance à l’ensemble des contraintes déjà sélectionnées. Cette mise à jour est réalisée en multipliant l’utilité actuelle de chaque contrainte par la distance minimale entre cette contrainte et l’ensemble des contraintes déjà sélectionnées ( via la méthode djc_t() ), et en divisant le résultat par la distance maximale parmi toutes les contraintes.

En résumé, cette fonction sélectionne les contraintes une par une en fonction de leur utilité, en veillant à mettre à jour l’utilité des contraintes restantes à chaque étape pour tenir compte de l’information déjà capturée par les contraintes sélectionnées.