Viewing a single comment thread. View all comments

resented_ape t1_irr5f0k wrote

Your problem is really how to measure similarity between graphs. If you can do that, you can use pretty much any standard clustering technique.

This task is common in drug discovery and related cheminformatics fields where small molecules are represented as graphs. One common approach there is to generate a dictionary of subgraphs and then represent each molecule as a binary vector where a 1 means the subgraph is present and 0 otherwise. There is an obvious extension to counts of features (i.e. you have an integer vector with a N if the subgraph appears N times). This strategy crucially depends on how nodes and edges are labeled.

The similarity measure is usually something like Jaccard (often referred to as Tanimoto in the chemistry literature).

7