Submitted by No_Performer203 t3_y09znb in MachineLearning

I am working on a project where my dataset consists of programs Each program is to be represented as a graph And I want to perform ‘between-graph clustering’ (clustering similar graphs)

So far all the literature I have seen talks about within graph clustering (clustering of similar nodes in a single graph)

Does anyone know of any resources that could help me with my project?

18

Comments

You must log in or register to comment.

ID4gotten t1_irqs2cs wrote

I assume you mean a mathematical graph structure and not a plot. Try searching for graph "similarity" instead of "clustering".

5

resented_ape t1_irr5f0k wrote

Your problem is really how to measure similarity between graphs. If you can do that, you can use pretty much any standard clustering technique.

This task is common in drug discovery and related cheminformatics fields where small molecules are represented as graphs. One common approach there is to generate a dictionary of subgraphs and then represent each molecule as a binary vector where a 1 means the subgraph is present and 0 otherwise. There is an obvious extension to counts of features (i.e. you have an integer vector with a N if the subgraph appears N times). This strategy crucially depends on how nodes and edges are labeled.

The similarity measure is usually something like Jaccard (often referred to as Tanimoto in the chemistry literature).

7

wealthyMoss t1_irsglr8 wrote

My thought is that if you have a set of graphs and graphs can be represented as adjacency matrices or lists, then essentially you can represent these graphs as a collection of dictionaries or data frames. Then you concatenate these into on giant dictionary or data frame.

For example, say you have 2 graphs each with nodes {a,b,c}. Then you could have a data frame with three columns and three rows for each that consist of the adjacency matrix. Add a new column to each to determine with graph is which (say column named “graph” and for the first graph it would 1 and 2 for second and so one). Then you can concatenate these matrices in a row wise manner to get a 6 row 4 column data frame where the columns are a,b,c, and graph. Then you can use whatever clustering technique you want.

I haven’t tried this but this would be my initial approach. Hope this helps and gives a new way of looking at the data!

I am also curious to see if others think this would be a valid approach as well!

1

felipeasg t1_irtrdlg wrote

Clustering basicaly is a tool to help in an exploratoy analysis. There are another aproaches that can help you to investigate graphs.

One approach from NLP (Natural Language Processing) world is called embedding (https://en.m.wikipedia.org/wiki/Word_embedding). You can use some techniques to represent your graph in a vector space like: https://github.com/guoji-fu/Event2vec

This technique is particularly good if your graph is big and has a stochastic nature.

In python you can use this libraries to work with graphs. https://networkx.org/documentation/stable/tutorial.html

And this to sample graphs: https://brandonrozek.com/blog/networkx-random-sample-graph/ https://little-ball-of-fur.readthedocs.io/en/latest/index.html

others references: https://en.m.wikipedia.org/wiki/Graph_embedding https://link.medium.com/rW0beBuz1tb

2

vwings t1_is99t4r wrote

Check contrastive pre-training of graph neural networks. The clustering methods should work well for those pretrained representations. Furthermore: do the graphs come from a particular domain? You might find some useful pre-training objectives then..

1