Submitted by NovelspaceOnly t3_11o97on in MachineLearning

Decompose Python libraries and generate Coherent hierarchical topic models of the repository.
https://github.com/danielpatrickhug/GitModel

The ability to bootstrap its own codebase is a powerful feature as it allows for efficient self-improvement and expansion. It means that the codebase is designed in such a way that it can use its own output as an input to improve itself. In the context of GitModel, this feature allows for the efficient improvement and expansion of its own codebase. By using its own output to generate hierarchical topic trees of GitHub repositories, it can analyze and extract insights from its own codebase and other codebases to improve its functionality. This can lead to more efficient and effective code generation, better semantic graph generation, and improved text generation capabilities.

I spent around 10 hours today on a major refactor creating a simple pipeline abstraction and allowing dynamic instantiation from yaml configs. It now also supports multiple GNN heads.

Please try it out and let me know what you think!

Example:
https://github.com/deepmind/clrs

https://preview.redd.it/ut4fc6c401na1.png?width=1506&format=png&auto=webp&v=enabled&s=b039242432c1f0526d1d81eadbfe8abc1168d2fd

109

Comments

You must log in or register to comment.

xt-89 t1_jbrx1ss wrote

This project is interesting. The description however is hard to parse through. I’d suggest going over your README and cleaning up some things.

If I could also suggest a feature, if you could use this to generate UML diagrams that’d be great.

You mention that the code base can improve itself. I don’t see where that functionality is. Do you mean that if a person uses this tool for software analysis, productivity increases?

28

[deleted] t1_jbs3esk wrote

the full functionality has been constrained a bit due to refactor, will be fixed soon.

I apologize for the messy read me, the main idea is that I’m using a GNN layer as an inductive bias to improve the representation power of the sentence embeddings by exponentiating the Adj matrix A^2 then aggregated node features using message passing. Then finally using the topic model to create a topic tree to then feed it back into the system prompt to generate more high quality semantic context. It’s also relatively easy w/ bertopic to combine this with outlier detection/ filters for low quality data removal.

It’s a recurrent neural network in the sense where your feeding the output of the previous step back into the network.

You can also use another repos topic tree to suggest improvements. For example, I can add the deepmind topic tree to comment on where it could add features such as the graph attention network or where code can be converted to jax and then generate it on the fly.

Also yes, I found it very useful for deconstructing complex repos like knot theory which provided me with a lot of insight making it easier to narrow down my research and study to the principal components of the repo.

9

xt-89 t1_jbs8od5 wrote

Yeah you’ve definitely setup a good representation bias for modeling entire software architectures.

I had a thought a while back that GitHub Copilot is eventually going to offer a feature where they suggest improvements to entire software architectures… and then eventually just write whole projects from a text description alone. I think that the solution for that would be pretty similar to what you’ve done if scaled up and applied that way.

If your plan is to scale up the system for more advanced features, that would be awesome.

Another suggestion is that if you integrated your tool with GitHub, it would be pretty useful for enterprise software development. Most companies are pretty crappy at documentation. Even with good documentation, a chatbot is better than a static document.

Good job!

7

xt-89 t1_jbsaabf wrote

I also plan on applying the basic idea of a GNN with prompting to the thought loop of an cognitive entity (basically open assistant). I believe if you take the tree your outputting for code, but use it to aid CoT reasoning, that could be pretty powerful

3

[deleted] t1_jbsafxc wrote

Thank you! Yes I thought the topic tree would be a great complement to the commit tree. Would be great for stale repos with little to no documentation.

Also the option to mix in multiple repositories and message pass between them to help with brain storming new features. Or message passing between your repo and its dependencies.

1

NovelspaceOnly OP t1_jbsdr55 wrote

I have some preliminary generation scripts for SMILES chemical graphs, Feynman diagrams, storytelling with interleaved images, and testing compilation rates. sorry for switching accounts. this one is logged on my laptop lol..

1

hak8or t1_jbt1wja wrote

/u/NovelspaceOnly Can you verify this?

As to /u/Main_Mathematician77 , you are effectively a software developer with the ability to dabble with machine learning. Are you located in the states or elsewhere? It would be very confusing as to how you are broke yet have that skillset.

> Don’t @ me saying this is a waste of compute, I know what I’m doing and idgaf.

That is extremely unnecessarily antagonistic/combative

4

jsonathan t1_jbt3hqq wrote

This is really fascinating, thanks for sharing. I'm also working on generating natural language representations of Python packages. My approach is:

  1. Extract a call graph from the package, where each node is a function and two nodes are connected if one contains a call to the other.
  2. Generate natural language summaries of each function by convolving over the graph. This involves generating summaries of the terminal nodes (i.e. functions with no dependencies), then passing those summaries to their dependents to generate summaries, and so on. Very similar to how message passing works in a GNN. The idea here is that summarizing what a function does isn't possible without summaries of what its dependencies do.
  3. Summaries of each function within a file are chained to generate a summary of that file.
  4. Summaries of each file within a directory are chained to generate a summary of that directory, and so on until the root directory is reached.

I'd love to learn more about the differences/advantages of your approach compared to something like this. Thanks again for your contribution, this is insanely cool!

1

eclipsejki t1_jbt3ivm wrote

si if I understood it well, this lib "explains" git repo. Am I right?

2