Submitted by NovelspaceOnly t3_11o97on in MachineLearning

Decompose Python libraries and generate Coherent hierarchical topic models of the repository.
https://github.com/danielpatrickhug/GitModel

The ability to bootstrap its own codebase is a powerful feature as it allows for efficient self-improvement and expansion. It means that the codebase is designed in such a way that it can use its own output as an input to improve itself. In the context of GitModel, this feature allows for the efficient improvement and expansion of its own codebase. By using its own output to generate hierarchical topic trees of GitHub repositories, it can analyze and extract insights from its own codebase and other codebases to improve its functionality. This can lead to more efficient and effective code generation, better semantic graph generation, and improved text generation capabilities.

I spent around 10 hours today on a major refactor creating a simple pipeline abstraction and allowing dynamic instantiation from yaml configs. It now also supports multiple GNN heads.

Please try it out and let me know what you think!

Example:
https://github.com/deepmind/clrs

https://preview.redd.it/ut4fc6c401na1.png?width=1506&format=png&auto=webp&v=enabled&s=b039242432c1f0526d1d81eadbfe8abc1168d2fd

109

Comments

You must log in or register to comment.

xt-89 t1_jbrx1ss wrote

This project is interesting. The description however is hard to parse through. I’d suggest going over your README and cleaning up some things.

If I could also suggest a feature, if you could use this to generate UML diagrams that’d be great.

You mention that the code base can improve itself. I don’t see where that functionality is. Do you mean that if a person uses this tool for software analysis, productivity increases?

28

[deleted] t1_jbs3esk wrote

the full functionality has been constrained a bit due to refactor, will be fixed soon.

I apologize for the messy read me, the main idea is that I’m using a GNN layer as an inductive bias to improve the representation power of the sentence embeddings by exponentiating the Adj matrix A^2 then aggregated node features using message passing. Then finally using the topic model to create a topic tree to then feed it back into the system prompt to generate more high quality semantic context. It’s also relatively easy w/ bertopic to combine this with outlier detection/ filters for low quality data removal.

It’s a recurrent neural network in the sense where your feeding the output of the previous step back into the network.

You can also use another repos topic tree to suggest improvements. For example, I can add the deepmind topic tree to comment on where it could add features such as the graph attention network or where code can be converted to jax and then generate it on the fly.

Also yes, I found it very useful for deconstructing complex repos like knot theory which provided me with a lot of insight making it easier to narrow down my research and study to the principal components of the repo.

9

xt-89 t1_jbs8od5 wrote

Yeah you’ve definitely setup a good representation bias for modeling entire software architectures.

I had a thought a while back that GitHub Copilot is eventually going to offer a feature where they suggest improvements to entire software architectures… and then eventually just write whole projects from a text description alone. I think that the solution for that would be pretty similar to what you’ve done if scaled up and applied that way.

If your plan is to scale up the system for more advanced features, that would be awesome.

Another suggestion is that if you integrated your tool with GitHub, it would be pretty useful for enterprise software development. Most companies are pretty crappy at documentation. Even with good documentation, a chatbot is better than a static document.

Good job!

7

[deleted] t1_jbsafxc wrote

Thank you! Yes I thought the topic tree would be a great complement to the commit tree. Would be great for stale repos with little to no documentation.

Also the option to mix in multiple repositories and message pass between them to help with brain storming new features. Or message passing between your repo and its dependencies.

1

xt-89 t1_jbsaabf wrote

I also plan on applying the basic idea of a GNN with prompting to the thought loop of an cognitive entity (basically open assistant). I believe if you take the tree your outputting for code, but use it to aid CoT reasoning, that could be pretty powerful

3

[deleted] t1_jbsama8 wrote

Yes exactly! That’s the a major goal of this project. I plan on incorporating the inference server that Yannic set up for open assistant.

1

NovelspaceOnly OP t1_jbsdr55 wrote

I have some preliminary generation scripts for SMILES chemical graphs, Feynman diagrams, storytelling with interleaved images, and testing compilation rates. sorry for switching accounts. this one is logged on my laptop lol..

1

xt-89 t1_jbt5yyd wrote

That’s cool. I assume you’re going to apply this to memories for the agent. There’s already relevant research on how to do that. Here’s one from Facebookresearch: https://ai.facebook.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/

1

NovelspaceOnly OP t1_jbu2nki wrote

yes! I would describe my repo as very aligned with ideas from Yann Lecun. "composition of clever abstractions"

1

[deleted] t1_jbs3u5o wrote

It’s also easy to retrieve representative docs for the topics in the tree

1

[deleted] t1_jbrp32i wrote

[deleted]

14

LikeForeheadBut t1_jbshnil wrote

If you’re that broke, aren’t there better uses of your time than this project lol

10

hak8or t1_jbt1wja wrote

/u/NovelspaceOnly Can you verify this?

As to /u/Main_Mathematician77 , you are effectively a software developer with the ability to dabble with machine learning. Are you located in the states or elsewhere? It would be very confusing as to how you are broke yet have that skillset.

> Don’t @ me saying this is a waste of compute, I know what I’m doing and idgaf.

That is extremely unnecessarily antagonistic/combative

4

[deleted] t1_jbu3k8i wrote

If you’re looking to build something similar. Please consider contributing instead of building from scratch!

1

CryInternational7589 t1_jbs1jo0 wrote

You just helped a ton in building personal codex models.

7

[deleted] t1_jbs3psv wrote

Amazing! Let me know if you end up sharing to huggingface!

2

eclipsejki t1_jbt3ivm wrote

si if I understood it well, this lib "explains" git repo. Am I right?

2

jsonathan t1_jbt3hqq wrote

This is really fascinating, thanks for sharing. I'm also working on generating natural language representations of Python packages. My approach is:

  1. Extract a call graph from the package, where each node is a function and two nodes are connected if one contains a call to the other.
  2. Generate natural language summaries of each function by convolving over the graph. This involves generating summaries of the terminal nodes (i.e. functions with no dependencies), then passing those summaries to their dependents to generate summaries, and so on. Very similar to how message passing works in a GNN. The idea here is that summarizing what a function does isn't possible without summaries of what its dependencies do.
  3. Summaries of each function within a file are chained to generate a summary of that file.
  4. Summaries of each file within a directory are chained to generate a summary of that directory, and so on until the root directory is reached.

I'd love to learn more about the differences/advantages of your approach compared to something like this. Thanks again for your contribution, this is insanely cool!

1

NovelspaceOnly OP t1_jbu2xyi wrote

This is awesome. I would be happy to discuss this as well! I was going to add GCNs and GATs pretty soon. if you're up for collaborating, please reach out in DMs!

2