Submitted by vintergroena t3_123asbg in MachineLearning

Looking at how GPT can work with source code mixed with language, I am thinking that similar techniques could perhaps be used to construct a decent decomplier. Consider a language like C. There are plenty of open sources which could be compiled. Then you can use the dataset consisting of (source code, compiled code) pairs to train a generative model to learn the inverse operation from data. Ofc, the model would need to fill in the information lost during compilation (variable names etc) in a human-understandable way, but looking at the recent language models and how they work with source codes, this now seems rather doable. Is anyone working on this already? I would consider such an application to be extremely beneficial.

78

Comments

You must log in or register to comment.

matthkamis t1_jduqbsd wrote

Why do you need a generative model for this? Couldn’t this be done with standard supervised learning?

28

nonotan t1_jdv7m7j wrote

You could certainly do that to some extent, but I suspect that wouldn't generalize very well to programs that do things significantly different from anything in the training set. Transforming the syntax alone would probably be straightforward enough, but the parts that need more "interpretation" of what's going on (such as assigning plausible variable/function/class names, nevermind something like writing comments) I just can't see a standard supervised model handling particularly gracefully. Whereas that's one of the areas LLM excel at.

12

Smallpaul t1_jdusjts wrote

Good question.

The generative model might be able to learn with fewer examples? Because if already knows more about coding in various languages?

Just a guess.

2

Koda_20 t1_jdy1sup wrote

These generative models also seem better at learning though right?

It could better understand what the user wants too

1

bubudumbdumb t1_jdu9382 wrote

I think that if you can have better variable names that is already a big selling point

18

vintergroena OP t1_jduazsh wrote

Exactly, a lot of the tech is already there, it's perhaps more about unobfuscating the code rather than decompiling it.

5

sdmat t1_jdu3imb wrote

GPT4 will do this to an extent out of the box, feed it some assembly and it will hypothesise a corresponding program in the language of your choice. For me it still has that disassembler character of over-specificity, but I didn't try very hard to get idiomatic result.

It can give detailed analysis of assembly too, including assessing what it does at a high level in plain english. Useful!

Edit: Of course it's going to fail hopelessly for large/complex programs.

14

orthomonas t1_jdum8od wrote

I've had some luck doing this with chatGPT too. Mainly feeding it bits of 6502 code and then saying, 'Please explain the branch logic in a higher level language". It's also reasonably able to give plain english explanations if you can let it know the context and what various addresses may represent.

5

Smallpaul t1_jdu38fb wrote

Decompilers already exist though.

3

currentscurrents t1_jdu3vwq wrote

Yeah, but they're hand-crafted algorithms and produce code that's hard to read.

10

ultraminxx t1_jdu7uz8 wrote

that said, it might be also a good approach to preprocess the input with a classical algorithm and then train a model on refactoring that decompiled code, so it becomes more readable

10

currentscurrents t1_jdvxga6 wrote

Possibly! But it also seems like a good sequence-to-sequence translation problem, just line up the two streams of tokens and let the model figure it out.

2

s0n0fagun t1_jdu975r wrote

That depends on the language/compiler used. Java and C# have decompilers that turn out great code.

2

currentscurrents t1_jdvxu6g wrote

Those languages don't compile to machine code, they compile to a special bytecode that runs in a VM.

2

bubudumbdumb t1_jdu90gu wrote

Friends working in rev.ng told me that it's very difficult to decompile to the original high level structures actually used in the source code. Maybe C have a few ways to code a loop but c++ has many and figuring out the source code from assembly is very hard to achieve with rule based systems.

5

SmellElectronic6656 t1_jdvigcv wrote

Sorry if this is a very basic question. But what are the applications of such a model? What can we do with the decompiled code?

2

vintergroena OP t1_jdvons9 wrote

Ultimately, you could force open-sourcing of any software (if you can run it on your device, not SaaS)

5

suflaj t1_jdxh3kc wrote

That would be breaching copyright. Depending on the company and the product, you'd get anywhere from a pretty nasty debt to straight up ruining your whole life (and potentially the lives of your family and people associated with you).

The same way you wouldn't steal from the mob, you would not steal from a company that makes money on a product FOSS can't compete with. Aside from that, decompilers exist for a very long time yet we have not witnessed such vigilantism.

−5

EarthquakeBass t1_jdy7796 wrote

It’s very useful for malware analysis. In malware it’s all about hiding your tracks. Clearing up the intent of even just some code helps white hats a lot. Example: Perhaps it inserts some magic bytes into a file to exploit an auto run vulnerability. ChatGPT might recognize that context from its training data much more quickly.

4

suflaj t1_jdxgd3z wrote

In most cases yes, but inherently no. Understand that compilers, as part of their optimization step, might compile high level code into something that you can't really connect with the actual code. Part of the information is lost in the optimization step and so in a general case you will not be able to revert the compilation step. At least not fully, of course you will be able to get something resembling the solution, but it is not guaranteed to be the exact code that compiled into your starting input.

This is, of course, after taking into consideration you will not be able to recover dead source code if it's never compiled into something. Because if you take this into account, even if a language does not optimize the source code otherwise, if it only discards dead code: you are also losing information.

And also, this is disregarding name mangling. Obviously name mangling can be done in a way you have information loss, but this is probably irrelevant since concrete entity names are not that relevant.

2

enn_nafnlaus t1_jdv2n2j wrote

Clever. Should be very possible.

1

Frumpagumpus t1_jdv6yrq wrote

decompiler childs play, train a model that reconstructs servers and databases based on api endpoints

1

ManInTheMirror2 t1_jdvsni8 wrote

Better question can we train a cross-language IDE that allows you to translate between different OOPLs

1

col-summers t1_jdwhbte wrote

Yes, that is obviously a hugely valuable application of machine intelligence. Want to work on it?

1

Fit-Recognition9795 t1_jdtz20q wrote

You think we are far ahead of where we are... and I also wish we were there, but we are not.

Not saying one day will not be possible but have you tried asking gpt4 to multiply two 3 digits numbers?

0

vintergroena OP t1_jdtzyuo wrote

Yeah, GPT sucks on tasks which require actual thinking and personally I am kind of skeptical about it's actual usefulness tbh. But my impression is that despite being primarily built to work with natural language, it actually does work better with computer code. Probably because computer code has much simpler structure. This got me thinking that building something more specialized that would be required to only work with computer code would actually be an easier task - more similar to automated translation perhaps, which is already working pretty well using ML.

4

nonotan t1_jdv8hy1 wrote

I can't speak for GPT-4, but in my experience with ChatGPT, I would definitely not say it is better with code. It's just absurdly, terribly, unbelievably bad at maths. It's a bit better at dealing with code, but it doesn't mean it's good, you're just comparing it with its weakest area. It's not really capable of generating code that does anything even a little complex without heavy guidance directing it towards mistakes and getting it to make revision after revision (and even that is non-trivial to get it to do, it tends to just start generating completely different programs with completely different problems instead)

That being said, I can definitely believe it could do okay at decompilation. It's an easy enough task in general, comparatively, and the "trickiest" bit (interpreting what the program is supposed to be doing, to have the context to name variables etc) feels like the kind of thing it'd perform surprisingly well at. Getting a general "vibe" and sticking with it, and translating A to B, it tends to do okay. It's when it needs to generate entirely novel outputs that need to fulfill multiple requirements at once that it starts failing miserably.

2

fmfbrestel t1_jdwmb7z wrote

Most of those problems are due to the input/memory limitations for general use. I can imagine locally hosted GPTs that have training access to an organization's source code, development standards, and database data structures. Such a system could be incredibly useful. Human developers would just provide the prompts, supervise, approve, and test new/updated code.

Would have to be locally hosted, because most orgs are NOT going to feed their source code to an outside agency regardless of the promises of efficiency.

2