Jump to content

Will decompiling source code accurately be easy with future tech?

Delicieuxz

I don't know anything about this topic. But I'm curious if the current challenges of accurately decompiling compiled source-code are due to current technology's performance limitations or something else.

 

Does the compiled information in an executable include information about how the source code was originally written and organized? If so, then I would think decompiling compiled code would be like breaking encryption, and I'd wonder if a supercomputer or quantum-computer could do this now already, if someone wanted to use one for that purpose.

 

But even if formatting information is lost when source code is compiled, I still wonder if it will become easier to accurately decompile compiled code with future technology.

You own the software that you purchase - Understanding software licenses and EULAs

 

"We’ll know our disinformation program is complete when everything the american public believes is false" - William Casey, CIA Director 1981-1987

Link to comment
Share on other sites

Link to post
Share on other sites

More than formatting is lost when something is compiled. The entire source code itself is no longer there.

 

You can attempt to reverse engineer through the compression, obfuscated code and possibly encryption but... you’ll end up with something nowhere remotely close to the source code.

 

Think of the source code as the recipe for Coca-Cola and think of the compiled  program as a can of Coke. You can’t take a can of coke and magically derive the recipe from the organic composition. You can form an educated guess but that’s about it.

MacBook Pro 16 i9-9980HK - Radeon Pro 5500m 8GB - 32GB DDR4 - 2TB NVME

iPhone 12 Mini / Sony WH-1000XM4 / Bose Companion 20

Link to comment
Share on other sites

Link to post
Share on other sites

28 minutes ago, Delicieuxz said:

Does the compiled information in an executable include information about how the source code was originally written and organized?

No, because there is zero reason to include any such information in the compiled executable.

Hand, n. A singular instrument worn at the end of the human arm and commonly thrust into somebody’s pocket.

Link to comment
Share on other sites

Link to post
Share on other sites

It's not a technology problem, it's just that compilers remove information. There may be advances in decompilers that produce more likely code, but you will never be able to take Windows (or something considerably less complex) and get the original code, or in all likelihood anything resembling the original code, back.

 

If you take these C programs:

int addOne(int input) {
    return input + 1;
}

int getValue() {
    return addOne(5);
}

 

int getValue() {
    return 5 + 1;
}

 

int getValue() {
    return 6;
}

Most compilers set to any level of optimization will produce the same code (in x86 pseudo-assembly, because I can't remember enough about any specific assembly syntax):

some_random_label_if_youre_lucky:
mov eax, #6
ret

 

It's therefore impossible to reconstruct which of the inputs was used.

 

Other languages have stricter rules about how the code is compiled, so you can usually do a pretty good job of decompiling them to get back most of the original code. For example, Java compilers aren't (afaik) allowed to optimize out the call to addOne, nor change its name, so decompilation will give you a lot more detail about the original code. The second and third cases will still be indistinguishable though, because the compiler can optimize 5+1 to 6 (and there are a bunch of other syntactic sugar features that will won't get decompiled correctly).

 

There are already some pretty good Java decompilers, but they're still limited by what's available. For example, a random decompiled class that I found currently open in my IDE is:

    @NotNull
    public SyntaxHighlighter getSyntaxHighlighter(Project project, VirtualFile virtualFile) {
        SyntaxHighlighter var10000 = new SyntaxHighlighter();
        if (var10000 == null) {
            $$$reportNull$$$0(0);
        }

        return var10000;
    }

I'm pretty sure that's not what the code really looks like, but the local variable names aren't present in the bytecode and the null check was added by the compiler.

HTTP/2 203

Link to comment
Share on other sites

Link to post
Share on other sites

data mining will never get to that point... especially since the direction of source code protection is basically developing on par with military grade encryption... even games that have been released more than 10 years ago still don't have a full source code available to the public that was acquired with data mining...

 

you can think of this in a simpler manner by comparing it to uploading over the internet.... even if you choose the "lossless" option when uploading, the fact of the matter is, the "raw file" will never be downloaded from the site where you uploaded your file... file transfer itself inherently involves compression (in case of codes compiling) and for security purposes for both the developers and the users, nobody will ever make source codes available to the public to prevent devious actions from taking place...

 

there is already a severe problem with hacking in a lot of applications and games that we use even without the source code being publicly available... having those things easier to access will just simply worsen the situation...

Link to comment
Share on other sites

Link to post
Share on other sites

Thanks for all the answers.

 

Does AI learning stand to improve decompiling accuracy? Perhaps by running a program or parts of it, and analyzing what's happening as it runs?

You own the software that you purchase - Understanding software licenses and EULAs

 

"We’ll know our disinformation program is complete when everything the american public believes is false" - William Casey, CIA Director 1981-1987

Link to comment
Share on other sites

Link to post
Share on other sites

9 hours ago, Delicieuxz said:

Does AI learning stand to improve decompiling accuracy?

Accuracy is not the primary issue. If you can compile and run the disassembled code and it performs exactly the same as the original, then the code is an accurate representation of the original. But this does not mean that the disassembled code is as easy to read and understand as the original, which will make it that much harder to modify it (e.g. to fix a bug).

 

Let me give you a very simple example. Let's say I have a variable defined like this:

/**
 * The amount of time after which our network request must time out.
 *
 * The request must time out after one day because <reasone>. Because
 * of <other reason> the time must be reduced by exactly 1.001 seconds.
 */
private static final long requestTimeout = (24 * 60 * 60 * 1000) - (1001); // ~24h

 

The compiler will take these values and replace them with a constant value (86,398,999). This means after disassembling you will probably be left with something like this:

long a = 86398999;

 

So here's the issue: My comment is gone, my "factorization" hinting at its meaning is gone, the name of the variable hinting at its use is gone. A developer working with time a lot may very well look at a value like "86400" and immediately recognize this as the number of seconds in a day. The value 86398999 is a lot harder to understand. So you'll have to dig deeper to make any sense of what the value could mean.

 

A disassembler, no matter how advanced, is not going to be able to take the value "86398999" and return it to its original representation of "24 * 60 * 60 * 1000 - 1001", because that information is simply gone. To restore it, you need to not only grasp the purpose of the value, you also need to grasp that the original representation is easier for a human to read/understand.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

I hope that the whole ... as a Service / ... in the Cloud marketing bullshit will meet its fate soon(ish). You cannot decompile web applications locally - not without interpreting most of them.

Write in C.

Link to comment
Share on other sites

Link to post
Share on other sites

On 1/24/2021 at 8:49 AM, Sakuriru said:

 For something like .NET it's compiled into CIL which does decompile pretty cleanly.

 

Still even in CIL it still does method inlining which you "lose" how the method was purposely built which is not possible to revert to the original written source code. But i agree it still a million times cleaner than the output of the last compiling pass

Link to comment
Share on other sites

Link to post
Share on other sites

Short answer is no.  Source code is intentionally meant for people.  The naming schemes, indentation, comments, etc.  All are meant to make it easier for a developer to look at the code and have an idea of what problem the code is trying to solve.  Machine code is meant to be easy for the cpu to execute, not for a person to read.

 

Don't get me wrong; yeah, some sort of AI could absolutely guess at the source code looked like based on the machine code, but you're still not going to have the naming schemes, you're not going to have the comments, and no matter how good the AI is, it will always be a guess.

And even assuming you had an AI that could recreate 90%+ of the source code, why the f*@# are you wasting an AI on something like that?????

 

Modern Vintage Gamer has a great video on how the source code for Diablo was reconstructed.  The only way they were able to do it was because of excess information left in the commercial versions.

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, JacobFW said:

Modern Vintage Gamer has a great video on how the source code for Diablo was reconstructed.  The only way they were able to do it was because of excess information left in the commercial versions.

 

 

That's a cool video.

You own the software that you purchase - Understanding software licenses and EULAs

 

"We’ll know our disinformation program is complete when everything the american public believes is false" - William Casey, CIA Director 1981-1987

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×