Machine code and byte encoders
-
Note: I'm not an expert in the area so if I misunderstand, please explain why. --- I'm looking to compile a PHP application into a byte code but I've always understood byte code to be comlpetely reversible, whereas machine code to be virtually impossible. The low level JMP could have been an DO, IF, SWITCH. Whether is was or not would require looking at the reverse engineered code and trying to determine that. Byte code, is not as low level but it's not so abstract that you have constructs like IF, SWITCH, LOOP, etc is it? Do they still use a low level JMP command? I know the 8086 MOV instruction has like 40 or 50 different operations, it's that low level-ness that makes decompiling from machine to source so difficult. Byte code though, how low level is? When a C++ program is compiled, all the tools of the language such as consts, access control, interfaces, etc are probably wiped clean, correct? Is the same apply to byte code? What about classes? Does byte code represent a source file as objects or are those objects expanded into globals (functions and data). How are objects converted into machine code or byte code? Do class names persist or are they replaced with mangled names or integer offsets? Would converting objects into inlined code create problems due to polymorphism? I can't really think of much else to say right now...but I'm sure I'll have more to ask once I get a few replies. :)
I'm finding the only constant in software development is change it self.
The machine code is binary language which only the machine can understand properly. This may be different for different hardware (basically processors) and operating systems. This means that binary language is machine specific. On the other hand, byte code is a sort of intermediate language. It is not in machine code form, but something between machine language and the high level language. It is not machine specific. It is compiled on the run, depending on the machine specifications. The resultant machine code is processed by your processor. How the byte code is converted to machine language depends on the program. e.g. a Java program stores its program as byte code. So when we want to run the program, java compiler compiles it on the run into machine code and runs it for you. This makes a few differences very obvious: 1. Machine code is hard to decompile just because it is pure zeros and ones. It is a set of hardware instructions. Whereas, byte code can easily be decompiled since it is not machine specific; nor it is in pure zeros and ones format. 2. Machine code is not portable. This is because each processor make or Operating System may interpret the instructions differently. On the other hand, byte code is portable. Byte code will always run irrespective of the environment or hardware, till the time an interpreter is available to convert it into that machine specific code. Hope this helps, Pradeep :)
-
Sweet Holy Mother! How are you up and typing Mick Martin? You can only have been asleep 5 hours?
martin_hughes wrote:
Sweet Holy Mother! How are you up and typing Mick Martin? You can only have been asleep 5 hours?
:laugh: That's a good night's sleep for some programmers, depending on their project.
-
If it can be run by a processor it can be reverse engineered. There's absolutely no such thing as impossible, just varying degrees of difficulty. Native unmanaged code has been reverse engineered since probably about 10 days after the first commercial application was released. Of course these days few people are releasing new software with revolutionary new patent worthy algorithms in them so I suspect it's done far less for industrial reasons and more for license circumvention than anything else. In the case of .net code obfuscation is critical because pretty much everything persists in the byte code, you can easily turn the assembly into fully compilable source code with one click. Obfuscation makes the resulting code a royal pain to work with when converted back to a .net language. In many cases a good obfuscator causes decompilers to have seizures. A good obfuscator makes it pretty much impossible to work with the code at all, definitely impossible to buil an application from it without a hell of a lot of hand renaming and coding and extremely difficult to find specific alorithms and areas of the code and understand them, thought not impossible if you have a *lot* of time on your hands, essentially far more time than it would take to just write the program yourself if it's of any size at all. I see there are some Java bytecode obfuscators on the market which leads me to believe the experience is similar to .net.
"It's so simple to be wise. Just think of something stupid to say and then don't say it." -Sam Levenson
John C wrote:
I see there are some Java bytecode obfuscators on the market which leads me to believe the experience is similar to .net.
C-Shroud from Gimpel Software was a popular obfuscator in the early '90s. It worked by obfuscating the C code, which then compiled to machine code that was completely incomprehensible. I mention this simply to demonstrate that the problem and solutions existed even before .NET and Java VMs came on the scene.
-
The machine code is binary language which only the machine can understand properly. This may be different for different hardware (basically processors) and operating systems. This means that binary language is machine specific. On the other hand, byte code is a sort of intermediate language. It is not in machine code form, but something between machine language and the high level language. It is not machine specific. It is compiled on the run, depending on the machine specifications. The resultant machine code is processed by your processor. How the byte code is converted to machine language depends on the program. e.g. a Java program stores its program as byte code. So when we want to run the program, java compiler compiles it on the run into machine code and runs it for you. This makes a few differences very obvious: 1. Machine code is hard to decompile just because it is pure zeros and ones. It is a set of hardware instructions. Whereas, byte code can easily be decompiled since it is not machine specific; nor it is in pure zeros and ones format. 2. Machine code is not portable. This is because each processor make or Operating System may interpret the instructions differently. On the other hand, byte code is portable. Byte code will always run irrespective of the environment or hardware, till the time an interpreter is available to convert it into that machine specific code. Hope this helps, Pradeep :)
Pradeep, I have to respond a bit to your statement. The only difference between "byte code" and "machine code" is that byte code runs on a virtual machine, and machine code runs on a physical machine. Byte code for one virtual machine (e.g. the Java Runtime) is not portable to another virtual machine (e.g. the .NET CLR). Only the fact that virtual machines are commonly written for multiple architectures makes byte-code portable - machine code is fully "portable" to any number of software emulators, thus it is a common practice for embedded software to be intially developed on a development workstation using an emulator, so that, say, the code for an instrument monitor that uses a Motorola 68020 as its CPU might actually be developed on a Windows PC that natively runs on a Core2 Duo from Intel.
-
martin_hughes wrote:
Sweet Holy Mother! How are you up and typing Mick Martin? You can only have been asleep 5 hours?
:laugh: That's a good night's sleep for some programmers, depending on their project.
Yeah, but he'd been drinking heavily.
Ahoy! Martin Hughes
-
Yeah, but he'd been drinking heavily.
Ahoy! Martin Hughes
martin_hughes wrote:
Yeah, but he'd been drinking heavily.
In that case, I certainly hope he's not operating machinery heavier than his keyboard :omg:
-
what vm would be "running" the app that you want to "compile" ? the features you are asking about would be a feature of the vm imo and what it was capable of doing with regards to run time support of language features on a side note... i write and teach php for a living and i'm absolutely fascinated as to the reasons why you would want to "compile" it into some kind of (presumably proprietary) byte code and what you hope to achieve by doing so? :)
"mostly watching the human race is like watching dogs watch tv ... they see the pictures move but the meaning escapes them"
modified on Monday, October 6, 2008 12:22 AM
l a u r e n wrote:
what vm would be "running" the app that you want to "compile" ?
ionCube and Zend encoder are two options...PHPA is the open source version. ionCube and Zend both have encoders/compilers that take your source and compile into byte code. These encoded files are then uploaded to your server but they require the proprietary decoder extensions to be installed on the server in order to decode/decrypt and execute. Basically skipping the tokenizing/parsing stages and going straight to the execution. PHP Accelerator/APC are basically just extensions that hook into the parsing/execution phase and cache the resulting byte code for a certain request. The next time that request is made, the byte code is used and tokenization/parsing is spared. The former can protect your code (in a limited manner after reading more about byte code -- it's basically binary code with all high level constructs like classes, etc) and speed it up under most circumstances. The latter will really only speed it up.
l a u r e n wrote:
on a side note... i write and teach php for a living and i'm absolutely fascinated as to the reasons why you would want to "compile" it into some kind of (presumably proprietary) byte code and what you hope to achieve by doing so?
I'm an architecture geek with a custom developed framework. The framework is a compilation of good ideas borrowed from every single framework I could find on Google (about 9 months of research and study and trial and error). I don't want my ideas ending up in Zend or Symphony or CakePHP and even worse in my competitors products. My application will be a SaaS hosted application however I wouldn't mind giving it away for free as a marketing channel (as I have no idea how else I'm going to get people using it). Hence the compliation.
I'm finding the only constant in software development is change it self.
-
If it can be run by a processor it can be reverse engineered. There's absolutely no such thing as impossible, just varying degrees of difficulty. Native unmanaged code has been reverse engineered since probably about 10 days after the first commercial application was released. Of course these days few people are releasing new software with revolutionary new patent worthy algorithms in them so I suspect it's done far less for industrial reasons and more for license circumvention than anything else. In the case of .net code obfuscation is critical because pretty much everything persists in the byte code, you can easily turn the assembly into fully compilable source code with one click. Obfuscation makes the resulting code a royal pain to work with when converted back to a .net language. In many cases a good obfuscator causes decompilers to have seizures. A good obfuscator makes it pretty much impossible to work with the code at all, definitely impossible to buil an application from it without a hell of a lot of hand renaming and coding and extremely difficult to find specific alorithms and areas of the code and understand them, thought not impossible if you have a *lot* of time on your hands, essentially far more time than it would take to just write the program yourself if it's of any size at all. I see there are some Java bytecode obfuscators on the market which leads me to believe the experience is similar to .net.
"It's so simple to be wise. Just think of something stupid to say and then don't say it." -Sam Levenson
John C wrote:
There's absolutely no such thing as impossible,
Didn't I say virtually impossible? :P I realize that reverse engineering byte code or machine code isn't impossible, but if it were that easy someone would have reverse engineered Windows or Adobe Photoshop and there would be Open Source versions available. :P The point was, that machine code is significantly harder to reverse engineer. Byte code as I now understand is quite high level.
I'm finding the only constant in software development is change it self.
-
I'm from the old school where we used to code in assembly language. Machine language is completely reversible, just a little harder. How do you think hackers go about removing the copy protection from commercial software? Or virus-killers go about understanding what a virus does? The determined will be always able to figure out what your app is doing.
I know it's not impossible...but removing copy right protection is not completely reverse engineering. The architecture is what I am interested in protecting, not the implementation.
I'm finding the only constant in software development is change it self.
-
Pradeep, I have to respond a bit to your statement. The only difference between "byte code" and "machine code" is that byte code runs on a virtual machine, and machine code runs on a physical machine. Byte code for one virtual machine (e.g. the Java Runtime) is not portable to another virtual machine (e.g. the .NET CLR). Only the fact that virtual machines are commonly written for multiple architectures makes byte-code portable - machine code is fully "portable" to any number of software emulators, thus it is a common practice for embedded software to be intially developed on a development workstation using an emulator, so that, say, the code for an instrument monitor that uses a Motorola 68020 as its CPU might actually be developed on a Windows PC that natively runs on a Core2 Duo from Intel.
Yes, this is the exact same thing I was trying to explain :) The virtual machine is actually the program in machine language. It is just like your any other traditional C/C++ program. e.g. The Java Runtime is the actual program in machine code. The java runtime (virtual machine) bears the same relationship to the Java program (in byte code) as any Word Processor program bears with Word documents. Pradeep
-
John C wrote:
There's absolutely no such thing as impossible,
Didn't I say virtually impossible? :P I realize that reverse engineering byte code or machine code isn't impossible, but if it were that easy someone would have reverse engineered Windows or Adobe Photoshop and there would be Open Source versions available. :P The point was, that machine code is significantly harder to reverse engineer. Byte code as I now understand is quite high level.
I'm finding the only constant in software development is change it self.
Hockey wrote:
I realize that reverse engineering byte code or machine code isn't impossible, but if it were that easy someone would have reverse engineered Windows or Adobe Photoshop and there would be Open Source versions available.
Not sure about "Open Source".. It's illegal to reverse engineer these types of projects - check the license agreements. So releasing said code after reverse-engineering will undoubtedly encode copyrighted or patented material. GNU and Linux have both been bitten by this - volunteers adding code that was covered by license - with either the code itself (copyrighted) or the algorithm (often patented).
-
Yes, this is the exact same thing I was trying to explain :) The virtual machine is actually the program in machine language. It is just like your any other traditional C/C++ program. e.g. The Java Runtime is the actual program in machine code. The java runtime (virtual machine) bears the same relationship to the Java program (in byte code) as any Word Processor program bears with Word documents. Pradeep
Mmmmm, I was also making the point that machine language is related to the computer hardware interface as the Java code is related to the Java runtime as the Word document is related to the Word program. Machine language, in other words, is just another script for a logical machine (where the logical machine, in this case, maps directly one-for-one to the physical machine). It's a point that's often missed except by C and assembly language programmers.
-
Note: I'm not an expert in the area so if I misunderstand, please explain why. --- I'm looking to compile a PHP application into a byte code but I've always understood byte code to be comlpetely reversible, whereas machine code to be virtually impossible. The low level JMP could have been an DO, IF, SWITCH. Whether is was or not would require looking at the reverse engineered code and trying to determine that. Byte code, is not as low level but it's not so abstract that you have constructs like IF, SWITCH, LOOP, etc is it? Do they still use a low level JMP command? I know the 8086 MOV instruction has like 40 or 50 different operations, it's that low level-ness that makes decompiling from machine to source so difficult. Byte code though, how low level is? When a C++ program is compiled, all the tools of the language such as consts, access control, interfaces, etc are probably wiped clean, correct? Is the same apply to byte code? What about classes? Does byte code represent a source file as objects or are those objects expanded into globals (functions and data). How are objects converted into machine code or byte code? Do class names persist or are they replaced with mangled names or integer offsets? Would converting objects into inlined code create problems due to polymorphism? I can't really think of much else to say right now...but I'm sure I'll have more to ask once I get a few replies. :)
I'm finding the only constant in software development is change it self.
Byte code is interpreted by an interpreter. machine code is processed by the CPU. basic is interpreted which run slow by reading text and translating. but if you transform it into byte code{function numbers} it will run faster by calling Function[number] in the function table instead of looking up a text command in a command list. java use to be interpreted but now you compile it into byte code to be interpreted faster. I think they also code the variables in a variable table as well as the functions, so there's no text to translate plus it make them smaller. Variable 100 {variabletable[100]} is quicker and smaller than looking up myvariablename each byte code tells whether it a command or a variable and how many arguments