Machine code and byte encoders
-
So asking about the theory of programming is not a programming question?
“Cannot find REALITY.SYS...Universe Halted.” ~ God on phone with Microsoft Customer Support
Richard Andrew x64 wrote:
So asking about the theory of programming is not a programming question?
Yes, initially Chris started the Lounge for exactly that, talking about technology and generic programming theories but definitely not language specific questions that fit into the other forums.
Michael Martin Australia "I controlled my laughter and simple said "No,I am very busy,so I can't write any code for you". The moment they heard this all the smiling face turned into a sad looking face and one of them farted. So I had to leave the place as soon as possible." - Mr.Prakash One Fine Saturday. 24/04/2004
-
Note: I'm not an expert in the area so if I misunderstand, please explain why. --- I'm looking to compile a PHP application into a byte code but I've always understood byte code to be comlpetely reversible, whereas machine code to be virtually impossible. The low level JMP could have been an DO, IF, SWITCH. Whether is was or not would require looking at the reverse engineered code and trying to determine that. Byte code, is not as low level but it's not so abstract that you have constructs like IF, SWITCH, LOOP, etc is it? Do they still use a low level JMP command? I know the 8086 MOV instruction has like 40 or 50 different operations, it's that low level-ness that makes decompiling from machine to source so difficult. Byte code though, how low level is? When a C++ program is compiled, all the tools of the language such as consts, access control, interfaces, etc are probably wiped clean, correct? Is the same apply to byte code? What about classes? Does byte code represent a source file as objects or are those objects expanded into globals (functions and data). How are objects converted into machine code or byte code? Do class names persist or are they replaced with mangled names or integer offsets? Would converting objects into inlined code create problems due to polymorphism? I can't really think of much else to say right now...but I'm sure I'll have more to ask once I get a few replies. :)
I'm finding the only constant in software development is change it self.
what vm would be "running" the app that you want to "compile" ? the features you are asking about would be a feature of the vm imo and what it was capable of doing with regards to run time support of language features on a side note... i write and teach php for a living and i'm absolutely fascinated as to the reasons why you would want to "compile" it into some kind of (presumably proprietary) byte code and what you hope to achieve by doing so? :)
"mostly watching the human race is like watching dogs watch tv ... they see the pictures move but the meaning escapes them"
modified on Monday, October 6, 2008 12:22 AM
-
what vm would be "running" the app that you want to "compile" ? the features you are asking about would be a feature of the vm imo and what it was capable of doing with regards to run time support of language features on a side note... i write and teach php for a living and i'm absolutely fascinated as to the reasons why you would want to "compile" it into some kind of (presumably proprietary) byte code and what you hope to achieve by doing so? :)
"mostly watching the human race is like watching dogs watch tv ... they see the pictures move but the meaning escapes them"
modified on Monday, October 6, 2008 12:22 AM
why are you wanting to "compile" php code at all? bytecode usually is more a C#/Java code. It's designed to run on a (usually hardware independent) platform (JVM, CRL, etc). (That said - "compiled" PHP code would allow for compile-time checking of errors. :-) )
-
This looks like a programming question. Did you not see the "NO PROGRAMMING QUESTIONS IN THE LOUNGE" admonition?
“Cannot find REALITY.SYS...Universe Halted.” ~ God on phone with Microsoft Customer Support
Balanced the low-votes on your post. However, I agree with you only partially. It isn't a programming question as such, but it could have better gone into general discussions or design/architecture forum.
Many are stubborn in pursuit of the path they have chosen, few in pursuit of the goal - Friedrich Nietzsche .·´¯`·->Rajesh<-·´¯`·. [Microsoft MVP - Visual C++]
-
Note: I'm not an expert in the area so if I misunderstand, please explain why. --- I'm looking to compile a PHP application into a byte code but I've always understood byte code to be comlpetely reversible, whereas machine code to be virtually impossible. The low level JMP could have been an DO, IF, SWITCH. Whether is was or not would require looking at the reverse engineered code and trying to determine that. Byte code, is not as low level but it's not so abstract that you have constructs like IF, SWITCH, LOOP, etc is it? Do they still use a low level JMP command? I know the 8086 MOV instruction has like 40 or 50 different operations, it's that low level-ness that makes decompiling from machine to source so difficult. Byte code though, how low level is? When a C++ program is compiled, all the tools of the language such as consts, access control, interfaces, etc are probably wiped clean, correct? Is the same apply to byte code? What about classes? Does byte code represent a source file as objects or are those objects expanded into globals (functions and data). How are objects converted into machine code or byte code? Do class names persist or are they replaced with mangled names or integer offsets? Would converting objects into inlined code create problems due to polymorphism? I can't really think of much else to say right now...but I'm sure I'll have more to ask once I get a few replies. :)
I'm finding the only constant in software development is change it self.
If it can be run by a processor it can be reverse engineered. There's absolutely no such thing as impossible, just varying degrees of difficulty. Native unmanaged code has been reverse engineered since probably about 10 days after the first commercial application was released. Of course these days few people are releasing new software with revolutionary new patent worthy algorithms in them so I suspect it's done far less for industrial reasons and more for license circumvention than anything else. In the case of .net code obfuscation is critical because pretty much everything persists in the byte code, you can easily turn the assembly into fully compilable source code with one click. Obfuscation makes the resulting code a royal pain to work with when converted back to a .net language. In many cases a good obfuscator causes decompilers to have seizures. A good obfuscator makes it pretty much impossible to work with the code at all, definitely impossible to buil an application from it without a hell of a lot of hand renaming and coding and extremely difficult to find specific alorithms and areas of the code and understand them, thought not impossible if you have a *lot* of time on your hands, essentially far more time than it would take to just write the program yourself if it's of any size at all. I see there are some Java bytecode obfuscators on the market which leads me to believe the experience is similar to .net.
"It's so simple to be wise. Just think of something stupid to say and then don't say it." -Sam Levenson
-
Note: I'm not an expert in the area so if I misunderstand, please explain why. --- I'm looking to compile a PHP application into a byte code but I've always understood byte code to be comlpetely reversible, whereas machine code to be virtually impossible. The low level JMP could have been an DO, IF, SWITCH. Whether is was or not would require looking at the reverse engineered code and trying to determine that. Byte code, is not as low level but it's not so abstract that you have constructs like IF, SWITCH, LOOP, etc is it? Do they still use a low level JMP command? I know the 8086 MOV instruction has like 40 or 50 different operations, it's that low level-ness that makes decompiling from machine to source so difficult. Byte code though, how low level is? When a C++ program is compiled, all the tools of the language such as consts, access control, interfaces, etc are probably wiped clean, correct? Is the same apply to byte code? What about classes? Does byte code represent a source file as objects or are those objects expanded into globals (functions and data). How are objects converted into machine code or byte code? Do class names persist or are they replaced with mangled names or integer offsets? Would converting objects into inlined code create problems due to polymorphism? I can't really think of much else to say right now...but I'm sure I'll have more to ask once I get a few replies. :)
I'm finding the only constant in software development is change it self.
Generally speaking the only way that byte code will reveal the detail you want, is if the information is taken from the source and saved in the object file (as it is with most 'debug' versions produced by language compilers). I don't know anything about PHP or what you are compiling it into, but suggest you look at the documentation of the compiler as a first pass. If you are trying to reverse engineer an executable into something better than pure machine code, then the same rule applies - if the information about class names etc. is not in the exe file then there is no way to guess what these names may be. And there is rarely any way to figure out things like constness because that does not exist in machine-code land, it's purely a mechanism for the compiler to add some validity checks to your source. Hope this helps...
-
what vm would be "running" the app that you want to "compile" ? the features you are asking about would be a feature of the vm imo and what it was capable of doing with regards to run time support of language features on a side note... i write and teach php for a living and i'm absolutely fascinated as to the reasons why you would want to "compile" it into some kind of (presumably proprietary) byte code and what you hope to achieve by doing so? :)
"mostly watching the human race is like watching dogs watch tv ... they see the pictures move but the meaning escapes them"
modified on Monday, October 6, 2008 12:22 AM
-
Note: I'm not an expert in the area so if I misunderstand, please explain why. --- I'm looking to compile a PHP application into a byte code but I've always understood byte code to be comlpetely reversible, whereas machine code to be virtually impossible. The low level JMP could have been an DO, IF, SWITCH. Whether is was or not would require looking at the reverse engineered code and trying to determine that. Byte code, is not as low level but it's not so abstract that you have constructs like IF, SWITCH, LOOP, etc is it? Do they still use a low level JMP command? I know the 8086 MOV instruction has like 40 or 50 different operations, it's that low level-ness that makes decompiling from machine to source so difficult. Byte code though, how low level is? When a C++ program is compiled, all the tools of the language such as consts, access control, interfaces, etc are probably wiped clean, correct? Is the same apply to byte code? What about classes? Does byte code represent a source file as objects or are those objects expanded into globals (functions and data). How are objects converted into machine code or byte code? Do class names persist or are they replaced with mangled names or integer offsets? Would converting objects into inlined code create problems due to polymorphism? I can't really think of much else to say right now...but I'm sure I'll have more to ask once I get a few replies. :)
I'm finding the only constant in software development is change it self.
I'm from the old school where we used to code in assembly language. Machine language is completely reversible, just a little harder. How do you think hackers go about removing the copy protection from commercial software? Or virus-killers go about understanding what a virus does? The determined will be always able to figure out what your app is doing.
-
Note: I'm not an expert in the area so if I misunderstand, please explain why. --- I'm looking to compile a PHP application into a byte code but I've always understood byte code to be comlpetely reversible, whereas machine code to be virtually impossible. The low level JMP could have been an DO, IF, SWITCH. Whether is was or not would require looking at the reverse engineered code and trying to determine that. Byte code, is not as low level but it's not so abstract that you have constructs like IF, SWITCH, LOOP, etc is it? Do they still use a low level JMP command? I know the 8086 MOV instruction has like 40 or 50 different operations, it's that low level-ness that makes decompiling from machine to source so difficult. Byte code though, how low level is? When a C++ program is compiled, all the tools of the language such as consts, access control, interfaces, etc are probably wiped clean, correct? Is the same apply to byte code? What about classes? Does byte code represent a source file as objects or are those objects expanded into globals (functions and data). How are objects converted into machine code or byte code? Do class names persist or are they replaced with mangled names or integer offsets? Would converting objects into inlined code create problems due to polymorphism? I can't really think of much else to say right now...but I'm sure I'll have more to ask once I get a few replies. :)
I'm finding the only constant in software development is change it self.
Hey... some people relax thinking in byte code... others binary... *shrug* But to put out an answer to the general question, byte code is still pretty low level stuff similar to machine code except that it's optimized for a virtual machine instead of an actual physical processor family. It's still a binary file, so don't expect to go in and read it like a C program or anything but byte codes are generally easier to decompile as well but not necessarily back to the original code but something functionally equivalent. Byte codes require an appropriate interpreter to run the code though. Now back to dreaming in zeros, ones and the occasional two...
-
Richard Andrew x64 wrote:
This looks like a programming question.
Put your glasses on and have another look. Hockey is not asking for code or how to do something specific in code. He is asking about how bytecode and machine language differ in how they store instructions. Beyond my knowledge but not a programming question.
Michael Martin Australia "I controlled my laughter and simple said "No,I am very busy,so I can't write any code for you". The moment they heard this all the smiling face turned into a sad looking face and one of them farted. So I had to leave the place as soon as possible." - Mr.Prakash One Fine Saturday. 24/04/2004
-
Note: I'm not an expert in the area so if I misunderstand, please explain why. --- I'm looking to compile a PHP application into a byte code but I've always understood byte code to be comlpetely reversible, whereas machine code to be virtually impossible. The low level JMP could have been an DO, IF, SWITCH. Whether is was or not would require looking at the reverse engineered code and trying to determine that. Byte code, is not as low level but it's not so abstract that you have constructs like IF, SWITCH, LOOP, etc is it? Do they still use a low level JMP command? I know the 8086 MOV instruction has like 40 or 50 different operations, it's that low level-ness that makes decompiling from machine to source so difficult. Byte code though, how low level is? When a C++ program is compiled, all the tools of the language such as consts, access control, interfaces, etc are probably wiped clean, correct? Is the same apply to byte code? What about classes? Does byte code represent a source file as objects or are those objects expanded into globals (functions and data). How are objects converted into machine code or byte code? Do class names persist or are they replaced with mangled names or integer offsets? Would converting objects into inlined code create problems due to polymorphism? I can't really think of much else to say right now...but I'm sure I'll have more to ask once I get a few replies. :)
I'm finding the only constant in software development is change it self.
The machine code is binary language which only the machine can understand properly. This may be different for different hardware (basically processors) and operating systems. This means that binary language is machine specific. On the other hand, byte code is a sort of intermediate language. It is not in machine code form, but something between machine language and the high level language. It is not machine specific. It is compiled on the run, depending on the machine specifications. The resultant machine code is processed by your processor. How the byte code is converted to machine language depends on the program. e.g. a Java program stores its program as byte code. So when we want to run the program, java compiler compiles it on the run into machine code and runs it for you. This makes a few differences very obvious: 1. Machine code is hard to decompile just because it is pure zeros and ones. It is a set of hardware instructions. Whereas, byte code can easily be decompiled since it is not machine specific; nor it is in pure zeros and ones format. 2. Machine code is not portable. This is because each processor make or Operating System may interpret the instructions differently. On the other hand, byte code is portable. Byte code will always run irrespective of the environment or hardware, till the time an interpreter is available to convert it into that machine specific code. Hope this helps, Pradeep :)
-
Sweet Holy Mother! How are you up and typing Mick Martin? You can only have been asleep 5 hours?
martin_hughes wrote:
Sweet Holy Mother! How are you up and typing Mick Martin? You can only have been asleep 5 hours?
:laugh: That's a good night's sleep for some programmers, depending on their project.
-
If it can be run by a processor it can be reverse engineered. There's absolutely no such thing as impossible, just varying degrees of difficulty. Native unmanaged code has been reverse engineered since probably about 10 days after the first commercial application was released. Of course these days few people are releasing new software with revolutionary new patent worthy algorithms in them so I suspect it's done far less for industrial reasons and more for license circumvention than anything else. In the case of .net code obfuscation is critical because pretty much everything persists in the byte code, you can easily turn the assembly into fully compilable source code with one click. Obfuscation makes the resulting code a royal pain to work with when converted back to a .net language. In many cases a good obfuscator causes decompilers to have seizures. A good obfuscator makes it pretty much impossible to work with the code at all, definitely impossible to buil an application from it without a hell of a lot of hand renaming and coding and extremely difficult to find specific alorithms and areas of the code and understand them, thought not impossible if you have a *lot* of time on your hands, essentially far more time than it would take to just write the program yourself if it's of any size at all. I see there are some Java bytecode obfuscators on the market which leads me to believe the experience is similar to .net.
"It's so simple to be wise. Just think of something stupid to say and then don't say it." -Sam Levenson
John C wrote:
I see there are some Java bytecode obfuscators on the market which leads me to believe the experience is similar to .net.
C-Shroud from Gimpel Software was a popular obfuscator in the early '90s. It worked by obfuscating the C code, which then compiled to machine code that was completely incomprehensible. I mention this simply to demonstrate that the problem and solutions existed even before .NET and Java VMs came on the scene.
-
The machine code is binary language which only the machine can understand properly. This may be different for different hardware (basically processors) and operating systems. This means that binary language is machine specific. On the other hand, byte code is a sort of intermediate language. It is not in machine code form, but something between machine language and the high level language. It is not machine specific. It is compiled on the run, depending on the machine specifications. The resultant machine code is processed by your processor. How the byte code is converted to machine language depends on the program. e.g. a Java program stores its program as byte code. So when we want to run the program, java compiler compiles it on the run into machine code and runs it for you. This makes a few differences very obvious: 1. Machine code is hard to decompile just because it is pure zeros and ones. It is a set of hardware instructions. Whereas, byte code can easily be decompiled since it is not machine specific; nor it is in pure zeros and ones format. 2. Machine code is not portable. This is because each processor make or Operating System may interpret the instructions differently. On the other hand, byte code is portable. Byte code will always run irrespective of the environment or hardware, till the time an interpreter is available to convert it into that machine specific code. Hope this helps, Pradeep :)
Pradeep, I have to respond a bit to your statement. The only difference between "byte code" and "machine code" is that byte code runs on a virtual machine, and machine code runs on a physical machine. Byte code for one virtual machine (e.g. the Java Runtime) is not portable to another virtual machine (e.g. the .NET CLR). Only the fact that virtual machines are commonly written for multiple architectures makes byte-code portable - machine code is fully "portable" to any number of software emulators, thus it is a common practice for embedded software to be intially developed on a development workstation using an emulator, so that, say, the code for an instrument monitor that uses a Motorola 68020 as its CPU might actually be developed on a Windows PC that natively runs on a Core2 Duo from Intel.
-
martin_hughes wrote:
Sweet Holy Mother! How are you up and typing Mick Martin? You can only have been asleep 5 hours?
:laugh: That's a good night's sleep for some programmers, depending on their project.
Yeah, but he'd been drinking heavily.
Ahoy! Martin Hughes
-
Yeah, but he'd been drinking heavily.
Ahoy! Martin Hughes
martin_hughes wrote:
Yeah, but he'd been drinking heavily.
In that case, I certainly hope he's not operating machinery heavier than his keyboard :omg:
-
what vm would be "running" the app that you want to "compile" ? the features you are asking about would be a feature of the vm imo and what it was capable of doing with regards to run time support of language features on a side note... i write and teach php for a living and i'm absolutely fascinated as to the reasons why you would want to "compile" it into some kind of (presumably proprietary) byte code and what you hope to achieve by doing so? :)
"mostly watching the human race is like watching dogs watch tv ... they see the pictures move but the meaning escapes them"
modified on Monday, October 6, 2008 12:22 AM
l a u r e n wrote:
what vm would be "running" the app that you want to "compile" ?
ionCube and Zend encoder are two options...PHPA is the open source version. ionCube and Zend both have encoders/compilers that take your source and compile into byte code. These encoded files are then uploaded to your server but they require the proprietary decoder extensions to be installed on the server in order to decode/decrypt and execute. Basically skipping the tokenizing/parsing stages and going straight to the execution. PHP Accelerator/APC are basically just extensions that hook into the parsing/execution phase and cache the resulting byte code for a certain request. The next time that request is made, the byte code is used and tokenization/parsing is spared. The former can protect your code (in a limited manner after reading more about byte code -- it's basically binary code with all high level constructs like classes, etc) and speed it up under most circumstances. The latter will really only speed it up.
l a u r e n wrote:
on a side note... i write and teach php for a living and i'm absolutely fascinated as to the reasons why you would want to "compile" it into some kind of (presumably proprietary) byte code and what you hope to achieve by doing so?
I'm an architecture geek with a custom developed framework. The framework is a compilation of good ideas borrowed from every single framework I could find on Google (about 9 months of research and study and trial and error). I don't want my ideas ending up in Zend or Symphony or CakePHP and even worse in my competitors products. My application will be a SaaS hosted application however I wouldn't mind giving it away for free as a marketing channel (as I have no idea how else I'm going to get people using it). Hence the compliation.
I'm finding the only constant in software development is change it self.
-
If it can be run by a processor it can be reverse engineered. There's absolutely no such thing as impossible, just varying degrees of difficulty. Native unmanaged code has been reverse engineered since probably about 10 days after the first commercial application was released. Of course these days few people are releasing new software with revolutionary new patent worthy algorithms in them so I suspect it's done far less for industrial reasons and more for license circumvention than anything else. In the case of .net code obfuscation is critical because pretty much everything persists in the byte code, you can easily turn the assembly into fully compilable source code with one click. Obfuscation makes the resulting code a royal pain to work with when converted back to a .net language. In many cases a good obfuscator causes decompilers to have seizures. A good obfuscator makes it pretty much impossible to work with the code at all, definitely impossible to buil an application from it without a hell of a lot of hand renaming and coding and extremely difficult to find specific alorithms and areas of the code and understand them, thought not impossible if you have a *lot* of time on your hands, essentially far more time than it would take to just write the program yourself if it's of any size at all. I see there are some Java bytecode obfuscators on the market which leads me to believe the experience is similar to .net.
"It's so simple to be wise. Just think of something stupid to say and then don't say it." -Sam Levenson
John C wrote:
There's absolutely no such thing as impossible,
Didn't I say virtually impossible? :P I realize that reverse engineering byte code or machine code isn't impossible, but if it were that easy someone would have reverse engineered Windows or Adobe Photoshop and there would be Open Source versions available. :P The point was, that machine code is significantly harder to reverse engineer. Byte code as I now understand is quite high level.
I'm finding the only constant in software development is change it self.
-
I'm from the old school where we used to code in assembly language. Machine language is completely reversible, just a little harder. How do you think hackers go about removing the copy protection from commercial software? Or virus-killers go about understanding what a virus does? The determined will be always able to figure out what your app is doing.
I know it's not impossible...but removing copy right protection is not completely reverse engineering. The architecture is what I am interested in protecting, not the implementation.
I'm finding the only constant in software development is change it self.
-
Pradeep, I have to respond a bit to your statement. The only difference between "byte code" and "machine code" is that byte code runs on a virtual machine, and machine code runs on a physical machine. Byte code for one virtual machine (e.g. the Java Runtime) is not portable to another virtual machine (e.g. the .NET CLR). Only the fact that virtual machines are commonly written for multiple architectures makes byte-code portable - machine code is fully "portable" to any number of software emulators, thus it is a common practice for embedded software to be intially developed on a development workstation using an emulator, so that, say, the code for an instrument monitor that uses a Motorola 68020 as its CPU might actually be developed on a Windows PC that natively runs on a Core2 Duo from Intel.
Yes, this is the exact same thing I was trying to explain :) The virtual machine is actually the program in machine language. It is just like your any other traditional C/C++ program. e.g. The Java Runtime is the actual program in machine code. The java runtime (virtual machine) bears the same relationship to the Java program (in byte code) as any Word Processor program bears with Word documents. Pradeep