Compiler and machine code
February 2020
Back to index.html.
Compiler and machine code
The descriptions in this chapter are greatly simplified. To obtain a solid understanding of this topic, please read some textbooks or popular articles written by computer scientists and engineers.
Code
Machine code
You may have had such a question: what does a computer program look like? In Linux, there are many commands, and those are computer programs. For example, ls
is a typical command, and it is a computer program. A ready-to-run program has a form of a file called executable (or binary).
In a typical Linux system, the substance of ls
is an executable stored in /bin
. Using the less
command, we can look at the content of the ls
executable. The following is the first some characters shown on the screen by less
.
less /bin/ls
^?ELF^B^A^A^@^@^@^@^@^@^@^@^@^B^@>^@^A^@^@^@<D4>B@^@^@^@^@^@^@^@^@^@^@^@^@<F0><C3>^A^@^@
It looks nonsense to us, but it makes much sense to a computer. A computer can read and interpret it, which describes the instructions for the machine. This machine-readable sequence is the machine code, and it is the only thing that a computer can understand.
Assembly language
At the beginning of the history of computers, a computer-user had to write the machine code to get computers work. The code was too primitive, too cryptic, and too much time to learn for many people. Then, a more human-readable code was suggested. It just replaced each of the machine code to some meaningful alphabets while keeping a similar structure to the machine code. It is called the assembly language. You can see the assembly code of ls
using the command, objdump
with -d
.
objdump -d /bin/ls
/bin/ls: file format elf64-x86-64
Disassembly of section .init:
0000000000402150 <_init@@Base>:
402150: 48 83 ec 08 sub $0x8,%rsp
402154: 48 8b 05 6d 8e 21 00 mov 0x218e6d(%rip),%rax # 61afc8 <__gmon_start__>
40215b: 48 85 c0 test %rax,%rax
40215e: 74 05 je 402165 <_init@@Base+0x15>
402160: e8 23 07 00 00 callq 402888 <__sprintf_chk@plt+0x18>
402165: 48 83 c4 08 add $0x8,%rsp
402169: c3 retq
(continues)
It is still cryptic but more readable. The alphabets like sub $0x8,%rsp
and mov 0x218e6d(%rip),%rax
are primitives which correspond to particular machine-instructions. Some people developed a tool to convert the assembly code to the machine code (called assembler). Once you have an assembler, you write the assembly code, then convert it to an executable. Writing the assembly code is challenging because it is still at a low-level (meaning it is close to the machine-readable code).
Nowadays, some developers still use the assembly language for particular software that must precisely control a machine (like operating systems and graphics drivers). Also, the assembly language can be used in speed-critical software. However, it is never widely used in standard software development.
Compiled programming languages
A programming language is designed to have people quickly focus on describing tasks that they want to perform. The code written in a programming language should be human-readable. A compiler, a kind of software, translates the code written in a programming language to the machine code. The programmer prepares a text file containing the code (source file), runs the compiler, and obtained an executable. When a programming language is supposed to be with a compiler, it is called compiled language. Typically, compiled languages include Fortran, C, C++, C#, Java, Go, Rust, Swift, Julia, and so on.
The compiled language generates an executable, which is generally fast. Although the languages are powerful enough to develop any software, the programmer would write much code to achieve the goal because the programmer should write everything from scratch. You may see some restrictions or inflexibilities in the language because it may have been designed to simplify the compiling process and to adapt to specific purposes.
Scripting languages
Some programming languages do not need a compiler; no machine-code is generated. Instead, each statement is interpreted one by one as a command, which invokes pre-compiled programs to do the task. It is called scripting languages. A scripting language can get a programmer to use pre-built, optimized, higher-level functions that may not be readily available in compiled languages. Also, the language is more accessible to write to perform typical tasks. Scripting languages include R, Python, Ruby, Perl, PHP, awk, shell scripts, and so on.
A scripting language is fast enough when the task is typical and well-defined. However, when the programmer tries to perform the task without pre-built functions, the operation becomes slow. Although computers became powerful and the disadvantage has been reduced, you may encounter the issue in data processing and numerical computing. Most scripting languages can call custom programs written in compiled languages to eliminate the speed penalty. This practice is common in R and Python when the performance matters, but the programmer should master at least two different languages (two-language problem).
Roughly speaking, compiled languages are better in performance (speed, resource usage, precise control of machine), but takes time in development. Scripting languages are more productive in coding for specific tasks, but it is slow in the execution of out-of-scope tasks or general algorithms.
Compiling and linking
Generating process of executables
Here is a typical process of development in compiled languages.
- Coding: Write a program as a text file.
- Compiling: Compile it to an executable using a compiler.
- Testing: Run the executable to see if it works as expected.
For the first step, you are going to learn the syntax of a programming language, and you write the program with a text editor. The last step is performed in terminal or some computing environments. Let us look into the second step here.
The figure shows a typical conversion process from a source file to an executable.
A compiler translates the code in a programming language to the machine code. Whereas a source file stores the original program, an object file hols the translated code. The object file is incomplete as executable. To generate the executable, another software, linker, puts all the object files and pre-compiled programs required by the operating system (run-time library) together. The run-time library comes with a compiler, but the user would not be aware of it.
Any popular compilers do these jobs at 1 pass. The compiler compiles the source file, then secretly calls linker software to generate the executable. So, the user does not have to use a linker in many cases. And, in many cases, the whole process is called compilation (including the link).
Example of file compilation
Here we look at an actual command to compile a source file using GFortran. Suppose we have a source file prog.f90
at my repository. It is a custom that the source file has .f90
at the end of the file name. Some perople use .f95
for Fortran 95, .f03
for Fortran 2003, and so on. However, we use .f90
for any Fortran program written in Fortran 90 or later as many people use this custom.
The file contains the following 3 lines.
program prog
print *,"okay"
end program prog
The following command compiles this file.
gfortran prog.f90
It generates an executable, a.out
by default. The object file is temporarily generated, but the compiler removes it when the executable is ready. You can separate the process into compilation and link using the following 2 commands.
gfortran -c prog.f90
gfortran prog.o
The first command converts the source file to an object file, prog.o
. The second command calls a linker to create the executable, a.out
(to be specific, gfortran
calls the linker, ld
). Usually, you do not have to run the compiler and the linker separately to compile a single file.
Back to index.html.