Compiler and machine code

Yutaka Masuda

February 2020

Compiler and machine code

The descriptions in this chapter are greatly simplified. To obtain a solid understanding of this topic, please read some textbooks or popular articles written by computer scientists and engineers.

Code

Machine code

You may have had such a question: what does a computer program look like? In Linux, there are many commands, and those are computer programs. For example, ls is a typical command, and it is a computer program. A ready-to-run program has a form of a file called executable (or binary).

In a typical Linux system, the substance of ls is an executable stored in /bin. Using the less command, we can look at the content of the ls executable. The following is the first some characters shown on the screen by less.

less /bin/ls

^?ELF^B^A^A^@^@^@^@^@^@^@^@^@^B^@>^@^A^@^@^@<D4>B@^@^@^@^@^@^@^@^@^@^@^@^@<F0><C3>^A^@^@

It looks nonsense to us, but it makes much sense to a computer. A computer can read and interpret it, which describes the instructions for the machine. This machine-readable sequence is the machine code, and it is the only thing that a computer can understand.

Assembly language

At the beginning of the history of computers, a computer-user had to write the machine code to get computers work. The code was too primitive, too cryptic, and too much time to learn for many people. Then, a more human-readable code was suggested. It just replaced each of the machine code to some meaningful alphabets while keeping a similar structure to the machine code. It is called the assembly language. You can see the assembly code of ls using the command, objdump with -d.

objdump -d /bin/ls

/bin/ls:     file format elf64-x86-64


Disassembly of section .init:

0000000000402150 <_init@@Base>:
  402150:       48 83 ec 08             sub    $0x8,%rsp
  402154:       48 8b 05 6d 8e 21 00    mov    0x218e6d(%rip),%rax        # 61afc8 <__gmon_start__>
  40215b:       48 85 c0                test   %rax,%rax
  40215e:       74 05                   je     402165 <_init@@Base+0x15>
  402160:       e8 23 07 00 00          callq  402888 <__sprintf_chk@plt+0x18>
  402165:       48 83 c4 08             add    $0x8,%rsp
  402169:       c3                      retq

(continues)

It is still cryptic but more readable. The alphabets like sub $0x8,%rsp and mov 0x218e6d(%rip),%rax are primitives which correspond to particular machine-instructions. Some people developed a tool to convert the assembly code to the machine code (called assembler). Once you have an assembler, you write the assembly code, then convert it to an executable. Writing the assembly code is challenging because it is still at a low-level (meaning it is close to the machine-readable code).

Nowadays, some developers still use the assembly language for particular software that must precisely control a machine (like operating systems and graphics drivers). Also, the assembly language can be used in speed-critical software. However, it is never widely used in standard software development.

Compiled programming languages

A programming language is designed to have people quickly focus on describing tasks that they want to perform. The code written in a programming language should be human-readable. A compiler, a kind of software, translates the code written in a programming language to the machine code. The programmer prepares a text file containing the code (source file), runs the compiler, and obtained an executable. When a programming language is supposed to be with a compiler, it is called compiled language. Typically, compiled languages include Fortran, C, C++, C#, Java, Go, Rust, Swift, Julia, and so on.

The compiled language generates an executable, which is generally fast. Although the languages are powerful enough to develop any software, the programmer would write much code to achieve the goal because the programmer should write everything from scratch. You may see some restrictions or inflexibilities in the language because it may have been designed to simplify the compiling process and to adapt to specific purposes.

Scripting languages

Some programming languages do not need a compiler; no machine-code is generated. Instead, each statement is interpreted one by one as a command, which invokes pre-compiled programs to do the task. It is called scripting languages. A scripting language can get a programmer to use pre-built, optimized, higher-level functions that may not be readily available in compiled languages. Also, the language is more accessible to write to perform typical tasks. Scripting languages include R, Python, Ruby, Perl, PHP, awk, shell scripts, and so on.

A scripting language is fast enough when the task is typical and well-defined. However, when the programmer tries to perform the task without pre-built functions, the operation becomes slow. Although computers became powerful and the disadvantage has been reduced, you may encounter the issue in data processing and numerical computing. Most scripting languages can call custom programs written in compiled languages to eliminate the speed penalty. This practice is common in R and Python when the performance matters, but the programmer should master at least two different languages (two-language problem).

Roughly speaking, compiled languages are better in performance (speed, resource usage, precise control of machine), but takes time in development. Scripting languages are more productive in coding for specific tasks, but it is slow in the execution of out-of-scope tasks or general algorithms.

Compiling and linking

Generating process of executables

Here is a typical process of development in compiled languages.

Coding: Write a program as a text file.
Compiling: Compile it to an executable using a compiler.
Testing: Run the executable to see if it works as expected.

For the first step, you are going to learn the syntax of a programming language, and you write the program with a text editor. The last step is performed in terminal or some computing environments. Let us look into the second step here.

The figure shows a typical conversion process from a source file to an executable.

A compiler translates the code in a programming language to the machine code. Whereas a source file stores the original program, an object file hols the translated code. The object file is incomplete as executable. To generate the executable, another software, linker, puts all the object files and pre-compiled programs required by the operating system (run-time library) together. The run-time library comes with a compiler, but the user would not be aware of it.

Any popular compilers do these jobs at 1 pass. The compiler compiles the source file, then secretly calls linker software to generate the executable. So, the user does not have to use a linker in many cases. And, in many cases, the whole process is called compilation (including the link).

Example of file compilation

Here we look at an actual command to compile a source file using GFortran. Suppose we have a source file prog.f90 at my repository. It is a custom that the source file has .f90 at the end of the file name. Some perople use .f95 for Fortran 95, .f03 for Fortran 2003, and so on. However, we use .f90 for any Fortran program written in Fortran 90 or later as many people use this custom.

The file contains the following 3 lines.

program prog
print *,"okay"
end program prog

The following command compiles this file.

gfortran prog.f90

It generates an executable, a.out by default. The object file is temporarily generated, but the compiler removes it when the executable is ready. You can separate the process into compilation and link using the following 2 commands.

gfortran -c prog.f90
gfortran prog.o

The first command converts the source file to an object file, prog.o. The second command calls a linker to create the executable, a.out (to be specific, gfortran calls the linker, ld). Usually, you do not have to run the compiler and the linker separately to compile a single file.

Back to index.html.