Numerical expression
February 2020
Back to index.html.
Numerical expression
Bit and byte
Whereas people use the decimal system, a computer uses the binary system. There are 10 (from 0 to 9) decimal digits in the decimal system, and there are only two (0 or 1) in the binary system. So, in the binary system, \(1+1=10\) with a carry. The bit is a unit having one of the two values.
Most of the computers that we are using define 8 bits as the smallest unit in operation. The byte is a unit with 8 bits. 1-byte data can represent at least \(2^8=256\) different values. As seen in this chapter, the default integer number in Fortran uses 4 bytes (32 bits) as a unit.
Literals and simple computations
Numeric and character literals
Numbers
Fortran can do any simple arithmetic. A number is expressed as a numeric literal, which is simply a sequence of numbers with some symbols. Fortran supports integer numbers, real numbers with the decimal point or expressed as the exponential style, and complex numbers (which we do not deal with in this tutorial). See the following table for examples.
Literal | Meaning | Comments |
---|---|---|
100 | \(100\) | integer number |
+100 | \(100\) | integer number |
-1 | \(-1\) | integer number |
3.14 | \(3.14\) | real number |
-99. | \(-99.0\) | real number |
1e6 | \(1\times10^{6}\) | real number |
6.02e23 | \(6.02\times 10^{23}\) | real number |
314e-2 | \(314\times 10^{-2}\) | real number |
In Fortran, the type of number (integer vs. real) is strictly differentiated.
- If the literal has only numbers and optionally a sign (\(+\) or \(-\)), it is an integer.
- If it has a decimal point (
.
) or exponential identifier (e
orE
) in addition to the numbers and the sign, it is a real number.
The numeric literal can also be used in print
.
program num
print *,100,-1,3.14,6.02e23
end program num
100 -1 3.14000010 6.02000017E+23
Note that the output format depends on the type and the magnitude of value. The default output has enough margin to show a number.
Mix of numeric and character literals
You can mix numeric and character literals in print
.
program num
print *,"Avogadro constant=",6.02e23
end program num
Avogadro constant= 6.02000017E+23
This is useful to precisely display the numeric values.
Summary
- Fortran holds both integer and real numbers; two are differentiated.
- The
print
statement put a margin to the output of numeric literals. - The numeric and character literals can coexist in
print
.
Exercises
- Print 3.14 with the message “pi=”.
- Print some extremely huge (or tiny) integer literals, and see how big (small) values Fortran can hold.
- Repeat as above but for real literals.
Computations
Operators
Fortran has several operators for numeric values. The operators are very similar to the one used in mathematics.
Operator | Meaning | Example |
---|---|---|
+ |
addition | 1+2 |
- |
subtraction | 1-3 |
* |
multiplication | 2*3 |
/ |
division | 6/2 |
** |
power | 10**3 (\(10^3\)) |
( and ) |
change the priority of operators | (1+3)*2 |
By default, the priority of operators is the same as the arithmetic rule (**
> *
and /
> +
and -
), and the operators are evaluated from left to right. The parentheses change the priority. You can use as many as operators in a formula such as 1+((5+4/2)**6)/2-8/4+2
.
program num
print *,1+((5+4/2)**6)/2-8/4+2
end program num
58825
Type of result
Some arithmetic rules in Fortran are not intuitive because the type of value is strictly considered. The following rules are applied.
- A formula with only integer numbers always returns an integer number.
- A formula with both integer and real numbers always returns a real number.
- A formula with both real numbers always returns a real number.
Simply, you will get the same type of result if you put the same type of numbers. The second point is because a real number can more precisely express a numerical value. So, the integer number is converted to a real number before evaluating the formula. The first two rules may confuse the programmers; see the following example.
program num
print *,"3/2 = ", 3/2 ! integer / integer
print *,"3.0/2 = ", 3.0/2 ! real / integer
print *,"3/2.0 = ", 3/2.0 ! integer / real
print *,"3.0/2.0 = ", 3.0/2.0 ! real / real
end program num
3/2 = 1
3.0/2 = 1.50000000
3/2.0 = 1.50000000
3.0/2.0 = 1.50000000
Only the first formula returns the integer number (1
), which is the quotient of 3/2, and the remainder is removed. Never forget to convert the number to real if you want to get the real number.
Type conversion
You may need to convert the numeric type from one to another. Also, sometimes a real number should be rounded to an integer. Fortran has conversion functions as follows.
Name | Meaning | Notation |
---|---|---|
real(b) |
convert to real | |
int(b) |
convert to integer | truncated toward zero |
aint(r) |
convert to real | truncated toward zero |
nint(r) |
convert to integer | truncated to nearest integer |
anint(r) |
convert to real | truncated to nearest integer |
ceiling(r) |
convert to integer | least integer greater than or equal to r |
floor(r) |
convert to integer | greatest integer less than or equal to r |
program num
print *,"3/2 = ", 3/2 ! integer / integer
print *,"3/2 = ", 3/real(2) ! integer / real
print *,"3/2 = ", real(3/2) ! it is nonsense, why?
end program num
3/2 = 1
3/2 = 1.50000000
3/2 = 1.00000000
Simple functions
Fortran supports more complicated computations than arithmetic operations. Such a computation is done using a function. For example, the cosine function is available as cos()
in Fortran. The value passed to a function is called an argument, which should be enclosed with ()
next to the function name. A function can be used as follows.
program num
print *,cos(0.0)
end program num
1.00000000
Some readers would recognize that the argument is real, not an integer. Yes, it is crucial because cosine is defined only for real numbers. If you give an integer to cos()
, you get the compilation error.
program num
print *,cos(0)
end program num
main.f95:2:15:
print *,cos(0)
1
Error: 'x' argument of 'cos' intrinsic at (1) must be REAL or COMPLEX
Here is a table for a few functions often used. In the table, r
means a real number, and b
means either real and integer number.
Name | Meaning | Notation |
---|---|---|
abs(b) |
absolute value | |
sqrt(r) |
square root | equivalent to r**0.5 |
sin(r) |
sine | |
cos(r) |
cosine | |
tan(r) |
tangent | |
log(r) |
natural logarithm | |
log10(r) |
common logarithm | |
exp(r) |
exponential function | \(e^{r}\) |
mod(x,y) |
modulo (remainder of x/y ) |
Both arguments should be the same type. |
The functions can be nested and mixed with other numeric expressions. The order of operations is defined by operator precedence (please use the parentheses to avoid the confusion in the precedence). See the following example.
program num
print *,cos(sin(3.14159)) ! nested function
print *,cos(sin(3.14159)**2)/2 ! mixed operations
end program num
Summary
- Fortran follows the standard arithmetic rule in terms of the precedence.
**
>*
and/
>+
and-
, from left to right- Parentheses change the precedence.
- The resulting type of formula depends on the input types.
- If the input types are the same, the output is the same type.
- If different, the output is a type with more precision.
- A numeric function accepts a specific type (mostly real).
- The numeric type can be converted using functions.
Exercises
- Compute \(\sqrt[3]{2}\). The result should be
1.25992107
. - Try to compute \(\log(\sin(0))\) and see what happens.
- Compute the sum of ten 0.1 (
0.1+ ... +0.1
) and compare it with0.1*10
. Do you find any difference?
Precision
Range and precision of numerical values
Fortran (and almost all programming languages) uses a limited resource to express a number. A number held in your program has a finite precision. In other words, some numbers are precisely expressed, but some are approximated. Also, the type of number (integer or real) has a range to express a number.
The following program presents the limitation of numbers in Fortran. Some compilers can not compile the program because they may detect the overflow in the first statement on the compilation and give an error. In such a case, comment out the first print statement, save the program, and compile it.
program lim
print *,1234567890*2 ! will overflow
print *,3.14159265359 ! too precise real number
end program lim
-1825831516
3.141593
The results should be unexpected.
- The first result is complete nonsense. The multiplication produces a result that exceeds the upper limit of an integer number (\(2^{31}-1=2147483647\)) by default in Fortran. It is an example of overflow, and the result is invalid.
- The second result cuts the lowest digits in the number. The default real-value in Fortran holds just 7 or 8 decimal digits. If the value exceeds the limit, it will be rounded to a lower precision. It may cause a rounding error, in which the computation is not accurate, especially when the missing values are accumulated.
Fortran provides several precisions both, for integer and real numbers. Even if the default precision is not enough, you can use higher precision values with a manual specification of precision.
Exercise
- See the result of adding 1 to the maximum integer (2147483647).
- See if there is a difference between printing
3.14159265359
and3.141593
. - See what happens when a function like
log
has an illegal input (e.g., a negative number).
integer numbers
Precision specification
In the current computers, the default integer number is expressed as 4 bytes (32 bits). Fortran handles various precisions for integers. In the following table, function int()
produces an integer number (i
) with particular precision. For example, int(100,kind=1)
returns 100
expressed as a 1-byte integer.
Precision | Range | Conversion | Remark |
---|---|---|---|
1 byte (8 bit) | -128 to 127 (\(-2^{7}\) to \(2^{7}-1\)) | int(i,kind=1) or int(i,1) |
|
2 byte (16 bit) | -32768 to 32767 (\(-2^{15}\) to \(2^{15}-1\)) | int(i,kind=2) or int(i,2) |
|
4 byte (32 bit) | -2147483648 to 2147483647 (\(-2^{31}\) to \(2^{31}-1\)) | int(i,kind=4) or int(i,4) |
Default |
8 byte (64 bit) | -9223372036854775808 to 9223372036854775807 (\(-2^{63}\) to \(2^{63}-1\)) | int(i,kind=8) or int(i,8) |
The default precision (4 bytes) should be enough in most cases, and you must not use the other precision unless there is a particular reason. If you mix some precisions in computations, because of the conversion, the program may be slow or instead tends to have issues that can not be easily found.
When a formula has mixed precisions, the result is the highest precision. The following program produces the 8-byte integer.
program lim
print *,int(1234567890,kind=8)*2
end program lim
Portable ways for precision specification
There is a tricky fact in the definition of precision like 8
in int(i,kind=8)
. Such a specification depends on your operating system and hardware. The kind-literals (1
, 2
, 4
, and 8
in kind=
) are widely used in many compilers for Intel or AMD processors, but technically, it is not universal. I am sure you do not have any exceptions as long as you use GFortran and Intel Fortran, so, in fact, you may still use the kind-literals in your program (and so many textbooks and the online materials use this de facto standard).
There is a portable function to get the precision instead of using such constants. It is useful to keep the portability of your program to other computers.
Function | Remark |
---|---|
selected_int_kind(n) |
Get a precision code in the rage of \(-10^n<i<10^n\) |
To short, it returns the kind-literal to guarantee that an integer number \(i\) with \(n\) digits. For example, in many cases, selected_int_kind(9)
returns 4
(\(-10^9<i<10^9\) i.e. 4-byte integer) and selected_int_kind(10)
returns 8
(\(-10^{10}<i<10^{10}\) i.e. 8-byte integer) in GFortran and Intel Fortran. If n
is invalid, it returns -1
. The above program can be written with the function as follows.
program lim
print *,int(1234567890,selected_int_kind(10))*2
print *,int(1234567890,kind=selected_int_kind(10))*2 ! alternative form
end program lim
Unfortunately, it is too long to write, so you should define it as a constant which will be introduced in the later chapter.
Using an external module, which has not been explained so far, more relevant keywords are available to specify the precision. A module is like a package in other languages, and it defines extra functions and constants. Although the user can create a custom module, some modules are available by default. The built-in module iso_fortran_env
, which is supported by almost all recent compilers, defines such meaningful keywords for precision. Here are some of them.
Keyword | Precision |
---|---|
int8 |
8 bit (1 byte) |
int16 |
16 bit (2 bytes) |
int32 |
32 bit (4 bytes) |
int64 |
64 bit (8 bytes) |
A module is called with the use
statement, which must stay at the top of the program. I just show an example here, and I explain the details in the later chapter.
program lim
use iso_fortran_env
print *,int(1234567890,int64)*2
print *,int(1234567890,kind=int64)*2 ! alternative form
end program lim
Summary
- There are several types to have integer numbers.
- The default type is 4-byte (32 bit), and it is enough in most cases.
- There are several ways to define integer types.
Exercises
- Try the above programs to see how to use the precision identifiers.
Real numbers
Floating-point number
In Fortran and the majority of programming languages, any real number is treated as a rounded number with a finite number of digits. It is called floating-point number, which is expressed as the following formula.
\[ x\times 10^y \]
The value \(x\) is significand and \(y\) is exponent. Usually, \(x\) has a real number, and \(y\) is an integer, and both have limited precision. By default, Fortran has 7 or 8 digits in \(x\), and up to 2 digits in \(y\), and this kind of number are called single precision. The higher precision is double-precision, which has 15 or 16 digits in \(x\) and up to 3 digits in \(y\). You can see that this expression cannot present general numeric values, including irrational numbers.
Precision specification
In the following table, real()
(or dble()
) returns an real number (r
) with particular precision.
Precision | Width | Approximated Range | Largest subnormal number | Usage |
---|---|---|---|---|
Single precision | 4 byte (32 bit) | \(-3.40\times 10^{38}\) to \(3.40\times 10^{38}\) | \(1.18\times 10^{-38}\) | real(r,kind=4) or real(r) or e literal |
Double precision | 8 byte (64 bit) | \(-1.80\times 10^{308}\) to \(1.80\times 10^{308}\) | \(2.23\times 10^{-308}\) | real(i,kind=8) or dble(i,8) or d literal |
Roughly, the largest subnormal number is the smallest positive number expressed in your computer. See for a primary textbook for details in floating-point numbers.
Double precision is mainly used in heavy numerical computations because single-precision is not enough to hold precise numbers, and it quickly accumulates the rounding errors during computations. In this tutorial, I use a single-precision number because it is shorter to type, and my program demonstrates a program not used in the application. Just remember the following rules for double-precision values.
- A double-precision literal is defined using
d
instead ofe
. - The function
dble
is useful to convert any numbers to double precision. - When single and double precisions are mixed in a formula, the result will be double precision.
- Integer numbers should be converted to double-precision when the double-precision values are involved in the computations.
See the following example to use double precision.
program double
print *,3.14159265359 ! single precision
print *,3.14159265359d0 ! double precision
print *,dble(2)*10 ! mixed precision = aligned to double
print *,0.1 ! not exactly expressed in the computer
print *,0.1d0 ! same in double
end program double
3.14159274
3.1415926535900001
20.000000000000000
0.100000001
0.10000000000000001
You can still see a rounding error in the second output. The 4th and 5th output is an extreme example of a rounding error. These numbers are not precisely expressed with a floating-point rule because it is a recurring decimal in a binary expression which the current computer is using. The last digit is always rounded and creates noise in the value. When you use double-precision, the rounding error is much smaller, and the result may be less affected.
Portable ways for precision specification
As like integer numbers, the definition of precision like 8
in real(i,kind=8)
is not always portable, although almost compilers accept 8
as double precision. There is a function to get the precision instead of using such constants.
Function | Remark |
---|---|
selected_real_kind(p,r) |
See below for usage. |
It results in a precision-code to guarantee that the real number has at least p
digits in significand and a decimal-exponent range of at least r
. For example, typical double-precision value is defined as selected_real_kind(15,307)
.
More relevant keywords are also available in the module iso_fortran_env
as the integer number.
Keyword | Precision |
---|---|
real32 |
single precision, 32 bit (4 bytes) |
real64 |
double precision, 64 bit (8 bytes) |
See the following example for the usage.
! All the statements should have the same results.
program double
use iso_fortran_env
print *,3.14d0
print *,dble(3.14)
print *,real(3.14,selected_real_kind(15,307))
print *,real(3.14,real64)
end program double
Summary
- There are two typical types to express a real number as a floating-point number.
- The default is single-precision, but the double-precision value is often used in numerical computing.
- There are several ways to define and convert the types.
In this tutorial, I use a single-precision number because it is simpler to type.
Exercises
- Compute the sum of ten 0.1 both in single and double precisions and compare the results.
- See if
selected_real_kind(15,308)
works or not. If working, what does it define?
Back to index.html.