# Numerical expression

February 2020

Back to index.html.

# Numerical expression

## Bit and byte

Whereas people use the decimal system, a computer uses the binary system. There are 10 (from 0 to 9) decimal digits in the decimal system, and there are only two (0 or 1) in the binary system. So, in the binary system, $1+1=10$ with a carry. The bit is a unit having one of the two values.

Most of the computers that we are using define 8 bits as the smallest unit in operation. The byte is a unit with 8 bits. 1-byte data can represent at least $2^8=256$ different values. As seen in this chapter, the default integer number in Fortran uses 4 bytes (32 bits) as a unit.

## Literals and simple computations

### Numeric and character literals

#### Numbers

Fortran can do any simple arithmetic. A number is expressed as a numeric literal, which is simply a sequence of numbers with some symbols. Fortran supports integer numbers, real numbers with the decimal point or expressed as the exponential style, and complex numbers (which we do not deal with in this tutorial). See the following table for examples.

100 $100$ integer number
+100 $100$ integer number
-1 $-1$ integer number
3.14 $3.14$ real number
-99. $-99.0$ real number
1e6 $1\times10^{6}$ real number
6.02e23 $6.02\times 10^{23}$ real number
314e-2 $314\times 10^{-2}$ real number

In Fortran, the type of number (integer vs. real) is strictly differentiated.

• If the literal has only numbers and optionally a sign ($+$ or $-$), it is an integer.
• If it has a decimal point (.) or exponential identifier (e or E) in addition to the numbers and the sign, it is a real number.

The numeric literal can also be used in print.

program num
print *,100,-1,3.14,6.02e23
end program num
         100          -1   3.14000010       6.02000017E+23

Note that the output format depends on the type and the magnitude of value. The default output has enough margin to show a number.

#### Mix of numeric and character literals

You can mix numeric and character literals in print.

program num
end program num
 Avogadro constant=   6.02000017E+23

This is useful to precisely display the numeric values.

#### Summary

• Fortran holds both integer and real numbers; two are differentiated.
• The print statement put a margin to the output of numeric literals.
• The numeric and character literals can coexist in print.

#### Exercises

1. Print 3.14 with the message “pi=”.
2. Print some extremely huge (or tiny) integer literals, and see how big (small) values Fortran can hold.
3. Repeat as above but for real literals.

### Computations

#### Operators

Fortran has several operators for numeric values. The operators are very similar to the one used in mathematics.

Operator Meaning Example
+ addition 1+2
- subtraction 1-3
* multiplication 2*3
/ division 6/2
** power 10**3 ($10^3$)
( and ) change the priority of operators (1+3)*2

By default, the priority of operators is the same as the arithmetic rule (** > * and / > + and -), and the operators are evaluated from left to right. The parentheses change the priority. You can use as many as operators in a formula such as 1+((5+4/2)**6)/2-8/4+2.

program num
print *,1+((5+4/2)**6)/2-8/4+2
end program num
       58825

#### Type of result

Some arithmetic rules in Fortran are not intuitive because the type of value is strictly considered. The following rules are applied.

• A formula with only integer numbers always returns an integer number.
• A formula with both integer and real numbers always returns a real number.
• A formula with both real numbers always returns a real number.

Simply, you will get the same type of result if you put the same type of numbers. The second point is because a real number can more precisely express a numerical value. So, the integer number is converted to a real number before evaluating the formula. The first two rules may confuse the programmers; see the following example.

program num
print *,"3/2     = ", 3/2       ! integer / integer
print *,"3.0/2   = ", 3.0/2     ! real / integer
print *,"3/2.0   = ", 3/2.0     ! integer / real
print *,"3.0/2.0 = ", 3.0/2.0   ! real / real
end program num
3/2     =            1
3.0/2   =    1.50000000
3/2.0   =    1.50000000
3.0/2.0 =    1.50000000

Only the first formula returns the integer number (1), which is the quotient of 3/2, and the remainder is removed. Never forget to convert the number to real if you want to get the real number.

#### Type conversion

You may need to convert the numeric type from one to another. Also, sometimes a real number should be rounded to an integer. Fortran has conversion functions as follows.

Name Meaning Notation
real(b) convert to real
int(b) convert to integer truncated toward zero
aint(r) convert to real truncated toward zero
nint(r) convert to integer truncated to nearest integer
anint(r) convert to real truncated to nearest integer
ceiling(r) convert to integer least integer greater than or equal to r
floor(r) convert to integer greatest integer less than or equal to r
program num
print *,"3/2     = ", 3/2       ! integer / integer
print *,"3/2     = ", 3/real(2) ! integer / real
print *,"3/2     = ", real(3/2) ! it is nonsense, why?
end program num
3/2     =            1
3/2     =    1.50000000
3/2     =    1.00000000

#### Simple functions

Fortran supports more complicated computations than arithmetic operations. Such a computation is done using a function. For example, the cosine function is available as cos() in Fortran. The value passed to a function is called an argument, which should be enclosed with () next to the function name. A function can be used as follows.

program num
print *,cos(0.0)
end program num
   1.00000000

Some readers would recognize that the argument is real, not an integer. Yes, it is crucial because cosine is defined only for real numbers. If you give an integer to cos(), you get the compilation error.

program num
print *,cos(0)
end program num
main.f95:2:15:

print *,cos(0)
1
Error: 'x' argument of 'cos' intrinsic at (1) must be REAL or COMPLEX

Here is a table for a few functions often used. In the table, r means a real number, and b means either real and integer number.

Name Meaning Notation
abs(b) absolute value
sqrt(r) square root equivalent to r**0.5
sin(r) sine
cos(r) cosine
tan(r) tangent
log(r) natural logarithm
log10(r) common logarithm
exp(r) exponential function $e^{r}$
mod(x,y) modulo (remainder of x/y) Both arguments should be the same type.

The functions can be nested and mixed with other numeric expressions. The order of operations is defined by operator precedence (please use the parentheses to avoid the confusion in the precedence). See the following example.

program num
print *,cos(sin(3.14159))         ! nested function
print *,cos(sin(3.14159)**2)/2    ! mixed operations
end program num

#### Summary

• Fortran follows the standard arithmetic rule in terms of the precedence.
• ** > * and / > + and -, from left to right
• Parentheses change the precedence.
• The resulting type of formula depends on the input types.
• If the input types are the same, the output is the same type.
• If different, the output is a type with more precision.
• A numeric function accepts a specific type (mostly real).
• The numeric type can be converted using functions.

#### Exercises

1. Compute $\sqrt{2}$. The result should be 1.25992107.
2. Try to compute $\log(\sin(0))$ and see what happens.
3. Compute the sum of ten 0.1 (0.1+ ... +0.1) and compare it with 0.1*10. Do you find any difference?

## Precision

### Range and precision of numerical values

Fortran (and almost all programming languages) uses a limited resource to express a number. A number held in your program has a finite precision. In other words, some numbers are precisely expressed, but some are approximated. Also, the type of number (integer or real) has a range to express a number.

The following program presents the limitation of numbers in Fortran. Some compilers can not compile the program because they may detect the overflow in the first statement on the compilation and give an error. In such a case, comment out the first print statement, save the program, and compile it.

program lim
print *,1234567890*2    ! will overflow
print *,3.14159265359   ! too precise real number
end program lim
-1825831516
3.141593

The results should be unexpected.

• The first result is complete nonsense. The multiplication produces a result that exceeds the upper limit of an integer number ($2^{31}-1=2147483647$) by default in Fortran. It is an example of overflow, and the result is invalid.
• The second result cuts the lowest digits in the number. The default real-value in Fortran holds just 7 or 8 decimal digits. If the value exceeds the limit, it will be rounded to a lower precision. It may cause a rounding error, in which the computation is not accurate, especially when the missing values are accumulated.

Fortran provides several precisions both, for integer and real numbers. Even if the default precision is not enough, you can use higher precision values with a manual specification of precision.

#### Exercise

1. See the result of adding 1 to the maximum integer (2147483647).
2. See if there is a difference between printing 3.14159265359 and 3.141593.
3. See what happens when a function like log has an illegal input (e.g., a negative number).

### integer numbers

#### Precision specification

In the current computers, the default integer number is expressed as 4 bytes (32 bits). Fortran handles various precisions for integers. In the following table, function int() produces an integer number (i) with particular precision. For example, int(100,kind=1) returns 100 expressed as a 1-byte integer.

Precision Range Conversion Remark
1 byte (8 bit) -128 to 127 ($-2^{7}$ to $2^{7}-1$) int(i,kind=1) or int(i,1)
2 byte (16 bit) -32768 to 32767 ($-2^{15}$ to $2^{15}-1$) int(i,kind=2) or int(i,2)
4 byte (32 bit) -2147483648 to 2147483647 ($-2^{31}$ to $2^{31}-1$) int(i,kind=4) or int(i,4) Default
8 byte (64 bit) -9223372036854775808 to 9223372036854775807 ($-2^{63}$ to $2^{63}-1$) int(i,kind=8) or int(i,8)

The default precision (4 bytes) should be enough in most cases, and you must not use the other precision unless there is a particular reason. If you mix some precisions in computations, because of the conversion, the program may be slow or instead tends to have issues that can not be easily found.

When a formula has mixed precisions, the result is the highest precision. The following program produces the 8-byte integer.

program lim
print *,int(1234567890,kind=8)*2
end program lim

#### Portable ways for precision specification

There is a tricky fact in the definition of precision like 8 in int(i,kind=8). Such a specification depends on your operating system and hardware. The kind-literals (1, 2, 4, and 8 in kind=) are widely used in many compilers for Intel or AMD processors, but technically, it is not universal. I am sure you do not have any exceptions as long as you use GFortran and Intel Fortran, so, in fact, you may still use the kind-literals in your program (and so many textbooks and the online materials use this de facto standard).

There is a portable function to get the precision instead of using such constants. It is useful to keep the portability of your program to other computers.

Function Remark
selected_int_kind(n) Get a precision code in the rage of $-10^n<i<10^n$

To short, it returns the kind-literal to guarantee that an integer number $i$ with $n$ digits. For example, in many cases, selected_int_kind(9) returns 4 ($-10^9<i<10^9$ i.e. 4-byte integer) and selected_int_kind(10) returns 8 ($-10^{10}<i<10^{10}$ i.e. 8-byte integer) in GFortran and Intel Fortran. If n is invalid, it returns -1. The above program can be written with the function as follows.

program lim
print *,int(1234567890,selected_int_kind(10))*2
print *,int(1234567890,kind=selected_int_kind(10))*2   ! alternative form
end program lim

Unfortunately, it is too long to write, so you should define it as a constant which will be introduced in the later chapter.

Using an external module, which has not been explained so far, more relevant keywords are available to specify the precision. A module is like a package in other languages, and it defines extra functions and constants. Although the user can create a custom module, some modules are available by default. The built-in module iso_fortran_env, which is supported by almost all recent compilers, defines such meaningful keywords for precision. Here are some of them.

Keyword Precision
int8 8 bit (1 byte)
int16 16 bit (2 bytes)
int32 32 bit (4 bytes)
int64 64 bit (8 bytes)

A module is called with the use statement, which must stay at the top of the program. I just show an example here, and I explain the details in the later chapter.

program lim
use iso_fortran_env
print *,int(1234567890,int64)*2
print *,int(1234567890,kind=int64)*2  ! alternative form
end program lim

#### Summary

• There are several types to have integer numbers.
• The default type is 4-byte (32 bit), and it is enough in most cases.
• There are several ways to define integer types.

#### Exercises

1. Try the above programs to see how to use the precision identifiers.

### Real numbers

#### Floating-point number

In Fortran and the majority of programming languages, any real number is treated as a rounded number with a finite number of digits. It is called floating-point number, which is expressed as the following formula.

$x\times 10^y$

The value $x$ is significand and $y$ is exponent. Usually, $x$ has a real number, and $y$ is an integer, and both have limited precision. By default, Fortran has 7 or 8 digits in $x$, and up to 2 digits in $y$, and this kind of number are called single precision. The higher precision is double-precision, which has 15 or 16 digits in $x$ and up to 3 digits in $y$. You can see that this expression cannot present general numeric values, including irrational numbers.

#### Precision specification

In the following table, real() (or dble()) returns an real number (r) with particular precision.

Precision Width Approximated Range Largest subnormal number Usage
Single precision 4 byte (32 bit) $-3.40\times 10^{38}$ to $3.40\times 10^{38}$ $1.18\times 10^{-38}$ real(r,kind=4) or real(r) or e literal
Double precision 8 byte (64 bit) $-1.80\times 10^{308}$ to $1.80\times 10^{308}$ $2.23\times 10^{-308}$ real(i,kind=8) or dble(i,8) or d literal

Roughly, the largest subnormal number is the smallest positive number expressed in your computer. See for a primary textbook for details in floating-point numbers.

Double precision is mainly used in heavy numerical computations because single-precision is not enough to hold precise numbers, and it quickly accumulates the rounding errors during computations. In this tutorial, I use a single-precision number because it is shorter to type, and my program demonstrates a program not used in the application. Just remember the following rules for double-precision values.

• A double-precision literal is defined using d instead of e.
• The function dble is useful to convert any numbers to double precision.
• When single and double precisions are mixed in a formula, the result will be double precision.
• Integer numbers should be converted to double-precision when the double-precision values are involved in the computations.

See the following example to use double precision.

program double
print *,3.14159265359     ! single precision
print *,3.14159265359d0   ! double precision
print *,dble(2)*10        ! mixed precision = aligned to double
print *,0.1               ! not exactly expressed in the computer
print *,0.1d0             ! same in double
end program double
3.14159274
3.1415926535900001
20.000000000000000
0.100000001
0.10000000000000001

You can still see a rounding error in the second output. The 4th and 5th output is an extreme example of a rounding error. These numbers are not precisely expressed with a floating-point rule because it is a recurring decimal in a binary expression which the current computer is using. The last digit is always rounded and creates noise in the value. When you use double-precision, the rounding error is much smaller, and the result may be less affected.

#### Portable ways for precision specification

As like integer numbers, the definition of precision like 8 in real(i,kind=8) is not always portable, although almost compilers accept 8 as double precision. There is a function to get the precision instead of using such constants.

Function Remark
selected_real_kind(p,r) See below for usage.

It results in a precision-code to guarantee that the real number has at least p digits in significand and a decimal-exponent range of at least r. For example, typical double-precision value is defined as selected_real_kind(15,307).

More relevant keywords are also available in the module iso_fortran_env as the integer number.

Keyword Precision
real32 single precision, 32 bit (4 bytes)
real64 double precision, 64 bit (8 bytes)

See the following example for the usage.

! All the statements should have the same results.
program double
use iso_fortran_env
print *,3.14d0
print *,dble(3.14)
print *,real(3.14,selected_real_kind(15,307))
print *,real(3.14,real64)
end program double

#### Summary

• There are two typical types to express a real number as a floating-point number.
• The default is single-precision, but the double-precision value is often used in numerical computing.
• There are several ways to define and convert the types.

In this tutorial, I use a single-precision number because it is simpler to type.

#### Exercises

1. Compute the sum of ten 0.1 both in single and double precisions and compare the results.
2. See if selected_real_kind(15,308) works or not. If working, what does it define?

Back to index.html.