Data preparation with RENUMF90

Yutaka Masuda

September 2019

Back to index.html.

Basic data preparation

Basic usage of RENUMF90

Why RENUMF90?

BLUPF90 accepts data and pedigree files containing only numerical expressions (integer or real values). A group label should be integer starting from 1. A raw data file usually contains characters (alphabets or symbols) for animals’ ID or group code. The characters should be replaced with numerical values before the analysis with BLUPF90 programs. RENUMF90 performs such a job. We call this process as renumbering.

Trivial instruction file

RENUMF90 reads a parameter file, but it is totally different from the one used in BLUPF90. Don’t confuse 2 kinds of parameter files. Here, to avoid the confusion, we will refer to the parameter file used in RENUMF90 as an instruction file in this tutorial. This name is unofficial but easier to differentiate two kinds of parameter files.

To understand the behavior of RENUMF90, let’s try a small example without pedigree. Here we use a raw data file and an instruction file. The raw data is a space-separated text file and contains 5 columns.

 ID006  A  1  1.0  3.0
 ID009  A  2  1.0  2.0
 ID012  A  1  2.0  4.0
 ID007  B  2  2.0  6.0
 ID010  B  1  1.0  3.0
 ID013  B  2  2.0  6.0
 ID008  C  1  2.0  6.0
 ID011  C  2  1.0  6.0
 ID014  C  1  1.0  8.0
 ID015  C  2  2.0  4.0

We will create a parameter file for the following model. \[ y_{ijk} = A_{i} + S_{j} + \beta x_{ijk} + e_{ijk} \] This model is actually the same as one introduced in the previous chapter as a fixed effect model; The first column is animal ID, the next 2 columns are fixed cross-classified effects, and the third one is a fixed regression. So, this example tries to produce a similar data set used before.

RENUMF90 will read the only necessary columns in the raw data-file, as described in the instruction file. The following instruction file demonstrates the renumbering of column 2 and 3 as fixed cross-classified effects, column 4 as a covariate, and column 5 as a phenotype. In this case, column 1 will not be used.

DATAFILE
rawdata1.txt
TRAITS
5
FIELDS_PASSED TO OUTPUT

WEIGHT(S)

RESIDUAL_VARIANCE
1.0
EFFECT          # 1st effect
2 cross alpha
EFFECT          # 2nd effect
3 cross alpha
EFFECT          # 3rd effect
4 cov

Instruction file can also have comments, which starts with # as seen in the above example.

The instruction file looks like the parameter file for BLUPF90. The file contains several pairs of keyword and value(s). The file contains 6 keywords. EFFECT can be repeated several times. The following keywords are required in the minimal instruction file.

Keyword possible value description
DATAFILE characters The name of raw data-file to be processed.
TRAITS integer A list of the position(s) of observation in the
data file.
FIELDS_PASSED TO OUTPUT integer A list of the position(s) of unchanged
columns in the data file. Empty value is
ac ceptable if not ne eded (just keep the line empty).
The columns will be passed through the renumbered file.
WEIGHT(S) integer The position(s) for weight in the data file.
Empty value is acceptable if not needed.
RESIDUAL_VARIANCE real value(s) Residual (co)variances.
EFFECT (see below) Description of an effect. Repeatable.

The EFFECT keyword can be repeated as many as you need. The above instruction file contains 3 effects. The EFFECT keyword has several values as follows.

Keyword Position effect type data type
EFFECT integer cross alpha
numer
cov (none)

In the above table, the position means the position(s) of group code (for class effect) or covariate (for regression) in the raw data-file. You can choose one of two effect types (cross for cross-classified or cov for regression). For cross-classified effects, you can also choose the data type: alpha if the column may contain alphabets, or numer if the column surely contains integer values only. Following is an explanation for EFFECT block in the above instruction.

EFFECT
  2 cross alpha       # treat 2nd column as cross - classified effect ; assuming alphabets
EFFECT
  3 cross alpha       # treat 3rd column as cross - classified effect ; assuming alphabets
EFFECT
  4 cov               # treat 4th column as covariate

Note that, in this case, you can technically use numer instead of alpha in the second effect because this column contains integer values only. We, nevertheless, recommend a user to always use alpha because it can accept any types of data including both alphabets and numbers.

Resulting files

RENUMF90 asks you to type the name of the instruction file. The program reads the original data, and it replaces the group labels with sequential, integer values. Finally, this program generates 3 files, and each file has a fixed name. If you have the existing file with the same name, the file will be replaced with a new one.

Let’s look into the inside of renf90.dat.

 3.0 1 1 1.0
 2.0 1 2 1.0
 4.0 1 1 2.0
 6.0 2 2 2.0
 3.0 2 1 1.0
 6.0 2 2 2.0
 6.0 3 1 2.0
 6.0 3 2 1.0
 8.0 3 1 1.0
 4.0 3 2 2.0

You can see the difference between the original and renumbered files. The order of lines (rows) are preserved but the order of columns is different. Each column has integer values or real (numerical) values. RENUMF90 change the column order as the following rules.

  1. First, observations: RENUMF90 doesn’t change the values.
    • column 1: a single-trait model is assumed and the 1st column is for observations.
  2. Second, effects: the order is determined with the order of EFFECT statements in the instruction file.
    • column 2 for effect 1 (corresponding to the 2nd column in the original file).
    • column 3 for effect 2 (3rd column in the original file).
    • column 4 for effect 3 (4th column in the original file).
  3. Last, passed columns: RENUMF90 adds the columns specified in FIELDS_PASSED TO OUTPUT keyword in instruction. It doesn’t change the content in anyway so, the passed columns may have alphabets and symbols.
    • no additional columns in this case: we didn’t set any values for this keyword.

Class variables for effect 1 and 2 (column 2 and 3) are successfully replaced with integer values. For effect 1, A is replaced with 1, B with 2 and C with 3. For effect 2, 1 is 1 and 2 is 2. The correspondence is saved in renf90.tables.

 Effect group 1 of column 1 with 3 levels, effect # 1
 Value    #    consecutive number
A 3 1
B 3 2
C 4 3
 Effect group 2 of column 1 with 2 levels, effect # 2
 Value    #    consecutive number
1 5 1
2 5 2

In this case, each Effect group hash 3 columns: 1st for the original group code, 2nd for the number of times the code occurred in the input, and 3rd for the replaced integer value.

RENUMF90 kindly generates a parameter file for BLUPF90.

# BLUPF90 parameter file created by RENF90
DATAFILE
 renf90.dat
NUMBER_OF_TRAITS
           1
NUMBER_OF_EFFECTS
           3
OBSERVATION(S)
    1
WEIGHT(S)

EFFECTS: POSITIONS_IN_DATAFILE NUMBER_OF_LEVELS TYPE_OF_EFFECT[EFFECT NESTED]
  2         3 cross
  3         2 cross
  4 1 cov
RANDOM_RESIDUAL VALUES
   1.0000

This looks like perfect for you to use it in BLUPF90, and the program will run the analysis successfully. In this parameter file, the model description is inherited from the instruction file. For example, the above instruction contains 3 effects so renf90.par also contains 3 effects corresponding to the description in instruction. Note that the EFFECTS: line is followed by extra words (POSITIONS_IN_DATAFILE etc.), but it is no problem for BLUPF90.

Running BLUPF90 with this renf90.par will produce the same solutions as described before. A reader can confirm the fact with these files.

Optional features in RENUMF90

RENUMF90 specifically accepts some options with an additional line beginning with OPTION at the end of the instruction file. Following options are available.

The \(n\) above is replaced with an actual integer value. See the examples.

OPTION max_string_readline 2048

If the option is not usable in RENUMF90, the program simply passes it through renf90.par. So, you can put some OPTION lines, which will be used in BLUPF90, in instruction, and you will obtain a perfect parameter file with desired options for your analysis. See the following example.

DATAFILE
rawdata1.txt
TRAITS
5
FIELDS_PASSED TO OUTPUT

WEIGHT(S)

RESIDUAL_VARIANCE
1.0
EFFECT
2 cross alpha   # 1st effect
EFFECT
3 cross alpha   # 2nd effect
EFFECT
4 cov           # 3rd effect
OPTION max_string_readline 2048
OPTION alpha_size 40
OPTION solv_method FSPAK

The first 2 options will be taken by RENUMF90, but the last one will be passed through the output file renf90.par.

What is the best practice?

Suppose you have to perform many analyses with the same data but different models. How many times do you run RENUMF90? Basically, there are 2 solutions.

  1. In every analysis, rewrite the instruction and run RENUMF90. Then just use the generated renf90.par.
  2. First, prepare the instruction with maximal models and run RENUMF90. Then, every time, copy the generated renf90.par to the new file and modify it to fit a model.

The better solution can depend on your situation. If you have a relatively small data set and the analysis will not take a long time, the first approach could be applicable. If your data is large enough and the analysis will take time, the second option is the only efficient way. Also, in some cases, the generated renf90.par is incomplete. The second approach is more general and it is the use case that the development team assumes.

Summary

Back to index.html.