Practical genomic analysis

Yutaka Masuda

September 2019

Back to index.html.

Files used in genomic analysis

As seen in the tutorial in the previous chapter, PREGSF90 needs 2 files to handle genomic data: an SNP marker file and a cross-reference file. In this section, we describe the detailed format for these 2 files. Also, we will present some other files optionally needed in quality control of markers.

PREGSF90 reads the same parameter file as BLUPF90. From this file, the program reads the names of the pedigree file, the marker file, (optionally) the cross-reference file, and some options. The rest of the parameter file (e.g., the name of the data file and the model) is effectively ignored.

SNP file

The PREGSF90 program can accept 2 kinds of SNP files. One contains only integer numbers as genotypes (e.g. from SNP chips) and the other one contains real numbers as gene content (possibly from imputation software, from Genotyping by Sequence or from sequencing at low depth). First, we consider the former one; we here call it an SNP-marker file.

SNP marker file

An SNP-marker file is a text file which contains 2 fields: animal ID (possibly alphanumeric) in the 1st field and its genotypes in the 2nd field. We show the first and the last 5 lines from an example data presented in the previous chapter again (limited to the first 30 characters in each line to save the space).

 8003   211011112112012000211002
 8007   212011111111012011111012
 8016   200120111021012111102121
 8019   112211021121202111011210
 8020   211110101120002021002122
   (skip)
13496   101001212012010210212201
13497   101112111021020120222111
13498   200000220202202000022222
13499   011011111111020111121112
13500   101020112122021000221001

The format was already described in the previous chapter.

Gene content file

The PREGSF90 program accepts another format. Here, we call it a gene content file. The gene content file may have a number of fields with the following rules.

Cross-reference (XrefID) file

The BLUPF90 programs need a cross-reference file. This file is automatically generated with RENUMF90. This file relates a renumbered ID to the original ID for genotyped animals. Again, we show the first 5 and the last 5 lines of the actual cross-reference file presented in the previous chapter.

6127 8003
13570 8007
406 8016
10802 8019
10924 8020
  (skip)
8585 13496
8941 13497
9369 13498
9753 13499
9905 13500

This file simply contains 2 fields: the first is for renumbered ID (same as in the pedigree file) and the second is for the original ID (same as in the marker file). The 2 fields can be separated with at least 1 space. A tab is not allowed. The order of animals must be the same as the SNP file. Again, this file is generated with RENUMF90, and the user should not edit it unless there is a reason to do it.

Allele frequency file (optional)

An allele frequency file contains the actual allele frequency on each marker. This file is optional because the allele frequency is calculated with the current SNP file by default. Only if you expect to use external information for it, you can provide a file containing the allele frequencies. Different allele frequency may change characteristics of \(\mathbf{G}\).

Here is an example file. This contains the allele frequency only for the first 5 markers.

1     0.711667
2     0.328000
3     0.422000
4     0.157000
5     0.492333

This file contains 2 columns:

  1. The position of the marker (from 1 to the maximum number of markers)
  2. Allele frequency as a real value (ranged from 0 to 1).

Note that PREGSF90 usually creates this file containing the allele frequency calculated from the current SNP file by default (the file name is freqdata.count).

Map file (optional)

Map file relates a marker to a chromosome, a physical location, and a specific name. This file is needed only when you try comprehensive quality control and GWAS with ssGBLUP. This file should contain at least 3 fields separated at least 1 white space. The 4th column is optional and it contains the name of a marker. Here we show an example (including the 4th field in this case). Only the first 3 lines are shown here.

1  1     127     SNP-CODE-1
2  1     652     SNP-CODE-2
3  1    1022     SNP-CODE-3

This file follows the rules.

Weight for SNP (optional)

The PREGSF90 program can create an alternative \(\mathbf{G}\) weighting on each SNP marker. This weighted matrix is especially useful for GWAS or related analyses with the POSTGSF90 program. If you supply this file, the weighted \(\mathbf{G}\) will be calculated. Otherwise, weights are set to 1 i.e. no specific weight on each SNP marker. This file is optional.

The file contains only 1 column with real values. Each row corresponds to each SNP marker; the first line contains a weight for the marker 1, and the second row contains a weight for the marker 2, and so on. The number of lines in this file shouldn’t exceed the number of markers.

Back to index.html.