Practical genomic analysis
September 2019
Back to index.html.
Files used in genomic analysis
As seen in the tutorial in the previous chapter, PREGSF90 needs 2 files to handle genomic data: an SNP marker file and a cross-reference file. In this section, we describe the detailed format for these 2 files. Also, we will present some other files optionally needed in quality control of markers.
PREGSF90 reads the same parameter file as BLUPF90. From this file, the program reads the names of the pedigree file, the marker file, (optionally) the cross-reference file, and some options. The rest of the parameter file (e.g., the name of the data file and the model) is effectively ignored.
SNP file
The PREGSF90 program can accept 2 kinds of SNP files. One contains only integer numbers as genotypes (e.g. from SNP chips) and the other one contains real numbers as gene content (possibly from imputation software, from Genotyping by Sequence or from sequencing at low depth). First, we consider the former one; we here call it an SNP-marker file.
SNP marker file
An SNP-marker file is a text file which contains 2 fields: animal ID (possibly alphanumeric) in the 1st field and its genotypes in the 2nd field. We show the first and the last 5 lines from an example data presented in the previous chapter again (limited to the first 30 characters in each line to save the space).
8003 211011112112012000211002
8007 212011111111012011111012
8016 200120111021012111102121
8019 112211021121202111011210
8020 211110101120002021002122
(skip)
13496 101001212012010210212201
13497 101112111021020120222111
13498 200000220202202000022222
13499 011011111111020111121112
13500 101020112122021000221001
The format was already described in the previous chapter.
Gene content file
The PREGSF90 program accepts another format. Here, we call it a gene content file. The gene content file may have a number of fields with the following rules.
- The Animal ID must be in the 1st field and the gene content on each locus must be in the 2nd or later fields.
- Adjacent gene content can be separated by white spaces. No spaces are also allowed.
- No headers, no comments, no other fields are allowed. The data must start on the first line.
- In animal IDs, alphabets, numbers, and symbols (possibly ASCII) are acceptable. The link to the renumbered pedigree is provided in the cross-reference file below.
- Gene content should contain an integer, a floating-point (e.g.
3.14
), or exponential expression (e.g.0.314E+01
) as a real number. All markers must have the same format. - No missing gene content is allowed and all animals must have the same number of gene content. The missing gene contents should be imputed before the analyses.
- The minimum number of gene content is 50. The fewer number of markers is not acceptable.
Cross-reference (XrefID) file
The BLUPF90 programs need a cross-reference file. This file is automatically generated with RENUMF90. This file relates a renumbered ID to the original ID for genotyped animals. Again, we show the first 5 and the last 5 lines of the actual cross-reference file presented in the previous chapter.
6127 8003
13570 8007
406 8016
10802 8019
10924 8020
(skip)
8585 13496
8941 13497
9369 13498
9753 13499
9905 13500
This file simply contains 2 fields: the first is for renumbered ID (same as in the pedigree file) and the second is for the original ID (same as in the marker file). The 2 fields can be separated with at least 1 space. A tab is not allowed. The order of animals must be the same as the SNP file. Again, this file is generated with RENUMF90, and the user should not edit it unless there is a reason to do it.
Allele frequency file (optional)
An allele frequency file contains the actual allele frequency on each marker. This file is optional because the allele frequency is calculated with the current SNP file by default. Only if you expect to use external information for it, you can provide a file containing the allele frequencies. Different allele frequency may change characteristics of \(\mathbf{G}\).
Here is an example file. This contains the allele frequency only for the first 5 markers.
1 0.711667
2 0.328000
3 0.422000
4 0.157000
5 0.492333
This file contains 2 columns:
- The position of the marker (from 1 to the maximum number of markers)
- Allele frequency as a real value (ranged from 0 to 1).
Note that PREGSF90 usually creates this file containing the allele frequency calculated from the current SNP file by default (the file name is freqdata.count
).
Map file (optional)
Map file relates a marker to a chromosome, a physical location, and a specific name. This file is needed only when you try comprehensive quality control and GWAS with ssGBLUP. This file should contain at least 3 fields separated at least 1 white space. The 4th column is optional and it contains the name of a marker. Here we show an example (including the 4th field in this case). Only the first 3 lines are shown here.
1 1 127 SNP-CODE-1
2 1 652 SNP-CODE-2
3 1 1022 SNP-CODE-3
This file follows the rules.
- The first 3 fields should contain integer values: the first is a marker number, the second specifies the chromosome number, and the third represents the physical location on the chromosome. The marker number is just an integer, not necessarily correlative, used for external data manipulation - it is not actually used by the program.
- The sex chromosome (X) can be present, but it should be also an integer value. The code of the X chromosome is 0 by default.
- The 4th column (optional) can contain any alphabets, numbers, and symbols (possibly in ASCII) up to 50 characters.
Weight for SNP (optional)
The PREGSF90 program can create an alternative \(\mathbf{G}\) weighting on each SNP marker. This weighted matrix is especially useful for GWAS or related analyses with the POSTGSF90 program. If you supply this file, the weighted \(\mathbf{G}\) will be calculated. Otherwise, weights are set to 1 i.e. no specific weight on each SNP marker. This file is optional.
The file contains only 1 column with real values. Each row corresponds to each SNP marker; the first line contains a weight for the marker 1, and the second row contains a weight for the marker 2, and so on. The number of lines in this file shouldn’t exceed the number of markers.
Back to index.html.