Distribution of Repeat Values at Various STR Sites for Haplogroup I1a
Analysis performed December 2005 to January 2006 by Gordon Hamilton
Some time ago I estimated the distribution of repeat values at various STR sites for those in haplogroup I1a. The results were obtained from an analysis of data in the Sorenson Molecular Genealogy Foundation (SMGF) database that was available at that time (July 2004). In the current presentation this analysis has been refined, been brought up to date using the latest SMGF database, expanded to more markers (including some markers not present in the SMGF database; these data were obtained from Ysearch), and the results presented in a better format (with acknowlegement to Whit Athey''s presentation of similar R1b data). Ken Nordtvedt has reported a detailed analysis of the characteristics of various I1a groups and the more common repeat values at STR sites for each group. However, it is felt that the information presented here is a useful adjunct to Nordtvedt's analysis because the frequencies of less common repeat values for the STR sites are given, as well as the frequencies of more common values at these sites. Thus, for any specific I1a haplotype one can get a better understanding of how common or unusual the repeat values at various STR sites are.
The methods used to obtain the data given in the following tables are described following the tables. The tables for the first 37 marker sites are arranged in Family Tree DNA (FTDNA) order and the various sites are defined as FTDNA defines them. The subsequent tables (38 to 48) are ordered as SMGF orders these markers and the marker sites are defined as DNA Heritage and Relative Genetics defines them.
1 |
DYS393 |
N = 1229 |
Repeats |
Count |
Percent |
11 |
1 |
0.1 |
12 |
23 |
1.9 |
13 |
1086 |
88 |
14 |
104 |
8.5 |
15 |
15 |
1.2 |
|
2 |
DYS390 |
N = 1230 |
Repeats |
Count |
Percent |
21 |
12 |
1.0 |
22 |
782 |
64 |
23 |
396 |
32 |
24 |
39 |
3.2 |
25 |
1 |
0.1 |
|
3 |
DYS19/394 |
N = 1230 |
|
|
|
13 |
13 |
1.1 |
14 |
987 |
80 |
15 |
195 |
16 |
16 |
29 |
2.4 |
17 |
6 |
0.5 |
|
4 |
DYS391 |
N = 1229 |
|
|
|
9 |
23 |
1.9 |
10 |
1097 |
89 |
11 |
107 |
8.7 |
12 |
2 |
0.2 |
|
|
|
|
5 |
DYS385a |
N = 1228 |
Repeats |
Count |
Percent |
11 |
5 |
0.4 |
12 |
37 |
3.0 |
13 |
721 |
59 |
14 |
434 |
35 |
15 |
31 |
2.5 |
|
|
|
|
6 |
DYS385b |
N = 1228 |
Repeats |
Count |
Percent |
12 |
1 |
0.1 |
13 |
84 |
6.8 |
14 |
851 |
69 |
15 |
251 |
20 |
16 |
34 |
2.8 |
17 |
7 |
0.6 |
|
7 |
DYS426 |
N = 1254 |
|
|
|
|
|
|
10 |
22 |
1.8 |
11 |
1227 |
98 |
12 |
5 |
0.4 |
|
|
|
|
|
|
|
8 |
DYS388 |
N = 1229 |
|
|
|
12 |
3 |
0.2 |
13 |
17 |
1.4 |
14 |
1105 |
|
15 |
48 |
3.9 |
16 |
55 |
4.5 |
17 |
1 |
0.1 |
|
9 |
DYS439 |
N = 1230 |
Repeats |
Count |
Percent |
9 |
1 |
0.1 |
10 |
40 |
3.3 |
11 |
925 |
75 |
12 |
216 |
18 |
13 |
43 |
3.5 |
14 |
5 |
0.4 |
|
10 |
DYS389i |
N = 1228 |
Repeats |
Count |
Percent |
|
|
|
11 |
11 |
0.9 |
12 |
1113 |
91 |
13 |
97 |
7.9 |
14 |
7 |
0.6 |
|
|
|
|
11 |
DYS392 |
N = 1264 |
|
|
|
|
|
|
10 |
4 |
0.3 |
11 |
1227 |
97 |
12 |
27 |
2.1 |
13 |
6 |
0.4 |
|
|
|
|
12 |
DYS389ii |
N = 1228 |
|
|
|
26 |
3 |
0.2 |
27 |
33 |
2.7 |
28 |
897 |
|
29 |
251 |
20 |
30 |
38 |
3.1 |
31 |
6 |
0.5 |
|
12a |
DYS389ii-389i |
N = 1228 |
Repeats |
Count |
Percent |
14 |
1 |
0.1 |
15 |
28 |
2.3 |
16 |
985 |
80 |
17 |
194 |
16 |
18 |
18 |
1.5 |
19 |
2 |
0.2 |
|
13 |
DYS458 |
N = 1228 |
Repeats |
Count |
Percent |
13 |
5 |
0.4 |
14 |
143 |
12 |
15 |
790 |
64 |
16 |
241 |
20 |
17 |
43 |
3.5 |
18 |
6 |
0.5 |
|
14 |
DYS459a |
N = 1228 |
|
|
|
|
|
|
7 |
49 |
4.0 |
8 |
1144 |
93 |
9 |
35 |
2.9 |
|
|
|
|
|
|
|
15 |
DYS459b |
N = 1228 |
|
|
|
|
|
|
8 |
32 |
2.6 |
9 |
1175 |
96 |
10 |
21 |
1.7 |
|
|
|
|
|
|
|
16 |
DYS455 |
N = 1227 |
Repeats |
Count |
Percent |
|
|
|
|
|
|
|
|
|
8 |
1227 |
100 |
|
|
|
|
|
|
|
17 |
DYS454 |
N = 1259 |
Repeats |
Count |
Percent |
|
|
|
9 |
2 |
0.2 |
10 |
9 |
0.7 |
11 |
1227 |
97 |
12 |
21 |
1.7 |
|
|
|
|
18 |
DYS447 |
N = 1228 |
|
|
|
20 |
1 |
0.1 |
21 |
30 |
2.4 |
22 |
313 |
|
23 |
736 |
60 |
24 |
142 |
12 |
25 |
6 |
0.5 |
|
19 |
DYS437 |
N = 1194 |
|
|
|
|
|
|
|
|
|
15 |
63 |
5.3 |
16 |
1112 |
93 |
17 |
19 |
1.6 |
|
|
|
|
20 |
DYS448 |
N = 1227 |
Repeats |
Count |
Percent |
18 |
6 |
0.5 |
19 |
79 |
6.4 |
20 |
1054 |
|
21 |
83 |
6.8 |
22 |
5 |
0.4 |
|
21 |
DYS449 |
N = 1230 |
Repeats |
Count |
Percent |
24 |
1 |
0.1 |
25 |
11 |
0.9 |
26 |
67 |
|
27 |
85 |
6.9 |
28 |
525 |
43 |
29 |
365 |
30 |
30 |
128 |
10 |
31 |
43 |
3.5 |
32 |
3 |
0.2 |
33 |
2 |
|
|
22,23,24,25 |
DYS464a,b,c,d |
N = 1495 |
|
|
|
12,14,15,16 |
450 |
30 |
12,14,15,15 |
299 |
20 |
12,14,14,16 |
78 |
|
12,14,14,15 |
60 |
4.0 |
12,14,15,17 |
53 |
3.5 |
12,14,16,16 |
52 |
3.5 |
12,13,15,16 |
45 |
3.0 |
12,15,15,16 |
31 |
|
14,14,16,16 |
30 |
2.0 |
14,14,15,15 |
30 |
2.0 |
12,12,14,15 |
29 |
1.9 |
12,15,16,16 |
29 |
1.9 |
11,14,14,16 |
28 |
|
12,12,14,16 |
27 |
1.8 |
12,15,15,15 |
25 |
1.7 |
12,12,15,16 |
22 |
1.5 |
12,14,16,17 |
20 |
1.3 |
13,14,15,16 |
16 |
1.1 |
|
26 |
DYS460 |
N = 1230 |
Repeats |
Count |
Percent |
|
|
|
9 |
37 |
3.0 |
10 |
924 |
75 |
11 |
258 |
21 |
12 |
11 |
0.9 |
|
|
|
|
27 |
GATA-H4 |
N = 1228 |
Repeats |
Count |
Percent |
8 |
2 |
0.2 |
9 |
133 |
11 |
10 |
1014 |
82 |
11 |
78 |
6.0 |
12 |
3 |
0.2 |
|
|
|
|
28 |
YCAIIa |
N = 1228 |
|
|
|
15 |
1 |
0.1 |
17 |
3 |
0.2 |
18 |
16 |
1.3 |
19 |
1171 |
95 |
20 |
13 |
1.1 |
21 |
24 |
2.0 |
|
29 |
YCAIIb |
N = 1228 |
|
|
|
|
|
|
19 |
19 |
1.5 |
20 |
18 |
1.5 |
21 |
1164 |
95 |
22 |
25 |
2.0 |
23 |
2 |
0.2 |
|
30 |
DYS456 |
N = 1229 |
Repeats |
Count |
Percent |
12 |
2 |
0.2 |
13 |
61 |
5.0 |
14 |
870 |
71 |
15 |
256 |
21 |
16 |
29 |
2.4 |
17 |
11 |
0.9 |
|
31 |
DYS607 |
N = 803 |
Repeats |
Count |
Percent |
12 |
2 |
0.3 |
13 |
46 |
5.7 |
14 |
632 |
79 |
15 |
112 |
14 |
16 |
10 |
1.2 |
17 |
1 |
0.1 |
|
32 |
DYS576 |
N = 802 |
|
|
|
14 |
2 |
0.2 |
15 |
67 |
8.4 |
16 |
392 |
49 |
17 |
247 |
31 |
18 |
81 |
10 |
19 |
11 |
1.4 |
20 |
2 |
0.2 |
|
33 |
DYS570 |
N = 802 |
|
|
|
16 |
3 |
0.4 |
17 |
15 |
1.9 |
18 |
134 |
17 |
19 |
239 |
30 |
20 |
223 |
28 |
21 |
140 |
17 |
22 |
35 |
4.4 |
23 |
8 |
1.0 |
24 |
5 |
0.6 |
|
34 |
CDYa |
N = 799 |
Repeats |
Count |
Percent |
32 |
5 |
0.6 |
33 |
54 |
6.8 |
34 |
178 |
22 |
35 |
307 |
38 |
36 |
172 |
22 |
37 |
65 |
8.1 |
38 |
14 |
1.8 |
39 |
3 |
0.4 |
40 |
1 |
0.1 |
|
35 |
CDYb |
N = 799 |
Repeats |
Count |
Percent |
33 |
1 |
0.1 |
34 |
20 |
2.5 |
35 |
87 |
11 |
36 |
207 |
26 |
37 |
225 |
28 |
38 |
167 |
21 |
39 |
67 |
8.4 |
40 |
19 |
2.4 |
41 |
6 |
0.8 |
|
36 |
DYS442 |
N = 1178 |
|
|
|
10 |
1 |
0.1 |
11 |
135 |
11 |
12 |
888 |
75 |
13 |
133 |
11 |
14 |
19 |
1.6 |
15 |
2 |
0.2 |
|
37 |
DYS438 |
N = 1258 |
|
|
|
9 |
19 |
1.5 |
10 |
1227 |
98 |
11 |
12 |
1.0 |
|
38 |
DYS441 |
N = 1211 |
Repeats |
Count |
Percent |
13 |
1 |
0.1 |
14 |
17 |
1.4 |
15 |
213 |
18 |
16 |
908 |
75 |
17 |
70 |
5.8 |
18 |
2 |
0.2 |
|
39 |
DYS444 |
N = 1165 |
Repeats |
Count |
Percent |
10 |
1 |
0.1 |
11 |
7 |
0.6 |
12 |
193 |
17 |
13 |
789 |
68 |
14 |
161 |
14 |
15 |
14 |
1.2 |
|
40 |
DYS445 |
N = 1274 |
|
|
|
|
|
|
10 |
6 |
0.5 |
11 |
1227 |
96 |
12 |
41 |
3.2 |
|
|
|
|
|
|
|
41 |
DYS446 |
N = 1157 |
|
|
|
11 |
17 |
1.5 |
12 |
95 |
8.2 |
13 |
801 |
69 |
14 |
211 |
18 |
15 |
29 |
2.5 |
16 |
4 |
0.3 |
|
42 |
DYS452 |
N = 1071 |
Repeats |
Count |
Percent |
10 |
6 |
0.6 |
11 |
4 |
0.4 |
12 |
1024 |
96 |
13 |
35 |
3.3 |
14 |
2 |
0.2 |
|
43 |
DYS461 |
N = 1230 |
Repeats |
Count |
Percent |
10 |
5 |
0.4 |
11 |
161 |
13 |
12 |
971 |
79 |
13 |
90 |
7.3 |
14 |
3 |
0.2 |
|
44 |
DYS462 |
N = 1194 |
|
|
|
|
|
|
11 |
9 |
0.8 |
12 |
871 |
73 |
13 |
306 |
26 |
14 |
8 |
0.7 |
|
45 |
DYS463 |
N = 1226 |
|
|
|
17 |
2 |
0.2 |
18 |
39 |
3.2 |
19 |
1136 |
93 |
20 |
48 |
3.9 |
21 |
1 |
0.1 |
|
46 |
GGAAT1B07 |
N = 1174 |
Repeats |
Count |
Percent |
9 |
2 |
0.2 |
10 |
35 |
3.0 |
11 |
1106 |
94 |
12 |
31 |
2.6 |
|
47 |
GATA-C4 |
N = 1190 |
Repeats |
Count |
Percent |
20 |
22 |
1.8 |
21 |
474 |
40 |
22 |
494 |
42 |
23 |
148 |
12 |
24 |
40 |
3.4 |
25 |
12 |
1.0 |
|
48 |
GATA-A10 |
N = 1194 |
|
|
|
9 |
1 |
0.1 |
10 |
1 |
0.1 |
11 |
9 |
0.8 |
12 |
160 |
14 |
13 |
855 |
72 |
14 |
122 |
10 |
15 |
34 |
2.9 |
|
The SMGF database that was examined in this study was the one on the web in December 2005; it was reported to have 11,095 genotypes with all the SMGF markers examined and 13,489 genotypes if those that did not have a complete marker set are included. To obtain results from the database one must input values for at least 7 markers. It has long been known that a repeat value of 8 at DYS455 is characteristic of haplogroup I1a and that this value distinguishes I1a from virtually all other haplogroups. In preliminary analyses it was determined that for haplogroup I1a one single value was present 96% or more of the time for 5 additional markers. These markers (with the predominant repeat value for each in parentheses) are: DYS 392 (11), 426 (11), 438 (10), 445 (11), and 454 (11). Thus, the database was searched using these values at the 6 sites plus various values at each of the other sites in turn. In each case the number of exact 7/7 matches was recorded and the results are given in the preceding tables. It can be calculated that this method would extract results for more than 85% of the I1a records in the database. For most markers a total of about 1230 exact matches was obtained with the various repeat values, thus indicating that a little over 10% of those in the database are haplogroup I1a.
Since a repeat value of 8 at DYS455 is considered to be characteristic of I1a, no serious attempt was made to search for any dispersion at this site mainly because the search techniques utilized here turn up records for other haplogroups when repeat values other than 8 are used for DYS455. In an earlier discussion Phil Goff and Ken Nordvedt have noted that, although very rare, repeat values of 7 and 9 for DYS455 have, in a few cases, been observed in what appear in other respects to be I1a haplotypes. Also, Phil Goff has pointed out that a value of 9 at this site is observed for some individuals thought to be I1a in both the Sorrell (kit 23875) and Thornton (Yellow Group) surname projects. The author thanks Phil for pointing these out.
To obtain estimates of the dispersion in repeat values for the other DYS sites (namely 392, 426, 438, 445, and 454) used in all the foregoing searches, searches were performed using repeat values of 9, 10, and 11 for DYS391 and the repeat values at each of the other 5 sites varied in turn. The total number of exact 7/7 matches was recorded and is given in the foregoing tables for each of these 5 sites. Since repeat values of 9, 10, and 11 represent virtually all those for DYS391, the total number of recorded exact matches at each of these 5 sites is larger than 1230 because one is now picking up the few records that have unusual repeat values at the site being varied.
Data are presented in some of the foregoing tables for a few markers (DYS464a-d, 607, 576, 570, CDYa, and CDYb) that are not present in the SMGF database but which are relatively common ones because Family Tree DNA analyzes for them. These data were extracted from the Ysearch database. To search this database one must input repeat value data for 8 markers. However, because of a quirk in the way DYS464a,b,c,d and YCAIIa,b are treated in these searches, it is possible to search the database for all other markers using legitimate values for only 2 markers. This is because the 6 markers DYS464a,b,c,d and YCAIIa,b are treated as infinite alleles so no matter what specific value is inputted for each of these markers a genetic distance of one is calculated for any mismatch. Thus, to search for the I1a haplogroup values for any other marker in this database the procedure is to set DYS455 = 8 and essentially impossible values for DYS464a,b,c,d (22 was used in each case) and for YCAIIa,b (25 was used), then vary the values for the marker of interest and record the number of 8 marker matches that are exactly at a genetic distance of 6 (the mismatches will be for DYS464a,b,c,d and YCAIIa,b). This is how the data in the tables for DYS607, 576, and 570, as well as for CDYa and CDYb, were obtained. One can use the same procedure to extract I1a haplogroup values for other markers from Ysearch and these agree quite well with those extracted from the SMGF database with only minor differences, presumably due to the presence in Ysearch of data from large surname studies. Since this probably occurs to a lesser extent in the SMGF database, the SMGF data are given in the preceding tables for those markers that SMGF analyzes. However, it is worth reiterating that the repeat value distributions for the various markers obtained from either the SMGF or Ysearch database are very similar.
To obtain the repeat values for DYS464a,b,c,d in the foregoing table the search was performed (24 January 2006) so that all those in Ysearch with only 25 FTDNA markers (as well as those with more) were sampled. To do this the repeat values at 4 markers of the first 25 that have very little dispersion were set as follows: DYS392=11, DYS426=11, DYS454=11 and DYS455=8. Then the Ysearch database was searched with various values for DYS464a,b,c,d and the number of 8/8 matches recorded. As expected from Nordtvedt's analysis, 12,14,15,16 and 12,14,15,15 are the most common sequences of repeat values for these markers but somewhat surprisingly together they account for only 50% of the total. The rest are made up of many different sequences; only those that are at least 1% of the total are given in the table. Several of these are ones that Nordtvedt has shown to be characteristic of different I1a groups.