Distribution of Repeat Values at Various STR Sites for Haplogroup I1a

Analysis performed December 2005 to January 2006 by Gordon Hamilton

Some time ago I estimated the distribution of repeat values at various STR sites for those in haplogroup I1a. The results were obtained from an analysis of data in the Sorenson Molecular Genealogy Foundation (SMGF) database that was available at that time (July 2004). In the current presentation this analysis has been refined, been brought up to date using the latest SMGF database, expanded to more markers (including some markers not present in the SMGF database; these data were obtained from Ysearch), and the results presented in a better format (with acknowlegement to Whit Athey''s presentation of similar R1b data). Ken Nordtvedt has reported a detailed analysis of the characteristics of various I1a groups and the more common repeat values at STR sites for each group. However, it is felt that the information presented here is a useful adjunct to Nordtvedt's analysis because the frequencies of less common repeat values for the STR sites are given, as well as the frequencies of more common values at these sites. Thus, for any specific I1a haplotype one can get a better understanding of how common or unusual the repeat values at various STR sites are.

The methods used to obtain the data given in the following tables are described following the tables. The tables for the first 37 marker sites are arranged in Family Tree DNA (FTDNA) order and the various sites are defined as FTDNA defines them. The subsequent tables (38 to 48) are ordered as SMGF orders these markers and the marker sites are defined as DNA Heritage and Relative Genetics defines them.

1
DYS393
N = 1229
Repeats
Count
Percent
11
1
0.1
12
23
1.9
13
1086
88
14
104
8.5
15
15
1.2
2
DYS390
N = 1230
Repeats
Count
Percent
21
12
1.0
22
782
64
23
396
32
24
39
3.2
25
1
0.1
3
DYS19/394
N = 1230
Repeats
Count
Percent
13
13
1.1
14
987
80
15
195
16
16
29
2.4
17
6
0.5
4
DYS391
N = 1229
Repeats
Count
Percent
9
23
1.9
10
1097
89
11
107
8.7
12
2
0.2
 
 
 
5
DYS385a
N = 1228
Repeats
Count
Percent
11
5
0.4
12
37
3.0
13
721
59
14
434
35
15
31
2.5
 
 
 
6
DYS385b
N = 1228
Repeats
Count
Percent
12
1
0.1
13
84
6.8
14
851
69
15
251
20
16
34
2.8
17
7
0.6
7
DYS426
N = 1254
Repeats
Count
Percent
 
 
 
10
22
1.8
11
1227
98
12
5
0.4
 
 
 
 
 
 
8
DYS388
N = 1229
Repeats
Count
Percent
12
3
0.2
13
17
1.4
14
1105

90

15
48
3.9
16
55
4.5
17
1
0.1
9
DYS439
N = 1230
Repeats
Count
Percent
9
1
0.1
10
40
3.3
11
925
75
12
216
18
13
43
3.5
14
5
0.4
10
DYS389i
N = 1228
Repeats
Count
Percent
 
 
 
11
11
0.9
12
1113
91
13
97
7.9
14
7
0.6
 
 
 
11
DYS392
N = 1264
Repeats
Count
Percent
 
 
 
10
4
0.3
11
1227
97
12
27
2.1
13
6
0.4
 
 
 
12
DYS389ii
N = 1228
Repeats
Count
Percent
26
3
0.2
27
33
2.7
28
897

73

29
251
20
30
38
3.1
31
6
0.5
12a
DYS389ii-389i
N = 1228
Repeats
Count
Percent
14
1
0.1
15
28
2.3
16
985
80
17
194
16
18
18
1.5
19
2
0.2
13
DYS458
N = 1228
Repeats
Count
Percent
13
5
0.4
14
143
12
15
790
64
16
241
20
17
43
3.5
18
6
0.5
14
DYS459a
N = 1228
Repeats
Count
Percent
 
 
 
7
49
4.0
8
1144
93
9
35
2.9
 
 
 
 
 
 
15
DYS459b
N = 1228
Repeats
Count
Percent
 
 
 
8
32
2.6
9
1175
96
10
21
1.7
 
 
 
 
 
 
16
DYS455
N = 1227
Repeats
Count
Percent
 
 
 
 
 
 
 
 
 
8
1227
100
 
 
 
 
 
 
17
DYS454
N = 1259
Repeats
Count
Percent
 
 
 
9
2
0.2
10
9
0.7
11
1227
97
12
21
1.7
 
 
 
18
DYS447
N = 1228
Repeats
Count
Percent
20
1
0.1
21
30
2.4
22
313

25

23
736
60
24
142
12
25
6
0.5
19
DYS437
N = 1194
Repeats
Count
Percent
 
 
 
 
 
 
15
63
5.3
16
1112
93
17
19
1.6
 
 
 
20
DYS448
N = 1227
Repeats
Count
Percent
18
6
0.5
19
79
6.4
20
1054

86

21
83
6.8
22
5
0.4
21
DYS449
N = 1230
Repeats
Count
Percent
24
1
0.1
25
11
0.9
26
67

5.4

27
85
6.9
28
525
43
29
365
30
30
128
10
31
43
3.5
32
3
0.2
33
2

0.2

22,23,24,25
DYS464a,b,c,d
N = 1495
Repeats
Count
Percent
12,14,15,16
450
30
12,14,15,15
299
20
12,14,14,16
78

5.2

12,14,14,15
60
4.0
12,14,15,17
53
3.5
12,14,16,16
52
3.5
12,13,15,16
45
3.0
12,15,15,16
31

2.1

14,14,16,16
30
2.0
14,14,15,15
30
2.0
12,12,14,15
29
1.9
12,15,16,16
29
1.9
11,14,14,16
28

1.9

12,12,14,16
27
1.8
12,15,15,15
25
1.7
12,12,15,16
22
1.5
12,14,16,17
20
1.3
13,14,15,16
16
1.1
26
DYS460
N = 1230
Repeats
Count
Percent
 
 
 
9
37
3.0
10
924
75
11
258
21
12
11
0.9
 
 
 
27
GATA-H4
N = 1228
Repeats
Count
Percent
8
2
0.2
9
133
11
10
1014
82
11
78
6.0
12
3
0.2
 
 
 
28
YCAIIa
N = 1228
Repeats
Count
Percent
15
1
0.1
17
3
0.2
18
16
1.3
19
1171
95
20
13
1.1
21
24
2.0
29
YCAIIb
N = 1228
Repeats
Count
Percent
 
 
 
19
19
1.5
20
18
1.5
21
1164
95
22
25
2.0
23
2
0.2
30
DYS456
N = 1229
Repeats
Count
Percent
12
2
0.2
13
61
5.0
14
870
71
15
256
21
16
29
2.4
17
11
0.9
31
DYS607
N = 803
Repeats
Count
Percent
12
2
0.3
13
46
5.7
14
632
79
15
112
14
16
10
1.2
17
1
0.1
32
DYS576
N = 802
Repeats
Count
Percent
14
2
0.2
15
67
8.4
16
392
49
17
247
31
18
81
10
19
11
1.4
20
2
0.2
33
DYS570
N = 802
Repeats
Count
Percent
16
3
0.4
17
15
1.9
18
134
17
19
239
30
20
223
28
21
140
17
22
35
4.4
23
8
1.0
24
5
0.6
34
CDYa
N = 799
Repeats
Count
Percent
32
5
0.6
33
54
6.8
34
178
22
35
307
38
36
172
22
37
65
8.1
38
14
1.8
39
3
0.4
40
1
0.1
35
CDYb
N = 799
Repeats
Count
Percent
33
1
0.1
34
20
2.5
35
87
11
36
207
26
37
225
28
38
167
21
39
67
8.4
40
19
2.4
41
6
0.8
36
DYS442
N = 1178
Repeats
Count
Percent
10
1
0.1
11
135
11
12
888
75
13
133
11
14
19
1.6
15
2
0.2
37
DYS438
N = 1258
Repeats
Count
Percent
9
19
1.5
10
1227
98
11
12
1.0
38
DYS441
N = 1211
Repeats
Count
Percent
13
1
0.1
14
17
1.4
15
213
18
16
908
75
17
70
5.8
18
2
0.2
39
DYS444
N = 1165
Repeats
Count
Percent
10
1
0.1
11
7
0.6
12
193
17
13
789
68
14
161
14
15
14
1.2
40
DYS445
N = 1274
Repeats
Count
Percent
 
 
 
10
6
0.5
11
1227
96
12
41
3.2
 
 
 
 
 
 
41
DYS446
N = 1157
Repeats
Count
Percent
11
17
1.5
12
95
8.2
13
801
69
14
211
18
15
29
2.5
16
4
0.3
42
DYS452
N = 1071
Repeats
Count
Percent
10
6
0.6
11
4
0.4
12
1024
96
13
35
3.3
14
2
0.2
43
DYS461
N = 1230
Repeats
Count
Percent
10
5
0.4
11
161
13
12
971
79
13
90
7.3
14
3
0.2
44
DYS462
N = 1194
Repeats
Count
Percent
 
 
 
11
9
0.8
12
871
73
13
306
26
14
8
0.7
45
DYS463
N = 1226
Repeats
Count
Percent
17
2
0.2
18
39
3.2
19
1136
93
20
48
3.9
21
1
0.1
46
GGAAT1B07
N = 1174
Repeats
Count
Percent
9
2
0.2
10
35
3.0
11
1106
94
12
31
2.6
47
GATA-C4
N = 1190
Repeats
Count
Percent
20
22
1.8
21
474
40
22
494
42
23
148
12
24
40
3.4
25
12
1.0
48
GATA-A10
N = 1194
Repeats
Count
Percent
9
1
0.1
10
1
0.1
11
9
0.8
12
160
14
13
855
72
14
122
10
15
34
2.9

The SMGF database that was examined in this study was the one on the web in December 2005; it was reported to have 11,095 genotypes with all the SMGF markers examined and 13,489 genotypes if those that did not have a complete marker set are included. To obtain results from the database one must input values for at least 7 markers. It has long been known that a repeat value of 8 at DYS455 is characteristic of haplogroup I1a and that this value distinguishes I1a from virtually all other haplogroups. In preliminary analyses it was determined that for haplogroup I1a one single value was present 96% or more of the time for 5 additional markers. These markers (with the predominant repeat value for each in parentheses) are: DYS 392 (11), 426 (11), 438 (10), 445 (11), and 454 (11). Thus, the database was searched using these values at the 6 sites plus various values at each of the other sites in turn. In each case the number of exact 7/7 matches was recorded and the results are given in the preceding tables. It can be calculated that this method would extract results for more than 85% of the I1a records in the database. For most markers a total of about 1230 exact matches was obtained with the various repeat values, thus indicating that a little over 10% of those in the database are haplogroup I1a.

Since a repeat value of 8 at DYS455 is considered to be characteristic of I1a, no serious attempt was made to search for any dispersion at this site mainly because the search techniques utilized here turn up records for other haplogroups when repeat values other than 8 are used for DYS455. In an earlier discussion Phil Goff and Ken Nordvedt have noted that, although very rare, repeat values of 7 and 9 for DYS455 have, in a few cases, been observed in what appear in other respects to be I1a haplotypes. Also, Phil Goff has pointed out that a value of 9 at this site is observed for some individuals thought to be I1a in both the Sorrell (kit 23875) and Thornton (Yellow Group) surname projects. The author thanks Phil for pointing these out.

To obtain estimates of the dispersion in repeat values for the other DYS sites (namely 392, 426, 438, 445, and 454) used in all the foregoing searches, searches were performed using repeat values of 9, 10, and 11 for DYS391 and the repeat values at each of the other 5 sites varied in turn. The total number of exact 7/7 matches was recorded and is given in the foregoing tables for each of these 5 sites. Since repeat values of 9, 10, and 11 represent virtually all those for DYS391, the total number of recorded exact matches at each of these 5 sites is larger than 1230 because one is now picking up the few records that have unusual repeat values at the site being varied.

Data are presented in some of the foregoing tables for a few markers (DYS464a-d, 607, 576, 570, CDYa, and CDYb) that are not present in the SMGF database but which are relatively common ones because Family Tree DNA analyzes for them. These data were extracted from the Ysearch database. To search this database one must input repeat value data for 8 markers. However, because of a quirk in the way DYS464a,b,c,d and YCAIIa,b are treated in these searches, it is possible to search the database for all other markers using legitimate values for only 2 markers. This is because the 6 markers DYS464a,b,c,d and YCAIIa,b are treated as infinite alleles so no matter what specific value is inputted for each of these markers a genetic distance of one is calculated for any mismatch. Thus, to search for the I1a haplogroup values for any other marker in this database the procedure is to set DYS455 = 8 and essentially impossible values for DYS464a,b,c,d (22 was used in each case) and for YCAIIa,b (25 was used), then vary the values for the marker of interest and record the number of 8 marker matches that are exactly at a genetic distance of 6 (the mismatches will be for DYS464a,b,c,d and YCAIIa,b). This is how the data in the tables for DYS607, 576, and 570, as well as for CDYa and CDYb, were obtained. One can use the same procedure to extract I1a haplogroup values for other markers from Ysearch and these agree quite well with those extracted from the SMGF database with only minor differences, presumably due to the presence in Ysearch of data from large surname studies. Since this probably occurs to a lesser extent in the SMGF database, the SMGF data are given in the preceding tables for those markers that SMGF analyzes. However, it is worth reiterating that the repeat value distributions for the various markers obtained from either the SMGF or Ysearch database are very similar.

To obtain the repeat values for DYS464a,b,c,d in the foregoing table the search was performed (24 January 2006) so that all those in Ysearch with only 25 FTDNA markers (as well as those with more) were sampled. To do this the repeat values at 4 markers of the first 25 that have very little dispersion were set as follows: DYS392=11, DYS426=11, DYS454=11 and DYS455=8. Then the Ysearch database was searched with various values for DYS464a,b,c,d and the number of 8/8 matches recorded. As expected from Nordtvedt's analysis, 12,14,15,16 and 12,14,15,15 are the most common sequences of repeat values for these markers but somewhat surprisingly together they account for only 50% of the total. The rest are made up of many different sequences; only those that are at least 1% of the total are given in the table. Several of these are ones that Nordtvedt has shown to be characteristic of different I1a groups.