Bill, But you are mixing apples and oranges in your data (M222+ and M222-) - so your results will not be accurate. Look at this M222- guy below, his STR's seem like M222+, yet he is not part of that haplogroup. Your method would include people like this. M222- 37784 Wilson Unknown Origin R1b1a2 13 25 14 11 12-14 12 12 12 13 14 29 189-911112415193015-16-16-17101119-231615171737-381212 I work as a programmer, and we have a saying: "Garbage in, Garbage out" - think about it. Cheers, Paul On Thu, Jul 7, 2011 at 8:05 PM, Bill Howard <weh8@verizon.net> wrote: > Yes, David. I agree. And correspondence with John McLaughlin has shown that > he is in agreement that any individual who is reasonably close to the R:M222 > 37-marker modal is so likely to test positive for M222 that it may be a > waste of money to do the test. > > My email that got this dialogue started was an attempt on my part to see > how likely that might be. What I found was that if 27 or more of the 37 > markers in a haplotype agreed with the modal value of the M222 group (which > was self-defined by a large group of M222s), then it would be safe to > conclude that the haplotype belonged to the M222 group. Granted, it was > "guaranteed" by SNP testing, but the agreement between the haplotypes that > WERE NOT SNP tested with those that were was so close as to be > indistinguishable. Then I gave a set of those 'self defined' DYS modal > values in my posting so that others would see what I meant and could do > comparisons on their own. > > Which DYS modal values are more important? Obviously they will be the slow > mutating markers for the far-ago progenitors and not the fast ones. The fast > mutating markers can mutate back and forth so that they are only useful in > comparing haplotypes that have a recent MRCA. In this way we can analyze the > differences that define clusters on my phylogenetic tree and see which > 'fingerprints' cause the differences among the clusters on the tree. > > On Jul 7, 2011, at 7:48 PM, David Ewing wrote: > > > R:M222 is 'defined' on the basis of the presence of the SNP M222. The > definition has nothing to do with STR testing; ie, it does not depend at all > on testing the markers we construct modals with. Iain correctly points out > that David Wilson first identified a STR profile he thought was > characteristic of NW Ireland that eventually proved to be associated with > R:M222, but I do not believe that Wilson was aware of the SNP at that time > and I do not think he tested for it. > > > > As it happens, since a SNP occurs in a single individual and is passed on > to his descendants in perpetuity, and since he also passes his STR markers > to his descendants and they will only gradually and slowly mutate away from > the ancestral values through the generations, STR profiles of the > descendants of the man who had the SNP will also be similar--very similar in > the early generations, and then of gradually diminishing similarity as > generations go on. > > > > The only way to be pure-d-double-L certain that an individual is in > R:M222 is to test him for the M222 SNP. But any individual who is reasonably > close to the R:M222 37-marker modal is so likely to test positive for M222 > that it may be a waste of money to do the test. > > > > David Ewing > > > > On Thu, Jul 7, 2011 at 6:00 AM, Bill Howard <weh8@verizon.net> wrote: > > There has been considerable discussion both on- and off-line about how > the M222 SNP is defined. > > First, I understand that its early definition depended on the first 12 > markers. > > Next, we have the deep clade test of FTDNA with a proprietary approach we > know little about. > > Next, there are discussions of how the markers agree or disagree with the > modal values of the deep clade test, but only with respect to the first 12 > markers of the FTDNA string. > > > > And now, here's my "take" on the situation. > > I received from John McLaughlin a large set of markers that he noted were > in the M222 group. Some had been SNP tested and some had not. > > I did a study of ALL 37 markers (not just the first 12) and I determined > the modal value of each DYS site. > > I then went back and determined for EACH TESTEE the number of times each > of his own particular markers matched the modal of that same marker for the > M222 sample John sent me. > > I then made a graph of the percentage of each testee's marker set that > matched the overall marker set. > > I found that virtually ALL markers in the testee set that John sent had > 73% or more markers that agreed with the set of M222 modals -- not the first > 12, but all 37 of them. > > > > The modal values I found for all 37 markers are the following, in the > sequence given by FTDNA postings: > > 13 25 14 11 11 13 12 12 12 > 13 14 29 17 9 10 11 11 25 15 > 18 30 15 16 16 17 11 11 19 > 23 17 16 18 17 38 39 12 12 > > > > Of the 683 M222s in the group, all matched 73% of that sequence (at least > 27 of the 37 markers). The average was 85.2% and both the median and the > mode was 83.8%. One testee, 26917 (MacKenzie) matched 100% of the modals. > > > > I also found that if you made a testee plot of the number of markers that > matched the M222 modal against their frequency of occurrence for all the 683 > testees, the plot between 26 and 37 markers was bimodal, with two peaks. One > peak was at 31 markers and the other peak was at 33 markers. A statistician > might say that the departures from a Gaussian are not significant and that > there are NOT two peaks, but I think it is arguable. When I do the same > plot using 320 testees which are among a set with a larger number of > SNP-tested testees, the bimodality is more pronounced but still > statistically inconclusive. The two peaks are sharper and appear at the same > place on the histogram. > > > > So, what do I conclude with all this? > > First, that we cannot go by just the first 12 markers. We have more at > our disposal to study. > > Second, while we refer to the M222 SNP test of FTDNA, we realize that we > take their results on faith about their criterion of who should be included > in the M222 group. > > Third, my analysis shows that you can safely (?) put a testee into the > M222 group IF 73% or more of his 37 markers agree with the modal values of > all 37 (not 12) markers. That is a practical working criterion for M222 > inclusion in the group. I have given the modals, above, so now anyone can > compare a haplotype with it and make your own conclusion. That criterion > correlates well with FTDNA's M222 SNP-tested group. > > > > Now, we must realize that there are extreme variations in the mutation > rates of the markers and that's why less than 100% of the testees are in the > M222 group. The mutation rates vary by a factor of almost 400 between the > fastest and slowest mutating DYS sites. Why does 26917 MacKenzie have a 100% > match? Well, statistically, out of 683 testees whose markers are mutating > over the time from the M222 progenitor to the present, you would expect one > line not to vary at all, and that line has led to 26917 MacKenzie. In fact, > his haplotype may provide a clue or a means to tease out some of the > mutations that have taken place over time. That's an exercise still to be > done. Now, when you have a set of fast to slow mutating DYS sites, you > should be comparing the DIFFERENCES in marker values along the mutating > lines. I include now a table that shows the percentage of M222 testees that > have mutations at the various points in the haplotype. For example, those > with 454 had a constant ! > value of the modal for 454, and less than 50% of the testees had the modal > for the two CDYs. > > > > DYS %Y > > DYS454 100% > > DYS426 99% > > DYS388 99% > > DYS459a 99% > > YCAIIa 98% > > DYS438 98% > > DYS393 98% > > DYS455 98% > > DYS448 96% > > DYS392 95% > > DYS385a 95% > > DYS459b 93% > > DYS19 93% > > DYS437 92% > > DYS464a 90% > > DYS442 90% > > Y-GATA-H4 89% > > DYS385b 88% > > YCAIIb 88% > > DYS389i 88% > > DYS447 87% > > DYS464b 87% > > DYS464c 86% > > DYS464d 85% > > DYS390 85% > > DYS607 85% > > DYS391 83% > > DYS389ii 80% > > DYS439 79% > > DYS570 77% > > DYS458 74% > > DYS449 71% > > DYS460 70% > > DYS456 68% > > DYS576 58% > > CDYb 46% > > CDYa 42% > > > > Now, with the modal values, and with the table just above, you could > analyze the slow moving markers among the haplotypes and see what happens. > The fast moving markers are useful only for small values of RCC, whereas the > slow moving markers will give insight about what was happening to the marker > strings nearer the time of the progenitor - the higher values of RCC. > > > > So, my fourth conclusion is that the sequence of junctions on the > phylogenetic tree, calibrated in terms of RCC values, will probably give > valuable information not only on how the DNA clusters (which later evolve > into surname groups) actually evolved over time but give us valuable > fingerprints that differentiate one cluster from another (and at RCC values > less than 20, the TMRCAs of the progenitor who was at the junction point > that leads to different surnames. A clever programmer might help here! The > data are available (!). > > > > - Bye from Bill Howard > > > > > > > > -- > > Notice: This email is not secure, and is not for use by patients or for > healthcare purposes in general. > > > > R1b1c7 Research and Links: > > http://clanmaclochlainn.com/R1b1c7/ > ------------------------------- > To unsubscribe from the list, please send an email to > DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the > quotes in the subject and the body of the message >
Paul, Not so, and it depends what your goal is in the analysis. Operationally, groups of what you call M222+ and M222- are indistinguishable with respect to their marker sequences. Both groups have nearly the same average numbers of markers that match each marker modal. That being said, one can try to find the date of origin of the whole group, and of the two separately. If one finds dates that are statistically identical, then operationally the differences between the groups that have been M222-SNP tested and the ones that have not been tested lead to the same origin, just as long as the haplotypes share at least 27 of the 37 values with the modal values of their individual DYS markers. More broadly, one can compute the TMRCA for any group of haplotypes whether or not they belong to a particular group like M222. In the M222- example you cite, 37784 is not even in the M222- minus group. It is a red-herring since it shares only 25 markers with the modal values of the M222 group. It should be in the group of unknowns, not in the M222- group. I would not have considered it in the M222 group. To prove it, compare your sequence with the ones I gave in my initial posting of the question about how M222 is defined. The markers you cite below, namely 12, 14, 18, 9, 24, 19, 10, 16, 15, 17, 37, and 38 do not agree between your list of the Wilson haplotype and my list of modal values. Those 12 markers are different and only 25 are the same. Less than 67% of the markers match the modal values; my criterion was that at least 73% should match. Wilson is not even in the M222 minus group. I agree with your contention that the quip "Garbage in, Garbage out" is correct, but it doesn't apply here. Speaking of quips, when I taught at U. Michigan in the early days of programming, there was another expression that we called Kirk's Law (after a graduate student named Kirk who was enamored with programming): "Numbers that have been through a computer are worth more than numbers that haven't!" (grin). - Bye from Bill On Jul 7, 2011, at 8:22 PM, Paul Conroy wrote: > Bill, > > But you are mixing apples and oranges in your data (M222+ and M222-) - so > your results will not be accurate. > > Look at this M222- guy below, his STR's seem like M222+, yet he is not part > of that haplogroup. Your method would include people like this. > > M222- > 37784 Wilson Unknown Origin R1b1a2 13 25 14 11 12-14 12 12 12 13 14 > 29 18 9-9 11 11 24 15 19 30 15-16-16-17 10 11 19-23 16 15 17 17 37-38 12 12 > I work as a programmer, and we have a saying: "Garbage in, Garbage out" - > think about it. > > Cheers, > Paul > > On Thu, Jul 7, 2011 at 8:05 PM, Bill Howard <weh8@verizon.net> wrote: > >> Yes, David. I agree. And correspondence with John McLaughlin has shown that >> he is in agreement that any individual who is reasonably close to the R:M222 >> 37-marker modal is so likely to test positive for M222 that it may be a >> waste of money to do the test. >> >> My email that got this dialogue started was an attempt on my part to see >> how likely that might be. What I found was that if 27 or more of the 37 >> markers in a haplotype agreed with the modal value of the M222 group (which >> was self-defined by a large group of M222s), then it would be safe to >> conclude that the haplotype belonged to the M222 group. Granted, it was >> "guaranteed" by SNP testing, but the agreement between the haplotypes that >> WERE NOT SNP tested with those that were was so close as to be >> indistinguishable. Then I gave a set of those 'self defined' DYS modal >> values in my posting so that others would see what I meant and could do >> comparisons on their own. >> >> Which DYS modal values are more important? Obviously they will be the slow >> mutating markers for the far-ago progenitors and not the fast ones. The fast >> mutating markers can mutate back and forth so that they are only useful in >> comparing haplotypes that have a recent MRCA. In this way we can analyze the >> differences that define clusters on my phylogenetic tree and see which >> 'fingerprints' cause the differences among the clusters on the tree. >> >> On Jul 7, 2011, at 7:48 PM, David Ewing wrote: >> >>> R:M222 is 'defined' on the basis of the presence of the SNP M222. The >> definition has nothing to do with STR testing; ie, it does not depend at all >> on testing the markers we construct modals with. Iain correctly points out >> that David Wilson first identified a STR profile he thought was >> characteristic of NW Ireland that eventually proved to be associated with >> R:M222, but I do not believe that Wilson was aware of the SNP at that time >> and I do not think he tested for it. >>> >>> As it happens, since a SNP occurs in a single individual and is passed on >> to his descendants in perpetuity, and since he also passes his STR markers >> to his descendants and they will only gradually and slowly mutate away from >> the ancestral values through the generations, STR profiles of the >> descendants of the man who had the SNP will also be similar--very similar in >> the early generations, and then of gradually diminishing similarity as >> generations go on. >>> >>> The only way to be pure-d-double-L certain that an individual is in >> R:M222 is to test him for the M222 SNP. But any individual who is reasonably >> close to the R:M222 37-marker modal is so likely to test positive for M222 >> that it may be a waste of money to do the test. >>> >>> David Ewing >>> >>> On Thu, Jul 7, 2011 at 6:00 AM, Bill Howard <weh8@verizon.net> wrote: >>> There has been considerable discussion both on- and off-line about how >> the M222 SNP is defined. >>> First, I understand that its early definition depended on the first 12 >> markers. >>> Next, we have the deep clade test of FTDNA with a proprietary approach we >> know little about. >>> Next, there are discussions of how the markers agree or disagree with the >> modal values of the deep clade test, but only with respect to the first 12 >> markers of the FTDNA string. >>> >>> And now, here's my "take" on the situation. >>> I received from John McLaughlin a large set of markers that he noted were >> in the M222 group. Some had been SNP tested and some had not. >>> I did a study of ALL 37 markers (not just the first 12) and I determined >> the modal value of each DYS site. >>> I then went back and determined for EACH TESTEE the number of times each >> of his own particular markers matched the modal of that same marker for the >> M222 sample John sent me. >>> I then made a graph of the percentage of each testee's marker set that >> matched the overall marker set. >>> I found that virtually ALL markers in the testee set that John sent had >> 73% or more markers that agreed with the set of M222 modals -- not the first >> 12, but all 37 of them. >>> >>> The modal values I found for all 37 markers are the following, in the >> sequence given by FTDNA postings: >>> 13 25 14 11 11 13 12 12 12 >> 13 14 29 17 9 10 11 11 25 15 >> 18 30 15 16 16 17 11 11 19 >> 23 17 16 18 17 38 39 12 12 >>> >>> Of the 683 M222s in the group, all matched 73% of that sequence (at least >> 27 of the 37 markers). The average was 85.2% and both the median and the >> mode was 83.8%. One testee, 26917 (MacKenzie) matched 100% of the modals. >>> >>> I also found that if you made a testee plot of the number of markers that >> matched the M222 modal against their frequency of occurrence for all the 683 >> testees, the plot between 26 and 37 markers was bimodal, with two peaks. One >> peak was at 31 markers and the other peak was at 33 markers. A statistician >> might say that the departures from a Gaussian are not significant and that >> there are NOT two peaks, but I think it is arguable. When I do the same >> plot using 320 testees which are among a set with a larger number of >> SNP-tested testees, the bimodality is more pronounced but still >> statistically inconclusive. The two peaks are sharper and appear at the same >> place on the histogram. >>> >>> So, what do I conclude with all this? >>> First, that we cannot go by just the first 12 markers. We have more at >> our disposal to study. >>> Second, while we refer to the M222 SNP test of FTDNA, we realize that we >> take their results on faith about their criterion of who should be included >> in the M222 group. >>> Third, my analysis shows that you can safely (?) put a testee into the >> M222 group IF 73% or more of his 37 markers agree with the modal values of >> all 37 (not 12) markers. That is a practical working criterion for M222 >> inclusion in the group. I have given the modals, above, so now anyone can >> compare a haplotype with it and make your own conclusion. That criterion >> correlates well with FTDNA's M222 SNP-tested group. >>> >>> Now, we must realize that there are extreme variations in the mutation >> rates of the markers and that's why less than 100% of the testees are in the >> M222 group. The mutation rates vary by a factor of almost 400 between the >> fastest and slowest mutating DYS sites. Why does 26917 MacKenzie have a 100% >> match? Well, statistically, out of 683 testees whose markers are mutating >> over the time from the M222 progenitor to the present, you would expect one >> line not to vary at all, and that line has led to 26917 MacKenzie. In fact, >> his haplotype may provide a clue or a means to tease out some of the >> mutations that have taken place over time. That's an exercise still to be >> done. Now, when you have a set of fast to slow mutating DYS sites, you >> should be comparing the DIFFERENCES in marker values along the mutating >> lines. I include now a table that shows the percentage of M222 testees that >> have mutations at the various points in the haplotype. For example, those >> with 454 had a constant ! >> value of the modal for 454, and less than 50% of the testees had the modal >> for the two CDYs. >>> >>> DYS %Y >>> DYS454 100% >>> DYS426 99% >>> DYS388 99% >>> DYS459a 99% >>> YCAIIa 98% >>> DYS438 98% >>> DYS393 98% >>> DYS455 98% >>> DYS448 96% >>> DYS392 95% >>> DYS385a 95% >>> DYS459b 93% >>> DYS19 93% >>> DYS437 92% >>> DYS464a 90% >>> DYS442 90% >>> Y-GATA-H4 89% >>> DYS385b 88% >>> YCAIIb 88% >>> DYS389i 88% >>> DYS447 87% >>> DYS464b 87% >>> DYS464c 86% >>> DYS464d 85% >>> DYS390 85% >>> DYS607 85% >>> DYS391 83% >>> DYS389ii 80% >>> DYS439 79% >>> DYS570 77% >>> DYS458 74% >>> DYS449 71% >>> DYS460 70% >>> DYS456 68% >>> DYS576 58% >>> CDYb 46% >>> CDYa 42% >>> >>> Now, with the modal values, and with the table just above, you could >> analyze the slow moving markers among the haplotypes and see what happens. >> The fast moving markers are useful only for small values of RCC, whereas the >> slow moving markers will give insight about what was happening to the marker >> strings nearer the time of the progenitor - the higher values of RCC. >>> >>> So, my fourth conclusion is that the sequence of junctions on the >> phylogenetic tree, calibrated in terms of RCC values, will probably give >> valuable information not only on how the DNA clusters (which later evolve >> into surname groups) actually evolved over time but give us valuable >> fingerprints that differentiate one cluster from another (and at RCC values >> less than 20, the TMRCAs of the progenitor who was at the junction point >> that leads to different surnames. A clever programmer might help here! The >> data are available (!). >>> >>> - Bye from Bill Howard >>> >>> >>> >>> -- >>> Notice: This email is not secure, and is not for use by patients or for >> healthcare purposes in general. >>> >> >> R1b1c7 Research and Links: >> >> http://clanmaclochlainn.com/R1b1c7/ >> ------------------------------- >> To unsubscribe from the list, please send an email to >> DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the >> quotes in the subject and the body of the message >> > R1b1c7 Research and Links: > > http://clanmaclochlainn.com/R1b1c7/ > ------------------------------- > To unsubscribe from the list, please send an email to DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the quotes in the subject and the body of the message