RootsWeb.com Mailing Lists
Previous Page      Next Page
Total: 7740/10000
    1. Re: [R-M222] Calculation of a correlation coeficient
    2. Alexander Paterson
    3. The fact of the matter is that there is no such thing as a multivariate correlation coefficient. It simply doesn't exist. My guess is that you assumed that such a concept existed and tried to calculate it. Now you're surprised and offended that someone doubts the validity of your calculations. I repeat: there is no such thing as a multivariate correlation coefficient. The only correlation coefficient known to man is that for a bivariate normal distribution. As soon as you move to 3 or more variables, the concept of a correlation coefficient is replaced by something called R-squared (also known as the coefficient of determination), which, in mathematical modelling, is defined as that proportion of the variance of the variable that you are trying to model, that is explained by the model. -----Original Message----- From: dna-r1b1c7-bounces@rootsweb.com [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Bill Howard Sent: 17 July 2011 17:00 To: dna-r1b1c7@rootsweb.com Subject: Re: [R-M222] Calculation of a correlation coeficient You still did it wrong, Sandy. Read my reply again. And please think more carefully about what I wrote. I think we are wasting the readers' time where there are more important issues to which we should be paying attention. (Oops, I see that my full reply did not get on the list. Too bad, because it gave more reasons why you are wrong. I did not state in my full reply that the answers you got were correct, but I did say that you had approached it wrong and that I did not use the CORREL function in Excel, as you contended.) - Bye from Bill Howard On Jul 17, 2011, at 10:05 AM, Alexander Paterson wrote:

    07/17/2011 01:10:25
    1. Re: [R-M222] Calculation of a correlation coeficient
    2. Stephen Forrest
    3. Hi Bill, I don't think this discussion is wasting people's time, though I do rather wish the tone were overall rather more civil. I have to say that regardless of the specifics of the Excel implementation, Sandy's objection to the statistical foundation of RCC is valid. The sample correlation coefficient on paired data (x_1,y_1),...,(x_n,y_n) corresponding to two random variables X and Y is the sum of (x_i - x_m)*(y_i-y_m) for i from 1 to n, divided by the product of the magnitude of the vectors (x_1-x_m,...,x_n-x_m) and (y_1-y_m,...,y_n-y_m) where x_m and y_m are the means of the sample data x_1,...,x_n and y_1,,,,,y_m respectively. The key point here is that the data x_1,...,x_n are supposed to be separate measurements of a single random variable X of interest. When you use them for RCC, by your design x_1,...,x_n are the STR markers values themselves. These are not separate measurements of a single quantity and the value x_m which is the average marker value across all 37, 67, or 111 markers has no obvious significance. You are measuring the correspondence between two random variables on a population of 37, 67 or, or 111, where the population is marker values and not people. This introduces additional problems because marker value ranges vary widely between markers. Some like DYS710 have high repeat numbers (mine is 35), while others like DYS 578 are lower (mine is 9). My testing with my own 111-marker sample has shown that the RCC between my profile and my profile with a one-point mutation (i.e. the difference between RCC values before and after)* varies inversely with the distance of the particular marker value from the mean marker value x_m*. I can supply data if you like. There is absolutely no good biological reason why RCC should depend so closely on marker values: a one-point mutation is a one-point mutation whether the change is from 34 to 35 or 14 to 15. The fact that that it does is evidence of the artificiality of this particular measure of genetic distance, to say nothing of that fact that documented mutation rates which will certainly affect TMRCA calculations are apparently not included in the RCC model at all. I have a few other points to raise about RCC which I will strive to write up and post here. I want to emphasize to all however that Bill has done a lot of work here and that innovation in statistical analysis of genetic data should to be welcomed. That said, these innovators like all researchers have to be ready to face criticism and just because a particular objection has not been raised before is not evidence of its falseness. Thanks to all for the discussion. regards, Steve

    07/17/2011 11:01:55
    1. Re: [R-M222] Calculation of a correlation coeficient
    2. Alexander Paterson
    3. You have already stated that the answers that I get in comparing the 37-marker haplotypes of two people are correct. I presume that what you mean by being correct is that you get the same answers. The only way to get those answers is to pretend that there are 37 people, not 2, and that there are 2 characteristics being compared, and not 37. Whether you do this from first principles, or using Correl in Excel, or using Mathematica or some other package is irrelevant. You can do this for all possible pairings of two people out of 600 odd people and it still doesn't alter the fact that you are using a method of calculation that is appropriate for calculating each pairwise correlation coefficient as if there were 37 people instead of 2, each with 2 characteristics instead of 37. Nothing can alter that. Sandy -----Original Message----- From: dna-r1b1c7-bounces@rootsweb.com [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Bill Howard Sent: 17 July 2011 12:54 To: dna-r1b1c7@rootsweb.com Subject: [R-M222] Calculation of a correlation coeficient Sandy, You are wrong. Please don't mislead the post readers without understanding what I did and how I did it. You don't use Excel's process on a matrix; you used it on only a string of values. You need to use the approach on a matrix of ROWS of numbers, not just a couple of columns.

    07/17/2011 09:05:08
    1. Re: [R-M222] Calculation of a correlation coeficient
    2. Bill Howard
    3. You still did it wrong, Sandy. Read my reply again. And please think more carefully about what I wrote. I think we are wasting the readers' time where there are more important issues to which we should be paying attention. (Oops, I see that my full reply did not get on the list. Too bad, because it gave more reasons why you are wrong. I did not state in my full reply that the answers you got were correct, but I did say that you had approached it wrong and that I did not use the CORREL function in Excel, as you contended.) - Bye from Bill Howard On Jul 17, 2011, at 10:05 AM, Alexander Paterson wrote: > You have already stated that the answers that I get in comparing the > 37-marker haplotypes of two people are correct. I presume that what you mean > by being correct is that you get the same answers. > > The only way to get those answers is to pretend that there are 37 people, > not 2, and that there are 2 characteristics being compared, and not 37. > Whether you do this from first principles, or using Correl in Excel, or > using Mathematica or some other package is irrelevant. > > You can do this for all possible pairings of two people out of 600 odd > people and it still doesn't alter the fact that you are using a method of > calculation that is appropriate for calculating each pairwise correlation > coefficient as if there were 37 people instead of 2, each with 2 > characteristics instead of 37. > > Nothing can alter that. > > > Sandy > > > > -----Original Message----- > From: dna-r1b1c7-bounces@rootsweb.com > [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Bill Howard > Sent: 17 July 2011 12:54 > To: dna-r1b1c7@rootsweb.com > Subject: [R-M222] Calculation of a correlation coeficient > > Sandy, > > You are wrong. Please don't mislead the post readers without understanding > what I did and how I did it. You don't use Excel's process on a matrix; you > used it on only a string of values. You need to use the approach on a > matrix of ROWS of numbers, not just a couple of columns. > > > R1b1c7 Research and Links: > > http://clanmaclochlainn.com/R1b1c7/ > ------------------------------- > To unsubscribe from the list, please send an email to DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the quotes in the subject and the body of the message

    07/17/2011 06:00:29
    1. Re: [R-M222] David Wilson
    2. Alexander Paterson
    3. I already run two websites plus a business from home so it would be difficult for me to get involved. I simply couldn't commit to any more than I'm already involved in. It struck me though that what's likely to put many people off from volunteering is the fear that FTDNA may once again make everyone's life difficult by making yet more changes to the way they do things. So I think that the first person you'd need on board is a website designer, who would be able to handle any such changes. With someone like that in place, I think you may find a few takers, as you put it. Sandy -----Original Message----- From: dna-r1b1c7-bounces@rootsweb.com [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Lochlan@aol.com Sent: 17 July 2011 04:31 To: dna-r1b1c7@rootsweb.com Subject: [R-M222] David Wilson David Wilson told me a few days ago he could no longer continue as administrator of the M222 project and suggested I find a replacement (or two). Ideally I'd like to find someone or several volunteers to take over so I can retire as well. Any takers?

    07/17/2011 04:39:31
    1. [R-M222] Calculation of a correlation coeficient
    2. Alexander Paterson
    3. I managed to dig up a Statistics text-book that has a worked example of how a correlation coefficient is calculated. The book is called Statistical Analysis, by Edward C Bryant, published by McGraw-Hill. The example is given on page 139 and the table input table is on page 141. The example entails estimating the correlation coefficient for a population, from a sample drawn from a bivariate normal distribution. The idea in this case is to see how well mid-term examination results were correlated to the year-end examination results in the subject of elementary statistics. There were 20 students included in the sample. The answer given is 0.642 Here is the data from the table : Student No Mid-term (X) Year-end(Y) 1 35 60 2 60 80 3 55 60 4 35 80 5 35 75 6 50 90 7 30 60 8 60 105 9 50 60 10 20 30 11 55 90 12 45 75 13 40 80 14 60 80 15 40 45 16 60 80 17 50 80 18 55 95 19 50 100 20 35 75 This can be done from first principles and the answer arrived at is 0.642. Alternately, you can do it in Excel. To do this, enter the X values into column A, and the Y values into column B. Then use the CORREL function to calculate the correlation coefficient. An Excel spreadsheet that does this is shown at http://dl.dropbox.com/u/2733445/STATSTUDENT.xlsx The answer is given in cell C20, namely 0.642 but to a few more decimal places. Here's my point : Correl in Excel is set up to accept bivariate data for a sample population of n (in the above example n=20). You are required to enter, for each of the 20 in the sample, the variable X (in column A) and the variable Y (in column B). Bill Howard does something completely different. He enters data for only 2 people, but he enters 37 numbers into column A and 37 numbers into column B. Excel interprets this to mean that there are 37 people (not just 2), and two variables (it is bi-variate), not 37. Correl is not set up for multivariate analysis - it is set up for bivariate analysis. This must be the most hilarious example I've ever seen of someone blindly banging data into software without having the faintest idea what the software does, or the input format that the software requires. Sandy

    07/17/2011 03:18:40
    1. Re: [R-M222] David Wilson
    2. Gerry
    3. If all that is stopping a qualified person from taking over the project, is some Web and database skills, then I can help with that, if needed. I work with databases with Web interfaces and would be happy to do whatever I can to help. I imagine that MySQL and PHP are available to do the work. My family Website is here: http://ringofgullion.com/, although it is outdated, due to all that I have learned this past year on M222. Gerry Hoy -----Original Message----- From: dna-r1b1c7-bounces@rootsweb.com [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Alexander Paterson Sent: Sunday, July 17, 2011 5:40 AM To: dna-r1b1c7@rootsweb.com Subject: Re: [R-M222] David Wilson I already run two websites plus a business from home so it would be difficult for me to get involved. I simply couldn't commit to any more than I'm already involved in. It struck me though that what's likely to put many people off from volunteering is the fear that FTDNA may once again make everyone's life difficult by making yet more changes to the way they do things. So I think that the first person you'd need on board is a website designer, who would be able to handle any such changes. With someone like that in place, I think you may find a few takers, as you put it. Sandy -----Original Message----- From: dna-r1b1c7-bounces@rootsweb.com [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Lochlan@aol.com Sent: 17 July 2011 04:31 To: dna-r1b1c7@rootsweb.com Subject: [R-M222] David Wilson David Wilson told me a few days ago he could no longer continue as administrator of the M222 project and suggested I find a replacement (or two). Ideally I'd like to find someone or several volunteers to take over so I can retire as well. Any takers? R1b1c7 Research and Links: http://clanmaclochlainn.com/R1b1c7/ ------------------------------- To unsubscribe from the list, please send an email to DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the quotes in the subject and the body of the message

    07/17/2011 03:00:28
    1. [R-M222] Calculation of a correlation coeficient
    2. Bill Howard
    3. Sandy, You are wrong. Please don't mislead the post readers without understanding what I did and how I did it. You don't use Excel's process on a matrix; you used it on only a string of values. You need to use the approach on a matrix of ROWS of numbers, not just a couple of columns. You don't understand how I did the analysis --- I anticipate that most of what you wrote before your four paragraphs at the end of your posting are correct, but the last four paragraphs are incorrect. I enter data for everyone and run a correlation at one time on all of them, not just a pair as Sandy writes. I do not use the Correl command at all. I use the Excel data analysis kit, as explained in my papers. The Excel data analysis kit, or Mathematica, or any of the other programs that can run a correlation on a matrix (not on just two columns) correlates each row of haplotypes against ALL other rows of haplotypes. You must click on ROWs to be correlated, not COLUMNs. This is not the first time you have criticized me, but each time I have shown that you either (1) did not understand what I was doing; (2) did not understand how I did it, (3) did not do a deep enough calculation yourself and just skimmed the top; (4) did not put error bars on what you did, and on and on and on. Your contention that "This must be the most hilarious example I've ever seen of someone blindly banging data into software without having the faintest idea what the software does, or the input format that the software requires" is again both wrong, excessively vituperative and entirely unwarranted. In addition, you are tending to dominate the postings with these types of criticisms and I for one would appreciate hearing more discussions from others. I suggest you back off and do a deeper analysis of what I have done, reading my papers and my FAQ, and doing your homework -- correctly. I hope others on this rootsweb site will agree. - Bye from Bill Howard PS -- I will be sorry to see if both David and John step down from their very important role as the M222 Project Administrator. They have made very important contributions, and John's very knowledgable postings have been much appreciated. --------------------------- On Jul 17, 2011, at 4:18 AM, Alexander Paterson wrote: > I managed to dig up a Statistics text-book that has a worked example of how > a correlation coefficient is calculated. The book is called Statistical > Analysis, by Edward C Bryant, published by McGraw-Hill. The example is given > on page 139 and the table input table is on page 141. > > > > The example entails estimating the correlation coefficient for a population, > from a sample drawn from a bivariate normal distribution. The idea in this > case is to see how well mid-term examination results were correlated to the > year-end examination results in the subject of elementary statistics. There > were 20 students included in the sample. The answer given is 0.642 > > > > Here is the data from the table : > > > > Student No Mid-term (X) Year-end(Y) > > > > 1 35 > 60 > > 2 60 > 80 > > 3 55 > 60 > > 4 35 > 80 > > 5 35 > 75 > > 6 50 > 90 > > 7 30 > 60 > > 8 60 > 105 > > 9 50 > 60 > > 10 20 30 > > 11 55 90 > > 12 45 75 > > 13 40 80 > > 14 60 80 > > 15 40 45 > > 16 60 80 > > 17 50 80 > > 18 55 95 > > 19 50 > 100 > > 20 35 75 > > > > > > This can be done from first principles and the answer arrived at is 0.642. > > > > Alternately, you can do it in Excel. To do this, enter the X values into > column A, and the Y values into column B. Then use the CORREL function to > calculate the correlation coefficient. > > > > An Excel spreadsheet that does this is shown at > > > > http://dl.dropbox.com/u/2733445/STATSTUDENT.xlsx > > > > The answer is given in cell C20, namely 0.642 but to a few more decimal > places. > > > > Here's my point : Correl in Excel is set up to accept bivariate data for a > sample population of n (in the above example n=20). You are required to > enter, for each of the 20 in the sample, the variable X (in column A) and > the variable Y (in column B). > > > > Bill Howard does something completely different. He enters data for only 2 > people, but he enters 37 numbers into column A and 37 numbers into column B. > Excel interprets this to mean that there are 37 people (not just 2), and two > variables (it is bi-variate), not 37. > > > > Correl is not set up for multivariate analysis - it is set up for bivariate > analysis. > > > > This must be the most hilarious example I've ever seen of someone blindly > banging data into software without having the faintest idea what the > software does, or the input format that the software requires. > > > > > > > > Sandy > > R1b1c7 Research and Links: > > http://clanmaclochlainn.com/R1b1c7/ > ------------------------------- > To unsubscribe from the list, please send an email to DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the quotes in the subject and the body of the message

    07/17/2011 01:54:01
    1. [R-M222] David Wilson
    2. David Wilson told me a few days ago he could no longer continue as administrator of the M222 project and suggested I find a replacement (or two). Ideally I'd like to find someone or several volunteers to take over so I can retire as well. Any takers? Your main duties (which don't take long) are admitting new people to the project. If you want to take over the M222 email list mostly that involves deleting spam messages a few times a week. John

    07/16/2011 05:31:19
    1. Re: [R-M222] M222 Tree
    2. Sandy Paterson
    3. Hi Bill I don't think we're wasting anyone's time. I think we've both made the same mistake. We both used the Excel function, yes, but with insufficient thought. The correlation coefficient is designed to produce answers that lie between -1 and +1. It requires the summing of the products of the differences between attributes and the population mean for those attributes (not the mean of the attributes for each individual). Comparing an individual marker with the mean of the marker scores of that individual is meaningless. Sandy -----Original Message----- From: dna-r1b1c7-bounces@rootsweb.com [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Bill Howard Sent: 12 July 2011 12:43 To: dna-r1b1c7@rootsweb.com Subject: Re: [R-M222] M222 Tree Sandy, One of us is confused and I don't think it is I! The Excel function for correlation is the one I used. Since Excel does it within its own 'black box', I have checked it out with statistical algorithms and find it to be correct. Enough of this! I think we are wasting the list readers' time. - Bye from Bill Howard On Jul 12, 2011, at 12:13 AM, Sandy Paterson wrote: > Really? > > Actually, they are garbage. In correlation, the numerator is the sum of the > products of (person A value of attribute - mean attribute value of > population) x (person B value of attribute - mean attribute value of > population). > > In what I set up in the spreadsheet, this is not the case. So they are not > correlation coefficients at all. They may have some meaning, I don't know, > but they look like garbage to me. > > > Sandy > > > -----Original Message----- > From: dna-r1b1c7-bounces@rootsweb.com > [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Bill Howard > Sent: 11 July 2011 14:46 > To: dna-r1b1c7@rootsweb.com > Subject: Re: [R-M222] M222 Tree > > To the list: > > Yes, Sandy has done the calculation right. > > R1b1c7 Research and Links: > > http://clanmaclochlainn.com/R1b1c7/ > ------------------------------- > To unsubscribe from the list, please send an email to DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the quotes in the subject and the body of the message R1b1c7 Research and Links: http://clanmaclochlainn.com/R1b1c7/ ------------------------------- To unsubscribe from the list, please send an email to DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the quotes in the subject and the body of the message

    07/12/2011 06:56:37
    1. Re: [R-M222] M222 Tree
    2. Bill Howard
    3. Sandy, You need to read what I have done with the correlation coefficient. I translated it into RCCs which are easier to use. Computing the cc's via the cc algorithm was found to be correct, and the simplification of the conversion into RCCs makes the analysis much easier than it would be if I had not done the conversion. No, I have not made any mistake. It all fits. I suggest you read my paper 1 again in the JoGG. Also you should review how the cc is computed. - Bye from Bill Howard On Jul 12, 2011, at 7:56 AM, Sandy Paterson wrote: > Hi Bill > > I don't think we're wasting anyone's time. I think we've both made the same > mistake. We both used the Excel function, yes, but with insufficient > thought. > > The correlation coefficient is designed to produce answers that lie between > -1 and +1. It requires the summing of the products of the differences > between attributes and the population mean for those attributes (not the > mean of the attributes for each individual). Comparing an individual marker > with the mean of the marker scores of that individual is meaningless. > > > Sandy > > > > -----Original Message----- > From: dna-r1b1c7-bounces@rootsweb.com > [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Bill Howard > Sent: 12 July 2011 12:43 > To: dna-r1b1c7@rootsweb.com > Subject: Re: [R-M222] M222 Tree > > Sandy, > One of us is confused and I don't think it is I! > The Excel function for correlation is the one I used. Since Excel does it > within its own 'black box', I have checked it out with statistical > algorithms and find it to be correct. > Enough of this! I think we are wasting the list readers' time. > - Bye from Bill Howard > > On Jul 12, 2011, at 12:13 AM, Sandy Paterson wrote: > >> Really? >> >> Actually, they are garbage. In correlation, the numerator is the sum of > the >> products of (person A value of attribute - mean attribute value of >> population) x (person B value of attribute - mean attribute value of >> population). >> >> In what I set up in the spreadsheet, this is not the case. So they are not >> correlation coefficients at all. They may have some meaning, I don't know, >> but they look like garbage to me. >> >> >> Sandy >> >> >> -----Original Message----- >> From: dna-r1b1c7-bounces@rootsweb.com >> [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Bill Howard >> Sent: 11 July 2011 14:46 >> To: dna-r1b1c7@rootsweb.com >> Subject: Re: [R-M222] M222 Tree >> >> To the list: >> >> Yes, Sandy has done the calculation right. >> >> R1b1c7 Research and Links: >> >> http://clanmaclochlainn.com/R1b1c7/ >> ------------------------------- >> To unsubscribe from the list, please send an email to > DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the > quotes in the subject and the body of the message > > > R1b1c7 Research and Links: > > http://clanmaclochlainn.com/R1b1c7/ > ------------------------------- > To unsubscribe from the list, please send an email to > DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the > quotes in the subject and the body of the message > > R1b1c7 Research and Links: > > http://clanmaclochlainn.com/R1b1c7/ > ------------------------------- > To unsubscribe from the list, please send an email to DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the quotes in the subject and the body of the message

    07/12/2011 02:30:12
    1. Re: [R-M222] M222 Tree
    2. Bill Howard
    3. Sandy, One of us is confused and I don't think it is I! The Excel function for correlation is the one I used. Since Excel does it within its own 'black box', I have checked it out with statistical algorithms and find it to be correct. Enough of this! I think we are wasting the list readers' time. - Bye from Bill Howard On Jul 12, 2011, at 12:13 AM, Sandy Paterson wrote: > Really? > > Actually, they are garbage. In correlation, the numerator is the sum of the > products of (person A value of attribute - mean attribute value of > population) x (person B value of attribute - mean attribute value of > population). > > In what I set up in the spreadsheet, this is not the case. So they are not > correlation coefficients at all. They may have some meaning, I don't know, > but they look like garbage to me. > > > Sandy > > > -----Original Message----- > From: dna-r1b1c7-bounces@rootsweb.com > [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Bill Howard > Sent: 11 July 2011 14:46 > To: dna-r1b1c7@rootsweb.com > Subject: Re: [R-M222] M222 Tree > > To the list: > > Yes, Sandy has done the calculation right. > > R1b1c7 Research and Links: > > http://clanmaclochlainn.com/R1b1c7/ > ------------------------------- > To unsubscribe from the list, please send an email to DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the quotes in the subject and the body of the message

    07/12/2011 01:43:04
    1. Re: [R-M222] M222 Tree
    2. Sandy Paterson
    3. Really? Actually, they are garbage. In correlation, the numerator is the sum of the products of (person A value of attribute - mean attribute value of population) x (person B value of attribute - mean attribute value of population). In what I set up in the spreadsheet, this is not the case. So they are not correlation coefficients at all. They may have some meaning, I don't know, but they look like garbage to me. Sandy -----Original Message----- From: dna-r1b1c7-bounces@rootsweb.com [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Bill Howard Sent: 11 July 2011 14:46 To: dna-r1b1c7@rootsweb.com Subject: Re: [R-M222] M222 Tree To the list: Yes, Sandy has done the calculation right.

    07/11/2011 11:13:59
    1. [R-M222] 23andMe SNPs
    2. J David Grierson
    3. List, As we know, David Wilson's observation that there was a bunch of consistent non-modal values in the greater R1b "pool" led to the ultimate discovery of the R-M222 SNP. With hindsight we can now see that there are 2/12, or 5/25, or 7/37 markers that are off-modal to L21, and are very, very consistent throughout the M222 cohort. We further find that there are 10/67, and 14/111 that fall into the same category. In my study of Grierson/Greers, ([1]http://www.shade.id.au/Grierson/GriersonDNA.htm > Excel Chart 1d) I have now identified a cluster that contains 10/111 of consistent off-modal markers with respect to M222. These markers are also mostly off-modal to L21, making a total of (roughly) 22/111 - there are, of course, significant variations within L21 due to its greater age. I am wondering whether this might, in the same way that David observed in R1b, indicate the presence of a junior (or family) SNP within M222. Now, the WTY process is extremely expensive. It is beyond me as a cost. However, I, and another of my distant cousins, have taken the 23andMe genotype test, which in both cases has confirmed the FTDNA YDNA rating of M222+. When comparing our individual genotypes, 23andMe says we have overall 74.5% similar in 557160 SNPs. By all other accounts we are about 600 years apart, or about 20 generations. Obviously, one of those SNPs is M222 (under the rs code, rs20321). I understand there is a way we can compare all SNPs, and therefore I assume that we can identify the common ones. Does anybody know whether we can further isolate those on the Y chromosome within the 23andMe results? If that is possible, identification of those on the Y chromosome would clearly assist in any potential discovery of SNP variations within M222. From my point of view, that would be much the cheaper outcome, as well! David Grierson References 1. http://www.shade.id.au/Grierson/GriersonDNA.htm

    07/11/2011 01:48:23
    1. Re: [R-M222] M222 Tree
    2. Sandy Paterson
    3. And this statement? <Finally, about the association of genetic distance (GD) with RCC -- I have run many strings of haplotypes and have changed various marker values by 1, 2, 3, and compared many sets with each other. They show that a change of 1 in GD can cause a change in RCC of about 3, depending on which marker (low vs high) is changed....> A change in GD of 1 can be as high as 16? Maybe even higher? Sandy -----Original Message----- From: dna-r1b1c7-bounces@rootsweb.com [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Bill Howard Sent: 11 July 2011 14:46 To: dna-r1b1c7@rootsweb.com Subject: Re: [R-M222] M222 Tree To the list: Yes, Sandy has done the calculation right.

    07/11/2011 12:18:51
    1. Re: [R-M222] M222 Tree
    2. Bill Howard
    3. Sorry, Sandy -- I was not sufficiently specific. I wanted to see what a change in a marker did to the RCC after that one change. I wanted to take the simplest possible case -- how a marker change affected a starting haplotype. Think of the starting haplotype as being that of a progenitor, and you want to see how a change of one in any marker would affect the RCC of the pair (progenitor vs descendant with that one marker change). I picked an arbitrary R1b haplotype string -- I think I took a Hamilton string as the progenitor. It could be any string as long as it is representative of a real case. Then I investigated what would happen to the value of RCC between the progenitors 'starter' string and EACH of the ones that I chose to change. I changed both low and high markers; I changed by one, to four markers. I changed combinations, too. I ran the RCC determination for each example. This was done four and a half years ago. Here is the result: Markers 1Min 1Max 1Min+1Max 2Min 2Max 2Min+2Max 37 2.85 2.02 4.46 11.49 7.87 17.76 /n changes 2.85 2.02 2.23 5.74 3.94 4.11 I changed one small number by one — column 2 I changed one large number by one — column 3 I changed one small and one large number by one — column 4 I changed two small numbers by one — column 5 I changed two large numbers by one — column 6 I changed two small numbers and two large numbers by one — column 7 The results are given in row 2. After dividing by the number of changes made, you have the results in row 3 You will note that my contention (that you quoted below) is quite correct. Namely, that a change of 1 in GD can cause a change in RCC of about 3, depending on which marker (low vs high) is changed…. In fact, that is a conservative statement since the real result was 2.02 in one case and 2.85 in the other. Q.E.D. - Bye from Bill Howard On Jul 11, 2011, at 1:18 PM, Sandy Paterson wrote: > And this statement? > > > <Finally, about the association of genetic distance (GD) with RCC -- I have > run many strings of haplotypes and have changed various marker values by 1, > 2, 3, and compared many sets with each other. They show that a change of 1 > in GD can cause a change in RCC of about 3, depending on which marker (low > vs high) is changed....> > > > A change in GD of 1 can be as high as 16? Maybe even higher? > > > Sandy > > > > -----Original Message----- > From: dna-r1b1c7-bounces@rootsweb.com > [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Bill Howard > Sent: 11 July 2011 14:46 > To: dna-r1b1c7@rootsweb.com > Subject: Re: [R-M222] M222 Tree > > To the list: > > Yes, Sandy has done the calculation right. > > R1b1c7 Research and Links: > > http://clanmaclochlainn.com/R1b1c7/ > ------------------------------- > To unsubscribe from the list, please send an email to DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the quotes in the subject and the body of the message

    07/11/2011 08:27:37
    1. Re: [R-M222] M222 Tree
    2. Sandy Paterson
    3. I don't like this method of calculation at all. However, there may be another way of calculating a sort of CC that would make more sense (to me, anyway). I'll see what I come up with. Sandy -----Original Message----- From: dna-r1b1c7-bounces@rootsweb.com [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Sandy Paterson Sent: 11 July 2011 09:06 To: dna-r1b1c7@rootsweb.com Subject: Re: [R-M222] M222 Tree I've set up an Excel spreadsheet at http://dl.dropbox.com/u/2733445/EWCON.xlsx Column A is Paul Conroy's 37-marker haplotype. Column B is the 37-marker haplotype of Ewing 26605. The CC and the RCC are in cells C37 and D37. The RCC is 97.11. If you change the CDYb value in column B from 38 to 37, the RCC changes from 97.11 to 113.26. Changes of 1 at other markers result in smaller changes in RCC. I think it would be worthwhile if someone were to check this independently from first principles. Having said that, I get the same answer of 97.11 using my own software, working from first principles. Sandy -----Original Message----- From: dna-r1b1c7-bounces@rootsweb.com [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Sandy Paterson Sent: 11 July 2011 06:48 To: dna-r1b1c7@rootsweb.com Subject: Re: [R-M222] M222 Tree [I've done that myself. If a marker (it doesn't make any difference which one) with the value of 12 is altered to 13 you will always get the same CC. Change another marker with a value of 29 to 30 and you will get a different CC. In genetic distance computations the result would be R1b1c7 Research and Links: http://clanmaclochlainn.com/R1b1c7/ ------------------------------- To unsubscribe from the list, please send an email to DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the quotes in the subject and the body of the message

    07/11/2011 08:13:32
    1. Re: [R-M222] M222+ vs M222-
    2. J David Grierson
    3. Bill, I have further questions: Let us assume that two brothers are born, one of whom carries the mutation SNP M222 at his conception, the other remains ancestral (we'll call him L21). Each has a chain of male descendants, a representative from each chain being born in the 20th Century, and tested by FTDNA. Their results (because they have fairly stable YDNA) come out as, say two, or three, or six GD one from the other, so far as haplotype is concerned. Question: What time would your methodology give for the TMRCA? Would you not think that the age of the haplogroup is a consideration? Regards David On 11/07/2011 8:25 AM, Bill Howard wrote: > Paul, > If the two haplotype strings are statistically the same, I don't really care. They lead to the same dates. > I agree, we are now beating on a dead horse. > I am sorry you think I am tiresome but you don't appear to understand that the date of origin depends only on the haplotypes presented to the program, not whether or not it is a member of a particular SNP. > (In the two postings immediately below, I find that only you used the word "unknown")….. > - Bye from Bill Howard > > On Jul 10, 2011, at 6:11 PM, Paul Conroy wrote: >

    07/11/2011 05:34:09
    1. Re: [R-M222] M222+ vs M222-
    2. J David Grierson
    3. Bill, I'm sorry to buy into this discussion, but I think it is incredibly important that we all, including the tyros in this (amateur) business, understand what we mean. In the ten years of so that we enthusiasts have been dealing with DNA genealogy, a certain jargon has grown up. The professionals in the business have developed more and more complicated ways of describing haplotypes and haplogroups, using alphanumeric strings. To simplify things, we amateurs use shorthand, such as M222 (which is the code for a certain clade in the greater R haplogroup, but is commonly referred to as a haplogroup itself) because if saves a lot of fiddling referencing when communicating. Included in the jargon is the meaning of M222+, which means that a testee has been shown positively to carry the M222 mutation. M222-, conversely, means that a testee has been shown positively NOT to carry the mutation, and is most commonly attached to a member of the ancestral mutation (L21 as far as we know). Rarely, it might be attached to the member of a more remotely connected clade of the "R" haplogroup, who has taken the test for whatever reason. We can further say that if the haplotype itself does not give us a degree of certainty (see below), and the bearer has not taken the SNP test, then technically the haplogroup (or the SNP) is "unknown". If the relationship was even more remote, ie, if a member of an unconnected (except in primeval terms) haplogroup took the test, under current assumptions that would be a waste of effort, and the TMRCA would be in the tens of thousands of years. Now, we DNA genealogists are predominantly interested in the most recent millennium, because that encompasses the time of the use of surnames, and, generally speaking, the period of usable record keeping. However, that is not to say that we don't also have an interest in estimating the age of our haplogroup. But we mostly don't have a particular interest in estimating the relative distance between unconnected haplogroups. Hence our insistence that knowledge of haplogroup is essential; indeed, I have trouble conceptualizing why you insist that you can derive useful information from unrelated haplotype strings. Now you said "If the two haplotype strings are statistically the same, I don't really care. They lead to the same dates." There is a very good reason for this. The two sets of data you were given were, with a very low probability of error, all carriers of the M222 mutation, (theoretically M222+), even though untested. That's why they are statistically the same, and lead to the same dates. It is true that more than half had not been tested positively for M222. However, the characteristic haplotype for the bearers of this mutation is such that M222+ can be predicted quite reliably. This turns on the following particular DYS values: in the FTDNA markers 1-67, DYS385b=13, 392=14, 448=18, 449=30, YCAIIb=23, 607=16, 413a=21, 534=16, 481=25. Latterly I have further identified DYS710=35, 714=24, 549=12, and 513=13 in the 68-111 marker range. Because M222 is a "young" SNP mutation, there haven't been many random DYS mutations since then, so most M222 members carry the great majority of the above DYS values. So, in our various ways, I think we are trying to say to you that using this particular data, you shouldn't draw conclusions about unrelated data sets. These are related. I accept your statistical expertise, but ask, what knowledge do we gain by comparing unrelated data sets, say, members of Haplogroups defined by M222, L21, I1 and G2a? I ask because they are the haplogroups identified in my surname study. We already know that they divided one from the other during the last 40,000 years, but unless we are trying to define an individual's place in all of this, what is to be gained? Regards David Grierson On 11/07/2011 8:25 AM, Bill Howard wrote: Paul, If the two haplotype strings are statistically the same, I don't really care. T hey lead to the same dates. I agree, we are now beating on a dead horse. I am sorry you think I am tiresome but you don't appear to understand that the date of origin depends only on the haplotypes presented to the program, not whe ther or not it is a member of a particular SNP. (In the two postings immediately below, I find that only you used the word "unk nown")….. - Bye from Bill Howard On Jul 10, 2011, at 6:11 PM, Paul Conroy wrote: Bill, Once again M222- does NOT mean untested, it mean (sic) TESTED NEGATIVE. Unknown means untested. You're getting tiresome. On 7/10/11, Bill Howard [1]<weh8@verizon.net> wrote: Hi, David, I did see your posting and I apologize for being a bit tardy in my reply. I got into this when a friend suggested looking into the M222 SNP and to see if there is a connection between it and Niall and his descendants. My look at the situation indicates that, while Niall and the UiNeills may have carried the SNP, it cannot be proved that they did so. My date determination (see below) indicates that the SNP did not originate with them. In the process I became aware that one of the things that the DNA folks wanted to do was to try to date the origin of the M222 SNP. Since my RCC approach could do that estimate, I wanted to analyze haplotypes that were in the M222 family. To prepare for the analysis, I was given a large list of M222 folks, and later found that only some of them had been SNP tested. I found that only slightly in excess of 320 had actually been tested, so I collected them as a second database. Next, there was a list exchange that suggested that the M222 group should be separated into plus and minus groupings, with minus not being well-defined except that they had not been tested. Before that exchange I tried to see if I could separate the plusses and the minuses by their haplotypes alone, and I found that they were statistically the same. If there was a separation by SNP testing they certainly did not stand out as being separate from their haplotypes. That analysis has already been posted. Now, since they looked to be the same, I separated my analysis into the two databases, the ones that had been called M222, a mixture of those tested and untested, and only those that had been tested. I ran a TMRCA for both groups and found that the answers were the same within the estimated error of about 300 years SD. It is a bit premature at this stage to give the answer I got since it has not been fully discussed with my potential co-author, but it was considerably earlier than Niall and was more like the dates that John McEwan got in the BC era. More on this later. To address your question about how I can calculate a time for the mixture, I say that if I cannot distinguish the difference from the haplotypes and since Mathematica works only on those haplotypes (without any knowledge of which group it is being given to analyze), I should get the same answer if I use either the large or the small sample. And that's what I got, again within the uncertainty of the errors involved. The answer for the M222 plus sample is statistically the same as the answer from the larger database. That's because the haplotypes inputted to Mathematica in the two samples were statistically the same. So, if you want the answer to dating M222 plus alone, it is the same date. I think that my analysis has been professionally rigorous given the statistical equalities within the two databases. I hope this answers your questions, David. - Bye from Bill Howard On Jul 10, 2011, at 4:10 PM, David H. MacLennan wrote: Dear Bill, Yesterday I posted a note concerning the M222 SNP status of your data (see below), but you have not responded. Can you please comment on what I said. I am particularly concerned about your dating of the time of the M222 mutation. If you are looking at samples of M222+ that are mixed with M222-, how can you calculate a time of the mutation? David Dear Bill, As a biological scientist I find it distressing that you and others are trying to convince us that it doesn't really matter if your SNP test does or does not show that you are M222+, you can still be included in the M222 project on the basis of your STR haplotype. Data based on such an assumption would not be acceptable in a rigorous scientific journal. It would seem to me that the benchmark of the M222 project should be the presence of M222+. At some stage in our background two brothers may have had an identical or nearly identical STR haplotype, but brother one had a de novo mutation that created the M222 SNP and brother two did not. The descendants of brother one would be M222+ and the descendants of brother two would be M222-. This de novo mutation occurred at a specific date and we would all be very interested in that date. However, if the samples used to measure that date are a mixture of = and - SNPs, then you can't measure the date of appearance of M222 accurately because common STR haplotypes would predate the appearance of the M222 SNP. Let's focus on the rigor of the analysis, not the cost of SNP testing. David -- Dr. David H. MacLennan, Banting and Best Department of Medical Research, University of Toronto, Charles H. Best Institute, 112 College St., Toronto, Ontario, Canada M5G1L6 Tel:1-416-978-5008 Fax:1-416-978-8528 [2]http://www.utoronto.ca/maclennan R1b1c7 Research and Links: [3]http://clanmaclochlainn.com/R1b1c7/ ------------------------------- To unsubscribe from the list, please send an email to [4]DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the quotes in the subject and the body of the message -- Sent from my mobile device R1b1c7 Research and Links: [5]http://clanmaclochlainn.com/R1b1c7/ ------------------------------- To unsubscribe from the list, please send an email to [6]DNA-R1B1C7-request@roo tsweb.com with the word 'unsubscribe' without the quotes in the subject and the body of the message R1b1c7 Research and Links: [7]http://clanmaclochlainn.com/R1b1c7/ ------------------------------- To unsubscribe from the list, please send an email to [8]DNA-R1B1C7-request@roo tsweb.com with the word 'unsubscribe' without the quotes in the subject and the body of the message ----- No virus found in this message. Checked by AVG - [9]www.avg.com Version: 10.0.1388 / Virus Database: 1516/3757 - Release Date: 07/10/11 References 1. mailto:weh8@verizon.net 2. http://www.utoronto.ca/maclennan 3. http://clanmaclochlainn.com/R1b1c7/ 4. mailto:DNA-R1B1C7-request@rootsweb.com 5. http://clanmaclochlainn.com/R1b1c7/ 6. mailto:DNA-R1B1C7-request@rootsweb.com 7. http://clanmaclochlainn.com/R1b1c7/ 8. mailto:DNA-R1B1C7-request@rootsweb.com 9. http://www.avg.com/

    07/11/2011 04:33:19
    1. Re: [R-M222] M222 Tree
    2. Bill Howard
    3. To the list: Yes, Sandy has done the calculation right. But, he misunderstands the power of correlating groups of strings not just one. So, he is falling victim by over-interpreting a result that comes from just one pair of haplotypes. The comparison that Sandy is making is equivalent to taking just one pair in an intercluster and trying to draw conclusions about the whole makeup of the intercluster. To show this explicitly, go to the URL that presents individual values of the intersections of Howles and McGoverns that I sent to the list earlier. You can see it at: <http://mysite.verizon.net/weh8/Howle-McGovernIntercluster.pdf> Look at the intercluster RCC numbers and consider the calculations below: 15 25 18 18 15 15 17 15 25 17 17 15 15 17 12 17 14 14 16 16 18 15 19 17 17 14 14 16 12 12 14 14 11 11 13 12 18 15 15 12 12 14 12 18 15 15 13 13 10 Average: 15.18367347 SD 2.990773795 SD/Avg: 19.7% A value of 20% of the average, like this one, is typical of the SDs of intercluster regions. Expectation of the SD from an intercluster using Sandy's numbers : 19.7 % of 97.11 is 19.12804854 19.7% of 113.26 is 22.30916258 Sandy gets: (113.26-97.11) =16.15 for his difference which he thinks is higher than it should be. But, 16 is less than one SD expected, since the value of 20% of the average for an SD is very representative of intercluster results. This is well under the one SD expected of 19 to 22. Sandy's comparison is normal. Q.E.D. My conclusions: Don't use anecdotal "evidence" — use averages and large haplotype comparisons like the ones that comprise large clusters and intercluster regions. Mathematica averages the numbers in the Howle-McGovern intercluster above to compute the junction point on the phylogenetic tree, but it does it using not only that intercluster, but the entire set of haplotypes. Refer to the junction point between Howles and McGovern clusters and you will see that the junction is at RCC ~ 15, exactly where it should be. - Bye from Bill Note to Sandy -- no need to redo your calculation. It was done correctly, but your interpretation was not correct. On Jul 11, 2011, at 4:05 AM, Sandy Paterson wrote: > I've set up an Excel spreadsheet at > > http://dl.dropbox.com/u/2733445/EWCON.xlsx > > > Column A is Paul Conroy's 37-marker haplotype. Column B is the 37-marker > haplotype of Ewing 26605. The CC and the RCC are in cells C37 and D37. The > RCC is 97.11. If you change the CDYb value in column B from 38 to 37, the > RCC changes from 97.11 to 113.26. > > Changes of 1 at other markers result in smaller changes in RCC. > > I think it would be worthwhile if someone were to check this independently > from first principles. Having said that, I get the same answer of 97.11 > using my own software, working from first principles. > > > Sandy > > > > -----Original Message----- > From: dna-r1b1c7-bounces@rootsweb.com > [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Sandy Paterson > Sent: 11 July 2011 06:48 > To: dna-r1b1c7@rootsweb.com > Subject: Re: [R-M222] M222 Tree > > [I've done that myself. If a marker (it doesn't make any difference which > one) with the value of 12 is altered to 13 you will always get the same > CC. Change another marker with a value of 29 to 30 and you will get a > different CC. In genetic distance computations the result would be > > R1b1c7 Research and Links: > > http://clanmaclochlainn.com/R1b1c7/ > ------------------------------- > To unsubscribe from the list, please send an email to DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the quotes in the subject and the body of the message

    07/11/2011 03:46:11