Hi Bill, I don't think this discussion is wasting people's time, though I do rather wish the tone were overall rather more civil. I have to say that regardless of the specifics of the Excel implementation, Sandy's objection to the statistical foundation of RCC is valid. The sample correlation coefficient on paired data (x_1,y_1),...,(x_n,y_n) corresponding to two random variables X and Y is the sum of (x_i - x_m)*(y_i-y_m) for i from 1 to n, divided by the product of the magnitude of the vectors (x_1-x_m,...,x_n-x_m) and (y_1-y_m,...,y_n-y_m) where x_m and y_m are the means of the sample data x_1,...,x_n and y_1,,,,,y_m respectively. The key point here is that the data x_1,...,x_n are supposed to be separate measurements of a single random variable X of interest. When you use them for RCC, by your design x_1,...,x_n are the STR markers values themselves. These are not separate measurements of a single quantity and the value x_m which is the average marker value across all 37, 67, or 111 markers has no obvious significance. You are measuring the correspondence between two random variables on a population of 37, 67 or, or 111, where the population is marker values and not people. This introduces additional problems because marker value ranges vary widely between markers. Some like DYS710 have high repeat numbers (mine is 35), while others like DYS 578 are lower (mine is 9). My testing with my own 111-marker sample has shown that the RCC between my profile and my profile with a one-point mutation (i.e. the difference between RCC values before and after)* varies inversely with the distance of the particular marker value from the mean marker value x_m*. I can supply data if you like. There is absolutely no good biological reason why RCC should depend so closely on marker values: a one-point mutation is a one-point mutation whether the change is from 34 to 35 or 14 to 15. The fact that that it does is evidence of the artificiality of this particular measure of genetic distance, to say nothing of that fact that documented mutation rates which will certainly affect TMRCA calculations are apparently not included in the RCC model at all. I have a few other points to raise about RCC which I will strive to write up and post here. I want to emphasize to all however that Bill has done a lot of work here and that innovation in statistical analysis of genetic data should to be welcomed. That said, these innovators like all researchers have to be ready to face criticism and just because a particular objection has not been raised before is not evidence of its falseness. Thanks to all for the discussion. regards, Steve
You must be the Steve Forrest from the L21+ group. I'm curious about Hannan, kit number 96185. He is an M222+ lookalike, but is L21+, M222-. Do you know whether he has asked FTDNA to check his M222- status? Sandy -----Original Message----- From: dna-r1b1c7-bounces@rootsweb.com [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Stephen Forrest Sent: 17 July 2011 22:02 To: dna-r1b1c7@rootsweb.com Subject: Re: [R-M222] Calculation of a correlation coeficient Hi Bill, I don't think this discussion is wasting people's time, though I do rather wish the tone were overall rather more civil.