In a message dated 7/10/2011 1:21:47 P.M. Central Daylight Time, alexanderpatterson@btinternet.com writes: I have explained to bill Bill I think 3 times, that I don't work in Excel and seldom use spreadsheets. Mostly, I write my own software in one of 3 programming languages, depending on the application, and I use files, not spreadsheets, so there is no spreadsheet to send him. I don't think Bill ever really explained this to the list in detail (although it is in his articles) but he uses the data analysis tool in older versions of Excel to generate the CC (correlation coefficient) half matrix he uses. I've tried that myself and it instantly compares every sample to every other sample in the spreadsheet and in seconds spits out the CCs in a half matrix form which should be familiar to anyone who has used the McGee utility for genetic distance since the format identical. To build a full matrix you need to do some copying and pasting or use some other Excel trick. Then you need to convert every CC in the matrix to RCC as you describe. You wind up with something exactly like the McGee genetic distance matrix where every sample is compared to the others in one cell except you have RCC numbers rather than genetic distance. I do not think this data analysis tool is available in newer versions of Excel. In the older version I used (MS. Excel 2000) it was an add-on which had to be installed from the CD. I just checked my current version of Excel and it does list data tool kit add-on which includes correlation but I'm not sure if I'm getting it installed correctly or not. It also lists a VBA based data tool kit add-on. I have no idea if that could be duplicated in software. If the McGee utility can generate a full matrix then you probably can to. I rarely use Excel myself and find the process slow and tedious, mainly because of the learning curve involved in using Excel itself. It's not my cup of tea. I'd rather have a software program generate the entire matrix and conversions. I simply don't see much difference between Bill's correlation method and standard genetic distance. <Finally, about the association of genetic distance (GD) with RCC -- I have run many strings of haplotypes and have changed various marker values by 1, 2, 3, and compared many sets with each other. They show that a change of 1 in GD can cause a change in RCC of about 3, depending on which marker (low vs high) is changed. Table 1 in my published paper in the JoGG _http://mysite.verizon.net/weh8/Howard1.pdf> confirms those more extensive calculations. I stand by this association of GD with RCC and maintain that RCC contains more valuable information because it applies to every marker value, not just citing how many of them have changed_ (http://mysite.verizon.net/weh8/Howard1.pdf> confirms those more extensive calculations. I stand by this association of GD with RCC and maintain that RCC contains more valuable information because it applies to every marker value, not just citing how many of them have changed) . I've done that myself. If a marker (it doesn't make any difference which one) with the value of 12 is altered to 13 you will always get the same CC. Change another marker with a value of 29 to 30 and you will get a different CC. In genetic distance computations the result would be two (in the above example) but it would be something slightly different in the correlation approach. A marker change from 39 to 40 would be yet a different CC value. I haven't been able to figure out yet exactly what the corrrelation approach is doing mathematically yet. I'm not sure what difference applying correlation to every marker makes in comparison to genetic distance since most of the markers will be the same in any case. Correlation will also only note the changes. As a check on the correlation efficient approach I ran the same samples Bill is using through one of the Phylip suite of programs called Kitsche, which uses genetic distance data generated by the McGee utility. A freeware program called Mega then generates the charts. Info on how to use these programs can be found on the McGee utility site. There are lots of variables that can be set on the McGee utility, some of which I thought were debatable. The instructions include using the infinite allele mutation model, setting the probability to 95%, years=25 years/generation, mutation rate = FTDNA = 0.004..0.0075. I'm going to have to re-run this because I omitted some samples used in a tree produced by Bill with Mathematica. But the resulting tree showed basically the same thing for the McGoverns and Howles, two surnames Bill has been talking about lately. They are clustered tightly together in both systems. The Mega program however just gives a short time scale at the bottom of the chart. On this particular tree it's 200 years. All things being equal, it appears Bill's methods may allow for a more accurate reading on TMRCA. Extrapolating from the 200 year scale on the Mega chart is difficult by eye and doesn't appear to go beyond 1000 years for any sample in the spreadsheet. I've never been a fan of just using genetic distance alone in DNA analysis. I know John McEwan used it often. He too came up with phylogenetic charts for M222 which are still available on his web site. But he also used modals and in fact developed one for each of his R1bSTR clusters. The reason I distrust genetic distance alone is you can get false positives, matches that on closer inspection aren't really matches. Samples at a GD of 5 tell you nothing about which markers are different. I've also used Fluxus charts which take the opposite approach, finding links between shared marker values in haplotypes. That is almost impossible to use in huge data sets though. Someone sent me one for M222 a few years ago and it was an indecipherable mess. At this stage I'm not sold on any one approach. But that's just my opinion. Everyone else is entitled to their own. John
Thanks, John. I have only this to add -- the major difference between the correlation method and standard genetic distance is at least twofold. The RCC time scale can be calibrated more easily and GD is only an indication that somewhere in a haplotype there has been a marker change, whereas the RCC method says the same thing more precisely because it looks at the entire marker string. That's a big difference. - Bye from Bill Howard On Jul 10, 2011, at 10:24 PM, Lochlan@aol.com wrote: > In a message dated 7/10/2011 1:21:47 P.M. Central Daylight Time, > alexanderpatterson@btinternet.com writes: > > I have explained to Bill I think 3 times, that I don't work in Excel > and seldom use spreadsheets. Mostly, I write my own software in one of 3 > programming languages, depending on the application, and I use files, not > spreadsheets, so there is no spreadsheet to send him. > > I don't think Bill ever really explained this to the list in detail > (although it is in his articles) but he uses the data analysis tool in older > versions of Excel to generate the CC (correlation coefficient) half matrix he > uses. I've tried that myself and it instantly compares every sample to > every other sample in the spreadsheet and in seconds spits out the CCs in a > half matrix form which should be familiar to anyone who has used the McGee > utility for genetic distance since the format identical. To build a full > matrix you need to do some copying and pasting or use some other Excel trick. > Then you need to convert every CC in the matrix to RCC as you describe. > You wind up with something exactly like the McGee genetic distance matrix > where every sample is compared to the others in one cell except you have RCC > numbers rather than genetic distance. > > I do not think this data analysis tool is available in newer versions of > Excel. In the older version I used (MS. Excel 2000) it was an add-on which > had to be installed from the CD. I just checked my current version of > Excel and it does list data tool kit add-on which includes correlation but I'm > not sure if I'm getting it installed correctly or not. It also lists a > VBA based data tool kit add-on. > > I have no idea if that could be duplicated in software. If the McGee > utility can generate a full matrix then you probably can to. > > I rarely use Excel myself and find the process slow and tedious, mainly > because of the learning curve involved in using Excel itself. It's not my > cup of tea. I'd rather have a software program generate the entire matrix > and conversions. > > I simply don't see much difference between Bill's correlation method and > standard genetic distance. > > <Finally, about the association of genetic distance (GD) with RCC -- I > have run many strings of haplotypes and have changed various marker values by > 1, 2, 3, and compared many sets with each other. They show that a change of > 1 in GD can cause a change in RCC of about 3, depending on which marker > (low vs high) is changed. Table 1 in my published paper in the JoGG > _http://mysite.verizon.net/weh8/Howard1.pdf> confirms those more extensive > calculations. I stand by this association of GD with RCC and maintain that RCC > contains more valuable information because it applies to every marker value, not > just citing how many of them have changed_ > (http://mysite.verizon.net/weh8/Howard1.pdf> confirms those more extensive calculations. I stand by this > association of GD with RCC and maintain that RCC contains more valuable > information because it applies to every marker value, not just citing how many of > them have changed) . > > I've done that myself. If a marker (it doesn't make any difference which > one) with the value of 12 is altered to 13 you will always get the same > CC. Change another marker with a value of 29 to 30 and you will get a > different CC. In genetic distance computations the result would be two (in the > above example) but it would be something slightly different in the > correlation approach. A marker change from 39 to 40 would be yet a different CC > value. I haven't been able to figure out yet exactly what the corrrelation > approach is doing mathematically yet. I'm not sure what difference > applying correlation to every marker makes in comparison to genetic distance > since most of the markers will be the same in any case. Correlation will also > only note the changes. > > As a check on the correlation efficient approach I ran the same samples > Bill is using through one of the Phylip suite of programs called Kitsche, > which uses genetic distance data generated by the McGee utility. A freeware > program called Mega then generates the charts. Info on how to use these > programs can be found on the McGee utility site. There are lots of > variables that can be set on the McGee utility, some of which I thought were > debatable. The instructions include using the infinite allele mutation model, > setting the probability to 95%, years=25 years/generation, mutation rate = > FTDNA = 0.004..0.0075. > > I'm going to have to re-run this because I omitted some samples used in > a tree produced by Bill with Mathematica. But the resulting tree showed > basically the same thing for the McGoverns and Howles, two surnames Bill > has been talking about lately. They are clustered tightly together in both > systems. The Mega program however just gives a short time scale at the > bottom of the chart. On this particular tree it's 200 years. > > All things being equal, it appears Bill's methods may allow for a more > accurate reading on TMRCA. Extrapolating from the 200 year scale on the Mega > chart is difficult by eye and doesn't appear to go beyond 1000 years for > any sample in the spreadsheet. > > I've never been a fan of just using genetic distance alone in DNA > analysis. I know John McEwan used it often. He too came up with phylogenetic > charts for M222 which are still available on his web site. But he also used > modals and in fact developed one for each of his R1bSTR clusters. The > reason I distrust genetic distance alone is you can get false positives, > matches that on closer inspection aren't really matches. > Samples at a GD of 5 tell you nothing about which markers are different. > > I've also used Fluxus charts which take the opposite approach, finding > links between shared marker values in haplotypes. That is almost impossible > to use in huge data sets though. Someone sent me one for M222 a few years > ago and it was an indecipherable mess. > > At this stage I'm not sold on any one approach. But that's just my > opinion. Everyone else is entitled to their own. > > > > > John >
[I've done that myself. If a marker (it doesn't make any difference which one) with the value of 12 is altered to 13 you will always get the same CC. Change another marker with a value of 29 to 30 and you will get a different CC. In genetic distance computations the result would be two (in the above example) but it would be something slightly different in the correlation approach. A marker change from 39 to 40 would be yet a different CC value. I haven't been able to figure out yet exactly what the corrrelation approach is doing mathematically yet. I'm not sure what difference applying correlation to every marker makes in comparison to genetic distance since most of the markers will be the same in any case. Correlation will also only note the changes.] That's given me an idea. I should be able to set up something in Excel that allows anyone to calculate the CC and hence RCC between two haplotypes. I'll use Conroy 16646 and Ewing 26605 37-marker haplotypes as the example. We can then examine the effect on the CC and the RCC of a single mutation at any marker. I'm a bit slow in Excel so it may take a while, but I think it's worth it. Sandy -----Original Message----- From: dna-r1b1c7-bounces@rootsweb.com [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Lochlan@aol.com Sent: 11 July 2011 03:24 To: dna-r1b1c7@rootsweb.com Subject: Re: [R-M222] M222 Tree In a message dated 7/10/2011 1:21:47 P.M. Central Daylight Time, alexanderpatterson@btinternet.com writes: I have explained to bill Bill I think 3 times, that I don't work in Excel and seldom use spreadsheets. Mostly, I write my own software in one of 3 programming languages, depending on the application, and I use files, not spreadsheets, so there is no spreadsheet to send him.
I've set up an Excel spreadsheet at http://dl.dropbox.com/u/2733445/EWCON.xlsx Column A is Paul Conroy's 37-marker haplotype. Column B is the 37-marker haplotype of Ewing 26605. The CC and the RCC are in cells C37 and D37. The RCC is 97.11. If you change the CDYb value in column B from 38 to 37, the RCC changes from 97.11 to 113.26. Changes of 1 at other markers result in smaller changes in RCC. I think it would be worthwhile if someone were to check this independently from first principles. Having said that, I get the same answer of 97.11 using my own software, working from first principles. Sandy -----Original Message----- From: dna-r1b1c7-bounces@rootsweb.com [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Sandy Paterson Sent: 11 July 2011 06:48 To: dna-r1b1c7@rootsweb.com Subject: Re: [R-M222] M222 Tree [I've done that myself. If a marker (it doesn't make any difference which one) with the value of 12 is altered to 13 you will always get the same CC. Change another marker with a value of 29 to 30 and you will get a different CC. In genetic distance computations the result would be
I don't like this method of calculation at all. However, there may be another way of calculating a sort of CC that would make more sense (to me, anyway). I'll see what I come up with. Sandy -----Original Message----- From: dna-r1b1c7-bounces@rootsweb.com [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Sandy Paterson Sent: 11 July 2011 09:06 To: dna-r1b1c7@rootsweb.com Subject: Re: [R-M222] M222 Tree I've set up an Excel spreadsheet at http://dl.dropbox.com/u/2733445/EWCON.xlsx Column A is Paul Conroy's 37-marker haplotype. Column B is the 37-marker haplotype of Ewing 26605. The CC and the RCC are in cells C37 and D37. The RCC is 97.11. If you change the CDYb value in column B from 38 to 37, the RCC changes from 97.11 to 113.26. Changes of 1 at other markers result in smaller changes in RCC. I think it would be worthwhile if someone were to check this independently from first principles. Having said that, I get the same answer of 97.11 using my own software, working from first principles. Sandy -----Original Message----- From: dna-r1b1c7-bounces@rootsweb.com [mailto:dna-r1b1c7-bounces@rootsweb.com] On Behalf Of Sandy Paterson Sent: 11 July 2011 06:48 To: dna-r1b1c7@rootsweb.com Subject: Re: [R-M222] M222 Tree [I've done that myself. If a marker (it doesn't make any difference which one) with the value of 12 is altered to 13 you will always get the same CC. Change another marker with a value of 29 to 30 and you will get a different CC. In genetic distance computations the result would be R1b1c7 Research and Links: http://clanmaclochlainn.com/R1b1c7/ ------------------------------- To unsubscribe from the list, please send an email to DNA-R1B1C7-request@rootsweb.com with the word 'unsubscribe' without the quotes in the subject and the body of the message