On 8/28/2017 12:32 PM, taf wrote: > On Sunday, August 27, 2017 at 12:54:20 PM UTC-7, taf wrote: > >> They have an ongoing project to do whole genome sequencing, from which >> further markers may be determined that would allow the subclade to be >> determined, > For those holding out hope, let me add that I said 'may' intentionally. The next-generation sequencing approaches most commonly used would provide useful SNP data (places where a single base is different), but is abysmal at determining the number of repeats in STRs (and is prone to guess, and when it does, guess wrong). > > This has to do with the way the information is generated - it collects very short stretches of sequence and uses a computer to line them up, Since by their nature, an STR has the same sequence, repeated again and again, the computer doing the alignment doesn't know which set to align with. As an example, lets say you generated sequences that could be characterized as: > > ABCDEFG > > DEFGHIJ > > This is simple to align as ABCDEFGHIJ. However, if you have a repeat: > > ABCDEEEEEEE > > EEEEEEEFGHI > > there is no way to tell how many of the repeats you have, which E aligns with which: > > ABCDEEE....EEEFGHI > > The only way you can count what is in between is if you have a single sequencing read that spans all the way from one side to the other: DEEEEEEEEEFG. However, most genealogically-informative STR regions are longer than the read lengths typically generated by the sequencing reaction, so you will never get this information. Unless it has specifically been programmed not to do this, the computer doing the compiling may simply align one stretch of EEEEE with another and give you a sequence that has a definite number even though the raw data was ambiguous. > > Thus, I wouldn't hold out hope that whole-genome sequencing will resolve this question, and one should be very careful in accepting reported values unless these issues are specifically evaluated. > > The STR analysis that is typically performed is PCR-based, not sequencing based, and hence does not suffer from these limitations, but one has to separately evaluate each STR, hence the scaled costs for progressively more 'markers', each additional marker being another test that must be performed. There is nothing to stop them doing a 100-marker test on Richard's DNA - if it was preserved well enough to do a 23-marker test, it should be good enough to do any number. It is just a question of whether the research group would view this as a priority or not. (And one may be able to sway this decision on their part with a sizable financial contribution to their research program, but it is going to cost you a lot more than simple having FTDNA test a cheek swab.) Thanks for explaining this. I knew that they had to do the analysis by looking at small segments and seeing how they would be attached, but it hadn't occurred to me that this would make long STRs harder to analyze, although it is pretty clear now that you mentioned it. I know that the typical layman probably has a much simpler process in mind, no doubt aided by all of the CSI-type TV shows where they just put the DNA sample into a machine, push a button, and presto! (As an expert in the field, can you even watch those shows without cringing? There was one such show recently where the cops had an "expert" genealogist aiding them in a case, which was so bad it had me shaking my head in disgust.) It is tempting to think of a chromosome as being like a long string of recording tape that you just put in a device and read from end-to-end, but I suppose that that technology is still a long way off. Does the above discussion mean that the "whole" genome sequencing that they talk about is a misnomer? Your comment (elsewhere in this thread) indicating that 23 markers might not be enough to get good information made me wonder if my own results are atypical. Among my Y-DNA matches at Family Tree DNA (none apparently any closer than 6th cousins), my 67 marker test shows 25 matches with "genetic distance" between 3 and 5 (none closer than that), 19 of whose surnames are either Baldwin or one of the variants of Maybury (Mayberry, Mabry, etc.), with many more of the latter. My group of Baldwins appears to have arisen from a "non-paternal event" (NPE) with a Maybury biological father around 300 years ago, or perhaps earlier. (I have circumstantial evidence for a specific "suspect.") Of the remaining six, two have circumstantial evidence for either a Baldwin or Maybury NPE, two have an ancestral geography consistent with such an NPE, and two have insufficient information. If I look at my matches which only consider the first 25 markers, I have 21 matches with genetic distance zero, 14 of whom are either Baldwin or Maybury. At the 12 marker level, as one might expect, there is no obvious pattern, with a huge number of matches of distance zero with apparently random surnames. So, the "noise level" at 12 markers is obviously too high for those results to be of much use. At 67 markers, the noise level is small and at least plausibly attributable to NPEs. At 25 markers, the noise level is high, but not high enough to drown out the Maybury presence. So, is my experience typical, or has an overabundance of Maybury testees skewed the results? Has anybody looked at the list of names that would pop up if Richard's current STR data was input to search for all current matches with genetic distance zero on those markers? Even if it produces nothing of interest, it still seems worth a try. (I would do it myself, if I knew how, but Family Tree DNA's tools are not helpful for something like this.) Stewart Baldwin