RootsWeb.com Mailing Lists
Total: 2/2
    1. Re: [Y-DNA-projects] Y-DNA-PROJECTS Sample Size
    2. Ralph Taylor
    3. Colin asked {paraphrasing & interpreting here} how large a sample size is adequate for the Ferguson Y-DNA project. Diana answered from a scientist's perspective: ".. I go into this kind of research with no expectations, and just let the data tell me the story." and went on to other aspects. I have a different perspective -- that of a manager, who's not comfortable "letting the chips fall where they may" and prefers taking action to affect (i.e., bias) the outcome in the desired way. I am a researcher only to the extent it's needed to answer a specific problem; when that is dealt with, I go on to different problem. Recognizing that one may never have all the information desirable, I'm less interested in "absolute proof" than in what can & should be done; I'm more prone to define problems than to construct hypotheses. Whichever perspective we look at the question from, we need to define what "adequate sample size" means in this instance. Does it mean (1) that we've identified a certain fraction of Y-DNA haplotypes for the target population? Or, (2) that we reach a certain probability of new participants matching existing participants? If we define it the second way, this happens to be a question to which I've devoted a fair amount of effort and developed a possible answer. It is somewhat complex and requires graphics and mathematics, so the list is not an appropriate medium for full discussion. (E-mail me for a Word version, or view the rough draft of the HTML version at http://freepages.misc.rootsweb.ancestry.com/~taylorydna/resources/size-vs-un matched.htm.) However, the brief essence of the matter is: - The probability of a random new participant matching existing participants depends on the ratio of the lines found within the project to the total number of ancestral lines for the project's target population. For short, I call this the "F/A ratio". The probability is Prob(match) = F/A (BTW, the F/A ratio could also be considered an index of "survey completeness" for the first definition.) "Found lines" (F, the numerator of the fraction) is a simple calculation; add up the number of groups (with matching Y-DNA, symbolized by G) and the number of unmatched singletons (symbolized by S): F = G + S If I have Colin's data correct, he has 20 groups and 250 participants total. G = 20, S <= 210; therefore, F <= 230. "Ancestral lines" (A, the denominator) is a more difficult number to determine and -- in many cases -- requires a variety of estimation methods. It may be particularly difficult for a clan surname such as Ferguson, as opposed to an occupational surname. Colin seems to have done an admirable amount of investigation into the Ferguson surname. This may establish, at least, upper & lower limits for A. (If one can't arrive at exact numbers, knowing the possible range is better than nothing.) To sum up: Yes, Colin, I believe you're on -- or close to -- the right track. -ralpht_/) Message: 1 Date: Fri, 10 Sep 2010 08:50:07 -0700 From: Colin Ferguson <colin.fergie@gmail.com> Subject: [Y-DNA-projects] Sample Size To: Y-DNA-PROJECTS@rootsweb.com Message-ID: <AANLkTim+6dQ05oX4PcnHk27+-SR2xZeTLX0gAh4oby0H@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1 I estimate that there are 300,000 Ferguson and variants worldwide with slightly more than half of these resident in the US. We only have 250 participants in project which with 150,000 men to sample seems ridiculously small. However, I estimate further that back about 1600 there were only 5,000 Ferguson and 80% of these were in Scotland. If you assume that an average household of 5 then that equates to 1,000 heads of household as progenitors of all 300,000 Ferguson alive today. See http://dna.cfsna.net/Demographics.htm One number in particular that I struggle to estimate is how many of those 1,000 heads of household would we call related to one another. If that number is one in five then then I am down to only 200 earliest known ancestors that need be tested to characterize today's population of Ferguson. A reasonable sample size at least seems attainable. In our project we have about 20 different groupings of Ferguson each sharing their own ancestor about 800 years ago; 400 years the time back from present to our 1600s progenitors and another 400 years as a TMRCA for the 1600s progenitors accounting for the one in five as above. The 20 different groups referred to account for about half our participants, the remainder fall in small groups or don't match other participants. Relative to 200 earliest known ancestors our sample size is still small but at least not ridiculously so. Am I on track? ------------------------------ Message: 2 Date: Fri, 10 Sep 2010 15:45:33 -0400 From: "Diana Gale Matthiesen" <DianaGM@dgmweb.net> Subject: Re: [Y-DNA-projects] Sample Size To: <y-dna-projects@rootsweb.com> Message-ID: <005c01cb5120$bb7966d0$326c3470$@dgmweb.net> Content-Type: text/plain; charset="us-ascii" There are so many factors that can affect how many progenitors any given surname has, it's difficult to predict how many there will actually turn out to be. Personally, I go into this kind of research with no expectations, and just let the data tell me the story. It's a patience I've learned, I guess, from being a scientist (now retired). Having expectations can lead to bias, and trying to make interpretations from too little data is futile, so I've learned to suppress the inclination to do either. I know that's not an exciting answer, but the lesson learned by every grad student is that it takes more data than you ever thought to really prove something. There is also a sampling issue here that is important for us: until your rarest group is represented by at least three individuals, there is a high probability that you have not found all groups. Now, the assumption here is that you are sampling a population randomly, which a surname project may not be doing. That is, I have no idea whether the FERGUSONs being tested are a random sample of FERGUSONs, or not. But, I think it's safe to say (and setting aside the issue of NPEs): As long as you have any FERGUSONs in your project who have no match, you can assume you have not, possibly remotely not, tested all the lineages. Rather than just sitting back and waiting for enough FERGUSONs to randomly join the project, one thing you can do to speed progress is to make a list of known FERGUSON immigrants to the U.S., then make it your goal to find and test at least one patrilineal descendant of each. The next goal would be to test a second one, to be certain the first doesn't have an NPE -- and if the first two don't match, to test a third, for the same reason. Of course, you would love to test FERGUSONs in Scotland, but if you're having as much trouble as I am bringing Europeans into your project, that's not really an option. Hope this helps, Diana

    09/13/2010 07:33:21
    1. Re: [Y-DNA-projects] Y-DNA-PROJECTS Sample Size
    2. Colin Ferguson
    3. Hi Ralph, Thanks, I am delighted to see a critical examination of sample size. I also prefer "taking action to affect (i.e., bias) the outcome in the desired way". For example, in the Fergus(s)on project we work hard to recruit Scottish and Irish participants in an attempt to reduce the US bias others alluded to. We have had some success as about 10% of our participants reside in Scotland or Ireland. As I read your paper I realized that I think of G=Groups as the number of found lines and S=Singletons as those waiting to be found or discovered to be NPE. Cheers, Colin

    09/15/2010 05:00:07