Note: The Rootsweb Mailing Lists will be shut down on April 6, 2023. (More info)
RootsWeb.com Mailing Lists
Total: 1/1
    1. Re: Single-tree gedcom files question
    2. Bob LeChevalier
    3. Peter J. Seymour wrote: >> On 2011-05-16 15:57, singhals wrote: >>> Peter J. Seymour wrote: >>>> I have been doing some analysis of a selection of the numerous gedcom >>>> files out there. One thing I have found disappointing is that the larger >>>> files tend to consist of a number of fragments rather than a single >>>> tree. In fact, the larger the file, the more likely it is to consist of >>>> fragments. Some large files seem to consist mostly of numerous >>>> unconnected individuals, or couples, or perhaps small trees of three or >>>> four people. >>>> So this seems to be how the really large files are made: throw together >>>> lots of data on the basis that it might be vaguely related. >>>> This set me wondering: How large do single trees get? So here is a >>>> challenge for you all, What is the largest single-tree gedcom you are >>>> aware of, does it consist of sensible data, and more to the point how >>>> large is it (File size in bytes and number of individuals, both metrics >>>> are needed please? (In addition to the generation count metric you added later, you also probably need some kind of metric on the amount of sourcing data, as shown below, though I'm not sure what kind of metric would be appropriate.) I hope the following will be useful. I have several files that might be considered "really large", which have multiple trees therein, but having one main tree linking most of the people in the data base together. The remainder of the data base consists of fragments that might eventually be linked in, but have not yet. But I cannot say that they consist of "throwing together lots of data on the basis that it might be vaguely related". By far the largest tree I work with is last year's version of a tree created by a French colleague. His current tree likely has 20-30K more people in it - he works fast and voluminously. The version I have from last summer has 139772 people in it, of which 130355 are in a single tree (according to Legacy, which has a utility to count the trees in a file). There are a couple thousand fragments in the file, the largest being only 58 people, and there are 145 fragments with 10 or more people in it. The file was 30.784 MB as I first received it from him in GEDCOM format. In Legacy format, which is how I use it, the main file has about 269 MB. I use this file mostly for reference. It has almost no source information, and consists almost entirely of BMD info for the people included, with few notes or supporting data. (He has extensive notes and a good memory, so he can usually tell me where he got the information). A second file is based on a 2007 file by the same Frenchman, which is my active working file for my French research. That file, as received, was around 24 MB in GED format, and maybe 100000 people. I have since added around 20000 people to the file, all extensively sourced. My version of the tree has 109522 of the 120000 in one tree, and 2250 smaller fragments, the largest of which is 78. I just did a GEDCOM 5.5 export of those who are in the single tree of 109522, and the file is 39.980 MB, so adding 20% of source-marked data to the count of the database added more than 50% to the GEDCOM size. The bulk of the data in these trees comprises a sort of place study for a small region on the border between Calvados and Manche and Orne departments in Normandy France. My French colleague has been working a larger area, less systematically, using a variety of sources, while I have focused on only 3 parishes for which I have personally transcribed 100 years of records for one and 70 years for the other two, and am attempting (slowly) to add all of the people mentioned to the tree. Since people in the 1600s didn't move very much, perhaps 95% of the records can be linked to someone already in the tree. Almost all of the data is extremely "sensible", even if not sourced. The primary source is parish records for the period in which French parish records exist (after around 1600 for the best parishes, but many records were destroyed at the time of the DDay invasion, since Manche's archives are in St-Lo, and Calvados's in Caen, both cities that suffered badly in the invasion). Several lines are connected to nobility in the 1600s, and in many cases, those lines can be connected using French medieval sources back to Charlemagne, and beyond to his ancestors. But I don't know how good those French medieval sources are. I cannot give good details on generations, because the file is too large and complex for automated analysis. My colleague a few years ago calculated that he had 94% coverage of his own direct line ancestors back 10, 35% back to 13 generations (5634 of 16383), but much sparser beyond that point (only 3% of the 14th generation). Out to 29 generations, he had around 16000 total ancestors including repeats that appear in multiple lines. I just ran a test on my 120,000 person version of his file, there are 53000 ancestors out to the 33rd generation, but there are people who are repeated 250 times or more in multiple lines, so I couldn't guess how many unique individuals that is. One such 250-times-repeated person herself - Elizabeth de Vermandois in the 1100s - has around 150 ancestors including repeats, going back 22 more generations to the 500s AD). Charlemagne is around 45 generations back (and appears 15 times in Elizabeth de Vermandois ancestry alone - hence would be in many thousand slots in a complete ancestral tree). I'm sure that the number is considerably higher for his current file, since the bulk of our work the last couple of years has been in the 1600s, and I know he has added material from at least one major medieval source. --- My other projects are primarily American and include 2 data bases at 38283 and 26185 people, of which 37013 and 25749 are in one tree, respectively, both extensively sourced, and with a good deal of transcribed census records in the notes for key heads of households. The larger database has a second chunk of around 1000 that is strongly believed to be tied to the first (based on DNA evidence), but the link has not been found. Another chunk of 50 is connected, but only via an obscure cousins-and-marriages link that I haven't added to the database yet. A fourth chunk of 150 is connected, but by an unknown link few hundred years ago that I haven't looked very hard for- 2 families with a very unusual name in the same town. There are around 150 individuals in 25 additional chunks. That larger tree has one line of 19 generations, but most of the main lines of the tree are about 9-12 generations. It is a very bushy tree with many lines of cousins, and I'm habitually adding such things as parents of spouses, stepkids (and their spouses), etc. These usually are only a few generations deep. Exporting the heavily-sourced main tree of 37013 into GEDCOM 5.5 is 35.572 MB, larger than the GEDCOM for the unsourced 140,000 person file with 50+ generations. lojbab --- Bob LeChevalier - artificial linguist; genealogist [email protected] Lojban language www.lojban.org

    05/17/2011 12:26:04