I have been doing some analysis of a selection of the numerous gedcom files out there. One thing I have found disappointing is that the larger files tend to consist of a number of fragments rather than a single tree. In fact, the larger the file, the more likely it is to consist of fragments. Some large files seem to consist mostly of numerous unconnected individuals, or couples, or perhaps small trees of three or four people. So this seems to be how the really large files are made: throw together lots of data on the basis that it might be vaguely related. This set me wondering: How large do single trees get? So here is a challenge for you all, What is the largest single-tree gedcom you are aware of, does it consist of sensible data, and more to the point how large is it (File size in bytes and number of individuals, both metrics are needed please? Peter
On May 16, 8:20 am, "Peter J. Seymour" <[email protected]> wrote: > This set me wondering: How large do single trees get? So here is a > challenge for you all, What is the largest single-tree gedcom you are > aware of, does it consist of sensible data, and more to the point how > large is it (File size in bytes and number of individuals, both metrics > are needed please? If you google for 'BUELL001.GED' you'll find a large GEDCOM file by someone called Matthew James Buell. It is about 3MB and contains about 9,900 individuals, virtually all of whom are related. (It's a descent from the biblical Adam, so clearly parts of it are dubious.) I'm sure I've seen larger databases, though I'm not sure I can point you to one at the moment. But this one seems to have become a fairly standard test database for applications as it's large enough that it can starts hitting scalability issues and diverse enough to include a large range of dates and nationality. Richard
Peter J. Seymour wrote: > I have been doing some analysis of a selection of the numerous gedcom > files out there. One thing I have found disappointing is that the larger > files tend to consist of a number of fragments rather than a single > tree. In fact, the larger the file, the more likely it is to consist of > fragments. Some large files seem to consist mostly of numerous > unconnected individuals, or couples, or perhaps small trees of three or > four people. > So this seems to be how the really large files are made: throw together > lots of data on the basis that it might be vaguely related. > This set me wondering: How large do single trees get? So here is a > challenge for you all, What is the largest single-tree gedcom you are > aware of, does it consist of sensible data, and more to the point how > large is it (File size in bytes and number of individuals, both metrics > are needed please? As of 10 am EDT on Sunday 15 May 2011, one of my databases has 24744 persons in a 18862080M file. Unfortunately, it contains branches; the largest branch has 23642 descendants & spouses of a single couple. The other branches are NOT OURS (116 persons), ?OURS? (41 persons) and the rest are parents of spouses who are being kept because two sibs married into the main database. This file is based on family papers going back to the 1750s, added to in 1826, again in 1894, 1937, 1987, 2003, and through yesterday. Data has been documented in census records, wills, BMDs, newspapers, and official government documents (i.e, bounty-land warrants, military service records, Acts of Congress, etc). Another database has 14088 persons (descendants and spouses) in a 6197248M file. This one is taken from a 1980s book based on 1970s research; much of it has been confirmed in official records. Neither of these are particularly large in my corner of the world. FWIW. Cheryl
On 05-16-2011 03:20, Peter J. Seymour wrote: > What is the largest single-tree gedcom you are aware of, does it consist http://myarnolds.com > of sensible data, and more to the point how large is it (File size in yes don't know > bytes and number of individuals, both metrics are needed please? second metric on the site's welcome page -- Wes Groleau There are two types of people in the world … http://Ideas.Lang-Learn.us/barrett?itemid=1157
I am somewhat curious about your question. You seem to be trying to come up with some kind of "quality" or "genuine-ness" metric based on the characteristics of a single tree. I would have thought most genuine researcher's databases would contain a majority of people who are all connected by some sequence of parent-child, spouse-spouse relationships (the "family tree", but mathematically probably a connected forest rather than a tree). And then a whole load of other folk, either individually, as couples or in small trees who represent people who have come up in your research but whose relationship to your family either isn't established, or you have proven them not to be related but need to hang onto their information to help you work out who's who (particularly when there are different families with the same surname living in the same place at the same time that you are forever stumbling over in your research). There will also be those pruned "wrong branches" from your own tree based on some erroneous conclusion that has now been corrected. Also people often have bits of research done for friends in their databases. People doing things like one name studies or local history will have GEDCOMs that are full of lots of disconnected individuals and small trees. I think you have to know what the GEDCOM purports to represent to know what kind of connectivity of individuals you should reasonably expect to find within it. I don't think finding large numbers of disconnected individuals or small trees in a GEDCOM file is necessarily a sign of "rubbish research". On the contrary, those people who like to build big family trees just for the sake of it will generally have a highly connected set of individuals as they will have linked someone in their own database to someone in the other database, creating an even larger connected "family tree", on the flimsiest of evidence. I think those who are overly willing to make "connections" are more dangerous from a quality perspective than those who collect a lot of data (for whatever reason) and don't connect it. Kerry