Note: The Rootsweb Mailing Lists will be shut down on April 6, 2023. (More info)
RootsWeb.com Mailing Lists
Previous Page      Next Page
Total: 3600/10000
    1. Re: Event-oriented genealogy software for Linux
    2. Wes Groleau
    3. On 05-16-2011 11:58, John Prentice wrote: > A colleague and I are working on a completely new approach that … … sounds like what I was working on when I decided PGV/webtrees was good enough. :-) -- Wes Groleau There are two types of people in the world … http://Ideas.Lang-Learn.us/barrett?itemid=1157

    05/17/2011 01:18:27
    1. Re: Event-oriented genealogy software for Linux
    2. Wes Groleau
    3. On 05-17-2011 06:38, [email protected] wrote: > It reminds me of a German colleague. He told me that he was having a house > built for him after discussing the plans and his wishes with his architect > for two years. I challenged him to bet that within one year of living in his > home, he would come up with some problems or things he had forget. He took > the bet very enthusiastically, but he never came to claim his prize. I was happy with my plans. Not so happy with the dozens of deviations the contractor let the subcontractors get away with! -- Wes Groleau There are two types of people in the world … http://Ideas.Lang-Learn.us/barrett?itemid=1157

    05/17/2011 01:16:45
    1. Re: Single-tree gedcom files question
    2. Bob LeChevalier
    3. Peter J. Seymour wrote: >> On 2011-05-16 15:57, singhals wrote: >>> Peter J. Seymour wrote: >>>> I have been doing some analysis of a selection of the numerous gedcom >>>> files out there. One thing I have found disappointing is that the larger >>>> files tend to consist of a number of fragments rather than a single >>>> tree. In fact, the larger the file, the more likely it is to consist of >>>> fragments. Some large files seem to consist mostly of numerous >>>> unconnected individuals, or couples, or perhaps small trees of three or >>>> four people. >>>> So this seems to be how the really large files are made: throw together >>>> lots of data on the basis that it might be vaguely related. >>>> This set me wondering: How large do single trees get? So here is a >>>> challenge for you all, What is the largest single-tree gedcom you are >>>> aware of, does it consist of sensible data, and more to the point how >>>> large is it (File size in bytes and number of individuals, both metrics >>>> are needed please? (In addition to the generation count metric you added later, you also probably need some kind of metric on the amount of sourcing data, as shown below, though I'm not sure what kind of metric would be appropriate.) I hope the following will be useful. I have several files that might be considered "really large", which have multiple trees therein, but having one main tree linking most of the people in the data base together. The remainder of the data base consists of fragments that might eventually be linked in, but have not yet. But I cannot say that they consist of "throwing together lots of data on the basis that it might be vaguely related". By far the largest tree I work with is last year's version of a tree created by a French colleague. His current tree likely has 20-30K more people in it - he works fast and voluminously. The version I have from last summer has 139772 people in it, of which 130355 are in a single tree (according to Legacy, which has a utility to count the trees in a file). There are a couple thousand fragments in the file, the largest being only 58 people, and there are 145 fragments with 10 or more people in it. The file was 30.784 MB as I first received it from him in GEDCOM format. In Legacy format, which is how I use it, the main file has about 269 MB. I use this file mostly for reference. It has almost no source information, and consists almost entirely of BMD info for the people included, with few notes or supporting data. (He has extensive notes and a good memory, so he can usually tell me where he got the information). A second file is based on a 2007 file by the same Frenchman, which is my active working file for my French research. That file, as received, was around 24 MB in GED format, and maybe 100000 people. I have since added around 20000 people to the file, all extensively sourced. My version of the tree has 109522 of the 120000 in one tree, and 2250 smaller fragments, the largest of which is 78. I just did a GEDCOM 5.5 export of those who are in the single tree of 109522, and the file is 39.980 MB, so adding 20% of source-marked data to the count of the database added more than 50% to the GEDCOM size. The bulk of the data in these trees comprises a sort of place study for a small region on the border between Calvados and Manche and Orne departments in Normandy France. My French colleague has been working a larger area, less systematically, using a variety of sources, while I have focused on only 3 parishes for which I have personally transcribed 100 years of records for one and 70 years for the other two, and am attempting (slowly) to add all of the people mentioned to the tree. Since people in the 1600s didn't move very much, perhaps 95% of the records can be linked to someone already in the tree. Almost all of the data is extremely "sensible", even if not sourced. The primary source is parish records for the period in which French parish records exist (after around 1600 for the best parishes, but many records were destroyed at the time of the DDay invasion, since Manche's archives are in St-Lo, and Calvados's in Caen, both cities that suffered badly in the invasion). Several lines are connected to nobility in the 1600s, and in many cases, those lines can be connected using French medieval sources back to Charlemagne, and beyond to his ancestors. But I don't know how good those French medieval sources are. I cannot give good details on generations, because the file is too large and complex for automated analysis. My colleague a few years ago calculated that he had 94% coverage of his own direct line ancestors back 10, 35% back to 13 generations (5634 of 16383), but much sparser beyond that point (only 3% of the 14th generation). Out to 29 generations, he had around 16000 total ancestors including repeats that appear in multiple lines. I just ran a test on my 120,000 person version of his file, there are 53000 ancestors out to the 33rd generation, but there are people who are repeated 250 times or more in multiple lines, so I couldn't guess how many unique individuals that is. One such 250-times-repeated person herself - Elizabeth de Vermandois in the 1100s - has around 150 ancestors including repeats, going back 22 more generations to the 500s AD). Charlemagne is around 45 generations back (and appears 15 times in Elizabeth de Vermandois ancestry alone - hence would be in many thousand slots in a complete ancestral tree). I'm sure that the number is considerably higher for his current file, since the bulk of our work the last couple of years has been in the 1600s, and I know he has added material from at least one major medieval source. --- My other projects are primarily American and include 2 data bases at 38283 and 26185 people, of which 37013 and 25749 are in one tree, respectively, both extensively sourced, and with a good deal of transcribed census records in the notes for key heads of households. The larger database has a second chunk of around 1000 that is strongly believed to be tied to the first (based on DNA evidence), but the link has not been found. Another chunk of 50 is connected, but only via an obscure cousins-and-marriages link that I haven't added to the database yet. A fourth chunk of 150 is connected, but by an unknown link few hundred years ago that I haven't looked very hard for- 2 families with a very unusual name in the same town. There are around 150 individuals in 25 additional chunks. That larger tree has one line of 19 generations, but most of the main lines of the tree are about 9-12 generations. It is a very bushy tree with many lines of cousins, and I'm habitually adding such things as parents of spouses, stepkids (and their spouses), etc. These usually are only a few generations deep. Exporting the heavily-sourced main tree of 37013 into GEDCOM 5.5 is 35.572 MB, larger than the GEDCOM for the unsourced 140,000 person file with 50+ generations. lojbab --- Bob LeChevalier - artificial linguist; genealogist [email protected] Lojban language www.lojban.org

    05/17/2011 12:26:04
    1. Re: Event-oriented genealogy software for Linux
    2. Ian Goddard
    3. [email protected] wrote: > Ian Goddard wrote: >> Nope. You end up having to make such changes because you didn't think >> it through in the first place. > > Oh yes, I recognize that argument. But I've never come across a project that > didn't have it surprises, even after having written down and discussed and > agreed upon the requirements document. OTOH well planned S/W products are able to cope with a wide range of applications. I used to be a sysadmin for a client using a particular ERP package for warehouse management. Another site of the same company used the same package to support a print-buying business and a hardware service business. I don't doubt that other users of the package had distinctly different ways of deploying it. %>< >> By now it should be clear how to treat this. You recognise /in advance/ >> that the hierarchies will be time-dependent and make provision for >> optional start and finish dates. You also recognise that a particular >> place may be simultaneously in different hierarchies, e.g. >> ecclesiastical (even different ecclesiastical hierarchies, such as >> different Anglican & RC parishes), manorial, Poor Law. You adopt a data >> model that fits and then code to that. > > Sounds good, but are you british? > I think you underestimate the complexity of it all. My family tree (my > mother's work) goes back to around 1590. Our regions (Flanders) have been > jostled around between the big powers for a number of times and reorganized > again and again. The town where most of my family lives now (and also in the > past, has even been scinded in two (part Spanish, part French) for a number > of years. This is a specific example of a general requirement. It's essentially no different from the situation that Holmfirth chapelry was split between two parishes and that things were shuffled round in both ecclesiastical and civil terms. Provide a /general/ framework for constructing location hierarchies which makes provision for a split and it makes no difference at what level the split happens. You didn't mention whether the town has different names in different languages which sometimes happens. Again, it's a general requirement; if you allow for synonyms you can handle Pontefract vs Pomfret just as easily as Koeningsberg vs Kaliningrad. > I think a lot of family tree researches simply give up there, it > is way to time consuming to record it all. Again my area confuses more distant researchers because there's no effective means to convey these changes. Wouldn't it be better if there were? -- Ian The Hotmail address is my spam-bin. Real mail address is iang at austonley org uk

    05/17/2011 11:29:13
    1. Re: Event-oriented genealogy software for Linux
    2. Ian Goddard wrote: > [email protected] wrote: >> >> I've read thru this thread, and I wonder if there will be any program, >> including your own, that will fullfil all things that have been asked >> here. >> >> It seems to me that you're after a kind of logging system, rather than a > >.....snip a lot > >> >> Can such program be built? Yes, but the degree of freedom you want >> assures you to frequent (and some substantial) changes to be applied. And >> changing the program is one thing, but assuring your data survive those >> changes is another story, where ultimate care will be needed. > > Nope. You end up having to make such changes because you didn't think > it through in the first place. Oh yes, I recognize that argument. But I've never come across a project that didn't have it surprises, even after having written down and discussed and agreed upon the requirements document. It reminds me of a German colleague. He told me that he was having a house built for him after discussing the plans and his wishes with his architect for two years. I challenged him to bet that within one year of living in his home, he would come up with some problems or things he had forget. He took the bet very enthusiastically, but he never came to claim his prize. This illustrates something which I found out the hard way, but I never have seen it in any sort of handbook. When you start a project (be it building a home or an IT solution to some problem), you start from an analysis of the current situation and it problems and the customers desires. Even if you succeed in fulfilling all these requirements, the simple fact that you put your solution in "production" creates a new situation. This has in a few cases I've been thru, invalidated some of the initial presumptions you started off with (customers behavior is changing), and turns your magnificent new solution into a new problem. That's why I am very weary against this "think it thru" and "final solution". Of course, if you build the system for yourself only, it could be a lot easier, but if you release it to the public...... ....snip more.... >> There has been discussions about hierarchy of places. I think trying to >> register such things is a bad idea in the first place, because such >> "relationships" have been so volatile in history. The only indication one >> could give that would remain consistent is something like : place X is >> part of (located nearby, ....) Y in 2011. Even geographical coordinates >> are no good, since villages etc. have moved in the run of time. >> > > By now it should be clear how to treat this. You recognise /in advance/ > that the hierarchies will be time-dependent and make provision for > optional start and finish dates. You also recognise that a particular > place may be simultaneously in different hierarchies, e.g. > ecclesiastical (even different ecclesiastical hierarchies, such as > different Anglican & RC parishes), manorial, Poor Law. You adopt a data > model that fits and then code to that. Sounds good, but are you british? I think you underestimate the complexity of it all. My family tree (my mother's work) goes back to around 1590. Our regions (Flanders) have been jostled around between the big powers for a number of times and reorganized again and again. The town where most of my family lives now (and also in the past, has even been scinded in two (part Spanish, part French) for a number of years. I think a lot of family tree researches simply give up there, it is way to time consuming to record it all. Herman > > > I can envisage a system with several aspects: > > - The genealogical data itself. > > - Standing data such as location information. > > - Rules such as the fuzzy logic which Richard mentioned. > > - A shared data model to describe the above. > > - Code to handle them. > > > This leaves scope for different S/W vendors and open source teams to > provide the last part. It also provides scope for specialists to > provide shared standing data or shared rules. It even, in an ideal > world, provides scope for archive sites such as A2A to export data in a > useable form. And it provides scope for users such as you and I to > explore that data and to find the family relationships which hide within > it. > -- Veel mensen danken hun goed geweten aan hun slecht geheugen. (G. Bomans) Lots of people owe their good conscience to their bad memory (G. Bomans)

    05/17/2011 06:38:58
    1. Re: Single-tree gedcom files question
    2. Peter J. Seymour
    3. On 2011-05-17 09:51, Richard Smith wrote: > On May 16, 9:07 pm, "Peter J. Seymour"<[email protected]> > wrote: > >> Oh and some sort of negative factor for any mention of Charlemagne, >> Pharoahs, Moses etc. > > Is it fair to include Charlemagne there? ... > > Richard Point accepted. I am a bit uneasy about such long distance links though. It only takes one mis-reported instance of paternity and the link is completely bogus (at least in biological terms). According to someone else's genealogy I am descended from Edward II. (and lets leave aside the question of what is the point of such claims). Now I don't know how reliable that research was. Even assuming it is accurate, unrecorded liaisons with resulting pregnancy are sufficiently common that one can be the more doubtful about a paternal line the longer it is. So another possible metric would be some sort of statistical probability relating number of generations to the cultural setting and the reliability of the link. How to go about deriving that one, I have no idea. Peter

    05/17/2011 04:37:42
    1. Re: Single-tree gedcom files question
    2. Paul Blair
    3. On 16-May-2011 5:20 pm, Peter J. Seymour wrote: > I have been doing some analysis of a selection of the numerous gedcom > files out there. One thing I have found disappointing is that the larger > files tend to consist of a number of fragments rather than a single > tree. In fact, the larger the file, the more likely it is to consist of > fragments. Some large files seem to consist mostly of numerous > unconnected individuals, or couples, or perhaps small trees of three or > four people. > So this seems to be how the really large files are made: throw together > lots of data on the basis that it might be vaguely related. > This set me wondering: How large do single trees get? So here is a > challenge for you all, What is the largest single-tree gedcom you are > aware of, does it consist of sensible data, and more to the point how > large is it (File size in bytes and number of individuals, both metrics > are needed please? > > Peter Some of the people at the PGV site (http://sourceforge.net/projects/phpgedview/forums/forum/185166) have very large 'trees'. Maybe they are not unfragmented, but I'd certainly ask the question. That link takes you to the Help forum, and there is an Open forum (under Forums on the men bar) that might also be a place to search. I used 'large' as a starter, and found one person with a 30,000 'tree'. Paul

    05/17/2011 02:30:16
    1. Re: Event-oriented genealogy software for Linux
    2. Richard Smith
    3. On May 16, 4:58 pm, John Prentice <[email protected]> wrote: > A colleague and I are working on a completely new approach that I > believe will answer most of your requirements in the first version, and > probably all of them by the second or third. It will work seamlessly > across PC, Mac and Linux (with no porting requirements, and no data > incompatibilities between). It will include backup & publishing "to the > cloud", with public visibility levels that you set. There are plans to > have smartphone (Android and iOS) versions that store the text > information but don't hold binary data locally, and data will be > exportable in a number of standardised formats. (GEDCOM 5.5 and GEDCOM > 5.5 XML at a bare minimum.) That sounds like excellent news, and I look forward to hearing more about it in due course. Have you yet chosen a product name so that I know what to keep an eye out for? Richard

    05/16/2011 08:19:25
    1. Re: Single-tree gedcom files question
    2. Richard Smith
    3. On May 16, 9:07 pm, "Peter J. Seymour" <[email protected]> wrote: > Oh and some sort of negative factor for any mention of Charlemagne, > Pharoahs, Moses etc. Is it fair to include Charlemagne there? It's not a subject that I've looked at in any great detail, but I was under the impression that descents from Charlemagne via Berengar I of Frioul and into the early Portuguese royal family were generally considered valid. Richard

    05/16/2011 07:51:47
    1. Re: Single-tree gedcom files question
    2. Peter J. Seymour
    3. On 2011-05-16 19:25, singhals wrote: > Peter J. Seymour wrote: >> On 2011-05-16 15:57, singhals wrote: >>> Peter J. Seymour wrote: >>>> I have been doing some analysis of a selection of the numerous gedcom >>>> files out there. One thing I have found disappointing is that the >>>> larger >>>> files tend to consist of a number of fragments rather than a single >>>> tree. In fact, the larger the file, the more likely it is to consist of >>>> fragments. Some large files seem to consist mostly of numerous >>>> unconnected individuals, or couples, or perhaps small trees of three or >>>> four people. >>>> So this seems to be how the really large files are made: throw together >>>> lots of data on the basis that it might be vaguely related. >>>> This set me wondering: How large do single trees get? So here is a >>>> challenge for you all, What is the largest single-tree gedcom you are >>>> aware of, does it consist of sensible data, and more to the point how >>>> large is it (File size in bytes and number of individuals, both metrics >>>> are needed please? >>> >>> As of 10 am EDT on Sunday 15 May 2011, one of my databases has 24744 >>> persons in a 18862080M file. .....going back >>> to the 1750s, ...... >>> >>> Another database has 14088 persons (descendants and spouses) in a >>> 6197248M file. This one is taken from a 1980s book based on 1970s >>> research; much of it has been confirmed in official records. >>> >>> Neither of these are particularly large in my corner of the world. >>> >>> FWIW. >>> >>> Cheryl >> >> Thanks. A rough calculation shows the 24744 file as around 10 times more >> fully populated than the "Diana" file previously mentioned. What this >> says to me is that 'number of generations' should be included in tree >> metrics (I suppose that should have been obvious, but better late than >> never). The way it would work is that the larger the number of >> generations covered by a given number of individuals, the less "good" >> the file is. >> In the current version of Gendatam Suite, a 20M file might take around >> 80-100M of RAM when loaded. That works fine with modern computers which >> might have as much as 4000M of RAM, but wouldn't have been feasible not >> that many years ago when it was rare for a computer to have more than >> about 8M of RAM. I suppose my point is that modern computers should cope >> well with holding and processing these and larger amounts of data. >> >> Peter > > The larger database runs 13 generations from the OP to the newest addition. > > The smaller one is 12 gens, OP to 2010. > > A third database has 41 Gens to the Great Ethelred, 2953216, 3939 > individuals. Data is good to about the 8th gen, as > good-as-it-gets-in-the-US Gens 9-13, but at the 14th gen it waffles off > into the 16th century and is accordingly only showing direct-line > ascent, no sibs. Could be an issue in determining a ratio for big/good? > A lot of folks do just record straight line ancestry. And a lot of folks > omit the more lyrical connections to Charlemagne or the Caesars or Zeus > and Odin ... > > Cheryl No single figure is going to give a good point of comparison of files such as gedcoms with widely differing characteristics. That is why I am musing on what a useful set of metrics might be. So far it has: - Size - Number of individuals - Number of generations To which could be added something like: percentage of individuals in main (or only) tree. Oh and some sort of negative factor for any mention of Charlemagne, Pharoahs, Moses etc. File "goodness" is going to be a rather subjective concept, the metrics are going to be just that: metrics providing points of comparison. Some simple cacluations such as ratios may provide useful information about the characteristics (the data make-up) of a file. Peter

    05/16/2011 03:07:31
    1. Re: Event-oriented genealogy software for Linux
    2. Nick Matthews
    3. On 16/05/2011 14:55, Ian Goddard wrote: > [email protected] wrote: >> >> I've read thru this thread, and I wonder if there will be any program, >> including your own, that will fullfil all things that have been asked here. >> >> It seems to me that you're after a kind of logging system, rather than a >> genealogy program. But you want to define persons, places, events, >> relations, sources and even dates (and a few more?) as entities and have n:n >> relations (in relational database speech) between all of them, > > If you model data in entity-relational terms most of these things have > to be seen as entities. And in some cases the relationships /are/ n:n. > In an RDBMS implementation a link table is a good way of representing > an n:n relationship and once one start's thinking in such terms it > becomes clear that such a table can hold more than just link information > which is why relationships start to emerge as entities in their own right. > > If, OTOH, you model data in OO terms you would model them as objects and > provide classes for them. And again in such a model some of the > inter-class associations are again n:n. > > "even dates"? Historical dates are a real problem. Are we talking > Gregorian or Julian? What's the start of the year, January 1 or Lady > day? What year is 3 Elizabeth? How do "Early nineteenth century", 1820 > and "First quarter nineteenth century" collate - a problem which has put > me off trying to devise a historical date data type for my favourite > RDBMS. At least the OO approach enables you to make some of this > explicit by providing such attributes to dates - I don't know if your > favoured genealogy program does this but Gramps certainly does. > >> including recursive relationships on all of them. > > Not all. However if you think about some of the > entities/objects/whatever you want to call them, they have an internal > structure which is hierarchical and the hierarchy differs from one > instance to the next. For instance one record might come from the > parish register of Dunny-on-the-Wold which is in the Dunshire archives. > Another might be in a book of parish register transcriptions which was > published in 1900 by the records section of the Wolds Archaeological > Society. Another might be a page reference of one of several books by > an author which is one of many books published by a publisher. Another > may be a paper in a journal.... You get the picture. They're all > similar in nature and different in detail and even in depth. And parts > of any hierarchy might be re-used - different pages from the same book, > different books by the same author, etc. And a good way of dealing with > this is to use a recursive model. That means that you don't have to > keep lots of copies of the title of the archive etc. It also means that > you can have the option of linking the PR directly to the archive but > another document to a particular collection which has its own distinct > identity within the archive. > >> So, is there then any real structure in the data?? Only in the way, as an >> example, that a date cannot appear in the place of a link where one would >> expect a person reference. But other than that? > > Yes, of course there's structure. It just makes a lot of sense to use a > data model flexible enough to fit the actual data rather than force fit > the data onto a preconceived model. For instance I know of no ancestor > of mine who lived in a city. Why should I have to warp their data to > re-use a "city" field as something else& then how does this warped data > sit properly alongside that of some of my wife's ancestors who did live > in a city? > >> >> Can such program be built? Yes, but the degree of freedom you want assures >> you to frequent (and some substantial) changes to be applied. And changing >> the program is one thing, but assuring your data survive those changes is >> another story, where ultimate care will be needed. > > Nope. You end up having to make such changes because you didn't think > it through in the first place. If you plan for flexibility you can code > for it. > [lengthy Example snipped] > > %>< > >> There has been discussions about hierarchy of places. I think trying to >> register such things is a bad idea in the first place, because such >> "relationships" have been so volatile in history. The only indication one >> could give that would remain consistent is something like : place X is part >> of (located nearby, ....) Y in 2011. Even geographical coordinates are no >> good, since villages etc. have moved in the run of time. >> > > By now it should be clear how to treat this. You recognise /in advance/ > that the hierarchies will be time-dependent and make provision for > optional start and finish dates. You also recognise that a particular > place may be simultaneously in different hierarchies, e.g. > ecclesiastical (even different ecclesiastical hierarchies, such as > different Anglican& RC parishes), manorial, Poor Law. You adopt a data > model that fits and then code to that. > > > I can envisage a system with several aspects: > > - The genealogical data itself. > > - Standing data such as location information. > > - Rules such as the fuzzy logic which Richard mentioned. > > - A shared data model to describe the above. > > - Code to handle them. > > > This leaves scope for different S/W vendors and open source teams to > provide the last part. It also provides scope for specialists to > provide shared standing data or shared rules. It even, in an ideal > world, provides scope for archive sites such as A2A to export data in a > useable form. And it provides scope for users such as you and I to > explore that data and to find the family relationships which hide within it. > I didn't intend to comment on this thread because my project is at a very early stage, there's no functional program yet but there is code and the beginnings of a practical (I hope) database design at thefamilypack.org - but that was such an actuate description of what I am trying to achieve that I feel obliged to point it out. In particular, you mention standing data; I think this is an area where open source collaborative efforts can really be made to work. I'm sure that anyone who has spent a few years on their family tree will have become expert on some small areas of local history and geography, if there was a simple way to contribute that expertise, without commercial interests taking advantage, then I'm sure it would happen. FreeBMD and friends are a good example of what can be done. Unfortunately this will be the last piece in what has become a very large jigsaw in designing such a system - but a lest it is being thought about at the beginning of the process. If anyone does go to the trouble of looking it up - I should point out the the database design only includes the minimum necessary for the program code that is being written. So if you have a question starting "Why doesn't the database include ...." the answer is almost certainly "No, not yet, but ...". Nick.

    05/16/2011 11:01:53
    1. Re: Event-oriented genealogy software for Linux
    2. John Prentice
    3. On 11/05/2011 16:52, Richard Smith wrote: > I've spent years looking for decent genealogy software that suits my > needs, and I'm almost at the stage of giving up and writing my own. Just FYI, I've had similar problems, limitations and frustrations with existing genealogy software. Right now, I'm using a combination of FTM 2011, with the supporting data stored in Dropbox (www.dropbox.com) directories that are replicated across all my computers. It's just about adequate, but it's a long way from what I'd like. (Have you tried installing FTM using WINE under Linux, by the way? It might solve some of your problems.) There's good news ahead. A colleague and I are working on a completely new approach that I believe will answer most of your requirements in the first version, and probably all of them by the second or third. It will work seamlessly across PC, Mac and Linux (with no porting requirements, and no data incompatibilities between). It will include backup & publishing "to the cloud", with public visibility levels that you set. There are plans to have smartphone (Android and iOS) versions that store the text information but don't hold binary data locally, and data will be exportable in a number of standardised formats. (GEDCOM 5.5 and GEDCOM 5.5 XML at a bare minimum.) I'm not going to go into details quite yet. Apart from anything else, we're using some new-tech tricks that haven't been applied to genealogy before, using data representations that allow much looser searches, so there's a LOT that's commercially confidential. Bear with us - it's not going to happen tomorrow. We need to define a business model to make it worth our while too! I think it might make a lot of difference to genealogists when it's done. John -- Maintainer of the s.g.b FAQs, at http://www.genealogy-britain.org.uk/ Tracing London names LEE, BEDFORD, CLARK, SUTTON, KEEN, SPRING, HARTLEY, WRIGHT, PETHERBRIDGE (Devonian/Cornish), MOODY (Cornish/Devonian), STEPHENS (Cornish) ** LOOK OUT, SPAM BLOCK AHEAD! ** To email me, please remove ".invalid" from the email address

    05/16/2011 10:58:00
    1. Re: Single-tree gedcom files question
    2. Peter J. Seymour
    3. On 2011-05-16 15:57, singhals wrote: > Peter J. Seymour wrote: >> I have been doing some analysis of a selection of the numerous gedcom >> files out there. One thing I have found disappointing is that the larger >> files tend to consist of a number of fragments rather than a single >> tree. In fact, the larger the file, the more likely it is to consist of >> fragments. Some large files seem to consist mostly of numerous >> unconnected individuals, or couples, or perhaps small trees of three or >> four people. >> So this seems to be how the really large files are made: throw together >> lots of data on the basis that it might be vaguely related. >> This set me wondering: How large do single trees get? So here is a >> challenge for you all, What is the largest single-tree gedcom you are >> aware of, does it consist of sensible data, and more to the point how >> large is it (File size in bytes and number of individuals, both metrics >> are needed please? > > As of 10 am EDT on Sunday 15 May 2011, one of my databases has 24744 > persons in a 18862080M file. .....going back > to the 1750s, ...... > > Another database has 14088 persons (descendants and spouses) in a > 6197248M file. This one is taken from a 1980s book based on 1970s > research; much of it has been confirmed in official records. > > Neither of these are particularly large in my corner of the world. > > FWIW. > > Cheryl Thanks. A rough calculation shows the 24744 file as around 10 times more fully populated than the "Diana" file previously mentioned. What this says to me is that 'number of generations' should be included in tree metrics (I suppose that should have been obvious, but better late than never). The way it would work is that the larger the number of generations covered by a given number of individuals, the less "good" the file is. In the current version of Gendatam Suite, a 20M file might take around 80-100M of RAM when loaded. That works fine with modern computers which might have as much as 4000M of RAM, but wouldn't have been feasible not that many years ago when it was rare for a computer to have more than about 8M of RAM. I suppose my point is that modern computers should cope well with holding and processing these and larger amounts of data. Peter

    05/16/2011 10:35:24
    1. Re: Single-tree gedcom files question
    2. Peter J. Seymour
    3. On 2011-05-16 13:03, Richard Smith wrote: > On May 16, 8:20 am, "Peter J. Seymour"<[email protected]> > wrote: > >> This set me wondering: How large do single trees get? So here is a >> challenge for you all, What is the largest single-tree gedcom you are >> aware of, does it consist of sensible data, and more to the point how >> large is it (File size in bytes and number of individuals, both metrics >> are needed please? > > If you google for 'BUELL001.GED' you'll find a large GEDCOM file by > someone called Matthew James Buell. It is about 3MB and contains > about 9,900 individuals, virtually all of whom are related. (It's a > descent from the biblical Adam, so clearly parts of it are dubious.) > I'm sure I've seen larger databases, though I'm not sure I can point > you to one at the moment. But this one seems to have become a fairly > standard test database for applications as it's large enough that it > can starts hitting scalability issues and diverse enough to include a > large range of dates and nationality. > > Richard Yes, almost all the entries are in one tree. Looked at in one way the file is an "interesting" account of the ancestry of a certain someone called Diana. I'm glad you pointed it out though, at 150 generations it breaks an assumed annotation limit in the Gendatam Suite reports. It also illustrates that sorting on date of birth is not very useful if most entries do not have it recorded. Peter

    05/16/2011 10:09:54
    1. Re: Event-oriented genealogy software for Linux
    2. Ian Goddard
    3. [email protected] wrote: > > I've read thru this thread, and I wonder if there will be any program, > including your own, that will fullfil all things that have been asked here. > > It seems to me that you're after a kind of logging system, rather than a > genealogy program. But you want to define persons, places, events, > relations, sources and even dates (and a few more?) as entities and have n:n > relations (in relational database speech) between all of them, If you model data in entity-relational terms most of these things have to be seen as entities. And in some cases the relationships /are/ n:n. In an RDBMS implementation a link table is a good way of representing an n:n relationship and once one start's thinking in such terms it becomes clear that such a table can hold more than just link information which is why relationships start to emerge as entities in their own right. If, OTOH, you model data in OO terms you would model them as objects and provide classes for them. And again in such a model some of the inter-class associations are again n:n. "even dates"? Historical dates are a real problem. Are we talking Gregorian or Julian? What's the start of the year, January 1 or Lady day? What year is 3 Elizabeth? How do "Early nineteenth century", 1820 and "First quarter nineteenth century" collate - a problem which has put me off trying to devise a historical date data type for my favourite RDBMS. At least the OO approach enables you to make some of this explicit by providing such attributes to dates - I don't know if your favoured genealogy program does this but Gramps certainly does. > including recursive relationships on all of them. Not all. However if you think about some of the entities/objects/whatever you want to call them, they have an internal structure which is hierarchical and the hierarchy differs from one instance to the next. For instance one record might come from the parish register of Dunny-on-the-Wold which is in the Dunshire archives. Another might be in a book of parish register transcriptions which was published in 1900 by the records section of the Wolds Archaeological Society. Another might be a page reference of one of several books by an author which is one of many books published by a publisher. Another may be a paper in a journal.... You get the picture. They're all similar in nature and different in detail and even in depth. And parts of any hierarchy might be re-used - different pages from the same book, different books by the same author, etc. And a good way of dealing with this is to use a recursive model. That means that you don't have to keep lots of copies of the title of the archive etc. It also means that you can have the option of linking the PR directly to the archive but another document to a particular collection which has its own distinct identity within the archive. > So, is there then any real structure in the data?? Only in the way, as an > example, that a date cannot appear in the place of a link where one would > expect a person reference. But other than that? Yes, of course there's structure. It just makes a lot of sense to use a data model flexible enough to fit the actual data rather than force fit the data onto a preconceived model. For instance I know of no ancestor of mine who lived in a city. Why should I have to warp their data to re-use a "city" field as something else & then how does this warped data sit properly alongside that of some of my wife's ancestors who did live in a city? > > Can such program be built? Yes, but the degree of freedom you want assures > you to frequent (and some substantial) changes to be applied. And changing > the program is one thing, but assuring your data survive those changes is > another story, where ultimate care will be needed. Nope. You end up having to make such changes because you didn't think it through in the first place. If you plan for flexibility you can code for it. This is a non-genealogical example but one taken from real-life: A client of mine was in the secure printing business - partly in printing the stationery (think cheques, for instance) and partly in digital printing /on/ the stationery (think printing the payee, amounts, etc. on the cheques). Eventually they got a contract for a much more complex document type than they'd handled previously, in fact the contract called for two different documents. The data would arrive as XML, also a first for them. Clearly this was a trend - XML would be the basis for contracts in the future. I suggested using XSL (a rules engine, in effect) to rewrite the data so that stuff likely to be in all such contracts - despatch addresses, due dates, etc. be taken out into an XML structure and vocabulary specific to my clients whilst the document specific bit be left unchanged but wrapped up in a specific place within this new structure. The next stage would be hard-coded to take apart the structure, put the house-keeping info into a database but the document-specific XML fragments would be stored unchanged in a text field in the database. For a print run the relevant fragments of XML would be grabbed from the database, strung together in a new XML file and run through another XSLT to convert them into the form required to drive the printer. This was extremely re-useable - only the XSL stylesheets would need to be changed for new contracts. The clients wouldn't have it at all. They insisted on having a database design that exactly reflected the printed document which also resulted in having code which matched the database to construct the print file. I did the front-end as I'd planned it but instead of storing the XML fragments I ran them through an XMLT & macro processor to generate the SQL to stuff the data into the document-specific database. Not only did this end up with two different databases and two different programs to handle the two documents but it also resulted in a maintenance nightmare as the main document changed over the life of the contract. For all I know they're still doing that. The next contract started off with many more document types. Storing the XML was too much for the clients but I did get them to agree to a half-way house. Instead of generating SQL I generated print-file fragments (essentially what would have been the second transform of my original scheme) and stored those in a text field in the database. As the contract rolled on we accommodated more document types & changes to the originals and we re-used the program itself largely as it stood for a second contract. All we had to do was keep writing more stylesheets and printer scripts, which we'd have had to do anyway. The tweaks to actual program code were only to do with changes to the way work was organised in the factory. That's the difference between planning for flexibility and no doing so. %>< > There has been discussions about hierarchy of places. I think trying to > register such things is a bad idea in the first place, because such > "relationships" have been so volatile in history. The only indication one > could give that would remain consistent is something like : place X is part > of (located nearby, ....) Y in 2011. Even geographical coordinates are no > good, since villages etc. have moved in the run of time. > By now it should be clear how to treat this. You recognise /in advance/ that the hierarchies will be time-dependent and make provision for optional start and finish dates. You also recognise that a particular place may be simultaneously in different hierarchies, e.g. ecclesiastical (even different ecclesiastical hierarchies, such as different Anglican & RC parishes), manorial, Poor Law. You adopt a data model that fits and then code to that. I can envisage a system with several aspects: - The genealogical data itself. - Standing data such as location information. - Rules such as the fuzzy logic which Richard mentioned. - A shared data model to describe the above. - Code to handle them. This leaves scope for different S/W vendors and open source teams to provide the last part. It also provides scope for specialists to provide shared standing data or shared rules. It even, in an ideal world, provides scope for archive sites such as A2A to export data in a useable form. And it provides scope for users such as you and I to explore that data and to find the family relationships which hide within it. -- Ian The Hotmail address is my spam-bin. Real mail address is iang at austonley org uk

    05/16/2011 08:55:26
    1. Re: Single-tree gedcom files question
    2. singhals
    3. Peter J. Seymour wrote: > On 2011-05-16 15:57, singhals wrote: >> Peter J. Seymour wrote: >>> I have been doing some analysis of a selection of the numerous gedcom >>> files out there. One thing I have found disappointing is that the larger >>> files tend to consist of a number of fragments rather than a single >>> tree. In fact, the larger the file, the more likely it is to consist of >>> fragments. Some large files seem to consist mostly of numerous >>> unconnected individuals, or couples, or perhaps small trees of three or >>> four people. >>> So this seems to be how the really large files are made: throw together >>> lots of data on the basis that it might be vaguely related. >>> This set me wondering: How large do single trees get? So here is a >>> challenge for you all, What is the largest single-tree gedcom you are >>> aware of, does it consist of sensible data, and more to the point how >>> large is it (File size in bytes and number of individuals, both metrics >>> are needed please? >> >> As of 10 am EDT on Sunday 15 May 2011, one of my databases has 24744 >> persons in a 18862080M file. .....going back >> to the 1750s, ...... >> >> Another database has 14088 persons (descendants and spouses) in a >> 6197248M file. This one is taken from a 1980s book based on 1970s >> research; much of it has been confirmed in official records. >> >> Neither of these are particularly large in my corner of the world. >> >> FWIW. >> >> Cheryl > > Thanks. A rough calculation shows the 24744 file as around 10 times more > fully populated than the "Diana" file previously mentioned. What this > says to me is that 'number of generations' should be included in tree > metrics (I suppose that should have been obvious, but better late than > never). The way it would work is that the larger the number of > generations covered by a given number of individuals, the less "good" > the file is. > In the current version of Gendatam Suite, a 20M file might take around > 80-100M of RAM when loaded. That works fine with modern computers which > might have as much as 4000M of RAM, but wouldn't have been feasible not > that many years ago when it was rare for a computer to have more than > about 8M of RAM. I suppose my point is that modern computers should cope > well with holding and processing these and larger amounts of data. > > Peter The larger database runs 13 generations from the OP to the newest addition. The smaller one is 12 gens, OP to 2010. A third database has 41 Gens to the Great Ethelred, 2953216, 3939 individuals. Data is good to about the 8th gen, as good-as-it-gets-in-the-US Gens 9-13, but at the 14th gen it waffles off into the 16th century and is accordingly only showing direct-line ascent, no sibs. Could be an issue in determining a ratio for big/good? A lot of folks do just record straight line ancestry. And a lot of folks omit the more lyrical connections to Charlemagne or the Caesars or Zeus and Odin ... Cheryl

    05/16/2011 08:25:55
    1. Re: Event-oriented genealogy software for Linux
    2. singhals
    3. [email protected] wrote: > Richard Smith wrote: > >> I've spent years looking for decent genealogy software that suits my >> needs, and I'm almost at the stage of giving up and writing my own. >> However, before I do that, I thought I'd ask on this newsgroup whether >> anyone has any suggestions of suitable software. >> >> Most products I've tried are far too lineage-oriented. That's perhaps >> okay for storing the results of my research, but that's not what I'm >> after. I want something much more event-oriented that can store the >> research itself. I want to record that I found John Smith on the 1881 >> census, two plausible John Smiths on the 1851 census, and three >> possible baptisms. I want to be able to record what the record says, >> not what I think it probably means, including the different spellings >> used in different sources. Although that seems a reasonable enough >> requirement, a lot of products make it hard to use them like that. >> Entering census data is often particularly tedious. >> >> If I only wanted to do that, I'd probably just use a spreadsheet. But >> I also want an application that can let me say that I currently >> believe the John Smith on the 1881 census is the same person as the >> John Smith who was listed in the 1851 census on North Street, not the >> one on South Street, and that I don't believe this person is the same >> as any of the baptisms. And I'd like to be able to do this in a way >> that's easy to change when new evidence comes to light. This seems >> very hard in most of the products I've tried, and nigh-on impossible >> for negative assertions like "the John Smith on the 1881 census was >> not either of three baptisms found". I've also never found software >> that can cope satisfactorily with relationships more complicated than >> simple parent-child ones. For example, I would like to be able to say >> "John was the grandson of Thomas, and probably the son of Thomas's son >> Henry, though possibly an illegitimate son of Thomas's daughter >> Sarah". That's certainly something that a computer program ought to >> be able to handle in a structured fashion, but, again, I've never >> found one that can. >> >> My second requirement is that the software runs on Linux and doesn't >> require me to be connected to the Internet. (So a web-based program >> is fine, but only if I can install it locally.) If it were open >> source, that would be an added bonus, but it's not a requirement. My >> only other requirement is that the program must be able to export its >> database in some vaguely usable format and re-import it again. It's >> probably best if it's not GEDCOM because I doubt GEDCOM will map >> cleanly enough to the sort of concepts the program needs, but some XML >> format (even if it's undocumented) would be perfect. >> >> I'm not aware of anything that comes close to this. Even without the >> requirements that it runs on Linux and has an export format, I'm not >> aware of anything, and that strikes me as surprising. Surely my first >> requirement is just basic good practice? And whilst I'm sure that a >> lot of research is not done to particularly good standards, surely >> most software vendors must be familiar with what good research >> entails? So I'm really hoping that someone will be able to point me >> towards some really good piece of software that I've somehow >> overlooked. >> >> Any suggestions or comments gratefully received! >> >> Richard > > I've read thru this thread, and I wonder if there will be any program, > including your own, that will fullfil all things that have been asked here. > > It seems to me that you're after a kind of logging system, rather than a > genealogy program. But you want to define persons, places, events, > relations, sources and even dates (and a few more?) as entities and have n:n > relations (in relational database speech) between all of them, including > recursive relationships on all of them. > So, is there then any real structure in the data?? Only in the way, as an > example, that a date cannot appear in the place of a link where one would > expect a person reference. But other than that? > > Can such program be built? Yes, but the degree of freedom you want assures > you to frequent (and some substantial) changes to be applied. And changing > the program is one thing, but assuring your data survive those changes is > another story, where ultimate care will be needed. > In the end, you might have to decide on what to spend most of your time: > coding and maintaining your data, or doing proper research. > > As a side note: > There has been discussions about hierarchy of places. I think trying to > register such things is a bad idea in the first place, because such > "relationships" have been so volatile in history. The only indication one > could give that would remain consistent is something like : place X is part > of (located nearby, ....) Y in 2011. Even geographical coordinates are no > good, since villages etc. have moved in the run of time. > Thank you, Hermann! Cheryl

    05/16/2011 05:06:53
    1. Re: Single-tree gedcom files question
    2. singhals
    3. Peter J. Seymour wrote: > I have been doing some analysis of a selection of the numerous gedcom > files out there. One thing I have found disappointing is that the larger > files tend to consist of a number of fragments rather than a single > tree. In fact, the larger the file, the more likely it is to consist of > fragments. Some large files seem to consist mostly of numerous > unconnected individuals, or couples, or perhaps small trees of three or > four people. > So this seems to be how the really large files are made: throw together > lots of data on the basis that it might be vaguely related. > This set me wondering: How large do single trees get? So here is a > challenge for you all, What is the largest single-tree gedcom you are > aware of, does it consist of sensible data, and more to the point how > large is it (File size in bytes and number of individuals, both metrics > are needed please? As of 10 am EDT on Sunday 15 May 2011, one of my databases has 24744 persons in a 18862080M file. Unfortunately, it contains branches; the largest branch has 23642 descendants & spouses of a single couple. The other branches are NOT OURS (116 persons), ?OURS? (41 persons) and the rest are parents of spouses who are being kept because two sibs married into the main database. This file is based on family papers going back to the 1750s, added to in 1826, again in 1894, 1937, 1987, 2003, and through yesterday. Data has been documented in census records, wills, BMDs, newspapers, and official government documents (i.e, bounty-land warrants, military service records, Acts of Congress, etc). Another database has 14088 persons (descendants and spouses) in a 6197248M file. This one is taken from a 1980s book based on 1970s research; much of it has been confirmed in official records. Neither of these are particularly large in my corner of the world. FWIW. Cheryl

    05/16/2011 04:57:43
    1. Re: Event-oriented genealogy software for Linux
    2. Richard Smith wrote: > I've spent years looking for decent genealogy software that suits my > needs, and I'm almost at the stage of giving up and writing my own. > However, before I do that, I thought I'd ask on this newsgroup whether > anyone has any suggestions of suitable software. > > Most products I've tried are far too lineage-oriented. That's perhaps > okay for storing the results of my research, but that's not what I'm > after. I want something much more event-oriented that can store the > research itself. I want to record that I found John Smith on the 1881 > census, two plausible John Smiths on the 1851 census, and three > possible baptisms. I want to be able to record what the record says, > not what I think it probably means, including the different spellings > used in different sources. Although that seems a reasonable enough > requirement, a lot of products make it hard to use them like that. > Entering census data is often particularly tedious. > > If I only wanted to do that, I'd probably just use a spreadsheet. But > I also want an application that can let me say that I currently > believe the John Smith on the 1881 census is the same person as the > John Smith who was listed in the 1851 census on North Street, not the > one on South Street, and that I don't believe this person is the same > as any of the baptisms. And I'd like to be able to do this in a way > that's easy to change when new evidence comes to light. This seems > very hard in most of the products I've tried, and nigh-on impossible > for negative assertions like "the John Smith on the 1881 census was > not either of three baptisms found". I've also never found software > that can cope satisfactorily with relationships more complicated than > simple parent-child ones. For example, I would like to be able to say > "John was the grandson of Thomas, and probably the son of Thomas's son > Henry, though possibly an illegitimate son of Thomas's daughter > Sarah". That's certainly something that a computer program ought to > be able to handle in a structured fashion, but, again, I've never > found one that can. > > My second requirement is that the software runs on Linux and doesn't > require me to be connected to the Internet. (So a web-based program > is fine, but only if I can install it locally.) If it were open > source, that would be an added bonus, but it's not a requirement. My > only other requirement is that the program must be able to export its > database in some vaguely usable format and re-import it again. It's > probably best if it's not GEDCOM because I doubt GEDCOM will map > cleanly enough to the sort of concepts the program needs, but some XML > format (even if it's undocumented) would be perfect. > > I'm not aware of anything that comes close to this. Even without the > requirements that it runs on Linux and has an export format, I'm not > aware of anything, and that strikes me as surprising. Surely my first > requirement is just basic good practice? And whilst I'm sure that a > lot of research is not done to particularly good standards, surely > most software vendors must be familiar with what good research > entails? So I'm really hoping that someone will be able to point me > towards some really good piece of software that I've somehow > overlooked. > > Any suggestions or comments gratefully received! > > Richard I've read thru this thread, and I wonder if there will be any program, including your own, that will fullfil all things that have been asked here. It seems to me that you're after a kind of logging system, rather than a genealogy program. But you want to define persons, places, events, relations, sources and even dates (and a few more?) as entities and have n:n relations (in relational database speech) between all of them, including recursive relationships on all of them. So, is there then any real structure in the data?? Only in the way, as an example, that a date cannot appear in the place of a link where one would expect a person reference. But other than that? Can such program be built? Yes, but the degree of freedom you want assures you to frequent (and some substantial) changes to be applied. And changing the program is one thing, but assuring your data survive those changes is another story, where ultimate care will be needed. In the end, you might have to decide on what to spend most of your time: coding and maintaining your data, or doing proper research. As a side note: There has been discussions about hierarchy of places. I think trying to register such things is a bad idea in the first place, because such "relationships" have been so volatile in history. The only indication one could give that would remain consistent is something like : place X is part of (located nearby, ....) Y in 2011. Even geographical coordinates are no good, since villages etc. have moved in the run of time. -- Veel mensen danken hun goed geweten aan hun slecht geheugen. (G. Bomans) Lots of people owe their good conscience to their bad memory (G. Bomans)

    05/16/2011 04:26:32
    1. Re: Single-tree gedcom files question
    2. Richard Smith
    3. On May 16, 3:57 pm, singhals <[email protected]> wrote: > As of 10 am EDT on Sunday 15 May 2011, one of my databases > has 24744 persons in a 18862080M file. I think you probably mean 18MB. 18862080M is a little over 18TB which is significantly more than the disk space in the vast majority of computers. Yes, datasets of that size do exist -- I have to deal with some at work -- but they require specialists tools and large clusters of computers to handle. Richard

    05/16/2011 02:36:46