Note: The Rootsweb Mailing Lists will be shut down on April 6, 2023. (More info)
RootsWeb.com Mailing Lists
Total: 1/1
    1. Re: How Should We Store Evidence in Genealogical Databases?
    2. Richard Smith
    3. On May 23, 12:52 pm, Tom Wetmore <[email protected]> wrote: > This thread is an offshoot from the Linux thread that is going off on a number of tangentsl. > > How should we store evidence in genealogical databases? [I've been away for a few days, so apologies for coming back into the discussion rather late.] I regard genealogical research as a seven stage process, and I tend to handle the data generated at each stage in different ways. 1) Planning Sometimes I've got a specific objective in mind -- something like "find out who Thomas Smith's parents are". For each of these objectives, I create a text file with a few notes about where might be a good place to search for evidence, where I've already looked, and a mixture of speculation and notes to myself. I name the file by surname, name and some additional suffix (say "the boot-maker") to make the person unique; if there's more than one plan per person (there rarely is), I'll disambiguate it in some further way. I also use symlinks (a bit like Window's shortcuts) to maintain an index of such plans by ancestor number in a separate directory. As I've got further back, I've found more and more frequently I don't have such as specific objective. The ultimate objective is usually to push back one or more generation, but I'm no longer specifically targeting records with that individual in mind; instead, I'm gathering as much information as I can do on the surname in the area. I have a directory with a more general set of plan files with just a surnames and area (typically a parish name somewhere near the centre of the area of interest). I use a revision control system (currently CVS) to keep track of changes to these plan files, and also to assist in backing them up. 2) Searching Whenever I search for something, I try to note the fact that I've done it in one of the plan files. This is particularly important if the search fails to find anything. If I'm in a records office, I tend to have a printout of the plan file and scribble on it, typing the notes up later. Sometimes I do the same for on-line research. I find on-line sites such as ancestry.com and familysearch.org particularly troublesome in this regard -- it's far too easy to spend an hour or two searching for things and forgetting to note anything down. Neither site keeps a log (at least, not that's available to the user) of what you've searched for, so you can't go back and write it up later. For this reason I no longer use familysearch.org directly. The only time I ever used it was to look up things on the IGI, so I wrote a perl script to drive the (old) site, do searches for me, download the full data set as GEDCOM and log each search I do to the appropriate plan file. The program requires me to associate the search with a specific plan, so I can't avoid recording the fact I've done a search. Putting these search logs into a database, and associating them with a source and/or repository, would be an obvious improvement. I did briefly experiment with gnote and mediawiki for the plan files but gave up -- I found them both overkill for what I wanted. The result of the search will vary. It might be a piece of GEDCOM (as per the example above), or an image (e.g. a census image on ancestry.com), or a entry in book (in which case I may or may not have been able to make a copy of it). Any paper copies I do end up with get scanned, and everything gets stored in directories, classified by type of record and surname. I'm not a big fan of putting things like images in a database, though indexing them in a database would be useful. At the moment, the only index I have is the directory listing. (As with plan files, I sometimes use symlinks if one document should be filed in multiple places.) 3) Transcribing Having found a document, the next job is to transcribe it. Often the result is a flat text file, again one file per source. I try to transcribe the document as accurately as plain text will allow, and there's the odd bit of ad hoc mark-up in it to document important bits of formatting: e.g. [struck-through: my daughter Isabella] or [inserted: Hampshire]. I very much like the idea that Nick Matthews suggested elsewhere of using XML for this, and may well start doing so. In longer documents, such as wills, I tend to put asterisks around peoples names to assist in searching; similarly, I often add ISO-style dates in parentheses [2011-05-24]. I don't do similar tagging for place names, though if I move to a light-weight XML format, I probably will do. In other cases, the source is essentially a long table. Baptism registers or census forms are a good example of this. In these cases, I use a tab-separated text file to record each field. That makes it easy to import into a spreadsheet or database, but at present the primary version is simply in the text files. Sometimes I'll use a spreadsheet to create them too, especially if I'm entering a large number by hand. If I need to add extra notes, they end up in the rightmost column. Tabular data of this sort is, again, an example of something that could usefully go into a database. At the moment, the text files get stored in CVS to retain a version history and to back them up. 4) Translating This stage is often irrelevant as the source is often in English (the only language I speak fluently). When it is necessary, I put the translation below the original transcript, in the same file as it. Even in English documents, there's sometimes an element of translation: for example, I'll add a note to remind myself what I think some obscure word or abbreviation means. 5) Extracting This is the stage that seems to be causing all the excitement here. It is when I extract the genealogical content from the source and put it into some computer-readable form. Typically I use GEDCOM as the destination format, simply because of its ubiquity. Sometimes I find GEDCOM inadequate for the purpose. For example, if a will mentions two grandchildren but gives no indication of whether the grandchildren are siblings, there's no way of expressing this in GEDCOM. In such a case, I'll either misuse GEDCOM to express what I need as best I can, or simply not bother extracting that bit of information (perhaps instead putting into a text note). For things like censuses, baptisms and so on, because the result of the transcription is already in a nice easy-to-parse tabular form, I have scripts that automatically create GEDCOM from the tables. Sometimes it needs hand editing afterwards to add some extra information that was in the source, but outside of the expected data -- for example, I once found a census on which two children had been grouped together with a big "}" and "twins" written next to them. In earlier baptism registers, the data is often more or less tabular, but with implicit fields recording whatever the priest felt was necessary; and occasionally an entry will have extra information included. Such cases need manual handling. I've also got a number of scripts that create blank bits of GEDCOM -- templates, if you like -- that I can then fill in. That fills in suitable source information. The result is hundreds of small GEDCOM files, one per source. Some (e.g. from a gravestone) just contain a single individual and little else; others (e.g. from a parish register or from an IGI search) may contain hundreds of individuals, some of whom may be duplicates (for example, if a couple have three children baptised, then the parents will appear three times). These GEDCOM files then get stored in CVS -- even the automatically generated ones. I will sometimes upload them into a genealogy program, but as I've not really settled on one that I like, I regard the GEDCOM as the primary version and never (well, rarely, anyway) use the program to make changes. It's just a tool to help me process or visualise the information. I've also experimented converting the GEDCOM to RDF and importing into an RDF processor (typically the Redland one) so that I can run SPARQL queries against it. This is really powerful, but also painfully tedious to use. I do see a future for something like this, though. I've also got a script that can search a directory tree of GEDCOM files looking for people that match specific criteria -- at the moment, it's pretty primitive, basically just doing name, date at a particular event, role in the event. It was originally designed for looking for baptisms, but has expanded a bit. 6) Reasoning This is the stage that most people think of as genealogy. It's where I try to work out how I need to combine the persona-level data extracted from the sources into real people. Was the John Smith in the 1851 census the one baptised in North Dunny or South Dunny, or maybe neither? This typically involves looking through all of the extracted persona-level data for people with the same (or a similar) surname in the locality over quite a long period. I tend to the view that unless I can understand every instance of surname in the source record, I cannot be confident that I've pieced it all together correctly. (And sometimes even then I can't be confident of it.) An unexplained burial could be evidence that what I had considered to be one family was in fact two, for example, and that might have knock-on-effects elsewhere. How I work at this stage depends on how many people I have. Sometimes there are few enough personae that I can keep everything in my head. For larger groups, I tend to print things out and spread everything out of my dining room table. In the very largest cases that's infeasible. For example, I once had an ancestor called John Smith and all I knew was that he was a cobbler, from Southampton, and an approximate date of birth from the 1841 census. Trying to sort out all of the Smiths in a big town was a complex task. (In the end I discovered evidence that he wasn't actually from Southampton after all -- he'd just lived there for a while before his marriage.) In that case, I created spreadsheet with everyone in. (And I still use an extended version of that spreadsheet as an index to the other records.) Once I've sorted things out into groups, then I enter them into Gramps (my current preferred program). I'll import bits of the persona-level GEDCOM because that's a convenient way of keeping source information with it. (Irritatingly I have to strip the repository from the GEDCOM and manually reassign it because Gramps can't, so far as I know, merge repositories as it can with other things, but that's a minor difficulty.) But what this doesn't do is give me any way of of documenting why I've merged the personae as I have. Sometimes this will be immediately obvious from the sources; but other times it won't. But at times, the reasoning process is more sophisticated. I often start with a large number of possibilities, consider each one and gradually discount possibilities as being too improbable until only one remains which for the time being I regard as probably correct. Documenting such things is tricky, but I really do care about documenting things: not primarily to justify my conclusions to others (though that is useful), but so that I can easily revisit them as further evidence comes to light, or as I correct any mistakes. At present, I use the plan files that I create right at the start of the whole research process to add notes on why I came to the conclusions I did. But this means the documentation behind the merging process is not kept with the merged individuals; nor is there a computer-readable link from the source to the documentation. I really want there to be so that if I have to correct a mistake in my transcription / translation / interpretation of the source, I can readily see what knock-on effects it might have. 7) Presentation The final step is presenting the data in a good way. That might means drawing trees (which many programs seem quite poor at), drawing ancestor tables (which they're much better at), or maybe just producing indexes of people. But this step is really beyond the theme of this discussion. Like most people, I expect, in practice, these seven steps often get blurred together, or some of them are not relevant. But whenever I find myself thinking about how to store some new sort of data, or how to rearrange the way I file things, I do find it very useful to think in terms of these seven steps. Richard

    05/27/2011 03:33:03