RootsWeb.com Mailing Lists
Total: 2/2
    1. RE: Latest Update
    2. Archer Barrie
    3. John, I am not sure why it is important to know which is the first keying. Surely if there are two or more keyings that produce different results all that is necessary is to know they are different. Which came first seems of little relevance. Your suggestion for comparing the entries omits the difficulty of knowing which two files to compare. If everyone had followed the filenaming standard this might have been possible but as I recall it only about 60% do. For example in 1898B4 one has to compare file Pb1898_20 with file 1898B40020. Page numbers from +PAGE are the other alternative but these are often not entered or accurate, for example Pb1898_20 contains page number 1 (wrong) and 1898B40020 contains page number 20 (right). Sorting this out would be a huge amount of work. And of course neither method works for random entries. What we actually do is compare the entries ignoring the the file that they have come from, thus enabling us to take into account random entries. Barrie > -----Original Message----- > From: John Fairlie [mailto:john.fairlie@blueyonder.co.uk] > Sent: 01 December 2003 17:35 > To: FREEBMD-DISCUSS-L@rootsweb.com > Subject: RE: Latest Update > > > There may not be an easy answer, but I believe there has to > be AN answer, as > confusion rarely clears itself, and the subject will have to > be addressed > sooner or later. > > There seems to be three types of upload to FreeBMD, blocks of > first keyings, > blocks of second keyings, and random/ad hoc entries. > Ignoring the latter, > why can't files uploaded be marked as either first keying or > second keying, > in the same way as they are marked as births, deaths or > marriages? Surely > the syndicate leader must know what he/she has given his/her > members, and as > stated on the list before, transcribers must be told whether > they are doing > first or second keying anyway. > > We would obviously have to go back and classify all the files already > uploaded, but this should be easy for syndicate leaders, and > would be well > worth the effort involved. > > When the database is compiled, it will then be known what all > the first > keying submissions are. Second keying files (and random/ad > hoc) can then be > compared to the first keying and if the records within them > are identical, a > bold entry will result on the search output screen. A second > keying entry > that does not match a first keying entry > could be written to a file and referred to an arbitrator. I > believe that > the FreeBMD system documented on the web site already > proposes arbitrators, > and that an arbitrators upload overrules all other uploads. > > Considering the obvious skill of the FreeBMD programmers, I > would not have > thought this too difficult. > > John Fairlie > Mail us at ..... john@fairlie.plus.com > john.fairlie@blueyonder.co.uk > Home page... http://www.fairlie.plus.com > > > -----Original Message----- > From: Dave Mayall [mailto:david.mayall@ukonline.co.uk] > Sent: Monday, December 01, 2003 11:58 AM > To: FREEBMD-DISCUSS-L@rootsweb.com > Subject: Re: Latest Update > > > On Sun, 30 Nov 2003 19:23:53 -0000, you wrote: > > >The latest update shows a gratifying increase in the number of unique > records. The figures do, however, raise some questions. > > > >Take two event years which have apparently been fully transcribed. > > > >We are told that there are 990,848 unique records of births > in 1898. For > that year, there were 2,474 pages of births to be > transcribed. Assuming an > average of 374 births per page (and an assumption of 375 or > 376 would not > affect the issue), there were 925,276 births in that year. > So the number of > "unique" records exceeds the actual number of births by about > 65,000, ie > about 7 percent. > > > >To take another example, we are told that there are 514, 581 > unique records > of marriages in 1890. In that year, there were 1,206 pages > of marriages to > be transcribed. At 374 entries per page, there were 451,044 marriage > entries in all. (Each marriage of course generates two entries in the > index.) So the number of "unique" records exceeds the actual > number of > marriage entries by over 63,000, ie about 14 percent. > > > >One possible cause of the discrepancy is inconsistencies in > the keying of > individual index entries. I think that we have been > previously told that, > where a page has been double keyed, different transcriptions > of the same > record will show up as two unique records in the update > statistics. If this > is the full explanation, then it raises some disturbing > questions about the > accuracy of our transcription. In any event, the question > does arise of > whether the statistics give a slightly too rosy account of progress. > > The question of the statistics has been raised previously, and there > is no easy answer! > > 1890 Marriages figures from ONS show that there are 223,000 Marriages > (446,000 entries). These are the figures we base our completeness on. > We do know however that there are various factors which tend to make > the actual total to be transcribed higher. > > The overrun seems to be rather excessive, and is worthy of > investigation, and I will do so. > > -- > Dave Mayall > > > ============================== > To join Ancestry.com and access our 1.2 billion online > genealogy records, go > to: > http://www.ancestry.com/rd/redir.asp?targetid=571&sourceid=1237 >

    12/02/2003 03:27:33
    1. RE: Latest Update
    2. John Fairlie
    3. Barrie, I think you have succinctly defined the problem. Dave Mayall also said that the main problem in performing the match is aligning the transcriptions so that we know that they need to be compared with each other. So in your example below, what we need to do is compare Pb1898_20 with 1898B40020. So how does a dumb computer know this?? I wouldn't know that these two files are the same page so I would have to do some pretty clever coding to get a computer to work it out. Even then, the filename for Pb1898_20 does not infer the fourth quarter. This must be within the file header. Then the computer has to read the page number from within the file, get 01 for file Pb1898_20, realise it's that that is wrong and not the 20 in the file name, and replace it with the correct page number. To be frank, I cannot see all this being programmed up to happen automatically as filenames and headers are so inconsistent at present. If it can, I would be delighted to be proved wrong. Consider that the files were 1898B40020 first key and 1898B40020 second key with otherwise identical headers. (Actually the first and second would be part of the header within the top of the file.) Of course it is irrelevant which is first key and which is second key - the important thing is that we have identified two files that are meant to contain identical contents. This is the important prelude to the actual comparison process, not to mention the database build process. Without this information, the system can only open and collate the contents of both files. Obviously the two transcribers made differences, both as mistakes and as uncertain characters. So if each file was meant to contain 375 entries, we may end up with 400 unique entries and 350 matching entries. This is what leads to years being apparently over 100% complete!! The 50 records that now don't match are just 50 records, no-one knows that 25 of them are non matching duplicates of the other 25. What we have to do is get back to that page being 375 entries and only 375 entries. Hence my suggestion that one file (and I say again that it matters not which), is considered as first key and we adopt it's 375 entries. The second file may match 350 of them. OK - fine. But the other 25 that don't match ARE NOT NEW UNIQUE RECORDS!!! They are candidates for arbitration. Hence my approach is that we need software that looks at all first keyings and identifies all entries that are out of sequence, have uncertain characters etc, pages out of range for the district, and entries where there are too many entries per register page, and clean them. (Yes, I know this is not the defined FreeBMD process.) Then we need a strict file naming regime that allows second keyings to be seen as such, and not as a duplicate first key files under another file name. Then we have a system that really is programmable. Then and only then can comparison software produce a result that is meaningful, and a compiled master database that is not apparently over 100% complete when it is actually under 100% complete. A quality end product means having a quality process that picks up errors (file naming or header/page number errors) early, and ensures all stick to the defined system. Then the software can work well to produce a clean database where 100% complete means 100% complete and the actual records have a quantifiable accuracy about them! Phew!! John Fairlie Mail us at ..... john@fairlie.plus.com john.fairlie@blueyonder.co.uk Home page... http://www.fairlie.plus.com -----Original Message----- From: Archer Barrie [mailto:Barrie.Archer@services.fujitsu.com] Sent: Tuesday, December 02, 2003 10:28 AM To: 'John Fairlie' Cc: FREEBMD-DISCUSS-L@rootsweb.com Subject: RE: Latest Update John, I am not sure why it is important to know which is the first keying. Surely if there are two or more keyings that produce different results all that is necessary is to know they are different. Which came first seems of little relevance. Your suggestion for comparing the entries omits the difficulty of knowing which two files to compare. If everyone had followed the filenaming standard this might have been possible but as I recall it only about 60% do. For example in 1898B4 one has to compare file Pb1898_20 with file 1898B40020. Page numbers from +PAGE are the other alternative but these are often not entered or accurate, for example Pb1898_20 contains page number 1 (wrong) and 1898B40020 contains page number 20 (right). Sorting this out would be a huge amount of work. And of course neither method works for random entries. What we actually do is compare the entries ignoring the the file that they have come from, thus enabling us to take into account random entries. Barrie > -----Original Message----- > From: John Fairlie [mailto:john.fairlie@blueyonder.co.uk] > Sent: 01 December 2003 17:35 > To: FREEBMD-DISCUSS-L@rootsweb.com > Subject: RE: Latest Update > > > There may not be an easy answer, but I believe there has to > be AN answer, as > confusion rarely clears itself, and the subject will have to > be addressed > sooner or later. > > There seems to be three types of upload to FreeBMD, blocks of > first keyings, > blocks of second keyings, and random/ad hoc entries. > Ignoring the latter, > why can't files uploaded be marked as either first keying or > second keying, > in the same way as they are marked as births, deaths or > marriages? Surely > the syndicate leader must know what he/she has given his/her > members, and as > stated on the list before, transcribers must be told whether > they are doing > first or second keying anyway. > > We would obviously have to go back and classify all the files already > uploaded, but this should be easy for syndicate leaders, and > would be well > worth the effort involved. > > When the database is compiled, it will then be known what all > the first > keying submissions are. Second keying files (and random/ad > hoc) can then be > compared to the first keying and if the records within them > are identical, a > bold entry will result on the search output screen. A second > keying entry > that does not match a first keying entry > could be written to a file and referred to an arbitrator. I > believe that > the FreeBMD system documented on the web site already > proposes arbitrators, > and that an arbitrators upload overrules all other uploads. > > Considering the obvious skill of the FreeBMD programmers, I > would not have > thought this too difficult. > > John Fairlie > Mail us at ..... john@fairlie.plus.com > john.fairlie@blueyonder.co.uk > Home page... http://www.fairlie.plus.com > > > -----Original Message----- > From: Dave Mayall [mailto:david.mayall@ukonline.co.uk] > Sent: Monday, December 01, 2003 11:58 AM > To: FREEBMD-DISCUSS-L@rootsweb.com > Subject: Re: Latest Update > > > On Sun, 30 Nov 2003 19:23:53 -0000, you wrote: > > >The latest update shows a gratifying increase in the number of unique > records. The figures do, however, raise some questions. > > > >Take two event years which have apparently been fully transcribed. > > > >We are told that there are 990,848 unique records of births > in 1898. For > that year, there were 2,474 pages of births to be > transcribed. Assuming an > average of 374 births per page (and an assumption of 375 or > 376 would not > affect the issue), there were 925,276 births in that year. > So the number of > "unique" records exceeds the actual number of births by about > 65,000, ie > about 7 percent. > > > >To take another example, we are told that there are 514, 581 > unique records > of marriages in 1890. In that year, there were 1,206 pages > of marriages to > be transcribed. At 374 entries per page, there were 451,044 marriage > entries in all. (Each marriage of course generates two entries in the > index.) So the number of "unique" records exceeds the actual > number of > marriage entries by over 63,000, ie about 14 percent. > > > >One possible cause of the discrepancy is inconsistencies in > the keying of > individual index entries. I think that we have been > previously told that, > where a page has been double keyed, different transcriptions > of the same > record will show up as two unique records in the update > statistics. If this > is the full explanation, then it raises some disturbing > questions about the > accuracy of our transcription. In any event, the question > does arise of > whether the statistics give a slightly too rosy account of progress. > > The question of the statistics has been raised previously, and there > is no easy answer! > > 1890 Marriages figures from ONS show that there are 223,000 Marriages > (446,000 entries). These are the figures we base our completeness on. > We do know however that there are various factors which tend to make > the actual total to be transcribed higher. > > The overrun seems to be rather excessive, and is worthy of > investigation, and I will do so. > > -- > Dave Mayall > > > ============================== > To join Ancestry.com and access our 1.2 billion online > genealogy records, go > to: > http://www.ancestry.com/rd/redir.asp?targetid=571&sourceid=1237 >

    12/03/2003 12:50:27