On Wed, 3 Dec 2003 19:50:27 -0000, you wrote: >Barrie, > >I think you have succinctly defined the problem. Dave Mayall also said that >the main problem in performing the match is aligning the transcriptions so >that we know that they need to be compared with each other. > >So in your example below, what we need to do is compare Pb1898_20 with >1898B40020. So how does a dumb computer know this?? I wouldn't know that >these two files are the same page so I would have to do some pretty clever >coding to get a computer to work it out. Even then, the filename for >Pb1898_20 does not infer the fourth quarter. This must be within the file >header. Then the computer has to read the page number from within the file, >get 01 for file Pb1898_20, realise it's that that is wrong and not the 20 in >the file name, and replace it with the correct page number. To be frank, I >cannot see all this being programmed up to happen automatically as filenames >and headers are so inconsistent at present. If it can, I would be delighted >to be proved wrong. Well, be delighted! We use a whole load of data crunching to achieve it, including a chunk of code which we obtained from the human genome project. >Without this information, the system can only open and collate the contents >of both files. Obviously the two transcribers made differences, both as >mistakes and as uncertain characters. So if each file was meant to contain >375 entries, we may end up with 400 unique entries and 350 matching entries. >This is what leads to years being apparently over 100% complete!! The 50 >records that now don't match are just 50 records, no-one knows that 25 of >them are non matching duplicates of the other 25. That is very likely where some of the count discrepancies originate, and it is where we are concentrating our efforts. Of course, things are complicated by the fact that a file is NOT the basic unit of work internally, because a file may contain more than one page. >What we have to do is get back to that page being 375 entries and only 375 >entries. Hence my suggestion that one file (and I say again that it matters >not which), is considered as first key and we adopt it's 375 entries. The >second file may match 350 of them. OK - fine. But the other 25 that don't >match ARE NOT NEW UNIQUE RECORDS!!! They are candidates for arbitration. Yes, we need to correct the counting of these records, and I think we can do so. It will just take a while to achieve! >Hence my approach is that we need software that looks at all first keyings >and identifies all entries that are out of sequence, have uncertain >characters etc, pages out of range for the district, and entries where there >are too many entries per register page, and clean them. (Yes, I know this >is not the defined FreeBMD process.) Then we need a strict file naming >regime that allows second keyings to be seen as such, and not as a duplicate >first key files under another file name. Cleaning is a part of the process! However, we still don't need the filenaming regime or identifying first/second keying files, because we can already match files and identify which have identical content. I believe that we should also be able to Identify the counts better. At present, we are counting two things for each chunk of data; 1) Total records into the chunk 2) Total distinct records out of the chunk (1-duplicates found) What we need to be finding is length of the aligned data. I suspect that a diagram of what we are and would be counting would be useful, and I'll have a go at one. >A quality end product means having a quality process that picks up errors >(file naming or header/page number errors) early, and ensures all stick to >the defined system. Then the software can work well to produce a clean >database where 100% complete means 100% complete and the actual records have >a quantifiable accuracy about them! A quality end product is fault tolerant of things such as file naming. We have a perfectly serviceable way of doing the data alignment that doesn't rely on file names, so we stick with that. As it happens the system is far more sensitive to missing +PAGE lines than anything else. This isn't going to be a quick fix, but I believe a fix is possible if Barrie and I can battle our way through the innards of some rather complex code. Leave it with us. -- Dave Mayall