Note: The Rootsweb Mailing Lists will be shut down on April 6, 2023. (More info)
RootsWeb.com Mailing Lists
Total: 1/1
    1. Re: [Dyfed] NLW scanning Welsh newspapers to put online
    2. David Rowlands
    3. I think there's a point that just might need a bit of clarification for some people, here, Gareth, if you don't mind putting my two-bob's worth in. The NLA (and probably most other big organisations doing this sort of thing) use both OCR and an image. So when you are looking at the Welsh journals on-line at the NLW that's the image which you see, and good to behold it is too! But they are also OCR'ing the same thing, so when you search a character string, that's what you search through. Sadly, the OCR isn't as good as the human eye-brain combination, so what you see as luminously readable material on-screen is not always recognised properly within the digitised record. Errors are far more common with old newspapers, where the original records are sometimes not too good to begin with. What the NLA has done is enable us to see both the image of the records (newspapers, journals etc) alongside their OCR interpretation of the text. That enables mugs like me to spend time we haven't got offering corrections to the OCR version of the text based on what we can see in the image before us. We all hope that makes searching the records easier and more effective for others after us. Funnily enough, the records in Australian newspapers that get the most attention for correction of the OCR text are the hatches, matches etc, primarily because of family historians. (There's plenty of Welsh interest there too. For example, there's a mountain of material relating to Lewis Thomas, the bloke from Talybont, Ceredigion, who became a very prominent coal miner in Queensland in the nineteenth century. And I've found rellies of mine who emigrated in the 1880s and even traced descendants.) David Canberra On 18/03/2012, at 11:16 PM, Gareth wrote: > I must agree that the Optical Character Recognition (OCR) process is > unlikely to achieve 100% perfection - although I hesitate to say 'never'. > I have OCR'd a considerable amount of material into Genuki over the years > and some of the sources have presented major headaches. > I use a programme called Omnipage pro by Nuance, which is a market leader, > and it is true that when faced with a long run of work then it is worthwhile > persevering with the 'learning phase' to the point where most, if not all, > mis-transcriptions are eliminated over time. > The rest can, in theory, be dealt with by a final human proof reading of > course, but it all adds up to a lengthy business - as I found with the Hanes > Eglwysi Annibynnol Cymru project! > > I read a lot of books on my Kindle and it is a constant source of surprise > that the proof reading hasn't been carried out as effectively as I would > have expected with a commercially produced book. > And some examples of 'original' Welsh pages reproduced with OCR, seen on the > net today, are gibberish. > > The NLW have got it right with their digitisation of Welsh journals project, > perfect reproduction every time. > They also give you the option of switching to 'text mode' with the caution; > "This text was generated automatically from the scanned page and has not > been checked. Typical character accuracy is in excess of 99%, but this > leaves one error per 100 characters." > > And, importantly, the search facility they provide has apparently worked > perfectly in the *digital* mode every time I've used it. > Of course you can only 'copy/paste' individual words in Text mode, which may > be it's biggest downside in practice.

    03/20/2012 12:34:40