Note: The Rootsweb Mailing Lists will be shut down on April 6, 2023. (More info)
RootsWeb.com Mailing Lists
Total: 3/3
    1. Re: [Dyfed] NLW scanning Welsh newspapers to put online
    2. Gareth
    3. I must agree that the Optical Character Recognition (OCR) process is unlikely to achieve 100% perfection - although I hesitate to say 'never'. I have OCR'd a considerable amount of material into Genuki over the years and some of the sources have presented major headaches. I use a programme called Omnipage pro by Nuance, which is a market leader, and it is true that when faced with a long run of work then it is worthwhile persevering with the 'learning phase' to the point where most, if not all, mis-transcriptions are eliminated over time. The rest can, in theory, be dealt with by a final human proof reading of course, but it all adds up to a lengthy business - as I found with the Hanes Eglwysi Annibynnol Cymru project! I read a lot of books on my Kindle and it is a constant source of surprise that the proof reading hasn't been carried out as effectively as I would have expected with a commercially produced book. And some examples of 'original' Welsh pages reproduced with OCR, seen on the net today, are gibberish. The NLW have got it right with their digitisation of Welsh journals project, perfect reproduction every time. They also give you the option of switching to 'text mode' with the caution; "This text was generated automatically from the scanned page and has not been checked. Typical character accuracy is in excess of 99%, but this leaves one error per 100 characters." And, importantly, the search facility they provide has apparently worked perfectly in the *digital* mode every time I've used it. Of course you can only 'copy/paste' individual words in Text mode, which may be it's biggest downside in practice. The recaptcha site has this para; "The transformation into text is useful because scanning a book produces images, which are difficult to store on small devices, expensive to download, and cannot be searched. The problem is that OCR is not perfect." There does seem to be a place for both OCR and digitisation, horses for courses ? Gareth Genuki Wales http://www.genuki.org.uk/big/wal/ Gareth's Help Page http://www.rootsweb.ancestry.com/~ukwales2/hicks.html Cwmgors a'r Waun http://freepages.history.rootsweb.ancestry.com/~cwmgors/Waun.html -----Original Message----- From: Aidan Jones Sent: Saturday, March 17, 2012 9:57 PM To: David Rowlands ; Dyfed List Subject: Re: [Dyfed] NLW scanning Welsh newspapers to put online <Actually I'd have put your point even more strongly. It's not merely a case of 'unlikely' about OCR not reading everything correctly - it's an absolute certainty! There are frequently issues which are traceable to the printing of the original pages. To name just a few other possible examples, there are often problems with figures such as 3, 8 and 9, and with 'm' being confused with 'r n' (hence the well-known Lancashire town of Blackbum). AJ

    03/18/2012 06:16:45
    1. Re: [Dyfed] NLW scanning Welsh newspapers to put online
    2. Aidan Jones
    3. ----- Original Message ----- From: "Gareth" <@clara.co.uk> To: "Dyfed List" <[email protected]> Sent: Sunday, March 18, 2012 12:16 PM Subject: Re: [Dyfed] NLW scanning Welsh newspapers to put online >I must agree that the Optical Character Recognition (OCR) process is > unlikely to achieve 100% perfection - although I hesitate to say 'never'. ... > The NLW have got it right with their digitisation of Welsh journals project, > perfect reproduction every time. But the print quality of the original pages in most commercial books and academic journals tends to be of a noticeably higher order than that used in many early provincial newspapers. This makes it easier for the OCR to function to a higher standard - especially with modern books. The newspapers quite often suffer from paler patches on parts of the sheet - whether this be caused by uneven inking on the original typeface due to the limitations of the letterpress, or caused by slippages or creases, or whatever. The paper quality is also generally lower (being intended to be current for shorter periods). It is thus more likely to have suffered deterioration or other incidental damage over the years, with much less chance that any alternative copy of the original will be obtainable. Sometimes the closeness of the printed newspaper lines and the greater use of small type, or a more variable range of fonts, might cause added complications for OCR. Some early attempts at using OCR for old newspapers had to be based on microfilm (often bearing scratches), rather than the original pages. There is no doubt that the latter will produce a superior result. However in some instances the original pages might not even still survive. The short currency of newspapers meant that printed illustrations were rare before the late 1890s (or even later). Technological improvements subsequently made the process easier and more commercially worthwhile. Before this date the only illustrations tended to be based on the re-use of engraved plates, which normally had been originally prepared for some entirely different publication (e.g. a book or a magazine) with a longer shelf life. AJ

    03/18/2012 07:50:53
    1. Re: [Dyfed] NLW scanning Welsh newspapers to put online
    2. Megan Roberts
    3. Just to add my own two penny worth  - I know that a lister has already mentioned the Australian archive "trove", and frankly any library or archive looking to digitise or put on line any old documents could do no better than use that as a model - even if there are new software developments around.   Yes the OCR can be diabolical, BUT because the original can also be seen at the same time then (A) users can contribute to others enjoyment by adding their corrections and (B) if you are looking at an unedited page it really isn't that difficult to find what you want.   Without Trove I would never have discovered that my distant relation John David Gambold of Rudbaxton, who was sentenced at Monmouth Assizes to be transported to Australia for life in 1832 eventually took passage from Sydney to San Francisco as evidenced by the Shipping Intelligence in the Maitland and Hunter River General Advertiser of 1849.   Megan

    03/18/2012 09:38:37