Ian Goddard wrote: > singhals wrote: >> Didn't someone in the past, oh, say, month, mention software that >> would compare databases and flag matches? >> >> Not necessarily a *specific* genealogy program database, jsut >> databases in general? >> >> I'm looking for an easy way to vacuum up "hit" lists from Ancestry. >> WC, Google, et al, and find the common ones. >> >> >> Cheryl > > I don't recall anything like that and a quick google doesn't find > anything. Wishful thinking? > > It's an interesting problem. First of all what's the format of the hit > lists? Are the hits from all the sources in the same format? > > Secondly, most comparison tools that I can think of work on a specific > file format, usually a flat text file although there are some that work > on XML files. You would need to get the files into the appropriate format. > > Thirdly, many comparison tools do the opposite of what you want - they > look for differences. My favourite approach to looking for multiple > occurrences of *identical* lines across multiple files would be the Unix > command > > cat x y z|sort|uniq -c|sort -rn|more > > where x, y & z would be 3 file names (you can cat as few or many files > as you like). This will merge the contents into alphabetical order so > that duplicates follow each other, process each line with the count of > times it was found, re-sort them in descending order of count and page > the output. You can then see which lines were in more than one file but > not which file they were in. > > This requires that you have the hits in a common flat file format or can > convert them to that; that hits which you would consider matching are > identical within the files; that you either don't care which lists the > matches were in, don't mind just comparing them in pairs or are prepared > to hunt for them in the files and finally that you have access to > Unix-style commands (if you're on Windows only, google for "cygwin"). > just uplaad the two gedcom to worldconnect and compare the resulting lists by eyeball Hugh W -- For genealogy and help with family and local history in Bristol and district http://groups.yahoo.com/group/Brycgstow/ http://snaps4.blogspot.com/ photographs and walks GENEALOGE http://hughw36.blogspot.com/ MAIN BLOG
Hugh Watkins wrote: > Ian Goddard wrote: >> singhals wrote: >>> Didn't someone in the past, oh, say, month, mention software that >>> would compare databases and flag matches? >>> >>> Not necessarily a *specific* genealogy program database, jsut >>> databases in general? >>> >>> I'm looking for an easy way to vacuum up "hit" lists from Ancestry. >>> WC, Google, et al, and find the common ones. >>> >>> >>> Cheryl >> >> I don't recall anything like that and a quick google doesn't find >> anything. Wishful thinking? >> >> It's an interesting problem. First of all what's the format of the >> hit lists? Are the hits from all the sources in the same format? <snip> > > just uplaad the two gedcom to worldconnect and compare the resulting > lists by eyeball > > Hugh W > You've assumed the data will be in Gedcom format. What if it isn't? One of my current problems is handling a flat file of baptisms converted from PDF by pdftotext (http://linux.die.net/man/1/pdftotext). The PDF column widths vary slightly from page to page so it's not even a truly fixed width file and some of the data isn't even consistently allocated between columns. Parsing the data sources isn't easy and you can't assume that you'll be able to find some pre-existing facility to do it for you. -- Ian Hotmail is for spammers. Real mail address is igoddard at nildram co uk