On Tue, 19 Feb 2013 08:57:27 -0600, Ed Morton <mortonspam@gmail.com> wrote: >On 2/19/2013 3:14 AM, Steve Hayes wrote: >> In an earlier message I suggested using AWK to manipulate a GEDCOM file to >> solve a particular problem. >> >> That point tended to get lost in discussion of other points like using other >> ways to solve the problem, or discussion of flaws in the GEDCOM data model >> itself and proposals for its replacement, which I see as a separate question. >> >> What I would like to see is the development of a kind of library of AWK >> routines to manipulate GEDCOM files. Lots of genealogists have GEDCOM files, >> and some would like to make changes to them, or extract information from them >> in ways that might not be possible with other genealogy programs. >> >> Here is a GEDCOM file. >> >> I tried to choose a short one to use as an example, which shows the structure >> of the file. > >OK, so that's presumably a good, representative input file for an awk script to >run against. Now - what might an output file look like and (briefly!) why? > > Ed. Posted in the previous thread, one example is: So a portion of the GEDCOM might look like this: 0 @I1@ INDI 1 NAME Gerald "Bernard" /Landry/ 2 GIVN Gerald "Bernard" 2 SURN Landry 1 SEX M 1 BIRT 2 DATE 9 MAR 1937 2 PLAC St-Jacques 0 @I2@ INDI 1 NAME Bernard /St-Jacques/ 2 GIVN Bernard 2 SURN St-Jacques 1 SEX M 1 FAMS @F1@ 1 FAMC @F2@ Where some of the given names (but not all) found in the GEDCOM have quote marks around them. And we want to process it so that it looks like this: 0 @I1@ INDI 1 NAME Gerald ~Bernard~ /Landry/ 2 GIVN Gerald ~Bernard~ 2 SURN Landry 1 SEX M 1 BIRT 2 DATE 9 MAR 1937 2 PLAC St-Jacques 0 @I2@ INDI 1 NAME Bernard /St-Jacques/ 2 GIVN Bernard 2 SURN St-Jacques 1 SEX M 1 FAMS @F1@ 1 FAMC @F2@ Where the tilde I used might be any other character of our choosisng as long as it were not a character that would also appear elsewhere in the Given name fields of the GEDCOM The only difficulty in processing (vs. a simple search and replace in a text editor) is that it is highly likely that there are other quote marks in other fields in the GEDCOM, as GEDCOMS typically contain many paragraphs of plain text. So the need is to direct the processing to occur only on a particular field or fields. It wouldn't be necessary to do all the processing in one "pass". ie, you could do the work on the GIVN field, and then on the NAME field. Why is a bit to explain, but it has been discussed at length in the previous thread.