RootsWeb.com Mailing Lists
Total: 4/4
    1. Re: Using AWK to manipulate GEDCOM files
    2. Janis Papanagnou
    3. On 19.02.2013 10:14, Steve Hayes wrote: > In an earlier message I suggested using AWK to manipulate a GEDCOM file to > solve a particular problem. > > That point tended to get lost in discussion of other points like using other > ways to solve the problem, or discussion of flaws in the GEDCOM data model > itself and proposals for its replacement, which I see as a separate question. > > What I would like to see is the development of a kind of library of AWK > routines to manipulate GEDCOM files. Lots of genealogists have GEDCOM files, > and some would like to make changes to them, or extract information from them > in ways that might not be possible with other genealogy programs. You can consider the awk operations to be quite primitive for the given syntax of the GEDCOM files, so a library seems not really necessary; just write the awk command. I will give examples below. But first I'd like to ask for confirmation what a GEDCOM "field" actually is, per semantic and syntax. Is it _one whole line_ with a specific 4-letter tag in column 2, or is it the _rest_ of a line where the first two columns are some number and a data type tag? To change data of a specific line identify the line by a pattern on the type field (please note that Lew already gave such example). To perform action on a "NAME" field, replacing "Service" by "S." awk '$2 == "NAME" { sub(/Service/, "S.") } { print $0 }' likewise negate the condition if you want to select type tags other than name. To exclude tag names prom processing that seem to have a specific meaning awk '$2 !~ /@.*@/ { sub(/Service/, "S.") } { print $0 }' You can combine those using logical operations like && (and) awk '$2 !~ /@.*@/ && $2 == "NAME" { sub(/Service/, "S.") } { print $0 }' If you want something like a library (as you said) it's harder to cover all conditions in an own ("invented-here") language frame without making it more complex than the awk language itself. But I will give an example how to parameterise awk when using only simple conditions. awk -v tag="NAME" -v from="Service" -v to="S." ' $2 == tag { sub(from,to) } { print $0 } ' where the string constants may be passed through shell variables awk -v tag="${1:?}" -v from="${2:?}" -v to="${3:?}" ' $2 == tag { sub(from,to) } { print $0 } ' (One caveat in advance; a /from/ pattern passed as a string, either as "from" or per variable, will be subject to other interpretation than a pattern constant. But I guess that would be an issue once we are sure that this is what you want. Another Caveat; for simplicity I substituted over the whole line, so any substitution may affect the tag fields, say, if you want to substitute in the data columns /ME/ it would change the "ME" in the tag "NAME". In case that you have the "rest of the line (after number and tag)" is the actual data it may be advantageous to operate on sub-strings of the whole line. I will expand on that on demand.) Janis > > Here is a GEDCOM file. > > I tried to choose a short one to use as an example, which shows the structure > of the file. > > 0 HEAD > 1 SOUR ANSTFILE > 2 VERS 4.19 > 2 NAME Ancestral File (R) > 2 CORP The Church of Jesus Christ of Latter-day Saints > 3 ADDR 50 East North Temple Street > 4 CONT Salt Lake City, Utah 84150 > 2 DATA Ancestral File > 3 DATE 5 January 1998 > 3 COPR Copyright (c) 1987, June 1998 > 1 DEST PAF > 1 DATE 20 APR 2002 > 2 TIME 2:58:56 > 1 FILE GEDCOM4.ged > 1 GEDC > 2 VERS 5.5 > 2 FORM LINEAGE-LINKED > 1 CHAR ANSEL > 1 SUBM @SUB01@ > 1 SUBN @N01@ > 0 @SUB01@ SUBM > 1 NAME Created by FamilySearch (TM) Internet Genealogy Service > 1 ADDR 50 East North Temple Street > 2 CONT Salt Lake City, Utah 84150 > 0 @S01@ SOUR > 1 AUTH The Church of Jesus Christ of Latter-day Saints > 1 TITL Ancestral File (R) > 1 PUBL Copyright (c) 1987, June 1998, data as of 5 January 1998 > 1 REPO @R01@ > 0 @R01@ REPO > 1 NAME Family History Library > 1 ADDR 35 N West Temple Street > 2 CONT Salt Lake City, Utah 84150 USA > 0 @N01@ SUBN > 1 DESC 2 > 1 ORDI N > 0 @I3GLR-Z3@ INDI > 1 NAME Thomas William /BALDOCK/ > 2 GIVN Thomas William > 2 SURN BALDOCK > 1 AFN 3GLR-Z3 > 1 SEX M > 1 SOUR @S01@ > 1 BIRT > 2 DATE 1 Jul 1850 > 2 PLAC Geelong, Vic, Astl > 1 FAMS @F1794078@ > 1 FAMC @F524078@ > 0 @I3GLR-4R@ INDI > 1 NAME Thomas /BALDOCK/ > 2 GIVN Thomas > 2 SURN BALDOCK > 1 AFN 3GLR-4R > 1 SEX M > 1 SOUR @S01@ > 1 FAMS @F524078@ > 0 @I3GLR-5X@ INDI > 1 NAME Anne /CHAMBERS/ > 2 GIVN Anne > 2 SURN CHAMBERS > 1 AFN 3GLR-5X > 1 SEX F > 1 SOUR @S01@ > 1 FAMS @F524078@ > 0 @I98BW-JC@ INDI > 1 NAME Emily Jane /THORNTON/ > 2 GIVN Emily Jane > 2 SURN THORNTON > 1 AFN 98BW-JC > 1 SEX F > 1 SOUR @S01@ > 1 BIRT > 2 DATE 1854 > 2 PLAC Geelong, Victoria, Australia > 1 DEAT > 2 DATE 9 Dec 1890 > 2 PLAC Geelong, Victoria, Australia > 1 FAMS @F1794078@ > 1 FAMC @F1794093@ > 0 @I98BX-N6@ INDI > 1 NAME Charles Edwin /THORNTON/ > 2 GIVN Charles Edwin > 2 SURN THORNTON > 1 AFN 98BX-N6 > 1 SEX M > 1 SOUR @S01@ > 1 FAMS @F1794093@ > 0 @I98BX-PC@ INDI > 1 NAME Emily /GROWDON/ > 2 GIVN Emily > 2 SURN GROWDON > 1 AFN 98BX-PC > 1 SEX F > 1 SOUR @S01@ > 1 FAMS @F1794093@ > 0 @I98CJ-BW@ INDI > 1 NAME Percy William Growdon /BALDOCK/ > 2 GIVN Percy William Growdon > 2 SURN BALDOCK > 1 AFN 98CJ-BW > 1 SEX M > 1 SOUR @S01@ > 1 BIRT > 2 DATE ABT 1876 > 2 PLAC Geelong, Victoria, Australia > 1 FAMC @F1794078@ > 0 @I98BW-LP@ INDI > 1 NAME Percy William Growdon /BALDOCK/ > 2 GIVN Percy William Growdon > 2 SURN BALDOCK > 1 AFN 98BW-LP > 1 SEX M > 1 SOUR @S01@ > 1 BIRT > 2 DATE 1879 > 2 PLAC Geelong, Victoria, Australia > 1 DEAT > 2 DATE 6 Sep 1886 > 2 PLAC Geelong, Victoria, Australia > 1 FAMC @F1794078@ > 0 @I98BW-KJ@ INDI > 1 NAME Arthur Jabez /BALDOCK/ > 2 GIVN Arthur Jabez > 2 SURN BALDOCK > 1 AFN 98BW-KJ > 1 SEX M > 1 SOUR @S01@ > 1 BIRT > 2 DATE 1878 > 2 PLAC Geelong, Victoria, Australia > 1 FAMC @F1794078@ > 0 @I98BW-P7@ INDI > 1 NAME Gladys Claudine /BALDOCK/ > 2 GIVN Gladys Claudine > 2 SURN BALDOCK > 1 AFN 98BW-P7 > 1 SEX F > 1 SOUR @S01@ > 1 BIRT > 2 DATE 1887 > 2 PLAC Geelong, Victoria, Australia > 1 DEAT > 2 DATE 1907 > 2 PLAC > 1 FAMC @F1794078@ > 0 @I98BW-N2@ INDI > 1 NAME Clive Alfred /BALDOCK/ > 2 GIVN Clive Alfred > 2 SURN BALDOCK > 1 AFN 98BW-N2 > 1 SEX M > 1 SOUR @S01@ > 1 BIRT > 2 DATE 1884 > 2 PLAC Geelong, Victoria, Australia > 1 DEAT > 2 DATE 25 Oct 1951 > 2 PLAC > 1 FAMC @F1794078@ > 0 @I98BW-MV@ INDI > 1 NAME Lawrence /BALDOCK/ > 2 GIVN Lawrence > 2 SURN BALDOCK > 1 AFN 98BW-MV > 1 SEX M > 1 SOUR @S01@ > 1 BIRT > 2 DATE 1881 > 2 PLAC Geelong, Victoria, Australia > 1 FAMC @F1794078@ > 0 @F1794078@ FAM > 1 HUSB @I3GLR-Z3@ > 1 WIFE @I98BW-JC@ > 1 CHIL @I98CJ-BW@ > 1 CHIL @I98BW-LP@ > 1 CHIL @I98BW-KJ@ > 1 CHIL @I98BW-P7@ > 1 CHIL @I98BW-N2@ > 1 CHIL @I98BW-MV@ > 1 MARR > 2 DATE 20 Apr 1876 > 2 PLAC Geelong, Victoria, Australia > 0 @F524078@ FAM > 1 HUSB @I3GLR-4R@ > 1 WIFE @I3GLR-5X@ > 1 CHIL @I3GLR-Z3@ > 0 @F1794093@ FAM > 1 HUSB @I98BX-N6@ > 1 WIFE @I98BX-PC@ > 1 CHIL @I98BW-JC@ > 0 TRLR >

    02/19/2013 07:06:25
    1. Re: Using AWK to manipulate GEDCOM files
    2. Steve Hayes
    3. On Tue, 19 Feb 2013 14:06:25 +0100, Janis Papanagnou <janis_papanagnou@hotmail.com> wrote: >To change data of a specific line identify the line by a pattern >on the type field (please note that Lew already gave such example). >To perform action on a "NAME" field, replacing "Service" by "S." > > awk '$2 == "NAME" { sub(/Service/, "S.") } { print $0 }' > >likewise negate the condition if you want to select type tags other >than name. Thanks very much for these examples. When I get a chance I will play with them and see how they work. A Gedcom file has several parts, but the first part consists of information about individual people. The digit at the beginning is a level number, so 0 means data on a new individual likde this: >> 0 @I98BW-JC@ INDI 1 is the next leveil, with particular information about the individual, such as the NAME >> 1 NAME Emily Jane /THORNTON/ >> 2 GIVN Emily Jane >> 2 SURN THORNTON Where the next level is further information about name then information about the person's BIRTH >> 1 BIRT >> 2 DATE 1854 >> 2 PLAC Geelong, Victoria, Australia The kind of manipulation one might want to do would be to change place names to make them consistent throughout the file, so one migbht want to use abbreviations and change "Geelong, Victoria, Australia" to "Geelong, VIC, AUS". Another might be to produce a report of people who were born in Geelong, but died elsewhere in Australia. And so on. -- Steve Hayes from Tshwane, South Africa Blog: http://khanya.wordpress.com E-mail - see web page, or parse: shayes at dunelm full stop org full stop uk

    02/19/2013 02:15:23
    1. Re: Using AWK to manipulate GEDCOM files
    2. Janis Papanagnou
    3. On 19.02.2013 20:15, Steve Hayes wrote: > On Tue, 19 Feb 2013 14:06:25 +0100, Janis Papanagnou > <janis_papanagnou@hotmail.com> wrote: > >> To change data of a specific line identify the line by a pattern >> on the type field (please note that Lew already gave such example). >> To perform action on a "NAME" field, replacing "Service" by "S." >> >> awk '$2 == "NAME" { sub(/Service/, "S.") } { print $0 }' >> >> likewise negate the condition if you want to select type tags other >> than name. > > Thanks very much for these examples. When I get a chance I will play with them > and see how they work. > > A Gedcom file has several parts, but the first part consists of information > about individual people. > > The digit at the beginning is a level number, so 0 means data on a new > individual likde this: > >>> 0 @I98BW-JC@ INDI > > 1 is the next leveil, with particular information about the individual, such > as the NAME > >>> 1 NAME Emily Jane /THORNTON/ >>> 2 GIVN Emily Jane >>> 2 SURN THORNTON > > Where the next level is further information about name > > then information about the person's BIRTH > >>> 1 BIRT >>> 2 DATE 1854 >>> 2 PLAC Geelong, Victoria, Australia > > The kind of manipulation one might want to do would be to change place names > to make them consistent throughout the file, so one migbht want to use > abbreviations and change "Geelong, Victoria, Australia" to "Geelong, VIC, > AUS". > > Another might be to produce a report of people who were born in Geelong, but > died elsewhere in Australia. > > And so on. I see. I'm sure awk fits very well for such manipulations. Given your example above one further step would be using a file with mapping information, thus letting awk do all that mapping without changing the awk program. Janis

    02/19/2013 01:48:36
    1. Re: Using AWK to manipulate GEDCOM files
    2. Steve Hayes
    3. On Tue, 19 Feb 2013 14:06:25 +0100, Janis Papanagnou <janis_papanagnou@hotmail.com> wrote: >On 19.02.2013 10:14, Steve Hayes wrote: >> In an earlier message I suggested using AWK to manipulate a GEDCOM file to >> solve a particular problem. >> >> That point tended to get lost in discussion of other points like using other >> ways to solve the problem, or discussion of flaws in the GEDCOM data model >> itself and proposals for its replacement, which I see as a separate question. >> >> What I would like to see is the development of a kind of library of AWK >> routines to manipulate GEDCOM files. Lots of genealogists have GEDCOM files, >> and some would like to make changes to them, or extract information from them >> in ways that might not be possible with other genealogy programs. > >You can consider the awk operations to be quite primitive for the >given syntax of the GEDCOM files, so a library seems not really >necessary; just write the awk command. I will give examples below. > >But first I'd like to ask for confirmation what a GEDCOM "field" >actually is, per semantic and syntax. Is it _one whole line_ with >a specific 4-letter tag in column 2, or is it the _rest_ of a line >where the first two columns are some number and a data type tag? > >To change data of a specific line identify the line by a pattern >on the type field (please note that Lew already gave such example). >To perform action on a "NAME" field, replacing "Service" by "S." > > awk '$2 == "NAME" { sub(/Service/, "S.") } { print $0 }' > >likewise negate the condition if you want to select type tags other >than name. > >To exclude tag names prom processing that seem to have a specific >meaning > > awk '$2 !~ /@.*@/ { sub(/Service/, "S.") } { print $0 }' I substituted "Ellwood1.ged" for "!~" and got this: gawk: (FILENAME=ellwood1.ged FNR=40697) fatal: cannot open file `/@.*@/' for reading (Invalid argument) Please forgive my ignorance -- I'm still feeling my way in the dark with this stuff. -- Steve Hayes from Tshwane, South Africa Blog: http://khanya.wordpress.com E-mail - see web page, or parse: shayes at dunelm full stop org full stop uk

    02/20/2013 01:05:36