RootsWeb.com Mailing Lists
Total: 1/1
    1. GEDCOM files. ASCII, ANSEL, Unicode and UTF-8
    2. Richard Halliday
    3. All; In an earlier message it was asked what character set should be used when transferring information to a non-PAF genealogical database. What follows is probably more than most of you want to know about the subject. If that is the case then a short answer is in the last three paragraphs. The selection is easier when you know what each character set represents. Since computers do not recognize letters, punctuation marks and special characters, but work entirely with numbers it is necessary to have a agreement as to what each number represents. A character set is merely an agreed upon list of numbers and the character that each number represents. For example Morse code is a character set composed of dots and dashes that represent letters, numerals, punctuation marks and control characters. Control characters are items such as TOF (Move the paper to the Top Of the Form), EOF (End of File), ACK (Acknowledge), etc. ASCII. One of the earliest character sets is the American Standard Code for Information Interchange. It used eight bits (1s or 0s) to represent each character. The first seven bits make up 128 characters. The eighth bit is a Parity bit. Parity is a simple method of determining whether an error in transmission has occurred. This was important at the time because telephone and telegraph lines were quite noisy making transmission errors common. Those 128 characters consist of letters, both upper case and lower case, numerals, punctuation and special characters (e.g. “/”, “?”, “%”, etc. For example “A” is represented by 66, “B” by 67, “C” by 68, “a” by 97, “1" by 49, “?” by 63, etc. The letters of the alphabet, both upper case and lower case and the numerals require 62 of the numbers leaving 65 of them for the other characters. ASCII is sufficient for the English language which uses the Roman alphabet. As the modes electrical transmission and storage of information became more reliable the eighth character, the Parity Bit, of the ASCII character set became redundant. By using eight bits it was possible to represent 256 characters. ANSEL. This character set consists of 256 characters. Other European languages use the Roman alphabet, but they have additional marks (diacritical marks) that help with the pronunciation. Examples are the tilde, the umlaut and the accent mark. The American National Standard Extended Latin (ANSEL) also called American Library association uses the additional codes to add the diacritical marks. ANSEL is backward compatible with ASCII (i.e., it uses the same codes for letters, numerals and punctuation). Unicode. In the late 1980s the need to encode other alphabets (e.g. Arabic, Cyrillic, etc.) and non-alphabetical languages (e.g., Chinese, Japanese, Korean, etc.) became urgent. This required a much larger list of characters. Unicode was the agreed upon answer to this need. It is backward compatible with ASCII. Each character consists of two, four or eight bytes. UTF-8. Universal Transfer Function - 8-bytes is a protocol for identifying and transferring any of the Unicode character sets. So – which one should be used when you are making a GEDCOM file for use of a non-PAF genealogical database. If your PAF database consists solely of English given names, surnames and place names then the ASCII character set is sufficient. If your PAF database contains European names that have diacritical marks then the ANSEL character set is adequate. ASCII and ANSEL are the most widely used and the smallest of the character sets. If you are using a non-Roman alphabet or a non-alphabetical character set then Unicode is required. This character set is much more bulky than ASCII or ANSEL and it will not work well on all computers. A safer alternative is to use UTF-8. It is more bulky than Unicode, but it is more universal and works both on Windows and most non-Windows operating systems.

    06/29/2005 02:08:16