gf User's Manual

17 July 1996

Gary Houston


Running gf

gf is an SGML system which converts SGML documents conforming to several DTDs into a number of output formats. It can act as a formatter by converting the SGML markup into a language supported by a typesetting system. The LaTeX converter gives the most complete results: others will omit facilities such as the inclusion of PostScript graphics.

The command

gf foo.sgml

will convert a document `foo.sgml' into the default output format, typically LaTeX. Other output formats can be specified using the `-f' option, e.g., `gf -frtf foo'. The formats currently supported are `latex2e', `rtf', `texinfo' and a number of plain text formats:

gf will automatically determine which DTD the document is supposed to conform to (see section Document Types for the list of supported DTDs). Not every combination of DTD and output format is supported.

gf uses the nsgmls program to parse the SGML document, but this is transparent unless an error occurs during parsing. An error indicates that a document does not conform to the model specified by its document type declaration.

As an aid to the user, gf has an option for the creation of new documents. For example, to create a document named "foo.sgml" type:

`gf -n' dtd `foo.sgml', where dtd can be general, html, spaper, smemo or sletter. See section Templates for details on how the "templates" can be modified.

A number of additional options are available, for example

gf -o foo.tex foo.sgml

will write the output to the file "foo.tex", or even

gf -o xxx.tex foo.sgml

which makes it easy to erase all the garbage later. Cleaning up garbage can occasionally be important, for example when formatting with different TeX macro packages: they often use incompatible auxiliary file formats.

See the gf man page for the complete list of options. See section Style Sheets for details on how the formatting process can be controlled through the use of style sheets.

The complexity of a typical DTD makes development of reliable SGML software difficult. There is always the possibility that some untested combination of elements and styles will cause gf to fail. In general, if a document parses correctly but gf generates incorrect output code (e.g., illegal LaTeX) then this should be considered a bug (see section Bug Reports).

Document Types

This section briefly describes each of the DTDs supported by gf.

general DTD

The "general" DTD was one of the earliest document types. It was originally published by the ISO in Annex E of the 1986 SGML standard as a demonstration of a general purpose DTD for books or articles. The date of publication (not surprisingly) appears to have preceded the development of reliable SGML parsers, since the document type declaration contains a minor error. gf includes a corrected version, with a superfluous "address" element removed from the "titlep" element.

gf can convert documents which use this DTD into LaTeX. Unfortunately I don't know of any freely available documentation for the DTD. A description is rumoured to have been published in ISO TR 9573, "Techniques for using SGML". Some idea of the intent of the DTD can be guessed from the HTML and Snafu documentation.

There are deficiencies (or bugs) in the conversion. The ones that I know of are:

Others probably exist.

HTML DTD

The HTML DTD is used in the World Wide Web project, which makes documents available across the Internet for interactive browsing.

gf can convert documents which use HTML-2.0, the current standard version (dated 1995-09-22, apparently frozen) and HTML-3.2, the most recent version released by the W3 organisation (dated 1996-06-26 and still subject to modification). HTML documents can be converted to LaTeX, RTF and the plain output formats. For current information on HTML see `<URL:http://www.w3.org/pub/WWW/MarkUp/>'.

Documents which use this DTD can start with a number of declarations, e.g.,

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Level 1//EN">
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Strict Level 1//EN">

The 3.2 DTD doesn't currently support as many options and is specified with

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Draft//EN">

or

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">

Existing HTML documents often lack a DOCTYPE declaration: provided they begin with an `<html>' tag, gf will "impute" the non-strict level 2 declaration and print a warning message. If neither the DOCTYPE declaration nor the `<html>' tag is present, then the document will not be recognised as HTML and a large number of error messages will be generated.

HTML uses a modified concrete syntax which allows direct entry of 8-bit Latin 1 characters. The output from gf will be 7-bit LaTeX, ASCII or RTF or 8-bit Latin-1.

Some elements are marked as deprecated, but can still be used when the non-strict declaration is used. gf attempts to support them.

Inline images can be included into LaTeX output. This requires the construction of a "links" file which provides information on each image which is not present in an HTML document, but available using the HTTP protocol. The links file should be named after the basename of the HTML document and given an extension of `.links', e.g., for `foo.html' a `foo.links' would be created. This file should have one line for each inline image with the URL as it appears in the HTML document, the image type (GIF, JPEG or EPS) and the local file where the image can be found. For example:

http://nowhere.edu/some/junk/image.jpg  JPEG  image.jpg
http://nowhere.edu/some/other/junk/image.jpg  JPEG  ../other/image.jpg
file:/localjunk/image.jpg  JPEG  /localjunk/image.jpg

The local file is located relative to the links file or as an absolute path if it starts with a "/".

gf will run a program to convert images to encapsulated PostScript if necessary so that they can be read by TeX and the dvi processors. The converted images will be given an extension of ".ps". The programs are specified by entries in the style sheet, with the default being the "convert" program from ImageMagick (compiled with JPEG support). A recent version of the PBM utilities would probably also work, if JPEG support was available, using commands like:

giftoppm $infile | pnmtops -noturn > $outfile
jpgtoppm $infile | pnmtops -noturn > $outfile

$infile and $outfile are set by gf before the conversion program is run.

It is also possible to create the encapsulated PostScript version of the images by hand before running gf.

There are currently limitations and bugs in the HTML support in gf, including:

In addition the support for HTML 3.2 tables is limited to LaTeX output, with the following restrictions:

Other 3.2 features not supported include:

Snafu DTD

I created this DTD for my own use and experimentation. Three types types of document are supported: letters, memos and simple technical papers. Several versions have been created: gf should be able to process all versions from 2 onwards. Documents conforming to the memo and letter variants can be converted to LaTeX, ASCII and RTF and technical papers can also be converted to Texinfo.

See the article "The Snafu SGML Document Types" for more information.

SGML Facilities

This section describes a few constructions which may be useful in SGML documents. Note that not all SGML features are described here: consult the standard for a complete discussion. The markup characters are those specified by the Reference Concrete Syntax.

Comments

SGML comments start and end with the symbol `--'. When comments are used within the body of a document they must also be enclosed in the markup construction `<!...>'. An example with two comments is:

<!-- This is the text of the first comment.  --
     -- This is the text of the second comment -- >

At the start of the first comment, space is not permitted between the `<!' and the `--'.

The construction `<!>' is also ignored and can be used for various purposes.

Minimalisation Techniques

The full markup of a non-empty SGML element is of the form:

<ELEMENT>element content</ELEMENT>

It will often be possible to omit the terminating tag: this depends on how the element is defined in the DTD. Sometimes the initial tag can also be omitted.

When the terminating tag is required, there are shorter forms which can be used:

<ELEMENT/element content/
<ELEMENT>element content</>

The symbol `</>' closes the most recently opened element. Beware that it may cause confusion when the element contains hidden sub-elements, since it will be the sub-element which is closed.

Likewise, there is a construction `<>' which reopens the last opened element. This can be just as confusing as `</>'.

When a number of tags appear in sequence the closing delimiters can be omitted, e.g., `<li<it>' is equivalent to `<li><it>'. A similar case is `</ul<ul>'.

A more drastic minimalisation technique is the short reference, several of which will often be defined by a DTD. They allow a single character to be used to represent an entity of some kind. Common examples are the use of " as an abbreviation for a the beginning or end of a short quotation and a blank line to begin a new paragraph.

Short references can be context sensitive, by being activated only within a subset of the document. An example would be a vertical bar which separates horizontal cells in a table but retains its normal ASCII code in other parts of the document.

Entity Declarations

There are several ways in which declaring entities within a document can be useful. Entities must be declared in the document type declaration at the beginning of the document.

The declaration associates a name with an object of some kind. The default "concrete syntax" SGML rules for an entity name are restrictive: it may contain no more than eight characters, it must start with a letter, and remaining characters may only be letters, digits and hyphens. Letters are case sensitive. However many SGML applications loosen the restriction on name length through the use of a modified concrete syntax.

An example of an external entity declaration is:

<!DOCTYPE spaper PUBLIC "-//Houston//DTD snafu5//EN"[  
<!ENTITY dogfig SYSTEM "dog.ps" NDATA EPSF>
]>

This declares an external file with notation EPSF, to be known within the SGML document as "dogfig".

Replacement Text

The simplest use of an entity is to provide replacement text. This is useful for text which is repeated several times in the document or which is subject to frequent change. To declare such an entity, use a line like:

<!ENTITY response "<hp1>Are you sure that you want to 
do this &rarr; </hp1>">

The entity can then be referred to within the document with the entity reference: `&response;', which will generate:

Are you sure that you want to do this ->

External Entities

Text can also be brought into the document from external files. When the entity is declared, the content type can also be specified. By default, the entity contains normal marked-up SGML text and is declared using a line such as:

<!ENTITY junk SYSTEM "junk.sgml">

It is also possible to use the "notation" feature of SGML to incorporate external text which is in some non-SGML format. For example, if a notation ASCII is defined (probably by the DTD), then markup such as:

<!ENTITY realjunk SYSTEM "junk.c" NDATA ASCII>

could be used to include a program source file into the document.

Marked Sections

Text within marked sections will be treated in a non-standard way by the SGML parser. The text is inclosed within the construction:

<![ KEYWORD [marked text]]>

where `KEYWORD' can be:

CDATA
the marked text will be treated as "character data", which means that potential SGML element tags or entity references will be ignored. This is useful when including verbatim program code (or SGML markup) in a document: only the sequence `]]>' can give problems.
RCDATA
treat as "replaceable character data". Element tags will be ignored but entity references will be resolved as normal.
IGNORE
treat the marked text as a comment
INCLUDE
treat the marked text as ordinary text
TEMP
treat the marked text as ordinary text (which will presumably be removed later).

Special Characters

Character References

A character can be generated by typing a "character reference" with its decimal value, e.g., with the default character set `&#38;' will produce & and `&#60;' will produce <.

Character Entity Sets

Character entity sets can be used to obtain special symbols within an SGML document. gf supports 6 entity sets(1):

  1. ISO Added Latin 1
  2. ISO Added Latin 2
  3. ISO Diacritical Marks
  4. ISO Numeric and Special Graphic
  5. ISO Publishing
  6. ISO General Technical

See Figure 1 in section Character Entity Sets to Figure 6 in section Character Entity Sets for tables showing the characters available. This is not the complete set of ISO character entity sets: others include greek letters and AMS math symbols. Note that the conversion to LaTeX does not support all of the symbols in the supported sets. (figure 1 omitted: ISO Added Latin 1 Character Entity Set) (figure 2 omitted: ISO Added Latin 2 Character Entity Set) (figure 3 omitted: ISO Diacritical Marks Character Entity Set. The marks can be applied to any character, in this case "a". Note: they are applied to the immediately preceding character.) (figure 4 omitted: ISO Publishing Character Entity Set) (figure 5 omitted: ISO Numeric and Special Graphic Character Entity Set. "nbsp" is a space where a line break is forbidden. "shy" is a soft hyphen, which allows a linebreak with the insertion of a hyphen.) (figure 6 omitted: ISO General Technical Character Entity Set. Math operators are classed as relations, binary operators, etc., and include implicit spacing.)

A symbol can be used by entering the appropriate entity reference, e.g., `&mdash;' will generate "---" when the document is formatted.

It is possible to modify the generated characters or define new characters if this is permitted for a given DTD: see section Control of Character Entity Translation.

Customisation

There are three possibilities for user customisation(2): modification of the gf style sheets, modification of the templates used when creating new documents, and modification or supplementation of the character entity translation tables.

Style Sheets

A gf style sheet is a specialised SGML document containing information for the control of the formatting process for one or more of the supported output formats. It is possible to format an SGML document in many different ways, for example by changing paper size, fonts or heading styles. DTDs intended for pure "content-based" markup will attempt to avoid direct specification for this type of information within the source SGML documents.

By default gf will use a system-wide style sheet (stored in some directory such as /usr/local/share/gf) which will be named after the document DTD, e.g., general.style (use `gf -v' to check the file names and use environment variables as described in the gf man page to change them at run-time). If you want to modify the style sheet, you will need to make a local copy of it. There are two ways to do this: if you want the modified styles to apply to all of your documents, then copy the style sheet into your `.gf' directory. If you do not have such a directory then you will need to create it.

mkdir ~/.gf
cp /usr/local/share/gf/general.style  ~/.gf

Alternatively, if you only want the modified style to apply to a single document, you can place it in the same directory as the document and give it the same base name, e.g.,

cp /usr/local/share/gf/general.style  foo.style

A gf style sheet begins with an SGML declaration redefining two default values: NAMELEN is increased to 34 and NAMECASE GENERAL is set to NO.

The body of the style sheet is broken into sections, each containing formatting information for a class of outputs. The classes are:

Each section has a number of style options which are relevant for that class of output formats.

gf will always attempt to read the style sheet from three locations, which by default are: the system directory, the `~/.gf' directory, and the directory containing the file. Each time a style sheet is found, its contents are merged into a single database, overwriting data already present. Therefore if some of the style options are deleted from a local style sheet, or if a new version of gf adds extra options, then the system versions will be used.

Documentation for the various style options can be obtained by running `gf --style-help' with the name of the document to be processed. This will print the options available for the particular DTD and output format. E.g.,

gf --style-help -fab gf.sgml

will print out the style options for the "plain" output formats when processing Snafu documents.

Templates

The gf `-n' option can be used to simplify the creation of new SGML documents. It works by copying a "template" into the current directory. Two locations are searched for the template:

The name of the template file is constructed by appending a standard extension (normally `.template') to the DTD type, e.g., `smemo.template'.

Templates files can therefore be customised by copying the system templates into the local `~/.gf' directory and modifying as required.

Control of Character Entity Translation

gf uses simple mapping files for the translation of character entity references in the source SGML document into the output format(3). The default versions of the mapping files are usually kept in the directory `/usr/local/share/gf' and are named after the character entity set and the output format (e.g., `ISOlat1.2tex').

Each file contains a line for each character in the entity set, with the entity name (the SDATA replacement text), a tab character and the replacement text.

Before reading the system file, gf will check for a file of the same name in the `~/.gf' directory. If this exists, it will be read instead of the system file. This makes it possible to redefine the translations.

In addition to the mapping files above, gf will read any mapping file attached to the current document. This can be used either to override translations in the system files or to define mappings for characters defined in the DTD. For example, if processing the document `foo.sgml' which has the following line in the document type declaration:

<!ENTITY BleECh SDATA "[BleECh]">

then the following `foo.2ab' file can be used to specify the ASCII mapping:

[BleECh]        BleECh

To check which mapping files are read when the document is processed, use the `-v' option.

Bug Reports

If you find a bug please send a description and short example to `ghouston@actrix.gen.nz'.


This document was generated on 22 July 1996 using the texi2html translator version 1.50.