Whereas ordinary word-processing and desktop publishing systems produce rather unstructured documents, SGML (Standard Generalized Markup Language) deals with structured documents]. Moreover, SGML has been adopted as an ISO standard (ISO 8879) in 1986. Since only the ASCII character set is used, there are no conversion problems between different platforms (e.g., DOS, Unix, VAX). Thus, there is no dependence on a particular company, and SGML documents will remain legible and useful in the future.
It should be emphasized that SGML per se is only a language. For use with real documents (instances), a specific document type definition (DTD) is required. In short, a DTD defines the basic structure of the documents, the tags to be used and the (character) entities.
Here we will focus on SGML systems which are freely available. They are based on sgmls or its successor, nsgmls (SP), by James Clark (source code and binaries for various platforms are available).
First, the qwertz SGML document types by Thomas F. Gordon and the associated (Unix) tools shall be mentioned. The main purpose of the qwertz system is the representation of mathematics and the conversion to LaTeX documents. However, document types for letters, manual pages and bibliographies are also given, as well as tools for conversion to nroff (hence ISO latin-1 or ASCII text), grops and man pages.
Whereas conversion to HTML (HyperText Markup Language) is
not covered by the original qwertz system, this is offered by the
Linuxdoc-SGML formatting system
by Matt Welsh and Greg Hankins
which is based on the qwertz system. As the name implies,
the Linuxdoc system was developed to produce documentation
for the Linux operating system (Linux HOWTO's). However, it can be used
as a general purpose system; the supplied replacement mappings
may have to edited, particularly with respect to mathematics.
The Linuxdoc system (version 1.5) offers the following output formats:
HTML, LaTeX (2.09 or LaTeXe, also DVI and PostScript), LyX, RTF
(Rich Text Format
The third and last system which shall be mentioned is the general formatter (gf) by Gary Houston with the snafu SGML document type. Although gf is most powerful with snafu DTD SGML input, it is a particularly useful tool for converting HTML documents to LaTeX (LaTeXe), plain text (ASCII or ISO-Latin-1) or RTF.
It is fairly easy to write SGML documents directly with any text editor, particularly convenient are text editors with macro capabilities. This task is comparable to writing LaTeX or HTML documents. However, it is important to obey to the document structure; SGML syntax checkers are available. WYSIWYG SGML systems are available commercially but shall not be considered here.
It stands to reason that many people prefer a WYSIWYG word processor to an allegedly inconvenient text editor. Also, a vast amount of scientific literature exists already in electronic form, but not as (sufficiently) structured documents. (In the worst case, printed documents might be scanned and converted to ASCII texts by OCR.) Thus, converters capable of generating SGML documents from, say, LaTeX, RTF or HTML source will be helpful. (RTF, the Rich Text Format, is a document interchange format designed by Microsoft.) Since the structure that can be inferred from these sources is most likely insufficient, some manual formatting will usually be required.
Although HTML is based on SGML, it must be kept in mind that it was developed with the purpose of quick online formatting capabilities. Hence current versions of HTML are inferior to full-featured SGML DTDs, and the representation of complex scientific documents in HTML presents problems. Moreover, although the fairly primitive version HTML 2.0 (RFC 1866) is essentially a standard, current extensions are not (HTML 3.2, Netscape HTML Extensions, W3C HTML Experimental).
Many tools (filters) are available for converting other types of documents such as word processor output to HTML. Only two famous converters shall be mentioned here: the LaTeX2HTML translator (LaTeX to HTML) by Nikos Drakos and rtftohtml, a filter to translate RTF (and hence Microsoft Word documents) to HTML, by Chris Hector.
Current browsers (Netscape Navigator, Mosaic) can convert HTML
documents to text or PostScript. The gf formatter has been
mentioned above. Another tool for converting HTML to PostScript is
html2ps
by Jan Kärrman.
Two (experimental) Perl scripts have been written here, latex2sgml.pl and html2sgml.pl, which may help in converting LaTeX or HTML documents to SGML documents (Linuxdoc DTD). These tools may have to be edited to suit particular needs, and some manual formatting of the resulting SGML document is generally necessary. Another tool which may be helpful in the process of converting documents is charconv; the purpose of this tool is the conversion between different extended character sets (e.g., DOS, Macintosh, ISO Latin 1, HTML, SGML), it does not provide any formatting.
Another approach which is worth mentioning is the MATHS system developed by Richard J. Botting. It provides a simple means of an ASCII representation of formal mathematics (BNF: Backus-Naur forms and the like) as well as tools for converting this notation (reusable mathematical information) to HTML. Two Perl scripts have been written here, mth2html and mth2sgml, which convert MATHS documents (allowing for several general-purpose extensions) to HTML and SGML (Linuxdoc DTD).
Figures have to be converted as well. (Encapsulated) PostScript is commonly used for the preparation of printed documents, wheras GIF or JPEG are popular image formats for screen presentations; there are many other formats (e.g., TIFF, PBM/PPM, XBM, RGB, PCX, WMF, BMP), but converters are available (e.g., xv, imagemagick, netpbm/pbmplus, GhostScript, paint shop, wmf2bmp).
For instance, PostScript figures can be converted to GIF images by means of the pstogif script which is part of the LaTeX2HTML package; it makes use of GhostScript and netpbm/pbmplus. The conversion of GIF images to encapsulated PostScript can be achieved by netpbm/pbmplus (e.g, giftopnm followed by pnmtops) or by convert from the imagemagick package.
mdl2gif is a convenient tool for converting molecular structures (2D)
given as MDL mol files to GIF images.
The qwertz SGML system (and hence Linuxdoc) offers a fairly
good set of tags and character entities for the representation
of mathematical formulas, although the coverage is not as
comprehensive as that provided by TeX or LaTeX. The snafu SGML system
takes a different approach by using the TeX notation
directly, embedded in <texeqn> tags, but that method is
only useful for generating LaTeX output.
Currently available WWW browsers do not support math mode, with
the exception of the experimental
Arena
browser. (Tools like
LaTeX2HTML convert mathematical formulas to graphics for use
with ordinary browsers.) Mathematical formulas and the character
entities of greek letters (e.g., α) have to be embedded
in between <math> tags. It should be mentioned that the
tool
html2ps
is able to convert (a subset of) math mode to PostScript. As an
example,
see a page dealing with a few formulas about the
chemical shift.
After the provided replacement mappings have been edited
appropriately, the Linuxdoc system can convert SGML documents
containing math mode to HTML output suitable for the Arena
browser.