2. Very Loose and Very Basic SGML (Not A GoodBook)
An SGML document instance is a fully parenthesized or strictly
hierarchically structured document. Components called elements are tagged
at the beginning and end. Inside each element is a (perhaps empty) string
of content and elements.
A Document Type Definition (DTD- the ADT of this language) describes the
syntax of a document by declarations of elements, entities and notations.
Elements and notations may have attributes with limited sets of value
types. Each element is defined by a content model which describes a
production (non-deterministic finite state automata -- no requirement for
look-ahead) of the allowable patterns of constituent elements. A valid
SGML document instance conforms to the grammar of a DTD.
- A document type declaration takes the form:
<!DOCTYPE document_type_name optional_external_identifier
[optional_declaration_subset] >
- A notation declaration takes the form:
<!NOTATION notation_name notation_identifier >
- An element type declaration takes the form:
<!ELEMENT name_of_element tag_minimization_indicators
content_model_or_declared_content
content_exceptions
>
- An element declaration may be followed by an attribute declaration:
<!ATTRIBUTE name_of_associated element(s)
name of attribute allowable_values default
name of attribute allowable_values default
etc.
>
An attribute list may be associated with many elements. An element may be
associated with one and only one attribute list.
In a document instance, an element and its attributes look like this:
<MyName myatt="***" my2att="888">content, other markup, etc.</myname>
There are lots and lots of rules that modify that look. Get a GoodBook.
Element type declarations are modified by data types, keywords and element
content expressions. Here again are loose examples.
2.1 Some data types for declared content
- #PCDATA -- parsed character data
- CDATA -- character data. May not include markup.
- RCDATA -- replaceable character data that may include entity references
- ANY -- may contain any elements defined in the same DTD
- EMPTY -- content is empty. A point tag possibly with attributes. No
end tag allowed.
2.2 Some keywords are
- #REQUIRED -- in the default parameter of an attribute type declaration,
this means that value must be supplied in the element.
- #IMPLIED -- the attribute is optional and may be supplied by the
application if not supplied in the element
- #FIXED -- the value for the attribute is specified in the DTD and can
not be changed. This is useful as it parameterizes types similar to the
way a template does in C++.
2.3 Element content expressions are
Occurrence Indicators -- modify group or individual element. The absence
of one of these specifies a default in which the element or group must
occur once and only once.
- ? -- optional (0 or 1)
- + -- required and repeatable (1 or more)
- * -- optional and repeatable (0 or more)
Ordering connectors -- used in a group to specify order. Only one type
may occur in a group.
- , (comma) specifies a strict sequence. occur in order encountered
- & (ampersand) -- any order
- | (vertical bar) -- one occurrence only each time group is evaluated.
Also sometimes informally called an or indicator
2.4 Others
- ( ) (parentheses) -- used to specify a group (just like mathematics)
- = (equals) -- is defined as
- -- (double hyphen no space) -- comment start or end. Must occur within
<! >.
2.5 Minimization: Rules for omitting or including begin and end tags
- - - (double hyphen space between) -- both tags required
- O O (upper or lowercase O (oh!)) -- no tags required.
- - o (hypen and upper or lowercase O (OH!)) -- no end tag required.
2.6 Here is a brain-dead example of a DTD and document instance
<!SGML -- declaration goes here. You know it's name. Look up the number. -- >
<!DOCTYPE badbook PUBLIC "-//Not a GoodBook//DTD BadBook//EN"
[
<!ELEMENT badbook - o (front, section+, appendix+) +(footnote) >
<!ATTLIST badbook
id ID #REQUIRED
ISBN CDATA #IMPLIED >
<!ELEMENT section - - (section | paragraph)* >
<!ATTLIST section
id ID #REQUIRED >
<!ELEMENT paragraph - o (#PCDATA) -(footnote) >
]>
<badbook id="mybook.doc" ISBN="09283333">
<section id="firstsection">
<section id="firstfirst" >
<paragraph>Oh I wish I was an Oscar Mayer Wiener!
</section>
</section>
</badbook>
2.7 Note some points about the badbook
- The PUBLIC keyword in the DOCTYPE declaration implies that this DTD has
been formally registered with a formal body like ISO. In this case,
it's a lie but you won't know that unless you notice the hyphen
preceding it. That means it is owned but not registered. A + would
indicate that it is registered. You still won't have a way to verify it
unless a published list is available somwhere. Therefore, this is
really a declaration of good faith but the parser has no way of knowing
if you are a good person nor does the it have an open-book-and-check-it-
out instruction... at least not yet. |-)
- The EN at the end of the line means that the following document is in
English. Yes, there are DTDs and declarations for lots of languages.
I've seen one in Kanji but I couldn't swear it was right because I can't
read Kanji. So this little flag could potentially save you some
trouble.
- The ISBN attribute for badbook is declared IMPLIED but a value was
included in the instance anyway. That's OK.
- No <front> element occurs in the document instance. Since it has no
occurrence indicator, the default is that it must occur once. This is a
boo-boo and the parser should howl at you and may refuse to go an inch
further until you fix it. Oops! Where is that Appendix? HOWWWWWL!
- +(footnote) -- this is an inclusion declaration. The footnote element
type will be allowed anywhere in the elements declared in the content
model and in any of the elements defined in the content models of the
elements declared in the badbook element type. This is a power tool.
Use it wisely or not at all.
- -(footnote) -- this is an exclusion declaration. This stops the rippling
effect of the inclusion declared earlier. Therefore, the paragraph
cannot contain a footnote by this rule. If you find yourself doing this
a lot, you have used an inclusion unwisely or have a renegade content
model. It's not illegal but raises the complexity of the software that
has to deal with it significantly. Let the user beware.
- footnote -- this element type is never declared in the DTD.
HOooowwwwlllll!.
- The section declaration defines a recursive section. This is not only
legal but very useful.
- </badbook> -- Although the minimization indicator for badbook allowed
one not to have a badbook endtag, including one is OK.
The actual characters used for these declarations and indicators are
specified in the SGML Declaration file. The SGML Declaration precedes the
DOCTYPE declaration and essentially declares the character sets used for
the syntax of SGML, features supported, capacity requirements for a system
that will process the SGML document, etc. Study this in a GoodBook as this
part instructs the parser about the nitty-gritty of how the DOCTYPE to
follow is to be interpreted.
2.8 Entities
Entities and entity management are the heart of SGML organization,
especially for modular structuring. Two types of entities can be declared:
- 1. General entity
-- used as an include by reference to insert text and or
markup into a document instance. Can be used to store such things as
boilerplate text in an external file. Takes the declaration form:
<!ENTITY entityname "stuff to be stuffed into somewhere">
Used in the document instance, this form will include the stuff:
&entityname;
- 2. Parameter entity
-- only used to include markup within SGML markup
declarations (for example, to modularize a DTD itself.). Takes the
declaration form:
<!ENTITY % entityname "declaration to be stuffed into markup" >
Used in a declaration, this form will include the stuff:
%entityname;
There are sets of keywords that can be in an entity declaration that modify
its meaning. Look them up in a GoodBook.