Understanding HTML and SGML

HTML is an application conforming to International Standard ISO 8879 -- Standard Generalized Markup Language (SGML). SGML is a system for defining structured document types, and markup languages to represent instances of those document types. The SGML declaration for HTML is given in SGML Declaration for HTML. It is implicit among WWW implementations.

In the event of any apparent conflict between HTML and SGML standards, the SGML standard is definitive.

Every SGML document has three parts:

SGML declaration: Binds SGML processing quantities and syntax token names to specific values. For example, the SGML declaration in the HTML DTD specifies that the string that opens an end tag is </ and the maximum length of a name is 72 characters.
Prologue: Includes one or more document type declarations (DTDs), which specify the element types, element relationships and attributes. The HTML 3.0 DTD provides a definitive specification of the allowed syntax for HTML 3.0 documents.
References: Can be represented by markup. An instance, which contains the data and markup of the document.

HTML refers to the document type as well as the markup language for representing instances of that document type.

Understanding HTML Elements

In HTML documents, tags define the start and end of headings, paragraphs, lists, character highlighting and links. Most HTML elements are identified in a document as a start tag, which gives the element name and attributes, followed by the content, followed by the end tag. Start tags are delimited by < and >, and end tags are delimited by </ and >. For example:

    <H1>This is a Heading</H1>
    <P>This is a paragraph.

Some elements appear as just a start tag. For example, to create a line break, you use <BR>. Additionally, the end tags of some other elements (e.g. P, LI, DT, DD) can be omitted as the position of the end tag is clearly implied by the context.

The content of an element is a sequence of characters and nested elements. Some elements, such as anchors, cannot be nested. Anchors and character highlighting may be put inside other constructs. The content model for a tag defines the syntax permitted for the content.

Note: The SGML declaration for HTML specifies SHORTTAG YES, which means that there are other valid syntaxes for tags, such as NET tags, <EM/.../; empty start tags, <>; and empty end tags, </>. Until support for these idioms is widely deployed, their use is strongly discouraged.

Names

The element name immediately follows the tag open delimiter. An element name consist of a letter followed by up to 72 letters, digits, periods, or hyphens. Names are not case sensitive. For example, H1 is equivalent to h1. This limit of 72 characters is set by the NAMELEN parameter in the SGML declaration for HTML 3.0.

Attributes

In a start tag, white space and attributes are allowed between the element name and the closing delimiter. An attribute typically consists of an attribute name, an equal sign, and a value (although some attributes may be just a value). White space is allowed around the equal sign.

The value of the attribute may be either:

A string literal, delimited by single quotes or double quotes
A name token (a sequence of letters, digits, periods, or hyphens)

In this example, a is the element name, href is the attribute name, and http://host/dir/file.html is the attribute value:

    <A HREF="http://host/dir/file.html">

Some implementations consider any occurrence of the > character to signal the end of a tag. For compatibility with such implementations, when > appears in an attribute value, you may want to represent it with an entity or numeric character reference, such as:

    <IMG SRC="eq1.ps" alt="a &#62; b">

To put quotes inside of quotes, you can use single quotes if the outer quotes are double or vice versa, as in:

    <IMG SRC="image.ps" alt="First 'real' example">

Alternatively, you use the character representation " as in:

    <IMG SRC="image.ps" alt="First &quot;real&quot; example">

The length of an attribute value (after replacing entity and numeric character references) is limited to 1024 characters. This number is defined by the LITLEN parameter in the SGML declaration for HTML 3.0.

Note: Some implementations allow any character except space or > in a name token. Attributes values must be quoted only if they don't satisfy the syntax for a name token.

Attributes with a declared value of NAME (e.g. ISMAP, COMPACT) may be written using a minimized syntax. The markup:

    <UL COMPACT="compact">

can be written as:

    <UL COMPACT>

Note: Unless you use the minimized syntax, some implementations won't understand.

Undefined Tag and Attribute Names

It is an accepted networking principle to be conservative in that which one produces, and liberal in that which one accepts. HTML parsers should be liberal except when verifying code. HTML generators should generate strictly conforming HTML. It is suggested that where ever practical, parsers should at least flag the presence of markup errors, as this will help to avoid bad markup being produced inadvertently.

The behavior of WWW applications reading HTML documents and discovering tag or attribute names which they do not understand should be to behave as though, in the case of a tag, the whole tag had not been there but its content had, or in the case of an attribute, that the attribute had not been present.

Special Characters

The characters between the tags represent text in the ISO-Latin-1 character set, which is a superset of ASCII. Because certain characters will be interpreted as markup, they should be represented by markup -- entity or numeric character references, for instance the character "&" must be represented by the entity &. See the Special Characters section of this specification for more information.

Comments

To include comments in an HTML document that will be ignored by the parser, surround them with . After the comment delimiter, all text up to the next occurrence of --> is ignored. Hence comments cannot be nested. White space is allowed between the closing -- and >, but not between the opening <! and --.

For example:

<HEAD>
<TITLE>HTML Guide: Recommended Usage</TITLE>
<!-- Id: Text.html,v 1.6 1994/04/25 17:33:48 connolly Exp -->
</HEAD>

Note: Some historical implementations incorrectly consider a > sign to terminate a comment.

Formal Variants of HTML 3.0

The HTML 3.0 document type definition includes two flags for controlling how prescriptive or how lax the language is. This makes use of SGML marked sections in the DTD to enable or disable certain features.

HTML.Recommended

Certain features of the language are necessary for compatibility with widespread usage, but they may compromise the structural integrity of a document. The HTML.Recommended entity should be defined as INCLUDE in the DTD subset to enable a more prescriptive version of HTML 3.0 that eliminates the above features. For example:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 3.0//EN"
[ <!ENTITY % HTML.Recommended "INCLUDE"> ] >

In particular, this prevents text from appearing except within block elements.

HTML.Deprecated

By default, for backwards compatibility, the %HTML.Deprecated entity is defined as INCLUDE, enabling certain features which are now deprecated. These features can be eliminated by defining this entity as IGNORE in the DTD subset. For example:

    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 3.0//EN" [
<!ENTITY % HTML.Deprecated "IGNORE"> ] >

Note: defining %HTML.Recommended as INCLUDE automatically sets %HTML.Deprecated to IGNORE.

In the spirit of being liberal in what you accept and strict in what you generate, HTML user agents are recommended to accept syntax corresponding to the specification with %HTML.Deprecated turned on, while HTML user agents generating HTML are recommended to generate documents that conform to the specification with %HTML.Recommended turned on.