«
<!DOCTYPE
Contents
- DOCTYPE Syntax
- Deconstruction
- Acronyms Explained
- Declaration in Plain English
- The Code
- Let's Rock [HTML]
- And Roll [XHTML]
- Content Negotiation
- PHP Labs Navigation
The DTD Declaration
Before I get into the code, I’ll start by describing the syntax of a DTD declaration. Trust me, it helps to deconstruct the thing before you try gluing it back together. Mental parsing if you like. Recall that my doctype() function will automatically generate any of the XHTML 1.1/1.0 or HTML 4.01 DOCTYPEs.
Okay, if you’re really starved for some results have a look at this demo, which will come in handy as we proceed.
DOCTYPE Syntax
You may on occasion see a DTD referred to as a markup declaration. See also naming rules for more details. I’ve also had some feedback from a few users that were a little confused by some of the terminology on this and other articles in this series. So, to clarify, a document type declaration is what I’m automating here, the code that goes at the top of your pages before the rest of the markup. A document type definition is the grammar, or the file we’re pointing at.
Deconstruction
Note: Why is the root element HTML in the first three cases and
html in the last four? Because HTML isn’t case-sensitive while
XHTML, which is an XML application language, is. In fact, all tags in XML/XHTML are case-sensitive
and must be in lowercase. As are their attribute names (which must have values and be quoted).
The situation with attribute values is a little more confusing, and it really depends on the
attribute. Some have predefined values so you must follow the rules in these situations. Others,
e.g. the id attribute, are not case-sensitive, but have certain other restrictions.
See ID and Name tokens for more details.
In addition, all tags must be closed even ones that have no formal closing tag.
Which is achieved through self-closing
like this: <br />. This is known as the
minimized tag syntax for empty elements
— see the W3C
site for guidelines.
Documents that follow these rules, and a few others, are often referred to as well-formed.
Which ensures that any application (user agent) can parse and understand the source. As
you might have guessed already, all this and more is defined in the DTDs.
FAQ: What is an empty element? Most elements contain some form of character data (CDATA) between an opening and a closing tag. A so-called empty element contains all its data within the tag itself in the form of attribute name/value pairs, or possibly nothing at all as in the example above.
The next element in a DTD declaration is the FPI
and is comprised of a 4-tuple of fields, each separated by a token consisting of a pair
of forward slashes ‘//’. Note that the entire FPI is enclosed
in double quotes:
"-//W3C//DTD XHTML 1.0 Strict//EN"
- +/- indicates the owner is/isn’t ISO registered
- W3C is the unique Owner ID of the resource
- DTD XHTML 1.0 Strict are the PTC and the PTD
- EN is the PCL
Acronyms Explained
Sorry folks, I don’t make this stuff up. Much of it has its roots in SGML and the various standards organizations I’m constantly referring too. Thankfully, the W3C maintains an excellent Glossary and Dictionary.
The PTC, or declaration type, is in this case naturally DTD. The PTD includes the markup language (XHTML), the version (1.0) and type (Strict). And finally the PCL, or natural language that the DTD is written in as an uppercase 2-character ISO 639-1 language code (EN = English).
Note: As indicated by the minus character, the W3C is not ISO registered. Neither is the IETF, which you may occasionally see here in older DOCTYPE declarations. All this really means is these DTDs are privately owned and maintained resources.
Before closing the DOCTYPE you add the complete URI to the definition:
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
also in double quotes, and finally close with a ‘>’
character.
Note: For the strict document types (which includes XHTML 1.1) the DTD URI is mandatory, for the others it’s optional but I always include it for completeness and after all, this will be generated by the function anyway so you need never worry about it after that. Or you may alter the code in any way you see fit. Details are available at Strictly Conforming Documents, and are summarized here:
- the root document element must be
<html> - this element must contain an XMLNS declaration
whose attribute name and URI value are:
xmlns="http://www.w3.org/1999/xhtml"
- an
xml:langattribute
which takes precedence, and for backwards compatibility, a lang attribute
are also necessary:
xml:lang="en" lang="en"
- in XHTML 1.1 the
langattribute is dropped
see changes for more on the transition from XHTML 1.0 Strict to 1.1.
Note: It is a happy accident that the
list items (3.) and (4.) above invalidate the XHTML 1.0 Strict and 1.1 document types. For
reasons known only to themselves, the W3C deprecated the value attribute of the
<li> tag in HTML 4.01. Which propagated into XHTML 1.0 and 1.1. Why happy? Because now I can
demonstrate the power of this entire exercise. I simply scrolled to the top of this source
code, switched DOCTYPEs and problem solved. Thanks to Tantek for his archive on this topic.
Declaration in Plain English
This is a document type declaration for XHTML version 1.0 Strict. The top
level element of the language is
<html>, whose namespace definition can be
found at the location http://www.w3.org/1999/xhtml. The definition is owned
and maintained by the organization designated by W3C. It is publicly available
at the address http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd. The natural
language of both the DTD and it’s namespace is English.
Why didn’t they just say that? Because it’s designed to be machine-readable, or the inverse of how I’m about to build it.
The Code
All modern programming languages have some form of the indexed sequential array or key/value list in which the keys are integers, often starting with zero, and whose values are related in some way. Many, like PHP, Perl and Python also have a similar structure known as an associative array (or hash or map or dictionary or lookup table...) in which the keys are strings rather than numbers. This is a powerful concept and I will be using it to my advantage often. For brevity I will be using the term hash.
You may want to open another window (or tab) to the source code for my doctype() function so you can refer to it as I explain what I’m doing. The file itself is indexed by line number so you can jump directly to any line in the code as I describe what I’m doing. You may also switch back and forth as I will be presenting links directly to line numbers as I discuss it.
On line 4 I populate the $agent hash by calling the PHP get_browser()
function. Which gives me the boolean variable $_IE on line 8. As mentioned earlier
in the series, if I send the XML declaration before the DOCTYPE, this will put IE
into quirks
mode and IE is bad enough already in so-called compliance
mode.
On line 10 I create a hash $media that maps the document type to, you guessed it, the
Content-type HTTP header. Maybe.
What? Recall that some browsers don’t get XML, but they will get XHTML (as HTML anyway)
— if you follow the guidelines as mentioned earlier.
At least most recent browsers will, for this site I consider IE 6.0 the LCD. Sigh.
This is referred to as content negotiation.
On lines 15 and 16 I define the document’s character set $charset and language $lang
variables. Don’t confuse the latter with language as discussed earlier. I’m referring to
the language I write in, not the one the DTD is written in. Use UTF-8 people, there’s no
other way to go. With one caveat, that I will get into sometime later.
On line 18 the doctype() function is declared. It accepts 3 parameters: $doc, $type and $ver,
any of which are optional. Refer to the demo for examples of calling the function to generate
each of the 7 DOCTYPEs using these parameters. In absentia, it defaults towards the
stricter types. In the case of HTML DTDs, the $ver variable is moot as 4.01
will be returned for all three. If you want to generate pre-4.01 DOCTYPEs you’ll have to
modify the code yourself. I can think of no good reason for doing so. Sending pre-4.01 DTDs
that is — make any modifications you want, just don’t expect me to fix it for
you if you break it.
Since the $media hash is outside the scope of the doctype() function, I declare it as global
on line 20. Another variable, $media_type, is likewise declared global and
will used later during content negotiation. You will see it again in the next lab when I will discuss
the function that creates the <head> element.
On lines 22 and 23 I fold the first two parameters to the doctype() function to upper and lowercase respectively using PHP’s strtoupper() and strtolower() functions.
Let’s Rock
- line 25 — availability is PUBLIC
- line 29 — not ISO registered
- line 30 — owner ID is W3C
- line 31 — PTC is DTD
- line 35 — PCL is ENglish
- line 36 —
xml:lang/langis en - line 40 — begin DTD
$URI - line 42 — open
<html>top level document element$doc_top
Line 44: it’s HTML we want so...$top is HTML (46) and $media_type is 'text/html'
(47). Brief digression: PHP’s string concatenation operator is the '.' (or period) character.
You will see this often, it is equivalent to the Javascript '+' operator on strings.
On line 49 I start building the PTD by combining the value of $doc (HTML) with the
version 4.01 (see ¶ above regarding HTML versions).
On lines 51 through 69 I use PHP’s switch/case statement to complete the PTD. The
PHP ucfirst() function converts the first character of the $type variable to uppercase. On line
70 I close the <html> top level tag since HTML doesn’t require any additional attributes such
as the XMLNS, remember this is not an XML application language. This ends the HTML
DOCTYPE block of code.
And Roll
Which leads to the XHTML block. Line 76 is a just-in-case
test: everyone makes typos.
The $top (root element) variable is now lowercase on line 78 since this is XHTML. On line 79
I initialize the namespace and xml:lang attributes. Lines 81 through 84 handle
XHTML 1.1 (which is the default). The PHP implode() function is similar to concatenation except it
uses any string token as the glue
and an array as parameters. The array in this example
is referred to as anonymous (or lambda, but that is generally in reference to functions, not
variables, see: functional programming) because no variable really exists for the parameter, I’m simply
creating it on the fly to satisfy the function and get the results I’m after. You will see the
implode() function again in just a bit when I build the FPI. Notice that the PTD does
not include the Strict
$type as it is implicit in that version. I also complete the
$URI to the DTD.
Bug Fixed: Dated Feb·12·2005 — Per the W3C’s Recommendations on Media Type Usage, I have altered the doctype() function so that browsers (typically IE) that do NOT Accept application/xhtml+xml are not sent the XHTML 1.1 DTD. I have never had a problem with doing so, but then I’m not sending anything outside the XHTML namespace (yet). Sending XHTML 1.0 Strict is fine, which is what I’m now doing under that specific condition. Run the demo from IE/Win 6.0 to see what I meant. At any rate, if you happen to be using this code, here is a fixed version.
The else block on lines 85 through 92 handles XHTML 1.0. The difference here being the addition
of the $type (Strict, Transitional or Frameset) and the lang attribute for backwards
compatibility. The top level <html> tag is closed one line 94.
Content Negotiation
As I mentioned in the HTTP lab, PHP gives you access to a whole slew of Web server information
including the request and response headers. PHP maintains these values in a hash of it’s
own called $_SERVER (kinda makes sense). I will be exploiting that fact in the next chunk
of code. As I hope you recall, the user agent will send a list of media types it understands
in the Accept header. For some reason, unknown to me, the W3C validator does
NOT do this even though it’s a perfectly valid media type to be sending them when you’re,
you guessed it, sending them that type. In other words the validator understand XML, it just
doesn’t tell you it understands it. Sigh. So, on line 101 I use the PHP
stristr() function to compare the User-Agent request header with the W3C validator
signature string. If I get a match then I set the correct Content-type. Recall
that I’m saving this string in the global variable $media_type for use later.
Otherwise on the very next line (102) I use the same stristr() function to search the
Accept header for the application/xhtml+xml media type and I fall back
to good old trusty text/html if I don’t spot it. On line 104 I pull in the
variables $_IE, $charset and $lang from way back at the top of doctype(). On line 108 I send
the header(). You must be very careful not to send any other data after this call if you plan
on sending more headers. They must all go in one chunk followed by a blank line before you start
sending the DOCTYPE bits. PHP will take care of this detail for you, but if you try sending a
header() after you’ve started to send normal data it will throw an error. Which I’m going to
do right off the bat on line 113 starting with the XML declaration. But not if the user
agent is IE or you’ll put it into tag soup mode. Bad browser, go to your room!
Ready to pull your hair out yet? I know I am after that last comment. The good news is we’re on the home stretch now. Line 116 is magic, it glues all the various parts together by calling implode() to create the FPI.
On lines 118 through 125 I use PHP’s heredoc syntax to spit the DTD declaration back to the
browser directly followed by the top level <html> tag. On line 127 doctype() closes.
Go get a cold one, you deserve it after all that. I would but I don’t drink anymore. Maybe a cigar instead...wait, I hate cigars. Chocolate!
Caveat: If you’re using mod_gzip
and you’re going to be sending Content-type headers with the charset
attribute you should probably make note of the following additional and/or modified settings
for your httpd.conf file:




























































































