HTML Made Special

What is Vernacular HTML?

HTML is one of the most widespread and interoperable data formats available. It is extremely versatile and flexible, and if you have access to a computer, you have a access to a browser with which to run it. Combined with JavaScript, CSS, SVG, and the rest of the Web platform, it can be used as a very powerful format for all aspects of digital publishing.

This generality comes at a cost, however. When your goal is the interchange of specialised information within a given domain, the full generality of HTML can get in the way. Your applications should not have to sift through some vague information soup in order to extract the data that they require, and equally they need an obvious way of encoding their data in such a way that other applications will reliably understand it.

That is the traditional purview of domain-specific data formats. These are often relatively simple (at least compared to the fullness of HTML) data formats, typically encoded in XML or JSON, with predictable rules that (one hopes) enable the interoperable interchange of information.

Domain-specific data formats are, however, not a panacea. They require conversion to an output format (usually HTML, sometimes PDF) in order to be consumed, which is often a lossy operation. This can make it harder to users to interact with them; it also means that round-tripping them to and from user-friendly formats is an error-prone process. Not being directly published on the Web, their content tends to be less discoverable.

Vernacular HTML” is the growing practice of creating domain-specific data formats using the HTML platform. Its goal is to enable interoperable, specialised data interchange while retaining as much of HTML’s full power as is sensible for the given domain. It is not a technology so much as an incipient body of good practices.

Why bother talking about Vernacular HTML?

Bodies of good practices are suspicious things. Why not do and advocate through example rather than talk about doing? In some cases, though, it is worth sitting down and writing up a few tips and tricks acquired through experience that might not be immediately obvious from just looking at what others are doing.

HTML already has several extensibility mechanisms, ranging from well-known ones such as class and data-* to less obvious parts like rel. It also has (sadly, several) ways of overlaying additional semantics atop its markup: Microdata, RDFa, and Microformats. Its elements and attributes can at times have non-obvious meanings, a fact made worse by the occasionally cryptic language in which they are described in the specification. Additionally, HTML’s processing rules make it possible to mint your own elements and attributes and expect a reliable DOM to come out at the other end. With all of these options it is no surprise that people could find creating their own vernacular daunting.

The HTML specification does contain some notes about how to use this or that extensibility feature, but it is worth noting that standard authors can at times suffer from self-important notions of right and wrong. Not that their advice should be ignored wholesale, but it needs to be appraised through the sieve of your detailed knowledge of the specific problem you are solving.

Our approach here is of a decidedly pragmatic bent. We simply wish to exchange good ideas and good examples; never-ending debates about the significance of semantic meaning are very much out of scope.

How do I build a Vernacular?

Much of the time, if you sit down and start typing out examples, you will quickly get a good feel for the kind of markup you want for your specific use case. Modulo some syntactical variation, the overall shape of it is likely to be obvious. What may not be so obvious are the bits that can trip you farther down the line. As a result, many of the practices listed here are actually bad practices, thing you should probably not be doing. I say “probably” because the philosophy at play here is that, ultimately, you know what you’re doing (if not, no amount of advice will save you). You should have no qualms about ignoring a red flag found here; all we’re saying is that it is likely a good idea to think about it.

The guidelines below commonly make use of examples drawn from XML vocabularies. That is because the problems one is likely to encounter in the creation XML and HTML languages have a fair amount of overlap, and the XML examples tend to be more outrageously wrong. Don’t let that lead you to thinking that using HTML makes you safe from such mistakes — if the problem is listed here, someone smart has hit that problem before.

Language Design
Wishful Thinking and Doe-Eyed Beliefs
It’s for Humans
RDFa/Microdata Problems

Where can I find some Vernacular HTML?

Vernacular HTML is a practice that existed long before it was given a name — as well it should be! Here is a short list of some examples (with varying qualities of documentation). If you wish to expand it, please simply file a pull request.

AMP (Accelerated Mobile Pages) is a project that defines a subset of HTML (with a number of extensions) that enables increased performance in HTML delivery by getting rid of many of the causes of slowness.
A vernacular to write books in HTML, in extensive use at O'Reilly.
W3C Publication Rules (“pubrules”)
Web standards are relatively formal affairs, and as such have their own formalised HTML vernacular in which they are expressed. It is a fairly old vernacular (created circa 1997) which, in terms of best practices, can sometimes show. It has nevertheless become instantly recognisable in the Web technology world, and certainly benefits from great simplicity (as can be seen in its style sheet).
ReSpec and Bikeshed
Since W3C “pubrules” markup can prove repetitive and at times hard to get right, many tools have been developed to assist people in producing it — these are the two main ones. ReSpec documents are essentially valid HTML with some extra configuration that a JS library turns into the real thing; Bikeshed is a Python preprocessor that can apply to HTML but is more often used in Markdown mode (and therefore not completely a candidate for this list).
The AngularJS project sports a wealth of addition to HTML (which can take the form of new elements, new attributes, data-ng-* attributes, and a few other options) to add interactivity to HTML.
Scholarly HTML
Scholarly HTML is a set of constraints on HTML meant to enable the interchange of scientific articles. (Disclaimer: it is done by the same people who are doing this.)
The purpose of html-version, from the ineffable substack, is “to make the Web more distributed, permanent, and robust against outages, censorship, and orphaned content”. It does so through specific application of link and meta values.

Any Other Question?

This is great! How do I contribute or point out a bug?
It’s all on GitHub. Feel free to make a pull request or to file an issue! We are completely open to new advice, or to discussing the validity of some of the practices described here.
Do I need to produce a validator for my vernacular?
As always, it depends on the usage context. In general it should not be considered a high-priority requirement — describing clear processing rules through which content can be reliably and robustly extracted from your vernacular is far more important. However, in a complex production chain, especially one involving multiple parties, validation can be a life saver. There are no special rules for HTML vernaculars: the same common sense that applies to other data formats is just as valid here.
How do I create a validator if I need one?
One approach can be to produce a modified version of the VNU validator, but it is not a lightweight undertaking and it involves Java. We are considering working on a validation framework architected in a manner similar to Specberus, to make it easier for people to write their own validation tooling with. Let us know if that’s something you’re interested in!
You sometimes seem to interpret standards in a flexible manner, is that not bad?
No. Standard texts have several levels of conformance. When dealing with normative rules (the “MUST” ones) it is generally a really good idea not to fool around. Breaking those leads to interoperability problems, and either you or your users will get bitten for that. If, however, we are considering advice from specifications (the “SHOULD” bits) then there is room for context. The idea is that you ought to understand the why behind the statement and not take it as a strict mandate. Once you understand the reasoning, you should have a fairly good idea of when to stick to it and when to break it. Unfortunately, this type of informative recommendation is often just stated, almost as a command, with no explanation as to the problems it is expected to protect you from. That’s when a little bit of creative thinking might be necessary.
I have another question, where do I ask it?
You can file an issue, or you can ping @sciencedotai or @robinberjon on Twitter.