document.characterSet and another meaningless example of flexibility destroying protocols

User Avatar
fiatjaf February 22, 2025

I always knew of at least two standardized ways browsers used to determine the charset of a given webpage document: the Content-Type header and the <meta charset> tag. These are widely understood, teached and documented specs that a lot of developers assume are being followed because they're "web standards".

Turns out there are a lot of pages on the internet that declare themselves as UTF-8 but are actually using other types of encoding (here's an example), and just by looking at the headers and meta tags you would think they are actually UTF-8, since they render correctly on Chromium and Firefox.

But the actual truth is that browsers actually ignore these headers completely and use their own internal heuristics to determine the actual charset. And they expose their internal result in the DOM property document.characterSet.

"Oh, that's great! Technology is awesome, they've fixed a problem!", you may think. But the actual result of that is:

  1. developers never learn that they're wrongly declaring "UTF-8" when their content is actually "windows-1252" because they never see their page being rendered wrongly;
  2. the actual spec is now that browsers should correctly guess a page encoding instead of just following what is written;
  3. people are fooled and continue to teach, learn (and write) the falsehood of these useless HTTP headers and <meta> tags not knowing they are completely wrong.
  4. new browsers coming to the space have to first learn that this is a thing, which is not obvious nor written anywhere, then they must implement it, because if they follow the spec people will think it is their fault that some broken pages are rendered with broken characters on this new browser;
  5. barriers to entry are higher, the protocol continues to centralize more and more;
  6. other people trying to read these HTML pages for whatever reason, from any software that isn't Google Chrome or Mozilla Firefox, will have the same problem and will have to learn everything and come up with their own charset detection mechanism, this again closes the content of webpages to being more and more restricted to the walled garden of existing browser vendors.

I think we can all agree these are not good outcomes.

In the end of things, this is just a very small example, but "the web" protocol has thousands of such small examples, and they add up.

Also, arguably the spec should have been "browsers must do their own charset detection" since the beginning, but that's irrelevant. The fact is that it wasn't (and still isn't, the specs weren't updated as far as I know), and here's again another undeniable example of how being flexible can bloat a protocol.