John Jago

XML signal-to-noise ratio

XML stands for Extensible Markup Language and was introduced roughly 25 years ago. It seems like its popularity as a format to mark up data is declining, although I don’t have any numbers to support this. I do know from personal experience that a lot of tools used for programming in the early 2000s often dealt with XML in some way, like for configuration. These tools are not fun to use.

Also from personal experience, and I may certainly be biased here since this is what I grew up with, I can’t think of any modern programming language or tooling that relies on XML as a configuration or data storage format.

Certainly the ubiquity of the web as a platform has helped JSON become a replacement for XML. However, one of the reasons why JSON may have been created in the first place is to address the verbosity of XML.

In larger configuration files for legacy Java projects, I always found it to be a burden to sift through tag soup to find the actual content. Here is a simpler example, and we’ll do some calculations to see what exactly the signal-to-noise ratio is.

For the purposes of this article, signal is any meaningful content, the actual data. Noise is any meta structure that is used to format the data.

We’ll calculate a ratio of signal characters to noise characters. The higher the ratio, the better. Note that JSON requires a little more indentation to make it readable.

What’s the actual data?

For both, this is the actual data, a total of 162 characters describing various species of plants. Notice that some data, like 4, doesn’t make sense alone. We need some context to know what 4 means, and that’s where XML and JSON come in.

bloodroot
sanguinaria canadensis
4
mostly shady
$2.44
031599
columbine
aquilegia canadensis
3
mostly shady
$9.37
030699
marsh marigold
caltha palustris
4
mostly sunny
$6.81
051799

XML

Below we have the data formatted with XML.

<catalog>
	<plant>
		<common>bloodroot</common>
		<botanical>sanguinaria canadensis</botanical>
		<zone>4</zone>
		<light>mostly shady</light>
		<price>$2.44</price>
		<availability>031599</availability>
	</plant>
	<plant>
		<common>columbine</common>
		<botanical>aquilegia canadensis</botanical>
		<zone>3</zone>
		<light>mostly shady</light>
		<price>$9.37</price>
		<availability>030699</availability>
	</plant>
	<plant>
		<common>marsh marigold</common>
		<botanical>caltha palustris</botanical>
		<zone>4</zone>
		<light>mostly sunny</light>
		<price>$6.81</price>
		<availability>051799</availability>
	</plant>
</catalog>

The noise looks like this and contains 442 characters.

<catalog>
	<plant>
		<common></common>
		<botanical></botanical>
		<zone></zone>
		<light></light>
		<price></price>
		<availability></availability>
	</plant>
	<plant>
		<common></common>
		<botanical></botanical>
		<zone></zone>
		<light></light>
		<price></price>
		<availability></availability>
	</plant>
	<plant>
		<common></common>
		<botanical></botanical>
		<zone></zone>
		<light></light>
		<price></price>
		<availability></availability>
	</plant>
</catalog>

We have 162 signal characters and 442 noise characters, which is a ratio of 0.37 (rounded up).

JSON

Below is the same data in JSON format.

{ "catalog": {
	"plants": [
		{
			"common": "bloodroot",
			"botanical": "sanguinaria canadensis",
			"zone": "4",
			"light": "mostly shady",
			"price": "$2.44",
			"availability": "031599"
		},
		{
			"common": "columbine",
			"botanical": "aquilegia canadensis",
			"zone": "3",
			"light": "mostly shady",
			"price": "$9.37",
			"availability": "030699"
		},
		{
			"common": "marsh marigold",
			"botanical": "caltha palustris",
			"zone": "4",
			"light": "mostly sunny",
			"price": "$6.81",
			"availability": "051799"
		}
	]
}}

The noise looks like this.

{ "catalog": {
	"plants": [
		{
			"common": "",
			"botanical": "",
			"zone": "",
			"light": "",
			"price": "",
			"availability": ""
		},
		{
			"common": "",
			"botanical": "",
			"zone": "",
			"light": "",
			"price": "",
			"availability": ""
		},
		{
			"common": "",
			"botanical": "",
			"zone": "",
			"light": "",
			"price": "",
			"availability": ""
		}
	]
}}

We have 162 signal characters like last time, but only 350 noise characters, which is a ratio of 0.46. This includes the characters for additional indentation that the JSON required to be readable.

What does this mean?

The signal-to-noise ratio for JSON in this example, 0.46, is higher than the ratio for the XML data, 0.37. That means there’s more real content compared to characters used to structure the content. In the end, this improves readability, and it may be why JSON is more popular than XML in recent years.

What’s more interesting than these small examples is how the idea manifests in other areas of software development. Domain-specific languages aim to solve particular problems, and with this constraint they are often less verbose than general-purpose languages. With certain assumptions, the language can be reduced to only what’s needed to express the solution to a problem. A higher signal-to-noise ratio.

Ruby on Rails is probably the best example of this. Although it’s a framework that uses Ruby, you could argue that it’s practically a web application DSL at this point. You need to type very little to express what HTTP endpoints you want create and what web pages you want to serve.

Wasp seems to be taking this a step further, with the concepts of a web application embedded into the programming language.

I think there’s a big future for domain-specific languages that improve the signal-to-noise ratio, allowing people from more fields to leverage software to their advantage.