Posted On: 2020-06-08
When developing, I often need to transfer data between systems. Whenever I am faced with such a task, I first reach for JSON as my primary tool of choice. I have mentioned this in passing more than once, but today I'll dig deeper into the particulars about why it's my first choice for data data transfer and storage.
JSON is a human-readable data transfer format. It's designed to be simple and intuitive, especially for those familiar with the syntax and conventions of the javascript programming language. Fortuitously, javascript itself was designed to be easy for newcomers and non-programmers to pick up, and JSON enjoys some of those benefits: if the data itself is simple enough, JSON can represent it in a way that anyone can understand it. Consider the following example of a simple recipe (I bet anyone can follow along):
{
"Name": "Buttered Toast",
"Ingredients": [
"Bread",
"Butter"
],
"Instructions": [
"Cut bread into thin (1-2cm) slices.",
"Heat bread slices in a skillet on low heat, flipping occasionally, until golden-brown.",
"Remove from heat and, with a knife, spread butter across one face of each slice.",
"Let stand until cool enough to eat.",
"Enjoy."
]
}
Data transfer is a very common situation: any time one needs to send information elsewhere (such as sending an email or copying files to different computer) one is performing a data transfer operation. In order for data transfer to be possible, all the systems involved must agree on how to represent the data - otherwise one or more of the systems won't be able to understand the data. This is similar to how human writing works: if two people share a common language, they can write to each other, but if they use different languages, then the words one person writes will look like nonsense to the other person.
Since it is human readable, JSON can also be used to transfer data between humans and machines. This can be especially handy for early development, since one can read and write JSON with even a simple text editor (though, as the data becomes larger or more complex, specialized tools are often preferable over writing JSON by hand.)
JSON is by no means the only human-readable data transfer format. XML and YAML are both popular alternatives. What's more, there a plethora of non-human-readable formats as well. While each of these has different advantages and disadvantages, I do tend to prefer the set of advantages that JSON provides above the others.
XML is a highly structured format, offering high levels of customizability and embedding of information into the structure of the data itself. It also supports enforcing conformance to specific standards (schemas and DTDs) as well defining intra-document references of both type and data. What's more, many of these features were built with the web in mind, allowing XML documents to import definitions from remote locations (such as web sites.)
Unfortunately, none of those features are useful to me, but they bring a plethora of security issues along with them. Most XML parsers have to be explicitly locked down to guard against external entity attacks, and doing so comes with the risk of rendering some documents unreadable (those that legitimately use the disabled features.) At its heart, XML appears (to me) to be designed to solve an impossible problem: to make a document sufficiently self-describing that any system reading the document will somehow work, without any additional development effort.
JSON neatly avoids all these issues by keeping things simple. Only a small, predefined set of types are allowed*, and if any validation is required, the developers are responsible for it (rather than expecting the data to tell the parser to do it.) Unsurprisingly, the concept of a JSON schema has emerged to fill in some of those gaps (such as automating validation), but this is (currently) only available through separate tools and libraries**.
YAML is a human-readable format designed specifically to be easy for developers to write by hand. It is less verbose than XML and uses white space with semantic meaning (unlike JSON) to enforce clear, legible structure. In keeping with its design, YAML supports the ability to define anchors that can be later be aliased , which tells the parser to instead use the data in the anchor - thereby avoiding repetition. YAML also supports a much wider variety of collections and data types than JSON.
While my experience using YAML is limited, it seems like a good choice if one is authoring data by hand, as it has many conveniences (like anchors) that can avoid repetition (and potentially human error.) On the flip side, YAML's impressive feature set seems like a nightmare from a security perspective: that is a much larger attack surface that parsers (and/or developers using said parsers) need to secure.
Perhaps most telling, YAML is self-described as a "data serialization standard", in contrast with JSON, which is a "data-interchange format". From the usage I've seen, this distinction seems to be the best way to summarize it: you can solve data transfer problems with YAML, but that is not its intended purpose. Conversely, that is the intended purpose of JSON - and many of the design choices and limitations in JSON reflect that.
Besides comparisons, JSON has a couple neat benefits that I regularly enjoy. Firstly, when working in C#, the JSON.net library makes mapping from JSON to POCOs trivially simple:
JObject.Parse(jsonString).ToObject<desiredType>();
Secondly, many web developers read JSON on a daily basis, and, as such, an ecosystem of great json reading tooling has grown to support that. Browsers have built-in support for viewing JSON as trees of data, and many software editing tools (including Visual Studio) have syntax highlighting and auto-correct for json files. Between the tools and my own prior experiences, reading (and to a lesser extent writing) JSON is about as easy as using plain text*.
Finally, since JSON works both as a data transfer format and a data storage format, a lot of reuse possibilities are available. Systems designed to work with stored data can just as easily be fed remote data (provided that it's properly validated), and those that work with remote data can be fed stored data (which can function as a replay or pre-scripted event.) As such, reaching for JSON as my first tool of choice helps keep possibilities open, even when transferring data isn't (yet) on the road map.
JSON is an excellent choice when you need to transfer human readable data between machines. Even when you aren't, using JSON can help keep your options open, (provided that there isn't a compelling reason to use another tool.) If you haven't had the chance to work with JSON, I highly recommend giving it a try: you might be surprised what you can create with it*.