Recently I got involved in a conversation about why SQL remains so popular. My friend and colleague Svetlana suggested I blog on the topic as it might be useful to others as a way of thinking about this. She noted that Chamberlin and Boyce were working in the same IBM office. They “talked to each other in SQL”, to test its viability, taking it in turns to formulate the answers to data management and query problems. They published two papers one on DDL and the other on DML. One of these was the paper that motivated Larry Ellison to adopt and promote SQL for his company, Oracle.
I’d remarked that I believe that SQL endures not just because it happens to be a popular language with wide adoption, though that is undoubtedly true. However, I believe it goes deeper than than that; it is one of the few data languages, maybe the only language, based on a firm theoretical foundation. It is underpinned with a solid theory of how you go about defining, manipulating and querying data. I’d argue that no other language has emerged so far that does this better. To replace it would mean we’d found another language that formed an equally solid base, and it’s not obvious what that would be.
Before getting into this too deeply I’d like to make clear that I’m not saying that SQL is perfect and should be used for everything. But it is a solid solution for a wide range of applications – and that I believe that this is due to more than just its existing popularity.
“The report of my death is an exaggeration” – Mark Twain
The death of SQL is constantly being predicted, and yet it continues to thrive. There was a large “NoSQL” movement which implied no use of SQL, this morphed into Not Only SQL, and now many NoSQL database system offer an SQL interface – what gives? Far from dying out SQL is extending itself into those parts of the market that seemed to be set up to avoid using it.
Let’s take a step back and consider the problem of working with any kind of data, big or small.
When we look inside our big databases / data stores what we find is:
- Some of the data represents things – let’s call them “Entities”
- Things / Entities are distinguished by various features – let’s call them “Attributes”. We need to discuss and characterize those attributes, because they distinguish entities from each other. The data tells us about the attributes too.
- Things / Entities can have different relationships to each other, and these can be represented in the data by matching data values – let’s call them “Relations”
Peter Chen did a lot of work on this, as well as Codd, Boyce, Chamberlin and Date.
These considerations are completely separate from any physical DBMS implementation. A lot of research was done into the manner in which you could represent and access information.
Once this groundwork had been done it turned out that you could represent these things using mathematical set theory – sets, unions, intersections – thus there was a significant and solid mathematical underpinning that flowed from this.
We may have forgotten this now, but we had other data access languages previously. These were basically ad hoc attempts at defining a data navigation language. They were developed by starting with a particular physical organization for the data. Then various operators and features that seemed useful would then be tacked on, until we had what felt like a more or less complete language.
Data Language I (DL/I) from IBM was basically hierarchical in nature. You could specify parent and child records. There were operators such as “Get Child”, “Get Next Child”, “Get Parent” etc. these were pretty arbitrary, and defined a physical access path. This was the language used by the mainframe CICS and IMS systems.
CODASYL represented a network organization, that is, the data structure was a graph, rather than a hierarchy. One of its main claims to fame was that it was adopted by multiple vendors and thus was one of the first multi-platform data management system. Thus, graph representations are at least 40 years old.
In those systems navigation was via physical links and pointers. You moved around the data structure by following the pointers that were embedded in the records. Having navigated a link between records you found data at the other end of the link. This implied a relationship between the record you’d come from and the record you’d arrived at.
However – a physical link between records does not necessarily imply a logical relationship. The connection may have been established in one context and should not be read in another. Interestingly, Codd’s 1970 paper, has a great example of this. In it he describes something called the “connection trap”.
The example given is a supplier can be linked to the parts supplied. Parts can then be linked to the projects they are used on. But following all paths would imply that the supplier is involved in a wide range of projects. This is not true, parts supplied to a project could have been supplied by another supplier.
We are experiencing similar modelling challenges with present day graph and document databases. These have their own place in our systems portfolios, and are very useful for the right applications, providing a natural way to represent certain types of data, and with runtime efficiency too. However, they navigate their data using particular physical structures. When the logical data organization does not suit those structures data modelling becomes harder.
With SQL, an access language based on relational theory, you specify what you want to retrieve not how you should retrieve it.
By expressing data and the relationships defined in abstract terms, purely as data, you leave the DBMS lots of degrees of freedom in how to service the request. The beauty and flexibility of the relational model is that relationships are defined by the data values themselves, not by any separate hard coded linkages or physical organization.
If your access language stipulates how to navigate the data structure (“follow this link”, “drop down a level to read the child records”) then any change in data structure breaks that access logic.
This also means that is very hard to use any kind of theory to predict or infer what a particular query will do, or how it might be transformed in order to be optimized. After all, the DBMS may not see the whole query, just individual navigation steps, so there is a limit to what it can infer.
The set theory operations of SQL are by contrast pretty easy to optimize. The query is declarative, it is specified in advance in its entirety. Thus, having seen it, the DBMS can then proceed to work out how the data should be accessed.
Having worked out the relational algebra there were several attempts at expressing this as a data access language. There was QUEL from Ingres, and the Structured English Query Langauge (SEQUEL), which became SQL, from IBM.
Teradata had their own TEQUEL early on, but decided to jump on the SQL train when it became obvious that is was the one with most momentum. See
Teradata Query execution order (For many years Teradata would continue to accept TQUEL statements that contained constructs that would only appear many years later in SQL).
As mentioned earlier, in these early days SQL was adopted by a small company called Oracle, who popularized SQL on the new Unix based systems that began competing with IBM and their mainframes.
Thus, SQL wins, not just because it’s popular and lots of people know it, but, IMHO, because it is the only, data access language that is predictable, has a robust mathematical foundation, can be generally applied, and can thus be systematically optimized. If there is another data access language that can rival SQL across all these dimensions then it’s not obvious what that might be.
SQL is a language that survives from an era of other programming languages which are now no longer used for new development. Languages such like COBOL, PL/I, FORTRAN, IBM ASSEMBLER, Algol, Algol 68, Pascal. Some presence remains but most have been largely superseded. Put another way, which languages from the 70’s and 80’s are still in widespread use – even gaining in popularity? – this should tell us something. I appreciate that some languages survive in systems awaiting a refresh, such as COBOL, PL/I, Algol. Also, that the programs they are written in are getting new leases of life with software that allows them to be ported to new platforms. However, these older languages are not the subject of substantial new development.
However, SQL remains a mainstream development language for developing new software. In a large part I believe that the substantial thought that went into it provides a sound theoretical base for how to go about information access. That thought process has stood the test of time and whilst it is now being complemented by other approaches it is not likely to be replaced any time soon.
BTW: There are many sources on the History of SQL and these are well worth exploring.
We’ll be happy to talk about this and other data and architecture related topics at our various conferences this year, for example the Data and Analytics Summits that go on throughout the year and are mentioned in the link from my colleague Ted Friedman, plus our Gartner for Technical Professionals Catalyst conferences in San Diego (August) and London (September)
Note about the title:
*after “Godel, Escher, Bach: An Eternal Golden Braid” – Douglas R. Hofstadter. Which is a philosophical oldie but goldie. I Notice that most of my blogs contain book recommendations. This isn’t deliberate, but not a bad habit to have