10 essential skills: (3) database knowledge
- Pamela Kinga Gill
- Jan 15, 2019
- 5 min read
Database knowledge is an essential skill for the big data professional. Whether data makes the backbone of the company, an arm, leg, or even vestigial structure, every enterprise is going to have some database management system (DBMS) in place. In today's digital age, the selection of that system is a careful, if not critical, decision for the enterprise as determined by its requirements.
Therefore, to manage, store, and analyze the volume of data being captured today, the professional must have an in-depth understanding of databases and how different databases are going to be more or less appropriate for their needs. Importantly, having a clear understanding of the distinction between relational database and non-relational database management systems is essential. This post describes the key differences between these systems, their features and strengths, and how this knowledge is applied in context.

Firstly, how important is a database? Well, you tell me...
In May 2017, The Economist published two thought-provoking articles under the common theme of "data as fuel". The first, entitled “The world’s most valuable resource is no longer oil, but data,” describes data as a lucrative commodity, and not unlike oil, a resource dominated by a handful of tech giants. The article quite literally draws a parallel between those giants, Amazon, Google, Microsoft, Uber, Facebook, and Tesla, with oil conglomerates, illustrating them as off-shore deep sea oil rigs drilling through the depths of the ocean (of data).
The second article, published under the heading "Fuel of the future", is entitled "Data is giving rise to a new economy." This article briefs the role data plays in transforming industry, competition, markets, and regulation/policy .
From subway trains and wind turbines to toilet seats and toasters—all sorts of devices are becoming sources of data - Source
Needless to say, the growth in the volume of data is exploding and companies are racing to translate "big data" into actionable insights. In parallel to this, is the explosive need to collect, organize, store, process, and analyze that data . Enter step one: the database.
The fun begins ... almost. What is a database?
A database is a set of data stored in a computer [and] ... is usually structured in a way that makes the data easily accessible - Source
Database structures and their models
Every database has an underlying logical structure that is very important to understanding how the database operates and satisfies key requirements.
This structure is determined by a database model. The model shows the relationships and constraints that define how data is organized, stored, accessed, and manipulated. Think about this model as a governing set of rules and concepts that guide the database architecture. Then, this architecture is what develops the databases we use today.
The variety of database models have evolved alongside computers; with the advent of fast, efficient, and cheaper commercial computing, the need to store, organize, access, and process data in different ways has emerged to serve a variety of requirements for the enterprise. This article is a fantastic Brief History of Database Management and describes the functional evolution of different models behind DBMSes.
Let's begin high-level: relational and non-relational databases models
Databases can be divided between either their relational or non-relational model.
Relational database models and databases
This is important. The term "relational database" was invented by E. F. Codd at IBM in 1970 and introduced in his research paper "A Relational Model of Data for Large Shared Data Banks." Its structure identifies and accesses data in relation to other data and most of us understand this organization as a series of tables and rows looking like a spreadsheet (also understood as "tabular relations"). Codd designed his relational database model on the concept of data normalization, which saved file space on storage disk drives. This was significant at a time when computing was still prohibitively expensive.
Databases built on this model are termed relational database management systems (RDBMS).
Popular closed-source RDMSes today are: Oracle, IBM DB2, and Microsoft SQL Server.

Popular open-source RDMSes today are: MySQL, SQLite, and PostgreSQL.

* As suspected, almost all relational databases used Structured Query Language (SQL) to maintain and a query the database.
A simplified (and incomplete) list of advantages to relational databases are:
Simplicity
Ease of data retrieval
Data integrity
Defined schema
Flexibility for scaling
Normalization
High Speed
Large data volume
Low code apporach
Easy Tuning
Easy Reporting and Analytics facility
Non-relational database models and databases
Let's look to the alternative: non-relational databases. Naturally, the non-relational database is one that does not follow the relational model. This category of databases, also referred to as NoSQL databases, have grown in popularity with the rise of "big data" applications. Websites like Google, Twitter, and Facebook use this type of technology.
NoSQL is also understood as "not only" SQL because these non-relational databases may support SQL-like languages. NoSQL databases are motivated to serve requirements that are troublesome for the relational database, such as horizontal scaling. Examples of popular NoSQL databases are: MongoDB, MariaDB, Cassandra, and CouchDB.

A simplified (and incomplete) list of advantages to non-relational databases are:
Speed: faster than traditional (relational) database
High Performance
Works with rapidly changing datatypes: structured, non-structured data
Less need for extract, transform, load (ETL) procedures
Scalable
Dynamic schema
Review of relational vs. non-relational
Here are some resources I recommend that further describe the differences between these two models:
Published by MongoDB: "What is a non-relational database?"
Published by CodeAcademy: "What is a relational database management system?"
Published by Upwork: "SQL vs. NoSQL: What's the difference?"

Ultimately, understanding how different databases organize, store, access, and manipulate data is going to determine which database is going to give you the type of performance and functionality you need to make the most of your data. Collecting more data is meaningless if its power can't be harnessed - sort of like the process between collecting crude oil and refining it into petroleum, or diesel fuel, etc - and this is going to depend critically on the data model/database employed.
Big data and NoSQL (non-relational) databases
Believe it or not, there are many more database models in use. Many of them fit into the "NoSQL" category, and for good reason....
While RDBMS systems were an efficient way to store and process structured data, processing speeds got faster, and “unstructured” data (art, photographs, music, etc.) became much more common place. Unstructured data is both non-relational and schema-less, and Relational Database Management Systems simply were not designed to handle this kind of data - Source
I also really like this blog post that uses diagrams to explain "Why relational databases are not the cure-all." It motivates the conversation introducing NoSQL models such as:
Hierarchical
Network
Object-oriented
Entity-relationship
Document
Entity-attribute-value
Star schema
Object-relational
It is fascinating to learn how each of these serves key functionalities. Here is a simple visual of some popular databases that are built off of the different NoSQL models.

It's not necessary to get into the details behind each model, per se. At least, not yet. Perhaps this post is a precursor to an article devoted to the NoSQL data models. Nevertheless, the objective here was to note that key distinctions between the relational and non-relational-backed databases persist and are relevant. The big data professional will find they are required to use large data sets that are unstructured or semi-structured and continuously evolving and growing exponentially. Selecting the correct database between the alternatives will become crucial depending on the requirements. Important trade-offs exist.
I hope this post illuminates the importance of database knowledge and gives the reader a new-found appreciation for the technology underlying database systems in use today.
Comments