Databases 101

29 May 2020

“Computational processes are abstract beings that inhabit computers. As they evolve, processes manipulate other abstract things called data. The evolution of a process is directed by a pattern of rules called a program. People create programs to direct processes.”

I came across this profound description of a computational process in SICP, which substantially changed how I think about programming. It simplified programming into two essential things, data, and rules: no languages, no frameworks. I’m already aware of the rules, and the data side of things is something I became curious about ever since.

I followed my curiosity and discovered many things. I came across tools and techniques that helped me think clearly with a data-first approach. As time passed, I started realizing that most of the software, in essence, is some sort of UI in the form of a website, mobile app, wearable, or an AR/VR layered over a very well-defined data models that are highly specific to the business. That realization blew my mind; I couldn’t stop thinking about data models and relations. I had to respond to this calling immediately, and that’s how I ended up here with an urge to learn everything about databases.

I’m relatively new to databases; I needed to learn things from the ground up. So, I researched quite a bit, found some fantastic resources online, and spent some time learning. I thought of writing about all my learnings’ to clarify my understanding and here are my reading notes;

Databases and management systems

A database is the model representation of some aspect of the real world. A database management system (DBMS) is a tool that allows end-users to interact with this data to organize, store, and analyze.

File-based system

The most straightforward implementation of a database is a flat-file (e.g., a .csv file), and we can use any programming language as DBMS to read this record and analyze it. They are known as filed-based systems.

It worked, but it is incompetent for different reasons;

Well, it’s slow. (When you think of billions of records)
Data Integrity
- How do we ensure there are no duplicates?
- How do we ensure we are using proper data types during manipulation? (E.g., Update a string type with a and vice versa)
Implementation
- What if we want another application to use this file?
- What if two threads try to write to the same file?
Durability
- What happens to our data if an application crashes?

This approach is not sophisticated, and it’s expensive.

To overcome this, many organizations, computer engineers came up with different designs of DBMSs. But these initial DBMSs were challenging to build and manage. Tightly coupled physical and logic layers made it painful to write and maintain databases because you have to tell the system how to store information as a specific type in a database and build logic.

E.g., If you store a piece of data as a tree structure, or as a hash table in your database. You have to use different types of queries (APIs) to handle it, depending on the model.

Programmers need to rewrite the database management system every time there is a significant change in application-related data.

Sounds unproductive, right? Welcome to relational databases; Thanks, Codd. He got baffled by all these rewrites at IBM and decided to solve this problem.

Relational databases

So, Ted Codd came up with a theory to abstract things and avoided rewrites and maintenance. He inscribed some of the great ideas in this paper A Relational Model of Data for Large Shared Data Banks.

In his theory, he suggests;

Store database in simple data structures
Access data through a high-level language
Leave physical storage to implementation

To abstract things like this we have to understand two important concepts;

Data model — It is a collection of concepts for describing the data in a database. E.g., Relational, Key/Value, Document
Schema — It is a description of a particular group of data using a given data model.

This theory helped people build relational databases that solved all the issues faced earlier;

Structure — Definition of relations
Integrity — Constraints
Manipulation — Access and modify

In relational databases, DBMS tools will take care of implementation and storage, end-users will manipulate it using different languages.

We can manipulate a relational database using two different data types of languages;

Procedural; The query specifies the (high-level) strategy the DBMS should use to find the desired result (Relational algebra)
Non-Procedural; The query specifies only what data is wanted and not how to find it (Relational calculus)

Both these manipulation languages use relational algebra to manipulate data.

In a relational database, there are only two essential elements; Relations and Tuples. They are the building blocks for storage and retrieval.

SQL

SQL is a query language for data manipulation. It implements relational algebra or relational calculus.

SQL is not a single language, but it’s a collection of multiple things;

Data Manipulation Language (DML) is a collection of things that manipulate data. E.g., statements; INSERT, UPDATE, SELECT.
Data Definition Language (DDL) is used to create and modify database objects such as tables and indexes. E.g., statements; CREATE, DROP.
Data Control Language (DCL) is used to define control access to data stored in a database. E.g., statements; GRANT, REVOKE.

That’s it for now; this is a broad introduction to databases. I will cover more, especially the operations in databases and query languages in my next post.