Database Normalization
By: Cam Wohlfeil
Published: 2018-12-11 1130 EST
Category: General
Tags:
databases
This is a more advanced topic and assumes you have basic knowledge of relation databases. If you don’t know what an RDMBS or datatypes are, this isn’t for you quite yet :). If you need help visualizing these concepts, check out the referenced Wikipedia articles and YouTube video. For examples I’ll use a pretty standard conceptual database, one for an online shopping (ecommerce) system.
The goal of database normalization is simply to prevent repetition of data. This has the additional benefits of preventing anomalies like data not being created, updated or deleted (CUD) when it should, and limiting (if not eliminating) the need to CUD data in several locations. While they increase in complexity and decrease in importance as you go up the chain, a well-designed and maintainable database should easily meet the first 3 normal forms. Also, while it’s not always mentioned, you must meet all previous forms to meet the higher forms. This means that to be in third normal form you must already be in first and second normal form.
First Normal Form (1NF)
The first normal form requires that:
- Data is atomic, i.e. only contains one piece of information.
- Entries in a column are of the same datatype, i.e. VARCHAR, INT, DATETIME. While not strictly required, the data should be comparable as well, i.e. addresses in an address column, names in a name column, dates in a date column, etc.
- Rows should be uniquely identified, i.e. have a primary key.
The easiest way to achieve this is by separating out data in to separate tables, i.e. a table for customer data and a table for order information which relates to the customer data (using foreign keys), with a RDBMS generated primary key column.
Second Normal Form (2NF)
The second normal form requires that all attributes (columns with actual data, not keys) should be dependent on the key. What does this mean? If you have a customer table, the price of an order would not be dependent on the customer key. The price does not depend on who buys it, and who buys it does not depend on the price. The price should be in a product table, and the order in a junction table that relates to the customer who made the order and the product they ordered.
Third Normal Form (3NF)
The third normal form requires that all fields (columns) can be determined only by the key in the table and no other column. This is where you really start cracking down on data repetition by breaking out data in to tables and relying on relationships.
In your product table, you’re likely to have multiple products from the same supplier, multiple orders from the same customers, and multiple charges to the same payment processors, etc. Rather than repeating their info in the tables for products, payments, orders, etc., instead we create tables with only unique data and relate the data to the necessary tables. There’s a stricter form of this called Boyce–Codd normal form (BCNF or 3.5NF), but it’s pretty rare for 3NF tables to not meet this standard.
Fourth Normal Form (4NF)
The fourth normal form is often left out when talking about database normalization, maybe because it’s more complicated and harder to achieve. It requires that there are no multi-valued dependencies, i.e. one column should not have multiple values that causes repetition in the other columns.
For example, if our company were to manage multiple warehouses we’d have to track what products and shipping methods are available at each of them. If we create a table with columns for each location, each product at that location, and each shipping method available, even using foreign keys as we should we’ll still be repeating tons of data. Instead we break this in to multiple tables, such as one for the products at each location and one for the available shipping methods at each location.
Other Normal Forms
Feel free to dig in to the referenced Wikipedia article for other normal forms, such as 5NF and 6NF, but these are much more complicated and less useful.