Put simply, normalization is an attempt to make sure you do not destroy true data or create false data in your database. Errors are avoided by representing a fact in the database one way, one time, and in one place. Duplicate data is a problem as old as data processing. Efficient and accurate data processing relies on the minimizing redundant data and maximizing data integrity. Normalization and the Normal Forms (NF) are efforts to achieve these two core objectives of data processing. This article will examine the concept of normalization in-depth.
Third Normal Form (3NF) and Boyce-Codd Normal Form (BCNF)
Third normal form states that a table must have no transitive dependencies. This means that a row could be uniquely identified by each column individually but that no column depends on any other column to identify the row. If columns X, Y and Z exist, deleting any two columns will still leave a set of uniquely identifiable rows. BCNF extends 3NF, stating that no non-trivial functional dependencies can exist on anything other than the superkey - that is, a superset of the candidate keys.
Typically, 3NF means there are no transitive dependencies. A transitive dependency is when two columnar relationships imply another relationship. For example, name -> extension and extension -> store_location, so name -> store_location, which is not a dependency we want to model in our table and could lead to faulty data. In the table we have defined, though, we still have a 3NF table. Also, what happens if an employee changes extension and the old sales records aren't updated? Or if a customer moves? These entry points for error are problematic but acceptable in 3NF because the dependencies are trivial. What we need to look to is BCNF, or Boyce-Codd Normal Form, which requires all transitive dependencies be eliminated in addition to the table being 3NF.
CREATE TABLE Sales (
employee_id, customer_name, product_id );
CREATE TABLE Employees (
employee_id, manager_id, name, extension );
CREATE TABLE Customers (
name, address );
CREATE TABLE Products (
product_id, name, price );
Now information about employees, customers, and products are isolated from the Sales table. There are obvious practical problems with this example, such as product price changes breaking sales records, but that is a matter for another discussion. Each fact is represented in a single row in a table - an employee's extension and manager are listed once in the Employees table, rather than repeatedly (and probably erroneously) in the Sales table.