Database Normalization

Put simply, normalization is an attempt to make sure you do not destroy true data or create false data in your database. Errors are avoided by representing a fact in the database one way, one time, and in one place. Duplicate data is a problem as old as data processing. Efficient and accurate data processing relies on the minimizing redundant data and maximizing data integrity. Normalization and the Normal Forms (NF) are efforts to achieve these two core objectives of data processing. This article will examine the concept of normalization in-depth.

Database Normalization

Normalization is the process of reducing duplication in a database, with the ultimate goal of eliminating duplicate data entirely. While duplicated data can cause a database to be greedy with disk space, the bigger issue is consistency. Duplication creates the risk of data corruption when information is inserted, updated, or deleted, by having a particular piece of information in more than once place.

Normalization as a formal process has been around since the early 1970s when Edgar Codd proposed the first normal form (1NF). Codd later defined 2NF and 3NF. Since that time, several higher level normal forms have been defined. 4NF and 5NF are often used only in an academic sense because queries against “real world” data in a 4NF or 5NF schema contain excessive number of joins and require extensive use of views. 6NF is fairly new and contains rules specific to temporal data. As one progresses from 1NF on 5NF and even 6NF, the data modeling requirements become more strict and each new form provides a lower degree of duplication than the preceding form.

{mospagebreak title=First Normal Form (1NF)}

First Normal Form (1NF)

First Normal Form states that each attribute in a relation has to be atomic and scalar. It also states that a table must have a primary key to identify any particular row. You cannot store a list of values in a single column and you cannot store a list of columns that emulate a fixed size list – also known as a repeating group. For example, suppose you were to create a table to hold contact information for business contacts:

CREATE TABLE Contacts (

      first_name*,
      last_name,
      company,
      phone1,
      phone2,
      phone3,
      fax,|
      email
);

* Data types ellided

This table contains a repeating group – phone1-3. More often than not, several of those columns will be empty, which is a waste. It is also nearly impossible to enforce any uniqueness for this group because a constraint cannot be declared such that any phone1 is unique across phone1, phone2 and phone3.

Consider another set of problems – what if you want to add a company, not a person, to your Contacts table? Or just a phone number and a first name? You would have to leave one or more of the candidate key columns (first_name, last_name, company), which would most certainly violate the only possible primary key constraint. What if you want to delete a fax number from your table? If you simply DELETE WHERE fax=X, you may be deleting people as well.

{mospagebreak title=Second Normal Form (2NF)}

Second Normal Form (2NF)

Second normal form states that there are no non-trivial functional dependencies on a non-key attribute. Let’s look at this table:

CREATE TABLE Sales (

            name,
            extension,
            sales_manager,
            store_location,
            customer_name,
            customer_address,
            product_name,
            product_price
);

The primary key for this table could be defined as (name, store_location, customer_name), giving us a 1NF table. There is a problem though when considering 2NF – what happens if a customer changes their mind at the last minute and changes the product? The product_price column has to change, which means the product_price column has a functional dependency on the product_name column. In order for this table to be 2NF, we would need to have a separate table for products that stored the price – something like this:

CREATE TABLE Sales (

            name,
            extension,
            sales_manager,
            store_location,
            customer_name,
            customer_address,
            product_id
);

CREATE TABLE Products (

            product_id,
            name,
            price
);

In this case there are no functional dependencies, but it is obvious that there are still some issues to be resolved. Let’s have a look at 3NF.

{mospagebreak title=Third Normal Form (3NF) and BCNF}

Third Normal Form (3NF) and Boyce-Codd Normal Form (BCNF)

Third normal form states that a table must have no transitive dependencies. This means that a row could be uniquely identified by each column individually but that no column depends on any other column to identify the row. If columns X, Y and Z exist, deleting any two columns will still leave a set of uniquely identifiable rows. BCNF extends 3NF, stating that no non-trivial functional dependencies can exist on anything other than the superkey – that is, a superset of the candidate keys.

Typically, 3NF means there are no transitive dependencies. A transitive dependency is when two columnar relationships imply another relationship. For example, name -> extension and extension -> store_location, so name -> store_location, which is not a dependency we want to model in our table and could lead to faulty data. In the table we have defined, though, we still have a 3NF table. Also, what happens if an employee changes extension and the old sales records aren’t updated? Or if a customer moves? These entry points for error are problematic but acceptable in 3NF because the dependencies are trivial. What we need to look to is BCNF, or Boyce-Codd Normal Form, which requires all transitive dependencies be eliminated in addition to the table being 3NF.

CREATE TABLE Sales (

            employee_id,
            customer_name,
            product_id
);

CREATE TABLE Employees (

            employee_id,
            manager_id,
            name,
            extension
);

CREATE TABLE Customers (

            name,
            address
);

CREATE TABLE Products (

            product_id,
            name,
            price
);

Now information about employees, customers, and products are isolated from the Sales table. There are obvious practical problems with this example, such as product price changes breaking sales records, but that is a matter for another discussion. Each fact is represented in a single row in a table – an employee’s extension and manager are listed once in the Employees table, rather than repeatedly (and probably erroneously) in the Sales table.

{mospagebreak title=Fourth Normal Form (4NF) and Fifth Normal Form (5NF)}

Fourth Normal Form (4NF) and Fifth Normal Form (5NF)

Fourth normal form requires that a table be BCNF and contain no multi-valued dependencies. 4NF solves the problem of multivalued dependencies, which we would encounter with our above Sales table when a customer purchased multiple products with a single order – we would have numerous rows that were dependent on one another without any clear definition of that dependency in the table.

Fifth normal form, also known as join-projection normal form (JPNF), states that no non-trivial join dependencies exist. 5NF states that any fact should be able to be reconstructed without any anomalous results in any case, regardless of the number of tables being joined. A 5NF table should have only candidate keys and it’s primary key should consist of only a single column.

The problem with these forms is the size of the joins that are required to reconstruct any non-trivial data.We can rely on views and procedures to simplify them but the underlying data still ends up very complex. There are also performance issues to consider – which is why 4NF and 5NF are often academic. In most cases, 3NF (or BCNF) are ideal.

{mospagebreak title=Conclusion}

Conclusion

This is only an introduction to the subject of normalization. The topic is extensive. For those of you who are developing databases on a day to day basis, knowing how to formally normalize a table can save large amounts of time in the long run. Being able to determine what level of normalization a database needs and designing to that level of normalization can be difficult in larger applications and will often have many possible solutions. Education and experience are the tools you will need to design efficient databases with minimal (or, in a dream, no) anomalies. I hope this introduction to normalization is helpful to the readers, and I intend to publish a future article covering denormalization and case studies of some database and the way in which they were normalized.

[gp-comments width="770" linklove="off" ]

antalya escort bayan antalya escort bayan Antalya escort diyarbakir escort