Database Programming :: Lessons :: Normalization
Redundancy and Anomalies
Redundancy in a database is when attribute values are repeated unnecessarily in a table. Check out the example below.
Team Name | Stadium Name | Player Name | Annual Salary |
---|---|---|---|
Chicago Cubs | Wrigley Field | Kris Bryant | $652,000 |
Chicago Cubs | Wrigley Field | Jon Lester | $20 million |
Chicago White Sox | U.S. Cellular Field | Chris Sale | $9.15 million |
Chicago White Sox | U.S. Cellular Field | Todd Frazier | $7.5 million |
St. Louis Cardinals | Busch Stadium | Yadier Molina | $15 million |
What happens if the White Sox changes the name of their stadium? In this case every tuple involving the White Sox would have to be changed. If we forget to change a tuple this could lead to an update anomaly where information in the database does not match. A deletion anomaly can occur when data is lost after a deletion. If we delete Yadier Molina from the database we would lose the stadium that the St. Louis Cardinals play in.
What if a new team, say the Las Vegas Extra Terrestrials, were added to the database? When the team is created they will not have any players until the expansion draft. In this case, a part of the primary key, the Player Name field, will be null, which isn't allowed. This is known as an insertion anomaly.
You can fix an anomaly by decomposing the relation into two tables like so:
Team (Team Name, Stadium Name, Player Name, Annual Salary) Team (Team Name, Stadium Name) Chicago Cubs Wrigley Field Chicago White Sox U.S. Cellular Field St. Louis Cardinals Busch Stadium Player (Team Name, Player Name, Annual Salary) Chicago Cubs Kris Bryant $652,000 Chicago Cubs Jon Lester $20 million Chicago White Sox Chris Sale $9.15 million Chicago White Sox Todd Frazier $7.5 million St. Louis Cardinals Yadier Molina $15 million
There is no longer any redundancy in the Team table after decomposing it into two tables. We can insert new teams without leaving a primary key null since the Player Name primary key is now in the Player table. In addition, we can delete players from the Player table without losing the stadium information. Decomposing relations does make queries take longer, but the elimination of anomalies is worth the loss of speed.
Functional Dependency
Functional dependency is when one attribute of a relationship uniquely determines another attribute. Let's use the original Team table from earlier: Team (Team Name, Stadium Name, Player Name, Annual Salary). The following functional dependencies (FD) exist in the table:
Team Name --> Stadium Name
Team Name and Player Name --> Salary
Stadium Name -/-> Player Name
Both Team Name and Player Name determine Salary since a player could move to another team and make a different salary. Stadium Name does not determine Player Name since, again, a player could move to a different team or the team could move to a different stadium.
Functional dependencies are established by the database designer and must hold true for all possible data values. Functional dependencies can be enforced during database insertion if programmed into the database design. Examine the following table: Class (Teacher Name, Class Number, Section, Class, Department, Department Chair)
Teacher Name | Class Number | Section | Class | Department | Department Chair |
---|---|---|---|---|---|
Derek Miller | 605 | 1 | AP Computer Science | Computer Science | Jon Calder |
Tim Peters | 302 | 4 | Physics | Science | Harry Wolf |
Derek Miller | 605 | 2 | AP Computer Science | Computer Science | Jon Calder |
Seth Hettel | 515 | 1 | PLTW Engineering | Engineering Technology | Jon Calder |
The functional dependencies of the Teacher table are the following:
Class Number --> Class, Department, Department Chair
Class Number, Section --> Teacher Name
Department --> Department Chair
Section --> Class
We cannot add classes to the table unless they are assigned a class number, which is an insertion anomaly. If a new department chair is hired for the Computer Science department we will have to change a number of tuples, which is an update anomaly. If we remove the Seth Hettel tuple we will lose all of the information connected to him, including the department and department chair.
Normalization
An unnormalized table has repeating groups. Use the following relation as an example where the parenthesis indicate a repeating group.
R (A, B, (c1, c2, c3)*, D, E, F)
FD = { A --> B, D, E, F
A, c1 --> c2, c3 }
For a table to be in first normal form (1NF) all values must be atomic, meaning there should be no repeating groups. The above table can be put into first normal form in the following way:
R1 (A, B, D, E, F)
R2 (A, c1, c2, c3)
The earlier Teacher table is already in 1NF since there are no repeating groups.
A set of attributes (X) is fully dependent on another set of attributes (Y) if X --> Y and Y can't be determined by any subset of X. In the Class table example above, Teacher Name is fully dependent on Class Number and Section while Class, Department and Department Chair are not fully dependent on Class Number and Section since the Class Number alone can determine the Class, Department, and Department Chair.
A prime attribute is an attribute that is part of a key of the table. In the Class table, Class Number and Section are prime attributes while Teacher Name, Class, Department, and Department Chair are non-prime attributes.
A table is in second normal form (2NF) if it is in 1NF and each of its non-prime attributes are fully dependent on the entire primary key. Since the Class Number alone determines the Class, Department, and Department Chair the Class table is not in 2NF. We can normalize the table into the following two tables to put it in 2NF.
Teacher (Class Number, Section, Teacher Name) Class (Class Number, Class, Department, Department Chair)
There are still problems with 2NF. For example, if a department chair is moved to a different department we would have to make multiple changes in the Class table, which is an update anomaly. Also, if a class is removed from the Class table we will lose information about the department, which is a deletion anomaly. Finally, an insertion anomaly occurs if we try to insert information about a new department. We can't add the new department until there are classes to assign to it.
A non-prime attribute is transitively dependent on the primary key of an attribute if there is also a non-prime attribute that functionally determine the original non-prime attribute. For example, Department Chair is transitively dependent on the Class Number since the Class Number determines the Department and the Department determines the Department Chair. A table is in third normal form (3NF) if none of its non-prime attributes are transitively dependent on its key. The following is our example in 3NF:
Teacher (Class Number, Section, Teacher Name) Class (Class Number, Class, Department) Department (Department, Department Chair