Denormalization

In a relational database, denormalization is an approach to speeding up read performance (data retrieval) in which the administrator selectively adds back specific instances of redundant data after the data structure has been normalized.

A denormalised data model is not the same as a data model that has not been normalised, and denormalisation should only take place after a satisfactory level of normalisation has taken place and that any required constraints and/or rules have been created to deal with the inherent anomalies in the design. For example, all the relations are in third normal form and any relations with join and multi-valued dependencies are handled appropriately.

Denormalization is usually done to decrease the time required to execute complex queries. Drawbacks of a normalized database are mostly in performance. In a normalized database, more joins are required to gather all the information from multiple entities, as data is divided and stored in multiple entities rather than in one large table.

Other reasons for which you may want to denormalize :

  • All or nearly all of the most frequent queries require access to the full set of joined data.
  • A majority of applications perform table scans when joining tables.
  • Computational complexity of derived columns requires temporary tables or excessively complex queries.

Denormalization input:

When deciding whether to denormalize, you need to analyze the data access requirements of the applications in your environment and their actual performance characteristics. Often, good indexing and other solutions solve many performance problems rather than denormalizing. Some of the issues to examine when considering denormalization include:

  • What are the critical transactions, and what is the expected response time?
  • How often are the transactions executed?
  • What tables or columns do the critical transactions use? How many rows do they access each time?
  • What is the mix of transaction types: select, insert, update, and delete?
  • What is the usual sort order?
  • What are the concurrency expectations?
  • How big are the most frequently accessed tables?
  • Do any processes compute summaries?
  • Where is the data physically located?

Performance advantages:

  • Minimizing the need for joins.
  • Reducing the number of foreign keys on tables.
  • Reducing the number of indexes, saving storage space, and reducing data modification time.
  • Precomputing aggregate values, that is, computing them at data modification time rather than at select time.
  • Reducing the number of tables (in some cases)

Techniques:

The most prevalent denormalization techniques are:

  • Adding redundant columns
  • Adding derived columns
  • Collapsing tables
  • Duplicating tables
  • Splitting tables

Let’s discuss in detail:

Adding redundant columns:

You can add redundant columns to eliminate frequent joins. For example, if you are performing frequent joins on the titleauthor and authors tables to retrieve the author’s last name, you can add the au_lname column to titleauthor.

Adding redundant columns eliminates joins for many queries. The problems with this solution are that it:

  • Requires maintenance of new columns. you must make changes to two tables, and possibly to many rows in one of the tables.
  • Requires more disk space, since au_lname is duplicate

Adding derived columns:

Adding derived columns can eliminate some joins and reduce the time needed to produce aggregate values. The  total_sales column in the titles table of the pubs2 database provides one example of a derived column used to reduce aggregate value processing time.

Below query shows, Frequent joins are needed between the titleauthor and titles tables to provide the total advance for a particular book title.

select title, sum(advance) from titleauthor ta, titles t where ta.title_id = t.title_id group by title_id

Before de-normalization

 

After adding derived column (sum_adv):

select title, sum_adv from titles

Before de-normalization

You can create and maintain a derived data column (sum_adv) in the titles table, eliminating both the join and the aggregate at runtime.

Note: This increases storage needs, and requires maintenance of the derived column whenever changes are made to the titles table.

Collapsing tables:

If most users need to see the full set of joined data from two tables, collapsing the two tables into one can improve performance by eliminating the join. For example, users frequently need to see the author name, author ID, and the blurbs copy data at the same time. The solution is to collapse the two tables into one. The data from the two tables must be in a one-to-one relationship to collapse tables

Collapsing the tables eliminates the join, but loses the conceptual separation of the data. If some users still need access to just the pairs of data from the two tables, this access can be restored by using queries that select only the needed columns or by using views.

Duplicating tables:

If a group of users regularly needs only a subset of data, you can duplicate the critical table subset for that group.

Splitting tables:

Sometimes splitting normalized tables can improve performance. You can split tables in two ways:

Splitting tables

Horizontal splitting: Use horizontal splitting if:

  • A table is large, and reducing its size reduces the number of index pages read in a query.
  • The table split corresponds to a natural separation of the rows, such as different geographical sites or historical versus current data. You might choose horizontal splitting if you have a table that stores huge amounts of rarely used historical data, and your applications have high performance needs for current data in the same table.
  • Table splitting distributes data over the physical media, however, there are other ways to accomplish this goal.

Generally, horizontal splitting requires different table names in queries, depending on values in the tables. In most database applications this complexity usually far outweighs the advantages of table splitting. If many queries perform table scans, horizontal splitting may improve performance enough to be worth the extra maintenance effort.

For example: if a table has records with ‘active’ and ‘inactive’ (assume it is emp_status column) and in application usyally only active records are accessed, so instead of accessing whole data all the time, we can split table horizontally into ‘active’ and ‘inactive’ data. That means, we can create to tables as Inactive_employee and Active_employee.

Vertical splitting: Use vertical splitting if:

  • Some columns are accessed more frequently than other columns.
  • The table has wide rows, and splitting the table reduces the number of pages that need to be read.

When a table contains very long columns that are accessed infrequently, placing them in a separate table can greatly speed the retrieval of the more frequently used columns. With shorter rows, more data rows fit on a data page, so for many queries, fewer pages can be accessed

Managing denormalized data:

Whatever denormalization techniques you use, you need to ensure data integrity by using:

Managing denormalized data

This entry was posted in RDBMS concepts. Bookmark the permalink.

Leave a Reply