One interesting aspect of data warehouse design is the use of
surrogate keys versus
natural keys. Natural keys are those unique identifiers that are used in
operational systems. For example, your business probably has a core table that stores data about your customers. Each customer gets a unique identifier that identifies that customer in that system. For the most part, natural keys are only used once and are unique across at least one operational system. Surrogate keys, on the other hand, are a unique identifier in your table or database. A natural key may not be unique in the data warehouse envronment and require the use of surrogate keys. These surrogate keys are then used as the unique identifier throughout your data warehouse.
In my experience, natural keys can be anything from numbers, to text, to a combination of numbers and text. I have also seen different natural keys coming from different operational systems that refer to the same entity (for example Customer A in one system may have the identifier of 123 but can have the identifier of 987 in another). When you build a data warehouse the natural key is not sufficient for long-term storage of data nor for company-wide reporting purposes. At some point, these keys will have to be reused or on the flip side, a customer may have to get a new natural key. Plus, you‘ll want to track some historical data on customers, and this is most easily accomplished with the use of surrogate keys (aka
slowly changing dimensions).
The question then becomes what type of surrogate key should be used. If you are using SQL Server, you have essentially three choices:
GUID unique identifier, 4-byte Integer identity and 8-byte integer identity.
A GUID is globally unique. Globally, means exactly that: you won’t come across two GUIDs that are the same. This is due to the complex algorithm used to generate these IDs; typically, it includes the unique system identifier where the application is installed. This sounds great and is very useful if you are sharing or moving data across applications and data bases. However, this is a 16-byte data type which takes quite a bit of space both in storage, indexes and computing power. It’s not a key you’d want to use in an OLTP application and could be too cumbersome even for a data warehouse, but to ensure ultimate uniqueness and for higher security, it’s a good choice.
Integer identities can be either 4-byte or 8-byte. These identities start at any integer within their range, positive or negative and are assigned in sequential order at an interval you choose. The benefit of integer identities is that they are small and easily indexed. They will not be unique even across a database, but if all you’re looking for is a unique row identifier, this is your choice. The difference between the two is the range, starting at -2,147,483,648 for 4-byte and -9,223,372,036,854,775,808 for 8-byte. Typically, integers start at 0 and increment by 1, but you get double the total number of unique ids if you begin at the minimum with an interval of 1. Choosing between a 4-byte and 8-byte depends on how you’ll be using the IDs and how many unique rows you expect to acquire.
I hope this explanation will help you next time you’re deciding which surrogate key to use.