[BOOK REVIEW] The Data Warehouse Toolkit by Ralph Kimball/Retail Case Study

Hang Nguyen
5 min readMar 23, 2022

--

On how to design dimensional modeling in retail industry based on a case study.

Case study

Working at headquarter of a large grocery chain with around 100 grocery stores cross 5 states. Each store has grocery, frozen foods, dairy, meat, produce, bakery, floral, and healthy/beauty aids departments. Each store has 60,000 stock keeping units (SKUs) a.k.a products on shelves.

At the grocery store, concern was put on the logistics of ordering,
stocking, and selling products while maximizing profit. The profit ultimately comes from charging as much as possible for each product, lowering costs for product acquisition and overhead, and at the same time attracting as many customers as possible in a highly competitive environment. Some of the most significant management decisions have to do with pricing and promotions. (Promotions in a grocery store include temporary price reductions, ads
in newspapers and newspaper inserts, displays in the grocery store, and coupons). A rather common strategy for instance put toilet paper on sales for 50% to get a sale jump by a factor of 10, but the towels probably are being sold at a loss. As a result, the visibility of all forms of promotion is an important part of analyzing the operations of a grocery store.

Data is collected from cash registers as customers purchase products as in figure below.

Sample cash register receipt (page 73)

Pre-design dimensional model

Regardless of the industry, we strongly encourage the four-step process for tackling dimensional model designs.

Step 1: Select Business Process

In this case study, we want to understand better customer purchases as captured by POS system. Hence, the business process we are modeling is POS retail sales transactions.

This data enables the business users to analyze which products are selling in which stores in which states on which days under what promotional conditions in which transactions.

Step 2: Declare the Grain

Tackling data at its lowest atomic grain makes sense for many reasons. Atomic
data is highly dimensional. The more detailed and atomic the fact measurement, the more things you know for sure. All those things you know for sure translate into dimensions. In this regard, atomic data is a perfect match for the dimensional approach.

Although aggregated data plays an important role for performance
tuning, it is not a substitute for giving users access to the lowest level details; users
can easily summarize atomic data, but it’s impossible to create details from summary data.

The grain here is an individual product purchased on a POS transaction.

Step 3: Identify the Dimensions

A careful grain statement determines the primary dimensionality of the
fact table. You then add more dimensions to the fact table if these additional dimensions naturally take on only one value under each combination of the primary dimensions. If the additional dimension violates the grain by causing additional fact rows to be generated, the dimension needs to be disqualified or the grain statement needs to be revisited.

After identifying the grain, the following dimensions include: product, store, state, date. promotion, cashier, and method of payment.

Step 4: Identify the Facts

Example of measured facts in retail sales schema (page 77)

In this step, we need to determine which should be put into fact table. No need to include non-additive facts.

In this case study, the gross profit calculation is straightforward, but storing it means it’s computed consistently in the ETL process, eliminating the possibility of user calculation errors.

Trick: Assuming that gross revenues are $4 billion per year and that the average price of an item on a customer ticket is $2.00, you can calculate that there are approximately 2 billion transaction line items per year.

Design dimensional model

Date Dimension

Even 20 years’ worth of days is only approximately 7,300 rows, which is a relatively small dimension table.

Example of Date Dimension (page 80)
Date dimension sample rows (page 81)

Should avoid using cryptic such as Y/N, 1/0 and instead use Holiday and Non-holiday.

Product Dimension

The product dimension is almost always sourced from the operational product master file. Most retailers administer their product master file at headquarters and download a subset to each store’s POS system at frequent intervals. It is headquarters’ responsibility to define the appropriate product master record (and unique SKU number) for each new product.

The merchandise hierarchy is an important group of attributes. Typically, individual SKUs roll up to brands, brands roll up to categories, and categories roll up to departments. Each of these is a many-to-one relationship.

Product dimension table (page 85)
Example of product dimension sample rows (page 84)

Data involved in calculations should be in fact tables and data involved in constraints, groups and labels should be in dimension tables

Rule of thumbs:

A reasonable product dimension table can have 50 or more descriptive attributes.
Each attribute is a rich source for constraining and constructing row header labels.

Store dimension

Store dimension table (page 88)

Promotion Dimension

Promotion dimension table (page 91)

You must avoid null keys in the fact table. A proper design includes
a row in the corresponding dimension table to identify that the dimension is not applicable to the measurement.

Querying retail sales

Querying the retail sales schema (page 95)

Result:

Factless Fact Tables

In this case, should be What products were on promotion but did not sell?

Promotion coverage factless fact table (set difference in OLAP)

SHOULD IN ANY CASE AVOID THIS DESIGN

Centipede fact table with too many normalized dimensions (page 108)

--

--

Hang Nguyen
Hang Nguyen

Written by Hang Nguyen

Just sharing (data) knowledge

No responses yet