[BOOK REVIEW] The Data Warehouse Toolkit by Ralph Kimball/Retail Case Study
On how to design dimensional modeling in retail industry based on a case study.
Case study
Working at headquarter of a large grocery chain with around 100 grocery stores cross 5 states. Each store has grocery, frozen foods, dairy, meat, produce, bakery, floral, and healthy/beauty aids departments. Each store has 60,000 stock keeping units (SKUs) a.k.a products on shelves.
At the grocery store, concern was put on the logistics of ordering,
stocking, and selling products while maximizing profit. The profit ultimately comes from charging as much as possible for each product, lowering costs for product acquisition and overhead, and at the same time attracting as many customers as possible in a highly competitive environment. Some of the most significant management decisions have to do with pricing and promotions. (Promotions in a grocery store include temporary price reductions, ads
in newspapers and newspaper inserts, displays in the grocery store, and coupons). A rather common strategy for instance put toilet paper on sales for 50% to get a sale jump by a factor of 10, but the towels probably are being sold at a loss. As a result, the visibility of all forms of promotion is an important part of analyzing the operations of a grocery store.
Data is collected from cash registers as customers purchase products as in figure below.
Pre-design dimensional model
Regardless of the industry, we strongly encourage the four-step process for tackling dimensional model designs.
Step 1: Select Business Process
In this case study, we want to understand better customer purchases as captured by POS system. Hence, the business process we are modeling is POS retail sales transactions.
This data enables the business users to analyze which products are selling in which stores in which states on which days under what promotional conditions in which transactions.
Step 2: Declare the Grain
Tackling data at its lowest atomic grain makes sense for many reasons. Atomic
data is highly dimensional. The more detailed and atomic the fact measurement, the more things you know for sure. All those things you know for sure translate into dimensions. In this regard, atomic data is a perfect match for the dimensional approach.Although aggregated data plays an important role for performance
tuning, it is not a substitute for giving users access to the lowest level details; users
can easily summarize atomic data, but it’s impossible to create details from summary data.
The grain here is an individual product purchased on a POS transaction.
Step 3: Identify the Dimensions
A careful grain statement determines the primary dimensionality of the
fact table. You then add more dimensions to the fact table if these additional dimensions naturally take on only one value under each combination of the primary dimensions. If the additional dimension violates the grain by causing additional fact rows to be generated, the dimension needs to be disqualified or the grain statement needs to be revisited.
After identifying the grain, the following dimensions include: product, store, state, date. promotion, cashier, and method of payment.
Step 4: Identify the Facts
In this step, we need to determine which should be put into fact table. No need to include non-additive facts.
In this case study, the gross profit calculation is straightforward, but storing it means it’s computed consistently in the ETL process, eliminating the possibility of user calculation errors.
Trick: Assuming that gross revenues are $4 billion per year and that the average price of an item on a customer ticket is $2.00, you can calculate that there are approximately 2 billion transaction line items per year.
Design dimensional model
Date Dimension
Even 20 years’ worth of days is only approximately 7,300 rows, which is a relatively small dimension table.
Should avoid using cryptic such as Y/N, 1/0 and instead use Holiday and Non-holiday.
Product Dimension
The product dimension is almost always sourced from the operational product master file. Most retailers administer their product master file at headquarters and download a subset to each store’s POS system at frequent intervals. It is headquarters’ responsibility to define the appropriate product master record (and unique SKU number) for each new product.
The merchandise hierarchy is an important group of attributes. Typically, individual SKUs roll up to brands, brands roll up to categories, and categories roll up to departments. Each of these is a many-to-one relationship.
Data involved in calculations should be in fact tables and data involved in constraints, groups and labels should be in dimension tables
Rule of thumbs:
A reasonable product dimension table can have 50 or more descriptive attributes.
Each attribute is a rich source for constraining and constructing row header labels.
Store dimension
Promotion Dimension
You must avoid null keys in the fact table. A proper design includes
a row in the corresponding dimension table to identify that the dimension is not applicable to the measurement.
Querying retail sales
Result:
Factless Fact Tables
In this case, should be What products were on promotion but did not sell?
SHOULD IN ANY CASE AVOID THIS DESIGN