SQL-Driven Business Intelligence Dashboard (Chinook ‘Digital Music Store’)

SQL
R
Python
Dashboard
Data Analysis
Business Intelligence
Exploratory Analysis
Retail Analytics
Chinook
Short
DuckDB
This project explores the Chinook dataset — a mock digital music store — to uncover key business insights around revenue, customers, and performance. It combines SQL analysis with dashboard development to present findings visually.
Author

Morrigan M.

Published

July 2, 2025

Modified

July 2, 2025

Project Overview

  • Goal: Build a BI dashboard for a fictional digital media business.
  • Focus: Revenue insights, customer analysis, business performance.
  • Why it matters: Business stakeholders benefit from fast, accessible reporting on key performance indicators (KPIs) to support decision-making.

This project is intended for both technical and business audiences, with clear visual insights supported by SQL-based exploration.

For the best visualization experience, browser is recommended - optimization of these exploratory visualizations is limited.

Why This Project?

This project was selected for its relevance to real-world business intelligence scenarios and its flexibility as a learning and demonstration tool. The Chinook dataset provides a realistic simulation of sales and customer data, enabling exploration of KPIs commonly used in industries such as retail, e-commerce, and SaaS.

Key focus areas include:

  • End-to-end analytical workflow from data exploration in DuckDB to dashboard construction—highlighting technical fluency and analytical structure.
  • Emphasis on stakeholder communication showcasing the ability to translate complex queries into clear, business-oriented narratives.
  • Broad applicability with insights around customer retention, regional performance, and product demand transferable across diverse verticals.

By combining strong SQL capabilities with clear business insight, this project demonstrates how structured data exploration can inform decisions and strategy at scale.

Business Questions

This project aims to provide insights on the following key business questions, with a time-based perspective embedded throughout each analysis to identify trends and changes over time:

  1. Where is revenue coming from geographically?
  2. What genres or artists generate the most income?
  3. How many customers are repeat buyers?

Deliverables

  • SQL-based analysis of key business metrics with visualizations in both R and Python.
  • Dashboards built in R (Shiny) and Python (Dash) [TO DO].
  • [Considering a tableau version of the dashboard; TO DO]

Technology Stack

The following tools were used to merge, analyze, and present the data:

Tool Purpose
SQL Data transformation and KPIs
DuckDB Lightweight, embedded relational database
R + Shiny Dashboard development (R-based)
Python + Dash Dashboard development (Python-based)

This project demonstrates dashboard development in both R (Shiny) and Python (Dash), highlighting flexibility across technical ecosystems.

[NOTE TO SELF - DELETE ONCE REVISED: This is a work in progress. I plan to make a dashboard with R and one with Python, just to show I can do both. I will update this to reflect the final project once it’s done.]

Show Code
library(DBI)
library(duckdb)
library(dplyr)
library(tidyr)
library(forcats)
library(countrycode)
library(plotly)
library(ggplot2)
Show Code
import duckdb
import numpy as np
import pandas as pd
import pycountry
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.graph_objs import Figure
from plotly.colors import sample_colorscale

Executive Summary

The Chinook dataset reveals that revenue is predominantly concentrated in North America and Central Europe, with stable year-over-year trends punctuated by notable regional anomalies—such as Australia’s revenue drop and a Europe-wide spike in 2011—indicating operational or market influences worth deeper exploration. Genre remains consistent overall, but efficiency varies significantly: top genres by revenue-per-track tend to be niche with high engagement, while top artists by total revenue rely on volume across many tracks rather than high per-track returns. Customer retention drops sharply within the first six months but shows unexpected spikes around months 28–33, suggesting irregular purchase behavior or promotional effects. Strategic opportunities include expanding in underpenetrated yet high-value markets, tailoring content and marketing by region and genre, optimizing large but underperforming catalogs, and implementing targeted retention and re-engagement campaigns.

Data Exploration & Schema Understanding

The SQLite version of the data was downloaded and converted into a DuckDB database for this project to take advantage of DuckDB’s speed and in-memory querying capabilities.

The database schema was examined through existing documentation and query-based exploration. A schema diagram was created with dbdiagram.

A database schema diagram for the Chinook database.

Chinook Database Schema

A database schema diagram for the Chinook database.

Chinook Database Schema

From this schema, key tables of interest were identified for each business question:

  1. Where is revenue coming from geographically?
  • Key Tables: Invoice, Customer
  • Explanation: The Invoice table records total sales and billing address information, which identifies where revenue is being billed geographically. Customer provides customer addressing, which may differ from billing location and allows for a more detailed geographic analysis.
  1. What genres or artists generate the most income?
  • Key Tables: InvoiceLine, Track, Artist, Genre
  • Explanation: InvoiceLine records individual track purchases, including price and quantity. Joining to the Track table provides track-level metadata through further links to Artist and Genre, enabling revenue analysis by artist or genre.
  1. How many customers are repeat buyers?
  • Key Tables: Invoice, Customer
  • Explanation: The Customer table identifies each buyer, while Invoice records all purchase transactions. Counting invoices per customer reveals how many made repeat purchases, offering insight into customer retention.
  1. How do sales trends evolve over time?
  • Key Tables: Invoice, InvoiceLine
  • Explanation: The Invoice table includes invoice dates and total amounts, allowing for time-based trend analysis (e.g., monthly revenue). InvoiceLine can be joined for more granular insights, such as which tracks or genres are trending over specific time periods, or purchase volume trends.

With the schema understood and key tables identified, the next step is to query the data using SQL and create exploratory R visualizations to begin answering the business questions.

Connection and Validation

A connection was made to the DuckDB database file.

Show Code
con_chinook <- DBI::dbConnect(
  duckdb::duckdb(), 
  dbdir = "../data/chinook.duckdb",
  read_only = TRUE
  )
Show Code
con_chinook = duckdb.connect("../data/chinook.duckdb", read_only = True)

Initial exploration confirmed the data’s structure and date range matched documentation.

A query of date values in the InvoiceDate table confirmed that the data contained records with a date range from 2009-01-01 to 2013-12-22.

Show Code
SQL
-- Get date range of Invoices
SELECT 
  MIN(i.InvoiceDate) as MinDate, 
  MAX(i.InvoiceDate) as MaxDate
FROM Invoice i;
1 records
MinDate MaxDate
2009-01-01 2013-12-22

As expected, InvoiceLine and Track had the highest number of unique records, reflecting their one-to-many relationships with Invoice and Album, respectively. Metadata tables such as Genre and MediaType had fewer unique values.

Show Code
SQL
-- Get Number of Unique Key Values in Each Table
SELECT 
  'Employees' AS TableName, 
  COUNT(DISTINCT EmployeeId) AS UniqueKeys 
FROM Employee
UNION ALL
SELECT 
  'Customers' AS TableName, 
  COUNT(DISTINCT Customerid) AS UniqueKeys 
FROM Customer
UNION ALL
SELECT 
  'Invoices' AS TableName, 
  COUNT(DISTINCT InvoiceId) AS UniqueKeys 
FROM Invoice
UNION ALL
SELECT 
  'Invoice Lines' AS TableName, 
  COUNT(DISTINCT InvoiceLineId) AS UniqueKeys 
FROM InvoiceLine
UNION ALL
SELECT 
  'Tracks' AS TableName, 
  COUNT(DISTINCT TrackId) AS UniqueKeys 
FROM Track
UNION ALL
SELECT 
  'Artists' AS TableName, 
  COUNT(DISTINCT ArtistId) AS UniqueKeys 
FROM Artist
UNION ALL
SELECT 
  'Albums' AS TableName, 
  COUNT(DISTINCT AlbumId) AS UniqueKeys 
FROM Album
UNION ALL
SELECT 
  'Genres' AS TableName, 
  COUNT(DISTINCT GenreId) AS UniqueKeys 
FROM Genre
UNION ALL
SELECT 
  'Media Types' AS TableName,
  COUNT(DISTINCT MediaTypeId) AS UniqueKeys 
FROM MediaType
UNION ALL
SELECT 
  'Playlists' AS TableName, 
  COUNT(DISTINCT PlaylistId) AS UniqueKeys 
FROM Playlist
ORDER BY UniqueKeys DESC;
Displaying records 1 - 10
TableName UniqueKeys
Tracks 3503
Invoice Lines 2240
Invoices 412
Albums 347
Artists 275
Customers 59
Genres 25
Playlists 18
Employees 8
Media Types 5

Next Step: Key performance indicators (KPIs) will be extracted with SQL and visualized with R and Python to answer the business questions outlined previously, beginning with a geographic revenue analysis.

Question 1: Where is revenue coming from geographically?

Executive Summary

Global revenue is highly concentrated, with the U.S., Canada, and France driving over 50% of total earnings. Yet, smaller markets like Chile and Hungary show outsized revenue per customer, indicating growth potential. Geographic disparities in engagement suggest tailored strategies — retention and upsell in large markets, acquisition in high-value niches. Visual analysis also reveals year-specific anomalies that merit further exploration

The data set documented revenue of $2,328.60 (USD) from customers in 24 countries.

Analysis of geographic revenue began with the country recorded in the Invoice table. This geographic value reflected where purchases were billed, which is often used as the default location reference for financial reporting.

Show Code
SQL
-- Revenue by Country (Billing)
SELECT 
    i.BillingCountry, 
    SUM(i.Total) AS TotalRevenue,
    ROUND(SUM(i.Total)*100.0 / (SELECT SUM(Total) from Invoice), 2) AS PercentGlobalRevenue,
    COUNT(DISTINCT c.CustomerId) AS NumCustomers,
    ROUND(SUM(i.Total) / COUNT(DISTINCT c.CustomerId), 2) AS RevenuePerCustomer
FROM Customer c
JOIN Invoice i on c.CustomerId == i.CustomerId
GROUP BY i.BillingCountry
-- Sort Revenue (Highest to Lowest)
ORDER BY TotalRevenue DESC;
Displaying records 1 - 10
BillingCountry TotalRevenue PercentGlobalRevenue NumCustomers RevenuePerCustomer
USA 523.06 22.46 13 40.24
Canada 303.96 13.05 8 37.99
France 195.10 8.38 5 39.02
Brazil 190.10 8.16 5 38.02
Germany 156.48 6.72 4 39.12
United Kingdom 112.86 4.85 3 37.62
Czech Republic 90.24 3.88 2 45.12
Portugal 77.24 3.32 2 38.62
India 75.26 3.23 2 37.63
Chile 46.62 2.00 1 46.62
Insights

Revenue is geographically concentrated, with a few countries dominating global totals.

  • The United States alone accounts for a quarter of global revenue (22.46%).
  • The top five countries (USA, Canada, France, Brazil, Germany) contribute over half (58.77%) of global revenue.

Opportunity: Expanding to underrepresented regions could be a growth opportunity, if demand can be identified and activated.

Insights

There are different customer behavior patterns in each country.

  • None of the top revenue-generating countries appear in the top five for revenue-per-customer (Chile, Hungary, Ireland, Czech Republic, Austria).
  • High total revenue countries likely have broad but less engaged customer bases.
  • High revenue per customer regions represent smaller but highly engaged audiences.

Opportunities:

  • High total revenue countries are ideal for retention and upselling strategies.
  • High per-customer revenue countries are ideal for acquisition-focused campaigns.

However, the Customer table also contained a Country field. Differences between billing and customer country could reflect travel, gift purchases, or mismatched contact vs. billing addresses.

Show Code
SQL
-- Total Revenue by Country (Customer)
SELECT 
    c.Country, 
    SUM(i.Total) AS TotalRevenue,
    ROUND(SUM(i.Total)*100.0 / (SELECT SUM(Total) from Invoice), 2) AS PercentGlobalRevenue,
    COUNT(DISTINCT c.CustomerId) AS NumCustomers,
    ROUND(SUM(i.Total) / COUNT(DISTINCT c.CustomerId), 2) AS RevenuePerCustomer
FROM Customer c
JOIN Invoice i on c.CustomerId == i.CustomerId
GROUP BY c.Country
-- Sort Revenue (Highest to Lowest)
ORDER BY TotalRevenue DESC;
Displaying records 1 - 10
Country TotalRevenue PercentGlobalRevenue NumCustomers RevenuePerCustomer
USA 523.06 22.46 13 40.24
Canada 303.96 13.05 8 37.99
France 195.10 8.38 5 39.02
Brazil 190.10 8.16 5 38.02
Germany 156.48 6.72 4 39.12
United Kingdom 112.86 4.85 3 37.62
Czech Republic 90.24 3.88 2 45.12
Portugal 77.24 3.32 2 38.62
India 75.26 3.23 2 37.63
Chile 46.62 2.00 1 46.62

To identify any countries with mismatched revenue attribution, the aggregated views were joined and compared.

Show Code
SQL
-- Rows with discrepancies in Revenue by Country
-- (Billing vs Customer)

-- Revenue by Invoice.BillingCountry
WITH billing_country_revenue AS (
    SELECT 
        BillingCountry AS Country,
        SUM(Total) AS Revenue_Billing
    FROM Invoice
    GROUP BY BillingCountry
),

-- Revenue by Customer Country (joined from Invoice.Customer)
customer_country_revenue AS (
    SELECT 
        c.Country AS Country,
        SUM(i.Total) AS Revenue_Customer
    FROM Invoice i
    JOIN Customer c ON i.CustomerId = c.CustomerId
    GROUP BY c.Country
)

-- Join the aggregations into a single table (by Country)
SELECT 
    COALESCE(b.Country, c.Country) AS Country,
    b.Revenue_Billing,
    c.Revenue_Customer
FROM billing_country_revenue b
FULL OUTER JOIN customer_country_revenue c
    ON b.Country = c.Country
-- Select only rows where revenue differs by country source
WHERE
  b.Revenue_Billing IS DISTINCT FROM c.Revenue_Customer
ORDER BY Country;
0 records
Country Revenue_Billing Revenue_Customer
Insights

Billing and customer country match exactly in this data set, indicating no divergence due to travel, gifting, or alternate addresses. This simplifies location-based analysis but may also reflect a limitation in the data set’s realism.

Exploratory Visualizations

In preparation for exploratory visualization generation, the data was retrieved using SQL queries and prepared in both R and Python.

Show Code
# SQL Queries
## Yearly Breakdown
res_yearly_df <- DBI::dbGetQuery(
  con_chinook, 
  "SELECT 
    -- Get Country and Year for grouping
    i.BillingCountry as Country, 
    YEAR(i.InvoiceDate) as Year,
    -- Calculate Total Revenue
    SUM(i.Total) AS TotalRevenue,
    -- Calculate % of Total/Global Revenue
    ROUND(SUM(i.Total)*100.0 / (SELECT SUM(Total) from Invoice), 2) AS PercentGlobalRevenue,
    -- Get Number of Customers
    COUNT(DISTINCT c.CustomerId) AS NumCustomers,
    -- Calculate Revenue per Customer
    ROUND(SUM(i.Total) / COUNT(DISTINCT c.CustomerId), 2) AS RevenuePerCustomer
FROM Customer c
JOIN Invoice i on c.CustomerId == i.CustomerId
GROUP BY i.BillingCountry, Year
-- Sort Revenue (Highest to Lowest)
ORDER BY Year, TotalRevenue DESC;"
)

## Total (all years)
res_agg_df <- DBI::dbGetQuery(
  con_chinook, 
  "SELECT 
    -- Get Country for grouping
    i.BillingCountry as Country,
    -- Set 'Year' to 'All' for grouping
    'All' AS Year,
    -- Calculate Total Revenue
    SUM(i.Total) AS TotalRevenue,
    -- Calculate % of Total/Global Revenue
    ROUND(SUM(i.Total)*100.0 / (SELECT SUM(Total) from Invoice), 2) AS PercentGlobalRevenue,
    -- Get Number of Customers
    COUNT(DISTINCT c.CustomerId) AS NumCustomers,
    -- Calculate Revenue per Customer
    ROUND(SUM(i.Total) / COUNT(DISTINCT c.CustomerId), 2) AS RevenuePerCustomer
FROM Customer c
JOIN Invoice i on c.CustomerId == i.CustomerId
GROUP BY i.BillingCountry,
-- Sort Revenue (Highest to Lowest)
ORDER BY Year, TotalRevenue DESC;"
)

# Combine data frames in R
res_df <- dplyr::bind_rows(
  res_agg_df,
  res_yearly_df |> dplyr::mutate(Year = as.character(Year))
  ) |>
  dplyr::mutate(
    ## Add ISO Country Codes
    iso_alpha = countrycode::countrycode(
      Country, 
      origin = 'country.name', 
      destination = 'iso3c'
      ),
    ## Format Hover Text (<b>Country:</b><br> $TotalRevenue.##")
    hover_text = paste0(
      "<b>", Country, ":</b><br> $",
      formatC(TotalRevenue, format = 'f', big.mark =",'", digits = 2)
      )
    ) 

# Get vector of unique years (layers/traces) - order with "All" first.
years <- c("All", sort(unique(res_yearly_df$Year)))
Show Code
# SQL Queries
## Yearly Breakdown
res_yearly_df = con_chinook.execute(
    """SELECT 
      -- Get Country and Year for grouping
      i.BillingCountry as Country, 
      YEAR(i.InvoiceDate) as Year,
      -- Calculate Total Revenue
      SUM(i.Total) AS TotalRevenue,
      -- Calculate % of Total/Global Revenue
      ROUND(SUM(i.Total)*100.0 / (SELECT SUM(Total) from Invoice), 2) AS PercentGlobalRevenue,
      -- Get Number of Customers
      COUNT(DISTINCT c.CustomerId) AS NumCustomers,
      -- Calculate Revenue per Customer
      ROUND(SUM(i.Total) / COUNT(DISTINCT c.CustomerId), 2) AS RevenuePerCustomer
  FROM Customer c
  JOIN Invoice i on c.CustomerId == i.CustomerId
  GROUP BY i.BillingCountry, Year
  -- Sort Revenue (Highest to Lowest)
  ORDER BY Year, TotalRevenue DESC;"""
  ).df()

## Total (all years)
res_agg_df = con_chinook.execute(
    """SELECT 
      -- Get Country for grouping
      i.BillingCountry as Country,
      -- Set 'Year' to 'All' for grouping
      'All' AS Year,
      -- Calculate Total Revenue
      SUM(i.Total) AS TotalRevenue,
      -- Calculate % of Total/Global Revenue
      ROUND(SUM(i.Total)*100.0 / (SELECT SUM(Total) from Invoice), 2) AS PercentGlobalRevenue,
      -- Get Number of Customers
      COUNT(DISTINCT c.CustomerId) AS NumCustomers,
      -- Calculate Revenue per Customer
      ROUND(SUM(i.Total) / COUNT(DISTINCT c.CustomerId), 2) AS RevenuePerCustomer
  FROM Customer c
  JOIN Invoice i on c.CustomerId == i.CustomerId
  GROUP BY i.BillingCountry,
  -- Sort Revenue (Highest to Lowest)
  ORDER BY Year, TotalRevenue DESC;"""
  ).df()

# Combine data frames and ensure consistent types
res_df = pd.concat([
  res_agg_df,
  res_yearly_df.assign(Year=res_yearly_df['Year'].astype(str))
  ], ignore_index=True)

# Add ISO Country Codes
def get_iso_alpha3(country_name):
 try:
   match = pycountry.countries.search_fuzzy(country_name)
   return match[0].alpha_3
 except LookupError:
   return None
 
res_df['iso_alpha'] = res_df['Country'].apply(get_iso_alpha3)

# Get unique years (layers/traces) - order with "All" first.
years = ["All"] + sorted(res_df[res_df['Year'] != 'All']['Year'].unique().tolist())

Total Revenue: Choropleth by Country

Total revenue per country across the entire data set was visualized with a Choropeth plot.

Show Code
# Format Hover Text (<b>Country:</b><br> $TotalRevenue.##")
res_df <- res_df |> 
  dplyr::mutate(
    hover_text = paste0(
      "<b>", Country, ":</b><br> $",
      formatC(TotalRevenue, format = 'f', big.mark =",'", digits = 2)
      )
    ) 

# Get minimum and maximum values for TotalRevenue (Colorbar consistency)
z_min_val <- min(res_df$TotalRevenue, na.rm = TRUE)
z_max_val <- max(res_df$TotalRevenue, na.rm = TRUE)

# Generate plotly Choropleth
fig <- plotly::plot_ly(
  data = res_df,
  type = 'choropleth',
  locations = ~iso_alpha,
  z = ~TotalRevenue,
  # Set hover text to only display our desired, formatted output
  text = ~hover_text,
  hoverinfo = "text",
  frame = ~Year,
  # Set minimum and maximum TotalRevenue values, for consistent scale
  zmin = z_min_val,
  zmax = z_max_val,
  # Title Colorbar/Legend
  colorbar = list(
    title = "Total Revenue (USD$)"
  ),
  # Color-blind friendly color scale
  colorscale = "Viridis",
  reversescale = TRUE,
  showscale = TRUE,
  # Give national boundaries a dark gray outline
  marker = list(line = list(color = "darkgrey", width = 0.5))
)

# Layout with animation controls
fig <- fig %>%
  plotly::layout(
    title = list(
      text = "Total Revenue by Country <br><sub>2009-01-01 to 2013-12-22</sub>",
      x = 0.5,
      xanchor = "center",
      font = list(size = 18)
    ),
    geo = list(
      # Add a neat little frame around the world
      showframe = TRUE, 
      # Add coast lines - ensures countries that aren't in data are seen
      showcoastlines = TRUE, 
      # Use natural earth projection
      projection = list(type = 'natural earth')
      ),
    updatemenus = list(
      list(
        type = "dropdown",
        showactive = TRUE,
        buttons = purrr::map(years, function(yr) {
          list(
            method = "animate",
            args = list(list(yr), list(mode = "immediate", frame = list(duration = 0, redraw = TRUE))),
            label = yr
          )
        }),
        # Positioning of dropdown menu
        x = 0.1,
        y = 1.15,
        xanchor = "left",
        yanchor = "top"
      )
    ),
    margin = list(t = 80)
  ) %>%
  plotly::animation_opts(frame = 1000, transition = 0, redraw = TRUE)

# Display interactive plot
fig
Show Code
# Specify hover text
res_df['hover_text'] = res_df.apply( \
  lambda row: f"<b>{row['Country']}</b><br> ${row['TotalRevenue']:.2f}", axis = 1 \
  )

# Get maximum and minimum TotalRevenue values (consistent scale)
z_min_val = res_df['TotalRevenue'].min()
z_max_val = res_df['TotalRevenue'].max()

# Create frames (one per year, and aggregate)
frames = []

for year in years:
  df_year = res_df[res_df['Year'] == year]
  frames.append(go.Frame(
    name = str(year),
    data = [go.Choropleth(
      locations = df_year['iso_alpha'],
      z = df_year['TotalRevenue'],
      zmin = z_min_val,
      zmax = z_max_val,
      text = df_year['hover_text'],
      hoverinfo = 'text',
      # Color-blind friendly color scale (reversed: darker with higher revenues)
      colorscale = 'Viridis_r',
      # Give national boundaries a dark grey outline
      marker = dict(line=dict(color='darkgrey', width=0.5))
    )]
  ))
  
# First frame (initial state)
init_df = res_df[res_df['Year'] == 'All']

# Generate plotly Choropleth
fig = go.Figure(
  data=[go.Choropleth(
        locations=init_df['iso_alpha'],
        z=init_df['TotalRevenue'],
        text=init_df['hover_text'],
        hoverinfo='text',
        # Color-blind friendly color scale (reversed: darker with higher revenues)
        colorscale='Viridis_r',
        zmin=z_min_val,
        zmax=z_max_val,
        # Give national boundaries a dark grey outline
        marker=dict(line=dict(color='darkgrey', width=0.5)),
        # Title Colorbar/Legend
        colorbar=dict(title='Total Revenue (USD$)')
    )],
    frames=frames
  )
  
# Format Layout with Animation Controls
fig.update_layout(
  title = dict(
    text = "Total Revenue by Country <br><sub>2009-01-01 to 2013-12-22</sub>",
    x = 0.5,
    xanchor = 'center',
    font = dict(size=18)
  ),
  margin=dict(t=80),
  # Frame and view
  geo = dict(
    # Show countries and boundaries
    showcountries = True,
    # Give national boundaries a dark gray outline
    countrycolor="darkgrey",
    # Add coast lines - ensure countries that aren't in data are seen
    showcoastlines = True,
    coastlinecolor = "gray",
    #  Ad a neat little frame around the world
    showframe = True,
    framecolor = "black",
    # Use natural earth projection
    projection_type = "natural earth"
  ),
  # Buttons/Menus
  updatemenus = [dict(
    ## Play/Pause
        # First button active by default (yr == "All")
        type = "buttons",
        direction = "left",
        x = 0,
        y = 0,
        showactive = False,
        xanchor = "left",
        yanchor = "bottom",
        pad = dict(r = 10, t = 70),
        buttons = [dict(
          label = "Play",
          method = "animate",
          args = [None, {
            "frame": {"duration": 1000, "redraw": True},
            "fromcurrent": True,
            "transition": {"duration": 300, "easing": "quadratic-in-out"}
          }]
        ), dict(
          label = "Pause",
          method = "animate",
          args=[[None], {"frame": {"duration": 0}, "mode": "immediate"}] 
          )] 
      )] +
  ## Year Dropdown Menu
    [dict(
      type="dropdown",
      x = 0.1,
      y = 1.15,
      xanchor="left",
      yanchor="top",
      showactive=True,
      buttons=[dict(
        label=str(year),
        method="animate",
        args=[
            [str(year)],
            {"mode": "immediate",
             "frame": {"duration": 0, "redraw": True},
             "transition": {"duration": 0}}
        ]
    ) for year in years]
  )],
  sliders = [dict(
      active = 0,
      # Positioning of slider menu
      x = 0.1,
      y = -0.2,
      len = 0.8,
      xanchor = "left",
      yanchor = "bottom",
      pad = dict(t= 30, b=10),
      currentvalue = dict(
        visible = True,
        prefix = "Year: ",
        xanchor = "right",
        font = dict(size=14, color = "#666")
      ),
    steps = [dict(
      method = 'animate',
      args =[[str(year)], {
        "mode": "immediate",
        "frame": {"duration": 1000, "redraw": True},
        "transition": {"duration": 300}
        }],
        label = str(year) 
      ) for year in years]
    )] 
  );

# Display interactive plot
fig.show()
Insights

The customer base exhibits persistent geographic disparities in both scale and engagement.

  • Overall revenue trends are relatively flat, with limited signs of organic growth.
  • North America and Brazil consistently drive the highest revenues, indicating strong market presence and sustained demand.
  • India and parts of Central Europe maintain modest but consistent revenue, suggesting a stable (though modest) consumer base that may respond well to targeted growth strategies.
  • Australia showed signs of growth until 2013, after which revenue dropped to zero — this may reflect market exit, operational changes, or demand saturation.
  • Other parts of South America and Europe show sporadic revenue, possibly tied to one-time purchases or minimal customer engagement.

Opportunities:

  • There is untapped potential in underrepresented regions. If market demand can be properly assessed and activated — through localized marketing, partnerships, or product adaptation — these regions could represent growth markets.
  • South America and Europe may benefit from customer acquisition and retention strategies.
  • Australia’s sharp revenue drop in 2013 warrants deeper investigation — identifying root causes could help preempt similar risks in other markets.

Revenue per Customer: Choropleth by Country

Initial exploration revealed a significant mismatch between total revenue and revenue per customer. This was further explored by forming a similar Choropeth plot.

Show Code
# Format Hover Text (<b>Country:</b><br> $RevenuePerCustomer.##")
res_df <- res_df |> 
  dplyr::mutate(
    hover_text = paste0(
      "<b>", Country, ":</b><br> $",
      formatC(RevenuePerCustomer, format = 'f', big.mark =",'", digits = 2)
      )
    ) 

# Get minimum and maximum values for RevenuePerCustomer (Colorbar consistency)
z_min_val <- min(res_df$RevenuePerCustomer, na.rm = TRUE)
z_max_val <- max(res_df$RevenuePerCustomer, na.rm = TRUE)

# Generate plotly Choropleth
fig <- plotly::plot_ly(
  data = res_df,
  type = 'choropleth',
  locations = ~iso_alpha,
  z = ~RevenuePerCustomer,
  # Set hover text to only display our desired, formatted output
  text = ~hover_text,
  hoverinfo = "text",
  frame = ~Year,
  # Set minimum and maximum RevenuePerCustomer values, for consistent scale
  zmin = z_min_val,
  zmax = z_max_val,
  # Title Colorbar/Legend
  colorbar = list(
    title = "Revenue per Customer (USD$)"
  ),
  # Color-blind friendly color scale
  colorscale = "Viridis",
  reversescale = TRUE,
  showscale = TRUE,
  # Give national boundaries a dark gray outline
  marker = list(line = list(color = "darkgrey", width = 0.5))
)

# Layout with animation controls
fig <- fig %>%
  plotly::layout(
    title = list(
      text = "Revenue per Customer by Country <br><sub>2009-01-01 to 2013-12-22</sub>",
      x = 0.5,
      xanchor = "center",
      font = list(size = 18)
    ),
    geo = list(
      # Add a neat little frame around the world
      showframe = TRUE, 
      # Add coast lines - ensures countries that aren't in data are seen
      showcoastlines = TRUE, 
      # Use natural earth projection
      projection = list(type = 'natural earth')
      ),
    updatemenus = list(
      list(
        type = "dropdown",
        showactive = TRUE,
        buttons = purrr::map(years, function(yr) {
          list(
            method = "animate",
            args = list(list(yr), list(mode = "immediate", frame = list(duration = 0, redraw = TRUE))),
            label = yr
          )
        }),
        # Positioning of dropdown menu
        x = 0.1,
        y = 1.15,
        xanchor = "left",
        yanchor = "top"
      )
    ),
    margin = list(t = 80)
  ) %>%
  plotly::animation_opts(frame = 1000, transition = 0, redraw = TRUE)

# Display interactive plot
fig
Show Code
# Specify hover text
res_df['hover_text'] = res_df.apply( \
  lambda row: f"<b>{row['Country']}</b><br> ${row['RevenuePerCustomer']:.2f}", axis = 1 \
  )

# Get maximum and minimum RevenuePerCustomer values (consistent scale)
z_min_val = res_df['RevenuePerCustomer'].min()
z_max_val = res_df['RevenuePerCustomer'].max()

# Create frames (one per year, and aggregate)
frames = []

for year in years:
  df_year = res_df[res_df['Year'] == year]
  frames.append(go.Frame(
    name = str(year),
    data = [go.Choropleth(
      locations = df_year['iso_alpha'],
      z = df_year['RevenuePerCustomer'],
      zmin = z_min_val,
      zmax = z_max_val,
      text = df_year['hover_text'],
      hoverinfo = 'text',
      # Color-blind friendly color scale (reversed: darker with higher revenues)
      colorscale = 'Viridis_r',
      # Give national boundaries a dark grey outline
      marker = dict(line=dict(color='darkgrey', width=0.5))
    )]
  ))
  
# First frame (initial state)
init_df = res_df[res_df['Year'] == 'All']

# Generate plotly Choropleth
fig = go.Figure(
  data=[go.Choropleth(
        locations=init_df['iso_alpha'],
        z=init_df['RevenuePerCustomer'],
        text=init_df['hover_text'],
        hoverinfo='text',
        # Color-blind friendly color scale (reversed: darker with higher revenues)
        colorscale='Viridis_r',
        zmin=z_min_val,
        zmax=z_max_val,
        # Give national boundaries a dark grey outline
        marker=dict(line=dict(color='darkgrey', width=0.5)),
        # Title Colorbar/Legend
        colorbar=dict(title='Revenue per Customer (USD$)')
    )],
    frames=frames
  )
  
# Format Layout with Animation Controls
fig.update_layout(
  title = dict(
    text = "Revenue per Customer by Country <br><sub>2009-01-01 to 2013-12-22</sub>",
    x = 0.5,
    xanchor = 'center',
    font = dict(size=18)
  ),
  margin=dict(t=80),
  # Frame and view
  geo = dict(
    # Show countries and boundaries
    showcountries = True,
    # Give national boundaries a dark gray outline
    countrycolor="darkgrey",
    # Add coast lines - ensure countries that aren't in data are seen
    showcoastlines = True,
    coastlinecolor = "gray",
    #  Ad a neat little frame around the world
    showframe = True,
    framecolor = "black",
    # Use natural earth projection
    projection_type = "natural earth"
  ),
  # Buttons/Menus
  updatemenus = [dict(
    ## Play/Pause
        # First button active by default (yr == "All")
        type = "buttons",
        direction = "left",
        x = 0,
        y = 0,
        showactive = False,
        xanchor = "left",
        yanchor = "bottom",
        pad = dict(r = 10, t = 70),
        buttons = [dict(
          label = "Play",
          method = "animate",
          args = [None, {
            "frame": {"duration": 1000, "redraw": True},
            "fromcurrent": True,
            "transition": {"duration": 300, "easing": "quadratic-in-out"}
          }]
        ), dict(
          label = "Pause",
          method = "animate",
          args=[[None], {"frame": {"duration": 0}, "mode": "immediate"}] 
          )] 
      )] +
  ## Year Dropdown Menu
    [dict(
      type="dropdown",
      x = 0.1,
      y = 1.15,
      xanchor="left",
      yanchor="top",
      showactive=True,
      buttons=[dict(
        label=str(year),
        method="animate",
        args=[
            [str(year)],
            {"mode": "immediate",
             "frame": {"duration": 0, "redraw": True},
             "transition": {"duration": 0}}
        ]
    ) for year in years]
  )],
  sliders = [dict(
      active = 0,
      # Positioning of slider menu
      x = 0.1,
      y = -0.2,
      len = 0.8,
      xanchor = "left",
      yanchor = "bottom",
      pad = dict(t= 30, b=10),
      currentvalue = dict(
        visible = True,
        prefix = "Year: ",
        xanchor = "right",
        font = dict(size=14, color = "#666")
      ),
    steps = [dict(
      method = 'animate',
      args =[[str(year)], {
        "mode": "immediate",
        "frame": {"duration": 1000, "redraw": True},
        "transition": {"duration": 300}
        }],
        label = str(year) 
      ) for year in years]
    )] 
  );

# Display interactive plot
fig.show()
Insights

Year-over-year shifts in revenue per customer suggest evolving engagement patterns across regions.

  • High total-revenue countries (USA, Canada, Brazil) and India show flat revenue per customer trends, indicating stable but possibly casual engagement.
  • South America saw increases in revenue per customer in both 2010 and 2013, suggesting periods of heightened individual customer value — possibly driven by regional promotions or market-specific trends.
  • Europe experienced a spike in 2011, with multiple countries (e.g., Austria, Hungary, Ireland) showing high per-customer revenue. The cause is unclear but may reflect a surge in high-value purchases or regional campaigns

Opportunities:

  • High per-customer revenue regions (especially those with smaller total revenue footprints) may benefit from customer acquisition efforts, as existing users demonstrate strong engagement or purchasing behavior.
  • Flat or declining per-customer revenue in large markets highlights a need for upselling, bundling, or personalized offers to increase customer lifetime value.
  • Investigating year-over-year anomalies (e.g., Australia’s 2012 spike, Europe in 2011) could uncover replicable growth levers or emerging market trends.

Summary of Exploratory findings

Total revenue measures market size, while revenue per customer reflects intensity of engagement. Both are important for guiding different types of strategic decisions (e.g., acquisition vs. retention).

Geographic revenue in the Chinook dataset is concentrated in a few strong markets, while many others remain underdeveloped. This disparity presents both risk and opportunity: efforts to deepen engagement in large casual markets (e.g., USA, Brazil) and expand in high-value but small markets (e.g., Austria, Chile) could lead to measurable revenue growth. Year-over-year changes — such as Australia’s exit and Europe’s 2011 spike — suggest that regional trends and operational shifts have meaningful financial impacts worth further investigation.

  • Top 5 Countries by Total Revenue: USA, Canada, France, Brazil, Germany
  • Top 5 Countries by Revenue per Customer: Chile, Hungary, Ireland, Czech Republic, Austria
Possible Next Steps

If this were real-world data, the following actions could strengthen both analytical insight and strategic decision-making.

Deepen the Analysis:

  • Investigate Australia’s 2012–13 drop-off.
    Examine customer churn, pricing changes, or external events that may explain the abrupt loss of revenue.
  • Analyze U.S. cohort churn to identify upsell opportunities.
    As the largest revenue contributor with moderate revenue-per-customer, the U.S. market may contain untapped upsell or retention opportunities.

Strategic Business Opportunities:

  • Launch targeted consumer acquisition campaigns in underpenetrated, high-potential markets.
    Focus on Chile, Hungary, the Czech Republic and Ireland — all show strong revenue-per-customer despite smaller customer bases.
  • Localize and invest in regional content for India and Asia.
    These high-population markets show minimal engagement. Tailored content, pricing, or distribution could unlock substantial growth.
  • Re-engage customers in stalled or declining regions.
    Countries like Brazil and Sweden once generated revenue but saw sustained drop-offs. A targeted win-back campaign could reclaim lapsed users.

Question 2: What genres or artists generate the most income?

Executive Summary

Revenue across genres and artists is highly concentrated, with a few categories (e.g., Rock, TV-related genres) accounting for most earnings. Yet, when normalized by catalog size, smaller genres like Sci Fi & Fantasy and Bossa Nova outperform in per-track efficiency. These findings suggest that total revenue alone understates the value of high-margin, niche content. Year-over-year trends also reveal temporal spikes that signal short-term demand surges — potential targets for agile promotion or bundling strategies.

To understand commercial performance across the catalog, revenue was analyzed at both the genre and artist levels.

The pertinent sales data was stored in the InvoiceLine table, where each row corresponded to a purchased track. These transaction records were linked to metadata through the Track, Genre, and Artist tables.

Initial exploration focused on genre-level revenue, with metrics capturing total earnings, sales volume, average revenue per track, and share of overall income.

Show Code
SQL
-- Revenue by Genre
SELECT 
  g.Name AS Genre,
  ROUND(SUM(il.UnitPrice * il.Quantity), 2) AS TotalRevenue,
  -- Number of Tracks
  COUNT(*) AS NumTracksSold,
  -- Average Revenue per Track
  ROUND(SUM(il.UnitPrice * il.Quantity)/COUNT(*), 2) AS AvgRevenuePerTrack,
  -- Percentage of Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity)*100.0 / (SELECT SUM(UnitPrice * Quantity) FROM InvoiceLine), 2) AS PercentOfRevenue,
  -- Percentage of Volume (Units Sold)
  ROUND(COUNT(*)*100.0 / (SELECT COUNT(*) FROM InvoiceLine),2) AS PercentOfUnitSales,
  -- Total number of tracks in the catalog in this genre
  track_counts.TotalTracksInGenre,
  -- Proportion of catalog that was actually sold
  ROUND(COUNT(DISTINCT il.TrackId) * 100.0 / track_counts.TotalTracksInGenre, 2) AS PercentOfTracksSold,
  -- Revenue per total track in genre
  ROUND(SUM(il.UnitPrice * il.Quantity) / track_counts.TotalTracksInGenre, 2) AS RevenuePerTotalTrack,
FROM InvoiceLine il
JOIN Track t ON il.TrackId = t.TrackId
JOIN Genre g ON t.GenreId = g.GenreId
-- Subquery to get total number of tracks in each genre
JOIN (
    SELECT GenreId, COUNT(*) AS TotalTracksInGenre
    FROM Track
    GROUP BY GenreId
) AS track_counts ON g.GenreId = track_counts.GenreId
GROUP BY g.Name, track_counts.TotalTracksInGenre
-- Arrange by TotalRevenue (Highest to Lowest)
ORDER BY TotalRevenue DESC;
Displaying records 1 - 10
Genre TotalRevenue NumTracksSold AvgRevenuePerTrack PercentOfRevenue PercentOfUnitSales TotalTracksInGenre PercentOfTracksSold RevenuePerTotalTrack
Rock 826.65 835 0.99 35.50 37.28 1297 57.44 0.64
Latin 382.14 386 0.99 16.41 17.23 579 58.72 0.66
Metal 261.36 264 0.99 11.22 11.79 374 61.76 0.70
Alternative & Punk 241.56 244 0.99 10.37 10.89 332 61.14 0.73
TV Shows 93.53 47 1.99 4.02 2.10 93 46.24 1.01
Jazz 79.20 80 0.99 3.40 3.57 130 52.31 0.61
Blues 60.39 61 0.99 2.59 2.72 81 65.43 0.75
Drama 57.71 29 1.99 2.48 1.29 64 42.19 0.90
Classical 40.59 41 0.99 1.74 1.83 74 48.65 0.55
R&B/Soul 40.59 41 0.99 1.74 1.83 61 60.66 0.67
Insights
  • Revenue is highly concentrated.
    • The top four genres — Rock, Latin, Metal, Alternative & Punk — generate 73.5% of total income and account for 77.2% of unit sales.
    • Rock alone contributes 35.5% of total revenue, establishing it as the primary commercial driver.
  • Revenue is predominantly volume-driven across genres, with minimal pricing differentiation.
    • The average revenue per track across genres is approximately $0.99, indicating consistent pricing strategies.
    • A small subset of non-music genres (e.g. TV Shows, Science Fiction, Sci Fi & Fantasy, Comedy, and Drama) command significantly higher per-unit revenue (~$1.99 per track). These categories contribute minimally to total volume but represent higher-margin segments.
  • Sales penetration varies by genre:
    • The top five genres by percentage of catalog sold are Bossa Nova (93.3%), Sci Fi & Fantasy (76.9%), Blues (65.4%), Metal (61.8%), and Alternative & Punk (61.1%).
    • Total track count within genres closely aligns with total revenue, highlighting the impact of catalog size on sales.
  • Some mid-sized genres — including Sci Fi & Fantasy, Comedy, and TV Shows — deliver above-average revenue per track in catalog, outperforming larger genres in revenue efficiency.

Opportunities:

  • Premium-priced genres may offer growth potential if supported through improved visibility, targeted marketing, or content expansion strategies.
  • The strong correlation between catalog size suggests sales are heavily dependent on the volume of available content within each genre. Diversifying through strategic content acquisition or development in underrepresented genres could mitigate risk from shifts in consumer preferences while creating new revenue streams.
Show Code
SQL
-- Revenue by Artist
SELECT 
  ar.Name AS Artist,
  -- Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity), 2) AS TotalRevenue,
  -- Number of Tracks
  COUNT(*) AS NumTracksSold,
  -- Average Revenue per Track Sold
  ROUND(SUM(il.UnitPrice * il.Quantity)/COUNT(*), 2) AS AvgRevenuePerTrack,
  -- Percentage of Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity)*100.0 / (SELECT SUM(UnitPrice * Quantity) FROM InvoiceLine), 2) AS PercentOfRevenue,
  -- Percentage of Volume (Units Sold)
  ROUND(COUNT(*)*100.0 / (SELECT COUNT(*) FROM InvoiceLine),2) AS PercentOfUnitSales,
  -- Total number of tracks in the catalog for artist
  track_counts.TotalTracksByArtist,
  -- Proportion of catalog that was actually sold
  ROUND(COUNT(DISTINCT il.TrackId) * 100.0 / track_counts.TotalTracksByArtist, 2) AS PercentOfTracksSold,
  -- Revenue per total track by artist
  ROUND(SUM(il.UnitPrice * il.Quantity) / track_counts.TotalTracksByArtist, 2) AS RevenuePerTotalTrack
FROM InvoiceLine il
JOIN Track t ON il.TrackId = t.TrackId
JOIN Album al ON t.AlbumId = al.AlbumId
JOIN Artist ar ON ar.ArtistId = al.ArtistId
-- Subquery to get total number of tracks in each genre
JOIN (
    SELECT 
      al.ArtistId,
      COUNT(*) AS TotalTracksByArtist
    FROM Track t
    JOIN Album al ON t.AlbumId = al.AlbumId
    GROUP BY al.ArtistId
) AS track_counts ON ar.ArtistId = track_counts.ArtistId
GROUP BY ar.Name, track_counts.TotalTracksByArtist
-- Arrange by TotalRevenue (Highest to Lowest)
ORDER BY TotalRevenue DESC
-- Limit output for readability (165 total artists)
LIMIT 25;
Displaying records 1 - 10
Artist TotalRevenue NumTracksSold AvgRevenuePerTrack PercentOfRevenue PercentOfUnitSales TotalTracksByArtist PercentOfTracksSold RevenuePerTotalTrack
Iron Maiden 138.60 140 0.99 5.95 6.25 213 57.75 0.65
U2 105.93 107 0.99 4.55 4.78 135 67.41 0.78
Metallica 90.09 91 0.99 3.87 4.06 112 70.54 0.80
Led Zeppelin 86.13 87 0.99 3.70 3.88 114 67.54 0.76
Lost 81.59 41 1.99 3.50 1.83 92 43.48 0.89
The Office 49.75 25 1.99 2.14 1.12 53 41.51 0.94
Os Paralamas Do Sucesso 44.55 45 0.99 1.91 2.01 49 79.59 0.91
Deep Purple 43.56 44 0.99 1.87 1.96 92 44.57 0.47
Faith No More 41.58 42 0.99 1.79 1.88 52 69.23 0.80
Eric Clapton 39.60 40 0.99 1.70 1.79 48 72.92 0.82
Insights
  • Revenue is more evenly distributed among artists, with lower concentration than at the genre level The top four artists — Iron Maiden, U2, Metallica, and Led Zeppelin — generate only 18.08% of total revenue, in contrast to the much higher concentration observed across genres.
  • Non-music content features prominently among top performers.
    TV Shows, such as Lost (#5, 3.50% of revenue) and The Office (#6, 2.14% of revenue), rank among the top revenue-generating artists, reflecting the impact of premium pricing and niche appeal.
  • Catalog penetration varies widely across artists.
    The use of a single “artist” value for each track complicates interpretation, as each combination or collaboration is listed as a separate artist (rather than attributed to all artists). Single-track contributors (often collaborations or soundtrack entries) show near-complete sales penetration, despite limited catalog depth.
  • Catalog size strongly correlates with total revenue, reinforcing the importance of content volume in driving performance.

Opportunities:

  • Investment in high-margin non-music content.
    Episodic and cinematic media command higher per-unit performance and penetration despite smaller catalogs. Expanding premium non-music offerings — such as curated soundtracks and narrative series — could unlock high-margin growth.
  • Diversify the artist portfolio.
    With revenue spread broadly across many artists, expanding representation — particularly in underdeveloped genres or emerging niches — could mitigate risk and attract a wider audience.
  • Maximize existing catalog performance.
    High-performing artists with unsold tracks represent untapped value. Targeted promotions, playlists, or bundling strategies could increase catalog penetration and enhance revenue efficiency.
Show Code
SQL
-- Top Artists within the Top Genres
SELECT 
  g.Name AS Genre,
  ar.Name AS Artist,
  -- Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity), 2) AS TotalRevenue,
  -- Number of Tracks Sold
  COUNT(*) AS NumTracksSold,
  -- Percentage of Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity) * 100.0 / (SELECT SUM(UnitPrice * Quantity) FROM InvoiceLine), 2) AS PercentOfRevenue,
  -- Average Revenue per Track Sold
  ROUND(SUM(il.UnitPrice * il.Quantity) / COUNT(*), 2) AS AvgRevenuePerTrack,
  -- Percentage of Volume (Units Sold)
  COUNT(DISTINCT il.TrackId) * 100.0 / (SELECT COUNT(DISTINCT TrackId) FROM InvoiceLine) AS PercentOfUnitSales,
  -- Total number of tracks in the catalog for artist-genre combo
  catalog_stats.TotalTracksInCatalog,
  -- Proportion of catalog that was actually sold
  ROUND(COUNT(DISTINCT il.TrackId) * 100.0 / catalog_stats.TotalTracksInCatalog, 2) AS PercentOfCatalogSold,
  -- Revenue per total track by artist-genre combo
  ROUND(SUM(il.UnitPrice * il.Quantity) / catalog_stats.TotalTracksInCatalog, 2) AS RevenuePerCatalogTrack

FROM InvoiceLine il
JOIN Track t ON il.TrackId = t.TrackId
JOIN Album al ON t.AlbumId = al.AlbumId
JOIN Artist ar ON al.ArtistId = ar.ArtistId
JOIN Genre g ON t.GenreId = g.GenreId

-- Subquery to get the total catalog size for each Artist-Genre pair
JOIN (
    SELECT 
      al.ArtistId,
      t.GenreId,
      COUNT(*) AS TotalTracksInCatalog
    FROM Track t
    JOIN Album al ON t.AlbumId = al.AlbumId
    GROUP BY al.ArtistId, t.GenreId
) AS catalog_stats 
  ON al.ArtistId = catalog_stats.ArtistId AND t.GenreId = catalog_stats.GenreId
  
GROUP BY g.Name, ar.Name, catalog_stats.TotalTracksInCatalog
-- Arrange by TotalRevenue (Highest to Lowest)
ORDER BY TotalRevenue DESC
-- Limit output for readability (165 total artists)
LIMIT 50;
Displaying records 1 - 10
Genre Artist TotalRevenue NumTracksSold PercentOfRevenue AvgRevenuePerTrack PercentOfUnitSales TotalTracksInCatalog PercentOfCatalogSold RevenuePerCatalogTrack
Metal Metallica 90.09 91 3.87 0.99 3.981855 112 70.54 0.80
Rock U2 90.09 91 3.87 0.99 3.881048 112 68.75 0.80
Rock Led Zeppelin 86.13 87 3.70 0.99 3.881048 114 67.54 0.76
Metal Iron Maiden 69.30 70 2.98 0.99 3.074597 95 64.21 0.73
Rock Iron Maiden 53.46 54 2.30 0.99 2.318548 81 56.79 0.66
TV Shows Lost 45.77 23 1.97 1.99 1.108871 48 45.83 0.95
Latin Os Paralamas Do Sucesso 44.55 45 1.91 0.99 1.965726 49 79.59 0.91
Rock Deep Purple 43.56 44 1.87 0.99 2.066532 92 44.57 0.47
Rock Queen 36.63 37 1.57 0.99 1.663307 45 73.33 0.81
Rock Creedence Clearwater Revival 36.63 37 1.57 0.99 1.562500 40 77.50 0.92
Insights
  • Artist-Genre revenue leaders align with broader trends.
    The top artist-genre combinations by revenue mirror those from individual artist and genre views. For example, Metallica and Iron Maiden dominate within the Metal genre while U2, Led Zeppelin, and Iron Maiden lead within Rock. Together, the top 5 artist-genre combinations account for 17.13% of total revenue, and 16.72% of unit sales, underscoring their importance in driving overall volume.
  • High-efficiency revenue segments are concentrated in niche content.
    The top entries by revenue per track in catalog are overwhelmingly skewed towards TV shows and classical music ensembles, such as Aquaman (TV Show, $3.98 per catalog track), Battlestar Galactica (Sci Fi & Fantasy, $1.99 per catalog track), and Emerson String Quartet (Classical, $1.98 catalog track). These offerings, while limited in volume, generate significantly more revenue per track than mainstream genres and artists, reflecting strong premium pricing and dedicated niche demand.
  • Catalog penetration varies widely by artist–genre combo.
    Several high-revenue artists still show less-than-complete catalog sales, revealing unrealized value within popular genres. On the other hand, certain one-off soundtrack or TV show entries show 100% catalog penetration, despite having only one or two tracks.

Opportunities:

  • Expand and promote high-margin, niche content.
    TV Shows demonstrate high catalog penetration while maintaining premium pricing and small catalog size. These segments offer disproportionately high returns and could benefit from improved discoverabiltity (e.g. curated collections, themed playlists), bundling or pricing experiments (e.g., multi-track purchase bundles), and/or licensing or development of similar content with proven formats.
  • Drive deeper engagement within top genres and artists.
    High performing combinations like Metallica (Metal) and U2 (Rock) have both scale and engagement, yet still show gaps in catalog sales. Tactics like targeted re-promotions, anniversary releases, or live/rare editions could convert more of the existing catalog into revenue.
  • Leverage genre-artist analytics for marketing segmentation.
    This crossover view reveals which artist–genre pairs have loyal followings and which may benefit from strategic push. Personalized marketing based on known affinities (e.g. fans of Battlestar Galactica may enjoy other Sci-Fi media) could unlock untapped audiences.
  • Mitigate risk through diversified content investment.
    Concentration remains a factor — a small number of artist–genre pairs account for a large share of unit sales. Expanding representation in emerging or underrepresented artist–genre combinations could reduce dependency on top performers while broadening reach.

Exploratory Visualizations

In preparation for exploratory visualization generation, the data was retrieved using SQL queries and prepared in both R and Python.

Show Code
# SQL Queries
## Genre
### Yearly Breakdown
res_genre_yearly_df <- DBI::dbGetQuery(
  con_chinook, 
  "SELECT 
  -- Get Genre and Year for grouping
  g.Name AS Genre,
  YEAR(i.InvoiceDate) as Year,
  -- Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity), 2) AS TotalRevenue,
  -- Number of Tracks
  COUNT(*) AS NumTracksSold,
  -- Average Revenue per Track
  ROUND(SUM(il.UnitPrice * il.Quantity)/COUNT(*), 2) AS AvgRevenuePerTrack,
  -- Percentage of Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity)*100.0 / (SELECT SUM(UnitPrice * Quantity) FROM InvoiceLine), 2) AS PercentOfRevenue,
  -- Percentage of Volume (Units Sold)
  ROUND(COUNT(*)*100.0 / (SELECT COUNT(*) FROM InvoiceLine),2) AS PercentOfUnitSales,
  -- Total number of tracks in the catalog in this genre
  track_counts.TotalTracksInGenre,
  -- Proportion of catalog that was actually sold
  ROUND(COUNT(DISTINCT il.TrackId) * 100.0 / track_counts.TotalTracksInGenre, 2) AS PercentOfTracksSold,
  -- Revenue per total track in genre
  ROUND(SUM(il.UnitPrice * il.Quantity) / track_counts.TotalTracksInGenre, 2) AS RevenuePerTotalTrack,
FROM InvoiceLine il
JOIN Invoice i on il.InvoiceId = i.InvoiceId
JOIN Track t ON il.TrackId = t.TrackId
JOIN Genre g ON t.GenreId = g.GenreId
-- Subquery to get total number of tracks in each genre
JOIN (
    SELECT GenreId, COUNT(*) AS TotalTracksInGenre
    FROM Track
    GROUP BY GenreId
) AS track_counts ON g.GenreId = track_counts.GenreId
GROUP BY g.Name, Year, track_counts.TotalTracksInGenre
-- Arrange by TotalRevenue (Highest to Lowest)
ORDER BY Year, TotalRevenue DESC;")

## Aggregate
res_genre_agg_df <- DBI::dbGetQuery(
  con_chinook,
  "SELECT 
  -- Get Genre and Year for grouping
  g.Name AS Genre,
  'All' as Year,
  -- Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity), 2) AS TotalRevenue,
  -- Number of Tracks
  COUNT(*) AS NumTracksSold,
  -- Average Revenue per Track
  ROUND(SUM(il.UnitPrice * il.Quantity)/COUNT(*), 2) AS AvgRevenuePerTrack,
  -- Percentage of Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity)*100.0 / (SELECT SUM(UnitPrice * Quantity) FROM InvoiceLine), 2) AS PercentOfRevenue,
  -- Percentage of Volume (Units Sold)
  ROUND(COUNT(*)*100.0 / (SELECT COUNT(*) FROM InvoiceLine),2) AS PercentOfUnitSales,
  -- Total number of tracks in the catalog in this genre
  track_counts.TotalTracksInGenre,
  -- Proportion of catalog that was actually sold
  ROUND(COUNT(DISTINCT il.TrackId) * 100.0 / track_counts.TotalTracksInGenre, 2) AS PercentOfTracksSold,
  -- Revenue per total track in genre
  ROUND(SUM(il.UnitPrice * il.Quantity) / track_counts.TotalTracksInGenre, 2) AS RevenuePerTotalTrack,
FROM InvoiceLine il
JOIN Track t ON il.TrackId = t.TrackId
JOIN Genre g ON t.GenreId = g.GenreId
-- Subquery to get total number of tracks in each genre
JOIN (
    SELECT GenreId, COUNT(*) AS TotalTracksInGenre
    FROM Track
    GROUP BY GenreId
) AS track_counts ON g.GenreId = track_counts.GenreId
GROUP BY g.Name, Year, track_counts.TotalTracksInGenre
-- Arrange by TotalRevenue (Highest to Lowest)
ORDER BY Year, TotalRevenue DESC;")

## Artist
res_artist_yearly_df <- DBI::dbGetQuery(
  con_chinook,
  "SELECT 
  -- Select Artist and Year for Grouping
  ar.Name AS Artist,
  YEAR(i.InvoiceDate) as Year,
  -- Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity), 2) AS TotalRevenue,
  -- Number of Tracks
  COUNT(*) AS NumTracksSold,
  -- Average Revenue per Track Sold
  ROUND(SUM(il.UnitPrice * il.Quantity)/COUNT(*), 2) AS AvgRevenuePerTrack,
  -- Percentage of Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity)*100.0 / (SELECT SUM(UnitPrice * Quantity) FROM InvoiceLine), 2) AS PercentOfRevenue,
  -- Percentage of Volume (Units Sold)
  ROUND(COUNT(*)*100.0 / (SELECT COUNT(*) FROM InvoiceLine),2) AS PercentOfUnitSales,
  -- Total number of tracks in the catalog for artist
  track_counts.TotalTracksByArtist,
  -- Proportion of catalog that was actually sold
  ROUND(COUNT(DISTINCT il.TrackId) * 100.0 / track_counts.TotalTracksByArtist, 2) AS PercentOfTracksSold,
  -- Revenue per total track by artist
  ROUND(SUM(il.UnitPrice * il.Quantity) / track_counts.TotalTracksByArtist, 2) AS RevenuePerTotalTrack
FROM InvoiceLine il
JOIN Invoice i on il.InvoiceId = i.InvoiceId
JOIN Track t ON il.TrackId = t.TrackId
JOIN Album al ON t.AlbumId = al.AlbumId
JOIN Artist ar ON ar.ArtistId = al.ArtistId
-- Subquery to get total number of tracks in each genre
JOIN (
    SELECT 
      al.ArtistId,
      COUNT(*) AS TotalTracksByArtist
    FROM Track t
    JOIN Album al ON t.AlbumId = al.AlbumId
    GROUP BY al.ArtistId
) AS track_counts ON ar.ArtistId = track_counts.ArtistId
GROUP BY ar.Name, Year, track_counts.TotalTracksByArtist
-- Arrange by TotalRevenue (Highest to Lowest)
ORDER BY Year, TotalRevenue DESC;"
)

res_artist_agg_df <- DBI::dbGetQuery(
  con_chinook,
  "SELECT 
  -- Select Artist and Year for Grouping
  ar.Name AS Artist,
  'All' as Year,
  -- Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity), 2) AS TotalRevenue,
  -- Number of Tracks
  COUNT(*) AS NumTracksSold,
  -- Average Revenue per Track Sold
  ROUND(SUM(il.UnitPrice * il.Quantity)/COUNT(*), 2) AS AvgRevenuePerTrack,
  -- Percentage of Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity)*100.0 / (SELECT SUM(UnitPrice * Quantity) FROM InvoiceLine), 2) AS PercentOfRevenue,
  -- Percentage of Volume (Units Sold)
  ROUND(COUNT(*)*100.0 / (SELECT COUNT(*) FROM InvoiceLine),2) AS PercentOfUnitSales,
  -- Total number of tracks in the catalog for artist
  track_counts.TotalTracksByArtist,
  -- Proportion of catalog that was actually sold
  ROUND(COUNT(DISTINCT il.TrackId) * 100.0 / track_counts.TotalTracksByArtist, 2) AS PercentOfTracksSold,
  -- Revenue per total track by artist
  ROUND(SUM(il.UnitPrice * il.Quantity) / track_counts.TotalTracksByArtist, 2) AS RevenuePerTotalTrack
FROM InvoiceLine il
JOIN Track t ON il.TrackId = t.TrackId
JOIN Album al ON t.AlbumId = al.AlbumId
JOIN Artist ar ON ar.ArtistId = al.ArtistId
-- Subquery to get total number of tracks in each genre
JOIN (
    SELECT 
      al.ArtistId,
      COUNT(*) AS TotalTracksByArtist
    FROM Track t
    JOIN Album al ON t.AlbumId = al.AlbumId
    GROUP BY al.ArtistId
) AS track_counts ON ar.ArtistId = track_counts.ArtistId
GROUP BY ar.Name, Year, track_counts.TotalTracksByArtist
-- Arrange by TotalRevenue (Highest to Lowest)
ORDER BY Year, TotalRevenue DESC;"
)

# Combine data frames in R
res_genre_df <- dplyr::bind_rows(
  res_genre_agg_df,
  res_genre_yearly_df |> dplyr::mutate(Year = as.character(Year))
  )

res_artist_df <- dplyr::bind_rows(
  res_artist_agg_df,
  res_artist_yearly_df |> dplyr::mutate(Year = as.character(Year))
  )

# Get vector of unique years (layers/traces) - order with "All" first.
years <- c("All", sort(unique(res_genre_yearly_df$Year)))

# Make "Year" an ordered factor (Plot generation ordering)
res_genre_df <- res_genre_df |>
  dplyr::mutate(Year = factor(Year, levels = years, ordered = TRUE))

res_artist_df <- res_artist_df |>
  dplyr::mutate(Year = factor(Year, levels = years, ordered = T))

# Get vector of unique Genres - order by total revenue overall
genres <- res_genre_agg_df |>
  dplyr::arrange(dplyr::desc(TotalRevenue)) |>
  dplyr::select(Genre) |>
  dplyr::distinct() |> 
  dplyr::pull()

# Make "Genre" an ordered factor (Plot generation ordering)
res_genre_df <- res_genre_df |>
  dplyr::mutate(
    Genre = factor(Genre, levels = genres, ordered = T)
  )

# Get vector of unique Artists - order by total revenue overall
artists <- res_artist_agg_df |>
  dplyr::arrange(dplyr::desc(TotalRevenue)) |>
  dplyr::select(Artist) |>
  dplyr::distinct() |> 
  dplyr::pull()

# Make "Genre" an ordered factor (Plot generation ordering)
res_artist_df <- res_artist_df |>
  dplyr::mutate(
    Genre = factor(Artist, levels = artists, ordered = T)
  )
Show Code
# SQL Queries
## Genre
### Yearly Breakdown
res_genre_yearly_df = con_chinook.execute(
  """SELECT 
  -- Get Genre and Year for grouping
  g.Name AS Genre,
  YEAR(i.InvoiceDate) as Year,
  -- Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity), 2) AS TotalRevenue,
  -- Number of Tracks
  COUNT(*) AS NumTracksSold,
  -- Average Revenue per Track
  ROUND(SUM(il.UnitPrice * il.Quantity)/COUNT(*), 2) AS AvgRevenuePerTrack,
  -- Percentage of Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity)*100.0 / (SELECT SUM(UnitPrice * Quantity) FROM InvoiceLine), 2) AS PercentOfRevenue,
  -- Percentage of Volume (Units Sold)
  ROUND(COUNT(*)*100.0 / (SELECT COUNT(*) FROM InvoiceLine),2) AS PercentOfUnitSales,
  -- Total number of tracks in the catalog in this genre
  track_counts.TotalTracksInGenre,
  -- Proportion of catalog that was actually sold
  ROUND(COUNT(DISTINCT il.TrackId) * 100.0 / track_counts.TotalTracksInGenre, 2) AS PercentOfTracksSold,
  -- Revenue per total track in genre
  ROUND(SUM(il.UnitPrice * il.Quantity) / track_counts.TotalTracksInGenre, 2) AS RevenuePerTotalTrack,
FROM InvoiceLine il
JOIN Invoice i on il.InvoiceId = i.InvoiceId
JOIN Track t ON il.TrackId = t.TrackId
JOIN Genre g ON t.GenreId = g.GenreId
-- Subquery to get total number of tracks in each genre
JOIN (
    SELECT GenreId, COUNT(*) AS TotalTracksInGenre
    FROM Track
    GROUP BY GenreId
) AS track_counts ON g.GenreId = track_counts.GenreId
GROUP BY g.Name, Year, track_counts.TotalTracksInGenre
-- Arrange by TotalRevenue (Highest to Lowest)
ORDER BY Year, TotalRevenue DESC;"""
  ).df()
  
### Aggregate
res_genre_agg_df = con_chinook.execute(
  """SELECT 
  -- Get Genre and Year for grouping
  g.Name AS Genre,
  'All' as Year,
  -- Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity), 2) AS TotalRevenue,
  -- Number of Tracks
  COUNT(*) AS NumTracksSold,
  -- Average Revenue per Track
  ROUND(SUM(il.UnitPrice * il.Quantity)/COUNT(*), 2) AS AvgRevenuePerTrack,
  -- Percentage of Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity)*100.0 / (SELECT SUM(UnitPrice * Quantity) FROM InvoiceLine), 2) AS PercentOfRevenue,
  -- Percentage of Volume (Units Sold)
  ROUND(COUNT(*)*100.0 / (SELECT COUNT(*) FROM InvoiceLine),2) AS PercentOfUnitSales,
  -- Total number of tracks in the catalog in this genre
  track_counts.TotalTracksInGenre,
  -- Proportion of catalog that was actually sold
  ROUND(COUNT(DISTINCT il.TrackId) * 100.0 / track_counts.TotalTracksInGenre, 2) AS PercentOfTracksSold,
  -- Revenue per total track in genre
  ROUND(SUM(il.UnitPrice * il.Quantity) / track_counts.TotalTracksInGenre, 2) AS RevenuePerTotalTrack,
FROM InvoiceLine il
JOIN Track t ON il.TrackId = t.TrackId
JOIN Genre g ON t.GenreId = g.GenreId
-- Subquery to get total number of tracks in each genre
JOIN (
    SELECT GenreId, COUNT(*) AS TotalTracksInGenre
    FROM Track
    GROUP BY GenreId
) AS track_counts ON g.GenreId = track_counts.GenreId
GROUP BY g.Name, Year, track_counts.TotalTracksInGenre
-- Arrange by TotalRevenue (Highest to Lowest)
ORDER BY Year, TotalRevenue DESC;"""
  ).df()
  
## Arist
### Yearly Breakdown
res_artist_yearly_df = con_chinook.execute(
  """SELECT 
  -- Select Artist and Year for Grouping
  ar.Name AS Artist,
  YEAR(i.InvoiceDate) as Year,
  -- Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity), 2) AS TotalRevenue,
  -- Number of Tracks
  COUNT(*) AS NumTracksSold,
  -- Average Revenue per Track Sold
  ROUND(SUM(il.UnitPrice * il.Quantity)/COUNT(*), 2) AS AvgRevenuePerTrack,
  -- Percentage of Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity)*100.0 / (SELECT SUM(UnitPrice * Quantity) FROM InvoiceLine), 2) AS PercentOfRevenue,
  -- Percentage of Volume (Units Sold)
  ROUND(COUNT(*)*100.0 / (SELECT COUNT(*) FROM InvoiceLine),2) AS PercentOfUnitSales,
  -- Total number of tracks in the catalog for artist
  track_counts.TotalTracksByArtist,
  -- Proportion of catalog that was actually sold
  ROUND(COUNT(DISTINCT il.TrackId) * 100.0 / track_counts.TotalTracksByArtist, 2) AS PercentOfTracksSold,
  -- Revenue per total track by artist
  ROUND(SUM(il.UnitPrice * il.Quantity) / track_counts.TotalTracksByArtist, 2) AS RevenuePerTotalTrack
FROM InvoiceLine il
JOIN Invoice i on il.InvoiceId = i.InvoiceId
JOIN Track t ON il.TrackId = t.TrackId
JOIN Album al ON t.AlbumId = al.AlbumId
JOIN Artist ar ON ar.ArtistId = al.ArtistId
-- Subquery to get total number of tracks in each genre
JOIN (
    SELECT 
      al.ArtistId,
      COUNT(*) AS TotalTracksByArtist
    FROM Track t
    JOIN Album al ON t.AlbumId = al.AlbumId
    GROUP BY al.ArtistId
) AS track_counts ON ar.ArtistId = track_counts.ArtistId
GROUP BY ar.Name, Year, track_counts.TotalTracksByArtist
-- Arrange by TotalRevenue (Highest to Lowest)
ORDER BY Year, TotalRevenue DESC;"""
  ).df()
  
### Aggregate
res_artist_agg_df = con_chinook.execute(
  """SELECT 
  -- Select Artist and Year for Grouping
  ar.Name AS Artist,
  'All' as Year,
  -- Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity), 2) AS TotalRevenue,
  -- Number of Tracks
  COUNT(*) AS NumTracksSold,
  -- Average Revenue per Track Sold
  ROUND(SUM(il.UnitPrice * il.Quantity)/COUNT(*), 2) AS AvgRevenuePerTrack,
  -- Percentage of Total Revenue
  ROUND(SUM(il.UnitPrice * il.Quantity)*100.0 / (SELECT SUM(UnitPrice * Quantity) FROM InvoiceLine), 2) AS PercentOfRevenue,
  -- Percentage of Volume (Units Sold)
  ROUND(COUNT(*)*100.0 / (SELECT COUNT(*) FROM InvoiceLine),2) AS PercentOfUnitSales,
  -- Total number of tracks in the catalog for artist
  track_counts.TotalTracksByArtist,
  -- Proportion of catalog that was actually sold
  ROUND(COUNT(DISTINCT il.TrackId) * 100.0 / track_counts.TotalTracksByArtist, 2) AS PercentOfTracksSold,
  -- Revenue per total track by artist
  ROUND(SUM(il.UnitPrice * il.Quantity) / track_counts.TotalTracksByArtist, 2) AS RevenuePerTotalTrack
FROM InvoiceLine il
JOIN Track t ON il.TrackId = t.TrackId
JOIN Album al ON t.AlbumId = al.AlbumId
JOIN Artist ar ON ar.ArtistId = al.ArtistId
-- Subquery to get total number of tracks in each genre
JOIN (
    SELECT 
      al.ArtistId,
      COUNT(*) AS TotalTracksByArtist
    FROM Track t
    JOIN Album al ON t.AlbumId = al.AlbumId
    GROUP BY al.ArtistId
) AS track_counts ON ar.ArtistId = track_counts.ArtistId
GROUP BY ar.Name, Year, track_counts.TotalTracksByArtist
-- Arrange by TotalRevenue (Highest to Lowest)
ORDER BY Year, TotalRevenue DESC;"""
  ).df()
  
# Combine data frames and ensure consistent types
res_genre_df = pd.concat([
  res_genre_agg_df,
  res_genre_yearly_df.assign(Year=res_genre_yearly_df['Year'].astype(str))
  ], ignore_index=True)
  
res_artist_df = pd.concat([
  res_artist_agg_df,
  res_artist_yearly_df.assign(Year=res_artist_yearly_df['Year'].astype(str))
  ], ignore_index=True)

# Get unique years
years = ["All"] + sorted(res_genre_df[res_genre_df['Year'] != 'All']['Year'].unique().tolist())

# Get unique genres, sorted by total revenue share
genres = (
    res_genre_df[res_genre_df['Year'] == 'All']
    .sort_values(by='TotalRevenue', ascending=False)['Genre']
    .tolist()
)

## Make Genre an ordered categorical (for plots)
res_genre_df['Genre'] = pd.Categorical(
  res_genre_df['Genre'], 
  categories=genres, 
  ordered=True
  )

res_genre_df = res_genre_df.sort_values(['Genre', 'Year'])

# Get unique artists, sorted by total revenue share
artists = (
    res_artist_df[res_artist_df['Year'] == 'All']
    .sort_values(by='TotalRevenue', ascending=False)['Artist']
    .tolist()
)

## Make Artist an ordered categorical (for plots)
res_artist_df['Artist'] = pd.Categorical(
  res_artist_df['Artist'], 
  categories=artists, 
  ordered=True
  )
  
res_artist_df = res_artist_df.sort_values(['Artist', 'Year'])

Revenue by Genre: Stacked Bar Chart

To better understand genre-level commercial performance, revenue was analyzed both in total and normalized by catalog size. This approach distinguished high-volume genres from those that generated more revenue per available track. The results were visualized with stacked plots to compare total earnings and catalog-adjusted efficiency over time.

Show Code
# TO DO: Figure out how to get the x-axis values to show

# Stacked Area Plot: Total Revenue by Genre
p1 <- ggplot2::ggplot(
  res_genre_df |>
    dplyr::filter(Year != "All") |>
    dplyr::mutate(Year =  forcats::fct_rev(Year)), 
  ggplot2::aes(x = Genre, y = TotalRevenue, fill = Year)
  ) +
  ggplot2::geom_bar(stat="identity") +
  # Format Y-xis into dollars
  ggplot2::scale_y_continuous(labels = scales::dollar) +
  # Labels
  ggplot2::labs(
    title = "Total Revenue",
    x = "Genre",
    y = "Total Revenue (USD$)",
    fill = "Year"
    ) +
  # Colorblind friendly color scheme
  ggplot2::scale_fill_viridis_d() +
  ggplot2::theme_minimal() +
  # Make Legend Vertical, Turn X-axis text horizontal (legibility)
  ggplot2::theme(
    legend.position = "bottom",
    legend.direction = "vertical",
    axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5, hjust = 1)
    )

# Stacked Area Plot: Revenue per Track in Catalog by Genre
p2 <- ggplot2::ggplot(
  res_genre_df |>
    dplyr::filter(Year != "All") |>
    dplyr::mutate(Year = forcats::fct_rev(Year)), 
  ggplot2::aes(x = Genre, y = RevenuePerTotalTrack, fill = Year)
  ) +
  ggplot2::geom_bar(stat="identity") +
  # Format Y-xis into dollars
  ggplot2::scale_y_continuous(labels = scales::dollar) +
  # Labels
  ggplot2::labs(
    title = "Revenue per Track in Catalog",
    x = "Genre",
    y = "Revenue per Track (USD$)",
    fill = "Year"
    ) +
  # Colorblind friendly color scheme
  ggplot2::scale_fill_viridis_d() +
  ggplot2::theme_minimal() +
  # Make Legend Vertical, Remove X-axis label (large text & in prev plot)
  ggplot2::theme(
    legend.position = "bottom",
    legend.direction = "vertical",
    axis.text.x=element_blank()
    )

## Convert to interactive plotly plots
p1_int <- plotly::ggplotly(p1)
p2_int <- plotly::ggplotly(p2)

# For p2, turn off legend for all traces (already in p1)
for (i in seq_along(p2_int$x$data)) {
  p2_int$x$data[[i]]$showlegend <- FALSE
}

# Combine plots and display
plotly::subplot(
  p1_int, 
  p2_int, 
  nrows = 2, 
  shareX = TRUE, 
  titleY = TRUE
) %>%
  plotly::layout(
    title = list(text = "Revenue by Genre<br><sup>2009-01-01 to 2013-12-22</sup>"),
    legend = list(orientation = "h", x = 0.5, xanchor = "center", y = -0.1)
  )
Show Code
# Filter out "All" year
df = res_genre_df[res_genre_df["Year"] != "All"]

# Sort years in reverse
years_r = sorted(df["Year"].unique(), reverse=True)

# Use a function to suppress intermediate outputs from quarto rendering
def build_plot(df):

  # Create subplot
  fig = make_subplots(
      rows=2, cols=1,
      shared_xaxes=True,
      vertical_spacing=0.08,
      subplot_titles=("Total Revenue", "Revenue per Track in Catalog")
  )

  # Colorblind-friendly color scheme
  colors = sample_colorscale("Viridis", [i/len(years_r) for i in range(len(years_r))])
  
  # Plot 1: Total Revenue
  for i, year in enumerate(years_r):
      subset = df[df["Year"] == year]
      fig.add_trace(
          go.Bar(
              x=subset["Genre"],
              y=subset["TotalRevenue"],
              name=str(year),
              marker_color=colors[i],
              showlegend=True if i == 0 else False  # only once for cleaner display
          ),
          row=1, col=1
      )

  # Plot 1: Revenue per Track in Catalog
  for i, year in enumerate(years_r):
      subset = df[df["Year"] == year]
      fig.add_trace(
          go.Bar(
              x=subset["Genre"],
              y=subset["RevenuePerTotalTrack"],
              name=str(year),
              marker_color=colors[i],
              showlegend=True  # show for all so full legend appears
          ),
          row=2, col=1
      )

  # Format Layout
  fig.update_layout(
    ## Make stacked bar plot
      barmode='stack',
      title=dict(
          text='Revenue by Genre<br><sub>2009-01-01 to 2013-12-22</sub>',
          x=0.5,
          xanchor='center',
          font=dict(size=20)
      ),
      height=700,
      margin=dict(t=120),
      ## Legend on bottom, horizontal
      legend=dict(
          orientation='h',
          yanchor='bottom',
          y=-0.25,
          xanchor='center',
          x=0.5,
          title='Year'
      )
  )

  # Format axes
  ## Y-axis as dollars
  fig.update_yaxes(title_text="Total Revenue (USD)", row=1, col=1, tickformat="$,.2f")
  fig.update_yaxes(title_text="Revenue per Track (USD)", row=2, col=1, tickformat="$,.2f")
  ## Ensure only one x-axis shows, rotated (for legibility)
  fig.update_xaxes(title_text=None, row=1, col=1)
  fig.update_xaxes(title_text="Genre", tickangle=45, row=2, col=1)

  # return figure
  return fig

# Display plot
fig = build_plot(df)
fig.show()
Insights
  • Genre-level revenue remains stable over time, with no significant year-over-year shifts. This aligns with earlier findings: a small number of genres — primarily Rock, along with Latin, Metal, Alternative & Punk, and TV Shows — consistently drive the majority of revenue.
  • Revenue per track in catalog tells a different story. While Rock shows consistent but moderate returns (likely due to its large catalog volume), several niche genres outperform in efficiency:
    • Sci Fi & Fantasy shows disproportionately high revenue per track, largely driven by premium TV content such as Battlestar Galactica. This genre saw a strong spike in 2011–2012 before tapering off in 2013 — suggesting a temporary surge in popularity.
    • Comedy experienced an anomalous peak in 2012, attributable to sales of The Office episodes.
    • Bossa Nova displayed a notable jump in 2012 revenue per catalog track, indicating strong demand relative to catalog size. Though less pronounced, this trend continued into 2013, hinting at sustained niche appeal.
    • TV Show performance is likely undervalued in the current database structure, as Battlestar Galactica and The Office (TV Shows classified as Sci Fi & Frantasy and Comedy, respectively) are drivers of high per-track revenue in genres that are otherwise unrepresented in total revenue.

Opportunities:

  • Invest in high-margin, premium content.
    TV-based genres such as Sci Fi & Fantasy and Comedy show strong per-track performance despite limited catalog sizes. Expanding these offerings — particularly episodic or narrative-driven media — could unlock outsized returns relative to content volume.
  • Capitalize on short-term demand spikes.
    Genres like Bossa Nova and Comedy demonstrated temporal revenue surges. Identifying and acting on these trends through curated promotions, seasonal playlists, or limited-time bundles may help convert interest into sustained revenue.
  • Reevaluate catalog strategies for large-volume genres.
    While Rock contributes the most revenue overall, its performance per track is relatively modest. Optimizing underperforming catalog segments — through targeted marketing or re-packaging — may improve monetization efficiency without requiring new content.
  • Monitor genre saturation and track penetrance.
    Genres with high sales penetration (e.g., Sci Fi & Fantasy) may signal either strong consumer interest or an exhausted catalog. Ensuring a consistent stream of new content in high-penetration genres can help maintain engagement and avoid plateaus.

Artist Revenue

Revenue by Artist: Stacked Bar Chart

To better understand artist-level commercial performance, revenue was analyzed both in total and normalized by catalog size. This approach distinguished high-volume artists from those that generated more revenue per available track. The results were visualized with stacked plots to compare total earnings and catalog-adjusted efficiency over time.

Show Code
# TO DO: Figure out how to make the x-axis values show

# Stacked Area Plot: Total Revenue by Artist
p1 <- ggplot2::ggplot(
  res_artist_df |>
    dplyr::filter(Year != "All") |>
    dplyr::mutate(Year =  forcats::fct_rev(Year)) |>
   ## Restrict to top 20 artists for legibility
    dplyr::filter(Artist %in% artists[1:20]) |>
    dplyr::mutate(Artist = factor(Artist, levels = artists[1:20], ordered = T)), 
  ggplot2::aes(x = Artist, y = TotalRevenue, fill = Year)
  ) +
  ggplot2::geom_bar(stat="identity") +
  # Format Y-xis into dollars
  ggplot2::scale_y_continuous(labels = scales::dollar) +
  # Labels
  ggplot2::labs(
    title = "Total Revenue",
    x = "Artist",
    y = "Total Revenue (USD$)",
    fill = "Year"
    ) +
  # Colorblind friendly color scheme
  ggplot2::scale_fill_viridis_d() +
  ggplot2::theme_minimal() +
  # Make Legend Vertical, Turn X-axis text horizontal (legibility)
  ggplot2::theme(
    legend.position = "bottom",
    legend.direction = "vertical",
    axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5, hjust = 1)
    )

# Stacked Area Plot: Revenue per Track in Catalog by Artist
p2 <- ggplot2::ggplot(
  res_artist_df |>
    dplyr::filter(Year != "All") |>
    dplyr::mutate(Year = forcats::fct_rev(Year)) |>
   ## Restrict to top 20 artists for legibility
    dplyr::filter(Artist %in% artists[1:20]) |>
    dplyr::mutate(Artist = factor(Artist, levels = artists[1:20], ordered = T)), 
  ggplot2::aes(x = Artist, y = RevenuePerTotalTrack, fill = Year)
  ) +
  ggplot2::geom_bar(stat="identity") +
  # Format Y-xis into dollars
  ggplot2::scale_y_continuous(labels = scales::dollar) +
  # Labels
  ggplot2::labs(
    title = "Revenue per Track in Catalog",
    x = "Artist",
    y = "Revenue per Track (USD$)",
    fill = "Year"
    ) +
  # Colorblind friendly color scheme
  ggplot2::scale_fill_viridis_d() +
  ggplot2::theme_minimal() +
  # Make Legend Vertical, Remove X-axis label (large text & in prev plot)
  ggplot2::theme(
    legend.position = "bottom",
    legend.direction = "vertical",
    axis.text.x=element_blank()
    )

## Convert to interactive plotly plots
p1_int <- plotly::ggplotly(p1)
p2_int <- plotly::ggplotly(p2)

# For p2, turn off legend for all traces (already in p1)
for (i in seq_along(p2_int$x$data)) {
  p2_int$x$data[[i]]$showlegend <- FALSE
}

# Combine plots and display
plotly::subplot(
  p1_int, 
  p2_int, 
  nrows = 2, 
  shareX = TRUE, 
  titleY = TRUE
) %>%
  plotly::layout(
    title = list(text = "Revenue by Artist<br><sup>2009-01-01 to 2013-12-22</sup>"),
    legend = list(orientation = "h", x = 0.5, xanchor = "center", y = -0.1)
  )
Show Code
# Filter out "All" year
df = res_artist_df[res_artist_df["Year"] != "All"]

# Restrict to top 20 artists (for legibility)
df = df[df["Artist"].isin(artists[0:20])]

# Use a function to suppress intermediate outputs from quarto rendering
def build_plot(df):

  # Sort years in reverse
  years_r = sorted(df["Year"].unique(), reverse=True)
  
  # Create subplot
  fig = make_subplots(
      rows=2, cols=1,
      shared_xaxes=True,
      vertical_spacing=0.08,
      subplot_titles=("Total Revenue", "Revenue per Track in Catalog")
  )

  # Colorblind-friendly color scheme
  colors = sample_colorscale("Viridis", [i/len(years_r) for i in range(len(years_r))])

  # Plot 1: Total Revenue
  for i, year in enumerate(years_r):
      subset = df[df["Year"] == year]
      fig.add_trace(
          go.Bar(
              x=subset["Artist"],
              y=subset["TotalRevenue"],
              name=str(year),
              marker_color=colors[i],
              showlegend=True if i == 0 else False  # only once for cleaner display
          ),
          row=1, col=1
      )

  # Plot 1: Revenue per Track in Catalog
  for i, year in enumerate(years_r):
      subset = df[df["Year"] == year]
      fig.add_trace(
          go.Bar(
              x=subset["Artist"],
              y=subset["RevenuePerTotalTrack"],
              name=str(year),
              marker_color=colors[i],
              showlegend=True  # show for all so full legend appears
          ),
          row=2, col=1
      )

  # Format Layout
  fig.update_layout(
    ## Make stacked bar plot
      barmode='stack',
      title=dict(
          text='Revenue by Artist<br><sub>2009-01-01 to 2013-12-22</sub>',
          x=0.5,
          xanchor='center',
          font=dict(size=20)
      ),
      height=700,
      margin=dict(t=120),
      ## Legend on bottom, horizontal
      legend=dict(
          orientation='h',
          yanchor='bottom',
          y=-0.25,
          xanchor='center',
          x=0.5,
          title='Year'
      )
  )

  # Format axes
  ## Y-axis as dollars
  fig.update_yaxes(title_text="Total Revenue (USD)", row=1, col=1, tickformat="$,.2f")
  fig.update_yaxes(title_text="Revenue per Track (USD)", row=2, col=1, tickformat="$,.2f")
  ## Ensure only one x-axis shows, rotated (for legibility)
  fig.update_xaxes(title_text=None, row=1, col=1)
  fig.update_xaxes(title_text="Artist", tickangle=45, row=2, col=1)
  
  # return figure
  return fig

# Display plot
fig = build_plot(df)
fig.show()
Insights
  • Artist revenue is temporally consistent.
    Most artists show stable annual revenue across the 2009–2013 period. While individual years may fluctuate (e.g., dips for Metallica and Led Zeppelin in 2011), these appear to be temporary rather than indicative of long-term trends.
  • Average revenue per track sold is remarkably stable (~$0.75).
    A small number of high-volume artists (e.g., Deep Purple, Pearl Jam, Van Halen) show lower per-track revenue (~$0.50), suggesting that large catalogs may dilute revenue unless paired with sustained demand. This could indicate catalog saturation or uneven consumer engagement across their offerings.
  • Revenue per track in catalog highlights structural differences.
    This metric remains consistent for most music artists, but TV show content (e.g., Lost, The Office) shows a different pattern: exceptionally high revenue per catalog track but concentrated in specific years. For instance, Lost peaked in 2012 before falling to zero in 2013, and The Office spiked from 2010–2012 with no revenue in adjacent years.
  • TV Shows demonstrate high efficiency per asset.
    Despite small catalogs, TV content generates significantly higher revenue per track than traditional music. This suggests strong demand relative to supply — possibly pointing to unmet or episodic consumption patterns.

Opportunities:

  • Expand premium TV show content.
    The high return per catalog item in TV genres suggests an underexploited category. Adding similar content (episodic, narrative-driven media) could yield high revenue efficiency without requiring large-scale production.
  • Reinvigorate large but underperforming catalogs.
    Artists with expansive libraries but low per-track revenue (e.g., Deep Purple, Pearl Jam) may benefit from targeted marketing efforts such as curated collections, remastered releases, or highlight reels that surface their most engaging content.
  • Use consumption patterns to guide licensing and promotions.
    The temporal spikes in shows like Lost and The Office suggest windows of elevated demand. Monitor and act on these seasonal or cultural surges to time promotions, exclusive bundles, or re-releases effectively.
  • Monitor catalog saturation signals.
    When large catalogs yield diminishing per-track returns, it may be time to reassess content strategy. This includes retiring underperforming assets or introducing new content types that align better with current listener interests.

Summary of Exploratory Findings

While overall genre revenue remains stable across years, genre-specific performance varies widely depending on how it’s measured:

  • Total Revenue reflects market size. Rock dominates here, alongside Latin, Metal, and Alternative.
  • Revenue per Track in Catalog reveals efficiency. Sci Fi & Fantasy, Comedy, and Bossa Nova perform disproportionately well relative to their catalog sizes — often driven by episodic TV content.

These patterns suggest that catalog volume and monetization efficiency are often misaligned. Notably, the top-performing genres by revenue-per-track tend to be narrow in scope but high in engagement, hinting at an opportunity to rebalance the content portfolio.

Artist level trends are similar, suggesting artists with expansive libraries but low per-track revenue may benefit from targeted marketing efforts such as curated collections, remastered releases, or highlight reels that surface their most engaging content.

  • Most Efficient Genres by Revenue per Track (2012 spike): Sci Fi & Fantasy (Battlestar Galactica), Comedy (The Office), Bossa Nova
  • Top Genres by Total Revenue: Rock, Latin, Metal, Alternative & Punk, TV Shows
Possible Next Steps

If this were real-world data, the following actions could strengthen both analytical insight and strategic decision-making.

Deepen the Analysis:

  • Attribute-level analysis for top-performing genres.
    Disaggregate genre revenue by artist, album, or track to pinpoint the specific drivers of efficiency within high-performing niches.
  • Explore customer-level purchase behavior.
    Are niche genre buyers also more valuable customers overall? Understanding cross-genre engagement can inform bundling or recommendation strategies.

Strategic Business Opportunities:

  • Invest in premium, episodic content.
    Sci Fi & Fantasy and Comedy genres — driven by TV content — deliver strong returns per track. Expanding this segment could yield high-margin growth.
  • React quickly to demand spikes.
    Temporal surges in Bossa Nova and Comedy suggest that curated playlists or limited-time offers can help capitalize on emerging trends.
  • Reoptimize large, underperforming genres.
    Rock has broad catalog coverage but moderate per-track performance. Marketing, repackaging, or removal of underperformers could improve efficiency.
  • Prioritize catalog expansion where penetration is high.
    For genres with high track sales penetration (e.g., Sci Fi), maintaining momentum may require fresh content to meet continuing demand.

Question 3: How many customers are repeat buyers?

  • Key Tables: Invoice, Customer, InvoiceLine
  • Explanation: Customer provides key customer data. Links to Invoice provide sales frequency information linked to each Customer, which may be further broken down to individual units or cost-per-unit with links to InvoiceLine.
Executive Summary

Customer retention patterns reveals rapid decline in retention within the first six months post-purchase, with periodic spikes suggesting promotional or catalog-driven re-engagement. These findings suggest that targeted re-engagement campaigns at the 4-6 months mark and long-term loyalty programs are potential strategies to extend customer lifetime value beyond typical drop-off points (e.g., after 2-3 years).

Initial exploration focused on any difference in average spend per purchase between one-time and repeat purchasers.

Show Code
SQL
-- Average Spend per Customer (One-time vs Repeat)
-- Purchase count and spend per customer
WITH CustomerPurchases AS (
    SELECT
        CustomerId,
        COUNT(*) AS PurchaseCount,
        SUM(Total) AS TotalSpent
    FROM Invoice
    GROUP BY CustomerId
)
-- Classify customers (One-time vs Repeat), Calculate Avg Spend in Category
SELECT
    CASE 
        WHEN PurchaseCount = 1 THEN 'One-Time Buyer'
        ELSE 'Repeat Buyer'
    END AS BuyerType,
    COUNT(*) AS NumCustomers,
    AVG(TotalSpent) AS AvgSpendPerCustomer
FROM CustomerPurchases
GROUP BY BuyerType;
1 records
BuyerType NumCustomers AvgSpendPerCustomer
Repeat Buyer 59 39.4678
Insights

All customers are repeat customers in this data set, indicating no one-time purchase behavior. This simplifies some analyses, but reflects a limitation in the data set’s realism.

Any business with such exceptional customer retention would be ideal for subscription models, loyalty programs, or cross-sells to improve lifetime value of such loyal customers.

To further characterize customer spending patterns, purchase intervals were queried.

Show Code
SQL
-- Repeat-customer purchase intervals
-- Get date of each customer purchase
WITH CustomerDates AS (
    SELECT
        CustomerId,
        InvoiceId,
        InvoiceDate,
        ROW_NUMBER() OVER (PARTITION BY CustomerId ORDER BY InvoiceDate) AS rn
    FROM Invoice
),

-- Calculate intervals between purchases
DateDiffs AS (
    SELECT
        c1.CustomerId,
        DATE_DIFF('day', c1.InvoiceDate, c2.InvoiceDate) AS DaysBetween
    FROM CustomerDates c1
    JOIN CustomerDates c2
        ON c1.CustomerId = c2.CustomerId
        AND c2.rn = c1.rn + 1
),

-- Aggregate purchases, tracks, and spend per customer
CustomerPurchaseStats AS (
    SELECT
        i.CustomerId,
        COUNT(DISTINCT i.InvoiceId) AS NumPurchases,
        AVG(il.TrackCount) AS AvgTracksPerPurchase,
        AVG(i.Total) AS AvgSpendPerPurchase
    FROM Invoice i
    JOIN (
        SELECT
            InvoiceId,
            COUNT(*) AS TrackCount
        FROM InvoiceLine
        GROUP BY InvoiceId
    ) il ON i.InvoiceId = il.InvoiceId
    GROUP BY i.CustomerId
)

-- Combine everything
SELECT
    s.CustomerId,
    s.NumPurchases,
    s.AvgTracksPerPurchase,
    s.AvgSpendPerPurchase,
    AVG(d.DaysBetween) AS AvgDaysBetweenPurchases
FROM CustomerPurchaseStats s
LEFT JOIN DateDiffs d ON s.CustomerId = d.CustomerId
GROUP BY s.CustomerId, s.NumPurchases, s.AvgTracksPerPurchase, s.AvgSpendPerPurchase
-- Order by average days between purchases (High to Low)
ORDER BY AvgDaysBetweenPurchases DESC;
Displaying records 1 - 10
CustomerId NumPurchases AvgTracksPerPurchase AvgSpendPerPurchase AvgDaysBetweenPurchases
23 7 5.428571 5.374286 298.0000
21 7 5.428571 5.374286 289.1667
25 7 5.428571 6.088571 289.1667
46 7 5.428571 6.517143 289.1667
8 7 5.428571 5.374286 289.1667
42 7 5.428571 5.660000 289.1667
4 7 5.428571 5.660000 289.1667
29 7 5.428571 5.374286 266.0000
16 7 5.428571 5.374286 266.0000
50 7 5.428571 5.374286 266.0000
Insights

Customers in this data set exhibited uniform purchase behavior in terms of track count per purchase (5.43), highlighting a limitation in the synthetic data set’s realism.

  • Average spend per purchase varies modestly, ranging from $5.37 to $7.09.
    • Top spenders (≥ $6.00 per purchase) represent 18.6% of customers.
  • Purchase intervals are long and inconsistent.
    • All customers wait more than 200 days (~6 months) between purchases on average.
    • 37.29% of customers wait over 8 months (243 days)*.
    • Top spenders are evenly distributed across the range, with no clear link between spend and purchase frequency.

Opportunities:

  • Re-engage customers sooner through loyalty programs or timed promotions around the 180–200 day mark, aiming to shorten long repurchase cycles.
  • Target high spenders with exclusive offers or early access content to maximize revenue from existing customers.
  • Introduce upsell strategies, such as larger or premium bundles, to grow average invoice size beyond the flat pattern observed here.

The retention decay rate was examined as a proxy for cohort retention/retention decay.

Show Code
SQL
-- Customer repeat rate (monthly return behavior)
-- Obtain First Purchase Date (DuckDB required separate)
WITH FirstPurchase AS (
    SELECT
        CustomerId,
        MIN(DATE(InvoiceDate)) AS FirstPurchaseDate
    FROM Invoice
    GROUP BY CustomerId
),

-- Get subsequent purchase date info from invoice
CustomerInvoices AS (
    SELECT
        i.CustomerId,
        DATE(i.InvoiceDate) AS PurchaseDate,
        fp.FirstPurchaseDate
    FROM Invoice i
    JOIN FirstPurchase fp ON i.CustomerId = fp.CustomerId
),

-- Calculate months since first purchase
RepeatInfo AS (
    SELECT
        CustomerId,
        DATE_DIFF('month', FirstPurchaseDate, PurchaseDate) AS MonthsSinceFirst
    FROM CustomerInvoices
)

-- Count distinct returning customers per month offset
SELECT
    MonthsSinceFirst,
    COUNT(DISTINCT CustomerId) AS ReturningCustomers,
    (SELECT COUNT(*) FROM FirstPurchase) AS CohortSize,
    ROUND(100.0 * COUNT(DISTINCT CustomerId) * 1.0 / (SELECT COUNT(*) FROM FirstPurchase), 1) AS RetentionPct
FROM RepeatInfo
WHERE MonthsSinceFirst > 0
GROUP BY MonthsSinceFirst
-- Arrange by "per month offset" (Low to High)
ORDER BY MonthsSinceFirst;
Displaying records 1 - 10
MonthsSinceFirst ReturningCustomers CohortSize RetentionPct
1 18 59 30.5
3 22 59 37.3
6 18 59 30.5
7 3 59 5.1
8 1 59 1.7
9 18 59 30.5
10 4 59 6.8
13 6 59 10.2
14 12 59 20.3
18 7 59 11.9
Insights
  • Customer retention is moderate and irregular over time, with notable fluctuations rather than a smooth decay curve.
  • The average retention rate across all months is relatively low at 16.6%, reflecting infrequent repeat purchase behavior.
  • Retention rates peak at 31 months (61.0%), suggesting some customers return in clusters far from their initial purchase.
  • Early retention is moderate, with 30-37% of customers returning at 1, 3, and 6 months.
  • Retention drops below 20% after 42 months, indicating most customers have ceased repeat purchases by then.

Opportunities:

  • Focus on early engagement campaigns to boost retention within the first 6 months, where natural return rates already show promise.
  • Investigate drivers behind the 31-month retention spike. Targeted marketing or product offerings around this timeline may reinforce repeat purchases.
  • Consider long-term loyalty programs to extend customer lifetime value beyond typical drop-off points (e.g., after 2-3 years).
  • Use segmentation to tailor retention strategies for different customer groups, given the irregular patterns observed.

Exploratory Visualizations

In preparation for exploratory visualization generation, the data was retrieved using SQL queries and prepared in both R and Python.

Show Code
# SQL Queries
## Retention Decay
res_repeat_df <- DBI::dbGetQuery(
  con_chinook, 
  "-- Customer repeat rate (monthly return behavior)
  -- Obtain First Purchase Date (DuckDB required separate)
  WITH FirstPurchase AS (
      SELECT
          CustomerId,
          MIN(DATE(InvoiceDate)) AS FirstPurchaseDate
      FROM Invoice
      GROUP BY CustomerId
  ),
  
  -- Get subsequent purchase date info from invoice
  CustomerInvoices AS (
      SELECT
          i.CustomerId,
          DATE(i.InvoiceDate) AS PurchaseDate,
          fp.FirstPurchaseDate
      FROM Invoice i
      JOIN FirstPurchase fp ON i.CustomerId = fp.CustomerId
  ),
  
  -- Calculate months since first purchase
  RepeatInfo AS (
      SELECT
          CustomerId,
          DATE_DIFF('month', FirstPurchaseDate, PurchaseDate) AS MonthsSinceFirst
      FROM CustomerInvoices
  )
  
  -- Count distinct returning customers per month offset
  SELECT
      MonthsSinceFirst,
      COUNT(DISTINCT CustomerId) AS ReturningCustomers,
      (SELECT COUNT(*) FROM FirstPurchase) AS CohortSize,
      ROUND(100.0 * COUNT(DISTINCT CustomerId) * 1.0 / (SELECT COUNT(*) FROM FirstPurchase), 1) AS RetentionPct
  FROM RepeatInfo
  WHERE MonthsSinceFirst > 0
  GROUP BY MonthsSinceFirst
  -- Arrange by 'per month offset' (Low to High)
  ORDER BY MonthsSinceFirst;"
  )

## Retention Across Cohorts
res_cohort_df <- DBI::dbGetQuery(
  con_chinook,
  "-- Retention across cohorts
  -- First purchase per customer and cohort assignment
  WITH FirstPurchases AS (
      SELECT
          CustomerId,
          MIN(DATE(InvoiceDate)) AS FirstPurchaseDate,
          DATE_TRUNC('month', MIN(InvoiceDate)) AS CohortMonth
      FROM Invoice
      GROUP BY CustomerId
  ),
  
  -- All purchases labeled with cohort info
  AllPurchases AS (
      SELECT
          i.CustomerId,
          DATE(i.InvoiceDate) AS PurchaseDate,
          DATE_TRUNC('month', i.InvoiceDate) AS PurchaseMonth,
          fp.CohortMonth,
          DATE_DIFF('month', fp.FirstPurchaseDate, i.InvoiceDate) AS MonthsSinceFirst
      FROM Invoice i
      JOIN FirstPurchases fp ON i.CustomerId = fp.CustomerId
  ),
  
  -- Cohort sizes
  CohortSizes AS (
      SELECT
          CohortMonth,
          COUNT(DISTINCT CustomerId) AS CohortSize
      FROM FirstPurchases
      GROUP BY CohortMonth
  ),
  
  -- Distinct customer activity per cohort/month
  CustomerActivity AS (
      SELECT DISTINCT
          CustomerId,
          CohortMonth,
          MonthsSinceFirst
      FROM AllPurchases
  )
  
  -- Final aggregation with cohort size and retention %
  SELECT
      ca.CohortMonth,
      ca.MonthsSinceFirst,
      COUNT(DISTINCT ca.CustomerId) AS NumActiveCustomers,
      cs.CohortSize,
      ROUND(100.0 * COUNT(DISTINCT ca.CustomerId)::DOUBLE / cs.CohortSize, 1) AS RetentionPct
  FROM CustomerActivity ca
  JOIN CohortSizes cs ON ca.CohortMonth = cs.CohortMonth
  GROUP BY ca.CohortMonth, ca.MonthsSinceFirst, cs.CohortSize
  ORDER BY ca.CohortMonth, ca.MonthsSinceFirst;"
  )
Show Code
# SQL Queries
## Retention Decay
res_repeat_df = con_chinook.execute(
  """-- Customer repeat rate (monthly return behavior)
  -- Obtain First Purchase Date (DuckDB required separate)
  WITH FirstPurchase AS (
      SELECT
          CustomerId,
          MIN(DATE(InvoiceDate)) AS FirstPurchaseDate
      FROM Invoice
      GROUP BY CustomerId
  ),
  
  -- Get subsequent purchase date info from invoice
  CustomerInvoices AS (
      SELECT
          i.CustomerId,
          DATE(i.InvoiceDate) AS PurchaseDate,
          fp.FirstPurchaseDate
      FROM Invoice i
      JOIN FirstPurchase fp ON i.CustomerId = fp.CustomerId
  ),
  
  -- Calculate months since first purchase
  RepeatInfo AS (
      SELECT
          CustomerId,
          DATE_DIFF('month', FirstPurchaseDate, PurchaseDate) AS MonthsSinceFirst
      FROM CustomerInvoices
  )
  
  -- Count distinct returning customers per month offset
  SELECT
      MonthsSinceFirst,
      COUNT(DISTINCT CustomerId) AS ReturningCustomers,
      (SELECT COUNT(*) FROM FirstPurchase) AS CohortSize,
      ROUND(100.0 * COUNT(DISTINCT CustomerId) * 1.0 / (SELECT COUNT(*) FROM FirstPurchase), 1) AS RetentionPct
  FROM RepeatInfo
  WHERE MonthsSinceFirst > 0
  GROUP BY MonthsSinceFirst
  -- Arrange by "per month offset" (Low to High)
  ORDER BY MonthsSinceFirst;"""
  ).df()
  
## Retention Across Cohorts
res_cohort_df = con_chinook.execute(
  """-- Retention across cohorts
  -- First purchase per customer and cohort assignment
  WITH FirstPurchases AS (
      SELECT
          CustomerId,
          MIN(DATE(InvoiceDate)) AS FirstPurchaseDate,
          DATE_TRUNC('month', MIN(InvoiceDate)) AS CohortMonth
      FROM Invoice
      GROUP BY CustomerId
  ),
  
  -- All purchases labeled with cohort info
  AllPurchases AS (
      SELECT
          i.CustomerId,
          DATE(i.InvoiceDate) AS PurchaseDate,
          DATE_TRUNC('month', i.InvoiceDate) AS PurchaseMonth,
          fp.CohortMonth,
          DATE_DIFF('month', fp.FirstPurchaseDate, i.InvoiceDate) AS MonthsSinceFirst
      FROM Invoice i
      JOIN FirstPurchases fp ON i.CustomerId = fp.CustomerId
  ),
  
  -- Cohort sizes
  CohortSizes AS (
      SELECT
          CohortMonth,
          COUNT(DISTINCT CustomerId) AS CohortSize
      FROM FirstPurchases
      GROUP BY CohortMonth
  ),
  
  -- Distinct customer activity per cohort/month
  CustomerActivity AS (
      SELECT DISTINCT
          CustomerId,
          CohortMonth,
          MonthsSinceFirst
      FROM AllPurchases
  )
  
  -- Final aggregation with cohort size and retention %
  SELECT
      ca.CohortMonth,
      ca.MonthsSinceFirst,
      COUNT(DISTINCT ca.CustomerId) AS NumActiveCustomers,
      cs.CohortSize,
      ROUND(100.0 * COUNT(DISTINCT ca.CustomerId)::DOUBLE / cs.CohortSize, 1) AS RetentionPct
  FROM CustomerActivity ca
  JOIN CohortSizes cs ON ca.CohortMonth = cs.CohortMonth
  GROUP BY ca.CohortMonth, ca.MonthsSinceFirst, cs.CohortSize
  ORDER BY ca.CohortMonth, ca.MonthsSinceFirst;"""
  ).df()

Retention Decay Plot

To better understand customer retention over time, a retention decay curve was generated.

Show Code
# Retention Decay Plot
# Add a column for formatted hover text
res_repeat_df$hover_text <- paste0(
  "Month: ", res_repeat_df$MonthsSinceFirst, "\n",
  "Retention: ", res_repeat_df$RetentionPct, "%"
)

# Retention Curve Line Plot
p <- ggplot2::ggplot(
  res_repeat_df, 
  ggplot2::aes(x = MonthsSinceFirst, y = RetentionPct)) +
  # Make line plot - emphasize points
  ggplot2::geom_line(color = "cadetblue4", size = 1.2) +
  ggplot2::geom_point(color = "cadetblue4", size = 2, ggplot2::aes(text = hover_text)) +
  # Mark every 3 months on the x axis
  ggplot2::scale_x_continuous(
    breaks = seq(0, max(res_repeat_df$MonthsSinceFirst), by = 3)
  ) +
  # Label y axis with "%"
  scale_y_continuous(labels = function(x) paste0(x, "%"), limits = c(0, 100)) +
  # Add labels
  ggplot2::labs(
    title = "Customer Retention Decay <br><sub>2009-01-01 to 2013-12-22</sub>",
    x = "Months Since First Purchase",
    y = "Retention (%)"
  ) +
  theme_minimal(base_size = 14)

# Convert to interactive plotly for quarto display
plotly::ggplotly(p, tooltip = "text")
Show Code
# Retention Decay Plot
# Add a column for formatted hover text
res_repeat_df["hover_text"] = (
    "Month: " + res_repeat_df["MonthsSinceFirst"].astype(str) + "<br>" +
    "Retention: " + res_repeat_df["RetentionPct"].astype(str) + "%"
)

# Use a function to suppress intermediate outputs from quarto rendering
def build_plot(df):

  # Initialize the plot
  fig = go.Figure()

  # Retention Curve Line Plot
  fig.add_trace(go.Scatter(
      x=res_repeat_df["MonthsSinceFirst"],
      y=res_repeat_df["RetentionPct"],
      mode="lines+markers",
      line=dict(color="cadetblue", width=2),
      marker=dict(size=6),
      hovertext=res_repeat_df["hover_text"],
      hoverinfo="text",
      name="Retention"
  ))

  # Labels
  fig.update_layout(
      title=dict(
          text="Customer Retention Decay <br><sub>2009-01-01 to 2013-12-22</sub>",
          x=0.5,
          xanchor="center",
          font=dict(size=20)
      ),
      xaxis=dict(
          title="Months Since First Purchase",
          tickmode="linear",
          tick0=0,
          dtick=3
      ),
      yaxis=dict(
          title="Retention (%)",
          range=[0, 100],
          ticksuffix="%"
      ),
      template="plotly_white",
      height=500
  )

  # return figure
  return fig

# Display plot
fig = build_plot(res_repeat_df)
fig.show()
Insights
  • Retention decays quickly after the first 6 months, indicating limited long-term engagement. appears to be sporadic, but largely flat.
  • The pattern is sporadic and largely flat, suggesting irregular purchase cycles or infrequent triggers for re-engagement.
  • Unexpected retention spikes between months 28-33 (peaking at 61% in month 31) may reflect anomalies in the synthetic data set, delayed reactivation behavior, or an unknown promotional triggers.
  • After month 42, retention levels flatten at or below 20%, indicating minimal long-term customer activity.

Opportunities:

  • Implement re-engagement campaigns at the 4-6 months mark, when drop-off risk is highest.
  • Introduce loyalty programs or time-based offers to reward continued engagement in the first year.
  • Investigate drivers behind the month 28-31 spike. If this behavior was tied to promotions or catalog changes, similar tactics could be used to reactivate dormant customers.

Retention Across Cohorts: Heatmap

For further perspective of customer retention dynamics, a retention heatmap was generated.

Show Code
# Cohort Retention Heatmap
# Fill in missing Cohort-Month combinations with missing values (`NA`)
res_cohort_df <- tidyr::complete(
  res_cohort_df,
  CohortMonth = seq.Date(
    min(res_cohort_df$CohortMonth), 
    max(res_cohort_df$CohortMonth), 
    by = "month"
    ),
  MonthsSinceFirst = seq(
    min(res_cohort_df$MonthsSinceFirst), 
    max(res_cohort_df$MonthsSinceFirst), 
    by = 1
    )
  ) |>
# Filter out any non-included values (Cohort pre Jan 2009, MonthsSince <= 0)
  dplyr::filter(
    MonthsSinceFirst > 0 & CohortMonth >= "2009-01-01"
  )

# Add a column for formatted hover text
res_cohort_df <- res_cohort_df |>
  dplyr::mutate(
    hover_text = ifelse(
      !is.na(RetentionPct), 
      paste0(
        "Cohort: ", format(CohortMonth, "%b %Y"), "\n",
        "Cohort Size: ", CohortSize, "\n",
        "Month: ", MonthsSinceFirst, "\n",
        "Retention: ", sprintf("%1.2f%%", RetentionPct)
        ),
      paste0(
        "Cohort: ", format(CohortMonth, "%b %Y"), "\n",
        "Month: ", MonthsSinceFirst, "\n",
        "No data"
        )
      )
    )

# Retention Heatmap
p <- ggplot2::ggplot(
  res_cohort_df, 
  ggplot2::aes(
    x = MonthsSinceFirst, 
    y = CohortMonth, 
    fill = RetentionPct,
    text = hover_text
    )
  ) +
  # Put a white box around each tile
  geom_tile(color = "white", size = 0.2) +
  # Use Color-blind friendly color scheme
  ggplot2::scale_fill_viridis_c(
    # Make empty/missing tiles a gray color - does not transfer to plotly
    na.value = "gray90",      
    # Default version
    option = "D",                
    # Make higher values a darker color
    direction = -1               
  ) +
  # Make x axis evenly marked 
  ggplot2::scale_x_continuous(breaks = pretty(res_cohort_df$MonthsSinceFirst)) +
  # Label Y axis in human legible format (Month short name + YYYY)
  ggplot2::scale_y_date(date_breaks = "1 month", date_labels = "%b %Y") +
  # Labels
  ggplot2::labs(
    title = "Customer Retention by Cohort <br><sub>2009-01-01 to 2013-12-22</sub>",
    x = "Months Since First Purchase",
    y = "Cohort Month (First Purchase)",
    fill = "Retention (%)"
  ) +
  # Optional styling
  ggplot2::theme_minimal(base_size = 14) +
  ggplot2::theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid = element_blank()
  )

# Convert to interactive plotly for quarto display
plotly::ggplotly(p, tooltip = "text")
Show Code
# Cohort Retention Heatmap
# Create grid that fills in missing Cohort-Month combinations with missing values (`NA`)
cohort_range = pd.date_range(
  start = res_cohort_df['CohortMonth'].min(), 
  end = res_cohort_df['CohortMonth'].max(), 
  freq='MS'
  )
  
month_range = range(
  res_cohort_df['MonthsSinceFirst'].min(),
  res_cohort_df['MonthsSinceFirst'].max() + 1
  )

# Create complete combination grid
full_grid = pd.MultiIndex.from_product(
    [cohort_range, month_range],
    names=['CohortMonth', 'MonthsSinceFirst']
).to_frame(index=False)

## Merge with actual data to fill in missing combinations
res_df = full_grid.merge(
  res_cohort_df, 
  on=['CohortMonth', 'MonthsSinceFirst'], 
  how='left'
  )

# Filter out any non-included values (Cohort pre Jan 2009, MonthsSince <= 0)
res_df = res_df[
    (res_df['CohortMonth'] >= pd.to_datetime('2009-01-01')) & 
    (res_df['MonthsSinceFirst'] > 0)
]

# Create a human-legible version of the cohort month for labels
res_df['CohortLabel'] = res_df['CohortMonth'].dt.strftime('%b %Y')

# Ensure they stay in order
res_df['CohortLabel'] = pd.Categorical(
    res_df['CohortLabel'], 
    ordered=True,
    categories=sorted(res_df['CohortMonth'].dt.strftime('%b %Y').unique(), key=lambda x: pd.to_datetime(x))
)

# Add a column for formatted hover text
res_df['hover_text'] = np.where(
    ~res_df['RetentionPct'].isna(),
    "Cohort: " + res_df['CohortMonth'].dt.strftime("%b %Y") +
    "<br>Cohort Size: " + res_df['CohortSize'].astype(str) +
    "<br>Month: " + res_df['MonthsSinceFirst'].astype(str) +
    "<br>Retention: " + (res_df['RetentionPct']).round(2).astype(str) + "%",
    "Cohort: " + res_df['CohortLabel'].astype(str) +
    "<br>Month: " + res_df['MonthsSinceFirst'].astype(str) +
    "<br>No data"
)

# Use a function to suppress intermediate outputs from quarto rendering
def build_plot(df):
  # Create pivot tables
  retention = df.pivot(index="CohortLabel", columns="MonthsSinceFirst", values="RetentionPct")
  text = df.pivot(index="CohortLabel", columns="MonthsSinceFirst", values="hover_text")

  # Create heatmap with Plotly
  fig = go.Figure(
      data=go.Heatmap(
          z=retention.values,
          x=retention.columns,
          y=retention.index,
          # Set hover-text
          text=text.values,
          hoverinfo='text',
          # Colorblind-friendly color scheme (Reverse - darker is higher %)
          colorscale='Viridis_r',
          zmin=0,
          zmax=100,
          colorbar=dict(title="Retention (%)"),
          showscale=True,
          hoverongaps=False,
          zauto=False
      )
  )

  # Layout (Labels)
  fig.update_layout(
      title="Customer Retention by Cohort<br><sub>2009-01-01 to 2013-12-22</sub>",
      xaxis_title="Months Since First Purchase",
      yaxis_title="Cohort Month (First Purchase)",
      xaxis=dict(type='category'),
      yaxis=dict(type='category'),
      height=600
  )
  
  # return figure
  return fig

# Display plot
fig = build_plot(res_df)
fig.show()
Insights
  • There are no strong diagnoal trends, suggesting that retention behavior is not driven by seasonality or time-of-year events.
  • Vertical spikes at months 1, 3, 6, 9, 14, 28, 31, 33, 40, and 42 indicate that certain months consistently see higher re-engagement, regardless of cohort. These patterns are more pronounced in later cohorts (September 2009 and onward).
  • Earlier cohorts (Jan - Aug 2009) show more scattered retention spikes in later months (e.g, 18-27, 46-53), though these are generally smaller in magnitude.
  • The strongest retention occurs at month 31, with most cohorts retaining 50% or more of their users.
  • The absence of a consistent decay slope and the presence of periodic spikes suggest irregular customer behavior, potentially influenced by catalog changes, promotions, or synthetic data quirks.

Opportunities:

  • Investigate month 31 and other spike periods for possible triggers. Catalog expansions, price changes, or customer lifecycle milestones could be replicated or scaled.
  • Design targeted campaigns around known high-retention intervals (e.g., months 1, 3, 6, 9) to encourage early and repeat engagement.
  • Consider dynamic lifecycle-based messaging rather than time-based marketing, given the erratic nature of long-term retention.
  • Explore subscription models or bundling strategies to stabilize and extend engagement beyond the initial 6–9 month window.

Summary of Exploratory Findings

Customer retention drops steeply within the first 6 months after initial purchase, with retention falling below 20% beyond month 42. This indicates that long-term engagement is limited, and most customers do not return frequently after their initial purchases. Unexpected spikes in retention at months 28–33 suggest irregular purchase cycles or promotional triggers. These spikes are consistent across cohorts, especially in later cohorts after September 2009.

  • Peak retention at month 31.
  • Initial drop-off after month 6.
Possible Next Steps

If this were real-world data, the following actions could strengthen both analytical insight and strategic decision-making.

Deepen the Analysis:

  • Investigate the causes of retention spikes, particularly around month 31, through detailed promotional or catalog event logs.
  • Analyze customer segments to identify differences in retention behavior and tailor marketing strategies accordingly.
  • Explore product-level repeat purchase patterns for granular insights.

Strategic Business Opportunities:

  • Implement re-engagement campaigns at the 4-6 months mark, when drop-off risk is highest.
  • Introduce loyalty programs or time-based offers to reward continued engagement in the first year.
  • Track customer reactivation triggers more systematically to reduce sporadic behavior and increase predictable retention patterns.

Summary

Revenue is primarily driven by North American and Central European markets, with consistent year-over-year patterns. However, anomalies like Australia’s steep revenue decline post-2011 and Europe’s spike in the same year suggest operational shifts or market events that merit further investigation. Markets like the USA and Brazil offer growth potential through deeper engagement, while smaller but lucrative markets such as Austria and Chile present attractive expansion opportunities.

Overall genre revenue is stable, but there is considerable variation in monetization efficiency. High revenue-per-track genres tend to be narrowly focused with strong audience engagement, whereas top artists by total revenue typically generate income through large catalogs with lower revenue per track. Artists with expansive libraries and low per-track returns could benefit from marketing strategies like curated collections or remastered releases to surface their most compelling content.

Retention declines steeply in the first six months after purchase, with an unusual spike in months 28–33, possibly due to promotional campaigns or irregular buying cycles. Understanding these patterns is critical to designing effective re-engagement and loyalty programs.

Possible Next Steps

If this were real-world data, the following actions could strengthen both analytical insight and strategic decision-making.

Deepen the Analysis

  • Investigate Australia’s 2012–13 revenue decline to identify churn drivers or external influences.
  • Analyze U.S. customer cohorts for upsell and retention opportunities given its dominant revenue role.
  • Perform attribute-level analysis within top genres to uncover key drivers of efficiency.
  • Examine cross-genre customer behavior and its impact on lifetime value.
  • Explore causes behind retention spikes and identify effective promotional triggers.
  • Segment customers by retention behavior for tailored marketing.
  • Study product-level repeat purchases for granular insights.

Strategic Business Opportunities

  • Target acquisition in underpenetrated but high-value markets like Chile, Hungary, and Ireland.
  • Localize content and marketing for emerging regions such as India and broader Asia.
  • Re-engage lapsed users in markets with past revenue declines (e.g., Brazil, Sweden).
  • Expand premium, episodic content in high-return genres like Sci Fi and Comedy.
  • Capitalize on temporal demand spikes with curated playlists and limited-time offers.
  • Optimize large but underperforming catalogs (e.g., Rock) through marketing or pruning.
  • Maintain content momentum in high-penetration genres by refreshing catalogs.
  • Implement targeted re-engagement campaigns at 4-6 months to counter early churn.
  • Develop loyalty programs and systematically track reactivation triggers to enhance retention.
Back to top