Appendix A β€” Sample Data

Clickstream dataset generator and public ML datasets

Keywords

snowflake, feature store, ml, machine learning, mlops

A.1 Overview

This appendix documents the sample data used throughout the Feature Store Implementation Guide:

  1. Clickstream Dataset - A synthetic e-commerce dataset generated specifically for this guide
  2. Public ML Datasets - Standard datasets (Iris, Titanic, etc.) for simple examples
Recommended: Streamlit Data Manager

For first-time setup and interactive data management, we recommend using the Streamlit Data Manager app. It provides a visual interface for all data operations with guided workflows.

cd data_manager/
export SNOWFLAKE_CONNECTION_NAME=your_connection
streamlit run app.py

πŸ“– Full Streamlit App Documentation β†’

A.2 Clickstream Dataset

A.2.1 Quick Start

You can manage sample data using either the Streamlit UI (interactive) or CLI tools (scriptable).

cd data_manager/

# Set Snowflake connection
export SNOWFLAKE_CONNECTION_NAME=your_connection

# Launch the UI
streamlit run app.py

The UI provides:

  • πŸ“Š Overview - View current data status
  • πŸ“₯ Initial Load - Generate and load synthetic data
  • πŸ”„ Incremental Generator - Manage continuous data generation
  • 🌿 Development Branch - Create sampled dev environments
  • πŸ“š Public Datasets - Load standard ML datasets
  • πŸ“ˆ Monitoring - Track task and DT executions

See detailed walkthrough with screenshots β†’

cd generator/

# Set Snowflake connection
export SNOWFLAKE_CONNECTION_NAME=your_connection

# Step 1: Initial load (scale 0.01 = ~1K users)
python main.py --scale 0.01 --output snowflake

# Step 2 (Optional): Deploy continuous generation
cd snowflake_native
python deploy.py --start
from data_manager.core import DataManager

dm = DataManager()
dm.connect()

# Initial load
result = dm.run_initial_load(scale=0.01)
print(f"Loaded: {result.row_counts}")

# Create dev branch
result = dm.create_dev_branch("MY_DEV_DB", sample_pct=10)

A.2.2 Required Privileges

The data loading scripts require specific Snowflake privileges depending on your use case.

A.2.2.1 Minimum Privileges (Using Existing Database)

If you have an existing database and just need to create schemas and tables:

Privilege Object Purpose
USAGE Warehouse Execute queries
USAGE Database Access the database
CREATE SCHEMA Database Create CLICKSTREAM_DATA schema
CREATE TABLE Schema Create data tables
CREATE STAGE Schema Upload Parquet files (bulk load)
-- Grant minimum privileges to a role (canonical guide names shown)
GRANT USAGE ON WAREHOUSE FS_DEV_WH TO ROLE my_role;
GRANT USAGE ON DATABASE FEATURE_STORE_DEMO TO ROLE my_role;
GRANT CREATE SCHEMA ON DATABASE FEATURE_STORE_DEMO TO ROLE my_role;

-- After schema creation, grant table privileges
GRANT ALL ON SCHEMA FEATURE_STORE_DEMO.CLICKSTREAM_DATA TO ROLE my_role;

A.2.2.2 Full Setup (Creating New Database)

For the complete setup including database creation:

Privilege Object Purpose
CREATE DATABASE Account Create FEATURE_STORE_DEMO database
USAGE Warehouse Execute queries
-- Grant full setup privileges
GRANT CREATE DATABASE ON ACCOUNT TO ROLE my_role;
GRANT USAGE ON WAREHOUSE FS_DEV_WH TO ROLE my_role;

A.2.2.3 Incremental Generator (Additional)

To deploy and run the Snowflake-native incremental generator:

Privilege Object Purpose
CREATE PROCEDURE Schema Create stored procedure
CREATE TASK Schema Create scheduled task
EXECUTE TASK Account Allow task execution
-- Grant incremental generator privileges (admin schema for generator objects)
GRANT CREATE PROCEDURE ON SCHEMA FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN TO ROLE my_role;
GRANT CREATE TASK ON SCHEMA FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN TO ROLE my_role;
GRANT EXECUTE TASK ON ACCOUNT TO ROLE my_role;

-- After task creation, grant operate privilege
GRANT OPERATE ON TASK incremental_data_task TO ROLE my_role;
Using a Custom Database

If you don’t have CREATE DATABASE privilege, use the --database flag to specify an existing database:

python main.py --scale 0.01 --database MY_EXISTING_DB --schema MY_SCHEMA --output snowflake
Recommended: Create a Dedicated Role

For production use, create a dedicated role with only the required privileges:

-- See Appendix B: Environment Setup for the full security model
-- Quick start: run scripts/01_dba_setup.sql (Appendix B)

-- Or manually create a role:
CREATE ROLE IF NOT EXISTS FS_ADMIN_ROLE;
GRANT USAGE ON WAREHOUSE FS_DEV_WH TO ROLE FS_ADMIN_ROLE;
GRANT CREATE DATABASE ON ACCOUNT TO ROLE FS_ADMIN_ROLE;
GRANT ROLE FS_ADMIN_ROLE TO USER my_user;

A.2.3 Schema Organization

The generator creates two schemas:

Schema Purpose
CLICKSTREAM_DATA Data tables (11 tables)
CLICKSTREAM_ADMIN Generator config, state, task

A.2.4 Data Tables

Table Description Key Columns
CATEGORIES Product categories CATEGORY_ID, CATEGORY_NAME
SUPPLIERS Product suppliers SUPPLIER_ID, SUPPLIER_NAME
PRODUCTS Product catalog PRODUCT_ID, PRODUCT_NAME, CATEGORY_ID
PRODUCT_SUPPLIER Product-supplier mapping PRODUCT_ID, SUPPLIER_ID (composite PK)
HOUSEHOLDS Customer households HOUSEHOLD_ID, ADDRESS
VISITORS Anonymous visitors VISITOR_ID, FIRST_SEEN_TS
USERS Registered users USER_ID, VISITOR_ID, HOUSEHOLD_ID
SESSIONS User sessions SESSION_ID, VISITOR_ID, USER_ID
EVENTS Clickstream events EVENT_ID, SESSION_ID, EVENT_TYPE
ORDERS Purchase orders ORDER_ID, USER_ID, ORDER_TS
ORDER_ITEMS Order line items ORDER_ITEM_ID, ORDER_ID, PRODUCT_ID

A.2.5 Scale Factors

Scale Users Sessions Events Orders
0.01 1K 5K 44K 250
0.1 10K 50K 440K 2.5K
1.0 100K 500K 4.4M 25K

A.2.6 Incremental Generation

For Dynamic Table and Online Feature testing we need an incremental data generator to show the evolution of the feature store data through feature pipelines over time. This can be deployed using the following SQL commands:

-- Start continuous generation (every minute)
ALTER TASK FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN.INCREMENTAL_DATA_TASK RESUME;

-- Monitor
SELECT * FROM FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN.GENERATION_STATUS;

-- Stop when done
ALTER TASK FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN.INCREMENTAL_DATA_TASK SUSPEND;

Note: Leaving these executing will continue to generate data and update the feature store data over time, but will incur costs for the compute resources used to generate the data. When you are done with the testing, you can suspend the task by running the following SQL command:

ALTER TASK FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN.INCREMENTAL_DATA_TASK SUSPEND;

You can resume the task at any time when you want to continue testing and it will continue to generate data and update the feature store data over time, filling in the gaps where data was not generated. The first task iteration after a suspension will start from the last batch timestamp, and continue to generate data and update the feature store data over time. This initial execution will be slower than subsequent executions as it will need to catch up and apply more data.

A.2.7 Files

File/Folder Purpose
generator/main.py Initial data generation
generator/snowflake_native/ Snowflake-native incremental generation
generator/README.md Complete documentation

A.3 Public ML Datasets

For simple inline examples and testing, we provide standard ML datasets:

Dataset Records Sample Use Case
penguins 344 All Multi-class classification
titanic 891 All Binary classification
iris 150 All Classification basics
wine 1,599 All Regression / classification
california_housing 20,640 All Regression
nyc_taxi ~3M 50K Temporal features, fare prediction
credit_card_fraud 284K 50K Imbalanced classification, temporal

A.3.1 Loading Public Datasets

cd public_datasets/

# List available datasets
python load_datasets.py --list

# Load specific dataset
python load_datasets.py penguins

# Load all datasets
python load_datasets.py all

# Full dataset (no sampling) for large datasets
python load_datasets.py nyc_taxi --full

Small datasets can also be loaded inline using FROM VALUES clauses for self-contained code snippets.

A.4 Development Database Subset Tool

Create development database β€œbranches” from production data with filtered subsets using Dynamic Tables.

A.4.1 Quick Start

cd database_subset/

# Create development database (10% sample)
python deploy.py create \
    --prod-db FEATURE_STORE_DEMO \
    --dev-db FEATURE_STORE_DEMO_DEV \
    --sample-pct 10 \
    --warehouse FS_DEV_WH

# Check status
python deploy.py status --dev-db FEATURE_STORE_DEMO_DEV

# Suspend Dynamic Tables (cost savings)
python deploy.py suspend --dev-db FEATURE_STORE_DEMO_DEV

# Resume Dynamic Tables
python deploy.py resume --dev-db FEATURE_STORE_DEMO_DEV

A.4.2 Key Features

  • Same Object Names: All schemas/tables have identical names below database level
  • Environment Switching: Change USE DATABASE to switch between DEV and PROD
  • Real-time Sync: Dynamic Tables automatically refresh from production
  • Referential Integrity: Cascading filters maintain FK relationships

A.4.3 Multi-Environment Setup

flowchart LR
  subgraph PROD[FEATURE_STORE_DEMO]
    P1[CLICKSTREAM_DATA]
  end
  subgraph DEV[FEATURE_STORE_DEMO_DEV]
    D1[CLICKSTREAM_DATA sampled]
  end
  PROD --- DEV

Production vs development database subset sizes

See database_subset/README.md for complete documentation.