Appendix A — Sample Data

Clickstream dataset generator and public ML datasets

Keywords

snowflake, feature store, ml, machine learning, mlops

A.1 Overview

This appendix documents the sample data used throughout the Feature Store Implementation Guide:

Clickstream Dataset - A synthetic e-commerce dataset generated specifically for this guide
Public ML Datasets - Standard datasets (Iris, Titanic, etc.) for simple examples

Recommended: Streamlit Data Manager

For first-time setup and interactive data management, we recommend using the Streamlit Data Manager app. It provides a visual interface for all data operations with guided workflows.

cd data_manager/
export SNOWFLAKE_CONNECTION_NAME=your_connection
streamlit run app.py

📖 Full Streamlit App Documentation →

A.2 Clickstream Dataset

A.2.1 Quick Start

You can manage sample data using either the Streamlit UI (interactive) or CLI tools (scriptable).

cd data_manager/

# Set Snowflake connection
export SNOWFLAKE_CONNECTION_NAME=your_connection

# Launch the UI
streamlit run app.py

The UI provides:

📊 Overview - View current data status
📥 Initial Load - Generate and load synthetic data
🔄 Incremental Generator - Manage continuous data generation
🌿 Development Branch - Create sampled dev environments
📚 Public Datasets - Load standard ML datasets
📈 Monitoring - Track task and DT executions

See detailed walkthrough with screenshots →

cd generator/

# Set Snowflake connection
export SNOWFLAKE_CONNECTION_NAME=your_connection

# Step 1: Initial load (scale 0.01 = ~1K users)
python main.py --scale 0.01 --output snowflake

# Step 2 (Optional): Deploy continuous generation
cd snowflake_native
python deploy.py --start

from data_manager.core import DataManager

dm = DataManager()
dm.connect()

# Initial load
result = dm.run_initial_load(scale=0.01)
print(f"Loaded: {result.row_counts}")

# Create dev branch
result = dm.create_dev_branch("MY_DEV_DB", sample_pct=10)

A.2.2 Required Privileges

The data loading scripts require specific Snowflake privileges depending on your use case.

A.2.2.1 Minimum Privileges (Using Existing Database)

If you have an existing database and just need to create schemas and tables:

Privilege	Object	Purpose
`USAGE`	Warehouse	Execute queries
`USAGE`	Database	Access the database
`CREATE SCHEMA`	Database	Create CLICKSTREAM_DATA schema
`CREATE TABLE`	Schema	Create data tables
`CREATE STAGE`	Schema	Upload Parquet files (bulk load)

-- Grant minimum privileges to a role (canonical guide names shown)
GRANT USAGE ON WAREHOUSE FS_DEV_WH TO ROLE my_role;
GRANT USAGE ON DATABASE FEATURE_STORE_DEMO TO ROLE my_role;
GRANT CREATE SCHEMA ON DATABASE FEATURE_STORE_DEMO TO ROLE my_role;

-- After schema creation, grant table privileges
GRANT ALL ON SCHEMA FEATURE_STORE_DEMO.CLICKSTREAM_DATA TO ROLE my_role;

A.2.2.2 Full Setup (Creating New Database)

For the complete setup including database creation:

Privilege	Object	Purpose
`CREATE DATABASE`	Account	Create FEATURE_STORE_DEMO database
`USAGE`	Warehouse	Execute queries

-- Grant full setup privileges
GRANT CREATE DATABASE ON ACCOUNT TO ROLE my_role;
GRANT USAGE ON WAREHOUSE FS_DEV_WH TO ROLE my_role;

A.2.2.3 Incremental Generator (Additional)

To deploy and run the Snowflake-native incremental generator:

Privilege	Object	Purpose
`CREATE PROCEDURE`	Schema	Create stored procedure
`CREATE TASK`	Schema	Create scheduled task
`EXECUTE TASK`	Account	Allow task execution

-- Grant incremental generator privileges (admin schema for generator objects)
GRANT CREATE PROCEDURE ON SCHEMA FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN TO ROLE my_role;
GRANT CREATE TASK ON SCHEMA FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN TO ROLE my_role;
GRANT EXECUTE TASK ON ACCOUNT TO ROLE my_role;

-- After task creation, grant operate privilege
GRANT OPERATE ON TASK incremental_data_task TO ROLE my_role;

Using a Custom Database

If you don’t have CREATE DATABASE privilege, use the --database flag to specify an existing database:

python main.py --scale 0.01 --database MY_EXISTING_DB --schema MY_SCHEMA --output snowflake

Recommended: Create a Dedicated Role

For production use, create a dedicated role with only the required privileges:

-- See Appendix B: Environment Setup for the full security model
-- Quick start: run scripts/01_dba_setup.sql (Appendix B)

-- Or manually create a role:
CREATE ROLE IF NOT EXISTS FS_ADMIN_ROLE;
GRANT USAGE ON WAREHOUSE FS_DEV_WH TO ROLE FS_ADMIN_ROLE;
GRANT CREATE DATABASE ON ACCOUNT TO ROLE FS_ADMIN_ROLE;
GRANT ROLE FS_ADMIN_ROLE TO USER my_user;

A.2.3 Schema Organization

The generator creates two schemas:

Schema	Purpose
`CLICKSTREAM_DATA`	Data tables (11 tables)
`CLICKSTREAM_ADMIN`	Generator config, state, task

A.2.4 Data Tables

Table	Description	Key Columns
CATEGORIES	Product categories	CATEGORY_ID, CATEGORY_NAME
SUPPLIERS	Product suppliers	SUPPLIER_ID, SUPPLIER_NAME
PRODUCTS	Product catalog	PRODUCT_ID, PRODUCT_NAME, CATEGORY_ID
PRODUCT_SUPPLIER	Product-supplier mapping	PRODUCT_ID, SUPPLIER_ID (composite PK)
HOUSEHOLDS	Customer households	HOUSEHOLD_ID, ADDRESS
VISITORS	Anonymous visitors	VISITOR_ID, FIRST_SEEN_TS
USERS	Registered users	USER_ID, VISITOR_ID, HOUSEHOLD_ID
SESSIONS	User sessions	SESSION_ID, VISITOR_ID, USER_ID
EVENTS	Clickstream events	EVENT_ID, SESSION_ID, EVENT_TYPE
ORDERS	Purchase orders	ORDER_ID, USER_ID, ORDER_TS
ORDER_ITEMS	Order line items	ORDER_ITEM_ID, ORDER_ID, PRODUCT_ID

A.2.5 Scale Factors

Scale	Users	Sessions	Events	Orders
0.01	1K	5K	44K	250
0.1	10K	50K	440K	2.5K
1.0	100K	500K	4.4M	25K

A.2.6 Incremental Generation

For Dynamic Table and Online Feature testing we need an incremental data generator to show the evolution of the feature store data through feature pipelines over time. This can be deployed using the following SQL commands:

-- Start continuous generation (every minute)
ALTER TASK FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN.INCREMENTAL_DATA_TASK RESUME;

-- Monitor
SELECT * FROM FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN.GENERATION_STATUS;

-- Stop when done
ALTER TASK FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN.INCREMENTAL_DATA_TASK SUSPEND;

Note: Leaving these executing will continue to generate data and update the feature store data over time, but will incur costs for the compute resources used to generate the data. When you are done with the testing, you can suspend the task by running the following SQL command:

ALTER TASK FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN.INCREMENTAL_DATA_TASK SUSPEND;

You can resume the task at any time when you want to continue testing and it will continue to generate data and update the feature store data over time, filling in the gaps where data was not generated. The first task iteration after a suspension will start from the last batch timestamp, and continue to generate data and update the feature store data over time. This initial execution will be slower than subsequent executions as it will need to catch up and apply more data.

A.2.7 Files

File/Folder	Purpose
`generator/main.py`	Initial data generation
`generator/snowflake_native/`	Snowflake-native incremental generation
`generator/README.md`	Complete documentation

A.3 Public ML Datasets

For simple inline examples and testing, we provide standard ML datasets:

Dataset	Records	Sample	Use Case
`penguins`	344	All	Multi-class classification
`titanic`	891	All	Binary classification
`iris`	150	All	Classification basics
`wine`	1,599	All	Regression / classification
`california_housing`	20,640	All	Regression
`nyc_taxi`	~3M	50K	Temporal features, fare prediction
`credit_card_fraud`	284K	50K	Imbalanced classification, temporal

A.3.1 Loading Public Datasets

cd public_datasets/

# List available datasets
python load_datasets.py --list

# Load specific dataset
python load_datasets.py penguins

# Load all datasets
python load_datasets.py all

# Full dataset (no sampling) for large datasets
python load_datasets.py nyc_taxi --full

Small datasets can also be loaded inline using FROM VALUES clauses for self-contained code snippets.

A.4 Development Database Subset Tool

Create development database “branches” from production data with filtered subsets using Dynamic Tables.

A.4.1 Quick Start

cd database_subset/

# Create development database (10% sample)
python deploy.py create \
    --prod-db FEATURE_STORE_DEMO \
    --dev-db FEATURE_STORE_DEMO_DEV \
    --sample-pct 10 \
    --warehouse FS_DEV_WH

# Check status
python deploy.py status --dev-db FEATURE_STORE_DEMO_DEV

# Suspend Dynamic Tables (cost savings)
python deploy.py suspend --dev-db FEATURE_STORE_DEMO_DEV

# Resume Dynamic Tables
python deploy.py resume --dev-db FEATURE_STORE_DEMO_DEV

A.4.2 Key Features

Same Object Names: All schemas/tables have identical names below database level
Environment Switching: Change USE DATABASE to switch between DEV and PROD
Real-time Sync: Dynamic Tables automatically refresh from production
Referential Integrity: Cascading filters maintain FK relationships

A.4.3 Multi-Environment Setup

flowchart LR
  subgraph PROD[FEATURE_STORE_DEMO]
    P1[CLICKSTREAM_DATA]
  end
  subgraph DEV[FEATURE_STORE_DEMO_DEV]
    D1[CLICKSTREAM_DATA sampled]
  end
  PROD --- DEV

Production vs development database subset sizes

See database_subset/README.md for complete documentation.

--- title: "Sample Data" subtitle: "Clickstream dataset generator and public ML datasets" --- ## Overview This appendix documents the sample data used throughout the Feature Store Implementation Guide: 1. **Clickstream Dataset** - A synthetic e-commerce dataset generated specifically for this guide 2. **Public ML Datasets** - Standard datasets (Iris, Titanic, etc.) for simple examples ::: {.callout-tip} ## Recommended: Streamlit Data Manager For first-time setup and interactive data management, we recommend using the **Streamlit Data Manager** app. It provides a visual interface for all data operations with guided workflows. ```bash cd data_manager/ export SNOWFLAKE_CONNECTION_NAME=your_connection streamlit run app.py ``` **[📖 Full Streamlit App Documentation →](data_manager/streamlit_app.qmd)** ::: ## Clickstream Dataset ### Quick Start You can manage sample data using either the **Streamlit UI** (interactive) or **CLI tools** (scriptable). ::: {.panel-tabset} ## Streamlit UI (Recommended) ```bash cd data_manager/ # Set Snowflake connection export SNOWFLAKE_CONNECTION_NAME=your_connection # Launch the UI streamlit run app.py ``` The UI provides: - 📊 **Overview** - View current data status - 📥 **Initial Load** - Generate and load synthetic data - 🔄 **Incremental Generator** - Manage continuous data generation - 🌿 **Development Branch** - Create sampled dev environments - 📚 **Public Datasets** - Load standard ML datasets - 📈 **Monitoring** - Track task and DT executions **[See detailed walkthrough with screenshots →](data_manager/streamlit_app.qmd)** ## CLI Tools ```bash cd generator/ # Set Snowflake connection export SNOWFLAKE_CONNECTION_NAME=your_connection # Step 1: Initial load (scale 0.01 = ~1K users) python main.py --scale 0.01 --output snowflake # Step 2 (Optional): Deploy continuous generation cd snowflake_native python deploy.py --start ``` ## Python API ```python from data_manager.core import DataManager dm = DataManager() dm.connect() # Initial load result = dm.run_initial_load(scale=0.01) print(f"Loaded: {result.row_counts}") # Create dev branch result = dm.create_dev_branch("MY_DEV_DB", sample_pct=10) ``` ::: ### Required Privileges The data loading scripts require specific Snowflake privileges depending on your use case. #### Minimum Privileges (Using Existing Database) If you have an existing database and just need to create schemas and tables: | Privilege | Object | Purpose | |-----------|--------|---------| | `USAGE` | Warehouse | Execute queries | | `USAGE` | Database | Access the database | | `CREATE SCHEMA` | Database | Create CLICKSTREAM_DATA schema | | `CREATE TABLE` | Schema | Create data tables | | `CREATE STAGE` | Schema | Upload Parquet files (bulk load) | ```sql -- Grant minimum privileges to a role (canonical guide names shown) GRANT USAGE ON WAREHOUSE FS_DEV_WH TO ROLE my_role; GRANT USAGE ON DATABASE FEATURE_STORE_DEMO TO ROLE my_role; GRANT CREATE SCHEMA ON DATABASE FEATURE_STORE_DEMO TO ROLE my_role; -- After schema creation, grant table privileges GRANT ALL ON SCHEMA FEATURE_STORE_DEMO.CLICKSTREAM_DATA TO ROLE my_role; ``` #### Full Setup (Creating New Database) For the complete setup including database creation: | Privilege | Object | Purpose | |-----------|--------|---------| | `CREATE DATABASE` | Account | Create FEATURE_STORE_DEMO database | | `USAGE` | Warehouse | Execute queries | ```sql -- Grant full setup privileges GRANT CREATE DATABASE ON ACCOUNT TO ROLE my_role; GRANT USAGE ON WAREHOUSE FS_DEV_WH TO ROLE my_role; ``` #### Incremental Generator (Additional) To deploy and run the Snowflake-native incremental generator: | Privilege | Object | Purpose | |-----------|--------|---------| | `CREATE PROCEDURE` | Schema | Create stored procedure | | `CREATE TASK` | Schema | Create scheduled task | | `EXECUTE TASK` | Account | Allow task execution | ```sql -- Grant incremental generator privileges (admin schema for generator objects) GRANT CREATE PROCEDURE ON SCHEMA FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN TO ROLE my_role; GRANT CREATE TASK ON SCHEMA FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN TO ROLE my_role; GRANT EXECUTE TASK ON ACCOUNT TO ROLE my_role; -- After task creation, grant operate privilege GRANT OPERATE ON TASK incremental_data_task TO ROLE my_role; ``` ::: {.callout-tip} ## Using a Custom Database If you don't have `CREATE DATABASE` privilege, use the `--database` flag to specify an existing database: ```bash python main.py --scale 0.01 --database MY_EXISTING_DB --schema MY_SCHEMA --output snowflake ``` ::: ::: {.callout-note} ## Recommended: Create a Dedicated Role For production use, create a dedicated role with only the required privileges: ```sql -- See Appendix B: Environment Setup for the full security model -- Quick start: run scripts/01_dba_setup.sql (Appendix B) -- Or manually create a role: CREATE ROLE IF NOT EXISTS FS_ADMIN_ROLE; GRANT USAGE ON WAREHOUSE FS_DEV_WH TO ROLE FS_ADMIN_ROLE; GRANT CREATE DATABASE ON ACCOUNT TO ROLE FS_ADMIN_ROLE; GRANT ROLE FS_ADMIN_ROLE TO USER my_user; ``` ::: ### Schema Organization The generator creates two schemas: | Schema | Purpose | |--------|---------| | `CLICKSTREAM_DATA` | Data tables (11 tables) | | `CLICKSTREAM_ADMIN` | Generator config, state, task | ### Data Tables | Table | Description | Key Columns | |-------|-------------|-------------| | CATEGORIES | Product categories | CATEGORY_ID, CATEGORY_NAME | | SUPPLIERS | Product suppliers | SUPPLIER_ID, SUPPLIER_NAME | | PRODUCTS | Product catalog | PRODUCT_ID, PRODUCT_NAME, CATEGORY_ID | | PRODUCT_SUPPLIER | Product-supplier mapping | PRODUCT_ID, SUPPLIER_ID (composite PK) | | HOUSEHOLDS | Customer households | HOUSEHOLD_ID, ADDRESS | | VISITORS | Anonymous visitors | VISITOR_ID, FIRST_SEEN_TS | | USERS | Registered users | USER_ID, VISITOR_ID, HOUSEHOLD_ID | | SESSIONS | User sessions | SESSION_ID, VISITOR_ID, USER_ID | | EVENTS | Clickstream events | EVENT_ID, SESSION_ID, EVENT_TYPE | | ORDERS | Purchase orders | ORDER_ID, USER_ID, ORDER_TS | | ORDER_ITEMS | Order line items | ORDER_ITEM_ID, ORDER_ID, PRODUCT_ID | ### Scale Factors | Scale | Users | Sessions | Events | Orders | |-------|-------|----------|--------|--------| | 0.01 | 1K | 5K | 44K | 250 | | 0.1 | 10K | 50K | 440K | 2.5K | | 1.0 | 100K | 500K | 4.4M | 25K | ### Incremental Generation For Dynamic Table and Online Feature testing we need an incremental data generator to show the evolution of the feature store data through feature pipelines over time. This can be deployed using the following SQL commands: ```sql -- Start continuous generation (every minute) ALTER TASK FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN.INCREMENTAL_DATA_TASK RESUME; -- Monitor SELECT * FROM FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN.GENERATION_STATUS; -- Stop when done ALTER TASK FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN.INCREMENTAL_DATA_TASK SUSPEND; ``` Note: Leaving these executing will continue to generate data and update the feature store data over time, but will incur costs for the compute resources used to generate the data. When you are done with the testing, you can suspend the task by running the following SQL command: ```sql ALTER TASK FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN.INCREMENTAL_DATA_TASK SUSPEND; ``` You can resume the task at any time when you want to continue testing and it will continue to generate data and update the feature store data over time, filling in the gaps where data was not generated. The first task iteration after a suspension will start from the last batch timestamp, and continue to generate data and update the feature store data over time. This initial execution will be slower than subsequent executions as it will need to catch up and apply more data. ### Files | File/Folder | Purpose | |-------------|---------| | `generator/main.py` | Initial data generation | | `generator/snowflake_native/` | Snowflake-native incremental generation | | `generator/README.md` | Complete documentation | ## Public ML Datasets For simple inline examples and testing, we provide standard ML datasets: | Dataset | Records | Sample | Use Case | |---------|---------|--------|----------| | `penguins` | 344 | All | Multi-class classification | | `titanic` | 891 | All | Binary classification | | `iris` | 150 | All | Classification basics | | `wine` | 1,599 | All | Regression / classification | | `california_housing` | 20,640 | All | Regression | | `nyc_taxi` | ~3M | 50K | Temporal features, fare prediction | | `credit_card_fraud` | 284K | 50K | Imbalanced classification, temporal | ### Loading Public Datasets ```bash cd public_datasets/ # List available datasets python load_datasets.py --list # Load specific dataset python load_datasets.py penguins # Load all datasets python load_datasets.py all # Full dataset (no sampling) for large datasets python load_datasets.py nyc_taxi --full ``` Small datasets can also be loaded inline using `FROM VALUES` clauses for self-contained code snippets. ## Development Database Subset Tool Create development database "branches" from production data with filtered subsets using Dynamic Tables. ### Quick Start ```bash cd database_subset/ # Create development database (10% sample) python deploy.py create \ --prod-db FEATURE_STORE_DEMO \ --dev-db FEATURE_STORE_DEMO_DEV \ --sample-pct 10 \ --warehouse FS_DEV_WH # Check status python deploy.py status --dev-db FEATURE_STORE_DEMO_DEV # Suspend Dynamic Tables (cost savings) python deploy.py suspend --dev-db FEATURE_STORE_DEMO_DEV # Resume Dynamic Tables python deploy.py resume --dev-db FEATURE_STORE_DEMO_DEV ``` ### Key Features - **Same Object Names**: All schemas/tables have identical names below database level - **Environment Switching**: Change `USE DATABASE` to switch between DEV and PROD - **Real-time Sync**: Dynamic Tables automatically refresh from production - **Referential Integrity**: Cascading filters maintain FK relationships ### Multi-Environment Setup ```{mermaid} %%| fig-cap: "Production vs development database subset sizes" %%| fig-alt: "Production and dev databases with sampled table row counts" flowchart LR subgraph PROD[FEATURE_STORE_DEMO] P1[CLICKSTREAM_DATA] end subgraph DEV[FEATURE_STORE_DEMO_DEV] D1[CLICKSTREAM_DATA sampled] end PROD --- DEV ``` See `database_subset/README.md` for complete documentation.