flowchart LR
subgraph PROD[FEATURE_STORE_DEMO]
P1[CLICKSTREAM_DATA]
end
subgraph DEV[FEATURE_STORE_DEMO_DEV]
D1[CLICKSTREAM_DATA sampled]
end
PROD --- DEV
Appendix A β Sample Data
Clickstream dataset generator and public ML datasets
snowflake, feature store, ml, machine learning, mlops
A.1 Overview
This appendix documents the sample data used throughout the Feature Store Implementation Guide:
- Clickstream Dataset - A synthetic e-commerce dataset generated specifically for this guide
- Public ML Datasets - Standard datasets (Iris, Titanic, etc.) for simple examples
For first-time setup and interactive data management, we recommend using the Streamlit Data Manager app. It provides a visual interface for all data operations with guided workflows.
A.2 Clickstream Dataset
A.2.1 Quick Start
You can manage sample data using either the Streamlit UI (interactive) or CLI tools (scriptable).
cd data_manager/
# Set Snowflake connection
export SNOWFLAKE_CONNECTION_NAME=your_connection
# Launch the UI
streamlit run app.pyThe UI provides:
- π Overview - View current data status
- π₯ Initial Load - Generate and load synthetic data
- π Incremental Generator - Manage continuous data generation
- πΏ Development Branch - Create sampled dev environments
- π Public Datasets - Load standard ML datasets
- π Monitoring - Track task and DT executions
A.2.2 Required Privileges
The data loading scripts require specific Snowflake privileges depending on your use case.
A.2.2.1 Minimum Privileges (Using Existing Database)
If you have an existing database and just need to create schemas and tables:
| Privilege | Object | Purpose |
|---|---|---|
USAGE |
Warehouse | Execute queries |
USAGE |
Database | Access the database |
CREATE SCHEMA |
Database | Create CLICKSTREAM_DATA schema |
CREATE TABLE |
Schema | Create data tables |
CREATE STAGE |
Schema | Upload Parquet files (bulk load) |
-- Grant minimum privileges to a role (canonical guide names shown)
GRANT USAGE ON WAREHOUSE FS_DEV_WH TO ROLE my_role;
GRANT USAGE ON DATABASE FEATURE_STORE_DEMO TO ROLE my_role;
GRANT CREATE SCHEMA ON DATABASE FEATURE_STORE_DEMO TO ROLE my_role;
-- After schema creation, grant table privileges
GRANT ALL ON SCHEMA FEATURE_STORE_DEMO.CLICKSTREAM_DATA TO ROLE my_role;A.2.2.2 Full Setup (Creating New Database)
For the complete setup including database creation:
| Privilege | Object | Purpose |
|---|---|---|
CREATE DATABASE |
Account | Create FEATURE_STORE_DEMO database |
USAGE |
Warehouse | Execute queries |
A.2.2.3 Incremental Generator (Additional)
To deploy and run the Snowflake-native incremental generator:
| Privilege | Object | Purpose |
|---|---|---|
CREATE PROCEDURE |
Schema | Create stored procedure |
CREATE TASK |
Schema | Create scheduled task |
EXECUTE TASK |
Account | Allow task execution |
-- Grant incremental generator privileges (admin schema for generator objects)
GRANT CREATE PROCEDURE ON SCHEMA FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN TO ROLE my_role;
GRANT CREATE TASK ON SCHEMA FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN TO ROLE my_role;
GRANT EXECUTE TASK ON ACCOUNT TO ROLE my_role;
-- After task creation, grant operate privilege
GRANT OPERATE ON TASK incremental_data_task TO ROLE my_role;For production use, create a dedicated role with only the required privileges:
-- See Appendix B: Environment Setup for the full security model
-- Quick start: run scripts/01_dba_setup.sql (Appendix B)
-- Or manually create a role:
CREATE ROLE IF NOT EXISTS FS_ADMIN_ROLE;
GRANT USAGE ON WAREHOUSE FS_DEV_WH TO ROLE FS_ADMIN_ROLE;
GRANT CREATE DATABASE ON ACCOUNT TO ROLE FS_ADMIN_ROLE;
GRANT ROLE FS_ADMIN_ROLE TO USER my_user;A.2.3 Schema Organization
The generator creates two schemas:
| Schema | Purpose |
|---|---|
CLICKSTREAM_DATA |
Data tables (11 tables) |
CLICKSTREAM_ADMIN |
Generator config, state, task |
A.2.4 Data Tables
| Table | Description | Key Columns |
|---|---|---|
| CATEGORIES | Product categories | CATEGORY_ID, CATEGORY_NAME |
| SUPPLIERS | Product suppliers | SUPPLIER_ID, SUPPLIER_NAME |
| PRODUCTS | Product catalog | PRODUCT_ID, PRODUCT_NAME, CATEGORY_ID |
| PRODUCT_SUPPLIER | Product-supplier mapping | PRODUCT_ID, SUPPLIER_ID (composite PK) |
| HOUSEHOLDS | Customer households | HOUSEHOLD_ID, ADDRESS |
| VISITORS | Anonymous visitors | VISITOR_ID, FIRST_SEEN_TS |
| USERS | Registered users | USER_ID, VISITOR_ID, HOUSEHOLD_ID |
| SESSIONS | User sessions | SESSION_ID, VISITOR_ID, USER_ID |
| EVENTS | Clickstream events | EVENT_ID, SESSION_ID, EVENT_TYPE |
| ORDERS | Purchase orders | ORDER_ID, USER_ID, ORDER_TS |
| ORDER_ITEMS | Order line items | ORDER_ITEM_ID, ORDER_ID, PRODUCT_ID |
A.2.5 Scale Factors
| Scale | Users | Sessions | Events | Orders |
|---|---|---|---|---|
| 0.01 | 1K | 5K | 44K | 250 |
| 0.1 | 10K | 50K | 440K | 2.5K |
| 1.0 | 100K | 500K | 4.4M | 25K |
A.2.6 Incremental Generation
For Dynamic Table and Online Feature testing we need an incremental data generator to show the evolution of the feature store data through feature pipelines over time. This can be deployed using the following SQL commands:
-- Start continuous generation (every minute)
ALTER TASK FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN.INCREMENTAL_DATA_TASK RESUME;
-- Monitor
SELECT * FROM FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN.GENERATION_STATUS;
-- Stop when done
ALTER TASK FEATURE_STORE_DEMO.CLICKSTREAM_ADMIN.INCREMENTAL_DATA_TASK SUSPEND;Note: Leaving these executing will continue to generate data and update the feature store data over time, but will incur costs for the compute resources used to generate the data. When you are done with the testing, you can suspend the task by running the following SQL command:
You can resume the task at any time when you want to continue testing and it will continue to generate data and update the feature store data over time, filling in the gaps where data was not generated. The first task iteration after a suspension will start from the last batch timestamp, and continue to generate data and update the feature store data over time. This initial execution will be slower than subsequent executions as it will need to catch up and apply more data.
A.2.7 Files
| File/Folder | Purpose |
|---|---|
generator/main.py |
Initial data generation |
generator/snowflake_native/ |
Snowflake-native incremental generation |
generator/README.md |
Complete documentation |
A.3 Public ML Datasets
For simple inline examples and testing, we provide standard ML datasets:
| Dataset | Records | Sample | Use Case |
|---|---|---|---|
penguins |
344 | All | Multi-class classification |
titanic |
891 | All | Binary classification |
iris |
150 | All | Classification basics |
wine |
1,599 | All | Regression / classification |
california_housing |
20,640 | All | Regression |
nyc_taxi |
~3M | 50K | Temporal features, fare prediction |
credit_card_fraud |
284K | 50K | Imbalanced classification, temporal |
A.3.1 Loading Public Datasets
cd public_datasets/
# List available datasets
python load_datasets.py --list
# Load specific dataset
python load_datasets.py penguins
# Load all datasets
python load_datasets.py all
# Full dataset (no sampling) for large datasets
python load_datasets.py nyc_taxi --fullSmall datasets can also be loaded inline using FROM VALUES clauses for self-contained code snippets.
A.4 Development Database Subset Tool
Create development database βbranchesβ from production data with filtered subsets using Dynamic Tables.
A.4.1 Quick Start
cd database_subset/
# Create development database (10% sample)
python deploy.py create \
--prod-db FEATURE_STORE_DEMO \
--dev-db FEATURE_STORE_DEMO_DEV \
--sample-pct 10 \
--warehouse FS_DEV_WH
# Check status
python deploy.py status --dev-db FEATURE_STORE_DEMO_DEV
# Suspend Dynamic Tables (cost savings)
python deploy.py suspend --dev-db FEATURE_STORE_DEMO_DEV
# Resume Dynamic Tables
python deploy.py resume --dev-db FEATURE_STORE_DEMO_DEVA.4.2 Key Features
- Same Object Names: All schemas/tables have identical names below database level
- Environment Switching: Change
USE DATABASEto switch between DEV and PROD - Real-time Sync: Dynamic Tables automatically refresh from production
- Referential Integrity: Cascading filters maintain FK relationships
A.4.3 Multi-Environment Setup
See database_subset/README.md for complete documentation.