# ==============================================================================
# CONFIGURATION
# ==============================================================================
SOURCE_DATABASE = "FEATURE_STORE_DEMO"
SOURCE_SCHEMA = "CLICKSTREAM_DATA"
FS_NAME = "FEATURE_STORE"
WAREHOUSE = "FS_DEV_WH"
# ==============================================================================
# IMPORTS
# ==============================================================================
import json
from datetime import datetime, timedelta
from decimal import Decimal
import pandas as pd
import numpy as np
from snowflake.snowpark import Session
from snowflake.snowpark.version import VERSION
from snowflake.snowpark import functions as F
from snowflake.snowpark import types as T
from snowflake.snowpark.context import get_active_session
from snowflake.ml.feature_store import (
FeatureStore,
FeatureView,
Entity,
CreationMode,
)
# ==============================================================================
# SESSION CREATION
# ==============================================================================
# Option 1: Running in Snowflake Notebook (preferred)
try:
session = get_active_session()
except Exception:
# Option 2: Running locally β reads ~/.snowflake/connections.toml [default]
session = Session.builder.config("connection_name", "default").create()
session.sql_simplifier_enabled = TrueIntroduction
Setup, prerequisites, and guide overview
snowflake, feature store, ml, machine learning, mlops
Author: Simon Field
Overview
This guide provides comprehensive best practices for implementing Snowflake Feature Store in production ML systems. It covers the full lifecycle from initial design through deployment and ongoing operations.
The Snowflake Feature Store is a managed data management layer that helps ML teams organize, store, and serve features at scale. It provides:
- Centralized Feature Management: A single source of truth for feature definitions and metadata
- Automated Feature Engineering Pipelines: Dynamic Table-based materialization (pre-computed) or View-based (query-time computation)
- Point-in-Time Feature Retrieval: Temporally correct feature joins for ML training and batch inference, preventing data leakage
- Online Serving: Low-latency feature retrieval via Online Feature Tables for real-time inference
- Governance & Lineage: Full tracking from source data to model predictions
π Chapter code: Browse companion scripts on GitHub
Whatβs Covered in This Guide
| Chapter | Topic | Description |
|---|---|---|
| 00 | Introduction | Setup, prerequisites, guide overview |
| 01 | Concepts | Core objects, terminology, transformation taxonomy |
| 02 | Design & Organization | Feature Store structure, multi-environment patterns |
| 03 | Entities & Hierarchies | Entity keys (simple & compound), relationships |
| 04 | Feature Views | DT vs View, versioning, lifecycle management |
| 05 | Feature Pipelines | dbt, Dynamic Tables, Temporal API pipelines |
| 06 | Temporal Features | Point-in-time correctness, tiling, aggregations |
| 07 | Aggregations API | Feature class, windows, rollups |
| 08 | Online Features | Online feature model; Hybrid Table backend (GA); batch sync |
| 09 | Real-Time & Online Serving | Postgres backend; batch sync, Stream FVs, RTFVs, Feature Groups |
| 10 | Preprocessing | Encoding, scaling, model-dependent transforms |
| 11 | Training & Inference | Spine design, dataset generation |
| 12 | Operations | Monitoring, DMFs, lineage, drift detection |
| 13 | Advanced Patterns | CI/CD, streaming workarounds, multi-FS operations |
| 14 | Migration Guide | Tecton, SageMaker, Vertex AI migration paths |
Prerequisites
Required Snowflake Access
- Snowflake account (Standard Edition or higher)
- Database privileges to create schemas, tables, dynamic tables, views, and tags
- Warehouse access for compute
snowflake-ml-pythonpackage installed (available via pip/conda)
Required Privileges
Feature Store requires specific privileges. A role setup script is available in the Snowflake documentation.
Minimum required privileges:
-- Create a role for Feature Store users
CREATE ROLE IF NOT EXISTS FEATURE_STORE_USER;
-- Grant database and schema access
GRANT USAGE ON DATABASE <database> TO ROLE FEATURE_STORE_USER;
GRANT CREATE SCHEMA ON DATABASE <database> TO ROLE FEATURE_STORE_USER;
GRANT USAGE ON WAREHOUSE <warehouse> TO ROLE FEATURE_STORE_USER;
-- For Feature Store schema operations
GRANT CREATE TABLE ON SCHEMA <schema> TO ROLE FEATURE_STORE_USER;
GRANT CREATE DYNAMIC TABLE ON SCHEMA <schema> TO ROLE FEATURE_STORE_USER;
GRANT CREATE VIEW ON SCHEMA <schema> TO ROLE FEATURE_STORE_USER;
GRANT CREATE TAG ON SCHEMA <schema> TO ROLE FEATURE_STORE_USER;Software Requirements
# Minimum package versions for this guide (as of April 2026)
snowflake-ml-python >= 1.21.0 # Feature Store incl. Feature aggregation API (latest: 1.34.0)
snowflake-snowpark-python >= 1.40.0 # DataFrame operations (latest: 1.48.1)
# Additional packages used in examples
pandas >= 2.0.0
numpy >= 1.23.5
scikit-learn >= 1.5.0The Feature aggregation class (1.5.3.1 Method 3: Feature Aggregation Class (Time-Windowed)) and Temporal Aggregated Feature Views require snowflake-ml-python >= 1.21.0. This guide assumes 1.21.0 as the minimum baseline.
Environment Setup
Session Configuration
The following code establishes a Snowpark session and verifies the environment. Adjust SOURCE_DATABASE, SOURCE_SCHEMA, FS_NAME, and WAREHOUSE to match your environment and needs.
π Full code:
_code/setup_session.py
Verify Environment
π Full code:
_code/verify_environment.py
# Capture environment details
snowflake_environment = session.sql('SELECT current_user(), current_version()').collect()
snowpark_version = VERSION
session_role = session.get_current_role().replace('"', "")
session_database = session.get_current_database().replace('"', "")
session_schema = session.get_current_schema().replace('"', "")
session_warehouse = session.get_current_warehouse().replace('"', "")
# Check warehouse status (size/state can be NULL for some warehouse types)
wh_status = session.sql(f"SHOW WAREHOUSES LIKE '{session_warehouse}'").collect()[0]
wh_size = wh_status["size"]
wh_state = wh_status["state"]
warehouse_size = wh_size.upper() if wh_size else "N/A"
warehouse_state = wh_state if wh_state is not None else "N/A"
print('=' * 70)
print('CONNECTION ESTABLISHED')
print('=' * 70)
print(f'Account : {session.sql("SELECT current_account()").collect()[0][0]}')
print(f'User : {snowflake_environment[0][0]}')
print(f'Role : {session_role}')
print(f'Database : {session_database}')
print(f'Schema : {session_schema}')
print(f'Warehouse : {session_warehouse}')
print(f'Warehouse Size : {warehouse_size}')
print(f'Warehouse State : {warehouse_state}')
print(f'Snowflake Version : {snowflake_environment[0][1]}')
print(f'Snowpark Version : {snowpark_version[0]}.{snowpark_version[1]}.{snowpark_version[2]}')
print('=' * 70)Expected output (values depend on your environment):
======================================================================
CONNECTION ESTABLISHED
======================================================================
Account : YOUR_ACCOUNT
User : YOUR_USER
Role : FS_DEV_ROLE
Database : FEATURE_STORE_DEMO
Schema : CLICKSTREAM_DATA
Warehouse : FS_DEV_WH
Warehouse Size : X-SMALL
Warehouse State : STARTED
Snowflake Version : 9.x.x
Snowpark Version : 1.25.0
======================================================================
Create Feature Store
π Full code:
_code/create_feature_store.py
# Create working schema for sample data
session.sql(f'CREATE SCHEMA IF NOT EXISTS {SOURCE_DATABASE}.{SOURCE_SCHEMA}').collect()
# Initialize Feature Store
fs = FeatureStore(
session=session,
database=SOURCE_DATABASE,
name=FS_NAME,
default_warehouse=WAREHOUSE,
creation_mode=CreationMode.CREATE_IF_NOT_EXIST,
)
print(f'Feature Store initialized: {SOURCE_DATABASE}.{FS_NAME}')Sample Data
π Full code:
_code/sample_data.py
This guide uses a synthetic clickstream dataset designed to demonstrate all Feature Store capabilities, including:
- Temporal patterns: Event timestamps with realistic distributions
- Entity hierarchies: Visitor β User β Household relationships
- Composite keys: Product-Supplier junction table (M:N)
- Semi-structured data: VARIANT, ARRAY, OBJECT columns
See Appendix A: Clickstream Data Model for the full data model specification.
Quick Data Overview
The clickstream dataset includes these core tables:
| Table | Description | Entity Key | Temporal Column |
|---|---|---|---|
VISITORS |
Anonymous visitors | VISITOR_ID |
FIRST_SEEN_TS |
USERS |
Identified users | USER_ID |
CREATED_TS |
SESSIONS |
User sessions | SESSION_ID |
SESSION_START_TS |
EVENTS |
Clickstream events | EVENT_ID |
EVENT_TS |
ORDERS |
Purchase orders | ORDER_ID |
ORDER_TS |
PRODUCTS |
Product catalog | PRODUCT_ID |
CREATED_TS |
PRODUCT_SUPPLIER |
Supplier relationships | (PRODUCT_ID, SUPPLIER_ID) |
VALID_FROM_TS |
The base tables above are designed to be append-only β rows are inserted once and not updated. This pattern is common for source tables dealing with high scale ingestion rates, such as clickstream data. Append-only sources can also be more efficient for Dynamic Table incremental refresh, because the DT engine only needs to scan newly added micro-partitions rather than diffing existing ones for in-place row updates. However, Dynamic Tables work with any source table regardless of whether it is append-only or not.
As of April 2026, source tables that are periodically fully replaced (e.g., via INSERT OVERWRITE or CREATE TABLE ... AS SELECT) can also support incremental downstream processing by declaring a primary key with the RELY property (ALTER TABLE ... ADD PRIMARY KEY (...) RELY). Snowflake uses the reliable PK to compare rows by key value across rewrites, computing only actual changes for downstream DTs. This is a SQL-level capability not yet exposed through the Feature Store Python API. See Chapter 5: Feature Pipelines and Understanding primary keys in dynamic tables for details.
Note that the Dynamic Tables themselves will insert and update rows as new source data arrives β this is expected. For example, when a new event is appended to EVENTS, the DT computing MAX(EVENT_TS) per session updates that sessionβs aggregated row. The DT engine handles this transparently during incremental refresh.
Closing or derived timestamps (e.g., session end time, last-seen time, supplier relationship end date) are computed in downstream Feature Views rather than maintained as updated columns on the base table. For example, SESSION_END_TS is derived from the last event in a session:
Inline Snippet Pattern
For simple examples, we use self-contained data that you can copy-paste directly:
from snowflake.snowpark import Row
sample_events = session.create_dataframe([
Row(USER_ID='usr_001', EVENT_TS='2025-01-01 10:00:00', EVENT_TYPE='page_view', VALUE=1.0),
Row(USER_ID='usr_001', EVENT_TS='2025-01-01 10:05:00', EVENT_TYPE='click', VALUE=1.0),
Row(USER_ID='usr_001', EVENT_TS='2025-01-01 10:10:00', EVENT_TYPE='purchase', VALUE=99.0),
Row(USER_ID='usr_002', EVENT_TS='2025-01-01 11:00:00', EVENT_TYPE='page_view', VALUE=1.0),
Row(USER_ID='usr_002', EVENT_TS='2025-01-01 11:02:00', EVENT_TYPE='page_view', VALUE=1.0),
])
sample_events.write.save_as_table("SAMPLE_EVENTS", mode="overwrite")
sample_events.show()All code examples in this guide are available in the _code/ directories. You can also click the βView sourceβ icon on any code block to see the full file.
Guide Conventions
Code Examples
All code examples follow these conventions:
- SQL: UPPERCASE keywords, lowercase identifiers
- Python: PEP 8 style, type hints where helpful
- Naming:
SCREAMING_SNAKE_CASEfor Snowflake objects,snake_casefor Python - SQL and Snowpark DataFrame examples: Feature View definitions and other DataFrames are shown in both SQL and Snowpark DataFrame form using tabbed panels. Both approaches produce identical lazy Snowpark DataFrames. SQL is shown first (as it is more widely known), but the Snowpark DataFrame API is equally valid. See Chapter 4: SQL vs Snowpark DataFrame API for a comparison table and guidance on choosing. When you rely on generated SQL (Snowpark, dbplyr, or similar), third-party pretty-printers and optional CTE hoists can make plans easier to read in Snowsight or
QUERY_HISTORY; see Appendix C.
For clarity, code examples in this guide often use literal object names (e.g., session.table("EVENTS"), "FEATURE_STORE_DEMO") rather than fully-qualified references or variable substitution. Production code should parameterise database, schema, and object names so the same codebase works across DEV, TEST, and PROD without edits:
# Guide example (for readability)
df = session.table("EVENTS")
fs = FeatureStore(session=session, database="FEATURE_STORE_DEMO", name="FEATURE_STORE", ...)
# Production equivalent (parameterised)
df = session.table(f"{SOURCE_DATABASE}.{SOURCE_SCHEMA}.EVENTS")
fs = FeatureStore(session=session, database=SOURCE_DATABASE, name=FS_NAME, ...)Values such as SOURCE_DATABASE and FS_NAME are typically sourced from configuration files, environment variables, CI/CD secrets, or templating engines (e.g., Jinja, Terraform, Snowflake DCM Projects). See Chapter 2: Multi-Environment Deployment for full parameterisation patterns.
Abbreviations Used in This Guide
| Abbreviation | Meaning |
|---|---|
| DT | Dynamic Table |
| FV | Feature View |
| GA | Generally Available (production-ready; no preview enrollment required) |
| PubPr | Public Preview (preview enrollment may be required; APIs and performance may change) |
| PrivPr | Private Preview (limited availability; account enablement required) |
| OFT | Online Feature Table |
| PIT | Point-in-Time |
| MIT | Model-Independent Transformation |
| MDT | Model-Dependent Transformation |
| ODT | On-Demand Transformation |
| SCD | Slowly Changing Dimension |
| RBAC | Role-Based Access Control |
Each abbreviation is expanded on first use in every chapter. After that, the short form is used for readability. Release-status labels (GA, PubPr, PrivPr) appear frequently in platform capability tables and online-serving chapters.
Feature Naming Conventions
| Suffix | Meaning | Example |
|---|---|---|
_TS |
Timestamp | ORDER_TS, CREATED_TS |
_DT |
Date | ORDER_DT |
_CNT |
Count | SESSION_CNT, ORDER_CNT |
_DCNT |
Distinct count | PRODUCT_VIEW_DCNT |
_SUM |
Sum/total | REVENUE_SUM, QUANTITY_SUM |
_AVG |
Average | ORDER_VALUE_AVG |
_AMT |
Amount (currency) | TOTAL_AMT, DISCOUNT_AMT |
IS_ |
Boolean flag | IS_CONVERTED, IS_CURRENT |
_LST |
Last | LAST_ORDER_TS |
_FST |
First | FIRST_ORDER_TS |
_MAX |
Maximum | MAX_ORDER_VALUE |
_MIN |
Minimum | MIN_ORDER_VALUE |
_STD |
Standard deviation | ORDER_VALUE_STD |
_VAR |
Variance | ORDER_VALUE_VAR |
_MED |
Median | ORDER_VALUE_MED |
_PCT |
Percentile | ORDER_VALUE_PCT_90 |
Agreeing on feature naming conventions at the start of a project prevents inconsistency, reduces refactoring, and makes features discoverable across teams. The table above is a starting point β adapt it to your organization and enforce it through code review and automation. See Chapter 2: Establish Naming Conventions Early for object-level conventions (databases, schemas, Feature Views, Entities) and Chapter 7: Naming Conventions for aggregation-specific prefixing patterns.
Version Callouts
New features are highlighted with version callouts:
Description of the new feature and how it improves on previous approaches.
Next Steps
Continue to Chapter 1: Concepts to learn about the core Feature Store objects and terminology.