Introduction

Setup, prerequisites, and guide overview

Keywords

snowflake, feature store, ml, machine learning, mlops

Author: Simon Field

Overview

This guide provides comprehensive best practices for implementing Snowflake Feature Store in production ML systems. It covers the full lifecycle from initial design through deployment and ongoing operations.

The Snowflake Feature Store is a managed data management layer that helps ML teams organize, store, and serve features at scale. It provides:

  • Centralized Feature Management: A single source of truth for feature definitions and metadata
  • Automated Feature Engineering Pipelines: Dynamic Table-based materialization (pre-computed) or View-based (query-time computation)
  • Point-in-Time Feature Retrieval: Temporally correct feature joins for ML training and batch inference, preventing data leakage
  • Online Serving: Low-latency feature retrieval via Online Feature Tables for real-time inference
  • Governance & Lineage: Full tracking from source data to model predictions

πŸ“‚ Chapter code: Browse companion scripts on GitHub

What’s Covered in This Guide

Chapter Topic Description
00 Introduction Setup, prerequisites, guide overview
01 Concepts Core objects, terminology, transformation taxonomy
02 Design & Organization Feature Store structure, multi-environment patterns
03 Entities & Hierarchies Entity keys (simple & compound), relationships
04 Feature Views DT vs View, versioning, lifecycle management
05 Feature Pipelines dbt, Dynamic Tables, Temporal API pipelines
06 Temporal Features Point-in-time correctness, tiling, aggregations
07 Aggregations API Feature class, windows, rollups
08 Online Features Online feature model; Hybrid Table backend (GA); batch sync
09 Real-Time & Online Serving Postgres backend; batch sync, Stream FVs, RTFVs, Feature Groups
10 Preprocessing Encoding, scaling, model-dependent transforms
11 Training & Inference Spine design, dataset generation
12 Operations Monitoring, DMFs, lineage, drift detection
13 Advanced Patterns CI/CD, streaming workarounds, multi-FS operations
14 Migration Guide Tecton, SageMaker, Vertex AI migration paths

Prerequisites

Required Snowflake Access

  • Snowflake account (Standard Edition or higher)
  • Database privileges to create schemas, tables, dynamic tables, views, and tags
  • Warehouse access for compute
  • snowflake-ml-python package installed (available via pip/conda)

Required Privileges

Feature Store requires specific privileges. A role setup script is available in the Snowflake documentation.

Minimum required privileges:

-- Create a role for Feature Store users
CREATE ROLE IF NOT EXISTS FEATURE_STORE_USER;

-- Grant database and schema access
GRANT USAGE ON DATABASE <database> TO ROLE FEATURE_STORE_USER;
GRANT CREATE SCHEMA ON DATABASE <database> TO ROLE FEATURE_STORE_USER;
GRANT USAGE ON WAREHOUSE <warehouse> TO ROLE FEATURE_STORE_USER;

-- For Feature Store schema operations
GRANT CREATE TABLE ON SCHEMA <schema> TO ROLE FEATURE_STORE_USER;
GRANT CREATE DYNAMIC TABLE ON SCHEMA <schema> TO ROLE FEATURE_STORE_USER;
GRANT CREATE VIEW ON SCHEMA <schema> TO ROLE FEATURE_STORE_USER;
GRANT CREATE TAG ON SCHEMA <schema> TO ROLE FEATURE_STORE_USER;

Software Requirements

# Minimum package versions for this guide (as of April 2026)
snowflake-ml-python >= 1.21.0     # Feature Store incl. Feature aggregation API (latest: 1.34.0)
snowflake-snowpark-python >= 1.40.0  # DataFrame operations (latest: 1.48.1)

# Additional packages used in examples
pandas >= 2.0.0
numpy >= 1.23.5
scikit-learn >= 1.5.0
Feature Aggregation API

The Feature aggregation class (1.5.3.1 Method 3: Feature Aggregation Class (Time-Windowed)) and Temporal Aggregated Feature Views require snowflake-ml-python >= 1.21.0. This guide assumes 1.21.0 as the minimum baseline.

Environment Setup

Session Configuration

The following code establishes a Snowpark session and verifies the environment. Adjust SOURCE_DATABASE, SOURCE_SCHEMA, FS_NAME, and WAREHOUSE to match your environment and needs.

πŸ“ Full code: _code/setup_session.py

# ==============================================================================
# CONFIGURATION
# ==============================================================================

SOURCE_DATABASE = "FEATURE_STORE_DEMO"
SOURCE_SCHEMA = "CLICKSTREAM_DATA"
FS_NAME = "FEATURE_STORE"
WAREHOUSE = "FS_DEV_WH"

# ==============================================================================
# IMPORTS
# ==============================================================================

import json
from datetime import datetime, timedelta
from decimal import Decimal

import pandas as pd
import numpy as np

from snowflake.snowpark import Session
from snowflake.snowpark.version import VERSION
from snowflake.snowpark import functions as F
from snowflake.snowpark import types as T
from snowflake.snowpark.context import get_active_session

from snowflake.ml.feature_store import (
    FeatureStore,
    FeatureView,
    Entity,
    CreationMode,
)

# ==============================================================================
# SESSION CREATION
# ==============================================================================

# Option 1: Running in Snowflake Notebook (preferred)
try:
    session = get_active_session()
except Exception:
    # Option 2: Running locally β€” reads ~/.snowflake/connections.toml [default]
    session = Session.builder.config("connection_name", "default").create()

session.sql_simplifier_enabled = True

Verify Environment

πŸ“ Full code: _code/verify_environment.py

# Capture environment details
snowflake_environment = session.sql('SELECT current_user(), current_version()').collect()
snowpark_version = VERSION
session_role = session.get_current_role().replace('"', "")
session_database = session.get_current_database().replace('"', "")
session_schema = session.get_current_schema().replace('"', "")
session_warehouse = session.get_current_warehouse().replace('"', "")

# Check warehouse status (size/state can be NULL for some warehouse types)
wh_status = session.sql(f"SHOW WAREHOUSES LIKE '{session_warehouse}'").collect()[0]
wh_size = wh_status["size"]
wh_state = wh_status["state"]
warehouse_size = wh_size.upper() if wh_size else "N/A"
warehouse_state = wh_state if wh_state is not None else "N/A"

print('=' * 70)
print('CONNECTION ESTABLISHED')
print('=' * 70)
print(f'Account                : {session.sql("SELECT current_account()").collect()[0][0]}')
print(f'User                   : {snowflake_environment[0][0]}')
print(f'Role                   : {session_role}')
print(f'Database               : {session_database}')
print(f'Schema                 : {session_schema}')
print(f'Warehouse              : {session_warehouse}')
print(f'Warehouse Size         : {warehouse_size}')
print(f'Warehouse State        : {warehouse_state}')
print(f'Snowflake Version      : {snowflake_environment[0][1]}')
print(f'Snowpark Version       : {snowpark_version[0]}.{snowpark_version[1]}.{snowpark_version[2]}')
print('=' * 70)

Expected output (values depend on your environment):

======================================================================
CONNECTION ESTABLISHED
======================================================================
Account                : YOUR_ACCOUNT
User                   : YOUR_USER
Role                   : FS_DEV_ROLE
Database               : FEATURE_STORE_DEMO
Schema                 : CLICKSTREAM_DATA
Warehouse              : FS_DEV_WH
Warehouse Size         : X-SMALL
Warehouse State        : STARTED
Snowflake Version      : 9.x.x
Snowpark Version       : 1.25.0
======================================================================

Create Feature Store

πŸ“ Full code: _code/create_feature_store.py

# Create working schema for sample data
session.sql(f'CREATE SCHEMA IF NOT EXISTS {SOURCE_DATABASE}.{SOURCE_SCHEMA}').collect()

# Initialize Feature Store
fs = FeatureStore(
    session=session,
    database=SOURCE_DATABASE,
    name=FS_NAME,
    default_warehouse=WAREHOUSE,
    creation_mode=CreationMode.CREATE_IF_NOT_EXIST,
)

print(f'Feature Store initialized: {SOURCE_DATABASE}.{FS_NAME}')

Sample Data

πŸ“ Full code: _code/sample_data.py

This guide uses a synthetic clickstream dataset designed to demonstrate all Feature Store capabilities, including:

  • Temporal patterns: Event timestamps with realistic distributions
  • Entity hierarchies: Visitor β†’ User β†’ Household relationships
  • Composite keys: Product-Supplier junction table (M:N)
  • Semi-structured data: VARIANT, ARRAY, OBJECT columns

See Appendix A: Clickstream Data Model for the full data model specification.

Quick Data Overview

The clickstream dataset includes these core tables:

Table Description Entity Key Temporal Column
VISITORS Anonymous visitors VISITOR_ID FIRST_SEEN_TS
USERS Identified users USER_ID CREATED_TS
SESSIONS User sessions SESSION_ID SESSION_START_TS
EVENTS Clickstream events EVENT_ID EVENT_TS
ORDERS Purchase orders ORDER_ID ORDER_TS
PRODUCTS Product catalog PRODUCT_ID CREATED_TS
PRODUCT_SUPPLIER Supplier relationships (PRODUCT_ID, SUPPLIER_ID) VALID_FROM_TS
Append-Friendly Base Tables

The base tables above are designed to be append-only – rows are inserted once and not updated. This pattern is common for source tables dealing with high scale ingestion rates, such as clickstream data. Append-only sources can also be more efficient for Dynamic Table incremental refresh, because the DT engine only needs to scan newly added micro-partitions rather than diffing existing ones for in-place row updates. However, Dynamic Tables work with any source table regardless of whether it is append-only or not.

As of April 2026, source tables that are periodically fully replaced (e.g., via INSERT OVERWRITE or CREATE TABLE ... AS SELECT) can also support incremental downstream processing by declaring a primary key with the RELY property (ALTER TABLE ... ADD PRIMARY KEY (...) RELY). Snowflake uses the reliable PK to compare rows by key value across rewrites, computing only actual changes for downstream DTs. This is a SQL-level capability not yet exposed through the Feature Store Python API. See Chapter 5: Feature Pipelines and Understanding primary keys in dynamic tables for details.

Note that the Dynamic Tables themselves will insert and update rows as new source data arrives – this is expected. For example, when a new event is appended to EVENTS, the DT computing MAX(EVENT_TS) per session updates that session’s aggregated row. The DT engine handles this transparently during incremental refresh.

Closing or derived timestamps (e.g., session end time, last-seen time, supplier relationship end date) are computed in downstream Feature Views rather than maintained as updated columns on the base table. For example, SESSION_END_TS is derived from the last event in a session:

session_features_df = (
    session.table("EVENTS")
    .group_by("SESSION_ID")
    .agg(
        F.min("EVENT_TS").alias("SESSION_START_TS"),
        F.max("EVENT_TS").alias("SESSION_END_TS"),
        F.count("EVENT_ID").alias("EVENT_COUNT"),
    )
)

Inline Snippet Pattern

For simple examples, we use self-contained data that you can copy-paste directly:

from snowflake.snowpark import Row

sample_events = session.create_dataframe([
    Row(USER_ID='usr_001', EVENT_TS='2025-01-01 10:00:00', EVENT_TYPE='page_view', VALUE=1.0),
    Row(USER_ID='usr_001', EVENT_TS='2025-01-01 10:05:00', EVENT_TYPE='click', VALUE=1.0),
    Row(USER_ID='usr_001', EVENT_TS='2025-01-01 10:10:00', EVENT_TYPE='purchase', VALUE=99.0),
    Row(USER_ID='usr_002', EVENT_TS='2025-01-01 11:00:00', EVENT_TYPE='page_view', VALUE=1.0),
    Row(USER_ID='usr_002', EVENT_TS='2025-01-01 11:02:00', EVENT_TYPE='page_view', VALUE=1.0),
])

sample_events.write.save_as_table("SAMPLE_EVENTS", mode="overwrite")
sample_events.show()
Full Code Available

All code examples in this guide are available in the _code/ directories. You can also click the β€œView source” icon on any code block to see the full file.

Guide Conventions

Code Examples

All code examples follow these conventions:

  • SQL: UPPERCASE keywords, lowercase identifiers
  • Python: PEP 8 style, type hints where helpful
  • Naming: SCREAMING_SNAKE_CASE for Snowflake objects, snake_case for Python
  • SQL and Snowpark DataFrame examples: Feature View definitions and other DataFrames are shown in both SQL and Snowpark DataFrame form using tabbed panels. Both approaches produce identical lazy Snowpark DataFrames. SQL is shown first (as it is more widely known), but the Snowpark DataFrame API is equally valid. See Chapter 4: SQL vs Snowpark DataFrame API for a comparison table and guidance on choosing. When you rely on generated SQL (Snowpark, dbplyr, or similar), third-party pretty-printers and optional CTE hoists can make plans easier to read in Snowsight or QUERY_HISTORY; see Appendix C.
Adapting Code for Your Environment

For clarity, code examples in this guide often use literal object names (e.g., session.table("EVENTS"), "FEATURE_STORE_DEMO") rather than fully-qualified references or variable substitution. Production code should parameterise database, schema, and object names so the same codebase works across DEV, TEST, and PROD without edits:

# Guide example (for readability)
df = session.table("EVENTS")
fs = FeatureStore(session=session, database="FEATURE_STORE_DEMO", name="FEATURE_STORE", ...)

# Production equivalent (parameterised)
df = session.table(f"{SOURCE_DATABASE}.{SOURCE_SCHEMA}.EVENTS")
fs = FeatureStore(session=session, database=SOURCE_DATABASE, name=FS_NAME, ...)

Values such as SOURCE_DATABASE and FS_NAME are typically sourced from configuration files, environment variables, CI/CD secrets, or templating engines (e.g., Jinja, Terraform, Snowflake DCM Projects). See Chapter 2: Multi-Environment Deployment for full parameterisation patterns.

Abbreviations Used in This Guide

Abbreviation Meaning
DT Dynamic Table
FV Feature View
GA Generally Available (production-ready; no preview enrollment required)
PubPr Public Preview (preview enrollment may be required; APIs and performance may change)
PrivPr Private Preview (limited availability; account enablement required)
OFT Online Feature Table
PIT Point-in-Time
MIT Model-Independent Transformation
MDT Model-Dependent Transformation
ODT On-Demand Transformation
SCD Slowly Changing Dimension
RBAC Role-Based Access Control

Each abbreviation is expanded on first use in every chapter. After that, the short form is used for readability. Release-status labels (GA, PubPr, PrivPr) appear frequently in platform capability tables and online-serving chapters.

Feature Naming Conventions

Suffix Meaning Example
_TS Timestamp ORDER_TS, CREATED_TS
_DT Date ORDER_DT
_CNT Count SESSION_CNT, ORDER_CNT
_DCNT Distinct count PRODUCT_VIEW_DCNT
_SUM Sum/total REVENUE_SUM, QUANTITY_SUM
_AVG Average ORDER_VALUE_AVG
_AMT Amount (currency) TOTAL_AMT, DISCOUNT_AMT
IS_ Boolean flag IS_CONVERTED, IS_CURRENT
_LST Last LAST_ORDER_TS
_FST First FIRST_ORDER_TS
_MAX Maximum MAX_ORDER_VALUE
_MIN Minimum MIN_ORDER_VALUE
_STD Standard deviation ORDER_VALUE_STD
_VAR Variance ORDER_VALUE_VAR
_MED Median ORDER_VALUE_MED
_PCT Percentile ORDER_VALUE_PCT_90
Establish naming conventions early

Agreeing on feature naming conventions at the start of a project prevents inconsistency, reduces refactoring, and makes features discoverable across teams. The table above is a starting point β€” adapt it to your organization and enforce it through code review and automation. See Chapter 2: Establish Naming Conventions Early for object-level conventions (databases, schemas, Feature Views, Entities) and Chapter 7: Naming Conventions for aggregation-specific prefixing patterns.

Version Callouts

New features are highlighted with version callouts:

πŸ†• New in snowflake-ml X.X.X

Description of the new feature and how it improves on previous approaches.


Next Steps

Continue to Chapter 1: Concepts to learn about the core Feature Store objects and terminology.