Introduction

Setup, prerequisites, and guide overview

Keywords

snowflake, feature store, ml, machine learning, mlops

Overview

This guide provides comprehensive best practices for implementing Snowflake Feature Store in production ML systems. It covers the full lifecycle from initial design through deployment and ongoing operations.

The Snowflake Feature Store is a managed data management layer that helps ML teams organize, store, and serve features at scale. It provides:

Centralized Feature Management: A single source of truth for feature definitions and metadata
Automated Feature Engineering Pipelines: Dynamic Table-based materialization (pre-computed) or View-based (query-time computation)
Point-in-Time Feature Retrieval: Temporally correct feature joins for ML training and batch inference, preventing data leakage
Online Serving: Low-latency feature retrieval via Online Feature Tables for real-time inference
Governance & Lineage: Full tracking from source data to model predictions

📂 Chapter code: Browse companion scripts on GitHub

What’s Covered in This Guide

Chapter	Topic	Description
00	Introduction	Setup, prerequisites, guide overview
01	Concepts	Core objects, terminology, transformation taxonomy
02	Design & Organization	Feature Store structure, multi-environment patterns
03	Entities & Hierarchies	Entity keys (simple & compound), relationships
04	Feature Views	DT vs View, versioning, lifecycle management
05	Feature Pipelines	dbt, Dynamic Tables, Temporal API pipelines
06	Temporal Features	Point-in-time correctness, tiling, aggregations
07	Aggregations API	`Feature` class, windows, rollups
08	Online Features	Online feature model; Hybrid Table backend (GA); batch sync
09	Real-Time & Online Serving	Postgres backend; batch sync, Stream FVs, RTFVs, Feature Groups
10	Preprocessing	Encoding, scaling, model-dependent transforms
11	Training & Inference	Spine design, dataset generation
12	Operations	Monitoring, DMFs, lineage, drift detection
13	Advanced Patterns	CI/CD, streaming workarounds, multi-FS operations
14	Migration Guide	Tecton, SageMaker, Vertex AI migration paths

Prerequisites

Required Snowflake Access

Snowflake account (Standard Edition or higher)
Database privileges to create schemas, tables, dynamic tables, views, and tags
Warehouse access for compute
snowflake-ml-python package installed (available via pip/conda)

Required Privileges

Feature Store requires specific privileges. A role setup script is available in the Snowflake documentation.

Minimum required privileges:

-- Create a role for Feature Store users
CREATE ROLE IF NOT EXISTS FEATURE_STORE_USER;

-- Grant database and schema access
GRANT USAGE ON DATABASE <database> TO ROLE FEATURE_STORE_USER;
GRANT CREATE SCHEMA ON DATABASE <database> TO ROLE FEATURE_STORE_USER;
GRANT USAGE ON WAREHOUSE <warehouse> TO ROLE FEATURE_STORE_USER;

-- For Feature Store schema operations
GRANT CREATE TABLE ON SCHEMA <schema> TO ROLE FEATURE_STORE_USER;
GRANT CREATE DYNAMIC TABLE ON SCHEMA <schema> TO ROLE FEATURE_STORE_USER;
GRANT CREATE VIEW ON SCHEMA <schema> TO ROLE FEATURE_STORE_USER;
GRANT CREATE TAG ON SCHEMA <schema> TO ROLE FEATURE_STORE_USER;

Software Requirements

# Minimum package versions for this guide (as of April 2026)
snowflake-ml-python >= 1.21.0     # Feature Store incl. Feature aggregation API (latest: 1.34.0)
snowflake-snowpark-python >= 1.40.0  # DataFrame operations (latest: 1.48.1)

# Additional packages used in examples
pandas >= 2.0.0
numpy >= 1.23.5
scikit-learn >= 1.5.0

Feature Aggregation API

The Feature aggregation class (1.5.3.1 Method 3: Feature Aggregation Class (Time-Windowed)) and Temporal Aggregated Feature Views require snowflake-ml-python >= 1.21.0. This guide assumes 1.21.0 as the minimum baseline.

Environment Setup

Session Configuration

The following code establishes a Snowpark session and verifies the environment. Adjust SOURCE_DATABASE, SOURCE_SCHEMA, FS_NAME, and WAREHOUSE to match your environment and needs.

📁 Full code: _code/setup_session.py

# ==============================================================================
# CONFIGURATION
# ==============================================================================

SOURCE_DATABASE = "FEATURE_STORE_DEMO"
SOURCE_SCHEMA = "CLICKSTREAM_DATA"
FS_NAME = "FEATURE_STORE"
WAREHOUSE = "FS_DEV_WH"

# ==============================================================================
# IMPORTS
# ==============================================================================

import json
from datetime import datetime, timedelta
from decimal import Decimal

import pandas as pd
import numpy as np

from snowflake.snowpark import Session
from snowflake.snowpark.version import VERSION
from snowflake.snowpark import functions as F
from snowflake.snowpark import types as T
from snowflake.snowpark.context import get_active_session

from snowflake.ml.feature_store import (
    FeatureStore,
    FeatureView,
    Entity,
    CreationMode,
)

# ==============================================================================
# SESSION CREATION
# ==============================================================================

# Option 1: Running in Snowflake Notebook (preferred)
try:
    session = get_active_session()
except Exception:
    # Option 2: Running locally — reads ~/.snowflake/connections.toml [default]
    session = Session.builder.config("connection_name", "default").create()

session.sql_simplifier_enabled = True

Verify Environment

📁 Full code: _code/verify_environment.py

# Capture environment details
snowflake_environment = session.sql('SELECT current_user(), current_version()').collect()
snowpark_version = VERSION
session_role = session.get_current_role().replace('"', "")
session_database = session.get_current_database().replace('"', "")
session_schema = session.get_current_schema().replace('"', "")
session_warehouse = session.get_current_warehouse().replace('"', "")

# Check warehouse status (size/state can be NULL for some warehouse types)
wh_status = session.sql(f"SHOW WAREHOUSES LIKE '{session_warehouse}'").collect()[0]
wh_size = wh_status["size"]
wh_state = wh_status["state"]
warehouse_size = wh_size.upper() if wh_size else "N/A"
warehouse_state = wh_state if wh_state is not None else "N/A"

print('=' * 70)
print('CONNECTION ESTABLISHED')
print('=' * 70)
print(f'Account                : {session.sql("SELECT current_account()").collect()[0][0]}')
print(f'User                   : {snowflake_environment[0][0]}')
print(f'Role                   : {session_role}')
print(f'Database               : {session_database}')
print(f'Schema                 : {session_schema}')
print(f'Warehouse              : {session_warehouse}')
print(f'Warehouse Size         : {warehouse_size}')
print(f'Warehouse State        : {warehouse_state}')
print(f'Snowflake Version      : {snowflake_environment[0][1]}')
print(f'Snowpark Version       : {snowpark_version[0]}.{snowpark_version[1]}.{snowpark_version[2]}')
print('=' * 70)

Expected output (values depend on your environment):

======================================================================
CONNECTION ESTABLISHED
======================================================================
Account                : YOUR_ACCOUNT
User                   : YOUR_USER
Role                   : FS_DEV_ROLE
Database               : FEATURE_STORE_DEMO
Schema                 : CLICKSTREAM_DATA
Warehouse              : FS_DEV_WH
Warehouse Size         : X-SMALL
Warehouse State        : STARTED
Snowflake Version      : 9.x.x
Snowpark Version       : 1.25.0
======================================================================

Create Feature Store

📁 Full code: _code/create_feature_store.py

# Create working schema for sample data
session.sql(f'CREATE SCHEMA IF NOT EXISTS {SOURCE_DATABASE}.{SOURCE_SCHEMA}').collect()

# Initialize Feature Store
fs = FeatureStore(
    session=session,
    database=SOURCE_DATABASE,
    name=FS_NAME,
    default_warehouse=WAREHOUSE,
    creation_mode=CreationMode.CREATE_IF_NOT_EXIST,
)

print(f'Feature Store initialized: {SOURCE_DATABASE}.{FS_NAME}')

Sample Data

📁 Full code: _code/sample_data.py

This guide uses a synthetic clickstream dataset designed to demonstrate all Feature Store capabilities, including:

Temporal patterns: Event timestamps with realistic distributions
Entity hierarchies: Visitor → User → Household relationships
Composite keys: Product-Supplier junction table (M:N)
Semi-structured data: VARIANT, ARRAY, OBJECT columns

See Appendix A: Clickstream Data Model for the full data model specification.

Quick Data Overview

The clickstream dataset includes these core tables:

Table	Description	Entity Key	Temporal Column
`VISITORS`	Anonymous visitors	`VISITOR_ID`	`FIRST_SEEN_TS`
`USERS`	Identified users	`USER_ID`	`CREATED_TS`
`SESSIONS`	User sessions	`SESSION_ID`	`SESSION_START_TS`
`EVENTS`	Clickstream events	`EVENT_ID`	`EVENT_TS`
`ORDERS`	Purchase orders	`ORDER_ID`	`ORDER_TS`
`PRODUCTS`	Product catalog	`PRODUCT_ID`	`CREATED_TS`
`PRODUCT_SUPPLIER`	Supplier relationships	`(PRODUCT_ID, SUPPLIER_ID)`	`VALID_FROM_TS`

Append-Friendly Base Tables

The base tables above are designed to be append-only – rows are inserted once and not updated. This pattern is common for source tables dealing with high scale ingestion rates, such as clickstream data. Append-only sources can also be more efficient for Dynamic Table incremental refresh, because the DT engine only needs to scan newly added micro-partitions rather than diffing existing ones for in-place row updates. However, Dynamic Tables work with any source table regardless of whether it is append-only or not.

As of April 2026, source tables that are periodically fully replaced (e.g., via INSERT OVERWRITE or CREATE TABLE ... AS SELECT) can also support incremental downstream processing by declaring a primary key with the RELY property (ALTER TABLE ... ADD PRIMARY KEY (...) RELY). Snowflake uses the reliable PK to compare rows by key value across rewrites, computing only actual changes for downstream DTs. This is a SQL-level capability not yet exposed through the Feature Store Python API. See Chapter 5: Feature Pipelines and Understanding primary keys in dynamic tables for details.

Note that the Dynamic Tables themselves will insert and update rows as new source data arrives – this is expected. For example, when a new event is appended to EVENTS, the DT computing MAX(EVENT_TS) per session updates that session’s aggregated row. The DT engine handles this transparently during incremental refresh.

Closing or derived timestamps (e.g., session end time, last-seen time, supplier relationship end date) are computed in downstream Feature Views rather than maintained as updated columns on the base table. For example, SESSION_END_TS is derived from the last event in a session:

session_features_df = (
    session.table("EVENTS")
    .group_by("SESSION_ID")
    .agg(
        F.min("EVENT_TS").alias("SESSION_START_TS"),
        F.max("EVENT_TS").alias("SESSION_END_TS"),
        F.count("EVENT_ID").alias("EVENT_COUNT"),
    )
)

Inline Snippet Pattern

For simple examples, we use self-contained data that you can copy-paste directly:

from snowflake.snowpark import Row

sample_events = session.create_dataframe([
    Row(USER_ID='usr_001', EVENT_TS='2025-01-01 10:00:00', EVENT_TYPE='page_view', VALUE=1.0),
    Row(USER_ID='usr_001', EVENT_TS='2025-01-01 10:05:00', EVENT_TYPE='click', VALUE=1.0),
    Row(USER_ID='usr_001', EVENT_TS='2025-01-01 10:10:00', EVENT_TYPE='purchase', VALUE=99.0),
    Row(USER_ID='usr_002', EVENT_TS='2025-01-01 11:00:00', EVENT_TYPE='page_view', VALUE=1.0),
    Row(USER_ID='usr_002', EVENT_TS='2025-01-01 11:02:00', EVENT_TYPE='page_view', VALUE=1.0),
])

sample_events.write.save_as_table("SAMPLE_EVENTS", mode="overwrite")
sample_events.show()

Full Code Available

All code examples in this guide are available in the _code/ directories. You can also click the “View source” icon on any code block to see the full file.

Guide Conventions

Code Examples

All code examples follow these conventions:

SQL: UPPERCASE keywords, lowercase identifiers
Python: PEP 8 style, type hints where helpful
Naming: SCREAMING_SNAKE_CASE for Snowflake objects, snake_case for Python
SQL and Snowpark DataFrame examples: Feature View definitions and other DataFrames are shown in both SQL and Snowpark DataFrame form using tabbed panels. Both approaches produce identical lazy Snowpark DataFrames. SQL is shown first (as it is more widely known), but the Snowpark DataFrame API is equally valid. See Chapter 4: SQL vs Snowpark DataFrame API for a comparison table and guidance on choosing. When you rely on generated SQL (Snowpark, dbplyr, or similar), third-party pretty-printers and optional CTE hoists can make plans easier to read in Snowsight or QUERY_HISTORY; see Appendix C.

Adapting Code for Your Environment

For clarity, code examples in this guide often use literal object names (e.g., session.table("EVENTS"), "FEATURE_STORE_DEMO") rather than fully-qualified references or variable substitution. Production code should parameterise database, schema, and object names so the same codebase works across DEV, TEST, and PROD without edits:

# Guide example (for readability)
df = session.table("EVENTS")
fs = FeatureStore(session=session, database="FEATURE_STORE_DEMO", name="FEATURE_STORE", ...)

# Production equivalent (parameterised)
df = session.table(f"{SOURCE_DATABASE}.{SOURCE_SCHEMA}.EVENTS")
fs = FeatureStore(session=session, database=SOURCE_DATABASE, name=FS_NAME, ...)

Values such as SOURCE_DATABASE and FS_NAME are typically sourced from configuration files, environment variables, CI/CD secrets, or templating engines (e.g., Jinja, Terraform, Snowflake DCM Projects). See Chapter 2: Multi-Environment Deployment for full parameterisation patterns.

Abbreviations Used in This Guide

Abbreviation	Meaning
DT	Dynamic Table
FV	Feature View
GA	Generally Available (production-ready; no preview enrollment required)
PubPr	Public Preview (preview enrollment may be required; APIs and performance may change)
PrivPr	Private Preview (limited availability; account enablement required)
OFT	Online Feature Table
PIT	Point-in-Time
MIT	Model-Independent Transformation
MDT	Model-Dependent Transformation
ODT	On-Demand Transformation
SCD	Slowly Changing Dimension
RBAC	Role-Based Access Control

Each abbreviation is expanded on first use in every chapter. After that, the short form is used for readability. Release-status labels (GA, PubPr, PrivPr) appear frequently in platform capability tables and online-serving chapters.

Feature Naming Conventions

Suffix	Meaning	Example
`_TS`	Timestamp	`ORDER_TS`, `CREATED_TS`
`_DT`	Date	`ORDER_DT`
`_CNT`	Count	`SESSION_CNT`, `ORDER_CNT`
`_DCNT`	Distinct count	`PRODUCT_VIEW_DCNT`
`_SUM`	Sum/total	`REVENUE_SUM`, `QUANTITY_SUM`
`_AVG`	Average	`ORDER_VALUE_AVG`
`_AMT`	Amount (currency)	`TOTAL_AMT`, `DISCOUNT_AMT`
`IS_`	Boolean flag	`IS_CONVERTED`, `IS_CURRENT`
`_LST`	Last	`LAST_ORDER_TS`
`_FST`	First	`FIRST_ORDER_TS`
`_MAX`	Maximum	`MAX_ORDER_VALUE`
`_MIN`	Minimum	`MIN_ORDER_VALUE`
`_STD`	Standard deviation	`ORDER_VALUE_STD`
`_VAR`	Variance	`ORDER_VALUE_VAR`
`_MED`	Median	`ORDER_VALUE_MED`
`_PCT`	Percentile	`ORDER_VALUE_PCT_90`

Establish naming conventions early

Agreeing on feature naming conventions at the start of a project prevents inconsistency, reduces refactoring, and makes features discoverable across teams. The table above is a starting point — adapt it to your organization and enforce it through code review and automation. See Chapter 2: Establish Naming Conventions Early for object-level conventions (databases, schemas, Feature Views, Entities) and Chapter 7: Naming Conventions for aggregation-specific prefixing patterns.

Version Callouts

New features are highlighted with version callouts:

🆕 New in snowflake-ml X.X.X

Description of the new feature and how it improves on previous approaches.

Next Steps

Continue to Chapter 1: Concepts to learn about the core Feature Store objects and terminology.

--- subtitle: "Setup, prerequisites, and guide overview" --- # Introduction {.unnumbered} *Author: [Simon Field](https://www.linkedin.com/in/fieldy6961/)* ## Overview This guide provides comprehensive best practices for implementing Snowflake Feature Store in production ML systems. It covers the full lifecycle from initial design through deployment and ongoing operations. The Snowflake Feature Store is a managed data management layer that helps ML teams organize, store, and serve features at scale. It provides: - **Centralized Feature Management**: A single source of truth for feature definitions and metadata - **Automated Feature Engineering Pipelines**: Dynamic Table-based materialization (pre-computed) or View-based (query-time computation) - **Point-in-Time Feature Retrieval**: Temporally correct feature joins for ML training and batch inference, preventing data leakage - **Online Serving**: Low-latency feature retrieval via Online Feature Tables for real-time inference - **Governance & Lineage**: Full tracking from source data to model predictions > 📂 **Chapter code:** [Browse companion scripts on GitHub](https://github.com/Snowflake-Labs/snowflake-featurestore-imp-guide/tree/main/Snowflake_FeatureStore_Implementation_Guide/00_introduction/_code) ## What's Covered in This Guide | Chapter | Topic | Description | |---------|-------|-------------| | **00** | Introduction | Setup, prerequisites, guide overview | | [**01**](../01_concepts/index.qmd) | Concepts | Core objects, terminology, transformation taxonomy | | [**02**](../02_design_organization/index.qmd) | Design & Organization | Feature Store structure, multi-environment patterns | | [**03**](../03_entities_hierarchies/index.qmd) | Entities & Hierarchies | Entity keys (simple & compound), relationships | | [**04**](../04_feature_views/index.qmd) | Feature Views | DT vs View, versioning, lifecycle management | | [**05**](../05_feature_pipelines/index.qmd) | Feature Pipelines | dbt, Dynamic Tables, Temporal API pipelines | | [**06**](../06_temporal_features/index.qmd) | Temporal Features | Point-in-time correctness, tiling, aggregations | | [**07**](../07_aggregations_api/index.qmd) | Aggregations API | `Feature` class, windows, rollups | | [**08**](../08_online_features/index.qmd) | Online Features | Online feature model; Hybrid Table backend (GA); batch sync | | [**09**](../09_real_time_online_serving/index.qmd) | Real-Time & Online Serving | Postgres backend; batch sync, Stream FVs, RTFVs, Feature Groups | | [**10**](../10_preprocessing/index.qmd) | Preprocessing | Encoding, scaling, model-dependent transforms | | [**11**](../11_training_inference/index.qmd) | Training & Inference | Spine design, dataset generation | | [**12**](../12_operations/index.qmd) | Operations | Monitoring, DMFs, lineage, drift detection | | [**13**](../13_advanced_patterns/index.qmd) | Advanced Patterns | CI/CD, streaming workarounds, multi-FS operations | | [**14**](../14_migration_guide/index.qmd) | Migration Guide | Tecton, SageMaker, Vertex AI migration paths | ## Prerequisites ### Required Snowflake Access - Snowflake account (Standard Edition or higher) - Database privileges to create schemas, tables, dynamic tables, views, and tags - Warehouse access for compute - `snowflake-ml-python` package installed (available via pip/conda) ### Required Privileges Feature Store requires specific privileges. A role setup script is available in the [Snowflake documentation](https://docs.snowflake.com/en/developer-guide/snowflake-ml/feature-store/rbac). Minimum required privileges: ```sql -- Create a role for Feature Store users CREATE ROLE IF NOT EXISTS FEATURE_STORE_USER; -- Grant database and schema access GRANT USAGE ON DATABASE <database> TO ROLE FEATURE_STORE_USER; GRANT CREATE SCHEMA ON DATABASE <database> TO ROLE FEATURE_STORE_USER; GRANT USAGE ON WAREHOUSE <warehouse> TO ROLE FEATURE_STORE_USER; -- For Feature Store schema operations GRANT CREATE TABLE ON SCHEMA <schema> TO ROLE FEATURE_STORE_USER; GRANT CREATE DYNAMIC TABLE ON SCHEMA <schema> TO ROLE FEATURE_STORE_USER; GRANT CREATE VIEW ON SCHEMA <schema> TO ROLE FEATURE_STORE_USER; GRANT CREATE TAG ON SCHEMA <schema> TO ROLE FEATURE_STORE_USER; ``` ### Software Requirements ```python # Minimum package versions for this guide (as of April 2026) snowflake-ml-python >= 1.21.0 # Feature Store incl. Feature aggregation API (latest: 1.34.0) snowflake-snowpark-python >= 1.40.0 # DataFrame operations (latest: 1.48.1) # Additional packages used in examples pandas >= 2.0.0 numpy >= 1.23.5 scikit-learn >= 1.5.0 ``` ::: {.callout-note} ## Feature Aggregation API The `Feature` aggregation class (@sec-aggregations-api) and Temporal Aggregated Feature Views require `snowflake-ml-python >= 1.21.0`. This guide assumes 1.21.0 as the minimum baseline. ::: ## Environment Setup ### Session Configuration The following code establishes a Snowpark session and verifies the environment. Adjust `SOURCE_DATABASE`, `SOURCE_SCHEMA`, `FS_NAME`, and `WAREHOUSE` to match your environment and needs. > 📁 **Full code:** [`_code/setup_session.py`](_code/setup_session.py) ```{python} #| output: false # ============================================================================== # CONFIGURATION # ============================================================================== SOURCE_DATABASE = "FEATURE_STORE_DEMO" SOURCE_SCHEMA = "CLICKSTREAM_DATA" FS_NAME = "FEATURE_STORE" WAREHOUSE = "FS_DEV_WH" # ============================================================================== # IMPORTS # ============================================================================== import json from datetime import datetime, timedelta from decimal import Decimal import pandas as pd import numpy as np from snowflake.snowpark import Session from snowflake.snowpark.version import VERSION from snowflake.snowpark import functions as F from snowflake.snowpark import types as T from snowflake.snowpark.context import get_active_session from snowflake.ml.feature_store import ( FeatureStore, FeatureView, Entity, CreationMode, ) # ============================================================================== # SESSION CREATION # ============================================================================== # Option 1: Running in Snowflake Notebook (preferred) try: session = get_active_session() except Exception: # Option 2: Running locally — reads ~/.snowflake/connections.toml [default] session = Session.builder.config("connection_name", "default").create() session.sql_simplifier_enabled = True ``` ### Verify Environment {#sec-verify-environment} > 📁 **Full code:** [`_code/verify_environment.py`](_code/verify_environment.py) ```{python} #| output: false # Capture environment details snowflake_environment = session.sql('SELECT current_user(), current_version()').collect() snowpark_version = VERSION session_role = session.get_current_role().replace('"', "") session_database = session.get_current_database().replace('"', "") session_schema = session.get_current_schema().replace('"', "") session_warehouse = session.get_current_warehouse().replace('"', "") # Check warehouse status (size/state can be NULL for some warehouse types) wh_status = session.sql(f"SHOW WAREHOUSES LIKE '{session_warehouse}'").collect()[0] wh_size = wh_status["size"] wh_state = wh_status["state"] warehouse_size = wh_size.upper() if wh_size else "N/A" warehouse_state = wh_state if wh_state is not None else "N/A" print('=' * 70) print('CONNECTION ESTABLISHED') print('=' * 70) print(f'Account : {session.sql("SELECT current_account()").collect()[0][0]}') print(f'User : {snowflake_environment[0][0]}') print(f'Role : {session_role}') print(f'Database : {session_database}') print(f'Schema : {session_schema}') print(f'Warehouse : {session_warehouse}') print(f'Warehouse Size : {warehouse_size}') print(f'Warehouse State : {warehouse_state}') print(f'Snowflake Version : {snowflake_environment[0][1]}') print(f'Snowpark Version : {snowpark_version[0]}.{snowpark_version[1]}.{snowpark_version[2]}') print('=' * 70) ``` Expected output (values depend on your environment): ``` ====================================================================== CONNECTION ESTABLISHED ====================================================================== Account : YOUR_ACCOUNT User : YOUR_USER Role : FS_DEV_ROLE Database : FEATURE_STORE_DEMO Schema : CLICKSTREAM_DATA Warehouse : FS_DEV_WH Warehouse Size : X-SMALL Warehouse State : STARTED Snowflake Version : 9.x.x Snowpark Version : 1.25.0 ====================================================================== ``` ### Create Feature Store > 📁 **Full code:** [`_code/create_feature_store.py`](_code/create_feature_store.py) ```{python} # Create working schema for sample data session.sql(f'CREATE SCHEMA IF NOT EXISTS {SOURCE_DATABASE}.{SOURCE_SCHEMA}').collect() # Initialize Feature Store fs = FeatureStore( session=session, database=SOURCE_DATABASE, name=FS_NAME, default_warehouse=WAREHOUSE, creation_mode=CreationMode.CREATE_IF_NOT_EXIST, ) print(f'Feature Store initialized: {SOURCE_DATABASE}.{FS_NAME}') ``` ## Sample Data > 📁 **Full code:** [`_code/sample_data.py`](_code/sample_data.py) This guide uses a synthetic clickstream dataset designed to demonstrate all Feature Store capabilities, including: - **Temporal patterns**: Event timestamps with realistic distributions - **Entity hierarchies**: Visitor → User → Household relationships - **Composite keys**: Product-Supplier junction table (M:N) - **Semi-structured data**: VARIANT, ARRAY, OBJECT columns See [Appendix A: Clickstream Data Model](../appendices/A_sample_data/index.qmd) for the full data model specification. ### Quick Data Overview The clickstream dataset includes these core tables: | Table | Description | Entity Key | Temporal Column | |-------|-------------|------------|-----------------| | `VISITORS` | Anonymous visitors | `VISITOR_ID` | `FIRST_SEEN_TS` | | `USERS` | Identified users | `USER_ID` | `CREATED_TS` | | `SESSIONS` | User sessions | `SESSION_ID` | `SESSION_START_TS` | | `EVENTS` | Clickstream events | `EVENT_ID` | `EVENT_TS` | | `ORDERS` | Purchase orders | `ORDER_ID` | `ORDER_TS` | | `PRODUCTS` | Product catalog | `PRODUCT_ID` | `CREATED_TS` | | `PRODUCT_SUPPLIER` | Supplier relationships | `(PRODUCT_ID, SUPPLIER_ID)` | `VALID_FROM_TS` | ::: {.callout-tip} ## Append-Friendly Base Tables The base tables above are designed to be **append-only** -- rows are inserted once and not updated. This pattern is common for source tables dealing with high scale ingestion rates, such as clickstream data. Append-only sources can also be more efficient for Dynamic Table incremental refresh, because the DT engine only needs to scan newly added micro-partitions rather than diffing existing ones for in-place row updates. However, Dynamic Tables work with any source table regardless of whether it is append-only or not. As of April 2026, source tables that are periodically **fully replaced** (e.g., via `INSERT OVERWRITE` or `CREATE TABLE ... AS SELECT`) can also support incremental downstream processing by declaring a primary key with the `RELY` property (`ALTER TABLE ... ADD PRIMARY KEY (...) RELY`). Snowflake uses the reliable PK to compare rows by key value across rewrites, computing only actual changes for downstream DTs. This is a SQL-level capability not yet exposed through the Feature Store Python API. See [Chapter 5: Feature Pipelines](../05_feature_pipelines/index.qmd) and [Understanding primary keys in dynamic tables](https://docs.snowflake.com/en/user-guide/dynamic-tables/input-data-optimization) for details. Note that the Dynamic Tables *themselves* will insert and update rows as new source data arrives -- this is expected. For example, when a new event is appended to `EVENTS`, the DT computing `MAX(EVENT_TS)` per session updates that session's aggregated row. The DT engine handles this transparently during incremental refresh. Closing or derived timestamps (e.g., session end time, last-seen time, supplier relationship end date) are computed in downstream Feature Views rather than maintained as updated columns on the base table. For example, `SESSION_END_TS` is derived from the last event in a session: ```python session_features_df = ( session.table("EVENTS") .group_by("SESSION_ID") .agg( F.min("EVENT_TS").alias("SESSION_START_TS"), F.max("EVENT_TS").alias("SESSION_END_TS"), F.count("EVENT_ID").alias("EVENT_COUNT"), ) ) ``` ::: ### Inline Snippet Pattern For simple examples, we use self-contained data that you can copy-paste directly: ```{python} from snowflake.snowpark import Row sample_events = session.create_dataframe([ Row(USER_ID='usr_001', EVENT_TS='2025-01-01 10:00:00', EVENT_TYPE='page_view', VALUE=1.0), Row(USER_ID='usr_001', EVENT_TS='2025-01-01 10:05:00', EVENT_TYPE='click', VALUE=1.0), Row(USER_ID='usr_001', EVENT_TS='2025-01-01 10:10:00', EVENT_TYPE='purchase', VALUE=99.0), Row(USER_ID='usr_002', EVENT_TS='2025-01-01 11:00:00', EVENT_TYPE='page_view', VALUE=1.0), Row(USER_ID='usr_002', EVENT_TS='2025-01-01 11:02:00', EVENT_TYPE='page_view', VALUE=1.0), ]) sample_events.write.save_as_table("SAMPLE_EVENTS", mode="overwrite") sample_events.show() ``` ::: {.callout-tip} ## Full Code Available All code examples in this guide are available in the `_code/` directories. You can also click the "View source" icon on any code block to see the full file. ::: ## Guide Conventions ### Code Examples All code examples follow these conventions: - **SQL**: UPPERCASE keywords, lowercase identifiers - **Python**: PEP 8 style, type hints where helpful - **Naming**: `SCREAMING_SNAKE_CASE` for Snowflake objects, `snake_case` for Python - **SQL and Snowpark DataFrame examples**: Feature View definitions and other DataFrames are shown in **both SQL and Snowpark DataFrame** form using tabbed panels. Both approaches produce identical lazy Snowpark DataFrames. SQL is shown first (as it is more widely known), but the Snowpark DataFrame API is equally valid. See [Chapter 4: SQL vs Snowpark DataFrame API](../04_feature_views/index.qmd#sec-sql-vs-snowpark) for a comparison table and guidance on choosing. When you rely on generated SQL (Snowpark, **dbplyr**, or similar), third-party pretty-printers and optional CTE hoists can make plans easier to read in Snowsight or `QUERY_HISTORY`; see [Appendix C](../appendices/C_snowpark_to_dynamic_table/index.qmd#sec-sql-formatting). ::: {.callout-important} ## Adapting Code for Your Environment For clarity, code examples in this guide often use **literal object names** (e.g., `session.table("EVENTS")`, `"FEATURE_STORE_DEMO"`) rather than fully-qualified references or variable substitution. Production code should parameterise database, schema, and object names so the same codebase works across DEV, TEST, and PROD without edits: ```python # Guide example (for readability) df = session.table("EVENTS") fs = FeatureStore(session=session, database="FEATURE_STORE_DEMO", name="FEATURE_STORE", ...) # Production equivalent (parameterised) df = session.table(f"{SOURCE_DATABASE}.{SOURCE_SCHEMA}.EVENTS") fs = FeatureStore(session=session, database=SOURCE_DATABASE, name=FS_NAME, ...) ``` Values such as `SOURCE_DATABASE` and `FS_NAME` are typically sourced from configuration files, environment variables, CI/CD secrets, or templating engines (e.g., Jinja, Terraform, [Snowflake DCM Projects](https://docs.snowflake.com/en/user-guide/dcm-projects/dcm-projects-overview)). See [Chapter 2: Multi-Environment Deployment](../02_design_organization/index.qmd) for full parameterisation patterns. ::: ### Abbreviations Used in This Guide {#sec-abbreviations} | Abbreviation | Meaning | |--------------|---------| | **DT** | Dynamic Table | | **FV** | Feature View | | **GA** | Generally Available (production-ready; no preview enrollment required) | | **PubPr** | Public Preview (preview enrollment may be required; APIs and performance may change) | | **PrivPr** | Private Preview (limited availability; account enablement required) | | **OFT** | Online Feature Table | | **PIT** | Point-in-Time | | **MIT** | Model-Independent Transformation | | **MDT** | Model-Dependent Transformation | | **ODT** | On-Demand Transformation | | **SCD** | Slowly Changing Dimension | | **RBAC** | Role-Based Access Control | Each abbreviation is expanded on first use in every chapter. After that, the short form is used for readability. Release-status labels (**GA**, **PubPr**, **PrivPr**) appear frequently in platform capability tables and online-serving chapters. ### Feature Naming Conventions {#sec-naming-conventions} | Suffix | Meaning | Example | |--------|---------|---------| | `_TS` | Timestamp | `ORDER_TS`, `CREATED_TS` | | `_DT` | Date | `ORDER_DT` | | `_CNT` | Count | `SESSION_CNT`, `ORDER_CNT` | | `_DCNT` | Distinct count | `PRODUCT_VIEW_DCNT` | | `_SUM` | Sum/total | `REVENUE_SUM`, `QUANTITY_SUM` | | `_AVG` | Average | `ORDER_VALUE_AVG` | | `_AMT` | Amount (currency) | `TOTAL_AMT`, `DISCOUNT_AMT` | | `IS_` | Boolean flag | `IS_CONVERTED`, `IS_CURRENT` | | `_LST` | Last | `LAST_ORDER_TS` | | `_FST` | First | `FIRST_ORDER_TS` | | `_MAX` | Maximum | `MAX_ORDER_VALUE` | | `_MIN` | Minimum | `MIN_ORDER_VALUE` | | `_STD` | Standard deviation | `ORDER_VALUE_STD` | | `_VAR` | Variance | `ORDER_VALUE_VAR` | | `_MED` | Median | `ORDER_VALUE_MED` | | `_PCT` | Percentile | `ORDER_VALUE_PCT_90` | ::: {.callout-tip} ## Establish naming conventions early Agreeing on feature naming conventions at the start of a project prevents inconsistency, reduces refactoring, and makes features discoverable across teams. The table above is a starting point — adapt it to your organization and enforce it through code review and automation. See [Chapter 2: Establish Naming Conventions Early](../02_design_organization/index.qmd) for object-level conventions (databases, schemas, Feature Views, Entities) and [Chapter 7: Naming Conventions](../07_aggregations_api/index.qmd) for aggregation-specific prefixing patterns. ::: ### Version Callouts New features are highlighted with version callouts: ::: {.callout-note} ## 🆕 New in snowflake-ml X.X.X Description of the new feature and how it improves on previous approaches. ::: --- ## Next Steps Continue to [Chapter 1: Concepts](../01_concepts/index.qmd) to learn about the core Feature Store objects and terminology.