1 Core Concepts

Entities, Feature Views, Features, and the Spine

Keywords

snowflake, feature store, ml, machine learning, mlops

1.1 Overview

This chapter introduces the core concepts and terminology of Snowflake Feature Store. Understanding these fundamentals is essential before diving into implementation patterns.

1.2 Learning Objectives

After completing this chapter, you will be able to:

Define what a Feature Store is and articulate its value in ML systems
Identify and describe the key components: Entities, Feature Views, Features, and Spines
Understand how Feature Store fits into the broader ML lifecycle
Apply the correct transformation taxonomy (MIT/MDT/ODT) to your use cases
Map Snowflake terminology to industry-standard terms

📂 Chapter code: Browse companion scripts on GitHub

1.3 What is a Feature Store?

A Feature Store is a centralized repository for storing, managing, and serving features for machine learning. It acts as the data management layer between raw data sources and ML models.

flowchart LR
  subgraph DS[Data Sources]
    d1[Tables / Streams / External / APIs]
  end
  subgraph FS[Feature Store]
    f1[Pipelines / Storage / Serving / Metadata]
  end
  subgraph ML[ML Consumers]
    m1[Training / Batch / Real-time / Analytics]
  end
  DS --> FS --> ML

ML data infrastructure: sources, Feature Store, and ML consumers

1.3.1 Features: The Building Blocks of ML

A feature is a measurable property or characteristic used as input to an ML model. Features can be:

Feature Type	Description	Example
Raw	Direct attributes from source data	`user_age`, `product_price`
Derived	Calculated from raw features	`price_per_unit`, `age_bucket`
Aggregated	Computed over time windows	`total_orders_30d`, `avg_session_duration_7d`
Encoded	Transformed for model consumption	`category_one_hot`, `amount_scaled`

1.4 Why Use a Feature Store?

Feature Stores solve critical challenges in production ML systems:

1. Feature Reusability

Without a Feature Store, teams often recreate the same features independently, leading to:

Duplicated effort across data scientists
Inconsistent feature definitions
Wasted compute resources

With Feature Store: Features are defined once, stored centrally, and reused across models and teams.

2. Training-Serving Skew

One of the most common causes of ML model degradation is when features computed during training differ from those used during inference.

With Feature Store: The same feature definitions serve both training and inference, eliminating skew.

3. Point-in-Time Correctness

Using future data to train models (data leakage) produces artificially good training metrics but poor production performance.

With Feature Store: Built-in temporal joins ensure features are computed using only data available at the prediction time, preventing data leakage.

4. Feature Discovery

As organizations scale ML, finding existing features becomes challenging.

With Feature Store: Searchable catalog of features with metadata, lineage, and documentation.

5. Governance & Compliance

Tracking where features come from and how they’re used is essential for regulatory compliance.

With Feature Store: Full lineage from source data through feature transformations to model predictions.

Snowflake Feature Store addresses all of these challenges by providing a centralized repository for features, a way to compute features, and a way to serve features.

1.5 Key Components

Snowflake Feature Store consists of four primary components:

1.5.1 Entity

An Entity represents a business object that features describe. It defines the join keys used to retrieve features.

📁 Full code: _code/entity_examples.py

# Simple entity with single key
user_entity = Entity(
    name="USER",
    join_keys=["USER_ID"],
    desc="Registered user in the system"
)

# Compound entity with multiple keys
product_supplier_entity = Entity(
    name="PRODUCT_SUPPLIER",
    join_keys=["PRODUCT_ID", "SUPPLIER_ID"],
    desc="Product-Supplier relationship for supplier-specific features"
)

# Register entities
fs.register_entity(user_entity)
fs.register_entity(product_supplier_entity)
print(f"Registered entities: {[e['NAME'] for e in fs.list_entities().collect()]}")

Key Concepts:

Concept	Description
`name`	Unique identifier for the entity within the Feature Store
`join_keys`	Column(s) that uniquely identify an instance of the entity
Simple Key	Single column identifier (e.g., `USER_ID`)
Compound Key	Multiple columns together form the identifier (e.g., `PRODUCT_ID` + `SUPPLIER_ID`)

Entities enable features from different Feature Views to be joined together when generating training data or serving predictions.

Join key names must match exactly

The Feature Store does not support synonyms or aliases for join keys. The column names defined in the Entity’s join_keys must appear with the same name in every Feature View’s feature_df and in every spine DataFrame. If your source column has a different name, alias it in the SQL or Snowpark DataFrame that defines the Feature View (e.g., SELECT CUST_KEY AS USER_ID ... or .with_column_renamed("CUST_KEY", "USER_ID")). See Chapter 3: Consistent Key Naming for details.

Standardize timestamp_col names across Feature Views

Just as entity join keys must be named consistently, adopting a standard timestamp column name (e.g., FV_TS) across all Feature Views simplifies spine design and validation. Alias the source timestamp in the Feature View SQL:

SELECT USER_ID,
       ORDER_TS AS FV_TS,       -- standardized name
       SUM(TOTAL_AMT) AS ...
FROM ...
GROUP BY USER_ID, ORDER_TS

When every Feature View uses the same timestamp_col name, the spine only needs a single spine_timestamp_col value, include_feature_view_timestamp_col=True output is predictable, and leakage-validation code does not need per-FV column mappings. See Chapter 6: timestamp_col Requirements for details.

📖 Deep Dive

See Chapter 3: Entities & Hierarchies for entity design patterns, hierarchies, and compound key strategies.

1.5.2 Feature View

A Feature View is a collection of related features computed from one or more source dataframe (tables, views, etc.). It defines:

Which entity/ies the features belong to
How features are computed (the transformation logic)
How features are materialized (Dynamic Table vs View)

📁 Full code: _code/featureview_examples.py

from snowflake.ml.feature_store import FeatureView

# Create a Feature View from a Snowpark DataFrame
user_features_fv = FeatureView(
    name="USER_PURCHASE_FEATURES",
    entities=[user_entity],
    feature_df=user_purchase_df,  # Snowpark DataFrame with feature logic
    timestamp_col="UPDATED_TS",   # For point-in-time correctness
    refresh_freq="1 hour",        # Dynamic Table refresh (omit for View)
    desc="User purchase behavior features"
)

# Register in Feature Store
user_features_fv = fs.register_feature_view(
    feature_view=user_features_fv,
    version="V01",
    block=True  # Wait for initial materialization
)

Materialization Options:

Type	Created When	Use Case	Compute Model
Dynamic Table	`refresh_freq` is specified	Pre-computed features, automatic refresh	Snowflake manages refresh
View	`refresh_freq` is omitted	Query-time computation, always fresh	Compute on each query

Key Concepts:

Concept	Description
`name`	Unique identifier for the Feature View
`entities`	List of entities this Feature View provides features for
`feature_df`	Snowpark DataFrame defining the feature transformations
`timestamp_col`	Column used for point-in-time feature retrieval
`refresh_freq`	How often to refresh (e.g., `"1 hour"`, `"1 day"`)
`version`	Version string for managing Feature View evolution

📖 Deep Dive

See Chapter 4: Feature Views for detailed coverage of Feature View types, versioning, and lifecycle management.

1.5.3 Feature

A Feature is an individual column within a Feature View. Features are defined through the feature_df parameter—a Snowpark DataFrame that specifies the transformation logic.

Both session.sql() (SQL) and the Snowpark DataFrame API produce identical lazy Snowpark DataFrames. See Chapter 4: SQL vs Snowpark DataFrame API for a detailed comparison.

📁 Full code: _code/feature_dataframe_api.py | _code/feature_sql_api.py

SQL
Snowpark DataFrame

user_purchase_df = session.sql("""
    SELECT
        USER_ID,
        COUNT(DISTINCT ORDER_ID) AS ORDER_CNT,
        SUM(TOTAL_AMT) AS SPEND_SUM,
        AVG(TOTAL_AMT) AS ORDER_VALUE_AVG,
        MAX(ORDER_TS) AS LAST_ORDER_TS,
        DATEDIFF('day', MIN(ORDER_TS), MAX(ORDER_TS)) AS CUSTOMER_TENURE_DAYS
    FROM ORDERS
    GROUP BY USER_ID
""")

user_purchase_df = (
    session.table(ORDERS_TABLE)
    .group_by("USER_ID")
    .agg(
        F.sum("TOTAL_AMT").alias("SPEND_SUM"),
        F.count("ORDER_ID").alias("ORDER_CNT"),
        F.avg("TOTAL_AMT").alias("ORDER_VALUE_AVG"),
        F.max("ORDER_TS").alias("LAST_ORDER_TS"),
    )
)

Create and register a Feature View using the DataFrame above:

user_purchase_fv = FeatureView(
    name="USER_PURCHASE_FEATURES",
    entities=[user_entity],
    feature_df=user_purchase_df,
    timestamp_col="LAST_ORDER_TS",
    refresh_freq="1 hour",
    desc="User purchase behavior features",
)

user_purchase_fv = fs.register_feature_view(
    feature_view=user_purchase_fv,
    version="V01",
    block=True,
    overwrite=True,
)

print(f"Registered: {user_purchase_fv.name}/V01")
print(f"Status: {user_purchase_fv.status}")

Time-windowed features over sparse data produce incorrect results at retrieval time

SQL window functions such as RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW compute correct windows at materialization time – each row’s 7-day window is anchored to that row’s own timestamp. The problem surfaces at retrieval time. When the Feature Store performs an ASOF join against a pre-materialized Dynamic Table, it returns the most recent row whose timestamp is <= the spine timestamp. If the source data is sparse (not a record for every time grain), that row may be days older than the spine timestamp, and its pre-computed window is anchored to the wrong point in time.

Example: A spine row requests features as of Jan 15, but the nearest pre-computed row in the DT has a timestamp of Jan 10 (no source activity between Jan 10-15). The returned “7-day sum” covers Jan 3-10, not the expected Jan 8-15.

This is an inherent limitation of pre-computing fixed windows and retrieving them via ASOF: the window boundaries are baked into the row at materialization time and cannot shift to match an arbitrary query timestamp. For time-windowed aggregations, use the Feature Aggregation API (Section 1.5.3.1) instead. Its tiling mechanism stores partial aggregates per time grain and reassembles the correct window from the spine timestamp backwards at retrieval time – regardless of source sparsity. See Chapter 5: Temporal Aggregation Pipelines for the detailed motivation and Chapter 7 for API reference.

1.5.3.1 Method 3: Feature Aggregation Class (Time-Windowed)

For time-windowed aggregations, the Feature class provides a declarative API:

📁 Full code: _code/feature_aggregation_api.py

from snowflake.ml.feature_store import Feature

# Define time-windowed aggregated features
features = [
    Feature.sum("TOTAL_AMT", "7d").alias("SPEND_SUM_7D"),
    Feature.count("ORDER_ID", "30d").alias("ORDER_CNT_30D"),
    Feature.avg("TOTAL_AMT", "24h").alias("ORDER_VALUE_AVG_24H"),
    Feature.last_n("PRODUCT_ID", "7d", n=5).alias("RECENT_PRODUCTS"),
]

# Use in a tiled Feature View for efficient computation
tiled_fv = FeatureView(
    name="USER_ORDER_AGGREGATES",
    entities=[user_entity],
    feature_df=session.table("ORDERS"),
    timestamp_col="ORDER_TS",
    refresh_freq="1h",
    feature_granularity="1h",
    features=features,
)

🆕 New in snowflake-ml 1.21+

The Feature aggregation class provides declarative syntax for time-windowed aggregations with automatic tiling for efficient incremental computation. See Chapter 7: Aggregations API for the complete API reference.

1.5.3.2 Choosing the Right Method

Method	Best For	Complexity
DataFrame API	Simple aggregations, joins, filtering	Low
SQL via session.sql()	Complex queries, window functions, prefer existing SQL	Medium
Feature class	Time-windowed aggregations, sliding windows	Advanced

1.5.4 Feature Slice

A Feature Slice identifies specific feature columns within a Feature View that you want to retrieve. This allows selective feature retrieval rather than fetching all columns.

# Get specific features from a Feature View
user_slice = user_features_fv.slice(["SPEND_SUM_7D", "ORDER_CNT_30D"])

# Use the slice in dataset generation
dataset = fs.generate_dataset(
    spine_df=training_spine,
    features=[user_slice],  # Only retrieves sliced features
)

1.6 The Spine: Connecting Features to ML

The Spine is a foundation dataframe that defines which entities need features and when. Its structure differs slightly between training and inference:

Use Case	Required Columns	Optional Columns
Training	Entity keys, Timestamp	Label, additional context columns
Batch Inference	Entity keys, Timestamp	Additional context columns
Online Inference	Entity keys only	—

The spine can carry any additional columns alongside the entity keys and timestamp – labels, context features, or columns sourced from tables that are not managed by the Feature Store. These columns pass through unchanged into the generated dataset.

Prefer routing all features through the Feature Store

If you find yourself adding feature columns directly to the spine from external tables, consider instead creating a view-based Feature View that references those external tables. This ensures all features – whether internally computed or externally managed – are discoverable, governed, and retrievable through the Feature Store. The view adds no storage or refresh cost; it simply wraps an existing table as a Feature View.

# External table with features not yet in the Feature Store
external_features_df = session.table("ANALYTICS.FEATURES.CREDIT_SCORES")

credit_fv = FeatureView(
    name="CREDIT_SCORES_EXTERNAL",
    entities=[user_entity],
    feature_df=external_features_df,
    timestamp_col="SCORE_TS",
    desc="Credit scores - maintained by risk team, registered for FS discovery"
)

1.6.0.1 Training Spine

For training, the spine defines the historical points where you want to retrieve features, along with the target variable (label) you’re predicting:

📁 Full code: _code/spine_examples.py

# Training spine: includes label for supervised learning
training_spine = session.sql("""
    SELECT 
        s.USER_ID,                         -- Entity key
        s.SESSION_START_TS AS EVENT_TS,    -- Point-in-time timestamp
        s.IS_CONVERTED AS LABEL            -- Target: did the session convert?
    FROM SESSIONS s
    WHERE s.USER_ID IS NOT NULL
""")

# Generate training dataset with features joined to spine
training_set = fs.generate_dataset(
    spine_df=training_spine,
    features=[user_features_fv, session_features_fv],
    spine_timestamp_col="EVENT_TS",
)

1.6.0.2 Batch Inference Spine

For batch inference, you only need entity keys and timestamps—there’s no label because that’s what the model will predict:

# Inference spine: no label column
inference_spine = session.sql("""
    SELECT 
        USER_ID,                           -- Entity key
        CURRENT_TIMESTAMP() AS EVENT_TS    -- Features as of now
    FROM USERS
    WHERE SUBSCRIPTION_STATUS != 'none'
""")

# Retrieve features for prediction
inference_data = fs.generate_dataset(
    spine_df=inference_spine,
    features=[user_features_fv, session_features_fv],
    spine_timestamp_col="EVENT_TS",
)

1.6.0.3 Online Inference

For real-time serving, features are retrieved from Online Feature Tables using entity keys only—no timestamp is needed since OFTs store only the current (latest) feature values:

# Online serving: entity keys only, retrieves current feature values
features = fs.retrieve_feature_values(
    spine_df=session.create_dataframe([{"USER_ID": "usr_001"}]),
    features=[user_features_fv],
)

1.6.1 How Spine Works

flowchart TB
  SP[Spine DataFrame USER_ID + EVENT_TS]
  FV1[USER_PURCHASE_FV features]
  FV2[USER_SESSION_FV features]
  SP --> FV1
  SP --> FV2
  RES[Result: features as-of EVENT_TS per USER_ID]
  FV1 --> RES
  FV2 --> RES

Spine-based feature retrieval: spine joined to multiple Feature Views

The Feature Store joins each spine row to the appropriate Feature Views, ensuring features are computed using only data available at the EVENT_TS timestamp. This spine-based ASOF retrieval applies to training and batch inference only. For online (real-time) inference, features are served via direct key-based lookup against Online Feature Tables – there is no spine and no ASOF join.

📖 Deep Dive

See Chapter 11: Training & Inference for spine design patterns and best practices.

1.7 Feature Store Architecture

Snowflake Feature Store leverages native Snowflake capabilities:

flowchart TB
  SD[Source Data] --> SFS[Feature Store Schema]
  subgraph SFS
    FV[Feature Views DT + View]
    MD[Metadata Tags]
  end
  FV --> T[Training Datasets]
  FV --> B[Batch Inference]
  FV --> O[Online Serving OFT]
  MD -.-> FV

Snowflake Feature Store architecture overview

1.7.1 Physical Implementation

Logical Concept	Snowflake Object	Purpose
Feature Store	Schema + Tags	Container for all Feature Store objects
Entity	Tag	Metadata defining join keys
Feature View (materialized)	Dynamic Table	Pre-computed features with full history
Feature View (query-time)	View	On-demand computed features
Online Feature Table (OFT)	Online Feature Table	Low-latency serving, current values only
Dataset	Dataset	Materialized training/inference data

Online Feature Tables Store Current Values Only

Online Feature Tables store only the latest (current) value for each feature per entity—they do not retain historical feature values. For point-in-time historical retrieval (training, batch inference), use the standard Feature View backed by Dynamic Table or View.

1.8 Transformation Taxonomy

Understanding where transformations should occur is critical for maintainable ML systems. We categorize transformations into three types:

Type	Full Name	Location	Stored In
MIT	Model-Independent Transformations	Feature Pipeline	Feature View
MDT	Model-Dependent Transformations	Training + Inference Pipeline	Model Registry
ODT	On-Demand Transformations	Inference Time	Not stored

1.8.1 Quick Decision Guide

flowchart TD
  Q1{Reusable across models?}
  Q1 -->|YES| MIT[MIT in Feature View]
  Q1 -->|NO| Q2{Depends on training stats?}
  Q2 -->|YES| MDT[MDT with model]
  Q2 -->|NO| Q3{Depends on request context?}
  Q3 -->|YES| ODT[ODT at inference]
  Q3 -->|NO| MIT2[Probably MIT]

MIT vs MDT vs ODT decision flowchart

📖 Deep Dive

See Transformation Taxonomy: MIT vs MDT vs ODT for detailed examples, anti-patterns, and implementation guidance.

1.9 Terminology Mapping

How Snowflake Feature Store terminology aligns with other platforms and industry terminology. Here’s how terms map across platforms:

Snowflake	Feast	Hopsworks	Tecton	Description
Feature Store	Feature Store (registry)	Feature Store	Feature Store	Central feature repository
Entity	Entity	Entity	Entity	Business object with join keys
Feature View	Feature View	Feature Group	Feature View	Collection of related features
Dynamic Table FV	Offline Store	Offline Store	Batch Feature View	Pre-computed, stored features
View FV	On-Demand Feature View	-	On-Demand Feature View	Query-time computed features
Online Feature Table	Online Store	Online Store	Online Store	Low-latency serving
Spine	Entity DataFrame	Spine DataFrame	Spine	Request keys + timestamps
Feature Slice	Feature	Feature	Feature	Individual feature column
`timestamp_col`	`event_timestamp`	Event Time	Timestamp Column	Point-in-time reference
`refresh_freq`	- (external orchestration)	Materialization Schedule	`batch_schedule`	Update frequency
`generate_dataset()`	`get_historical_features()`	Get Training Data	Get Dataset	Training data generation
`retrieve_feature_values()`	`get_online_features()`	Get Feature Values	Get Online Features	Feature serving

1.9.1 Key Terminology Notes

Feature View vs Feature Group: Snowflake and Feast both use “Feature View” while Hopsworks uses “Feature Group.” All represent a collection of features computed together.
Dynamic Table vs Offline Store: Both refer to pre-materialized features for batch training and inference. Feast relies on an external offline store (e.g., BigQuery, Snowflake, Redshift) whereas Snowflake manages materialization natively via Dynamic Tables.
Online Feature Table: Snowflake’s implementation uses Hybrid Tables for low-latency serving, equivalent to an “Online Store” backed by DynamoDB or Redis in Feast, or similar stores in other platforms.
Refresh scheduling: Snowflake’s refresh_freq and Tecton’s batch_schedule are built-in scheduling mechanisms. Feast does not include a built-in scheduler – feature materialization is driven by external orchestration (e.g., Airflow, cron).

1.10 Summary

Concept	Definition	Key Attributes
Feature Store	Centralized feature management layer	Schema + Tags in Snowflake
Entity	Business object with join keys	`name`, `join_keys`, `desc`
Feature View	Feature collection with transformations	`entities`, `feature_df`, `refresh_freq`
Feature	Individual computed value	Defined in DataFrame or `Feature` class
Spine	Request DataFrame	Entity keys + timestamp + labels
MIT/MDT/ODT	Transformation taxonomy	Where transformations belong

1.11 Next Steps

Continue to Chapter 2: Design & Organization to learn how to structure your Feature Store for scale and maintainability.

--- title: "Core Concepts" subtitle: "Entities, Feature Views, Features, and the Spine" --- ## Overview This chapter introduces the core concepts and terminology of Snowflake Feature Store. Understanding these fundamentals is essential before diving into implementation patterns. ## Learning Objectives After completing this chapter, you will be able to: - Define what a Feature Store is and articulate its value in ML systems - Identify and describe the key components: Entities, Feature Views, Features, and Spines - Understand how Feature Store fits into the broader ML lifecycle - Apply the correct transformation taxonomy (MIT/MDT/ODT) to your use cases - Map Snowflake terminology to industry-standard terms > 📂 **Chapter code:** [Browse companion scripts on GitHub](https://github.com/Snowflake-Labs/snowflake-featurestore-imp-guide/tree/main/Snowflake_FeatureStore_Implementation_Guide/01_concepts/_code) ```{python} #| output: false #| echo: false # Session setup (shared across chapter cells) from snowflake.snowpark import Session, Row from snowflake.snowpark import functions as F from snowflake.snowpark import types as T from snowflake.snowpark.context import get_active_session from snowflake.ml.feature_store import ( FeatureStore, FeatureView, Entity, CreationMode, ) try: session = get_active_session() except Exception: session = Session.builder.config("connection_name", "default").create() session.sql_simplifier_enabled = True SOURCE_DATABASE = "FEATURE_STORE_DEMO" FS_NAME = "FEATURE_STORE" WAREHOUSE = "FS_DEV_WH" fs = FeatureStore( session=session, database=SOURCE_DATABASE, name=FS_NAME, default_warehouse=WAREHOUSE, creation_mode=CreationMode.CREATE_IF_NOT_EXIST, ) # Create sample ORDERS table for chapter examples from datetime import datetime orders_data = [ ("usr_001", "ord_001", datetime(2025, 1, 10, 9, 0), 49.99), ("usr_001", "ord_002", datetime(2025, 1, 12, 14, 30), 129.50), ("usr_001", "ord_003", datetime(2025, 1, 15, 11, 0), 25.00), ("usr_002", "ord_004", datetime(2025, 1, 11, 10, 15), 89.99), ("usr_002", "ord_005", datetime(2025, 1, 14, 16, 45), 199.00), ("usr_003", "ord_006", datetime(2025, 1, 13, 8, 30), 15.50), ] orders_schema = T.StructType([ T.StructField("USER_ID", T.StringType()), T.StructField("ORDER_ID", T.StringType()), T.StructField("ORDER_TS", T.TimestampType()), T.StructField("TOTAL_AMT", T.FloatType()), ]) orders_df = session.create_dataframe(orders_data, orders_schema) ORDERS_TABLE = f"{SOURCE_DATABASE}.CLICKSTREAM_DATA.ORDERS" orders_df.write.save_as_table(ORDERS_TABLE, mode="overwrite") ``` --- ## What is a Feature Store? A **Feature Store** is a centralized repository for storing, managing, and serving features for machine learning. It acts as the data management layer between raw data sources and ML models. ```{mermaid} %%| fig-cap: "ML data infrastructure: sources, Feature Store, and ML consumers" %%| fig-alt: "Flow from data sources through Feature Store to ML consumers" flowchart LR subgraph DS[Data Sources] d1[Tables / Streams / External / APIs] end subgraph FS[Feature Store] f1[Pipelines / Storage / Serving / Metadata] end subgraph ML[ML Consumers] m1[Training / Batch / Real-time / Analytics] end DS --> FS --> ML ``` ### Features: The Building Blocks of ML A **feature** is a measurable property or characteristic used as input to an ML model. Features can be: | Feature Type | Description | Example | |--------------|-------------|---------| | **Raw** | Direct attributes from source data | `user_age`, `product_price` | | **Derived** | Calculated from raw features | `price_per_unit`, `age_bucket` | | **Aggregated** | Computed over time windows | `total_orders_30d`, `avg_session_duration_7d` | | **Encoded** | Transformed for model consumption | `category_one_hot`, `amount_scaled` | --- ## Why Use a Feature Store? Feature Stores solve critical challenges in production ML systems: ### 1. Feature Reusability {.unnumbered} Without a Feature Store, teams often recreate the same features independently, leading to: - Duplicated effort across data scientists - Inconsistent feature definitions - Wasted compute resources **With Feature Store**: Features are defined once, stored centrally, and reused across models and teams. ### 2. Training-Serving Skew {.unnumbered} One of the most common causes of ML model degradation is when features computed during training differ from those used during inference. **With Feature Store**: The same feature definitions serve both training and inference, eliminating skew. ### 3. Point-in-Time Correctness {.unnumbered} Using future data to train models (data leakage) produces artificially good training metrics but poor production performance. **With Feature Store**: Built-in temporal joins ensure features are computed using only data available at the prediction time, preventing data leakage. ### 4. Feature Discovery {.unnumbered} As organizations scale ML, finding existing features becomes challenging. **With Feature Store**: Searchable catalog of features with metadata, lineage, and documentation. ### 5. Governance & Compliance {.unnumbered} Tracking where features come from and how they're used is essential for regulatory compliance. **With Feature Store**: Full lineage from source data through feature transformations to model predictions. Snowflake Feature Store addresses all of these challenges by providing a centralized repository for features, a way to compute features, and a way to serve features. --- ## Key Components Snowflake Feature Store consists of four primary components: ### Entity {#sec-entity} An **Entity** represents a business object that features describe. It defines the join keys used to retrieve features. > 📁 **Full code:** [`_code/entity_examples.py`](_code/entity_examples.py) ```{python} # Simple entity with single key user_entity = Entity( name="USER", join_keys=["USER_ID"], desc="Registered user in the system" ) # Compound entity with multiple keys product_supplier_entity = Entity( name="PRODUCT_SUPPLIER", join_keys=["PRODUCT_ID", "SUPPLIER_ID"], desc="Product-Supplier relationship for supplier-specific features" ) # Register entities fs.register_entity(user_entity) fs.register_entity(product_supplier_entity) print(f"Registered entities: {[e['NAME'] for e in fs.list_entities().collect()]}") ``` **Key Concepts**: | Concept | Description | |---------|-------------| | `name` | Unique identifier for the entity within the Feature Store | | `join_keys` | Column(s) that uniquely identify an instance of the entity | | **Simple Key** | Single column identifier (e.g., `USER_ID`) | | **Compound Key** | Multiple columns together form the identifier (e.g., `PRODUCT_ID` + `SUPPLIER_ID`) | Entities enable features from different Feature Views to be joined together when generating training data or serving predictions. ::: {.callout-important} ## Join key names must match exactly The Feature Store does not support synonyms or aliases for join keys. The column names defined in the Entity's `join_keys` must appear **with the same name** in every Feature View's `feature_df` and in every spine DataFrame. If your source column has a different name, alias it in the SQL or Snowpark DataFrame that defines the Feature View (e.g., `SELECT CUST_KEY AS USER_ID ...` or `.with_column_renamed("CUST_KEY", "USER_ID")`). See [Chapter 3: Consistent Key Naming](../03_entities_hierarchies/index.qmd#sec-consistent-keys) for details. ::: ::: {.callout-tip} ## Standardize `timestamp_col` names across Feature Views Just as entity join keys must be named consistently, adopting a **standard timestamp column name** (e.g., `FV_TS`) across all Feature Views simplifies spine design and validation. Alias the source timestamp in the Feature View SQL: ```sql SELECT USER_ID, ORDER_TS AS FV_TS, -- standardized name SUM(TOTAL_AMT) AS ... FROM ... GROUP BY USER_ID, ORDER_TS ``` When every Feature View uses the same `timestamp_col` name, the spine only needs a single `spine_timestamp_col` value, `include_feature_view_timestamp_col=True` output is predictable, and leakage-validation code does not need per-FV column mappings. See [Chapter 6: timestamp_col Requirements](../06_temporal_features/index.qmd#sec-timestamp-col-requirements) for details. ::: ::: {.callout-note} ## 📖 Deep Dive See [Chapter 3: Entities & Hierarchies](../03_entities_hierarchies/index.qmd) for entity design patterns, hierarchies, and compound key strategies. ::: --- ### Feature View {#sec-featureview} A **Feature View** is a collection of related features computed from one or more source dataframe (tables, views, etc.). It defines: - Which entity/ies the features belong to - How features are computed (the transformation logic) - How features are materialized (Dynamic Table vs View) > 📁 **Full code:** [`_code/featureview_examples.py`](_code/featureview_examples.py) ```python from snowflake.ml.feature_store import FeatureView # Create a Feature View from a Snowpark DataFrame user_features_fv = FeatureView( name="USER_PURCHASE_FEATURES", entities=[user_entity], feature_df=user_purchase_df, # Snowpark DataFrame with feature logic timestamp_col="UPDATED_TS", # For point-in-time correctness refresh_freq="1 hour", # Dynamic Table refresh (omit for View) desc="User purchase behavior features" ) # Register in Feature Store user_features_fv = fs.register_feature_view( feature_view=user_features_fv, version="V01", block=True # Wait for initial materialization ) ``` **Materialization Options**: | Type | Created When | Use Case | Compute Model | |------|--------------|----------|---------------| | **Dynamic Table** | `refresh_freq` is specified | Pre-computed features, automatic refresh | Snowflake manages refresh | | **View** | `refresh_freq` is omitted | Query-time computation, always fresh | Compute on each query | **Key Concepts**: | Concept | Description | |---------|-------------| | `name` | Unique identifier for the Feature View | | `entities` | List of entities this Feature View provides features for | | `feature_df` | Snowpark DataFrame defining the feature transformations | | `timestamp_col` | Column used for point-in-time feature retrieval | | `refresh_freq` | How often to refresh (e.g., `"1 hour"`, `"1 day"`) | | `version` | Version string for managing Feature View evolution | ::: {.callout-note} ## 📖 Deep Dive See [Chapter 4: Feature Views](../04_feature_views/index.qmd) for detailed coverage of Feature View types, versioning, and lifecycle management. ::: --- ### Feature {#sec-feature} A **Feature** is an individual column within a Feature View. Features are defined through the `feature_df` parameter—a Snowpark DataFrame that specifies the transformation logic. Both `session.sql()` (SQL) and the Snowpark DataFrame API produce identical lazy Snowpark DataFrames. See [Chapter 4: SQL vs Snowpark DataFrame API](../04_feature_views/index.qmd#sec-sql-vs-snowpark) for a detailed comparison. > 📁 **Full code:** [`_code/feature_dataframe_api.py`](_code/feature_dataframe_api.py) | [`_code/feature_sql_api.py`](_code/feature_sql_api.py) ::: {.panel-tabset group="lang"} ## SQL ```python user_purchase_df = session.sql(""" SELECT USER_ID, COUNT(DISTINCT ORDER_ID) AS ORDER_CNT, SUM(TOTAL_AMT) AS SPEND_SUM, AVG(TOTAL_AMT) AS ORDER_VALUE_AVG, MAX(ORDER_TS) AS LAST_ORDER_TS, DATEDIFF('day', MIN(ORDER_TS), MAX(ORDER_TS)) AS CUSTOMER_TENURE_DAYS FROM ORDERS GROUP BY USER_ID """) ``` ## Snowpark DataFrame ```python user_purchase_df = ( session.table(ORDERS_TABLE) .group_by("USER_ID") .agg( F.sum("TOTAL_AMT").alias("SPEND_SUM"), F.count("ORDER_ID").alias("ORDER_CNT"), F.avg("TOTAL_AMT").alias("ORDER_VALUE_AVG"), F.max("ORDER_TS").alias("LAST_ORDER_TS"), ) ) ``` ::: ```{python} #| echo: false #| output: false user_purchase_df = ( session.table(ORDERS_TABLE) .group_by("USER_ID") .agg( F.sum("TOTAL_AMT").alias("SPEND_SUM"), F.count("ORDER_ID").alias("ORDER_CNT"), F.avg("TOTAL_AMT").alias("ORDER_VALUE_AVG"), F.max("ORDER_TS").alias("LAST_ORDER_TS"), ) ) ``` Create and register a Feature View using the DataFrame above: ```{python} user_purchase_fv = FeatureView( name="USER_PURCHASE_FEATURES", entities=[user_entity], feature_df=user_purchase_df, timestamp_col="LAST_ORDER_TS", refresh_freq="1 hour", desc="User purchase behavior features", ) user_purchase_fv = fs.register_feature_view( feature_view=user_purchase_fv, version="V01", block=True, overwrite=True, ) print(f"Registered: {user_purchase_fv.name}/V01") print(f"Status: {user_purchase_fv.status}") ``` ::: {.callout-warning} ## Time-windowed features over sparse data produce incorrect results at retrieval time SQL window functions such as `RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW` compute correct windows **at materialization time** -- each row's 7-day window is anchored to that row's own timestamp. The problem surfaces at **retrieval time**. When the Feature Store performs an ASOF join against a pre-materialized Dynamic Table, it returns the most recent row whose timestamp is <= the spine timestamp. If the source data is sparse (not a record for every time grain), that row may be **days older** than the spine timestamp, and its pre-computed window is anchored to the wrong point in time. **Example:** A spine row requests features as of **Jan 15**, but the nearest pre-computed row in the DT has a timestamp of **Jan 10** (no source activity between Jan 10-15). The returned "7-day sum" covers **Jan 3-10**, not the expected **Jan 8-15**. This is an inherent limitation of pre-computing fixed windows and retrieving them via ASOF: the window boundaries are baked into the row at materialization time and cannot shift to match an arbitrary query timestamp. For time-windowed aggregations, use the **Feature Aggregation API** (@sec-aggregations-api) instead. Its tiling mechanism stores partial aggregates per time grain and reassembles the correct window from the spine timestamp backwards at retrieval time -- regardless of source sparsity. See [Chapter 5: Temporal Aggregation Pipelines](../05_feature_pipelines/index.qmd#sec-temporal-api) for the detailed motivation and [Chapter 7](../07_aggregations_api/index.qmd) for API reference. ::: #### Method 3: Feature Aggregation Class (Time-Windowed) {#sec-aggregations-api} For time-windowed aggregations, the `Feature` class provides a declarative API: > 📁 **Full code:** [`_code/feature_aggregation_api.py`](_code/feature_aggregation_api.py) ```python from snowflake.ml.feature_store import Feature # Define time-windowed aggregated features features = [ Feature.sum("TOTAL_AMT", "7d").alias("SPEND_SUM_7D"), Feature.count("ORDER_ID", "30d").alias("ORDER_CNT_30D"), Feature.avg("TOTAL_AMT", "24h").alias("ORDER_VALUE_AVG_24H"), Feature.last_n("PRODUCT_ID", "7d", n=5).alias("RECENT_PRODUCTS"), ] # Use in a tiled Feature View for efficient computation tiled_fv = FeatureView( name="USER_ORDER_AGGREGATES", entities=[user_entity], feature_df=session.table("ORDERS"), timestamp_col="ORDER_TS", refresh_freq="1h", feature_granularity="1h", features=features, ) ``` ::: {.callout-note} ## 🆕 New in snowflake-ml 1.21+ The `Feature` aggregation class provides declarative syntax for time-windowed aggregations with automatic tiling for efficient incremental computation. See [Chapter 7: Aggregations API](../07_aggregations_api/index.qmd) for the complete API reference. ::: #### Choosing the Right Method | Method | Best For | Complexity | |--------|----------|------------| | **DataFrame API** | Simple aggregations, joins, filtering | Low | | **SQL via session.sql()** | Complex queries, window functions, prefer existing SQL | Medium | | **Feature class** | Time-windowed aggregations, sliding windows | Advanced | --- ### Feature Slice A **Feature Slice** identifies specific feature columns within a Feature View that you want to retrieve. This allows selective feature retrieval rather than fetching all columns. ```python # Get specific features from a Feature View user_slice = user_features_fv.slice(["SPEND_SUM_7D", "ORDER_CNT_30D"]) # Use the slice in dataset generation dataset = fs.generate_dataset( spine_df=training_spine, features=[user_slice], # Only retrieves sliced features ) ``` --- ## The Spine: Connecting Features to ML {#sec-spine} The **Spine** is a foundation dataframe that defines which entities need features and when. Its structure differs slightly between training and inference: | Use Case | Required Columns | Optional Columns | |----------|------------------|------------------| | **Training** | Entity keys, Timestamp | Label, additional context columns | | **Batch Inference** | Entity keys, Timestamp | Additional context columns | | **Online Inference** | Entity keys only | — | The spine can carry any additional columns alongside the entity keys and timestamp -- labels, context features, or columns sourced from tables that are not managed by the Feature Store. These columns pass through unchanged into the generated dataset. ::: {.callout-tip} ## Prefer routing all features through the Feature Store If you find yourself adding feature columns directly to the spine from external tables, consider instead creating a **view-based Feature View** that references those external tables. This ensures all features -- whether internally computed or externally managed -- are discoverable, governed, and retrievable through the Feature Store. The view adds no storage or refresh cost; it simply wraps an existing table as a Feature View. ```python # External table with features not yet in the Feature Store external_features_df = session.table("ANALYTICS.FEATURES.CREDIT_SCORES") credit_fv = FeatureView( name="CREDIT_SCORES_EXTERNAL", entities=[user_entity], feature_df=external_features_df, timestamp_col="SCORE_TS", desc="Credit scores - maintained by risk team, registered for FS discovery" ) ``` ::: #### Training Spine For training, the spine defines the historical points where you want to retrieve features, along with the target variable (label) you're predicting: > 📁 **Full code:** [`_code/spine_examples.py`](_code/spine_examples.py) ```python # Training spine: includes label for supervised learning training_spine = session.sql(""" SELECT s.USER_ID, -- Entity key s.SESSION_START_TS AS EVENT_TS, -- Point-in-time timestamp s.IS_CONVERTED AS LABEL -- Target: did the session convert? FROM SESSIONS s WHERE s.USER_ID IS NOT NULL """) # Generate training dataset with features joined to spine training_set = fs.generate_dataset( spine_df=training_spine, features=[user_features_fv, session_features_fv], spine_timestamp_col="EVENT_TS", ) ``` #### Batch Inference Spine For batch inference, you only need entity keys and timestamps—there's no label because that's what the model will predict: ```python # Inference spine: no label column inference_spine = session.sql(""" SELECT USER_ID, -- Entity key CURRENT_TIMESTAMP() AS EVENT_TS -- Features as of now FROM USERS WHERE SUBSCRIPTION_STATUS != 'none' """) # Retrieve features for prediction inference_data = fs.generate_dataset( spine_df=inference_spine, features=[user_features_fv, session_features_fv], spine_timestamp_col="EVENT_TS", ) ``` #### Online Inference For real-time serving, features are retrieved from Online Feature Tables using entity keys only—no timestamp is needed since OFTs store only the current (latest) feature values: ```python # Online serving: entity keys only, retrieves current feature values features = fs.retrieve_feature_values( spine_df=session.create_dataframe([{"USER_ID": "usr_001"}]), features=[user_features_fv], ) ``` ### How Spine Works ```{mermaid} %%| fig-cap: "Spine-based feature retrieval: spine joined to multiple Feature Views" %%| fig-alt: "Spine dataframe joined to purchase and session feature views" flowchart TB SP[Spine DataFrame USER_ID + EVENT_TS] FV1[USER_PURCHASE_FV features] FV2[USER_SESSION_FV features] SP --> FV1 SP --> FV2 RES[Result: features as-of EVENT_TS per USER_ID] FV1 --> RES FV2 --> RES ``` The Feature Store joins each spine row to the appropriate Feature Views, ensuring features are computed using only data available at the `EVENT_TS` timestamp. This spine-based ASOF retrieval applies to **training and batch inference** only. For **online (real-time) inference**, features are served via direct key-based lookup against [Online Feature Tables](../08_online_features/index.qmd) -- there is no spine and no ASOF join. ::: {.callout-note} ## 📖 Deep Dive See [Chapter 11: Training & Inference](../11_training_inference/index.qmd) for spine design patterns and best practices. ::: --- ## Feature Store Architecture {#sec-architecture} Snowflake Feature Store leverages native Snowflake capabilities: ```{mermaid} %%| fig-cap: "Snowflake Feature Store architecture overview" %%| fig-alt: "Source data flows into feature views and metadata, then to training, batch, and online serving" flowchart TB SD[Source Data] --> SFS[Feature Store Schema] subgraph SFS FV[Feature Views DT + View] MD[Metadata Tags] end FV --> T[Training Datasets] FV --> B[Batch Inference] FV --> O[Online Serving OFT] MD -.-> FV ``` ### Physical Implementation | Logical Concept | Snowflake Object | Purpose | |-----------------|------------------|---------| | Feature Store | Schema + Tags | Container for all Feature Store objects | | Entity | Tag | Metadata defining join keys | | Feature View (materialized) | Dynamic Table | Pre-computed features with full history | | Feature View (query-time) | View | On-demand computed features | | Online Feature Table (OFT) | Online Feature Table | Low-latency serving, **current values only** | | Dataset | Dataset | Materialized training/inference data | ::: {.callout-important} ## Online Feature Tables Store Current Values Only Online Feature Tables store only the **latest (current) value** for each feature per entity—they do not retain historical feature values. For point-in-time historical retrieval (training, batch inference), use the standard Feature View backed by Dynamic Table or View. ::: --- ## Transformation Taxonomy {#sec-taxonomy} Understanding where transformations should occur is critical for maintainable ML systems. We categorize transformations into three types: | Type | Full Name | Location | Stored In | |------|-----------|----------|-----------| | **MIT** | Model-Independent Transformations | Feature Pipeline | Feature View | | **MDT** | Model-Dependent Transformations | Training + Inference Pipeline | Model Registry | | **ODT** | On-Demand Transformations | Inference Time | Not stored | ### Quick Decision Guide ```{mermaid} %%| fig-cap: "MIT vs MDT vs ODT decision flowchart" %%| fig-alt: "Decision tree for where to store transformations" flowchart TD Q1{Reusable across models?} Q1 -->|YES| MIT[MIT in Feature View] Q1 -->|NO| Q2{Depends on training stats?} Q2 -->|YES| MDT[MDT with model] Q2 -->|NO| Q3{Depends on request context?} Q3 -->|YES| ODT[ODT at inference] Q3 -->|NO| MIT2[Probably MIT] ``` ::: {.callout-note} ## 📖 Deep Dive See [Transformation Taxonomy: MIT vs MDT vs ODT](./transformation_taxonomy.md) for detailed examples, anti-patterns, and implementation guidance. ::: --- ## Terminology Mapping How Snowflake Feature Store terminology aligns with other platforms and industry terminology. Here's how terms map across platforms: | Snowflake | Feast | Hopsworks | Tecton | Description | |-----------|-------|-----------|--------|-------------| | **Feature Store** | Feature Store (registry) | Feature Store | Feature Store | Central feature repository | | **Entity** | Entity | Entity | Entity | Business object with join keys | | **Feature View** | Feature View | Feature Group | Feature View | Collection of related features | | **Dynamic Table FV** | Offline Store | Offline Store | Batch Feature View | Pre-computed, stored features | | **View FV** | On-Demand Feature View | - | On-Demand Feature View | Query-time computed features | | **Online Feature Table** | Online Store | Online Store | Online Store | Low-latency serving | | **Spine** | Entity DataFrame | Spine DataFrame | Spine | Request keys + timestamps | | **Feature Slice** | Feature | Feature | Feature | Individual feature column | | `timestamp_col` | `event_timestamp` | Event Time | Timestamp Column | Point-in-time reference | | `refresh_freq` | - (external orchestration) | Materialization Schedule | `batch_schedule` | Update frequency | | `generate_dataset()` | `get_historical_features()` | Get Training Data | Get Dataset | Training data generation | | `retrieve_feature_values()` | `get_online_features()` | Get Feature Values | Get Online Features | Feature serving | ### Key Terminology Notes 1. **Feature View vs Feature Group**: Snowflake and Feast both use "Feature View" while Hopsworks uses "Feature Group." All represent a collection of features computed together. 2. **Dynamic Table vs Offline Store**: Both refer to pre-materialized features for batch training and inference. Feast relies on an external offline store (e.g., BigQuery, Snowflake, Redshift) whereas Snowflake manages materialization natively via Dynamic Tables. 3. **Online Feature Table**: Snowflake's implementation uses Hybrid Tables for low-latency serving, equivalent to an "Online Store" backed by DynamoDB or Redis in Feast, or similar stores in other platforms. 4. **Refresh scheduling**: Snowflake's `refresh_freq` and Tecton's `batch_schedule` are built-in scheduling mechanisms. Feast does not include a built-in scheduler -- feature materialization is driven by external orchestration (e.g., Airflow, cron). --- ## Summary | Concept | Definition | Key Attributes | |---------|------------|----------------| | **Feature Store** | Centralized feature management layer | Schema + Tags in Snowflake | | **Entity** | Business object with join keys | `name`, `join_keys`, `desc` | | **Feature View** | Feature collection with transformations | `entities`, `feature_df`, `refresh_freq` | | **Feature** | Individual computed value | Defined in DataFrame or `Feature` class | | **Spine** | Request DataFrame | Entity keys + timestamp + labels | | **MIT/MDT/ODT** | Transformation taxonomy | Where transformations belong | --- ## Next Steps Continue to [Chapter 2: Design & Organization](../02_design_organization/index.qmd) to learn how to structure your Feature Store for scale and maintainability.