flowchart LR
subgraph DS[Data Sources]
d1[Tables / Streams / External / APIs]
end
subgraph FS[Feature Store]
f1[Pipelines / Storage / Serving / Metadata]
end
subgraph ML[ML Consumers]
m1[Training / Batch / Real-time / Analytics]
end
DS --> FS --> ML
1 Core Concepts
Entities, Feature Views, Features, and the Spine
snowflake, feature store, ml, machine learning, mlops
1.1 Overview
This chapter introduces the core concepts and terminology of Snowflake Feature Store. Understanding these fundamentals is essential before diving into implementation patterns.
1.2 Learning Objectives
After completing this chapter, you will be able to:
- Define what a Feature Store is and articulate its value in ML systems
- Identify and describe the key components: Entities, Feature Views, Features, and Spines
- Understand how Feature Store fits into the broader ML lifecycle
- Apply the correct transformation taxonomy (MIT/MDT/ODT) to your use cases
- Map Snowflake terminology to industry-standard terms
📂 Chapter code: Browse companion scripts on GitHub
1.3 What is a Feature Store?
A Feature Store is a centralized repository for storing, managing, and serving features for machine learning. It acts as the data management layer between raw data sources and ML models.
1.3.1 Features: The Building Blocks of ML
A feature is a measurable property or characteristic used as input to an ML model. Features can be:
| Feature Type | Description | Example |
|---|---|---|
| Raw | Direct attributes from source data | user_age, product_price |
| Derived | Calculated from raw features | price_per_unit, age_bucket |
| Aggregated | Computed over time windows | total_orders_30d, avg_session_duration_7d |
| Encoded | Transformed for model consumption | category_one_hot, amount_scaled |
1.4 Why Use a Feature Store?
Feature Stores solve critical challenges in production ML systems:
1. Feature Reusability
Without a Feature Store, teams often recreate the same features independently, leading to:
- Duplicated effort across data scientists
- Inconsistent feature definitions
- Wasted compute resources
With Feature Store: Features are defined once, stored centrally, and reused across models and teams.
2. Training-Serving Skew
One of the most common causes of ML model degradation is when features computed during training differ from those used during inference.
With Feature Store: The same feature definitions serve both training and inference, eliminating skew.
3. Point-in-Time Correctness
Using future data to train models (data leakage) produces artificially good training metrics but poor production performance.
With Feature Store: Built-in temporal joins ensure features are computed using only data available at the prediction time, preventing data leakage.
4. Feature Discovery
As organizations scale ML, finding existing features becomes challenging.
With Feature Store: Searchable catalog of features with metadata, lineage, and documentation.
5. Governance & Compliance
Tracking where features come from and how they’re used is essential for regulatory compliance.
With Feature Store: Full lineage from source data through feature transformations to model predictions.
Snowflake Feature Store addresses all of these challenges by providing a centralized repository for features, a way to compute features, and a way to serve features.
1.5 Key Components
Snowflake Feature Store consists of four primary components:
1.5.1 Entity
An Entity represents a business object that features describe. It defines the join keys used to retrieve features.
📁 Full code:
_code/entity_examples.py
# Simple entity with single key
user_entity = Entity(
name="USER",
join_keys=["USER_ID"],
desc="Registered user in the system"
)
# Compound entity with multiple keys
product_supplier_entity = Entity(
name="PRODUCT_SUPPLIER",
join_keys=["PRODUCT_ID", "SUPPLIER_ID"],
desc="Product-Supplier relationship for supplier-specific features"
)
# Register entities
fs.register_entity(user_entity)
fs.register_entity(product_supplier_entity)
print(f"Registered entities: {[e['NAME'] for e in fs.list_entities().collect()]}")Key Concepts:
| Concept | Description |
|---|---|
name |
Unique identifier for the entity within the Feature Store |
join_keys |
Column(s) that uniquely identify an instance of the entity |
| Simple Key | Single column identifier (e.g., USER_ID) |
| Compound Key | Multiple columns together form the identifier (e.g., PRODUCT_ID + SUPPLIER_ID) |
Entities enable features from different Feature Views to be joined together when generating training data or serving predictions.
The Feature Store does not support synonyms or aliases for join keys. The column names defined in the Entity’s join_keys must appear with the same name in every Feature View’s feature_df and in every spine DataFrame. If your source column has a different name, alias it in the SQL or Snowpark DataFrame that defines the Feature View (e.g., SELECT CUST_KEY AS USER_ID ... or .with_column_renamed("CUST_KEY", "USER_ID")). See Chapter 3: Consistent Key Naming for details.
timestamp_col names across Feature Views
Just as entity join keys must be named consistently, adopting a standard timestamp column name (e.g., FV_TS) across all Feature Views simplifies spine design and validation. Alias the source timestamp in the Feature View SQL:
SELECT USER_ID,
ORDER_TS AS FV_TS, -- standardized name
SUM(TOTAL_AMT) AS ...
FROM ...
GROUP BY USER_ID, ORDER_TSWhen every Feature View uses the same timestamp_col name, the spine only needs a single spine_timestamp_col value, include_feature_view_timestamp_col=True output is predictable, and leakage-validation code does not need per-FV column mappings. See Chapter 6: timestamp_col Requirements for details.
See Chapter 3: Entities & Hierarchies for entity design patterns, hierarchies, and compound key strategies.
1.5.2 Feature View
A Feature View is a collection of related features computed from one or more source dataframe (tables, views, etc.). It defines:
- Which entity/ies the features belong to
- How features are computed (the transformation logic)
- How features are materialized (Dynamic Table vs View)
📁 Full code:
_code/featureview_examples.py
from snowflake.ml.feature_store import FeatureView
# Create a Feature View from a Snowpark DataFrame
user_features_fv = FeatureView(
name="USER_PURCHASE_FEATURES",
entities=[user_entity],
feature_df=user_purchase_df, # Snowpark DataFrame with feature logic
timestamp_col="UPDATED_TS", # For point-in-time correctness
refresh_freq="1 hour", # Dynamic Table refresh (omit for View)
desc="User purchase behavior features"
)
# Register in Feature Store
user_features_fv = fs.register_feature_view(
feature_view=user_features_fv,
version="V01",
block=True # Wait for initial materialization
)Materialization Options:
| Type | Created When | Use Case | Compute Model |
|---|---|---|---|
| Dynamic Table | refresh_freq is specified |
Pre-computed features, automatic refresh | Snowflake manages refresh |
| View | refresh_freq is omitted |
Query-time computation, always fresh | Compute on each query |
Key Concepts:
| Concept | Description |
|---|---|
name |
Unique identifier for the Feature View |
entities |
List of entities this Feature View provides features for |
feature_df |
Snowpark DataFrame defining the feature transformations |
timestamp_col |
Column used for point-in-time feature retrieval |
refresh_freq |
How often to refresh (e.g., "1 hour", "1 day") |
version |
Version string for managing Feature View evolution |
See Chapter 4: Feature Views for detailed coverage of Feature View types, versioning, and lifecycle management.
1.5.3 Feature
A Feature is an individual column within a Feature View. Features are defined through the feature_df parameter—a Snowpark DataFrame that specifies the transformation logic.
Both session.sql() (SQL) and the Snowpark DataFrame API produce identical lazy Snowpark DataFrames. See Chapter 4: SQL vs Snowpark DataFrame API for a detailed comparison.
📁 Full code:
_code/feature_dataframe_api.py|_code/feature_sql_api.py
Create and register a Feature View using the DataFrame above:
user_purchase_fv = FeatureView(
name="USER_PURCHASE_FEATURES",
entities=[user_entity],
feature_df=user_purchase_df,
timestamp_col="LAST_ORDER_TS",
refresh_freq="1 hour",
desc="User purchase behavior features",
)
user_purchase_fv = fs.register_feature_view(
feature_view=user_purchase_fv,
version="V01",
block=True,
overwrite=True,
)
print(f"Registered: {user_purchase_fv.name}/V01")
print(f"Status: {user_purchase_fv.status}")SQL window functions such as RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW compute correct windows at materialization time – each row’s 7-day window is anchored to that row’s own timestamp. The problem surfaces at retrieval time. When the Feature Store performs an ASOF join against a pre-materialized Dynamic Table, it returns the most recent row whose timestamp is <= the spine timestamp. If the source data is sparse (not a record for every time grain), that row may be days older than the spine timestamp, and its pre-computed window is anchored to the wrong point in time.
Example: A spine row requests features as of Jan 15, but the nearest pre-computed row in the DT has a timestamp of Jan 10 (no source activity between Jan 10-15). The returned “7-day sum” covers Jan 3-10, not the expected Jan 8-15.
This is an inherent limitation of pre-computing fixed windows and retrieving them via ASOF: the window boundaries are baked into the row at materialization time and cannot shift to match an arbitrary query timestamp. For time-windowed aggregations, use the Feature Aggregation API (Section 1.5.3.1) instead. Its tiling mechanism stores partial aggregates per time grain and reassembles the correct window from the spine timestamp backwards at retrieval time – regardless of source sparsity. See Chapter 5: Temporal Aggregation Pipelines for the detailed motivation and Chapter 7 for API reference.
1.5.3.1 Method 3: Feature Aggregation Class (Time-Windowed)
For time-windowed aggregations, the Feature class provides a declarative API:
📁 Full code:
_code/feature_aggregation_api.py
from snowflake.ml.feature_store import Feature
# Define time-windowed aggregated features
features = [
Feature.sum("TOTAL_AMT", "7d").alias("SPEND_SUM_7D"),
Feature.count("ORDER_ID", "30d").alias("ORDER_CNT_30D"),
Feature.avg("TOTAL_AMT", "24h").alias("ORDER_VALUE_AVG_24H"),
Feature.last_n("PRODUCT_ID", "7d", n=5).alias("RECENT_PRODUCTS"),
]
# Use in a tiled Feature View for efficient computation
tiled_fv = FeatureView(
name="USER_ORDER_AGGREGATES",
entities=[user_entity],
feature_df=session.table("ORDERS"),
timestamp_col="ORDER_TS",
refresh_freq="1h",
feature_granularity="1h",
features=features,
)The Feature aggregation class provides declarative syntax for time-windowed aggregations with automatic tiling for efficient incremental computation. See Chapter 7: Aggregations API for the complete API reference.
1.5.3.2 Choosing the Right Method
| Method | Best For | Complexity |
|---|---|---|
| DataFrame API | Simple aggregations, joins, filtering | Low |
| SQL via session.sql() | Complex queries, window functions, prefer existing SQL | Medium |
| Feature class | Time-windowed aggregations, sliding windows | Advanced |
1.5.4 Feature Slice
A Feature Slice identifies specific feature columns within a Feature View that you want to retrieve. This allows selective feature retrieval rather than fetching all columns.
# Get specific features from a Feature View
user_slice = user_features_fv.slice(["SPEND_SUM_7D", "ORDER_CNT_30D"])
# Use the slice in dataset generation
dataset = fs.generate_dataset(
spine_df=training_spine,
features=[user_slice], # Only retrieves sliced features
)1.6 The Spine: Connecting Features to ML
The Spine is a foundation dataframe that defines which entities need features and when. Its structure differs slightly between training and inference:
| Use Case | Required Columns | Optional Columns |
|---|---|---|
| Training | Entity keys, Timestamp | Label, additional context columns |
| Batch Inference | Entity keys, Timestamp | Additional context columns |
| Online Inference | Entity keys only | — |
The spine can carry any additional columns alongside the entity keys and timestamp – labels, context features, or columns sourced from tables that are not managed by the Feature Store. These columns pass through unchanged into the generated dataset.
If you find yourself adding feature columns directly to the spine from external tables, consider instead creating a view-based Feature View that references those external tables. This ensures all features – whether internally computed or externally managed – are discoverable, governed, and retrievable through the Feature Store. The view adds no storage or refresh cost; it simply wraps an existing table as a Feature View.
# External table with features not yet in the Feature Store
external_features_df = session.table("ANALYTICS.FEATURES.CREDIT_SCORES")
credit_fv = FeatureView(
name="CREDIT_SCORES_EXTERNAL",
entities=[user_entity],
feature_df=external_features_df,
timestamp_col="SCORE_TS",
desc="Credit scores - maintained by risk team, registered for FS discovery"
)1.6.0.1 Training Spine
For training, the spine defines the historical points where you want to retrieve features, along with the target variable (label) you’re predicting:
📁 Full code:
_code/spine_examples.py
# Training spine: includes label for supervised learning
training_spine = session.sql("""
SELECT
s.USER_ID, -- Entity key
s.SESSION_START_TS AS EVENT_TS, -- Point-in-time timestamp
s.IS_CONVERTED AS LABEL -- Target: did the session convert?
FROM SESSIONS s
WHERE s.USER_ID IS NOT NULL
""")
# Generate training dataset with features joined to spine
training_set = fs.generate_dataset(
spine_df=training_spine,
features=[user_features_fv, session_features_fv],
spine_timestamp_col="EVENT_TS",
)1.6.0.2 Batch Inference Spine
For batch inference, you only need entity keys and timestamps—there’s no label because that’s what the model will predict:
# Inference spine: no label column
inference_spine = session.sql("""
SELECT
USER_ID, -- Entity key
CURRENT_TIMESTAMP() AS EVENT_TS -- Features as of now
FROM USERS
WHERE SUBSCRIPTION_STATUS != 'none'
""")
# Retrieve features for prediction
inference_data = fs.generate_dataset(
spine_df=inference_spine,
features=[user_features_fv, session_features_fv],
spine_timestamp_col="EVENT_TS",
)1.6.0.3 Online Inference
For real-time serving, features are retrieved from Online Feature Tables using entity keys only—no timestamp is needed since OFTs store only the current (latest) feature values:
1.6.1 How Spine Works
flowchart TB SP[Spine DataFrame USER_ID + EVENT_TS] FV1[USER_PURCHASE_FV features] FV2[USER_SESSION_FV features] SP --> FV1 SP --> FV2 RES[Result: features as-of EVENT_TS per USER_ID] FV1 --> RES FV2 --> RES
The Feature Store joins each spine row to the appropriate Feature Views, ensuring features are computed using only data available at the EVENT_TS timestamp. This spine-based ASOF retrieval applies to training and batch inference only. For online (real-time) inference, features are served via direct key-based lookup against Online Feature Tables – there is no spine and no ASOF join.
See Chapter 10: Training & Inference for spine design patterns and best practices.
1.7 Feature Store Architecture
Snowflake Feature Store leverages native Snowflake capabilities:
flowchart TB
SD[Source Data] --> SFS[Feature Store Schema]
subgraph SFS
FV[Feature Views DT + View]
MD[Metadata Tags]
end
FV --> T[Training Datasets]
FV --> B[Batch Inference]
FV --> O[Online Serving OFT]
MD -.-> FV
1.7.1 Physical Implementation
| Logical Concept | Snowflake Object | Purpose |
|---|---|---|
| Feature Store | Schema + Tags | Container for all Feature Store objects |
| Entity | Tag | Metadata defining join keys |
| Feature View (materialized) | Dynamic Table | Pre-computed features with full history |
| Feature View (query-time) | View | On-demand computed features |
| Online Feature Table (OFT) | Online Feature Table | Low-latency serving, current values only |
| Dataset | Dataset | Materialized training/inference data |
Online Feature Tables store only the latest (current) value for each feature per entity—they do not retain historical feature values. For point-in-time historical retrieval (training, batch inference), use the standard Feature View backed by Dynamic Table or View.
1.8 Transformation Taxonomy
Understanding where transformations should occur is critical for maintainable ML systems. We categorize transformations into three types:
| Type | Full Name | Location | Stored In |
|---|---|---|---|
| MIT | Model-Independent Transformations | Feature Pipeline | Feature View |
| MDT | Model-Dependent Transformations | Training + Inference Pipeline | Model Registry |
| ODT | On-Demand Transformations | Inference Time | Not stored |
1.8.1 Quick Decision Guide
flowchart TD
Q1{Reusable across models?}
Q1 -->|YES| MIT[MIT in Feature View]
Q1 -->|NO| Q2{Depends on training stats?}
Q2 -->|YES| MDT[MDT with model]
Q2 -->|NO| Q3{Depends on request context?}
Q3 -->|YES| ODT[ODT at inference]
Q3 -->|NO| MIT2[Probably MIT]
See Transformation Taxonomy: MIT vs MDT vs ODT for detailed examples, anti-patterns, and implementation guidance.
1.9 Terminology Mapping
How Snowflake Feature Store terminology aligns with other platforms and industry terminology. Here’s how terms map across platforms:
| Snowflake | Feast | Hopsworks | Tecton | Description |
|---|---|---|---|---|
| Feature Store | Feature Store (registry) | Feature Store | Feature Store | Central feature repository |
| Entity | Entity | Entity | Entity | Business object with join keys |
| Feature View | Feature View | Feature Group | Feature View | Collection of related features |
| Dynamic Table FV | Offline Store | Offline Store | Batch Feature View | Pre-computed, stored features |
| View FV | On-Demand Feature View | - | On-Demand Feature View | Query-time computed features |
| Online Feature Table | Online Store | Online Store | Online Store | Low-latency serving |
| Spine | Entity DataFrame | Spine DataFrame | Spine | Request keys + timestamps |
| Feature Slice | Feature | Feature | Feature | Individual feature column |
timestamp_col |
event_timestamp |
Event Time | Timestamp Column | Point-in-time reference |
refresh_freq |
- (external orchestration) | Materialization Schedule | batch_schedule |
Update frequency |
generate_dataset() |
get_historical_features() |
Get Training Data | Get Dataset | Training data generation |
retrieve_feature_values() |
get_online_features() |
Get Feature Values | Get Online Features | Feature serving |
1.9.1 Key Terminology Notes
Feature View vs Feature Group: Snowflake and Feast both use “Feature View” while Hopsworks uses “Feature Group.” All represent a collection of features computed together.
Dynamic Table vs Offline Store: Both refer to pre-materialized features for batch training and inference. Feast relies on an external offline store (e.g., BigQuery, Snowflake, Redshift) whereas Snowflake manages materialization natively via Dynamic Tables.
Online Feature Table: Snowflake’s implementation uses Hybrid Tables for low-latency serving, equivalent to an “Online Store” backed by DynamoDB or Redis in Feast, or similar stores in other platforms.
Refresh scheduling: Snowflake’s
refresh_freqand Tecton’sbatch_scheduleare built-in scheduling mechanisms. Feast does not include a built-in scheduler – feature materialization is driven by external orchestration (e.g., Airflow, cron).
1.10 Summary
| Concept | Definition | Key Attributes |
|---|---|---|
| Feature Store | Centralized feature management layer | Schema + Tags in Snowflake |
| Entity | Business object with join keys | name, join_keys, desc |
| Feature View | Feature collection with transformations | entities, feature_df, refresh_freq |
| Feature | Individual computed value | Defined in DataFrame or Feature class |
| Spine | Request DataFrame | Entity keys + timestamp + labels |
| MIT/MDT/ODT | Transformation taxonomy | Where transformations belong |
1.11 Next Steps
Continue to Chapter 2: Design & Organization to learn how to structure your Feature Store for scale and maintainability.