1  Core Concepts

Entities, Feature Views, Features, and the Spine

Keywords

snowflake, feature store, ml, machine learning, mlops

1.1 Overview

This chapter introduces the core concepts and terminology of Snowflake Feature Store. Understanding these fundamentals is essential before diving into implementation patterns.

1.2 Learning Objectives

After completing this chapter, you will be able to:

  • Define what a Feature Store is and articulate its value in ML systems
  • Identify and describe the key components: Entities, Feature Views, Features, and Spines
  • Understand how Feature Store fits into the broader ML lifecycle
  • Apply the correct transformation taxonomy (MIT/MDT/ODT) to your use cases
  • Map Snowflake terminology to industry-standard terms

📂 Chapter code: Browse companion scripts on GitHub


1.3 What is a Feature Store?

A Feature Store is a centralized repository for storing, managing, and serving features for machine learning. It acts as the data management layer between raw data sources and ML models.

flowchart LR
  subgraph DS[Data Sources]
    d1[Tables / Streams / External / APIs]
  end
  subgraph FS[Feature Store]
    f1[Pipelines / Storage / Serving / Metadata]
  end
  subgraph ML[ML Consumers]
    m1[Training / Batch / Real-time / Analytics]
  end
  DS --> FS --> ML

ML data infrastructure: sources, Feature Store, and ML consumers

1.3.1 Features: The Building Blocks of ML

A feature is a measurable property or characteristic used as input to an ML model. Features can be:

Feature Type Description Example
Raw Direct attributes from source data user_age, product_price
Derived Calculated from raw features price_per_unit, age_bucket
Aggregated Computed over time windows total_orders_30d, avg_session_duration_7d
Encoded Transformed for model consumption category_one_hot, amount_scaled

1.4 Why Use a Feature Store?

Feature Stores solve critical challenges in production ML systems:

1. Feature Reusability

Without a Feature Store, teams often recreate the same features independently, leading to:

  • Duplicated effort across data scientists
  • Inconsistent feature definitions
  • Wasted compute resources

With Feature Store: Features are defined once, stored centrally, and reused across models and teams.

2. Training-Serving Skew

One of the most common causes of ML model degradation is when features computed during training differ from those used during inference.

With Feature Store: The same feature definitions serve both training and inference, eliminating skew.

3. Point-in-Time Correctness

Using future data to train models (data leakage) produces artificially good training metrics but poor production performance.

With Feature Store: Built-in temporal joins ensure features are computed using only data available at the prediction time, preventing data leakage.

4. Feature Discovery

As organizations scale ML, finding existing features becomes challenging.

With Feature Store: Searchable catalog of features with metadata, lineage, and documentation.

5. Governance & Compliance

Tracking where features come from and how they’re used is essential for regulatory compliance.

With Feature Store: Full lineage from source data through feature transformations to model predictions.

Snowflake Feature Store addresses all of these challenges by providing a centralized repository for features, a way to compute features, and a way to serve features.


1.5 Key Components

Snowflake Feature Store consists of four primary components:

1.5.1 Entity

An Entity represents a business object that features describe. It defines the join keys used to retrieve features.

📁 Full code: _code/entity_examples.py

# Simple entity with single key
user_entity = Entity(
    name="USER",
    join_keys=["USER_ID"],
    desc="Registered user in the system"
)

# Compound entity with multiple keys
product_supplier_entity = Entity(
    name="PRODUCT_SUPPLIER",
    join_keys=["PRODUCT_ID", "SUPPLIER_ID"],
    desc="Product-Supplier relationship for supplier-specific features"
)

# Register entities
fs.register_entity(user_entity)
fs.register_entity(product_supplier_entity)
print(f"Registered entities: {[e['NAME'] for e in fs.list_entities().collect()]}")

Key Concepts:

Concept Description
name Unique identifier for the entity within the Feature Store
join_keys Column(s) that uniquely identify an instance of the entity
Simple Key Single column identifier (e.g., USER_ID)
Compound Key Multiple columns together form the identifier (e.g., PRODUCT_ID + SUPPLIER_ID)

Entities enable features from different Feature Views to be joined together when generating training data or serving predictions.

Join key names must match exactly

The Feature Store does not support synonyms or aliases for join keys. The column names defined in the Entity’s join_keys must appear with the same name in every Feature View’s feature_df and in every spine DataFrame. If your source column has a different name, alias it in the SQL or Snowpark DataFrame that defines the Feature View (e.g., SELECT CUST_KEY AS USER_ID ... or .with_column_renamed("CUST_KEY", "USER_ID")). See Chapter 3: Consistent Key Naming for details.

Standardize timestamp_col names across Feature Views

Just as entity join keys must be named consistently, adopting a standard timestamp column name (e.g., FV_TS) across all Feature Views simplifies spine design and validation. Alias the source timestamp in the Feature View SQL:

SELECT USER_ID,
       ORDER_TS AS FV_TS,       -- standardized name
       SUM(TOTAL_AMT) AS ...
FROM ...
GROUP BY USER_ID, ORDER_TS

When every Feature View uses the same timestamp_col name, the spine only needs a single spine_timestamp_col value, include_feature_view_timestamp_col=True output is predictable, and leakage-validation code does not need per-FV column mappings. See Chapter 6: timestamp_col Requirements for details.

📖 Deep Dive

See Chapter 3: Entities & Hierarchies for entity design patterns, hierarchies, and compound key strategies.


1.5.2 Feature View

A Feature View is a collection of related features computed from one or more source dataframe (tables, views, etc.). It defines:

  • Which entity/ies the features belong to
  • How features are computed (the transformation logic)
  • How features are materialized (Dynamic Table vs View)

📁 Full code: _code/featureview_examples.py

from snowflake.ml.feature_store import FeatureView

# Create a Feature View from a Snowpark DataFrame
user_features_fv = FeatureView(
    name="USER_PURCHASE_FEATURES",
    entities=[user_entity],
    feature_df=user_purchase_df,  # Snowpark DataFrame with feature logic
    timestamp_col="UPDATED_TS",   # For point-in-time correctness
    refresh_freq="1 hour",        # Dynamic Table refresh (omit for View)
    desc="User purchase behavior features"
)

# Register in Feature Store
user_features_fv = fs.register_feature_view(
    feature_view=user_features_fv,
    version="V01",
    block=True  # Wait for initial materialization
)

Materialization Options:

Type Created When Use Case Compute Model
Dynamic Table refresh_freq is specified Pre-computed features, automatic refresh Snowflake manages refresh
View refresh_freq is omitted Query-time computation, always fresh Compute on each query

Key Concepts:

Concept Description
name Unique identifier for the Feature View
entities List of entities this Feature View provides features for
feature_df Snowpark DataFrame defining the feature transformations
timestamp_col Column used for point-in-time feature retrieval
refresh_freq How often to refresh (e.g., "1 hour", "1 day")
version Version string for managing Feature View evolution
📖 Deep Dive

See Chapter 4: Feature Views for detailed coverage of Feature View types, versioning, and lifecycle management.


1.5.3 Feature

A Feature is an individual column within a Feature View. Features are defined through the feature_df parameter—a Snowpark DataFrame that specifies the transformation logic.

Both session.sql() (SQL) and the Snowpark DataFrame API produce identical lazy Snowpark DataFrames. See Chapter 4: SQL vs Snowpark DataFrame API for a detailed comparison.

📁 Full code: _code/feature_dataframe_api.py | _code/feature_sql_api.py

user_purchase_df = session.sql("""
    SELECT
        USER_ID,
        COUNT(DISTINCT ORDER_ID) AS ORDER_CNT,
        SUM(TOTAL_AMT) AS SPEND_SUM,
        AVG(TOTAL_AMT) AS ORDER_VALUE_AVG,
        MAX(ORDER_TS) AS LAST_ORDER_TS,
        DATEDIFF('day', MIN(ORDER_TS), MAX(ORDER_TS)) AS CUSTOMER_TENURE_DAYS
    FROM ORDERS
    GROUP BY USER_ID
""")
user_purchase_df = (
    session.table(ORDERS_TABLE)
    .group_by("USER_ID")
    .agg(
        F.sum("TOTAL_AMT").alias("SPEND_SUM"),
        F.count("ORDER_ID").alias("ORDER_CNT"),
        F.avg("TOTAL_AMT").alias("ORDER_VALUE_AVG"),
        F.max("ORDER_TS").alias("LAST_ORDER_TS"),
    )
)

Create and register a Feature View using the DataFrame above:

user_purchase_fv = FeatureView(
    name="USER_PURCHASE_FEATURES",
    entities=[user_entity],
    feature_df=user_purchase_df,
    timestamp_col="LAST_ORDER_TS",
    refresh_freq="1 hour",
    desc="User purchase behavior features",
)

user_purchase_fv = fs.register_feature_view(
    feature_view=user_purchase_fv,
    version="V01",
    block=True,
    overwrite=True,
)

print(f"Registered: {user_purchase_fv.name}/V01")
print(f"Status: {user_purchase_fv.status}")
Time-windowed features over sparse data produce incorrect results at retrieval time

SQL window functions such as RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW compute correct windows at materialization time – each row’s 7-day window is anchored to that row’s own timestamp. The problem surfaces at retrieval time. When the Feature Store performs an ASOF join against a pre-materialized Dynamic Table, it returns the most recent row whose timestamp is <= the spine timestamp. If the source data is sparse (not a record for every time grain), that row may be days older than the spine timestamp, and its pre-computed window is anchored to the wrong point in time.

Example: A spine row requests features as of Jan 15, but the nearest pre-computed row in the DT has a timestamp of Jan 10 (no source activity between Jan 10-15). The returned “7-day sum” covers Jan 3-10, not the expected Jan 8-15.

This is an inherent limitation of pre-computing fixed windows and retrieving them via ASOF: the window boundaries are baked into the row at materialization time and cannot shift to match an arbitrary query timestamp. For time-windowed aggregations, use the Feature Aggregation API (Section 1.5.3.1) instead. Its tiling mechanism stores partial aggregates per time grain and reassembles the correct window from the spine timestamp backwards at retrieval time – regardless of source sparsity. See Chapter 5: Temporal Aggregation Pipelines for the detailed motivation and Chapter 7 for API reference.

1.5.3.1 Method 3: Feature Aggregation Class (Time-Windowed)

For time-windowed aggregations, the Feature class provides a declarative API:

📁 Full code: _code/feature_aggregation_api.py

from snowflake.ml.feature_store import Feature

# Define time-windowed aggregated features
features = [
    Feature.sum("TOTAL_AMT", "7d").alias("SPEND_SUM_7D"),
    Feature.count("ORDER_ID", "30d").alias("ORDER_CNT_30D"),
    Feature.avg("TOTAL_AMT", "24h").alias("ORDER_VALUE_AVG_24H"),
    Feature.last_n("PRODUCT_ID", "7d", n=5).alias("RECENT_PRODUCTS"),
]

# Use in a tiled Feature View for efficient computation
tiled_fv = FeatureView(
    name="USER_ORDER_AGGREGATES",
    entities=[user_entity],
    feature_df=session.table("ORDERS"),
    timestamp_col="ORDER_TS",
    refresh_freq="1h",
    feature_granularity="1h",
    features=features,
)
🆕 New in snowflake-ml 1.21+

The Feature aggregation class provides declarative syntax for time-windowed aggregations with automatic tiling for efficient incremental computation. See Chapter 7: Aggregations API for the complete API reference.

1.5.3.2 Choosing the Right Method

Method Best For Complexity
DataFrame API Simple aggregations, joins, filtering Low
SQL via session.sql() Complex queries, window functions, prefer existing SQL Medium
Feature class Time-windowed aggregations, sliding windows Advanced

1.5.4 Feature Slice

A Feature Slice identifies specific feature columns within a Feature View that you want to retrieve. This allows selective feature retrieval rather than fetching all columns.

# Get specific features from a Feature View
user_slice = user_features_fv.slice(["SPEND_SUM_7D", "ORDER_CNT_30D"])

# Use the slice in dataset generation
dataset = fs.generate_dataset(
    spine_df=training_spine,
    features=[user_slice],  # Only retrieves sliced features
)

1.6 The Spine: Connecting Features to ML

The Spine is a foundation dataframe that defines which entities need features and when. Its structure differs slightly between training and inference:

Use Case Required Columns Optional Columns
Training Entity keys, Timestamp Label, additional context columns
Batch Inference Entity keys, Timestamp Additional context columns
Online Inference Entity keys only

The spine can carry any additional columns alongside the entity keys and timestamp – labels, context features, or columns sourced from tables that are not managed by the Feature Store. These columns pass through unchanged into the generated dataset.

Prefer routing all features through the Feature Store

If you find yourself adding feature columns directly to the spine from external tables, consider instead creating a view-based Feature View that references those external tables. This ensures all features – whether internally computed or externally managed – are discoverable, governed, and retrievable through the Feature Store. The view adds no storage or refresh cost; it simply wraps an existing table as a Feature View.

# External table with features not yet in the Feature Store
external_features_df = session.table("ANALYTICS.FEATURES.CREDIT_SCORES")

credit_fv = FeatureView(
    name="CREDIT_SCORES_EXTERNAL",
    entities=[user_entity],
    feature_df=external_features_df,
    timestamp_col="SCORE_TS",
    desc="Credit scores - maintained by risk team, registered for FS discovery"
)

1.6.0.1 Training Spine

For training, the spine defines the historical points where you want to retrieve features, along with the target variable (label) you’re predicting:

📁 Full code: _code/spine_examples.py

# Training spine: includes label for supervised learning
training_spine = session.sql("""
    SELECT 
        s.USER_ID,                         -- Entity key
        s.SESSION_START_TS AS EVENT_TS,    -- Point-in-time timestamp
        s.IS_CONVERTED AS LABEL            -- Target: did the session convert?
    FROM SESSIONS s
    WHERE s.USER_ID IS NOT NULL
""")

# Generate training dataset with features joined to spine
training_set = fs.generate_dataset(
    spine_df=training_spine,
    features=[user_features_fv, session_features_fv],
    spine_timestamp_col="EVENT_TS",
)

1.6.0.2 Batch Inference Spine

For batch inference, you only need entity keys and timestamps—there’s no label because that’s what the model will predict:

# Inference spine: no label column
inference_spine = session.sql("""
    SELECT 
        USER_ID,                           -- Entity key
        CURRENT_TIMESTAMP() AS EVENT_TS    -- Features as of now
    FROM USERS
    WHERE SUBSCRIPTION_STATUS != 'none'
""")

# Retrieve features for prediction
inference_data = fs.generate_dataset(
    spine_df=inference_spine,
    features=[user_features_fv, session_features_fv],
    spine_timestamp_col="EVENT_TS",
)

1.6.0.3 Online Inference

For real-time serving, features are retrieved from Online Feature Tables using entity keys only—no timestamp is needed since OFTs store only the current (latest) feature values:

# Online serving: entity keys only, retrieves current feature values
features = fs.retrieve_feature_values(
    spine_df=session.create_dataframe([{"USER_ID": "usr_001"}]),
    features=[user_features_fv],
)

1.6.1 How Spine Works

flowchart TB
  SP[Spine DataFrame USER_ID + EVENT_TS]
  FV1[USER_PURCHASE_FV features]
  FV2[USER_SESSION_FV features]
  SP --> FV1
  SP --> FV2
  RES[Result: features as-of EVENT_TS per USER_ID]
  FV1 --> RES
  FV2 --> RES

Spine-based feature retrieval: spine joined to multiple Feature Views

The Feature Store joins each spine row to the appropriate Feature Views, ensuring features are computed using only data available at the EVENT_TS timestamp. This spine-based ASOF retrieval applies to training and batch inference only. For online (real-time) inference, features are served via direct key-based lookup against Online Feature Tables – there is no spine and no ASOF join.

📖 Deep Dive

See Chapter 10: Training & Inference for spine design patterns and best practices.


1.7 Feature Store Architecture

Snowflake Feature Store leverages native Snowflake capabilities:

flowchart TB
  SD[Source Data] --> SFS[Feature Store Schema]
  subgraph SFS
    FV[Feature Views DT + View]
    MD[Metadata Tags]
  end
  FV --> T[Training Datasets]
  FV --> B[Batch Inference]
  FV --> O[Online Serving OFT]
  MD -.-> FV

Snowflake Feature Store architecture overview

1.7.1 Physical Implementation

Logical Concept Snowflake Object Purpose
Feature Store Schema + Tags Container for all Feature Store objects
Entity Tag Metadata defining join keys
Feature View (materialized) Dynamic Table Pre-computed features with full history
Feature View (query-time) View On-demand computed features
Online Feature Table (OFT) Online Feature Table Low-latency serving, current values only
Dataset Dataset Materialized training/inference data
Online Feature Tables Store Current Values Only

Online Feature Tables store only the latest (current) value for each feature per entity—they do not retain historical feature values. For point-in-time historical retrieval (training, batch inference), use the standard Feature View backed by Dynamic Table or View.


1.8 Transformation Taxonomy

Understanding where transformations should occur is critical for maintainable ML systems. We categorize transformations into three types:

Type Full Name Location Stored In
MIT Model-Independent Transformations Feature Pipeline Feature View
MDT Model-Dependent Transformations Training + Inference Pipeline Model Registry
ODT On-Demand Transformations Inference Time Not stored

1.8.1 Quick Decision Guide

flowchart TD
  Q1{Reusable across models?}
  Q1 -->|YES| MIT[MIT in Feature View]
  Q1 -->|NO| Q2{Depends on training stats?}
  Q2 -->|YES| MDT[MDT with model]
  Q2 -->|NO| Q3{Depends on request context?}
  Q3 -->|YES| ODT[ODT at inference]
  Q3 -->|NO| MIT2[Probably MIT]

MIT vs MDT vs ODT decision flowchart

📖 Deep Dive

See Transformation Taxonomy: MIT vs MDT vs ODT for detailed examples, anti-patterns, and implementation guidance.


1.9 Terminology Mapping

How Snowflake Feature Store terminology aligns with other platforms and industry terminology. Here’s how terms map across platforms:

Snowflake Feast Hopsworks Tecton Description
Feature Store Feature Store (registry) Feature Store Feature Store Central feature repository
Entity Entity Entity Entity Business object with join keys
Feature View Feature View Feature Group Feature View Collection of related features
Dynamic Table FV Offline Store Offline Store Batch Feature View Pre-computed, stored features
View FV On-Demand Feature View - On-Demand Feature View Query-time computed features
Online Feature Table Online Store Online Store Online Store Low-latency serving
Spine Entity DataFrame Spine DataFrame Spine Request keys + timestamps
Feature Slice Feature Feature Feature Individual feature column
timestamp_col event_timestamp Event Time Timestamp Column Point-in-time reference
refresh_freq - (external orchestration) Materialization Schedule batch_schedule Update frequency
generate_dataset() get_historical_features() Get Training Data Get Dataset Training data generation
retrieve_feature_values() get_online_features() Get Feature Values Get Online Features Feature serving

1.9.1 Key Terminology Notes

  1. Feature View vs Feature Group: Snowflake and Feast both use “Feature View” while Hopsworks uses “Feature Group.” All represent a collection of features computed together.

  2. Dynamic Table vs Offline Store: Both refer to pre-materialized features for batch training and inference. Feast relies on an external offline store (e.g., BigQuery, Snowflake, Redshift) whereas Snowflake manages materialization natively via Dynamic Tables.

  3. Online Feature Table: Snowflake’s implementation uses Hybrid Tables for low-latency serving, equivalent to an “Online Store” backed by DynamoDB or Redis in Feast, or similar stores in other platforms.

  4. Refresh scheduling: Snowflake’s refresh_freq and Tecton’s batch_schedule are built-in scheduling mechanisms. Feast does not include a built-in scheduler – feature materialization is driven by external orchestration (e.g., Airflow, cron).


1.10 Summary

Concept Definition Key Attributes
Feature Store Centralized feature management layer Schema + Tags in Snowflake
Entity Business object with join keys name, join_keys, desc
Feature View Feature collection with transformations entities, feature_df, refresh_freq
Feature Individual computed value Defined in DataFrame or Feature class
Spine Request DataFrame Entity keys + timestamp + labels
MIT/MDT/ODT Transformation taxonomy Where transformations belong

1.11 Next Steps

Continue to Chapter 2: Design & Organization to learn how to structure your Feature Store for scale and maintainability.