10 Training & Inference

Generating datasets, spine design, and Model Registry integration

Keywords

snowflake, feature store, ml, machine learning, mlops

10.1 Overview

This chapter covers the end-to-end workflow for using Feature Store in ML pipelines: from designing spines and generating training datasets to batch and online inference.

Code examples in this chapter use the shared demo environment (FEATURE_STORE_DEMO.CLICKSTREAM_DATA) established in the Introduction.

10.2 Learning Objectives

After completing this chapter, you will be able to:

Design effective spines for training and inference
Choose between generate_dataset() and generate_training_set() for your workload
Generate training data with point-in-time correctness
Apply feature column prefixing for disambiguation
Integrate with Snowflake Model Registry using open-source or Snowflake ML estimators
Implement batch inference workflows, including Dynamic Table (DT)-based continuous scoring

📂 Chapter code: Browse companion scripts on GitHub

10.3 Spine Design

The spine defines which entities need features and when. Its structure differs between training and inference.

Use Case	Required Columns	Optional Columns
Training	Entity keys, Timestamp	Label (target)
Batch Inference	Entity keys, Timestamp	—
Online Inference	Entity keys only	—

10.3.1 Training Spine

📁 Full code: _code/spine_design.py

Use SESSION_START_TS from SESSIONS as the point-in-time column so its name matches what you pass to spine_timestamp_col. If you alias it (for example SESSION_START_TS AS EVENT_TS), the original name is no longer in the DataFrame—downstream code and spine_timestamp_col must use the aliased name consistently.

# Training spine: USER_ID + PIT timestamp + label (conversion from session)
training_spine = session.sql("""
    SELECT
        USER_ID,
        SESSION_START_TS,
        IS_CONVERTED AS LABEL
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.SESSIONS
""")

10.3.2 Inference Spine

The timestamp can be CURRENT_TIMESTAMP() (score as of now) or an event timestamp when you want to persist predictions against a specific point in time:

# Inference spine: score active users as of now
inference_spine = session.sql("""
    SELECT
        USER_ID,
        CURRENT_TIMESTAMP() AS PREDICTION_TS
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.USERS
    WHERE IS_ACTIVE = TRUE
""")

10.4 `generate_dataset()` vs `generate_training_set()`

Both methods accept spine_df, features, and spine_timestamp_col and perform point-in-time correct feature retrieval. They target different downstream workflows.

	`fs.generate_dataset()`	`fs.generate_training_set()`
Returns	A versioned Snowflake ML Dataset (Parquet files in a stage)	A Snowpark DataFrame (not a versioned Dataset object)
Versioning	Supports a `version` parameter for the Dataset	No built-in versioning
Typical use	Deep learning (TensorFlow, PyTorch) and pipelines that consume the Dataset API	Classic ML (scikit-learn, XGBoost), especially training inside a Virtual Warehouse via Python stored procedures / UDFs
Persist	Dataset lifecycle is managed as a registered Dataset	Use `save_as` parameter, or call `df.write.save_as_table()` on the returned DataFrame

Both methods ultimately return a Snowpark DataFrame that you can further transform, persist via df.write.save_as_table(), or convert to pandas for local training. The key difference is that generate_dataset() additionally wraps the result in a versioned Dataset object for Parquet-based consumption.

See the Snowflake ML Dataset documentation: Snowflake ML Dataset.

10.4.1 Versioned Dataset (deep learning)

dataset = fs.generate_dataset(
    spine_df=training_spine,
    features=[user_order_fv, user_session_fv],
    spine_timestamp_col="SESSION_START_TS",
    version="V01",
)
# Consume Parquet-backed Dataset per Snowflake ML Dataset docs (e.g. TensorFlow / PyTorch).

10.4.2 Snowpark DataFrame (classic ML)

training_df = fs.generate_training_set(
    spine_df=training_spine,
    features=[user_order_fv, user_session_fv],
    spine_timestamp_col="SESSION_START_TS",
    save_as="FEATURE_STORE_DEMO.FEATURE_STORE.TRAINING_SET_CONVERSION_V01",
)
# training_df is a Snowpark DataFrame; use in Snowpark or convert for sklearn / XGBoost as needed.

10.5 Generating Training Datasets

10.5.1 Basic usage (Snowpark / classic ML)

training_df = fs.generate_training_set(
    spine_df=training_spine,
    features=[user_order_fv, user_session_fv],
    spine_timestamp_col="SESSION_START_TS",
)

10.5.2 Joining multiple Feature Views

training_df = fs.generate_training_set(
    spine_df=training_spine,
    features=[
        user_order_fv,
        user_session_fv,
        user_profile_fv,
    ],
    spine_timestamp_col="SESSION_START_TS",
)

10.5.3 Multi-entity spine design (hierarchical features)

When Feature Views exist at different entity grains (e.g., user-level, order-level, lineitem-level), the spine must carry all relevant join keys so each Feature View can match on its entity. Build the spine by joining through the entity hierarchy:

# Spine at the order level, carrying keys for customer and nation Feature Views
multi_entity_spine = session.sql("""
    SELECT
        o.ORDER_ID,
        o.CUSTOMER_ID,
        c.NATION_ID,
        o.ORDER_DATE AS EVENT_TS
    FROM ORDERS o
    JOIN CUSTOMERS c ON o.CUSTOMER_ID = c.CUSTOMER_ID
""")

training_df = fs.generate_training_set(
    spine_df=multi_entity_spine,
    features=[
        order_fv,             # entity: ORDER_ID
        order_lineitem_agg_fv,  # entity: ORDER_ID (aggregated from lineitems)
        customer_fv,          # entity: CUSTOMER_ID
        nation_fv,            # entity: NATION_ID
    ],
    spine_timestamp_col="EVENT_TS",
)

The Feature Store LEFT-JOINs each Feature View on its entity keys. Higher-level features (nation, customer) are denormalized onto every order row. This is intentional – each training example gets the full context of its parent entities.

Spine grain	Feature View grain	Result
Same as FV	1:1 match	One set of features per spine row
Finer than FV	N:1 (parent)	Parent features duplicated per child row
Coarser than FV	1:N (child)	Multiple matches – training data expands; use with care

If the spine is coarser than the Feature View (e.g., spine at order level, FV at lineitem level), the LEFT JOIN produces multiple rows per spine entry. This is usually undesirable – prefer an aggregated rollup Feature View at the spine grain instead (see Chapter 4: Hierarchical Feature Views).

10.6 Feature Column Prefixing

When joining multiple Feature Views, column name collisions can occur.

10.6.1 Using `auto_prefix`

training_df = fs.generate_training_set(
    spine_df=training_spine,
    features=[user_orders_fv, user_sessions_fv],
    spine_timestamp_col="SESSION_START_TS",
    auto_prefix=True,
)

# Result columns (example):
# USER_ORDERS__TOTAL_SPEND_7D
# USER_SESSIONS__SESSION_CNT_7D

10.6.2 Using `.with_name()`

training_df = fs.generate_training_set(
    spine_df=training_spine,
    features=[
        user_orders_fv.with_name("ord"),
        user_sessions_fv.with_name("sess"),
    ],
    spine_timestamp_col="SESSION_START_TS",
)

# Result columns (example):
# ord__TOTAL_SPEND_7D
# sess__SESSION_CNT_7D

10.6.3 Choosing a strategy

Scenario	Recommendation
Single Feature View	No prefix needed
Multiple FVs, unique names	No prefix needed
Multiple FVs, potential collisions	Use `auto_prefix=True`
Want custom short prefixes	Use `.with_name()`

10.7 Model Registry integration

Use open-source libraries (scikit-learn, XGBoost, LightGBM, etc.) when your training data fits in memory, or snowflake.ml.modeling estimators when you need distributed fitting inside the warehouse. Both produce a trained object you can log with Registry.log_model(). See Chapter 9: Preprocessing – Pipeline Classes for guidance on choosing between the two.

📁 Full code: _code/model_registry.py

from snowflake.ml.registry import Registry

reg = Registry(
    session=session,
    database_name="FEATURE_STORE_DEMO",
    schema_name="MODEL_REGISTRY",
)
reg.log_model(
    model,
    model_name="churn_model",
    version_name="V01",
    comment="Trained with USER_ORDER_FV V01, USER_SESSION_FV V01; spine SESSIONS + SESSION_START_TS",
)

10.7.1 Lineage flow

flowchart LR
  SRC[Source tables] --> FV[Feature Views] --> DS[Training Dataset] --> M[Model] --> P[Predictions]

End-to-end lineage from sources through feature views to model

10.8 Batch Inference

10.8.1 Ad-Hoc Scoring

The spine timestamp for inference can be CURRENT_TIMESTAMP() (score as of now) or an event timestamp you want to persist predictions against (e.g. SESSION_START_TS for a specific session):

# Inference spine: entities to score as of now
inference_spine = session.sql("""
    SELECT
        USER_ID,
        CURRENT_TIMESTAMP() AS PREDICTION_TS
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.USERS
    WHERE IS_ACTIVE = TRUE
""")

# Point-in-time features as Snowpark DataFrame
inference_df = fs.generate_training_set(
    spine_df=inference_spine,
    features=[user_order_fv, user_session_fv],
    spine_timestamp_col="PREDICTION_TS",
)

# Score and persist
predictions = model.predict(inference_df.to_pandas().drop(columns=[...]))
inference_df.write.save_as_table("FEATURE_STORE_DEMO.FEATURE_STORE.PREDICTIONS_LATEST", mode="overwrite")

For frameworks that consume a Snowflake ML Dataset instead, build a spine the same way and use generate_dataset() with an appropriate version.

10.8.2 Continuous Batch Inference with Dynamic Tables

For single-Feature-View models where all features come from one table (no ASOF join required), you can use a Dynamic Table (or DT-backed Feature View) to persist predictions that refresh automatically. Reference the Model Registry’s SQL function directly in the DT definition:

CREATE DYNAMIC TABLE FEATURE_STORE.USER_CHURN_PREDICTIONS
  TARGET_LAG = '1 hour'
  WAREHOUSE  = FS_DEV_WH
AS
SELECT
    USER_ID,
    FEATURE_STORE_DEMO.MODEL_REGISTRY.CHURN_MODEL!PREDICT(
        ORDER_CNT, SPEND_SUM, AVG_ORDER_AMT
    ):output_feature_0::FLOAT AS CHURN_PROBABILITY,
    CURRENT_TIMESTAMP() AS SCORED_AT
FROM FEATURE_STORE.USER_PURCHASE_STATS$V01;

This pattern works well when the model reads directly from a single Feature View’s columns. Each DT refresh re-scores all entities with the latest feature values. To make these predictions discoverable via the Feature Store API and Snowsight, register the scoring DT as a Feature View — see Scoring Feature Views below for worked examples including multi-FV variants.

The Model Registry’s !PREDICT SQL function is automatically registered as IMMUTABLE, which makes it eligible for DT incremental refresh. If you create your own scoring UDF/UDTF outside of Model Registry, you must declare it immutable for the same reason – see UDF immutability for incremental refresh below.

Model version rotation works seamlessly with incremental DTs (tested April 2026)

When you promote a new model version via ALTER MODEL ... SET DEFAULT_VERSION = v_new, the DT engine automatically detects the change and triggers a REINITIALIZE (one-time full refresh) – all rows are re-scored with the new model. After the reinitialize, incremental mode resumes: subsequent source data changes are processed incrementally with the new version. The DT is not broken and no CREATE OR REPLACE is needed.

This means the full pipeline – retrain model, register new version, promote to default, incremental scoring resumes – works end-to-end without manual intervention.

DTs pinned to a specific version (WITH mv AS MODEL my_model VERSION V1 SELECT mv!PREDICT(...)) are unaffected by default-version changes.

Cost implication: Each version rotation triggers a full re-score of the entire scoring table. For very large tables, factor this into your retraining cadence – the one-time reinitialize cost is the price of correctness (no mixed-version predictions).

This behavior is specific to Model Registry’s MODEL()!PREDICT(). Standard IMMUTABLE UDFs do not get this treatment – a CREATE OR REPLACE FUNCTION breaks the DT rather than triggering a graceful reinitialize. See UDF immutability below.

Incremental scoring downstream of full-refresh Feature Views (April 2026)

If the upstream Feature View (USER_PURCHASE_STATS$V01 above) uses full refresh – for example, because its query contains a FLOAT+JOIN pattern or other unsupported incremental construct – the scoring DT can still refresh incrementally. When the upstream DT has a system-derived unique key (from GROUP BY or QUALIFY ROW_NUMBER() = 1), the scoring DT only re-scores entities whose feature values actually changed, rather than re-scoring the entire population on every cycle. Opt in with ALTER DYNAMIC TABLE ... SET REFRESH_MODE = INCREMENTAL (this is a SQL-level capability; the Feature Store Python API does not yet support it). Verify with SHOW UNIQUE KEYS IN <upstream_fv_dt>. See Understanding primary keys in dynamic tables.

ASOF join limitation

At this time, ASOF join is not supported for incremental DT refresh. If your model requires features from multiple Feature Views joined via ASOF (point-in-time correctness), you cannot embed that join in a Dynamic Table definition. Instead, use the ad-hoc scoring pattern above or schedule batch inference via a Task or external orchestrator.

Pipeline lag affects batch inference too

The scoring DT above reads from USER_PURCHASE_STATS$V01, which is itself a Dynamic Table with its own refresh_freq. The features the model scores against are subject to cumulative lag: source pipeline delay + upstream FV refresh + scoring DT target_lag. A Stream/Task/Stored Procedure batch pattern has the same issue – the Task schedule determines when scoring runs, but the upstream features may already be stale.

If the model was trained on point-in-time-perfect features (no lag), it may perform worse in production where features are always somewhat stale. To avoid this training-serving skew, train with a conservative spine that bakes in realistic lag. See Chapter 8: Feature Freshness and Training-Serving Skew for detailed guidance on measuring actual lag and alternative mitigation strategies.

10.8.3 UDF Immutability for Incremental Refresh

Dynamic Tables in incremental refresh mode require every function in the SELECT to be deterministic. The Model Registry’s !PREDICT function is registered as IMMUTABLE automatically, but if you build your own scoring UDF/UDTF outside of Model Registry, you must explicitly declare it immutable. A VOLATILE UDF forces the DT to use full refresh on every cycle, negating the performance benefits of incremental refresh.

SQL:

CREATE OR REPLACE FUNCTION FEATURE_STORE.SCORE_CHURN(
    order_cnt NUMBER, spend_sum NUMBER, avg_order_amt NUMBER
)
RETURNS FLOAT
LANGUAGE PYTHON
IMMUTABLE
RUNTIME_VERSION = '3.11'
HANDLER = 'run'
AS $$
def run(order_cnt, spend_sum, avg_order_amt):
    # Your scoring logic here
    ...
$$;

Snowpark (decorator):

from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import FloatType, IntegerType

@udf(
    name="FEATURE_STORE.SCORE_CHURN",
    input_types=[IntegerType(), FloatType(), FloatType()],
    return_type=FloatType(),
    immutable=True,
    replace=True,
)
def score_churn(order_cnt: int, spend_sum: float, avg_order_amt: float) -> float:
    ...

Snowpark (session.udf.register):

session.udf.register(
    func=score_churn,
    name="FEATURE_STORE.SCORE_CHURN",
    input_types=[IntegerType(), FloatType(), FloatType()],
    return_type=FloatType(),
    immutable=True,
    replace=True,
)

Setting immutable=True causes Snowpark to emit the UDF with IMMUTABLE volatility in SQL. The DT can then treat it as eligible for incremental refresh. See the Dynamic Table supported queries documentation for the full list of incremental refresh requirements.

Standard UDF replacement breaks incremental DTs

Unlike Model Registry (which handles version rotation gracefully via REINITIALIZE – see above), replacing a standard IMMUTABLE UDF with CREATE OR REPLACE FUNCTION while it is referenced by an incremental DT breaks the DT. The DT must be fully recreated (CREATE OR REPLACE DYNAMIC TABLE).

This is an important design consideration: if your UDF logic is expected to change over time (updated model weights, bug fixes, new feature engineering rules), prefer Model Registry for inference UDFs. MR’s MODEL(name)!PREDICT() handles version rotation seamlessly – the DT detects the default-version change, triggers a one-time full re-score, then resumes incremental mode.

For UDFs that must remain outside Model Registry, plan for the DT recreation cost when the function changes. In CI/CD pipelines, this typically means including CREATE OR REPLACE DYNAMIC TABLE in the deployment script alongside CREATE OR REPLACE FUNCTION.

10.9 Scoring Feature Views

A Scoring Feature View is a Dynamic Table that calls a Model Registry function (MODEL()!PREDICT()) to produce predictions, then is registered as a Feature View through the Feature Store API. This makes model outputs discoverable alongside the features that feed them — teams can browse predictions in Snowsight, track lineage from source tables through features to scores, and consume predictions downstream via read_feature_view() just like any other feature.

Three variants cover progressively complex scenarios:

flowchart LR
  subgraph v1[Single_FV]
    FV1[Feature View] --> S1["Scoring DT\nMODEL()!PREDICT()"]
  end
  subgraph v2[Multi_FV_PreJoin]
    FVa[FV A] --> C[Combined FV]
    FVb[FV B] --> C
    C --> S2["Scoring DT"]
  end
  subgraph v3[Orchestrated]
    FVx[FV A] --> GTS["generate_training_set()\n+ ASOF join"]
    FVy[FV B] --> GTS
    GTS --> P["model.predict()"]
  end

Scoring Feature View variants — from single-source to PIT-correct multi-FV

10.9.1 Variant 1: Single-FV Scoring DT

The Continuous Batch Inference section above shows how to build a scoring DT with MODEL()!PREDICT(). To make those predictions discoverable in Snowsight and consumable via the Feature Store API, wrap the same feature_df in a FeatureView and register it:

scoring_fv = FeatureView(
    name="USER_CHURN_SCORES",
    entities=[Entity(name="USER", join_keys=["USER_ID"])],
    feature_df=scoring_df,   # same DataFrame used in the DT example above
    refresh_freq="1 hour",
    desc="Churn probability from USER_PURCHASE_STATS via churn_model",
)
fs.register_feature_view(scoring_fv, version="V01")

All the existing behaviour applies: model version rotation triggers a REINITIALIZE then resumes incremental mode, and UDF immutability rules carry over unchanged.

10.9.2 Variant 2: Multi-FV Pre-Join Pattern

When the model requires features from multiple Feature Views, ASOF join is not available inside a Dynamic Table (see the ASOF limitation callout above). The workaround is to pre-join the upstream FVs into a combined intermediate FV, then score from that single source:

# Step 1: Read upstream Feature Views
orders_df  = fs.read_feature_view(user_order_fv)    # USER_ID, ORDER_CNT, SPEND_SUM, ...
sessions_df = fs.read_feature_view(user_session_fv)  # USER_ID, SESSION_CNT, AVG_DURATION, ...

# Step 2: Join into a combined source (latest-value join on entity key)
combined_df = orders_df.join(sessions_df, "USER_ID")

# Step 3: Register the combined FV
combined_fv = FeatureView(
    name="USER_COMBINED_FEATURES",
    entities=[user_entity],
    feature_df=combined_df,
    refresh_freq="1 hour",
    desc="Pre-joined order + session features for scoring",
)
fs.register_feature_view(combined_fv, version="V01")

# Step 4: Score from the combined FV
scoring_df = session.sql("""
    SELECT
        USER_ID,
        FEATURE_STORE_DEMO.MODEL_REGISTRY.CHURN_MODEL!PREDICT(
            ORDER_CNT, SPEND_SUM, SESSION_CNT, AVG_DURATION
        ):output_feature_0::FLOAT AS CHURN_PROBABILITY,
        CURRENT_TIMESTAMP()          AS SCORED_AT
    FROM FEATURE_STORE.USER_COMBINED_FEATURES$V01
""")

scoring_fv = FeatureView(
    name="USER_CHURN_SCORES",
    entities=[user_entity],
    feature_df=scoring_df,
    refresh_freq="1 hour",
    desc="Churn probability from combined order + session features",
)
fs.register_feature_view(scoring_fv, version="V01")

The DT chain is: upstream FVs → Combined FV → Scoring FV. Each layer refreshes incrementally (subject to the usual constraints), and model version rotation propagates automatically.

Latest-value join, not PIT-correct

The pre-join in Step 2 is an equi-join on entity key — it picks the current latest values from each upstream FV, not the values as-of a specific timestamp. This is appropriate for real-time scoring (score with the freshest features available) but does not provide point-in-time correctness. If PIT-correct multi-FV scoring is required, use Variant 3 below.

10.9.3 Variant 3: Orchestrated Batch Scoring (PIT-Correct)

When the model requires features from multiple Feature Views and predictions must be point-in-time correct (e.g. for regulatory audit, back-testing, or time-series evaluation), use generate_training_set() with an inference spine inside a Snowflake Task or Stored Procedure:

# Inside a Task or Stored Procedure — scheduled or event-triggered
inference_spine = session.sql("""
    SELECT
        USER_ID,
        CURRENT_TIMESTAMP() AS PREDICTION_TS
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.USERS
    WHERE IS_ACTIVE = TRUE
""")

# ASOF join retrieves PIT-correct feature values
inference_df = fs.generate_training_set(
    spine_df=inference_spine,
    features=[user_order_fv, user_session_fv],
    spine_timestamp_col="PREDICTION_TS",
)

# Score
model_ref = Registry(session=session, database_name="FEATURE_STORE_DEMO",
                     schema_name="MODEL_REGISTRY").get_model("churn_model")
predictions = model_ref.default.run(inference_df, function_name="predict")

# Persist
predictions.write.save_as_table(
    "FEATURE_STORE_DEMO.FEATURE_STORE.USER_CHURN_PREDICTIONS",
    mode="overwrite",
)

This variant trades continuous refresh for correctness: the Task runs on a schedule (e.g. hourly via SCHEDULE = 'USING CRON 0 * * * * UTC'), and each run produces a full PIT-correct scoring snapshot.

Choosing a scoring variant

	Single-FV DT	Multi-FV Pre-Join	Orchestrated (Task/Sproc)
Feature sources	One FV	Multiple FVs	Multiple FVs
Join type	Direct read	Equi-join (latest)	ASOF (PIT-correct)
Refresh	Continuous (DT)	Continuous (DT chain)	Scheduled (Task)
Incremental	Yes	Yes	No (full each run)
PIT-correct	N/A (single source)	No	Yes
Discoverable as FV	Yes	Yes	Manual (register output table)
Best for	Simple models, low latency	Multi-feature models, real-time	Audit, back-testing, compliance

10.10 Dataset Generation at Scale

When training datasets involve many Feature Views (50+) or very large spines (hundreds of millions to billions of rows), the default single-query join strategy can hit memory or shuffle limits. Understanding these constraints helps you right-size your infrastructure and leverage SDK optimisations.

10.10.1 The Batched Join Strategy (SDK 1.32+)

By default, generate_training_set and generate_dataset produce a single SQL statement that LEFT JOINs all Feature Views at once. At high Feature View counts, the final join stage shuffles data proportional to spine_rows × feature_view_count × avg_features_per_fv, which can cause out-of-memory (OOM) failures or queries that never complete.

SDK version 1.32+ automatically batches Feature View joins into groups of ~10, produces intermediate results for each batch, and performs a final merge. This happens transparently – no API changes are required. The improvement can be dramatic:

Scenario	Pre-1.32 (single query)	1.32+ (batched)
75+ FVs, ~1,000 features, ~100M spine rows	~40 min on 4XL	~17 min on 4XL Gen2
75+ FVs, ~2,000 features, ~2B spine rows	OOM on 4XL	~2 hours on 4XL Gen2

If you are on an older SDK version and cannot upgrade, the manual workaround is to split Feature Views into batches of 10-15, call generate_training_set for each batch, and join the results:

import math

def generate_training_set_batched(
    fs, spine_df, feature_views, spine_timestamp_col, batch_size=10
):
    """Manually batch Feature View joins for older SDK versions."""
    batches = [
        feature_views[i:i + batch_size]
        for i in range(0, len(feature_views), batch_size)
    ]

    result_df = spine_df
    for batch in batches:
        batch_df = fs.generate_training_set(
            spine_df=spine_df,
            features=batch,
            spine_timestamp_col=spine_timestamp_col,
        )
        join_keys = [spine_timestamp_col] + [
            k for fv in batch for k in fv.entity.join_keys
        ]
        result_df = result_df.join(
            batch_df.drop(*spine_df.columns),
            on=list(set(join_keys) & set(result_df.columns)),
            how="left",
        )
    return result_df

10.10.2 Warehouse Sizing for Dataset Generation

The optimal warehouse configuration depends on whether your bottleneck is parallelism (many joins, wide data shuffles) or memory (large intermediate results per node).

Bottleneck	Symptom	Recommendation
Parallelism	High idle execution skew; CPU-bound	Scale up standard warehouse size (more nodes)
Memory	OOM errors; high disk spill	Consider Snowpark Optimized warehouse, or reduce batch size
Both	OOM + skew	Snowpark Optimized at larger size (3XL+)

Prefer standard Gen2 warehouses

For join-heavy dataset generation workloads, standard Gen2 warehouses typically outperform Snowpark Optimized warehouses of equivalent credit cost. A standard 4XL Gen2 (128 nodes × 16 GB) provides far more parallelism than an SPO 3XL (16 nodes × 256 GB) at similar cost. Only move to SPO when profiling confirms memory – not parallelism – is the constraint.

10.10.3 The Wide Resultset Problem

When training datasets exceed ~500 features across many Feature Views, Snowflake’s query planner and columnar engine encounter non-linear performance degradation. The issue stems from SQL expression count, query compilation complexity, and memory pressure – not just data volume:

Scale	Compilation	Execution (full resultset)
10 FVs, ~200 features	Seconds	Minutes
50 FVs, ~1,000 features	Minutes	Tens of minutes
100 FVs, ~2,000 features	35+ minutes	6+ hours (likely OOM)

Known platform limitation

This is a known Snowflake limitation for wide resultsets. The Feature Store SDK mitigates it with batched joins (see above), but extremely wide datasets (2,000+ features) may still require architectural workarounds described below.

VARIANT column approach: For extreme scale (>500 features), encapsulating each Feature View’s output as a single OBJECT column dramatically reduces SQL expression count:

# Instead of: SELECT entity_id, feat_1, feat_2, ..., feat_200 FROM fv_table
# Use:        SELECT entity_id, OBJECT_CONSTRUCT(*) AS features FROM fv_table

# Benchmark: 100 FVs, 2000 features
# Column-per-feature: ~25 minutes for a single-row collect()
# VARIANT approach:     ~4 seconds for a single-row collect()

Trade-offs of the VARIANT approach:

Aspect	Column-per-feature	VARIANT per FV
Query compilation	Non-linear with feature count	Near-constant
Column-level lineage	Full	Lost
Column-level security	Full	Lost
Client-side processing	Direct DataFrame access	Requires `json.loads()` per row
Memory overhead	Standard	~2x (JSON deserialization)

For most use cases, the batched join strategy (SDK 1.32+) combined with appropriate warehouse sizing is sufficient. Reserve the VARIANT approach for the largest deployments where batched joins still hit compilation or memory limits.

10.10.4 Parquet and Iceberg Output

For training datasets that feed distributed ML frameworks (Ray, Spark, PyTorch), writing directly to Parquet avoids the memory bottleneck of to_pandas():

Scale	Warehouse	Write time
1B rows	Standard 4XL	~4 minutes
12B rows	Standard 4XL	~35 minutes

Options for Parquet output:

generate_dataset with output_type="parquet" – writes directly to a Snowflake stage; files are readable by Ray’s read_parquet().
Iceberg-backed Feature Views (SDK 1.26+) – Feature Views backed by Dynamic Iceberg Tables that expose data as standard Parquet/Iceberg on external cloud storage. ML frameworks read Iceberg natively; no Snowflake connector needed. See Iceberg-backed Feature Views below.
Snowflake-managed Iceberg tables – queryable via both Snowflake SQL and external Parquet readers. Useful when the same training data is consumed by both Snowflake-based and external ML pipelines.
COPY INTO from a materialized table – if you need fine-grained control over Parquet file layout (row group size, compression).

Skip to_pandas() for large datasets

For datasets exceeding ~100M rows or ~500 features, writing to Parquet and reading directly from your ML framework is both faster and more memory-efficient than to_pandas(). This is especially true when using the VARIANT approach, where to_pandas() returns JSON strings requiring single-threaded json.loads() deserialization.

10.11 Iceberg-Backed Feature Views

Starting with snowflake-ml-python v1.26.0 (February 2026), Feature Views can be backed by Dynamic Iceberg Tables – Dynamic Tables that materialize their output as Parquet files in Iceberg format on external cloud storage (S3, Azure Blob, GCS). This gives ML frameworks direct, native access to feature data without requiring a Snowflake connector or intermediate export step.

10.11.1 Creating an Iceberg-Backed Feature View

Use StorageConfig to specify Iceberg storage when creating a Feature View:

from snowflake.ml.feature_store import FeatureView, StorageConfig, StorageFormat

storage = StorageConfig(
    format=StorageFormat.ICEBERG,
    external_volume='MY_EXTERNAL_VOLUME',
    base_location='feature_store/user_purchase_stats'
)

iceberg_fv = FeatureView(
    name="USER_PURCHASE_STATS_ICE",
    entities=[user_entity],
    feature_df=user_purchase_df,
    timestamp_col="LAST_ORDER_TS",
    refresh_freq="1 hour",
    storage_config=storage,
    desc="User purchase statistics materialized as Iceberg"
)
fs.register_feature_view(feature_view=iceberg_fv, version="V01")

Prerequisites

An external volume must be configured for your cloud storage location.
refresh_freq is required – Iceberg storage only applies to managed (DT-backed) Feature Views, not static Views.
The FeatureStore constructor accepts default_iceberg_external_volume to avoid repeating the volume name in every StorageConfig.

All standard Feature Store APIs (generate_dataset, generate_training_set, list_feature_views, etc.) work with Iceberg-backed Feature Views. The feature data is simultaneously queryable via Snowflake SQL and via any Iceberg-compatible reader.

10.11.2 Training Directly from Iceberg Feature Views

Because Iceberg-backed Feature Views expose standard Parquet files, ML frameworks can read them natively:

import pyiceberg
from pyiceberg.catalog import load_catalog

catalog = load_catalog("snowflake", **catalog_config)
table = catalog.load_table("feature_store_demo.feature_store.USER_PURCHASE_STATS_ICE$V01")
df = table.scan().to_pandas()

This eliminates the need for generate_dataset() or to_pandas() when all you need is the feature data – the Feature View is the training dataset in a standard open format.

10.11.3 Current Limitations

Limitation	Detail
Online Feature Tables	Not supported with Iceberg storage (future work)
DT partitioning parameters	`PARTITION_BY`, `TARGET_FILE_SIZE`, and `PATH_LAYOUT` (GA April 2026) are not yet exposed through `StorageConfig` – use `ALTER DYNAMIC ICEBERG TABLE` after registration to set these
View-based Feature Views	Iceberg storage requires `refresh_freq` (DT-backed); it does not apply to View-based FVs

Iceberg FVs as an alternative to the Dataset API

For many training workflows, an Iceberg-backed Feature View provides the same benefits as generate_dataset() – versioned Parquet files accessible to distributed ML frameworks – but in a standard open format that any Iceberg-compatible tool can read. The Feature View is also incrementally maintained by the DT scheduler, so training data stays fresh without manual regeneration.

If your pipeline reads from a single Feature View (or a small number of them) and does not require spine-based ASOF joins, an Iceberg-backed Feature View may be simpler than the generate_dataset() → Parquet → framework path.

Future: Fully automated incremental training with ASOF-in-DT

When ASOF join support lands in Dynamic Tables (stretch goal), the combination with Iceberg-backed Feature Views will enable a fully automated incremental training pipeline:

A DT with ASOF join pre-materializes point-in-time correct training data as an Iceberg table
ML frameworks read the Iceberg table directly for training
As new data arrives, the DT incrementally updates the training set
A downstream scoring DT (using Model Registry’s MODEL()!PREDICT()) incrementally scores new entities
The entire retrain → evaluate → promote → score pipeline operates continuously with no manual orchestration

This is one of the most compelling future-state patterns for Feature Store. See Chapter 12: Advanced Patterns for discussion of Iceberg table integration patterns.

10.12 Best practices

1. Spine design

If your Feature Views follow the recommended convention of aliasing their timestamp column to a standard name (e.g., FV_TS – see Chapter 6: timestamp_col Requirements), name the spine timestamp to match:

# ✅ GOOD: Standardized timestamp name matches FV convention
spine = session.sql("""
    SELECT
        USER_ID,
        SESSION_START_TS AS FV_TS,
        IS_CONVERTED AS LABEL
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.SESSIONS
""")
# spine_timestamp_col="FV_TS" everywhere — same as every Feature View's timestamp_col

# ✅ ALSO GOOD: Explicit source name when you prefer traceability
spine = session.sql("""
    SELECT
        USER_ID,
        SESSION_START_TS,
        IS_CONVERTED AS LABEL
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.SESSIONS
""")
# spine_timestamp_col="SESSION_START_TS"

# ❌ BAD: Ambiguous or opaque timestamp column
spine = session.sql("""
    SELECT
        USER_ID,
        SOME_DATE
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.SESSIONS
""")

2. Feature selection

Registered Feature Views expose .slice(["col1", "col2"]) to restrict columns. Do not assume .select() on the Feature View object.

training_df = fs.generate_training_set(
    spine_df=training_spine,
    features=[
        user_order_fv.slice(["TOTAL_SPEND_7D", "ORDER_CNT_30D"]),
    ],
    spine_timestamp_col="SESSION_START_TS",
)

3. Validate before training

# Check for data issues (column names depend on your spine / FVs)
assert training_df.filter(F.col("LABEL").isNull()).count() == 0
assert training_df.count() > 1000

10.13 Common pitfalls

10.13.1 ❌ Pitfall 1: No timestamp column

Problem: PIT retrieval does not work without a timestamp.

Solution: Always include a spine timestamp and set spine_timestamp_col to its name.

10.13.2 ❌ Pitfall 2: Feature leakage

Problem: Using features computed from future data relative to the spine row.

Solution: Use spine_timestamp_col and validate cutoff behavior.

10.13.3 ❌ Pitfall 3: Column collisions

Problem: Same feature names from multiple Feature Views.

Solution: Use auto_prefix or .with_name().

10.13.4 ❌ Pitfall 4: Wrong API for the framework

Problem: Expecting a Snowpark DataFrame from generate_dataset() or Dataset Parquet from generate_training_set().

Solution: Use the comparison table above; match the method to DL (Dataset) vs classic ML (DataFrame / save_as).

10.13.5 ❌ Pitfall 5: Training-serving skew from pipeline lag

Problem: The model is trained on point-in-time-perfect features (ASOF join returns the exact values available at each event), but at inference time – whether via an OFT, a DT-based scoring pipeline, or a scheduled Task/Sproc – the features are subject to cumulative lag: source ETL delay + upstream Feature View refresh_freq + inference-layer lag (target_lag, Task schedule, etc.). The model sees feature distributions at inference that differ from training, degrading prediction accuracy.

Solution: Train with a conservative spine that shifts the ASOF cutoff back by the expected end-to-end pipeline lag (see Chapter 6: Late-Arriving Data). This ensures the training features reflect realistic staleness. For detailed guidance on measuring actual lag and alternative mitigation strategies, see Chapter 8: Feature Freshness and Training-Serving Skew.

10.14 Summary

Concept	Description
Spine	DataFrame defining entities and timestamps (and label for training)
`generate_dataset()`	Versioned Snowflake ML Dataset (Parquet on a stage); strong fit for deep learning
`generate_training_set()`	Snowpark DataFrame; persist with `save_as` or `df.write.save_as_table()`; strong fit for classic ML
`auto_prefix`	Automatic column prefixing across Feature Views
`.with_name()`	Custom column prefixing
`.slice([...])`	Restrict features from a registered Feature View
DT-based inference	Single-FV models can use a Dynamic Table with Model Registry SQL functions for continuous batch scoring
Scoring Feature Views	Register scoring DTs as Feature Views; single-FV (direct), multi-FV pre-join (latest-value), or orchestrated Task/Sproc (PIT-correct)
Iceberg-backed FVs	`StorageConfig(format=StorageFormat.ICEBERG)` materializes feature data as Parquet/Iceberg for native ML framework access (SDK 1.26+)

10.15 Next steps

Continue to Chapter 11: Operations to learn about monitoring, data quality, and operational management.

--- title: "Training & Inference" subtitle: "Generating datasets, spine design, and Model Registry integration" --- ## Overview This chapter covers the end-to-end workflow for using Feature Store in ML pipelines: from designing spines and generating training datasets to batch and online inference. Code examples in this chapter use the shared demo environment (`FEATURE_STORE_DEMO.CLICKSTREAM_DATA`) established in the [Introduction](../00_introduction/index.qmd). ## Learning Objectives After completing this chapter, you will be able to: - Design effective spines for training and inference - Choose between `generate_dataset()` and `generate_training_set()` for your workload - Generate training data with point-in-time correctness - Apply feature column prefixing for disambiguation - Integrate with Snowflake Model Registry using open-source or Snowflake ML estimators - Implement batch inference workflows, including Dynamic Table (DT)-based continuous scoring > 📂 **Chapter code:** [Browse companion scripts on GitHub](https://github.com/Snowflake-Labs/snowflake-featurestore-imp-guide/tree/main/Snowflake_FeatureStore_Implementation_Guide/10_training_inference/_code) --- ## Spine Design The **spine** defines which entities need features and when. Its structure differs between training and inference. | Use Case | Required Columns | Optional Columns | |----------|------------------|------------------| | **Training** | Entity keys, Timestamp | Label (target) | | **Batch Inference** | Entity keys, Timestamp | — | | **Online Inference** | Entity keys only | — | ### Training Spine > 📁 **Full code:** [`_code/spine_design.py`](_code/spine_design.py) Use **`SESSION_START_TS`** from `SESSIONS` as the point-in-time column so its name matches what you pass to `spine_timestamp_col`. If you alias it (for example `SESSION_START_TS AS EVENT_TS`), the original name is no longer in the DataFrame—downstream code and `spine_timestamp_col` must use the aliased name consistently. ```python # Training spine: USER_ID + PIT timestamp + label (conversion from session) training_spine = session.sql(""" SELECT USER_ID, SESSION_START_TS, IS_CONVERTED AS LABEL FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.SESSIONS """) ``` ### Inference Spine The timestamp can be `CURRENT_TIMESTAMP()` (score as of now) or an event timestamp when you want to persist predictions against a specific point in time: ```python # Inference spine: score active users as of now inference_spine = session.sql(""" SELECT USER_ID, CURRENT_TIMESTAMP() AS PREDICTION_TS FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.USERS WHERE IS_ACTIVE = TRUE """) ``` --- ## `generate_dataset()` vs `generate_training_set()` Both methods accept `spine_df`, `features`, and `spine_timestamp_col` and perform point-in-time correct feature retrieval. They target different downstream workflows. | | `fs.generate_dataset()` | `fs.generate_training_set()` | |---|-------------------------|------------------------------| | **Returns** | A versioned Snowflake ML **Dataset** (Parquet files in a stage) | A **Snowpark DataFrame** (not a versioned Dataset object) | | **Versioning** | Supports a `version` parameter for the Dataset | No built-in versioning | | **Typical use** | Deep learning (TensorFlow, PyTorch) and pipelines that consume the Dataset API | Classic ML (scikit-learn, XGBoost), especially training inside a Virtual Warehouse via Python stored procedures / UDFs | | **Persist** | Dataset lifecycle is managed as a registered Dataset | Use `save_as` parameter, or call `df.write.save_as_table()` on the returned DataFrame | Both methods ultimately return a Snowpark DataFrame that you can further transform, persist via `df.write.save_as_table()`, or convert to pandas for local training. The key difference is that `generate_dataset()` additionally wraps the result in a versioned Dataset object for Parquet-based consumption. See the Snowflake ML Dataset documentation: [Snowflake ML Dataset](https://docs.snowflake.com/en/developer-guide/snowflake-ml/dataset). ### Versioned Dataset (deep learning) ```python dataset = fs.generate_dataset( spine_df=training_spine, features=[user_order_fv, user_session_fv], spine_timestamp_col="SESSION_START_TS", version="V01", ) # Consume Parquet-backed Dataset per Snowflake ML Dataset docs (e.g. TensorFlow / PyTorch). ``` ### Snowpark DataFrame (classic ML) ```python training_df = fs.generate_training_set( spine_df=training_spine, features=[user_order_fv, user_session_fv], spine_timestamp_col="SESSION_START_TS", save_as="FEATURE_STORE_DEMO.FEATURE_STORE.TRAINING_SET_CONVERSION_V01", ) # training_df is a Snowpark DataFrame; use in Snowpark or convert for sklearn / XGBoost as needed. ``` --- ## Generating Training Datasets ### Basic usage (Snowpark / classic ML) ```python training_df = fs.generate_training_set( spine_df=training_spine, features=[user_order_fv, user_session_fv], spine_timestamp_col="SESSION_START_TS", ) ``` ### Joining multiple Feature Views ```python training_df = fs.generate_training_set( spine_df=training_spine, features=[ user_order_fv, user_session_fv, user_profile_fv, ], spine_timestamp_col="SESSION_START_TS", ) ``` ### Multi-entity spine design (hierarchical features) {#sec-multi-entity-spine} When Feature Views exist at **different entity grains** (e.g., user-level, order-level, lineitem-level), the spine must carry **all relevant join keys** so each Feature View can match on its entity. Build the spine by joining through the entity hierarchy: ```python # Spine at the order level, carrying keys for customer and nation Feature Views multi_entity_spine = session.sql(""" SELECT o.ORDER_ID, o.CUSTOMER_ID, c.NATION_ID, o.ORDER_DATE AS EVENT_TS FROM ORDERS o JOIN CUSTOMERS c ON o.CUSTOMER_ID = c.CUSTOMER_ID """) training_df = fs.generate_training_set( spine_df=multi_entity_spine, features=[ order_fv, # entity: ORDER_ID order_lineitem_agg_fv, # entity: ORDER_ID (aggregated from lineitems) customer_fv, # entity: CUSTOMER_ID nation_fv, # entity: NATION_ID ], spine_timestamp_col="EVENT_TS", ) ``` The Feature Store LEFT-JOINs each Feature View on its entity keys. **Higher-level features** (nation, customer) are **denormalized** onto every order row. This is intentional -- each training example gets the full context of its parent entities. | Spine grain | Feature View grain | Result | |---|---|---| | Same as FV | 1:1 match | One set of features per spine row | | Finer than FV | N:1 (parent) | Parent features **duplicated** per child row | | Coarser than FV | 1:N (child) | **Multiple matches** -- training data expands; use with care | If the spine is **coarser** than the Feature View (e.g., spine at order level, FV at lineitem level), the LEFT JOIN produces multiple rows per spine entry. This is usually undesirable -- prefer an **aggregated rollup Feature View** at the spine grain instead (see [Chapter 4: Hierarchical Feature Views](../04_feature_views/index.qmd#sec-hierarchy-fvs)). --- ## Feature Column Prefixing {#sec-training-prefixing} When joining multiple Feature Views, column name collisions can occur. ### Using `auto_prefix` ```python training_df = fs.generate_training_set( spine_df=training_spine, features=[user_orders_fv, user_sessions_fv], spine_timestamp_col="SESSION_START_TS", auto_prefix=True, ) # Result columns (example): # USER_ORDERS__TOTAL_SPEND_7D # USER_SESSIONS__SESSION_CNT_7D ``` ### Using `.with_name()` ```python training_df = fs.generate_training_set( spine_df=training_spine, features=[ user_orders_fv.with_name("ord"), user_sessions_fv.with_name("sess"), ], spine_timestamp_col="SESSION_START_TS", ) # Result columns (example): # ord__TOTAL_SPEND_7D # sess__SESSION_CNT_7D ``` ### Choosing a strategy | Scenario | Recommendation | |----------|----------------| | Single Feature View | No prefix needed | | Multiple FVs, unique names | No prefix needed | | Multiple FVs, potential collisions | Use `auto_prefix=True` | | Want custom short prefixes | Use `.with_name()` | --- ## Model Registry integration Use **open-source libraries** (scikit-learn, XGBoost, LightGBM, etc.) when your training data fits in memory, or **`snowflake.ml.modeling`** estimators when you need distributed fitting inside the warehouse. Both produce a trained object you can log with **`Registry.log_model()`**. See [Chapter 9: Preprocessing -- Pipeline Classes](../09_preprocessing/index.qmd#pipeline-classes-scikit-learn-vs-snowflake-ml) for guidance on choosing between the two. > 📁 **Full code:** [`_code/model_registry.py`](_code/model_registry.py) ```python from snowflake.ml.registry import Registry reg = Registry( session=session, database_name="FEATURE_STORE_DEMO", schema_name="MODEL_REGISTRY", ) reg.log_model( model, model_name="churn_model", version_name="V01", comment="Trained with USER_ORDER_FV V01, USER_SESSION_FV V01; spine SESSIONS + SESSION_START_TS", ) ``` ### Lineage flow ```{mermaid} %%| fig-cap: "End-to-end lineage from sources through feature views to model" %%| fig-alt: "Source tables feature views training dataset model predictions" flowchart LR SRC[Source tables] --> FV[Feature Views] --> DS[Training Dataset] --> M[Model] --> P[Predictions] ``` --- ## Batch Inference ### Ad-Hoc Scoring The spine timestamp for inference can be `CURRENT_TIMESTAMP()` (score as of *now*) or an event timestamp you want to persist predictions against (e.g. `SESSION_START_TS` for a specific session): ```python # Inference spine: entities to score as of now inference_spine = session.sql(""" SELECT USER_ID, CURRENT_TIMESTAMP() AS PREDICTION_TS FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.USERS WHERE IS_ACTIVE = TRUE """) # Point-in-time features as Snowpark DataFrame inference_df = fs.generate_training_set( spine_df=inference_spine, features=[user_order_fv, user_session_fv], spine_timestamp_col="PREDICTION_TS", ) # Score and persist predictions = model.predict(inference_df.to_pandas().drop(columns=[...])) inference_df.write.save_as_table("FEATURE_STORE_DEMO.FEATURE_STORE.PREDICTIONS_LATEST", mode="overwrite") ``` For frameworks that consume a Snowflake ML Dataset instead, build a spine the same way and use `generate_dataset()` with an appropriate `version`. ### Continuous Batch Inference with Dynamic Tables {#sec-continuous-batch} For **single-Feature-View models** where all features come from one table (no ASOF join required), you can use a Dynamic Table (or DT-backed Feature View) to persist predictions that refresh automatically. Reference the Model Registry's SQL function directly in the DT definition: ```sql CREATE DYNAMIC TABLE FEATURE_STORE.USER_CHURN_PREDICTIONS TARGET_LAG = '1 hour' WAREHOUSE = FS_DEV_WH AS SELECT USER_ID, FEATURE_STORE_DEMO.MODEL_REGISTRY.CHURN_MODEL!PREDICT( ORDER_CNT, SPEND_SUM, AVG_ORDER_AMT ):output_feature_0::FLOAT AS CHURN_PROBABILITY, CURRENT_TIMESTAMP() AS SCORED_AT FROM FEATURE_STORE.USER_PURCHASE_STATS$V01; ``` This pattern works well when the model reads directly from a single Feature View's columns. Each DT refresh re-scores all entities with the latest feature values. To make these predictions discoverable via the Feature Store API and Snowsight, register the scoring DT as a Feature View — see [Scoring Feature Views](#sec-scoring-fv) below for worked examples including multi-FV variants. The Model Registry's `!PREDICT` SQL function is automatically registered as **IMMUTABLE**, which makes it eligible for DT incremental refresh. If you create your own scoring UDF/UDTF outside of Model Registry, you must declare it immutable for the same reason -- see [UDF immutability for incremental refresh](#sec-udf-immutability) below. ::: {.callout-tip title="Model version rotation works seamlessly with incremental DTs (tested April 2026)"} When you promote a new model version via `ALTER MODEL ... SET DEFAULT_VERSION = v_new`, the DT engine **automatically detects** the change and triggers a **REINITIALIZE** (one-time full refresh) -- all rows are re-scored with the new model. After the reinitialize, **incremental mode resumes**: subsequent source data changes are processed incrementally with the new version. The DT is not broken and no `CREATE OR REPLACE` is needed. This means the full pipeline -- retrain model, register new version, promote to default, incremental scoring resumes -- works end-to-end without manual intervention. DTs pinned to a specific version (`WITH mv AS MODEL my_model VERSION V1 SELECT mv!PREDICT(...)`) are unaffected by default-version changes. **Cost implication:** Each version rotation triggers a full re-score of the entire scoring table. For very large tables, factor this into your retraining cadence -- the one-time reinitialize cost is the price of correctness (no mixed-version predictions). **This behavior is specific to Model Registry's `MODEL()!PREDICT()`.** Standard `IMMUTABLE` UDFs do *not* get this treatment -- a `CREATE OR REPLACE FUNCTION` breaks the DT rather than triggering a graceful reinitialize. See [UDF immutability](#sec-udf-immutability) below. ::: ::: {.callout-tip title="Incremental scoring downstream of full-refresh Feature Views (April 2026)"} If the upstream Feature View (`USER_PURCHASE_STATS$V01` above) uses **full refresh** -- for example, because its query contains a FLOAT+JOIN pattern or other unsupported incremental construct -- the scoring DT can still refresh incrementally. When the upstream DT has a **system-derived unique key** (from `GROUP BY` or `QUALIFY ROW_NUMBER() = 1`), the scoring DT only re-scores entities whose feature values actually changed, rather than re-scoring the entire population on every cycle. Opt in with `ALTER DYNAMIC TABLE ... SET REFRESH_MODE = INCREMENTAL` (this is a SQL-level capability; the Feature Store Python API does not yet support it). Verify with `SHOW UNIQUE KEYS IN <upstream_fv_dt>`. See [Understanding primary keys in dynamic tables](https://docs.snowflake.com/en/user-guide/dynamic-tables/input-data-optimization). ::: ::: {.callout-warning} ## ASOF join limitation At this time, ASOF join is **not supported for incremental DT refresh**. If your model requires features from multiple Feature Views joined via ASOF (point-in-time correctness), you cannot embed that join in a Dynamic Table definition. Instead, use the ad-hoc scoring pattern above or schedule batch inference via a Task or external orchestrator. ::: ::: {.callout-warning} ## Pipeline lag affects batch inference too The scoring DT above reads from `USER_PURCHASE_STATS$V01`, which is itself a Dynamic Table with its own `refresh_freq`. The features the model scores against are subject to cumulative lag: **source pipeline delay + upstream FV refresh + scoring DT target_lag**. A Stream/Task/Stored Procedure batch pattern has the same issue -- the Task schedule determines when scoring runs, but the upstream features may already be stale. If the model was trained on point-in-time-perfect features (no lag), it may perform worse in production where features are always somewhat stale. To avoid this **training-serving skew**, train with a [conservative spine](../06_temporal_features/index.qmd#solutions) that bakes in realistic lag. See [Chapter 8: Feature Freshness and Training-Serving Skew](../08_online_features/index.qmd#sec-training-serving-skew) for detailed guidance on measuring actual lag and alternative mitigation strategies. ::: ### UDF Immutability for Incremental Refresh {#sec-udf-immutability} Dynamic Tables in **incremental** refresh mode require every function in the `SELECT` to be deterministic. The Model Registry's `!PREDICT` function is registered as **IMMUTABLE** automatically, but if you build your own scoring UDF/UDTF outside of Model Registry, you must explicitly declare it immutable. A `VOLATILE` UDF forces the DT to use **full refresh** on every cycle, negating the performance benefits of incremental refresh. **SQL:** ```sql CREATE OR REPLACE FUNCTION FEATURE_STORE.SCORE_CHURN( order_cnt NUMBER, spend_sum NUMBER, avg_order_amt NUMBER ) RETURNS FLOAT LANGUAGE PYTHON IMMUTABLE RUNTIME_VERSION = '3.11' HANDLER = 'run' AS $$ def run(order_cnt, spend_sum, avg_order_amt): # Your scoring logic here ... $$; ``` **Snowpark (decorator):** ```python from snowflake.snowpark.functions import udf from snowflake.snowpark.types import FloatType, IntegerType @udf( name="FEATURE_STORE.SCORE_CHURN", input_types=[IntegerType(), FloatType(), FloatType()], return_type=FloatType(), immutable=True, replace=True, ) def score_churn(order_cnt: int, spend_sum: float, avg_order_amt: float) -> float: ... ``` **Snowpark (`session.udf.register`):** ```python session.udf.register( func=score_churn, name="FEATURE_STORE.SCORE_CHURN", input_types=[IntegerType(), FloatType(), FloatType()], return_type=FloatType(), immutable=True, replace=True, ) ``` Setting `immutable=True` causes Snowpark to emit the UDF with `IMMUTABLE` volatility in SQL. The DT can then treat it as eligible for incremental refresh. See the [Dynamic Table supported queries](https://docs.snowflake.com/en/user-guide/dynamic-tables/supported-queries) documentation for the full list of incremental refresh requirements. ::: {.callout-warning title="Standard UDF replacement breaks incremental DTs"} **Unlike Model Registry** (which handles version rotation gracefully via REINITIALIZE -- see above), replacing a standard `IMMUTABLE` UDF with `CREATE OR REPLACE FUNCTION` while it is referenced by an incremental DT **breaks the DT**. The DT must be fully recreated (`CREATE OR REPLACE DYNAMIC TABLE`). This is an important design consideration: if your UDF logic is expected to change over time (updated model weights, bug fixes, new feature engineering rules), prefer **Model Registry** for inference UDFs. MR's `MODEL(name)!PREDICT()` handles version rotation seamlessly -- the DT detects the default-version change, triggers a one-time full re-score, then resumes incremental mode. For UDFs that must remain outside Model Registry, plan for the DT recreation cost when the function changes. In CI/CD pipelines, this typically means including `CREATE OR REPLACE DYNAMIC TABLE` in the deployment script alongside `CREATE OR REPLACE FUNCTION`. ::: --- ## Scoring Feature Views {#sec-scoring-fv} A **Scoring Feature View** is a Dynamic Table that calls a Model Registry function (`MODEL()!PREDICT()`) to produce predictions, then is registered as a Feature View through the Feature Store API. This makes model outputs discoverable alongside the features that feed them — teams can browse predictions in Snowsight, track lineage from source tables through features to scores, and consume predictions downstream via `read_feature_view()` just like any other feature. Three variants cover progressively complex scenarios: ```{mermaid} %%| fig-cap: "Scoring Feature View variants — from single-source to PIT-correct multi-FV" %%| fig-alt: "Three scoring patterns: single FV DT, multi-FV pre-join DT, and orchestrated Task/Sproc" flowchart LR subgraph v1[Single_FV] FV1[Feature View] --> S1["Scoring DT\nMODEL()!PREDICT()"] end subgraph v2[Multi_FV_PreJoin] FVa[FV A] --> C[Combined FV] FVb[FV B] --> C C --> S2["Scoring DT"] end subgraph v3[Orchestrated] FVx[FV A] --> GTS["generate_training_set()\n+ ASOF join"] FVy[FV B] --> GTS GTS --> P["model.predict()"] end ``` ### Variant 1: Single-FV Scoring DT The [Continuous Batch Inference](#sec-continuous-batch) section above shows how to build a scoring DT with `MODEL()!PREDICT()`. To make those predictions discoverable in Snowsight and consumable via the Feature Store API, wrap the same `feature_df` in a `FeatureView` and register it: ```python scoring_fv = FeatureView( name="USER_CHURN_SCORES", entities=[Entity(name="USER", join_keys=["USER_ID"])], feature_df=scoring_df, # same DataFrame used in the DT example above refresh_freq="1 hour", desc="Churn probability from USER_PURCHASE_STATS via churn_model", ) fs.register_feature_view(scoring_fv, version="V01") ``` All the existing behaviour applies: model version rotation triggers a REINITIALIZE then resumes incremental mode, and UDF immutability rules carry over unchanged. ### Variant 2: Multi-FV Pre-Join Pattern {#sec-multi-fv-prejoin} When the model requires features from **multiple Feature Views**, ASOF join is not available inside a Dynamic Table (see the [ASOF limitation callout](#sec-continuous-batch) above). The workaround is to **pre-join** the upstream FVs into a combined intermediate FV, then score from that single source: ```python # Step 1: Read upstream Feature Views orders_df = fs.read_feature_view(user_order_fv) # USER_ID, ORDER_CNT, SPEND_SUM, ... sessions_df = fs.read_feature_view(user_session_fv) # USER_ID, SESSION_CNT, AVG_DURATION, ... # Step 2: Join into a combined source (latest-value join on entity key) combined_df = orders_df.join(sessions_df, "USER_ID") # Step 3: Register the combined FV combined_fv = FeatureView( name="USER_COMBINED_FEATURES", entities=[user_entity], feature_df=combined_df, refresh_freq="1 hour", desc="Pre-joined order + session features for scoring", ) fs.register_feature_view(combined_fv, version="V01") # Step 4: Score from the combined FV scoring_df = session.sql(""" SELECT USER_ID, FEATURE_STORE_DEMO.MODEL_REGISTRY.CHURN_MODEL!PREDICT( ORDER_CNT, SPEND_SUM, SESSION_CNT, AVG_DURATION ):output_feature_0::FLOAT AS CHURN_PROBABILITY, CURRENT_TIMESTAMP() AS SCORED_AT FROM FEATURE_STORE.USER_COMBINED_FEATURES$V01 """) scoring_fv = FeatureView( name="USER_CHURN_SCORES", entities=[user_entity], feature_df=scoring_df, refresh_freq="1 hour", desc="Churn probability from combined order + session features", ) fs.register_feature_view(scoring_fv, version="V01") ``` The DT chain is: upstream FVs → Combined FV → Scoring FV. Each layer refreshes incrementally (subject to the usual constraints), and model version rotation propagates automatically. ::: {.callout-warning title="Latest-value join, not PIT-correct"} The pre-join in Step 2 is an equi-join on entity key — it picks the **current latest values** from each upstream FV, not the values as-of a specific timestamp. This is appropriate for **real-time scoring** (score with the freshest features available) but does **not** provide point-in-time correctness. If PIT-correct multi-FV scoring is required, use Variant 3 below. ::: ### Variant 3: Orchestrated Batch Scoring (PIT-Correct) {#sec-orchestrated-scoring} When the model requires features from **multiple Feature Views** and predictions must be **point-in-time correct** (e.g. for regulatory audit, back-testing, or time-series evaluation), use `generate_training_set()` with an inference spine inside a **Snowflake Task** or **Stored Procedure**: ```python # Inside a Task or Stored Procedure — scheduled or event-triggered inference_spine = session.sql(""" SELECT USER_ID, CURRENT_TIMESTAMP() AS PREDICTION_TS FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.USERS WHERE IS_ACTIVE = TRUE """) # ASOF join retrieves PIT-correct feature values inference_df = fs.generate_training_set( spine_df=inference_spine, features=[user_order_fv, user_session_fv], spine_timestamp_col="PREDICTION_TS", ) # Score model_ref = Registry(session=session, database_name="FEATURE_STORE_DEMO", schema_name="MODEL_REGISTRY").get_model("churn_model") predictions = model_ref.default.run(inference_df, function_name="predict") # Persist predictions.write.save_as_table( "FEATURE_STORE_DEMO.FEATURE_STORE.USER_CHURN_PREDICTIONS", mode="overwrite", ) ``` This variant trades continuous refresh for correctness: the Task runs on a schedule (e.g. hourly via `SCHEDULE = 'USING CRON 0 * * * * UTC'`), and each run produces a full PIT-correct scoring snapshot. ::: {.callout-tip title="Choosing a scoring variant"} | | **Single-FV DT** | **Multi-FV Pre-Join** | **Orchestrated (Task/Sproc)** | |---|---|---|---| | **Feature sources** | One FV | Multiple FVs | Multiple FVs | | **Join type** | Direct read | Equi-join (latest) | ASOF (PIT-correct) | | **Refresh** | Continuous (DT) | Continuous (DT chain) | Scheduled (Task) | | **Incremental** | Yes | Yes | No (full each run) | | **PIT-correct** | N/A (single source) | No | Yes | | **Discoverable as FV** | Yes | Yes | Manual (register output table) | | **Best for** | Simple models, low latency | Multi-feature models, real-time | Audit, back-testing, compliance | ::: --- ## Dataset Generation at Scale {#sec-scale} When training datasets involve many Feature Views (50+) or very large spines (hundreds of millions to billions of rows), the default single-query join strategy can hit memory or shuffle limits. Understanding these constraints helps you right-size your infrastructure and leverage SDK optimisations. ### The Batched Join Strategy (SDK 1.32+) By default, `generate_training_set` and `generate_dataset` produce a single SQL statement that LEFT JOINs **all** Feature Views at once. At high Feature View counts, the final join stage shuffles data proportional to `spine_rows × feature_view_count × avg_features_per_fv`, which can cause out-of-memory (OOM) failures or queries that never complete. SDK version 1.32+ automatically **batches** Feature View joins into groups of ~10, produces intermediate results for each batch, and performs a final merge. This happens transparently -- no API changes are required. The improvement can be dramatic: | Scenario | Pre-1.32 (single query) | 1.32+ (batched) | |----------|------------------------|-----------------| | 75+ FVs, ~1,000 features, ~100M spine rows | ~40 min on 4XL | ~17 min on 4XL Gen2 | | 75+ FVs, ~2,000 features, ~2B spine rows | OOM on 4XL | ~2 hours on 4XL Gen2 | If you are on an older SDK version and cannot upgrade, the manual workaround is to split Feature Views into batches of 10-15, call `generate_training_set` for each batch, and join the results: ```python import math def generate_training_set_batched( fs, spine_df, feature_views, spine_timestamp_col, batch_size=10 ): """Manually batch Feature View joins for older SDK versions.""" batches = [ feature_views[i:i + batch_size] for i in range(0, len(feature_views), batch_size) ] result_df = spine_df for batch in batches: batch_df = fs.generate_training_set( spine_df=spine_df, features=batch, spine_timestamp_col=spine_timestamp_col, ) join_keys = [spine_timestamp_col] + [ k for fv in batch for k in fv.entity.join_keys ] result_df = result_df.join( batch_df.drop(*spine_df.columns), on=list(set(join_keys) & set(result_df.columns)), how="left", ) return result_df ``` ### Warehouse Sizing for Dataset Generation The optimal warehouse configuration depends on whether your bottleneck is **parallelism** (many joins, wide data shuffles) or **memory** (large intermediate results per node). | Bottleneck | Symptom | Recommendation | |------------|---------|----------------| | Parallelism | High idle execution skew; CPU-bound | Scale **up** standard warehouse size (more nodes) | | Memory | OOM errors; high disk spill | Consider Snowpark Optimized warehouse, or reduce batch size | | Both | OOM + skew | Snowpark Optimized at larger size (3XL+) | ::: {.callout-tip title="Prefer standard Gen2 warehouses"} For join-heavy dataset generation workloads, **standard Gen2 warehouses** typically outperform Snowpark Optimized warehouses of equivalent credit cost. A standard 4XL Gen2 (128 nodes × 16 GB) provides far more parallelism than an SPO 3XL (16 nodes × 256 GB) at similar cost. Only move to SPO when profiling confirms memory -- not parallelism -- is the constraint. ::: ### The Wide Resultset Problem {#sec-wide-resultset} When training datasets exceed ~500 features across many Feature Views, Snowflake's query planner and columnar engine encounter **non-linear** performance degradation. The issue stems from SQL expression count, query compilation complexity, and memory pressure -- not just data volume: | Scale | Compilation | Execution (full resultset) | |-------|-------------|---------------------------| | 10 FVs, ~200 features | Seconds | Minutes | | 50 FVs, ~1,000 features | Minutes | Tens of minutes | | 100 FVs, ~2,000 features | **35+ minutes** | **6+ hours (likely OOM)** | ::: {.callout-warning title="Known platform limitation"} This is a known Snowflake limitation for wide resultsets. The Feature Store SDK mitigates it with batched joins (see above), but extremely wide datasets (2,000+ features) may still require architectural workarounds described below. ::: **VARIANT column approach**: For extreme scale (>500 features), encapsulating each Feature View's output as a single OBJECT column dramatically reduces SQL expression count: ```python # Instead of: SELECT entity_id, feat_1, feat_2, ..., feat_200 FROM fv_table # Use: SELECT entity_id, OBJECT_CONSTRUCT(*) AS features FROM fv_table # Benchmark: 100 FVs, 2000 features # Column-per-feature: ~25 minutes for a single-row collect() # VARIANT approach: ~4 seconds for a single-row collect() ``` **Trade-offs of the VARIANT approach:** | Aspect | Column-per-feature | VARIANT per FV | |--------|-------------------|----------------| | Query compilation | Non-linear with feature count | Near-constant | | Column-level lineage | Full | Lost | | Column-level security | Full | Lost | | Client-side processing | Direct DataFrame access | Requires `json.loads()` per row | | Memory overhead | Standard | ~2x (JSON deserialization) | For most use cases, the batched join strategy (SDK 1.32+) combined with appropriate warehouse sizing is sufficient. Reserve the VARIANT approach for the largest deployments where batched joins still hit compilation or memory limits. ### Parquet and Iceberg Output {#sec-parquet-output} For training datasets that feed distributed ML frameworks (Ray, Spark, PyTorch), writing directly to Parquet avoids the memory bottleneck of `to_pandas()`: | Scale | Warehouse | Write time | |-------|-----------|------------| | 1B rows | Standard 4XL | ~4 minutes | | 12B rows | Standard 4XL | ~35 minutes | **Options for Parquet output:** 1. **`generate_dataset` with `output_type="parquet"`** -- writes directly to a Snowflake stage; files are readable by Ray's `read_parquet()`. 2. **Iceberg-backed Feature Views** (SDK 1.26+) -- Feature Views backed by Dynamic Iceberg Tables that expose data as standard Parquet/Iceberg on external cloud storage. ML frameworks read Iceberg natively; no Snowflake connector needed. See [Iceberg-backed Feature Views](#sec-iceberg-fv) below. 3. **Snowflake-managed Iceberg tables** -- queryable via both Snowflake SQL and external Parquet readers. Useful when the same training data is consumed by both Snowflake-based and external ML pipelines. 4. **`COPY INTO` from a materialized table** -- if you need fine-grained control over Parquet file layout (row group size, compression). ::: {.callout-tip title="Skip to_pandas() for large datasets"} For datasets exceeding ~100M rows or ~500 features, writing to Parquet and reading directly from your ML framework is both faster and more memory-efficient than `to_pandas()`. This is especially true when using the VARIANT approach, where `to_pandas()` returns JSON strings requiring single-threaded `json.loads()` deserialization. ::: --- ## Iceberg-Backed Feature Views {#sec-iceberg-fv} Starting with `snowflake-ml-python` v1.26.0 (February 2026), Feature Views can be backed by **Dynamic Iceberg Tables** -- Dynamic Tables that materialize their output as Parquet files in Iceberg format on external cloud storage (S3, Azure Blob, GCS). This gives ML frameworks direct, native access to feature data without requiring a Snowflake connector or intermediate export step. ### Creating an Iceberg-Backed Feature View Use `StorageConfig` to specify Iceberg storage when creating a Feature View: ```python from snowflake.ml.feature_store import FeatureView, StorageConfig, StorageFormat storage = StorageConfig( format=StorageFormat.ICEBERG, external_volume='MY_EXTERNAL_VOLUME', base_location='feature_store/user_purchase_stats' ) iceberg_fv = FeatureView( name="USER_PURCHASE_STATS_ICE", entities=[user_entity], feature_df=user_purchase_df, timestamp_col="LAST_ORDER_TS", refresh_freq="1 hour", storage_config=storage, desc="User purchase statistics materialized as Iceberg" ) fs.register_feature_view(feature_view=iceberg_fv, version="V01") ``` ::: {.callout-note title="Prerequisites"} - An [external volume](https://docs.snowflake.com/en/sql-reference/sql/create-external-volume) must be configured for your cloud storage location. - `refresh_freq` is required -- Iceberg storage only applies to managed (DT-backed) Feature Views, not static Views. - The `FeatureStore` constructor accepts `default_iceberg_external_volume` to avoid repeating the volume name in every `StorageConfig`. ::: All standard Feature Store APIs (`generate_dataset`, `generate_training_set`, `list_feature_views`, etc.) work with Iceberg-backed Feature Views. The feature data is simultaneously queryable via Snowflake SQL and via any Iceberg-compatible reader. ### Training Directly from Iceberg Feature Views Because Iceberg-backed Feature Views expose standard Parquet files, ML frameworks can read them natively: ```python import pyiceberg from pyiceberg.catalog import load_catalog catalog = load_catalog("snowflake", **catalog_config) table = catalog.load_table("feature_store_demo.feature_store.USER_PURCHASE_STATS_ICE$V01") df = table.scan().to_pandas() ``` This eliminates the need for `generate_dataset()` or `to_pandas()` when all you need is the feature data -- the Feature View *is* the training dataset in a standard open format. ### Current Limitations | Limitation | Detail | |-----------|--------| | **Online Feature Tables** | Not supported with Iceberg storage (future work) | | **DT partitioning parameters** | `PARTITION_BY`, `TARGET_FILE_SIZE`, and `PATH_LAYOUT` (GA April 2026) are not yet exposed through `StorageConfig` -- use `ALTER DYNAMIC ICEBERG TABLE` after registration to set these | | **View-based Feature Views** | Iceberg storage requires `refresh_freq` (DT-backed); it does not apply to View-based FVs | ::: {.callout-tip title="Iceberg FVs as an alternative to the Dataset API"} For many training workflows, an Iceberg-backed Feature View provides the same benefits as `generate_dataset()` -- versioned Parquet files accessible to distributed ML frameworks -- but in a standard open format that any Iceberg-compatible tool can read. The Feature View is also incrementally maintained by the DT scheduler, so training data stays fresh without manual regeneration. If your pipeline reads from a single Feature View (or a small number of them) and does not require spine-based ASOF joins, an Iceberg-backed Feature View may be simpler than the `generate_dataset()` → Parquet → framework path. ::: ::: {.callout-note title="Future: Fully automated incremental training with ASOF-in-DT"} When ASOF join support lands in Dynamic Tables (stretch goal), the combination with Iceberg-backed Feature Views will enable a fully automated incremental training pipeline: 1. A DT with ASOF join pre-materializes **point-in-time correct training data** as an Iceberg table 2. ML frameworks read the Iceberg table directly for training 3. As new data arrives, the DT incrementally updates the training set 4. A downstream scoring DT (using Model Registry's `MODEL()!PREDICT()`) incrementally scores new entities 5. The entire retrain → evaluate → promote → score pipeline operates continuously with no manual orchestration This is one of the most compelling future-state patterns for Feature Store. See [Chapter 12: Advanced Patterns](../12_advanced_patterns/index.qmd) for discussion of Iceberg table integration patterns. ::: --- ## Best practices ### 1. Spine design {.unnumbered} If your Feature Views follow the recommended convention of aliasing their timestamp column to a standard name (e.g., `FV_TS` -- see [Chapter 6: timestamp_col Requirements](../06_temporal_features/index.qmd#sec-timestamp-col-requirements)), name the spine timestamp to match: ```python # ✅ GOOD: Standardized timestamp name matches FV convention spine = session.sql(""" SELECT USER_ID, SESSION_START_TS AS FV_TS, IS_CONVERTED AS LABEL FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.SESSIONS """) # spine_timestamp_col="FV_TS" everywhere — same as every Feature View's timestamp_col # ✅ ALSO GOOD: Explicit source name when you prefer traceability spine = session.sql(""" SELECT USER_ID, SESSION_START_TS, IS_CONVERTED AS LABEL FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.SESSIONS """) # spine_timestamp_col="SESSION_START_TS" # ❌ BAD: Ambiguous or opaque timestamp column spine = session.sql(""" SELECT USER_ID, SOME_DATE FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.SESSIONS """) ``` ### 2. Feature selection {.unnumbered} Registered Feature Views expose **`.slice(["col1", "col2"])`** to restrict columns. Do not assume `.select()` on the Feature View object. ```python training_df = fs.generate_training_set( spine_df=training_spine, features=[ user_order_fv.slice(["TOTAL_SPEND_7D", "ORDER_CNT_30D"]), ], spine_timestamp_col="SESSION_START_TS", ) ``` ### 3. Validate before training {.unnumbered} ```python # Check for data issues (column names depend on your spine / FVs) assert training_df.filter(F.col("LABEL").isNull()).count() == 0 assert training_df.count() > 1000 ``` --- ## Common pitfalls ### ❌ Pitfall 1: No timestamp column **Problem**: PIT retrieval does not work without a timestamp. **Solution**: Always include a spine timestamp and set `spine_timestamp_col` to its name. ### ❌ Pitfall 2: Feature leakage **Problem**: Using features computed from future data relative to the spine row. **Solution**: Use `spine_timestamp_col` and validate cutoff behavior. ### ❌ Pitfall 3: Column collisions **Problem**: Same feature names from multiple Feature Views. **Solution**: Use `auto_prefix` or `.with_name()`. ### ❌ Pitfall 4: Wrong API for the framework **Problem**: Expecting a Snowpark DataFrame from `generate_dataset()` or Dataset Parquet from `generate_training_set()`. **Solution**: Use the comparison table above; match the method to DL (Dataset) vs classic ML (DataFrame / `save_as`). ### ❌ Pitfall 5: Training-serving skew from pipeline lag {#sec-pitfall-skew} **Problem**: The model is trained on point-in-time-perfect features (ASOF join returns the exact values available at each event), but at inference time -- whether via an OFT, a DT-based scoring pipeline, or a scheduled Task/Sproc -- the features are subject to cumulative lag: source ETL delay + upstream Feature View `refresh_freq` + inference-layer lag (`target_lag`, Task schedule, etc.). The model sees feature distributions at inference that differ from training, degrading prediction accuracy. **Solution**: Train with a **conservative spine** that shifts the ASOF cutoff back by the expected end-to-end pipeline lag (see [Chapter 6: Late-Arriving Data](../06_temporal_features/index.qmd#solutions)). This ensures the training features reflect realistic staleness. For detailed guidance on measuring actual lag and alternative mitigation strategies, see [Chapter 8: Feature Freshness and Training-Serving Skew](../08_online_features/index.qmd#sec-training-serving-skew). --- ## Summary | Concept | Description | |---------|-------------| | **Spine** | DataFrame defining entities and timestamps (and label for training) | | `generate_dataset()` | Versioned Snowflake ML Dataset (Parquet on a stage); strong fit for deep learning | | `generate_training_set()` | Snowpark DataFrame; persist with `save_as` or `df.write.save_as_table()`; strong fit for classic ML | | `auto_prefix` | Automatic column prefixing across Feature Views | | `.with_name()` | Custom column prefixing | | `.slice([...])` | Restrict features from a registered Feature View | | **DT-based inference** | Single-FV models can use a Dynamic Table with Model Registry SQL functions for continuous batch scoring | | **Scoring Feature Views** | Register scoring DTs as Feature Views; single-FV (direct), multi-FV pre-join (latest-value), or orchestrated Task/Sproc (PIT-correct) | | **Iceberg-backed FVs** | `StorageConfig(format=StorageFormat.ICEBERG)` materializes feature data as Parquet/Iceberg for native ML framework access (SDK 1.26+) | --- ## Next steps Continue to [Chapter 11: Operations](../11_operations/index.qmd) to learn about monitoring, data quality, and operational management.