10  Training & Inference

Generating datasets, spine design, and Model Registry integration

Keywords

snowflake, feature store, ml, machine learning, mlops

10.1 Overview

This chapter covers the end-to-end workflow for using Feature Store in ML pipelines: from designing spines and generating training datasets to batch and online inference.

Code examples in this chapter use the shared demo environment (FEATURE_STORE_DEMO.CLICKSTREAM_DATA) established in the Introduction.

10.2 Learning Objectives

After completing this chapter, you will be able to:

  • Design effective spines for training and inference
  • Choose between generate_dataset() and generate_training_set() for your workload
  • Generate training data with point-in-time correctness
  • Apply feature column prefixing for disambiguation
  • Integrate with Snowflake Model Registry using open-source or Snowflake ML estimators
  • Implement batch inference workflows, including Dynamic Table (DT)-based continuous scoring

📂 Chapter code: Browse companion scripts on GitHub


10.3 Spine Design

The spine defines which entities need features and when. Its structure differs between training and inference.

Use Case Required Columns Optional Columns
Training Entity keys, Timestamp Label (target)
Batch Inference Entity keys, Timestamp
Online Inference Entity keys only

10.3.1 Training Spine

📁 Full code: _code/spine_design.py

Use SESSION_START_TS from SESSIONS as the point-in-time column so its name matches what you pass to spine_timestamp_col. If you alias it (for example SESSION_START_TS AS EVENT_TS), the original name is no longer in the DataFrame—downstream code and spine_timestamp_col must use the aliased name consistently.

# Training spine: USER_ID + PIT timestamp + label (conversion from session)
training_spine = session.sql("""
    SELECT
        USER_ID,
        SESSION_START_TS,
        IS_CONVERTED AS LABEL
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.SESSIONS
""")

10.3.2 Inference Spine

The timestamp can be CURRENT_TIMESTAMP() (score as of now) or an event timestamp when you want to persist predictions against a specific point in time:

# Inference spine: score active users as of now
inference_spine = session.sql("""
    SELECT
        USER_ID,
        CURRENT_TIMESTAMP() AS PREDICTION_TS
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.USERS
    WHERE IS_ACTIVE = TRUE
""")

10.4 generate_dataset() vs generate_training_set()

Both methods accept spine_df, features, and spine_timestamp_col and perform point-in-time correct feature retrieval. They target different downstream workflows.

fs.generate_dataset() fs.generate_training_set()
Returns A versioned Snowflake ML Dataset (Parquet files in a stage) A Snowpark DataFrame (not a versioned Dataset object)
Versioning Supports a version parameter for the Dataset No built-in versioning
Typical use Deep learning (TensorFlow, PyTorch) and pipelines that consume the Dataset API Classic ML (scikit-learn, XGBoost), especially training inside a Virtual Warehouse via Python stored procedures / UDFs
Persist Dataset lifecycle is managed as a registered Dataset Use save_as parameter, or call df.write.save_as_table() on the returned DataFrame

Both methods ultimately return a Snowpark DataFrame that you can further transform, persist via df.write.save_as_table(), or convert to pandas for local training. The key difference is that generate_dataset() additionally wraps the result in a versioned Dataset object for Parquet-based consumption.

See the Snowflake ML Dataset documentation: Snowflake ML Dataset.

10.4.1 Versioned Dataset (deep learning)

dataset = fs.generate_dataset(
    spine_df=training_spine,
    features=[user_order_fv, user_session_fv],
    spine_timestamp_col="SESSION_START_TS",
    version="V01",
)
# Consume Parquet-backed Dataset per Snowflake ML Dataset docs (e.g. TensorFlow / PyTorch).

10.4.2 Snowpark DataFrame (classic ML)

training_df = fs.generate_training_set(
    spine_df=training_spine,
    features=[user_order_fv, user_session_fv],
    spine_timestamp_col="SESSION_START_TS",
    save_as="FEATURE_STORE_DEMO.FEATURE_STORE.TRAINING_SET_CONVERSION_V01",
)
# training_df is a Snowpark DataFrame; use in Snowpark or convert for sklearn / XGBoost as needed.

10.5 Generating Training Datasets

10.5.1 Basic usage (Snowpark / classic ML)

training_df = fs.generate_training_set(
    spine_df=training_spine,
    features=[user_order_fv, user_session_fv],
    spine_timestamp_col="SESSION_START_TS",
)

10.5.2 Joining multiple Feature Views

training_df = fs.generate_training_set(
    spine_df=training_spine,
    features=[
        user_order_fv,
        user_session_fv,
        user_profile_fv,
    ],
    spine_timestamp_col="SESSION_START_TS",
)

10.5.3 Multi-entity spine design (hierarchical features)

When Feature Views exist at different entity grains (e.g., user-level, order-level, lineitem-level), the spine must carry all relevant join keys so each Feature View can match on its entity. Build the spine by joining through the entity hierarchy:

# Spine at the order level, carrying keys for customer and nation Feature Views
multi_entity_spine = session.sql("""
    SELECT
        o.ORDER_ID,
        o.CUSTOMER_ID,
        c.NATION_ID,
        o.ORDER_DATE AS EVENT_TS
    FROM ORDERS o
    JOIN CUSTOMERS c ON o.CUSTOMER_ID = c.CUSTOMER_ID
""")

training_df = fs.generate_training_set(
    spine_df=multi_entity_spine,
    features=[
        order_fv,             # entity: ORDER_ID
        order_lineitem_agg_fv,  # entity: ORDER_ID (aggregated from lineitems)
        customer_fv,          # entity: CUSTOMER_ID
        nation_fv,            # entity: NATION_ID
    ],
    spine_timestamp_col="EVENT_TS",
)

The Feature Store LEFT-JOINs each Feature View on its entity keys. Higher-level features (nation, customer) are denormalized onto every order row. This is intentional – each training example gets the full context of its parent entities.

Spine grain Feature View grain Result
Same as FV 1:1 match One set of features per spine row
Finer than FV N:1 (parent) Parent features duplicated per child row
Coarser than FV 1:N (child) Multiple matches – training data expands; use with care

If the spine is coarser than the Feature View (e.g., spine at order level, FV at lineitem level), the LEFT JOIN produces multiple rows per spine entry. This is usually undesirable – prefer an aggregated rollup Feature View at the spine grain instead (see Chapter 4: Hierarchical Feature Views).


10.6 Feature Column Prefixing

When joining multiple Feature Views, column name collisions can occur.

10.6.1 Using auto_prefix

training_df = fs.generate_training_set(
    spine_df=training_spine,
    features=[user_orders_fv, user_sessions_fv],
    spine_timestamp_col="SESSION_START_TS",
    auto_prefix=True,
)

# Result columns (example):
# USER_ORDERS__TOTAL_SPEND_7D
# USER_SESSIONS__SESSION_CNT_7D

10.6.2 Using .with_name()

training_df = fs.generate_training_set(
    spine_df=training_spine,
    features=[
        user_orders_fv.with_name("ord"),
        user_sessions_fv.with_name("sess"),
    ],
    spine_timestamp_col="SESSION_START_TS",
)

# Result columns (example):
# ord__TOTAL_SPEND_7D
# sess__SESSION_CNT_7D

10.6.3 Choosing a strategy

Scenario Recommendation
Single Feature View No prefix needed
Multiple FVs, unique names No prefix needed
Multiple FVs, potential collisions Use auto_prefix=True
Want custom short prefixes Use .with_name()

10.7 Model Registry integration

Use open-source libraries (scikit-learn, XGBoost, LightGBM, etc.) when your training data fits in memory, or snowflake.ml.modeling estimators when you need distributed fitting inside the warehouse. Both produce a trained object you can log with Registry.log_model(). See Chapter 9: Preprocessing – Pipeline Classes for guidance on choosing between the two.

📁 Full code: _code/model_registry.py

from snowflake.ml.registry import Registry

reg = Registry(
    session=session,
    database_name="FEATURE_STORE_DEMO",
    schema_name="MODEL_REGISTRY",
)
reg.log_model(
    model,
    model_name="churn_model",
    version_name="V01",
    comment="Trained with USER_ORDER_FV V01, USER_SESSION_FV V01; spine SESSIONS + SESSION_START_TS",
)

10.7.1 Lineage flow

flowchart LR
  SRC[Source tables] --> FV[Feature Views] --> DS[Training Dataset] --> M[Model] --> P[Predictions]

End-to-end lineage from sources through feature views to model


10.8 Batch Inference

10.8.1 Ad-Hoc Scoring

The spine timestamp for inference can be CURRENT_TIMESTAMP() (score as of now) or an event timestamp you want to persist predictions against (e.g. SESSION_START_TS for a specific session):

# Inference spine: entities to score as of now
inference_spine = session.sql("""
    SELECT
        USER_ID,
        CURRENT_TIMESTAMP() AS PREDICTION_TS
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.USERS
    WHERE IS_ACTIVE = TRUE
""")

# Point-in-time features as Snowpark DataFrame
inference_df = fs.generate_training_set(
    spine_df=inference_spine,
    features=[user_order_fv, user_session_fv],
    spine_timestamp_col="PREDICTION_TS",
)

# Score and persist
predictions = model.predict(inference_df.to_pandas().drop(columns=[...]))
inference_df.write.save_as_table("FEATURE_STORE_DEMO.FEATURE_STORE.PREDICTIONS_LATEST", mode="overwrite")

For frameworks that consume a Snowflake ML Dataset instead, build a spine the same way and use generate_dataset() with an appropriate version.

10.8.2 Continuous Batch Inference with Dynamic Tables

For single-Feature-View models where all features come from one table (no ASOF join required), you can use a Dynamic Table (or DT-backed Feature View) to persist predictions that refresh automatically. Reference the Model Registry’s SQL function directly in the DT definition:

CREATE DYNAMIC TABLE FEATURE_STORE.USER_CHURN_PREDICTIONS
  TARGET_LAG = '1 hour'
  WAREHOUSE  = FS_DEV_WH
AS
SELECT
    USER_ID,
    FEATURE_STORE_DEMO.MODEL_REGISTRY.CHURN_MODEL!PREDICT(
        ORDER_CNT, SPEND_SUM, AVG_ORDER_AMT
    ):output_feature_0::FLOAT AS CHURN_PROBABILITY,
    CURRENT_TIMESTAMP() AS SCORED_AT
FROM FEATURE_STORE.USER_PURCHASE_STATS$V01;

This pattern works well when the model reads directly from a single Feature View’s columns. Each DT refresh re-scores all entities with the latest feature values. To make these predictions discoverable via the Feature Store API and Snowsight, register the scoring DT as a Feature View — see Scoring Feature Views below for worked examples including multi-FV variants.

The Model Registry’s !PREDICT SQL function is automatically registered as IMMUTABLE, which makes it eligible for DT incremental refresh. If you create your own scoring UDF/UDTF outside of Model Registry, you must declare it immutable for the same reason – see UDF immutability for incremental refresh below.

Model version rotation works seamlessly with incremental DTs (tested April 2026)

When you promote a new model version via ALTER MODEL ... SET DEFAULT_VERSION = v_new, the DT engine automatically detects the change and triggers a REINITIALIZE (one-time full refresh) – all rows are re-scored with the new model. After the reinitialize, incremental mode resumes: subsequent source data changes are processed incrementally with the new version. The DT is not broken and no CREATE OR REPLACE is needed.

This means the full pipeline – retrain model, register new version, promote to default, incremental scoring resumes – works end-to-end without manual intervention.

DTs pinned to a specific version (WITH mv AS MODEL my_model VERSION V1 SELECT mv!PREDICT(...)) are unaffected by default-version changes.

Cost implication: Each version rotation triggers a full re-score of the entire scoring table. For very large tables, factor this into your retraining cadence – the one-time reinitialize cost is the price of correctness (no mixed-version predictions).

This behavior is specific to Model Registry’s MODEL()!PREDICT(). Standard IMMUTABLE UDFs do not get this treatment – a CREATE OR REPLACE FUNCTION breaks the DT rather than triggering a graceful reinitialize. See UDF immutability below.

Incremental scoring downstream of full-refresh Feature Views (April 2026)

If the upstream Feature View (USER_PURCHASE_STATS$V01 above) uses full refresh – for example, because its query contains a FLOAT+JOIN pattern or other unsupported incremental construct – the scoring DT can still refresh incrementally. When the upstream DT has a system-derived unique key (from GROUP BY or QUALIFY ROW_NUMBER() = 1), the scoring DT only re-scores entities whose feature values actually changed, rather than re-scoring the entire population on every cycle. Opt in with ALTER DYNAMIC TABLE ... SET REFRESH_MODE = INCREMENTAL (this is a SQL-level capability; the Feature Store Python API does not yet support it). Verify with SHOW UNIQUE KEYS IN <upstream_fv_dt>. See Understanding primary keys in dynamic tables.

ASOF join limitation

At this time, ASOF join is not supported for incremental DT refresh. If your model requires features from multiple Feature Views joined via ASOF (point-in-time correctness), you cannot embed that join in a Dynamic Table definition. Instead, use the ad-hoc scoring pattern above or schedule batch inference via a Task or external orchestrator.

Pipeline lag affects batch inference too

The scoring DT above reads from USER_PURCHASE_STATS$V01, which is itself a Dynamic Table with its own refresh_freq. The features the model scores against are subject to cumulative lag: source pipeline delay + upstream FV refresh + scoring DT target_lag. A Stream/Task/Stored Procedure batch pattern has the same issue – the Task schedule determines when scoring runs, but the upstream features may already be stale.

If the model was trained on point-in-time-perfect features (no lag), it may perform worse in production where features are always somewhat stale. To avoid this training-serving skew, train with a conservative spine that bakes in realistic lag. See Chapter 8: Feature Freshness and Training-Serving Skew for detailed guidance on measuring actual lag and alternative mitigation strategies.

10.8.3 UDF Immutability for Incremental Refresh

Dynamic Tables in incremental refresh mode require every function in the SELECT to be deterministic. The Model Registry’s !PREDICT function is registered as IMMUTABLE automatically, but if you build your own scoring UDF/UDTF outside of Model Registry, you must explicitly declare it immutable. A VOLATILE UDF forces the DT to use full refresh on every cycle, negating the performance benefits of incremental refresh.

SQL:

CREATE OR REPLACE FUNCTION FEATURE_STORE.SCORE_CHURN(
    order_cnt NUMBER, spend_sum NUMBER, avg_order_amt NUMBER
)
RETURNS FLOAT
LANGUAGE PYTHON
IMMUTABLE
RUNTIME_VERSION = '3.11'
HANDLER = 'run'
AS $$
def run(order_cnt, spend_sum, avg_order_amt):
    # Your scoring logic here
    ...
$$;

Snowpark (decorator):

from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import FloatType, IntegerType

@udf(
    name="FEATURE_STORE.SCORE_CHURN",
    input_types=[IntegerType(), FloatType(), FloatType()],
    return_type=FloatType(),
    immutable=True,
    replace=True,
)
def score_churn(order_cnt: int, spend_sum: float, avg_order_amt: float) -> float:
    ...

Snowpark (session.udf.register):

session.udf.register(
    func=score_churn,
    name="FEATURE_STORE.SCORE_CHURN",
    input_types=[IntegerType(), FloatType(), FloatType()],
    return_type=FloatType(),
    immutable=True,
    replace=True,
)

Setting immutable=True causes Snowpark to emit the UDF with IMMUTABLE volatility in SQL. The DT can then treat it as eligible for incremental refresh. See the Dynamic Table supported queries documentation for the full list of incremental refresh requirements.

Standard UDF replacement breaks incremental DTs

Unlike Model Registry (which handles version rotation gracefully via REINITIALIZE – see above), replacing a standard IMMUTABLE UDF with CREATE OR REPLACE FUNCTION while it is referenced by an incremental DT breaks the DT. The DT must be fully recreated (CREATE OR REPLACE DYNAMIC TABLE).

This is an important design consideration: if your UDF logic is expected to change over time (updated model weights, bug fixes, new feature engineering rules), prefer Model Registry for inference UDFs. MR’s MODEL(name)!PREDICT() handles version rotation seamlessly – the DT detects the default-version change, triggers a one-time full re-score, then resumes incremental mode.

For UDFs that must remain outside Model Registry, plan for the DT recreation cost when the function changes. In CI/CD pipelines, this typically means including CREATE OR REPLACE DYNAMIC TABLE in the deployment script alongside CREATE OR REPLACE FUNCTION.


10.9 Scoring Feature Views

A Scoring Feature View is a Dynamic Table that calls a Model Registry function (MODEL()!PREDICT()) to produce predictions, then is registered as a Feature View through the Feature Store API. This makes model outputs discoverable alongside the features that feed them — teams can browse predictions in Snowsight, track lineage from source tables through features to scores, and consume predictions downstream via read_feature_view() just like any other feature.

Three variants cover progressively complex scenarios:

flowchart LR
  subgraph v1[Single_FV]
    FV1[Feature View] --> S1["Scoring DT\nMODEL()!PREDICT()"]
  end
  subgraph v2[Multi_FV_PreJoin]
    FVa[FV A] --> C[Combined FV]
    FVb[FV B] --> C
    C --> S2["Scoring DT"]
  end
  subgraph v3[Orchestrated]
    FVx[FV A] --> GTS["generate_training_set()\n+ ASOF join"]
    FVy[FV B] --> GTS
    GTS --> P["model.predict()"]
  end

Scoring Feature View variants — from single-source to PIT-correct multi-FV

10.9.1 Variant 1: Single-FV Scoring DT

The Continuous Batch Inference section above shows how to build a scoring DT with MODEL()!PREDICT(). To make those predictions discoverable in Snowsight and consumable via the Feature Store API, wrap the same feature_df in a FeatureView and register it:

scoring_fv = FeatureView(
    name="USER_CHURN_SCORES",
    entities=[Entity(name="USER", join_keys=["USER_ID"])],
    feature_df=scoring_df,   # same DataFrame used in the DT example above
    refresh_freq="1 hour",
    desc="Churn probability from USER_PURCHASE_STATS via churn_model",
)
fs.register_feature_view(scoring_fv, version="V01")

All the existing behaviour applies: model version rotation triggers a REINITIALIZE then resumes incremental mode, and UDF immutability rules carry over unchanged.

10.9.2 Variant 2: Multi-FV Pre-Join Pattern

When the model requires features from multiple Feature Views, ASOF join is not available inside a Dynamic Table (see the ASOF limitation callout above). The workaround is to pre-join the upstream FVs into a combined intermediate FV, then score from that single source:

# Step 1: Read upstream Feature Views
orders_df  = fs.read_feature_view(user_order_fv)    # USER_ID, ORDER_CNT, SPEND_SUM, ...
sessions_df = fs.read_feature_view(user_session_fv)  # USER_ID, SESSION_CNT, AVG_DURATION, ...

# Step 2: Join into a combined source (latest-value join on entity key)
combined_df = orders_df.join(sessions_df, "USER_ID")

# Step 3: Register the combined FV
combined_fv = FeatureView(
    name="USER_COMBINED_FEATURES",
    entities=[user_entity],
    feature_df=combined_df,
    refresh_freq="1 hour",
    desc="Pre-joined order + session features for scoring",
)
fs.register_feature_view(combined_fv, version="V01")

# Step 4: Score from the combined FV
scoring_df = session.sql("""
    SELECT
        USER_ID,
        FEATURE_STORE_DEMO.MODEL_REGISTRY.CHURN_MODEL!PREDICT(
            ORDER_CNT, SPEND_SUM, SESSION_CNT, AVG_DURATION
        ):output_feature_0::FLOAT AS CHURN_PROBABILITY,
        CURRENT_TIMESTAMP()          AS SCORED_AT
    FROM FEATURE_STORE.USER_COMBINED_FEATURES$V01
""")

scoring_fv = FeatureView(
    name="USER_CHURN_SCORES",
    entities=[user_entity],
    feature_df=scoring_df,
    refresh_freq="1 hour",
    desc="Churn probability from combined order + session features",
)
fs.register_feature_view(scoring_fv, version="V01")

The DT chain is: upstream FVs → Combined FV → Scoring FV. Each layer refreshes incrementally (subject to the usual constraints), and model version rotation propagates automatically.

Latest-value join, not PIT-correct

The pre-join in Step 2 is an equi-join on entity key — it picks the current latest values from each upstream FV, not the values as-of a specific timestamp. This is appropriate for real-time scoring (score with the freshest features available) but does not provide point-in-time correctness. If PIT-correct multi-FV scoring is required, use Variant 3 below.

10.9.3 Variant 3: Orchestrated Batch Scoring (PIT-Correct)

When the model requires features from multiple Feature Views and predictions must be point-in-time correct (e.g. for regulatory audit, back-testing, or time-series evaluation), use generate_training_set() with an inference spine inside a Snowflake Task or Stored Procedure:

# Inside a Task or Stored Procedure — scheduled or event-triggered
inference_spine = session.sql("""
    SELECT
        USER_ID,
        CURRENT_TIMESTAMP() AS PREDICTION_TS
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.USERS
    WHERE IS_ACTIVE = TRUE
""")

# ASOF join retrieves PIT-correct feature values
inference_df = fs.generate_training_set(
    spine_df=inference_spine,
    features=[user_order_fv, user_session_fv],
    spine_timestamp_col="PREDICTION_TS",
)

# Score
model_ref = Registry(session=session, database_name="FEATURE_STORE_DEMO",
                     schema_name="MODEL_REGISTRY").get_model("churn_model")
predictions = model_ref.default.run(inference_df, function_name="predict")

# Persist
predictions.write.save_as_table(
    "FEATURE_STORE_DEMO.FEATURE_STORE.USER_CHURN_PREDICTIONS",
    mode="overwrite",
)

This variant trades continuous refresh for correctness: the Task runs on a schedule (e.g. hourly via SCHEDULE = 'USING CRON 0 * * * * UTC'), and each run produces a full PIT-correct scoring snapshot.

Choosing a scoring variant
Single-FV DT Multi-FV Pre-Join Orchestrated (Task/Sproc)
Feature sources One FV Multiple FVs Multiple FVs
Join type Direct read Equi-join (latest) ASOF (PIT-correct)
Refresh Continuous (DT) Continuous (DT chain) Scheduled (Task)
Incremental Yes Yes No (full each run)
PIT-correct N/A (single source) No Yes
Discoverable as FV Yes Yes Manual (register output table)
Best for Simple models, low latency Multi-feature models, real-time Audit, back-testing, compliance

10.10 Dataset Generation at Scale

When training datasets involve many Feature Views (50+) or very large spines (hundreds of millions to billions of rows), the default single-query join strategy can hit memory or shuffle limits. Understanding these constraints helps you right-size your infrastructure and leverage SDK optimisations.

10.10.1 The Batched Join Strategy (SDK 1.32+)

By default, generate_training_set and generate_dataset produce a single SQL statement that LEFT JOINs all Feature Views at once. At high Feature View counts, the final join stage shuffles data proportional to spine_rows × feature_view_count × avg_features_per_fv, which can cause out-of-memory (OOM) failures or queries that never complete.

SDK version 1.32+ automatically batches Feature View joins into groups of ~10, produces intermediate results for each batch, and performs a final merge. This happens transparently – no API changes are required. The improvement can be dramatic:

Scenario Pre-1.32 (single query) 1.32+ (batched)
75+ FVs, ~1,000 features, ~100M spine rows ~40 min on 4XL ~17 min on 4XL Gen2
75+ FVs, ~2,000 features, ~2B spine rows OOM on 4XL ~2 hours on 4XL Gen2

If you are on an older SDK version and cannot upgrade, the manual workaround is to split Feature Views into batches of 10-15, call generate_training_set for each batch, and join the results:

import math

def generate_training_set_batched(
    fs, spine_df, feature_views, spine_timestamp_col, batch_size=10
):
    """Manually batch Feature View joins for older SDK versions."""
    batches = [
        feature_views[i:i + batch_size]
        for i in range(0, len(feature_views), batch_size)
    ]

    result_df = spine_df
    for batch in batches:
        batch_df = fs.generate_training_set(
            spine_df=spine_df,
            features=batch,
            spine_timestamp_col=spine_timestamp_col,
        )
        join_keys = [spine_timestamp_col] + [
            k for fv in batch for k in fv.entity.join_keys
        ]
        result_df = result_df.join(
            batch_df.drop(*spine_df.columns),
            on=list(set(join_keys) & set(result_df.columns)),
            how="left",
        )
    return result_df

10.10.2 Warehouse Sizing for Dataset Generation

The optimal warehouse configuration depends on whether your bottleneck is parallelism (many joins, wide data shuffles) or memory (large intermediate results per node).

Bottleneck Symptom Recommendation
Parallelism High idle execution skew; CPU-bound Scale up standard warehouse size (more nodes)
Memory OOM errors; high disk spill Consider Snowpark Optimized warehouse, or reduce batch size
Both OOM + skew Snowpark Optimized at larger size (3XL+)
Prefer standard Gen2 warehouses

For join-heavy dataset generation workloads, standard Gen2 warehouses typically outperform Snowpark Optimized warehouses of equivalent credit cost. A standard 4XL Gen2 (128 nodes × 16 GB) provides far more parallelism than an SPO 3XL (16 nodes × 256 GB) at similar cost. Only move to SPO when profiling confirms memory – not parallelism – is the constraint.

10.10.3 The Wide Resultset Problem

When training datasets exceed ~500 features across many Feature Views, Snowflake’s query planner and columnar engine encounter non-linear performance degradation. The issue stems from SQL expression count, query compilation complexity, and memory pressure – not just data volume:

Scale Compilation Execution (full resultset)
10 FVs, ~200 features Seconds Minutes
50 FVs, ~1,000 features Minutes Tens of minutes
100 FVs, ~2,000 features 35+ minutes 6+ hours (likely OOM)
Known platform limitation

This is a known Snowflake limitation for wide resultsets. The Feature Store SDK mitigates it with batched joins (see above), but extremely wide datasets (2,000+ features) may still require architectural workarounds described below.

VARIANT column approach: For extreme scale (>500 features), encapsulating each Feature View’s output as a single OBJECT column dramatically reduces SQL expression count:

# Instead of: SELECT entity_id, feat_1, feat_2, ..., feat_200 FROM fv_table
# Use:        SELECT entity_id, OBJECT_CONSTRUCT(*) AS features FROM fv_table

# Benchmark: 100 FVs, 2000 features
# Column-per-feature: ~25 minutes for a single-row collect()
# VARIANT approach:     ~4 seconds for a single-row collect()

Trade-offs of the VARIANT approach:

Aspect Column-per-feature VARIANT per FV
Query compilation Non-linear with feature count Near-constant
Column-level lineage Full Lost
Column-level security Full Lost
Client-side processing Direct DataFrame access Requires json.loads() per row
Memory overhead Standard ~2x (JSON deserialization)

For most use cases, the batched join strategy (SDK 1.32+) combined with appropriate warehouse sizing is sufficient. Reserve the VARIANT approach for the largest deployments where batched joins still hit compilation or memory limits.

10.10.4 Parquet and Iceberg Output

For training datasets that feed distributed ML frameworks (Ray, Spark, PyTorch), writing directly to Parquet avoids the memory bottleneck of to_pandas():

Scale Warehouse Write time
1B rows Standard 4XL ~4 minutes
12B rows Standard 4XL ~35 minutes

Options for Parquet output:

  1. generate_dataset with output_type="parquet" – writes directly to a Snowflake stage; files are readable by Ray’s read_parquet().
  2. Iceberg-backed Feature Views (SDK 1.26+) – Feature Views backed by Dynamic Iceberg Tables that expose data as standard Parquet/Iceberg on external cloud storage. ML frameworks read Iceberg natively; no Snowflake connector needed. See Iceberg-backed Feature Views below.
  3. Snowflake-managed Iceberg tables – queryable via both Snowflake SQL and external Parquet readers. Useful when the same training data is consumed by both Snowflake-based and external ML pipelines.
  4. COPY INTO from a materialized table – if you need fine-grained control over Parquet file layout (row group size, compression).
Skip to_pandas() for large datasets

For datasets exceeding ~100M rows or ~500 features, writing to Parquet and reading directly from your ML framework is both faster and more memory-efficient than to_pandas(). This is especially true when using the VARIANT approach, where to_pandas() returns JSON strings requiring single-threaded json.loads() deserialization.


10.11 Iceberg-Backed Feature Views

Starting with snowflake-ml-python v1.26.0 (February 2026), Feature Views can be backed by Dynamic Iceberg Tables – Dynamic Tables that materialize their output as Parquet files in Iceberg format on external cloud storage (S3, Azure Blob, GCS). This gives ML frameworks direct, native access to feature data without requiring a Snowflake connector or intermediate export step.

10.11.1 Creating an Iceberg-Backed Feature View

Use StorageConfig to specify Iceberg storage when creating a Feature View:

from snowflake.ml.feature_store import FeatureView, StorageConfig, StorageFormat

storage = StorageConfig(
    format=StorageFormat.ICEBERG,
    external_volume='MY_EXTERNAL_VOLUME',
    base_location='feature_store/user_purchase_stats'
)

iceberg_fv = FeatureView(
    name="USER_PURCHASE_STATS_ICE",
    entities=[user_entity],
    feature_df=user_purchase_df,
    timestamp_col="LAST_ORDER_TS",
    refresh_freq="1 hour",
    storage_config=storage,
    desc="User purchase statistics materialized as Iceberg"
)
fs.register_feature_view(feature_view=iceberg_fv, version="V01")
Prerequisites
  • An external volume must be configured for your cloud storage location.
  • refresh_freq is required – Iceberg storage only applies to managed (DT-backed) Feature Views, not static Views.
  • The FeatureStore constructor accepts default_iceberg_external_volume to avoid repeating the volume name in every StorageConfig.

All standard Feature Store APIs (generate_dataset, generate_training_set, list_feature_views, etc.) work with Iceberg-backed Feature Views. The feature data is simultaneously queryable via Snowflake SQL and via any Iceberg-compatible reader.

10.11.2 Training Directly from Iceberg Feature Views

Because Iceberg-backed Feature Views expose standard Parquet files, ML frameworks can read them natively:

import pyiceberg
from pyiceberg.catalog import load_catalog

catalog = load_catalog("snowflake", **catalog_config)
table = catalog.load_table("feature_store_demo.feature_store.USER_PURCHASE_STATS_ICE$V01")
df = table.scan().to_pandas()

This eliminates the need for generate_dataset() or to_pandas() when all you need is the feature data – the Feature View is the training dataset in a standard open format.

10.11.3 Current Limitations

Limitation Detail
Online Feature Tables Not supported with Iceberg storage (future work)
DT partitioning parameters PARTITION_BY, TARGET_FILE_SIZE, and PATH_LAYOUT (GA April 2026) are not yet exposed through StorageConfig – use ALTER DYNAMIC ICEBERG TABLE after registration to set these
View-based Feature Views Iceberg storage requires refresh_freq (DT-backed); it does not apply to View-based FVs
Iceberg FVs as an alternative to the Dataset API

For many training workflows, an Iceberg-backed Feature View provides the same benefits as generate_dataset() – versioned Parquet files accessible to distributed ML frameworks – but in a standard open format that any Iceberg-compatible tool can read. The Feature View is also incrementally maintained by the DT scheduler, so training data stays fresh without manual regeneration.

If your pipeline reads from a single Feature View (or a small number of them) and does not require spine-based ASOF joins, an Iceberg-backed Feature View may be simpler than the generate_dataset() → Parquet → framework path.

Future: Fully automated incremental training with ASOF-in-DT

When ASOF join support lands in Dynamic Tables (stretch goal), the combination with Iceberg-backed Feature Views will enable a fully automated incremental training pipeline:

  1. A DT with ASOF join pre-materializes point-in-time correct training data as an Iceberg table
  2. ML frameworks read the Iceberg table directly for training
  3. As new data arrives, the DT incrementally updates the training set
  4. A downstream scoring DT (using Model Registry’s MODEL()!PREDICT()) incrementally scores new entities
  5. The entire retrain → evaluate → promote → score pipeline operates continuously with no manual orchestration

This is one of the most compelling future-state patterns for Feature Store. See Chapter 12: Advanced Patterns for discussion of Iceberg table integration patterns.


10.12 Best practices

1. Spine design

If your Feature Views follow the recommended convention of aliasing their timestamp column to a standard name (e.g., FV_TS – see Chapter 6: timestamp_col Requirements), name the spine timestamp to match:

# ✅ GOOD: Standardized timestamp name matches FV convention
spine = session.sql("""
    SELECT
        USER_ID,
        SESSION_START_TS AS FV_TS,
        IS_CONVERTED AS LABEL
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.SESSIONS
""")
# spine_timestamp_col="FV_TS" everywhere — same as every Feature View's timestamp_col

# ✅ ALSO GOOD: Explicit source name when you prefer traceability
spine = session.sql("""
    SELECT
        USER_ID,
        SESSION_START_TS,
        IS_CONVERTED AS LABEL
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.SESSIONS
""")
# spine_timestamp_col="SESSION_START_TS"

# ❌ BAD: Ambiguous or opaque timestamp column
spine = session.sql("""
    SELECT
        USER_ID,
        SOME_DATE
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.SESSIONS
""")

2. Feature selection

Registered Feature Views expose .slice(["col1", "col2"]) to restrict columns. Do not assume .select() on the Feature View object.

training_df = fs.generate_training_set(
    spine_df=training_spine,
    features=[
        user_order_fv.slice(["TOTAL_SPEND_7D", "ORDER_CNT_30D"]),
    ],
    spine_timestamp_col="SESSION_START_TS",
)

3. Validate before training

# Check for data issues (column names depend on your spine / FVs)
assert training_df.filter(F.col("LABEL").isNull()).count() == 0
assert training_df.count() > 1000

10.13 Common pitfalls

10.13.1 ❌ Pitfall 1: No timestamp column

Problem: PIT retrieval does not work without a timestamp.

Solution: Always include a spine timestamp and set spine_timestamp_col to its name.

10.13.2 ❌ Pitfall 2: Feature leakage

Problem: Using features computed from future data relative to the spine row.

Solution: Use spine_timestamp_col and validate cutoff behavior.

10.13.3 ❌ Pitfall 3: Column collisions

Problem: Same feature names from multiple Feature Views.

Solution: Use auto_prefix or .with_name().

10.13.4 ❌ Pitfall 4: Wrong API for the framework

Problem: Expecting a Snowpark DataFrame from generate_dataset() or Dataset Parquet from generate_training_set().

Solution: Use the comparison table above; match the method to DL (Dataset) vs classic ML (DataFrame / save_as).

10.13.5 ❌ Pitfall 5: Training-serving skew from pipeline lag

Problem: The model is trained on point-in-time-perfect features (ASOF join returns the exact values available at each event), but at inference time – whether via an OFT, a DT-based scoring pipeline, or a scheduled Task/Sproc – the features are subject to cumulative lag: source ETL delay + upstream Feature View refresh_freq + inference-layer lag (target_lag, Task schedule, etc.). The model sees feature distributions at inference that differ from training, degrading prediction accuracy.

Solution: Train with a conservative spine that shifts the ASOF cutoff back by the expected end-to-end pipeline lag (see Chapter 6: Late-Arriving Data). This ensures the training features reflect realistic staleness. For detailed guidance on measuring actual lag and alternative mitigation strategies, see Chapter 8: Feature Freshness and Training-Serving Skew.


10.14 Summary

Concept Description
Spine DataFrame defining entities and timestamps (and label for training)
generate_dataset() Versioned Snowflake ML Dataset (Parquet on a stage); strong fit for deep learning
generate_training_set() Snowpark DataFrame; persist with save_as or df.write.save_as_table(); strong fit for classic ML
auto_prefix Automatic column prefixing across Feature Views
.with_name() Custom column prefixing
.slice([...]) Restrict features from a registered Feature View
DT-based inference Single-FV models can use a Dynamic Table with Model Registry SQL functions for continuous batch scoring
Scoring Feature Views Register scoring DTs as Feature Views; single-FV (direct), multi-FV pre-join (latest-value), or orchestrated Task/Sproc (PIT-correct)
Iceberg-backed FVs StorageConfig(format=StorageFormat.ICEBERG) materializes feature data as Parquet/Iceberg for native ML framework access (SDK 1.26+)

10.15 Next steps

Continue to Chapter 11: Operations to learn about monitoring, data quality, and operational management.