12  Advanced Patterns

Streaming, multi-Feature Store, CI/CD, testing, wide & sparse data, and advanced architectures

Keywords

snowflake, feature store, ml, machine learning, mlops

12.1 Overview

This chapter covers advanced Feature Store patterns for enterprise-scale deployments, including streaming features, multi-Feature Store architectures, CI/CD integration, testing strategies, multi-region and DR considerations, and complex use cases.

Examples follow the clickstream sample layout: database FEATURE_STORE_DEMO, Feature Store schema FEATURE_STORE, source schema CLICKSTREAM_DATA, warehouse FS_DEV_WH, and tables such as EVENTS, SESSIONS, ORDERS, USERS, and PRODUCTS. Register Feature Views with version strings in V01 format (for example version="V01"). Prefer aggregate column suffixes such as _CNT, _SUM, _AVG, timestamp columns ending in _TS, boolean prefixes IS_, and monetary fields such as TOTAL_AMT.

12.2 Learning Objectives

After completing this chapter, you will be able to:

  • Design streaming feature architectures using short-refresh DTs or external pipelines, and interpret platform roadmap status for native Streaming Feature Views
  • Implement multi-Feature Store patterns with RBAC isolation and cross-domain training
  • Plan multi-region and disaster recovery considerations for Feature Store schemas
  • Integrate Feature Store deployments with CI/CD pipelines and environment promotion
  • Apply structured testing: unit logic, PIT correctness, schema validation, and freshness checks
  • Apply hybrid storage, two-tier Feature View architecture, and Dynamic Table (DT) cascades to manage wide and sparse feature data at scale
  • Use OBJECT_CONSTRUCT and semi-structured types for embeddings, JSON features, and column-count reduction

📂 Chapter code: Browse companion scripts on GitHub


12.3 Platform capabilities and preview status (April 2026)

Several Feature Store capabilities are at different stages of the release lifecycle. Always confirm in Snowflake release notes and your account’s feature bundle before planning production work.

Capability Status Notes
RollupConfig (hierarchical aggregations) Verify availability API names may evolve.
Time-windowed Aggregation (Feature class, tiling) Available via snowflake-ml-python >= 1.24.0 Client-side SDK feature — no account enablement needed. Generates standard SQL under the hood. Official docs may be limited; use help(Feature). See Chapter 7.
Streaming Feature Views Private Preview (snowflake-ml-python >= 1.36) Postgres-backed online store, dual-write kappa path, 2–3s E2E freshness. Account enablement required. See internal chapter for details.
Postgres Online Store (batch FVs) Private Preview (snowflake-ml-python >= 1.36) Drop-in replacement for Hybrid Table OFT with <20ms p50 reads. Account enablement required.
REST Ingest / Query API Private Preview HTTP endpoints for streaming ingestion and online feature retrieval. Bundled with Postgres online store.
Note

These callouts are not removed when features graduate—timelines shift. Re-check Snowflake documentation before committing architecture.


12.4 Streaming Feature Views (roadmap and workarounds)

📁 Full code: _code/streaming_patterns.py

12.4.1 Current state

Many ML use cases (fraud scoring, session personalization, real-time recommendations) need features computed over data that is seconds or minutes old, not hours. Native Streaming Feature Views are the platform’s long-term answer, but may not yet be enabled in every account and region.

Until your account is explicitly enabled, plan on workarounds that approximate low latency:

Workaround 1 – Short-refresh Dynamic Tables. Set refresh_freq to the shortest interval your warehouse budget and data volume allow. A 1-minute refresh is not true streaming, but for many use cases (session-level aggregates, rolling 5-minute counts) the practical difference is negligible.

realtime_fv = FeatureView(
    name="USER_REALTIME_FEATURES",
    entities=[user_entity],
    feature_df=session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.EVENTS"),
    timestamp_col="EVENT_TS",
    refresh_freq="1 minute",
    desc="Near real-time event counts, 1-min refresh cycle",
)

Workaround 2 – External streaming into a View-based Feature View. An external pipeline (Kafka Connect, Snowpipe Streaming, or a custom producer) writes into a staging table. A View-based Feature View wraps that table so it is always current as of the last micro-batch:

flowchart TB
  K[Kafka / Snowpipe Streaming] --> S[EVENTS_STAGING table, continuously appended]
  S --> FV[View-based Feature View always reads latest rows]

External streaming into a View-based Feature View

12.4.2 Illustrative native streaming pattern

When native Streaming Feature Views are available in your account, the API should resemble the following. Confirm the exact signature and parameters in the release notes before using this in production:

streaming_fv = FeatureView(
    name="USER_STREAMING_FEATURES",
    entities=[user_entity],
    source_stream="USER_EVENTS_STREAM",  # Snowflake Stream object
    streaming=True,
    desc="True streaming features -- verify API availability",
)

12.5 Multi-region and disaster recovery

📁 Full code: _code/multi_region_dr.py

Enterprise deployments often require multiple regions and disaster recovery (DR). At a high level:

  • Database replication: Snowflake database replication can replicate databases that hold Feature Store schemas (for example FEATURE_STORE_DEMO / FEATURE_STORE), subject to replication rules, edition, and licensing.
  • Feature definitions: Feature Views and related metadata can be promoted via clone, CI/CD redeploy from repository definitions, or replication-oriented workflows; choose based on whether secondary regions must own writable primaries or read-only copies.
  • Online Feature Tables: Online Feature Tables require separate provisioning per region (and separate serving paths); replication alone does not replace regional online stores or cache layers.
  • Scope: Full multi-region and DR planning is beyond this guide. Use current Snowflake documentation on replication, failover, and business continuity, and involve your platform team for RPO/RTO targets.

12.6 Multi-Feature Store operations

Large organizations rarely run a single Feature Store schema. Different business domains (marketing, fraud, risk, personalization) have distinct data ownership, sensitivity levels, and development cadences. The Snowflake Feature Store maps one-to-one with a schema, so multiple Feature Stores are simply multiple schemas – potentially in different databases.

12.6.1 Cross-domain ML

A cross-domain model (e.g., a churn model that combines marketing engagement, fraud signals, and base demographic features) pulls Feature Views from multiple Feature Store schemas at training time via generate_dataset or generate_training_set. The diagram below illustrates the topology:

flowchart TB
  MKT[Marketing FS] --> X[Cross-domain ML]
  FRD[Fraud FS] --> X
  SH[Shared FS] --> X

Multi-Feature Store architecture feeding cross-domain ML

Each Feature Store instance points to its own schema. At training time the consumer session must have grants on all schemas involved.

12.6.2 RBAC considerations

Domain isolation is enforced through standard Snowflake RBAC. Each Feature Store schema can be owned by a different role. A “shared” schema holds cross-domain features that multiple teams consume. The consuming role only needs USAGE on the schema and SELECT on the underlying objects:

marketing_fs = FeatureStore(session, database="FEATURE_STORE_DEMO", name="MARKETING_FS")
shared_fs    = FeatureStore(session, database="FEATURE_STORE_DEMO", name="SHARED_FS")
fraud_fs     = FeatureStore(session, database="FEATURE_STORE_DEMO", name="FRAUD_FS")
Tip

When generating training data across Feature Stores, the session role must have grants on all schemas involved. Use a dedicated ML_TRAINING role with SELECT grants on each domain’s Feature Store schema, rather than granting broad privileges to individual users.

12.6.3 Cross-domain training: pulling Feature Views from multiple Feature Stores

Create separate FeatureStore instances for each domain, retrieve the Feature Views you need, and pass them all to a single generate_dataset or generate_training_set call:

marketing_fs = FeatureStore(session, database="FEATURE_STORE_DEMO", name="MARKETING_FS",
                            default_warehouse="FS_PROD_WH")
shared_fs    = FeatureStore(session, database="FEATURE_STORE_DEMO", name="SHARED_FS",
                            default_warehouse="FS_PROD_WH")
fraud_fs     = FeatureStore(session, database="FEATURE_STORE_DEMO", name="FRAUD_FS",
                            default_warehouse="FS_PROD_WH")

# Retrieve Feature Views from each domain
campaign_fv = marketing_fs.get_feature_view("CAMPAIGN_FV", "V01")
txn_risk_fv = fraud_fs.get_feature_view("TRANSACTION_RISK_FV", "V01")
user_base_fv = shared_fs.get_feature_view("USER_BASE_FV", "V01")

# Generate a cross-domain training set (any FeatureStore instance can call generate_*)
training_df = shared_fs.generate_training_set(
    spine_df=churn_spine,
    features=[campaign_fv, txn_risk_fv, user_base_fv],
    spine_timestamp_col="EVENT_TS",
)

12.6.4 Cross-domain Feature View: sourcing from another Feature Store

A View-based Feature View in one Feature Store can read from a materialized Feature View in another. This is useful when a domain wants to expose a curated subset of another domain’s features without duplicating the underlying DT:

cross_domain_df = session.sql("""
    SELECT USER_ID, LIFETIME_SPEND, ACCOUNT_AGE_DAYS, LAST_LOGIN_TS
    FROM FEATURE_STORE_DEMO.SHARED_FS."USER_BASE_FV$V01"
""")
cross_domain_df = session.table('FEATURE_STORE_DEMO.SHARED_FS."USER_BASE_FV$V01"').select(
    "USER_ID", "LIFETIME_SPEND", "ACCOUNT_AGE_DAYS", "LAST_LOGIN_TS"
)
user_base_marketing_fv = FeatureView(
    name="USER_BASE_MARKETING_FV",
    entities=[user_entity],
    feature_df=cross_domain_df,
    timestamp_col="LAST_LOGIN_TS",
    refresh_freq=None,
    desc="Curated user base features sourced from SHARED_FS for marketing models",
)

marketing_fs.register_feature_view(user_base_marketing_fv, "V01")

12.6.5 Discovering Feature Views across Feature Stores

Feature Store schemas use Snowflake tags for metadata. You can query across multiple schemas using INFORMATION_SCHEMA or TAG_REFERENCES to build a cross-domain feature catalog:

discovery_df = session.sql("""
    SELECT
        TAG_VALUE AS FEATURE_STORE_SCHEMA,
        OBJECT_NAME AS FEATURE_VIEW_NAME,
        OBJECT_DATABASE,
        OBJECT_SCHEMA
    FROM TABLE(INFORMATION_SCHEMA.TAG_REFERENCES_ALL_COLUMNS(
        'FEATURE_STORE_DEMO.SHARED_FS', 'SCHEMA'))
    WHERE TAG_NAME = 'SNOWML_FEATURE_VIEW_METADATA'
    UNION ALL
    SELECT
        TAG_VALUE, OBJECT_NAME, OBJECT_DATABASE, OBJECT_SCHEMA
    FROM TABLE(INFORMATION_SCHEMA.TAG_REFERENCES_ALL_COLUMNS(
        'FEATURE_STORE_DEMO.MARKETING_FS', 'SCHEMA'))
    WHERE TAG_NAME = 'SNOWML_FEATURE_VIEW_METADATA'
""")

discovery_df.show()

Alternatively, iterate programmatically:

for fs_name in ["SHARED_FS", "MARKETING_FS", "FRAUD_FS"]:
    fs = FeatureStore(session, database="FEATURE_STORE_DEMO", name=fs_name,
                      default_warehouse="FS_PROD_WH")
    print(f"\n--- {fs_name} ---")
    fs.list_feature_views().select("NAME", "VERSION", "DESC").show()

12.6.6 Feature Sharing Patterns

The cross-domain patterns above work within a single Snowflake account. For broader sharing — across accounts, regions, or organisational boundaries — four complementary patterns cover the spectrum:

1. View-based cross-FS Feature Views (same account). Already demonstrated in Cross-domain Feature View above: a View in the consumer’s Feature Store schema wraps the provider’s materialized DT. The consumer schema holds only the View definition; data stays in the provider schema and is read at query time.

2. Secure Data Sharing / Listings (cross-account). Snowflake Secure Data Sharing and the Snowflake Marketplace can expose a Feature Store schema (or selected objects within it) to consumer accounts:

-- Provider account: share the Feature Store schema
CREATE SHARE FEATURE_STORE_SHARE;
GRANT USAGE ON DATABASE FEATURE_STORE_DEMO TO SHARE FEATURE_STORE_SHARE;
GRANT USAGE ON SCHEMA FEATURE_STORE_DEMO.FEATURE_STORE TO SHARE FEATURE_STORE_SHARE;
GRANT SELECT ON ALL DYNAMIC TABLES IN SCHEMA FEATURE_STORE_DEMO.FEATURE_STORE
  TO SHARE FEATURE_STORE_SHARE;

-- Consumer account: create a database from the share
CREATE DATABASE SHARED_FEATURES FROM SHARE provider_account.FEATURE_STORE_SHARE;

The consumer sees read-only tables whose data is always up-to-date (no copy, no refresh). Feature Store TAGs are not replicated via shares, so the consumer must apply TAGs locally (see Appendix D) or access the shared tables via direct SQL rather than the Feature Store SDK.

3. Single-definition multi-deploy via CI/CD. For organisations that want the same Feature View definitions in multiple schemas or accounts (e.g. regional deployments), parameterise the deployment script and run it once per target. This is the same pattern used for environment promotion (see Chapter 2: Re-register via CI/CD and Repository Layout above):

for env in ["US_EAST", "EU_WEST", "AP_SOUTHEAST"]:
    fs = FeatureStore(session, database=f"{env}_FEATURE_STORE",
                      name="FEATURE_STORE", default_warehouse=f"{env}_FS_WH")
    deploy_all_features(fs, config)  # same config, different target

Each deployment creates its own DTs and materialises data independently — suited for data residency requirements where data cannot leave a region.

4. Materialise once, reference many. When multiple teams consume the same features, prefer one DT materialisation with View-based Feature Views in each consumer’s schema pointing back to the shared DT. This avoids duplicate compute and storage costs:

Strategy Compute cost Storage cost Data freshness Best for
Duplicate DTs per schema N x DT refresh N copies Independent per copy Data residency, isolation
Shared DT + consumer Views 1 x DT refresh 1 copy Shared (same lag) Cost optimisation, single-region
Secure Data Sharing 0 (provider refreshes) 0 (no copy) Always current Cross-account, read-only
Discovery and RBAC for shared features

Consumers need USAGE on the provider’s database and schema plus SELECT on the shared objects. Use the Feature Catalog View or TAG_REFERENCES_ALL_COLUMNS to discover features across schemas. See Chapter 2: RBAC for the recommended role hierarchy that supports cross-schema grants.


12.7 External Data Sources

Feature Views can wrap tables managed by external systems. For dbt integration (View-based Feature Views backed by dbt-managed tables), see Chapter 5: Feature Pipelines – External Orchestration.

12.7.1 Iceberg Tables as Sources

Apache Iceberg tables registered as Snowflake External Tables or Snowflake-managed Iceberg Tables can serve as Feature View sources. The Feature View wraps the Iceberg table with a View (no refresh_freq) since the external system controls data updates:

iceberg_feature_df = session.sql("""
    SELECT USER_ID, TOTAL_AMT, ORDER_TS
    FROM iceberg_catalog.feature_store_demo_clickstream.orders
""")
iceberg_feature_df = session.table("iceberg_catalog.feature_store_demo_clickstream.orders").select(
    "USER_ID", "TOTAL_AMT", "ORDER_TS"
)
iceberg_fv = FeatureView(
    name="USER_FEATURES_ICEBERG",
    entities=[user_entity],
    feature_df=iceberg_feature_df,
    timestamp_col="ORDER_TS",
    desc="User order features from Iceberg-managed table",
)

This is functionally identical to any View-based Feature View – the Iceberg table is just another source. Freshness depends entirely on the external write cadence.

12.7.2 Iceberg-Backed Feature Views (Dynamic Iceberg Tables)

The reverse pattern – Feature Views that output to Iceberg format – is supported since SDK 1.26.0 via StorageConfig. This creates a Dynamic Iceberg Table: a DT that materializes its output as Parquet files on external cloud storage, with Iceberg metadata for catalog-based access.

from snowflake.ml.feature_store import StorageConfig, StorageFormat

storage = StorageConfig(
    format=StorageFormat.ICEBERG,
    external_volume='MY_EXTERNAL_VOLUME',
    base_location='feature_store/user_features'
)

iceberg_output_fv = FeatureView(
    name="USER_PURCHASE_STATS_ICE",
    entities=[user_entity],
    feature_df=user_purchase_df,
    timestamp_col="LAST_ORDER_TS",
    refresh_freq="1 hour",
    storage_config=storage,
    desc="Purchase stats materialized as Iceberg for external access"
)
fs.register_feature_view(feature_view=iceberg_output_fv, version="V01")
Direction Backing Object Use Case
Iceberg as source View (no refresh_freq) External tools write features to Iceberg; Feature Store wraps them for retrieval
Iceberg as output Dynamic Iceberg Table (refresh_freq required) Feature Store computes features; external tools read them from Iceberg

The Iceberg-as-output pattern is particularly valuable for teams that train models outside Snowflake (Databricks, SageMaker, custom Kubernetes jobs) – they can read feature data via standard Iceberg catalog integration without the Snowflake SDK. For a full discussion of Iceberg-backed Feature Views in training workflows, including limitations and the future automated training pipeline vision, see Chapter 10: Iceberg-Backed Feature Views.


12.8 CI/CD for Feature Store

📁 Full code: _code/cicd_patterns.py

Feature definitions should be treated as production code: version-controlled, reviewed, tested, and deployed through an automated pipeline. This prevents configuration drift between environments and ensures that every Feature View change is auditable.

12.8.1 Feature definition as code

Store Feature View definitions (name, query, entities, refresh frequency, version) in a declarative format (YAML, JSON, or Python config) alongside the deployment script. A CI/CD pipeline triggers on changes to these definitions:

# .github/workflows/feature-deploy.yml
name: Deploy Features

on:
  push:
    branches: [main]
    paths:
      - 'features/**'

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Deploy Feature Views
        env:
          SNOWFLAKE_ACCOUNT: ${{ secrets.SF_ACCOUNT }}
          SNOWFLAKE_USER: ${{ secrets.SF_USER }}
        run: |
          python scripts/deploy_features.py

The paths filter ensures the pipeline only runs when Feature View definitions change – not on unrelated code changes.

12.8.2 Deployment script

The deployment script reads a configuration file, constructs each Feature View, and registers it. Version labels (V01, V02, …) are part of the configuration so that version bumps are explicit and reviewable in pull requests:

def deploy_feature_view(
    fs: FeatureStore,
    config: dict,
) -> None:
    fv = FeatureView(
        name=config["name"],
        entities=config["entities"],
        feature_df=session.sql(config["query"]),
        timestamp_col=config["timestamp_col"],
        refresh_freq=config.get("refresh_freq"),
    )
    fs.register_feature_view(fv, version=config["version"], block=True)
Environment promotion

Combine this with the environment-strategy patterns from Chapter 2. Parameterize database and schema names so that the same deployment script works across DEV, QA, and PROD by changing environment variables.

Config-driven definitions vs config-driven deployment

This section covers deployment-time config: which Feature Views to register, in which environment, at what version. For definition-time config — programmatically generating the Feature objects and feature_df columns from data or specification tables — see Chapter 7: Data-Driven Feature Definitions. The two patterns complement each other: data-driven definitions produce the Feature View content, and config-driven deployment pushes it through environments.


12.8.3 Reconciling Desired vs Actual State

Feature Store management operates on two layers that are not yet unified by a single tool:

Layer Managed by Scope
Physical DCM Projects, SQL DDL, or Terraform Dynamic Tables, Views, Tags, grants, warehouses
Logical Feature Store Python SDK Entities, Feature View metadata (descriptions, versions, entity associations)

DCM Projects (see Chapter 2: DCM Projects) give you declarative, plan-then-deploy control over the physical objects. But there is no fs.plan() or fs.apply() that compares a config file against the live Feature Store and surfaces a diff at the logical layer. Until a unified declarative API emerges, a pre-deploy reconciliation script bridges the gap:

def reconcile_feature_state(
    fs: FeatureStore,
    desired: list[dict],
) -> dict:
    """Compare desired config against live Feature Store state.

    Returns a report of missing, extra, and version-mismatched FVs.
    """
    live_fvs = {
        fv.name: [v.version for v in fv.versions]
        for fv in fs.list_feature_views()
    }
    desired_map = {d["name"]: d["version"] for d in desired}

    missing   = [n for n in desired_map if n not in live_fvs]
    extra     = [n for n in live_fvs if n not in desired_map]
    mismatch  = [
        n for n in desired_map
        if n in live_fvs and desired_map[n] not in live_fvs[n]
    ]

    return {
        "missing_in_live": missing,
        "unexpected_in_live": extra,
        "version_mismatch": mismatch,
        "in_sync": len(missing) == 0
            and len(extra) == 0
            and len(mismatch) == 0,
    }

Run this before every deployment to catch drift early. In CI, a non-empty missing or mismatch list can trigger the deployment step; a non-empty extra list raises a warning for manual review.

Deployment strategies. How you handle the diff depends on your risk tolerance:

Strategy Behaviour Best for
Append-only Never mutate existing versions; only register new ones (V01V02 → …) Production environments; clean audit trail; consumers stay on a known version until they opt in
Full-redeploy CREATE OR REPLACE every object from config on each deploy Dev/test environments where speed matters more than stability
Hybrid Append-only in PROD; full-redeploy in DEV/TEST Most teams — balances safety with iteration speed
Emerging tooling

The Snowflake ecosystem is actively evolving in this space. Snowflake CLI (snow object create, snow object describe) provides imperative object management. The Snowflake Terraform provider can declare DTs and TAGs. DCM Projects handle plan/deploy for supported objects. None of these currently understand Feature Store Entities or Feature Views as first-class constructs — but the TAG-based registration approach (see Appendix D) means any tool that can create a DT and apply TAGs can produce a valid Feature View. Watch for future SDK enhancements that may add a native fs.plan() / fs.apply() workflow.

12.8.4 Recommended Repository Layout

A consistent directory structure makes it easy for multiple teams to contribute Feature Views through a shared CI/CD pipeline:

feature-store-repo/
├── features/                    # One directory per Feature View
│   ├── user_purchase/
│   │   ├── config.yaml          # Name, version, entities, refresh_freq, description
│   │   └── feature_df.py        # Snowpark DataFrame logic (or SQL file)
│   └── user_session/
│       ├── config.yaml
│       └── feature_df.py
├── entities/
│   └── user.yaml                # Entity name, join_keys, description
├── scripts/
│   ├── deploy_features.py       # Reads configs, registers Feature Views
│   └── reconcile.py             # Pre-deploy diff (see above)
├── tests/
│   ├── unit/                    # DataFrame logic tests
│   ├── integration/             # End-to-end PIT, schema, freshness tests
│   └── conftest.py              # Snowpark session fixtures
├── config/
│   ├── dev.yaml                 # Environment-specific parameters
│   ├── test.yaml
│   └── prod.yaml
└── .github/workflows/
    └── feature-deploy.yml       # CI/CD pipeline (see above)

Each config.yaml is a declarative spec for one Feature View; feature_df.py contains the Snowpark or SQL logic that produces the feature_df. The deployment script iterates over features/*/config.yaml, calls reconcile_feature_state() to check for drift, and registers any missing or updated versions. Entity definitions in entities/ are registered first since Feature Views depend on them.


12.9 Testing strategies

📁 Full code: _code/testing_strategies.py

Reliable Feature Store operations depend on tests that go beyond “the job finished.” The patterns below are the minimum bar for advanced deployments. Most can be implemented with pytest and Snowpark’s local testing framework or a dedicated Snowflake test account.

1. Unit tests for feature logic

Test DataFrame or Snowpark transformations in isolation: build small fixtures (or sampled tables from FEATURE_STORE_DEMO.CLICKSTREAM_DATA), run the same logic your Feature View uses, and assert row counts, column names, and values—including naming conventions (ORDER_CNT, TOTAL_AMT_SUM, LAST_ORDER_TS).

import pytest
from snowflake.snowpark.functions import col

def test_user_order_aggregates_schema_and_totals(order_stats_df):
    """order_stats_df: result of GROUP BY USER_ID on ORDERS-like input."""
    cols = order_stats_df.columns
    assert "USER_ID" in cols
    assert "ORDER_CNT" in cols
    assert "TOTAL_AMT_SUM" in cols
    row = order_stats_df.filter(col("USER_ID") == 42).collect()
    assert row[0]["ORDER_CNT"] == 3

2. PIT correctness tests (no future leakage)

After generate_dataset, assert that feature view timestamps never exceed the spine’s event time for the same row.

from snowflake.snowpark.functions import col

def test_no_future_leakage(fs, test_spine, fv):
    dataset = fs.generate_dataset(
        spine_df=test_spine,
        features=[fv],
        spine_timestamp_col="EVENT_TS",
        include_feature_view_timestamp_col=True,
    )
    df = dataset.read.to_snowpark_dataframe()
    assert df.filter(col("FV_TS") > col("EVENT_TS")).count() == 0

Extend this pattern with spot checks on known entities and times (golden spines) and with multiple Feature Views joined on the same spine.

3. Schema validation

Before or after register_feature_view, assert that outputs match an expected schema: column set, types, and (where your contracts require it) order. Options include:

  • DESCRIBE TABLE or DESCRIBE VIEW on the underlying object (see Chapter 11 for how to identify the object type), compared to a checked-in manifest
  • Queries against INFORMATION_SCHEMA.COLUMNS for FEATURE_STORE_DEMO.FEATURE_STORE
  • API metadata from fs.get_feature_view(name="USER_ORDER_FV", version="V01") where available

Keep one manifest (YAML/JSON) as the source of truth shared by registration scripts and tests.

4. Freshness and quality hooks

Combine the above with operational checks (lag, null rates) as in Chapter 11—those belong in monitoring as well as in smoke tests after deploy.

def test_feature_freshness():
    lag = get_feature_lag("USER_ORDER_FV")
    assert lag < timedelta(hours=2)

def test_feature_quality():
    null_rate = get_null_rate("USER_ORDER_FV", "TOTAL_AMT_SUM")
    assert null_rate < 0.01

12.10 Wide & sparse feature data

ML feature tables are often wide. Sensor data, one-hot encodings, interaction features, and aggregation windows routinely produce tables with hundreds to thousands of columns. In many domains the data is also sparse – a large fraction of feature values are NULL or zero.

Scenario Typical column count Sparsity
Sensor aggregations (manufacturing, IoT) 400–600 30–50%
One-hot encoded categoricals 500–10,000+ 95%+
User-item interaction matrices 1,000–100,000+ 99%+
Time-windowed aggregation features 200–2,000 40–70%

Feature Store tables inherit these characteristics. When generate_training_set() or inference queries must process wide Feature Views, several Snowflake-specific costs compound.

12.10.1 Why wide tables hurt in Snowflake

SQL compilation cost. Every column in a query must be parsed, resolved, and optimized. At 500+ columns, compilation can dominate total query time – particularly for UNPIVOT, PIVOT, and SELECT * operations.

Micro-partition overhead. Each column adds metadata overhead (min/max statistics, null counts, bloom filters). At 500+ columns per-partition metadata grows substantially, impacting partition pruning and scan initialisation.

Column limits. Standard tables, Iceberg tables, and Dynamic Tables all have a practical limit of ~2,000 columns. Hybrid Tables (for online serving) have similar limits, with latency proportional to row width.

Transfer and memory expansion. Wide tables transferred via Arrow or Parquet carry every column even when sparse. There is also a known Snowflake Datasets issue where all numeric columns may be expanded to full-width NUMBER(38,0) in the underlying Parquet files, causing significant memory bloat when loaded into Pandas.

Feature View joins at scale. generate_training_set() performs ASOF joins across Feature Views. More columns per Feature View increases the join payload. Splitting wide data across many Feature Views (vertical partitioning) multiplies the number of joins instead.

12.10.2 Column count thresholds

Feature count Recommendation
< 100 Standard wide table – no special handling needed
100–500 Consider hybrid packing for sparse features (> 60% sparse)
500–2,000 Hybrid packing recommended; consider DT cascade
> 2,000 Hybrid packing mandatory; vertical partitioning for logical groups

12.10.3 Semi-structured types in Feature Views

Snowflake’s VARIANT, OBJECT, and ARRAY types are first-class column types and work inside Feature Views. They serve three roles: vector embeddings, JSON-structured data, and packed wide features.

Embeddings and arrays. ML embeddings are naturally represented as ARRAY or VECTOR columns, avoiding hundreds of individual dimension columns:

embedding_df = session.sql("""
    SELECT USER_ID, EMBEDDING_VECTOR, LAST_UPDATED_TS
    FROM FEATURE_STORE_DEMO.FEATURE_STORE.USER_EMBEDDINGS
""")
embedding_df = session.table("FEATURE_STORE_DEMO.FEATURE_STORE.USER_EMBEDDINGS").select(
    "USER_ID", "EMBEDDING_VECTOR", "LAST_UPDATED_TS",
)
embedding_fv = FeatureView(
    name="USER_EMBEDDINGS",
    entities=[user_entity],
    feature_df=embedding_df,
    timestamp_col="LAST_UPDATED_TS",
    desc="User embeddings as ARRAY column",
)
import numpy as np
df = fs.generate_dataset(spine_df=spine, features=[embedding_fv],
                         spine_timestamp_col="EVENT_TS").read.to_pandas()
X_embed = np.stack(df["EMBEDDING_VECTOR"].values)

JSON features. User preferences, configuration objects, or other nested structures can be stored as VARIANT or OBJECT columns, preserving hierarchical structure:

prefs_df = session.sql("""
    SELECT USER_ID, PREFERENCES, UPDATED_TS
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.USER_PROFILES
""")
prefs_df = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.USER_PROFILES").select(
    "USER_ID", "PREFERENCES", "UPDATED_TS",
)
json_fv = FeatureView(
    name="USER_PREFERENCES",
    entities=[user_entity],
    feature_df=prefs_df,
    timestamp_col="UPDATED_TS",
    desc="User preferences as VARIANT/OBJECT column",
)

12.10.4 Hybrid storage: key columns + OBJECT payload

The recommended pattern for wide Feature Views is hybrid storage: entity keys, timestamp, and any frequently queried columns remain as native columns; all other features are packed into a single OBJECT column. This combines efficient joins/filtering on keys with compact storage for the feature payload.

Aspect Hybrid (keys + OBJECT) Standard wide table
SQL queryability on keys Excellent Excellent
Storage efficiency (sparse data) Good – absent keys = no storage Poor – NULLs/zeros stored explicitly
Column-limit safe Yes – one OBJECT column regardless of feature count No – hits ~2,000 limit
SQL compile time Low – single column in projection High – each column parsed individually
ML library compatibility Good – unpack in Python Excellent – direct DataFrame

Packing with OBJECT_CONSTRUCT:

SELECT
    USER_ID,
    OBJECT_CONSTRUCT(* EXCLUDE (USER_ID, UPDATED_TS)) AS FEATURES,
    UPDATED_TS
FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.USER_WIDE_FEATURES;

The * EXCLUDE syntax packs all columns except the named ones. OBJECT_CONSTRUCT automatically drops NULL keys, providing inherent sparsity compression.

NULL vs zero handling

Use OBJECT_CONSTRUCT (drops NULLs) when absent features should default to a known value (e.g., 0). Use OBJECT_CONSTRUCT_KEEP_NULL when you need to preserve the distinction between NULL (missing/unknown) and zero (measured as zero). For most ML use cases where sparse features default to 0, combining NULLIF(col, 0) with OBJECT_CONSTRUCT eliminates both NULL and zero keys from the OBJECT, maximising compression:

OBJECT_CONSTRUCT(
    'OHE_CAT_1', NULLIF(OHE_CAT_1, 0),
    'OHE_CAT_2', NULLIF(OHE_CAT_2, 0),
    ...
) AS SPARSE_FEATURES

For dynamically generated columns (e.g., per-category aggregates where new categories appear), use OBJECT_AGG:

SELECT
    USER_ID,
    OBJECT_AGG(CATEGORY_NAME, SPEND_AMT) AS CATEGORY_SPEND_MAP,
    MAX(UPDATED_TS) AS UPDATED_TS
FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.USER_SPEND_BY_CATEGORY
GROUP BY USER_ID;

Registering the Feature View:

packed_fv = FeatureView(
    name="USER_CATEGORY_SPEND",
    entities=[user_entity],
    feature_df=session.table("FEATURE_STORE_DEMO.FEATURE_STORE.USER_CATEGORY_SPEND_V"),
    timestamp_col="UPDATED_TS",
    desc="Per-category spend packed into OBJECT column",
)

Unpacking in Python:

import pandas as pd

df = fs.generate_dataset(
    spine_df=spine, features=[packed_fv],
    spine_timestamp_col="EVENT_TS",
).read.to_pandas()

features_df = pd.json_normalize(df["FEATURES"])
training_df = pd.concat([df.drop(columns=["FEATURES"]), features_df], axis=1)

For sparse matrices (e.g., feeding XGBoost or LightGBM which accept sparse input natively):

from scipy.sparse import coo_matrix
import json

records = df["FEATURES"].apply(json.loads).tolist()
feature_names = sorted(set(k for r in records for k in r))
name_to_idx = {n: i for i, n in enumerate(feature_names)}

rows, cols, vals = [], [], []
for i, rec in enumerate(records):
    for k, v in rec.items():
        rows.append(i)
        cols.append(name_to_idx[k])
        vals.append(float(v))

X_sparse = coo_matrix((vals, (rows, cols)), shape=(len(records), len(feature_names))).tocsr()

12.10.5 Two-tier Feature View architecture

For production-grade wide Feature Views backed by Dynamic Tables, the recommended pattern is a two-tier architecture: a materialized bundled DT underneath, with an expansion View on top. The expansion View presents the same wide-column schema that consumers expect while the bundled DT stores data compactly.

flowchart TB
  USQL[User Feature SQL wide columns] --> BUN[MY_FEATURES__BUNDLED Dynamic Table]
  BUN --> EXPV[MY_FEATURES View with casts from OBJECT]

Two-tier bundled Dynamic Table with expansion View

Object Name Type Consumer-facing?
Expanded Feature View {name} View Yes – consumers query this
Bundled Feature View {name}__BUNDLED Dynamic Table Internal – efficient storage

The consumer-facing name goes on the expansion View so that SELECT * FROM my_features returns the expected wide columns. The bundled DT is an implementation detail.

Expansion SQL:

The expansion query can be used in two ways – either as a View-based Feature View registered through the Feature Store API, or as a standalone CREATE VIEW statement managed outside the API:

-- Expansion query: extract sparse features from the bundled OBJECT
SELECT
    USER_ID,
    UPDATED_TS,
    AGE,
    INCOME,
    CITY,
    COALESCE(SPARSE_FEATURES:OHE_CAT_1::INTEGER, 0)  AS OHE_CAT_1,
    COALESCE(SPARSE_FEATURES:OHE_CAT_2::INTEGER, 0)  AS OHE_CAT_2,
    COALESCE(SPARSE_FEATURES:OHE_CAT_3::INTEGER, 0)  AS OHE_CAT_3
    -- ... generated from metadata
FROM FEATURE_STORE_DEMO.FEATURE_STORE.USER_WIDE_FEATURES__BUNDLED
# Option A: Register as a View-based Feature View (preferred — visible in the Feature Store)
expansion_df = session.sql("SELECT ... FROM USER_WIDE_FEATURES__BUNDLED")  # query above

wide_fv = FeatureView(
    name="USER_WIDE_FEATURES", entities=[user_entity],
    feature_df=expansion_df, timestamp_col="UPDATED_TS",
    refresh_freq=None,  # View — reads from the bundled DT at query time
    desc="Wide expansion of USER_WIDE_FEATURES__BUNDLED",
)
wide_fv = fs.register_feature_view(feature_view=wide_fv, version="V01")
-- Option B: Standalone CREATE VIEW (managed outside the Feature Store API)
CREATE VIEW FEATURE_STORE_DEMO.FEATURE_STORE.USER_WIDE_FEATURES AS
SELECT ... FROM FEATURE_STORE_DEMO.FEATURE_STORE.USER_WIDE_FEATURES__BUNDLED;
TAG registration required for API visibility

Option B creates the View outside the Feature Store API. It will not appear in fs.list_feature_views() or the Snowsight Feature Store UI unless the correct internal TAGs are applied. See Appendix D: Tag Convention for the required TAG schema.

COALESCE with the appropriate default (0 for numeric, NULL for strings) handles features that were absent from the OBJECT due to sparsity.

When NOT to bundle

If the Feature View has no refresh frequency (i.e., it is a plain View with no materialization), bundling serves no purpose – there is no stored data to compress. The packing/unpacking adds unnecessary query-time overhead. Only apply the two-tier pattern to DT-backed Feature Views.

12.10.6 Schema evolution with IMMUTABLE WHERE

Adding a new derived feature expression to the upstream VIEW (e.g., a computed ratio) is picked up by the bundled DT incrementally (REFRESH_ACTION = INCREMENTAL) — not as a REINITIALIZE. Every row in the mutable window whose OBJECT would gain the new key is recomputed; the full row-level recompute (ins=N/del=N) applies to the mutable region. For DTs with large history, this means recomputing all mutable rows just to add one new key.

IMMUTABLE WHERE constrains the incremental recompute to only the mutable window. The constraint can be added retroactively to an existing DT:

-- 1. Protect history on the bundled DT
ALTER DYNAMIC TABLE my_features__bundled SET
    IMMUTABLE WHERE (ts < '2026-04-01'::TIMESTAMP_NTZ);

-- 2. Replace the upstream VIEW with the new feature
CREATE OR REPLACE VIEW feature_source_v AS
SELECT entity_id, ts,
       OBJECT_CONSTRUCT(* EXCLUDE (entity_id, ts)) AS features
FROM source_with_new_derived_column;

-- 3. Refresh — only mutable rows recomputed
ALTER DYNAMIC TABLE my_features__bundled REFRESH;

-- 4. (Optional) Remove constraint
ALTER DYNAMIC TABLE my_features__bundled UNSET IMMUTABLE WHERE;

After the refresh, immutable rows retain their original OBJECT (e.g., keys {A, B, C}), while mutable rows have the full key set including the new feature (e.g., {A, B, C, D}). The expansion View handles this transparently – COALESCE(features:"D"::NUMBER, 0) returns the default for immutable rows, the same pattern already used for sparse features.

OBJECT columns make heterogeneous key sets natural

In a scalar DT, every row must have the same column list – partial recompute via BACKFILL produces visible NULL columns. With OBJECT columns, rows with different key sets coexist naturally. A missing key is indistinguishable from a sparse feature that was absent due to OBJECT_CONSTRUCT dropping NULLs. Consumers already handle this via COALESCE, so no additional logic is needed.

Lazy and phased backfill

If the new feature is needed across all history, you can either remove the constraint and run a single full refresh, or phase the backfill by progressively moving the immutability boundary backward – e.g., one month at a time – so each refresh recomputes only a slice of history. Phased backfill spreads the compute cost across multiple windows, which is useful when the full history is too large for a single off-hours refresh or when warehouse budget is constrained. See Chapter 4: Retroactive IMMUTABLE WHERE and Chapter 6: Optimising backfill for full examples.

12.10.7 Choosing between SELECT * column pickup and OBJECT bundling

Snowflake updated Dynamic Table behaviour so that a DT defined with SELECT * FROM base_table can now pick up new columns added to that base table on its next refresh, rather than failing. This creates a new lightweight path for schema evolution alongside the OBJECT-bundled pattern.

The two approaches solve overlapping but distinct problems:

flowchart LR
    subgraph pathA ["Path A: SELECT * pickup (automatic, uncontrolled)"]
        A_src["Base table<br/>ADD COLUMN"] -->|"automatic trigger"| A_dt["DT SELECT *<br/>full recompute<br/>cascades downstream"]
        A_dt --> A_cons["Consumer sees<br/>new scalar column"]
    end
    subgraph pathB ["Path B: OBJECT via VIEW gate (controlled)"]
        B_src["Base or source<br/>ADD COLUMN"] -->|"no effect on DT"| B_gate["VIEW replacement<br/>(deliberate operator action)"]
        B_gate --> B_dt["OBJECT DT<br/>INCREMENTAL on<br/>your schedule"]
        B_dt --> B_cons["Consumer sees<br/>new key in OBJECT<br/>schema unchanged"]
    end

SELECT * column pickup vs OBJECT-bundled schema evolution

Dimension SELECT * scalar columns OBJECT inline in DT body OBJECT via VIEW intermediary
Feature count ≤ ~100 cols (compilation grows) Recommended for 100+ cols; schema constant Same — schema constant
Source of new columns Base table physical columns only Base table physical columns only Any expression in the VIEW (computed ratios, windowed functions)
DT schema on column add New scalar column added to DT schema DT schema unchanged (OBJECT column grows) DT schema unchanged
Triggered by upstream ADD COLUMN? ⚠️ Yes — automatic, no human action required ⚠️ Yes — automatic, no human action required No — gated by VIEW replacement; nothing happens until you replace the VIEW
Who controls when recompute happens? Upstream table owner (you have no veto) Upstream table owner (you have no veto) You — VIEW replacement is a deliberate operator action
REFRESH_ACTION on column add INCREMENTAL (full row-level recompute) INCREMENTAL (full row-level recompute) INCREMENTAL — but only when you replace the VIEW
Pre-existing rows for new column Recomputed from source — no NULLs Recomputed from source — no NULLs Recomputed from source — no NULLs (or NULL for protected region if using IMMUTABLE WHERE)
Cost of column add Full row-level recompute (ins=N/del=N); cascades to all downstream SELECT * DTs Full row-level recompute (ins=N/del=N); no cascade beyond this DT Full row-level recompute (INCREMENTAL, ins=N/del=N) on your schedule; IMMUTABLE WHERE limits how much mutable history is recomputed
IMMUTABLE WHERE Respected: immutable rows get NULL; mutable recomputed Respected: immutable rows get NULL key; mutable recomputed ✅ Best controlled here — you choose exactly when to ALTER the VIEW and how far back the mutable window reaches
Downstream DT cascade ⚠️ Downstream SELECT * DTs also recompute; cascade multiplies cost ✅ No cascade schema impact — downstream sees a stable OBJECT column ✅ No cascade schema impact
Sparse features / null handling Explicit NULL scalar columns stored OBJECT_CONSTRUCT drops NULL-valued keys; use OBJECT_CONSTRUCT_KEEP_NULLS when a complete, stable key-set is required — e.g. for position-aware or schema-validating consumers Same — use OBJECT_CONSTRUCT_KEEP_NULLS in the VIEW or DT body when downstream must always see a full key-set
Consumer query change on add None if using SELECT * (but schema changes) OBJECT path extraction on new key OBJECT path extraction on new key
Version / group control ❌ None — new columns visible immediately to all consumers ❌ None — new keys appear immediately, automatically ✅ Explicit — VIEW definition is the gate; you decide what enters the bundle and when
Computed expressions No No Yes — VIEW can add ratios, flags, windowed computations
Data-driven derivations No ✅ TRANSFORM + OBJECT_KEYS auto-expand with new keys ✅ LATERAL FLATTEN + OBJECT_AGG auto-expand with new keys in presentation VIEW
Multi-team / production safety ⚠️ Risky — any upstream schema change triggers recompute across all consumers ⚠️ Moderate — recompute contained to this DT, but still uncontrolled ✅ Safe — upstream changes are inert until intentionally promoted through the VIEW
SELECT * and inline OBJECT_CONSTRUCT(*) are triggered by upstream schema changes you do not control

Both the SELECT * scalar pattern and the inline OBJECT_CONSTRUCT(* EXCLUDE ...) pattern respond automatically to any ADD COLUMN on the upstream base table. You — or your consumers — do not need to take any action; the recompute simply happens on the next refresh.

In a multi-team or production environment this means an upstream table owner adding a new column (perhaps for an unrelated purpose) will trigger a full row-level recompute (ins=N/del=N) on your DT. For the SELECT * scalar pattern this cascade continues into every downstream SELECT * DT, multiplying the cost.

The VIEW intermediary pattern provides a gate. An ADD COLUMN on the source leaves the DT completely unaffected until you deliberately replace the VIEW. You control the timing, you can apply IMMUTABLE WHERE to limit how much history is recomputed, and you can review and test the change before it enters production consumers.

Use SELECT * scalar pickup when:

  • The DT is a thin pass-through over a single managed base table that you own and control.
  • Feature count stays below ~100 columns.
  • Consumers tolerate dynamic schemas (or use their own SELECT *).
  • No explicit consumer version gate is needed, and no uncontrolled upstream owners.
  • Dev/experimentation contexts where the automatic recompute is acceptable.

**Use inline OBJECT_CONSTRUCT(*) in the DT body when:**

  • Feature count ≥ 100 columns, or you want a stable DT schema regardless of source width.
  • You own (or tightly coordinate with) the base table — unintended upstream ADD COLUMN will still trigger an automatic full recompute on this DT.
  • You want data-driven derivations to auto-expand with no operational intervention (see confirmed pattern below).
  • Downstream DTs read from this DT and must not gain new scalar columns or be recomputed (the OBJECT column is stable; downstream is insulated from the cascade).

Use VIEW intermediary → OBJECT DT when:

  • The source table is owned by another team, or you need explicit control over which upstream columns enter the feature bundle and when.
  • New features include computed expressions (ratios, flags, windowed functions) that do not exist as physical base-table columns.
  • You want to gate schema changes behind a human review step before recompute occurs.
  • You need fine-grained backfill control: use IMMUTABLE WHERE to recompute only the mutable region, and progressively move the boundary backward to backfill history as budget permits. With SELECT * / inline OBJECT you have no such gate.
  • The LATERAL FLATTEN + OBJECT_AGG presentation layer still auto-adapts to new keys without any VIEW changes (confirmed by prototype).
Confirmed: inline OBJECT_CONSTRUCT(*) + TRANSFORM auto-evolves with source schema

Both OBJECT_CONSTRUCT(* EXCLUDE ...) written directly in the DT body and downstream TRANSFORM + OBJECT_KEYS derivation chains have been confirmed to auto-evolve when new columns are added to the base table — with INCREMENTAL refresh action throughout.

-- Single DT: passthrough OBJECT + auto-squared derived features
-- When a new column is added to source_base_table, both raw_features
-- and derived_features update automatically on next refresh.
CREATE DYNAMIC TABLE FEATURE_STORE."USER_FEATURES_DERIVED$V01"
    TARGET_LAG   = '1 hour'
    WAREHOUSE    = FS_WH
    REFRESH_MODE = INCREMENTAL
AS
WITH src AS (
    SELECT
        USER_ID,
        EVENT_TS,
        OBJECT_CONSTRUCT(* EXCLUDE (USER_ID, EVENT_TS)) AS raw_features
    FROM SOURCE_DB.DATA.USER_FEATURES_BASE
)
SELECT
    USER_ID,
    EVENT_TS,
    raw_features,
    ARRAYS_TO_OBJECT(
        TRANSFORM(OBJECT_KEYS(raw_features), k -> k || '_sq'),
        TRANSFORM(OBJECT_KEYS(raw_features),
                  k -> (GET(raw_features, k)::FLOAT
                        * GET(raw_features, k)::FLOAT)::FLOAT)
    ) AS derived_sq_features
FROM src;

When USER_FEATURES_BASE gains a new column (e.g., RISK_SCORE): - raw_features automatically gains RISK_SCORE key on next refresh - derived_sq_features automatically gains RISK_SCORE_sq key - No VIEW update, no DT recreation, no new version registration - REFRESH_ACTION = INCREMENTAL throughout

The VIEW intermediary is the right choice when new features require computed expressions that do not exist as physical base-table columns. In that case, replace the VIEW — the OBJECT(*) DT picks up the new derived column incrementally (REFRESH_ACTION = INCREMENTAL), and any downstream presentation queries using LATERAL FLATTEN + OBJECT_AGG will automatically include the new key’s transformed value.

For test results and detailed behaviour, see _internal_development/dt_schema_evolution/internal_notes.md.

See Chapter 4: Incremental Column Pickup with SELECT * for the full DT behaviour details, backfill semantics, and IMMUTABLE WHERE interaction.

12.10.8 ARRAY-based feature bundles

OBJECT bundling stores features as named key-value pairs. For fixed-width, positionally-defined feature vectors — e.g. homogeneous numeric features where the consumer knows the schema in advance — ARRAY_CONSTRUCT and ARRAY_CONSTRUCT_COMPACT offer an alternative with different null and storage semantics.

12.10.8.1 SQL NULL vs JSON null in Snowflake arrays

Snowflake distinguishes two null-like values inside semi-structured arrays:

Value How produced Array representation Stored?
SQL NULL Column IS NULL or absent undefined No — sparse gap, efficient
JSON null PARSE_JSON('null') null Yes — real element

This has a direct consequence for which ARRAY variant you choose:

WITH T AS (
    SELECT
        NULL          AS SQL_NULL,
        'hello'       AS HELLO,
        PARSE_JSON('null') AS JSON_NULL,
        'goodbye'     AS GOODBYE,
        3::DOUBLE     AS THREE_DOUBLE,
        4             AS FOUR_INT
)
SELECT
    ARRAY_CONSTRUCT(*)         AS ac,
    ARRAY_CONSTRUCT_COMPACT(*) AS acc
FROM T;
ac  = [undefined, "hello", null, "goodbye", 3.0, 4]   -- SQL NULL → undefined gap
acc = [           "hello", null, "goodbye", 3.0, 4]   -- SQL NULL removed; JSON null kept
  • ARRAY_CONSTRUCT(*) preserves all positions. SQL NULL becomes an undefined element — a sparse gap that does not consume the same storage as a real element. JSON null is preserved as a real null element.
  • ARRAY_CONSTRUCT_COMPACT(*) drops SQL NULL arguments entirely, shifting remaining elements left. JSON null values are preserved. Use when you want to exclude absent features from the bundle entirely.

12.10.8.2 Storage characteristics

Snowflake arrays are stored as semi-structured VARIANT data internally:

  • Dense typed arrays (ARRAY(FLOAT), used for VECTOR-ready serving) fill every position — absent values are replaced with 0.0 — and consume nearly constant storage regardless of sparsity.
  • Semi-structured sparse arrays (VARIANT ARRAY via ARRAY_CONSTRUCT with SQL NULLs) store only the present elements; undefined gaps do not materialise as bytes.
  • The practical per-cell size limit (128 MB) applies to the full ARRAY value, not individual elements.

Measured storage comparison (100K rows, 200-feature schema, uncompressed logical bytes):

Format 10% sparsity 50% sparsity 90% sparsity
OBJECT (named key-value, NULLs absent) 567 MB 315 MB 63 MB
Dense ARRAY(FLOAT) (COALESCE → 0.0) 449 MB 445 MB 441 MB
Sparse positional OBJECT (integer-keyed proxy) 531 MB 295 MB 59 MB

Actual Snowflake compressed storage is substantially smaller (Zstandard typically 5–10×), but relative ratios hold.

Key conclusions:

  • OBJECT and dense ARRAY cross over at approximately 40% sparsity. Below ~40% (most features present), dense ARRAY is more compact — JSON key-value encoding overhead outweighs the benefit of dropped NULL keys. Above ~40%, OBJECT becomes progressively more compact.
  • At 90% sparsity, OBJECT is ~7× more compact than dense ARRAY (63 MB vs 441 MB).
  • Dense ARRAY cost is nearly flat across sparsity levels — absent values are zero-filled, so all N float values are always serialised.

For Feature Views where features are mostly present (low sparsity), dense ARRAY can be the more compact at-rest format. For genuinely sparse feature sets (>~40% of values absent), OBJECT is the better storage choice.

There is no SQL-native sparse positional ARRAY

ARRAY_INSERT(arr, pos, val) in Snowflake is an insert-and-shift operation, not a set-at-position operation. Chaining ARRAY_INSERT calls to build a sparse positional array (with features at known fixed positions) displaces earlier elements, producing an array with incorrect positions and wrong size. There is no SQL primitive equivalent to arr[pos] = val.

Consequences for Feature Store design:

  • You cannot construct a correctly-positioned sparse ARRAY from individual feature columns in pure SQL.
  • Any Feature View that needs features at stable positional indices must use a dense ARRAY with explicit 0.0 fill (ARRAY_CONSTRUCT(COALESCE(f1, 0.0), COALESCE(f2, 0.0), ...)) or the TRANSFORM+OBJECT path described below.
  • The only SQL workaround is to store features in an OBJECT with integer string keys and densify at read time using TRANSFORM + ARRAY_GENERATE_RANGE.

12.10.8.3 Null preservation with ARRAY_CONSTRUCT and OBJECT_CONSTRUCT_KEEP_NULLS

When downstream consumers require a complete, fixed-width representation — for example, a model serving layer that expects a specific array length, or a schema-validating consumer that must see every key — you need to explicitly preserve NULLs:

Goal OBJECT equivalent ARRAY equivalent
Drop absent features OBJECT_CONSTRUCT(...) ARRAY_CONSTRUCT_COMPACT(...)
Preserve absent features as null OBJECT_CONSTRUCT_KEEP_NULLS(...) ARRAY_CONSTRUCT(...) (SQL NULL → undefined)
Preserve absent features as JSON null Explicit COALESCE(col, PARSE_JSON('null')) per column ARRAY_CONSTRUCT(COALESCE(col, PARSE_JSON('null')), ...)
ARRAY_CONSTRUCT(*) and column ordering

ARRAY_CONSTRUCT(*) positional order follows the column order of the source table or query. This order is not guaranteed to be stable across DT recreations or schema changes unless the query uses an explicit column list. For Feature Views where consumers depend on positional indexing, always use an explicit column list rather than *:

ARRAY_CONSTRUCT(FEATURE_A, FEATURE_B, FEATURE_C) AS features_array

Schema evolution (adding a new feature) requires either appending to the end of the explicit list and updating all consumers, or creating a new Feature View version — there is no equivalent of the OBJECT key-based pickup for positional arrays.

12.10.8.4 TRANSFORM + ARRAY_GENERATE_RANGE: O(1)-compile densification

When a downstream consumer requires a dense numeric array or VECTOR — for ML training export, embedding distance queries, or model serving — the recommended path is Higher-Order Function (HOF) densification via TRANSFORM rather than ARRAY_CONSTRUCT.

Why this matters: compilation cost at scale.

ARRAY_CONSTRUCT(f1, f2, ..., fN) with N explicit columns is compiled by the Snowflake query planner as N independent expressions — the plan scales as O(N^1.5). A 1,000-feature ARRAY_CONSTRUCT takes ~2 seconds just to compile, and this cost is paid on every DT refresh that recompiles the query.

TRANSFORM(ARRAY_GENERATE_RANGE(0, N), i -> expr(i)) compiles as a single lambda expression regardless of N — O(1) compilation time, approximately 100–140 ms across the 200–1,000 feature range:

Width ARRAY_CONSTRUCT compile TRANSFORM+RANGE compile
200 79 ms 119 ms (ARRAY_CONSTRUCT faster below ~200)
500 747 ms 139 ms (5.4× faster)
1,000 2,061 ms 105 ms (20× faster)

The crossover is around 200 features. Below ~200, ARRAY_CONSTRUCT is slightly faster to compile. Above 200 — which is the normal range for real-world Feature Views — TRANSFORM wins decisively.

The canonical pattern for densifying an OBJECT or sparse ARRAY to a dense VECTOR:

-- Densify from OBJECT with integer string keys → VECTOR(FLOAT, N)
SELECT
    entity_id,
    event_ts,
    TRANSFORM(
        ARRAY_GENERATE_RANGE(0, 500),
        i -> COALESCE(GET(features, i::VARCHAR)::FLOAT, 0.0)
    )::VECTOR(FLOAT, 500) AS feature_vec
FROM feature_store.user_features_obj_v01;
-- Densify from a pre-built ARRAY(FLOAT) column → VECTOR(FLOAT, N)
-- Use when the ARRAY was materialised by a prior DT layer
SELECT
    entity_id,
    event_ts,
    TRANSFORM(
        ARRAY_GENERATE_RANGE(0, 500),
        i -> COALESCE(GET(arr, i)::FLOAT, 0.0)
    )::VECTOR(FLOAT, 500) AS feature_vec
FROM feature_store.user_features_arr_v01;

Important: the O(1) compilation benefit of TRANSFORM only applies when reading from an existing ARRAY or OBJECT column. If you start from individual scalar columns and build the ARRAY within the same query using ARRAY_CONSTRUCT, you pay the O(N^1.5) cost there. This is why an intermediate materialisation layer (an ARRVEC DT, or an OBJECT DT) is architecturally valuable: the O(N^1.5) cost is paid once at materialisation time; all subsequent reads and downstream DT definitions operate in O(1) compile.

Use TRANSFORM-based DT definitions for ARRVEC layers at width > 200

If your Feature Store includes a dense ARRAY (ARRAY(FLOAT)) DT as a materialisation layer for VECTOR serving, define it using TRANSFORM + ARRAY_GENERATE_RANGE reading from an upstream OBJECT DT or ARRAY column — not ARRAY_CONSTRUCT over individual scalar columns. This makes the DT definition itself O(1) to compile, which matters most at refresh-triggered recompilation of wide Feature Views.

12.10.8.5 Sparse-at-rest, dense-on-read architecture

The probe results confirm a practical three-layer architecture for sparse numeric feature sets destined for ML consumption:

flowchart LR
    SRC["Source table\n(sparse columns)"] --> OBJ["OBJECT DT\nOBJECT_CONSTRUCT\n(NULL keys dropped)\nSparse at rest"]
    OBJ --> ARR["ARRVEC DT\nTRANSFORM + RANGE\nO(1) compile\nDense 0-fill"]
    ARR --> VEC["VECTOR(FLOAT, N)\ncast at export\nML training / serving"]
    OBJ -.->|"High sparsity:\nskip ARRVEC"| COO["Sparse export\nFLATTEN → COO\nor OBJECT→dict"]

Sparse-at-rest, dense-on-read pipeline for ML feature serving

Layer 1 — OBJECT DT (sparse at rest): OBJECT_CONSTRUCT with NULL-omission stores only the non-null features. At >40% sparsity this is substantially more compact than a dense ARRAY (up to 7× at 90% sparsity). Schema evolution (adding or dropping features via VIEW replacement) is incremental. This layer serves key-based feature retrieval for all non-VECTOR consumers.

Layer 2 — ARRVEC DT (dense ARRAY, optional): A second DT reads from the OBJECT DT and materialises a dense ARRAY(FLOAT) using TRANSFORM + ARRAY_GENERATE_RANGE. The DT definition compiles in O(1) regardless of feature width. This layer exists solely to serve consumers that need VECTOR casting or sequential array access. It is not needed if VECTOR casting happens only at query time (the TRANSFORM can be applied in the retrieval query directly from the OBJECT layer).

VECTOR cast at export boundary: ::VECTOR(FLOAT, N) is applied at the point where data leaves Snowflake for ML training or serving. VECTOR is always dense and fixed-width; it does not preserve sparse semantics.

High-sparsity alternative — sparse export path: For feature sets where >60–70% of values are absent, densifying to VECTOR(FLOAT, N) at export materialises a large volume of zeros that consume transfer bandwidth and client memory. In these cases prefer a COO-style sparse export:

-- Sparse export: FLATTEN OBJECT → (entity, feature_name, value) tuples
-- Transfer volume proportional to non-zero entries only
SELECT
    entity_id,
    event_ts,
    f.key   AS feature_name,
    f.value::FLOAT AS feature_value
FROM feature_store.user_features_obj_v01,
LATERAL FLATTEN(input => features) f;

The COO representation can be reconstructed into a sparse tensor or scipy sparse matrix on the client without ever materialising the full dense array.

12.10.8.6 When to use ARRAY vs OBJECT bundling

Consideration OBJECT bundle Dense ARRAY (ARRAY(FLOAT))
Consumer access pattern Key-based extraction (features:col_name) Positional indexing; direct VECTOR cast
Schema stability Named keys tolerate incremental addition Fixed-width; any addition requires new version
Homogeneous feature types Mixed types natural; cast at extraction Homogeneous numeric; type locked at ARRAY build
Storage at low sparsity (<40%) More verbose than dense ARRAY (JSON key overhead) ✅ More compact
Storage at high sparsity (>40%) ✅ More compact; up to 7× at 90% sparsity Nearly constant — zero-fill wastes space
Schema evolution Incremental key pickup via VIEW gate No equivalent — new version required
VECTOR serving Requires TRANSFORM HOF densification at read ✅ Direct ::VECTOR(FLOAT, N) cast; TRANSFORM O(1) for downstream DTs
DT definition compile cost O(1) for OBJECT_CONSTRUCT(*) O(N^1.5) for ARRAY_CONSTRUCT; O(1) if reading from existing ARRAY/OBJECT
Sparse positional construction ❌ No SQL set-at-position primitive; integer-keyed OBJECT is the workaround ❌ ARRAY_INSERT is insert-and-shift, not set-at-position

For the majority of Feature Store use cases — heterogeneous feature types, named access, and schema evolution — OBJECT bundling is the recommended at-rest pattern. The ARRVEC (dense ARRAY) layer is a purpose-built serving layer added on top for consumers that require VECTOR types or sequential positional access. Both layers can be served from a single OBJECT DT source using the TRANSFORM HOF densification pattern.

12.10.9 Type fidelity and metadata

When features are packed into an OBJECT, all values become VARIANT. The original Snowflake types (INTEGER, DOUBLE, NUMBER(10,2), VARCHAR, BOOLEAN) are lost. The expansion View must cast each feature back to its original type.

Maintain a metadata table or manifest that records each packed feature’s original type so that expansion Views and Python unpacking can apply the correct casts:

Original type OBJECT extraction Default
INTEGER / NUMBER(p,0) features:col::INTEGER 0
DOUBLE / FLOAT features:col::DOUBLE 0.0
NUMBER(p,s) where s > 0 features:col::NUMBER(p,s) 0
VARCHAR features:col::VARCHAR NULL or ''
BOOLEAN features:col::BOOLEAN FALSE

12.10.10 Dynamic Table cascade for multiple representations

When wide Feature Views serve multiple query patterns (parameter discovery, point lookups, full key-value scans), a DT cascade pre-materialises multiple representations from a single source:

flowchart TD
  DT1[DT1 wide aggregated] --> DT2[DT2 bundled OBJECT]
  DT1 --> EXP[Expansion View]
  DT2 --> DT3[DT3 optional KV flatten]
  DT2 --> EXP

Dynamic Table cascade wide bundled expansion and optional KV

Representation Best for
DT1 (wide) Direct columnar access, traditional ML training
DT2 (hybrid) Point lookups (obj:SENSOR_042), efficient transfer, OBJECT_KEYS
DT3 (KV, optional) Pre-built tall/EAV rows – replaces runtime UNPIVOT
Expansion View Consumers expecting wide columns – schema-identical to DT1

DT cascade SQL:

CREATE DYNAMIC TABLE my_features__bundled
    TARGET_LAG = DOWNSTREAM  WAREHOUSE = FS_DEV_WH
AS SELECT entity_id, ts,
          OBJECT_CONSTRUCT(* EXCLUDE (entity_id, ts)) AS features
   FROM (/* user's wide feature query */);

-- Optional: pre-materialized KV for heavy KV consumers
CREATE DYNAMIC TABLE my_features__kv
    TARGET_LAG = '1 hour'  WAREHOUSE = FS_DEV_WH
AS SELECT b.entity_id, b.ts,
          f.key   AS feature_name,
          f.value AS feature_value
   FROM my_features__bundled b,
        LATERAL FLATTEN(input => b.features) f;

-- Expansion View (always created)
CREATE VIEW my_features AS
SELECT entity_id, ts,
       COALESCE(features:feat_1::INTEGER, 0)   AS feat_1,
       COALESCE(features:feat_2::DOUBLE, 0.0)  AS feat_2,
       ...
FROM my_features__bundled;

Set TARGET_LAG = DOWNSTREAM on intermediate DTs so the cascade is driven by a concrete lag on the leaf DT (see Chapter 5: Feature Pipelines for DOWNSTREAM behavior). Only create DT3 (the KV materialization) if multiple consumers run frequent KV-pattern queries and the resulting row count (entity_count * feature_count) is manageable.

Primary key-based change reduction in cascades (April 2026)

DT cascades benefit from primary key derivation. When an intermediate DT’s query uses GROUP BY or QUALIFY ROW_NUMBER() = 1, Snowflake derives a unique key and uses it for value-based change filtering – rows where the PK exists in both the old and new versions with identical values are filtered out before propagating to the next stage. This reduces the volume of changes flowing through the cascade, even if an upstream stage uses full refresh. Downstream stages can opt into REFRESH_MODE = INCREMENTAL via SQL (ALTER DYNAMIC TABLE ... SET REFRESH_MODE = INCREMENTAL); this is not yet available through the Feature Store Python API. See Understanding primary keys in dynamic tables.

Note

The SQL examples above use CREATE DYNAMIC TABLE / CREATE VIEW for clarity, but each stage can equally be implemented through the Feature Store API (register a View-based or DT-backed Feature View). Using the API makes the objects discoverable within the Feature Store and ensures entity/timestamp metadata is tracked.

12.10.11 Feature discovery without UNPIVOT

Querying the distinct set of feature names is a common need. The naive approach (UNPIVOT + DISTINCT) is expensive on wide tables because it materialises all rows then deduplicates. With the hybrid DT, use OBJECT_KEYS() on a single row instead:

-- Fast: reads one row from the hybrid DT
SELECT f.value::VARCHAR AS feature_name
FROM (SELECT features FROM my_features__bundled LIMIT 1) s,
     LATERAL FLATTEN(input => OBJECT_KEYS(s.features)) f;
Warning

OBJECT_CONSTRUCT vs. OBJECT_CONSTRUCT_KEEP_NULL: If the bundle was created with OBJECT_CONSTRUCT (without KEEP_NULL), keys whose values are NULL are dropped from the OBJECT. OBJECT_KEYS on any single row will only return keys that have non-NULL values in that specific row, so the result may be incomplete. To get the full set of feature names, either:

  • Use OBJECT_CONSTRUCT_KEEP_NULL when building the bundle so all keys are always present, or
  • Query INFORMATION_SCHEMA.COLUMNS on the expansion View (zero-scan metadata query), or
  • Sample multiple rows and take the DISTINCT union of OBJECT_KEYS results.

INFORMATION_SCHEMA.COLUMNS on the expansion View is the most reliable zero-scan approach regardless of how the bundle was built.

12.10.12 Cross-engine expansion

Store data in compact (hybrid/OBJECT) form. Transfer it between engines as-is. Expand to wide/dense form only at the point of model training:

Engine Expansion from OBJECT
Python (Pandas) pd.json_normalize()
Python (scipy) dictcoo_matrix() for sparse input to XGBoost/LightGBM
R jsonlite::fromJSON() + tidyr::pivot_wider()
DuckDB json_extract() + PIVOT
Spark from_json() + select

This “expand at the last moment” principle minimises transfer costs and memory usage across all engines.

12.10.13 Anti-patterns for wide data

Anti-pattern Problem Fix
Runtime UNPIVOT on wide tables Materialises all rows × columns, compilation alone can take minutes at 500+ columns Pre-materialise KV in a DT, or use OBJECT_CONSTRUCT + FLATTEN
SELECT DISTINCT on UNPIVOT for parameter names Scans entire table to discover a fixed set of names Use OBJECT_KEYS() on one row, or INFORMATION_SCHEMA.COLUMNS
Vertical partitioning as only strategy Splitting wide data across many Feature Views multiplies ASOF join cost Use hybrid packing within each Feature View to keep column count manageable
Ignoring sparsity when values are zeros Standard wide tables store zeros explicitly; at 90% sparsity, 90% of storage is wasted Use NULLIF(col, 0) before OBJECT_CONSTRUCT to convert zeros to NULLs; reconstitute via COALESCE(..., 0) in the expansion View
Further reading

See also Chapter 9: Preprocessing – Pre-Encoding Categoricals for OBJECT-column strategies specific to one-hot encoded features, and Chapter 4: Feature Views for the Base + Presentation Layer Model which complements the two-tier approach.


12.11 Dataset Generation at Enterprise Scale

At large scale (50+ Feature Views, billions of spine rows, thousands of features), the default single-query join strategy in generate_training_set / generate_dataset hits memory and shuffle limits. This section covers architectural patterns for scale.

12.11.1 Batched Joins

SDK version 1.32+ automatically batches Feature View joins into groups of ~10, producing intermediate results before a final merge. This resolved OOM failures seen at 75+ Feature Views and ~2 billion spine rows. See Chapter 10: Dataset Generation at Scale for manual batching code when upgrading the SDK is not immediately possible.

12.11.2 Warehouse Selection: Standard vs Snowpark Optimized

A common misconception is that Snowpark Optimized (SPO) warehouses are always better for large-scale dataset generation. In practice, standard Gen2 warehouses provide superior performance for join-heavy workloads because they offer more parallelism per credit:

Warehouse Nodes Memory per node Total memory Best for
Standard 4XL Gen2 128 16 GB ~2 TB Wide joins, high parallelism
SPO 3XL 16 256 GB ~4 TB Python UDFs, memory-heavy single-node ops

SPO warehouses are designed for Python UDFs and stored procedures that run on a single node and need large local memory. Dataset generation (generate_training_set) is a distributed SQL operation where parallelism – not per-node memory – is the bottleneck. At equivalent credit rates, standard Gen2 provides ~8x more nodes, significantly reducing shuffle and sort times.

Tip

Reserve SPO warehouses for model training inside stored procedures (where the model fitting runs on a single node) and use standard warehouses for dataset generation.

12.11.3 Sizing Guidelines

Spine rows Feature Views Estimated features Recommended warehouse
< 10M < 20 < 200 Standard M or L
10M-100M 20-50 200-500 Standard XL-2XL Gen2
100M-1B 50-100 500-2,000 Standard 4XL Gen2
> 1B 100+ 2,000+ Standard 4XL+ Gen2; batch FVs if OOM

12.11.4 Wide Resultset Mitigation

At extreme scale (100+ Feature Views, 2,000+ features), Snowflake’s query compilation scales non-linearly with the number of output columns and join expressions. Even with batched joins, compilation times of 35+ minutes and execution times of 6+ hours have been observed. The root cause is the SQL expression tree size and columnar scan overhead for very wide resultsets.

When batched joins alone are insufficient, two architectural patterns can help:

  1. VARIANT column encoding – encapsulate each Feature View’s features into a single OBJECT column, reducing the SQL expression count to one per Feature View rather than one per feature. This reduces a 2,000-feature compilation from ~25 minutes to ~4 seconds in benchmarks. See Chapter 10: The Wide Resultset Problem for trade-offs.

  2. Parquet/Iceberg output – bypass to_pandas() entirely by writing training data to Parquet files on a stage or to a Snowflake-managed Iceberg table. At 4XL warehouse scale, 1B rows writes in ~4 minutes and 12B rows in ~35 minutes. The output is natively readable by Ray, Spark, and PyTorch DataLoader. See Chapter 10: Parquet and Iceberg Output.

Note

These patterns are most relevant for customers with 500+ features. For most deployments, the SDK batched join strategy and standard Gen2 warehouse sizing are sufficient.


12.12 Best practices

1. Partition Feature Views by refresh cadence

Features that change every minute should not share a Dynamic Table with features that change daily. Mixing cadences forces the entire table to refresh at the fastest rate, wasting compute:

fast_features_fv = FeatureView(name="FAST_FEATURES_FV", refresh_freq="5 minutes", ...)
slow_features_fv = FeatureView(name="SLOW_FEATURES_FV", refresh_freq="1 day", ...)

Separate Feature Views by update frequency, then combine them at training time via generate_dataset or generate_training_set.

2. Document feature ownership and dependencies

As the number of Feature Views grows, it becomes critical to track which source tables feed each Feature View and which models consume them. A simple dependency manifest (YAML, JSON, or a database table) prevents orphaned features and simplifies impact analysis when source schemas change:

features:
  USER_ORDER_FV:
    owner: data-engineering
    depends_on:
      - FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
      - FEATURE_STORE_DEMO.CLICKSTREAM_DATA.USERS
    consumers:
      - churn_model
      - ltv_model

Combine this with the FeatureView.lineage() API and SQL lineage queries described in Chapter 11: Operations.

3. Automate testing in CI

Run pytest (unit + PIT + schema contracts) on every change to feature definitions; gate deploys on green builds. See the Testing strategies section above for the minimum test categories.


12.13 Common pitfalls

12.13.1 Pitfall 1: Over-engineering from day one

Problem: Teams design multi-region, multi-schema, streaming architectures before they have a single model in production.

Solution: Start with a single Feature Store schema, View-based or simple DT-based Feature Views, and a manual deployment script. Add streaming, multi-schema isolation, and CI/CD automation only when the scale and organizational complexity demand it. The patterns in this chapter are a menu, not a mandatory checklist.

12.13.2 Pitfall 2: Inconsistent patterns across teams

Problem: Each team uses different naming conventions, version strategies, and refresh configurations. This creates confusion when cross-domain models attempt to join Feature Views from different schemas.

Solution: Establish organization-wide standards early: V01-style versioning, consistent naming conventions (see Chapter 2), and shared Entity definitions. A “Feature Store working group” or a shared manifest format prevents divergence before it becomes entrenched.

12.13.3 Pitfall 3: No testing beyond “it runs”

Problem: Feature issues (future leakage, schema drift, stale data) are discovered in production, often after a model has been serving incorrect predictions for days.

Solution: Implement the minimum test bar described in the Testing strategies section above: unit tests for transformation logic, PIT correctness tests, schema validation, and freshness/quality checks. Integrate these into CI/CD so that no Feature View change reaches production without passing them.

12.13.4 Pitfall 4: Unexpected full refresh on aggregate DTs

Problem: A Dynamic Table Feature View that should be refreshing incrementally is instead performing full refresh every cycle, causing unexpectedly high compute costs. This often goes unnoticed until the bill arrives.

Solution: After registering any DT-backed Feature View, check REFRESH_MODE in INFORMATION_SCHEMA.DYNAMIC_TABLES(). The most common cause of unexpected full refresh is aggregations on FLOAT-typed columns combined with a JOIN in the same query block. The DT engine cannot guarantee incremental correctness for floating-point aggregate diffs across joins. Cast to fixed-point (NUMBER(38, 6)) before aggregating, or split into a base DT + View layer. See Chapter 4, Chapter 5, and Chapter 11 for detailed guidance and workarounds.

Even when a DT cannot avoid full refresh, downstream DTs can still process incrementally if the full-refresh DT has a system-derived unique key (e.g., from GROUP BY or QUALIFY ROW_NUMBER() = 1). Set REFRESH_MODE = INCREMENTAL on the downstream DT via SQL. Similarly, source tables refreshed via INSERT OVERWRITE can enable incremental downstream processing by declaring a RELY primary key. These are SQL-level capabilities not yet in the Feature Store Python API – see Chapter 5 for the full pattern and Understanding primary keys in dynamic tables.


12.14 Summary

Pattern Use Case Key Consideration
Streaming / low latency Near real-time features Short-refresh DTs or external pipelines until native streaming is GA
Multi-Feature Store Domain isolation, cross-domain ML RBAC grants must span schemas for training; use a shared ML_TRAINING role
Feature sharing Cross-account / cross-region feature access Secure Data Sharing for zero-copy; single-def multi-deploy for data residency; materialise once + Views for cost
Multi-region / DR Geographic resilience Database replication works; Online Feature Tables need per-region provisioning
External data sources Iceberg integration View-based FVs wrap external tables; dbt covered in Ch05
CI/CD Automated, auditable deployments Parameterise database/schema for environment promotion
State reconciliation Detecting drift between config and live Feature Store Pre-deploy diff script; append-only vs full-redeploy strategies
Repository layout Consistent multi-team Feature View management features/*/config.yaml + feature_df.py; shared entities, scripts, and CI workflows
Testing Preventing production incidents Minimum bar: unit, PIT, schema, freshness tests
Wide & sparse data 100+ feature columns, sparse data, sensor/OHE tables Hybrid storage (keys + OBJECT), two-tier FV, DT cascade; expand at last moment
Semi-structured features Embeddings, JSON, packed columns ARRAY, VARIANT, OBJECT_CONSTRUCT; unpack in Python/R/Spark

12.15 Next steps

Continue to Chapter 13: Migration Guide for guidance on migrating from other Feature Store platforms.