7 Aggregations API

The Feature class for declarative time-windowed aggregations

Keywords

snowflake, feature store, ml, machine learning, mlops

7.1 Overview

The Aggregations API provides a declarative way to define time-windowed features using the Feature class. Instead of writing complex SQL with window functions, you specify what aggregations you want and the Feature Store handles the efficient computation.

Examples in this chapter use the canonical demo layout: database FEATURE_STORE_DEMO, Feature Store schema FEATURE_STORE, clickstream source schema CLICKSTREAM_DATA, warehouse FS_DEV_WH, and tables such as ORDERS (ORDER_TS, ORDER_ID, USER_ID, TOTAL_AMT).

SDK Version and Documentation Status

The Feature class and tiled aggregation API require snowflake-ml-python >= 1.24.0. This is a client-side SDK feature – it generates standard SQL (Dynamic Tables, window functions, aggregations) under the hood, so no account-level enablement or feature flag is needed. Install the required SDK version and you are ready to use it.

Official API documentation on the Snowflake docs portal may be limited at time of reading. For the authoritative parameter reference, use help(Feature) in a Python session or inspect the SDK source. API names and signatures may evolve in future releases.

See also the platform capabilities and preview status table in Chapter 12.

7.2 Learning Objectives

After completing this chapter, you will be able to:

Use the Feature class to define time-windowed aggregations
Understand tiling for efficient incremental computation
Configure feature granularity and refresh frequencies
Apply feature column prefixing for disambiguation
Name aggregated features following best practices
Generate features programmatically using data-driven patterns (categorical pivots, cross-product factories, config-driven specs)

📂 Chapter code: Browse companion scripts on GitHub

7.3 The Feature Class

7.3.1 Basic Usage

📁 Full code: _code/feature_class.py

# Define time-windowed aggregations declaratively (aliases: _SUM, _CNT, _AVG, …)
features = [
    Feature.sum("TOTAL_AMT", "7d").alias("TOTAL_AMT_SUM_7D"),
    Feature.sum("TOTAL_AMT", "30d").alias("TOTAL_AMT_SUM_30D"),
    Feature.count("ORDER_ID", "7d").alias("ORDER_ID_CNT_7D"),
    Feature.avg("TOTAL_AMT", "7d").alias("TOTAL_AMT_AVG_7D"),
    Feature.max("TOTAL_AMT", "30d").alias("TOTAL_AMT_MAX_30D"),
]

# Create tiled Feature View (source: canonical clickstream ORDERS)
orders_fqn = "FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS"
user_agg_fv = FeatureView(
    name="USER_PURCHASE_AGGREGATES",
    entities=[user_entity],
    feature_df=session.table(orders_fqn),
    timestamp_col="ORDER_TS",
    refresh_freq="1 hour",
    feature_granularity="1 hour",
    features=features,
)

registered_agg_fv = fs.register_feature_view(
    feature_view=user_agg_fv, version="V01", block=True, overwrite=True,
)
print(f"Registered: {registered_agg_fv.name}/V01  (tiled={registered_agg_fv.is_tiled})")
print(f"Granularity: {registered_agg_fv.feature_granularity}")

7.3.2 Available Aggregation Functions

Function	Description	Example
`Feature.sum(col, window)`	Sum over window	`Feature.sum("TOTAL_AMT", "7d")`
`Feature.min(col, window)`	Minimum over window	`Feature.min("TOTAL_AMT", "30d")`
`Feature.max(col, window)`	Maximum over window	`Feature.max("TOTAL_AMT", "30d")`
`Feature.count(col, window)`	Count over window	`Feature.count("ORDER_ID", "30d")`
`Feature.avg(col, window)`	Average over window	`Feature.avg("TOTAL_AMT", "7d")`
`Feature.std(col, window)`	Standard deviation	`Feature.std("TOTAL_AMT", "30d")`
`Feature.var(col, window)`	Variance	`Feature.var("TOTAL_AMT", "30d")`
`Feature.last_n(col, window, n=N)`	N most recent values	`Feature.last_n("PRODUCT_ID", "7d", n=5)`
`Feature.first_n(col, window, n=N)`	N oldest values	`Feature.first_n("PRODUCT_ID", "7d", n=5)`
`Feature.last_distinct_n(col, window, n=N)`	N most recent unique values	`Feature.last_distinct_n("PRODUCT_ID", "7d", n=5)`
`Feature.first_distinct_n(col, window, n=N)`	N oldest unique values	`Feature.first_distinct_n("PRODUCT_ID", "7d", n=5)`
`Feature.approx_count_distinct(col, window)`	HLL distinct count	`Feature.approx_count_distinct("PRODUCT_ID", "7d")`
`Feature.approx_percentile(col, window, percentile=0.5)`	Approximate percentile	`Feature.approx_percentile("TOTAL_AMT", "30d", percentile=0.5)`

Float columns and incremental refresh

Aggregation Feature Views backed by Dynamic Tables may fall back to full refresh if the underlying SQL combines aggregations on FLOAT-typed expressions (sum, avg, min, max, std, var) with a JOIN in the same query block. The DT engine cannot guarantee incremental correctness for floating-point aggregate diffs across joins.

If your source columns are FLOAT/DOUBLE, cast them to a fixed-point type before the aggregation reaches the DT definition:

CAST(TOTAL_AMT AS NUMBER(38, 6))

The Temporal Aggregation API generates the underlying DT SQL automatically, so if your source table’s numeric columns are FLOAT and the API introduces a join (e.g., for tiling or rollup), the resulting DT may silently use full refresh. Check REFRESH_MODE in INFORMATION_SCHEMA.DYNAMIC_TABLES() after registration and alter source column types to NUMBER if needed.

ML model impact: For most ML use cases the precision difference between FLOAT and NUMBER(38, 8) is negligible – the variance across training runs or hyperparameter choices far exceeds the rounding delta. If you need higher fidelity, NUMBER(38, 12) provides ~12 decimal digits of precision, comparable to FLOAT for the value ranges typical in feature engineering. The key trade-off is incremental refresh (fast, cheap) vs full refresh (slow, expensive) – the precision loss from fixed-point casting is almost always the better deal.

Recommended approach for source table owners: Alter the source column type to NUMBER/DECIMAL at the table level rather than casting in every Feature View query. This avoids repeating the cast in multiple Feature Views and ensures all downstream consumers benefit:

ALTER TABLE FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
  ALTER COLUMN TOTAL_AMT SET DATA TYPE NUMBER(18, 8);

For rollup Feature Views, the same constraint applies transitively – if the source tiled Feature View aggregates FLOAT columns, the rollup DT also falls back to full refresh. Fix the types at the source level.

7.3.3 Aggregation Function Performance Characteristics

Not all aggregation functions are equal in terms of tile storage, computation cost, and incremental refresh compatibility. Understanding these trade-offs is important when designing features at scale.

Function	Tile storage	Incremental	Notes
`count`	Partial counts per tile	Yes	Lightweight; most efficient aggregation
`sum`	Partial sums per tile	Yes (if not `FLOAT`+JOIN)	Cast to `NUMBER` if source is `FLOAT`
`min`, `max`	Per-tile min/max	Yes	Used for windowed extremes
`avg`	Stored as sum + count	Yes (derived)	Reconstructed from partial sum/count at query time
`std`, `var`	Partial sum, sum-of-squares, count	Yes (derived)	Higher storage per tile; 3 partial aggregates
`approx_count_distinct`	HLL sketch per tile	Yes	Fixed-size sketch (~12 KB); merges across tiles
`last_n`, `first_n`	Array per tile	Yes	Storage scales with `n`; larger `n` increases tile size
`last_distinct_n`, `first_distinct_n`	Deduplicated array per tile	Yes	More expensive than `last_n` due to dedup logic
`approx_percentile`	T-Digest sketch per tile	Yes	Fixed-size sketch; approximate result

Guidance for high-cardinality, high-volume data:

Prefer count and sum as primary features – they are the most efficient for both tiling and query-time reassembly.
Use approx_count_distinct instead of exact distinct counts when precision within ~2% is acceptable. The HLL sketch is constant-size regardless of cardinality, whereas exact distinct counts require per-tile array storage.
Limit last_n / first_n values of n to what the model actually needs. Each tile stores up to n values; at n=100 with hourly tiles and 90-day windows, tile storage grows substantially.
std and var store three partial aggregates per tile (sum, sum-of-squares, count). If you only need variance for a subset of features, consider computing it post-retrieval from sum and count features to reduce tile storage.

7.3.4 Filter Expressions Within Features

When the same source data feeds features that differ only by a filter condition (e.g., count of automated messages vs. count of all messages), use separate Feature definitions with filter logic in the source feature_df rather than creating multiple Feature Views:

SQL
Snowpark DataFrame

filtered_source_df = session.sql("""
    SELECT
        USER_ID,
        EVENT_TS,
        ORDER_ID,
        TOTAL_AMT,
        CASE WHEN CHANNEL = 'AUTOMATED' THEN ORDER_ID END AS AUTO_ORDER_ID,
        CASE WHEN CHANNEL = 'AUTOMATED' THEN TOTAL_AMT END AS AUTO_TOTAL_AMT
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
""")

import snowflake.snowpark.functions as F

orders = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS")
filtered_source_df = orders.select(
    F.col("USER_ID"),
    F.col("EVENT_TS"),
    F.col("ORDER_ID"),
    F.col("TOTAL_AMT"),
    F.when(F.col("CHANNEL") == "AUTOMATED", F.col("ORDER_ID")).alias("AUTO_ORDER_ID"),
    F.when(F.col("CHANNEL") == "AUTOMATED", F.col("TOTAL_AMT")).alias("AUTO_TOTAL_AMT"),
)

features = [
    Feature.count("ORDER_ID", "7d").alias("ORDER_CNT_7D"),
    Feature.count("AUTO_ORDER_ID", "7d").alias("AUTO_ORDER_CNT_7D"),
    Feature.sum("TOTAL_AMT", "30d").alias("SPEND_SUM_30D"),
    Feature.sum("AUTO_TOTAL_AMT", "30d").alias("AUTO_SPEND_SUM_30D"),
]

This approach keeps all related features in a single Feature View, which means they share tiles and refresh together – significantly more efficient than creating separate Feature Views per filter condition. The CASE WHEN produces NULL for non-matching rows, and aggregation functions skip NULLs, giving the correct filtered result.

The static approach above works well when the filter values are stable and few. When dimension values change over time (new channels, product categories, regions), see Data-Driven Feature Definitions for a programmatic alternative that discovers values from data and generates the CASE columns and Feature objects automatically.

7.3.5 Window Specifications

Windows can be specified using shorthand notation:

Notation	Meaning
`"1h"` or `"1 hour"`	1 hour
`"24h"`	24 hours
`"7d"` or `"7 day"`	7 days
`"30d"`	30 days
`"1w"` or `"1 week"`	1 week

7.3.6 Window offset

Pass offset to shift the aggregation window backward in time. This is the standard way to build comparative features such as week-over-week or month-over-month trends:

features = [
    Feature.sum("TOTAL_AMT", "7d").alias("TOTAL_AMT_SUM_CURRENT_WEEK"),
    Feature.sum("TOTAL_AMT", "7d", offset="7d").alias("TOTAL_AMT_SUM_PREV_WEEK"),
]

Pair the current-window and offset-window versions to capture momentum or change over time. For example, a model can detect spending acceleration when TOTAL_AMT_SUM_CURRENT_WEEK exceeds TOTAL_AMT_SUM_PREV_WEEK.

The offset must be a multiple of feature_granularity so the shifted window aligns cleanly to tile boundaries. For example, with feature_granularity="1h", valid offsets are "1h", "2h", "24h", "7d", etc.

Use .alias("NAME") on each Feature to set the output column name.

7.4 Naming Aggregated Features

7.4.1 Best Practice: Descriptive Aliases

Always use .alias() to give features meaningful names:

# ✅ GOOD: Suffix metric role (_SUM, _CNT, _AVG) and window
features = [
    Feature.sum("TOTAL_AMT", "7d").alias("TOTAL_AMT_SUM_7D"),
    Feature.sum("TOTAL_AMT", "30d").alias("TOTAL_AMT_SUM_30D"),
    Feature.count("ORDER_ID", "7d").alias("ORDER_ID_CNT_7D"),
]

# ❌ BAD: Auto-generated names are hard to interpret
features = [
    Feature.sum("TOTAL_AMT", "7d"),  # opaque default name
    Feature.sum("TOTAL_AMT", "30d"),
]

7.5 Feature Column Prefixing for Disambiguation

When joining features from multiple Feature Views, column name collisions can occur. Feature Store provides two mechanisms for disambiguation.

7.5.1 The Problem

# Name collision: same aggregation function on same column, different entity groupings,
# using the default naming derivation -- both resolve to ORDER_ID_CNT_7D
user_orders_fv = FeatureView(
    name="USER_ORDERS",
    entities=[user_entity],
    features=[Feature.count("ORDER_ID", "7d")],  # default name: ORDER_ID_CNT_7D
    # ...
)

merchant_orders_fv = FeatureView(
    name="MERCHANT_ORDERS",
    entities=[merchant_entity],
    features=[Feature.count("ORDER_ID", "7d")],  # same default name: ORDER_ID_CNT_7D
    # ...
)

# Without prefixing, collision occurs
dataset = fs.generate_dataset(
    spine_df=spine,
    features=[user_orders_fv, user_sessions_fv],  # Which ORDER_ID_CNT_7D?
)

Prefer explicit .alias() calls with distinct names (e.g., MERCHANT_ORDER_ID_CNT_7D) or use auto_prefix=True / .with_name() so semantics stay clear even when multiple Feature Views aggregate the same source column.

7.5.2 Solution 1: Auto-Prefix

Use auto_prefix=True to automatically prefix columns with Feature View name:

📁 Full code: _code/prefixing.py

dataset = fs.generate_dataset(
    spine_df=spine,
    features=[user_orders_fv, user_sessions_fv],
    auto_prefix=True,  # Columns become USER_ORDERS__ORDER_ID_CNT_7D, etc.
)

# Result columns:
# - USER_ORDERS__ORDER_ID_CNT_7D
# - USER_SESSIONS__SESSION_ID_CNT_7D

7.5.3 Solution 2: Custom Prefixes with `.with_name()`

For more control, use .with_name() on individual Feature Views or slices:

dataset = fs.generate_dataset(
    spine_df=spine,
    features=[
        user_orders_fv.with_name("orders"),      # Prefix: orders__
        user_sessions_fv.with_name("sessions"),  # Prefix: sessions__
    ],
)

# Result columns:
# - orders__ORDER_ID_CNT_7D
# - sessions__SESSION_ID_CNT_7D

7.5.4 Solution 3: Combining Both Approaches

dataset = fs.generate_dataset(
    spine_df=spine,
    features=[
        user_orders_fv.with_name("orders"),  # Custom prefix
        user_sessions_fv,                     # Will use auto_prefix
    ],
    auto_prefix=True,
)

Priority

.with_name() takes precedence over auto_prefix. If a Feature View has .with_name() set, that prefix is used regardless of auto_prefix setting.

7.5.5 Choosing a Prefixing Strategy

Scenario	Recommendation
Single Feature View	No prefix needed
Multiple FVs, unique column names	No prefix needed
Multiple FVs, potential collisions	Use `auto_prefix=True`
Want short, custom prefixes	Use `.with_name()`
Consistent naming across projects	Define naming convention + `.with_name()`

7.6 Feature Granularity (Tiling)

7.6.1 How Tiling Works

The feature_granularity parameter sets the tile size for incremental computation:

# Hourly tiles for high-frequency data
fv = FeatureView(
    feature_granularity="1 hour",  # Each tile covers 1 hour
    refresh_freq="15 minutes",      # Refresh every 15 minutes
    # ...
)

7.6.2 Tile Size Recommendations

Data Pattern	Recommended Tile	Reason
Clickstream (high frequency)	`1 hour`	Many events per hour
Transactions (medium)	`1 day`	Daily aggregation patterns
Weekly metrics	`1 week`	Coarse granularity sufficient

7.6.3 Multi-Granularity Tile Strategy

When Feature Views need to serve both short lookback windows (1d, 7d) and long lookback windows (90d, 360d), a single tile size creates a trade-off: small tiles (1 hour) provide precision for short windows but require combining thousands of tiles for 360-day windows; large tiles (1 day) are efficient for long windows but lose sub-day precision.

The recommended approach is to use multiple Feature Views at different tile granularities, each optimized for the windows it serves:

short_window_features = [
    Feature.sum("TOTAL_AMT", "1d").alias("SPEND_SUM_1D"),
    Feature.sum("TOTAL_AMT", "7d").alias("SPEND_SUM_7D"),
    Feature.count("ORDER_ID", "1d").alias("ORDER_CNT_1D"),
    Feature.count("ORDER_ID", "7d").alias("ORDER_CNT_7D"),
]

short_window_fv = FeatureView(
    name="USER_PURCHASE_SHORT_WINDOW",
    entities=[user_entity],
    feature_df=session.table(orders_fqn),
    timestamp_col="ORDER_TS",
    refresh_freq="1 hour",
    feature_granularity="1 hour",
    features=short_window_features,
    desc="Short-window purchase features (1d, 7d) with hourly tiles",
)

long_window_features = [
    Feature.sum("TOTAL_AMT", "30d").alias("SPEND_SUM_30D"),
    Feature.sum("TOTAL_AMT", "90d").alias("SPEND_SUM_90D"),
    Feature.sum("TOTAL_AMT", "360d").alias("SPEND_SUM_360D"),
    Feature.count("ORDER_ID", "30d").alias("ORDER_CNT_30D"),
    Feature.count("ORDER_ID", "360d").alias("ORDER_CNT_360D"),
]

long_window_fv = FeatureView(
    name="USER_PURCHASE_LONG_WINDOW",
    entities=[user_entity],
    feature_df=session.table(orders_fqn),
    timestamp_col="ORDER_TS",
    refresh_freq="1 day",
    feature_granularity="1 day",
    features=long_window_features,
    desc="Long-window purchase features (30d-360d) with daily tiles",
)

At training time, join both Feature Views on the same spine:

training_df = fs.generate_training_set(
    spine_df=spine,
    features=[short_window_fv, long_window_fv],
    spine_timestamp_col="EVENT_TS",
)

Trade-offs:

Approach	Tiles for 360d window	Refresh cost	Short-window precision
Single FV, `1 hour` tiles	8,640 tiles	High (hourly refresh)	Sub-hour
Single FV, `1 day` tiles	360 tiles	Low (daily refresh)	Daily only
Two FVs (recommended)	360 tiles (daily FV) + 168 tiles for 7d (hourly FV)	Balanced	Sub-hour for short, daily for long

The two-FV approach also lets you set different refresh_freq values: hourly for short windows where freshness matters, daily for long windows where the marginal value of hourly refresh is negligible.

7.7 Multi-Window Features

Define features with multiple time windows efficiently:

features = [
    Feature.sum("TOTAL_AMT", "1d").alias("TOTAL_AMT_SUM_1D"),
    Feature.sum("TOTAL_AMT", "7d").alias("TOTAL_AMT_SUM_7D"),
    Feature.sum("TOTAL_AMT", "30d").alias("TOTAL_AMT_SUM_30D"),
    Feature.sum("TOTAL_AMT", "90d").alias("TOTAL_AMT_SUM_90D"),
    Feature.count("ORDER_ID", "7d").alias("ORDER_ID_CNT_7D"),
    Feature.count("ORDER_ID", "30d").alias("ORDER_ID_CNT_30D"),
]

# Single Feature View efficiently computes all windows (required FV args shown above)
multi_window_fv = FeatureView(
    name="USER_MULTI_WINDOW",
    entities=[user_entity],
    feature_df=session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS"),
    timestamp_col="ORDER_TS",
    refresh_freq="1 day",
    features=features,
    feature_granularity="1 day",
)

When you have many columns and windows, a helper function avoids repetitive boilerplate and enforces consistent naming:

from snowflake.ml.feature_store import Feature

def build_features(columns: list[str], windows: list[str], agg: str = "sum") -> list[Feature]:
    agg_fn = getattr(Feature, agg)
    return [
        agg_fn(col, w).alias(f"{col}_{agg.upper()}_{w.upper()}")
        for col in columns
        for w in windows
    ]

features = (
    build_features(["TOTAL_AMT", "DISCOUNT_AMT"], ["7d", "30d", "90d"], "sum")
    + build_features(["ORDER_ID"], ["7d", "30d"], "count")
)

7.8 Best Practices

1. Use Consistent Naming Conventions

# Pattern: <COL_OR_ENTITY>_<ROLE>_<WINDOW> (roles: _SUM, _CNT, _AVG, …)
features = [
    Feature.sum("TOTAL_AMT", "7d").alias("TOTAL_AMT_SUM_7D"),
    Feature.sum("TOTAL_AMT", "30d").alias("TOTAL_AMT_SUM_30D"),
    Feature.count("ORDER_ID", "7d").alias("ORDER_ID_CNT_7D"),
    Feature.avg("TOTAL_AMT", "7d").alias("TOTAL_AMT_AVG_7D"),
]

2. Match Tile Size to Data Frequency

# High-frequency: smaller tiles
clickstream_fv = FeatureView(
    feature_granularity="1 hour",
    # ...
)

# Low-frequency: larger tiles
weekly_metrics_fv = FeatureView(
    feature_granularity="1 day",
    # ...
)

3. Group Related Features and Consolidate Feature Views

When multiple Feature Views share the same source table, entity, and tile granularity, they automatically share the same underlying tile computation. Consolidating them into a single Feature View reduces operational overhead (fewer DTs to monitor) and ensures tile computation happens once:

# ✅ GOOD: Related features in same Feature View -- single tile computation
purchase_features = [
    Feature.sum("TOTAL_AMT", "7d").alias("TOTAL_AMT_SUM_7D"),
    Feature.sum("TOTAL_AMT", "30d").alias("TOTAL_AMT_SUM_30D"),
    Feature.count("ORDER_ID", "7d").alias("ORDER_ID_CNT_7D"),
    Feature.count("ORDER_ID", "30d").alias("ORDER_ID_CNT_30D"),
    Feature.avg("TOTAL_AMT", "7d").alias("TOTAL_AMT_AVG_7D"),
]

# ❌ BAD: Same source split across multiple FVs -- duplicated tile computation
spend_fv = FeatureView(
    name="USER_SPEND_FV",
    feature_df=session.table(orders_fqn),
    features=[Feature.sum("TOTAL_AMT", "7d").alias("TOTAL_AMT_SUM_7D")],
    # ...
)
count_fv = FeatureView(
    name="USER_ORDER_COUNT_FV",
    feature_df=session.table(orders_fqn),
    features=[Feature.count("ORDER_ID", "7d").alias("ORDER_ID_CNT_7D")],
    # ...
)

Splitting features from the same source into separate Feature Views forces independent tile computations over the same data. Each Feature View creates its own DT with its own tiles, effectively doubling (or more) the compute cost. Consolidate into one Feature View when the source, entity, and granularity are shared.

When separate Feature Views are appropriate:

Different source tables (e.g., ORDERS vs EVENTS)
Different entities (e.g., USER vs PRODUCT)
Different tile granularities (e.g., hourly tiles for short windows, daily for long – see Multi-Granularity)
Different refresh cadences needed for operational reasons
Unrelated feature domains that evolve independently

# ❌ BAD: Unrelated features mixed together
mixed_features = [
    Feature.sum("TOTAL_AMT", "7d"),  # Purchase feature
    Feature.count("EVENT_ID", "7d"),  # Clickstream feature (different grain / entity!)
]

7.9 Rollup Aggregated Feature Views

Rollup Feature Views enable hierarchical aggregation – creating coarser-grained features from finer-grained tiled Feature Views without recomputing from raw data. For example, clickstream-style hierarchies often roll up VISITOR → USER → HOUSEHOLD: a visitor-level “7-day page view count” can be rolled up to user-level (sum across all visitors linked to that user) and then to household-level (sum across all users in the household).

The RollupConfig defines the relationship between a finer-grained (source) Feature View and the coarser-grained (target) Feature View. The Feature Store leverages Dynamic Tables for automatic, incremental updates.

7.9.1 RollupConfig Arguments

Argument	Description
`source`	The parent tiled Feature View to roll up from. Must be registered (not in DRAFT status)
`mapping_df`	A Snowpark DataFrame containing the entity mapping – all join keys from both the source (finer) and target (coarser) entities

7.9.2 Example: Visitor → Subscriber Rollup

from snowflake.ml.feature_store import FeatureStore, FeatureView, Entity, RollupConfig, Feature

# Entities at both levels of the hierarchy
visitor_entity = Entity("VISITOR", ["VISITOR_ID", "COMPANY_ID"])
subscriber_entity = Entity("SUBSCRIBER", ["SUBSCRIBER_ID", "COMPANY_ID"])
fs.register_entity(visitor_entity)
fs.register_entity(subscriber_entity)

# Fine-grained: tiled visitor-level Feature View (already registered)
visitor_features = [
    Feature.sum("PAGE_VIEWS", "7d").alias("PAGE_VIEWS_7D"),
    Feature.count("SESSION_ID", "30d").alias("SESSION_CNT_30D"),
]
visitor_fv = FeatureView(
    name="VISITOR_ENGAGEMENT_FV",
    entities=[visitor_entity],
    feature_df=session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.EVENTS"),
    timestamp_col="EVENT_TS",
    refresh_freq="1 hour",
    feature_granularity="1 hour",
    features=visitor_features,
    desc="Visitor-level engagement features — tiled",
)
registered_visitor_fv = fs.register_feature_view(visitor_fv, version="V01")

# Entity mapping: links visitor keys to subscriber keys
mapping_df = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.VISITOR_SUBSCRIBER_MAPPING")
# Expected columns: VISITOR_ID, COMPANY_ID, SUBSCRIBER_ID

# Rollup configuration
rollup_config = RollupConfig(
    source=registered_visitor_fv,
    mapping_df=mapping_df,
)

# Coarse-grained: subscriber-level rollup Feature View
subscriber_fv = FeatureView(
    name="SUBSCRIBER_ENGAGEMENT_FV",
    entities=[subscriber_entity],
    rollup_config=rollup_config,
    desc="Subscriber-level engagement — rolled up from visitor-level tiles",
)
fs.register_feature_view(subscriber_fv, version="V01")

The resulting SUBSCRIBER_ENGAGEMENT_FV automatically aggregates the visitor-level tiles up to the subscriber level using the entity mapping. Because it is backed by a Dynamic Table, it refreshes incrementally as the source visitor Feature View updates.

Note

The source Feature View must be a registered, tiled Feature View (one with features and feature_granularity). You cannot roll up a non-tiled DT-based or View-based Feature View.

7.9.3 Rollup Design Guidelines

Define entities that match the grain of each Feature View (for example, visitor-level tiles on EVENTS, user-level aggregates on ORDERS keyed by USER_ID).
Reuse the same aggregation patterns (Feature.sum, Feature.count, …) at each level; change the spine entity and join keys when moving from visitor to user to household.
Keep aliases explicit (VISITOR_PAGE_VIEWS_7D vs USER_PAGE_VIEWS_7D) so rolled-up datasets remain interpretable when multiple hierarchy levels appear in one training set.
Design mapping tables carefully. The mapping_df determines which fine-grained entities roll up to which coarse-grained entities. For static relationships a simple DISTINCT mapping suffices; for time-varying relationships (visitors identified as subscribers over time), consider SCD-Type2 patterns or MAPPED_AT timestamps. See Chapter 3: Entity Mapping Tables for patterns and PIT correctness implications.
Watch for late-arriving mappings. If entity mappings are added retroactively, the rollup may aggregate historical tile data under a mapping that did not exist at the original event time, creating subtle data leakage. See the mapping table PIT warning in Chapter 3 and Chapter 6.

7.10 Common Pitfalls

7.10.1 ❌ Pitfall 1: No Alias

Problem: Auto-generated names are unclear.

Solution: Always use .alias() with descriptive names.

7.10.2 ❌ Pitfall 2: Wrong Tile Size

Problem: Tile size doesn’t match data frequency.

Solution: Use smaller tiles for high-frequency data.

7.10.3 ❌ Pitfall 3: Column Collisions

Problem: Same column names from different Feature Views.

Solution: Use auto_prefix=True or .with_name().

7.10.4 ❌ Pitfall 4: Internal _PARTIAL Columns in Output

Problem: Internal columns used for tile aggregation (prefixed _PARTIAL_) sometimes appear in output DataFrames from generate_training_set() or retrieve_feature_values(). These are implementation artifacts – partial aggregates stored per tile – and are not user-facing features.

Solution: Filter them before consuming the output:

result_df = fs.generate_training_set(spine_df=spine, features=fv_list, ...)

# Drop internal tile columns
output_cols = [c for c in result_df.columns if not c.startswith("_PARTIAL_")]
result_df = result_df.select(output_cols)

7.10.5 ❌ Pitfall 5: Partial Leading-Edge Tiles from Incomplete Source Intervals

Problem: When the source data is a continuous stream aggregated into fixed time buckets (e.g., 15-minute intervals), the current in-progress bucket is incomplete at DT refresh time. The tiled Feature View generates a tile from this partial interval, producing aggregation values with a different distribution than complete historical tiles. This creates training/serving skew: the model trains on features derived from complete intervals but at inference time receives features that include a partial tile.

This affects all tiled aggregations — Feature.avg, Feature.sum, Feature.count, Feature.min, Feature.max — because the partial tile has fewer source events than a complete one.

Solution: Two complementary strategies:

Strategy 1 — Use offset to skip the current interval (DT-safe):

The offset parameter shifts the aggregation window back, inherently excluding the potentially partial current tile. This is deterministic SQL that preserves incremental DT refresh:

# Window covers intervals ending one interval ago — skips partial current tile
Feature.avg("QTY", "1h", offset="15m").alias("QTY_ROLL_4_MEAN")
Feature.sum("QTY", "1h", offset="15m").alias("QTY_ROLL_4_SUM")

This is the simplest and most robust approach for tiled Feature Views. The trade-off is that features always reflect the state one interval behind — typically acceptable for forecasting use cases where the current interval is incomplete anyway.

Strategy 2 — Gate the source table externally (outside the DT):

Use a scheduled Task or Stream+Task pipeline that writes only complete intervals to the source table. The DT reads a table that never contains partial data:

-- Task runs every 15 minutes: write only complete intervals to source
INSERT INTO SOURCE_ORDERS_15MIN
SELECT TIME_SLICE(ORDER_TS, 15, 'MINUTE') AS INTERVAL_ID,
       ITEM_CODE, BRANCH_CODE, COUNT(*) AS QTY
FROM RAW_ORDERS
WHERE TIME_SLICE(ORDER_TS, 15, 'MINUTE')
      < TIME_SLICE(CURRENT_TIMESTAMP(), 15, 'MINUTE', 'START')
GROUP BY 1, 2, 3;

Do not use CURRENT_TIMESTAMP() inside a DT or feature_df

CURRENT_TIMESTAMP() is non-deterministic and forces FULL refresh. The TIME_SLICE gating filter must be applied in a Task or ingestion pipeline outside the DT — never in the DT definition SQL or the feature_df passed to a Feature View. Even when passed via session.sql(...), the non-deterministic expression flows through to the DT and prevents incremental refresh.

For non-tiled DT Feature Views with explicit window functions, use ROWS BETWEEN N PRECEDING AND 1 PRECEDING (excluding CURRENT ROW) as the DT-safe equivalent. See Chapter 5: Partial Leading-Edge Interval and Chapter 6: Data Leakage from Partial Intervals.

7.11 Data-Driven Feature Definitions

The examples earlier in this chapter define features with hard-coded column names, filter values, and window periods. That works well for a handful of stable features, but becomes brittle and verbose as feature sets grow. This section introduces data-driven patterns that generate Feature objects and the corresponding feature_df columns programmatically.

Schema Is Fixed at Registration

The Feature View schema is fixed at registration time. Data-driven generation makes the code path adaptive, not the registered Feature View itself. When underlying data changes (e.g., a new channel appears), re-run the generation code and register a new version of the Feature View.

Pattern 1: Categorical Pivot — Static vs Dynamic

Filter Expressions Within Features shows how CASE WHEN decomposes a metric by a single hard-coded dimension value. When the set of values is large or evolving, generate the columns programmatically:

Static (2 known channels):

SQL
Snowpark DataFrame

source_df = session.sql("""
    SELECT USER_ID, ORDER_TS, ORDER_ID, TOTAL_AMT,
           CASE WHEN CHANNEL = 'AUTOMATED' THEN ORDER_ID END AS AUTOMATED_ORDER_ID,
           CASE WHEN CHANNEL = 'WEB'       THEN ORDER_ID END AS WEB_ORDER_ID
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
""")

import snowflake.snowpark.functions as F

orders = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS")
source_df = orders.select(
    "USER_ID", "ORDER_TS", "ORDER_ID", "TOTAL_AMT",
    F.when(F.col("CHANNEL") == "AUTOMATED", F.col("ORDER_ID")).alias("AUTOMATED_ORDER_ID"),
    F.when(F.col("CHANNEL") == "WEB", F.col("ORDER_ID")).alias("WEB_ORDER_ID"),
)

features = [
    Feature.count("ORDER_ID", "7d").alias("ORDER_CNT_7D"),
    Feature.count("AUTOMATED_ORDER_ID", "7d").alias("AUTOMATED_ORDER_CNT_7D"),
    Feature.count("WEB_ORDER_ID", "7d").alias("WEB_ORDER_CNT_7D"),
]

Dynamic (discovers all channels from data):

from snowflake.snowpark import functions as F

channels = [row["CHANNEL"] for row in
    session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS")
    .select("CHANNEL").distinct().collect()]

source_df = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS") \
    .select("USER_ID", "ORDER_TS", "ORDER_ID", "TOTAL_AMT", "CHANNEL")

for ch in channels:
    tag = ch.upper().replace(" ", "_")
    source_df = source_df.with_column(
        f"{tag}_ORDER_ID",
        F.when(F.col("CHANNEL") == ch, F.col("ORDER_ID"))
    )

feature_df = source_df.drop("CHANNEL")

features = [Feature.count("ORDER_ID", "7d").alias("ORDER_CNT_7D")]
for ch in channels:
    tag = ch.upper().replace(" ", "_")
    features.append(
        Feature.count(f"{tag}_ORDER_ID", "7d").alias(f"{tag}_ORDER_CNT_7D")
    )

When to prefer dynamic: Dimension values change over time (new channels, product categories, regions). Adding a value requires only a data change and a new Feature View version — no code edits.

Pattern 2: Multi-Window Sweep

The Multi-Window Features section above shows a build_features helper that loops over columns and windows. This is already a data-driven pattern: the columns and windows lists are the configuration, and the helper generates Feature objects with consistent naming. Refer to that helper as the starting point for more complex patterns.

Pattern 3: Metric × Dimension × Window Factory

Patterns 1 and 2 combine naturally into a full cross-product: for every (metric, dimension value, window) triple, generate a filtered column and a Feature. With 3 metrics, 4 channels, and 3 windows, this produces 36 features from a single function call instead of 36 hand-written definitions.

from itertools import product as cartesian
from snowflake.snowpark import functions as F
from snowflake.ml.feature_store import Feature

AGG_DISPATCH = {
    "count": Feature.count,
    "sum":   Feature.sum,
    "avg":   Feature.avg,
    "max":   Feature.max,
    "min":   Feature.min,
}

def build_pivot_features(
    session,
    source_table: str,
    entity_col: str,
    ts_col: str,
    pivot_col: str,
    metrics: dict[str, str],      # {"ORDER_ID": "count", "TOTAL_AMT": "sum"}
    windows: list[str],           # ["7d", "30d", "90d"]
) -> tuple:
    """Return (feature_df, features_list) with full cross-product."""
    pivots = [r[0] for r in session.sql(
        f"SELECT DISTINCT {pivot_col} FROM {source_table} ORDER BY 1"
    ).collect()]

    df = session.table(source_table) \
        .select(entity_col, ts_col, pivot_col, *metrics.keys())

    features = []

    # Base (unfiltered) features across all windows
    for col, fn_name in metrics.items():
        for w in windows:
            features.append(
                AGG_DISPATCH[fn_name](col, w).alias(f"{col}_{fn_name.upper()}_{w.upper()}")
            )

    # Per-pivot filtered features
    for pv in pivots:
        tag = pv.upper().replace(" ", "_")
        for col in metrics:
            derived = f"{tag}_{col}"
            df = df.with_column(
                derived, F.when(F.col(pivot_col) == pv, F.col(col))
            )
            for w in windows:
                fn_name = metrics[col]
                features.append(
                    AGG_DISPATCH[fn_name](derived, w)
                    .alias(f"{tag}_{col}_{fn_name.upper()}_{w.upper()}")
                )

    return df.drop(pivot_col), features

Usage:

feature_df, features = build_pivot_features(
    session,
    source_table="FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS",
    entity_col="USER_ID",
    ts_col="ORDER_TS",
    pivot_col="CHANNEL",
    metrics={"ORDER_ID": "count", "TOTAL_AMT": "sum"},
    windows=["7d", "30d"],
)

fv = FeatureView(
    name="USER_CHANNEL_METRICS",
    entities=[user_entity],
    feature_df=feature_df,
    timestamp_col="ORDER_TS",
    refresh_freq="1 day",
    features=features,
    feature_granularity="1 day",
)

Pattern 4: Config-Driven Feature Specifications

Instead of embedding feature logic in code, define feature specifications as data — either a Python dict for version-controlled config or a Snowflake table for runtime flexibility.

Python dict (version-controlled):

FEATURE_SPECS = [
    {"col": "TOTAL_AMT", "agg": "sum",   "windows": ["7d", "30d", "90d"]},
    {"col": "ORDER_ID",  "agg": "count", "windows": ["7d", "30d"]},
    {"col": "TOTAL_AMT", "agg": "avg",   "windows": ["30d"]},
]

features = []
for spec in FEATURE_SPECS:
    fn = AGG_DISPATCH[spec["agg"]]
    for w in spec["windows"]:
        features.append(
            fn(spec["col"], w).alias(f"{spec['col']}_{spec['agg'].upper()}_{w.upper()}")
        )

Snowflake table (runtime-configurable):

CREATE TABLE IF NOT EXISTS FEATURE_STORE_DEMO.FEATURE_STORE.FEATURE_SPECS (
    DOMAIN      VARCHAR,   -- e.g. 'ORDERS', 'SESSIONS'
    SOURCE_COL  VARCHAR,   -- column to aggregate
    AGG_FN      VARCHAR,   -- 'sum', 'count', 'avg', etc.
    WINDOW_SIZE VARCHAR,   -- '7d', '30d', etc.
    ENABLED     BOOLEAN DEFAULT TRUE
);

specs = session.table("FEATURE_STORE_DEMO.FEATURE_STORE.FEATURE_SPECS") \
    .filter(F.col("DOMAIN") == "ORDERS") \
    .filter(F.col("ENABLED") == True) \
    .collect()

features = []
for row in specs:
    fn = AGG_DISPATCH[row["AGG_FN"]]
    features.append(
        fn(row["SOURCE_COL"], row["WINDOW_SIZE"])
        .alias(f"{row['SOURCE_COL']}_{row['AGG_FN'].upper()}_{row['WINDOW_SIZE'].upper()}")
    )

Approach	Pros	Cons
Python dict	Version-controlled, reviewable in PRs, easy to test	Requires code deployment to change
Snowflake table	Non-developers can manage; takes effect on next FV version	State split between git (code) and table (config); harder to reproduce; audit trail needed

Tip

A hybrid approach keeps the dict in code as the source of truth and syncs it to a Snowflake table for discoverability. This gives you version control and runtime queryability.

Pattern 5: Ratio and Derived Features

Many useful features are derived from base aggregations — conversion rates, averages-per-order, or period-over-period growth. The Feature class supports base aggregation functions only; ratios and formulas must be computed separately.

Tiled Feature Views store partial aggregates

The underlying Dynamic Table of a tiled Feature View contains internal _PARTIAL columns — tile-level fragments, not final values. You cannot compute ratios by reading the DT directly (e.g., session.table("FEATURE_STORE.USER_ORDER_FV$V01")). You must use the Feature Store API to reassemble tiles into final aggregated values first.

Approach A: Post-retrieval (recommended for training pipelines)

generate_dataset and retrieve_feature_values handle tile reassembly automatically. Compute ratios on the returned DataFrame:

RATIOS = [
    ("AUTOMATED_ORDER_CNT_7D", "ORDER_CNT_7D", "AUTOMATION_RATE_7D"),
    ("DISCOUNT_AMT_SUM_30D",   "TOTAL_AMT_SUM_30D", "DISCOUNT_RATE_30D"),
]

dataset_df = fs.generate_dataset(
    name="TRAINING_SET",
    spine_df=spine,
    features=[base_fv],
    output_type="dataset",
).read.to_snowpark_dataframe()

for num, denom, alias in RATIOS:
    dataset_df = dataset_df.with_column(
        alias,
        F.iff(F.col(denom) != 0, F.col(num) / F.col(denom), F.lit(None))
    )

Approach B: Downstream View-based Feature View

Use read_feature_view to get the reassembled values, then build a non-tiled Feature View that computes derived columns:

base_fv = fs.get_feature_view("USER_ORDER_METRICS", "V01")
reassembled_df = fs.read_feature_view(base_fv)

for num, denom, alias in RATIOS:
    reassembled_df = reassembled_df.with_column(
        alias,
        F.iff(F.col(denom) != 0, F.col(num) / F.col(denom), F.lit(None))
    )

ratio_fv = FeatureView(
    name="USER_ORDER_RATIOS",
    entities=[user_entity],
    feature_df=reassembled_df,
    # No refresh_freq — View-based, recomputes on read
)
fs.register_feature_view(ratio_fv, version="V01")

Note

Safe division (F.iff(denom != 0, ...)) prevents NaN / divide-by-zero errors that can silently corrupt model training. Always guard ratio features.

Pattern 6: Interaction Features

Interaction features capture non-linear relationships by combining pairs of columns — typically products (a × b) or ratios (a / b). These apply to any Feature View source DataFrame, not just aggregation Feature Views.

Static (hand-picked pairs):

source_df = source_df.with_column(
    "AMT_X_QTY", F.col("TOTAL_AMT") * F.col("QUANTITY")
).with_column(
    "PRICE_PER_UNIT", F.iff(F.col("QUANTITY") != 0,
                            F.col("TOTAL_AMT") / F.col("QUANTITY"),
                            F.lit(None))
)

Dynamic (all pairwise products from a list):

from itertools import combinations

NUMERIC_COLS = ["TOTAL_AMT", "QUANTITY", "DISCOUNT_AMT"]

for a, b in combinations(NUMERIC_COLS, 2):
    source_df = source_df.with_column(
        f"{a}_X_{b}", F.col(a) * F.col(b)
    )

With 5 numeric columns this generates 10 interaction features; with 10 columns it generates 45. Use this for exploratory model experiments, then prune non-contributing interactions after feature importance evaluation to avoid unnecessary compute.

Static vs Dynamic — Trade-Off Summary

Dimension	Static	Dynamic
Readability	Immediately clear; each feature visible in code	Requires understanding the generation logic
Verbosity	High — repetitive definitions	Low — DRY, loop-based
Testability	Each feature testable individually	Test the generator + spot-check outputs
Adaptability	Manual edit per new value/metric/window	Adapts to data changes; new FV version only
Debugging	Straightforward — read the code	Inspect generated DataFrame schema
Git diffs	Explicit per-feature changes	Changes to generator logic only
Registration	Same	Same — schema fixed at `register_feature_view`

Guidance: Start static for small, stable feature sets (under ~10 features). Switch to dynamic when you find yourself copying and renaming feature definitions, when dimension values evolve, or when you need to experiment with many metric/window combinations rapidly.

Consuming Data-Driven Features with OBJECT_CONSTRUCT

The factory patterns above enforce a naming convention — {DIMENSION}_{METRIC}_{AGG}_{WINDOW} — that becomes a queryable contract. Snowflake’s OBJECT_CONSTRUCT with ILIKE and EXCLUDE filtering can slice dynamically generated columns by any naming axis without hard-coding column lists:

-- Slice features by dimension, window, or collect everything — no column list needed
SELECT
    USER_ID,
    OBJECT_CONSTRUCT(* ILIKE 'AUTOMATED_%')             AS automated_features,
    OBJECT_CONSTRUCT(* ILIKE '%_7D')                     AS window_7d_features,
    OBJECT_CONSTRUCT(* ILIKE '%_30D')                    AS window_30d_features,
    OBJECT_CONSTRUCT(* EXCLUDE (USER_ID, ORDER_TS))      AS all_features
FROM FEATURE_STORE.USER_CHANNEL_METRICS$V01

This is particularly valuable when the factory generates dozens of columns — consumers can dynamically group features by dimension, window, or metric type without knowing every column name. The resulting OBJECTs are well-suited for:

Feature serving APIs that return JSON payloads
Feature exploration — OBJECT_KEYS(all_features) lists what was generated
Selective retrieval — extract specific features with all_features:"AUTOMATED_ORDER_ID_COUNT_7D" path syntax

Two-tier Feature View pattern: Combine this with a lightweight View-based Feature View that repackages the flat columns from a tiled DT-backed Feature View into structured OBJECTs. The tiled FV handles efficient incremental computation; the View FV provides the structured consumption interface. See Chapter 13: Wide & Sparse Feature Data for the full two-tier architecture.

Note

OBJECT_CONSTRUCT operates at the consumption layer — downstream queries and serving. The Feature class still requires flat scalar columns as inputs during definition. The naming convention from the factory is what bridges the two: structured names at definition time enable wildcard grouping at consumption time.

7.12 Summary

Concept	Description
Feature class	Declarative API for time-windowed aggregations (`Feature.sum`, `Feature.count`, …)
Window	Time period for aggregation (e.g., `"7d"`, `"30d"`); optional `offset` shifts the window back
Tile (feature_granularity)	Pre-computation unit for efficiency
Required FV args	`timestamp_col`, `refresh_freq`, `feature_granularity`, `features` (plus `name`, `entities`, `feature_df`)
Version	Register with zero-padded strings such as `"V01"`, `"V02"`
auto_prefix	Automatic column prefixing with FV name
`.with_name()`	Custom column prefixing
RollupConfig	Hierarchical aggregation from fine-grained to coarse-grained entities via `RollupConfig(source, mapping_df)`
Data-driven definitions	Programmatic generation of `Feature` objects and source columns from data or config; patterns include categorical pivot, cross-product factory, config-driven specs, ratio features, and interaction features

7.13 Tiled Feature Views and Online Serving

Tiled Feature Views Cannot Be Served Directly via Hybrid Table OFTs

A tiled Feature View’s underlying Dynamic Table stores partial aggregates per tile interval — columns like _PARTIAL_SUM_TOTAL_AMT, _PARTIAL_COUNT_ORDER_ID, and a TILE_START timestamp. The original timestamp_col (e.g., ORDER_TS) does not exist in the DT schema. When you attempt to create a Hybrid Table Online Feature Table from a tiled FV, the CREATE ONLINE FEATURE TABLE ... TIMESTAMP_COLUMN = ORDER_TS statement fails:

SQL compilation error: column 'ORDER_TS' does not exist

The SDK does not validate this upfront — the error only surfaces at OFT creation time during register_feature_view.

For offline retrieval (fs.read_feature_view, fs.retrieve_feature_values), the Feature Store performs tile reassembly automatically — window aggregations are computed correctly at read time. The issue only arises when you want to push these aggregated values into a Hybrid Table for low-latency online serving.

7.13.1 Workaround: `online_fv_from_tiled` Helper

Manually re-expressing the same aggregation logic as a SQL GROUP BY is error-prone and violates DRY. The online_fv_from_tiled() helper introspects a tiled Feature View’s Feature definitions and automatically generates a non-tiled, OFT-compatible companion:

from online_from_tiled import online_fv_from_tiled

# Step 1: Define the tiled FV (offline only, no online_config)
tiled_fv = FeatureView(
    name="USER_PURCHASE_TILED",
    entities=[user_entity],
    feature_df=session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS"),
    timestamp_col="ORDER_TS",
    refresh_freq="1 hour",
    feature_granularity="1 hour",
    features=purchase_features,
    desc="Tiled aggregations — offline only",
)
tiled_reg = fs.register_feature_view(tiled_fv, version="V01", block=True)

# Step 2: Auto-generate a non-tiled, OFT-ready companion
online_fv = online_fv_from_tiled(
    tiled_reg, session,
    source_query="SELECT * FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS",
    refresh_freq="1 minute",
    target_lag="1 minute",
    window_buffer="1d",
)
online_reg = fs.register_feature_view(online_fv, version="V01", block=True)

The helper produces a GROUP BY query scoped to only max(window + offset) + buffer of source data. For the purchase features above (7-day windows, no offset, 1-day buffer), the generated SQL is:

SELECT
    USER_ID,
    SUM(CASE WHEN ORDER_TS >= DATEADD('DAY', -7, CURRENT_TIMESTAMP())
        THEN TOTAL_AMT END)  AS TOTAL_AMT_SUM_7D,
    COUNT(CASE WHEN ORDER_TS >= DATEADD('DAY', -7, CURRENT_TIMESTAMP())
        THEN ORDER_ID END)   AS ORDER_ID_CNT_7D,
    AVG(CASE WHEN ORDER_TS >= DATEADD('DAY', -7, CURRENT_TIMESTAMP())
        THEN TOTAL_AMT END)  AS TOTAL_AMT_AVG_7D,
    MAX(ORDER_TS) AS LAST_ORDER_TS
FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
WHERE ORDER_TS >= DATEADD('DAY', -8, CURRENT_TIMESTAMP())
GROUP BY USER_ID

Each feature gets its own CASE WHEN filter for its specific window, so features with different windows (e.g., 7-day and 30-day) coexist correctly in a single query. The outer WHERE clause scopes the entire query to the longest window plus the buffer.

How the helper works

online_fv_from_tiled() reads each Feature object’s internal attributes (_function, _column, _window, _offset, _alias) and maps them to standard SQL aggregation functions. It:

Converts each Feature aggregation type to its SQL equivalent (e.g., Feature.sum → SUM, Feature.stddev → STDDEV)
Applies per-feature CASE WHEN window filters based on the feature’s _window and _offset
Computes the maximum lookback across all features and adds window_buffer to form the outer WHERE clause
Adds MAX(timestamp_col) AS LAST_{timestamp_col} as the timestamp for OFT tracking
Returns a new FeatureView with online_config enabled and no tiling

For a registered tiled FV (whose internal query has been rewritten to the tiled form), you must pass source_query= pointing at the original source table.

Unsupported aggregations: FIRST_N, LAST_N, FIRST_DISTINCT_N, and LAST_DISTINCT_N cannot be reduced to a simple GROUP BY — the helper raises a clear error for these.

You can also use online_sql_from_features() to inspect or customise the generated SQL before creating a FeatureView:

from online_from_tiled import online_sql_from_features

sql = online_sql_from_features(
    features=purchase_features,
    join_keys=["USER_ID"],
    source_table="FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS",
    timestamp_col="ORDER_TS",
)
print(sql)

7.13.2 Manual Approach: Direct SQL

If you prefer full control or need to customise the query beyond what the helper generates, you can write the non-tiled FV manually:

SQL
Snowpark DataFrame

from snowflake.ml.feature_store import feature_view as fv_mod

online_df = session.sql("""
    SELECT
        USER_ID,
        SUM(CASE WHEN ORDER_TS >= DATEADD('DAY', -7, CURRENT_TIMESTAMP())
            THEN TOTAL_AMT END)  AS TOTAL_AMT_SUM_7D,
        COUNT(CASE WHEN ORDER_TS >= DATEADD('DAY', -7, CURRENT_TIMESTAMP())
            THEN ORDER_ID END)   AS ORDER_ID_CNT_7D,
        AVG(CASE WHEN ORDER_TS >= DATEADD('DAY', -7, CURRENT_TIMESTAMP())
            THEN TOTAL_AMT END)  AS TOTAL_AMT_AVG_7D,
        MAX(ORDER_TS) AS LAST_ORDER_TS
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
    WHERE ORDER_TS >= DATEADD('DAY', -8, CURRENT_TIMESTAMP())
    GROUP BY USER_ID
""")

online_fv = FeatureView(
    name="USER_PURCHASE_ONLINE",
    entities=[user_entity],
    feature_df=online_df,
    timestamp_col="LAST_ORDER_TS",
    refresh_freq="1 minute",
    online_config=fv_mod.OnlineConfig(enable=True, target_lag="1 minute"),
    desc="Non-tiled online companion for USER_PURCHASE_TILED",
)
online_reg = fs.register_feature_view(online_fv, version="V01", block=True)

from snowflake.snowpark import functions as F
from snowflake.ml.feature_store import feature_view as fv_mod

orders = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS")
cutoff_7d = F.dateadd("DAY", F.lit(-7), F.current_timestamp())
cutoff_8d = F.dateadd("DAY", F.lit(-8), F.current_timestamp())

online_df = (
    orders
    .filter(F.col("ORDER_TS") >= cutoff_8d)
    .group_by("USER_ID")
    .agg(
        F.sum(F.when(F.col("ORDER_TS") >= cutoff_7d, F.col("TOTAL_AMT")))
            .alias("TOTAL_AMT_SUM_7D"),
        F.count(F.when(F.col("ORDER_TS") >= cutoff_7d, F.col("ORDER_ID")))
            .alias("ORDER_ID_CNT_7D"),
        F.avg(F.when(F.col("ORDER_TS") >= cutoff_7d, F.col("TOTAL_AMT")))
            .alias("TOTAL_AMT_AVG_7D"),
        F.max("ORDER_TS").alias("LAST_ORDER_TS"),
    )
)

online_fv = FeatureView(
    name="USER_PURCHASE_ONLINE",
    entities=[user_entity],
    feature_df=online_df,
    timestamp_col="LAST_ORDER_TS",
    refresh_freq="1 minute",
    online_config=fv_mod.OnlineConfig(enable=True, target_lag="1 minute"),
    desc="Non-tiled online companion for USER_PURCHASE_TILED",
)
online_reg = fs.register_feature_view(online_fv, version="V01", block=True)

Performance: FULL Refresh on Scoped Window

Both the helper and the manual approach use CURRENT_TIMESTAMP() in the WHERE clause to scope data to only the maximum window needed. This forces the resulting Dynamic Table to use FULL refresh on every cycle (not incremental), because Snowflake cannot track changes on queries with non-deterministic functions.

For short-to-medium windows (e.g., 7–30 days), the scoped data volume is small and FULL refresh typically completes in 1–2 seconds. For wide windows (90d, 180d, 365d), the FULL refresh must re-scan a proportionally larger slice of the source table on every cycle. At 365 days the scoped query effectively touches most of the table, negating much of the scoping benefit.

Window Width	Approx. Source Scope	FULL Refresh Impact
7 days	~1 week + buffer	Fast (seconds)
30 days	~1 month + buffer	Low (seconds to tens of seconds)
90–180 days	Quarters to half-year	Moderate — monitor DT refresh duration and warehouse cost
365 days	~full table	High — consider whether online serving of year-wide aggregates is truly needed, or pre-compute in a scheduled task instead

Component	Refresh Mode	Why
Tiled FV (offline)	INCREMENTAL	Deterministic tiling query, efficient for large historical datasets
Online FV (helper-generated)	FULL	`CURRENT_TIMESTAMP()` in WHERE prevents change tracking, but scoped to bounded window
OFT (Hybrid Table)	Mirrors DT	Syncs changes from the online DT each cycle

The recommended pattern is to use both: the Aggregations API for offline computation (efficient tiled storage) and the online_fv_from_tiled() helper for online serving (scoped, bounded FULL refresh). If your widest feature window exceeds 90 days, benchmark the DT refresh time and cost before committing to a 1-minute refresh_freq — a longer refresh cycle (e.g., 5–15 minutes) may be more appropriate for wide windows.

Postgres Online Store (PubPr)

The Postgres-backed online store (OnlineStoreType.POSTGRES) handles tiled Feature Views with query-time tile reassembly. See Chapter 9: Real-Time and Online Serving.

7.14 Next Steps

Continue to Chapter 8: Online Features to learn about Online Feature Tables for real-time serving, including the limitations and workarounds for tiled Feature Views discussed above.

--- title: "Aggregations API" subtitle: "The Feature class for declarative time-windowed aggregations" --- ## Overview The Aggregations API provides a declarative way to define time-windowed features using the `Feature` class. Instead of writing complex SQL with window functions, you specify what aggregations you want and the Feature Store handles the efficient computation. Examples in this chapter use the canonical demo layout: database `FEATURE_STORE_DEMO`, Feature Store schema `FEATURE_STORE`, clickstream source schema `CLICKSTREAM_DATA`, warehouse `FS_DEV_WH`, and tables such as `ORDERS` (`ORDER_TS`, `ORDER_ID`, `USER_ID`, `TOTAL_AMT`). ::: {.callout-note title="SDK Version and Documentation Status"} The `Feature` class and tiled aggregation API require **`snowflake-ml-python >= 1.24.0`**. This is a **client-side SDK feature** -- it generates standard SQL (Dynamic Tables, window functions, aggregations) under the hood, so no account-level enablement or feature flag is needed. Install the required SDK version and you are ready to use it. Official API documentation on the Snowflake docs portal may be limited at time of reading. For the authoritative parameter reference, use `help(Feature)` in a Python session or inspect the SDK source. API names and signatures may evolve in future releases. See also the [platform capabilities and preview status table](../13_advanced_patterns/index.qmd#platform-capabilities-and-preview-status-march-2026) in Chapter 12. ::: ## Learning Objectives After completing this chapter, you will be able to: - Use the Feature class to define time-windowed aggregations - Understand tiling for efficient incremental computation - Configure feature granularity and refresh frequencies - Apply feature column prefixing for disambiguation - Name aggregated features following best practices - Generate features programmatically using data-driven patterns (categorical pivots, cross-product factories, config-driven specs) > 📂 **Chapter code:** [Browse companion scripts on GitHub](https://github.com/Snowflake-Labs/snowflake-featurestore-imp-guide/tree/main/Snowflake_FeatureStore_Implementation_Guide/07_aggregations_api/_code) ```{python} #| output: false #| echo: false from snowflake.snowpark import Session from snowflake.snowpark.context import get_active_session from snowflake.ml.feature_store import FeatureStore, FeatureView, Feature, Entity, CreationMode try: session = get_active_session() except Exception: session = Session.builder.config("connection_name", "default").create() fs = FeatureStore( session=session, database="FEATURE_STORE_DEMO", name="FEATURE_STORE", default_warehouse="FS_DEV_WH", creation_mode=CreationMode.CREATE_IF_NOT_EXIST, ) user_entity = Entity(name="USER", join_keys=["USER_ID"]) ``` --- ## The Feature Class ### Basic Usage > 📁 **Full code:** [`_code/feature_class.py`](_code/feature_class.py) ```{python} # Define time-windowed aggregations declaratively (aliases: _SUM, _CNT, _AVG, …) features = [ Feature.sum("TOTAL_AMT", "7d").alias("TOTAL_AMT_SUM_7D"), Feature.sum("TOTAL_AMT", "30d").alias("TOTAL_AMT_SUM_30D"), Feature.count("ORDER_ID", "7d").alias("ORDER_ID_CNT_7D"), Feature.avg("TOTAL_AMT", "7d").alias("TOTAL_AMT_AVG_7D"), Feature.max("TOTAL_AMT", "30d").alias("TOTAL_AMT_MAX_30D"), ] # Create tiled Feature View (source: canonical clickstream ORDERS) orders_fqn = "FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS" user_agg_fv = FeatureView( name="USER_PURCHASE_AGGREGATES", entities=[user_entity], feature_df=session.table(orders_fqn), timestamp_col="ORDER_TS", refresh_freq="1 hour", feature_granularity="1 hour", features=features, ) registered_agg_fv = fs.register_feature_view( feature_view=user_agg_fv, version="V01", block=True, overwrite=True, ) print(f"Registered: {registered_agg_fv.name}/V01 (tiled={registered_agg_fv.is_tiled})") print(f"Granularity: {registered_agg_fv.feature_granularity}") ``` ### Available Aggregation Functions | Function | Description | Example | |----------|-------------|---------| | `Feature.sum(col, window)` | Sum over window | `Feature.sum("TOTAL_AMT", "7d")` | | `Feature.min(col, window)` | Minimum over window | `Feature.min("TOTAL_AMT", "30d")` | | `Feature.max(col, window)` | Maximum over window | `Feature.max("TOTAL_AMT", "30d")` | | `Feature.count(col, window)` | Count over window | `Feature.count("ORDER_ID", "30d")` | | `Feature.avg(col, window)` | Average over window | `Feature.avg("TOTAL_AMT", "7d")` | | `Feature.std(col, window)` | Standard deviation | `Feature.std("TOTAL_AMT", "30d")` | | `Feature.var(col, window)` | Variance | `Feature.var("TOTAL_AMT", "30d")` | | `Feature.last_n(col, window, n=N)` | N most recent values | `Feature.last_n("PRODUCT_ID", "7d", n=5)` | | `Feature.first_n(col, window, n=N)` | N oldest values | `Feature.first_n("PRODUCT_ID", "7d", n=5)` | | `Feature.last_distinct_n(col, window, n=N)` | N most recent unique values | `Feature.last_distinct_n("PRODUCT_ID", "7d", n=5)` | | `Feature.first_distinct_n(col, window, n=N)` | N oldest unique values | `Feature.first_distinct_n("PRODUCT_ID", "7d", n=5)` | | `Feature.approx_count_distinct(col, window)` | HLL distinct count | `Feature.approx_count_distinct("PRODUCT_ID", "7d")` | | `Feature.approx_percentile(col, window, percentile=0.5)` | Approximate percentile | `Feature.approx_percentile("TOTAL_AMT", "30d", percentile=0.5)` | ::: {.callout-warning title="Float columns and incremental refresh"} Aggregation Feature Views backed by Dynamic Tables may fall back to **full refresh** if the underlying SQL combines aggregations on `FLOAT`-typed expressions (`sum`, `avg`, `min`, `max`, `std`, `var`) with a JOIN in the same query block. The DT engine cannot guarantee incremental correctness for floating-point aggregate diffs across joins. If your source columns are `FLOAT`/`DOUBLE`, cast them to a fixed-point type **before** the aggregation reaches the DT definition: ```sql CAST(TOTAL_AMT AS NUMBER(38, 6)) ``` The Temporal Aggregation API generates the underlying DT SQL automatically, so if your source table's numeric columns are `FLOAT` and the API introduces a join (e.g., for tiling or rollup), the resulting DT may silently use full refresh. Check `REFRESH_MODE` in `INFORMATION_SCHEMA.DYNAMIC_TABLES()` after registration and alter source column types to `NUMBER` if needed. **ML model impact:** For most ML use cases the precision difference between `FLOAT` and `NUMBER(38, 8)` is negligible -- the variance across training runs or hyperparameter choices far exceeds the rounding delta. If you need higher fidelity, `NUMBER(38, 12)` provides ~12 decimal digits of precision, comparable to `FLOAT` for the value ranges typical in feature engineering. The key trade-off is **incremental refresh (fast, cheap) vs full refresh (slow, expensive)** -- the precision loss from fixed-point casting is almost always the better deal. **Recommended approach for source table owners:** Alter the source column type to `NUMBER`/`DECIMAL` at the table level rather than casting in every Feature View query. This avoids repeating the cast in multiple Feature Views and ensures all downstream consumers benefit: ```sql ALTER TABLE FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS ALTER COLUMN TOTAL_AMT SET DATA TYPE NUMBER(18, 8); ``` For rollup Feature Views, the same constraint applies transitively -- if the source tiled Feature View aggregates `FLOAT` columns, the rollup DT also falls back to full refresh. Fix the types at the source level. ::: ### Aggregation Function Performance Characteristics {#sec-agg-performance} Not all aggregation functions are equal in terms of tile storage, computation cost, and incremental refresh compatibility. Understanding these trade-offs is important when designing features at scale. | Function | Tile storage | Incremental | Notes | |----------|-------------|-------------|-------| | `count` | Partial counts per tile | Yes | Lightweight; most efficient aggregation | | `sum` | Partial sums per tile | Yes (if not `FLOAT`+JOIN) | Cast to `NUMBER` if source is `FLOAT` | | `min`, `max` | Per-tile min/max | Yes | Used for windowed extremes | | `avg` | Stored as sum + count | Yes (derived) | Reconstructed from partial sum/count at query time | | `std`, `var` | Partial sum, sum-of-squares, count | Yes (derived) | Higher storage per tile; 3 partial aggregates | | `approx_count_distinct` | HLL sketch per tile | Yes | Fixed-size sketch (~12 KB); merges across tiles | | `last_n`, `first_n` | Array per tile | Yes | Storage scales with `n`; larger `n` increases tile size | | `last_distinct_n`, `first_distinct_n` | Deduplicated array per tile | Yes | More expensive than `last_n` due to dedup logic | | `approx_percentile` | T-Digest sketch per tile | Yes | Fixed-size sketch; approximate result | **Guidance for high-cardinality, high-volume data:** - **Prefer `count` and `sum`** as primary features -- they are the most efficient for both tiling and query-time reassembly. - **Use `approx_count_distinct`** instead of exact distinct counts when precision within ~2% is acceptable. The HLL sketch is constant-size regardless of cardinality, whereas exact distinct counts require per-tile array storage. - **Limit `last_n` / `first_n` values of `n`** to what the model actually needs. Each tile stores up to `n` values; at `n=100` with hourly tiles and 90-day windows, tile storage grows substantially. - **`std` and `var`** store three partial aggregates per tile (sum, sum-of-squares, count). If you only need variance for a subset of features, consider computing it post-retrieval from `sum` and `count` features to reduce tile storage. ### Filter Expressions Within Features {#sec-filter-expr} When the same source data feeds features that differ only by a **filter condition** (e.g., count of automated messages vs. count of all messages), use separate `Feature` definitions with filter logic in the source `feature_df` rather than creating multiple Feature Views: ::: {.panel-tabset group="lang"} ## SQL ```python filtered_source_df = session.sql(""" SELECT USER_ID, EVENT_TS, ORDER_ID, TOTAL_AMT, CASE WHEN CHANNEL = 'AUTOMATED' THEN ORDER_ID END AS AUTO_ORDER_ID, CASE WHEN CHANNEL = 'AUTOMATED' THEN TOTAL_AMT END AS AUTO_TOTAL_AMT FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS """) ``` ## Snowpark DataFrame ```python import snowflake.snowpark.functions as F orders = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS") filtered_source_df = orders.select( F.col("USER_ID"), F.col("EVENT_TS"), F.col("ORDER_ID"), F.col("TOTAL_AMT"), F.when(F.col("CHANNEL") == "AUTOMATED", F.col("ORDER_ID")).alias("AUTO_ORDER_ID"), F.when(F.col("CHANNEL") == "AUTOMATED", F.col("TOTAL_AMT")).alias("AUTO_TOTAL_AMT"), ) ``` ::: ```python features = [ Feature.count("ORDER_ID", "7d").alias("ORDER_CNT_7D"), Feature.count("AUTO_ORDER_ID", "7d").alias("AUTO_ORDER_CNT_7D"), Feature.sum("TOTAL_AMT", "30d").alias("SPEND_SUM_30D"), Feature.sum("AUTO_TOTAL_AMT", "30d").alias("AUTO_SPEND_SUM_30D"), ] ``` This approach keeps all related features in a **single Feature View**, which means they share tiles and refresh together -- significantly more efficient than creating separate Feature Views per filter condition. The `CASE WHEN` produces `NULL` for non-matching rows, and aggregation functions skip NULLs, giving the correct filtered result. The static approach above works well when the filter values are stable and few. When dimension values change over time (new channels, product categories, regions), see [Data-Driven Feature Definitions](#sec-data-driven) for a programmatic alternative that discovers values from data and generates the CASE columns and `Feature` objects automatically. ### Window Specifications Windows can be specified using shorthand notation: | Notation | Meaning | |----------|---------| | `"1h"` or `"1 hour"` | 1 hour | | `"24h"` | 24 hours | | `"7d"` or `"7 day"` | 7 days | | `"30d"` | 30 days | | `"1w"` or `"1 week"` | 1 week | ### Window offset {#sec-window-offset} Pass `offset` to shift the aggregation window backward in time. This is the standard way to build **comparative features** such as week-over-week or month-over-month trends: ```python features = [ Feature.sum("TOTAL_AMT", "7d").alias("TOTAL_AMT_SUM_CURRENT_WEEK"), Feature.sum("TOTAL_AMT", "7d", offset="7d").alias("TOTAL_AMT_SUM_PREV_WEEK"), ] ``` Pair the current-window and offset-window versions to capture momentum or change over time. For example, a model can detect spending acceleration when `TOTAL_AMT_SUM_CURRENT_WEEK` exceeds `TOTAL_AMT_SUM_PREV_WEEK`. The `offset` must be a **multiple of `feature_granularity`** so the shifted window aligns cleanly to tile boundaries. For example, with `feature_granularity="1h"`, valid offsets are `"1h"`, `"2h"`, `"24h"`, `"7d"`, etc. Use `.alias("NAME")` on each `Feature` to set the output column name. --- ## Naming Aggregated Features {#sec-naming} ### Best Practice: Descriptive Aliases Always use `.alias()` to give features meaningful names: ```python # ✅ GOOD: Suffix metric role (_SUM, _CNT, _AVG) and window features = [ Feature.sum("TOTAL_AMT", "7d").alias("TOTAL_AMT_SUM_7D"), Feature.sum("TOTAL_AMT", "30d").alias("TOTAL_AMT_SUM_30D"), Feature.count("ORDER_ID", "7d").alias("ORDER_ID_CNT_7D"), ] # ❌ BAD: Auto-generated names are hard to interpret features = [ Feature.sum("TOTAL_AMT", "7d"), # opaque default name Feature.sum("TOTAL_AMT", "30d"), ] ``` --- ## Feature Column Prefixing for Disambiguation {#sec-prefixing} When joining features from multiple Feature Views, column name collisions can occur. Feature Store provides two mechanisms for disambiguation. ### The Problem ```python # Name collision: same aggregation function on same column, different entity groupings, # using the default naming derivation -- both resolve to ORDER_ID_CNT_7D user_orders_fv = FeatureView( name="USER_ORDERS", entities=[user_entity], features=[Feature.count("ORDER_ID", "7d")], # default name: ORDER_ID_CNT_7D # ... ) merchant_orders_fv = FeatureView( name="MERCHANT_ORDERS", entities=[merchant_entity], features=[Feature.count("ORDER_ID", "7d")], # same default name: ORDER_ID_CNT_7D # ... ) # Without prefixing, collision occurs dataset = fs.generate_dataset( spine_df=spine, features=[user_orders_fv, user_sessions_fv], # Which ORDER_ID_CNT_7D? ) ``` Prefer explicit `.alias()` calls with distinct names (e.g., `MERCHANT_ORDER_ID_CNT_7D`) or use `auto_prefix=True` / `.with_name()` so semantics stay clear even when multiple Feature Views aggregate the same source column. ### Solution 1: Auto-Prefix Use `auto_prefix=True` to automatically prefix columns with Feature View name: > 📁 **Full code:** [`_code/prefixing.py`](_code/prefixing.py) ```python dataset = fs.generate_dataset( spine_df=spine, features=[user_orders_fv, user_sessions_fv], auto_prefix=True, # Columns become USER_ORDERS__ORDER_ID_CNT_7D, etc. ) # Result columns: # - USER_ORDERS__ORDER_ID_CNT_7D # - USER_SESSIONS__SESSION_ID_CNT_7D ``` ### Solution 2: Custom Prefixes with `.with_name()` For more control, use `.with_name()` on individual Feature Views or slices: ```python dataset = fs.generate_dataset( spine_df=spine, features=[ user_orders_fv.with_name("orders"), # Prefix: orders__ user_sessions_fv.with_name("sessions"), # Prefix: sessions__ ], ) # Result columns: # - orders__ORDER_ID_CNT_7D # - sessions__SESSION_ID_CNT_7D ``` ### Solution 3: Combining Both Approaches ```python dataset = fs.generate_dataset( spine_df=spine, features=[ user_orders_fv.with_name("orders"), # Custom prefix user_sessions_fv, # Will use auto_prefix ], auto_prefix=True, ) ``` ::: {.callout-note} ## Priority `.with_name()` takes precedence over `auto_prefix`. If a Feature View has `.with_name()` set, that prefix is used regardless of `auto_prefix` setting. ::: ### Choosing a Prefixing Strategy | Scenario | Recommendation | |----------|----------------| | Single Feature View | No prefix needed | | Multiple FVs, unique column names | No prefix needed | | Multiple FVs, potential collisions | Use `auto_prefix=True` | | Want short, custom prefixes | Use `.with_name()` | | Consistent naming across projects | Define naming convention + `.with_name()` | --- ## Feature Granularity (Tiling) {#sec-tiling} ### How Tiling Works The `feature_granularity` parameter sets the tile size for incremental computation: ```python # Hourly tiles for high-frequency data fv = FeatureView( feature_granularity="1 hour", # Each tile covers 1 hour refresh_freq="15 minutes", # Refresh every 15 minutes # ... ) ``` ### Tile Size Recommendations | Data Pattern | Recommended Tile | Reason | |--------------|------------------|--------| | Clickstream (high frequency) | `1 hour` | Many events per hour | | Transactions (medium) | `1 day` | Daily aggregation patterns | | Weekly metrics | `1 week` | Coarse granularity sufficient | ### Multi-Granularity Tile Strategy {#sec-multi-granularity} When Feature Views need to serve both **short lookback windows** (1d, 7d) and **long lookback windows** (90d, 360d), a single tile size creates a trade-off: small tiles (1 hour) provide precision for short windows but require combining thousands of tiles for 360-day windows; large tiles (1 day) are efficient for long windows but lose sub-day precision. The recommended approach is to use **multiple Feature Views at different tile granularities**, each optimized for the windows it serves: ```python short_window_features = [ Feature.sum("TOTAL_AMT", "1d").alias("SPEND_SUM_1D"), Feature.sum("TOTAL_AMT", "7d").alias("SPEND_SUM_7D"), Feature.count("ORDER_ID", "1d").alias("ORDER_CNT_1D"), Feature.count("ORDER_ID", "7d").alias("ORDER_CNT_7D"), ] short_window_fv = FeatureView( name="USER_PURCHASE_SHORT_WINDOW", entities=[user_entity], feature_df=session.table(orders_fqn), timestamp_col="ORDER_TS", refresh_freq="1 hour", feature_granularity="1 hour", features=short_window_features, desc="Short-window purchase features (1d, 7d) with hourly tiles", ) long_window_features = [ Feature.sum("TOTAL_AMT", "30d").alias("SPEND_SUM_30D"), Feature.sum("TOTAL_AMT", "90d").alias("SPEND_SUM_90D"), Feature.sum("TOTAL_AMT", "360d").alias("SPEND_SUM_360D"), Feature.count("ORDER_ID", "30d").alias("ORDER_CNT_30D"), Feature.count("ORDER_ID", "360d").alias("ORDER_CNT_360D"), ] long_window_fv = FeatureView( name="USER_PURCHASE_LONG_WINDOW", entities=[user_entity], feature_df=session.table(orders_fqn), timestamp_col="ORDER_TS", refresh_freq="1 day", feature_granularity="1 day", features=long_window_features, desc="Long-window purchase features (30d-360d) with daily tiles", ) ``` At training time, join both Feature Views on the same spine: ```python training_df = fs.generate_training_set( spine_df=spine, features=[short_window_fv, long_window_fv], spine_timestamp_col="EVENT_TS", ) ``` **Trade-offs:** | Approach | Tiles for 360d window | Refresh cost | Short-window precision | |----------|----------------------|--------------|----------------------| | Single FV, `1 hour` tiles | 8,640 tiles | High (hourly refresh) | Sub-hour | | Single FV, `1 day` tiles | 360 tiles | Low (daily refresh) | Daily only | | Two FVs (recommended) | 360 tiles (daily FV) + 168 tiles for 7d (hourly FV) | Balanced | Sub-hour for short, daily for long | The two-FV approach also lets you set different `refresh_freq` values: hourly for short windows where freshness matters, daily for long windows where the marginal value of hourly refresh is negligible. --- ## Multi-Window Features Define features with multiple time windows efficiently: ```python features = [ Feature.sum("TOTAL_AMT", "1d").alias("TOTAL_AMT_SUM_1D"), Feature.sum("TOTAL_AMT", "7d").alias("TOTAL_AMT_SUM_7D"), Feature.sum("TOTAL_AMT", "30d").alias("TOTAL_AMT_SUM_30D"), Feature.sum("TOTAL_AMT", "90d").alias("TOTAL_AMT_SUM_90D"), Feature.count("ORDER_ID", "7d").alias("ORDER_ID_CNT_7D"), Feature.count("ORDER_ID", "30d").alias("ORDER_ID_CNT_30D"), ] # Single Feature View efficiently computes all windows (required FV args shown above) multi_window_fv = FeatureView( name="USER_MULTI_WINDOW", entities=[user_entity], feature_df=session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS"), timestamp_col="ORDER_TS", refresh_freq="1 day", features=features, feature_granularity="1 day", ) ``` When you have many columns and windows, a helper function avoids repetitive boilerplate and enforces consistent naming: ```python from snowflake.ml.feature_store import Feature def build_features(columns: list[str], windows: list[str], agg: str = "sum") -> list[Feature]: agg_fn = getattr(Feature, agg) return [ agg_fn(col, w).alias(f"{col}_{agg.upper()}_{w.upper()}") for col in columns for w in windows ] features = ( build_features(["TOTAL_AMT", "DISCOUNT_AMT"], ["7d", "30d", "90d"], "sum") + build_features(["ORDER_ID"], ["7d", "30d"], "count") ) ``` --- ## Best Practices ### 1. Use Consistent Naming Conventions {.unnumbered} ```python # Pattern: <COL_OR_ENTITY>_<ROLE>_<WINDOW> (roles: _SUM, _CNT, _AVG, …) features = [ Feature.sum("TOTAL_AMT", "7d").alias("TOTAL_AMT_SUM_7D"), Feature.sum("TOTAL_AMT", "30d").alias("TOTAL_AMT_SUM_30D"), Feature.count("ORDER_ID", "7d").alias("ORDER_ID_CNT_7D"), Feature.avg("TOTAL_AMT", "7d").alias("TOTAL_AMT_AVG_7D"), ] ``` ### 2. Match Tile Size to Data Frequency {.unnumbered} ```python # High-frequency: smaller tiles clickstream_fv = FeatureView( feature_granularity="1 hour", # ... ) # Low-frequency: larger tiles weekly_metrics_fv = FeatureView( feature_granularity="1 day", # ... ) ``` ### 3. Group Related Features and Consolidate Feature Views {#sec-consolidate .unnumbered} When multiple Feature Views share the **same source table, entity, and tile granularity**, they automatically share the same underlying tile computation. Consolidating them into a single Feature View reduces operational overhead (fewer DTs to monitor) and ensures tile computation happens once: ```python # ✅ GOOD: Related features in same Feature View -- single tile computation purchase_features = [ Feature.sum("TOTAL_AMT", "7d").alias("TOTAL_AMT_SUM_7D"), Feature.sum("TOTAL_AMT", "30d").alias("TOTAL_AMT_SUM_30D"), Feature.count("ORDER_ID", "7d").alias("ORDER_ID_CNT_7D"), Feature.count("ORDER_ID", "30d").alias("ORDER_ID_CNT_30D"), Feature.avg("TOTAL_AMT", "7d").alias("TOTAL_AMT_AVG_7D"), ] # ❌ BAD: Same source split across multiple FVs -- duplicated tile computation spend_fv = FeatureView( name="USER_SPEND_FV", feature_df=session.table(orders_fqn), features=[Feature.sum("TOTAL_AMT", "7d").alias("TOTAL_AMT_SUM_7D")], # ... ) count_fv = FeatureView( name="USER_ORDER_COUNT_FV", feature_df=session.table(orders_fqn), features=[Feature.count("ORDER_ID", "7d").alias("ORDER_ID_CNT_7D")], # ... ) ``` Splitting features from the same source into separate Feature Views forces independent tile computations over the same data. Each Feature View creates its own DT with its own tiles, effectively doubling (or more) the compute cost. Consolidate into one Feature View when the source, entity, and granularity are shared. **When separate Feature Views are appropriate:** - Different **source tables** (e.g., ORDERS vs EVENTS) - Different **entities** (e.g., USER vs PRODUCT) - Different **tile granularities** (e.g., hourly tiles for short windows, daily for long -- see [Multi-Granularity](#sec-multi-granularity)) - Different **refresh cadences** needed for operational reasons - **Unrelated** feature domains that evolve independently ```python # ❌ BAD: Unrelated features mixed together mixed_features = [ Feature.sum("TOTAL_AMT", "7d"), # Purchase feature Feature.count("EVENT_ID", "7d"), # Clickstream feature (different grain / entity!) ] ``` ### 4. Tile Sharing Across Feature Views {#sec-tile-sharing .unnumbered} When two Feature Views use the **same source, entity, and `feature_granularity`**, the underlying tile DTs compute identical partial aggregates. Today, the Feature Store does not automatically share tiles between separately registered Feature Views -- each creates its own tile DT. This means: - **Consolidating features** into fewer Feature Views (per best practice #3 above) is the primary mechanism for avoiding duplicate tile computation. - If you need separate Feature Views for organisational reasons (different owners, different version lifecycles), be aware that tile compute is duplicated. The cost impact depends on source data volume and tile granularity. - A future platform optimisation may deduplicate tiles across Feature Views with identical source/entity/granularity, but until then, consolidation is the recommended approach. ::: {.callout-warning title="Stacking tiled Feature Views is not supported"} You **cannot** chain (stack) one Time Aggregation Feature View on top of another. Attempting it causes duplicate internal `_PARTIAL_*` column names and SQL compilation errors that block both `retrieve_feature_values()` and `generate_training_set()`. To aggregate from a finer-grained entity to a coarser one (e.g., visitor → subscriber), use **Rollup Feature Views** with `RollupConfig` as described below -- not raw stacking of tiled Feature Views. ::: --- ## Rollup Aggregated Feature Views {#sec-rollup} Rollup Feature Views enable **hierarchical aggregation** -- creating coarser-grained features from finer-grained tiled Feature Views without recomputing from raw data. For example, clickstream-style hierarchies often roll up **VISITOR** → **USER** → **HOUSEHOLD**: a visitor-level "7-day page view count" can be rolled up to user-level (sum across all visitors linked to that user) and then to household-level (sum across all users in the household). The `RollupConfig` defines the relationship between a finer-grained (source) Feature View and the coarser-grained (target) Feature View. The Feature Store leverages Dynamic Tables for automatic, incremental updates. ### RollupConfig Arguments | Argument | Description | |----------|-------------| | `source` | The parent **tiled** Feature View to roll up from. Must be registered (not in DRAFT status) | | `mapping_df` | A Snowpark DataFrame containing the entity mapping -- all join keys from both the source (finer) and target (coarser) entities | ### Example: Visitor → Subscriber Rollup ```python from snowflake.ml.feature_store import FeatureStore, FeatureView, Entity, RollupConfig, Feature # Entities at both levels of the hierarchy visitor_entity = Entity("VISITOR", ["VISITOR_ID", "COMPANY_ID"]) subscriber_entity = Entity("SUBSCRIBER", ["SUBSCRIBER_ID", "COMPANY_ID"]) fs.register_entity(visitor_entity) fs.register_entity(subscriber_entity) # Fine-grained: tiled visitor-level Feature View (already registered) visitor_features = [ Feature.sum("PAGE_VIEWS", "7d").alias("PAGE_VIEWS_7D"), Feature.count("SESSION_ID", "30d").alias("SESSION_CNT_30D"), ] visitor_fv = FeatureView( name="VISITOR_ENGAGEMENT_FV", entities=[visitor_entity], feature_df=session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.EVENTS"), timestamp_col="EVENT_TS", refresh_freq="1 hour", feature_granularity="1 hour", features=visitor_features, desc="Visitor-level engagement features — tiled", ) registered_visitor_fv = fs.register_feature_view(visitor_fv, version="V01") # Entity mapping: links visitor keys to subscriber keys mapping_df = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.VISITOR_SUBSCRIBER_MAPPING") # Expected columns: VISITOR_ID, COMPANY_ID, SUBSCRIBER_ID # Rollup configuration rollup_config = RollupConfig( source=registered_visitor_fv, mapping_df=mapping_df, ) # Coarse-grained: subscriber-level rollup Feature View subscriber_fv = FeatureView( name="SUBSCRIBER_ENGAGEMENT_FV", entities=[subscriber_entity], rollup_config=rollup_config, desc="Subscriber-level engagement — rolled up from visitor-level tiles", ) fs.register_feature_view(subscriber_fv, version="V01") ``` The resulting `SUBSCRIBER_ENGAGEMENT_FV` automatically aggregates the visitor-level tiles up to the subscriber level using the entity mapping. Because it is backed by a Dynamic Table, it refreshes incrementally as the source visitor Feature View updates. ::: {.callout-note} The source Feature View **must** be a registered, tiled Feature View (one with `features` and `feature_granularity`). You cannot roll up a non-tiled DT-based or View-based Feature View. ::: ### Rollup Design Guidelines - **Define entities** that match the grain of each Feature View (for example, visitor-level tiles on `EVENTS`, user-level aggregates on `ORDERS` keyed by `USER_ID`). - **Reuse the same aggregation patterns** (`Feature.sum`, `Feature.count`, …) at each level; change the spine entity and join keys when moving from visitor to user to household. - **Keep aliases explicit** (`VISITOR_PAGE_VIEWS_7D` vs `USER_PAGE_VIEWS_7D`) so rolled-up datasets remain interpretable when multiple hierarchy levels appear in one training set. - **Design mapping tables carefully.** The `mapping_df` determines which fine-grained entities roll up to which coarse-grained entities. For static relationships a simple `DISTINCT` mapping suffices; for time-varying relationships (visitors identified as subscribers over time), consider SCD-Type2 patterns or `MAPPED_AT` timestamps. See [Chapter 3: Entity Mapping Tables](../03_entities_hierarchies/index.qmd#sec-mapping-tables) for patterns and PIT correctness implications. - **Watch for late-arriving mappings.** If entity mappings are added retroactively, the rollup may aggregate historical tile data under a mapping that did not exist at the original event time, creating subtle data leakage. See the mapping table PIT warning in [Chapter 3](../03_entities_hierarchies/index.qmd#sec-mapping-tables) and [Chapter 6](../06_temporal_features/index.qmd#sec-mapping-leakage). --- ## Common Pitfalls ### ❌ Pitfall 1: No Alias **Problem**: Auto-generated names are unclear. **Solution**: Always use `.alias()` with descriptive names. ### ❌ Pitfall 2: Wrong Tile Size **Problem**: Tile size doesn't match data frequency. **Solution**: Use smaller tiles for high-frequency data. ### ❌ Pitfall 3: Column Collisions **Problem**: Same column names from different Feature Views. **Solution**: Use `auto_prefix=True` or `.with_name()`. ### ❌ Pitfall 4: Internal _PARTIAL Columns in Output **Problem**: Internal columns used for tile aggregation (prefixed `_PARTIAL_`) sometimes appear in output DataFrames from `generate_training_set()` or `retrieve_feature_values()`. These are implementation artifacts -- partial aggregates stored per tile -- and are not user-facing features. **Solution**: Filter them before consuming the output: ```python result_df = fs.generate_training_set(spine_df=spine, features=fv_list, ...) # Drop internal tile columns output_cols = [c for c in result_df.columns if not c.startswith("_PARTIAL_")] result_df = result_df.select(output_cols) ``` ### ❌ Pitfall 5: Partial Leading-Edge Tiles from Incomplete Source Intervals {#sec-partial-interval-tiling} **Problem**: When the source data is a continuous stream aggregated into fixed time buckets (e.g., 15-minute intervals), the current in-progress bucket is **incomplete** at DT refresh time. The tiled Feature View generates a tile from this partial interval, producing aggregation values with a different distribution than complete historical tiles. This creates **training/serving skew**: the model trains on features derived from complete intervals but at inference time receives features that include a partial tile. This affects all tiled aggregations — `Feature.avg`, `Feature.sum`, `Feature.count`, `Feature.min`, `Feature.max` — because the partial tile has fewer source events than a complete one. **Solution**: Two complementary strategies: **Strategy 1 — Use `offset` to skip the current interval (DT-safe):** The `offset` parameter shifts the aggregation window back, inherently excluding the potentially partial current tile. This is deterministic SQL that preserves incremental DT refresh: ```python # Window covers intervals ending one interval ago — skips partial current tile Feature.avg("QTY", "1h", offset="15m").alias("QTY_ROLL_4_MEAN") Feature.sum("QTY", "1h", offset="15m").alias("QTY_ROLL_4_SUM") ``` This is the simplest and most robust approach for tiled Feature Views. The trade-off is that features always reflect the state one interval behind — typically acceptable for forecasting use cases where the current interval is incomplete anyway. **Strategy 2 — Gate the source table externally (outside the DT):** Use a scheduled Task or Stream+Task pipeline that writes only complete intervals to the source table. The DT reads a table that never contains partial data: ```sql -- Task runs every 15 minutes: write only complete intervals to source INSERT INTO SOURCE_ORDERS_15MIN SELECT TIME_SLICE(ORDER_TS, 15, 'MINUTE') AS INTERVAL_ID, ITEM_CODE, BRANCH_CODE, COUNT(*) AS QTY FROM RAW_ORDERS WHERE TIME_SLICE(ORDER_TS, 15, 'MINUTE') < TIME_SLICE(CURRENT_TIMESTAMP(), 15, 'MINUTE', 'START') GROUP BY 1, 2, 3; ``` ::: {.callout-warning title="Do not use CURRENT_TIMESTAMP() inside a DT or feature_df"} `CURRENT_TIMESTAMP()` is non-deterministic and forces FULL refresh. The `TIME_SLICE` gating filter must be applied in a Task or ingestion pipeline *outside* the DT — never in the DT definition SQL or the `feature_df` passed to a Feature View. Even when passed via `session.sql(...)`, the non-deterministic expression flows through to the DT and prevents incremental refresh. ::: For non-tiled DT Feature Views with explicit window functions, use `ROWS BETWEEN N PRECEDING AND 1 PRECEDING` (excluding `CURRENT ROW`) as the DT-safe equivalent. See [Chapter 5: Partial Leading-Edge Interval](../05_feature_pipelines/index.qmd#sec-partial-interval) and [Chapter 6: Data Leakage from Partial Intervals](../06_temporal_features/index.qmd#sec-partial-interval-leakage). --- ## Data-Driven Feature Definitions {#sec-data-driven} The examples earlier in this chapter define features with hard-coded column names, filter values, and window periods. That works well for a handful of stable features, but becomes brittle and verbose as feature sets grow. This section introduces **data-driven patterns** that generate `Feature` objects and the corresponding `feature_df` columns programmatically. ::: {.callout-important} ## Schema Is Fixed at Registration The Feature View schema is fixed at registration time. Data-driven generation makes the **code path** adaptive, not the registered Feature View itself. When underlying data changes (e.g., a new channel appears), re-run the generation code and register a **new version** of the Feature View. ::: ### Pattern 1: Categorical Pivot — Static vs Dynamic {.unnumbered} [Filter Expressions Within Features](#sec-filter-expr) shows how `CASE WHEN` decomposes a metric by a single hard-coded dimension value. When the set of values is large or evolving, generate the columns programmatically: **Static (2 known channels):** ::: {.panel-tabset group="lang"} ## SQL ```python source_df = session.sql(""" SELECT USER_ID, ORDER_TS, ORDER_ID, TOTAL_AMT, CASE WHEN CHANNEL = 'AUTOMATED' THEN ORDER_ID END AS AUTOMATED_ORDER_ID, CASE WHEN CHANNEL = 'WEB' THEN ORDER_ID END AS WEB_ORDER_ID FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS """) ``` ## Snowpark DataFrame ```python import snowflake.snowpark.functions as F orders = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS") source_df = orders.select( "USER_ID", "ORDER_TS", "ORDER_ID", "TOTAL_AMT", F.when(F.col("CHANNEL") == "AUTOMATED", F.col("ORDER_ID")).alias("AUTOMATED_ORDER_ID"), F.when(F.col("CHANNEL") == "WEB", F.col("ORDER_ID")).alias("WEB_ORDER_ID"), ) ``` ::: ```python features = [ Feature.count("ORDER_ID", "7d").alias("ORDER_CNT_7D"), Feature.count("AUTOMATED_ORDER_ID", "7d").alias("AUTOMATED_ORDER_CNT_7D"), Feature.count("WEB_ORDER_ID", "7d").alias("WEB_ORDER_CNT_7D"), ] ``` **Dynamic (discovers all channels from data):** ```python from snowflake.snowpark import functions as F channels = [row["CHANNEL"] for row in session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS") .select("CHANNEL").distinct().collect()] source_df = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS") \ .select("USER_ID", "ORDER_TS", "ORDER_ID", "TOTAL_AMT", "CHANNEL") for ch in channels: tag = ch.upper().replace(" ", "_") source_df = source_df.with_column( f"{tag}_ORDER_ID", F.when(F.col("CHANNEL") == ch, F.col("ORDER_ID")) ) feature_df = source_df.drop("CHANNEL") features = [Feature.count("ORDER_ID", "7d").alias("ORDER_CNT_7D")] for ch in channels: tag = ch.upper().replace(" ", "_") features.append( Feature.count(f"{tag}_ORDER_ID", "7d").alias(f"{tag}_ORDER_CNT_7D") ) ``` **When to prefer dynamic:** Dimension values change over time (new channels, product categories, regions). Adding a value requires only a data change and a new Feature View version — no code edits. ### Pattern 2: Multi-Window Sweep {.unnumbered} The [Multi-Window Features](#multi-window-features) section above shows a `build_features` helper that loops over columns and windows. This is already a data-driven pattern: the `columns` and `windows` lists are the configuration, and the helper generates `Feature` objects with consistent naming. Refer to that helper as the starting point for more complex patterns. ### Pattern 3: Metric × Dimension × Window Factory {.unnumbered} Patterns 1 and 2 combine naturally into a full cross-product: for every *(metric, dimension value, window)* triple, generate a filtered column and a `Feature`. With 3 metrics, 4 channels, and 3 windows, this produces 36 features from a single function call instead of 36 hand-written definitions. ```python from itertools import product as cartesian from snowflake.snowpark import functions as F from snowflake.ml.feature_store import Feature AGG_DISPATCH = { "count": Feature.count, "sum": Feature.sum, "avg": Feature.avg, "max": Feature.max, "min": Feature.min, } def build_pivot_features( session, source_table: str, entity_col: str, ts_col: str, pivot_col: str, metrics: dict[str, str], # {"ORDER_ID": "count", "TOTAL_AMT": "sum"} windows: list[str], # ["7d", "30d", "90d"] ) -> tuple: """Return (feature_df, features_list) with full cross-product.""" pivots = [r[0] for r in session.sql( f"SELECT DISTINCT {pivot_col} FROM {source_table} ORDER BY 1" ).collect()] df = session.table(source_table) \ .select(entity_col, ts_col, pivot_col, *metrics.keys()) features = [] # Base (unfiltered) features across all windows for col, fn_name in metrics.items(): for w in windows: features.append( AGG_DISPATCH[fn_name](col, w).alias(f"{col}_{fn_name.upper()}_{w.upper()}") ) # Per-pivot filtered features for pv in pivots: tag = pv.upper().replace(" ", "_") for col in metrics: derived = f"{tag}_{col}" df = df.with_column( derived, F.when(F.col(pivot_col) == pv, F.col(col)) ) for w in windows: fn_name = metrics[col] features.append( AGG_DISPATCH[fn_name](derived, w) .alias(f"{tag}_{col}_{fn_name.upper()}_{w.upper()}") ) return df.drop(pivot_col), features ``` Usage: ```python feature_df, features = build_pivot_features( session, source_table="FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS", entity_col="USER_ID", ts_col="ORDER_TS", pivot_col="CHANNEL", metrics={"ORDER_ID": "count", "TOTAL_AMT": "sum"}, windows=["7d", "30d"], ) fv = FeatureView( name="USER_CHANNEL_METRICS", entities=[user_entity], feature_df=feature_df, timestamp_col="ORDER_TS", refresh_freq="1 day", features=features, feature_granularity="1 day", ) ``` ### Pattern 4: Config-Driven Feature Specifications {.unnumbered} Instead of embedding feature logic in code, define feature specifications as data — either a Python dict for version-controlled config or a Snowflake table for runtime flexibility. **Python dict (version-controlled):** ```python FEATURE_SPECS = [ {"col": "TOTAL_AMT", "agg": "sum", "windows": ["7d", "30d", "90d"]}, {"col": "ORDER_ID", "agg": "count", "windows": ["7d", "30d"]}, {"col": "TOTAL_AMT", "agg": "avg", "windows": ["30d"]}, ] features = [] for spec in FEATURE_SPECS: fn = AGG_DISPATCH[spec["agg"]] for w in spec["windows"]: features.append( fn(spec["col"], w).alias(f"{spec['col']}_{spec['agg'].upper()}_{w.upper()}") ) ``` **Snowflake table (runtime-configurable):** ```sql CREATE TABLE IF NOT EXISTS FEATURE_STORE_DEMO.FEATURE_STORE.FEATURE_SPECS ( DOMAIN VARCHAR, -- e.g. 'ORDERS', 'SESSIONS' SOURCE_COL VARCHAR, -- column to aggregate AGG_FN VARCHAR, -- 'sum', 'count', 'avg', etc. WINDOW_SIZE VARCHAR, -- '7d', '30d', etc. ENABLED BOOLEAN DEFAULT TRUE ); ``` ```python specs = session.table("FEATURE_STORE_DEMO.FEATURE_STORE.FEATURE_SPECS") \ .filter(F.col("DOMAIN") == "ORDERS") \ .filter(F.col("ENABLED") == True) \ .collect() features = [] for row in specs: fn = AGG_DISPATCH[row["AGG_FN"]] features.append( fn(row["SOURCE_COL"], row["WINDOW_SIZE"]) .alias(f"{row['SOURCE_COL']}_{row['AGG_FN'].upper()}_{row['WINDOW_SIZE'].upper()}") ) ``` | Approach | Pros | Cons | |----------|------|------| | **Python dict** | Version-controlled, reviewable in PRs, easy to test | Requires code deployment to change | | **Snowflake table** | Non-developers can manage; takes effect on next FV version | State split between git (code) and table (config); harder to reproduce; audit trail needed | ::: {.callout-tip} A **hybrid** approach keeps the dict in code as the source of truth and syncs it to a Snowflake table for discoverability. This gives you version control *and* runtime queryability. ::: ### Pattern 5: Ratio and Derived Features {.unnumbered} Many useful features are **derived from base aggregations** — conversion rates, averages-per-order, or period-over-period growth. The `Feature` class supports base aggregation functions only; ratios and formulas must be computed separately. ::: {.callout-warning} ## Tiled Feature Views store partial aggregates The underlying Dynamic Table of a tiled Feature View contains internal `_PARTIAL` columns — tile-level fragments, not final values. You **cannot** compute ratios by reading the DT directly (e.g., `session.table("FEATURE_STORE.USER_ORDER_FV$V01")`). You must use the Feature Store API to reassemble tiles into final aggregated values first. ::: **Approach A: Post-retrieval (recommended for training pipelines)** `generate_dataset` and `retrieve_feature_values` handle tile reassembly automatically. Compute ratios on the returned DataFrame: ```python RATIOS = [ ("AUTOMATED_ORDER_CNT_7D", "ORDER_CNT_7D", "AUTOMATION_RATE_7D"), ("DISCOUNT_AMT_SUM_30D", "TOTAL_AMT_SUM_30D", "DISCOUNT_RATE_30D"), ] dataset_df = fs.generate_dataset( name="TRAINING_SET", spine_df=spine, features=[base_fv], output_type="dataset", ).read.to_snowpark_dataframe() for num, denom, alias in RATIOS: dataset_df = dataset_df.with_column( alias, F.iff(F.col(denom) != 0, F.col(num) / F.col(denom), F.lit(None)) ) ``` **Approach B: Downstream View-based Feature View** Use `read_feature_view` to get the reassembled values, then build a non-tiled Feature View that computes derived columns: ```python base_fv = fs.get_feature_view("USER_ORDER_METRICS", "V01") reassembled_df = fs.read_feature_view(base_fv) for num, denom, alias in RATIOS: reassembled_df = reassembled_df.with_column( alias, F.iff(F.col(denom) != 0, F.col(num) / F.col(denom), F.lit(None)) ) ratio_fv = FeatureView( name="USER_ORDER_RATIOS", entities=[user_entity], feature_df=reassembled_df, # No refresh_freq — View-based, recomputes on read ) fs.register_feature_view(ratio_fv, version="V01") ``` ::: {.callout-note} Safe division (`F.iff(denom != 0, ...)`) prevents `NaN` / divide-by-zero errors that can silently corrupt model training. Always guard ratio features. ::: ### Pattern 6: Interaction Features {.unnumbered} Interaction features capture non-linear relationships by combining pairs of columns — typically products (`a × b`) or ratios (`a / b`). These apply to **any Feature View source DataFrame**, not just aggregation Feature Views. **Static (hand-picked pairs):** ```python source_df = source_df.with_column( "AMT_X_QTY", F.col("TOTAL_AMT") * F.col("QUANTITY") ).with_column( "PRICE_PER_UNIT", F.iff(F.col("QUANTITY") != 0, F.col("TOTAL_AMT") / F.col("QUANTITY"), F.lit(None)) ) ``` **Dynamic (all pairwise products from a list):** ```python from itertools import combinations NUMERIC_COLS = ["TOTAL_AMT", "QUANTITY", "DISCOUNT_AMT"] for a, b in combinations(NUMERIC_COLS, 2): source_df = source_df.with_column( f"{a}_X_{b}", F.col(a) * F.col(b) ) ``` With 5 numeric columns this generates 10 interaction features; with 10 columns it generates 45. **Use this for exploratory model experiments**, then prune non-contributing interactions after feature importance evaluation to avoid unnecessary compute. ### Static vs Dynamic — Trade-Off Summary {.unnumbered} | Dimension | Static | Dynamic | |-----------|--------|---------| | **Readability** | Immediately clear; each feature visible in code | Requires understanding the generation logic | | **Verbosity** | High — repetitive definitions | Low — DRY, loop-based | | **Testability** | Each feature testable individually | Test the generator + spot-check outputs | | **Adaptability** | Manual edit per new value/metric/window | Adapts to data changes; new FV version only | | **Debugging** | Straightforward — read the code | Inspect generated DataFrame schema | | **Git diffs** | Explicit per-feature changes | Changes to generator logic only | | **Registration** | Same | Same — schema fixed at `register_feature_view` | **Guidance:** Start static for small, stable feature sets (under ~10 features). Switch to dynamic when you find yourself copying and renaming feature definitions, when dimension values evolve, or when you need to experiment with many metric/window combinations rapidly. ### Consuming Data-Driven Features with OBJECT_CONSTRUCT {.unnumbered} The factory patterns above enforce a naming convention — `{DIMENSION}_{METRIC}_{AGG}_{WINDOW}` — that becomes a **queryable contract**. Snowflake's [`OBJECT_CONSTRUCT`](https://docs.snowflake.com/en/sql-reference/functions/object_construct) with `ILIKE` and `EXCLUDE` filtering can slice dynamically generated columns by any naming axis without hard-coding column lists: ```sql -- Slice features by dimension, window, or collect everything — no column list needed SELECT USER_ID, OBJECT_CONSTRUCT(* ILIKE 'AUTOMATED_%') AS automated_features, OBJECT_CONSTRUCT(* ILIKE '%_7D') AS window_7d_features, OBJECT_CONSTRUCT(* ILIKE '%_30D') AS window_30d_features, OBJECT_CONSTRUCT(* EXCLUDE (USER_ID, ORDER_TS)) AS all_features FROM FEATURE_STORE.USER_CHANNEL_METRICS$V01 ``` This is particularly valuable when the factory generates dozens of columns — consumers can dynamically group features by dimension, window, or metric type without knowing every column name. The resulting OBJECTs are well-suited for: - **Feature serving APIs** that return JSON payloads - **Feature exploration** — `OBJECT_KEYS(all_features)` lists what was generated - **Selective retrieval** — extract specific features with `all_features:"AUTOMATED_ORDER_ID_COUNT_7D"` path syntax **Two-tier Feature View pattern:** Combine this with a lightweight View-based Feature View that repackages the flat columns from a tiled DT-backed Feature View into structured OBJECTs. The tiled FV handles efficient incremental computation; the View FV provides the structured consumption interface. See [Chapter 13: Wide & Sparse Feature Data](../13_advanced_patterns/index.qmd#sec-wide-sparse) for the full two-tier architecture. ::: {.callout-note} `OBJECT_CONSTRUCT` operates at the **consumption layer** — downstream queries and serving. The `Feature` class still requires flat scalar columns as inputs during definition. The naming convention from the factory is what bridges the two: structured names at definition time enable wildcard grouping at consumption time. ::: --- ## Summary | Concept | Description | |---------|-------------| | **Feature class** | Declarative API for time-windowed aggregations (`Feature.sum`, `Feature.count`, …) | | **Window** | Time period for aggregation (e.g., `"7d"`, `"30d"`); optional `offset` shifts the window back | | **Tile (feature_granularity)** | Pre-computation unit for efficiency | | **Required FV args** | `timestamp_col`, `refresh_freq`, `feature_granularity`, `features` (plus `name`, `entities`, `feature_df`) | | **Version** | Register with zero-padded strings such as `"V01"`, `"V02"` | | **auto_prefix** | Automatic column prefixing with FV name | | `.with_name()` | Custom column prefixing | | **RollupConfig** | Hierarchical aggregation from fine-grained to coarse-grained entities via `RollupConfig(source, mapping_df)` | | **Data-driven definitions** | Programmatic generation of `Feature` objects and source columns from data or config; patterns include categorical pivot, cross-product factory, config-driven specs, ratio features, and interaction features | --- ## Tiled Feature Views and Online Serving {#sec-tiled-online} ::: {.callout-warning title="Tiled Feature Views Cannot Be Served Directly via Hybrid Table OFTs"} A tiled Feature View's underlying Dynamic Table stores **partial aggregates** per tile interval — columns like `_PARTIAL_SUM_TOTAL_AMT`, `_PARTIAL_COUNT_ORDER_ID`, and a `TILE_START` timestamp. The original `timestamp_col` (e.g., `ORDER_TS`) **does not exist** in the DT schema. When you attempt to create a Hybrid Table Online Feature Table from a tiled FV, the `CREATE ONLINE FEATURE TABLE ... TIMESTAMP_COLUMN = ORDER_TS` statement fails: ``` SQL compilation error: column 'ORDER_TS' does not exist ``` The SDK does not validate this upfront — the error only surfaces at OFT creation time during `register_feature_view`. ::: For **offline retrieval** (`fs.read_feature_view`, `fs.retrieve_feature_values`), the Feature Store performs tile reassembly automatically — window aggregations are computed correctly at read time. The issue only arises when you want to push these aggregated values into a Hybrid Table for low-latency online serving. ### Workaround: `online_fv_from_tiled` Helper {#sec-tiled-online-workaround} Manually re-expressing the same aggregation logic as a SQL `GROUP BY` is error-prone and violates DRY. The `online_fv_from_tiled()` helper introspects a tiled Feature View's `Feature` definitions and **automatically generates** a non-tiled, OFT-compatible companion: ```{python} #| eval: false from online_from_tiled import online_fv_from_tiled # Step 1: Define the tiled FV (offline only, no online_config) tiled_fv = FeatureView( name="USER_PURCHASE_TILED", entities=[user_entity], feature_df=session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS"), timestamp_col="ORDER_TS", refresh_freq="1 hour", feature_granularity="1 hour", features=purchase_features, desc="Tiled aggregations — offline only", ) tiled_reg = fs.register_feature_view(tiled_fv, version="V01", block=True) # Step 2: Auto-generate a non-tiled, OFT-ready companion online_fv = online_fv_from_tiled( tiled_reg, session, source_query="SELECT * FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS", refresh_freq="1 minute", target_lag="1 minute", window_buffer="1d", ) online_reg = fs.register_feature_view(online_fv, version="V01", block=True) ``` The helper produces a `GROUP BY` query scoped to only `max(window + offset) + buffer` of source data. For the purchase features above (7-day windows, no offset, 1-day buffer), the generated SQL is: ```sql SELECT USER_ID, SUM(CASE WHEN ORDER_TS >= DATEADD('DAY', -7, CURRENT_TIMESTAMP()) THEN TOTAL_AMT END) AS TOTAL_AMT_SUM_7D, COUNT(CASE WHEN ORDER_TS >= DATEADD('DAY', -7, CURRENT_TIMESTAMP()) THEN ORDER_ID END) AS ORDER_ID_CNT_7D, AVG(CASE WHEN ORDER_TS >= DATEADD('DAY', -7, CURRENT_TIMESTAMP()) THEN TOTAL_AMT END) AS TOTAL_AMT_AVG_7D, MAX(ORDER_TS) AS LAST_ORDER_TS FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS WHERE ORDER_TS >= DATEADD('DAY', -8, CURRENT_TIMESTAMP()) GROUP BY USER_ID ``` Each feature gets its own `CASE WHEN` filter for its specific window, so features with different windows (e.g., 7-day and 30-day) coexist correctly in a single query. The outer `WHERE` clause scopes the entire query to the longest window plus the buffer. ::: {.callout-note title="How the helper works"} `online_fv_from_tiled()` reads each `Feature` object's internal attributes (`_function`, `_column`, `_window`, `_offset`, `_alias`) and maps them to standard SQL aggregation functions. It: 1. Converts each `Feature` aggregation type to its SQL equivalent (e.g., `Feature.sum` → `SUM`, `Feature.stddev` → `STDDEV`) 2. Applies per-feature `CASE WHEN` window filters based on the feature's `_window` and `_offset` 3. Computes the maximum lookback across all features and adds `window_buffer` to form the outer `WHERE` clause 4. Adds `MAX(timestamp_col) AS LAST_{timestamp_col}` as the timestamp for OFT tracking 5. Returns a new `FeatureView` with `online_config` enabled and no tiling For a registered tiled FV (whose internal query has been rewritten to the tiled form), you must pass `source_query=` pointing at the original source table. **Unsupported aggregations:** `FIRST_N`, `LAST_N`, `FIRST_DISTINCT_N`, and `LAST_DISTINCT_N` cannot be reduced to a simple `GROUP BY` — the helper raises a clear error for these. ::: You can also use `online_sql_from_features()` to inspect or customise the generated SQL before creating a FeatureView: ```{python} #| eval: false from online_from_tiled import online_sql_from_features sql = online_sql_from_features( features=purchase_features, join_keys=["USER_ID"], source_table="FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS", timestamp_col="ORDER_TS", ) print(sql) ``` ### Manual Approach: Direct SQL {#sec-tiled-online-manual} If you prefer full control or need to customise the query beyond what the helper generates, you can write the non-tiled FV manually: ::: {.panel-tabset group="feature-definition"} ## SQL ```{python} #| eval: false from snowflake.ml.feature_store import feature_view as fv_mod online_df = session.sql(""" SELECT USER_ID, SUM(CASE WHEN ORDER_TS >= DATEADD('DAY', -7, CURRENT_TIMESTAMP()) THEN TOTAL_AMT END) AS TOTAL_AMT_SUM_7D, COUNT(CASE WHEN ORDER_TS >= DATEADD('DAY', -7, CURRENT_TIMESTAMP()) THEN ORDER_ID END) AS ORDER_ID_CNT_7D, AVG(CASE WHEN ORDER_TS >= DATEADD('DAY', -7, CURRENT_TIMESTAMP()) THEN TOTAL_AMT END) AS TOTAL_AMT_AVG_7D, MAX(ORDER_TS) AS LAST_ORDER_TS FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS WHERE ORDER_TS >= DATEADD('DAY', -8, CURRENT_TIMESTAMP()) GROUP BY USER_ID """) online_fv = FeatureView( name="USER_PURCHASE_ONLINE", entities=[user_entity], feature_df=online_df, timestamp_col="LAST_ORDER_TS", refresh_freq="1 minute", online_config=fv_mod.OnlineConfig(enable=True, target_lag="1 minute"), desc="Non-tiled online companion for USER_PURCHASE_TILED", ) online_reg = fs.register_feature_view(online_fv, version="V01", block=True) ``` ## Snowpark DataFrame ```{python} #| eval: false from snowflake.snowpark import functions as F from snowflake.ml.feature_store import feature_view as fv_mod orders = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS") cutoff_7d = F.dateadd("DAY", F.lit(-7), F.current_timestamp()) cutoff_8d = F.dateadd("DAY", F.lit(-8), F.current_timestamp()) online_df = ( orders .filter(F.col("ORDER_TS") >= cutoff_8d) .group_by("USER_ID") .agg( F.sum(F.when(F.col("ORDER_TS") >= cutoff_7d, F.col("TOTAL_AMT"))) .alias("TOTAL_AMT_SUM_7D"), F.count(F.when(F.col("ORDER_TS") >= cutoff_7d, F.col("ORDER_ID"))) .alias("ORDER_ID_CNT_7D"), F.avg(F.when(F.col("ORDER_TS") >= cutoff_7d, F.col("TOTAL_AMT"))) .alias("TOTAL_AMT_AVG_7D"), F.max("ORDER_TS").alias("LAST_ORDER_TS"), ) ) online_fv = FeatureView( name="USER_PURCHASE_ONLINE", entities=[user_entity], feature_df=online_df, timestamp_col="LAST_ORDER_TS", refresh_freq="1 minute", online_config=fv_mod.OnlineConfig(enable=True, target_lag="1 minute"), desc="Non-tiled online companion for USER_PURCHASE_TILED", ) online_reg = fs.register_feature_view(online_fv, version="V01", block=True) ``` ::: ::: {.callout-important title="Performance: FULL Refresh on Scoped Window"} Both the helper and the manual approach use `CURRENT_TIMESTAMP()` in the `WHERE` clause to scope data to only the maximum window needed. This forces the resulting Dynamic Table to use **FULL refresh** on every cycle (not incremental), because Snowflake cannot track changes on queries with non-deterministic functions. For short-to-medium windows (e.g., 7–30 days), the scoped data volume is small and FULL refresh typically completes in 1–2 seconds. For **wide windows** (90d, 180d, 365d), the FULL refresh must re-scan a proportionally larger slice of the source table on every cycle. At 365 days the scoped query effectively touches most of the table, negating much of the scoping benefit. | Window Width | Approx. Source Scope | FULL Refresh Impact | |-------------|---------------------|---------------------| | 7 days | ~1 week + buffer | Fast (seconds) | | 30 days | ~1 month + buffer | Low (seconds to tens of seconds) | | 90–180 days | Quarters to half-year | Moderate — monitor DT refresh duration and warehouse cost | | 365 days | ~full table | **High** — consider whether online serving of year-wide aggregates is truly needed, or pre-compute in a scheduled task instead | | Component | Refresh Mode | Why | |-----------|-------------|-----| | Tiled FV (offline) | **INCREMENTAL** | Deterministic tiling query, efficient for large historical datasets | | Online FV (helper-generated) | **FULL** | `CURRENT_TIMESTAMP()` in WHERE prevents change tracking, but scoped to bounded window | | OFT (Hybrid Table) | Mirrors DT | Syncs changes from the online DT each cycle | The **recommended pattern** is to use both: the Aggregations API for offline computation (efficient tiled storage) and the `online_fv_from_tiled()` helper for online serving (scoped, bounded FULL refresh). If your widest feature window exceeds 90 days, benchmark the DT refresh time and cost before committing to a 1-minute `refresh_freq` — a longer refresh cycle (e.g., 5–15 minutes) may be more appropriate for wide windows. ::: ::: {.callout-tip title="Postgres Online Store (PubPr)"} The **Postgres-backed online store** (`OnlineStoreType.POSTGRES`) handles tiled Feature Views with **query-time tile reassembly**. See [Chapter 9: Real-Time and Online Serving](../09_real_time_online_serving/index.qmd#sec-batch-postgres). ::: --- ## Next Steps Continue to [Chapter 8: Online Features](../08_online_features/index.qmd) to learn about Online Feature Tables for real-time serving, including the limitations and workarounds for tiled Feature Views discussed above.

7.1 Overview

7.2 Learning Objectives

7.3 The Feature Class

7.3.1 Basic Usage

7.3.2 Available Aggregation Functions

7.3.3 Aggregation Function Performance Characteristics

7.3.4 Filter Expressions Within Features

7.3.5 Window Specifications

7.3.6 Window offset

7.4 Naming Aggregated Features

7.4.1 Best Practice: Descriptive Aliases

7.5 Feature Column Prefixing for Disambiguation

7.5.1 The Problem

7.5.2 Solution 1: Auto-Prefix

7.5.3 Solution 2: Custom Prefixes with .with_name()

7.5.4 Solution 3: Combining Both Approaches

7.5.5 Choosing a Prefixing Strategy

7.6 Feature Granularity (Tiling)

7.6.1 How Tiling Works

7.6.2 Tile Size Recommendations

7.6.3 Multi-Granularity Tile Strategy

7.7 Multi-Window Features

7.8 Best Practices

1. Use Consistent Naming Conventions

2. Match Tile Size to Data Frequency

3. Group Related Features and Consolidate Feature Views

4. Tile Sharing Across Feature Views

7.9 Rollup Aggregated Feature Views

7.9.1 RollupConfig Arguments

7.9.2 Example: Visitor → Subscriber Rollup

7.9.3 Rollup Design Guidelines

7.10 Common Pitfalls

7.10.1 ❌ Pitfall 1: No Alias

7.10.2 ❌ Pitfall 2: Wrong Tile Size

7.10.3 ❌ Pitfall 3: Column Collisions

7.10.4 ❌ Pitfall 4: Internal _PARTIAL Columns in Output

7.10.5 ❌ Pitfall 5: Partial Leading-Edge Tiles from Incomplete Source Intervals

7.11 Data-Driven Feature Definitions

Pattern 1: Categorical Pivot — Static vs Dynamic

Pattern 2: Multi-Window Sweep

Pattern 3: Metric × Dimension × Window Factory

Pattern 4: Config-Driven Feature Specifications

Pattern 5: Ratio and Derived Features

Pattern 6: Interaction Features

Static vs Dynamic — Trade-Off Summary

Consuming Data-Driven Features with OBJECT_CONSTRUCT

7.12 Summary

7.13 Tiled Feature Views and Online Serving

7.13.1 Workaround: online_fv_from_tiled Helper

7.13.2 Manual Approach: Direct SQL

7.14 Next Steps

7.5.3 Solution 2: Custom Prefixes with `.with_name()`

7.13.1 Workaround: `online_fv_from_tiled` Helper