3 Entities & Hierarchies

Entity design patterns, compound keys, and hierarchical relationships

Keywords

snowflake, feature store, ml, machine learning, mlops

3.1 Overview

Entities are the foundational building blocks of a Feature Store. They define the business objects that features describe and determine how features can be joined together. This chapter covers entity design patterns, from simple single-key entities to complex compound keys and hierarchical relationships.

3.2 Learning Objectives

After completing this chapter, you will be able to:

Design entities that accurately reflect your business domain
Choose between simple and compound entity keys
Model entity hierarchies for multi-level aggregations
Make informed decisions using the entity design framework

📂 Chapter code: Browse companion scripts on GitHub

3.3 Entity Fundamentals

An Entity represents a business object that features describe. It defines:

What business concept the features relate to (for example User, Product, Session)
How features are retrieved via join keys
Which Feature Views can be combined for a given request

Entity registration lifecycle

Register each entity with the Feature Store using fs.register_entity(entity) before you create or register Feature Views that reference it. Entities are not auto-created when you register a Feature View. If an entity is missing from the registry, Feature View registration fails until you register it explicitly.

📁 Full code: _code/entity_definitions.py

from snowflake.ml.feature_store import Entity

# 1. Define entities
user_entity = Entity(
    name="USER",
    join_keys=["USER_ID"],
    desc="Registered user in the clickstream platform"
)

# 2. Register entities (must happen before any Feature View that references them)
fs.register_entity(user_entity)
fs.register_entity(visitor_entity)
# ... other entities ...

3.3.1 Entity Components

Component	Description	Requirements
`name`	Unique identifier within the Feature Store	SCREAMING_SNAKE_CASE, descriptive
`join_keys`	Column(s) that uniquely identify an instance	Must exist in spine and Feature View
`desc`	Human-readable description	Document the business meaning

3.4 Simple Entity Keys

Simple entities use a single column as the join key. This is the most common pattern.

3.4.1 Examples

# Visitor (pre-login identity)
visitor_entity = Entity(
    name="VISITOR",
    join_keys=["VISITOR_ID"],
    desc="Site visitor before or without login"
)

# User entity
user_entity = Entity(
    name="USER",
    join_keys=["USER_ID"],
    desc="Registered user with account"
)

# Household rollup
household_entity = Entity(
    name="HOUSEHOLD",
    join_keys=["HOUSEHOLD_ID"],
    desc="Household containing one or more users"
)

# Product and category
product_entity = Entity(
    name="PRODUCT",
    join_keys=["PRODUCT_ID"],
    desc="Product in the catalog"
)
category_entity = Entity(
    name="CATEGORY",
    join_keys=["CATEGORY_ID"],
    desc="Product taxonomy category"
)

# Session entity
session_entity = Entity(
    name="SESSION",
    join_keys=["SESSION_ID"],
    desc="User browsing session"
)

# Register all entities
for entity in [visitor_entity, user_entity, household_entity,
               product_entity, category_entity, session_entity]:
    fs.register_entity(entity)

print(f"Registered {len(fs.list_entities().collect())} entities:")
for e in fs.list_entities().collect():
    keys = e["JOIN_KEYS"]
    print(f"  {e['NAME']:20s}  keys: {keys}")

3.4.2 When to Use Simple Entities

Scenario	Example
Single business object	Users, products, sessions
Natural unique identifier exists	User ID, product ID, session ID
Features are always at this granularity	User-level churn prediction

3.5 Compound Entity Keys

Compound entities use multiple columns together as the join key. Essential for many-to-many relationships.

3.5.1 Why Compound Keys?

flowchart LR
  P[PRODUCT P001 Widget Pro] --> J[(PRODUCT_SUPPLIER)]
  J --> R1[P001 S001 10.00]
  J --> R2[P001 S002 9.50]

Product-supplier compound key relationship

3.5.2 Defining Compound Entities

📁 Full code: _code/compound_entities.py

# Primary compound entity: product–supplier (sourcing / many-to-many)
product_supplier_entity = Entity(
    name="PRODUCT_SUPPLIER",
    join_keys=["PRODUCT_ID", "SUPPLIER_ID"],
    desc="Product and supplier combination for sourcing features"
)

# User–product affinity (interaction grain)
user_product_entity = Entity(
    name="USER_PRODUCT",
    join_keys=["USER_ID", "PRODUCT_ID"],
    desc="User-product affinity features"
)

# Order line within an order (USER places ORDER → ORDER_ITEMS)
order_item_entity = Entity(
    name="ORDER_ITEM",
    join_keys=["ORDER_ID", "LINE_NUMBER"],
    desc="Individual line item within an order"
)

# Register compound entities
for entity in [product_supplier_entity, user_product_entity, order_item_entity]:
    fs.register_entity(entity)
    print(f"  {entity.name:20s}  keys: {entity.join_keys}")

3.5.3 When to Use Compound Entities

Scenario	Join Keys	Example Use Case
Many-to-many relationships	`[A_ID, B_ID]`	User-product affinity
Intersection tables	`[PARENT_ID, CHILD_ID]`	Product-supplier pricing
Sub-entity granularity	`[PARENT_ID, SEQ]`	Order line items
Time-partitioned entities	`[ENTITY_ID, DATE]`	Daily snapshots

3.6 Entity Hierarchies

Entities can form hierarchies where one entity “rolls up” to another. This enables multi-level aggregations.

3.6.1 Clickstream hierarchy (demo domain)

Relationships used throughout this guide (database FEATURE_STORE_DEMO, source schema CLICKSTREAM_DATA, Feature Store schema FEATURE_STORE, warehouse FS_DEV_WH):

VISITOR (VISITOR_ID) → identified as → USER (USER_ID) → belongs to → HOUSEHOLD (HOUSEHOLD_ID)
USER → has → SESSION (SESSION_ID) → contains events
USER → places → ORDER → ORDER_ITEM rows (ORDER_ID, LINE_NUMBER)
PRODUCT (PRODUCT_ID) → belongs to → CATEGORY (CATEGORY_ID)
PRODUCT_SUPPLIER (PRODUCT_ID, SUPPLIER_ID) is the compound key for supplier-specific product facts

flowchart TB
  V[VISITOR] --> U[USER] --> H[HOUSEHOLD]
  U --> S[SESSION]
  U --> O[ORDER] --> OI[ORDER_ITEM]
  P[PRODUCT] --> C[CATEGORY]

Clickstream entity relationships

3.6.2 Implementing Hierarchies

📁 Full code: _code/entity_hierarchies.py

from snowflake.ml.feature_store import Entity, FeatureView

# Register each entity first: fs.register_entity(visitor_entity), etc.
visitor_entity = Entity(name="VISITOR", join_keys=["VISITOR_ID"])
household_entity = Entity(name="HOUSEHOLD", join_keys=["HOUSEHOLD_ID"])
user_entity = Entity(name="USER", join_keys=["USER_ID"])
session_entity = Entity(name="SESSION", join_keys=["SESSION_ID"])

# Feature Views at different granularities (version labels: V01, V02, ...)
household_fv = FeatureView(
    name="HOUSEHOLD_FEATURES",
    entities=[household_entity],
    feature_df=household_agg_df,
    version="V01",
    # timestamp_col, refresh_freq, desc, ...
)

user_fv = FeatureView(
    name="USER_FEATURES",
    entities=[user_entity],
    feature_df=user_agg_df,
    version="V01",
)

session_fv = FeatureView(
    name="SESSION_FEATURES",
    entities=[session_entity],
    feature_df=session_df,
    version="V01",
)

3.6.3 Entity Mapping Tables for Rollup Feature Views

Rollup Feature Views (see Chapter 7) aggregate features from a finer-grained entity to a coarser-grained entity using a mapping DataFrame. The design of this mapping table significantly affects both correctness and performance.

Basic mapping table (static, 1:1 or N:1):

CREATE TABLE FEATURE_STORE_DEMO.CLICKSTREAM_DATA.VISITOR_USER_MAPPING AS
SELECT DISTINCT VISITOR_ID, USER_ID
FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.VISITOR_SESSIONS;

SCD-Type2 mapping table (time-varying relationships):

When the relationship between entities changes over time (e.g., a visitor becomes identified as a different user, or a subscriber changes accounts), a static mapping table can introduce data leakage – the rollup uses the current mapping to aggregate historical data. For temporal correctness, use an SCD-Type2 mapping with validity windows:

CREATE TABLE VISITOR_USER_MAPPING_SCD2 (
    VISITOR_ID   VARCHAR,
    USER_ID      VARCHAR,
    VALID_FROM   TIMESTAMP_NTZ,
    VALID_TO     TIMESTAMP_NTZ  -- NULL = currently active
);

When the entity mapping is time-varying, the mapping_df passed to RollupConfig should include only currently active rows (where VALID_TO IS NULL), because the rollup Feature View aggregates tiles across the mapping at refresh time – it does not perform point-in-time mapping alignment per tile. For training scenarios that require historical mapping correctness, consider building the rollup in a custom SQL Feature View that joins the SCD2 mapping with validity range predicates.

Late-arriving entity mappings and PIT correctness

A common source of subtle data leakage in hierarchical Feature Stores is late-arriving entity mappings. If a visitor-to-subscriber mapping is added retroactively (e.g., a visitor is identified as a subscriber after their browsing session), the rollup Feature View will include that visitor’s historical features in the subscriber’s aggregates even though the mapping did not exist at the original event time.

For offline training, this means the subscriber’s “7-day page view count” at training time T may include visitor activity that was not attributable to the subscriber at time T. The model learns from features that would not have been available in production.

Mitigation strategies:

Track MAPPED_AT timestamp in the mapping table and filter the mapping to MAPPED_AT <= spine_timestamp when constructing training data.
Accept the approximation if the mapping lag is consistently small relative to the feature windows (e.g., mappings arrive within minutes but feature windows are 7+ days).
Use the fine-grained entity (visitor) for training if mapping timeliness cannot be guaranteed, and only use the coarser entity (subscriber) for inference.

See Chapter 6: PIT Correctness with Entity Mappings for a detailed discussion.

Performance at scale: For entity hierarchies with very high cardinality (e.g., 500M+ visitors mapping to 100M+ subscribers), the mapping table join is the most expensive part of the rollup. Ensure the mapping table is well-clustered on the fine-grained entity key (VISITOR_ID) and avoid including unnecessary columns in the mapping DataFrame.

3.6.4 Combining Hierarchical Features

To use features from multiple hierarchy levels, your spine must include all join keys:

# Spine with multi-level keys (source tables in CLICKSTREAM_DATA)
spine = session.sql("""
    SELECT
        s.SESSION_ID,
        s.USER_ID,
        u.HOUSEHOLD_ID,
        s.SESSION_START_TS AS EVENT_TS
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.SESSIONS s
    JOIN FEATURE_STORE_DEMO.CLICKSTREAM_DATA.USERS u
      ON s.USER_ID = u.USER_ID
""")

# Generate dataset with features from all levels
dataset = fs.generate_dataset(
    spine_df=spine,
    features=[session_fv, user_fv, household_fv],
)

3.6.5 Modeling at a Different Grain Than the Source

A common design question is: the ML model operates at a parent grain (e.g., ORDER), but the source data lives at a child grain (e.g., LINEITEM / ORDER_ITEMS). You have two options:

Option A: Entity at the child grain, parent key on the Feature View. Register the entity at the finest grain (e.g., [ORDER_ID, LINE_NUMBER]) and include the parent key as a column. Training sets will have multiple rows per parent, which is useful when the model needs line-level detail.

Option B: Entity at the parent grain, aggregated child features. Register the entity at the parent grain (e.g., [ORDER_ID]) and build a Feature View that aggregates child data up:

ORDER = Entity(name="ORDER", join_keys=["ORDER_ID"])

SQL
Snowpark DataFrame

order_agg_df = session.sql("""
    SELECT
        ORDER_ID,
        SUM(EXTENDED_PRICE) AS ORDER_TOTAL_AMT,
        COUNT(*) AS LINE_COUNT,
        SUM(CASE WHEN RETURN_FLAG = 'R' THEN 1 ELSE 0 END) AS RETURN_LINE_CNT,
        MAX(SHIP_DATE) AS LAST_SHIP_DATE
    FROM ORDER_ITEMS
    GROUP BY ORDER_ID
""")

import snowflake.snowpark.functions as F

order_items = session.table("ORDER_ITEMS")
order_agg_df = order_items.group_by("ORDER_ID").agg(
    F.sum("EXTENDED_PRICE").alias("ORDER_TOTAL_AMT"),
    F.count("*").alias("LINE_COUNT"),
    F.sum(F.when(F.col("RETURN_FLAG") == "R", 1).otherwise(0)).alias("RETURN_LINE_CNT"),
    F.max("SHIP_DATE").alias("LAST_SHIP_DATE"),
)

order_lineitem_agg_fv = FeatureView(
    name="ORDER_LINEITEM_AGG_FV",
    entities=[ORDER],
    feature_df=order_agg_df,
    timestamp_col="LAST_SHIP_DATE",
    refresh_freq="1 hour",
    desc="Lineitem features aggregated to the order level",
)

When the spine carries keys from multiple levels of the hierarchy (e.g., ORDER_ID, CUSTOMER_ID, NATION_ID), generate_training_set performs LEFT JOINs against each Feature View on its entity keys. Higher-level features are denormalized (repeated) onto every child row. See Chapter 11: Training & Inference for multi-entity spine design patterns.

3.7 Entity Design Patterns

3.7.1 Pattern 1: Domain-Aligned Entities

Align entities with business domain concepts:

# Clickstream / e-commerce domain (this guide)
USER = Entity(name="USER", join_keys=["USER_ID"])
PRODUCT = Entity(name="PRODUCT", join_keys=["PRODUCT_ID"])
ORDER = Entity(name="ORDER", join_keys=["ORDER_ID"])

# Financial domain
ACCOUNT = Entity(name="ACCOUNT", join_keys=["ACCOUNT_ID"])
TRANSACTION = Entity(name="TRANSACTION", join_keys=["TXN_ID"])
MERCHANT = Entity(name="MERCHANT", join_keys=["MERCHANT_ID"])

3.7.2 Pattern 2: Interaction Entities

Model relationships between domain entities:

# User–product interaction (same grain as USER_PRODUCT in compound examples)
USER_PRODUCT = Entity(
    name="USER_PRODUCT",
    join_keys=["USER_ID", "PRODUCT_ID"],
    desc="User-product affinity and behavior"
)

# User-Merchant interaction
USER_MERCHANT = Entity(
    name="USER_MERCHANT",
    join_keys=["USER_ID", "MERCHANT_ID"],
    desc="User spending patterns per merchant"
)

3.7.3 Pattern 3: Time-Grain Entities (Forecasting / Time-Series)

For forecasting and time-series models, the unit of analysis is often a combination of a business entity and a time period – not just the entity alone. For example, a demand forecast model predicts sales per store per day; a patient monitoring model tracks health metrics per patient per week. The time dimension becomes part of the entity key, not just the timestamp_col.

# Daily demand forecasting: one prediction per store per date
STORE_DAY = Entity(
    name="STORE_DAY",
    join_keys=["STORE_ID", "FORECAST_DATE"],
    desc="Store-day grain for demand forecasting models",
)

# Weekly inventory planning: one prediction per SKU per week
SKU_WEEK = Entity(
    name="SKU_WEEK",
    join_keys=["SKU_ID", "WEEK_START_DATE"],
    desc="SKU-week grain for inventory planning models",
)

Key distinction: The time-grain join key (e.g., FORECAST_DATE) defines the analysis granularity – it determines which row the spine matches on. The timestamp_col on the Feature View still serves its standard role: it records when the feature row was computed or became valid, enabling PIT correctness if features are refreshed incrementally.

SQL
Snowpark DataFrame

store_day_features_df = session.sql("""
    SELECT
        STORE_ID,
        SALE_DATE AS FORECAST_DATE,
        SUM(QUANTITY) AS DAILY_UNITS_SOLD,
        SUM(REVENUE) AS DAILY_REVENUE,
        COUNT(DISTINCT PRODUCT_ID) AS DISTINCT_PRODUCTS_SOLD,
        MAX(UPDATED_TS) AS FEATURE_TS
    FROM SALES
    GROUP BY STORE_ID, SALE_DATE
""")

import snowflake.snowpark.functions as F

sales = session.table("SALES")
store_day_features_df = sales.group_by("STORE_ID", F.col("SALE_DATE").alias("FORECAST_DATE")).agg(
    F.sum("QUANTITY").alias("DAILY_UNITS_SOLD"),
    F.sum("REVENUE").alias("DAILY_REVENUE"),
    F.count_distinct("PRODUCT_ID").alias("DISTINCT_PRODUCTS_SOLD"),
    F.max("UPDATED_TS").alias("FEATURE_TS"),
)

store_day_fv = FeatureView(
    name="STORE_DAY_SALES_FV",
    entities=[STORE_DAY],
    feature_df=store_day_features_df,
    timestamp_col="FEATURE_TS",
    refresh_freq="1 hour",
    desc="Daily sales features per store for demand forecasting",
)

Spine design for time-grain entities typically enumerates the (entity, time-period) tuples you want to score:

forecast_spine = session.sql("""
    SELECT STORE_ID, cal.CAL_DATE AS FORECAST_DATE
    FROM STORES
    CROSS JOIN (SELECT CAL_DATE FROM CALENDAR WHERE CAL_DATE BETWEEN '2025-01-01' AND '2025-01-31') cal
""")

See Chapter 11: Training & Inference for spine design patterns, including multi-horizon forecasting spines.

3.8 Best Practices

1. Use Descriptive Names

# ✅ GOOD: Clear business meaning (SCREAMING_SNAKE_CASE, singular)
Entity(name="USER", join_keys=["USER_ID"])
Entity(name="PRODUCT_SUPPLIER", join_keys=["PRODUCT_ID", "SUPPLIER_ID"])

# ❌ BAD: Vague or technical names
Entity(name="ENTITY1", join_keys=["ID"])
Entity(name="TBL_KEY", join_keys=["K1", "K2"])

2. Document Business Context

Entity(
    name="HOUSEHOLD",
    join_keys=["HOUSEHOLD_ID"],
    desc="Group of users sharing the same billing address. Used for household-level targeting and fraud detection."
)

3. Consistent Key Naming

The Feature Store does not support synonyms or aliases for join keys. The column names in the Entity’s join_keys must appear with exactly the same name in every Feature View’s feature_df and in every spine DataFrame. There is no mapping layer that translates USER_KEY to USER_ID at runtime – a name mismatch will cause registration or dataset generation to fail.

While this is a constraint, it is also a conformance benefit: it enforces a single, consistent vocabulary for entity keys across all Feature Views and consumers. Teams cannot accidentally use different column names for the same logical concept, which eliminates a common source of silent join errors.

If source tables use different column names, rename them in the Feature View’s feature_df to match the Entity definition:

# Source table uses "CUST_ID", but Entity expects "USER_ID"
feature_df = session.table("LEGACY_CUSTOMERS").with_column_renamed("CUST_ID", "USER_ID")

# ✅ GOOD: Consistent USER_ID everywhere
Entity(name="USER", join_keys=["USER_ID"])
# Feature View source: SELECT USER_ID, ... FROM ...
# Spine: SELECT USER_ID, ...

# ❌ BAD: Inconsistent naming for the same logical user
# Table A: USER_ID, Table B: USER_KEY, Table C: USR_ID

3.9 Common Pitfalls

3.9.1 ❌ Pitfall 1: Key Column Mismatch

Problem: Join key column missing from Feature View or spine.

Solution: Ensure join key columns exist in all relevant DataFrames.

3.9.2 ❌ Pitfall 2: Wrong Granularity

Problem: Entity doesn’t match feature computation granularity.

Solution: Ensure Feature View aggregates to entity grain.

3.9.3 ❌ Pitfall 3: Missing Hierarchy Links

Problem: Can’t combine features from different hierarchy levels.

Solution: Include all necessary foreign keys in spine.

3.9.4 ❌ Pitfall 4: Entity grain vs. query-time timestamps

Problem: Features that must be computed from the spine timestamp (for example age or tenure as of each training row) cannot rely on columns that are not in the Feature View’s feature_df. The spine timestamp is not injected into Feature View SQL for you.

Solution: Use an entity grain where the reference time is already a column in the feature data (for example snapshot rows with SNAPSHOT_TS as timestamp_col), or enrich after retrieval with the same logic for training and scoring. Event-level entities (one row per application, session, etc.) can anchor features to an intrinsic event time. For SCD2 and point-in-time–correct derived attributes, consider a worked banking example where snapshot-based entity grains provide the temporal anchor.

3.10 Summary

Pattern	Join Keys	Use Case
Simple	Single column	Standard domain entities
Compound	Multiple columns	Many-to-many, intersections
Hierarchical	Cascading relationships	Multi-level aggregations

3.11 Next Steps

Continue to Chapter 4: Feature Views to learn about Feature View types, versioning, and lifecycle management.

--- title: "Entities & Hierarchies" subtitle: "Entity design patterns, compound keys, and hierarchical relationships" --- ## Overview Entities are the foundational building blocks of a Feature Store. They define the business objects that features describe and determine how features can be joined together. This chapter covers entity design patterns, from simple single-key entities to complex compound keys and hierarchical relationships. ## Learning Objectives After completing this chapter, you will be able to: - Design entities that accurately reflect your business domain - Choose between simple and compound entity keys - Model entity hierarchies for multi-level aggregations - Make informed decisions using the entity design framework > 📂 **Chapter code:** [Browse companion scripts on GitHub](https://github.com/Snowflake-Labs/snowflake-featurestore-imp-guide/tree/main/Snowflake_FeatureStore_Implementation_Guide/03_entities_hierarchies/_code) ```{python} #| output: false #| echo: false from snowflake.snowpark import Session from snowflake.snowpark.context import get_active_session from snowflake.ml.feature_store import FeatureStore, FeatureView, Entity, CreationMode try: session = get_active_session() except Exception: session = Session.builder.config("connection_name", "default").create() session.sql_simplifier_enabled = True fs = FeatureStore( session=session, database="FEATURE_STORE_DEMO", name="FEATURE_STORE", default_warehouse="FS_DEV_WH", creation_mode=CreationMode.CREATE_IF_NOT_EXIST, ) ``` --- ## Entity Fundamentals An **Entity** represents a business object that features describe. It defines: - **What** business concept the features relate to (for example User, Product, Session) - **How** features are retrieved via join keys - **Which** Feature Views can be combined for a given request ::: {.callout-note} ## Entity registration lifecycle Register each entity with the Feature Store using `fs.register_entity(entity)` **before** you create or register Feature Views that reference it. Entities are **not** auto-created when you register a Feature View. If an entity is missing from the registry, Feature View registration fails until you register it explicitly. ::: > 📁 **Full code:** [`_code/entity_definitions.py`](_code/entity_definitions.py) ```python from snowflake.ml.feature_store import Entity # 1. Define entities user_entity = Entity( name="USER", join_keys=["USER_ID"], desc="Registered user in the clickstream platform" ) # 2. Register entities (must happen before any Feature View that references them) fs.register_entity(user_entity) fs.register_entity(visitor_entity) # ... other entities ... ``` ### Entity Components | Component | Description | Requirements | |-----------|-------------|--------------| | `name` | Unique identifier within the Feature Store | SCREAMING_SNAKE_CASE, descriptive | | `join_keys` | Column(s) that uniquely identify an instance | Must exist in spine and Feature View | | `desc` | Human-readable description | Document the business meaning | --- ## Simple Entity Keys Simple entities use a single column as the join key. This is the most common pattern. ### Examples ```{python} # Visitor (pre-login identity) visitor_entity = Entity( name="VISITOR", join_keys=["VISITOR_ID"], desc="Site visitor before or without login" ) # User entity user_entity = Entity( name="USER", join_keys=["USER_ID"], desc="Registered user with account" ) # Household rollup household_entity = Entity( name="HOUSEHOLD", join_keys=["HOUSEHOLD_ID"], desc="Household containing one or more users" ) # Product and category product_entity = Entity( name="PRODUCT", join_keys=["PRODUCT_ID"], desc="Product in the catalog" ) category_entity = Entity( name="CATEGORY", join_keys=["CATEGORY_ID"], desc="Product taxonomy category" ) # Session entity session_entity = Entity( name="SESSION", join_keys=["SESSION_ID"], desc="User browsing session" ) # Register all entities for entity in [visitor_entity, user_entity, household_entity, product_entity, category_entity, session_entity]: fs.register_entity(entity) print(f"Registered {len(fs.list_entities().collect())} entities:") for e in fs.list_entities().collect(): keys = e["JOIN_KEYS"] print(f" {e['NAME']:20s} keys: {keys}") ``` ### When to Use Simple Entities | Scenario | Example | |----------|---------| | Single business object | Users, products, sessions | | Natural unique identifier exists | User ID, product ID, session ID | | Features are always at this granularity | User-level churn prediction | --- ## Compound Entity Keys {#sec-compound-keys} Compound entities use multiple columns together as the join key. Essential for many-to-many relationships. ### Why Compound Keys? ```{mermaid} %%| fig-cap: "Product-supplier compound key relationship" %%| fig-alt: "Product links to multiple supplier rows requiring compound key" flowchart LR P[PRODUCT P001 Widget Pro] --> J[(PRODUCT_SUPPLIER)] J --> R1[P001 S001 10.00] J --> R2[P001 S002 9.50] ``` ### Defining Compound Entities > 📁 **Full code:** [`_code/compound_entities.py`](_code/compound_entities.py) ```{python} # Primary compound entity: product–supplier (sourcing / many-to-many) product_supplier_entity = Entity( name="PRODUCT_SUPPLIER", join_keys=["PRODUCT_ID", "SUPPLIER_ID"], desc="Product and supplier combination for sourcing features" ) # User–product affinity (interaction grain) user_product_entity = Entity( name="USER_PRODUCT", join_keys=["USER_ID", "PRODUCT_ID"], desc="User-product affinity features" ) # Order line within an order (USER places ORDER → ORDER_ITEMS) order_item_entity = Entity( name="ORDER_ITEM", join_keys=["ORDER_ID", "LINE_NUMBER"], desc="Individual line item within an order" ) # Register compound entities for entity in [product_supplier_entity, user_product_entity, order_item_entity]: fs.register_entity(entity) print(f" {entity.name:20s} keys: {entity.join_keys}") ``` ### When to Use Compound Entities | Scenario | Join Keys | Example Use Case | |----------|-----------|------------------| | Many-to-many relationships | `[A_ID, B_ID]` | User-product affinity | | Intersection tables | `[PARENT_ID, CHILD_ID]` | Product-supplier pricing | | Sub-entity granularity | `[PARENT_ID, SEQ]` | Order line items | | Time-partitioned entities | `[ENTITY_ID, DATE]` | Daily snapshots | --- ## Entity Hierarchies {#sec-hierarchies} Entities can form hierarchies where one entity "rolls up" to another. This enables multi-level aggregations. ### Clickstream hierarchy (demo domain) Relationships used throughout this guide (database `FEATURE_STORE_DEMO`, source schema `CLICKSTREAM_DATA`, Feature Store schema `FEATURE_STORE`, warehouse `FS_DEV_WH`): - **VISITOR** (`VISITOR_ID`) → identified as → **USER** (`USER_ID`) → belongs to → **HOUSEHOLD** (`HOUSEHOLD_ID`) - **USER** → has → **SESSION** (`SESSION_ID`) → contains events - **USER** → places → **ORDER** → **ORDER_ITEM** rows (`ORDER_ID`, `LINE_NUMBER`) - **PRODUCT** (`PRODUCT_ID`) → belongs to → **CATEGORY** (`CATEGORY_ID`) - **PRODUCT_SUPPLIER** (`PRODUCT_ID`, `SUPPLIER_ID`) is the compound key for supplier-specific product facts ```{mermaid} %%| fig-cap: "Clickstream entity relationships" %%| fig-alt: "Visitor user household session order and product relationships" flowchart TB V[VISITOR] --> U[USER] --> H[HOUSEHOLD] U --> S[SESSION] U --> O[ORDER] --> OI[ORDER_ITEM] P[PRODUCT] --> C[CATEGORY] ``` ### Implementing Hierarchies > 📁 **Full code:** [`_code/entity_hierarchies.py`](_code/entity_hierarchies.py) ```python from snowflake.ml.feature_store import Entity, FeatureView # Register each entity first: fs.register_entity(visitor_entity), etc. visitor_entity = Entity(name="VISITOR", join_keys=["VISITOR_ID"]) household_entity = Entity(name="HOUSEHOLD", join_keys=["HOUSEHOLD_ID"]) user_entity = Entity(name="USER", join_keys=["USER_ID"]) session_entity = Entity(name="SESSION", join_keys=["SESSION_ID"]) # Feature Views at different granularities (version labels: V01, V02, ...) household_fv = FeatureView( name="HOUSEHOLD_FEATURES", entities=[household_entity], feature_df=household_agg_df, version="V01", # timestamp_col, refresh_freq, desc, ... ) user_fv = FeatureView( name="USER_FEATURES", entities=[user_entity], feature_df=user_agg_df, version="V01", ) session_fv = FeatureView( name="SESSION_FEATURES", entities=[session_entity], feature_df=session_df, version="V01", ) ``` ### Entity Mapping Tables for Rollup Feature Views {#sec-mapping-tables} Rollup Feature Views (see [Chapter 7](../07_aggregations_api/index.qmd#sec-rollup)) aggregate features from a finer-grained entity to a coarser-grained entity using a **mapping DataFrame**. The design of this mapping table significantly affects both correctness and performance. **Basic mapping table** (static, 1:1 or N:1): ```sql CREATE TABLE FEATURE_STORE_DEMO.CLICKSTREAM_DATA.VISITOR_USER_MAPPING AS SELECT DISTINCT VISITOR_ID, USER_ID FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.VISITOR_SESSIONS; ``` **SCD-Type2 mapping table** (time-varying relationships): When the relationship between entities changes over time (e.g., a visitor becomes identified as a different user, or a subscriber changes accounts), a static mapping table can introduce data leakage -- the rollup uses the **current** mapping to aggregate **historical** data. For temporal correctness, use an SCD-Type2 mapping with validity windows: ```sql CREATE TABLE VISITOR_USER_MAPPING_SCD2 ( VISITOR_ID VARCHAR, USER_ID VARCHAR, VALID_FROM TIMESTAMP_NTZ, VALID_TO TIMESTAMP_NTZ -- NULL = currently active ); ``` When the entity mapping is time-varying, the `mapping_df` passed to `RollupConfig` should include only **currently active** rows (where `VALID_TO IS NULL`), because the rollup Feature View aggregates tiles across the mapping at refresh time -- it does not perform point-in-time mapping alignment per tile. For training scenarios that require historical mapping correctness, consider building the rollup in a custom SQL Feature View that joins the SCD2 mapping with validity range predicates. ::: {.callout-warning title="Late-arriving entity mappings and PIT correctness"} A common source of subtle data leakage in hierarchical Feature Stores is **late-arriving entity mappings**. If a visitor-to-subscriber mapping is added retroactively (e.g., a visitor is identified as a subscriber after their browsing session), the rollup Feature View will include that visitor's historical features in the subscriber's aggregates even though the mapping did not exist at the original event time. For offline training, this means the subscriber's "7-day page view count" at training time `T` may include visitor activity that was not attributable to the subscriber at time `T`. The model learns from features that would not have been available in production. Mitigation strategies: - **Track `MAPPED_AT` timestamp** in the mapping table and filter the mapping to `MAPPED_AT <= spine_timestamp` when constructing training data. - **Accept the approximation** if the mapping lag is consistently small relative to the feature windows (e.g., mappings arrive within minutes but feature windows are 7+ days). - **Use the fine-grained entity** (visitor) for training if mapping timeliness cannot be guaranteed, and only use the coarser entity (subscriber) for inference. See [Chapter 6: PIT Correctness with Entity Mappings](../06_temporal_features/index.qmd#sec-mapping-leakage) for a detailed discussion. ::: **Performance at scale:** For entity hierarchies with very high cardinality (e.g., 500M+ visitors mapping to 100M+ subscribers), the mapping table join is the most expensive part of the rollup. Ensure the mapping table is well-clustered on the fine-grained entity key (`VISITOR_ID`) and avoid including unnecessary columns in the mapping DataFrame. ### Combining Hierarchical Features To use features from multiple hierarchy levels, your spine must include all join keys: ```python # Spine with multi-level keys (source tables in CLICKSTREAM_DATA) spine = session.sql(""" SELECT s.SESSION_ID, s.USER_ID, u.HOUSEHOLD_ID, s.SESSION_START_TS AS EVENT_TS FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.SESSIONS s JOIN FEATURE_STORE_DEMO.CLICKSTREAM_DATA.USERS u ON s.USER_ID = u.USER_ID """) # Generate dataset with features from all levels dataset = fs.generate_dataset( spine_df=spine, features=[session_fv, user_fv, household_fv], ) ``` ### Modeling at a Different Grain Than the Source {#sec-cross-grain} A common design question is: the ML model operates at a **parent** grain (e.g., ORDER), but the source data lives at a **child** grain (e.g., LINEITEM / ORDER_ITEMS). You have two options: **Option A: Entity at the child grain, parent key on the Feature View.** Register the entity at the finest grain (e.g., `[ORDER_ID, LINE_NUMBER]`) and include the parent key as a column. Training sets will have **multiple rows per parent**, which is useful when the model needs line-level detail. **Option B: Entity at the parent grain, aggregated child features.** Register the entity at the parent grain (e.g., `[ORDER_ID]`) and build a Feature View that aggregates child data up: ```python ORDER = Entity(name="ORDER", join_keys=["ORDER_ID"]) ``` ::: {.panel-tabset group="lang"} ## SQL ```python order_agg_df = session.sql(""" SELECT ORDER_ID, SUM(EXTENDED_PRICE) AS ORDER_TOTAL_AMT, COUNT(*) AS LINE_COUNT, SUM(CASE WHEN RETURN_FLAG = 'R' THEN 1 ELSE 0 END) AS RETURN_LINE_CNT, MAX(SHIP_DATE) AS LAST_SHIP_DATE FROM ORDER_ITEMS GROUP BY ORDER_ID """) ``` ## Snowpark DataFrame ```python import snowflake.snowpark.functions as F order_items = session.table("ORDER_ITEMS") order_agg_df = order_items.group_by("ORDER_ID").agg( F.sum("EXTENDED_PRICE").alias("ORDER_TOTAL_AMT"), F.count("*").alias("LINE_COUNT"), F.sum(F.when(F.col("RETURN_FLAG") == "R", 1).otherwise(0)).alias("RETURN_LINE_CNT"), F.max("SHIP_DATE").alias("LAST_SHIP_DATE"), ) ``` ::: ```python order_lineitem_agg_fv = FeatureView( name="ORDER_LINEITEM_AGG_FV", entities=[ORDER], feature_df=order_agg_df, timestamp_col="LAST_SHIP_DATE", refresh_freq="1 hour", desc="Lineitem features aggregated to the order level", ) ``` When the spine carries keys from **multiple levels** of the hierarchy (e.g., `ORDER_ID`, `CUSTOMER_ID`, `NATION_ID`), `generate_training_set` performs LEFT JOINs against each Feature View on its entity keys. Higher-level features are **denormalized** (repeated) onto every child row. See [Chapter 11: Training & Inference](../11_training_inference/index.qmd) for multi-entity spine design patterns. --- ## Entity Design Patterns ### Pattern 1: Domain-Aligned Entities Align entities with business domain concepts: ```python # Clickstream / e-commerce domain (this guide) USER = Entity(name="USER", join_keys=["USER_ID"]) PRODUCT = Entity(name="PRODUCT", join_keys=["PRODUCT_ID"]) ORDER = Entity(name="ORDER", join_keys=["ORDER_ID"]) # Financial domain ACCOUNT = Entity(name="ACCOUNT", join_keys=["ACCOUNT_ID"]) TRANSACTION = Entity(name="TRANSACTION", join_keys=["TXN_ID"]) MERCHANT = Entity(name="MERCHANT", join_keys=["MERCHANT_ID"]) ``` ### Pattern 2: Interaction Entities Model relationships between domain entities: ```python # User–product interaction (same grain as USER_PRODUCT in compound examples) USER_PRODUCT = Entity( name="USER_PRODUCT", join_keys=["USER_ID", "PRODUCT_ID"], desc="User-product affinity and behavior" ) # User-Merchant interaction USER_MERCHANT = Entity( name="USER_MERCHANT", join_keys=["USER_ID", "MERCHANT_ID"], desc="User spending patterns per merchant" ) ``` ### Pattern 3: Time-Grain Entities (Forecasting / Time-Series) For forecasting and time-series models, the **unit of analysis** is often a combination of a business entity and a **time period** -- not just the entity alone. For example, a demand forecast model predicts sales per store per day; a patient monitoring model tracks health metrics per patient per week. The time dimension becomes part of the **entity key**, not just the `timestamp_col`. ```python # Daily demand forecasting: one prediction per store per date STORE_DAY = Entity( name="STORE_DAY", join_keys=["STORE_ID", "FORECAST_DATE"], desc="Store-day grain for demand forecasting models", ) # Weekly inventory planning: one prediction per SKU per week SKU_WEEK = Entity( name="SKU_WEEK", join_keys=["SKU_ID", "WEEK_START_DATE"], desc="SKU-week grain for inventory planning models", ) ``` **Key distinction:** The time-grain join key (e.g., `FORECAST_DATE`) defines the **analysis granularity** -- it determines which row the spine matches on. The `timestamp_col` on the Feature View still serves its standard role: it records **when** the feature row was computed or became valid, enabling PIT correctness if features are refreshed incrementally. ::: {.panel-tabset group="lang"} ## SQL ```python store_day_features_df = session.sql(""" SELECT STORE_ID, SALE_DATE AS FORECAST_DATE, SUM(QUANTITY) AS DAILY_UNITS_SOLD, SUM(REVENUE) AS DAILY_REVENUE, COUNT(DISTINCT PRODUCT_ID) AS DISTINCT_PRODUCTS_SOLD, MAX(UPDATED_TS) AS FEATURE_TS FROM SALES GROUP BY STORE_ID, SALE_DATE """) ``` ## Snowpark DataFrame ```python import snowflake.snowpark.functions as F sales = session.table("SALES") store_day_features_df = sales.group_by("STORE_ID", F.col("SALE_DATE").alias("FORECAST_DATE")).agg( F.sum("QUANTITY").alias("DAILY_UNITS_SOLD"), F.sum("REVENUE").alias("DAILY_REVENUE"), F.count_distinct("PRODUCT_ID").alias("DISTINCT_PRODUCTS_SOLD"), F.max("UPDATED_TS").alias("FEATURE_TS"), ) ``` ::: ```python store_day_fv = FeatureView( name="STORE_DAY_SALES_FV", entities=[STORE_DAY], feature_df=store_day_features_df, timestamp_col="FEATURE_TS", refresh_freq="1 hour", desc="Daily sales features per store for demand forecasting", ) ``` **Spine design** for time-grain entities typically enumerates the (entity, time-period) tuples you want to score: ```python forecast_spine = session.sql(""" SELECT STORE_ID, cal.CAL_DATE AS FORECAST_DATE FROM STORES CROSS JOIN (SELECT CAL_DATE FROM CALENDAR WHERE CAL_DATE BETWEEN '2025-01-01' AND '2025-01-31') cal """) ``` See [Chapter 11: Training & Inference](../11_training_inference/index.qmd) for spine design patterns, including multi-horizon forecasting spines. --- ## Best Practices ### 1. Use Descriptive Names {.unnumbered} ```python # ✅ GOOD: Clear business meaning (SCREAMING_SNAKE_CASE, singular) Entity(name="USER", join_keys=["USER_ID"]) Entity(name="PRODUCT_SUPPLIER", join_keys=["PRODUCT_ID", "SUPPLIER_ID"]) # ❌ BAD: Vague or technical names Entity(name="ENTITY1", join_keys=["ID"]) Entity(name="TBL_KEY", join_keys=["K1", "K2"]) ``` ### 2. Document Business Context {.unnumbered} ```python Entity( name="HOUSEHOLD", join_keys=["HOUSEHOLD_ID"], desc="Group of users sharing the same billing address. Used for household-level targeting and fraud detection." ) ``` ### 3. Consistent Key Naming {#sec-consistent-keys .unnumbered} The Feature Store **does not support synonyms or aliases** for join keys. The column names in the Entity's `join_keys` must appear with exactly the same name in every Feature View's `feature_df` and in every spine DataFrame. There is no mapping layer that translates `USER_KEY` to `USER_ID` at runtime -- a name mismatch will cause registration or dataset generation to fail. While this is a constraint, it is also a **conformance benefit**: it enforces a single, consistent vocabulary for entity keys across all Feature Views and consumers. Teams cannot accidentally use different column names for the same logical concept, which eliminates a common source of silent join errors. If source tables use different column names, rename them in the Feature View's `feature_df` to match the Entity definition: ```python # Source table uses "CUST_ID", but Entity expects "USER_ID" feature_df = session.table("LEGACY_CUSTOMERS").with_column_renamed("CUST_ID", "USER_ID") # ✅ GOOD: Consistent USER_ID everywhere Entity(name="USER", join_keys=["USER_ID"]) # Feature View source: SELECT USER_ID, ... FROM ... # Spine: SELECT USER_ID, ... # ❌ BAD: Inconsistent naming for the same logical user # Table A: USER_ID, Table B: USER_KEY, Table C: USR_ID ``` --- ## Common Pitfalls ### ❌ Pitfall 1: Key Column Mismatch **Problem**: Join key column missing from Feature View or spine. **Solution**: Ensure join key columns exist in all relevant DataFrames. ### ❌ Pitfall 2: Wrong Granularity **Problem**: Entity doesn't match feature computation granularity. **Solution**: Ensure Feature View aggregates to entity grain. ### ❌ Pitfall 3: Missing Hierarchy Links **Problem**: Can't combine features from different hierarchy levels. **Solution**: Include all necessary foreign keys in spine. ### ❌ Pitfall 4: Entity grain vs. query-time timestamps **Problem**: Features that must be computed from the **spine timestamp** (for example age or tenure as of each training row) cannot rely on columns that are not in the Feature View’s `feature_df`. The spine timestamp is not injected into Feature View SQL for you. **Solution**: Use an entity grain where the reference time is already a column in the feature data (for example snapshot rows with `SNAPSHOT_TS` as `timestamp_col`), or enrich after retrieval with the same logic for training and scoring. Event-level entities (one row per application, session, etc.) can anchor features to an intrinsic event time. For SCD2 and point-in-time–correct derived attributes, consider a worked banking example where snapshot-based entity grains provide the temporal anchor. --- ## Summary | Pattern | Join Keys | Use Case | |---------|-----------|----------| | **Simple** | Single column | Standard domain entities | | **Compound** | Multiple columns | Many-to-many, intersections | | **Hierarchical** | Cascading relationships | Multi-level aggregations | --- ## Next Steps Continue to [Chapter 4: Feature Views](../04_feature_views/index.qmd) to learn about Feature View types, versioning, and lifecycle management.