13 Migration Guide

Migrating to Snowflake Feature Store from other platforms

Keywords

snowflake, feature store, ml, machine learning, mlops

13.1 Overview

This chapter provides guidance for teams migrating to Snowflake Feature Store from other platforms, primarily Tecton. It covers terminology mapping, migration strategies, and common patterns.

Examples follow the clickstream sample layout: database FEATURE_STORE_DEMO, Feature Store schema FEATURE_STORE, source schema CLICKSTREAM_DATA (tables such as EVENTS, SESSIONS, ORDERS, USERS, PRODUCTS), warehouse FS_DEV_WH. Register Feature Views with version strings in V01 format (for example fs.register_feature_view(..., version="V01")).

13.2 Learning Objectives

After completing this chapter, you will be able to:

Map terminology from other Feature Stores to Snowflake
Plan a phased migration approach
Migrate feature definitions and pipelines
Validate migrated features

📂 Chapter code: Browse companion scripts on GitHub

13.3 Why Migrate to Snowflake Feature Store?

13.3.1 Key Benefits

Benefit	Description
Unified Platform	Features, training, inference all in Snowflake
Native Integration	Uses Snowflake objects (DTs, Views, Hybrid Tables)
Zero Infrastructure	No separate serving infrastructure needed
Enterprise Security	Inherits Snowflake RBAC and governance
Cost Efficiency	No data movement, unified compute

13.4 Terminology Mapping

13.4.1 Tecton → Snowflake

📁 Full code: _code/terminology_mapping.py

Tecton	Snowflake	Notes
Feature View	Feature View	Registered with `register_feature_view`; use `version="V01"` (or `V02`, …)
Entity	Entity	Same concept; join keys align with spine columns
Batch Feature View	Feature View (Dynamic Table)	Set `refresh_freq` to a period (e.g. `"1 hour"`) or CRON string
On-Demand Feature View	Feature View (View)	Omit `refresh_freq` or set `refresh_freq=None`; query-time computation
Feature Service	`generate_training_set()` / `generate_dataset()`	No separate service object: pass spine + list of Feature Views (see Chapter 10)
Online Store	Online Feature Table (Hybrid Table)	Enable with `OnlineConfig`; freshness via `target_lag`
Offline Store	Feature Store schema	e.g. `FEATURE_STORE_DEMO.FEATURE_STORE`; materialized batch features live in Dynamic Tables there
Data Source	Source table or view	e.g. `FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS`
Transformation	Feature View query (`feature_df`)	Snowpark DataFrame / SQL vs Tecton Python decorators
Backfill	`fs.refresh_feature_view(fv)`	Triggers refresh outside the scheduled `refresh_freq`
TTL	`refresh_freq` / `target_lag`	Offline lag ↔︎ Dynamic Table `refresh_freq`; online freshness ↔︎ `OnlineConfig.target_lag` (not a single global TTL)
`get_historical_features`	`generate_training_set()` / `generate_dataset()`	Snowpark DataFrame vs versioned Dataset (Parquet); same PIT semantics
`get_online_features`	`get_online_features`	Retrieve from Online Feature Table
Spine	Spine DataFrame	Entity keys + PIT timestamp column passed as `spine_timestamp_col`

13.4.2 Feast → Snowflake (abbreviated)

Feast	Snowflake
Feature Store (registry)	Feature Store schema + platform metadata
Feature View	Feature View
Online Store	Online Feature Table (Hybrid Table)
Offline Store	Feature Store schema
Feature Service	`generate_training_set()` / `generate_dataset()`
`get_historical_features`	`generate_training_set()` / `generate_dataset()`

13.5 Migration Approach

13.5.1 Three-Phase Migration

flowchart LR
  P1[Phase 1 Assessment] --> P2[Phase 2 Migration] --> P3[Phase 3 Validation]

Three-phase migration: assessment migration validation

13.5.2 Phase 1: Assessment

Inventory existing features
- List all Feature Views/Groups
- Document entities and join keys
- Identify refresh frequencies (map to refresh_freq / target_lag)
Map to Snowflake concepts
- Use terminology mapping table
- Identify gaps or differences
Plan migration order
- Start with simple features
- Group by model dependency

13.5.3 Phase 2: Migration

📁 Full code: _code/migration_utils.py

# Example: Migrating a Tecton batch Feature View to Snowflake (clickstream ORDERS)

from snowflake.ml.feature_store import FeatureStore, FeatureView, Entity

# Tecton (original)
"""
@batch_feature_view(
    entities=[user_entity],
    feature_table=user_features,
    schedule=timedelta(hours=1),
)
def user_purchase_fv(user_txns):
    return user_txns.groupby('USER_ID').agg(
        total_spend=('AMOUNT', 'sum'),
        order_count=('ORDER_ID', 'count'),
    )
"""

# Snowflake (migrated) — FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS uses TOTAL_AMT, ORDER_ID, ORDER_TS
user_purchase_df = session.sql("""
    SELECT
        USER_ID,
        SUM(TOTAL_AMT) AS ORDER_TOTAL_AMT_SUM,
        COUNT(ORDER_ID) AS ORDER_CNT,
        MAX(ORDER_TS) AS LAST_ORDER_TS
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
    GROUP BY USER_ID
""")

user_purchase_fv = FeatureView(
    name="USER_PURCHASE_FV",
    entities=[user_entity],
    feature_df=user_purchase_df,
    timestamp_col="LAST_ORDER_TS",
    refresh_freq="1 hour",
    desc="USER purchase aggregates from CLICKSTREAM_DATA.ORDERS",
)

fs.register_feature_view(feature_view=user_purchase_fv, version="V01")

13.5.4 Phase 3: Validation

from snowflake.snowpark import functions as F

def validate_migration(
    source_features,
    snowflake_features,
    entity_cols: list,
) -> dict:
    """Compare features between source platform and Snowflake (Snowpark DataFrames)."""

    comparison = source_features.join(
        snowflake_features,
        on=entity_cols,
        how="full_outer",
    )

    mismatches = comparison.filter(
        F.col("source_ORDER_TOTAL_AMT_SUM") != F.col("sf_ORDER_TOTAL_AMT_SUM")
    ).count()

    total = comparison.count()
    return {
        "total_rows": total,
        "mismatches": mismatches,
        "match_rate": (1 - mismatches / total) * 100 if total else 0.0,
    }

13.6 Platform-Specific Guides

13.6.1 Tecton Migration

Priority

Tecton migration is prioritized given Databricks Online Tables EOL.

# Key differences to handle:
# 1. Feature definitions use SQL/Snowpark (feature_df) vs Tecton Python transformations
# 2. Online serving uses Hybrid Tables + OnlineConfig.target_lag vs Tecton's online store
# 3. Offline scheduling uses Feature View refresh_freq vs Tecton's schedule parameter
# 4. Backfill / ad-hoc refresh: fs.refresh_feature_view(fv)
# 5. Training data: fs.generate_training_set() (Snowpark DF) or fs.generate_dataset() (ML Dataset)

13.6.2 Feast Migration

# Key differences:
# 1. Feature Store is a Snowflake schema + platform objects vs external registry + plugins
# 2. Online store is Online Feature Table (Hybrid Table) vs Redis/DynamoDB
# 3. No separate Feature Service — use generate_training_set() / generate_dataset() with spine + FVs

13.6.3 Migrating from Pandas-based Feature Engineering

Many ML teams start with local pandas for feature engineering and need a path to productionize those pipelines in the Feature Store. The difficulty varies widely depending on what pandas operations are used.

13.6.3.1 Complexity ladder

Tier 1 – Trivial (direct SQL rewrite). Simple groupby/agg, filters, joins. These map 1:1 to SQL or Snowpark DataFrame operations.

import pandas as pd

df = pd.read_sql("SELECT * FROM ORDERS", conn)
features = df.groupby("USER_ID").agg(
    ORDER_CNT=("ORDER_ID", "count"),
    SPEND_SUM=("TOTAL_AMT", "sum"),
    LAST_ORDER_TS=("ORDER_TS", "max"),
).reset_index()

user_purchase_df = session.sql("""
    SELECT
        USER_ID,
        COUNT(ORDER_ID)  AS ORDER_CNT,
        SUM(TOTAL_AMT)   AS SPEND_SUM,
        MAX(ORDER_TS)     AS LAST_ORDER_TS
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
    GROUP BY USER_ID
""")

import snowflake.snowpark.functions as F

orders = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS")
user_purchase_df = orders.group_by("USER_ID").agg(
    F.count("ORDER_ID").alias("ORDER_CNT"),
    F.sum("TOTAL_AMT").alias("SPEND_SUM"),
    F.max("ORDER_TS").alias("LAST_ORDER_TS"),
)

Tier 2 – Moderate (Snowpark DataFrame rewrite). Multi-step transformations, conditional columns, window functions. Requires translating pandas idioms to Snowpark equivalents.

df["IS_HIGH_VALUE"] = (df["TOTAL_AMT"] > 100).astype(int)
df["ORDER_RANK"] = df.groupby("USER_ID")["ORDER_TS"].rank(ascending=False)
df = df[df["ORDER_RANK"] <= 5]

recent_orders_df = session.sql("""
    SELECT USER_ID, ORDER_ID, TOTAL_AMT, ORDER_TS,
           CASE WHEN TOTAL_AMT > 100 THEN 1 ELSE 0 END AS IS_HIGH_VALUE,
           ROW_NUMBER() OVER (PARTITION BY USER_ID ORDER BY ORDER_TS DESC) AS ORDER_RANK
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
    QUALIFY ORDER_RANK <= 5
""")

from snowflake.snowpark import Window
import snowflake.snowpark.functions as F

orders = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS")
recent_orders_df = (
    orders
    .with_column("IS_HIGH_VALUE", F.when(F.col("TOTAL_AMT") > 100, 1).otherwise(0))
    .with_column(
        "ORDER_RANK",
        F.row_number().over(Window.partition_by("USER_ID").order_by(F.col("ORDER_TS").desc()))
    )
    .filter(F.col("ORDER_RANK") <= 5)
)

Tier 3 – Complex (LLM-assisted conversion + manual tuning). Pandas operations relying on row order (.shift(), .rolling() with custom functions, .apply() with arbitrary Python). These require expressing the logic with SQL window functions (LAG, LEAD, cumulative sums) or vectorized UDFs.

# pandas (before) — row-ordered operations
df = df.sort_values(["USER_ID", "ORDER_TS"])
df["PREV_ORDER_AMT"] = df.groupby("USER_ID")["TOTAL_AMT"].shift(1)
df["AMT_CHANGE"] = df["TOTAL_AMT"] - df["PREV_ORDER_AMT"]

# SQL (after) — window functions replace row-ordered pandas
order_change_df = session.sql("""
    SELECT
        USER_ID,
        ORDER_TS,
        TOTAL_AMT,
        LAG(TOTAL_AMT) OVER (
            PARTITION BY USER_ID ORDER BY ORDER_TS
        ) AS PREV_ORDER_AMT,
        TOTAL_AMT - LAG(TOTAL_AMT) OVER (
            PARTITION BY USER_ID ORDER BY ORDER_TS
        ) AS AMT_CHANGE
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
""")

Use Coding LLMs for Tier 3 Conversions

Use coding LLM tools (Cursor, Copilot, ChatGPT, etc.) to generate a first-pass SQL or Snowpark DataFrame translation of complex pandas code. The LLM output typically handles 80-90% of the logic correctly. Review and manually tune the remaining edge cases – especially around window frame boundaries, NULL handling, and DT-compatibility (avoid non-deterministic functions like CURRENT_TIMESTAMP()).

13.6.3.2 Snowpark pandas: not recommended for production Feature Views

Snowpark pandas and Feature Store

Snowpark pandas (import snowflake.snowpark.modin.pandas as pd) runs pandas-syntax code on Snowflake warehouse compute. However, it is not a supported transformation language for Feature Store feature_df in production:

With default enforce_ordering=True, it generates multi-statement SQL with temporary tables that cannot back a Dynamic Table
With enforce_ordering=False, some pipelines collapse to a single SQL statement, but known correctness issues make this experimental and not recommended for production
Feature Store assumes relational, unordered sources keyed by entity + timestamp – pandas row-ordered semantics are fundamentally at odds with this model

Recommendation: Use Snowpark pandas for exploration and prototyping during migration, then rewrite the final pipeline in SQL or Snowpark DataFrames for the registered feature_df.

13.6.3.3 Anti-pattern: passing a local pandas DataFrame as `feature_df`

Passing a local pd.DataFrame as feature_df will fail – the Feature Store API expects a Snowpark DataFrame, not a pandas object. If you have data in a local pandas DataFrame, upload it to a Snowflake table first and then reference it:

# Upload local pandas DataFrame to Snowflake
session.write_pandas(local_df, "MY_STAGING_TABLE", auto_create_table=True)

# Now use it as feature_df
feature_df = session.table("MY_STAGING_TABLE")

13.7 Best Practices

1. Migrate Incrementally

Week 1: Entities + 2-3 simple Feature Views
Week 2: Remaining batch features
Week 3: Online serving setup
Week 4: Consumer cutover

2. Run in Parallel

Operate both systems during migration:

# Historical / training: compare Snowpark path (generate_training_set) and Dataset path (generate_dataset) as needed
source_features = tecton_client.get_historical_features(...)
sf_training_df = snowflake_fs.generate_training_set(
    spine_df=training_spine,
    features=[user_purchase_fv],
    spine_timestamp_col="SESSION_START_TS",
)
# Or: sf_dataset = snowflake_fs.generate_dataset(..., version="V01") for Parquet-backed ML Dataset workflows

validate_migration(source_features, sf_training_df, entity_cols=["USER_ID"])

3. Validate PIT Correctness

# Ensure point-in-time retrieval produces same results
def test_pit_correctness(spine, source_fs, snowflake_fs):
    source_result = source_fs.get_historical_features(spine)
    sf_df = snowflake_fs.generate_training_set(
        spine_df=spine,
        features=[user_purchase_fv],
        spine_timestamp_col="SESSION_START_TS",
    )
    assert_dataframes_equal(source_result, sf_df)

13.8 Common Pitfalls

13.8.1 ❌ Pitfall 1: Big Bang Migration

Problem: Migrating everything at once causes extended outages.

Solution: Phased migration with parallel operation.

13.8.2 ❌ Pitfall 2: Skipping Validation

Problem: Subtle differences in feature values undetected.

Solution: Comprehensive output comparison.

13.8.3 ❌ Pitfall 3: Ignoring Timestamp Semantics

Problem: Different interpretation of event time vs. processing time.

Solution: Document and align timestamp semantics.

13.8.4 ❌ Pitfall 4: Confusing TTL with a Single Knob

Problem: Tecton TTL maps to different Snowflake controls for offline vs online paths.

Solution: Set refresh_freq for Dynamic Table materialization and OnlineConfig.target_lag for online sync independently; align both with product freshness SLAs.

13.9 Summary

Phase	Activities
Assessment	Inventory, terminology mapping, planning
Migration	Recreate entities, port definitions (`feature_df`), register with `V01`, setup pipelines
Validation	Compare outputs (`generate_training_set` / `generate_dataset`), validate PIT, cutover

13.10 Additional Resources

--- title: "Migration Guide" subtitle: "Migrating to Snowflake Feature Store from other platforms" --- ## Overview This chapter provides guidance for teams migrating to Snowflake Feature Store from other platforms, **primarily Tecton**. It covers terminology mapping, migration strategies, and common patterns. Examples follow the **clickstream** sample layout: database `FEATURE_STORE_DEMO`, Feature Store schema `FEATURE_STORE`, source schema `CLICKSTREAM_DATA` (tables such as `EVENTS`, `SESSIONS`, `ORDERS`, `USERS`, `PRODUCTS`), warehouse `FS_DEV_WH`. Register Feature Views with version strings in **`V01`** format (for example `fs.register_feature_view(..., version="V01")`). ## Learning Objectives After completing this chapter, you will be able to: - Map terminology from other Feature Stores to Snowflake - Plan a phased migration approach - Migrate feature definitions and pipelines - Validate migrated features > 📂 **Chapter code:** [Browse companion scripts on GitHub](https://github.com/Snowflake-Labs/snowflake-featurestore-imp-guide/tree/main/Snowflake_FeatureStore_Implementation_Guide/13_migration_guide/_code) --- ## Why Migrate to Snowflake Feature Store? ### Key Benefits | Benefit | Description | |---------|-------------| | **Unified Platform** | Features, training, inference all in Snowflake | | **Native Integration** | Uses Snowflake objects (DTs, Views, Hybrid Tables) | | **Zero Infrastructure** | No separate serving infrastructure needed | | **Enterprise Security** | Inherits Snowflake RBAC and governance | | **Cost Efficiency** | No data movement, unified compute | --- ## Terminology Mapping {#sec-terminology} ### Tecton → Snowflake > 📁 **Full code:** [`_code/terminology_mapping.py`](_code/terminology_mapping.py) | Tecton | Snowflake | Notes | |--------|-----------|-------| | Feature View | Feature View | Registered with `register_feature_view`; use `version="V01"` (or `V02`, …) | | Entity | Entity | Same concept; join keys align with spine columns | | Batch Feature View | Feature View (Dynamic Table) | Set `refresh_freq` to a **period** (e.g. `"1 hour"`) or **CRON** string | | On-Demand Feature View | Feature View (View) | Omit `refresh_freq` or set `refresh_freq=None`; query-time computation | | Feature Service | `generate_training_set()` / `generate_dataset()` | No separate service object: pass spine + list of Feature Views (see Chapter 10) | | Online Store | Online Feature Table (Hybrid Table) | Enable with `OnlineConfig`; freshness via `target_lag` | | Offline Store | Feature Store schema | e.g. `FEATURE_STORE_DEMO.FEATURE_STORE`; materialized batch features live in Dynamic Tables there | | Data Source | Source table or view | e.g. `FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS` | | Transformation | Feature View query (`feature_df`) | Snowpark DataFrame / SQL vs Tecton Python decorators | | Backfill | `fs.refresh_feature_view(fv)` | Triggers refresh outside the scheduled `refresh_freq` | | TTL | `refresh_freq` / `target_lag` | **Offline** lag ↔ Dynamic Table `refresh_freq`; **online** freshness ↔ `OnlineConfig.target_lag` (not a single global TTL) | | `get_historical_features` | `generate_training_set()` / `generate_dataset()` | Snowpark DataFrame vs versioned Dataset (Parquet); same PIT semantics | | `get_online_features` | `get_online_features` | Retrieve from Online Feature Table | | Spine | Spine DataFrame | Entity keys + PIT timestamp column passed as `spine_timestamp_col` | ### Feast → Snowflake (abbreviated) | Feast | Snowflake | |-------|-----------| | Feature Store (registry) | Feature Store schema + platform metadata | | Feature View | Feature View | | Online Store | Online Feature Table (Hybrid Table) | | Offline Store | Feature Store schema | | Feature Service | `generate_training_set()` / `generate_dataset()` | | `get_historical_features` | `generate_training_set()` / `generate_dataset()` | --- ## Migration Approach ### Three-Phase Migration ```{mermaid} %%| fig-cap: "Three-phase migration: assessment migration validation" %%| fig-alt: "Assessment then migration then validation phases" flowchart LR P1[Phase 1 Assessment] --> P2[Phase 2 Migration] --> P3[Phase 3 Validation] ``` ### Phase 1: Assessment 1. **Inventory existing features** - List all Feature Views/Groups - Document entities and join keys - Identify refresh frequencies (map to `refresh_freq` / `target_lag`) 2. **Map to Snowflake concepts** - Use terminology mapping table - Identify gaps or differences 3. **Plan migration order** - Start with simple features - Group by model dependency ### Phase 2: Migration > 📁 **Full code:** [`_code/migration_utils.py`](_code/migration_utils.py) ```python # Example: Migrating a Tecton batch Feature View to Snowflake (clickstream ORDERS) from snowflake.ml.feature_store import FeatureStore, FeatureView, Entity # Tecton (original) """ @batch_feature_view( entities=[user_entity], feature_table=user_features, schedule=timedelta(hours=1), ) def user_purchase_fv(user_txns): return user_txns.groupby('USER_ID').agg( total_spend=('AMOUNT', 'sum'), order_count=('ORDER_ID', 'count'), ) """ # Snowflake (migrated) — FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS uses TOTAL_AMT, ORDER_ID, ORDER_TS user_purchase_df = session.sql(""" SELECT USER_ID, SUM(TOTAL_AMT) AS ORDER_TOTAL_AMT_SUM, COUNT(ORDER_ID) AS ORDER_CNT, MAX(ORDER_TS) AS LAST_ORDER_TS FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS GROUP BY USER_ID """) user_purchase_fv = FeatureView( name="USER_PURCHASE_FV", entities=[user_entity], feature_df=user_purchase_df, timestamp_col="LAST_ORDER_TS", refresh_freq="1 hour", desc="USER purchase aggregates from CLICKSTREAM_DATA.ORDERS", ) fs.register_feature_view(feature_view=user_purchase_fv, version="V01") ``` ### Phase 3: Validation ```python from snowflake.snowpark import functions as F def validate_migration( source_features, snowflake_features, entity_cols: list, ) -> dict: """Compare features between source platform and Snowflake (Snowpark DataFrames).""" comparison = source_features.join( snowflake_features, on=entity_cols, how="full_outer", ) mismatches = comparison.filter( F.col("source_ORDER_TOTAL_AMT_SUM") != F.col("sf_ORDER_TOTAL_AMT_SUM") ).count() total = comparison.count() return { "total_rows": total, "mismatches": mismatches, "match_rate": (1 - mismatches / total) * 100 if total else 0.0, } ``` --- ## Platform-Specific Guides ### Tecton Migration ::: {.callout-note} ## Priority Tecton migration is prioritized given Databricks Online Tables EOL. ::: ```python # Key differences to handle: # 1. Feature definitions use SQL/Snowpark (feature_df) vs Tecton Python transformations # 2. Online serving uses Hybrid Tables + OnlineConfig.target_lag vs Tecton's online store # 3. Offline scheduling uses Feature View refresh_freq vs Tecton's schedule parameter # 4. Backfill / ad-hoc refresh: fs.refresh_feature_view(fv) # 5. Training data: fs.generate_training_set() (Snowpark DF) or fs.generate_dataset() (ML Dataset) ``` ### Feast Migration ```python # Key differences: # 1. Feature Store is a Snowflake schema + platform objects vs external registry + plugins # 2. Online store is Online Feature Table (Hybrid Table) vs Redis/DynamoDB # 3. No separate Feature Service — use generate_training_set() / generate_dataset() with spine + FVs ``` ### Migrating from Pandas-based Feature Engineering {#sec-pandas-migration} Many ML teams start with local pandas for feature engineering and need a path to productionize those pipelines in the Feature Store. The difficulty varies widely depending on what pandas operations are used. #### Complexity ladder **Tier 1 -- Trivial (direct SQL rewrite).** Simple groupby/agg, filters, joins. These map 1:1 to SQL or Snowpark DataFrame operations. ::: {.panel-tabset group="lang"} ## pandas (before) ```python import pandas as pd df = pd.read_sql("SELECT * FROM ORDERS", conn) features = df.groupby("USER_ID").agg( ORDER_CNT=("ORDER_ID", "count"), SPEND_SUM=("TOTAL_AMT", "sum"), LAST_ORDER_TS=("ORDER_TS", "max"), ).reset_index() ``` ## SQL (after) ```python user_purchase_df = session.sql(""" SELECT USER_ID, COUNT(ORDER_ID) AS ORDER_CNT, SUM(TOTAL_AMT) AS SPEND_SUM, MAX(ORDER_TS) AS LAST_ORDER_TS FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS GROUP BY USER_ID """) ``` ## Snowpark DataFrame (after) ```python import snowflake.snowpark.functions as F orders = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS") user_purchase_df = orders.group_by("USER_ID").agg( F.count("ORDER_ID").alias("ORDER_CNT"), F.sum("TOTAL_AMT").alias("SPEND_SUM"), F.max("ORDER_TS").alias("LAST_ORDER_TS"), ) ``` ::: **Tier 2 -- Moderate (Snowpark DataFrame rewrite).** Multi-step transformations, conditional columns, window functions. Requires translating pandas idioms to Snowpark equivalents. ::: {.panel-tabset group="lang"} ## pandas (before) ```python df["IS_HIGH_VALUE"] = (df["TOTAL_AMT"] > 100).astype(int) df["ORDER_RANK"] = df.groupby("USER_ID")["ORDER_TS"].rank(ascending=False) df = df[df["ORDER_RANK"] <= 5] ``` ## SQL (after) ```python recent_orders_df = session.sql(""" SELECT USER_ID, ORDER_ID, TOTAL_AMT, ORDER_TS, CASE WHEN TOTAL_AMT > 100 THEN 1 ELSE 0 END AS IS_HIGH_VALUE, ROW_NUMBER() OVER (PARTITION BY USER_ID ORDER BY ORDER_TS DESC) AS ORDER_RANK FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS QUALIFY ORDER_RANK <= 5 """) ``` ## Snowpark DataFrame (after) ```python from snowflake.snowpark import Window import snowflake.snowpark.functions as F orders = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS") recent_orders_df = ( orders .with_column("IS_HIGH_VALUE", F.when(F.col("TOTAL_AMT") > 100, 1).otherwise(0)) .with_column( "ORDER_RANK", F.row_number().over(Window.partition_by("USER_ID").order_by(F.col("ORDER_TS").desc())) ) .filter(F.col("ORDER_RANK") <= 5) ) ``` ::: **Tier 3 -- Complex (LLM-assisted conversion + manual tuning).** Pandas operations relying on row order (`.shift()`, `.rolling()` with custom functions, `.apply()` with arbitrary Python). These require expressing the logic with SQL window functions (`LAG`, `LEAD`, cumulative sums) or vectorized UDFs. ```python # pandas (before) — row-ordered operations df = df.sort_values(["USER_ID", "ORDER_TS"]) df["PREV_ORDER_AMT"] = df.groupby("USER_ID")["TOTAL_AMT"].shift(1) df["AMT_CHANGE"] = df["TOTAL_AMT"] - df["PREV_ORDER_AMT"] ``` ```python # SQL (after) — window functions replace row-ordered pandas order_change_df = session.sql(""" SELECT USER_ID, ORDER_TS, TOTAL_AMT, LAG(TOTAL_AMT) OVER ( PARTITION BY USER_ID ORDER BY ORDER_TS ) AS PREV_ORDER_AMT, TOTAL_AMT - LAG(TOTAL_AMT) OVER ( PARTITION BY USER_ID ORDER BY ORDER_TS ) AS AMT_CHANGE FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS """) ``` ::: {.callout-tip title="Use Coding LLMs for Tier 3 Conversions"} Use coding LLM tools (Cursor, Copilot, ChatGPT, etc.) to generate a first-pass SQL or Snowpark DataFrame translation of complex pandas code. The LLM output typically handles 80-90% of the logic correctly. Review and manually tune the remaining edge cases -- especially around window frame boundaries, NULL handling, and DT-compatibility (avoid non-deterministic functions like `CURRENT_TIMESTAMP()`). ::: #### Snowpark pandas: not recommended for production Feature Views ::: {.callout-warning title="Snowpark pandas and Feature Store"} Snowpark pandas (`import snowflake.snowpark.modin.pandas as pd`) runs pandas-syntax code on Snowflake warehouse compute. However, it is **not a supported transformation language for Feature Store `feature_df` in production**: - With default `enforce_ordering=True`, it generates multi-statement SQL with temporary tables that **cannot back a Dynamic Table** - With `enforce_ordering=False`, some pipelines collapse to a single SQL statement, but known correctness issues make this **experimental and not recommended for production** - Feature Store assumes relational, unordered sources keyed by entity + timestamp -- pandas row-ordered semantics are fundamentally at odds with this model **Recommendation:** Use Snowpark pandas for **exploration and prototyping** during migration, then rewrite the final pipeline in SQL or Snowpark DataFrames for the registered `feature_df`. ::: #### Anti-pattern: passing a local pandas DataFrame as `feature_df` Passing a local `pd.DataFrame` as `feature_df` will **fail** -- the Feature Store API expects a Snowpark DataFrame, not a pandas object. If you have data in a local pandas DataFrame, upload it to a Snowflake table first and then reference it: ```python # Upload local pandas DataFrame to Snowflake session.write_pandas(local_df, "MY_STAGING_TABLE", auto_create_table=True) # Now use it as feature_df feature_df = session.table("MY_STAGING_TABLE") ``` --- ## Best Practices ### 1. Migrate Incrementally {.unnumbered} ``` Week 1: Entities + 2-3 simple Feature Views Week 2: Remaining batch features Week 3: Online serving setup Week 4: Consumer cutover ``` ### 2. Run in Parallel {.unnumbered} Operate both systems during migration: ```python # Historical / training: compare Snowpark path (generate_training_set) and Dataset path (generate_dataset) as needed source_features = tecton_client.get_historical_features(...) sf_training_df = snowflake_fs.generate_training_set( spine_df=training_spine, features=[user_purchase_fv], spine_timestamp_col="SESSION_START_TS", ) # Or: sf_dataset = snowflake_fs.generate_dataset(..., version="V01") for Parquet-backed ML Dataset workflows validate_migration(source_features, sf_training_df, entity_cols=["USER_ID"]) ``` ### 3. Validate PIT Correctness {.unnumbered} ```python # Ensure point-in-time retrieval produces same results def test_pit_correctness(spine, source_fs, snowflake_fs): source_result = source_fs.get_historical_features(spine) sf_df = snowflake_fs.generate_training_set( spine_df=spine, features=[user_purchase_fv], spine_timestamp_col="SESSION_START_TS", ) assert_dataframes_equal(source_result, sf_df) ``` --- ## Common Pitfalls ### ❌ Pitfall 1: Big Bang Migration **Problem**: Migrating everything at once causes extended outages. **Solution**: Phased migration with parallel operation. ### ❌ Pitfall 2: Skipping Validation **Problem**: Subtle differences in feature values undetected. **Solution**: Comprehensive output comparison. ### ❌ Pitfall 3: Ignoring Timestamp Semantics **Problem**: Different interpretation of event time vs. processing time. **Solution**: Document and align timestamp semantics. ### ❌ Pitfall 4: Confusing TTL with a Single Knob **Problem**: Tecton TTL maps to different Snowflake controls for offline vs online paths. **Solution**: Set `refresh_freq` for Dynamic Table materialization and `OnlineConfig.target_lag` for online sync independently; align both with product freshness SLAs. --- ## Summary | Phase | Activities | |-------|------------| | **Assessment** | Inventory, terminology mapping, planning | | **Migration** | Recreate entities, port definitions (`feature_df`), register with `V01`, setup pipelines | | **Validation** | Compare outputs (`generate_training_set` / `generate_dataset`), validate PIT, cutover | --- ## Additional Resources - [Snowflake Feature Store Documentation](https://docs.snowflake.com/en/developer-guide/snowflake-ml/feature-store) - [Tecton Documentation](https://docs.tecton.ai/) (for reference) - [Feast Documentation](https://docs.feast.dev/) (for reference)

13.1 Overview

13.2 Learning Objectives

13.3 Why Migrate to Snowflake Feature Store?

13.3.1 Key Benefits

13.4 Terminology Mapping

13.4.1 Tecton → Snowflake

13.4.2 Feast → Snowflake (abbreviated)

13.5 Migration Approach

13.5.1 Three-Phase Migration

13.5.2 Phase 1: Assessment

13.5.3 Phase 2: Migration

13.5.4 Phase 3: Validation

13.6 Platform-Specific Guides

13.6.1 Tecton Migration

13.6.2 Feast Migration

13.6.3 Migrating from Pandas-based Feature Engineering

13.6.3.1 Complexity ladder

13.6.3.2 Snowpark pandas: not recommended for production Feature Views

13.6.3.3 Anti-pattern: passing a local pandas DataFrame as feature_df

13.7 Best Practices

1. Migrate Incrementally

2. Run in Parallel

3. Validate PIT Correctness

13.8 Common Pitfalls

13.8.1 ❌ Pitfall 1: Big Bang Migration

13.8.2 ❌ Pitfall 2: Skipping Validation

13.8.3 ❌ Pitfall 3: Ignoring Timestamp Semantics

13.8.4 ❌ Pitfall 4: Confusing TTL with a Single Knob

13.9 Summary

13.10 Additional Resources

13.6.3.3 Anti-pattern: passing a local pandas DataFrame as `feature_df`