13  Migration Guide

Migrating to Snowflake Feature Store from other platforms

Keywords

snowflake, feature store, ml, machine learning, mlops

13.1 Overview

This chapter provides guidance for teams migrating to Snowflake Feature Store from other platforms, primarily Tecton. It covers terminology mapping, migration strategies, and common patterns.

Examples follow the clickstream sample layout: database FEATURE_STORE_DEMO, Feature Store schema FEATURE_STORE, source schema CLICKSTREAM_DATA (tables such as EVENTS, SESSIONS, ORDERS, USERS, PRODUCTS), warehouse FS_DEV_WH. Register Feature Views with version strings in V01 format (for example fs.register_feature_view(..., version="V01")).

13.2 Learning Objectives

After completing this chapter, you will be able to:

  • Map terminology from other Feature Stores to Snowflake
  • Plan a phased migration approach
  • Migrate feature definitions and pipelines
  • Validate migrated features

📂 Chapter code: Browse companion scripts on GitHub


13.3 Why Migrate to Snowflake Feature Store?

13.3.1 Key Benefits

Benefit Description
Unified Platform Features, training, inference all in Snowflake
Native Integration Uses Snowflake objects (DTs, Views, Hybrid Tables)
Zero Infrastructure No separate serving infrastructure needed
Enterprise Security Inherits Snowflake RBAC and governance
Cost Efficiency No data movement, unified compute

13.4 Terminology Mapping

13.4.1 Tecton → Snowflake

📁 Full code: _code/terminology_mapping.py

Tecton Snowflake Notes
Feature View Feature View Registered with register_feature_view; use version="V01" (or V02, …)
Entity Entity Same concept; join keys align with spine columns
Batch Feature View Feature View (Dynamic Table) Set refresh_freq to a period (e.g. "1 hour") or CRON string
On-Demand Feature View Feature View (View) Omit refresh_freq or set refresh_freq=None; query-time computation
Feature Service generate_training_set() / generate_dataset() No separate service object: pass spine + list of Feature Views (see Chapter 10)
Online Store Online Feature Table (Hybrid Table) Enable with OnlineConfig; freshness via target_lag
Offline Store Feature Store schema e.g. FEATURE_STORE_DEMO.FEATURE_STORE; materialized batch features live in Dynamic Tables there
Data Source Source table or view e.g. FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
Transformation Feature View query (feature_df) Snowpark DataFrame / SQL vs Tecton Python decorators
Backfill fs.refresh_feature_view(fv) Triggers refresh outside the scheduled refresh_freq
TTL refresh_freq / target_lag Offline lag ↔︎ Dynamic Table refresh_freq; online freshness ↔︎ OnlineConfig.target_lag (not a single global TTL)
get_historical_features generate_training_set() / generate_dataset() Snowpark DataFrame vs versioned Dataset (Parquet); same PIT semantics
get_online_features get_online_features Retrieve from Online Feature Table
Spine Spine DataFrame Entity keys + PIT timestamp column passed as spine_timestamp_col

13.4.2 Feast → Snowflake (abbreviated)

Feast Snowflake
Feature Store (registry) Feature Store schema + platform metadata
Feature View Feature View
Online Store Online Feature Table (Hybrid Table)
Offline Store Feature Store schema
Feature Service generate_training_set() / generate_dataset()
get_historical_features generate_training_set() / generate_dataset()

13.5 Migration Approach

13.5.1 Three-Phase Migration

flowchart LR
  P1[Phase 1 Assessment] --> P2[Phase 2 Migration] --> P3[Phase 3 Validation]

Three-phase migration: assessment migration validation

13.5.2 Phase 1: Assessment

  1. Inventory existing features
    • List all Feature Views/Groups
    • Document entities and join keys
    • Identify refresh frequencies (map to refresh_freq / target_lag)
  2. Map to Snowflake concepts
    • Use terminology mapping table
    • Identify gaps or differences
  3. Plan migration order
    • Start with simple features
    • Group by model dependency

13.5.3 Phase 2: Migration

📁 Full code: _code/migration_utils.py

# Example: Migrating a Tecton batch Feature View to Snowflake (clickstream ORDERS)

from snowflake.ml.feature_store import FeatureStore, FeatureView, Entity

# Tecton (original)
"""
@batch_feature_view(
    entities=[user_entity],
    feature_table=user_features,
    schedule=timedelta(hours=1),
)
def user_purchase_fv(user_txns):
    return user_txns.groupby('USER_ID').agg(
        total_spend=('AMOUNT', 'sum'),
        order_count=('ORDER_ID', 'count'),
    )
"""

# Snowflake (migrated) — FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS uses TOTAL_AMT, ORDER_ID, ORDER_TS
user_purchase_df = session.sql("""
    SELECT
        USER_ID,
        SUM(TOTAL_AMT) AS ORDER_TOTAL_AMT_SUM,
        COUNT(ORDER_ID) AS ORDER_CNT,
        MAX(ORDER_TS) AS LAST_ORDER_TS
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
    GROUP BY USER_ID
""")

user_purchase_fv = FeatureView(
    name="USER_PURCHASE_FV",
    entities=[user_entity],
    feature_df=user_purchase_df,
    timestamp_col="LAST_ORDER_TS",
    refresh_freq="1 hour",
    desc="USER purchase aggregates from CLICKSTREAM_DATA.ORDERS",
)

fs.register_feature_view(feature_view=user_purchase_fv, version="V01")

13.5.4 Phase 3: Validation

from snowflake.snowpark import functions as F

def validate_migration(
    source_features,
    snowflake_features,
    entity_cols: list,
) -> dict:
    """Compare features between source platform and Snowflake (Snowpark DataFrames)."""

    comparison = source_features.join(
        snowflake_features,
        on=entity_cols,
        how="full_outer",
    )

    mismatches = comparison.filter(
        F.col("source_ORDER_TOTAL_AMT_SUM") != F.col("sf_ORDER_TOTAL_AMT_SUM")
    ).count()

    total = comparison.count()
    return {
        "total_rows": total,
        "mismatches": mismatches,
        "match_rate": (1 - mismatches / total) * 100 if total else 0.0,
    }

13.6 Platform-Specific Guides

13.6.1 Tecton Migration

Priority

Tecton migration is prioritized given Databricks Online Tables EOL.

# Key differences to handle:
# 1. Feature definitions use SQL/Snowpark (feature_df) vs Tecton Python transformations
# 2. Online serving uses Hybrid Tables + OnlineConfig.target_lag vs Tecton's online store
# 3. Offline scheduling uses Feature View refresh_freq vs Tecton's schedule parameter
# 4. Backfill / ad-hoc refresh: fs.refresh_feature_view(fv)
# 5. Training data: fs.generate_training_set() (Snowpark DF) or fs.generate_dataset() (ML Dataset)

13.6.2 Feast Migration

# Key differences:
# 1. Feature Store is a Snowflake schema + platform objects vs external registry + plugins
# 2. Online store is Online Feature Table (Hybrid Table) vs Redis/DynamoDB
# 3. No separate Feature Service — use generate_training_set() / generate_dataset() with spine + FVs

13.6.3 Migrating from Pandas-based Feature Engineering

Many ML teams start with local pandas for feature engineering and need a path to productionize those pipelines in the Feature Store. The difficulty varies widely depending on what pandas operations are used.

13.6.3.1 Complexity ladder

Tier 1 – Trivial (direct SQL rewrite). Simple groupby/agg, filters, joins. These map 1:1 to SQL or Snowpark DataFrame operations.

import pandas as pd

df = pd.read_sql("SELECT * FROM ORDERS", conn)
features = df.groupby("USER_ID").agg(
    ORDER_CNT=("ORDER_ID", "count"),
    SPEND_SUM=("TOTAL_AMT", "sum"),
    LAST_ORDER_TS=("ORDER_TS", "max"),
).reset_index()
user_purchase_df = session.sql("""
    SELECT
        USER_ID,
        COUNT(ORDER_ID)  AS ORDER_CNT,
        SUM(TOTAL_AMT)   AS SPEND_SUM,
        MAX(ORDER_TS)     AS LAST_ORDER_TS
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
    GROUP BY USER_ID
""")
import snowflake.snowpark.functions as F

orders = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS")
user_purchase_df = orders.group_by("USER_ID").agg(
    F.count("ORDER_ID").alias("ORDER_CNT"),
    F.sum("TOTAL_AMT").alias("SPEND_SUM"),
    F.max("ORDER_TS").alias("LAST_ORDER_TS"),
)

Tier 2 – Moderate (Snowpark DataFrame rewrite). Multi-step transformations, conditional columns, window functions. Requires translating pandas idioms to Snowpark equivalents.

df["IS_HIGH_VALUE"] = (df["TOTAL_AMT"] > 100).astype(int)
df["ORDER_RANK"] = df.groupby("USER_ID")["ORDER_TS"].rank(ascending=False)
df = df[df["ORDER_RANK"] <= 5]
recent_orders_df = session.sql("""
    SELECT USER_ID, ORDER_ID, TOTAL_AMT, ORDER_TS,
           CASE WHEN TOTAL_AMT > 100 THEN 1 ELSE 0 END AS IS_HIGH_VALUE,
           ROW_NUMBER() OVER (PARTITION BY USER_ID ORDER BY ORDER_TS DESC) AS ORDER_RANK
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
    QUALIFY ORDER_RANK <= 5
""")
from snowflake.snowpark import Window
import snowflake.snowpark.functions as F

orders = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS")
recent_orders_df = (
    orders
    .with_column("IS_HIGH_VALUE", F.when(F.col("TOTAL_AMT") > 100, 1).otherwise(0))
    .with_column(
        "ORDER_RANK",
        F.row_number().over(Window.partition_by("USER_ID").order_by(F.col("ORDER_TS").desc()))
    )
    .filter(F.col("ORDER_RANK") <= 5)
)

Tier 3 – Complex (LLM-assisted conversion + manual tuning). Pandas operations relying on row order (.shift(), .rolling() with custom functions, .apply() with arbitrary Python). These require expressing the logic with SQL window functions (LAG, LEAD, cumulative sums) or vectorized UDFs.

# pandas (before) — row-ordered operations
df = df.sort_values(["USER_ID", "ORDER_TS"])
df["PREV_ORDER_AMT"] = df.groupby("USER_ID")["TOTAL_AMT"].shift(1)
df["AMT_CHANGE"] = df["TOTAL_AMT"] - df["PREV_ORDER_AMT"]
# SQL (after) — window functions replace row-ordered pandas
order_change_df = session.sql("""
    SELECT
        USER_ID,
        ORDER_TS,
        TOTAL_AMT,
        LAG(TOTAL_AMT) OVER (
            PARTITION BY USER_ID ORDER BY ORDER_TS
        ) AS PREV_ORDER_AMT,
        TOTAL_AMT - LAG(TOTAL_AMT) OVER (
            PARTITION BY USER_ID ORDER BY ORDER_TS
        ) AS AMT_CHANGE
    FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
""")
Use Coding LLMs for Tier 3 Conversions

Use coding LLM tools (Cursor, Copilot, ChatGPT, etc.) to generate a first-pass SQL or Snowpark DataFrame translation of complex pandas code. The LLM output typically handles 80-90% of the logic correctly. Review and manually tune the remaining edge cases – especially around window frame boundaries, NULL handling, and DT-compatibility (avoid non-deterministic functions like CURRENT_TIMESTAMP()).

13.6.3.3 Anti-pattern: passing a local pandas DataFrame as feature_df

Passing a local pd.DataFrame as feature_df will fail – the Feature Store API expects a Snowpark DataFrame, not a pandas object. If you have data in a local pandas DataFrame, upload it to a Snowflake table first and then reference it:

# Upload local pandas DataFrame to Snowflake
session.write_pandas(local_df, "MY_STAGING_TABLE", auto_create_table=True)

# Now use it as feature_df
feature_df = session.table("MY_STAGING_TABLE")

13.7 Best Practices

1. Migrate Incrementally

Week 1: Entities + 2-3 simple Feature Views
Week 2: Remaining batch features
Week 3: Online serving setup
Week 4: Consumer cutover

2. Run in Parallel

Operate both systems during migration:

# Historical / training: compare Snowpark path (generate_training_set) and Dataset path (generate_dataset) as needed
source_features = tecton_client.get_historical_features(...)
sf_training_df = snowflake_fs.generate_training_set(
    spine_df=training_spine,
    features=[user_purchase_fv],
    spine_timestamp_col="SESSION_START_TS",
)
# Or: sf_dataset = snowflake_fs.generate_dataset(..., version="V01") for Parquet-backed ML Dataset workflows

validate_migration(source_features, sf_training_df, entity_cols=["USER_ID"])

3. Validate PIT Correctness

# Ensure point-in-time retrieval produces same results
def test_pit_correctness(spine, source_fs, snowflake_fs):
    source_result = source_fs.get_historical_features(spine)
    sf_df = snowflake_fs.generate_training_set(
        spine_df=spine,
        features=[user_purchase_fv],
        spine_timestamp_col="SESSION_START_TS",
    )
    assert_dataframes_equal(source_result, sf_df)

13.8 Common Pitfalls

13.8.1 ❌ Pitfall 1: Big Bang Migration

Problem: Migrating everything at once causes extended outages.

Solution: Phased migration with parallel operation.

13.8.2 ❌ Pitfall 2: Skipping Validation

Problem: Subtle differences in feature values undetected.

Solution: Comprehensive output comparison.

13.8.3 ❌ Pitfall 3: Ignoring Timestamp Semantics

Problem: Different interpretation of event time vs. processing time.

Solution: Document and align timestamp semantics.

13.8.4 ❌ Pitfall 4: Confusing TTL with a Single Knob

Problem: Tecton TTL maps to different Snowflake controls for offline vs online paths.

Solution: Set refresh_freq for Dynamic Table materialization and OnlineConfig.target_lag for online sync independently; align both with product freshness SLAs.


13.9 Summary

Phase Activities
Assessment Inventory, terminology mapping, planning
Migration Recreate entities, port definitions (feature_df), register with V01, setup pipelines
Validation Compare outputs (generate_training_set / generate_dataset), validate PIT, cutover

13.10 Additional Resources