25  Many-Model Patterns

Partition-keyed forecasting and batch registry

Keywords

snowflake, R, RStudio, Posit, VS Code, workspace notebooks, snowflakeR, RSnowflake, mlops

25.1 Overview

A many-model problem is one where you need many models of the same type, each fit to a different slice of data — one forecast per SKU, one churn model per region, one anomaly detector per sensor. The modeling code is identical; only the partition key and input series change.

That pattern is everywhere in retail, supply chain, and IoT, but it collides with two practical limits:

  1. Scale — thousands of for loops on a laptop take days
  2. Governance — exporting each series to CSV loses Snowflake’s access controls and lineage

snowflakeR addresses both by combining doSnowflake (parallel R on SPCS), stage persistence (.rds models in-account), and Model Registry batch APIs (sfr_log_many_model(), sfr_run_batch()). You write one foreach loop; Snowflake runs it across a compute pool and registers the result as a single registry version that knows how to score every partition.

This chapter applies doSnowflake to partitioned forecasting and registry batch scoring. Read Parallel doSnowflake first for serialization, tasks/task_XXX.rds, queue rows, and how unit_data reaches workers (data_query vs stage files).

Primary reference: many-model-howto vignette (HTML) or vignette("many-model-howto", package = "snowflakeR") after install.

25.2 Learning Objectives

  • Decide when many-model beats one global model
  • Trace data from source table → parallel training → stage → registry → batch inference
  • Choose Tasks vs Queue dispatch and the screen-then-refine pattern
  • Wire data_query so each SKU’s series is available as unit_data in the worker
  • Understand the worker return contract (model_obj, metrics) and aggregator registration

25.3 When many-model applies

Pattern Example partition key Why not one global model?
Per-SKU demand UNIT_ID / SKU_ID Series length, seasonality, and intermittency differ
Per-store sales STORE_ID Local promotions, regional trends
Per-customer churn CUSTOMER_ID Sparse events; pooled model washes out signal
Per-asset monitoring SENSOR_ID Failure signatures are device-specific

Use a single global model when:

  • Entities share stable structure and you have enough pooled training data
  • A hierarchical or mixed-effects model explicitly shares information across partitions
  • Operational simplicity outweighs per-partition accuracy gains

Use many-model when:

  • Partitions are independent conditional on their own history
  • The same R function fits every partition (auto.arima, lm, tidymodels workflow)
  • You need governed storage and batch scoring inside Snowflake

25.4 End-to-end architecture

The many-model pipeline has four phases. Each phase uses different Snowflake surfaces — warehouse SQL for data, SPCS for R training, stage for artifacts, Registry for serving.

flowchart TB
  subgraph data [1 — Data in Snowflake]
    SRC["SERIES_EVENTS table<br/>(UNIT_ID, DATE, Y)"]
  end

  subgraph train [2 — Parallel training]
    FE["foreach(unit_id = ids) %dopar%"]
    DS[doSnowflake Tasks or Queue]
    W[SPCS R workers]
    FE --> DS --> W
  end

  subgraph persist [3 — Persist artifacts]
    STG["@STAGE — one .rds per partition"]
    IDX[model_index.json]
    STG --> IDX
  end

  subgraph serve [4 — Register and score]
    REG["sfr_log_many_model() — aggregator"]
    BATCH["sfr_run_batch() — server-side"]
    OUT[FORECAST_RESULTS table]
    REG --> BATCH --> OUT
  end

  SRC --> FE
  W --> STG
  IDX --> REG
  SRC --> BATCH

Phase What happens Snowflake compute
1. Data Source table with partition key + time series (and optional covariates) Warehouse (SQL ETL, Dynamic Tables)
2. Train foreach body fits one model per partition SPCS workers via doSnowflake
3. Persist Each worker writes .rds to stage via volume mount SPCS → stage I/O
4. Register & infer One aggregator version in Registry; batch job scores all partitions Registry + SPCS batch (run_batch)

25.4.1 Why registration is O(1) in model count

sfr_log_many_model() registers a single CustomModel aggregator — not thousands of separate registry entries. The aggregator holds a model_index.json mapping partition_key → stage_path. Registration writes the index (~seconds) regardless of whether you have 200 or 20,000 underlying .rds files. Batch inference uses that index to load the right model per row.


25.5 Dispatch: Tasks vs Queue

Both modes run the same %dopar% loop body. The difference is coordination (Task DAG vs Hybrid Table queue), not the modeling code.

Tasks Queue
Best for Full pass over all SKUs (screen pass) Time-boxed refine pass on worst performers
Completion Every chunk must finish Feeder can stop early; workers drain queue

Queue rows are one per chunk, not per SKU. doSnowflake creates tasks/task_XXX.rds on the stage automatically when you call %dopar% — you never author those files by hand. See Who creates task_XXX.rds? and Queue rows.


25.6 Screen-then-refine

A practical two-pass strategy from the parallel forecast demo:

Pass 1 — Tasks (fast screen):

  • Run stepwise auto.arima (or other cheap model) over all partitions via Tasks mode
  • Log accuracy metrics per SKU to a results table
  • Every partition completes — you get a full baseline

Pass 2 — Queue (expensive refine):

  • Rank SKUs by error; enqueue only the worst N (or all below a threshold)
  • Queue mode with a time budget — feeder stops adding rows when time expires
  • Run a model contest (ARIMA vs ETS vs TBATS with holdout validation) only where it matters

This spends compute on partitions that benefit from a richer model, not on easy series.

flowchart LR
  ALL[All SKUs] --> P1[Tasks pass — fast ARIMA]
  P1 --> MET[TRAINING_METRICS table]
  MET --> RANK[Rank by MAPE]
  RANK --> P2[Queue pass — model contest]
  P2 --> STG[Refined .rds on stage]

Runnable walkthrough: many-model-howto vignette and workspace_parallel_spcs_demo.ipynb in snowflakeR/inst/notebooks.


25.7 Worker expression contract

The %dopar% body runs on SPCS workers, not in your master session.

Variable How it appears
unit_id (foreach loop variable) Serialized automatically in task_XXX.rdsargs
unit_data Not automatic. Worker bulk-reads via data_query on registerDoSnowflake() — see Getting data into each partition

Register with bulk read (typical many-model setup):

registerDoSnowflake(conn, mode = "tasks",
  compute_pool = cfg$compute_pool,
  image_uri    = cfg$image_uri,
  stage        = cfg$dosnowflake_stage_name,
  chunks_per_job = 50L,
  save_models    = TRUE,
  model_run_id   = "tasks_20260529_1200",
  model_key_arg  = "unit_id",
  data_query = list(
    table      = "MY_DB.MY_SCHEMA.SERIES_EVENTS",
    key_column = "UNIT_ID",
    key_arg    = "unit_id",
    columns    = c("UNIT_ID", "OBS_DATE", "Y"),
    order_by   = "UNIT_ID, OBS_DATE",
    warehouse  = cfg$warehouse
  )
)

Example worker body (after unit_data injection):

suppressPackageStartupMessages(library(forecast))

ts_full <- ts(unit_data$Y, frequency = 7)
model <- auto.arima(ts_full, stepwise = TRUE)
fc    <- forecast(model, h = 30L)

list(
  unit_id    = unit_id,
  model      = as.character(fc$method),
  forecast   = as.numeric(fc$mean),
  model_obj  = model,
  accuracy   = forecast::accuracy(model)
)

Return contract: a list per iteration. Include model_obj (fitted R object) when save_models = TRUE; the worker writes @stage/models/<model_run_id>/<unit_id>.rds and strips model_obj from chunk results. The string field model is metadata only. Within-chunk CPU use (mclapply, instance_family, chunks_per_job) is covered in Parallel doSnowflake.


25.8 Bootstrap and configuration

Environment-specific values live in YAML — compute pool, image URI, stage, queue table, demo sizing:

parallel_lab:
  compute_pool: MY_COMPUTE_POOL
  image_uri: "mydb.myschema.myrepo/dosnowflake-worker:latest"
  dosnowflake_stage_name: DOSNOWFLAKE_STAGE
  queue_table: DOSNOWFLAKE_QUEUE
  instance_family: CPU_X64_L
  demo_forecast_n_skus: 2000
  demo_tasks_chunks_per_job: 4
  demo_queue_n_workers: 4

Setup workflow:

cfg  <- parallel_lab_load_config("snowflaker_parallel_spcs_config.yaml")
conn <- sfr_connect(name = "my_demo_jwt")
parallel_lab_setup(conn, cfg, create_series = TRUE, n_units = 2000)

parallel_lab_setup() creates schemas, stage (with directory table), Hybrid Table queue, metrics tables, and optionally synthetic SERIES_EVENTS data for demos.


25.9 Batch inference after registration

After sfr_log_many_model():

sfr_run_batch(
  reg,
  model_name  = "MANY_MODEL_FORECAST",
  version     = "V1",
  input_table = "MY_DB.MY_SCHEMA.SERIES_EVENTS",
  output_table = "MY_DB.MY_SCHEMA.FORECAST_RESULTS",
  partition_col = "UNIT_ID"
)

Snowflake runs server-side scoring: for each partition, load .rds from stage via the index, run R predict through rpy2, write results. Your R session does not pull every model locally.


25.10 Prerequisites

  • Parallel doSnowflake — serialization, data_query, stage layout, modes
  • SPCS compute pool and worker image (linux/amd64)
  • Warehouse for data_query reads, orchestration, and batch jobs (consider multi-cluster when many chunks read in parallel)
  • Hybrid Table entitlement for queue mode

25.11 Demos and notebooks

Resource Description
many-model-howto vignette Step-by-step with code excerpts
workspace_parallel_spcs_setup.ipynb One-time infrastructure (snowflakeR inst/notebooks)
workspace_parallel_spcs_demo.ipynb Runnable forecast demo

25.12 Next steps

Model Registrysfr_log_many_model() and aggregator detail.

End-to-End Pipeline — full lifecycle placement.