flowchart TB
subgraph data [1 — Data in Snowflake]
SRC["SERIES_EVENTS table<br/>(UNIT_ID, DATE, Y)"]
end
subgraph train [2 — Parallel training]
FE["foreach(unit_id = ids) %dopar%"]
DS[doSnowflake Tasks or Queue]
W[SPCS R workers]
FE --> DS --> W
end
subgraph persist [3 — Persist artifacts]
STG["@STAGE — one .rds per partition"]
IDX[model_index.json]
STG --> IDX
end
subgraph serve [4 — Register and score]
REG["sfr_log_many_model() — aggregator"]
BATCH["sfr_run_batch() — server-side"]
OUT[FORECAST_RESULTS table]
REG --> BATCH --> OUT
end
SRC --> FE
W --> STG
IDX --> REG
SRC --> BATCH
25 Many-Model Patterns
Partition-keyed forecasting and batch registry
snowflake, R, RStudio, Posit, VS Code, workspace notebooks, snowflakeR, RSnowflake, mlops
25.1 Overview
A many-model problem is one where you need many models of the same type, each fit to a different slice of data — one forecast per SKU, one churn model per region, one anomaly detector per sensor. The modeling code is identical; only the partition key and input series change.
That pattern is everywhere in retail, supply chain, and IoT, but it collides with two practical limits:
- Scale — thousands of
forloops on a laptop take days - Governance — exporting each series to CSV loses Snowflake’s access controls and lineage
snowflakeR addresses both by combining doSnowflake (parallel R on SPCS), stage persistence (.rds models in-account), and Model Registry batch APIs (sfr_log_many_model(), sfr_run_batch()). You write one foreach loop; Snowflake runs it across a compute pool and registers the result as a single registry version that knows how to score every partition.
This chapter applies doSnowflake to partitioned forecasting and registry batch scoring. Read Parallel doSnowflake first for serialization, tasks/task_XXX.rds, queue rows, and how unit_data reaches workers (data_query vs stage files).
Primary reference: many-model-howto vignette (HTML) or vignette("many-model-howto", package = "snowflakeR") after install.
25.2 Learning Objectives
- Decide when many-model beats one global model
- Trace data from source table → parallel training → stage → registry → batch inference
- Choose Tasks vs Queue dispatch and the screen-then-refine pattern
- Wire
data_queryso each SKU’s series is available asunit_datain the worker - Understand the worker return contract (
model_obj, metrics) and aggregator registration
25.3 When many-model applies
| Pattern | Example partition key | Why not one global model? |
|---|---|---|
| Per-SKU demand | UNIT_ID / SKU_ID |
Series length, seasonality, and intermittency differ |
| Per-store sales | STORE_ID |
Local promotions, regional trends |
| Per-customer churn | CUSTOMER_ID |
Sparse events; pooled model washes out signal |
| Per-asset monitoring | SENSOR_ID |
Failure signatures are device-specific |
Use a single global model when:
- Entities share stable structure and you have enough pooled training data
- A hierarchical or mixed-effects model explicitly shares information across partitions
- Operational simplicity outweighs per-partition accuracy gains
Use many-model when:
- Partitions are independent conditional on their own history
- The same R function fits every partition (
auto.arima,lm, tidymodels workflow) - You need governed storage and batch scoring inside Snowflake
25.4 End-to-end architecture
The many-model pipeline has four phases. Each phase uses different Snowflake surfaces — warehouse SQL for data, SPCS for R training, stage for artifacts, Registry for serving.
| Phase | What happens | Snowflake compute |
|---|---|---|
| 1. Data | Source table with partition key + time series (and optional covariates) | Warehouse (SQL ETL, Dynamic Tables) |
| 2. Train | foreach body fits one model per partition |
SPCS workers via doSnowflake |
| 3. Persist | Each worker writes .rds to stage via volume mount |
SPCS → stage I/O |
| 4. Register & infer | One aggregator version in Registry; batch job scores all partitions | Registry + SPCS batch (run_batch) |
25.4.1 Why registration is O(1) in model count
sfr_log_many_model() registers a single CustomModel aggregator — not thousands of separate registry entries. The aggregator holds a model_index.json mapping partition_key → stage_path. Registration writes the index (~seconds) regardless of whether you have 200 or 20,000 underlying .rds files. Batch inference uses that index to load the right model per row.
25.5 Dispatch: Tasks vs Queue
Both modes run the same %dopar% loop body. The difference is coordination (Task DAG vs Hybrid Table queue), not the modeling code.
| Tasks | Queue | |
|---|---|---|
| Best for | Full pass over all SKUs (screen pass) | Time-boxed refine pass on worst performers |
| Completion | Every chunk must finish | Feeder can stop early; workers drain queue |
Queue rows are one per chunk, not per SKU. doSnowflake creates tasks/task_XXX.rds on the stage automatically when you call %dopar% — you never author those files by hand. See Who creates task_XXX.rds? and Queue rows.
25.6 Screen-then-refine
A practical two-pass strategy from the parallel forecast demo:
Pass 1 — Tasks (fast screen):
- Run stepwise
auto.arima(or other cheap model) over all partitions via Tasks mode - Log accuracy metrics per SKU to a results table
- Every partition completes — you get a full baseline
Pass 2 — Queue (expensive refine):
- Rank SKUs by error; enqueue only the worst N (or all below a threshold)
- Queue mode with a time budget — feeder stops adding rows when time expires
- Run a model contest (ARIMA vs ETS vs TBATS with holdout validation) only where it matters
This spends compute on partitions that benefit from a richer model, not on easy series.
flowchart LR ALL[All SKUs] --> P1[Tasks pass — fast ARIMA] P1 --> MET[TRAINING_METRICS table] MET --> RANK[Rank by MAPE] RANK --> P2[Queue pass — model contest] P2 --> STG[Refined .rds on stage]
Runnable walkthrough: many-model-howto vignette and workspace_parallel_spcs_demo.ipynb in snowflakeR/inst/notebooks.
25.7 Worker expression contract
The %dopar% body runs on SPCS workers, not in your master session.
| Variable | How it appears |
|---|---|
unit_id (foreach loop variable) |
Serialized automatically in task_XXX.rds → args |
unit_data |
Not automatic. Worker bulk-reads via data_query on registerDoSnowflake() — see Getting data into each partition |
Register with bulk read (typical many-model setup):
registerDoSnowflake(conn, mode = "tasks",
compute_pool = cfg$compute_pool,
image_uri = cfg$image_uri,
stage = cfg$dosnowflake_stage_name,
chunks_per_job = 50L,
save_models = TRUE,
model_run_id = "tasks_20260529_1200",
model_key_arg = "unit_id",
data_query = list(
table = "MY_DB.MY_SCHEMA.SERIES_EVENTS",
key_column = "UNIT_ID",
key_arg = "unit_id",
columns = c("UNIT_ID", "OBS_DATE", "Y"),
order_by = "UNIT_ID, OBS_DATE",
warehouse = cfg$warehouse
)
)Example worker body (after unit_data injection):
suppressPackageStartupMessages(library(forecast))
ts_full <- ts(unit_data$Y, frequency = 7)
model <- auto.arima(ts_full, stepwise = TRUE)
fc <- forecast(model, h = 30L)
list(
unit_id = unit_id,
model = as.character(fc$method),
forecast = as.numeric(fc$mean),
model_obj = model,
accuracy = forecast::accuracy(model)
)Return contract: a list per iteration. Include model_obj (fitted R object) when save_models = TRUE; the worker writes @stage/models/<model_run_id>/<unit_id>.rds and strips model_obj from chunk results. The string field model is metadata only. Within-chunk CPU use (mclapply, instance_family, chunks_per_job) is covered in Parallel doSnowflake.
25.8 Bootstrap and configuration
Environment-specific values live in YAML — compute pool, image URI, stage, queue table, demo sizing:
parallel_lab:
compute_pool: MY_COMPUTE_POOL
image_uri: "mydb.myschema.myrepo/dosnowflake-worker:latest"
dosnowflake_stage_name: DOSNOWFLAKE_STAGE
queue_table: DOSNOWFLAKE_QUEUE
instance_family: CPU_X64_L
demo_forecast_n_skus: 2000
demo_tasks_chunks_per_job: 4
demo_queue_n_workers: 4Setup workflow:
cfg <- parallel_lab_load_config("snowflaker_parallel_spcs_config.yaml")
conn <- sfr_connect(name = "my_demo_jwt")
parallel_lab_setup(conn, cfg, create_series = TRUE, n_units = 2000)parallel_lab_setup() creates schemas, stage (with directory table), Hybrid Table queue, metrics tables, and optionally synthetic SERIES_EVENTS data for demos.
25.9 Batch inference after registration
After sfr_log_many_model():
sfr_run_batch(
reg,
model_name = "MANY_MODEL_FORECAST",
version = "V1",
input_table = "MY_DB.MY_SCHEMA.SERIES_EVENTS",
output_table = "MY_DB.MY_SCHEMA.FORECAST_RESULTS",
partition_col = "UNIT_ID"
)Snowflake runs server-side scoring: for each partition, load .rds from stage via the index, run R predict through rpy2, write results. Your R session does not pull every model locally.
25.10 Prerequisites
- Parallel doSnowflake — serialization,
data_query, stage layout, modes - SPCS compute pool and worker image (
linux/amd64) - Warehouse for
data_queryreads, orchestration, and batch jobs (consider multi-cluster when many chunks read in parallel) - Hybrid Table entitlement for queue mode
25.11 Demos and notebooks
| Resource | Description |
|---|---|
many-model-howto vignette |
Step-by-step with code excerpts |
workspace_parallel_spcs_setup.ipynb |
One-time infrastructure (snowflakeR inst/notebooks) |
workspace_parallel_spcs_demo.ipynb |
Runnable forecast demo |
25.12 Next steps
Model Registry — sfr_log_many_model() and aggregator detail.
End-to-End Pipeline — full lifecycle placement.