LLM pipelines are fragile, often crashing mid-process from API timeouts, rate limiting, or worker evictions. When a worker restarts, its state is gone, forcing potentially expensive re-runs that burn tokens on already completed work. Traditional orchestration tools like Airflow or Prefect get applied here, but they are mostly designed for batch ETL and don’t handle state well for multi-hour agent loops.
Temporal solves this with Durable Execution. This is a programming model that:
- Lets processes resume exactly where they failed
- Avoids re-executing expensive steps that already completed
What Temporal provides
Temporal is an orchestration platform built on three core primitives:
Workflows define business logic as code. A workflow can run for days or months, maintaining state in local variables that survive process crashes. The platform automatically persists execution history through event sourcing, allowing deterministic replay from any failure point. This eliminates external checkpointing systems.
Activities encapsulate non-deterministic operations — LLM API calls, database queries, external service interactions. Temporal applies configurable retry policies with exponential backoff automatically. Workflows don’t track individual retry attempts — that happens automatically in the activity layer.
Event History records every workflow execution step (activity scheduling, completion, input/output data) as an immutable audit log. You get Time-Travel Debugging — replaying production failures locally with identical state reconstruction. For debugging non-deterministic AI behavior, this history lets you replay exact execution paths.
Workers poll task queues, take work, execute code, return results. Scale by adding more workers. State lives in Temporal Server’s event log, not worker memory.
When to use Temporal
The decision depends on pipeline characteristics and operational constraints.
Use Temporal when:
- Long-running processes (hours/days/months) where losing progress is expensive, such as multi-stage document analysis, conversational agents maintaining long-term state, or financial audit workflows
- Critical fault tolerance requirements, for example in payment processing triggered by LLM agents, medical diagnosis pipelines, or any scenario requiring compensating transactions (part of the “Saga” pattern) to handle rollbacks
- Dynamic agent loops where next steps are determined by LLM output at runtime, including ReAct agents, tool selection workflows, or multi-agent coordination that cannot be expressed as static DAGs
- Human-in-the-loop approval flows, like HR recruitment or compliance pipelines, where processes must pause indefinitely for user input and resume without state loss
Don’t use Temporal for:
- Simple batch jobs (<10 minutes) with low failure risk, like standard ETL/ELT tasks where Prefect or Airflow provide faster time-to-value with less operational overhead
- Stateless API services requiring sub-100ms latency. Single synchronous LLM calls don’t benefit from durable execution — they add unnecessary complexity
- Rapid prototyping where determinism constraints slow iteration. Temporal requires learning distributed systems patterns (Event Sourcing, deterministic workflows) that increase initial development friction
Key tradeoff: Temporal’s core workflow code must be strictly deterministic. All non-deterministic operations (LLM calls, time.now(), random number generation) must be isolated in Activities. Violating this causes non-deterministic errors during event replay. This is a steep learning curve for AI developers used to standard imperative Python.
Practical example: refactoring an LLM pipeline
Start with a fragile document parser that needs reliability. We’ll refactor it step-by-step (incremental migration) instead of rewriting everything.
Original pipeline
def parse_document(doc_id: str):
doc = db.fetch(doc_id)
parsed = extract_text(doc.path)
summary = llm_client.generate(parsed) # Fails on rate limit
db.save_summary(doc_id, summary)
Problems: no state preservation, no retry logic, crashes lose progress.
Step 1: basic workflow with durable state
-
Problem: The original
parse_documentfunction is completely stateless. If the worker crashes, everything is lost, including thedoc_idbeing processed -
Change: We move from a simple Python function to a
DocumentWorkflowclass. State is now held inself.state, which Temporal automatically persists
from temporalio import workflow
@workflow.defn
class DocumentWorkflow:
def __init__(self):
self.state = {
"doc_id": None,
"parsed_text": None,
"summary": None,
"status": "pending"
}
@workflow.run
async def run(self, doc_id: str) -> dict:
self.state["doc_id"] = doc_id
self.state["status"] = "processing"
# Activities will go here
return self.state
Pattern: Durable state in class variables. Temporal persists self.state automatically via event sourcing. Worker crashes no longer lose context.
Step 2: LLM activity with retry policy
- Problem: Our LLM call has no retry logic. If it fails due to a rate limit (HTTP 429), the entire workflow fails instead of just waiting and trying again
-
Change: We wrap the LLM call in an
Activity(generate_summary). This isolates the non-deterministic code and allows the Workflow to add aRetryPolicy
from temporalio import activity
from datetime import timedelta
from llm_client import LLMClient # Abstract client interface
# Model configuration
llm_model_name = "your-llm-model" # Replace with your provider-specific model
llm_client = LLMClient()
@activity.defn
async def generate_summary(text: str) -> str:
response = await llm_client.generate(
model=llm_model_name,
messages=[{"role": "user", "content": f"Summarize: {text}"}]
)
return response.content
# In workflow:
from temporalio.common import RetryPolicy
summary = await workflow.execute_activity(
generate_summary,
self.state["parsed_text"],
retry_policy=RetryPolicy(
initial_interval=timedelta(seconds=1),
backoff_coefficient=2.0,
maximum_attempts=5,
maximum_interval=timedelta(seconds=60),
non_retryable_error_types=["InvalidAPIKeyError", "AuthenticationError"]
),
start_to_close_timeout=timedelta(seconds=30)
)
self.state["summary"] = summary
Pattern: Activities encapsulate non-deterministic LLM calls. Retry policy handles rate limits (HTTP 429) automatically with exponential backoff. non_retryable_error_types prevents wasted retries on permanent failures like auth errors.
start_to_close_timeout (30s) caps single attempt duration. schedule_to_close_timeout would cap total time including all retries — set higher (e.g., 5 minutes) to accommodate backoff periods.
Step 3: human-in-the-loop via signals
- Problem: The process is fully automated. It cannot pause and wait for a human to approve the generated summary before saving it to the database
-
Change: We add
workflow.wait_conditionto allow the workflow to “sleep” (without consuming worker resources). We also add aSignal(approve_summary) to let an external user “wake” the workflow with an input
@workflow.defn
class DocumentWorkflow:
def __init__(self):
self.state = {...}
self.approval_received = None # Signal target
@workflow.signal
async def approve_summary(self, approved: bool):
self.approval_received = approved
@workflow.run
async def run(self, doc_id: str) -> dict:
# ... generate summary ...
# Wait for human approval (can wait hours/days)
await workflow.wait_condition(
lambda: self.approval_received is not None,
timeout=timedelta(days=7)
)
if not self.approval_received:
self.state["status"] = "rejected"
return self.state # Compensating action
# Continue processing...
Pattern: Workflows can pause via wait_condition. This pause consumes no worker resources and can last indefinitely. Use timeout to prevent infinite waits — best practice shown in the code.
Step 4: query for status monitoring
- Problem: We have no idea what’s happening inside a running workflow. Is it stuck or waiting for approval? The only way to know is to check logs or the database
-
Change: We add a
@workflow.querymethod (get_status). This lets our UI (or any other service) synchronously read the workflow’s internalself.stateat any time without interrupting it
@workflow.query
def get_status(self) -> dict:
return {
"doc_id": self.state["doc_id"],
"status": self.state["status"],
"progress": "awaiting_approval" if self.approval_received is None else "processing"
}
Pattern: Queries provide synchronous state access for UIs/dashboards without interrupting workflow execution. They are read-only by design (meaning they cannot change workflow state, only view it).
Step 5: GPU resource isolation via task queues
-
Problem: We have an expensive
run_local_modelActivity that requires a GPU. If a normal CPU worker picks it up, it will either crash (OOM) or block other tasks -
Change: When executing the activity, we specify
task_queue="gpu-workers". This ensures only workers specifically configured to listen to that queue (and deployed on GPU machines) will execute this task
@activity.defn
async def run_local_model(text: str) -> str:
# Requires GPU, expensive
model = load_model()
return model.generate(text)
# In workflow:
result = await workflow.execute_activity(
run_local_model,
text,
task_queue="gpu-workers", # Dedicated queue
start_to_close_timeout=timedelta(minutes=10)
)
# Worker configuration
worker = Worker(
client,
task_queue="gpu-workers",
workflows=[DocumentWorkflow],
activities=[run_local_model],
max_concurrent_activities=2 # Only 2 parallel GPU tasks
)
Pattern: Task queues isolate resource-intensive activities. Deploy workers with limited activity slots on GPU machines, preventing cascading failures and controlling costs. Implements Bulkhead pattern.
Step 6: continue-as-new for long processes
- Problem: If the workflow processes 10,000 documents in a loop or runs for months (like a chatbot), its Event History becomes huge. This slows down replay/recovery
-
Change: We periodically check the history size (
workflow.info().get_current_history_length()). If it exceeds a threshold, we useworkflow.continue_as_new()to start a fresh workflow execution, carrying over the state but with a clean history
@workflow.run
async def run(self, doc_id: str, iteration: int = 0) -> dict:
# Process document...
# Check history size
history_length = workflow.info().get_current_history_length()
if history_length > 1000:
# Start fresh workflow execution with current state
workflow.continue_as_new(doc_id, iteration + 1)
return self.state
Pattern: Continue-as-new prevents event history bloat by starting a new workflow execution with a clean history. Warning: This is not a “continue”; it’s a “restart”. The new run loses all self.state unless you explicitly pass the necessary data as arguments to the continue_as_new call.
Complete integration
Full example with error handling:
@workflow.defn
class DocumentWorkflow:
def __init__(self):
self.state = {"doc_id": None, "summary": None, "status": "pending"}
self.approval_received = None
@workflow.run
async def run(self, doc_id: str) -> dict:
self.state["doc_id"] = doc_id
try:
# Fetch with retry
doc = await workflow.execute_activity(
fetch_document,
doc_id,
retry_policy=RetryPolicy(maximum_attempts=3),
start_to_close_timeout=timedelta(seconds=10)
)
# LLM with rate limit handling
summary = await workflow.execute_activity(
generate_summary,
doc.text,
retry_policy=RetryPolicy(
backoff_coefficient=2.0,
maximum_attempts=5,
non_retryable_error_types=["InvalidAPIKeyError"]
),
start_to_close_timeout=timedelta(seconds=30)
)
self.state["summary"] = summary
# Human approval
await workflow.wait_condition(
lambda: self.approval_received is not None,
timeout=timedelta(hours=24)
)
if self.approval_received:
await workflow.execute_activity(
save_results,
self.state,
start_to_close_timeout=timedelta(seconds=10)
)
self.state["status"] = "completed"
else:
self.state["status"] = "rejected"
except Exception as e:
self.state["status"] = "failed"
self.state["error"] = str(e)
return self.state
@workflow.signal
async def approve_summary(self, approved: bool):
self.approval_received = approved
@workflow.query
def get_status(self) -> dict:
return self.state
Best practices
Use exponential backoff for rate limits. Set backoff_coefficient > 1.5 (typically 2.0) and maximum_interval to prevent retry storms. Add jitter if calling shared services to avoid thundering herd effects.
Store references, not large payloads. Event History isn’t designed for petabyte storage. For large LLM outputs or document content, store only S3 URLs or database IDs in workflow state. Fetch actual data in Activities.
Monitor Event History size. Workflows exceeding 1000–2000 events should implement Continue-as-new. High event counts degrade replay performance and risk hitting platform limits.
Set appropriate timeouts. start_to_close_timeout should exceed typical activity duration but fail stuck calls quickly. schedule_to_close_timeout should account for total retry time including backoff periods. Missing timeouts risk infinite hangs.
Make Activities idempotent. Temporal guarantees at-least-once execution. Activities may execute multiple times. Use idempotency keys for payments or other non-repeatable operations.
Use Task Queues for resource isolation. Deploy GPU workers on dedicated queues with max_concurrent_activities limits. This implements Bulkhead pattern, preventing resource exhaustion and controlling costs.
Conclusion
Temporal trades upfront complexity for long-term reliability. The determinism constraint and operational overhead (self-hosted requires Cassandra/Postgres, Elasticsearch for Visibility) make it overkill for simple batch jobs or stateless services.
For LLM pipelines with expensive API calls, multi-hour execution times, or critical fault tolerance requirements, the architecture works. Automatic state recovery, configurable retry policies, and Time-Travel Debugging fix entire classes of production failures.
Add Temporal when failure recovery and durable orchestration become blocking issues, not as a default architecture choice.