Running multiple Dagster agents

Each Dagster+ full deployment (e.g., prod) needs to have at least one agent running. A single agent is adequate for many use cases, but you may want to run multiple agents to provide redundancy if a single agent goes down.

When to use multiple agents

For redundancy and high availability
When you need to run agents in completely separate infrastructure environments or AWS accounts, with separate compute resources, volumes, and networks
To dedicate specific agents for branch deployments
To support multi-region failover

Considerations

Additional infrastructure management overhead
More complex configuration required for multiple agent setup
Need to configure agent queues for proper workload routing

Running multiple agents in the same environment

To run multiple agents in the same environment (e.g., multiple Kubernetes agents in the same namespace), you can set the number of replicas in the configuration for your particular agent type:

Docker
Kubernetes
Amazon ECS

In Docker

In Docker, you can set the number of replicas for a service in the docker-compose.yaml file if the deployment mode is set to replicated (which is the default):

services:
  dagster-cloud-agent:
    ...
    deploy:
      mode: replicated
      replicas: 2

In Kubernetes

In Kubernetes, the number of replicas is set in the Helm chart. You can set the number of replicas in the Helm command:

helm upgrade \
    ...
    --set dagsterCloudAgent.replicas=2

or if using a values.yaml file:

dagsterCloudAgent:
  ...
  replicas: 2

In Amazon ECS

In Amazon ECS, the number of replicas can be set via the CloudFormation template:

DagsterCloudAgent:
  Type: AWS::ECS::Service
  Properties:
    ...
    DesiredCount: 2

If using the CloudFormation template provided by Dagster, the number of replicas can be set via the NumReplicas parameter in the Amazon Web Services (AWS) UI.

Running multiple agents in different environments

To run multiple agents in an environment where each agent can not access the others' resources (for example, multiple Kubernetes namespaces or different clusters), enable the isolated_agents option. This is supported for all agent types.

Docker
Kubernetes
Amazon ECS

In Docker

Add the following to the dagster.yaml file:

# dagster.yaml
isolated_agents:
  enabled: true

dagster_cloud_api:
  # <your other config>
  agent_label: 'My agent' # optional

In Kubernetes

Add the following options to your Helm command:

helm upgrade \
    ...
    --set isolatedAgents.enabled=true \
    --set dagsterCloud.agentLabel="My agent" # optional, only supported on 0.13.14 and later

Or if you're using a values.yaml file:

isolatedAgents:
  enabled: true

dagsterCloud:
  agentLabel: 'My agent' # optional, only supported on 0.13.14 and later

In Amazon ECS

The isolated_agents option can be set as per-deployment configuration on the dagster.yaml file used by your agent. See the ECS configuration reference guide for more information.

Routing requests to specific agents

note

Agent queues are a Dagster+ Pro feature and require agents to use version 1.6.0 or greater.

Every Dagster+ agent serves requests from one or more queues. By default, requests for each code location are placed on a default queue and your agent will read requests only from that default queue.

In some cases, you might want to route requests for certain code locations to specific agents. For example, routing requests for one code location to an agent running in an on-premise data center, but then routing requests for all other code locations to an agent running in AWS.

To route requests for a code location to a specific agent, annotate the code locations with the name of a custom queue and configure an agent to serve only requests for that queue.

Step 1: Define an agent queue for the code location

First, set an agent queue for the code location in your pyproject.toml:

# pyproject.toml

[tool.dg.project]
root_module = "quickstart_etl"
agent_queue = "special-queue"

Step 2: Configure an agent to handle the agent queue

Next, configure an agent to handle your agent queue.

Docker
Kubernetes
Amazon ECS

In Docker

Add the following to your project's dagster.yaml file:

agent_queues:
  include_default_queue: True # Continue to handle requests for code locations that aren't annotated with a specific queue
  additional_queues:
    - special-queue

In Kubernetes

Add the following options to your Helm command:

helm upgrade \
    ...
    --set dagsterCloud.agentQueues.additionalQueues={"special-queue"}

Or if you're using a values.yaml file:

dagsterCloud:
  agentQueues:
    # Continue to handle requests for code locations that aren't
    # assigned to a specific agent queue
    includeDefaultQueue: true
    additionalQueues:
      - special-queue

In Amazon ECS

Modify your ECS Cloud Formation template to add the following configuration to the dagster.yaml file passed to the agent (the ECS agent configuration reference can be found here):

# dagster.yaml
agent_queues:
  # Continue to handle requests for code locations that aren't
  # assigned to a specific agent queue
  include_default_queue: true
  additional_queues:
    - special-queue

Multi-region failover

note

Multi-region failover uses agent queues, which are a Dagster+ Pro feature and require agents to use version 1.6.0 or greater.

Running agents in multiple geographic regions lets you fail over to a secondary region if your primary region becomes unavailable. The approach:

Deploy one agent per region, each with a region-specific label and dedicated queue
Assign code locations to the primary region's queue
When the primary region is unavailable, update the queue assignment to point to the secondary region

Failover only requires updating the queue assignment because the agent is stateless — it polls for work but holds no persistent state of its own. All run history, asset metadata, schedule and sensor state, and event logs live in the Dagster+ control plane. Switching which agent handles a code location's queue doesn't require any state migration.

Step 1: Deploy agents with region labels

Deploy an agent in each region with isolated_agents enabled and a descriptive agent_label so you can identify which agent is active in the Dagster+ UI.

Docker
Kubernetes
Amazon ECS

# dagster.yaml — us-east-1 agent
isolated_agents:
  enabled: true

dagster_cloud_api:
  # <your other config>
  agent_label: 'us-east-1'

# dagster.yaml — eu-west-1 agent
isolated_agents:
  enabled: true

dagster_cloud_api:
  # <your other config>
  agent_label: 'eu-west-1'

# values-us-east-1.yaml
isolatedAgents:
  enabled: true

dagsterCloud:
  agentLabel: 'us-east-1'

# values-eu-west-1.yaml
isolatedAgents:
  enabled: true

dagsterCloud:
  agentLabel: 'eu-west-1'

Set isolated_agents and agent_label in the dagster.yaml file for each regional agent. See the ECS configuration reference for details.

Step 2: Configure region-specific agent queues

Configure each agent to handle only its own region's queue:

Docker
Kubernetes
Amazon ECS

# dagster.yaml — us-east-1 agent
agent_queues:
  include_default_queue: false
  additional_queues:
    - us-east-1

# dagster.yaml — eu-west-1 agent
agent_queues:
  include_default_queue: false
  additional_queues:
    - eu-west-1

# values-us-east-1.yaml
dagsterCloud:
  agentQueues:
    includeDefaultQueue: false
    additionalQueues:
      - us-east-1

# values-eu-west-1.yaml
dagsterCloud:
  agentQueues:
    includeDefaultQueue: false
    additionalQueues:
      - eu-west-1

Add the queue configuration to the dagster.yaml for each regional agent:

# dagster.yaml — us-east-1 agent
agent_queues:
  include_default_queue: false
  additional_queues:
    - us-east-1

# dagster.yaml — eu-west-1 agent
agent_queues:
  include_default_queue: false
  additional_queues:
    - eu-west-1

Step 3: Assign code locations to the primary region

In each code location's pyproject.toml, set agent_queue to the primary region:

# pyproject.toml
[tool.dg.project]
root_module = "my_project"
agent_queue = "us-east-1"

Performing a failover

When the primary region (us-east-1) becomes unavailable, update agent_queue in each affected code location's pyproject.toml to the secondary region and redeploy:

# pyproject.toml
[tool.dg.project]
root_module = "my_project"
agent_queue = "eu-west-1"

Once the updated code location is deployed, the secondary region's agent handles all requests. No changes to the agent configuration are needed.

Optional: Reduce standby costs

By default, user code server TTL is disabled for full deployments, meaning code location servers run indefinitely even when idle. In a standby region that isn't handling any traffic, this means you're paying to keep servers warm that aren't doing any work.

To reduce costs, set a short server_ttl in the standby agent's configuration. Code location servers will shut down when idle and restart on demand when the agent starts receiving requests after a failover:

Docker
Kubernetes
Amazon ECS

# dagster.yaml — eu-west-1 standby agent
user_code_launcher:
  module: dagster_cloud.workspace.docker
  class: DockerUserCodeLauncher
  config:
    server_ttl:
      full_deployments:
        enabled: true
        ttl_seconds: 300 # shut down idle servers after 5 minutes

# values-eu-west-1.yaml — standby region
workspace:
  serverTTL:
    fullDeployments:
      enabled: true
      ttlSeconds: 300

# dagster.yaml — eu-west-1 standby agent
user_code_launcher:
  module: dagster_cloud.workspace.ecs
  class: EcsUserCodeLauncher
  config:
    server_ttl:
      full_deployments:
        enabled: true
        ttl_seconds: 300 # shut down idle servers after 5 minutes

The tradeoff is a brief startup delay (typically a few seconds) the first time a code location is accessed after failover, as the server initializes. For most use cases this is acceptable given the cost savings of not running idle servers in the standby region.

Optional: Active-active redundancy

To eliminate the need for a manual failover, configure each agent to listen on both regional queues. With this setup, both agents share the load from all code locations, and if one region becomes unavailable the remaining agent continues processing everything automatically.

Docker
Kubernetes
Amazon ECS

# dagster.yaml — us-east-1 agent
agent_queues:
  include_default_queue: false
  additional_queues:
    - us-east-1
    - eu-west-1

# dagster.yaml — eu-west-1 agent
agent_queues:
  include_default_queue: false
  additional_queues:
    - us-east-1
    - eu-west-1

# values-us-east-1.yaml
dagsterCloud:
  agentQueues:
    includeDefaultQueue: false
    additionalQueues:
      - us-east-1
      - eu-west-1

# values-eu-west-1.yaml
dagsterCloud:
  agentQueues:
    includeDefaultQueue: false
    additionalQueues:
      - us-east-1
      - eu-west-1

# dagster.yaml — us-east-1 agent
agent_queues:
  include_default_queue: false
  additional_queues:
    - us-east-1
    - eu-west-1

# dagster.yaml — eu-west-1 agent
agent_queues:
  include_default_queue: false
  additional_queues:
    - us-east-1
    - eu-west-1

The tradeoff is that both agents run at full capacity at all times. Work is still placed on the queue specified by each code location's agent_queue setting, but since both agents monitor all queues, either agent may process it.

When to use multiple agents​

Considerations​

Running multiple agents in the same environment​

In Docker​

In Kubernetes​

In Amazon ECS​

Running multiple agents in different environments​

In Docker​

In Kubernetes​

In Amazon ECS​

Routing requests to specific agents​

Step 1: Define an agent queue for the code location​

Step 2: Configure an agent to handle the agent queue​

In Docker​

In Kubernetes​

In Amazon ECS​

Multi-region failover​

Step 1: Deploy agents with region labels​

Step 2: Configure region-specific agent queues​

Step 3: Assign code locations to the primary region​

Performing a failover​

Optional: Reduce standby costs​

Optional: Active-active redundancy​

When to use multiple agents

Considerations

Running multiple agents in the same environment

In Docker

In Kubernetes

In Amazon ECS

Running multiple agents in different environments

In Docker

In Kubernetes

In Amazon ECS

Routing requests to specific agents

Step 1: Define an agent queue for the code location

Step 2: Configure an agent to handle the agent queue

In Docker

In Kubernetes

In Amazon ECS

Multi-region failover

Step 1: Deploy agents with region labels

Step 2: Configure region-specific agent queues

Step 3: Assign code locations to the primary region

Performing a failover

Optional: Reduce standby costs

Optional: Active-active redundancy