Distributed HA Atlantis Server #6147

marcportabellaclotet-mt · 2026-02-03T19:27:14Z

marcportabellaclotet-mt
Feb 3, 2026

Exploring a Distributed / HA Atlantis Architecture (PoC)

Context

Over the weekend I experimented with Atlantis and built a small, opinionated fork to explore what a distributed, HA-friendly Atlantis could look like.

This Fork is not a proposal to merge code, and the fork is not production-ready.
The goal is to validate feasibility, surface design challenges, and encourage discussion around a potential long-term direction for Atlantis.

Repository with details and code:
👉 https://github.com/totmicro/atlantis/tree/feat/atlantis-distributed

Problems this PoC tries to explore

This experiment is motivated by a few recurring challenges when running Atlantis at scale:

High availability
- Atlantis is typically deployed as a single logical executor
- HA setups mainly protect webhook reception, not job execution state
Job execution coupling
- Planning/applying Terraform is tightly coupled to the Atlantis server lifecycle
- Server restarts or crashes can interrupt long-running jobs
Horizontal scaling
- Scaling Atlantis replicas does not naturally distribute execution
- Concurrency is limited by a single process / workspace locking model
Kubernetes-native execution
- Terraform runs often benefit from:
  - Ephemeral execution environments
  - Per-job isolation
  - Explicit resource limits
- This is hard to express cleanly with the current architecture
Queueing and backpressure
- No clear global view of:
  - Job prioritization
  - Retries
  - Saturation / circuit breaking
Security
- atlantis agents can use Principle of Least Privilege*
Network Isoltation
- atlantis agents can run in isolated networks to have access to private networks*

Proposed idea (PoC only)

The fork separates Atlantis into control-plane and execution-plane responsibilities.

Control plane: Atlantis Masters

Stateless, HA-deployed
Responsibilities:
- Receive VCS webhooks
- Parse repo config
- Persist job state in PostgreSQL
- Schedule jobs via a central queue
- Assign jobs to available agents via gRPC

Execution plane: Agents

One or more agents connected via long-lived gRPC streams
Agents execute Terraform jobs and report results
Two execution models explored:
- Long-running agents executing Terraform directly
- Static agents spawning ephemeral Kubernetes Jobs per execution

Simplified architecture

flowchart TB
    subgraph Master["Atlantis Master (HA / Stateless)"]
        Scheduler["Enhanced Scheduler
        - Priority Queue
        - Retry Manager
        - Circuit Breaker
        - Router"]

        AgentManager["Agent Manager
        (gRPC Orchestrator)"]

        Scheduler --> AgentManager
    end

    AgentManager -->|gRPC stream| Agent1["Agent 1
    Terraform Executor"]
    AgentManager -->|gRPC stream| Agent2["Agent 2
    Terraform Executor"]
    AgentManager -->|gRPC stream| AgentN["Agent N
    Terraform Executor"]

Kubernetes-based execution mode (optional)

flowchart LR
    Master["Atlantis Master
    (Server Mode)"]
    StaticAgent["Static Agent
    (Long-running)"]

    Master <-->|gRPC stream
    Job assignments| StaticAgent

    StaticAgent -->|"Spawns K8s Job"| Pod["Ephemeral Pod
    (agent-executor)

    1. GetJob(ID)
    2. Execute Terraform
    3. ReportResult
    4. Exit"]

Non-goals

This PoC intentionally does not aim to:

Be production-ready
Preserve full backward compatibility
Define a final API or protocol
Replace the current Atlantis architecture
Be reviewed as mergeable code

Why share this?

To make the discussion more concrete than a theoretical proposal
To explore whether a distributed Atlantis is:
- Reasonable
- Overkill
- Or somewhere in between
To gather feedback from maintainers and users running Atlantis at scale

Happy to clarify design choices, explain trade-offs, or walk through the code paths.
Even if this ends up being the wrong direction, I hope it helps frame future discussions around HA and scalability for Atlantis.

Thanks for reading 🙏

Screnshots:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed HA Atlantis Server #6147

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Distributed HA Atlantis Server #6147

Uh oh!

Uh oh!

marcportabellaclotet-mt Feb 3, 2026

Exploring a Distributed / HA Atlantis Architecture (PoC)

Context

Problems this PoC tries to explore

Proposed idea (PoC only)

Control plane: Atlantis Masters

Execution plane: Agents

Simplified architecture

Kubernetes-based execution mode (optional)

Non-goals

Why share this?

Replies: 0 comments

marcportabellaclotet-mt
Feb 3, 2026