Distributed HA Atlantis Server #6147
marcportabellaclotet-mt
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Exploring a Distributed / HA Atlantis Architecture (PoC)
Context
Over the weekend I experimented with Atlantis and built a small, opinionated fork to explore what a distributed, HA-friendly Atlantis could look like.
This Fork is not a proposal to merge code, and the fork is not production-ready.
The goal is to validate feasibility, surface design challenges, and encourage discussion around a potential long-term direction for Atlantis.
Repository with details and code:
👉 https://github.com/totmicro/atlantis/tree/feat/atlantis-distributed
Problems this PoC tries to explore
This experiment is motivated by a few recurring challenges when running Atlantis at scale:
High availability
Job execution coupling
Horizontal scaling
Kubernetes-native execution
Terraform runs often benefit from:
This is hard to express cleanly with the current architecture
Queueing and backpressure
No clear global view of:
Security
Network Isoltation
Proposed idea (PoC only)
The fork separates Atlantis into control-plane and execution-plane responsibilities.
Control plane: Atlantis Masters
Stateless, HA-deployed
Responsibilities:
Execution plane: Agents
One or more agents connected via long-lived gRPC streams
Agents execute Terraform jobs and report results
Two execution models explored:
Simplified architecture
flowchart TB subgraph Master["Atlantis Master (HA / Stateless)"] Scheduler["Enhanced Scheduler - Priority Queue - Retry Manager - Circuit Breaker - Router"] AgentManager["Agent Manager (gRPC Orchestrator)"] Scheduler --> AgentManager end AgentManager -->|gRPC stream| Agent1["Agent 1 Terraform Executor"] AgentManager -->|gRPC stream| Agent2["Agent 2 Terraform Executor"] AgentManager -->|gRPC stream| AgentN["Agent N Terraform Executor"]Kubernetes-based execution mode (optional)
flowchart LR Master["Atlantis Master (Server Mode)"] StaticAgent["Static Agent (Long-running)"] Master <-->|gRPC stream Job assignments| StaticAgent StaticAgent -->|"Spawns K8s Job"| Pod["Ephemeral Pod (agent-executor) 1. GetJob(ID) 2. Execute Terraform 3. ReportResult 4. Exit"]Non-goals
This PoC intentionally does not aim to:
Why share this?
To make the discussion more concrete than a theoretical proposal
To explore whether a distributed Atlantis is:
To gather feedback from maintainers and users running Atlantis at scale
Happy to clarify design choices, explain trade-offs, or walk through the code paths.
Even if this ends up being the wrong direction, I hope it helps frame future discussions around HA and scalability for Atlantis.
Thanks for reading 🙏
Screnshots:
Beta Was this translation helpful? Give feedback.
All reactions