Odoo High Availability with Patroni and HAProxy: HA architecture

How to design a real HA architecture for Odoo with Patroni, PostgreSQL streaming replication, HAProxy and shared filestore: RTO, RPO and failover testing included.

Why High Availability is critical in production Odoo

A production Odoo system managing orders, invoices, payroll or logistics cannot afford unexpected downtime. Every minute of inactivity has a direct cost: orders not recorded, operators idle waiting for delivery notes, customers calling. In mid-sized deployments (50-200 concurrent users) a 15-minute outage can mean losses of between 2,000 and 20,000 EUR, depending on the sector.

Yet the vast majority of Odoo deployments in Spain run on a single server with a single PostgreSQL instance. If that server fails -- disk, kernel panic, data corruption, a package update that breaks something -- there is no plan B. This guide describes the high-availability architecture we have implemented in production, using open-source components, controlled cost and automatic failover in under 30 seconds.

Key concepts: RTO, RPO and failure modes

Before designing the architecture it is useful to clarify two parameters that define the SLA:

RTO (Recovery Time Objective): maximum tolerable time from the moment a failure occurs until the service is back online. In the architecture described here, the target RTO is < 30 seconds for database-node failures.
RPO (Recovery Point Objective): maximum acceptable data loss. With synchronous replication, the RPO is 0 confirmed transactions. With asynchronous replication (more common to avoid latency impact), the RPO can be 1-5 seconds.

The failure modes this architecture covers are: primary PostgreSQL node failure, Odoo application node failure, data corruption on a single node, planned maintenance with zero downtime, and load saturation.

Reference architecture: overview

The architecture consists of four independent layers working together:

         ┌──────────────────────────────────────────┐
         │         CLIENTES (navegadores, apps)      │
         └────────────────┬─────────────────────────┘
                          │
         ┌────────────────▼─────────────────────────┐
         │           HAProxy (activo/pasivo)          │
         │   :80/:443 → Odoo workers  :5432 → PG     │
         │   stats: :8404                             │
         └───────────┬──────────────┬────────────────┘
                     │              │
       ┌─────────────▼─┐    ┌──────▼────────────┐
       │  Odoo Worker 1  │    │  Odoo Worker 2       │
       │  (activo)       │    │  (activo)            │
       │  8069/8072      │    │  8069/8072           │
       └──────┬──────────┘    └──────┬───────────────┘
              │   Filestore compartido (NFS/S3)        │
              └──────────────┬────────────────────────┘
                             │
         ┌───────────────────▼──────────────────────┐
         │            Capa PostgreSQL HA              │
         │                                           │
         │   ┌─────────┐  ┌─────────┐  ┌─────────┐  │
         │   │ PG Prim │  │ PG Rep1 │  │ PG Rep2 │  │
         │   │ (R/W)   │▶│(standby)│  │(standby)│  │
         │   └─────────┘  └─────────┘  └─────────┘  │
         │        ▲             ▲            ▲        │
         │        └─────────────└────────────┘        │
         │              Patroni + etcd                │
         └───────────────────────────────────────────┘

Layer 1: PostgreSQL HA with Patroni and etcd

Patroni is the de-facto standard for managing PostgreSQL clusters in high availability. It handles leader election, automatic promotion of a standby to primary and controlled restart of failed nodes. It works alongside a distributed consensus system (etcd, Consul or ZooKeeper) to avoid the split-brain problem.

Recommended topology

3 PostgreSQL nodes: 1 primary (R/W) + 2 streaming replicas (read-only)
3 etcd nodes (can co-exist on the same hosts as PostgreSQL in mid-sized environments)
Asynchronous replication by default; synchronous optional for RPO=0 at the cost of write latency

Patroni configuration file (patroni.yml)

scope: odoo-cluster
namespace: /service/
name: pg-node-1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.1.11:8008

etcd3:
  hosts:
    - 10.0.1.11:2379
    - 10.0.1.12:2379
    - 10.0.1.13:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576   # 1 MB de lag máximo para failover
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        max_wal_senders: 10
        max_replication_slots: 10
        wal_log_hints: "on"
        archive_mode: "on"
        archive_command: "cp %p /var/lib/postgresql/wal_archive/%f"
        synchronous_commit: "off"   # asíncrono; cambiar a 'on' para RPO=0
  initdb:
    - encoding: UTF8
    - data-checksums

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.1.11:5432
  data_dir: /var/lib/postgresql/16/main
  bin_dir: /usr/lib/postgresql/16/bin
  pgpass: /tmp/pgpass0
  authentication:
    replication:
      username: replicator
      password: <REPLICATION_PASSWORD>
    superuser:
      username: postgres
      password: <SUPERUSER_PASSWORD>
  parameters:
    unix_socket_directories: "."
    shared_buffers: "2GB"
    effective_cache_size: "6GB"
    maintenance_work_mem: "512MB"
    work_mem: "64MB"
    max_connections: 200
    log_min_duration_statement: 1000   # loguear queries > 1s

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false

Repeat this file on each node adjusting name and connect_address (pg-node-2 / pg-node-3). The Patroni service is managed with systemd:

# /etc/systemd/system/patroni.service
[Unit]
Description=Patroni PostgreSQL HA
After=network.target

[Service]
Type=simple
User=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Layer 2: HAProxy as load balancer and traffic switcher

HAProxy serves two purposes: balancing HTTP traffic from Odoo workers across several application nodes, and routing PostgreSQL connections to the current primary node (using Patroni health checks via the REST API on port 8008).

haproxy.cfg file

global
    log /dev/log local0
    log /dev/log local1 notice
    maxconn 4096
    user haproxy
    group haproxy
    daemon

defaults
    log global
    mode http
    option httplog
    option dontlognull
    retries 3
    timeout connect 5s
    timeout client 30s
    timeout server 60s

#
# FRONTEND Odoo HTTP (workers)
#
frontend odoo_http
    bind *:80
    bind *:443 ssl crt /etc/ssl/certs/skanndar.pem
    http-request redirect scheme https unless { ssl_fc }
    default_backend odoo_workers

backend odoo_workers
    balance roundrobin
    option httpchk GET /web/health
    http-check expect status 200
    server odoo1 10.0.1.21:8069 check inter 5s fall 3 rise 2
    server odoo2 10.0.1.22:8069 check inter 5s fall 3 rise 2

#
# LONGPOLLING / GEVENT (puerto 8072)
#
frontend odoo_longpoll
    bind *:8072
    default_backend odoo_longpoll_backend

backend odoo_longpoll_backend
    balance source
    option httpchk GET /web/health
    http-check expect status 200
    timeout tunnel 3600s
    server odoo1 10.0.1.21:8072 check inter 5s fall 3 rise 2
    server odoo2 10.0.1.22:8072 check inter 5s fall 3 rise 2

#
# FRONTEND PostgreSQL (TCP mode)
# Patroni expone /master en 8008 cuando el nodo es primario
#
frontend pg_primary_frontend
    mode tcp
    bind *:5432
    default_backend pg_primary

backend pg_primary
    mode tcp
    option httpchk GET /master
    http-check expect status 200
    default-server inter 3s fastinter 1s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg1 10.0.1.11:5432 check port 8008
    server pg2 10.0.1.12:5432 check port 8008
    server pg3 10.0.1.13:5432 check port 8008

#
# FRONTEND PostgreSQL réplicas (lectura)
#
frontend pg_replica_frontend
    mode tcp
    bind *:5433
    default_backend pg_replicas

backend pg_replicas
    mode tcp
    option httpchk GET /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2
    server pg1 10.0.1.11:5432 check port 8008
    server pg2 10.0.1.12:5432 check port 8008
    server pg3 10.0.1.13:5432 check port 8008

#
# STATS
#
listen stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 5s
    stats auth admin:<STATS_PASSWORD>

The key is the PostgreSQL health check: HAProxy calls GET /master on Patroni's REST port (8008). Only the primary node responds 200; standbys respond 503. This guarantees that write traffic always reaches the current primary, even after a failover.

Layer 3: Odoo workers and session configuration

Odoo in multi-worker mode launches independent processes that must be able to run on any application node. This requires two things:

Shared filestore

Odoo attachments, images and documents are stored in the filestore (by default ~/.local/share/Odoo/filestore/). In a multi-node cluster, this directory must be shared. The options are:

NFS mounted on both nodes: simple, low latency on LAN. Use NFSv4 with locks enabled.
S3 / S3-compatible (MinIO): preferred option for cloud environments. The OCA module base_attachment_s3 allows using S3 as an attachment backend transparently.
GlusterFS: distributed alternative with no single point of failure for the filestore itself.

User sessions

By default Odoo stores sessions in local files (/tmp/sessions/). In a multi-node cluster this causes users to lose their session if an HTTP request goes to the node that does not have their session file. The solution is to store sessions in Redis or in PostgreSQL (OCA module session_db):

# odoo.conf (nodo de aplicación)
[options]
workers = 8
max_cron_threads = 2
db_host = 10.0.1.1         # VIP de HAProxy → primario PG
db_port = 5432
db_user = odoo
db_password = <DB_PASSWORD>
dbfilter = ^odoo_prod$
http_port = 8069
gevent_port = 8072
proxy_mode = True
limit_memory_hard = 2684354560
limit_memory_soft = 2147483648
limit_request = 8192
limit_time_cpu = 120
limit_time_real = 240
log_level = warn
logfile = /var/log/odoo/odoo.log
# Sesiones en Redis (requiere módulo OCA web_session_redis)
# redis_url = redis://10.0.1.30:6379/0

Longpolling and gevent

Odoo's real-time chat and notification module uses gevent on port 8072. In an HA environment, longpolling connections must always go to the same node for the same session (sticky sessions by source IP). The balance source block in HAProxy's longpolling backend ensures this. In Nginx, if placed in front of HAProxy, use ip_hash for requests to /longpolling/.

Automatic failover: what happens step by step

When the primary PostgreSQL node fails, the sequence is as follows:

Patroni detects that the primary is not responding to the etcd heartbeat (configurable timeout, default 10 s).
The remaining nodes initiate a leader election through etcd. The candidate with the most recent WAL and no excessive lag wins.
The elected node runs pg_ctl promote: the standby becomes the primary. This process takes between 5 and 15 seconds.
Patroni updates the DCS key to reflect the new leader.
HAProxy detects on the next health check cycle (every 3 s with fastinter 1s after a failure) that /master now returns 200 on the new primary and 503 on the others.
Write traffic is automatically redirected. Active connections in Odoo are interrupted with a database connection error, but Odoo retries automatically on reconnect.

Total failover time in production: 15-30 seconds. Users will see a 500 error during that interval, but will not lose confirmed data.

Failover testing: how to validate the architecture

An untested HA architecture is not HA. These are the minimum tests that must be run before going to production and periodically (chaos engineering):

Test 1: Primary node failure

# Desde el nodo pg-node-1 (primario)
sudo systemctl stop patroni
# O simulando un crash:
sudo kill -9 $(pgrep -f "postgres: patroni")

# Observar la elección en tiempo real:
patronict -c "host=10.0.1.11,10.0.1.12,10.0.1.13 port=8008" topology
# Resultado esperado: pg-node-2 o pg-node-3 pasa a Leader en < 30s

Test 2: Verify that HAProxy redirects traffic

# Desde una máquina externa con psql instalado:
watch -n 1 "psql -h 10.0.1.1 -p 5432 -U odoo -c 'SELECT pg_is_in_recovery(), inet_server_addr();'"
# Resultado esperado: pg_is_in_recovery = f (es el primario)
# Tras el failover debe cambiar la IP pero el resultado seguir siendo f

Test 3: Odoo application node failure

# Parar Odoo en uno de los nodos
ssh odoo-node-1 "sudo systemctl stop odoo"
# HAProxy debe dejar de enviar tráfico a ese nodo en < 15s (3 checks x 5s)
# Verificar en las stats de HAProxy: http://10.0.1.1:8404/stats

Test 4: pg_rewind after failed primary recovery

When the original primary comes back online, Patroni automatically reintegrates it as a standby using pg_rewind to synchronise divergent WALs. Verify that use_pg_rewind: true is configured in patroni.yml and that the replication user has superuser permissions (required for pg_rewind in versions prior to PG 16).

Common errors and how to avoid them

Split-brain

Split-brain occurs when two nodes simultaneously believe they are the primary. Patroni prevents this through the etcd quorum: if a node cannot write to etcd it cannot be primary. Always use an odd number of etcd nodes (3 or 5) to guarantee quorum.

Out-of-sync filestore

If the filestore is not correctly shared, attachments created on one node are not visible from the other. Symptom: product images or documents that appear and disappear depending on which node serves the request. Solution: NFS with options rsize=131072,wsize=131072,hard,intr or migrate to S3.

Database connections not released after failover

Odoo opens connections to PostgreSQL that can become zombie connections if the primary crashes abruptly. Configure tcp_keepalives_idle = 60 and tcp_keepalives_interval = 10 in PostgreSQL and in the Odoo client (db_maxconn at a reasonable value, default 64). PgBouncer as a connection pooler between Odoo and PostgreSQL dramatically improves recovery after failover.

Duplicate cron jobs

In a two-node Odoo application cluster, cron jobs run on both nodes if max_cron_threads > 0 on both. This can cause duplicate emails, invoices generated twice, etc. Solution: dedicate a single node as the cron node (max_cron_threads = 2 only on that node; on the others max_cron_threads = 0).

Replication lag during peak hours

With write-intensive workloads (bulk imports, period-end closings), the replica can accumulate lag. If the lag exceeds maximum_lag_on_failover (1 MB by default), Patroni excludes that node from failover candidates. Monitor lag with:

SELECT application_name, state, sent_lsn, replay_lsn,
       (sent_lsn - replay_lsn) AS lag_bytes
FROM pg_stat_replication;

Achievable RTO and RPO summary

Scenario	RTO	RPO	Notes
Primary PG node failure	< 30 s	0-5 s (async) / 0 (sync)	Automatic Patroni failover
Odoo application node failure	< 15 s	0	HAProxy health check, no state in app
Data corruption (disk)	< 5 min	Last snapshot + WAL	Restore from backup + WAL replay
Planned maintenance	0	0	Rolling restart: remove node from HAProxy, update, reintegrate

Do you need to implement High Availability in your Odoo?

Request a free technical audit

# Alta Disponibilidad DevOps Infraestructura Odoo PostgreSQL

Odoo Monitoring with ELK Stack and Telegram Alerts

How to centralise Odoo logs and metrics with Elasticsearch, Logstash/Filebeat and Kibana, parse Odoo's native log format, and receive proactive alerts in a Telegram bot when something goes wrong.