Why High Availability is critical in production Odoo
A production Odoo system managing orders, invoices, payroll or logistics cannot afford unexpected downtime. Every minute of inactivity has a direct cost: orders not recorded, operators idle waiting for delivery notes, customers calling. In mid-sized deployments (50-200 concurrent users) a 15-minute outage can mean losses of between 2,000 and 20,000 EUR, depending on the sector.
Yet the vast majority of Odoo deployments in Spain run on a single server with a single PostgreSQL instance. If that server fails -- disk, kernel panic, data corruption, a package update that breaks something -- there is no plan B. This guide describes the high-availability architecture we have implemented in production, using open-source components, controlled cost and automatic failover in under 30 seconds.
Key concepts: RTO, RPO and failure modes
Before designing the architecture it is useful to clarify two parameters that define the SLA:
- RTO (Recovery Time Objective): maximum tolerable time from the moment a failure occurs until the service is back online. In the architecture described here, the target RTO is < 30 seconds for database-node failures.
- RPO (Recovery Point Objective): maximum acceptable data loss. With synchronous replication, the RPO is 0 confirmed transactions. With asynchronous replication (more common to avoid latency impact), the RPO can be 1-5 seconds.
The failure modes this architecture covers are: primary PostgreSQL node failure, Odoo application node failure, data corruption on a single node, planned maintenance with zero downtime, and load saturation.
Reference architecture: overview
The architecture consists of four independent layers working together:
┌──────────────────────────────────────────┐
│ CLIENTES (navegadores, apps) │
└────────────────┬─────────────────────────┘
│
┌────────────────▼─────────────────────────┐
│ HAProxy (activo/pasivo) │
│ :80/:443 → Odoo workers :5432 → PG │
│ stats: :8404 │
└───────────┬──────────────┬────────────────┘
│ │
┌─────────────▼─┐ ┌──────▼────────────┐
│ Odoo Worker 1 │ │ Odoo Worker 2 │
│ (activo) │ │ (activo) │
│ 8069/8072 │ │ 8069/8072 │
└──────┬──────────┘ └──────┬───────────────┘
│ Filestore compartido (NFS/S3) │
└──────────────┬────────────────────────┘
│
┌───────────────────▼──────────────────────┐
│ Capa PostgreSQL HA │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ PG Prim │ │ PG Rep1 │ │ PG Rep2 │ │
│ │ (R/W) │▶│(standby)│ │(standby)│ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ ▲ ▲ ▲ │
│ └─────────────└────────────┘ │
│ Patroni + etcd │
└───────────────────────────────────────────┘
Layer 1: PostgreSQL HA with Patroni and etcd
Patroni is the de-facto standard for managing PostgreSQL clusters in high availability. It handles leader election, automatic promotion of a standby to primary and controlled restart of failed nodes. It works alongside a distributed consensus system (etcd, Consul or ZooKeeper) to avoid the split-brain problem.
Recommended topology
- 3 PostgreSQL nodes: 1 primary (R/W) + 2 streaming replicas (read-only)
- 3 etcd nodes (can co-exist on the same hosts as PostgreSQL in mid-sized environments)
- Asynchronous replication by default; synchronous optional for RPO=0 at the cost of write latency
Patroni configuration file (patroni.yml)
scope: odoo-cluster
namespace: /service/
name: pg-node-1
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.1.11:8008
etcd3:
hosts:
- 10.0.1.11:2379
- 10.0.1.12:2379
- 10.0.1.13:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1 MB de lag máximo para failover
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
wal_level: replica
hot_standby: "on"
max_wal_senders: 10
max_replication_slots: 10
wal_log_hints: "on"
archive_mode: "on"
archive_command: "cp %p /var/lib/postgresql/wal_archive/%f"
synchronous_commit: "off" # asíncrono; cambiar a 'on' para RPO=0
initdb:
- encoding: UTF8
- data-checksums
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.1.11:5432
data_dir: /var/lib/postgresql/16/main
bin_dir: /usr/lib/postgresql/16/bin
pgpass: /tmp/pgpass0
authentication:
replication:
username: replicator
password: <REPLICATION_PASSWORD>
superuser:
username: postgres
password: <SUPERUSER_PASSWORD>
parameters:
unix_socket_directories: "."
shared_buffers: "2GB"
effective_cache_size: "6GB"
maintenance_work_mem: "512MB"
work_mem: "64MB"
max_connections: 200
log_min_duration_statement: 1000 # loguear queries > 1s
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
Repeat this file on each node adjusting name and connect_address (pg-node-2 / pg-node-3). The Patroni service is managed with systemd:
# /etc/systemd/system/patroni.service
[Unit]
Description=Patroni PostgreSQL HA
After=network.target
[Service]
Type=simple
User=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Layer 2: HAProxy as load balancer and traffic switcher
HAProxy serves two purposes: balancing HTTP traffic from Odoo workers across several application nodes, and routing PostgreSQL connections to the current primary node (using Patroni health checks via the REST API on port 8008).
haproxy.cfg file
global
log /dev/log local0
log /dev/log local1 notice
maxconn 4096
user haproxy
group haproxy
daemon
defaults
log global
mode http
option httplog
option dontlognull
retries 3
timeout connect 5s
timeout client 30s
timeout server 60s
#
# FRONTEND Odoo HTTP (workers)
#
frontend odoo_http
bind *:80
bind *:443 ssl crt /etc/ssl/certs/skanndar.pem
http-request redirect scheme https unless { ssl_fc }
default_backend odoo_workers
backend odoo_workers
balance roundrobin
option httpchk GET /web/health
http-check expect status 200
server odoo1 10.0.1.21:8069 check inter 5s fall 3 rise 2
server odoo2 10.0.1.22:8069 check inter 5s fall 3 rise 2
#
# LONGPOLLING / GEVENT (puerto 8072)
#
frontend odoo_longpoll
bind *:8072
default_backend odoo_longpoll_backend
backend odoo_longpoll_backend
balance source
option httpchk GET /web/health
http-check expect status 200
timeout tunnel 3600s
server odoo1 10.0.1.21:8072 check inter 5s fall 3 rise 2
server odoo2 10.0.1.22:8072 check inter 5s fall 3 rise 2
#
# FRONTEND PostgreSQL (TCP mode)
# Patroni expone /master en 8008 cuando el nodo es primario
#
frontend pg_primary_frontend
mode tcp
bind *:5432
default_backend pg_primary
backend pg_primary
mode tcp
option httpchk GET /master
http-check expect status 200
default-server inter 3s fastinter 1s fall 3 rise 2 on-marked-down shutdown-sessions
server pg1 10.0.1.11:5432 check port 8008
server pg2 10.0.1.12:5432 check port 8008
server pg3 10.0.1.13:5432 check port 8008
#
# FRONTEND PostgreSQL réplicas (lectura)
#
frontend pg_replica_frontend
mode tcp
bind *:5433
default_backend pg_replicas
backend pg_replicas
mode tcp
option httpchk GET /replica
http-check expect status 200
default-server inter 3s fall 3 rise 2
server pg1 10.0.1.11:5432 check port 8008
server pg2 10.0.1.12:5432 check port 8008
server pg3 10.0.1.13:5432 check port 8008
#
# STATS
#
listen stats
bind *:8404
stats enable
stats uri /stats
stats refresh 5s
stats auth admin:<STATS_PASSWORD>
The key is the PostgreSQL health check: HAProxy calls GET /master on Patroni's REST port (8008). Only the primary node responds 200; standbys respond 503. This guarantees that write traffic always reaches the current primary, even after a failover.
Layer 3: Odoo workers and session configuration
Odoo in multi-worker mode launches independent processes that must be able to run on any application node. This requires two things:
Shared filestore
Odoo attachments, images and documents are stored in the filestore (by default ~/.local/share/Odoo/filestore/). In a multi-node cluster, this directory must be shared. The options are:
- NFS mounted on both nodes: simple, low latency on LAN. Use NFSv4 with locks enabled.
- S3 / S3-compatible (MinIO): preferred option for cloud environments. The OCA module
base_attachment_s3allows using S3 as an attachment backend transparently. - GlusterFS: distributed alternative with no single point of failure for the filestore itself.
User sessions
By default Odoo stores sessions in local files (/tmp/sessions/). In a multi-node cluster this causes users to lose their session if an HTTP request goes to the node that does not have their session file. The solution is to store sessions in Redis or in PostgreSQL (OCA module session_db):
# odoo.conf (nodo de aplicación)
[options]
workers = 8
max_cron_threads = 2
db_host = 10.0.1.1 # VIP de HAProxy → primario PG
db_port = 5432
db_user = odoo
db_password = <DB_PASSWORD>
dbfilter = ^odoo_prod$
http_port = 8069
gevent_port = 8072
proxy_mode = True
limit_memory_hard = 2684354560
limit_memory_soft = 2147483648
limit_request = 8192
limit_time_cpu = 120
limit_time_real = 240
log_level = warn
logfile = /var/log/odoo/odoo.log
# Sesiones en Redis (requiere módulo OCA web_session_redis)
# redis_url = redis://10.0.1.30:6379/0
Longpolling and gevent
Odoo's real-time chat and notification module uses gevent on port 8072. In an HA environment, longpolling connections must always go to the same node for the same session (sticky sessions by source IP). The balance source block in HAProxy's longpolling backend ensures this. In Nginx, if placed in front of HAProxy, use ip_hash for requests to /longpolling/.
Automatic failover: what happens step by step
When the primary PostgreSQL node fails, the sequence is as follows:
- Patroni detects that the primary is not responding to the etcd heartbeat (configurable timeout, default 10 s).
- The remaining nodes initiate a leader election through etcd. The candidate with the most recent WAL and no excessive lag wins.
- The elected node runs
pg_ctl promote: the standby becomes the primary. This process takes between 5 and 15 seconds. - Patroni updates the DCS key to reflect the new leader.
- HAProxy detects on the next health check cycle (every 3 s with
fastinter 1safter a failure) that/masternow returns 200 on the new primary and 503 on the others. - Write traffic is automatically redirected. Active connections in Odoo are interrupted with a database connection error, but Odoo retries automatically on reconnect.
Total failover time in production: 15-30 seconds. Users will see a 500 error during that interval, but will not lose confirmed data.
Failover testing: how to validate the architecture
An untested HA architecture is not HA. These are the minimum tests that must be run before going to production and periodically (chaos engineering):
Test 1: Primary node failure
# Desde el nodo pg-node-1 (primario)
sudo systemctl stop patroni
# O simulando un crash:
sudo kill -9 $(pgrep -f "postgres: patroni")
# Observar la elección en tiempo real:
patronict -c "host=10.0.1.11,10.0.1.12,10.0.1.13 port=8008" topology
# Resultado esperado: pg-node-2 o pg-node-3 pasa a Leader en < 30s
Test 2: Verify that HAProxy redirects traffic
# Desde una máquina externa con psql instalado:
watch -n 1 "psql -h 10.0.1.1 -p 5432 -U odoo -c 'SELECT pg_is_in_recovery(), inet_server_addr();'"
# Resultado esperado: pg_is_in_recovery = f (es el primario)
# Tras el failover debe cambiar la IP pero el resultado seguir siendo f
Test 3: Odoo application node failure
# Parar Odoo en uno de los nodos
ssh odoo-node-1 "sudo systemctl stop odoo"
# HAProxy debe dejar de enviar tráfico a ese nodo en < 15s (3 checks x 5s)
# Verificar en las stats de HAProxy: http://10.0.1.1:8404/stats
Test 4: pg_rewind after failed primary recovery
When the original primary comes back online, Patroni automatically reintegrates it as a standby using pg_rewind to synchronise divergent WALs. Verify that use_pg_rewind: true is configured in patroni.yml and that the replication user has superuser permissions (required for pg_rewind in versions prior to PG 16).
Common errors and how to avoid them
Split-brain
Split-brain occurs when two nodes simultaneously believe they are the primary. Patroni prevents this through the etcd quorum: if a node cannot write to etcd it cannot be primary. Always use an odd number of etcd nodes (3 or 5) to guarantee quorum.
Out-of-sync filestore
If the filestore is not correctly shared, attachments created on one node are not visible from the other. Symptom: product images or documents that appear and disappear depending on which node serves the request. Solution: NFS with options rsize=131072,wsize=131072,hard,intr or migrate to S3.
Database connections not released after failover
Odoo opens connections to PostgreSQL that can become zombie connections if the primary crashes abruptly. Configure tcp_keepalives_idle = 60 and tcp_keepalives_interval = 10 in PostgreSQL and in the Odoo client (db_maxconn at a reasonable value, default 64). PgBouncer as a connection pooler between Odoo and PostgreSQL dramatically improves recovery after failover.
Duplicate cron jobs
In a two-node Odoo application cluster, cron jobs run on both nodes if max_cron_threads > 0 on both. This can cause duplicate emails, invoices generated twice, etc. Solution: dedicate a single node as the cron node (max_cron_threads = 2 only on that node; on the others max_cron_threads = 0).
Replication lag during peak hours
With write-intensive workloads (bulk imports, period-end closings), the replica can accumulate lag. If the lag exceeds maximum_lag_on_failover (1 MB by default), Patroni excludes that node from failover candidates. Monitor lag with:
SELECT application_name, state, sent_lsn, replay_lsn,
(sent_lsn - replay_lsn) AS lag_bytes
FROM pg_stat_replication;
Achievable RTO and RPO summary
| Scenario | RTO | RPO | Notes |
|---|---|---|---|
| Primary PG node failure | < 30 s | 0-5 s (async) / 0 (sync) | Automatic Patroni failover |
| Odoo application node failure | < 15 s | 0 | HAProxy health check, no state in app |
| Data corruption (disk) | < 5 min | Last snapshot + WAL | Restore from backup + WAL replay |
| Planned maintenance | 0 | 0 | Rolling restart: remove node from HAProxy, update, reintegrate |