From two million scattered references to a unified CRM
At Cymit Química we managed a catalogue of more than two million references from dozens of domestic and international suppliers. The alternative to automating that management was a dedicated team for manual catalogue updates: a team that could never have competed in speed with the price and availability update cycles that the chemical distribution market demands. The crawling and scraping system I designed and implemented was not an IT project: it was a business decision that turned that volume of data into a real competitive advantage over distributors that lacked that automation.
This guide covers the complete technical architecture of that system, adapted as a reference for any project that needs to feed Odoo's CRM or catalogue with data extracted from external sources at scale.
Overall architecture: from source to CRM
An industrial scraping pipeline has four clearly differentiated layers. Mixing them in a single script is the most frequent mistake and the one that causes scraping projects to die within a few months:
- Extraction: obtaining the raw HTML or data from the source.
- Transformation and normalisation: cleaning, structuring and unifying the data format.
- Deduplication and validation: ensuring no duplicate or incorrect records are inserted.
- Loading into Odoo: writing to the CRM via API with idempotency guarantees.
Each layer has its own tools, its own failure modes and its own scaling strategies. Below I detail each one.
Layer 1: extraction -- choosing the right tool
There is no universal tool for scraping. The choice depends on what the source does:
- Requests + BeautifulSoup / lxml: for static HTML pages or undocumented REST APIs returning JSON. The lightest and fastest option. 60-70% of a B2B distributor's cases are solved here.
- Scrapy: when volume is high (thousands or millions of URLs) and you need fine control over concurrency, middlewares, pipelines and retries. Scrapy is a framework, not a library: it has a learning curve, but the long-term maintainability difference justifies it.
- Playwright (or Selenium): for pages that require JavaScript to render content. Playwright is the modern choice: faster than Selenium, better async API, Chromium/Firefox/WebKit support and native headless mode.
In the Cymit project we used all three in parallel, assigned by source type. The task orchestrator decided which engine to use for each supplier.
Example: extraction with Requests and lxml for a static HTML catalogue
import requests
from lxml import html
import time
import random
SESSION_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
),
"Accept-Language": "es-ES,es;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
}
def fetch_product_page(url: str, session: requests.Session) -> dict | None:
"""Extrae datos de una página de producto individual."""
try:
response = session.get(url, headers=SESSION_HEADERS, timeout=15)
response.raise_for_status()
except requests.RequestException as exc:
print(f"[ERROR] {url}: {exc}")
return None
tree = html.fromstring(response.content)
name = tree.xpath("//h1[@class='product-title']/text()")
ref = tree.xpath("//span[@data-field='product_ref']/text()")
price = tree.xpath("//span[@class='price-value']/text()")
availability = tree.xpath("//div[@class='stock-info']//text()")
return {
"url": url,
"name": name[0].strip() if name else None,
"ref": ref[0].strip() if ref else None,
"price_raw": price[0].strip() if price else None,
"availability": " ".join(availability).strip() if availability else None,
}
def crawl_catalog(base_url: str, page_count: int) -> list[dict]:
"""Recorre las páginas de listado y extrae URLs de producto."""
session = requests.Session()
product_data = []
for page in range(1, page_count + 1):
url = f"{base_url}?page={page}"
resp = session.get(url, headers=SESSION_HEADERS, timeout=15)
tree = html.fromstring(resp.content)
links = tree.xpath("//a[@class='product-link']/@href")
for link in links:
full_url = f"https://proveedor.example.com{link}"
data = fetch_product_page(full_url, session)
if data:
product_data.append(data)
time.sleep(random.uniform(1.5, 3.5))
return product_data
Scrapy for volume: distributor catalogue spider
When the catalogue exceeds tens of thousands of pages, manually managing concurrency, retries and queues becomes unmanageable. Scrapy solves this with its pipeline and middleware architecture:
import scrapy
from scrapy.http import Response
class CatalogSpider(scrapy.Spider):
name = "catalog_spider"
allowed_domains = ["proveedor.example.com"]
start_urls = ["https://proveedor.example.com/catalogo/"]
custom_settings = {
"CONCURRENT_REQUESTS": 4,
"DOWNLOAD_DELAY": 2,
"RANDOMIZE_DOWNLOAD_DELAY": True,
"AUTOTHROTTLE_ENABLED": True,
"AUTOTHROTTLE_TARGET_CONCURRENCY": 2.0,
"RETRY_TIMES": 3,
"RETRY_HTTP_CODES": [429, 500, 502, 503, 504],
"ITEM_PIPELINES": {
"catalog_scraper.pipelines.NormalizePipeline": 100,
"catalog_scraper.pipelines.DedupPipeline": 200,
"catalog_scraper.pipelines.OdooIngestPipeline": 300,
},
"ROTATING_PROXY_LIST_PATH": "/etc/scraper/proxies.txt",
}
def parse(self, response: Response):
for href in response.css("a.product-link::attr(href)").getall():
yield response.follow(href, callback=self.parse_product)
next_page = response.css("a.pagination-next::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_product(self, response: Response):
yield {
"source_url": response.url,
"name": response.css("h1.product-title::text").get("").strip(),
"ref_supplier": response.css("[data-field=product_ref]::text").get("").strip(),
"price_raw": response.css(".price-value::text").get("").strip(),
"description": response.css(".product-description").get(""),
"availability": response.css(".stock-info::text").get("").strip(),
}
Playwright for JavaScript pages
Some supplier portals render price and availability via JS. Requests cannot see that data; Playwright can:
import asyncio
from playwright.async_api import async_playwright
async def scrape_dynamic_page(url: str) -> dict:
async with async_playwright() as pw:
browser = await pw.chromium.launch(headless=True)
context = await browser.new_context(
user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36"
)
page = await context.new_page()
api_responses = []
async def handle_response(response):
if "/api/product" in response.url:
try:
data = await response.json()
api_responses.append(data)
except Exception:
pass
page.on("response", handle_response)
await page.goto(url, wait_until="networkidle", timeout=30_000)
await page.wait_for_selector(".price-value", timeout=10_000)
name = await page.text_content("h1.product-title")
price = await page.text_content(".price-value")
await browser.close()
return {
"name": name.strip() if name else None,
"price_raw": price.strip() if price else None,
"api_data": api_responses[0] if api_responses else None,
}
Scale management and anti-blocking
A scraper that works with a hundred requests fails with a million. The difference between an artisanal and an industrial scraping system lies in how it handles rejections, rate limits and detection.
Rate limiting and jitter
The golden rule is not to look like a bot. Bots make requests at exact intervals; humans do not. Adding a random component to the delay between requests (jitter) dramatically reduces the detection rate. In Scrapy, RANDOMIZE_DOWNLOAD_DELAY = True does this automatically. In scripts with Requests, time.sleep(random.uniform(1.5, 4.0)) between requests is the minimum.
Proxy and User-Agent rotation
For volumes of millions of requests, a single IP will inevitably be blocked. The solution is a pool of rotating residential or datacenter proxies. In Scrapy, scrapy-rotating-proxies manages the pool automatically, marking blocked proxies and redistributing traffic among active ones:
# settings.py de Scrapy
DOWNLOADER_MIDDLEWARES = {
"rotating_proxies.middlewares.RotatingProxyMiddleware": 610,
"rotating_proxies.middlewares.BanDetectionMiddleware": 620,
}
ROTATING_PROXY_LIST_PATH = "/etc/scraper/proxies.txt"
ROTATING_PROXY_PAGE_RETRY_TIMES = 5
# proxies.txt: una per línia
# http://user:pass@proxy1.example.com:8080
# http://user:pass@proxy2.example.com:8080
User-Agent rotation complements proxy rotation. The scrapy-fake-useragent middleware pulls updated lists of real browser User-Agents for each request.
Distributed queues and workers with Celery + Redis
For a catalogue of two million references that updates periodically, a single sequential process takes days. The solution is to distribute the work across parallel workers with a task queue. Celery with Redis as broker is the de-facto Python standard:
from celery import Celery
app = Celery(
"scraper",
broker="redis://localhost:6379/0",
backend="redis://localhost:6379/1",
)
app.conf.update(
task_serializer="json",
result_serializer="json",
accept_content=["json"],
task_acks_late=True,
worker_prefetch_multiplier=1,
task_routes={
"scraper.tasks.scrape_product": {"queue": "scraping"},
"scraper.tasks.ingest_odoo": {"queue": "odoo_ingest"},
},
)
@app.task(bind=True, max_retries=3, default_retry_delay=60)
def scrape_product(self, url: str, supplier_id: int) -> dict:
try:
data = fetch_product_page(url, requests.Session())
if data:
ingest_odoo.delay(data, supplier_id)
return data
except Exception as exc:
raise self.retry(exc=exc)
With 8 workers dedicated to scraping, Cymit's system processed tens of thousands of references per hour during periodic bulk updates.
Layer 2: data normalisation
Data extracted from different suppliers arrives in heterogeneous formats: prices with currency symbols and comma decimals, availability as «En stock» / «Out of stock» / «Disponible en 3-5 días» / «S/D», references with or without supplier prefix. The normalisation pipeline must convert all this to a unified schema before touching Odoo's database.
import re
from decimal import Decimal, InvalidOperation
AVAILABILITY_MAP = {
"en stock": "available",
"disponible": "available",
"out of stock": "out_of_stock",
"no disponible": "out_of_stock",
"agotado": "out_of_stock",
"s/d": "out_of_stock",
r"disponible en \d+-\d+ días": "on_demand",
r"\d+ semanas": "on_demand",
}
def normalize_price(raw: str) -> Decimal | None:
if not raw:
return None
cleaned = re.sub(r"[^\d.,]", "", raw).strip()
if "," in cleaned and "." in cleaned:
cleaned = cleaned.replace(".", "").replace(",", ".")
elif "," in cleaned:
cleaned = cleaned.replace(",", ".")
try:
return Decimal(cleaned)
except InvalidOperation:
return None
def normalize_availability(raw: str) -> str:
normalized = raw.lower().strip()
for pattern, status in AVAILABILITY_MAP.items():
if re.search(pattern, normalized):
return status
return "unknown"
def normalize_product(raw: dict, supplier_code: str) -> dict:
ref = (raw.get("ref_supplier") or "").strip().upper()
return {
"external_id": f"{supplier_code}_{ref}",
"name": (raw.get("name") or "").strip(),
"default_code": ref,
"list_price": normalize_price(raw.get("price_raw", "")),
"availability": normalize_availability(raw.get("availability", "")),
"description": (raw.get("description") or "").strip(),
"source_url": raw.get("source_url"),
}
Layer 3: deduplication
With two million references and multiple suppliers, the same reference can appear from different sources under slightly different names. The deduplication strategy has two levels:
- Exact dedup by
external_id: the canonical identifier{supplier_code}_{ref}guarantees that the same reference from the same supplier is never inserted twice. - Fuzzy dedup across suppliers: to detect the same reference sold by two different distributors under different names, string comparison techniques such as
rapidfuzzare used on the normalised name and the CAS number when available.
from rapidfuzz import fuzz
def find_duplicate_candidate(
product: dict,
existing_refs: list[dict],
threshold: int = 90
) -> dict | None:
name = product["name"].lower()
for existing in existing_refs:
score = fuzz.token_sort_ratio(name, existing["name"].lower())
if score >= threshold:
return existing
return None
Layer 4: ingestion into Odoo via API
Odoo exposes two API interfaces: XML-RPC (classic, available since Odoo 6) and JSON-RPC (more modern, same capabilities). Both allow authentication, search, record creation and update for any ERP model. For bulk ingestion from an external scraper, JSON-RPC with create_or_write via external_id is the right combination.
Connection and authentication
import xmlrpc.client
class OdooClient:
def __init__(self, url: str, db: str, username: str, password: str):
self.url = url
self.db = db
self.uid = None
self._models = xmlrpc.client.ServerProxy(f"{url}/xmlrpc/2/object")
self._common = xmlrpc.client.ServerProxy(f"{url}/xmlrpc/2/common")
self.password = password
self._authenticate(username)
def _authenticate(self, username: str) -> None:
self.uid = self._common.authenticate(self.db, username, self.password, {})
if not self.uid:
raise ValueError(f"Authentication failed for user {username}")
def execute(self, model: str, method: str, *args, **kwargs):
return self._models.execute_kw(self.db, self.uid, self.password, model, method, list(args), kwargs)
def search_read(self, model: str, domain: list, fields: list) -> list:
return self.execute(model, "search_read", domain, fields=fields)
def upsert_by_external_id(self, model: str, external_id: str, values: dict) -> int:
existing = self.execute(
"ir.model.data", "search_read",
[["name", "=", external_id], ["model", "=", model]],
fields=["res_id"],
)
if existing:
record_id = existing[0]["res_id"]
self.execute(model, "write", [record_id], values)
return record_id
else:
record_id = self.execute(model, "create", values)
self.execute("ir.model.data", "create", {
"name": external_id, "model": model,
"res_id": record_id, "module": "scraper_import", "noupdate": False,
})
return record_id
Bulk lead and product ingestion
For bulk ingestion, the individual upsert_by_external_id operation per reference is too slow. Odoo allows batch create by passing a list of dicts. Combined with chunk-loading, performance improves dramatically:
def ingest_products_bulk(client, products, chunk_size=100):
created = updated = errors = 0
ext_ids_in_odoo = {
row["name"]: row["res_id"]
for row in client.execute(
"ir.model.data", "search_read",
[["module", "=", "scraper_import"], ["model", "=", "product.template"]],
fields=["name", "res_id"],
)
}
for i in range(0, len(products), chunk_size):
chunk = products[i:i + chunk_size]
to_create = []
to_update = {}
for p in chunk:
ext_id = p.pop("external_id")
values = {
"name": p["name"], "default_code": p["default_code"],
"list_price": float(p["list_price"]) if p["list_price"] else 0.0,
"description_sale": p.get("description", ""), "type": "product",
}
if ext_id in ext_ids_in_odoo:
to_update[ext_ids_in_odoo[ext_id]] = values
else:
to_create.append((ext_id, values))
if to_create:
ids = client.execute("product.template", "create", [v for _, v in to_create])
for (ext_id, _), record_id in zip(to_create, ids):
client.execute("ir.model.data", "create", {
"name": ext_id, "model": "product.template",
"res_id": record_id, "module": "scraper_import",
})
created += len(to_create)
for record_id, values in to_update.items():
client.execute("product.template", "write", [record_id], values)
updated += len(to_update)
return {"created": created, "updated": updated, "errors": errors}
Lead ingestion into crm.lead
To feed the CRM directly with leads extracted from directories, forms or aggregators, the target model is crm.lead. The same external_id pattern guarantees idempotency: the same lead detected in two different scraper runs is not duplicated:
def ingest_lead(client, lead_data):
ext_id = lead_data["external_id"]
values = {
"name": lead_data["company_name"],
"partner_name": lead_data.get("contact_name", ""),
"email_from": lead_data.get("email", ""),
"phone": lead_data.get("phone", ""),
"website": lead_data.get("website", ""),
"street": lead_data.get("address", ""),
"city": lead_data.get("city", ""),
"country_id": 69,
"type": "lead",
"source_id": lead_data.get("odoo_source_id"),
"description": lead_data.get("raw_description", ""),
}
return client.upsert_by_external_id("crm.lead", ext_id, values)
Legal and ethical aspects of scraping
No technical scraping guide is complete without this section, especially in the European context. These are the points I evaluate in every project before writing a single line of extraction code:
- Source terms of use: many sites explicitly prohibit automated scraping in their ToS. Failing to respect this exposes you to civil liability and possible legal action.
- Personal data and GDPR: if the extracted data includes information about natural persons (names, emails, phone numbers), GDPR applies. You need a legal basis for processing and must guarantee ARCO rights. This is non-negotiable in the Spanish market.
- Robots.txt: although it has no direct legal force in Spain, ignoring
robots.txtcan worsen your position in litigation and is a sign of bad faith. Respecting it is the minimum ethical practice. - Copyrighted content: copying and storing full product descriptions may constitute copyright infringement. Extracting structured data (price, availability, reference) is in a more favourable grey area than copying verbatim descriptions.
- Load on the target server: a scraper without rate limiting can degrade service on the target site, which in extreme cases may be classified as negligent DoS.
At Cymit, extraction was performed from supplier catalogues with whom we had commercial agreements that included access to product information. That pre-existing contractual relationship is the safest way to operate at scale.
Pipeline orchestration and monitoring
A production scraping pipeline needs observability. The minimum metrics we monitored at Cymit included: success rate per supplier, number of references updated per run, execution time per crawler, and alerts when a supplier changed its HTML structure (which broke the selectors). With Telegram alerts via a simple Python bot, the team was notified in under five minutes whenever any crawler stopped working, without having to review logs manually.
Conclusió
A well-designed bulk scraping system is not a collection of scripts: it is a data architecture with clearly separated layers, idempotency guarantees, scale management and observability. The difference between a script that works on day one and a pipeline that keeps working two years later is exactly that: the architecture. At Cymit, this system turned two million references from an operational problem into a competitive advantage that contributed to the exit to Grupo PALEX. The same approach applies to any project that needs to feed Odoo with data from the outside world at scale.