🎯 A customizable, anti-detection cloud browser powered by self-developed Chromium designed for web crawlers and AI Agents.πŸ‘‰Try Now
Back to Blog

How to Build a Competitive Pricing Pipeline: Track 5,000 SKUs Across 8 Competitors Daily

Ethan Brown
Ethan Brown

Advanced Bot Mitigation Engineer

28-May-2026

Key Takeaways:

  • Competitive pricing is a basket problem, not a product problem. A pricing team tracking 5,000 SKUs across 8 competitors in 4 markets is running 160,000 reads per day. The architecture that scales is one render call per URL with the egress pinned per market, plus a single normalized output schema β€” not 160,000 ad-hoc fetches.
  • The market dictates the egress. Prices, currency, and availability shift by region and by IP reputation. Pinning the proxy country to the market under measurement keeps every recorded price comparable; mixing US and EU egress on the same SKU produces a price history that means nothing.
  • One canonical schema across competitors. Each retailer's DOM is different; the warehouse table is not. Normalize on extraction: {your_sku, competitor, market, price_value, price_currency, availability, promo_state, captured_at}. Decisions read the warehouse, not the raw HTML.
  • Anti-detection is handled server-side. Each request renders inside the Scrapeless cloud with residential egress, JavaScript execution, and fingerprint randomization. The pipeline sends a URL and a country; it gets rendered HTML back. No browser binaries, no proxy rotation logic, and no third-party CDP client on your machine.
  • The pipeline ends at a diff, not at HTML. Raw rendered pages are scratch storage. The signal pricing teams act on is the diff between your price and the competitor's, per market, per SKU β€” surfaced to a repricing rule, a Slack alert, or an analyst dashboard.
  • Free to start. New Scrapeless accounts include free runtime β€” sign up at app.scrapeless.com.

Introduction: From web data to a competitive pricing decision

Competitive pricing teams have lived with the same constraint for years: prices change faster than the data feeds that inform pricing decisions. A retailer revises a sticker price overnight; the BI tile updates 48 hours later; by the time the analyst sees the gap, the promotional window has closed. Web data closes that loop, but only if the collection layer keeps up with the pace of change and feeds a schema the warehouse can join on.

The structural challenge is not "scraping a product page." It is operating a fleet of scrapes across a basket of SKUs, across a basket of competitors, across a basket of markets β€” every day, every market, every retailer, with the same accuracy guarantees. Each retailer's DOM rotates. Each market's prices localize. Each request needs to clear the retailer's anti-bot layer and return clean, rendered HTML. The Octoparse OptiGroup case study captured the same pattern at scale: 50 subsidiaries, dozens of competitor sites, regional prices, a centralized pricing decision layer.

This guide walks through the architecture and the Python code for the collection layer of a pricing intelligence pipeline on top of Scrapeless. The output is a normalized NDJSON stream feeding a warehouse table; the input is the basket file the analyst defines. Read once for the pattern; reuse for every competitor by changing the per-retailer extractor.


What You Can Do With It

  • Daily competitive basket reads. Track 5,000 SKUs across 8 competitors in 4 markets on a daily schedule with bounded runtime and one canonical schema.
  • Market-specific repricing. Pin the egress country to each market; pull localized prices that reflect what a local shopper actually sees, not a geo-fallback price.
  • Promo-state monitoring. Capture both the listed price and the promotional state (on sale, percent off, time-bound badge) so the warehouse knows the difference between an everyday price and a clearance push.
  • MAP compliance audits. Compare retailer-listed prices against your MAP (minimum advertised price) policy and surface violations to the channel-management team.
  • New-product launch tracking. Watch for first-appearance of competitor SKUs in a category; the pipeline doubles as an "is the competitor about to launch X?" signal.
  • Price elasticity datasets. Daily snapshots over 90 days produce the time series that revenue management uses to compute elasticity at the SKU level.

At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this post is for demonstration purposes only.


Why Scrapeless for competitive pricing

Scrapeless renders each target URL in an anti-detection cloud browser powered by self-developed Chromium and returns the finished HTML over a single API call. For a pricing intelligence pipeline specifically, it brings:

  • Residential proxies in 195+ countries, pinned per request with a country code β€” egress geography is one field per market.
  • Cloud-side JavaScript rendering. Retailer product pages are React or Next.js apps; the price element lands after hydration. js_render=True means your pipeline reads the post-paint DOM, not the SSR shell.
  • Server-side anti-detection. UA, timezone, WebGL, canvas, and headless flags are randomized in the cloud per request. No local stealth-plugin maintenance, no browser binaries to install.
  • A stateless request shape. Each product page is an independent read: send a URL plus a country, get rendered HTML back. That maps cleanly onto a basket of thousands of independent SKU reads.
  • One API key for the whole pipeline. Rendering, residential proxies, and the SDK all bill against the same Scrapeless account; no per-tier integration.

Get your API key on the free plan at app.scrapeless.com.


Prerequisites

  • Python 3.10 or newer
  • A Scrapeless account and API key β€” sign up at app.scrapeless.com
  • Familiarity with requests-style HTTP and a CSS-selector library
  • A target competitor list and an SKU basket file

Pipeline architecture at a glance

Copy
basket.yaml                      (analyst-defined input)
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   orchestrator   β”‚  one task per (market, competitor, SKU); bounded concurrency
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Scrapeless     β”‚  client.universal.scrape(url, country) β€” residential egress,
β”‚  (cloud render)  β”‚  JS rendering, anti-detection, all server-side
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚  rendered HTML
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   normalizer     β”‚  per-retailer extractor β†’ canonical schema
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
prices.ndjson          (one row per (product, competitor, market, day))
       β”‚
       β–Ό
warehouse load + diff vs your prices + alert

Each stage is a Python module; the seven steps below build it bottom-up.


Step 1 β€” Install the Scrapeless SDK

bash Copy
pip install scrapeless lxml pyyaml

scrapeless is the official Python SDK; it renders pages cloud-side and returns HTML, so there are no browser binaries and no third-party automation library to install. lxml is the parser; pyyaml reads the basket config.


Step 2 β€” Define the basket

The pricing team owns this file. Keep it boring β€” markets, competitors, SKU mappings. One row per (your_sku, competitor, competitor_url, market):

yaml Copy
# basket.yaml
markets:
  - US
  - GB
  - DE
  - JP

basket:
  - your_sku: SKU-1001
    name: "Acme Widget Pro"
    competitors:
      - retailer: target_competitor_a
        url:
          US: "https://competitor-a.com/p/widget-pro"
          GB: "https://competitor-a.co.uk/p/widget-pro"
          DE: "https://competitor-a.de/p/widget-pro"
          JP: "https://competitor-a.co.jp/p/widget-pro"
      - retailer: target_competitor_b
        url:
          US: "https://competitor-b.com/products/widget-pro"
          GB: "https://competitor-b.co.uk/products/widget-pro"

A 5,000-SKU basket lives in the same shape; the warehouse joins on your_sku to align against your own price feed.


Step 3 β€” Render a product page through Scrapeless

One render call per (market, SKU). The country pin sets the residential egress; js_render=True returns the post-hydration DOM:

python Copy
import os
from scrapeless import Scrapeless
from scrapeless.types.universal import (
    UniversalScrapingRequest, UniversalJsRenderInput, UniversalProxy,
)

client = Scrapeless()  # reads SCRAPELESS_API_KEY from env

def scrape_rendered(url: str, market: str) -> str:
    """Render one product page in the Scrapeless cloud and return the HTML."""
    request = UniversalScrapingRequest(
        actor="unlocker.webunlocker",
        input=UniversalJsRenderInput(url=url, js_render=True, headless=True),
        proxy=UniversalProxy(country=market),
    )
    return client.universal.scrape(request)  # returns rendered HTML (str)

The country pin is the load-bearing field. The same product URL renders a different price, currency, and availability state per region, so pinning the egress keeps every recorded price on the same market. js_render=True waits for the page to paint before returning, so React/Vue/Next.js retailers return the price element, not an empty shell.


Step 4 β€” Walk the basket

Each SKU is an independent render call, so the basket walk is a plain loop (or a bounded thread pool for parallelism). No session to hold, no homepage to warm β€” the cloud render clears the retailer's anti-bot layer per request:

python Copy
import yaml

def load_basket(path: str = "basket.yaml") -> dict:
    with open(path, encoding="utf-8") as f:
        return yaml.safe_load(f)

def walk_basket(basket: dict):
    """Yield (your_sku, retailer, market, url, html) for every basket entry."""
    for item in basket["basket"]:
        for comp in item["competitors"]:
            for market, url in comp["url"].items():
                html = scrape_rendered(url, market)
                yield item["your_sku"], comp["retailer"], market, url, html

For a 5,000-SKU basket, wrap scrape_rendered in a concurrent.futures.ThreadPoolExecutor and cap the worker count to a level the account plan allows. Each call is stateless, so parallelism scales by adding workers β€” there is no shared session to contend on.

Get your API key on the free plan: app.scrapeless.com


Step 5 β€” Extract into the canonical schema

Each retailer's DOM is different; the warehouse table is not. The extractor's job is to turn whatever the retailer renders into the same shape every time. The output schema (one row per (your_sku, competitor, market, captured_at)):

python Copy
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
from typing import Optional
from lxml import html as lxml_html

@dataclass
class PriceRecord:
    your_sku: str
    competitor: str
    market: str
    url: str
    price_value: Optional[float]
    price_currency: Optional[str]
    availability: Optional[str]       # "in_stock" | "out_of_stock" | "preorder" | None
    promo_state: Optional[str]        # "none" | "on_sale" | "clearance" | None
    promo_discount_pct: Optional[float]
    captured_at: str                  # ISO-8601 UTC

Per-retailer extractors plug into the same return type:

python Copy
def extract_competitor_a(html: str, your_sku: str, market: str, url: str) -> PriceRecord:
    doc = lxml_html.fromstring(html)

    price_el = doc.cssselect("[data-test='price'] .value")
    currency_el = doc.cssselect("[data-test='price'] .currency")
    availability_el = doc.cssselect("[data-test='availability']")
    promo_el = doc.cssselect("[data-test='promo-badge']")

    availability = (
        "in_stock" if availability_el and "In stock" in availability_el[0].text_content()
        else "out_of_stock" if availability_el
        else None
    )

    return PriceRecord(
        your_sku=your_sku,
        competitor="target_competitor_a",
        market=market,
        url=url,
        price_value=_to_float(price_el[0].text_content()) if price_el else None,
        price_currency=currency_el[0].text_content().strip() if currency_el else None,
        availability=availability,
        promo_state="on_sale" if promo_el else "none",
        promo_discount_pct=_to_float(promo_el[0].get("data-discount-pct")) if promo_el else None,
        captured_at=datetime.now(timezone.utc).isoformat(),
    )

def _to_float(text) -> Optional[float]:
    if not text:
        return None
    cleaned = "".join(c for c in text if c.isdigit() or c == ".")
    try:
        return float(cleaned)
    except (ValueError, TypeError):
        return None

Every retailer gets its own extract_<name> function; every function returns the same PriceRecord. The orchestrator does not know what DOM each retailer uses β€” only the function name to call.

Selector design notes:

  • Prefer [data-test='...'] attributes when retailers expose them. They survive cosmetic class-name rotation; classes like .text-lg.font-semibold change every release.
  • Treat absent fields as nullable. A None price for an out-of-stock product is data, not a failure.
  • Capture the currency string the retailer renders. Don't infer currency from market β€” some retailers list USD on their .de domain for cross-border products. Store what the page says.

Step 6 β€” Stream to NDJSON for warehouse load

Stream-write to NDJSON so the pipeline survives mid-run interruptions without losing records. Each row is one rendered SKU; the file is append-only:

python Copy
import json
from pathlib import Path

def append_records(records: list[PriceRecord], out_path: str = "prices.ndjson"):
    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
    with open(out_path, "a", encoding="utf-8") as f:
        for r in records:
            f.write(json.dumps(asdict(r)) + "\n")

NDJSON loads directly into Snowflake (COPY INTO ... FILE_FORMAT = (TYPE = JSON)), BigQuery (bq load --source_format=NEWLINE_DELIMITED_JSON), Redshift, ClickHouse, and DuckDB. Pick whichever the BI stack already uses; the schema is the same.


Step 7 β€” Compute diffs and route pricing decisions

The signal the pricing team acts on is not the raw price β€” it is the diff between the competitor's price and yours, per market, per SKU. The diff lives in the warehouse, not in the scraper:

sql Copy
-- Daily price gap, per SKU per competitor per market
WITH yours AS (
  SELECT sku, market, list_price, currency, captured_date
  FROM your_internal_prices
  WHERE captured_date = CURRENT_DATE
),
theirs AS (
  SELECT your_sku, competitor, market, price_value, price_currency,
         availability, promo_state, CAST(captured_at AS DATE) AS captured_date
  FROM competitor_prices
  WHERE CAST(captured_at AS DATE) = CURRENT_DATE
)
SELECT
  t.your_sku,
  t.competitor,
  t.market,
  y.list_price                                  AS our_price,
  t.price_value                                 AS their_price,
  ROUND(100.0 * (y.list_price - t.price_value) / NULLIF(t.price_value, 0), 2)
                                                AS price_gap_pct,
  t.availability,
  t.promo_state
FROM theirs t
LEFT JOIN yours y
  ON y.sku = t.your_sku AND y.market = t.market
WHERE y.list_price IS NOT NULL
  AND t.price_value IS NOT NULL
ORDER BY price_gap_pct DESC;

Route the rows where price_gap_pct exceeds the threshold the pricing rule defines:

  • Above your-price threshold (e.g. you are 5%+ more expensive than the leader) β†’ repricing review.
  • Below MAP threshold β†’ MAP-violation alert to channel management.
  • Promo-state change since yesterday β†’ competitive-promo notification to category managers.

The diff query is the contract between collection and decision. As long as the warehouse schema stays stable, the pricing team's downstream BI tiles, alerts, and pricing rules never change when a retailer rotates its DOM β€” only the per-retailer extractor in Step 5 changes.


What You Get Back

One NDJSON row per (your_sku, competitor, market, day), shaped like this:

json Copy
{
  "your_sku": "SKU-1001",
  "competitor": "target_competitor_a",
  "market": "US",
  "url": "https://competitor-a.com/p/widget-pro",
  "price_value": 79.99,
  "price_currency": "USD",
  "availability": "in_stock",
  "promo_state": "on_sale",
  "promo_discount_pct": 15.0,
  "captured_at": "<ISO-8601 UTC timestamp written at read time>"
}

Honest observations from running the pattern:

  • Rendering timing matters more than DOM specificity. A selector that runs against the SSR shell returns an empty string before the price element paints. js_render=True returns the post-hydration DOM, which is what makes the price selector resolve.
  • Currency is not redundant with market. Cross-border SKUs sometimes list a non-local currency even on a localized domain. Store the rendered string; let the warehouse layer normalize.
  • Promo state has at least three values, not two. none, on_sale, and clearance behave differently in repricing rules β€” a clearance markdown signals end-of-life, not a promotional push.
  • Availability is the second-most-actionable field. A 20% price gap on an out-of-stock SKU is not the same competitive signal as the same gap on an in-stock SKU. Surface both to the decision layer.
  • One canonical schema is the load-bearing decision. Per-retailer fields, currency conventions, and promo formats vary; the warehouse table does not. Push the variability into the extractor functions, keep the schema flat.

Conclusion: scale your competitive pricing pipeline

The pipeline reduces to six moves: define the basket β†’ render each SKU through Scrapeless with the egress pinned per market β†’ extract into a canonical schema β†’ stream to NDJSON β†’ load the warehouse β†’ diff against your own prices. Each step is small enough to read; the composition handles 5,000 SKUs across 8 competitors and 4 markets on a single daily cron.

For a vendor-comparison view of pricing-adjacent scraping (real-estate pricing in particular), the Best Zillow Scrapers in 2026 listicle ranks eight tools against the same kind of localized-price extraction challenge. For loading the NDJSON output into a cloud warehouse, the Scrapeless + Snowflake data ingestion guide walks the COPY INTO and streaming paths.

Pin the egress country per market, render each SKU independently, normalize on extraction, store one canonical row per SKU/competitor/market/day, and diff in the warehouse β€” not the scraper.


Ready to Build Your AI-Powered Data Pipeline?

Join our community to claim a free plan and connect with developers building competitive-pricing pipelines: Discord Β· Telegram.

Sign up at app.scrapeless.com for free runtime and adapt the patterns above to the markets, competitors, and SKU baskets the pricing pipeline needs. Pricing details at scrapeless.com/en/pricing; the proxy solutions product page is at scrapeless.com/en/product/proxy-solutions; full SDK reference at docs.scrapeless.com.


FAQ

Q1: Is scraping competitor prices legal?

Pricing is public information on retailer product pages, and price comparison is a well-established commercial practice. Legality depends on what you scrape, from where, and under what terms. Publicly visible data is generally accessible; site terms of service, regional privacy laws (GDPR, CCPA), and copyright apply. Consult counsel for high-stakes use cases. Scrapeless accesses publicly available data only.

Q2: Do I need a proxy for competitive pricing?

Yes, and the country pin matters more than the IP rotation. Retailers localize prices by market; a US-egress request to a .co.uk domain may return a fallback price, a redirect, or a geo-block. Pin the country to the market under measurement via UniversalProxy(country=...). Scrapeless residential proxies in 195+ countries cover the typical pricing basket without bringing a separate proxy provider into the stack.

Q3: How do I handle anti-bot challenges and bot detection?

The rendering runs server-side in the Scrapeless cloud with residential egress, real JavaScript execution, and randomized fingerprinting, so the request that reaches the retailer looks like an ordinary browser from a residential IP in the target market. Set js_render=True so the response is the post-hydration DOM rather than a pre-render shell, and pin the country to the market you measure.

Q4: How often should the pipeline run?

Daily is the canonical cadence for repricing decisions; hourly is realistic for promotional-window monitoring where prices change within the day. Per-SKU cost is bounded by a single render call, so a 5,000-SKU basket at daily cadence is well inside a single-cron-shop budget. Higher frequencies add cost linearly β€” pick the cadence the pricing decision actually consumes.

Q5: What happens when a retailer rotates its DOM?

The per-retailer extractor in Step 5 is the only file that changes. The canonical schema, the warehouse table, the BI tiles, the diff query, and the alerting rules are all unaffected. Re-check selectors when a retailer ships a release; prefer [data-test='...'] attributes when available; treat the extractor as the volatile layer and the schema as the stable layer.

Q6: Can I run multiple retailers in parallel?

Yes. Each render call is stateless, so the orchestrator fans out (market, competitor, SKU) tasks across a thread pool and caps the worker count to the level the account plan allows. Parallelism scales by adding workers, not by sharing a session β€” there is no held connection to contend on.

Q7: How do I capture promo state and discount percentages?

The Step 5 extractor reads the promo badge directly from the rendered DOM and stores both promo_state ("on_sale", "clearance", "none") and promo_discount_pct as separate fields. The warehouse joins both into the diff query so the pricing rule can branch on "is the competitor on sale right now?" vs "what is the competitor's everyday price?"

Q8: What about international currencies and FX?

Store the rendered currency string per record (USD, EUR, JPY, GBP). Currency conversion belongs in the warehouse layer, not the scraper β€” keep the raw price + raw currency + market in the NDJSON, and run a daily FX cross-join on the BI side. That way one bad FX rate doesn't poison the entire history.

At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.

Most Popular Articles

Catalogue