How to Stream Web Data Into Snowflake with Scrapeless and Snowpipe Streaming

Alex Johnson

Senior Web Scraping Engineer

20-May-2026

TL;DR:

Land scraped web data in Snowflake without a fixed schema. Scrapeless Scraping Browser renders a page in a cloud browser and emits newline-delimited JSON (NDJSON); Snowflake ingests it into a VARIANT column, so new fields never break the load.
Four ingestion methods, one data shape. Bulk COPY INTO for one-shot loads, Snowpipe for continuous file-based batches, Snowpipe Streaming for low-latency rows, and the Kafka connector for event-driven pipelines — all read the same NDJSON Scrapeless produces.
Snowpipe Streaming's high-performance architecture is generally available (GA since September 2025). It writes rows directly through an SDK (Java, Python, Node.js) or REST, with channels, offset tokens, and exactly-once recovery — no staged files.
Schema-on-read keeps scraped data flexible. Query a VARIANT column with col:field::type notation and unnest arrays with LATERAL FLATTEN — no migration when the source page adds a field.
The Snowflake CLI (snow) is the current tool. pip install snowflake-cli, then snow sql -f ingest.sql runs the whole setup from one file.
Free to start. New Scrapeless accounts include free Scraping Browser runtime — sign up at app.scrapeless.com.

Introduction: from rendered page to a Snowflake table

Analytics teams increasingly want web data — product catalogs, listings, reviews, market signals — in the same warehouse as their first-party data, so it can join, model, and feed BI. Snowflake is a common destination because its VARIANT type stores semi-structured JSON natively and makes it queryable with SQL.

The friction is the gap between the two systems. Scraped pages are JavaScript-rendered and behind anti-bot defenses; the data often arrives as nested JSON whose shape drifts as the source site changes. Hand-built loaders that map every field to a column break the first time the page adds one.

This post walks through a terminal-first workflow that closes that gap. Scrapeless Scraping Browser handles the rendering and anti-detection side and emits NDJSON; Snowflake ingests it four different ways depending on how fresh the data needs to be. The example producer is the public scraping sandbox books.toscrape.com, so every command below is reproducible — the same pattern applies to harder targets (see the sibling Best Zillow Scrapers in 2026 and Best Amazon Scrapers in 2026 guides).

What You Can Do With It

Build a web-data lakehouse. Land scraped catalogs and listings in Snowflake and join them to internal sales or inventory data.
Run scheduled market snapshots. Drop a new NDJSON file per run into a stage and let Snowpipe auto-load it within minutes.
Feed near-real-time dashboards. Stream scraped events row-by-row with Snowpipe Streaming for sub-minute freshness.
Bridge an existing Kafka backbone. Push scraped records onto a topic and let the Snowflake Kafka connector land them.
Keep schema flexible. Store the raw JSON in VARIANT and shape it at query time, so source-site changes never block a load.

Why Scrapeless Scraping Browser

Scrapeless Scraping Browser is a customizable, anti-detection cloud browser designed for web crawlers and AI agents. As the producer side of a Snowflake pipeline, it brings:

Cloud-side JavaScript rendering, so the data is present in the DOM before extraction
Residential proxies in 195+ countries, pinned per session
Anti-detection browser fingerprinting
A single scrapeless-scraping-browser CLI surface whose eval returns JSON you can reshape into NDJSON in one step
Session persistence for multi-page crawls

Get your API key on the free plan at app.scrapeless.com.

Prerequisites

Node.js 18 or newer
A Scrapeless account and API key — sign up at app.scrapeless.com
A Snowflake account with a role that can create databases, warehouses, stages, and pipes
The Snowflake CLI: pip install snowflake-cli (Python 3.10+)
For Snowpipe auto-ingest and external stages: a cloud bucket (AWS S3, GCS, or Azure) and permission to create a storage integration
jq is optional (a Node one-liner fallback is shown for the NDJSON conversion)

Configure a Snowflake connection once, in ~/.snowflake/config.toml:

toml Copy

[connections.demo]
account   = "myorg-myaccount"
user      = "jondoe"
password  = "your_password_here"
warehouse = "ingest_wh"
database  = "web_data"
schema    = "raw"
role      = "sysadmin"

Then snow sql -c demo -q "SELECT CURRENT_VERSION();" confirms it works.

The pipeline at a glance

Copy

Scrapeless Scraping Browser   →   NDJSON file   →   Snowflake stage   →   table (VARIANT)
   (render + extract)              (one object        (internal or          COPY INTO  | one-shot
                                    per line)          external bucket)      Snowpipe   | continuous
                                                                             Streaming  | low-latency
                                                                             Kafka      | event-driven

The shape never changes: Scrapeless emits one JSON object per line, the file lands in a stage, and one of four methods loads it into a VARIANT column you query with SQL.

Step 1 — Produce NDJSON with Scrapeless

Install the CLI and set your key:

bash Copy

npm install -g scrapeless-scraping-browser
scrapeless-scraping-browser config set apiKey your_api_token_here

Open one cloud session, navigate to the catalog page, wait for a stable marker, and extract the book records with eval. The new-session JSON nests the id under data.taskId — use jq, or the portable grep fallback shown:

bash Copy

# open a session and capture the task id (jq path is .data.taskId)
SID=$(scrapeless-scraping-browser new-session --name books --ttl 300 --proxy-country US --json | jq -r '.data.taskId')
# no jq? portable fallback:
# SID=$(scrapeless-scraping-browser new-session --name books --ttl 300 --proxy-country US --json | grep -oE '"taskId":"[^"]*"' | head -1 | cut -d'"' -f4)

# render the catalog page, then wait for the product grid
scrapeless-scraping-browser --session-id "$SID" open "https://books.toscrape.com/catalogue/page-1.html"
scrapeless-scraping-browser --session-id "$SID" wait "article.product_pod"

# extract one record per book; the eval returns a JSON array
scrapeless-scraping-browser --session-id "$SID" eval '
  JSON.stringify(Array.from(document.querySelectorAll("article.product_pod")).map(el => ({
    title: el.querySelector("h3 a")?.getAttribute("title") ?? null,
    price: el.querySelector(".price_color")?.textContent.trim() ?? null,
    rating: el.querySelector("p.star-rating")?.className.replace("star-rating", "").trim() ?? null,
    in_stock: /In stock/i.test(el.querySelector(".availability")?.textContent ?? ""),
    url: el.querySelector("h3 a")?.href ?? null
  })))
' > books.raw.json

scrapeless-scraping-browser --session-id "$SID" close

Convert the array to NDJSON — one object per line, which is the format Snowflake's loaders read most cleanly:

bash Copy

# with jq
jq -c '.[]' books.raw.json > books.ndjson

# or, without jq, a Node one-liner
node -e 'JSON.parse(require("fs").readFileSync("books.raw.json","utf8")).forEach(o=>console.log(JSON.stringify(o)))' > books.ndjson

books.ndjson now holds one self-contained JSON object per line. Scrapeless Scraping Browser manages session health and anti-detection resilience on the cloud side, so the page renders fully before eval runs without you scripting recovery logic.

Step 2 — Prepare Snowflake

Create the warehouse, database, schema, a JSON file format, and a landing table with a single VARIANT column. Save this as setup.sql and run it with snow sql -c demo -f setup.sql:

sql Copy

CREATE WAREHOUSE IF NOT EXISTS ingest_wh WITH WAREHOUSE_SIZE = 'XSMALL' AUTO_SUSPEND = 60;
CREATE DATABASE  IF NOT EXISTS web_data;
CREATE SCHEMA    IF NOT EXISTS web_data.raw;

USE WAREHOUSE ingest_wh;
USE SCHEMA web_data.raw;

-- NDJSON: one JSON object per line, so do not strip an outer array
CREATE OR REPLACE FILE FORMAT ndjson_format
  TYPE = JSON
  STRIP_OUTER_ARRAY = FALSE
  COMPRESSION = AUTO;

-- land the raw record as-is; shape it at query time
CREATE OR REPLACE TABLE raw_books (
  src       VARIANT,
  loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

STRIP_OUTER_ARRAY = FALSE is correct for NDJSON because each line is already its own object — STRIP_OUTER_ARRAY = TRUE is only for a file that is one big [ ... ] array.

Step 3 — Method 1: One-shot bulk load with `COPY INTO`

For a single file or a manual batch, stage the file and run COPY INTO. The simplest path is an internal named stage plus PUT:

sql Copy

-- a named internal stage bound to the JSON format
CREATE OR REPLACE STAGE books_stage FILE_FORMAT = ndjson_format;

bash Copy

# upload the local NDJSON to the internal stage (snow CLI runs PUT)
snow sql -c demo -q "PUT file://$(pwd)/books.ndjson @books_stage AUTO_COMPRESS=TRUE OVERWRITE=TRUE"

sql Copy

-- load every object as one row into the VARIANT column
COPY INTO raw_books (src)
  FROM @books_stage
  FILE_FORMAT = (FORMAT_NAME = 'ndjson_format')
  ON_ERROR = 'CONTINUE';

To map JSON keys straight to typed columns instead of a VARIANT, create a table whose column names match the keys and use MATCH_BY_COLUMN_NAME:

sql Copy

CREATE OR REPLACE TABLE books (
  title VARCHAR, price VARCHAR, rating VARCHAR, in_stock BOOLEAN, url VARCHAR
);

COPY INTO books
  FROM @books_stage
  FILE_FORMAT = (TYPE = 'JSON')
  MATCH_BY_COLUMN_NAME = 'CASE_INSENSITIVE';

If you would rather let Snowflake derive the schema from the staged files, INFER_SCHEMA with CREATE TABLE … USING TEMPLATE builds the column list for you:

sql Copy

CREATE OR REPLACE TABLE books_auto
  USING TEMPLATE (
    SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
    FROM TABLE(INFER_SCHEMA(
      LOCATION => '@books_stage',
      FILE_FORMAT => 'ndjson_format'
    ))
  );

For data already sitting in a cloud bucket, point an external stage at it instead of uploading. With a storage integration (no inline keys):

sql Copy

CREATE OR REPLACE STAGE books_s3_stage
  URL = 's3://my-bucket/scraped/books/'
  STORAGE_INTEGRATION = my_s3_integration
  FILE_FORMAT = ndjson_format;

COPY INTO raw_books (src) FROM @books_s3_stage;

Get your API key on the free plan: app.scrapeless.com

Step 4 — Method 2: Continuous batches with Snowpipe

When the scraper drops a new file into a bucket on a schedule, Snowpipe loads each file automatically — no manual COPY and no dedicated warehouse. A pipe wraps a COPY INTO statement; with AUTO_INGEST = TRUE, a cloud event notification triggers the load:

sql Copy

CREATE OR REPLACE PIPE books_pipe
  AUTO_INGEST = TRUE
  AWS_SNS_TOPIC = 'arn:aws:sns:us-east-1:123456789012:scraped-bucket'
  AS
  COPY INTO raw_books (src)
  FROM @books_s3_stage
  FILE_FORMAT = (TYPE = 'JSON');

On S3 the event flows through SNS/SQS into a Snowflake-managed queue; GCS uses Pub/Sub and Azure uses Event Grid, each wired with a notification integration. If you would rather call Snowpipe explicitly, leave AUTO_INGEST unset and POST the staged file paths to the insertFiles REST endpoint, then poll insertReport.

Two operational notes from Snowflake's guidance:

Billing is serverless and now charged on a per-GB basis for the data Snowpipe ingests — there is no warehouse to size, and the former per-file component is retired.
File sizing matters. Aim for files around 100–250 MB compressed, and stage no more than about once per minute; staging more frequently than that adds queue-management overhead without reducing latency. Buffer small scrape batches into larger files before staging.

Snowpipe makes data available within minutes, which fits scheduled market snapshots well.

Step 5 — Method 3: Low-latency rows with Snowpipe Streaming

When freshness needs to be seconds, not minutes, Snowpipe Streaming writes rows directly into a table — no staged files. The high-performance architecture has been generally available since September 2025, with SDKs for Java, Python, and Node.js plus a REST API over a shared client core; the classic file-based Snowpipe Streaming is on a deprecation path.

The model has three core concepts:

Channels — a named, long-lived streaming connection to a table. Rows are committed in order within a channel.
Offset tokens — a string your application attaches to each batch. After a restart, getLatestCommittedOffsetToken() tells you the last durably committed position, so you replay only what follows — the basis for exactly-once delivery.
Throughput billing — credits per uncompressed GB ingested, rather than per file.

The Java client shape is small:

java Copy

SnowflakeStreamingIngestClient client =
    SnowflakeStreamingIngestClientFactory.builder("BOOKS_CLIENT")
        .setProperties(props).build();

OpenChannelRequest request = OpenChannelRequest.builder("BOOKS_CHANNEL")
    .setDBName("WEB_DATA").setSchemaName("RAW").setTableName("RAW_BOOKS")
    .setOnErrorOption(OpenChannelRequest.OnErrorOption.CONTINUE)
    .build();

SnowflakeStreamingIngestChannel channel = client.openChannel(request);
channel.insertRow(rowAsMap, offsetToken);   // one scraped record
channel.getLatestCommittedOffsetToken();     // for recovery

Reach for Streaming when scraped records arrive as a continuous flow (an agent emitting events as it crawls) and the dashboard needs them within seconds.

Step 6 — Method 4: Event-driven loads with the Kafka connector

If scraped records already flow over Apache Kafka, the Snowflake Connector for Kafka lands a topic into a table (one topic maps to one table). It runs inside a Kafka Connect worker. A single property selects the ingestion engine underneath:

properties Copy

name=scraped-books-sink
connector.class=com.snowflake.kafka.connector.SnowflakeSinkConnector
topics=scraped_books
snowflake.database.name=WEB_DATA
snowflake.schema.name=RAW
# SNOWPIPE (file-based, default) or SNOWPIPE_STREAMING (low-latency)
snowflake.ingestion.method=SNOWPIPE_STREAMING

Each table the connector creates has two VARIANT columns: RECORD_CONTENT (the message payload) and RECORD_METADATA (topic, partition, offset, timestamps, and headers). Query the payload exactly as you would any VARIANT.

For a fully managed alternative — no Kafka Connect cluster to operate — Snowflake Openflow (generally available, built on Apache NiFi) ingests from Kafka, Kinesis, databases, and SaaS sources into Snowflake through managed pipelines.

Choosing a method

Method	Latency	Data shape	Ops overhead	Use when
`COPY INTO`	Manual	Files in a stage	Lowest	One-shot loads, backfills
Snowpipe	Minutes	Files dropped in a bucket	Low (serverless)	Scheduled scrape batches
Snowpipe Streaming	Seconds	Rows via SDK/REST	Medium (write a client)	Continuous event flow
Kafka connector	Seconds–minutes	Kafka topic records	Medium (Connect worker)	An existing Kafka backbone

Most teams start with COPY INTO to validate the schema, move to Snowpipe once the scraper runs on a schedule, and adopt Streaming or Kafka only when sub-minute freshness justifies the extra moving piece.

Querying the loaded data

Because the raw record lives in a VARIANT, you shape it at read time. Navigate with the : operator and cast with :::

sql Copy

SELECT
  src:title::string   AS title,
  src:price::string   AS price,
  src:rating::string  AS rating,
  src:in_stock::boolean AS in_stock,
  src:url::string     AS url
FROM raw_books;

When a scraped record carries an array — a photos list, a priceHistory series — LATERAL FLATTEN unnests it into rows:

sql Copy

-- raw_listings: a hypothetical scraped-listings table loaded the same way as raw_books.
-- (The books example has no nested array; this shows the pattern for a source that does.)
SELECT
  src:title::string AS title,
  ph.value:date::string  AS price_date,
  ph.value:price::number AS price
FROM raw_listings,
LATERAL FLATTEN(INPUT => src:priceHistory) ph;

No migration is needed when the source page adds a field — it simply appears under src on the next load.

What You Get Back

After COPY INTO, each scraped object is one row in raw_books. The VARIANT column holds the record verbatim; the schema below is normative and field values are illustrative samples:

json Copy

// illustrative sample — one row's src VARIANT, as emitted by the Step 1 extractor.
{
  "title": "A Light in the Attic",
  "price": "£51.77",
  "rating": "Three",
  "in_stock": true,
  "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
}

A few honest observations:

Prices arrive as display strings. Cast and clean in SQL (REPLACE(src:price::string, '£', '')::number) rather than expecting numbers from the page.
Conditional fields are nullable. A field absent on a given page is simply missing from src; VARIANT access returns NULL rather than erroring.
MATCH_BY_COLUMN_NAME skips unmatched keys. New keys land in a VARIANT table automatically but are dropped by a typed table until you add the column.
Snowpipe load history is retained 14 days in pipe metadata; bulk COPY history is retained 64 days in table metadata — keep that in mind when auditing a backfill.

Conclusion

Getting scraped web data into Snowflake reduces to four moves: render and extract with Scrapeless, emit NDJSON, stage the file, and load it with the method whose latency fits — COPY INTO, Snowpipe, Snowpipe Streaming, or the Kafka connector. Land the raw record in a VARIANT column and shape it at query time, so the pipeline survives the source site adding fields.

Pin US egress when the target needs it, keep the Scrapeless session chain inside one shell invocation, follow the discover → extract pattern, and treat absent fields as nullable. For harder targets than the sandbox used here, the same producer pattern carries over — see the sibling Best Zillow Scrapers in 2026 guide, the Scraping Browser product page, and the Scrapeless docs.

Ready to Build Your AI-Powered Data Pipeline?

Join our community to claim a free plan and connect with developers building web-data-to-warehouse pipelines: Discord · Telegram.

Sign up at app.scrapeless.com for free Scraping Browser runtime, and see scrapeless.com/en/pricing to scale session minutes and concurrency as the pipeline grows.

FAQ

Q: Is web scraping for warehouse ingestion legal?

Collecting publicly visible data is broadly defensible, but legality depends on the target site's terms, the jurisdiction, and the data type. Review the site's terms of service, avoid personal or restricted data, and consult counsel before commercial use. The sandbox site used here, books.toscrape.com, exists specifically for scraping practice.

Q: NDJSON or a JSON array — which should the scraper emit?

NDJSON (one object per line) loads most cleanly and streams without buffering the whole file. Set STRIP_OUTER_ARRAY = FALSE. If your producer emits a single [ ... ] array, set STRIP_OUTER_ARRAY = TRUE so each element becomes a row.

Q: Should I load into a VARIANT column or typed columns?

Land raw scraped data in a VARIANT and shape it with SQL — source pages change, and VARIANT absorbs new fields without a migration. Use MATCH_BY_COLUMN_NAME into typed columns only once the schema is stable.

Q: Which ingestion method should I pick?

COPY INTO for one-shot loads, Snowpipe for scheduled file batches (minutes of latency, serverless), Snowpipe Streaming for sub-minute row-level freshness, and the Kafka connector when records already flow over Kafka. Start with COPY INTO, then graduate as freshness needs grow.

Q: What keeps the scraping side reliable on hard targets?

Scrapeless Scraping Browser handles session health, fingerprinting, and anti-detection resilience on the cloud side, so the page renders fully before eval runs. The producer side stays decoupled from Snowflake, which simply loads whatever NDJSON lands in the stage.

Q: Do I need a warehouse running for Snowpipe?

No. Snowpipe is serverless and billed per GB ingested — Snowflake supplies the compute. A user-managed warehouse is only needed for bulk COPY INTO and for querying.

Q: Can I run this without an AI agent?

Yes. The scrapeless-scraping-browser CLI produces the NDJSON end-to-end from a plain shell, and the Snowflake side is ordinary SQL. An MCP-connected agent is the convenient path, not a requirement.

At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.

How to Stream Web Data Into Snowflake with Scrapeless and Snowpipe Streaming

TL;DR:

Introduction: from rendered page to a Snowflake table

What You Can Do With It

Why Scrapeless Scraping Browser

Prerequisites

The pipeline at a glance

Step 1 — Produce NDJSON with Scrapeless

Step 2 — Prepare Snowflake

Step 3 — Method 1: One-shot bulk load with `COPY INTO`

Step 4 — Method 2: Continuous batches with Snowpipe

Step 5 — Method 3: Low-latency rows with Snowpipe Streaming

Step 6 — Method 4: Event-driven loads with the Kafka connector

Choosing a method

Querying the loaded data

What You Get Back

Conclusion

Ready to Build Your AI-Powered Data Pipeline?

FAQ

Most Popular Articles

n8n + LLM Scraper: Capture AI Answers in a No-Code Workflow

How to Enhance Crawl4AI with Scrapeless Cloud Browser

Scrapeless MCP Server Is Officially Live! Build Your Ultimate AI-Web Connector

How to Stream Web Data Into Snowflake with Scrapeless and Snowpipe Streaming

TL;DR:

Introduction: from rendered page to a Snowflake table

What You Can Do With It

Why Scrapeless Scraping Browser

Prerequisites

The pipeline at a glance

Step 1 — Produce NDJSON with Scrapeless

Step 2 — Prepare Snowflake

Step 3 — Method 1: One-shot bulk load with COPY INTO

Step 4 — Method 2: Continuous batches with Snowpipe

Step 5 — Method 3: Low-latency rows with Snowpipe Streaming

Step 6 — Method 4: Event-driven loads with the Kafka connector

Choosing a method

Querying the loaded data

What You Get Back

Conclusion

Ready to Build Your AI-Powered Data Pipeline?

FAQ

Most Popular Articles

n8n + LLM Scraper: Capture AI Answers in a No-Code Workflow

How to Enhance Crawl4AI with Scrapeless Cloud Browser

Scrapeless MCP Server Is Officially Live! Build Your Ultimate AI-Web Connector

Step 3 — Method 1: One-shot bulk load with `COPY INTO`