How to Stream Web Data Into Snowflake with Scrapeless and Snowpipe Streaming
Senior Web Scraping Engineer
Key Takeaways:
- Land scraped web data in Snowflake without a fixed schema. Scrapeless Scraping Browser renders a page in a cloud browser and emits newline-delimited JSON (NDJSON); Snowflake ingests it into a
VARIANTcolumn, so new fields never break the load. - Four ingestion methods, one data shape. Bulk
COPY INTOfor one-shot loads, Snowpipe for continuous file-based batches, Snowpipe Streaming for low-latency rows, and the Kafka connector for event-driven pipelines — all read the same NDJSON Scrapeless produces. - Snowpipe Streaming's high-performance architecture is generally available (GA since September 2025). It writes rows directly through an SDK (Java, Python, Node.js) or REST, with channels, offset tokens, and exactly-once recovery — no staged files.
- Schema-on-read keeps scraped data flexible. Query a
VARIANTcolumn withcol:field::typenotation and unnest arrays withLATERAL FLATTEN— no migration when the source page adds a field. - The Snowflake CLI (
snow) is the current tool.pip install snowflake-cli, thensnow sql -f ingest.sqlruns the whole setup from one file. - Free to start. New Scrapeless accounts include free Scraping Browser runtime — sign up at Scrapeless Website.
Introduction: from rendered page to a Snowflake table
Analytics teams increasingly want web data — product catalogs, listings, reviews, market signals — in the same warehouse as their first-party data, so it can join, model, and feed BI. Snowflake is a common destination because its VARIANT type stores semi-structured JSON natively and makes it queryable with SQL.
The friction is the gap between the two systems. Scraped pages are JavaScript-rendered and behind anti-bot defenses; the data often arrives as nested JSON whose shape drifts as the source site changes. Hand-built loaders that map every field to a column break the first time the page adds one.
This post walks through a terminal-first workflow that closes that gap. Scrapeless Scraping Browser handles the rendering and anti-detection side and emits NDJSON; Snowflake ingests it four different ways depending on how fresh the data needs to be. The example producer is the public scraping sandbox books.toscrape.com, so every command below is reproducible — the same pattern applies to harder targets (see the sibling Best Zillow Scrapers in 2026 and Best Amazon Scrapers in 2026 guides).
What You Can Do With It
- Build a web-data lakehouse. Land scraped catalogs and listings in Snowflake and join them to internal sales or inventory data.
- Run scheduled market snapshots. Drop a new NDJSON file per run into a stage and let Snowpipe auto-load it within minutes.
- Feed near-real-time dashboards. Stream scraped events row-by-row with Snowpipe Streaming for sub-minute freshness.
- Bridge an existing Kafka backbone. Push scraped records onto a topic and let the Snowflake Kafka connector land them.
- Keep schema flexible. Store the raw JSON in
VARIANTand shape it at query time, so source-site changes never block a load.
At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this post is for demonstration purposes only.
Why Scrapeless Scraping Browser
Scrapeless Scraping Browser is a customizable, anti-detection cloud browser designed for web crawlers and AI agents. As the producer side of a Snowflake pipeline, it brings:
- Cloud-side JavaScript rendering, so the data is present in the DOM before extraction
- Residential proxies in 195+ countries, pinned per session
- Anti-detection browser fingerprinting
- A single
scrapeless-scraping-browserCLI surface whoseevalreturns JSON you can reshape into NDJSON in one step - Session persistence for multi-page crawls
Get your API key on the free plan at Scrapeless Website.
Prerequisites
- Node.js 18 or newer
- A Scrapeless account and API key — sign up at Scrapeless Website
- A Snowflake account with a role that can create databases, warehouses, stages, and pipes
- The Snowflake CLI:
pip install snowflake-cli(Python 3.10+) - For Snowpipe auto-ingest and external stages: a cloud bucket (AWS S3, GCS, or Azure) and permission to create a storage integration
jqis optional (a Node one-liner fallback is shown for the NDJSON conversion)
Configure a Snowflake connection once, in ~/.snowflake/config.toml:
toml
[connections.demo]
account = "myorg-myaccount"
user = "jondoe"
password = "your_password_here"
warehouse = "ingest_wh"
database = "web_data"
schema = "raw"
role = "sysadmin"
Then snow sql -c demo -q "SELECT CURRENT_VERSION();" confirms it works.
The pipeline at a glance
Scrapeless Scraping Browser → NDJSON file → Snowflake stage → table (VARIANT)
(render + extract) (one object (internal or COPY INTO | one-shot
per line) external bucket) Snowpipe | continuous
Streaming | low-latency
Kafka | event-driven
The shape never changes: Scrapeless emits one JSON object per line, the file lands in a stage, and one of four methods loads it into a VARIANT column you query with SQL.
Step 1 — Produce NDJSON with Scrapeless
Install the CLI and set your key:
bash
npm install -g scrapeless-scraping-browser
scrapeless-scraping-browser config set apiKey your_api_token_here
Open one cloud session, navigate to the catalog page, wait for a stable marker, and extract the book records with eval. The new-session JSON nests the id under data.taskId — use jq, or the portable grep fallback shown:
bash
# open a session and capture the task id (jq path is .data.taskId)
SID=$(scrapeless-scraping-browser new-session --name books --ttl 300 --proxy-country US --json | jq -r '.data.taskId')
# no jq? portable fallback:
# SID=$(scrapeless-scraping-browser new-session --name books --ttl 300 --proxy-country US --json | grep -oE '"taskId":"[^"]*"' | head -1 | cut -d'"' -f4)
# render the catalog page, then wait for the product grid
scrapeless-scraping-browser --session-id "$SID" open "https://books.toscrape.com/catalogue/page-1.html"
scrapeless-scraping-browser --session-id "$SID" wait "article.product_pod"
# extract one record per book; the eval returns a JSON array
scrapeless-scraping-browser --session-id "$SID" eval '
JSON.stringify(Array.from(document.querySelectorAll("article.product_pod")).map(el => ({
title: el.querySelector("h3 a")?.getAttribute("title") ?? null,
price: el.querySelector(".price_color")?.textContent.trim() ?? null,
rating: el.querySelector("p.star-rating")?.className.replace("star-rating", "").trim() ?? null,
in_stock: /In stock/i.test(el.querySelector(".availability")?.textContent ?? ""),
url: el.querySelector("h3 a")?.href ?? null
})))
' > books.raw.json
scrapeless-scraping-browser --session-id "$SID" close
Convert the array to NDJSON — one object per line, which is the format Snowflake's loaders read most cleanly:
bash
# with jq
jq -c '.[]' books.raw.json > books.ndjson
# or, without jq, a Node one-liner
node -e 'JSON.parse(require("fs").readFileSync("books.raw.json","utf8")).forEach(o=>console.log(JSON.stringify(o)))' > books.ndjson
books.ndjson now holds one self-contained JSON object per line. If a cold session returns an empty shell or a transient os error 10054, close the session, create a fresh one, and retry a bounded number of times before extracting.
Step 2 — Prepare Snowflake
Create the warehouse, database, schema, a JSON file format, and a landing table with a single VARIANT column. Save this as setup.sql and run it with snow sql -c demo -f setup.sql:
sql
CREATE WAREHOUSE IF NOT EXISTS ingest_wh WITH WAREHOUSE_SIZE = 'XSMALL' AUTO_SUSPEND = 60;
CREATE DATABASE IF NOT EXISTS web_data;
CREATE SCHEMA IF NOT EXISTS web_data.raw;
USE WAREHOUSE ingest_wh;
USE SCHEMA web_data.raw;
-- NDJSON: one JSON object per line, so do not strip an outer array
CREATE OR REPLACE FILE FORMAT ndjson_format
TYPE = JSON
STRIP_OUTER_ARRAY = FALSE
COMPRESSION = AUTO;
-- land the raw record as-is; shape it at query time
CREATE OR REPLACE TABLE raw_books (
src VARIANT,
loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);
STRIP_OUTER_ARRAY = FALSE is correct for NDJSON because each line is already its own object — STRIP_OUTER_ARRAY = TRUE is only for a file that is one big [ ... ] array.
Step 3 — Method 1: One-shot bulk load with COPY INTO
For a single file or a manual batch, stage the file and run COPY INTO. The simplest path is an internal named stage plus PUT:
sql
-- a named internal stage bound to the JSON format
CREATE OR REPLACE STAGE books_stage FILE_FORMAT = ndjson_format;
bash
# upload the local NDJSON to the internal stage (snow CLI runs PUT)
snow sql -c demo -q "PUT file://$(pwd)/books.ndjson @books_stage AUTO_COMPRESS=TRUE OVERWRITE=TRUE"
sql
-- load every object as one row into the VARIANT column
COPY INTO raw_books (src)
FROM @books_stage
FILE_FORMAT = (FORMAT_NAME = 'ndjson_format')
ON_ERROR = 'CONTINUE';
To map JSON keys straight to typed columns instead of a VARIANT, create a table whose column names match the keys and use MATCH_BY_COLUMN_NAME:
sql
CREATE OR REPLACE TABLE books (
title VARCHAR, price VARCHAR, rating VARCHAR, in_stock BOOLEAN, url VARCHAR
);
COPY INTO books
FROM @books_stage
FILE_FORMAT = (TYPE = 'JSON')
MATCH_BY_COLUMN_NAME = 'CASE_INSENSITIVE';
If you would rather let Snowflake derive the schema from the staged files, INFER_SCHEMA with CREATE TABLE … USING TEMPLATE builds the column list for you:
sql
CREATE OR REPLACE TABLE books_auto
USING TEMPLATE (
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
FROM TABLE(INFER_SCHEMA(
LOCATION => '@books_stage',
FILE_FORMAT => 'ndjson_format'
))
);
For data already sitting in a cloud bucket, point an external stage at it instead of uploading. With a storage integration (no inline keys):
sql
CREATE OR REPLACE STAGE books_s3_stage
URL = 's3://my-bucket/scraped/books/'
STORAGE_INTEGRATION = my_s3_integration
FILE_FORMAT = ndjson_format;
COPY INTO raw_books (src) FROM @books_s3_stage;
Get your API key on the free plan: app.scrapeless.com
Step 4 — Method 2: Continuous batches with Snowpipe
When the scraper drops a new file into a bucket on a schedule, Snowpipe loads each file automatically — no manual COPY and no dedicated warehouse. A pipe wraps a COPY INTO statement; with AUTO_INGEST = TRUE, a cloud event notification triggers the load:
sql
CREATE OR REPLACE PIPE books_pipe
AUTO_INGEST = TRUE
AWS_SNS_TOPIC = 'arn:aws:sns:us-east-1:123456789012:scraped-bucket'
AS
COPY INTO raw_books (src)
FROM @books_s3_stage
FILE_FORMAT = (TYPE = 'JSON');
On S3 the event flows through SNS/SQS into a Snowflake-managed queue; GCS uses Pub/Sub and Azure uses Event Grid, each wired with a notification integration. If you would rather call Snowpipe explicitly, leave AUTO_INGEST unset and POST the staged file paths to the insertFiles REST endpoint, then poll insertReport.
Two operational notes from Snowflake's guidance:
- Billing is serverless and now charged on a per-GB basis for the data Snowpipe ingests — there is no warehouse to size, and the former per-file component is retired.
- File sizing matters. Aim for files around 100–250 MB compressed, and stage no more than about once per minute; staging more frequently than that adds queue-management overhead without reducing latency. Buffer small scrape batches into larger files before staging.
Snowpipe makes data available within minutes, which fits scheduled market snapshots well.
Step 5 — Method 3: Low-latency rows with Snowpipe Streaming
When freshness needs to be seconds, not minutes, Snowpipe Streaming writes rows directly into a table — no staged files. The high-performance architecture has been generally available since September 2025, with SDKs for Java, Python, and Node.js plus a REST API over a shared client core; the classic file-based Snowpipe Streaming is on a deprecation path.
The model has three core concepts:
- Channels — a named, long-lived streaming connection to a table. Rows are committed in order within a channel.
- Offset tokens — a string your application attaches to each batch. After a restart,
getLatestCommittedOffsetToken()tells you the last durably committed position, so you replay only what follows — the basis for exactly-once delivery. - Throughput billing — credits per uncompressed GB ingested, rather than per file.
The Java client shape is small:
java
SnowflakeStreamingIngestClient client =
SnowflakeStreamingIngestClientFactory.builder("BOOKS_CLIENT")
.setProperties(props).build();
OpenChannelRequest request = OpenChannelRequest.builder("BOOKS_CHANNEL")
.setDBName("WEB_DATA").setSchemaName("RAW").setTableName("RAW_BOOKS")
.setOnErrorOption(OpenChannelRequest.OnErrorOption.CONTINUE)
.build();
SnowflakeStreamingIngestChannel channel = client.openChannel(request);
channel.insertRow(rowAsMap, offsetToken); // one scraped record
channel.getLatestCommittedOffsetToken(); // for recovery
Reach for Streaming when scraped records arrive as a continuous flow (an agent emitting events as it crawls) and the dashboard needs them within seconds.
Step 6 — Method 4: Event-driven loads with the Kafka connector
If scraped records already flow over Apache Kafka, the Snowflake Connector for Kafka lands a topic into a table (one topic maps to one table). It runs inside a Kafka Connect worker. A single property selects the ingestion engine underneath:
properties
name=scraped-books-sink
connector.class=com.snowflake.kafka.connector.SnowflakeSinkConnector
topics=scraped_books
snowflake.database.name=WEB_DATA
snowflake.schema.name=RAW
# SNOWPIPE (file-based, default) or SNOWPIPE_STREAMING (low-latency)
snowflake.ingestion.method=SNOWPIPE_STREAMING
Each table the connector creates has two VARIANT columns: RECORD_CONTENT (the message payload) and RECORD_METADATA (topic, partition, offset, timestamps, and headers). Query the payload exactly as you would any VARIANT.
For a fully managed alternative — no Kafka Connect cluster to operate — Snowflake Openflow (generally available, built on Apache NiFi) ingests from Kafka, Kinesis, databases, and SaaS sources into Snowflake through managed pipelines.
Choosing a method
| Method | Latency | Data shape | Ops overhead | Use when |
|---|---|---|---|---|
COPY INTO |
Manual | Files in a stage | Lowest | One-shot loads, backfills |
| Snowpipe | Minutes | Files dropped in a bucket | Low (serverless) | Scheduled scrape batches |
| Snowpipe Streaming | Seconds | Rows via SDK/REST | Medium (write a client) | Continuous event flow |
| Kafka connector | Seconds–minutes | Kafka topic records | Medium (Connect worker) | An existing Kafka backbone |
Most teams start with COPY INTO to validate the schema, move to Snowpipe once the scraper runs on a schedule, and adopt Streaming or Kafka only when sub-minute freshness justifies the extra moving piece.
Querying the loaded data
Because the raw record lives in a VARIANT, you shape it at read time. Navigate with the : operator and cast with :::
sql
SELECT
src:title::string AS title,
src:price::string AS price,
src:rating::string AS rating,
src:in_stock::boolean AS in_stock,
src:url::string AS url
FROM raw_books;
When a scraped record carries an array — a photos list, a priceHistory series — LATERAL FLATTEN unnests it into rows:
sql
-- raw_listings: a hypothetical scraped-listings table loaded the same way as raw_books.
-- (The books example has no nested array; this shows the pattern for a source that does.)
SELECT
src:title::string AS title,
ph.value:date::string AS price_date,
ph.value:price::number AS price
FROM raw_listings,
LATERAL FLATTEN(INPUT => src:priceHistory) ph;
No migration is needed when the source page adds a field — it simply appears under src on the next load.
What You Get Back
After COPY INTO, each scraped object is one row in raw_books. The VARIANT column holds the record verbatim; the schema below is normative and field values are illustrative samples:
json
// One row's src VARIANT, as emitted by the Step 1 extractor.
{
"title": "A Light in the Attic",
"price": "£51.77",
"rating": "Three",
"in_stock": true,
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
}
A few honest observations:
- Prices arrive as display strings. Cast and clean in SQL (
REPLACE(src:price::string, '£', '')::number) rather than expecting numbers from the page. - Conditional fields are nullable. A field absent on a given page is simply missing from
src;VARIANTaccess returnsNULLrather than erroring. MATCH_BY_COLUMN_NAMEskips unmatched keys. New keys land in aVARIANTtable automatically but are dropped by a typed table until you add the column.- Snowpipe load history is retained 14 days in pipe metadata; bulk
COPYhistory is retained 64 days in table metadata — keep that in mind when auditing a backfill.
Conclusion
Getting scraped web data into Snowflake reduces to four moves: render and extract with Scrapeless, emit NDJSON, stage the file, and load it with the method whose latency fits — COPY INTO, Snowpipe, Snowpipe Streaming, or the Kafka connector. Land the raw record in a VARIANT column and shape it at query time, so the pipeline survives the source site adding fields.
Pin US egress when the target needs it, keep the Scrapeless session chain inside one shell invocation, follow the discover → extract pattern, and treat absent fields as nullable. For harder targets than the sandbox used here, the same producer pattern carries over — see the sibling Best Zillow Scrapers in 2026 guide, the Scraping Browser product page, and the Scrapeless docs.
Ready to Build Your AI-Powered Data Pipeline?
Join our community to claim a free plan and connect with developers building web-data-to-warehouse pipelines: Discord · Telegram.
Sign up at Scrapeless Website for free Scraping Browser runtime, and see scrapeless.com/en/pricing to scale session minutes and concurrency as the pipeline grows.
FAQ
Q1: Is web scraping for warehouse ingestion legal?
Collecting publicly visible data is broadly defensible, but legality depends on the target site's terms, the jurisdiction, and the data type. Review the site's terms of service, avoid personal or restricted data, and consult counsel before commercial use. The sandbox site used here, books.toscrape.com, exists specifically for scraping practice.
Q2: NDJSON or a JSON array — which should the scraper emit?
NDJSON (one object per line) loads most cleanly and streams without buffering the whole file. Set STRIP_OUTER_ARRAY = FALSE. If your producer emits a single [ ... ] array, set STRIP_OUTER_ARRAY = TRUE so each element becomes a row.
Q3: Should I load into a VARIANT column or typed columns?
Land raw scraped data in a VARIANT and shape it with SQL — source pages change, and VARIANT absorbs new fields without a migration. Use MATCH_BY_COLUMN_NAME into typed columns only once the schema is stable.
Q4: Which ingestion method should I pick?
COPY INTO for one-shot loads, Snowpipe for scheduled file batches (minutes of latency, serverless), Snowpipe Streaming for sub-minute row-level freshness, and the Kafka connector when records already flow over Kafka. Start with COPY INTO, then graduate as freshness needs grow.
Q5: How do I handle transient scraper errors like os error 10054 or a 503?
Treat them as transient: close the Scrapeless session, create a fresh one, navigate again, and wait for a stable selector before extracting. Keep retries bounded. These belong to the producer side and do not affect Snowflake, which loads whatever lands in the stage.
Q6: Do I need a warehouse running for Snowpipe?
No. Snowpipe is serverless and billed per GB ingested — Snowflake supplies the compute. A user-managed warehouse is only needed for bulk COPY INTO and for querying.
Q7: Can I run this without an AI agent?
Yes. The scrapeless-scraping-browser CLI produces the NDJSON end-to-end from a plain shell, and the Snowflake side is ordinary SQL. An MCP-connected agent is the convenient path, not a requirement.
At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.



