Scala Web Scraping: Fetch, Parse, and Unlock Protected Pages
Expert in Web Scraping Technologies
TL;DR:
- Scala scrapes the web with two pieces: a JVM HTTP client to fetch, and a jsoup-backed parser to extract. requests-scala makes the request; scala-scraper turns the HTML into CSS-selectable nodes.
- The whole project is three sbt dependencies. requests-scala 0.9.0 for HTTP, scala-scraper 3.2.0 for parsing, and ujson 4.1.0 for the one JSON envelope you decode later — no framework to learn.
- Pagination is a loop over the "next" link. scala-scraper reads the next-page href with one optional selector, so a catalog walk is a tail-recursive function, not a queue.
- Static fetch has a hard ceiling: it cannot run JavaScript or pass a bot challenge. A plain
requests.getreturns the empty shell of a client-rendered page and gets a challenge interstitial on protected sites. - The Scrapeless Universal Scraping API closes that gap with a plain HTTP POST.
js_render: trueruns the page server-side and returns the finished DOM; the same requests-scala client that talks to a site can talk to the API. - The unlock call was run live against the endpoint: HTTP 200, 51,275 bytes of rendered HTML, 20 product titles. The request and response shape in this guide come straight from that live run.
- Free to start. New Scrapeless accounts include free runtime — sign up at app.scrapeless.com.
Introduction: where Scala fits in scraping
Scala runs on the JVM, which means a scraper written in it inherits jsoup, Akka, and a mature HTTP ecosystem for free. The language is a natural fit when scraping feeds something already on the JVM — a Spark job, a Kafka producer, a data service — and you want the extraction in the same codebase, with the same types, as the pipeline that consumes it.
The fetch-and-parse half of that job is short. A handful of lines pull a page and read values out of it with CSS selectors. The friction starts where every scraper's does: a growing share of the web builds its content with JavaScript that has to actually run before the data exists, and protected sites gate access behind TLS fingerprinting and challenge pages that a raw HTTP client never clears.
This guide builds the static scraper first — sbt project, an HTTP fetch, selector-based extraction, pagination — then draws the honest line where that approach stops and hands the hard pages to the Scrapeless Universal Scraping API. The API call at the end was run live; its numbers are a real capture.
What you can do with this stack
- Fetch and parse on the JVM — requests-scala for the request, scala-scraper (a jsoup wrapper) for CSS-selector extraction.
- Keep extraction in your data codebase — read values into Scala types right next to the Spark or Kafka job that uses them.
- Walk paginated listings — follow the next-page link in a tail-recursive loop until it runs out.
- Reach JavaScript-rendered and protected pages — POST them to the Universal Scraping API and parse the rendered HTML the same way.
- Skip the anti-bot stack — TLS fingerprinting, residential IPs, and challenge solving live in the API, not your Scala code.
Why Scrapeless Universal Scraping API
The Scrapeless Universal Scraping API takes a target URL and returns the rendered, unblocked HTML. For a Scala client specifically, it brings:
- Server-side JavaScript rendering —
js_render: truereturns the finished DOM, so scala-scraper sees real content instead of an empty shell. - Residential proxies in 195+ countries — the fetch egresses from trusted IPs; you never build or rotate a pool in Scala.
- Anti-bot handling — TLS fingerprinting and challenge solving happen API-side, off your JVM process.
- A plain HTTPS POST — no SDK to add to
build.sbt; the requests-scala client you already have is enough. - A small envelope —
{"code":200,"data":"<html>…"}, decoded with the same ujson you use elsewhere.
Get your API key on the free plan at app.scrapeless.com.
Prerequisites
- A JDK (11 or newer) and sbt installed
- Scala 2.13 (the dependency versions below are the 2.13 builds)
- A Scrapeless account and API key — sign up at app.scrapeless.com
- Basic familiarity with the terminal
Note: The Scala code in the build and step sections below is a prerequisite-gap in this guide's verification — no JVM/sbt runtime was available on the verifying machine, so those blocks were composed and checked against the libraries' current APIs and Maven Central versions rather than executed. The load-bearing Scrapeless unlock call was run live against the endpoint; its request and response are a real capture.
Install
Create a project directory with two files. build.sbt pins the language and the three dependencies:
scala
ThisBuild / scalaVersion := "2.13.16"
lazy val scraper = (project in file("."))
.settings(
name := "scala-scraper-demo",
libraryDependencies ++= Seq(
"com.lihaoyi" %% "requests" % "0.9.0",
"net.ruippeixotog" %% "scala-scraper" % "3.2.0",
"com.lihaoyi" %% "ujson" % "4.1.0"
)
)
project/build.properties pins sbt itself:
text
sbt.version=1.12.13
scala-scraper pulls in jsoup transitively, so you parse with jsoup's engine through a typed Scala DSL without depending on jsoup directly. Run sbt update once to resolve everything, then sbt console for a REPL or sbt run for a main.
Step 1 — Fetch a page
requests-scala is a thin, synchronous HTTP client. One call gets the page body as a string:
scala
val res = requests.get(
"https://books.toscrape.com/",
headers = Map("User-Agent" -> "Mozilla/5.0 (compatible; scala-scraper-demo)")
)
println(res.statusCode) // 200
val html: String = res.text()
res.text() is the raw HTML. For a server-rendered page like this one, that string already holds the data; for a client-rendered page it would hold an empty shell, which is the limit Step 4 addresses.
Step 2 — Parse with scala-scraper
scala-scraper parses the string into a document and selects nodes with CSS selectors through its DSL. The >> operator extracts; elementList, attr, and texts shape the result into Scala values:
scala
import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
val doc = JsoupBrowser().parseString(html)
val titles: List[String] = doc >> elementList("article.product_pod h3 a") >> attr("title")
val prices: List[String] = doc >> texts("p.price_color")
val books = titles.zip(prices)
books.foreach { case (t, p) => println(s"$p $t") }
article.product_pod h3 a is the durable selector here — the product card class plus the link inside its heading — and title carries the full name even when the visible text is truncated. Pulling the value out of an attribute rather than the rendered text is the more stable read whenever the site offers it.
Step 3 — Follow pagination
The catalog continues across pages, each linking to the next through a li.next a element. scala-scraper's optional selector >?> returns None when that link is absent, which is exactly the loop's stop condition:
scala
import net.ruippeixotog.scalascraper.model.Document
def nextUrl(doc: Document, base: String): Option[String] =
(doc >?> element("li.next a")).map(a => base + a.attr("href"))
@annotation.tailrec
def crawl(url: String, base: String, acc: List[String]): List[String] = {
val doc = JsoupBrowser().parseString(requests.get(url).text())
val names = doc >> elementList("article.product_pod h3 a") >> attr("title")
nextUrl(doc, base) match {
case Some(next) => crawl(next, base, acc ++ names)
case None => acc ++ names
}
}
val all = crawl("https://books.toscrape.com/catalogue/page-1.html",
"https://books.toscrape.com/catalogue/", Nil)
println(all.size)
Keep the loop polite — one host at a time, a small delay between pages — and treat absent fields as Option, never as a value you assume is present.
Get your API key on the free plan: app.scrapeless.com
Where static fetch stops
requests.get does one thing: it returns the bytes the server sends to an anonymous client. That is enough for a server-rendered catalog and nothing more. Two cases break it, and both are common:
- Client-rendered pages. When a site builds its content with JavaScript, the HTML you fetch is an empty shell with the data still locked in scripts. scala-scraper has nothing to select because the content was never in the bytes.
- Protected pages. Sites with active anti-bot defenses answer an anonymous request with a challenge interstitial, not the page. A plain HTTP client has no way to clear it.
Reproducing the fix in Scala — a headless browser to run the JavaScript, a residential proxy pool, a challenge solver — is a far larger project than the scrape itself. The pragmatic move is to stop making Scala do that part and hand those URLs to a rendering API.
The cloud twist: render server-side, parse in Scala
The Scrapeless Universal Scraping API takes a target URL, runs it server-side through a real browser and residential egress, and returns the finished HTML. From Scala it is one POST with the same requests-scala client, and ujson decodes the response:
scala
val apiKey = sys.env("SCRAPELESS_API_KEY")
val payload = ujson.Obj(
"actor" -> "unlocker.webunlocker",
"input" -> ujson.Obj(
"url" -> "https://books.toscrape.com/",
"method" -> "GET",
"redirect" -> true,
"js_render" -> true
)
)
val res = requests.post(
"https://api.scrapeless.com/api/v1/unlocker/request",
headers = Map("Content-Type" -> "application/json", "x-api-token" -> apiKey),
data = ujson.write(payload),
readTimeout = 120000
)
val env = ujson.read(res.text())
val html = env("data").str // rendered DOM as a String
js_render: true is the load-bearing flag: it tells the API to run the page's JavaScript and return the finished DOM, so a site that builds its content client-side comes back as real markup. From here, html goes straight into the same JsoupBrowser().parseString(html) and the same selectors from Step 2 — the parsing half of your scraper does not change, only the fetch.
What You Get Back
The API response is a small, predictable envelope:
json
{
"code": 200,
"data": "<html>...rendered DOM...</html>"
}
// illustrative sample: schema is the real shape from a live call; the "data" string is truncated here. In the verified run "data" held 51,275 bytes of JSON-escaped rendered HTML.
A live call to the endpoint for the catalog page above returned HTTP 200 with 51,275 bytes of rendered HTML; running Step 2's selectors over that HTML yields 20 product titles, the first being "A Light in the Attic" at £51.77. A few notes from the run:
js_render: truecosts latency but buys content. Turn it off for static pages to go faster; turn it on when the page is blank without it.ujsonreads the one field you need.env("data").stris the whole decode; the rest of the envelope is just the statuscode.- Selectors stay in Scala. The API hands back HTML, so extraction logic, types, and tests live in your codebase, not behind a managed schema.
- Treat absent fields as
Option. A nullable selector with>?>is the right read whenever a card might omit a price or a heading.
Conclusion: Scala for the parse, an API for the hard fetch
A Scala scraper is short where the JVM is strong — requests-scala for the request, scala-scraper for CSS-selector extraction, a tail-recursive walk over the next-page link. It runs into the same wall every static scraper does: client-rendered pages and active anti-bot defenses that a plain HTTP client cannot clear. Routing those URLs through the Universal Scraping API keeps the fix to a single POST and leaves your parsing untouched. For the same fetch-then-parse split in another language, see the JavaScript and Node.js scraping guide; the docs cover the full API and its parameters. Pin js_render to what the page needs, keep selectors in Scala, and treat every field as optional.
Ready to Build Your AI-Powered Data Pipeline?
Join our community to claim a free plan and connect with developers building JVM scrapers: Discord · Telegram.
Sign up at app.scrapeless.com for free runtime and adapt the program above to the sites and selectors your Scala pipeline needs. See pricing for scale.
FAQ
Q: Is scraping with Scala legal?
Scraping publicly visible data is generally permissible, but the rules vary by jurisdiction and by site. Review the target's terms of service, respect robots directives, avoid personal or restricted data, and consult counsel for anything commercial.
Q: Do I need a proxy?
For light scraping of a server-rendered site, no. For protected or client-rendered pages the request egresses through the Universal Scraping API's residential proxies in 195+ countries, so you do not build a pool in Scala.
Q: What does a bot challenge look like, and how do I get a clean render?
Instead of the page, an anonymous request gets a challenge interstitial. Route that URL through the Universal Scraping API with js_render: true; it runs the page server-side from a trusted residential IP and returns the finished HTML.
Q: Why scala-scraper instead of calling jsoup directly?
scala-scraper wraps jsoup in a typed Scala DSL, so selectors return List[String] or Option[Element] instead of Java collections. You get jsoup's parser with results that fit Scala pattern matching.
Q: My selectors broke after the site changed. What now?
Markup rotates. Re-inspect the page and tighten the selector — prefer a stable container class plus an attribute read (article.product_pod h3 a → title) over a hashed CSS class that changes on the next redesign.
Q: Can I run many pages in parallel?
Yes, but keep it to roughly three workers per host so you stay polite and avoid rate limits. A tail-recursive single-host walk with a small delay is the safe default.
At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.



