How to Use a Puppeteer Without Being Detected

Scraping and Proxy Management Expert
Websites now days employ anti-bot software that can identify scrapers. The best defense against a smooth scraping process is the use of appropriate masking techniques, such as headless browsers.
When web scraping, Puppeteer is a headless Chrome that may mimic actual user activity to evade anti-bots like Cloudflare. How then do you approach it?
The greatest techniques to use Puppeteer to scrape without being detected will be covered in this post. However, prior to that...
Puppeteer: What Is It?
A Node.js software called Puppeteer offers a high-level API for programmatically accessing a Chromium headless browser.
It is simple to install with Yarn or npm, and one of its main advantages is that it can access and modify the DevTools Protocol.
Can Anti-Bots Identify a Puppeteer?
Indeed, these anti-bots can identify headless browsers such as Selenium or Puppeteer.
Let's attempt to crawl NowSecure as a fast example of scraping to demonstrate this. This website notifies you if you have passed the protection or not using bots to verify tests.
In order to accomplish that, we will first install Node.js, and once that is finished, we will install Puppeteer by running the following simple command code.
language
npm install puppeteer
language
const puppeteer = require('puppeteer');
(async () => {
// Initiate the browser
const browser = await puppeteer.launch();
// Create a new page with the default browser context
const page = await browser.newPage();
// Setting page view
await page.setViewport({ width: 1280, height: 720 });
// Go to the target website
await page.goto('https://nowsecure.nl/');
// Wait for security check
await page.waitForTimeout(30000);
// Take screenshot
await page.screenshot({ path: 'image.png', fullPage: true });
// Closes the browser and all of its pages
await browser.close();
})();
Thus, in that example, we created a new browser page and visited the target website using the basic Puppeteer configuration. After the security check, we then snap a screenshot.
3 Ways to Prevent Puppeteer Detection
Achieving a seamless crawling operation may be mostly achieved by avoiding Puppeteer bot detection. Here's how to keep yourself from being blocked when scraping and avoiding Puppeteer detection:
1. Use Proxies
IP tracking is one of the most popular anti-bot techniques, in which the website's requests are monitored by the bot detection system. Additionally, the anti-bot can identify the Puppeteer scraper when an IP sends out a large number of queries quickly.
You may utilize proxies, which act as a gateway between users and the internet, to evade detection in Puppeteer. As a result, the proxy receives requests from the server and forwards them to us together with the response data.
In order to accomplish this, we may run Puppeteer and add a proxy to the args
argument as follows:
language
const puppeteer = require('puppeteer');
const proxy = ''; // Add your proxy here
(async () => {
// Initiate the browser with a proxy
const browser = await puppeteer.launch({args: ['--proxy-server=${proxy}']});
// ... continue as before
})();
Are you tired of continuous web scraping blocks?
Scrapeless: the best all-in-one online scraping solution available!
Stay anonymous and avoid IP-based bans with our intelligent, high-performance proxy rotation:
Try it for free!
2. Headers
Context and metadata details about the HTTP request are contained in headers. It indicates if the tool is a bot or a standard web browser. By adding the appropriate headers in the HTTP request, you can aid in preventing discovery.
You may extend Puppeteer's functionality by adding new headers like User-Agent, as it operates under headlessChrome
by default. This widely used header, which includes the application, operating system, vendor, and request version, is utilized in web scraping.
language
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Add Headers
await page.setExtraHTTPHeaders({
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
'upgrade-insecure-requests': '1',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,en;q=0.8'
});
// ... continue as before
})();
3. Limit Requests
As was previously mentioned, the quantity of queries a user sends can be used by an anti-bot to monitor their behavior. Additionally, restricting the quantity of queries and pausing between requests helps prevent Puppeteer detection because most users don't submit hundreds of requests per second.
You may limit the resources rendered in Puppeteer by using the .setRequestInterception()
function.
language
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Limit requests
await page.setRequestInterception(true);
page.on('request', async (request) => {
if (request.resourceType() == 'image') {
await request.abort();
} else {
await request.continue();
}
});
// ... continue as before
})();
We reject Puppeteer's requests for pictures by setting the .setRequestInterception() = true
. We are able to restrict queries this way. Because there are less resources to load and wait for, we'll also get speedier scrapers.
Conclusion
With Puppeteer, there are a variety of techniques to evade discovery; in this post, we'll go over the most effective and straightforward approaches.
There are restrictions when using proxies, headers, limit requests, and Puppeteer-Stealth, but they can help you accomplish the task. These techniques frequently fall short when it comes to getting past sophisticated anti-bot defenses.
With just one API request, Scrapeless manages all aspect of anti-bot bypassing for you, including CAPTCHAs and headless browsers that rotate proxies. Additionally, getting started is free.
At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.