How to Use a Puppeteer Without Being Detected

James Thompson

Scraping and Proxy Management Expert

24-Sep-2024

Websites now days employ anti-bot software that can identify scrapers. The best defense against a smooth scraping process is the use of appropriate masking techniques, such as headless browsers.

When web scraping, Puppeteer is a headless Chrome that may mimic actual user activity to evade anti-bots like Cloudflare. How then do you approach it?

The greatest techniques to use Puppeteer to scrape without being detected will be covered in this post. However, prior to that...

Puppeteer: What Is It?

A Node.js software called Puppeteer offers a high-level API for programmatically accessing a Chromium headless browser.

It is simple to install with Yarn or npm, and one of its main advantages is that it can access and modify the DevTools Protocol.

Can Anti-Bots Identify a Puppeteer?

Indeed, these anti-bots can identify headless browsers such as Selenium or Puppeteer.

Let's attempt to crawl NowSecure as a fast example of scraping to demonstrate this. This website notifies you if you have passed the protection or not using bots to verify tests.

In order to accomplish that, we will first install Node.js, and once that is finished, we will install Puppeteer by running the following simple command code.

language Copy

npm install puppeteer

language Copy

const puppeteer = require('puppeteer'); 
 
(async () => { 
        // Initiate the browser 
        const browser = await puppeteer.launch(); 
 
        // Create a new page with the default browser context 
        const page = await browser.newPage(); 
 
        // Setting page view 
        await page.setViewport({ width: 1280, height: 720 }); 
 
        // Go to the target website 
        await page.goto('https://nowsecure.nl/'); 
 
        // Wait for security check 
        await page.waitForTimeout(30000); 
 
        // Take screenshot 
        await page.screenshot({ path: 'image.png', fullPage: true }); 
 
        // Closes the browser and all of its pages 
        await browser.close(); 
})();

Thus, in that example, we created a new browser page and visited the target website using the basic Puppeteer configuration. After the security check, we then snap a screenshot.

3 Ways to Prevent Puppeteer Detection

Achieving a seamless crawling operation may be mostly achieved by avoiding Puppeteer bot detection. Here's how to keep yourself from being blocked when scraping and avoiding Puppeteer detection:

1. Use Proxies

IP tracking is one of the most popular anti-bot techniques, in which the website's requests are monitored by the bot detection system. Additionally, the anti-bot can identify the Puppeteer scraper when an IP sends out a large number of queries quickly.

You may utilize proxies, which act as a gateway between users and the internet, to evade detection in Puppeteer. As a result, the proxy receives requests from the server and forwards them to us together with the response data.

In order to accomplish this, we may run Puppeteer and add a proxy to the args argument as follows:

language Copy

const puppeteer = require('puppeteer'); 
const proxy = ''; // Add your proxy here 
 
(async () => { 
        // Initiate the browser with a proxy 
        const browser = await puppeteer.launch({args: ['--proxy-server=${proxy}']}); 
 
        // ... continue as before 
})();

Are you tired of continuous web scraping blocks?

Scrapeless: the best all-in-one online scraping solution available!

Stay anonymous and avoid IP-based bans with our intelligent, high-performance proxy rotation:

Try it for free!

2. Headers

Context and metadata details about the HTTP request are contained in headers. It indicates if the tool is a bot or a standard web browser. By adding the appropriate headers in the HTTP request, you can aid in preventing discovery.

You may extend Puppeteer's functionality by adding new headers like User-Agent, as it operates under headlessChrome by default. This widely used header, which includes the application, operating system, vendor, and request version, is utilized in web scraping.

language Copy

const puppeteer = require('puppeteer'); 
 
(async () => { 
        const browser = await puppeteer.launch(); 
        const page = await browser.newPage(); 
 
        // Add Headers 
        await page.setExtraHTTPHeaders({ 
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36', 
                'upgrade-insecure-requests': '1', 
                'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8', 
                'accept-encoding': 'gzip, deflate, br', 
                'accept-language': 'en-US,en;q=0.9,en;q=0.8' 
        }); 
 
        // ... continue as before 
})();

3. Limit Requests

As was previously mentioned, the quantity of queries a user sends can be used by an anti-bot to monitor their behavior. Additionally, restricting the quantity of queries and pausing between requests helps prevent Puppeteer detection because most users don't submit hundreds of requests per second.

You may limit the resources rendered in Puppeteer by using the .setRequestInterception() function.

language Copy

const puppeteer = require('puppeteer'); 
 
(async () => { 
        const browser = await puppeteer.launch(); 
        const page = await browser.newPage(); 
 
        // Limit requests 
        await page.setRequestInterception(true); 
        page.on('request', async (request) => { 
                if (request.resourceType() == 'image') { 
                        await request.abort(); 
                } else { 
                        await request.continue(); 
                } 
        }); 
 
        // ... continue as before 
})();

We reject Puppeteer's requests for pictures by setting the .setRequestInterception() = true. We are able to restrict queries this way. Because there are less resources to load and wait for, we'll also get speedier scrapers.

Conclusion

With Puppeteer, there are a variety of techniques to evade discovery; in this post, we'll go over the most effective and straightforward approaches.

There are restrictions when using proxies, headers, limit requests, and Puppeteer-Stealth, but they can help you accomplish the task. These techniques frequently fall short when it comes to getting past sophisticated anti-bot defenses.

With just one API request, Scrapeless manages all aspect of anti-bot bypassing for you, including CAPTCHAs and headless browsers that rotate proxies. Additionally, getting started is free.

At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.

How to Use a Puppeteer Without Being Detected

Puppeteer: What Is It?

Can Anti-Bots Identify a Puppeteer?

3 Ways to Prevent Puppeteer Detection

1. Use Proxies

2. Headers

3. Limit Requests

Conclusion

Most Popular Articles

Scrapeless MCP Server Is Officially Live! Build Your Ultimate AI-Web Connector

Product Updates | New Profile Feature

How to Track Your Ranking on ChatGPT?