🥳Join the Scrapeless Community and Claim Your Free Trial to Access Our Powerful Web Scraping Toolkit!
Back to Blog

How to Use Pyppeteer with a Proxy In 2024

James Thompson
James Thompson

Scraping and Proxy Management Expert

18-Sep-2024

It's crucial to route HTTP requests across many IP addresses in order to avoid being banned during web scraping. That's why in this tutorial we'll be learning how to construct a Pyppeteer proxy!

Prerequisites

Make sure your local system is running Python 3.6 or above.

Next, use pip to install Pyppeteer from PyPI by executing the line below.

language Copy
pip install pyppeteer

Are you tired of continuous web scraping blocks?

Scrapeless: the best all-in-one online scraping solution available!

Stay anonymous and avoid IP-based bans with our intelligent, high-performance proxy rotation:

Try it for free!

How to Utilize Pyppeteer as a Proxy

To get started, write the script scraper.py to request your current IP address from ident.me.

language Copy
import asyncio
from pyppeteer import launch
 
async def main():
    # Create a new headless browser instance
    browser = await launch()
    # Create a new page
    page = await browser.newPage()
    # Navigate to target website
    await page.goto('https://ident.me')
    # Select the body element
    body = await page.querySelector('body')
    # Get the text content of the selected element
    content = await page.evaluate('(element) => element.textContent', body)
    # Dump the result
    print(content)
    await browser.close()
 
asyncio.get_event_loop().run_until_complete(main())

To obtain the body content of the target page, run the script.

language Copy
python scraper.py

It's time to update your script to include a Pyppeteer proxy. To do that, get a free proxy from FreeProxyList (you might not be able to use the one we used).

The scraper.py script uses the launch() function, which opens a new browser instance and lets you pass in certain parameters. Set the --proxy-server parameter to tell the browser to route Pyppeteer requests through a proxy. One of the choices is args, which is a list of extra arguments to send to the browser process.

language Copy
# ...
async def main():
    # Create a new headless browser instance
    browser = await launch(args=['--proxy-server=http://20.219.108.109:8080'])
    # Create a new page
    page = await browser.newPage()
# ...

This is the whole code:

language Copy
import asyncio
from pyppeteer import launch
 
async def main():
    # Create a new headless browser instance
    browser = await launch(args=['--proxy-server=http://20.219.108.109:8080'])
    # Create a new page
    page = await browser.newPage()
    # Navigate to target website
    await page.goto('https://ident.me')
    # Select the body element
    body = await page.querySelector('body')
    # Get the text content of the selected element
    content = await page.evaluate('(element) => element.textContent', body)
    # Dump the result
    print(content)
    await browser.close()
 
asyncio.get_event_loop().run_until_complete(main())

This time, when you run the script again with the command line option python scraper.py, the IP address of your proxy should appear on the screen.

language Copy
20.219.108.109

Pyppeteer Authentication via a Proxy

You will require a username and password for authentication if you use a premium proxy. Use the --proxy-auth parameter for that.

language Copy
# ...
    # Create a new headless browser instance
    browser = await launch(args=[
        '--proxy-server=http://20.219.108.109:8080'
        '--proxy-auth=<YOUR_USERNAME>:<YOUR_PASSWORD>'
        ])
# ...

As an alternative, you may authenticate using the page API as seen below:

language Copy
# ...
    # Create a new page
    page = await browser.newPage()
    await page.authenticate({ 'username': '<YOUR_USERNAME>', 'password': '<YOUR_PASSWORD>' })
# ...

Use Pyppeteer to Configure a Dynamic Proxy

To prevent being blacklisted, you must utilize a dynamic proxy for web scraping instead of the static proxy you previously used. Using Pyppeteer, you may create numerous browser instances, each with a unique proxy setup.

To begin, obtain additional free proxies and compile a list of them:

language Copy
# ...
import random
 
proxies = [
'http://20.219.108.109:8080',
'http://210.22.77.94:9002',
'http://103.150.18.218:80',
]
# ...

Next, write an asynchronous function that makes a Pyppeteer request to ident.me using an asynchronous function that accepts a proxy as an argument:

language Copy
# ...
async def init_pyppeteer_proxy_request(url):
    # Create a new headless browser instance
    browser = await launch(args=[
        f'--proxy-server={url}',
        ])
    # Create a new page
    page = await browser.newPage()
    # Navigate to target website
    await page.goto('https://ident.me')
    # Select the body element
    body = await page.querySelector('body')
    # Get the text content of the selected element
    content = await page.evaluate('(element) => element.textContent', body)
    # Dump the result
    print(content)
    await browser.close()
# ...

Now, change the main() function such that it calls the newly constructed function via a randomly chosen proxy:

language Copy
# ...
async def main():
    for i in range(3):
        await init_pyppeteer_proxy_request(random.choice(proxies))
# ...

This is how your code should currently appear:

language Copy
import asyncio
from pyppeteer import launch
import random
 
proxies = [
'http://20.219.108.109:8080',
'http://210.22.77.94:9002',
'http://103.150.18.218:80',
]
 
async def init_pyppeteer_proxy_request(url):
    # Create a new headless browser instance
    browser = await launch(args=[
        f'--proxy-server={url}',
        ])
    # Create a new page
    page = await browser.newPage()
    # Navigate to target website
    await page.goto('https://ident.me')
    # Select the body element
    body = await page.querySelector('body')
    # Get the text content of the selected element
    content = await page.evaluate('(element) => element.textContent', body)
    # Dump the result
    print(content)
    await browser.close()
 
async def main():
    for i in range(3):
        await init_pyppeteer_proxy_request(random.choice(proxies))
    
 
asyncio.get_event_loop().run_until_complete(main())

Install Python Requests (or any other HTTP request library) after placing the Python scraper code that the request builder produced into a new file:

language Copy
pip install requests

Now that your scraper is running, the HTML page for OpenSea will be scraped and shown on the console.

Conclusion

Your web scraping success may be greatly increased by using a proxy with Pyppeteer, and you now know how to send requests using both static and dynamic proxies.

You also discovered that a different tool can complete the task more quickly and accurately. The web scraping tool from Scrapeless might be your ally if you need to scrape on a wide scale without worrying about infrastructure and have greater assurances that you will acquire the data you need.

At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.

Most Popular Articles

Catalogue