Rotating Free Elite Proxies for Web Scraping in Python 3 and an Alternative

I recently started playing around with web scraping for one of my data mining projects. Like most serious web scrapers I wanted to avoid getting blocked by the websites which I wanted to scrape. (Though, I have to admit that I did this only for a learning purpose on a webserver specifically setup up for this task.).

Some common techniques for minimizing the risk of getting blocked are:

  • Rotating IP addresses
  • Using Proxies
  • Rotating and Spoofing user agents
  • Using headless browsers
  • Reducing the crawling rate

Like you probably guessed from the title of this post, I’ll be focusing on the first two bullet points. If those two points are combined, it’s often called a rotating proxy.

Info
A rotating proxy is a proxy server that assigns a new IP address from the proxy pool for every connection. That means you can launch a script to send 1,000 requests to any number of sites and get 1,000 different IP addresses.

As this isn’t something I came up with myself, lot’s of paid solution such as scraperapi.com, scrapehero.com, parsehub.com could be purchased as a service. While these have dedicated proxies of various kinds (such as mobile or residential proxies) and most often don’t even require rotating proxies manually, they tend to get pricey quite quickly (especially if you are running a hobby project). If you don’t care too much, about reliability, confidentiality and or scalability, then I’ve got a free pythonic alternative.

There ain’t no such thing as a free lunch 🍱

Free proxies can not just be super slow, but free proxies are also notorious for man-in-the-middle attacks. It’s estimated, that roughly a quarter, is modifying the content passed through them. Also, confidentiality isn’t really a priority for free proxy servers, as more than 60 % are completely banning SSL encrypted traffic. An interesting read up, about the risks of free proxies can be found here. 1 For an alternative to rotating free proxies, see the section on the alternatives.

Therefore, you should always use HTTPS enabled proxies and never transmit any sensitive data (such as passwords, session cookies, tokens etc.) through them, as those could end up in the wrong hands.

If you are using proxy servers for the sake of anonymity, make sure to use an elite proxy, as they are not revealing the source IP in the requests headers.

How It Works in Theory

Great if you made it past the disclaimer. Let’s have a look at how a script for rotating proxies could work in theory.

It works like this:

  1. Get a list of free elite proxies from https://sslproxies.org/, which allow HTTPS and store them as list entries.
  2. Make a new GET request using Python requests module:
    1. Select a random proxy server from the list
    2. Return the response (website) if successful
    3. In case of an error, remove the proxy server from the list. This step is crucial if you’re using free proxy servers as they are oftentimes overloaded or no longer available. Further we should catch any SSL connection errors.
    4. Rinse and repeat
  3. Once the list of available servers is used up, we’ll do a complete refresh of the proxy list.

What the Implementation Looks Like in Python

Once I knew how it should work in theory, it was quite easy to implement it in Python. Basically, I recycled a couple of other scripts I had seen online and aggregated them into one neat piece of code. 2 3. A couple of adjustments I made myself, was to strictly filter for elite proxies, as they are considered more secure than transparent or anonymous proxies. I further bundled everything in a class, which allows me to reuse the whole script in lots of different projects.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
from random import choice

import requests
from lxml.html import fromstring


class Proxies:

    def __init__(self):
        self.proxy = None
        self.proxies = []

    @staticmethod
    def get_proxies():
        url = 'https://sslproxies.org/'
        response = requests.get(url)
        parser = fromstring(response.text)
        p = []
        for i in parser.xpath('//tbody/tr'):
            if i.xpath('.//td[7][contains(text(),"yes")]'):
                if i.xpath('.//td[5][contains(text(),"elite proxy")]'):
                    proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]])
                    p.append(proxy)
        return p

    @staticmethod
    def to_proxy(proxy):
        return {"http": proxy, "https": proxy}

    def get(self):
        if len(self.proxies) < 2:
            self.proxies = self.get_proxies()
        self.proxy = choice(self.proxies)
        return self.to_proxy(self.proxy)

    def remove(self):
        # Remove invalid proxy from pool
        self.proxies.remove(self.proxy)
        self.proxy = self.get()

    def scrape(self, url, **kwargs):
        # Retry until request was sucessful
        while True:
            try:
                proxy = self.get()
                print("Proxy currently being used: {}".format(proxy))
                response = requests.get(url, proxies=proxy, timeout=7, **kwargs)
                break
            # if the request is successful, no exception is raised
            except requests.exceptions.ProxyError:
                print("Proxy error, choosing a new proxy")
                self.remove()
            except requests.exceptions.ConnectTimeout:
                print("Connect error, choosing a new proxy")
                self.remove()
        return response

How to use the script

The script can be used like this:

1
2
3
4
5
6
import proxies
proxy = proxies.Proxies()
# Make each request using a randomly selected proxy
for i in range(10):
    r = proxy.scrape('https://httpbin.org/ip')
    print(r.text)

The output will look like this:

Proxy currently being used: {'http': '103.194.171.162:5836', 'https': '103.194.171.162:5836'}
{
  "origin": "103.194.171.162"
}

Proxy currently being used: {'http': '85.10.219.98:1080', 'https': '85.10.219.98:1080'}
Proxy error, choosing a new proxy
Proxy currently being used: {'http': '103.194.171.161:5836', 'https': '103.194.171.161:5836'}
{
  "origin": "103.194.171.161"
}

A More Reliable and Faster Alternative 👑

While this setup certainly works for fetching small amounts of data from simple sites, it has a couple of drawbacks. As stated earlier, the free proxy servers are notoriously overloaded, which causes requests to time out frequently. In one test, I noticed that out of 200 only 5 gave a valid and complete response. This may be acceptable for a small, personal project, but not for any larger project. Another drawback is that the free VPN servers are widely known and therefore tend to be blocked by the web application firewalls and application proxies of common sites such as Google, Amazon or even major Online Travel Agencies. An easy way to overcome these hurdles are proxy network such as crawlera which rotates proxies for you and has a set of own proxies. By routing the requests automatically through a network of hundreds of (working) proxies they will do the heavy lifting for you. In one of my tests using Crawlera, out of 10000 requests all delivered a valid response, which I haven’t achieved with any other service be it self coded or paid.

The usage of crawlera is extremely simple and can easily be added to existing projects.

  1. Sign up for a free account and pick a plan (the cheapest plan also includes a free trial)
  2. Pass your API Key as part of the proxy dictionary as shown in the following snippet.
  3. Protect your API key like your password. Never commit it to GitHub or expose it elsewhere.
1
2
3
4
5
6
7
8
9
proxy_host = "proxy.crawlera.com"
proxy_port = "8010"
proxy_auth = "<CRAWLERA API KEY>:" # Make sure to include ':' at the end
proxies = {
    "https": f"http://{proxy_auth}@{proxy_host}:{proxy_port}/",
    "http": f"http://{proxy_auth}@{proxy_host}:{proxy_port}/"
}

r = requests.get(url, proxies=proxies, verify='/path/to/crawlera-ca.crt')

I hope my script is also useful for your own Python web scraping projects.


  1. The Risks of a Free Proxy Server proxyrack.com ↩︎

  2. How To Rotate Proxies and change IP Addresses using Python 3 scrapehero.com ↩︎

  3. Proxy Rotator in Python – Complete Guide zenscrape.com ↩︎