Rotating Free Elite Proxies for Web Scraping in Python 3 and an Alternative
I recently started playing around with web scraping for one of my data mining projects. Like most serious web scrapers I wanted to avoid getting blocked by the websites which I wanted to scrape. (Though, I have to admit that I did this only for a learning purpose on a webserver specifically setup up for this task.).
Some common techniques for minimizing the risk of getting blocked are:
- Rotating IP addresses
- Using Proxies
- Rotating and Spoofing user agents
- Using headless browsers
- Reducing the crawling rate
Like you probably guessed from the title of this post, I’ll be focusing on the first two bullet points. If those two points are combined, it’s often called a rotating proxy
.
As this isn’t something I came up with myself, lot’s of paid solution such as scraperapi.com, scrapehero.com, parsehub.com could be purchased as a service. While these have dedicated proxies of various kinds (such as mobile or residential proxies) and most often don’t even require rotating proxies manually, they tend to get pricey quite quickly (especially if you are running a hobby project). If you don’t care too much, about reliability, confidentiality and or scalability, then I’ve got a free pythonic alternative.
There ain’t no such thing as a free lunch 🍱
Free proxies can not just be super slow, but free proxies are also notorious for man-in-the-middle attacks. It’s estimated, that roughly a quarter, is modifying the content passed through them. Also, confidentiality isn’t really a priority for free proxy servers, as more than 60 % are completely banning SSL encrypted traffic. An interesting read up, about the risks of free proxies can be found here. 1 For an alternative to rotating free proxies, see the section on the alternatives.
Therefore, you should always use HTTPS enabled proxies and never transmit any sensitive data (such as passwords, session cookies, tokens etc.) through them, as those could end up in the wrong hands.
If you are using proxy servers for the sake of anonymity, make sure to use an elite proxy
, as they are not revealing the source IP in the requests headers.
How It Works in Theory
Great if you made it past the disclaimer. Let’s have a look at how a script for rotating proxies could work in theory.
It works like this:
- Get a list of
free elite proxies
fromhttps://sslproxies.org/
, which allowHTTPS
and store them as list entries. - Make a new
GET
request using Pythonrequests
module:- Select a random proxy server from the list
- Return the response (website) if successful
- In case of an error, remove the proxy server from the list. This step is crucial if you’re using free proxy servers as they are oftentimes overloaded or no longer available. Further we should catch any SSL connection errors.
- Rinse and repeat
- Once the list of available servers is used up, we’ll do a complete refresh of the proxy list.
What the Implementation Looks Like in Python
Once I knew how it should work in theory, it was quite easy to implement it in Python. Basically, I recycled a couple of other scripts I had seen online and aggregated them into one neat piece of code. 2 3. A couple of adjustments I made myself, was to strictly filter for elite proxies, as they are considered more secure than transparent or anonymous proxies. I further bundled everything in a class, which allows me to reuse the whole script in lots of different projects.
|
|
How to use the script
The script can be used like this:
|
|
The output will look like this:
Proxy currently being used: {'http': '103.194.171.162:5836', 'https': '103.194.171.162:5836'}
{
"origin": "103.194.171.162"
}
Proxy currently being used: {'http': '85.10.219.98:1080', 'https': '85.10.219.98:1080'}
Proxy error, choosing a new proxy
Proxy currently being used: {'http': '103.194.171.161:5836', 'https': '103.194.171.161:5836'}
{
"origin": "103.194.171.161"
}
A More Reliable and Faster Alternative 👑
While this setup certainly works for fetching small amounts of data from simple sites, it has a couple of drawbacks. As stated earlier, the free proxy servers are notoriously overloaded, which causes requests to time out frequently. In one test, I noticed that out of 200 only 5 gave a valid and complete response. This may be acceptable for a small, personal project, but not for any larger project. Another drawback is that the free VPN servers are widely known and therefore tend to be blocked by the web application firewalls and application proxies of common sites such as Google, Amazon or even major Online Travel Agencies. An easy way to overcome these hurdles are proxy network such as crawlera which rotates proxies for you and has a set of own proxies. By routing the requests automatically through a network of hundreds of (working) proxies they will do the heavy lifting for you. In one of my tests using Crawlera, out of 10000 requests all delivered a valid response, which I haven’t achieved with any other service be it self coded or paid.
The usage of crawlera is extremely simple and can easily be added to existing projects.
- Sign up for a free account and pick a plan (the cheapest plan also includes a free trial)
- Pass your API Key as part of the proxy dictionary as shown in the following snippet.
- Protect your API key like your password. Never commit it to GitHub or expose it elsewhere.
|
|
I hope my script is also useful for your own Python web scraping projects.
The Risks of a Free Proxy Server proxyrack.com ↩︎
How To Rotate Proxies and change IP Addresses using Python 3 scrapehero.com ↩︎
Proxy Rotator in Python – Complete Guide zenscrape.com ↩︎