This article explains how to change the User-Agent of urllib, a module of the standard library of Python.
While using Python's urllib
module you may want to change the User-Agent sent by the functions for two main reasons:
The Python Library Reference said that, by default, the
, as we can see in the following code:URLopener
class sends a User-Agent
header of urllib/VVV, where VVV is the urllib
version number
>>> from urllib import URLopener
>>> URLopener.version
'Python-urllib/1.16'
Several sites (e.g. Google, Wikipedia) don't like this User-Agent and they will return an error message when you try to access their pages using urllib
:
>>> from urllib import urlopen
>>> page = urlopen('http://www.google.com/search?q=python')
>>> page.read()
[…]<b>Error</b><H1>Forbidden</H1>Your client does not have permission
to get URL <code>/search?q=python</code> from this server.[…]
>>> page = urlopen('http://en.wikipedia.org/wiki/Python')
>>> page.read()
[…]Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 25 Dec 2007 15:45:20 GMT[…]
So, how can we change the User-Agent? If we don't want to change the headers using a lower-level module such as httplib
, the solution is quite easy:
Applications can define their own
User-Agent
header by subclassingURLopener
orFancyURLopener
and setting the class attribute version to an appropriate string value in the subclass definition.
Let's see how it works:
>>> from urllib import FancyURLopener
>>> class MyOpener(FancyURLopener):
... version = 'My new User-Agent'
We have defined a new class, named MyOpener
, with a new UA: 'My new User-Agent'.
>>> MyOpener.version
'My new User-Agent'
However, this is not enough if we want to access Google or Wikipedia. These sites want a browser-like User-Agent, so we need to change the version with:
>>> class MyOpener(FancyURLopener):
... version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)
Gecko/20071127 Firefox/2.0.0.11'
Now we can create a new instance of MyOpener
and try again using the .open()
method (instead of the urlopen()
function):
>>> myopener = MyOpener()
>>> page = myopener.open('http://www.google.com/search?q=python')
>>> page.read()
[…]Results <b>1</b> - <b>10</b> of about <b>81,800,000</b> for <b>python</b>[…]
>>> page = myopener.open('http://en.wikipedia.org/wiki/Python')
>>> page.read()
[…]<h1 class="firstHeading">Python</h1><h3 id="siteSub">From Wikipedia, the free encyclopedia</h3>[…]
Using the methods of MyOpener
we will be able to open or retrieve the pages we need, sending our User-Agent instead of the one used by urllib
.
How can we replace the functions of urllib
with the methods of MyOpener
? Even in this case the answer is easy:
>>> urlopen = MyOpener().open
>>> urlretrieve = MyOpener().retrieve
After this, we will be able to open or retrieve a page using page = urlopen('www.google.com')
.
How can I change UA whenever I access a page? If you want to choose a random User-Agent from a list you can try this:
from urllib import FancyURLopener
from random import choice
user_agents = [
'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
'Opera/9.25 (Windows NT 5.1; U; en)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9'
]
class MyOpener(FancyURLopener, object):
version = choice(user_agents)
myopener = MyOpener()
myopener.retrieve('http://www.useragent.org/', 'useragent.html')
This code creates an instance of MyOpener
and saves the page http://www.useragent.org/ in the file useragent.html using the retrieve
method. If you open the page you can see the User-Agent used by the script.
Note that, after you have created the instance of the class, the UA will be the same.
Ezio Melotti - ©2007 - This work is licensed under a Creative Commons BY-NC-SA 3.0 License.