This website uses cookies to ensure you get the best possible experience. See our Cookies Policy.
How SEO's Can Use Python
During the last six months or so, I started learning more and more about coding. It all started with WordPress and quickly moved on to other things.
As an SEO, one of the things I notice is how much work is sometimes involved in simple tasks like writing title or description tags for clients. As this is SEO 101, we end up doing it for every client. Most of the time the sorts of things we want in a good title are already on the page. We just need a way to extract them. Crawlers like Xenu make finding all the pages easy. But then what?
Python is my scripting language of choice. Its easy syntax, batteries-included attitude, and library for just about everything make it a great choice for a great many things. It makes tasks like writing titles and descriptions go quicker.
In short, knowing Python (or any other scripting language) gives a SEO the tools to get results for their clients quicker.
When you ask a client how many pages their site has, chances are you’ll get a pretty inconclusive answer. “Maybe about 10,000?” they’ll say. As an SEO, you need know know, so you send out a crawler, like Xenu, to find everything.
There’s another way, however: the sitemap. Almost every ecommerce or CMS platform generates a sitemap. All you need is a XML parser and a way to fetch the sitemap URL to find every URL your client deemed important enough to throw in a sitemap. This becomes especially useful when clients have multiple sitemaps (categories, products, static pages, etc). It lets you find specific pages to optimize first – like product pages on an ecommerce site.
Sitemaps are also good Python practice because the spec is well known and used. You can count on most sitemaps being the same and having well formed XML. That makes it easy to use a parse like BeautifulSoup (see below).
We’ll use two third-party python librarys for this example. Requests is a nice wrapper around a lot of the python url and HTTP libraries with a much prettier API. My XML/HTML parser of choice is BeautifulSoup, which, apart from its humorous name, has great documentation and works well. We’ll import these two libraries and the with_statement at the top of our file.
from__future__importwith_statement# we’ll use this later, has to be here
importrequests
fromBeautifulSoup importBeautifulStoneSoup asSoup
To get started, we’ll write a function that takes a URL as its only argument. It will then grab the content of the page with simple GET request. We’ll fetch the URL with request.get, which returns an object. That object has several attributes, but we’re only going to worry about two: the status_code and content. After getting the URL, we’ll check to make sure it returned a 200 OK response.
defparse_sitemap(url):
resp=requests.get(url)
# we didn’t get a valid response, bail
if200!=resp.status_code:
returnFalse
With that done, we can use BeautifulStoneSoup to parse the XML. This returns an object which contains several useful methods, but we’ll only use find and findAll. We’ll use findAll, which takes a tag name as its only required, positional argument, to find all the url tags in the sitemap.
resp=requests.get(url)
# we didn’t get a valid response, bail
if200!=resp.status_code:
returnFalse
# BeautifulStoneSoup to parse the document
soup=Soup(resp.content)
# find all the <url> tags in the document
urls=soup.findAll(‘url’)
findAll returns a list, and each of its items is also a BeautifulStoneSoup object. If we didn’t get any URLs, findAll will return an empty list, which we’ll check for. Next we’ll iterate through our list of URLs, and extract each of the elements with a call to find. The .string attribute at the end of each find extracts the text of the element only. After that’s all done, we can return what we found. The entire function looks like this.
defparse_sitemap(url):
resp=requests.get(url)
# we didn’t get a valid response, bail
if200!=resp.status_code:
returnFalse
# BeautifulStoneSoup to parse the document
soup=Soup(resp.content)
# find all the <url> tags in the document
urls=soup.findAll(‘url’)
# no urls? bail
ifnoturls:
returnFalse
# storage for later…
out=[]
#extract what we need from the url
foruinurls:
loc=u.find(‘loc’).string
prio=u.find(‘priority’).string
change=u.find(‘changefreq’).string
last=u.find(‘lastmod’).string
out.append([loc,prio,change,last])
returnout
That’s it! You could use this function from the python shell:
>>>fromxmlimportparse_sitemap
>>>l=parse_sitemap(‘http://www.classicalguitar.org/post-sitemap.xml’)
>>>forurl inl:
>>> # do stuff here, like write to file
Or you could write an if __name__ == '__main__' clause at the bottom of the file with the logic of what’s supposed to happen if the script is run directly like this:
That clause might look a bit like this:shell>python xml.py
if__name__==‘__main__’:
options=ArgumentParser()
options.add_argument(‘-u’,‘–url’,action=‘store’,dest=‘url’,help=‘The file contain one url per line’)
options.add_argument(‘-o’,‘–output’,action=‘store’,dest=‘out’,default=‘out.txt’,help=‘Where you would like to save the data’)
args=options.parse_args()
urls=parse_sitemap(args.url)
ifnoturls:
print‘There was an error!’
withopen(args.out,‘w’)asout:
foruinurls:
out.write(‘t’.join([i.encode(‘utf-8’)foriinu])+‘n’)
And then from the command line
shell>python xml.py–uhttp://www.classicalguitar.org/post–sitemap.xml–ooutput.txt
This entire script is available on github if you’re interested.
Stay in touch
Subscribe to our newsletter
Well, every SEO should know some scripting language. It’s a tool that allows you to quickly do work for your clients and get things in place faster so you can start getting results.
Posted by: Christopher Davis