PMG Digital Made for Humans

How SEO's Can Use Python

5 MINUTE READ | October 24, 2011

How SEO's Can Use Python

Author's headshot

Christopher Davis

Christopher Davis has written this article. More details coming soon.

During the last six months or so, I started learning more and more about coding. It all started with WordPress and quickly moved on to other things.

As an SEO, one of the things I notice is how much work is sometimes involved in simple tasks like writing title or description tags for clients. As this is SEO 101, we end up doing it for every client. Most of the time the sorts of things we want in a good title are already on the page. We just need a way to extract them. Crawlers like Xenu make finding all the pages easy. But then what?

Python is my scripting language of choice. Its easy syntax, batteries-included attitude, and library for just about everything make it a great choice for a great many things. It makes tasks like writing titles and descriptions go quicker.

In short, knowing Python (or any other scripting language) gives a SEO the tools to get results for their clients quicker.

When you ask a client how many pages their site has, chances are you’ll get a pretty inconclusive answer. “Maybe about 10,000?” they’ll say. As an SEO, you need know know, so you send out a crawler, like Xenu, to find everything.

There’s another way, however: the sitemap. Almost every ecommerce or CMS platform generates a sitemap. All you need is a XML parser and a way to fetch the sitemap URL to find every URL your client deemed important enough to throw in a sitemap. This becomes especially useful when clients have multiple sitemaps (categories, products, static pages, etc). It lets you find specific pages to optimize first – like product pages on an ecommerce site.

Sitemaps are also good Python practice because the spec is well known and used. You can count on most sitemaps being the same and having well formed XML. That makes it easy to use a parse like BeautifulSoup (see below).

We’ll use two third-party python librarys for this example. Requests is a nice wrapper around a lot of the python url and HTTP libraries with a much prettier API. My XML/HTML parser of choice is BeautifulSoup, which, apart from its humorous name, has great documentation and works well. We’ll import these two libraries and the with_statement at the top of our file.

from__future__importwith_statement# we’ll use this later, has to be here
importrequests
fromBeautifulSoup importBeautifulStoneSoup asSoup

To get started, we’ll write a function that takes a URL as its only argument. It will then grab the content of the page with simple GET request. We’ll fetch the URL with request.get, which returns an object. That object has several attributes, but we’re only going to worry about two: the status_code and content. After getting the URL, we’ll check to make sure it returned a 200 OK response.

defparse_sitemap(url):
    resp=requests.get(url)
    
    # we didn’t get a valid response, bail
    if200!=resp.status_code:
        returnFalse

With that done, we can use BeautifulStoneSoup to parse the XML. This returns an object which contains several useful methods, but we’ll only use find and findAll. We’ll use findAll, which takes a tag name as its only required, positional argument, to find all the url tags in the sitemap.

    resp=requests.get(url)
    
    # we didn’t get a valid response, bail
    if200!=resp.status_code:
        returnFalse
    
    # BeautifulStoneSoup to parse the document
    soup=Soup(resp.content)
    
    # find all the <url> tags in the document
    urls=soup.findAll(‘url’)

findAll returns a list, and each of its items is also a BeautifulStoneSoup object. If we didn’t get any URLs, findAll will return an empty list, which we’ll check for. Next we’ll iterate through our list of URLs, and extract each of the elements with a call to find. The .string attribute at the end of each find extracts the text of the element only. After that’s all done, we can return what we found. The entire function looks like this.

defparse_sitemap(url):
    resp=requests.get(url)
    
    # we didn’t get a valid response, bail
    if200!=resp.status_code:
        returnFalse
    
    # BeautifulStoneSoup to parse the document
    soup=Soup(resp.content)
    
    # find all the <url> tags in the document
    urls=soup.findAll(‘url’)
    
    # no urls? bail
    ifnoturls:
        returnFalse
    
    # storage for later…
    out=[]
    
    #extract what we need from the url
    foruinurls:
        loc=u.find(‘loc’).string
        prio=u.find(‘priority’).string
        change=u.find(‘changefreq’).string
        last=u.find(‘lastmod’).string
        out.append([loc,prio,change,last])
    returnout

That’s it! You could use this function from the python shell:

>>>fromxmlimportparse_sitemap
>>>l=parse_sitemap(‘http://www.classicalguitar.org/post-sitemap.xml’)
>>>forurl inl:
>>>    # do stuff here, like write to file

Or you could write an if __name__ == '__main__' clause at the bottom of the file with the logic of what’s supposed to happen if the script is run directly like this:

shell>python xml.py
That clause might look a bit like this:

if__name__==‘__main__’:
    options=ArgumentParser()
    options.add_argument(‘-u’,‘–url’,action=‘store’,dest=‘url’,help=‘The file contain one url per line’)
    options.add_argument(‘-o’,‘–output’,action=‘store’,dest=‘out’,default=‘out.txt’,help=‘Where you would like to save the data’)
    args=options.parse_args()
    urls=parse_sitemap(args.url)
    ifnoturls:
        print‘There was an error!’
    withopen(args.out,‘w’)asout:
        foruinurls:
            out.write(‘t’.join([i.encode(‘utf-8’)foriinu])+‘n’)

And then from the command line

shell>python xml.py–uhttp://www.classicalguitar.org/post–sitemap.xml–ooutput.txt

This entire script is available on github if you’re interested.

Stay in touch

Bringing news to you

Subscribe to our newsletter

By clicking and subscribing, you agree to our Terms of Service and Privacy Policy

Well, every SEO should know some scripting language. It’s a tool that allows you to quickly do work for your clients and get things in place faster so you can start getting results.


Related Content

thumbnail image

SEO For ReactJS Websites

4 MINUTES READ | April 11, 2018

thumbnail image

A New Strategy for Testing Google AMP

3 MINUTES READ | July 13, 2017

thumbnail image

Optimizing for Amazon Echo Search

4 MINUTES READ | August 8, 2016

thumbnail image

Not a Publisher? AMP Pages for the Rest of Us

3 MINUTES READ | March 3, 2016

thumbnail image

Using jQuery PJAX in Your WordPress Themes

5 MINUTES READ | February 28, 2012

thumbnail image

SEO Auto Linker

2 MINUTES READ | August 23, 2011

ALL POSTS