If you’re like me, you were super excited when Screaming Frog added the custom extraction feature to their spider tool. Now, you can do some basic web scraping when crawling a site. This is particularly helpful when it comes to checking if a client’s site is using schema code or checking for hreflang tags. But before you get to scraping, you’re going to have to understand how to use xpath. First of all, what is xpath?
XPath is a language that describes a way to locate and process items in Extensible Markup Language (XML) documents by using an addressing syntax based on a path through the document’s logical structure or hierarchy. – TechTarget
Basic Syntax for Xpath
You can scape almost anything in an XML or HTML document using xpath. First, you’ll need to know the basic syntax.
// – Selects the current node no matter where it is located in the document
//H2 – Returns all H2’s in a document
/ – Selects from the root node
//div/span – Only returns a span if contained in a div
. – Selects the current node
.. – Selects the parent of the current node
@ – Selects an attribute
//@href – Returns all links in a document
* – Wildcard, matches any element node
//*[@class=’intro’] – Finds any element with a class named “intro”
@* – Wildcard, matches any attribute node
//span[@*] – Finds any span tag with an attribute
[ ] – Used to find a specific node or a node that contains a specific value, known as a predicate
Common predicates used include:
//ul/li – Finds the first item in an unordered list
//ul/li[last()] – Finds the last item in an undordered list
//ul[@class=’first-list’]/li – Only finds list items in an unordered list named “first-list”
//a[contains(., ‘Click Here’)]/@href – Finds links with the anchor text “Click Here”
| – AND, selects multiple paths
//meta[(@name|@content)] – Finds meta tags with both a ‘name’ and ‘content’ attribute
Xpath for Extracting Schema Code
Scraping structured data marked up with Schema code is actually pretty easy. If it is in the microdata format, you can use this simple formula: //span[@itemprop=’insert property name here‘] and then set Screaming Frog to extract the text. Keep in mind, however, that xpath is case sensitive so be sure to use the correct capitalization in the property names.
Location Data (PostalAddress)
- For Street Address, use the expression: //span[@itemprop=’streetAddress’]
- If the address has 2 lines like in the example below, you’ll also have to use the expression: //span[@itemprop=’streetAddress’] to extract the second occurrence.
- For City, use the expression: //span[@itemprop=’addressLocality’]
- For State, use the expression: //span[@itemprop=’addressRegion’]
Ratings & Reviews (AggregateRating)
- For star rating, use the expression: //span[@itemprop=’ratingValue’]
- For review count, use the exprssion: //span[@itemprop=’reviewCount’]
In this example, I extracted star ratings and review from an iTunes page for their top health and fitness apps and then sorted them by highest star rating. You can see the extracted data below.
Finding Schema Types
So what if you are unsure if the website in question is using Schema? You can search by Schema types by using the expression: //*[@itemtype=’http://schema.org/insert type here‘]/@itemtype.
Common types include:
OR you can set up the expressions like so:
Xpath for Extracting HREFlang Tags
There are two approaches you can take for scraping hreflang tags; you can scrape any instance of the tag or you can scrape by language.
To scrape by instance (which is helpful if you have no idea if the site contains hreflang tags or if you don’t know which countries are being used), you would use the expression: //link[@rel=’alternate’][1*]/@hreflang
*Enter a number from 1-10 (since Screaming Frog only has 10 fields you can use when extracting data). The number represents the instance in which that path occurs. So in the example below, the expression //link[@rel=’alternate’]/@hreflang would return “en-ca“.
<link rel=”alternate” href=”http://www.example.com/us” hreflang=”en-us” />
<link rel=”alternate” href=”http://www.example.com/ie/” hreflang=”en-ie” />
<link rel=”alternate” href=”http://www.example.com/ca/” hreflang=”en-ca” />
<link rel=”alternate” href=”http://www.example.com/au/” hreflang=”en-au” />
The only issue with this method is that results may not be returned in the same order and it may be difficult to discern which pages are missing hreflang tags with certain country and language codes (as seen in the example below). Ideally, each language and country code would have its own dedicated column.
If you already know which countries you are checking for, you can use the expression:
As you can see below, the results of the scrape are much cleaner and it is easier to see which pages are missing tags and which pages are not available in a particular country/language.
Xpath for Extracting Open Graph Tags & Twitter Cards
This expression comes in handy if you want to see which titles and descriptions a page will render when shared on Facebook or Twitter. To do so, use the expression:
//meta[@insert “property” or “name”=’insert type‘]/@content
The most common ones you’ll probably use are:
Facebook Title: //meta[@property=’og:title’]/@content
Twitter Title: //meta[@name=’twitter:title’]/@content
Twitter Title: //meta[@name=’twitter:description’]/@content
Writing xpath expressions can be trick, but here are some helpful tools you can use:
You can find a specific xpath using the inpsector tool in chrome. However, using this option is typically only good at finding a specific path.
There’s also a Chrome extension you can use called Xpather (available for both Chrome and Firefox). Xpather features a search box at the top if your browser that allows you to input an xpath expression. By doing so, you can check to make sure you are using the correct syntax and see which results are returned using the given expression.
A big limitation with Screaming Frog that prevents it from being a full-fledged data mining tool is that you are limited to 10 fields to input your xpath expressions and it only allows for one result per data cell. For instance, if you had an e-commerce page with 30 different products and you wanted to scrape the prices for each product, you would only be able to extract the first 10 prices. If that’s your goal, you can try a Chrome extension called Scraper. Scraper allows you to quickly extract data from a web page into a spreadsheet.
To learn more about xpath, check out these additional resources:
Xpath syntax: http://www.w3schools.com/xsl/xpath_syntax.asp
Xpath tutorial: http://zvon.org/xxl/XPathTutorial/General/examples.html
Using Xpath in Screaming Frog: http://www.screamingfrog.co.uk/web-scraping/