PMG Digital Made for Humans

Crawling Ajax Webpages with PhantomJS

5 MINUTE READ | February 10, 2012

Crawling Ajax Webpages with PhantomJS

Author's headshot

Chris Alvares, Head of Technology

Chris Alvares has written this article. More details coming soon.

Sometimes, as an SEO, you need to analyze a site before working on it. Many sites today load content and links after the page has loaded. This content is loaded asynchronously via a Javascript method know as AJAX.

To get started, you need two things: The phantomjs binary and a download of the latest version of jQuery (store it in a folder, I called mine “includes”). This makes everything easier, and allows you to use jQuery functions throughout phantomjs.

Getting phantomjs to work for the first time could be a hassle, luckly, the phantomjs team provides detailed instructions on how to install for any OS here.

This allows us to use jQuery inside phantomJS, it isn’t a necessary step for all projects, but jQuery’s selectors and static functions make it a lot easier on the coder.

//inject jQuery so you can use it inside this document

 phantom.injectJs("includes/jquery.js");

PhantomJS actually has 2 different types of arguments when running a program, one for phantomJS and one for your actual javascript file.

The syntax looks like this

[code]phantomjs [phantomjs args] “yourscript.js” [yourscript args][/code]

With phantomjs args, you can change things that a normal browser can do, such as wether to load javascript or enable proxies. For this project, we want to make the site’s URL dynamic, so we will use one of yourscript’s arguments.

if(phantom.args.length < 1) 

{

 console.log('{"error":"not correct amount of args"}'); 

phantom.exit();

 }
var site = jQuery.trim(phantom.args[0]);

if(site == “”)

{
console.log(‘{“error”:”empty search keyword”}’);

phantom.exit();

}

The next step is to create a new WebPage, and change a couple of settings for the page, we want to tell the page to load images, as well as change the user agent. Changing the user agent allows site owners to differentiate from actual visitors and the web crawling bot.

//Set the useragent so analytics of some sort will be able to pick up the bot

var page = new WebPage();

page.settings.loadImages = true;

page.settings.userAgent = "PMG Web Crawler Bot/1.0";

When you call page.open in phantomjs, you list two parameters, a URL, and a callback function that will be called whenever a new page is loaded. This means that the callback can actually be called more than once if the page redirects to another page. You can check to see if the page loaded using the status parameter of the callback function.

page.open(site, function(status)

{

if (status !== 'success')

{

console.log('{"error":"Unable to load the address for page"}');

phantom.exit();

}

}

While there are other means of checking for AJAX, none of them are 100% effective, since phantomjs does not track AJAX call, so in this code, I just wait 5 seconds before executing any of the code. Once that is done, I then inject the same jQuery file we saved before into the web page. Previously, we had injected the script into the phantomjs object, but some sites do not include a jQuery library, so we inject it after it has loaded.

window.setInterval(function()

{

//inject jQuery into the actual web page.

page.injectJs("includes/jquery.js");

}, 5000);

This part gets a little tricky. When using the WebPage.evaluate() function, the code is actually run on the webpage, and it is sandboxed, which means the document can’t access the phantomjs object, and you can not access any previous variables you have defined earlier. You can, however, return variables back to phantomjs. In this case, we are using our injected jQuery object to find all of the anchor tags on the page, and returning their href value.

var results = page.evaluate(function()

{

var hrefs = new Array();
jQuery(“a”).each(function(){
hrefs.push(jQuery(this).attr(“href”));
});

return hrefs;
});

console.log(JSON.stringify(results));
phantom.exit();

You might have noticed that in steps 5 and 6, I added the lines phantom.exit(). This is because phantomjs does not quit the program automatically, you must call phantom.exit() function. In this case, I added it if the page fails to load as well as when we return all the anchor texts. Another safety measure is to set a timeout just incase something random happens with the program, or you miss a phantom.exit call.

//this is a timeout to make sure the phantom process eventually stops

window.setTimeout(function(){

phantom.exit();

}, 120000);

[

code]
cd “”
phantomjs crawler.js “http://pmg.co”
[/code]

The output should be something like this:

[code]
[“http://pmg.co”,”http://pmg.co/”,”http://pmg.co/news”,”http://pmg.co/solutions”,”http://pmg.co/work/”,”http://pmg.co/about”,”http://pmg.co/contact”,”http://www.facebook.com/pages/Performance-Media-Group/226744320697630″,”http://twitter.com/agencypmg”,”http://www.linkedin.com/company/performance-media-group”,”http://feeds.feedburner.com/performancemediagroup”,”http://pmg.co/”,”http://www.adobe.com/go/getflash”,”http://pmg.co/work”,”http://pmg.co/contact”,”http://pmg.co/”,”http://www.adobe.com/go/getflash”,”http://pmg.co/news”,”http://pmg.co/color-analysis-when-designing-for-mobile-devices-part-2-color-design-tools”,”http://pmg.co/color-analysis-when-designing-for-mobile-devices-part-2-color-design-tools”,”http://pmg.co/understanding-google-multi-channel-funnels”,”http://pmg.co/understanding-google-multi-channel-funnels”,”http://pmg.co/mobile-development-and-ocr”,”http://pmg.co/mobile-development-and-ocr”,”http://twitter.com/agencypmg”,”http://t.co/LpUsCv7m”,”http://pmg.co/”,”http://pmg.co/news”,”http://pmg.co/solutions”,”http://pmg.co/work/”,”http://pmg.co/about”,”http://pmg.co/contact”,”#”,”#”,”#”,”#”]
[/code]

Here is the full code in case anyone wants it:

phantom.injectJs("includes/jquery.js");

if(phantom.args.length < 1)
{
console.log(‘{“error”:”not correct amount of args”}’);
phantom.exit();
}

var site = jQuery.trim(phantom.args[0]);

if(site == “”)
{
console.log(‘{“error”:”empty search keyword”}’);
phantom.exit();
}

//this is a timeout to make sure the phantom process eventually stops
window.setTimeout(function(){
phantom.exit();
}, 120000);

var page = new WebPage();
page.settings.loadImages = true;
page.settings.userAgent = “PMG Web Crawler Bot/1.0”;

page.open(site, function(status)
{
if (status !== ‘success’)
{
console.log(‘{“error”:”Unable to load the address for page”}’);
phantom.exit();
}
//wait for any ajax to load
window.setInterval(function()
{
//inject jQuery
page.injectJs(“includes/jquery.js”);

var results = page.evaluate(function()
{

var hrefs = new Array();
jQuery(“a”).each(function(){
hrefs.push(jQuery(this).attr(“href”));
});

return hrefs;
});

console.log(JSON.stringify(results));

phantom.exit();
}, 3000);
});

Stay in touch

Bringing news to you

Subscribe to our newsletter

By clicking and subscribing, you agree to our Terms of Service and Privacy Policy

Chris Alvares is the Products Architect at PMG, follow him on twitter: @chrisalvares


Related Content

thumbnail image

AlliPMG CultureCampaigns & Client WorkCompany NewsDigital MarketingData & Technology

PMG Innovation Challenge Inspires New Alli Technology Solutions

4 MINUTES READ | November 2, 2021

thumbnail image

Applying Function Options to Domain Entities in Go

11 MINUTES READ | October 21, 2019

thumbnail image

My Experience Teaching Through Jupyter Notebooks

4 MINUTES READ | September 21, 2019

thumbnail image

Working with an Automation Mindset

5 MINUTES READ | August 22, 2019

thumbnail image

3 Tips for Showing Value in the Tech You Build

5 MINUTES READ | April 24, 2019

thumbnail image

Testing React

13 MINUTES READ | March 12, 2019

thumbnail image

A Beginner’s Experience with Terraform

4 MINUTES READ | December 20, 2018

ALL POSTS