Parsing HTML in a nutshell

Ever wanted to get data from a particular webpage / service but they don’t have an API available for the public? You really want to use the data for whatever reason floats your boat? If so, what you will be wanting to do is crawl the webpage and gather all the data you need.

For this tutorial we’re going to parse HTML data via the Simple HTML DOM Parser PHP script. What it does is fetch all the contents of a webpage and makes it searchable with CSS like selectors. You can find the source for this gem right here at sourceforge.net. All we need is the simple_html_dom.php script, everything else is example data.

We’ll work with the following scenario: I want to fetch all the articles from, in my case, the homepage of NetTuts+ so I can email them to myself every morning (via a cronjob). When I open my email I want to see the title of the post, the permalink to the post and the thumbnail that goes with each post. I will not cover the email part in this tutorial because it falls out of scope.

The Basics

Let’s start with the basic setup of the parser script. Create a new file within your server environment of choice (I use xampp, but you could also use wampp or a hosted solution). Just call it parser.php and fill it with the following code. Make sure you have downloaded the simple_html_dom.php script into the same folder as the file you just created.

<?php
include_once('simple_html_dom.php');
$html = file_get_html('http://net.tutsplus.com');
echo $html;

Save the script and run it in your webbrowser of choice. As you can see the page from NetTuts+ has been rendered inside your page. Now we have our data we can start doing something useful with it.

Scraping html pages requires insights in how a webpage's structure is defined. When we look at the following structure of NetTuts+, you can see it has a recurring structure in which every post on the frontpage has its own post-id. We are going to use this to identify posts and get their title, permalink and thumbnail.

nettuts-structure

Our parser script provides us with a clean method of finding data. To get all the posts we use the following line of code.

$posts = $html->find('div[id*=post]');

What it does is find all the div's with an id that contains the word "post" and store it into the $posts array variable. One thing word mentioning is do not attempt to read out all the data inside the $posts variable through print_r. It will "crash" your browser because of the vast amount of data that is stored in it.

Parsing Data

Next up is looping through all the posts and gathering the data we need. Here is the full script.

<?php
include_once('simple_html_dom.php');

$html = file_get_html('http://net.tutsplus.com');
$posts = $html->find('div[id*=post]');

// Initiate an empty array to store all the data in.
$data = array();
foreach($posts as $post)
{
    $link      = $post->find('h1.post_title a',0);
    $title     = $link->innertext;
    $permalink = $link->href;
    $img       = $post->find('div.post_image a img',0);
    $thumbnail = $img->src;
    $data[] = array(
        'title' => $title,
        'href' => $permalink,
        'image' => $thumbnail
    );
}

// Testing data
echo print_r($data,1);

While the program loops through the posts it tries to find the first - the second argument in the find method is an index value; 0 being the first - title anchor tag. If you would echo the $link variable, it would echo something like this.

<a href="http://net.tutsplus.com/..." class="post_title" id="...">The Title!</a>

After it has retrieved the link anchor tag, it can extract the innertext which is a build in command that retrieves everything inside the element tags. This gives us the title of the post. Second, we want to get the permalink of the post. The HTML Parser script uses "magic getters". If you try echoing $link->href, you'll get the contents of the href attribute of this link. Same goes if you try to access the id or the class (if present!)

Because I also wanted the thumbnail to be included with each post, I did the same thing and produces a find on the first thumbnail it finds for an article.

If all went well you will see something like the following on your webpage.

nettuts-scraped

Warning: Scraping / Crawling / Parsing a webpage can create a high load on someone else's server which may result in an ip-ban. Only use this software to fetch results and cache them for later use, not to fetch results in real-time.

Closing Thoughts

There are many possibilities to extend this code. To give you an example, I order the posts by ID and only fetch results that are newer than a stored value. That way I only receive mail when new posts are published.

This concludes the tutorial Parsing HTML in a Nutshell. If you have any questions feel free to ask away in the comments below.