Tools for Citizen Journalists

Deeper reporting builds community

bottom line
navigation thumbnails
Ideas
navigation thumbnails
Unraveling
navigation thumbnails
Data Mining
navigation thumbnails
Managing
navigation thumbnails
Partnering
navigation thumbnails
Crowdsourcing

An example of how to get and use data

Web scraping the Census Bureau with PHP

The Census Bureau is a treasure trove of information about communities. But sometimes getting to the data and making sense of it takes a bit of work. It can be hard to find the data and even harder to wade through it.

For those who want large portions of the data collected by the bureau, there are gigabits of raw data available for download and import into a local database for analysis. But for those who need only tiny pieces of demographic data, the American FactFinder on the Census Web site makes getting the information easy. The information can be cut and paste manually from the Web site or downloaded in Microsoft Excel format.

purpletop bar

About this tutorial

Unlike most  other elements in this module, this tutorial is designed for more advanced Web programmers.

Specifially, it assumes:

--You are familiar with HTML

--You are familiar with PHP

--Or you have access to someone on your staff or in your community who has those skills.

purplebottombar

Unfortunately, the information compiled by the American FactFinder is not easily available on the fly to an outside Web application; the Census does not provide the information as a Web service or as a XML feed. With a little work, however, that data can be pulled out of the Web page automatically by a computer script through a process known as Web scraping.

In a post on his blog at readwriteweb.com, Alex Iskold explains: "Web scraping is essentially reverse engineering of HTML pages." Basically, "Scrapers are the programs that 'know' how to get the data back from a given HTML page. They work by learning the details of the particular markup and figuring out where the actual data is."

Web scraping issues

You should know that Web scraping -- sometimes also called "screen scraping" -- is not without problems or controversy.

Derek Willis, database editor at  washingtonpost.com maintains the site's widely recognized Congressional votes application which relies heavily on scraping techniques, as well as other data-dependent projects. He says one issue is simply practical: "Web-scraping is dependent on the source, so should the source of the information change, that could break the scraper or inject errors into the data."

Some Webmasters consider it "rude," because if you're not careful, you can create a lot of traffic on a server if you are grabbing lots of data. Willis says, "You also need to be a polite scraper - don't flood a server with multiple requests at the same time, or you could get banned from accessing that server." Some Webmasters have been known to take steps to block scrapers once they realize what's happening.

It's probably best to notify a Webmaster if you're launching a persistent scraping tool or trying to collect a large amount of information.

In some cases, you might need to get legal advice. If you are taking information from a commercial site or if you plan to use the data you get for commercial purposes, you might run into copyright issues. Those should be less of a problem if you're dealing with government sites and public information.

John Perry, has written a primer on Web scraping for Uplink, a publication of the National Institute of Computer-Assisted Reporting and Investigative Reporters and Editors. If you are serious about using data, or if you want to learn more about all kinds of reporting techniques, IRE is a great resource. (Full disclosure, Wendell Cochran, one of the developers of this module, is a member of the IRE Board of Directors and a shameless booster of the organization.)

The nuts and bolts of the Census graphing 'widget'

Okay, with those caveats in mind, and with full recognition that this will take you pretty deep in the weeds of Web programming, this tutorial shows you the basics of building a Web site that scrapes select demographic data about a user-defined zip code from the Census Web site. Credit for both the tutorial and the coding goes to Josh Williams, who recently received his master's degree in interactive journalism from the School of Communication at American University.

For this example, we will be collecting and using racial data for an Arlington, Va., zip code. The widget is based on PHP. It has three main features:

  • Permits users to enter a zip code
  • Grabs the data from American FactFinder
  • Creates and displays either a bar graph or pie chart of the information. The bar graph compares the zip code's characteristics to national averages. Both graphs include a text data table.
The full source code of the sample applications also is available. You are free to use it as either a stand-alone or as the basis of your own Census-based Web application. If you do decide to use the code or incorporate it on your site, we would appreciate credit and a link back to this page.

Many computer languages facilitate Web scraping, and several languages have special libraries or packages to make it even easier. To reach a very broad audience, we are going to use PHP, a free Web development language available on almost all standard shared Web hosts. For the second of the two graphing examples, we will use GD, an optional graphics library for PHP that may not be installed with PHP on all Web hosts, but which can easily be downloaded and installed.

While meant to be simple, with code examples written more for clarity than efficiency, this tutorial assumes a basic familiarity of Web development with PHP and HTML.

The tutorial is broken into three parts:

  1. The Basics – Getting the data automatically.
  2. Graphing The Data – Building a bar graph.
  3.  Advanced Graphing – Creating pie charts with dynamic images and the GD library.

The basics

In an ideal world, building Web applications with data from the Web would be as easy as pulling data from a XML or RSS file. However, to pull race-related data from the Census Web site, we’re going to have to scrape it.

The first step in this exercise is to determine the location of the desired data. For basic zip code-level demographic data at the American Factfinder, the URL is: http://factfinder.census.gov/servlet/SAFFFacts?_event=Search&_zip=22203.

The part we’re concerned about in the URL is the _zip variable at the end of the query string. Changing “22203” to another valid five-digit zip code brings up that area’s data.

Now that we know where the demographic information is, we need to parse out only the information we want. Let’s start with the white population for 22203, the first one listed on the Web page.

On the page it looks like:

HTML Table

The corresponding HTML is:

HTML Source

Take note of the location of each of the numbers we want (12,587, 68.0, 75.1). Notice that they are all directly below a line with unique “headers” values of the table data cells (<td>). One simple way for our application to grab the statistics about the white population in 22203 is to simply loop through each line of the HTML document and search for the unique “headers” value above the numbers we want, strip out the HTML on the next line -- which leaves only the desired value on that line -- and ignore everything else on the page.

Here is the php code to accomplish that:

<?php
//zip code we want demographic data for
$zip = "22203";
//the URL of the data
$url = "http://factfinder.census.gov/servlet/SAFFFacts?_event=Search&_zip=" . $zip;
//put Census page HTML in an array
$lines = file($url);
//loop through HTML and search for unique table values
foreach ($lines as $line_num => $line) {
//white population (denoted by the R9 R10 C2 headers)
if(preg_match("<td headers=\"R9 R10 C2\" align=\"RIGHT\">", $line)){
$totalWhite = trim($lines[$line_num + 1]);
}
}
//print number of white people in 22203
echo($totalWhite);
?>

Explaining the PHP

The first few lines simply define the location of the Census data and put the source of that page into an array, with one line of HTML per array element. The “$zip” variable can be changed to any valid five-digit zip code.

The next bit gets interesting. The “for” loop works through each element of the array (each element being a line of HTML) while “preg_match” looks for the unique headers value. Once we’ve found the matching pattern, we take the next line of HTML, which is the next element in the array ($totalWhite = trim($lines[$line_num + 1]);, and assign it to a variable for later use.

If you got lost in the details of how the code works, the important part to remember is that “$totalWhite = trim($lines[$line_num + 1]);” assigns the values 12,587 to the variable “$totalWhite” by taking the HTML line after matching the unique headers we searched for and striped it of HTML. If we wanted to know the percent white for the zip code, we could simply grab the value of the line three lines down from our search, like so: “$percentWhite = trim($lines[$line_num + 1]);.”

To grab other statistics from the page, simply count the lines from the match found above or add a new conditional and “preg_match” search. Once all the data is stored in variables, printing it in tabular form is a snap.

Try the application. View the source.

Graphing the data

Now that we know how to automatically grab Census demographic data for any zip code, we can represent that data graphically. A bar chart is a simple way to show the relationship between population percentages, both locally and nationally, for much of the demographic data on the American FactFinder.

Follow the instructions in section one to grab Census data and store them in variables for later use. This tutorial will stick with race data for zip code 22203.

We are going to make our chart by dynamically sizing static images of different colors in order to represent the two series of data (the zip code and U.S. total). The “height” and “width” attributes of the image tag size an image, even if the values of the attributes do not match the true pixel size of the image. Using this feature, we can create very small images and stretch them to appear to be separate, individually-sized bars.

We take two one pixel high by one pixel wide images and stretch several instances of them to produce the following result:

                             

Here is a sample of the relevant HTML (you may have to change the paths to match those on your server):

<table width="508" height="250" border="0" cellpadding="0" cellspacing="0">
...
<td width="32" align="bottom"><img src="widgets/images/chart_green.png" width="32" height="7" /></td>
<td width="32" align="bottom"><img src="widgets/images/chart_greenW.png" width="32" height="5" /></td>
...
<td width="32" align="bottom"><img src="widgets/images/chart_green.png" width="32" height="28" /></td>
<td width="32" align="bottom"><img src="widgets/images/chart_greenW.png" width="32" height="15" /></td>
...
</table>

Notice we used a table to contain the dynamic images. Tables can have an image as a background, which we will use as our scale.

Here is the same table with a background image we created to show scale:

Scale                            

Now all we have to do is make the bars the correct size relative to the values they represent. Our table is 250 pixels high, so a value of “100%” would need to be 250 pixels high. A value of “50%” would need to be 125 pixels high, et cetera.

The equation to determine bar heights: image height = (race percent / 100) * 250.

We can express this in a simple PHP function:

function calculateBarGraph($v){
$v = ceil(($v/100)*250);
return $v;
}

$v is the percent value we want to turn into an image height. We are rounding up to the nearest percent all values with “ceil()” so that the bars all have at lease one pixel visible.

Try the application. View the source.

Advanced graphing

The bar chart technique in the previous section, which utilizes small images stretched to appear various sizes, can be used in many situations. There are times, however, when stretching images into bars will not suffice. Line graphs and pie graphs, for example, are valuable tools that require more horsepower than has been introduced thus far. Fortunately there is an optional library called GD, that can be compiled into PHP. This library facilitates the creation of dynamic images in various formats. Your system administrator can tell you if GD is installed with your version of PHP.

Building on the first two sections, we are going to create dynamic pie charts with GD and libchart, another open source library written in pure PHP that only needs to be uploaded to your Web server to work.

Fortunately, libchart makes pie charts very easy by handling all of the low-level GD details. Simply upload the “libchart” folder in the download and upload it to the project folder you want to use for our chart and add one line at the top of your PHP page: include "libchart/libchart.php";

From here, we take the demographic that we collected in the first step of the tutorial and feed the variables to libchart as shown in this php code snippet:

//name of image to create
$dynamicPieChart = "cache/" . $zip . "_pie.png";

//only create if it doesn't exist already
if(!file_exists($dynamicPieChart)) {
$pie_chart = new PieChart(510, 250);
$pie_chart->setLogo("images/blankLogo.png");
$pie_chart->addPoint(new Point("White - $percentWhite", $percentWhite));
$pie_chart->addPoint(new Point("Black / African American - $percentBlack", $percentBlack));
$pie_chart->addPoint(new Point("American Indian - $percentAmericanIndian", $percentAmericanIndian));
$pie_chart->addPoint(new Point("Asian - $percentAsian", $percentAsian));
$pie_chart->addPoint(new Point("Native Hawaiian - $percentHawaiian", $percentHawaiian));
$pie_chart->addPoint(new Point("Some Other Race - $percentOther", $percentOther));
$pie_chart->addPoint(new Point("Two Or More Races - $percent2More", $percent2More));
$pie_chart->setTitle("Race Breakdown for Zip Code ". $zip);
$pie_chart->render($dynamicPieChart);
}

Notice the $dynamicPieChart variable. That is the location on the Web server where we are going to physically create a static pie chart. We chose to create a “cache” folder on the Web server for our images. Regardless of the location, PHP needs to have write access to the folder. If you have problems writing to the folder, contact your system administrator.

The other code creates a new instance of the pie chart class of libchart library and feeds it the relevant data. The only thing else that’s needed is a little HTML to place the image on a page.

The result is this:

pie chart

Try the application. View the source.

Overview Ideas Unraveling Data mining Managing Partnering Crowdsourcing About
free hit counter javascript