Ideas |
Unraveling |
![]() Data Mining |
Managing |
Partnering |
Crowdsourcing |
Collecting and using Census Bureau data
An example of Web programming techniques
The Census Bureau is a treasure trove of information about communities. But sometimes getting to the data and making sense of it takes a bit — OK, a lot — of work.
For those who want it, there are gigabits of raw data available for download and import into a local database for analysis. But for those who need only tiny pieces of demographic data, the American FactFinder on the Census Web site makes finding the information easy. With some effort, the information can be cut and pasted manually from the Web site or downloaded in Microsoft Excel format.
Unfortunately, the information compiled by the American FactFinder is not easily available on the fly to an outside Web application; the Census does not provide the information as a Web service or as an XML feed. With a little work, however, that data can be pulled out of the Web page automatically by a computer script through a process known as Web scraping.
In a post on his blog about Web technologies at readwriteweb.com, Alex Iskold explains: “Web scraping is essentially reverse engineering of HTML pages.” Basically, “Scrapers are the programs that ‘know’ how to get the data back from a given HTML page. They work by learning the details of the particular markup and figuring out where the actual data is.”
Web scraping issues
You should know that Web scraping — sometimes also called “screen scraping” — is not without problems or controversy.
Derek Willis, database editor at washingtonpost.com, maintains the site’s widely recognized Congressional votes application which relies heavily on scraping techniques, as well as other data-dependent projects. He says one issue is simply practical: “Web-scraping is dependent on the source, so should the source of the information change, that could break the scraper or inject errors into the data.”
Some Webmasters consider it “rude,” because if you’re not careful, you can create a lot of traffic on a server if you are grabbing lots of data. Willis says, “You also need to be a polite scraper - don’t flood a server with multiple requests at the same time, or you could get banned from accessing that server.” Some Webmasters have been known to take steps to block scrapers once they realize what’s happening.
It’s probably best to notify a Webmaster if you’re launching a persistent scraping tool or trying to collect a large amount of information.
In some cases, you might need to get legal advice. If you are taking information from a commercial site or if you plan to use the data you get for commercial purposes, you might run into copyright issues. Those should be less of a problem if you’re dealing with government sites and public information.
John Perry, has written a primer on Web scraping for Uplink, a publication of the National Institute of Computer-Assisted Reporting and Investigative Reporters and Editors. If you are serious about using data, or if you want to learn more about all kinds of reporting techniques, IRE is a great resource. (Full disclosure, Wendell Cochran, one of the developers of this module, is a member of the IRE Board of Directors and a shameless booster of the organization.)
For details on how we built our widget, read this page. Just so you will know, it is written for sophisticated Web programmers.
| Overview | Ideas | Unraveling | Data Mining | Managing | Partnering | Crowdsourcing | About |



