Screen Scraping Your Way Into RSS
(Page 1 of 6 )
Introduction: RSS is one the hottest technologies at the moment, and even big web publishers (such as the New York Times) are getting into RSS as well.However, there are still a lot of websites that do not have RSS feeds.
If you still want to be able to check those websites in your favorite aggregator, you need to create your own RSS feed for those websites. This can be done automatically with PHP, using a method called screen scrapping. Screen scrapping is usually frowned upon, as it's mostly used to steal content from other websites.
I personally believe that in this case, to automatically generate a RSS feed, screen scrapping is not a bad thing. Now, on to the code!
Getting the content
For this article, we'll use PHPit as an example, despite the fact that PHPit already has RSS feeds (http://www.phpit.net/syndication/).
We'll want to generate a RSS feed from the content listed on the frontpage (http://www.phpit.net). The first step in screen scraping is getting the complete page. In PHP this can be done very easily, by using implode(file("", "[the url here]")); IF your web host allows it. If you can't use file() you'll have to use a different method of getting the page, e.g. using the CURL library (http://www.php.net/curl).
Now that we have the content available, we can parse it for the content using some regular expressions. The key to screen scraping is looking for patterns that match the content, e.g. are all the content items wrapped in <div>'s or something else? If you can successfully discover a pattern, then you can use preg_match_all() to get all the content items.
More Website Content Articles
More By Jase Dow