Use Perl to harness XML data sources
A step by step guide to process the Moreover XML headline feeds
Web gurus are constantly telling us that great content is the key to an even greater website. But, with available ‘content wizards’ and JavaScript code snippets offering the somewhat limited implementation of content from external sources for your site, where do you turn to for headlines, statistics and other useful data for use on a website? The answer? It must be XML !
The Perl server-side scripting language is the ultimate partner for XML as it enables you to actually use the data from XML sources. At this point we are assuming a basic knowledge of the Perl language and how to upload and maintain scripts on a server.
Perl lets developers get web pages (and other files) from the web via the use of its LWP module. The following script will download a web page and pass it on to the user:
#!/usr/bin/perl
use LWP::Simple;
$WebPage=get(‘http://www.yoursite.com/index.asp’); # $Webpage now holds yoursite.com page
print ‘Status: HTTP/2.0 200 OK\nContent-type: text/html\n\n’; # Print Headers for web browser viewing
print $WebPage;
When uploaded to a Perl/CGI-enabled host and viewed through a web browser, this script should display the yoursite.com homepage. As you’ll have guessed, the get() can now also be used to retrieve XML documents.
You’ll find many sources for XML-formatted data on the web, but some may limit your commercial usage of such content. Moreover.com are famous for their Javascript-creating web wizard, but did you know that they also offer their content in the form of XML for your own customized use. A full list of XML addresses from Moreover is available here.
For the following example we are going to use the Microsoft Corporation XML feed (http://p.moreover.com/cgi-local/page?c=Microsoft%20news&o=xml).
On Moreover, content is offered in the form:
<article id="ARTICLE_ID">
<url>ARTICLE_URL</url>
<headline_text>HEADLINE_CLIPPET</headline_text>
<source>ORIGINATION_OF_ARTICLE</source>
<media_type>text</media_type>
<cluster>moreover...</cluster>
<tagline> </tagline>
<document_url>ORIGINATION_WEB_ADDRESS</document_url>
<harvest_time>TIME_HARVESTED</harvest_time>
<access_registration> </access_registration>
<access_status> </access_status>
</article>
Of course, document headers surround these repeating clusters of data, but these are the pieces of data we’ll be working with.
So, to start writing a Perl script to collect, parse and redisplay this data, we’ll start off with the mandatory headers:
#!/usr/bin/perl
use LWP::Simple;
$_=get(‘http://p.moreover.com/cgi-local/page?c=Microsoft%20news&o=xml’);
You may want to replace the XML address with your preferred choice, but at this point we’ll have the entire XML page in $_. Now we can run a loop which will, while it can still find the start of a new article (<article id="ARTICLE_ID">) the script will find each piece of information - headline text, source URL, etc - and place it in individual arrays.
while (m/<article id=”/) { #Find start of new article
#First let’s get the URL
$_=$’; #Now $_ contains all data after the latest ‘<article id="’
m/<url>/; #Get first piece of article data - a link
$_=$’; #$_ contains URL and rest of data
m#</url>#; #$` contains text before latest find of ‘</url>’ and $’ contains text after
$URL[$ArticleNumber] = $`;
#Now retrieve headline text
$_=$’; #Set $_ to contain data after last find
m/<headline_text>/; #Get the headline start
$_=$’; #$_ contains headline and rest of data
m#</headline_text>#; #$` contains text before latest find of ‘</headline_text>’ and $’ contains text after
$Headline[$ArticleNumber] = $`; #$Headline[$ArticleNumber] contains headline
#Now retrieve source of article
$_=$’; #Set $_ to contain data after last find
m/<source>/; #Get the source start
$_=$’; #$_ contains source and rest of data
m#</source>#; #$` contains text before find of ‘</source>’ and $’ contains text after
$Source[$ArticleNumber] = $`; #$Source[$ArticleNumber] contains article headline source
#Now retrieve media type of article
$_=$’; #Set $_ to contain data after last find
m/<media_type>/; #Get the media type start
$_=$’; #$_ contains media type and rest of data
m#</media_type>#; #$` contains text before find of ‘</media_type>’ and $’ contains text after
$MediaType[$ArticleNumber] = $`; #$MediaType[$ArticleNumber] contains the article’s media type
#Now retrieve tagline of article
$_=$’; #Set $_ to contain data after last find
m/<tagline>/; #Get the tagline start
$_=$’; #$_ contains tagline and rest of data
m#</tagline>#; #$` contains text before find of ‘</tagline>’ and $’ contains text after
$Tagline[$ArticleNumber] = $`; #$Tagline[$ArticleNumber] contains the article’s tagline
#Now retrieve document URL of article
$_=$’; #Set $_ to contain data after last find
m/<document_url>/; #Get the document URL start
$_=$’; #$_ contains document URL and rest of data
m#</document_url>#; #$` contains text before find of ‘</document_url>’ and $’ contains text after
$DocumentURL[$ArticleNumber] = $`; #$DocumentURL[$ArticleNumber] contains the article’s document URL
#Now retrieve harvest time of article
$_=$’; #Set $_ to contain data after last find
m/<harvest_time>/; #Get the harvest time start
$_=$’; #$_ contains harvest time and rest of data
m#</harvest_time>#; #$` contains text before find of ‘</harvest_time>’ and $’ contains text after
$HarvestTime[$ArticleNumber] = $`; #$HarvestTime[$ArticleNumber] contains the article’s time of harvest
#Now retrieve access registration of article
$_=$’; #Set $_ to contain data after last find
m/<access_registration>/; #Get the access registration start
$_=$’; #$_ contains access registration and rest of data
m#</access_registration>#; #$` contains text before find of ‘</access_registration>’ and $’ contains text after
$AccessRegistration[$ArticleNumber] = $`; #$AccessRegistration[$ArticleNumber] contains the article’s access registration
#Now retrieve access status of article
$_=$’; #Set $_ to contain data after last find
m/<access_status>/; #Get the access status start
$_=$’; #$_ contains access status and rest of data
m#</access_status>#; #$` contains text before find of ‘</access_status>’ and $’ contains text after
$AccessStatus[$ArticleNumber] = $`; #$AccessStatys[$ArticleNumber] contains the article’s access status
$ArticleNumber++; # Increment the array number to store data about the same article
}
We now have 9 arrays of article data, each of whose items correspond with another array. For example, the URL of the headline $Headline[5] can be found in $DocumentURL[5]. What can be now done with the data we now have in the arrays? The main thing you’ll probably want to do is simply display it. A simple piece of code which can follow the last loop is:
print ‘Status: HTTP/2.0 200 OK\nContent-type: text/html\n\n’; # HTTP Headers for viewing page through a web browser
for ($Article=0; $Article < $ArticleNumber; $Article++) { # Go through each article
print "<A HREF=\"$DocumentURL[$Article]\">$Headline[$Article]</A><BR>$HarvestTime[$Article] from <A HREF=\"$URL[$Article]\">$Source[$Article]</A><BR><BR>";
}
The possibilities for XML are clearly endless - limitless distribution and representation of data from sources anywhere in the world; easily parsed and updateable automatically. What is more, Moreover.com are just one of many suppliers of harvested data and the market is growing as Microsoft promote this area. The outlook is certainly great for XML and content kings in the online world.
Copyright © 2001 Adam Waude. All Rights Reserved.
Author Information: Adam Waude - adamwaude@hotmail.com
| DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware. |
More Website Content Articles
More By Developer Shed
developerWorks - FREE Tools! |
Join this webcast, to learn how the Rational Process Library can help with compliance issues, drive process improvement, and assist in service-oriented architecture (SOA) or Agile development. We will take a peek into the Rational Process Library with content around software and systems engineering (including RUP), operations and systems management, program and portfolio management, and asset and SOA governance. FREE! Go There Now!
|
|
|
|
Visit IBM developerWorks to download a free trial version of Lotus Quickr 8.0, which enables collaboration by transforming the way everyday business content such as documents, rich media, photos, and video can be shared. Lotus Quickr makes it faster and easier to share content of all types (not just documents) within virtual teams. It is designed to make it easier to collaborate across organizational boundaries, while continuing to work within the context of familiar desktop applications. FREE! Go There Now!
|
|
|
|
Discover how IBM Rational AppScan Standard Edition can help you detext vulnerabilities in your web applications in the Web Application Security eKit. IBM Rational AppScan is a leading suite of automated web application security solutions that scan and test for common Web application vulnerabilities. The new Web Application Security eKit provides you with valuable resources, including white papers, demos, and additional information on the benefits of testing your Web applications. FREE! Go There Now!
|
|
|
|
Join us for this web seminar to learn how you can defend your web applications from attack. Learn about the 3 most common web application attacks, including how they occur and what can be done to prevent them. We’ll also discuss manual versus automated approaches for scanning and identifying web application vulnerabilities and how IBM Rational AppScan, an automated vulnerability scanner, can help you automate more of what you are doing manually today. FREE! Go There Now!
|
|
|
|
Rational Build Forge Express Edition is an automation framework that packages the latest enterprise-grade technologies into a reliable, flexible and robust configuration designed and priced specifically for small to midsize businesses. The new Rational Build Forge Express eKit provides you with valuable resources – including a case study, podcast, demo, and articles – to help you increase staff productivity, compress development cycles and deliver better software, fast. FREE! Go There Now!
|
|
|
|
Join this Rational Talks to You teleconference on December 11 at 1:00 pm ET to get tips on building your own plugins with Rational Method Composer. Get your questions answered! FREE! Go There Now!
|
|
|
|
Join this webcast to discover the key requirements for successful change and release management. Learn how to extend your .NET environment to improve productivity and collaboration, and address core problems afflicting team development. In this webcast, we’ll review typical challenges faced by customers and how to resolve them with the IBM Rational Change and Release Management solution, including Rational ClearCase, Rational ClearQuest and Rational Build Forge. Replay is available for 9 months. FREE! Go There Now!
|
|
|
|
Get a free trial download of the latest version of IBM Rational Method Composer V7.2 which helps you deliver customized yet consistent process guidance to your project teams and IT organization, and includes the latest version of IBM Rational Unified Process (RUP), which has provided process guidance to teams since 1996. FREE! Go There Now!
|
|
|
|
The Eclipse community is constantly working to extend Eclipse's functionality. In this webcast, learn about some of the most important and feature-rich projects under development. From multi-language support to plug-in development, tune in to see what Eclipse is capable of now. FREE! Go There Now!
|
|
|
|
The unprecedented scope of a service-oriented architecture (SOA) initiative brings to the forefront a number of management and governance issues that were sidestepped in the past. The key to a successful SOA implementation is managing and governing activities throughout the entire SOA delivery lifecycle by ensuring that services conform to the needs of all of the business’s stakeholders. Learn how service lifecycle management allows the business to ensure that the process by which services are defined, created, tested, deployed, optimized and retired is manageable, repeatable and auditable. FREE! Go There Now!
|
|
|
|
All FREE IBM® developerWorks Tools! |