Website Content
  Home arrow Website Content arrow Use Perl to harness XML data sources:
Affiliate Promotion  
Blog Help  
Domain Name Tips  
How To  
Newsletter Marketing  
Online Business Help  
Search Engine Tricks  
Web Development  
Web Hosting  
Website Advertising  
Website Content  
Website Marketing  
 Webmaster Tools
 
Base64 Encoding 
Browser Settings 
CSS Coder 
CSS Navigation Menu 
Datetime Converter 
DHTML Tooltip 
Dig Utility 
DNS Utility 
Dropdown Menu 
Fetch Content 
Fetch Header 
Floating Layer 
htaccess Generator 
HTML to PHP 
HTML Encoder 
HTML Entities 
IP Convert 
Meta Tags 
Password Encryption
 
Password Strength
 
Pattern Extractor 
Ping Utility 
Pop-Up Window 
Regex Extractor 
Regex Match 
Scrollbar Color 
Source Viewer 
Syntax Highlighting 
URL Encoding 
Web Safe Colors 
Whois
 
Forums Sitemap 
Mobile Linux 
APP Generation ROI 
IBM® developerWorks 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
WEBSITE CONTENT

Use Perl to harness XML data sources:
By: Developer Shed
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating:  stars stars stars stars stars / 0
    2003-08-09

    Table of Contents:

    Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Use Perl to harness XML data sources

    A step by step guide to process the Moreover XML headline feeds

    Web gurus are constantly telling us that great content is the key to an even greater website. But, with available ‘content wizards’ and JavaScript code snippets offering the somewhat limited implementation of content from external sources for your site, where do you turn to for headlines, statistics and other useful data for use on a website? The answer? It must be XML !

    The Perl server-side scripting language is the ultimate partner for XML as it enables you to actually use the data from XML sources. At this point we are assuming a basic knowledge of the Perl language and how to upload and maintain scripts on a server.

    Perl lets developers get web pages (and other files) from the web via the use of its LWP module. The following script will download a web page and pass it on to the user:


    #!/usr/bin/perl
    use LWP::Simple;

    $WebPage=get(‘http://www.yoursite.com/index.asp’); # $Webpage now holds yoursite.com page

    print ‘Status: HTTP/2.0 200 OK\nContent-type: text/html\n\n’; # Print Headers for web browser viewing

    print $WebPage;


    When uploaded to a Perl/CGI-enabled host and viewed through a web browser, this script should display the yoursite.com homepage. As you’ll have guessed, the get() can now also be used to retrieve XML documents.

    You’ll find many sources for XML-formatted data on the web, but some may limit your commercial usage of such content. Moreover.com are famous for their Javascript-creating web wizard, but did you know that they also offer their content in the form of XML for your own customized use. A full list of XML addresses from Moreover is available here.

    For the following example we are going to use the Microsoft Corporation XML feed (http://p.moreover.com/cgi-local/page?c=Microsoft%20news&o=xml).

    On Moreover, content is offered in the form:

    <article id="ARTICLE_ID">

    <url>ARTICLE_URL</url>

    <headline_text>HEADLINE_CLIPPET</headline_text>

    <source>ORIGINATION_OF_ARTICLE</source>

    <media_type>text</media_type>

    <cluster>moreover...</cluster>

    <tagline> </tagline>

    <document_url>ORIGINATION_WEB_ADDRESS</document_url>

    <harvest_time>TIME_HARVESTED</harvest_time>

    <access_registration> </access_registration>

    <access_status> </access_status>

    </article>

    Of course, document headers surround these repeating clusters of data, but these are the pieces of data we’ll be working with.

    So, to start writing a Perl script to collect, parse and redisplay this data, we’ll start off with the mandatory headers:

    #!/usr/bin/perl

    use LWP::Simple;

    $_=get(‘http://p.moreover.com/cgi-local/page?c=Microsoft%20news&o=xml’);

    You may want to replace the XML address with your preferred choice, but at this point we’ll have the entire XML page in $_. Now we can run a loop which will, while it can still find the start of a new article (<article id="ARTICLE_ID">) the script will find each piece of information - headline text, source URL, etc - and place it in individual arrays.


    while (m/<article id=”/) { #Find start of new article

    #First let’s get the URL
    $_=$’; #Now $_ contains all data after the latest ‘<article id="’
    m/<url>/; #Get first piece of article data - a link
    $_=$’; #$_ contains URL and rest of data
    m#</url>#; #$` contains text before latest find of ‘</url>’ and $’ contains text after
    $URL[$ArticleNumber] = $`;

    #Now retrieve headline text
    $_=$’; #Set $_ to contain data after last find
    m/<headline_text>/; #Get the headline start
    $_=$’; #$_ contains headline and rest of data
    m#</headline_text>#; #$` contains text before latest find of ‘</headline_text>’ and $’ contains text after
    $Headline[$ArticleNumber] = $`; #$Headline[$ArticleNumber] contains headline

    #Now retrieve source of article
    $_=$’; #Set $_ to contain data after last find
    m/<source>/; #Get the source start
    $_=$’; #$_ contains source and rest of data
    m#</source>#; #$` contains text before find of ‘</source>’ and $’ contains text after
    $Source[$ArticleNumber] = $`; #$Source[$ArticleNumber] contains article headline source

    #Now retrieve media type of article
    $_=$’; #Set $_ to contain data after last find
    m/<media_type>/; #Get the media type start
    $_=$’; #$_ contains media type and rest of data
    m#</media_type>#; #$` contains text before find of ‘</media_type>’ and $’ contains text after
    $MediaType[$ArticleNumber] = $`; #$MediaType[$ArticleNumber] contains the article’s media type

    #Now retrieve tagline of article
    $_=$’; #Set $_ to contain data after last find
    m/<tagline>/; #Get the tagline start
    $_=$’; #$_ contains tagline and rest of data
    m#</tagline>#; #$` contains text before find of ‘</tagline>’ and $’ contains text after
    $Tagline[$ArticleNumber] = $`; #$Tagline[$ArticleNumber] contains the article’s tagline

    #Now retrieve document URL of article
    $_=$’; #Set $_ to contain data after last find
    m/<document_url>/; #Get the document URL start
    $_=$’; #$_ contains document URL and rest of data
    m#</document_url>#; #$` contains text before find of ‘</document_url>’ and $’ contains text after
    $DocumentURL[$ArticleNumber] = $`; #$DocumentURL[$ArticleNumber] contains the article’s document URL


    #Now retrieve harvest time of article
    $_=$’; #Set $_ to contain data after last find
    m/<harvest_time>/; #Get the harvest time start
    $_=$’; #$_ contains harvest time and rest of data
    m#</harvest_time>#; #$` contains text before find of ‘</harvest_time>’ and $’ contains text after
    $HarvestTime[$ArticleNumber] = $`; #$HarvestTime[$ArticleNumber] contains the article’s time of harvest

    #Now retrieve access registration of article
    $_=$’; #Set $_ to contain data after last find
    m/<access_registration>/; #Get the access registration start
    $_=$’; #$_ contains access registration and rest of data
    m#</access_registration>#; #$` contains text before find of ‘</access_registration>’ and $’ contains text after
    $AccessRegistration[$ArticleNumber] = $`; #$AccessRegistration[$ArticleNumber] contains the article’s access registration

    #Now retrieve access status of article
    $_=$’; #Set $_ to contain data after last find
    m/<access_status>/; #Get the access status start
    $_=$’; #$_ contains access status and rest of data
    m#</access_status>#; #$` contains text before find of ‘</access_status>’ and $’ contains text after
    $AccessStatus[$ArticleNumber] = $`; #$AccessStatys[$ArticleNumber] contains the article’s access status

    $ArticleNumber++; # Increment the array number to store data about the same article
    }

    We now have 9 arrays of article data, each of whose items correspond with another array. For example, the URL of the headline $Headline[5] can be found in $DocumentURL[5]. What can be now done with the data we now have in the arrays? The main thing you’ll probably want to do is simply display it. A simple piece of code which can follow the last loop is:


    print ‘Status: HTTP/2.0 200 OK\nContent-type: text/html\n\n’; # HTTP Headers for viewing page through a web browser

    for ($Article=0; $Article < $ArticleNumber; $Article++) { # Go through each article

    print "<A HREF=\"$DocumentURL[$Article]\">$Headline[$Article]</A><BR>$HarvestTime[$Article] from <A HREF=\"$URL[$Article]\">$Source[$Article]</A><BR><BR>";
    }


    The possibilities for XML are clearly endless - limitless distribution and representation of data from sources anywhere in the world; easily parsed and updateable automatically. What is more, Moreover.com are just one of many suppliers of harvested data and the market is growing as Microsoft promote this area. The outlook is certainly great for XML and content kings in the online world.

    Copyright © 2001 Adam Waude. All Rights Reserved.

    Author Information: Adam Waude - adamwaude@hotmail.com


    DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware.

    More Website Content Articles
    More By Developer Shed

     

    IBM® developerWorks developerWorks - FREE Tools!


    NEW! Evaluate WebSphere Extended Deployment Compute Grid V6.1

    Visit IBM developerWorks to download a free trial version of WebSphere Extended Deployment Compute Grid, which lets you schedule, execute, and monitor batch jobs. Because online transaction processing and batch jobs execute simultaneously on the same server resources, you can avoid costly duplication of resources. Compute Grid supports job types of Java transactional batch, compute-intensive and a new type called "native execution", which enables non-Java workloads to run on distributed end points.
    FREE! Go There Now!


    NEW! Harnessing the power of SQL and Java for high performance data access

    Join this webcast to see how IBM Data Studio Developer and pureQuery can take the pain out of Java data access. uApplications developed using both Java and SQL have become a common requirement. Database connectivity using Java Database Connectivity (JDBC) to create an application is a multi-step tedious process, and tooling that covers both SQL and Java has been unavailable, until now. IBM Data Studio introduces the pureQuery platform: a high-performance, Java data access platform focused on simplifying the tasks of developing, managing, and optimizing database applications and services.
    FREE! Go There Now!


    NEW! Hello World: Monitor a simple business process using WebSphere Business Monitor V6.0.2

    This tutorial shows new users of IBM WebSphere Business Monitor Version 6.0.2 how to perform the "Hello World" equivalent for monitoring business process applications. It is intended to help you get familiar with the capabilities of the product.
    FREE! Go There Now!


    NEW! IBM Enterprise Modernization Sandbox for System z: Architecture

    Analysts, architects, and developers who have existing COBOL or PL/I skills and want to extend those skills to deploy new workloads on the mainframe can use the IBM Enterprise Modernization Sandbox for System z to find hands-on walkthroughs of common real world scenarios. The scenarios provide examples of how to rapidly design, create, assemble, test, and deploy high-quality Web, Web services, portal, and SOA applications for IBM CICS, IBM IMS, and IBM WebSphere Application Server.
    FREE! Go There Now!


    NEW! Improve your build process with IBM Rational Build Forge, Part 1: Create a continuous build and integration environment

    Learn how to implement a build management system that uses and extends your existing automation technologies. This tutorial shows, step-by-step, how to install and configure IBM Rational Build Forge to manage builds for Jakarta Tomcat from source code.
    FREE! Go There Now!


    NEW! Rational Testing eKits

    Discover how Rational tools and best practices for testing can make your job easier. The new Rational Testing eKits provide you with valuable resources – including demos, webcasts, tutorials, and articles – that help you address your specific testing needs across the software lifecycle. Five new eKits are available covering the topics of Requirements and Test Management, Functional Testing, Performance Testing, Code Quality and Embedded Systems, and SOA and Web Services Testing.
    FREE! Go There Now!


    NEW! Trial download: IBM Rational Performance Tester V7.0.1

    Get a free trial download of the latest version of IBM Rational Performance Tester V7.0.1, a load and performance testing solution for teams concerned about the scalability of their Web-based applications. Combining multiple ease-of-use features with granular detail, Rational Performance Tester simplifies the test-creation, load-generation and data-collection processes that help teams ensure the ability of their applications to accommodate required user loads.
    FREE! Go There Now!


    NEW! Understanding Web application security challenges

    As businesses grow increasingly dependent upon Web applications, these complex entities grow more difficult to secure. Most companies equip their Web sites with firewalls, Secure Sockets Layer (SSL), and network and host security, but the majority of attacks are on applications themselves – and these technologies cannot prevent them. This paper explains what you can do to help protect your organization, and it discusses an approach for improving your organization’s Web application security.
    FREE! Go There Now!


    NEW! Using Rational Business Developer to enhance your developer productivity

    Join this Rational Talks to You teleconference, to hear how Enterprise Generation Language (EGL) eliminates the need for tedious and error-prone low level coding, so developers can focus on business requirements. EGL extends the Rational software development platform with a simplified programming language that enables developers who have little or no experience with Java, Web technologies or Service Oriented Architecture, to create enterprise-class applications and services quickly and easily. It also allows developers who may have little or no mainframe programming experience to quickly create traditional mainframe components.
    FREE! Go There Now!


    NEW! Webcast: WebSphere Process Server

    WebSphere Process Server delivers a unique integration framework that simplifies existing IT resources. Often, as IT assets grow to support business demand, so too does their complexity and manageability. In this webcast, we’ll discuss how WebSphere Process Server helps deliver an SOA infrastructure that provides a common model to orchestrate, mediate, connect, map, and execute the underlying IT functions. Discover how WebSphere Process Server simplifies integration of business processes by leveraging existing IT assets as reusable services without the complexities of traditional integration methodologies.
    FREE! Go There Now!



    All FREE IBM® developerWorks Tools!

       

    WEBSITE CONTENT ARTICLES

    - Does Article Marketing Really Generate Traff...
    - Why Online Polls Work
    - Put Your Blog on Your Site
    - Simplifying Page Design
    - Is It Time to Archive Your Content?
    - Why You Need Content Categories
    - Make Your Content Different: Find a Spin
    - Why Feature Webisodes?
    - Should Your Site Celebrate?
    - Let Your Visitors Write the Content
    - Where Do You Draw the Line on User Comments?
    - Should You Make Them Pay for Content?
    - What Can User Ratings and Reviews Do for You...
    - How Contests Contribute to Your Site
    - Is It Plagiarism?

     
    Create the Optimal Architecture for your Critical Applications
    Warburton's the largest independently owned bakery in the UK faced a number of d....

     
    Five Best Practices for Deploying a Successful Service-Oriented Architecture
    This white paper describes the benefits you can expect with SOA, and how IBM can....

     
    Gartner Magic Quadrant for Application Delivery Controllers
    Gartner summarizes its view on Application Delivery Controllers, evaluates stren....

     
    Knowledge is Power
    What you don't know can hurt you, and is likely costing you money and increasing....

     
    Rationalizing the Multi-Tool Environment
    The rationalized multi-tool approach is flexible, scalable and cost effective. I....

     




    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 6 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek