Suche Home Einstellungen Anmelden Hilfe  

Maintaining Link Integrity to External Web Sites in a Hyperwave-based Learning Environment

C. Anderson
J. Lennon
HyperMedia Unit
Department of Computer Science,
Private Bag 92019, Auckland, New Zealand

Abstract: Teaching in an Internet-based learning environment provides unique challenges. Of particular interest is the question of maintaining link integrity. Although several modern hypermedia systems can guarantee that links within their systems are maintained, it is more difficult to ensure that links to external Web servers remain intact. The system we have developed maintains consistency to external Web sites for all essential documents (as specified by the courseware authors).

1 Introduction

Internet-based learning environments are an exciting development worldwide. Unfortunately broken links pose a considerable problem for system administrators, teachers, and students alike. For example, it is most frustrating to discover in the middle of a lecture that a link, which was valid when a lecture was prepared, has disappeared ? maybe just one hour later (as we have experienced more than once). Students also experience additional stress when they discover just before an exam that material, which they have briefly skimmed at the beginning of a semester, has disappeared.

1.1 Our Virtual Learning Centre

We have aimed to create an interactive distributed learning environment, using a combination of new course material and lecture material available from previous years [Lennon and Maurer, 1994]. Since there is a wide range of hardware that has to be supported, we have relied on common Web browsers that support Java applets. Successful features of our project include:

1.2 A Proliferation of Files

Experience has shown that students expect annotated lecture notes, tutorial notes, assignment information and a whole raft of extra information. They also expect that anything on the screen should have a reasonably high pictorial content which creates a further proliferation of files since every graphic, regardless of how small it is has to reside in a separate file. The number of files for one course can easily exceed 200. Since the number of courses given in one semester at the University of Auckland is currently just under 3000 we can estimate that if all courses were given electronically we should be planning for over 600,000 files.

1.3 A Proliferation of Broken Links

The next question to be considered is how many links may be involved in a database of 600,000 files. This is more difficult to answer since authors vary widely in the number of links that they include in any document. Taking our own site as an example, we estimate over one million. This means that someone may have to try and maintain the integrity of over a million links in the database.

Also to be considered is the problem/question of how an administrator of one server knows when a document on another Web site has been moved or removed. In the current climate, the answer is that they will rarely know.

Hyperwave [Andrews et al., 1995, Maurer, 1996], which was developed to address these problems, has been used for the management of all electronic material in our Internet-based learning environment.

1.4 Further Motivation For Our Work

It is becoming increasingly popular for publishing agencies to produce related Web sites for their publications. Many books now not only refer to their publisherís sites, but other related Web sites. Unfortunately, a book which has perhaps spent over six months in production and contains links within it to various Web sites, may well be out of date the day it is finally distributed. An ideal example of this problem would be The Internet Yellow Pages [Harley Halm & Rick Stout, 1994] published in 1994. This book is entirely dedicated to listings of newsgroups, FTP sites and Web pages, many of which are now well out of date. Following an arbitrary URL from this book will likely result in the infamous "404: file not found" error from the host site, because the original document referred to has been either renamed or deleted.

Figure 1. The dangling link problem
Figure 1: The dangling link problem.

Online guides and lists of links also suffer from this problem to a lesser extent. It is of course much easier to alter a Web page to reflect a change in a URL than to reprint an entire book. Web site changes still require manual alteration of the document in question, and more importantly, someone must be available first to notice that a particular link is now out of date.

What is needed is a method of specifying links in relatively stable Web documents and printed material in a manner which allows automatic notification of broken links and provides a simple method of correcting or re-directing these links.

Before moving on to the problem of links to documents stored in external Web servers, we first highlight a few key built-in features of Hyperwave.

2 A Brief Summary of Key Features of the Hyperwave System

Hyperwave (originally called Hyper-G) is an advanced hypermedia document management system developed at the Institute for Computer Science and Computer Supported New Media (IICM) in Graz, Austria. The principal architects are Prof. Hermann Maurer and Frank Kappe [Kappe et al., 1993].

Hyperwave was developed to store high-volume hypermedia resources, which can be spread over multiple servers [Kappe, 1995]. The long-term maintenance and development of databases becomes crucial when databases grow beyond a certain size. To expedite management, particularly of links, a more structured approach needs to be pursued right from the beginning [Flohr, 1995, Maurer et al., 1994].

Hyperwave follows a more sophisticated concept than most available Web server systems. The project's aims are [Andrews et al., 1994]:

Hyperwave offers a self-contained working and navigation environment, and relies on the Internet as its base communication medium. TCP/IP port 418 has been reserved by the Internet community for communication to Hyperwave servers.

The core of a Hyperwave server is formed by a set of object-oriented databases. It supports customary database features such as the assigning of attributes to objects, indexing, and searching. The server also maintains a hierarchy of user and group information. The following attributes can be assigned to any document:

Since attributes are not stored within the objects (as in most Web servers), but in a separate database, they can be efficiently extracted and manipulated. In addition, objects can be locked to ensure they are not updated concurrently.

2.1 Local and Remote Link Maintenance

There exist utilities which can augment Web server functionality and ensure that all links at a site are pointing to the correct documents or discard links which no longer point to existing documents [Jones and Pitkow, 1995]. These utilities generally only ensure that all 'on site' links are kept up-to-date.

Hyperwave also addresses this problem on a 'site' basis. The details of a link within a document stored in a Hyperwave server are kept separate from the actual document itself. The server is always aware of exactly which document a particular link refers to and, more importantly, which documents make use of the link.

When a document is moved on a Hyperwave server, all links referring to that document are automatically updated. Similarly, when a document is deleted, all links referring to it are 'deactivated'. As soon as the document referred to is restored or a new one created, the link becomes active once more. This automatic maintenance of local links also works across multiple Hyperwave servers.

A full description of Hyperwave, including all technical details, can be found in [Maurer, 1996].

However, Hyperwave does not check that links that point out of Hyperwave documents to documents on remote Web servers remain intact and this is the topic of the rest of this paper.

3.1 Remote Link De-referencing

It is not practical to expect everyone in the world to start using a Hyperwave server to ensure the integrity of links world-wide. We propose the use of special link and document caches, implemented using Hyperwave, which could be used to de-reference links on a local server and protect static documents from changes on the Web at large.

Hyperwave allows links to remote sites to be loaded as entities in their own right. When a user accesses a link in a local document, instead of connecting directly to the remote document, it simply connects to the local remote link object. The remote link object automatically refers to the remote document and this redirection is virtually transparent to the user.

Figure 2. De-referencing a link
Figure 2: De-referencing a link.

By using and maintaining the remote link objects, several features can be implemented:

In practice, a user will click on a link in a document, which will refer to a remote link object. This object will either re-direct the user to the remote document, or provide the cached document and explain that the actual remote document is unavailable for whatever reason.

3.2 External Link Maintenance

Setting up remote link objects and referencing them is a relatively simple exercise using Hyperwave. The main problem to be overcome is detecting when a remote document no longer exists (or is no longer valid) and what to do when this happens. It is relatively easy to detect when a document has been deleted from a Web site, simply by periodically attempting to download the document. Some parameters are required to determine how 'periodically' the links should be checked and an indication of how long a document must be missing or unobtainable before it is deemed to have be deleted (or just too unreliable to use).

Another complicating factor is that the content of a document may change to the point where it is no longer relevant in its original context. The current relevance of a document must ultimately lie in the hands of the author who decided to use the link in the first place. It would therefore be sensible for the link cache to report to the author that a remote document has been altered.

The HTTP protocol used by Web servers allows document information to be retrieved without actually downloading the entire document [http://www.hmu.com/web_links/http2]. The remote documentís modification time is ideal for noting changes in a remote document. This modification time can then be added to the remote link object as an attribute for future reference.

3.3 Remote Document Caching

When a remote document has been removed or renamed, and the link author informed, it would be advantageous to be able to supply a temporary version of the document until such time as the author alters the link's destination, or the original remote document becomes available once again.

After creating the original remote link object, and checking the existence of the remote document, the text part of the document can be cached (ignoring images and other embedded items). The remote link object can then use the cached document in the event that the remote document becomes unavailable. When supplying the temporary cached document, the viewer should be informed that the document they are seeing is not the original and they should be allowed the option of trying the original remote document.

Figure 3. Supplying the cached document
Figure 3: Supplying the cached document.

4 Implementation

We have implemented a prototype link cache, as described above, using a Hyperwave running on a UNIX system. Two separate programs run periodically, link_setup and link_maintenance. Link_setup creates new remote link objects from existing links, and link_maintenance maintains the remote link and remote document caches.

When a user on the Hyperwave system decides that they want the external links in a document to be maintained, they add a custom attribute "CacheURLs" to the document. This flags the document to have all external links de-referenced by the link_setup program, which runs as a CRON job once a day. The user also has the ability to run link_setup immediately if they wish, via an HTML form and CGI.

Once it executes, link_setup processes each document on the server with a new CacheURLs attribute. It removes all URLs linking off the local site (with the exception of those accessing CGIs) and creates remote link objects in the Remote Link Cache (RLC). The HTML of the source document is altered to use the links in the RLC (see figure 3). As each remote link is created, a text-only version of the remote document is downloaded and stored in the Remote Document Cache(RDC).

The CacheURLs attribute is altered to indicate a successful load and an attribute is added to the remote link object, indicating the last modification time (and date) of the remote document. Finally, the user is emailed the results of the load, stating which links were successfully loaded and reporting any problems.

The link_maintenance program also runs as a CRON job but on a more frequent basis. Under normal circumstances (ie. there are currently no broken links) it will attempt to verify all links in the RLC every 6 hours. If there are links which have previously been found broken and are using temporary documents from the RDC, then link_maintenance will check these links once an hour until they are either corrected or deemed deleted.

Upon execution, link_maintenance checks each remote link in the RLC. First it attempts to get the last modification time of the remote document referenced and compares this to the modification time attribute of the remote link object. If these time stamps are identical then no further action is required for this link, otherwise the remote document must have been changed and the owner of the remote link is informed via email.

In the case of a deleted remote document, or a non-responding site, the remote link object will be re-directed to the cached version of the document. The link owner is then informed of this action. Links that are still not resolved after a week are deemed to be lost and the link owners are emailed once again, with the advice that they should remove the link permanently.

5 Future Work

The system we have implemented based on the description above is certainly not complete and as user friendly as it could be, but performs its core functions admirably. Users are informed of changes in the status of their links usually within a day of a link breaking. Unfortunately, at this stage some knowledge of the underlying system is required to change the destination of the remote link objects.

The actual work involved in creating the raw functionality was quite simple compared to the complications encountered in making the system as convenient and transparent to the user as possible. Other than improving the user interface, most of the future work required will be in determining the optimum periods for checking links (both working and broken).

6 Conclusion

The system we have developed on top of Hyperwave lets teachers designate that particular documents should have all links to external Web sites de-referenced to point to the local serverís link and document caches. This enables teachers to simply set one attribute on each of the documents they deem to be important and the system will de-reference all links to external Web sites contained within that document. We believe that this facility will significantly improve the reliability of electronic courseware and thus reduce levels of stress for teachers and students alike.

7 References

[Andrews et al., 1994] Andrews, K., Kappe, F. and Maurer, H., (1994). Hyper-G and Harmony: Towards the next generation of networked information technology. Technical report, IICM.

[Andrews et al., 1995] Andrews, K., Kappe, F., Maurer, H. and Schmaranz, K., (1994). On second generation hypermedia systems. In proc. ED-MEDIA 95 - World Conference on Educational Multimedia and Hypermedia, pages 75-80. AACE.

[Flohr, 1995] Flohr, U., (1995). Hyper-G Organizes the Web (1995). Byte, November, pages 59-64.

[Harley Halm & Rick Stout] Halm, H., and Stout, R. The Internet Yellow Pages. Osborne, McGraw-Hill.

[Jones and Pitkow, 1995] Jones, R. and Pitkow, j., (1995). Atlas: Supporting the Web with a Distributed Hyperlink Database, http://www.cc.gatech.edu/grads/j/Kipp.Jones/atlas/boston/atlas.html

[Kappe et al, 1993] Kappe, F., Maurer, H., and Sherbakov, N., (1993). Hyper-G ? A universal hypermedia system. Journal of Educational Multimedia and Hypermedia, 2(1):39--66.

[Lennon and Maurer, 1994] Lennon, J., and Maurer, H, (1994). Lecturing technology: A future with hypermedia. Educational Technology, 34:5--14.

[Kappe, 1995] Kappe, F., (1995). A scalable architecture for maintaining referential integrity in distributed information systems. Journal of Universal Computer Science, 1(2):84--104.

[Maurer et al., 1994c] Maurer, H., Philpott, A., and Scherbakov, N. Hypermedia systems without links. Journal of Microcomputer Applications, 17:321--332.

[Maurer, 1996] Maurer, H., (1996). HyperWave: The Next Generation Web Solution.. Addison-Wesley, UK.

Benutzer: gast • Besitzer: schwill • Zuletzt gešndert am: