WebTracker - a Web Service for tracking documents

Ken Fishkin
Eric Bier

Xerox PARC
3333 Coyote Hill Road
Palo Alto, CA 94304
{fishkin, bier}@parc.xerox.com

Abstract

The growth of the World-Wide Web has led to an explosion in the type and number of documents available to knowledge workers via the inter- and intra-nets. Often, such workers would like to be notified of changes in Web documents, to share Web documents through a common depository, and to have such documents made available to them locally, automatically.

WebTracker is a Web service designed to address these needs. It allows users to express interest in a set of URLs, or in a set of patterns for URLs. When such URLs change or are created, the user is notified.

Introduction

The World-Wide-Web ('Web') has led to a revolution in the availability of documents. There has been a qualitative and quantitative increase in document access. However, this increase in access has not been matched with a corresponding increase in the set of tools available to intelligently use these documents.

In particular, the Web provides no built-in support for monitoring or tracking said documents, whereby a user is notified whenever a document changes in an interesting way, and is allowed to easily see that change. A few attempts have been made to address this, as discussed in the Related Work section, but none were powerful enough for our needs. This paper presents our attempt to address this, a system we call WebTracker.

WebTracker allows users to monitor a set of web pages (URLs), with the following features:

Interaction

WebTracker is made available as a Web service. Visiting the WebTracker page, the user is presented with a list of URLs which all users are currently tracking. This shared list allows users to share common URLs (for example, in an intranet setting) without redundancy. Filters are available to restrict the display to avoid information overload as the set of URLs grows. The user can add new URLs to this list, or examine a particular URL by clicking on its title.

When the user clicks on a URL title, they are presented with a more detailed set of information about that URL: its size, the set of other users interested in that URL, when it was last downloaded, and so forth. By clicking on a button, the user subscribes to that URL - whenever the URL changes, the user will be sent e-mail notifying them of the change, and allowing them to easily see the difference.

For example, the figure below shows the information a user sees who is interested in tracking the URL with information about the WWW6 conference: http://www6conf.slac.edu. The name 'fishkin' is in bold, indicating that this is the name of the current user: if 'fishkin' was not presently subscribed to this URL, the 'unsubscribe myself' button would instead read 'subscribe myself'.



The user can interact with the system at 3 levels. At the top level, the user sees all or some of the tracked URLs. At the second level, the user sees the information associated with a particular URL, as shown in the previous figure. At the third level, the user sees 'sub-URLs' obtained by pattern search on the contents of a particular URL, as will be described later. The user interface uses 'progressive disclosure' or holophrasting to accomplish this - at each level, the user is shown the surrounding context of higher levels, while progressively more detailed information is shown according to their current level.

Seeing the Difference

When a URL changes, the user is very often interested not in the fact that the document changed, but more in the nature of the change. Accordingly, when a document changes the user is presented with a simple mechanism (pressing a button on the Web interface, or clicking a URL in the mail message notification) to see what changed in the document.

Presently, the semantics of this are to compare the two most recent versions of the URL contents by means of the Unix diff capability, and to present the results of that to the user.

For example, the figure below shows the information the user receives when checking a URL which contains scheduling information for a conference room: the 'diff' output shows the user, without having to parse the entire file, that a talk by Hadar Shemtov is now tentatively scheduled for 3/11/97.





The difference between the last two versions of URL



http://parcweb.parc.xerox.com/project/istl/whistle/schedule.text 







14c14



<  3/11/97



---



>  3/11/97 Hadar Shemtov: Tentative







click here to return to the application.




URL patterns

Often, a user isn't interested in a URL per se, but rather some of the URLs referenced by that URL. For example, consider a URL which contains a 'download' page, of the most recent drivers, patches, etc. for users to download. When this page changes, the user is not interested in the change in the page, but rather in the fact that a new downloadable document has been posted.

Accordingly, WebTracker allows the user to specify a set of regular expressions (filters) to associate with a particular URL. When a new embedded URL is found in the host URL, if that URL fits the given filter, the embedded URL is automatically downloaded and made available to the user.

For example, in the figure below, WebTracker is tracking a URL used by Creative Labs, makers of PC sound cards, to present drivers and patches for their cards. There are two 'filters' that users have added to this: any reference within this URL to another URL which ends with '.exe', or one which ends in '.zip', will be automatically downloaded. At present, 9 such .exe files have been found and downloaded, and 0 .zip files.



Just as a URL is downloaded when it first appears via a filter, when such a URL is no longer found via the filter (i.e. it is no longer referenced by the parent URL), the sub-URL and its associated local storage are deleted. For example, one of the URLs we track is the download page for the latest Norton anti-virus files. When the anti-virus files for April, for example, are posted, the anti-virus file for March will no longer be referenced by the parent URL, and accordingly will be deleted locally.

Notification

When a URL changes, or is downloaded for the first time, users are notified via e-mail. Users only receive one piece of mail, no matter how many URLs change, in order to reduce e-mail overhead. Each notification includes a brief description of the notifying URL, and links to allow the user to quickly view the URL, the local downloaded version of the URL, and the changes in that URL. For example, in the figure below, we see a piece of e-mail in which the user is notified that the web page for the WWW6 conference has changed, and that a new executable file has been referenced by the Creative Labs driver page, and downloaded for the user:



Greetings from WebTracker ( http://girweb/cgi-bin/webtracker )



Some of the URLs which I am tracking for you have changed.



In particular:







1) URL http://www.creaf.com/creative/drivers/3db/3dpb73p1.exe



	, obtained via the '*.exe' filter



	of the 'drivers for Creative Labs (PC sound cards)'' nugget.



	has been downloaded for the first time.



	a local cached version has been stored in 



		/project/webtracker/data/3dpb73p1.exe



	a URL for this is 



		http://parcweb/project/webtracker/data/3dpb73p1.exe







	 WebTracker's page with information about this nugget is 



		http://girweb/cgi-bin/webtracker?op_16_14_*.exe=x







2) URL http://www6conf.slac.stanford.edu/



	titled 'WWW6 Details'



	has changed.



	a local cached version has been stored in 



		/project/webtracker/data/url_b1957.html



	a URL for this is 



		http://parcweb/project/webtracker/data/url_b1957.html



	a URL to see what changed is:



		http://girweb/cgi-bin/webtracker?op_10_11_X=X







	 WebTracker's page with information about this nugget is 



		http://girweb/cgi-bin/webtracker?focus=11#L11







Yours, 







	WebTracker ( http://girweb/cgi-bin/webtracker )




Related Work

Both the
WebMinder and Smart Bookmarks systems share the basic functionality of allowing users to indicate interest in a set of URLs, and receive notification when those URLs change. Smart Bookmarks also allows users to easily view and edit their list of URLs. However, neither supports collaboration, local downloading, quick viewing of changes, or pattern-matched downloads.

Finally, the Grassroots architecture of Stanford incorporates such URL tracking within its system for document management. However, GrassRoots is an architecture, not an implementation of a service.

Current Status

The WebTracker system has been in use amongst the employees of the Information Sciences and Technology (ISTL) group at Xerox PARC, a group of roughly 40 people, for a bit over a month. Currently, WebTracker is tracking 56 URLs (25 specified directly, and 31 through pattern-matching), and is notifying 13 users.

The system is implemented in GNU G++, and runs on a Sun Sparcstation 10 using SUNOS 4.1.3. URL monitoring is implemented twice nightly - once at 1 AM, and again at 3 AM. This twice-nightly monitoring is done to reduce the odds that a particular server will be down when the monitoring is performed. There is no particular architectural limitation on the frequency of the monitoring.

To support collaborative use of the system, semaphores are used to ensure that multiple users can access and change the WebTracker database.

Future Work

In the future, we would like to extend the system to support personal, as well as shared, URL sets, so that users can monitor 'private' URLs which they don't wish to make a system-wide resource (e.g. school lunch menus). We would also like to explore better ways to show file changes than the Unix diff, a venerable but limited tool.



Return to Top of Page
Return to Posters Index