Mining the World Wide Web: An Information Search Approach

Mike Thelwall (University of Wolverhampton, UK)

Journal of Documentation

ISSN: 0022-0418

Article publication date: 1 April 2002

282

Keywords

Citation

Thelwall, M. (2002), "Mining the World Wide Web: An Information Search Approach", Journal of Documentation, Vol. 58 No. 2, pp. 232-234. https://doi.org/10.1108/jd.2002.58.2.232.4

Publisher

:

Emerald Group Publishing Limited

Copyright © 2002, MCB UP Limited


This is a well‐written publication suitable for developers of Web information systems and to support advanced level courses in information retrieval (IR), databases and data mining, as the book’s cover claims. It is not, however, about Web mining, but it does give background information for this topic and a brief introduction to it. Although primarily of interest to computer scientists, its focus on data mining and Web IR makes it also of interest to information scientists. Although the most important computing concepts are introduced when met, prior knowledge of database concepts and querying languages, as well as basic software engineering fundamentals, would be an advantage for readers in the first half of the book. This book is split into three parts. The first concerns IR on the Web, covering the formal computing aspects of the kind of queries that can now be executed through many search engine advanced searches. The second part deals with data mining, Web mining and Web crawler construction, and the third is a case study of a Web‐based IR system.

In Part 1, an informative introduction to each topic area is given followed by a series of more detailed case studies. It begins with a general overview of the main types of commercial search engine in the first chapter. The next two chapters describe two different approaches to gaining the ability to perform more complex types of search on the Web. Both involve constructing search and retrieval tools, but whilst the first approach involves querying existing search engines and combining and processing the results in response to user queries, the second involves the creation of a topic‐specific local database automatically gathered from the Web which can then be queried with complex requests. The latter differs from the advanced facilities of a search engine such as AltaVista in being topic‐specific (and hence its database is smaller and potentially able to be hosted on a single small computer) and offering the ability to perform requests involving several pages at once. For example, a request could be sent for pages containing the phrase “documentation” in the title and linking to at least ten others. The idea is intriguing but the focus is on the syntax and scope of the various query languages, with implementation details not given. The final chapter in the first part covers multimedia search engines and explains how queries can be built from text surrounding images, video and audio tags in HTML, but gives little insight into how matches can be made from the contents of the resources themselves.

The second part of the book begins with an overview of data mining. This differs from IR because the latter focuses on finding the documents relevant to the user whereas the former is interested in extracting knowledge from whole collections of documents. A good overview of a range of different techniques that can be used is given. To give a flavour, association mining is the extraction of inference rules, with the result potentially being the identification of a pattern useful to the data owner, such as that customers who buy one organic food tend to buy several more in the same visit. This could not be discovered from any one record in the database, only by analysing a large number. Text mining differs from data mining in that the object of study is an unstructured collection of documents rather than a “proper” structured database. The aim is still to extract knowledge and several techniques are explained. One is online detection, which can be used to process news stories automatically and flag those that are sufficiently similar to previously processed clusters of stories. The seventh chapter finally arrives at Web mining and gives an overview of what are probably the two most popular areas: Web usage mining; and Web structure mining. In Web usage mining, the logs of Web servers are analysed to identify patterns in the behaviour of site visitors. This can lead to restructuring Web sites to make them easier to use, or to the development of additional information about the behaviour of visitors that would be useful for marketing purposes. Web structure mining involves discovering patterns in the hyperlink structure of sets of Web pages. In particular, connectivity analysis is described as a tool to aid the discovery of Web pages that appear to be authoritative for a topic (i.e. are linked to by many relevant pages) or a source of relevant links. This is the type of approach that has led to the success of Google’s PageRank algorithm. The final chapter in Part 2 describes some of the workings of a Web crawler, but does not really give any useful insight that would aid understanding of the other topics covered.

Part 3 contains a case study of a topic‐specific search engine created by the authors. Disappointingly, it did not include any data‐mining capability and its query interface did not appear to go significantly further than specifying the part of a Web page that search text should match.

This book offers a reasonable account of topics in Web IR. It also gives a good background to Web mining and would make a useful introduction to the topic but does not offer any depth. The content of the book is variable, with some sections giving brief topic overviews, and others perhaps too much information. For example, the details of some of the many different query languages covered in the first part will probably only be of interest to computer scientists. It is worth buying a copy for the library to support any relevant courses, however, because of the overall clarity of its exposition.

Related articles