It is difficult to predict Web structures for being rapidly changed with frequent updates of documents on the Web. Nevertheless,given the structures, information providers can discover users'behavior patterns and characteristics and supply better services to users, and users can find useful information easily and exactly. This paper proposes an improved method for extracting Web structures.
The method consists of two steps. The first is constructing a directed graph on Web documents as node with their hyperlinks using the depth-first search algorithm. The second is making up for the direct graph by discovering the hyperlinks, which are not extracted in the first step, called hidden hyperlinks. They can be found by analyzing Web access logs, in which click streams are contained. The click streams do not include clicks on 'Back' buttons because of the local cache problem of Web browsers. This causes the problem not finding correct hidden hyperlinks. To cope with the problems, this paper propose an algorithm on searching hidden hyperlinks. We have simulated the discovery of the hidden hyperlinks to evaluate the proposed method experimentally.
Through the simulations, we have observed that the proposed method discovers most hidden hyperlinks appeared on clickstreams.
In the future we should develop some tools for visualizing discovered Web structures and do study on discovering more correct hidden hyperlinks through improving the proposed algorithm.