Automatic classification applied to the full-text Internet documents in a robot-generated subject index

Anders Ardö, DTV, Lyngby, Denmark

Traugott Koch, NetLab, Lund, Sweden

Manuscript of a forthcoming publication in: Proceedings of the Online Information 99 Conference, London.

Abstract: The practical goal of this work within the EU-project DESIRE is to integrate a manually selected, catalogued and quality assessed collection of WWW-resources with a much larger robot-generated subject index via cross-browsing and cross-searching. We aim at exploring different methods of automatic classification on a robot-generated subject index in order to improve resource discovery for Internet resources. To demonstrate our findings we will add this functionality to the service Engineering Electronic Library, Sweden (EELS).

We tested two approaches for harvesting a robot-generated Engineering index starting from a number of quality collections on the Internet:

  1. follow all links from the top-pages of these collections in three steps recursively.
  2. harvest a few collections completely and follow all links in two steps recursively.

We found a surprisingly low overlap between the Engineering link collections used as start sites for the robot and in addition a low overlap between the resulting databases from the two approaches. An intellectual evaluation of the contents of both databases showed the same percentage of relevant documents (77%). Both methods combined are therefore required if one wishes to achieve a more complete index.

An improved approach that we started to test is to use a thesaurus on the fly during the harvesting process. This is utilized to determine whether a page should be included and its links followed, thus allowing harvesting depth to be adapted to the distribution of relevant material.

To provide a subject based browsing interface for the robot index we carry out an automatic classification and generate an equivalent Engineering Information Inc. (Ei) structure to the one used in the quality service EELS.

Matching words in full-text documents to words in the Ei-thesaurus provides the classification, relying on the editorial mapping between the thesaurus and the Ei-classification system. We apply several different approaches, heuristics and weighting schemes to improve the classification process and the resulting browsing structure.

A comparison between the results of an intellectual classification for close to 1000 web pages and the automatic classification for the same pages shows identical or a more specific automatic classification in 57-66% of the cases.

We will also test more advanced methods of automatic classification: linguistic methods and advanced knowledge bases based on the Dewey Decimal Classification.

A demonstration page, including a classification service, is available at: http://www.lub.lu.se/desire/demonstration.html

Keywords: automatic classification; harvesting; robot-generated subject index; subject gateways; web resource discovery; metadata; Engineering; Dewey Decimal Classification; Ei thesaurus and classification

 

1. Introduction

Already several years ago resource discovery services on the Internet started to use methods of manual classification to develop a satisfactory structure for browsing their services and improving search processes (as reported in the DESIRE classification report 1997 [1] and updated in a paper in German [2]).

The European Union project DESIRE [3], after gaining experiences with manual classification in the three test bed subject gateways EELS [4], SOSIG [5] and DutchESS [6], started to explore different approaches to automatic classification of documents in an Internet subject index.

Our findings in the DESIRE project were that users want to be able to explore a large number of resources from the net as a complement to the relatively small number of quality assessed resources in a SBIG, without drowning in huge numbers of unrelated hits offered in global search services. This is the reason why we used robot software to collect a large subject index.

One of the practical goals of this project is to explore ways of combining a manually selected, catalogued and quality assessed collection of WWW-resources (a Subject-based Information Gateway, SBIG) with a much larger robot-generated subject index. We use the subject gateway EELS (Engineering Electronic Library, Sweden [4]) and the robot-generated subject index "All" Engineering [7] as the services to improve and build upon.

A large subject index with the same browsing structure as the small quality gateway will offer the users, besides the already established cross-searching, a cross-browsing option from each subject section in the gateway to the section carrying a large number of corresponding records in the subject index (and vice versa).

EELS offers its resources with a selection, description and classification carried out by human experts and the Ei classification (Engineering Information Inc. [8]) used is applied as the browsing structure for this service. The challenge then becomes to find ways to provide an equivalent browsing structure for the robot-generated subject index "All" Engineering. Since it contains more than 250000 resources, manual classification is not feasible. Consequently we decided to explore simple applications of automatic classification methods.

The tasks to be addressed in the second phase of the DESIRE project were:

In addition, a comprehensive state-of-the art report on projects, methods, alternatives and problems with automatic classification will be presented.

After a series of evaluations, the most suitable solutions found will be added to the EELS service.

We are convinced that our findings and developments will be valuable for similar subject services and contribute to an improvement of resource discovery in the Internet in general.

2. Influence of different methods for harvesting a subject index

Before focusing on the automatic classification developments, the main issue of this article, we want to discuss which methods to use in creating a robot-generated subject index (cf. our detailed working paper [17]). The quality and composition of the resulting indexes content has obviously an important influence on the usefulness of the classification structure created to browse among the resources. A high percentage of documents not related to the subject area of the index will disturb the classification process and pollute the service; when the harvesting process, at the other hand, misses many relevant resources the browsing structure will display many unnecessarily empty areas.

2.1 General methods for robot collection of a subject index

It is obvious that a lot of different methods could be applied to construct a robot-generated subject index. Since only a small number of real services of this kind exist, a very limited number of methods have been applied so far.

Here is a list of possible procedures with references to some real examples.

  1. Index all resources listed in several co-operating sites applying some sort of quality selection (used in ARGOS, HIPPIAS, Digger, cf. [12])
  2. Index all pages at a manually selected list of organizational sites in a particular subject area (used in Europe Physics Broker, Energysearch, GERMLST Broker, ChristWeb, cf. [12] and most national academic web indices like GERHARD [13])
  3. Index all resources listed in a few quality services and include the resources pointed to in one or two steps of citation (used in DESIRE)
  4. Index all resources listed in a few quality services (as in C) and collect the resources in one step of citation. Identify the relevant pages by comparison with a comprehensive vocabulary in the chosen subject area (as a filter). Add the relevant pages to the database and follow all links from the relevant pages in the next step of harvesting. Repeat the same procedure ad infinitum in order to improve both coverage and precision of the resulting index. Here the number of steps of citation followed or the sites included is not decided upon beforehand as in all other methods. (experimented with in DESIRE)
  5. Harvest all resources from one or two quality controlled subject services and use an algorithm to detect other large link collections among them. Index all resources listed in the quality controlled subject services and in the discovered link lists (is a variant of C)
  6. Harvest the complete sites for all pages gathered in method C (or A)
  7. Select several good link lists/digital libraries in the subject area and harvest all pages linked to in three steps of the citation chain starting from the top page (used in "All" Engineering [7])

From model A to G one might expect the coverage to increase whereas the subject focus (relevance) decreases (with the exception of method D). At least control over what is included is handed over to a larger community.

These methods dictate which resources are harvested in order to build the initial index. Through both preprocessing (changes as far as the selection of seed pages is concerned) and postprocessing (removal of non-relevant resources) the final composition of the index can be influenced. The goal is to increase the percentage of resources relevant to the chosen subject area of the index.

2.2 Our approaches

Initially we tested two approaches using our own modular harvesting software Combine [14] for the task. In both we start from a number of quality collections within Engineering. In the first case we follow all links from a small number (30) of seed pages (the top-pages from manually selected Engineering quality collections) in three steps recursively while harvesting all pages. In the second case we use all pages from a few of these collections (7), more than 1500 pages as start pages. These are variants of the general methods G and C mentioned above.

The main difference between the two approaches is that in the second all pages from the selected link collections are used as start pages for the robot, while we only follow citation links two steps away from the seed pages. The hypothesis was that this approach would result in a more focused index.

In every step of the link chain the size of the database increases approximately tenfold. Both approaches result in a database containing close to 200.000 web-pages.

2.3 Result: Combined subject-index

We found a surprisingly low overlap between the Engineering link collections we used as start pages for the robot. Only 20% of the total number of WWW sites were found in two or more services. The vast majority of all sites are only found in one service. The implication of this finding is that we have to be very careful so as not to include too few link lists when feeding the harvesting robot with start pages. At least in Engineering, this process and subsequently all comprehensive information searching has to use basically all important link lists and digital libraries in the subject area.

An intellectual evaluation of the contents of the two databases showed almost exactly the same percentage of relevant documents (77%), indicating that the main difference between those approaches was rather how many and which individual resources the two resulting databases covered. We carried out an overlap study into the records collected in both full databases in order to find out more about the relative coverage of the two approaches. To our surprise the overlap between the two approaches was very small despite our starting the harvesting process from basically the same sites.

The interconnectivity in today’s Engineering pages on the Web appears to be rather low in spite of an average number of links per page of about 10. We could very well find a different linking behaviour in other subject areas. Both methods combined are therefore required if one wishes to attain a complete index covering most Engineering material on the Web and this combination is used for the service at the moment. It is hard, however, to evaluate the degree of coverage of a subject index and more empirical studies are needed to further improve our approach.

2.4 Advanced gathering model

A more precise approach for collecting a subject specific database would be to use a simple matching algorithm with a thesaurus on the fly during the harvesting process to determine whether a page should be included and thus allowing the harvesting depth to be adapted to the distribution of relevant material (as in the general model D above).

Figure 1:

Advanced gathering model

In narrow subject areas or in cases where not enough good starting collections for the robot are known, possible start pages could be found by sending a short list of terms or the descriptors from a small thesaurus as queries to a couple of search services. Pages from the hit lists would then be harvested and run through this advanced gathering model. The matching process with a subject specific vocabulary and its heuristics and weighting schemes (cf. below) would then guarantee a good resulting selection of documents for the subject service and avoid unnecessary inclusion of and following of links from unrelated pages.

At the moment we test this method in the subject area of carnivorous plants (cf. the testsite [15], [21]).

An additional issue we want to explore is the question of whether it is necessary to explicitly include pages pointing/linking to resources which are already in the index. We would have to use methods like the ones described by Dean and Henzinger [16] to identify pages related to documents in our database by connectivity (parents, other childs from the same parents, other parents to the same childs).

The study would have to check how many of those connected pages already are in the database via the harvesting methods used in our project. The possible addition would limit itself to the first and last steps in our citation chain (approach G and C): to parents of our starting pages, other children from the same parents and to other parents to all children throughout our own citation steps.

3. Automatic classification of full-text documents from the Internet

3.1 Goal

In order to provide a subject based browsing interface for the robot-generated index some kind of automatic classification is needed. The goal is to structure the index using the same Ei classification that is used in the quality service EELS. This will allow cross-browsing between both services.

3.2 Method overview

To accomplish a rough subject classification we used the Ei thesaurus that contains more than 16000 terms, intellectually mapped to more than 700 classification categories.

For each record in the index database all metadata, headings and plain text was extracted. Then each of the vocabulary terms (and the captions from the classification categories) was matched against this text. If a match was found the corresponding list of classification codes was associated with the record with a score that was dependent on several factors such as term complexity (single word, Boolean expression or phrase), match location (metadata, headings or plain text) or type of classification (master or optional).

In the end all scores were reckoned for each class (adding scores from classes directly above in the Ei hierarchy to the most specific classes below). For every document a list of classification suggestions in decreasing order of the scores was generated. In order to decide how many classifications to assign to every record we experimented with different heuristics for cut-off points. The outcome of this rough classification process is presented in a browsing structure for the robot-generated engineering index.

3.3 Ei thesaurus preprocessing

We are using the Ei thesaurus [8] 2nd ed. 1995, in electronic form. This thesaurus is useful since it contains editorial mappings between thesaurus terms and class codes from the Ei classifications.

The thesaurus contains 17458 terms. 8273 of those are preferred terms. Together with the Ei headings there are over 14000 class codes assigned.

The thesaurus has to be converted into our internal thesaurus format. Starting from the Ei-thesaurus all terms are extracted. For each of the terms a list of Ei classification codes (MC - main classification and/or OC - optional classification) is assigned, taken from the intellectual mapping in the Ei Thesaurus. We also added the Ei captions, as terms, with their respective Ei classification code assigned.

A number of text preprocessing steps was done to the thesaurus vocabulary in order to prepare the terms for use in our automatic classification system. The preprocessing includes text operations like case conversion and removing special characters. We also remove stop-words and a few terms (e.g. one and two letter terms and geographical names) in order to reduce false hits. Inverted terms are converted to Boolean expressions. Optionally stemming using Porters stemming algorithm can be applied.

The preprocessing generates a number of triplets each containing a weight (see below section 3.4.2), a term and one or more class codes. These triplets constitute the thesaurus format used in our system. Each term can consist of one or more words (a phrase) or a Boolean expression.

After the preprocessing the vocabulary contains 13586 terms with MC class codes and 7355 with OC class codes assigned. In total 854 different class codes are used with an average of 25 terms per class code.

There are more than 3000 single word terms and close to 18000 composite (phrases or Boolean expressions) terms. Most of the composite terms have between 2 or 3 words but there are a few terms with 7 or more words, giving an average of 2.4 words per term.

Most preferred terms have 1 or 2 class codes but also here there are a few with 7 or more class codes. In average we have 1.7 class codes per preferred term.

The Ei vocabulary with this large share of composite terms provides an unusually rich and precise vocabulary with the potential to reduce the risks of false hits. Compared with the captions alone the linked thesaurus provides us with a rich additional vocabulary for every class.

3.4 Automatic classification process

Figure 2 shows an overview of the classification process, which is made up of several steps. First the document that is to be classified is fetched. From this document text is extracted, and all thesaurus terms are matched to it. Some heuristic processing rules are applied to the results from the matching process. Finally the outcome is formatted for either presentation or storing in a database. Each of these steps is described more fully below.

Figure 2:

Automatic classification process

The system presently has routines for handling two different input formats, HTML and the record-syntax used in the Combine harvester [14].

Text from the document is extracted by parsing it into a maximum of three different mark-up groups for each document. The mark-up groups are used for weighting of term matches. They are:

During text extraction we apply similar text preprocessing as is done while treating the thesaurus. Optionally stemming using Porters stemming algorithm can be applied.

3.4.1 Matching

For each of the text mark-up groups all terms from the thesaurus are tried. If a match is found the corresponding list of classification codes is associated with the record with a score that's dependent on several factors like term complexity (single word, Boolean expression or phrase), match location (metadata, important or plain text) or type of classification (master or optional). The weighting scheme is described in more detail below (section 3.4.2). This term matching is done in a case sensitive way. A term can match several times on a long document and each match contributes to the scores.

The matching process is straightforward for terms consisting of single words and phrases where exact matching is used. For Boolean expressions we match each of the constituent terms against the entire text mark-up group counting how many times they match. The number of matches for the entire Boolean expression is calculated as the minimum number of matches for the individual terms.

False matches are a problem, for example the term "tar" is associated with classes 411 (Bituminous Materials), 513 (Petroleum Refining), 524 (Solid Fuels), and 804 (Chemical Products), while it is also very much in use as a file-extension for software archives (and as such common in the type of documents we are classifying). The influence of false matches is diminished by increasing weights for matches from phrases and Boolean terms as compared to matches from single word terms.

Example A: Matched terms and associated class codes in a document.

Mark-up group: plain text

913.5: maintenance; 693.1: pipe @and supports; 91: engineering management; 401.1: pipe @and supports; 511.2: equipment, equipment; 603: tools, tools; 801: solutions; 703.1: schematics; 901.2: training; 901: engineering, engineering, engineering, engineering, engineering; 902.1: drawing; 605: tools, tools; 803: solutions; 804: solutions; 921.2: integration, integration, integration; 912.2: management; 408.2: pipe @and supports; 913.1: production; 446.1: equipment, equipment; 535.2: drawing; 723.3: databases; 706.2: pipe @and supports; 913.2: production; 912.4: training; 816.1: drawing; 681: maintenance, maintenance; 619.1: pipe @and supports, pipe;

Mark-up group: important text

901: engineering;

3.4.2 Weighting

Weighting is applied to the resulting matches according to

  1. term complexity and type of classification.
  2. match location
  3. matching frequency

Each pair of term-class code is assigned a weight depending on the type of term (Boolean, phrase or single word) and the type of class code (MC, OC). The weights are assigned with the motivation to accomplish a differentiation of term matches according to:

The text mark-up groups are weighted so that metadata matches contribute more than important text matches which contribute more than plain text matches.

A term can match several times on a long document and each match contributes to the score.

The values for the two first elements (a and b) in the weighting scheme are multiplied and then values from each match are reckoned to constitute the final score.

Example B: Summed (in parenthesis) and weighted results from example A.

901(26), 921.2(9), 91(8), 619.1(7), 693.1(4), 401.1(4), 511.2(4), 605(4), 603(4), 408.2(4), 706.2(4), 446.1(4), 681(4), 723.3(3), 902.1(3), 912.4(3), 913.5(3), 912.2(3), 913.2(2), 901.2(2), 804(2), 535.2(2), 803(2), 816.1(2), 703.1(2), 801(2), 913.1(2)

3.4.3 Preparation for display

At the end we apply some heuristic post-processing in order to select the suggestions to use for the display in the service.

First we propagate scores down the classification tree to the leaf (the most specific) classifications. To assign the most specific class code is common classification practice.

For every document a list of classification suggestions in decreasing order of the scores is generated. This list is truncated by throwing away all classifications with an absolute score lower than a cut-off value. This is done to reduce the number of suggested classifications and thus interdependency in the browsing system.

Example C: The resulting list after heuristics (cut-off=10) have been applied to example B.

901.2(28), 913.5(11), 912.4(11), 912.2(11)

Another possible way to limit the list would be to just accept the n highest ranking suggestions. A more refined way would be to take the tree structure of Ei into account and just keep classifications within one or two main branches disregarding outliers. To see if this is feasible needs more detailed study.

Output formatting can be done in a number of ways. One way of displaying the outcome of this rough classification process is to present it in a browsing structure for the robot-generated engineering index. We have also implemented other types of formatting like RDF [19] encoding.

3.4.4 Software

The software is implemented in the programming language Perl as two modules, which provide an application programming interface (API) to the functions [20]. A program can directly use this API in order to build an automatic classification application.

3.5 Results of classifying a robot-generated subject index in engineering with the Ei vocabulary

Using the automatic classification procedure each record in the database produced by method 2 (harvest a few collections completely and follow links in two steps - see section 2.2 above) was processed with our system to assign classification codes to the records.

Of a total 155611 records 132120 are in English (85 %). (We are using a tri-gram method to decide upon the main language of a document). We only attempt to classify English documents since the Ei vocabulary only contains English terms. Out of these records 129890 contained at least one match against the Ei vocabulary. Knowing that 77 % of the total database is considered relevant for engineering the system should ideally be able to classify 119820 records, which means 101732 English records. Somewhat more than 2 % of the records got no match at all.

Using the heuristics mentioned above and setting the cut-off score to 10 we get the following results for two runs of automatic classification (with and without stemming). The outcome of this first simple approach can be found at the project's demonstration site [21].

Figure 3:

Ei classification suggestions

 

No of documents classified

No of class-codes assigned

Mean no of class-codes per document

No of documents not classified

no stemming

86468

568440

6.57

45652

stemming

107297

1355191

12.63

24823

As is expected stemming produces many more assignments of class codes. It also introduces more false matches.

Statistics on term matching (percentage is calculated relative the number of unique terms in the Ei vocabulary).

unique terms in total

Boolean expressions

phrases

single words

Ei vocabulary

17347

6600

8035

2712

no stemming

13004 (75 %)

4751 (71 %)

5722 (71 %)

2531 (93 %)

stemming

14096

5429

6253

2414

Close to 3/4 of all unique terms are actually matched in our documents.

However, further analysis of term matching statistics shows that there is a huge dominance for single word terms. Single word terms account for about 80 % of all matches. Out of the 100 most frequently matched terms only 8 are phrases or Boolean expressions.

False hits in phrases or Boolean expressions are rather unlikely, while single words are much more likely to suffer. Stemming gives many more matches for obvious reasons, but on the other hand it is also more likely to produce false matches. Stemming makes some of the terms equal which is why there are fewer unique single word terms matches for stemming than for non-stemming. This also why we have chosen not to calculate any percentage figures for the stemming case.

Techniques applied allow us to classify, in the best case, 80 % of the English documents in our database and to assign on average between 7 and 13 class codes to each document.

3.6 Evaluation

The first obvious test is to see if the automatic classification system can correctly assign class codes to the 700+ browsing pages of EELS [4] (which are organized after the Ei structure and contains Ei headings) in spite of the complicated weighting scheme and our applied heuristics. The system passes this test nicely with 100 % correct classifications.

To evaluate the accuracy of the automatic classifications we compared our automatic classifications (both stemming and no stemming) with the intellectually assigned Ei classifications made for resources included in EELS. A sample of resources present both in EELS and in our database was extracted and the two classifications were compared, taking the Ei hierarchy into account. Results are shown in the table below:

Automatic

classification

 

Classification agreement

no of samples

correct or finer

correct to the first

     

3 digits

2 digits

1 digit

no stemming

923

57 %

64 %

77 %

90 %

stemming

999

60 %

66 %

81 %

93 %

It can be noted that stemming, producing twice as many classifications per document is only better by a few percent. On the other hand it succeeds in assigning class codes to more documents.

The general level of agreement between the intellectual and the automatic classification seems to lie between 57 and 66 %. At first moment this might sound like a low level, but other studies show even lower "success rates". Larson [24] classified 283 MARC records based on the titles and subject headings with the Library of Congress classification. He tested sixty different methods. The single method with the best accuracy able to select the correct classification for about 46 % of the records only. Other studies through the years have demonstrated rather poor, uneven and inconsistent classification efforts by human beings. So a very high degree of agreement with human classification does not necessarily mean correct classification (whatever that is - which can be an issue for another article).

Coming evaluations will give possible reasons for the remaining disagreement. Our first analysis indicated a problem of lost context of the pages treated by the automatic classification, which is seen by the human classifiers though.

We intend to apply methods and heuristics to correct this difference.

3.8 Applications

To demonstrate our findings we will add a classification and browsing structure, with the most suitable solution found during the project, to the pilot service Engineering Electronic Library, Sweden [4].

In addition we provide a free online classification service [21] for engineering resources on WWW where anyone can enter an URL for an engineering related WWW-page and view the resulting classifications either as a HTML-page or as a RDF structure. In that service we provide the possibility to change the text mark-up group weights and the cut-off value.

4. Further work

During the remaining time of the DESIRE project, we will concentrate on trying more elaborate methods of automatic classification and to compare and evaluate the different solutions in order to offer good recommendations to other Internet subject services.

In addition, only a few tests will be run to improve the methods of creation of a robot-generated subject index. We especially wish to find out if it is necessary to actively include citing references, in other words, web pages that point to documents already in the database.

As far as the advanced classification methods are concerned we will at least explore the following:

At the end of the project, during spring 2000, the existing EELS/"All" Engineering service [7] will be upgraded with the most suitable solution for automatic classification. A cross-browsing feature in both directions between the quality service and the robot-generated database will be introduced as well as the option to use classification as a search filter.

In a related workpackage within the DESIRE project, we are exploring several methods necessary to accomplish cross-browsing between different subject services: mapping with a Prolog inference engine between vocabulary systems coded in RDF and methods of conceptual relationship analysis for mapping and conversion to improve the mapping techniques.

5. Conclusion

To improve the discovery of Internet resources, automatic methods of gathering and knowledge organization are necessary. Not even a large co-operative effort can cope with the quantities and the amount of changes to the documents for a service or subject area of some size. Our DESIRE project developed an approach to integrate a manually selected, catalogued and quality assessed collection of WWW-resources with a much larger robot-generated subject index in the same subject area.

We showed that a combination of different harvesting methods and a careful and broad selection of starting sites is necessary to reach a satisfactory level of coverage in the subject index. An advanced gathering model, using a thesaurus, seems appropriate to improve the level of relevance of the collected resources to the subject area. Focused crawling [18] in such robot-generated subject indices has the potential to overcome several of the most disturbing weaknesses of general web search engines in completeness, frequency of updating and topical coherence of the resources.

The DESIRE project demonstrated in earlier studies the general usefulness of library classification systems to knowledge organization of Internet resource collections. The work reported here used relatively simple approaches of automatic classification to successfully classify a large collection of full-text documents, more than 250000 records, with an established international subject classification system in the field of Engineering. With a relatively limited set of heuristics and quite a simple weighting algorithm an apparently good classification could be accomplished. Part of the reason is the existence of a rather good vocabulary system with an integrated thesaurus and classification system. A direct comparison to the intellectual classification of identical web pages showed a surprisingly high level of agreement, around 60%.

More advanced automatic classification methods and much more thorough evaluations remain to be carried out.

To our practical goal, to allow cross-browsing between a quality controlled subject gateway and a robot-generated index using the same classification structure, the approach of using an established library classification system seems to bee successful.

For other services and purposes alternative solutions need to be explored and compared, however. Different clustering methods, content based, usage pattern or citation based might turn out superior. Artificial Neural Network or other AI techniques for classification tasks, with a certain learning ability, could be promising and have been used in different contexts.

The established library classification systems themselves might need to be changed and adapted to become really suitable as browsing structures for Internet services. Tailor-made visualization and navigation techniques have to be developed and thorough user studies need to be conducted.

The work to improve the knowledge organization of large digital collections has just begun.

 

Anders Ardö

Technical Knowledge Center

& Library of Denmark

P.O. Box 777

DK-2800 Lyngby, Denmark

Homepage: http://nwi.dtv.dk/anders/

E-Mail: and@dtv.dk

Traugott Koch

NetLab

Lund University Library

P.O. Box 3

S-221 00 Lund, Sweden

Homepage: http://www.lub.lu.se/koch.html

E-Mail: Traugott.Koch@lub.lu.se

Notes

[1] Koch, T. and Day, M. (1997)

The role of classification schemes in internet resource description

and discovery.

EU Project DESIRE. Deliverable D3.2.3.

http://www.lub.lu.se/desire/radar/reports/D3.2.3/

[2] Koch, T. (1998)

Nutzung von Klassifikationssystemen zur verbesserten

Beschreibung, Organisation und Suche von Internet Ressourcen.

Buch und Bibliothek 50:5, pp.326-335.

Manuscript with hyperlinks at: http://www.lub.lu.se/tk/publ/bubmanus.html

[3] DESIRE: http://www.desire.org

DESIRE at NetLab: http://www.lub.lu.se/desire/

[4] EELS: http://eels.lub.lu.se/

[5] SOSIG: http://www.sosig.ac.uk/

[6] DutchESS: http://www.konbib.nl/dutchess/

[7] "All" Engineering: http://eels.lub.lu.se/ae/

[8] Engineering Information Inc.: http://www.ei.org/

[9] Z39.50: http://lcweb.loc.gov/z3950/agency/

[10] Zebra: http://www.indexdata.dk/zebra/

[11] Dublin Core: http://purl.org/DC/

[12] "Browsing and searching Internet resources" page:

http://www.lub.lu.se/nav_menu.html#rosubj

[13] GERHARD: German Harvest Automated Retrieval and Directory:

http://www.gerhard.de/

[14] Combine: http://www.lub.lu.se/combine/

[15] Advanced gathering model, testsite: http://dtv25.dtv.dk/CP/cp.html

[16] Dean, J. and Henzinger, M.R. (1999)

Finding related pages in the World Wide Web.

Proc. of the 8. Internat. WWW Conference, Toronto. pp. 389-401

[17] Koch, T., Ardö, A. and Noodén, L. (1999)

The construction of a robot-generated subject index.

EU Project DESIRE II D3.6a, Working Paper 1.

http://www.lub.lu.se/desire/DESIRE36a-WP1.html

[18] Chakrabarti, S., van den Berg, M. and Dom, B. (1999)

Focused crawling: a new approach to topic-specific Web resource discovery.

Proc. of the 8. Internat. WWW Conference, Toronto. pp. 545-562

http://www8.org/w8-papers/5a-search-query/crawling/index.html

[19] RDF: http://www.w3.org/RDF/

[20] Koch, T. and Ardö, A. (1999)

Automatic classification of robot-generated subject indexes.

EU Project DESIRE II D3.6a, Working Paper 2.

http://www.lub.lu.se/~traugott/DESIRE36a-WP2.html

[21] DESIRE automatic classification demonstration page:

http://www.lub.lu.se/desire/demonstration.html

[22] Scorpion: http://purl.oclc.org/scorpion

[23] Koch, T. and Vizine-Goetz, D. (1999)

Automatic classification and content navigation support for web services.

DESIRE II co-operates with OCLC.

Annual Review of OCLC Research 1998.

http://www.oclc.org/oclc/research/publications/review98/koch_vizine-goetz/automatic.htm.

[24] Larson, R.R. (1992)

Experiments in automatic Library of Congress classification.

JASIS 43(2), pp. 130-148