Taxonomy Research

Russell Blakeborough

School of Everything

May 2009

Summary

I've been working on a system to scan our taxonomy data set and automatically find related subjects. The idea of this project is to make our taxonomy work. At the moment we have a very fragmented free tag data set, which is tending to make it very hard for users to find what they are looking for on the site.

All across the site we are only making matches for people if they have typed in the exact subject.

This article describes a research project that has successfully found relationships between our taxonomy terms using Normalized co-occurrence analysis (NCO)

The project had 3 phases: Research, practical tests and results analysis.

Research

I looked at 3 papers on freetagging (or 'folksonomy') data analysis:

1. Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems
Paul Heymann and Hector Garcia-Molina
Computer Science Department, Stanford University

2. The Structure of Collaborative Tagging Systems
Scott A. Golder and Bernardo A. Huberman
Information Dynamics Lab, HP Labs

3. Clustering Tags in Enterprise and Web Folksonomies
Edwin Simpson
HP Labs

The last paper by Edwin Simpson turned out to be the most useful. Here the benefits of NCO analysis are laid out in 2 different test data sets - one from an internal information system at HP, the other from Delicious.

Normalized Co-occurrence analysis gives each term relationship a rating calculated as the intersection over the union of the occurrence of those terms.

Practical Tests

My first tests were in manual SQL queries on my local development server. I wanted to see if our data set would give good results under NCO analysis on some simple test cases. eg. I wanted to see that 'Tai Chi' could be related to 'martial arts', and that 'Driving' is not strongly related to 'Tai Chi'. Initial tests gave good results:

Driving - Tai Chi .. NCO: 0
Tai Chi - martial arts .. NCO: 0.04
National Standards Cycling Test - cycling .. NCO: 0.23

These good results persuaded me to go on to write an automated analysis system. I decided to analyse only seed terms with 5 or more occurrences. NCO analysis was performed on all terms with co-occurrence to these seed terms. The results were written into a new term relationship table with the following columns:

- Term ID A
- Term ID B
- A intersection B
- A union B
- NCO

My first tests were on just the first 8 seed terms, these looked good so I scaled up to run the analysis on the full data set. Analysis on the full data set involves a large number of queries! I left it running overnight .. and came back to find that it had timed out, but at around one third of the way through the analysis. This gave an excellent first set of real results with over 30 000 entries in the term similarity table.

Results analysis

Some further SQL gave the results we'd been looking for: A table sorted by NCO of our best taxonomy relationships. I included the values for intersection and union in this table too, on a hunch that we might also need to look at these. Sure enough, even if a term relationship has a good NCO, there are too many false positives where the intersection is only 1, so I rejected these from our results. Intersection of 2 is borderline - there are some excellent matches at intersection 2 where the NCO is high, but some false positives. A little bit of extra logic might eliminate these false positives for intersection of 2 - I suspect that they will often be down to the same person tagging 2 of their resources with the same combination of subjects.

Here are some of the results of the NCO system:

The relationship with the highest NCO was Cosmology - Astrophysics with an NCO of 0.0727

The next 4 are:
burlesque - Tassle Making
anatomy Physiology
chinese cluture - Chinese history
natural voice acappella

The first 200 relationships are almost all excellent, and where there are false positives we can see that users might also be interested in the 'related' subject, so in a sense they would not be false positives from our point of view.

By taking some random samples moving to lower and lower NCOs I have tried to see how far down the NCO values we can find good relationships. At entry number 2000 in the list we get to an NCO of 0.0051, and the value of the relationships is still good:

eg.

+-----------------+---------------------------------------------+--------+---------+--------+
| Term A          | Term B                                      | NCO    | A and B | A or B |
+-----------------+---------------------------------------------+--------+---------+--------+
| painting        | The Bob Ross Oil Painting Technique         | 0.0051 |       3 |    589 |
| painting        | collage                                     | 0.0051 |       3 |    588 |
| painting        | fine art                                    | 0.0051 |       3 |    591 |
| mandolin        | Cello                                       | 0.0051 |       2 |    393 |
| mandolin        | Bass                                        | 0.0051 |       2 |    390 |
| history         | Sociology                                   | 0.0051 |       3 |    586 |
| music tutor     | Music                                       | 0.0051 |       6 |   1179 |
| crafts          | drawing                                     | 0.0051 |       3 |    590 |
+-----------------+---------------------------------------------+--------+---------+--------+

by entry 3000 we get to NCO 0.0034, and things are starting to tail off, although you can still see some pretty good relationships, and there are some entertaining ones..

+-----------------+---------------------------------------------+--------+---------+--------+
| Term A          | Term B                                      | NCO    | A and B | A or B |
+-----------------+---------------------------------------------+--------+---------+--------+
| philosophy      | architecture                                | 0.0034 |       2 |    596 |
| philosophy      | metaphysics                                 | 0.0034 |       2 |    583 |
| wine            | French                                      | 0.0034 |       4 |   1169 |
| Classical       | Songwriting                                 | 0.0034 |       3 |    892 |
+-----------------+---------------------------------------------+--------+---------+--------+

Conclusion

This stuff rocks! I feel that we could increase the value of our offering to people considerably with this technology right across the site. The next step is for us to look at producing an extension to our taxonomy system to handle this new term relation data. This system should also allow us to do manual correction / relationship designation. We could also look at manual seeding of this algorithm - a kind of semi manual process where we decide that we want to cluster on a certain topic, and set cutoffs for NCO and intersection.

2 other ideas for further research that have occurred to me during this process are:

- clustering (edge finding)
- seasonal taxonomy variations


Don't be shy, say hello. We'd love to hear from you.


[email protected]