Data Harmony White Paper
There is a lot of confusion in the marketplace. Word is out that rule bases take a lot of up front investment. Concurrence systems advertise a purely programmatic solution and appeal to many IT professionals.
Let’s look at the actual data. This is a case study based on the Data Harmony MAIstro rules based system and the statistics based systems (such as Autonomy, Nstein or Stratify) implementation
.
First, a couple of assumptions:
A simple rule base (matching preferred and equivalent/ synonym terms) is created automatically as terms are added to the controlled vocabulary (thesaurus, taxonomy etc). If there is an existing thesaurus or an authority file this is a 2 hour process. Rules for both equivalent (synonym) and preferred terms are created when a controlled vocabulary is imported.
Complex rules generally make up an average of less than 10% of the terms in the vocabulary. Complex rules are created at a rate of 4 - 6 per hour. So with a 6000 term thesaurus, 600 complex rules at 6 per hour requires 2.5 man weeks. Some people begin indexing with the software immediately to get some baseline statistics and then do the rule building. This usually provides 60% accuracy with just the simple rule base. With the addition of complex rules, the accuracy increases to 85 - 92%.
There is no limit to the number of users, the number of terms used in the taxonomy created, or the number of taxonomies put on a server.
Data (e.g. an existing taxonomy) can be preloaded to the Data Harmony software before shipping. It may already be available in one of three formats (tab or comma delimited, XML or left tagged ASCII). If not, a short conversion script can put it into an appropriate format.
On the average Data Harmony customers are up and running one month after the contract is done.
The up front time and dollar investment based on the workflow for implementation for the full Thesaurus Master and Machine Aided Indexer (MAIstro in combination) is :
There is a proven four-fold productivity increase for editors. The return on investment if an editor is loaded expense of $45 per hour is 8 months on one editor or 1 month on 8 editors. They will index with more accuracy, be more consistent, and do deeper indexing (and enjoy the system). Time made available can be used for those other things you’ve been wanting to get to like talk with customers, increase coverage, automatically filter data…..
The U.S. GAO (Lockheed) reports 92% accuracy and CSA reports a four fold increase in productivity.
If we take the same example based on a 6000 word thesaurus, the thesaurus creation cost should be the same.
The cost of the software usually starts at about $75,000. (We will use this lower number although it can be much higher.) Training and support are an additional expense of about $2000 per day. Usually one week of training is required ($10,000)
The up front time and dollar investment based on the workflow for implementation for the statistical (Bayesian, DNA etc) systems is:
Now you are ready to begin implementation.
Elapsed time: Assume all people are ready and standing by to move to the next step when needed.
A two fold productivity increase has been noted by the American Psychological Association. Accuracy is not known above 72% at present.
The table below compares the return on investment for the rule based system and the statistics based system in terms of total cost and time to implementation. It is apparent that there are considerable savings in using the rule based systems over the statistics based system-by a factor of almost seven, based on the assumptions outlined above.
Rules Based |
Statistics Based |
|
| Total time frame: | 1 month |
33+ Weeks |
|
|
||
| Total man hours: | 104-154 hours |
6488 hours |
+24 hours training |
+40 hours training |
|
| Total up-front cost: | $64,840 |
$449,375 |
|
||
| ROI assuming 6 editors: | 1 month |
57.73 months |
The Data Harmony M.A.I.™ system is both efficient and cost-effective right out of the box. A simple rule base is generated automatically on the basis of your controlled vocabulary (thesaurus, taxonomy, authority file). Rules are generated for both preferred terms and specified synonym terms.
The accuracy of results from the simple rule base is enhanced by fine-tuning the rules to reflect editorial analysis, interpretation, and insight. For about 10% of the terms, complex rules are required to capture the meaning and conditions of use of the term. (This estimate varies with the wording of taxonomy terms and document writing style.)
How quickly can M.A.I. be implemented?
The software is delivered by CD ROM or FTP immediately upon payment. Your data can be preloaded in the software for immediate use. Data formatted in tab- or comma-delimited, XML, or left-tagged ASCII is ready to go; format conversion would require a small amount of additional time. Our customers are typically up and running one month after the contract is done.
What about automatic taxonomy generation?
This is partially possible. However using training sets and full or unstructured text to create a categorization system causes many misleading information channels to appear. For example: If Enron is search in the news today it will co-occur with fraud, embezzlement etc. If it was run four years ago in would occur with energy and gas distribution etc. Using rules will ensure the proper usage and application of the language over the life of the project. – We recommend augmentation of an existing vocabulary as a faster, more accurate, more reliable, and more consistent methodology for taxonomy creation. It is also less expensive.
*Can I purchase or augment an existing thesaurus? - YES
Data Harmony offers 40 Knowledge Domains, including ready-made thesauri with associated rule bases covering a variety of topics.
With our experience, we can construct a thesaurus for your specific needs. Time required varies with the topic; estimates provided reflect an average 6000 term thesaurus.
Time investment: 4 months
Cost investment: 32,000 including software