Frequently Asked Questions - Answers!

Business Advantages

How can Data Harmony products help me find what I'm looking for and improve search results?
Who are your target users?
Is my business too small or simple for M.A.I. / TM?
What is the Return on Investment (ROI)?

What makes MAIstro different?

What's the difference between M.A.I./TM and MAIstro?
Why don't you use training sets?
Aren't rulebases difficult to manage?
Isn't it faster to get a semantic engine running than a rulebase?

Taxonomy/Thesaurus Basics

Controlled vocabulary / taxonomy / thesaurus -- What are they and how do they differ?
What advantages does a taxonomy offer?
What extra advantages does a thesaurus offer?
What can I do with a taxonomy?

Indexing Basics

What is indexing and why bother?
Why is an indexed search better than a full-text search?
What is automatic indexing?
What is machine aided indexing?
What is the difference between "automatic" and "assisted" indexing?
If I have a search engine (like Verity), why do I need MAIstro or M.A.I.?

Using MAIstro

How long does it take to build a rulebase?
How long does it take to maintain a rulebase?
What kind of people do I need to maintain a rule base and taxonomy (thesaurus)?
How long will it take to index my legacy collection?
Who will build the rulebase?
Who will maintain the rulebase?
Do I have to come to you to add or change a term?

Organizing Data

How does Data Harmony work with a content management system (CMS)?
What makes a good content management system (CMS)?
Do I need a database?
What is a spider?
I have lots of data but I don't have a database-what do I do?
What is "well-formed data" in a database?
What is a DTD?

Integrating With My System

How does it connect to other systems?
What is an API?
How do I get my data to M.A.I.?
How does Data Harmony software relate to a portal?
Can I use Data Harmony remotely?
What are Internet protocols and why are they important?

Under the Hood

What are the system requirements?
What operating systems do Data Harmony products support?
In what computer language are Data Harmony products written?
What is Java?
What is XML?

Working With Search Systems

How does it work with a search system?
How does it work with search software?
If I have a Natural Language search software, why do I need MAIstro or M.A.I.?
What is an inverted index?


Business Advantages / Benefits / R.O.I.

How can Data Harmony products help me find what I'm looking for and improve search results?
Data Harmony promotes precise and consistent topic or subject labeling of documents, following rules of use designed for your specific needs. The result is pinpoint accuracy in document retrieval allowing more effective knowledge management and trend spotting. Data Harmony enables information specialists to do their jobs better.

Who are your target users?
Data Harmony products are designed to manage primarily text-based data. We encourage anyone with the need to classify, store and retrieve quantities of information objects to use the Data Harmony suite. Most of our customers are libraries and information centers, secondary publishers and keepers of portal and knowledge management systems.

Is my business too small or simple for M.A.I. / TM?
Do you give different users access to your information? Do you have more than 5000 items in a data collection or more than 14 fields of data to access through your system? If the answer to either of these is "yes," then Data Harmony products can improve the organization and retrieval of your data, to help you get the information you need.

What is the Return on Investment (ROI)?
There are two major gains in the use of the Data Harmony products: 1) Productivity and 2) Quality.

1) Our users usually see a substantial increase in productivity for indexing. We measure productivity increases in two ways: increased speed and increased quality. Productivity increase is usually about fourfold. Suppose you have four indexers and you pay them $20,000 per year. If you purchase a $60,000 system and get a fourfold increase in productivity, those indexers will process four times as many records in the same time period or will process twice the records with half that many people. Either way, the system will pay for itself in three months of usage.

2) Another important measure is the quality of indexing. Our users experience a substantial increase in the quality and consistency of indexing. They benefit from a marked decrease in editorial drift, the tendency for editors to focus on different indexing terms at different times, or for the same editor to use different terms on different occasions. Increased speed and improved quality and consistency all contribute to the benefits gained through using Data Harmony software. Over a longer time period, e.g. five years, the ROI continues to grow.

Return to Top of Page

What makes MAIstro different?

What's the difference between M.A.I./TM and MAIstro?
MAIstro is the combination of Thesaurus Master and M.A.I. Since TM and M.A.I. work best when functioning as an integrated unit, in fact we find most customers want both sections. In MAIstro, the backend processing is unified to make the two component software pieces work together as a single program and installation.

Why don't you use training sets?
To build a training set requires manually finding a set of documents that are about a particular concept. Since this effort must be repeated for each concept, the time required can be significant. Each time a new concept is added, the training needs to be redone.

We find that a rulebase approach is more efficient, more flexible, easier to maintain, and less costly to maintain. A rulebase can be easily managed by an editor, not requiring the more expensive services of a programmer. There is no limit to the number of controlled vocabulary terms or size of thesaurus it serves. Modification or addition of rules is easily accomplished. A rulebase does not require research to locate and prove a large corpus of documents that exemplify the concept represented by a single term.

Aren't rulebases difficult to manage?
Quite the contrary. Rules governing the use of indexing terms are accessible and transparent, not hidden in a virtual black box. The editor can review and fine tune the requirements for term use at any time to produce more accurate term suggestions. M.A.I. automatically maintains statistics that point out any discrepancies between the editor's use of terms and the M.A.I. suggested terms. These statistical results are presented in order of frequency of occurrence so that an editor's time is used most productively, targeting rules most in need of fine tuning. Maintaining the rulebase to continually improve indexing term suggestions takes approximately two hours weekly.

Isn't it faster to get a semantic engine running than a rulebase?
A semantic engine needs to be trained and retrained for each new term. M.A.I. is ready to operate as soon as it is installed with its basic rules governing each of the taxonomy terms for indexing. The system is functional immediately and its performance improves with feedback from every document editors index. If one is going to embed M.A.I. into some other system, of course you have to wait until the rest of the system is ready.

Return to Top of Page

Taxonomy/Thesaurus Basics

Controlled vocabulary / taxonomy / thesaurus -- What are they and how do they differ?
A controlled vocabulary is a limited set of terms that are valid for indexing (keywording or topic tagging) a set of documents. The list is generally alphabetized, but no further internal organization of terms is implied.

A taxonomy is a controlled vocabulary presented in an outline view, also called a classified view or hierarchy. Terms are organized in categories reflecting general concepts (Top Terms), major groups (Broader Terms), and more specific concepts (Narrower Terms). The final terms at the end of a branch, often called nodes, can represent any specific instance of a Broader Term, including terms from an authority file of people, organizations, places, or things.

A thesaurus is a controlled vocabulary that is displayed as a taxonomy or other display format. The key difference is that a thesaurus is enhanced to specify not only the relative position of terms (Top Terms, Broader Terms, and Narrower Terms), but also to provide synonyms (nonpreferred terms or Use/Used for indicators) for valid terms in the thesaurus and associations between conceptually related terms. A thesaurus allows for scope notes and term history. More advanced thesauri may involve additional equivalencies such as alternate languages, numerical codes, and status designation for terms. National and international standards dictate the details of thesaurus construction and display.

The hierarchy view of a thesaurus parallels a classic taxonomy. The alphabetic view of a thesaurus presents the full term record for each term, with hierarchy relationships, conceptual associations, notes, etc. Alternative displays include an alphabetical listing of terms and nonpreferred terms, the permuted index (KWIC or KeyWord In Context), and other views.

What advantages does a taxonomy offer?
A taxonomy's hierarchical organization makes it easy to locate the most accurate subject indexing term. The hierarchical view makes it easy to navigate across categories or from a major category along a branch to singular examples. This benefit is enjoyed by indexers, internal document editors, and end-users.

What extra advantages does a thesaurus offer?
In addition to the hierarchical display of the taxonomy view, a thesaurus may be viewed as individual term records with the added value of conceptual associations, notes on use, translations to other languages, etc. This information provides the user ideas for alternative terms that may be appropriate for indexing or finding documents on conceptually related topics.

What can I do with a taxonomy?
You can use a taxonomy in many ways:
1. Index the data using the taxonomy terms
2. Add the taxonomy terms as metadata to HTML records
3. Add the taxonomy as a browsable topic list to a portal or web site
4. Search the data in your database by taxonomy term
5. Set a spider to crawl the web for information you want
6. Hook your data files to a visual display system
7. Fill the metadata header in your web records

Return to Top of Page

Indexing Basics

What is indexing and why bother?
Indexing is a way to increase retrieval precision and accuracy by consistent application of subject terms in their preferred forms. Good indexing leads to the optimal combination of precision retrieval and comprehensive recall. The result is retrieval of documents appropriate for the search and blocking documents that may share a common word but are conceptually irrelevant.

Why is an indexed search better than a full-text search?
The results of a search will always depend on what you are looking for. In a full-text search, if your query words do not match words in the document text, you will not get back all the relevant documents. If they do match words in the document, they may not be what the document is primarily about, so you get back a lot of irrelevant documents. If you are searching for a new and as yet undefined concept not reflected in the controlled vocabulary of indexing terms, then a full text search is preferable.

What is automatic indexing?
Automatic indexing is the application of index terms from a thesaurus or controlled vocabulary without human intervention, using only the computer.

What is machine aided indexing?
Machine aided indexing allows the selection from a list of the automatically suggested index terms drawn from a thesaurus or controlled vocabulary for increased accuracy in storage and in retrieval. An editor can also apply additional thesaurus terms not suggested by the system.

What is the difference between "automatic" and "assisted" indexing?
Automatic indexing will work without human review or intervention and is great for filtering large amounts of data, such as a news feed. In assisted or machine aided indexing, a human reviews the suggested terms and tweaks them for the highest quality results. Data Harmony supports both modes of operation

If I have a search engine (like Verity), why do I need MAIstro or M.A.I.?
Either MAIstro or M.A.I. allows you to introduce precision with high relevant recall to the search in a Verity type system. M.A.I. (or MAIstro) and Verity in combination offer a very powerful solution, assuring access to documents that have been precision indexed with controlled vocabulary or thesaurus terms using M.A.I. or MAIstro, as well as Verity's free text search of documents with important words that have not yet been incorporated into the thesaurus.

Return to Top of Page

Using MAIstro

How long does it take to build a rulebase?
An effective starter rulebase can be prepared in two hours from an existing thesaurus or controlled vocabulary. This is the simple rulebase which gives about 60% accuracy in hits. After the simple rulebase is built we recommend about one month to build a complex rulebase to take you to 85% accuracy in hits. Actual time required varies with the size of the taxonomy and the writing style of documents.

How long does it take to maintain a rulebase?
After an initial period of rulebase preparation, the time required for maintenance drops off steadily. Approximately two hours of weekly editorial work is generally sufficient to fine tune the rulebase and keep up good statistical results.

What kind of people do I need to maintain a rulebase and taxonomy (thesaurus)?
Editors who manage your document collection and indexing or categorization needs can easily learn to maintain a rulebase and taxonomy.

How long will it take to index my legacy collection?
This depends on the number of metadata fields to be filled, the complexity of the materials, and of course the size of the legacy collection. If just the subject field is required, we have been able to index 8 per hour on complicated documents and up to 60 per hour on more straightforward text. While many fully automatic indexing systems provide only 60% accuracy, to automatically index the collection we recommend that the accuracy level be at 85% or higher. Then an automatic run will give excellent access to a legacy collection. Ongoing indexing depends on the number of metadata fields you are filling.

Who will build the rulebase?
We can train your editors to construct your rulebase, or we can develop it for you, using your documents, and then pass it over to you for continued use and maintenance. Once trained, most of our customers prefer to build and maintain their rulebase independently.

Who will maintain the rulebase?
We will train your editorial staff to maintain your rulebase, or you can outsource the task to our experienced editorial staff. The training lasts one day and includes practice time.

Do I have to come to you to add or change a term?
No, for each term that you add, M.A.I. creates a simple starter rule for its use for indexing documents. You can modify the rule at any time to improve its function. If you change a term, existing rules are automatically changed to suggest the revised term. These processes are quick and simple, and performed by your staff.

Return to Top of Page

Organizing Data

How does Data Harmony work with a content management system (CMS)?
Data Harmony is integrated with a CMS through documented Application Program Interfaces (APIs). Depending on the client's needs, indexing terms may be presented interactively in real time document by document, or by a batch approach, filtering and suggesting terms for a number of documents at a time.

We work with other software vendors to create custom packages such as MAI STAR, integrating M.A.I. with Cuadra's STAR information management software.

What makes a good content management system (CMS)?
A good content management system includes the following parts:
1. Input system for document creation and editing
2. Search system to find the documents in the system
3. The display of the documents to the user either by web site (portal) or print or
a customized user interface
4. Administration modules to create custom reports and document sets
5. Nice to have features include hooking to a publishing system or portal
interface

Do I need a database?
You don't necessarily need a database to use the M.A.I. and TM or MAIstro. You can access data through filtering of content feeds or by setting a spider. The rule of thumb is that you are better off with a database if you have more than 5000 items or 14 fields or metadata areas. If not, it may not be worth the expense of the database creation and maintenance.

What is a spider?
A spider is a program that automatically surfs the web. It is generally used to find documents on topics that you specify.

I have lots of data but I don't have a database - what do I do?
You don't need a database to use the M.A.I. and TM or MAIStro. Even without a formal database, the software can work on data feeds or streams, or can work in conjunction with a spider or web crawler to gather data for your needs. Then the software can classify your data for display and enable the data to be displayed in a web site or portal. Additionally, M.A.I. can be used for query expansion in a search or with a federated search engine.

If a formal database is indicated, the Data Harmony support team can help you build a database if that suits your needs.

What is "well-formed data" in a database?
databases usually put data into a particular format. The usage "well-formed data" comes from the XML Standard on the world wide web consortium web site. It means that data follows the format of the XML exactly and will parse in any XML system.

Data must be well-formed and valid so that it will pass into the system of your choice or to another system following the standard exactly and load without problems to the new software or to your client's software. "Well-formed" is distinct from "valid". The latter means that, in addition to being well-formed, it conforms to a DTD or schema.

What is a DTD?
A DTD is a Document Type Definition, a mechanism to describe the structure of documents. It was created to specify the fields in SGML (Standardized General Markup Language). SGML was created to allow platform independence in publishing of documents. HTML and XML are descended from SGML. In XML, many of the constraints and features of the SGML have been removed to make it simpler to use. XML can also be defined with a DTD.

An XML Schema is an alternative to a DTD. A schema is generally broader and may have specific rules for constraining the content of an XML document.

Your database will have well-formed data if it follows the DTD.

Return to Top of Page

Integrating With My System

How does it connect to other systems?
We connect to other systems through an Application Program Interface. This allows different systems to communicate seamlessly.

What is an API?
An Application Programming Interface is a list of methods that enable one program to communicate with another program. Data Harmony has published APIs that allow other software to hook to TM, M.A.I., and MAIstro.

How do I get my data to M.A.I.?
You can get your data into M.A.I. in one of the following three ways:
1. Your data input system can be modified to use the Data Harmony API for sending data interactively through the M.A.I. rulebase. This produces indexing term suggestions interactively, on a document by document basis.
2. You can use the Data Harmony Batch program to run batch files, which produces suggested terms for a number of documents at a time. This approach requires a modest amount of preformatting in order to feed into Data Harmony Batch.
3. Custom batch programs can be written to handle any text format you require.

How does Data Harmony software relate to a portal?
Data Harmony software enables you to build and maintain a taxonomy of indexing terms that describe documents. A portal uses the taxonomy in two ways:
1. Add the taxonomy hierarchy as a browsable topic list to a portal or web site. HTML links from the topic terms provide access to the documents.
2. Add the taxonomy terms as metadata to HTML records, filling in Name and/or Keyword fields for access by search systems and spiders.

Can I use Data Harmony remotely?
Data Harmony uses Client/Server architecture, allowing the Client (the editor's workstation) to access the Server wherever a network connection allows. The software does not need a browser in place to work. The data can be accessed over the Internet.

What are Internet protocols and why are they important?
We believe in using the Internet for ease of data movement and communication. If you are sending data over the Internet, you have to comply with Internet protocols.

The Internet protocols used in Data Harmony are the TCP/IP. This stands for Transmission Control Protocol and Internet Protocol. The first ensures that the data arrives at its destination in the correct order. The second takes your data and bundles it into discrete packages for transmission of 1500 bytes each.

We use TCP/IP because it can run on any network, an organization's internal LAN, WAN, or a world wide network. This network flexibility means that Data Harmony products can be used without being impeded by any network constraints, can be accessed by users with password access from anywhere, and is as scalable as the Internet in the size of a community it can serve.

Return to Top of Page

Under the Hood

What are the system requirements?
Data Harmony is written in Java to be platform independent. Data Harmony requires only the ability to run Java and the presence of JRE (Java Runtime Environment) to run. We recommend JRE 1.5 or higher for best performance.

Partnered products like MAI STAR have specific requirements and customers are best directed to their sites for the current information.

What operating systems do Data Harmony products support?
Data Harmony products can run on any operating system that is Java compliant, including Windows 95, Windows 98, Windows NT and Windows Server, Windows XP and Vista, Unix, Linux and Mac. We recommend JRE 1.5 or better.

Special note on Mac: Mac OS X has Java pre-installed and so should run Data Harmony products with no problem. Mac OS lower versions use a Macintosh Runtime for Java (MRJ2.2.5), based on Sun Microsystems' Java 1.1.8 specification. This is not the recommended version and in some cases may not support Data Harmony.

In what computer language are Data Harmony products written?
Data Harmony products are written in the Java programming language, making them platform independent.

What is Java?
Java is a programming language that is platform independent, just as XML is platform independent. It allows you to run programs written in Java on one or many platforms without rewriting the software for each individual computer type. There are now two kinds of Java. One is open and supported by Sun Micro Systems and others. The second is the Microsoft version. They have subtle differences in compilation.

What is XML?
XML stands for eXtensible Markup Language. It is a World Wide Web Consortium standard for the consistent portable markup of documents and other objects. It allows markup into fields or elements of data with accompanying further description of information called Attributes. It also allows transfer of that data without requiring specific data software or operating platform. It frees the user from relying on a particular kind of software hardware shop and allows easily reliable data movement at any time.

Return to Top of Page

Working With Search Systems

How does it work with a search system?
Although the front ends and options vary at the end of the process, all search systems build an inverted index (alphabetic list of all term entries) and then run the queries against the inverted index. This is true for all search systems including MuseGlobal, Intelliseek, Verity, Postgresgl, Infoseek, FAST, Oracle, my SQL, Sequel, SAP, and so forth.

M.A.I. can work with a search system in two ways:
1. M.A.I. enables construction of an inverted index of terms, i.e. a list of documents associated with each indexing term. These indexing terms, arranged in the inverted index, provide consistent and accurate subject access to the data. This is the preferred method in well-formed databases with field formatting or metadata access.
2. M.A.I. can transition from a searcher's query word to the valid indexing term, and then access documents indexed with that term. This works well with natural language query systems such as Verity.

M.A.I. enables precision indexing for highly accurate retrieval of documents or information objects. A search system provides the supplemental ability to spot important words not yet incorporated in the indexing thesaurus.

How does it work with search software?
To capture the maximum recall with precision, M.A.I. allows you to search for concepts you want with your precise needs in mind and retrieve only those items relevant to your own use. You may also expand your search by using synonyms (non-preferred or Used For terms) from the thesaurus and searching for all of them in Verity for maximum recall of related data.

If I have a Natural Language search software, why do I need MAIstro or M.A.I.?
By coordinating with M.A.I., a search software is not limited to specific words, which may vary in meaning from document to document and may miss many alternate expressions of a concept. M.A.I. boosts Verity's search success by enabling it to also find in metadata the indexing terms that accurately represent concepts, using the terms that were attached to the documents by M.A.I.

With M.A.I., you introduce precision and highly relevant recall to the search in a system such as Verity or FAST. While these systems can locate a given query word, it does not supply consistency in indexing nor does it fully expand to the synonyms. It does not provide the ability to jump to related terms. These options come with the introduction of a thesaurus and application of those thesaurus terms to the individual information objects.

M.A.I. and your search software in combination offer a very powerful solution, assuring access to documents that have been precision indexed with controlled vocabulary or thesaurus terms using M.A.I., as well as Verity's free text search of documents with important words that have not yet been incorporated into the thesaurus.

What is an inverted index?
An inverted index is an alphabetical list of all the words that occur in all documents in a document set. Each word in the list is hooked to every document that contains that word. When the user searches for "light bulb," the inverted index points to all documents containing either or both of those words. It will, therefore, bring up documents pertaining to "light years" as well as "daffodil bulbs." It will not retrieve documents on the basis of "incandescent" or "fluorescent tubes." M.A.I., however, can interpret and index those references as "light bulbs" and place that tag in the document's metadata, making the document retrievable despite the absence of those query words.

In a free text search, the inverted index enables the search system to find a specified query word anywhere in the document text. An inverted index of a particular data field, e.g., the author field or data field, will have a much narrower selection of elements. An inverted index of indexing terms will include only those words that occur in the valid indexing terms drawn from your thesaurus or controlled vocabulary.

If you have any other questions, please free to give us a call at 1-800-926-8328.

Return to Top of Page