Frequently Asked Questions - Answers!
Business Advantages
How can Data Harmony products help me find what I'm looking for and improve search results?Who are your target users?
Is my business too small or simple for M.A.I. / TM?
What is the Return on Investment (ROI)?
What makes MAIstro different?
What's the difference between M.A.I./TM and MAIstro?Why don't you use training sets?
Aren't rulebases difficult to manage?
Isn't it faster to get a semantic engine running than a rulebase?
Taxonomy/Thesaurus Basics
Controlled vocabulary / taxonomy / thesaurus -- What are they and how do they differ?What advantages does a taxonomy offer?
What extra advantages does a thesaurus offer?
What can I do with a taxonomy?
Indexing Basics
What is indexing and why bother?Why is an indexed search better than a full-text search?
What is automatic indexing?
What is machine aided indexing?
What is the difference between "automatic" and "assisted" indexing?
If I have a search engine (like Verity), why do I need MAIstro or M.A.I.?
Using MAIstro
How long does it take to build a rulebase?How long does it take to maintain a rulebase?
What kind of people do I need to maintain a rule base and taxonomy (thesaurus)?
How long will it take to index my legacy collection?
Who will build the rulebase?
Who will maintain the rulebase?
Do I have to come to you to add or change a term?
Organizing Data
How does Data Harmony work with a content management system (CMS)?What makes a good content management system (CMS)?
Do I need a database?
What is a spider?
I have lots of data but I don't have a database-what do I do?
What is "well-formed data" in a database?
What is a DTD?
Integrating With My System
How does it connect to other systems?What is an API?
How do I get my data to M.A.I.?
How does Data Harmony software relate to a portal?
Can I use Data Harmony remotely?
What are Internet protocols and why are they important?
Under the Hood
What are the system requirements?What operating systems do Data Harmony products support?
In what computer language are Data Harmony products written?
What is Java?
What is XML?
Working With Search Systems
How does it work with a search system?How does it work with search software?
If I have a Natural Language search software, why do I need MAIstro or M.A.I.?
What is an inverted index?
Business Advantages / Benefits / R.O.I.
How can Data
Harmony products help me find what I'm looking for and improve
search results?
Data Harmony promotes precise and
consistent topic or subject labeling of documents, following
rules of use designed for your specific needs. The result is
pinpoint accuracy in document retrieval allowing more effective
knowledge management and trend spotting. Data Harmony enables
information specialists to do their jobs better.
Who
are your target users?
Data Harmony products are
designed to manage primarily text-based data. We encourage
anyone with the need to classify, store and retrieve quantities
of information objects to use the Data Harmony suite. Most of
our customers are libraries and information centers, secondary
publishers and keepers of portal and knowledge management
systems.
Is my business too small or simple for M.A.I.
/ TM?
Do you give different users access to your
information? Do you have more than 5000 items in a data
collection or more than 14 fields of data to access through your
system? If the answer to either of these is "yes,"
then Data Harmony products can improve the organization and
retrieval of your data, to help you get the information you
need.
What is the
Return on Investment (ROI)?
There are two major gains in
the use of the Data Harmony products: 1) Productivity and 2)
Quality.
1) Our users usually see a substantial increase in productivity for indexing. We measure productivity increases in two ways: increased speed and increased quality. Productivity increase is usually about fourfold. Suppose you have four indexers and you pay them $20,000 per year. If you purchase a $60,000 system and get a fourfold increase in productivity, those indexers will process four times as many records in the same time period or will process twice the records with half that many people. Either way, the system will pay for itself in three months of usage.
2) Another important measure is the quality of indexing. Our users experience a substantial increase in the quality and consistency of indexing. They benefit from a marked decrease in editorial drift, the tendency for editors to focus on different indexing terms at different times, or for the same editor to use different terms on different occasions. Increased speed and improved quality and consistency all contribute to the benefits gained through using Data Harmony software. Over a longer time period, e.g. five years, the ROI continues to grow.
What makes MAIstro different?
What's the
difference between M.A.I./TM and MAIstro?
MAIstro
is the combination of Thesaurus
Master and M.A.I.
Since TM and M.A.I. work best when functioning as an integrated
unit, in fact we find most customers want both sections. In
MAIstro, the backend processing is unified to make the two
component software pieces work together as a single program and
installation.
Why don't you use
training sets?
To build a training set requires manually
finding a set of documents that are about a particular concept.
Since this effort must be repeated for each concept, the time
required can be significant. Each time a new concept is added, the training needs to be redone.
We find that a
rulebase approach is more efficient, more flexible, easier to
maintain, and less costly to maintain. A rulebase can be easily
managed by an editor, not requiring the more expensive services
of a programmer. There is no limit to the number of controlled
vocabulary terms or size of thesaurus it serves. Modification or
addition of rules is easily accomplished. A rulebase does not
require research to locate and prove a large corpus of documents
that exemplify the concept represented by a single term.
Aren't
rulebases difficult to manage?
Quite the contrary. Rules
governing the use of indexing terms are accessible and
transparent, not hidden in a virtual black box. The editor can
review and fine tune the requirements for term use at any time
to produce more accurate term suggestions. M.A.I. automatically
maintains statistics that point out any discrepancies between
the editor's use of terms and the M.A.I. suggested terms. These
statistical results are presented in order of frequency of
occurrence so that an editor's time is used most productively,
targeting rules most in need of fine tuning. Maintaining the
rulebase to continually improve indexing term suggestions takes
approximately two hours weekly.
Isn't it faster
to get a semantic engine running than a rulebase?
A
semantic engine needs to be trained and retrained for each new
term. M.A.I. is ready to operate as soon as it is installed with
its basic rules governing each of the taxonomy terms for
indexing. The system is functional immediately and its
performance improves with feedback from every document editors index.
If one is going to embed M.A.I. into some other system, of
course you have to wait until the rest of the system is ready.
Taxonomy/Thesaurus Basics
Controlled
vocabulary / taxonomy / thesaurus -- What are they and how do
they differ?
A controlled vocabulary is a limited set of
terms that are valid for indexing (keywording or topic tagging)
a set of documents. The list is generally alphabetized, but no
further internal organization of terms is implied.
A taxonomy is a controlled vocabulary presented in an outline view, also called a classified view or hierarchy. Terms are organized in categories reflecting general concepts (Top Terms), major groups (Broader Terms), and more specific concepts (Narrower Terms). The final terms at the end of a branch, often called nodes, can represent any specific instance of a Broader Term, including terms from an authority file of people, organizations, places, or things.
A thesaurus is a controlled vocabulary that is displayed as a taxonomy or other display format. The key difference is that a thesaurus is enhanced to specify not only the relative position of terms (Top Terms, Broader Terms, and Narrower Terms), but also to provide synonyms (nonpreferred terms or Use/Used for indicators) for valid terms in the thesaurus and associations between conceptually related terms. A thesaurus allows for scope notes and term history. More advanced thesauri may involve additional equivalencies such as alternate languages, numerical codes, and status designation for terms. National and international standards dictate the details of thesaurus construction and display.
The hierarchy view of a thesaurus parallels a classic taxonomy. The alphabetic view of a thesaurus presents the full term record for each term, with hierarchy relationships, conceptual associations, notes, etc. Alternative displays include an alphabetical listing of terms and nonpreferred terms, the permuted index (KWIC or KeyWord In Context), and other views.
What
advantages does a taxonomy offer?
A taxonomy's
hierarchical organization makes it easy to locate the most
accurate subject indexing term. The hierarchical view makes it
easy to navigate across categories or from a major category
along a branch to singular examples. This benefit is enjoyed by
indexers, internal document editors, and end-users.
What
extra advantages does a thesaurus offer?
In addition to
the hierarchical display of the taxonomy view, a thesaurus may
be viewed as individual term records with the added value of
conceptual associations, notes on use, translations to other
languages, etc. This information provides the user ideas for
alternative terms that may be appropriate for indexing or
finding documents on conceptually related topics.
What can I do
with a taxonomy?
You can use a taxonomy in many ways:
1.
Index the data using the taxonomy terms
2. Add the taxonomy
terms as metadata to HTML records
3. Add the taxonomy as a
browsable topic list to a portal or web site
4. Search the
data in your database by taxonomy term
5. Set a spider to
crawl the web for information you want
6. Hook your data
files to a visual display system
7. Fill the metadata header
in your web records
Indexing Basics
What is indexing
and why bother?
Indexing is a way to increase retrieval
precision and accuracy by consistent application of subject
terms in their preferred forms. Good indexing leads to the
optimal combination of precision retrieval and comprehensive
recall. The result is retrieval of documents appropriate for the
search and blocking documents that may share a common word but
are conceptually irrelevant.
Why is an indexed
search better than a full-text search?
The results of a
search will always depend on what you are looking for. In a
full-text search, if your query words do not match words in the
document text, you will not get back all the relevant documents.
If they do match words in the document, they may not be what the
document is primarily about, so you get back a lot of irrelevant
documents. If you are searching for a new and as yet undefined
concept not reflected in the controlled vocabulary of indexing
terms, then a full text search is preferable.
What is automatic
indexing?
Automatic indexing is the application of index
terms from a thesaurus or controlled vocabulary without human
intervention, using only the computer.
What is machine
aided indexing?
Machine aided indexing allows the
selection from a list of the automatically suggested index terms
drawn from a thesaurus or controlled vocabulary for increased
accuracy in storage and in retrieval. An editor can also apply
additional thesaurus terms not suggested by the system.
What is the
difference between "automatic" and "assisted"
indexing?
Automatic indexing will work without human
review or intervention and is great for filtering large amounts
of data, such as a news feed. In assisted or machine aided
indexing, a human reviews the suggested terms and tweaks them
for the highest quality results. Data Harmony supports both
modes of operation
If I have a
search engine (like Verity), why do I need MAIstro or M.A.I.?
Either MAIstro or M.A.I. allows you to introduce
precision with high relevant recall to the search in a Verity
type system. M.A.I. (or MAIstro) and Verity in combination offer
a very powerful solution, assuring access to documents that have
been precision indexed with controlled vocabulary or thesaurus
terms using M.A.I. or MAIstro, as well as Verity's free text
search of documents with important words that have not yet been
incorporated into the thesaurus.
Using MAIstro
How long does it
take to build a rulebase?
An effective starter rulebase
can be prepared in two hours from an existing thesaurus or
controlled vocabulary. This is the simple rulebase which gives
about 60% accuracy in hits. After the simple rulebase is built
we recommend about one month to build a complex rulebase to take
you to 85% accuracy in hits. Actual time required varies with
the size of the taxonomy and the writing style of documents.
How long does it
take to maintain a rulebase?
After an initial period of
rulebase preparation, the time required for maintenance drops
off steadily. Approximately two hours of weekly editorial work
is generally sufficient to fine tune the rulebase and keep up
good statistical results.
What kind of
people do I need to maintain a rulebase and taxonomy
(thesaurus)?
Editors who manage your document collection
and indexing or categorization needs can easily learn to
maintain a rulebase and taxonomy.
How long will it
take to index my legacy collection?
This depends on the
number of metadata fields to be filled, the complexity of the
materials, and of course the size of the legacy collection. If
just the subject field is required, we have been able to index 8
per hour on complicated documents and up to 60 per hour on more
straightforward text. While many fully automatic indexing
systems provide only 60% accuracy, to automatically index the
collection we recommend that the accuracy level be at 85% or
higher. Then an automatic run will give excellent access to a
legacy collection. Ongoing indexing depends on the number of
metadata fields you are filling.
Who will build
the rulebase?
We can train your editors to construct your
rulebase, or we can develop it for you, using your documents,
and then pass it over to you for continued use and maintenance.
Once trained, most of our customers prefer to build and maintain
their rulebase independently.
Who will maintain
the rulebase?
We will train your editorial staff to
maintain your rulebase, or you can outsource the task to our
experienced editorial staff. The training lasts one day and
includes practice time.
Do I have to come
to you to add or change a term?
No, for each term that
you add, M.A.I. creates a simple starter rule for its use for
indexing documents. You can modify the rule at any time to
improve its function. If you change a term, existing rules are
automatically changed to suggest the revised term. These
processes are quick and simple, and performed by your staff.
Organizing Data
How does Data
Harmony work with a content management system (CMS)?
Data
Harmony is integrated with a CMS through documented Application
Program Interfaces (APIs). Depending on the client's needs,
indexing terms may be presented interactively in real time
document by document, or by a batch approach, filtering and
suggesting terms for a number of documents at a time.
We work with other software vendors to create custom packages such as MAI STAR, integrating M.A.I. with Cuadra's STAR information management software.
What makes a good
content management system (CMS)?
A good content
management system includes the following parts:
1. Input
system for document creation and editing
2. Search system to
find the documents in the system
3. The display of the
documents to the user either by web site (portal) or print or
a
customized user interface
4. Administration modules to
create custom reports and document sets
5. Nice to have
features include hooking to a publishing system or portal
interface
Do I need a
database?
You don't necessarily need a database to use
the M.A.I. and TM or MAIstro. You can access data through
filtering of content feeds or by setting a spider. The rule of
thumb is that you are better off with a database if you have
more than 5000 items or 14 fields or metadata areas. If not, it
may not be worth the expense of the database creation and
maintenance.
What is a
spider?
A spider is a program that automatically surfs
the web. It is generally used to find documents on topics that
you specify.
I have lots of
data but I don't have a database - what do I do?
You don't
need a database to use the M.A.I. and TM or MAIStro. Even
without a formal database, the software can work on data feeds
or streams, or can work in conjunction with a spider or web
crawler to gather data for your needs. Then the software can
classify your data for display and enable the data to be
displayed in a web site or portal. Additionally, M.A.I. can be
used for query expansion in a search or with a federated search
engine.
If a formal database is indicated, the Data Harmony support team can help you build a database if that suits your needs.
What is
"well-formed data" in a database?
databases
usually put data into a particular format. The usage
"well-formed data" comes from the XML Standard on the
world wide web consortium web site. It means that data follows
the format of the XML exactly and will parse in any XML system.
Data must be well-formed and valid so that it will pass into the system of your choice or to another system following the standard exactly and load without problems to the new software or to your client's software. "Well-formed" is distinct from "valid". The latter means that, in addition to being well-formed, it conforms to a DTD or schema.
What is a DTD?
A
DTD is a Document Type Definition, a mechanism to describe the
structure of documents. It was created to specify the fields in
SGML (Standardized General Markup Language). SGML was created to
allow platform independence in publishing of documents. HTML and
XML are descended from SGML. In XML, many of the constraints and
features of the SGML have been removed to make it simpler to
use. XML can also be defined with a DTD.
An XML Schema is an alternative to a DTD. A schema is generally broader and may have specific rules for constraining the content of an XML document.
Your database will have well-formed data if it follows the DTD.
Integrating With My System
How does it
connect to other systems?
We connect to other systems
through an Application Program Interface. This allows different
systems to communicate seamlessly.
What is an API?
An Application Programming Interface is a list of
methods that enable one program to communicate with another
program. Data Harmony has published APIs that allow other
software to hook to TM, M.A.I., and MAIstro.
How do I get my
data to M.A.I.?
You can get your data into M.A.I. in one
of the following three ways:
1. Your data input system can
be modified to use the Data Harmony API for sending data
interactively through the M.A.I. rulebase. This produces
indexing term suggestions interactively, on a document by
document basis.
2. You can use the Data Harmony Batch program
to run batch files, which produces suggested terms for a number
of documents at a time. This approach requires a modest amount
of preformatting in order to feed into Data Harmony Batch.
3.
Custom batch programs can be written to handle any text format
you require.
How does Data
Harmony software relate to a portal?
Data Harmony
software enables you to build and maintain a taxonomy of
indexing terms that describe documents. A portal uses the
taxonomy in two ways:
1. Add the taxonomy hierarchy as a
browsable topic list to a portal or web site. HTML links from
the topic terms provide access to the documents.
2. Add the
taxonomy terms as metadata to HTML records, filling in Name
and/or Keyword fields for access by search systems and spiders.
Can I use Data
Harmony remotely?
Data Harmony uses Client/Server
architecture, allowing the Client (the editor's workstation) to
access the Server wherever a network connection allows. The
software does not need a browser in place to work. The data can
be accessed over the Internet.
What are Internet
protocols and why are they important?
We believe in
using the Internet for ease of data movement and communication.
If you are sending data over the Internet, you have to comply
with Internet protocols.
The Internet protocols used in Data Harmony are the TCP/IP. This stands for Transmission Control Protocol and Internet Protocol. The first ensures that the data arrives at its destination in the correct order. The second takes your data and bundles it into discrete packages for transmission of 1500 bytes each.
We use TCP/IP because it can run on any network, an organization's internal LAN, WAN, or a world wide network. This network flexibility means that Data Harmony products can be used without being impeded by any network constraints, can be accessed by users with password access from anywhere, and is as scalable as the Internet in the size of a community it can serve.
Under the Hood
What are the
system requirements?
Data Harmony is written in Java to
be platform independent. Data Harmony requires only the ability
to run Java and the presence of JRE (Java Runtime
Environment) to run. We recommend JRE 1.5 or higher for best performance.
Partnered products like MAI STAR have specific requirements and customers are best directed to their sites for the current information.
What operating
systems do Data Harmony products support?
Data Harmony
products can run on any operating system that is Java compliant,
including Windows 95, Windows 98, Windows NT and Windows Server, Windows XP and Vista, Unix, Linux and
Mac. We recommend JRE 1.5 or better.
Special note on Mac: Mac OS X has Java pre-installed and so should run Data Harmony products with no problem. Mac OS lower versions use a Macintosh Runtime for Java (MRJ2.2.5), based on Sun Microsystems' Java 1.1.8 specification. This is not the recommended version and in some cases may not support Data Harmony.
In what computer
language are Data Harmony products written?
Data Harmony
products are written in the Java programming language, making
them platform independent.
What is Java?
Java is a programming language that is platform
independent, just as XML is platform independent. It allows you
to run programs written in Java on one or many platforms without
rewriting the software for each individual computer type. There
are now two kinds of Java. One is open and supported by Sun
Micro Systems and others. The second is the Microsoft version.
They have subtle differences in compilation.
What is XML?
XML
stands for eXtensible Markup Language. It is a World Wide Web
Consortium standard for the consistent portable markup of
documents and other objects. It allows markup into fields or
elements of data with accompanying further description of
information called Attributes. It also allows transfer of that
data without requiring specific data software or operating
platform. It frees the user from relying on a particular kind of
software hardware shop and allows easily reliable data movement
at any time.
Working With Search Systems
How does it work
with a search system?
Although the front ends and
options vary at the end of the process, all search systems build
an inverted index (alphabetic list of all term entries) and then
run the queries against the inverted index. This is true for all
search systems including MuseGlobal, Intelliseek, Verity,
Postgresgl, Infoseek, FAST, Oracle, my SQL, Sequel, SAP, and so
forth.
M.A.I. can work with a search system
in two ways:
1. M.A.I. enables construction of an inverted
index of terms, i.e. a list of documents associated with each
indexing term. These indexing terms, arranged in the inverted
index, provide consistent and accurate subject access to the
data. This is the preferred method in well-formed databases with
field formatting or metadata access.
2. M.A.I. can
transition from a searcher's query word to the valid indexing
term, and then access documents indexed with that term. This
works well with natural language query systems such as Verity.
M.A.I. enables precision indexing for highly accurate retrieval of documents or information objects. A search system provides the supplemental ability to spot important words not yet incorporated in the indexing thesaurus.
How does it work
with search software?
To capture the maximum recall with
precision, M.A.I. allows you to search for concepts you want
with your precise needs in mind and retrieve only those items
relevant to your own use. You may also expand your search by
using synonyms (non-preferred or Used For terms) from the
thesaurus and searching for all of them in Verity for maximum
recall of related data.
If I have a
Natural Language search software, why do I need MAIstro or
M.A.I.?
By coordinating with M.A.I., a search software
is not limited to specific words, which may vary in meaning from
document to document and may miss many alternate expressions of
a concept. M.A.I. boosts Verity's search success by enabling it
to also find in metadata the indexing terms that accurately
represent concepts, using the terms that were attached to the
documents by M.A.I.
With M.A.I., you introduce precision and highly relevant recall to the search in a system such as Verity or FAST. While these systems can locate a given query word, it does not supply consistency in indexing nor does it fully expand to the synonyms. It does not provide the ability to jump to related terms. These options come with the introduction of a thesaurus and application of those thesaurus terms to the individual information objects.
M.A.I. and your search software in combination offer a very powerful solution, assuring access to documents that have been precision indexed with controlled vocabulary or thesaurus terms using M.A.I., as well as Verity's free text search of documents with important words that have not yet been incorporated into the thesaurus.
What is an
inverted index?
An inverted index is an alphabetical list
of all the words that occur in all documents in a document set.
Each word in the list is hooked to every document that contains
that word. When the user searches for "light bulb,"
the inverted index points to all documents containing either or
both of those words. It will, therefore, bring up documents
pertaining to "light years" as well as "daffodil
bulbs." It will not retrieve documents on the basis of
"incandescent" or "fluorescent tubes."
M.A.I., however, can interpret and index those references as
"light bulbs" and place that tag in the document's
metadata, making the document retrievable despite the absence
of those query words.
In a free text search, the inverted index enables the search system to find a specified query word anywhere in the document text. An inverted index of a particular data field, e.g., the author field or data field, will have a much narrower selection of elements. An inverted index of indexing terms will include only those words that occur in the valid indexing terms drawn from your thesaurus or controlled vocabulary.
If you have any other questions, please free to give us a call at 1-800-926-8328.

