Metadata Extractor
The Data Harmony Metadata Extractor (DH-ME) is a new tool that automatically builds a document record. Raw, unstructured, or semi-structured information is converted to structured information. Any digital document can be used, such as HTML web pages, office documents, and PDFs. Not only are common entities such as dates, names, and numbers extracted, but also custom, client-specific ones such as titles, publication dates, and document types.
DH-ME uses innovative technology that enables users to define virtually any field or entity to be extracted from a document. Positional and formatting information is fed into an inference engine that allows the program to logically extract the fields.
Integrated with the existing Data Harmony tools such as MAIstro and M.A.I., DH-ME uses domain knowledge (from thesauri, ontologies, authority files) as well as positional inference. An author's name, for example, can be recognized in various versions (with middle initial or middle name) but recorded in the preferred standardized format.
DH-ME can be used as a tool to convert legacy documents after they are run through an OCR program into structured records or to automatically populate a check-in form for a document repository. DH-ME can also be combined with M.A.I.'s automatic indexing, which provides a complete document record for the user to simply validate and upload, and then immediately move on to the next document.
