Metadata Extractor – Metadata Enrichment & Document Autosummarization

Metadata Extractor is a managed Web service for document handlers that extracts relevant information from a document, generates metadata to effectively describe the content and adds structured metadata to an XML output file of the text. Plus, this Data Harmony extension is a content autosummarization tool.

What the software does

Metadata Extractor creates a full bibliographic citation and autosummary abstract, handles author parsing and assigns taxonomy terms automatically. This Web service extension takes an unstructured or semi-structured document and converts it to a data object with a richer, more useful structure.

Metadata Extractor accepts these input formats:

  • Adobe PDF publications
  • MSWord documents
  • HTML pages
  • TXT files

Metadata Extractor interface

Powered by a unique set of programs and driven by elements in a publishing schema

The process draws upon a unique set of Data Harmony programs, with metadata extraction determined by user-defined elements from the document publishing schema or Document Type Definition (DTD).

The Metadata Extractor Web service generates subject keywords suggested by MAIstro™ (the integrated module where Thesaurus Master® is combined with M.A.I.™). A refined extraction system and the autosummarizer application facilitate metadata mining of Adobe PDF documents in batches, without editorial intervention.

The system renders bibliographic output according to the Dublin Core standard elements and syntax.

Software feature – defining document elements that contain metadata

Metadata Extractor surfaces relevant information (from a PDF file, for example) for consistently effective metadata and includes it in an XML record inside elements that suit an organization’s journal publication style. You define the document elements that contain metadata for extraction, in the publishing pipeline.

Metadata extraction and output XML elements are configured to reflect structure of the publication style, as the input document reflects style elements that can be identified for metadata enrichment.

Access Innovations customizes implementation of the Web service extension

Access Innovations provides customization and administration services during configuration for the Metadata Extractor Web service extension. Every publishing schema requires a targeted approach to leverage the most appropriate document fields, to generate the best metadata.

Graphical user interfaces (GUIs) and input elements for metadata extraction are adjustable. With Metadata Extractor, your information administrator designates document data fields and entity data fields for the application to extract, when the publication schema evolves over time.

It’s recommended that regular monitoring of the output is established, to maintain an optimal accuracy level for metadata extraction and content autosummarization, as new semantic patterns appear in the publishing pipeline.

Written by

Data Harmony