Entity Extractor

The Entity Extractor finds and extracts the people, places, and things as named entities from the full text of articles and provides them as XML tags. We do entity extraction in three ways to provide maximum flexibility and extract the most information from the client content.

  1. Extracting people, places, and things based on a known authority list. For this we use Data Harmony® M.A.I. with a flat file of names and aliases. When the client has existing lists of product names, staff names, place names, etc., it is a straightforward implementation.

  2. Entity extraction of unknown names in text. This use is based on the novelty detection system. There are two parts to the programs. First a large dictionary - essentially the English Language with a stop word list added. If a term or term phrase is not in the dictionary, it is considered new and suggested as a term for matching entity extraction parameters. The downside of this if used alone would be that the unknown words could be anything - a typo, different language, new words, weird phrases. Names will only be a small percentage. So we add the third component in our system.

  3. Proper name extraction depends on initial capitalization for finding and extracting names. These names may then be compared against a known list of names and synonyms presenting the first list (the flat list with synonyms in #1 above) as accepted and the second list as ones the editor needs to decide how to handle.

  4. Finally we add our own analysis and functions checking for certain indicators such as several words in a row with initial caps, or even all caps, words within so many words of concepts like city or country or Mr. or Mister, etc.
These components taken together provide excellent, reliable entity extraction. Used in conjunction with M.A.I., they form the basis for full semantic mining of the full text.