Notes
Slide Show
Outline
1
Identifying Chemical Names
with MAI Chem™
  • Marjorie M. K. Hlava
  • Jay Ven Eman
  • Data Harmony Software
  • Access Innovations, Inc.
  • www.accessinn.com
  • www.dataharmony.com



2
What is MAI Chem?
  • Software for identifying chemical names in electronic documents
  • Quickly locate all chemical names in documents
    • even when such documents are exceedingly large!
3
MAI Chem Allows
  • Rapid searching of text‑containing documents
  • Identifying and isolating chemical names and common chemical expressions from surrounding text
  • The chemical names are returned to the user
  • In a list
    • descending order
    • number of occurrences of each chemical name
  • The list also contains all synonyms for each chemical name found
4
Challenge – Find
 Chemical Names
  • Search the entire document
  • An extremely time-consuming task
  • Risk overlooking some chemical names
  • Slowly reviewing the document
      • Reduces risk
      • Increases the time required to review the document
  • Large documents can span many pages
  • Contain only a handful of chemical names
    • Like locating a needle in a haystack!
  • Need an automated process
5
Compare each word
  • List of chemical names
  • Containing an element selected
      • Chemical prefixes
      • Chemical suffixes
      • Combinations
      • Multiple occurrences
6
Distinguish the parts
  • Chemical names from the non‑chemical names
  • Chemical terms not used in the text‑containing documents as part of a chemical name
  • Synonyms are displayed along with the chemical names
  • Non‑individual chemical names can be grouped
  • Individual chemical names are identified and expanded
  • The number of occurrences of each chemical name is presented


7
Definition
  • “Chemical name" for simplicity
    •  all chemical names
    • chemical compounds
    • symbols
    • expressions
    • common chemical names
8
Other uses
  • Primarily for identifying chemical names
  • New technology area with developing lexicons
  • Filtering coded information
  • Adapt to search for words, phrases, and symbols of any lexicon
9
Use it whenever …
  • Patents, journal articles, technical reports, other full text
  • You have an unlimited number of potential chemical compounds, and a variety of ways that a particular compound can be named
  • You need to match names against text
  • You need to filter incoming data
10
How does it work?
  • Processes the text
    • against regular expressions
    • match typical chemical morphemes
    • "hydro" or "amine"
  • Distinguish between the non-chemical and legitimate
    • "hydrophobia"
    • “hydrogen sulfate"
11
Where will it run?
  • A stand-alone program
  • Over a network
  • A simple HTTP server is set to run
    • loads the word lists
    • except for the synonym table
    • waits for messages
  • A client program transmits a document to the server
  • The server runs a program, then sends the client a list of all chemical names contained in the original text‑containing document.
12
What happens?
  • The text is split into words
  • Punctuation removed except for parentheses
  • Each word is compared against various algorithms
13
Things behind the scenes
  • Stopwords
    • A very large list of non-chemical words
    • Saves time later
    • Removes words which might match regular expressions, but are not chemical names
    • “Hydrophobia" will be matched with the regular expression /hydro/, but is not a chemical name
14
Things behind the scenes
  • Regular expression stopwords
    • A very small number of regular expressions which eliminate other strings that may occur
    • For example, any number standing alone can be eliminated
15
Things behind the scenes
  • Chemical names
    • Exact matches
    • All element names
    • Common words for chemicals
    • (not be picked up by regular expressions, such as "salt" or "soda")
16
Things behind the scenes
  •  Chemical name starts
    • Regular expressions that often start long compounds
    • \(\d\)- which will match (1)-
17
Things behind the scenes
  • Regular expressions
    • Common chemical morphemes
    • Show up in long compound names
    • "hydro", "sulf", or "oxy"
    • (non-chemical names, "proxy“ in the stopword list)
    • Keep it short
18
Things behind the scenes
  • Marks each word as chemical or non‑chemical
  • Groups the chemical words into actual chemical terms
    • "hydrogen peroxide", and not two terms such as "hydrogen" and "peroxide"
  • Determine words that are part of a longer term
    • "acid" is not a chemical
    • "hydrochloric acid" is
  • Review adjectives
    • “linear" as beginning of a chemical name
    • not included, if not followed by a chemical word
19
Things behind the scenes
  • Identify chemical words separated by non‑chemical words which have common endings or beginnings
    • "sodium and potassium sulfates"
    • returns "sodium sulfates" and "potassium sulfates"
  • Chemical names, plus any synonyms are presented to the user


20
Insert Text
21
Resulting set
22
MAI Chem Results
23
What is Unique about
MAI Chem?
  • Identifying chemical terms which are not used in the text‑containing documents as part of a chemical name
  • Chemical names are bolded
  • Placed in a list
  • Synonyms
  • Grouping non‑individual chemical names
  • Identifying and expanding individual chemical names
  • States number of occurrences


24
MAI Chem™ Partners
  • ACD/Labs
  • Vivisimo
  • Mark Logic
  • RedDot
  • VxInsight
  • VisWave
  • More wanted – please ask
25
MAI Chem™ Partners
26
MAI Chem™ Partners
27
MAI Chem™ Partners
28
MAI Chem™
brought to you by:
  • Data Harmony Division
  • Access Innovations, Inc
  • 131 Adams NE
  • Albuquerque, NM  87108
  • www.accessinn.com
  • www.dataharmony.com
  • Jay Ven Eman
  • Marjorie M.K. Hlava
  • Ask us about the other Data Harmony software products and Data Base Services such as...
29
MAIstro™