|
1
|
- Marjorie M. K. Hlava
- Jay Ven Eman
- Data Harmony Software
- Access Innovations, Inc.
- www.accessinn.com
- www.dataharmony.com
|
|
2
|
- Software for identifying chemical names in electronic documents
- Quickly locate all chemical names in documents
- even when such documents are exceedingly large!
|
|
3
|
- Rapid searching of text‑containing documents
- Identifying and isolating chemical names and common chemical expressions
from surrounding text
- The chemical names are returned to the user
- In a list
- descending order
- number of occurrences of each chemical name
- The list also contains all synonyms for each chemical name found
|
|
4
|
- Search the entire document
- An extremely time-consuming task
- Risk overlooking some chemical names
- Slowly reviewing the document
- Reduces risk
- Increases the time required to review the document
- Large documents can span many pages
- Contain only a handful of chemical names
- Like locating a needle in a haystack!
- Need an automated process
|
|
5
|
- List of chemical names
- Containing an element selected
- Chemical prefixes
- Chemical suffixes
- Combinations
- Multiple occurrences
|
|
6
|
- Chemical names from the non‑chemical names
- Chemical terms not used in the text‑containing documents as part
of a chemical name
- Synonyms are displayed along with the chemical names
- Non‑individual chemical names can be grouped
- Individual chemical names are identified and expanded
- The number of occurrences of each chemical name is presented
|
|
7
|
- “Chemical name" for simplicity
- all chemical names
- chemical compounds
- symbols
- expressions
- common chemical names
|
|
8
|
- Primarily for identifying chemical names
- New technology area with developing lexicons
- Filtering coded information
- Adapt to search for words, phrases, and symbols of any lexicon
|
|
9
|
- Patents, journal articles, technical reports, other full text
- You have an unlimited number of potential chemical compounds, and a
variety of ways that a particular compound can be named
- You need to match names against text
- You need to filter incoming data
|
|
10
|
- Processes the text
- against regular expressions
- match typical chemical morphemes
- "hydro" or "amine"
- Distinguish between the non-chemical and legitimate
- "hydrophobia"
- “hydrogen sulfate"
|
|
11
|
- A stand-alone program
- Over a network
- A simple HTTP server is set to run
- loads the word lists
- except for the synonym table
- waits for messages
- A client program transmits a document to the server
- The server runs a program, then sends the client a list of all chemical
names contained in the original text‑containing document.
|
|
12
|
- The text is split into words
- Punctuation removed except for parentheses
- Each word is compared against various algorithms
|
|
13
|
- Stopwords
- A very large list of non-chemical words
- Saves time later
- Removes words which might match regular expressions, but are not
chemical names
- “Hydrophobia" will be matched with the regular expression /hydro/,
but is not a chemical name
|
|
14
|
- Regular expression stopwords
- A very small number of regular expressions which eliminate other
strings that may occur
- For example, any number standing alone can be eliminated
|
|
15
|
- Chemical names
- Exact matches
- All element names
- Common words for chemicals
- (not be picked up by regular expressions, such as "salt" or
"soda")
|
|
16
|
- Chemical name starts
- Regular expressions that often start long compounds
- \(\d\)- which will match (1)-
|
|
17
|
- Regular expressions
- Common chemical morphemes
- Show up in long compound names
- "hydro", "sulf", or "oxy"
- (non-chemical names, "proxy“ in the stopword list)
- Keep it short
|
|
18
|
- Marks each word as chemical or non‑chemical
- Groups the chemical words into actual chemical terms
- "hydrogen peroxide", and not two terms such as
"hydrogen" and "peroxide"
- Determine words that are part of a longer term
- "acid" is not a chemical
- "hydrochloric acid" is
- Review adjectives
- “linear" as beginning of a chemical name
- not included, if not followed by a chemical word
|
|
19
|
- Identify chemical words separated by non‑chemical words which have
common endings or beginnings
- "sodium and potassium sulfates"
- returns "sodium sulfates" and "potassium sulfates"
- Chemical names, plus any synonyms are presented to the user
|
|
20
|
|
|
21
|
|
|
22
|
|
|
23
|
- Identifying chemical terms which are not used in the
text‑containing documents as part of a chemical name
- Chemical names are bolded
- Placed in a list
- Synonyms
- Grouping non‑individual chemical names
- Identifying and expanding individual chemical names
- States number of occurrences
|
|
24
|
- ACD/Labs
- Vivisimo
- Mark Logic
- RedDot
- VxInsight
- VisWave
- More wanted – please ask
|
|
25
|
|
|
26
|
|
|
27
|
|
|
28
|
- Data Harmony Division
- Access Innovations, Inc
- 131 Adams NE
- Albuquerque, NM 87108
- www.accessinn.com
- www.dataharmony.com
- Jay Ven Eman
- Marjorie M.K. Hlava
- Ask us about the other Data Harmony software products and Data Base
Services such as...
|
|
29
|
|