Where did we come from? Where do we go? (And will we ever reveal the identity of Cotton-Eyed Joe?)
Answering these questions starts with the word ontology, which describes a branch of metaphysics focused on the nature and grouping of entities that exist (or are said to exist) according to their similarities and differences.
Origins of Ontology
Ontology comes from the Greek words Ontos (being: that which is) and Logia (logical discourse) and has generally been used to assert that different languages and cultures have different rules for stating reality and actuality as it relates to entities, ideas, and events.
But creating ontologies also has a less metaphysical, more practical purpose as well. In information science, an ontology is a formal naming and definition of the categories, properties, and relations among concepts and data within a domain. (In AI, a “domain” is the subject area being studied, such as, “customer complaints to a retail bank” or “the insurance industry.”)
Many natural language systems have been built on the idea that, if we can gather enough data and then properly define the relationships between different words and concepts in that data, we can create a system that’s very good at understanding what someone is talking about within a specific subject.
These contextual differences between real objects or entities and their broader, subtly related identifiers make programming NLU or machine learning systems a very tall order.
Why so many natural language understanding systems need ontologies
Most natural language technologies rely on ontologies and long keyword lists because they need a reference for domain-specific concepts and relationships. For instance, the word “remote” refers to a kind of access for IT, but a device with buttons for a cable TV company. However, these ontologies are often built manually, making them highly error-prone and costly, not to mention the constant upkeep required to help them remain relevant.
The other issue here is that these systems require some kind of synonym representation, ultimately limiting them to controlled vocabularies and very few resources to pull from when trying to make sense of a domain-specific dataset. For the system to correctly identify a term, it needs not only the term itself, but also how it relates to other terms.
For instance, if a natural language system were to look for the keyword or concept, “pulmonary hypertension,” but has no reference telling it that the term is related to high blood pressure in the lungs, or that it affects the circulatory system, it simply won’t make the right connections.
Why Luminoso doesn’t need Ontologies
Luminoso has no need for ontologies because it natively “learns” new domain-specific language with the help of ConceptNet, which provides a foundation of 28 million concepts and their connections to one another. This knowledge base is a collection of facts about how the world works (things like, “the sun is hot,” or “dogs and cats can both be pets”) that gives our software the same understanding that a human would going into a conversation.
With this valuable foundation, our software can quickly understand new words and map out the relationships between different concepts in the data with a much higher degree of accuracy and relevancy. Luminoso also analyzes word frequency, allowing it to de-prioritize very common words, and prioritize less common words, as they offer much more relevant information about the dataset being analyzed.
This “no ontologies” idea may confuse data scientists used to spending months collecting data and hand-tuning rules and hierarchies associated with a specific industry, like financial services or consumer electronics. But it’s true: Luminoso’s approach effectively develops an understanding of the language based on one and only one domain: the subject of whatever documents Luminoso analyzes.
So, in spite of ontologies being a logical way of providing these systems with a reference point for human language, the upkeep and amount of manual tagging makes ontologies difficult to scale and even more difficult to use on larger datasets.