Content enrichment jargon buster

Content enrichment comes with a number of terms that need some clarification. Below we provide our definitions of the relevant terms and how they relate to content enrichment.


AI Artificial Intelligence. A broad-ranging term for the use of computers to simulate human intelligent behaviour in order to tackle complex problems that are difficult to solve using traditional computational approaches. It covers a wide range of problems, including machine vision, playing chess or Go, understanding and translating documents; it encompasses many of the techniques of automated content enrichment such as machine learning and natural language processing.

Algorithm A step-by-step sequence of rules to be performed that sets out to produce a decision.

Automated curation Often the automatic classification of documents according to a taxonomy, but more widely also the identification of keywords, entities etc. that would previously have been carried out by a human indexer or taxonomist.

Cognitive computing A subset of AI, cognitive computing has been described as "The goal of cognitive computing is to simulate human thought processes in a computerized model. Using self-learning algorithms that use data mining, pattern recognition and natural language processing, the computer can mimic the way the human brain works". IBM's Watson is held up as being the prime example. However, others suggest that the term is largely marketing hype and not demonstrably different from AI.

Content collection A set of content items gathered around a specific subject. Typically used to aid discovery by helping content consumers locate content around the subject they are interested in.

Content enrichment The application of modern content processing techniques like machine learning, AI and natural language processing to add structure, context and metadata to content to make it more useful to humans and computers.

Discoverability The degree of ease by which a piece of information can be found. Discoverability can be greatly improved using content enrichment.

Document databases Databases designed to hold document-oriented data such as XML (eXtensible Markup Language) or JSON (JavaScript Object Notation). Generally the structure of the documents is not mandated but there is usually some support for indexing the content of the documents. This means it can hold data with different levels of detail and even different structures without problem.

Domain A specific area of knowledge or research e.g. 17th century playwrights or Stem cell therapies.

Entity A term or phrase that represents a concept with an identifier from a known knowledge model e.g. a drug name, a place, an economic approach.

Entity page A page of information about a particular entity gathered from a variety of sources, brought together for the convenience of the reader - see Content collection.

Faceted search A means of finding content by applying multiple filters to defined categories. These filters could come from document metadata, predefined taxonomies or identified categories.

Graph A collection of concepts (nodes) connected by defined relationships (edges). Typically a graph allows things like semantic queries. This structure is used in computing to model entities and relationships in graph databases such as Neo4j and RDF triplestores.

Keyword A term or phrase that has been identified as important or relevant in defining a piece of content. Keywords are different from entities in that they do not have a domain specific identifier and they have not been located from within a known knowledge model.

Knowledge model A taxonomy, ontology or other data structure containing concepts that aims to provide structure to a specific domain of knowledge.

Linked Data A set of best practices for publishing and connecting structured data on the web.

Linked Open Data A set of Linked Data that is made available to all.

Machine learning A form of artificial intelligence (AI) that enables computers to learn without being explicitly programmed. A computer system where patterns can be discovered based on a large set of existing data, which can be used to suggest facts about new data. For example, given a large set of pictures some of which are flagged as containing faces, a machine learning system can "learn" to recognise pictures of faces.

Metadata [Structured] Data about data. For example, the title, author and creation date of a document may be metadata while the content of the document itself is the data.

Metadata schema The set of metadata that is applicable to a set of documents. Standard schema exist for different purposes, e.g. Dublin Core for document resources.

Natural language proccessing (NLP) Automatic processing of ordinary human language, usually to do things such as determine the topics being discussed, the sentiments being expressed, to summarize the discussion and so on. NLP is a component of artificial intelligence (AI).

NoSQL databases Databases where the data is not structured in tables like the traditional relational database. This includes graph, key-value, column or document oriented databases. There can be advantages to using these databases, but often it needs careful thought about how the data will be used to get these benefits.

Ontology A description of the classes (objects) and properties (relationships) between them. This is typically used with RDF to describe how the data is structured, such as bibliographic data. Can be used to infer information and map data between different ontologies.

RDF A mechanism for modelling information as a set of "triples" - i.e. statements split into a subject, predicate and object such as 'the cat' 'sat on' 'the mat'. This makes it easy to model and query entities and their relationships. RDF is often stored within a triplestore and queried via SPARQL.

Semantic computing Another term describing the use of AI to analyse meaning.

Semantic enrichment The process of adding meaning to content. See content enrichment.

Semantic fingerprint The complete set of metadata for a document that describes it uniquely and can therefore be used to match the document with a degree of similarity against other documents. This can be used, for example, to personalise a user's search experience.

Semantic search Searching semantically enriched information instead of the text to provide more relevant and detailed results. For example, searching for an author of scientific papers regardless of how their name is listed in those papers.

Sentiment analysis A particular type of Natural Language Processing concerned with determining the opinion being expressed by the author particularly whether positive, negative or neutral towards a given subject. This is especially useful in automatically categorising market feedback.

SPARQL A query language for RDF, in the same way as SQL is a query language for relational databases.

Taxonomic browsing A website feature whereby the user navigates documents via the site's taxonomy or several taxonomies, which can be quite elaborate with multiple facets.

Taxonomy Typically a system for hierarchical classification. Often used interchangeably with thesaurus, term list, and ontology - which are all related systems for organizing knowledge.

Text analytics The process of analysing natural language text to derive information about a content item, such as identifying and extracting entities within it.

Text mining The process of using a computer program to automatically scan one or more documents and extract information by analysing the text it is written in.

tf-idf "Text frequency - inverse document frequency" - an approach for determining the distinctive terms within a document based on how frequently those terms appear in that document versus how frequently they appear in a wider corpus of documents.

Topic maps A way of visualising the topics that describe a set of documents, and the connections between.

Training set A set of documents that are pre-defined (by a subject matter expert) as belonging unequivocally to one topic within a set of topics. These documents can be used as 'archetypes' to train a computer algorithm to correctly categorise new documents into the most appropriate topic.

Triplestore A database that stores RDF triples (subject, predicate, object).

XML schema languages A language used to describe and validate the structure of an XML document. These languages include DTD, XMLSchema and RelaxNG.

XQuery A query language for XML-based data in NoSQL databases, in the same way as SQL is a query language for relational databases.

Back to content enrichment overview.