An introduction to content enrichment approaches

Content enrichment is the process of adding structure, context or metadata to content to make it more useful to humans and computers. It can be a manual or automated process or a combination of manual and automated (augmented) processes. A content enrichment capability typically incorporates technologies, workflow and people. 67 Bricks helps publishers select the right combination of manual and automated processes, technologies, workflows and tasks to deliver an organisation’s strategic objectives.

Statistical analysis uses various approaches like ‘bag of words’, corpus based modelling and network analysis to identify important and relevant words, phrases, sentences or paragraphs and other content features. Having identified relevant content features these can form part of a semantic fingerprint or used to generate new metadata for a content item. The different approaches are used in different scenarios depending on the business objective. For example a ‘bag of words’ approach is very fast and can therefore be used to analysis large quantities of legacy content, whilst a corpus based approach might deliver a better result in a specific well defined content domain.

Grammatical analysis looks at the structure of sentences and uses this information to identify important words or phrases. This approach is typically used in conjunction with a statistical approach to apply additional weighting to terms.

Rules based approaches can be used to identify concepts where they have a regular structure, but a database of all known examples of that concept would be impractical e.g. car number plates or postcodes.

Machine learning is a form of artificial intelligence (AI) that enables computers to learn without being explicitly programmed. The process of learning typically involves introducing new data to the machine learning application from which it can improve its processing algorithms. Machine learning has many applications within publishing, one of the most tested applications is automated classification. In this case we can feed the application with training data about how content items should be categorised. The application learns from this and applies categories to new content items. Implementing a feedback loop means that the application can continue to improve over time.

Entity identification is the process of analysing content to find known concepts from within a knowledge model. This approach typically utilises a knowledge model e.g. taxonomy or ontology that provides information to the system about the concepts we are looking for. Being able to confirm that a phrase is an entity from a knowledge model supports additional product features that other approaches do not. For example knowing that a phrase is a known drug we can use its identifier to pull in data from other sources and present more value to the content consumer.

Ontological classification is the process of using a structured ontology of terms to classify a piece of content. The ontology typically needs to be designed for this specific purpose. A software application is then used to build classification rules based on the ontology that can be applied to content items. This is a good approach in domains where specific terms identify the category a piece of content falls within. For example if the content mentions the Supreme Court then we might be able to assume that this content item is referencing the US legal system.

Sentiment analysis is the process of analysing text to determine the opinion of the author. Typically this is used to determine whether a piece of text is positive, negative or neutral towards a particular subject.

Relationship identification is the process of identifying connections between specific concepts within the content. This typically requires very specific rules for specific content domains and is hard to achieve good quality, but can deliver significant value when successful. For example it might be extremely useful to know when a piece of content identifies a specific interaction between two drugs.

Back to content enrichment overview.