tetrel Logo tetrel Logo

The challenge with NLP: structuring data that is sometimes highly unstructured

11/10/2021 by Johannes Humbert (LinkedIn | Twitter)

This article is an excerpt from the whitepaper Functionality and areas of operation of Natural Language Processing. Read the full whitepaper here.

Humans need years to learn how to use language. Nuances and countless irregularities make it difficult to recognise and distinguish. This makes it difficult for NLP solutions to convert unstructured language into structured data for use. Now we are not just talking about sarcasm, metaphors and synonyms. There are also exceptions in grammar, syntax and idioms, but above all context. Example:

“I’m going to my bank” in German can mean both going to a branch of a bank to deposit money there and the intention of leisurely reading a book on a bench in a park at one’s favourite spot.

NLP Use Cases

Specific challenges with spoken language:

Dialects, slang, loanwords, mumbled, poor connection, reeled, voice pitches, amplitude variations, rhetorical questions, half-sentences, single words, etc. This makes recognition and thus the creation of a context, which is essential for processing, more difficult.

Specific challenges with written language:

Typing errors, different spellings and sentence structure as well as grammatical errors, missing or incorrect punctuation marks and abbreviations pose major - but solvable - challenges for NLP solutions, as many entities have to be recognised and correctly assigned. Homonyms are also among them: Words that can have different meanings. For example, “band” - this can be a book of a multi-part book series, as well as a narrow textile strip or a music group.

NLP solutions recognise the context, constantly improve, they learn and thus optimise themselves automatically. The more training data there is at the beginning of an NLP project, the better.



Your contact

Johannes Humbert
+49 176 83 33 51 46
johannes.humbert@tetrel.ai