Computational linguistics is the scientific discipline which deals with the computational modelling of natural language (both in written and spoken form). Computational linguistics draws from a variety of different fields, including linguistics, artificial intelligence, cognitive science, and so on, to facilitate human interactions with machines by allowing them to gain a computational understanding of language.
From a theoretical standpoint, computational linguistics aim to formulate grammatical and semantic frameworks that characterize languages in a way that allows computers to integrate syntactic and semantic sound structures. A key goal of computational linguistics is to discover the learning properties and processing methods that make up both the structural and statistical aspects of language.
As far as practical goals go, they involve efficient text retrieval, effective question answering and translations, text summarization, sentiment analysis, and so on. Computational linguistics have a variety of practical goals it seeks to fulfil with a human-like proficiency in dialogue.
No matter if it’s in natural language or computer languages, parsing is essentially the process required to analyse text ‘strings’ following fundamental grammatical rules.
Traditionally, sentence parsing is performed as a way to understand the exact meaning of a sentence with the help of syntax trees, highlighting grammatical differences such as subjects and nouns, as follows:
Parsing, from a computational linguistics perspective, refers to the computational analysis of a sentence by segmenting words of a sentence based on their syntactic relation to each other.
First, in order to parse any natural language, programmers must decide on the grammar to be used, this addresses both linguistic and computational concerns. While some parsing systems use lexical functional grammar, others may use head-driven phrase structure grammar.
Either way, many modern parsers are partly statistical, relying on a corpus of text for training. This training data has already been manually annotated, allowing the system to understand the frequency in which linguistic constructs occur in specific contexts.
One example of an algorithm which is used to parse natural language is the CYK algorithm. The CYK algorithm knows how to parse sentences using the grammar of a language.
As previously mentioned, computational linguistics goals are varied and spread across a broad spectrum. It is for this reason that linguistic systems are currently being widely used in both business and scientific fields for various purposes, such as:
Spell checkers aim to detect and correct any typographic and orthographic errors in a given text. Errors typically occur due to two reasons: either people cannot synchronize the movements of their hands properly when typing, causing them to make mistakes, or they simply do not know the correct spelling of some words (especially in different languages). Firstly, the spell checker detects the strings that are incorrect words. Once detected, they are highlighted and left to be corrected by the user in whichever way they prefer (manually or with the help of the program).
Spell checkers usually operate under a resource-consuming approach with a dictionary of words, a benchmark of similarity of words and a few presuppositions on common typographic errors in a specific language. However, if one desires to achieve a deeper insight into correction problems, a much more detailed understanding of morphology is needed to model a more compacted dictionary.
Information retrieval systems (IRS) are modelled to look for relevant information within a large documentary database. An IRS operates in the following way: the system digests a given query by initially trying to find the documents that contain all keywords presented in the query. Then it runs the same process, however looking for documents containing all but one of the keywords. The process continues until it searches for the documents that contain only one of the keywords within the query. In the end, the documents containing more keywords are the ones that are presented to the user first.
The qualitative characteristics encompassing IRSs are typically recall and precision. Recall is the ratio of the amount of relevant documents found divided by the relevant amount of documents existing within a database. Precision, on the other hand, refers to the ratio of the amount of all relevant documents which were retrieved, divided by the total amount of documents retrieved in general, including both relevant and non-relevant documents.
Nowadays, sophisticated systems can automatically generate sets of keywords by simply being given the text or document. Internet search engines are typically based on this concept, called automatic abstracting. The main issue when handling such a system is that, statistically, the most frequently used words are usually auxiliary verbs that do not reflect the essence of the text.
Summarization is often required, usually to determine what documents are about in order to facilitate classifying them by topics to index them in an IRS, guide people through a large set of documents, and so on. There are various types of summarization, however for the purpose of this explanation we will be looking at topical summarization. Topical summarization functions in the following way:
Firstly, the system neutralizes all morphologic variations, reducing all words to its standard form (e.g.: standing → stand, running → run). Subsequently, it uses a large thesaurus to assign each word (in its standard form) a position within a pre-written hierarchy of topics. For example, the word dictate belongs under linguistics, which in turn belongs under social sciences, which in turn places it under the topic science.
Finally, the system counts how often each topic is brought up in the document and the most frequently mentioned topic will then be considered the main topic of the document.
One of the first automatic translation programs was developed more than 40 years ago.
Although there are many translation softwares available ranging from large international projects to simple automatic dictionaries, the quality is still incredibly low compared to manual human translations.
The most prominent issue faced in automatic translations is the weak disambiguation between word and sense. For example, in any bilingual dictionary, each source word would have a list of words listed as possible translations. The question therefore is, which word should the program select? Deep linguistic analysis of the text segment is required in order to overcome this issue and select the correct word, based on the meaning of its surrounding words within the text.