Computational Linguistics stretches all the way back to the Cold War era, where we built machines to translate Russian to English. This task is called machine translating, or MT.
To give you an idea of the state of computing back then, computers were the size of rooms and they all computed everything off of punch cards.
Computational Perspective
What does it mean for a computer to communicate in or interpret a human language?
Language involves complex symbol systems.
Computers are very fast mechanical symbol-processors.
Computer’s linguistic capabilities come from programs we write for them.
There are natural connections between linguistic processing and computation, complexity of linguistic patterns and complexity of mathematical models of computation.
I mean, think about the Star Trek Universal Translator!
Statistical Analysis
Methods based on statistical analyses of linguistic data have improved the accuracy with which systems carry out tasks like understanding syntactic structure of a sentence.
It also suggests that humans might learn from experience by means of induction using statistical regularities.
So lets jump right into morphological processing.
Morphological processing
The task of an automatic morphological analyzer is to take a word in a language and break it down into its stem along with any modifiers that are attached.
Tokenization
The first step is to identify separate words. This could be easy like in English where we have delimiters like spaces and punctuation, and sentences start with capital letters. But in languages where there are no word boundaries, like Japanese, the task is way harder.
With Kanji in particular, there’s a genuine ambiguity how certain common multi-character words are to be segmented and different analyzers differ in their segmentation decisions.
Stemming
Sometimes instead of a full analysis, a simple stemming algorithm is used in which it strips a word of all its modifiers. Like in search engines, which are perfectly happy to find love where the word given is loved.
You could also use a fully inflected lexicon which includes every possible affixed form of every word in the language, but that can be unreasonable. Tamil has about 2,000 inflected forms per verb, while the number of inflected forms for a given Turkish stem may be in the millions.
Morphological analysis and synthesis
So, you may familiar with the idea that, whenever we learn a language we learn the pattern and then are taught the exceptions. Computers are no different.
finite state transducer
declarative pattern-action morphological rule
We do this with a pattern action approach. The one I’m going to show is declarative, because this separates linguistic data from the computer programs that operate on the data, which are called finite-state transducers, since they have a finite number of states (6 in this case) and the number of states is predetermined before runtime.
Let’s walk through the computer analyzing cried using the transducer
- Starts in its initial state, on receiving c prints c, and advances to state 2.
- On seeing r it prints r and advances to state 3.
- On seeing i as input, it prints y as output, advancing to state 4.
- On seeing e in its input, it prints + as output, advancing to state 5.
- On seeing d as input, it prints ed as output, and reaches the final state.
Given cried as input, it yields cry+ed as output. They are bidirectional, so they can be used in both analysis and synthesis.
Acknowledgements
If you liked this post and want to learn more, we highly recommend you read Contemporary Linguistics: An Introduction by William O’Grady. It’s a great textbook on linguistics in general and would make a great addition to any scientists library.
Leave a comment