Sentence Segmentation
Getting a computer to figure out where English sentences begin and end hasn’t been perfectly solved. There are libraries that do it, but they all seem focused on getting incrementally closer to how humans handle rare, ambiguous, or grammatically incorrect cases (and publishing academic papers along the way).
As a small part of a fun side project that’s just getting started, I needed sentence breaking specifically for well-edited text. For example, if someone forgot to capitalize the start of their sentence, I’d rather just move on than try to compensate with something that could misfire on valid sentences but makes a great research paper.
Besides, the whole project is for fun, and writing this part sounded fun.
Code
Here’s the core of the result in Python:
def ends_sentence(word, next_word):
return (punctuated_as_end(word) and capitalized(next_word) and
not (capitalized(word) and abbreviated_name(word)) and
not acronym(word))
The basic approach was to spend a weekend turning the crank of Test Driven Design, starting with simple test cases, moving on to more sophisticated cases, then finally the test cases in How to Split Sentences by actual experts Olexiy Sliusarenko & Vsevolod Dyomkin.1
You can check the full code to see all the tests and which ones of them actually pass (there’s no perfect solution, after all). The main helpers are below:
def punctuated_as_end(word):
try:
= re.findall("\W+$", word)[-1]
punct_suffix except IndexError:
return False
return not END_PUNCTUATION.isdisjoint(frozenset(punct_suffix))
def capitalized(word):
return re.match("[A-Z][^A-Z]*", letters(word))
def acronym(word):
return re.match("[A-Z.]+\.$", word)
def abbreviated_name(word):
= letters(word)
ltrs if re.match("[A-Z]*$", ltrs):
return True # initial
return ltrs in ABBREVIATED_NAMES
As sometimes happens with TDD, the whole is more accurate than its parts. For example, abbreviated_name
calls “I” an abbreviated name, but the other checks for things like a period and the next word being capitalized usually catch that.
As a final, non-TDD touch, the lookup sets were beefed up with obvious alternatives. (For example, there’s only a test for “Mr.,” but ABBREVIATED_NAMES
has “Mrs.” as well.)
Sample Output
Common practice in natural language processing is to judge the code by running it on huge, tagged corpora, but that requires access to those corpora.
Instead, here’s output from running the code on some novels from Project Gutenberg. Below are excerpts from each novel’s output, with links to the whole output. Paragraph breaks were preserved, and the line breaks were changed only within each paragraph.
"What do I hear?
You, my dear master! you in this terrible plight!
What misfortune has happened to you?
Why are you no longer in the most magnificent of castles?
What has become of Miss Cunegonde, the pearl of girls, and nature's masterpiece?"
– Candide
And meanwhile his hunger grew and grew.
The only relief poor Pinocchio had was to yawn; and he certainly did yawn, such a big yawn that his mouth stretched out to the tips of his ears.
Soon he became dizzy and faint.
He wept and wailed to himself:
"The Talking Cricket was right.
It was wrong of me to disobey Father and to run away from home.
If he were here now, I wouldn't be so hungry!
Oh, how horrible it is to be hungry!"
"To whom dost thou talk of alighting or sleeping?" said Don Quixote.
"Am I one of those knights who take repose in time of danger?
Sleep thou, who wert born to sleep, or do what thou wilt:
I shall act as becomes my profession."
– The History of Don Quixote de la Mancha
Somehow that article came to me with a title calling them “Golden Rules,” which is catchy but not in the article itself.↩︎