Yes, I still use PHP. Yes, I still love it. Yes, I am old 😁. No, it's not a dead language. It might not be the best choice for what I am trying to do here, but sometimes you simply don't have an alternative or maybe you just want to prove that languages are only tools, and if you know how to use them, you can achieve anything (with trade offs that may be acceptable in certain situations).
Spellchecking and auto-correcting has become an indispensable part of many software products and services, assisting users with their writing, giving suggestions as they type, showing mistakes or better ways of expressing something. Lately, and especially since the advance of AI, an increasing number of apps have also started to aid people in using correct punctuation. You may have noticed it in office and productivity tools, texting apps or social platforms.
I wanted to dive a little bit into the matter, but I didn't find anything for PHP out there; so I decided to at least get the ball rolling and have a look at a few ways in which we could check and restore punctuation in a PHP based development environment, regardless of whether you use the CLI or just want to implement this on the server side of a web application - the main scenario I had in mind during my research.
Unlike spellchecking - where you could have as little as a basic function that just checks a text against a list of correct terms and makes sure everything is written accordingly - punctuation is not that simple, because its rules are a little more complex and they also may vary based on the context of the writing and the writer.
We use punctuation for two main purposes: logic (to construct grammatically correct texts that can be followed and understood) and rhetoric (to help the reader move through a text at a certain pace, to transmit a feeling or an intended meaning, and to help with the intonation). While the first part can be handled elegantly enough by algorithms to some extent, the second one is largely influenced by the person doing the writing, the style and the nature of their work; identifying these characteristics with precision and without additional input from the user is not an easy chore for a digital computer.
Let me start by giving an example of a bunch of sentences, firstly lacking any punctuation, next with some basic punctuation for better logic and lastly, the text written by somebody who uses punctuation extensively in order to improve the rhetorical aspect of the text.
- No punctuation
- I went to the store today It was a long and boring trip And I bought some things eggs bread and cheese Wonderful
I also saw my friend John there He was looking for a birthday gift for his sister - Basic punctuation
- I went to the store today. It was a long and boring trip. And I bought some things: eggs, bread, and cheese. Wonderful!
I also saw my friend John there. He was looking for a birthday gift for his sister. - Rich punctuation
- I went to the store today - it was a long and boring trip - and I bought some things: eggs, bread, and cheese. Wonderful!
I also saw my friend John there; he was looking for a birthday gift for his sister.
The punctuation marks used in these phrases are . (period or full stop), , (comma), : (colon), ! (exclamation point), ; (semicolon), - (dash).
As you can see, besides the ones that leave no room for doubt (usually, commas and full stops are the easiest to place), many of them depend on the factors presented previously (like the dash or the semicolon). Therefore, we either can try to reach for the best compromise between logic and rhetoric (a proper option in informal contexts) but we can also give the user options to guide restoration process.
Let's leave these examples here for later reference and see what are the possible courses one could take when embarking on such an endeavour. Initially, I wanted to add code in this post, but I realized that it will make it too large, messy and not very helpful. In any case, I have also decided to start a project that implements the solutions proposed here and I will append a link to the repository at the end of this page, for those interested in more than just explanations.
Solution 1: Only PHP
Probably the most painstaking way to do something like this is by using just PHP. I mean, it's not hard, but it's a little bit like building one of those old mechanical computers. Lots of little cogs to tweak and adjust to make it work. The advantage is that you won't need anything else but this code to run it anywhere (CLI, website, web app). In this first challenge, the most important step is to establish clear rules for each of the punctuation marks: when and how should they be applied.
I am not a grammar genius (though I was pretty good at it in school), but if we were to start with the period, one obvious rule is that most sentences end with one. Now - and ignoring the exceptions for the moment - we know that a sentence should at least have a subject and a predicate to make sense; which means we need to identify the parts of the given text. Actually, this will be an essential first action to take when placing any punctuation correctly in a sentence.
Words tagging
Hence, the first component of our restoration script should be what is called a POS (part of speech) tagger: a function that identifies each word in a sentence and returns its most probable role. One straightforward way to achieve it will be by creating a dictionary. It could exist as a table inside a database, as a data file using formats like XML or JSON, or we could simply use an array. Something like this:
$text = "I went to the store today It was a long and boring trip And I bought some things eggs bread and cheese wonderful";
$dictionary = [
"i" => "pronoun", "went" => "verb", "to" => "preposition", "the" => "article",
"store" => "noun", "today" => "adverb", "it" => "pronoun", "was" => "verb",
"a" => "article", "long" => "adjective", "and" => "conjunction", "boring" => "adjective",
"trip" => "noun", "bought" => "verb", "some" => "determiner", "things" => "noun",
"eggs" => "noun", "bread" => "noun", "cheese" => "noun", "wonderful" => "adjective"
];
This won't work when trying to cover an entire language, though. The array would be too huge and impractical. A database would do a better work, especially if caching is also enabled. Alternatively, one could take advantage of the several APIs outthere that can provide such information and even more. You could even mix both sources (use APIs when the word is not found in the database). One external dictionary that I found on the web is https://dictionaryapi.dev. You can give it a try.
I have also found a PHP package that claims to tag parts of speech, called NaiPosTagger. Seems that nobody has updated it in a long time, but it might be worth a shot.
As a last resort, we could also try to guess what part of speech a word could be based on general patterns that sometimes words of the same class share. For example, words ending in "ing" are usually verbs (gerunds).
Sentences extraction
Unless we target a very specific scenario, and if we want to create a solution that can restore any given text, the most unfavourable cases must be considered. The worse: there's no punctuation at all in the text. Extracting sentences from such texts is quite a challenge. Let's go back to our examples and see one reason of it:
I went to the store today It was a long and boring trip And I bought some things eggs bread and cheese WonderfulWhere does one sentence end and where does the other start? If we look at it from different angles, there are several possible outputs:
I went to the store. Today it was a long and boring trip. And I bought some things ...
I went to the store today. It was a long and boring trip. And I bought some things ...
I went to the store today. It was a long and boring trip and I bought some things ...
To solve the mystery, user feedback could be requested at this point. Explicitly, using prompts, popups or less intrusive elements, like hints or signs; or just by providing settings that can be adjusted according to each one's preferences. For example, there could be an option that disallows sentences that start with a conjunction (like "and") or with an adverb (like "today").
Ultimately, there should be a default mode of interpretation for those who just want a quick restoration and are not too concerned about all of this; then, an advanced mode for users to tweak all several settings until they get exactly what they expect.
After numerous trial and error I managed to create the following initial flowchart for the algorithm that will extract sentences from a text:
- Get the word's role based on the dictionary and it's relationship with surrounding words
- Insert a new sentence in the array if none created already
- If there's a possibility that the word belongs to another sentence than the current, insert the new sentence in the array
- Assign weight values to the word for each existing sentence to which it might belong
- Increase or decrease these values based on surrounding words or factors like the user settings
- Finally, place the words in the sentences for which they have the biggest weights and return the result
Some of these steps are easy enough to implement, some will need more patience (we will have to test many case scenarios - unit testing might come in really handy at this stage). But I think this is solid enough to serve us both for extracting sentences and even for finding the place of any other punctuation mark. It's a good starting point.
Modifiers
I mentioned that a word's role and its weight can be modified by the presence of other words. The word "store" is a noun by itself, but when we have the preposition "to" and the article "the" in front of it, this word becomes part of a place circumstantial complement, which increases the probability that they all belong to the same sentence.
In order to take all these complex combinations into account, we can define modifiers for the algorithm to make more precise assumptions about the properties of a word. For example, "to" comes with a modifier that will increase the weight property of the next noun, meaning there's a higher probablity that these two words belong together to the same sentence. The modifier can also make the noun the object of the same preposition, which can be useful for figuring out other punctuation marks.
There will be some general modifiers for the most common ocurrences (like a definite article, which almost always is followed by a noun) and then there will specific modifiers that target particular words or groups of words. The behavior of general modifiers could be even hardcoded (unless we want support for multiple languages) or defined as a separate list of structures that are parsed and interpreted by the algorithm at a global level. Specific modifiers can be defined inside the dictionary, with the words to which they are tightly related, to be only applied in those individual cases.
One last thing to give thought to is the abstraction of such modifiers. They will most likely be stored in the form of a structure (maybe in JSON format), containing several pairs of key-value that specify the words or classes that are affected by the modifier, the relationship and range / distance between words, the weight's value and, if the modifier changes the role of a word, it's new class. This structure will crystallize as we start building up our dictionary and testing the code. It will also allow the collaboration of a linguist, who can work on improving the grammatical part of the project without requiring advanced technical knowledge and without changing the code.
Just for the sake of example, a modifider's structure could look like this when laid down as a PHP array:
[ "to" => [ [ "+[0..2]noun" ] => [ "weight" => .8, "role" => "adverb" ] ] ]
First, the word that comes with the modifier, the type of words that it modifies (any nouns that follow at a distance of maximum 2 words, then the weight added to the affected words and finally the new role assigned to these).
Punctuation marks
Sentences can end in other punctuation marks, not just a period. We can simply append the end punctuation of a sentence to the array containing its words which I mentioned earlier. The sentence will then be constructed just by joining all the elements, including the punctuation marks.
$sentences = [
[ "I", "also", "saw", "my", "friend", "John", "there", "." ],
[ "He", "was", "looking", "for", "a", "birthday", "gift", "for", "his", "sister", "." ]
];
Now that we have sketched out a possible approach to the extraction of sentences and their final punctuation, we shall move to the other punctuation marks. Their place within a sentence can be calculated in a similar way: using general and specific modifiers. For example, an enumeration of nouns adds a comma between each one of them. This is a general modifier. The word "things", followed by an enumeration of nouns shall be followed by a colon, before the enumeration. This is a specific modifier that comes with this exact word. Again, users can be provided different settings to control special cases and to bend the behavior of the algorithm according to their needs.
One last thing to keep in mind is whether the text to analyze already has some punctuation in it or not. Here, we could either strip it of any special characters and fully trust our algorithm, or take advantage of the existing punctuation in order to speed up the process. For instance, existing periods would help extract the sentences faster and move on to the next stages. The user should, again, have an option to chose how existing punctuation should be treated.
... to continue