Validating complex user input (using normalized context-free grammars and CYK)

Here I am again, presenting to you another 'OH MI GOD, SO MUCH MATH' article.

TL; DR;

I made a simple JavaScript library that allows you to validate complex user input whenever regular expressions can't help you (validating formulae for example). Unfortunately, due to the algorithm's generality it ain't the quickest. Go to last section to see how to use the library if you are interested.

Validating words in sample grammars

WTF is a context-free grammar?

Theoretical computer scientists spent an awful lot of time examining properties of certain languages (actually that's how they created programming languages). They've discovered that there are a few classes of languages that are useful in practice. The latter can be arranged into a hierarchy called Chomsky's hierarchy named after the famous Noam Chomsky. Two of the most important of those classes are the regular languages (languages that you can use regular expressions to match) and the context-free languages about which this article is all about. Now what exactly is a language? Let's take a look at a few definitions.

We'll call an alphabet any set of characters and we'll denote it with Σ.
We'll call a word (string) any sequence of characters over a given alphabet. Empty word will be the word with length 0. Denote any word with w and denote the empty word with ε.
We'll call a language any set (be it finite or infinite) of words (strings) over a given alphabet. Denote it with L.

Languages are just a formal way to say 'Here's a collection of strings that have the same structure / look alike / fall under the same rules'. Now given a language, how do we:

1. Generate a random string in the language?
2. Tell whether a specific string is in the language?

Well, there is no general algorithm that can solve the tasks above for any language (no such algorithm can even exist!) but for certain classes of languages there are. Having said that, we arrive at our final definition which I know looks quite confusing, so read it twice if you need to.

Firstly, call a nonterminal a special character that will allow us to do kewl thingz (think of it as a variable). We'll denote nonterminals with uppercase letters.
Secondly, consider an object which consists of the following:

1. A set of nonterminals N
2. An alphabet Σ – call the alphabet's characters terminals
3. A special nonterminal that is contained in the set N, call it the start symbol, denote it with S
4. A set of derivation rules R that allow us to substitute a nonterminal for a mixed sequence of nonterminals and terminals

Formally, our object is the quadruple$G=$. We just gave birth to a context – free grammar (CFG).

That's cool, now what?

Let us look at an example grammar:

$G=<\{ S \}, \{ a, b \}, S, R>$, where our rules $R$ are:

• $S\rightarrow aSb$
• $S\rightarrow ab$

This means that we have a single nonterminal – the starting symbol S and the alphabet consists solely from the letters a and b. Let's put our grammar to practice.

We'll start with… well – the starting symbol. Apply on the start symbol any of our two rules by substituting S with the right hand side of the rule until there's nothing to be replaced. The algorithm goes like that:

Or we might choose to apply rule 2 immediately:

It is obvious that the algorithm terminates whenever only terminals are left (which explains where did terminals and nonterminals got their names from).

Using the above derivations, we generated two words – namely aaabbb and ab over the alphabet $\Sigma =\{ a, b \}$. In other words, we've just discovered an algorithm for generating words in the language that consists of all strings that begin with a number of a and end with the same number of b.

Consider another grammar:

$S=<\{ S, A, N, V, \}, all \; lowercase \; English \; letters \; and \; whitespace, S, R >$, where the rules $R$ are:

• $S \rightarrow N \; V \; A$
• $A\rightarrow awesome | handsome | legendary$
• $N\rightarrow dimitroff | nikola | the \; author$
• $V \rightarrow is$

(the disjunction symbol | allows us to conveniently write several rules on one line - $N\rightarrow dimitroff | nikola | the \; author$ is equivalent to ) $N\rightarrow dimitroff, N\rightarrow nikola, N\rightarrow the \; author$

Applying rules at random we can generate$S \rightarrow N \; V \; A \rightarrow nikola \; V \; A \rightarrow nikola \; is \; A \rightarrow nikola \; is \; awesome$. Clearly, the grammar just presented can construct valid English sentences. Even more, you can construct incredibly complex grammars that generate an astonishing amount of words. Without formal proof, we can conclude that a CFG is a great language generator. Let us call any language that can be produced by CFG a context – free language.

Recognizing context – free languages

A great deal of languages are context – free. For example, almost all programming languages you've learnt or heard of are CF. Furthermore, the compilers and interpreters for those languages make heavy use of CFGs. Here's the grammar used to produce the JavaScript language.

All languages created by regular expressions are context – free. Unfortunately, most context – free languages can't be described with regular expressions and this is why we must use grammars. There are a few algorithms that deal with the question 'Does this string belong to this CF language?'. Compilers usually use LR parsers but the latter only work on a special subset of CFGs. The library I wish to present to you uses the more general (and thus slower) CYK algorithm for solving the same problem. When saying slower, I mean really slow – worst time is $O(|G|n^3)$ - that is cubic in the length of the word being validated and linear in the size of grammar (the grammar's size is the length of all rules combined)

The algorithm works in two steps:

1. Pre-process the grammar – normalize it, precompute some important sets
2. Given a word, use CYK to test whether the word is valid

I won't go into details but you can read more in this paper by Lange and Leiss.

Here's a really simple and ugly – looking demo to see that CYK actually works.

Did I succeed in convincing you that there are cases in which regular expressions are powerless and you MUST fall to CFGs?
If not, come back if you feel the urge.
If yes, go to the repository @ github, download the library and validate stuff like a champ.

Using the CfgSolver library

To use CfgSolver you simply follow the steps I outlined above. Begin by creating your grammar as a standard JS object. Here's an example grammar and its JS representation:

$G=<\{ E, N \}, \{ +, -, *, /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 \}, E, R>$, with rules :

• $E \rightarrow EOE | N$
• $O \rightarrow + | - | * | /$
• $N \rightarrow NN | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7| 8 | 9$

($E$ stands for expression, $O$ stands for operation and N$N$ stands for number)

The rules effectively say that

• an expression is either an operation over two other expressions or simply a number
• an operation is either +, -, *, /
• a number is either two numbers concatenated or a single digit

This grammar generates all arithmetic expressions without parentheses (the only problem here is that there may be leading 0s but for simplicity we won't deal with it).

Here's the JavaScript object that represents the grammar:

var grammar = {
"E": ["EOE", "N"],
"O": ["+", "-", "*", "/"],
"N": ["NN", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",],
}


Pretty much the same thing, eh? Use it like that:

// Pre-process the grammar and save the returned container in a variable
var container = CfgSolver.preprocess(grammar);
// Use the container to test whether a word is valid
var word = "5+3*4";
var isValid = CfgSolver.recognizeWord(container, word); // isValid is now true


That's it, you now have a Boolean that shows you whether the word is valid.

Notes

There are a few things I couldn't find where to fit, so I'll just drop them all here.

<

ul>

• When designing your grammar, be careful what symbols you are using. Nonterminals must always be single characters and once a nonterminal has been declared you may not use the same symbol as terminal e.g. if you want to derive a word containing capital S you must not use S as nonterminal.
• Sometimes, it is convenient to have a rule that derives nothing (remember we denote the empty string with ε). For instance, you might want to model variable declaration in C# with the following grammar:
• $S\rightarrow TA N;$
• $T \rightarrow int | long | char | \ldots$
• $A \rightarrow [\;] | \epsilon$
• $N \rightarrow aN | bN | cN | a | b| c$

($T$ stands for type, $A$ stands for array square brackets and is optional, $N$ stands for name – for simplicity names can only be strings containing the letters a, b and c)

This grammar allows to have both array and normal variables declared. In case an array is declared use the rule$A \rightarrow [\;]$, else use the rule $A \rightarrow \epsilon$. To write rule deriving the empty word in your JavaScript, use CfgSolver.epsilon

• Unfortunately, the CYK version used in the library doesn't allow the empty string to be derived from the start symbol (for improved performance). An empty string is not much of a interesting string so that's not a loss to mourn about but you need to be careful when designing your grammar. Here's an example grammar that indirectly derives the start symbol:
• $S \rightarrow AB | \ldots$
• $A \rightarrow \epsilon | \ldots$
• $B \rightarrow \epsilon | \ldots$

Applying the rules $S \rightarrow AB \rightarrow \epsilon B \rightarrow \epsilon \epsilon \rightarrow \epsilon$ derives the empty word and CfgSolver will throw an exception when pre-processing your grammar.

• CfgSolver provides a printer property that prints your grammar in the console in a readable way:
CfgSolver.printer.printGrammar(grammar);