02 Jun 2011Making Whitespace Significant
Every time I code up a new programming language I forget one of the most important things about programming languages. High level program languages should be designed to make it easy for humans to read and write rather than trying to make it easy for computers to parse or execute.
Whitespace and indentation in particular, is significant to humans when representing logical structure but rarely does the programming language consider indentation as significant (Python and Haskell are exceptions to this generalization and the logical structure of a program is represented by indentation level.). Most program languages use explicit tokens such as begin
and end
or {
and }
to mark the start and end of logical blocks. In these languages programmers tend to use both indentation and the explicit block delimiters, thus introducing redundant information that may become out of sync. A poorly indented program can be much more difficult to understand. By “poorly” indented I mean that the indentation does not represent the logical structure of the program or worse - gives a false impression of the program structure.
The article [In praise of mandatory indentation for novice programmers], seemed shocked that a language that uses indentation to delimit blocks was such a boon to novice programmers. Most of the objections that arise when proposing indentation as significant are not applicable to novice programmers. They have not yet learnt enough to think it is weird and in fact it may be more familiar as that is the way they are taught to structure their natural language writing. Nor do they have any stylistic idioms that they have developed over years and rigidly adhere to. Novice programmers have not yet to learnt to chunk code at a higher level and thus increasing the number of tokens tends to increase the number of chunks they have to remember (potentially exceeding the 7+/-2 rule).
Style is another issue that seems to crop up with high frequency. Countless hours have been wasted debating the tiniest and most irrelevant details of laying out of code. Do spaces occur outside or inside braces? Does the {
appear on a new line or not? Do you use tabs or 2/4/8 spaces to indent code? For any significant codebase using a single style is going be a benefit but no style is likely to have any significantly greater benefit than any other commonly accepted style. Yet we waste time debating style, writing tools to reformat code and writing tools to check code compliance.
To deal with this, [Style is substance] proposes that for each language a definitive style is selected and enforced by the grammar. More than just using indentation to indicate program structure this locks down every style variation so that there is just one style. No longer would we need separate tools to enforce style (it would be done by the compiler) or format code and nor would there be a gazillion options in IDEs to determine how the code lay out.
Selecting one style and enforcing it is an embodiment of Python’s philosophy “there should be one obvious way to do it”. Under the python model the language implementer selects the “one way” and those who use the language must put up with it. Compare this to Perl’s philosophy “there is more than one way to do it” (TIMTOWTDI). This philosophy encourages each person to adopt their own style and often fellow Perl programmers have difficulty reading each others code where as most Python code is easy to understand to a fellow Python enthusiast. The problem with the Python approach is that sometimes the language implementer gets it wrong and it is not possible for the users to route around the problem.
Ruby does not attempt to dictate a single approach but it does not fervently adhere to TIMTOWTDI as does Perl. If people have a problem with ruby they are given enough power to fix the problem. If the “fix” attracts the attention of the language implementers it can be folded back into the core language (See [Language design philosophy: more than one way?). This makes it possible for the users of the language to evolve the language and the language implementer can cherry pick the best changes.
The question is - should syntax/style be a language feature that is evolved by the end-users? (By making style part of the grammar it becomes part of the syntax) A lisp user would say that it is necessary for a truly powerful language. Macros manipulating s-expressions allow the user to redefine the syntax and expand the language. And in fact most lisps seem to have suffered from the proliferation of syntaxes as occurs in Perl. Then again CLOS represents a crystallization of syntax that would not have occurred if it was not for the ability of the end users to define their own extensions.
I am unsure whether I would prefer a language kernel that is tightly locked down and is loosened up by language extensions or a language kernel that requires extensions to tighten up the syntax. i.e. Do you create an extension to make whitespace significant or create an extension to make whitespace insignificant?
Regardless, high-level languages should be designed for human consumption. Whitespace is significant for humans and thus should be significant for programs. To reduce the wasted time spent defining, adopting and enforcing a code style the language developers should consider adopting a definitive style and enforcing it in the compiler.