ACSL Interpreter


Fully functional parser, interpreter, and debugger for the pseudo-code language invented by ACSL.

In high school, I participated in an year-wide competition called American Computer Science League (ACSL). ACSL competitions test participants on a variety of computer science topics like boolean algebra, graph theory, and assembly. One topic that participants are quizzed on is called “What Does This Program Do”. Students are given a program in a fictitious programming language that strongly resembles Python or Ruby. They designed it this way probably so people couldn’t just cheat by running the program.

I thought it would be funny if I made it so you can actually run their programs.

Features

These are the full capabilities of the language as defined by the ACSL wiki

ConstructCode Segment
Operators! (not), ^ or ↑(exponent), *, / (real division), % (modulus), +, -, >, <, >=, <=, !=, ==, && (and), || (or) in that order of precedence
Functionsabs(x) - absolute value, sqrt(x) - square root, int(x) - greatest integer <= x
VariablesStart with a letter, only letters and digits
Sequential statementsINPUT variable
variable = expression (assignment)
OUTPUT variable
Decision statementsIF boolean expression THEN
Statement(s)
ELSE (optional)
Statement(s)
END IF
Indefinite Loop statementsWHILE boolean expression
Statement(s)
END WHILE
Definite Loop statementsFOR variable = start TO end STEP increment
Statement(s)
NEXT
Arrays:1 dimensional arrays use a single subscript such as A(5). 2 dimensional arrays use (row, col) order such as A(2,3). Arrays can start at location 0 for 1 dimensional arrays and location (0,0) for 2 dimensional arrays. Most ACSL past problems start with either A(1) or A(1,1). The size of the array will usually be specified in the problem statement.
Strings:Basically python strings

And here’s more features found in practice problems that they failed to mention in their wiki:

ConstructCode Segment
Alternate Decision Statement SyntaxIF expression THEN Statement ELSE Statement
(one line)
Program Early TerminationEND

Basic Theory on Interpreters

I know if you’re using this tool, odds are you are a high school student (or younger!) doing ACSL. Here’s a quick description on how this program works so hopefully I can inspire you to be more interested in programming languages because IMO they’re extremely extremely cool.

This program goes through three phases to turn raw code text into a program:

  1. Lexing
  2. Parsing
  3. Interpreting

Lexer

Strings are an array of characters. To a computer, the string "fortnite" looks like this: ['f', 'o', 'r', 't', 'n', 'i', 't', 'e']. Our code in its raw form looks like that too. We can turn it into a more usable form by lexing/tokenizing it. Instead of having an array of characters, we can have an array of “tokens” instead.

For example, the tokenization of int varName = 30 + func(3); is:

['int', 'varName', '=', '30', '+', 'func', '(', '3', ')']

You can think of it kind of like doing string.split(" ") on our code, but there’s a little more to it than that. For example, note how it split up func(3) into 4 separate tokens.

Additionally, each token is labelled. We know that the int token is a type, varName is a variable name, and 30 is a string literal.

Parser

The parser turns the list of tokens into structures of code. In our code, if you see the pattern type variableName = value, you know that this is an assignment.

The parser will look for these patterns and turn them into something called an Abstract Syntax Tree (AST).

ast

ANTLR and Backus-Normal Form Notation

I didn’t write the lexer or parser myself. I used a program called ANTLR that generates the code for the lexer and parser if I give it a file that describes the programming language I want to parse.

(Most) programming languages can be described using mathematical expressions. I wrote a description of the language I wanted to parse using those expressions, gave it to ANTLR, and ANTLR wrote the code for me.

Here’s an excerpt that describes an assignment and a for loop:

assignment: (id | array_reference) '=' expression; // A = "value" or A[3] = "value"

for_loop_stmt:
    'FOR' assignment 'TO' expression ('STEP' expression)? // The STEP is optional
        stmt_list
    'NEXT';

You can find the grammar definition file for ACSL pseudo-code here.

Interpreter

Funnily enough, the purpose of “What Does This Program Do” questions is to train your brain to think like the interpreter. You already know how the interpreter works.

There are two main parts to the interpreter: evaluating expressions and control logic.

Evaluating Expressions

Expressions in the AST might look like this. Evaluating these is straightforward - evaluate the bottom nodes first and work your way up. If one of the nodes is a reference to a variable, it will have to look up the variable’s value. Nodes can also be a function call, and it will have to execute the function to find its value.

Control Logic

There are no surprises here. Statements are executed exactly as you would expect. For example:

for J = 0 to 10 step 2
    [ Some code here ]
next

The for loop is broken up into 4 parts: the initialization assignment J = 0, the end condition 10, the step amount 2, and the body of the for loop.

  1. First, it has to find the direction of the for loop. The loop condition of for J = 0 to 10 is implicitly J <= 10, but the loop condition of for J = 10 to 0 step -1 is implicitly J >= 0. Note how the comparison changed from <= to >=.
  2. Next, it checks loop condition. If it’s true, we proceed to step 3, otherwise stop.
  3. Execute the body of the for loop.
  4. Increment/decrement the loop variable by executing J = J + STEP.
  5. Loop back to 2.

Like I said, nothing surprising.

Why ACSL’s Pseudocode is a Terrible Language

The language is clearly supposed to be some dialect of BASIC. Because of that, it inherited some very odd language characteristics from the BASIC family which really irk me.

Case Insensitive Variables

The ACSL wiki page describing the language has this piece of code as an example:

for J = len(A) - 1 to 0 step - 1
     T = T + A[j]
next

In this example, the J variable is declared as upper case J but used in the loop as lower case j, which shows that this language has case-insensitive variable names.

Why, in my opinion, programming languages should be case-sensitive:

  • Forces the programmer to be very deliberate with code.
  • Very important for math and physics where different cases of the same character are often mixed (eg V is volume and v is velocity)
  • When you define a variable, you can encode meta-information through casing (ex camelCase vs PascalCase)

At first, I thought this was a typo. There are so many typos scattered throughout the code examples and the language spec doesn’t say anything about being case-insensitive. After a Google search, it turns out that this was actually intended.

I grew up in a world of case-sensitive languages like Java and Python. I recently picked up Golang which finds the casing of variable names so essential that proper PascalCase and camelCase usage is enforced by the compiler. To me, case-sensitivity is just the way programming languages are supposed to be. That’s all I’ve ever known. Now, apparently, case sensitivity is not as universal as I thought.

Timeline of case-sensitive vs insensitive languages
Figure 1 Timeline of case-sensitive (green) vs case-insensitive (red) languages

There is a clear point in time when the standard shifted from insensitive to sensitive.

Nim is the only exception I can find of a very modern language that’s case-insensitive. Nim actually takes it a few steps further by not differentiating between camel case and snake case (thisVarName is equivalent to this_var_name in Nim). Then, they take it a step further by making only the first character case-sensitive. Nim is a real freak. You can read their justification here.

Functions vs Arrays Ambiguity

Another irk I have with this language is how functions calls and arrays have the same notation - funcName(params...) vs arrayName(i, j, ...). This makes it impossible to define a grammar that automatically distinguishes the two. In an ideal world, once I parse code, I should be able to differentiate the two because they have different AST node types, so it’s clear how they should be resolved.

How I Solved This Problem

I first solved this by making function names reserved-words. The language specification on this page says there are exactly 4 functions - int, sqrt, abs, and len. I changed the grammar to this:

start: function_call | array_reference

function_call: function_name '(' expression* ')';
function_name: ABS | INT_FUNC | SQRT | LEN;

array_reference: id '(' expression (',' expression)* ')';

When it tries to parse, if the name does not match one of the tokens in function_name, it would fall back by assuming it is an array reference.

This solution works, but I’m not satisfied with it. I shouldn’t have to change the grammar every time I want to add a function to my language. I wanted to solve this in a better way.

How Microsoft Solved This Problem

I have no idea.

Visual Basic has this exact same problem. VB’s procedures and arrays have the syntax Array(index1, index2, ...) and Procedure(args...) too. Luckily, Microsoft published the official grammar for Visual Basic here, so by referencing it, I should be able to use the same structure in my grammar, too, right? Their grammar for differentiating arrays and procedures looks like this:

Expression
    : SimpleExpression
    | TypeExpression
    | ...
    | InvocationExpression // Procedure Calls
    | IndexExpression      // Array access
    | ...
    ;

InvocationExpression: Expression ( '(' ArgumentList ')' )?;
IndexExpression: Expression '(' ArgumentList ')';

I studied their grammar for a while and I couldn’t figure out how they got it to work. These rules are indirect left-recursive, which ANTLR explicitly can’t parse due to the limitations of how it was implemented. Microsoft published the grammar as an ANTLR grammar for some reason, which makes things even more bizarre. Additionally, InvocationExpression will always shadow IndexExpression, so I don’t see how this could work in any parser.

How I Solved This Problem (Again)

I made my parser two-pass.

On the first pass, it won’t try to distinguish arrays vs functions. It will leave it ambiguous in the AST.

function_or_array = id '(' arguments* ')'

On the second pass, it will traverse the AST and build a symbol table of every declared variable. Once it encounters a function_or_array node, it will try to resolve it using the symbol table. If an array was declared with the same name, it assumes it to be an array. Otherwise, at execution, it will try to resolve the function from the environment.

The Language Flexibility Downgrade

Based on the language specification, this language could completely ignore whitespace and new-line characters. Additionally, the start and end of statements are inherently unambiguous even without a delimiter such as a semicolon.

But…

The language specification does not mention a secret second form for if statements:

IF condition THEN statement // (One line)

If statements should be delimited by the END IF token, except in this one case. If we were to ignore new-line characters, we could not tell the difference between

IF condition1 THEN
    IF condition2 THEN statement1
    statement2
END IF

and

IF condition1 THEN (IF condition2 THEN (statement1 statement2) ENDIF)

I had to rewrite my grammar to make the language new-line sensitive because of this one single edge case.

This really isn’t a complaint of the language itself. I actually like languages that enforce good layout (like python), but as a developer it was just annoying.

Why is it This Way?

According to random people on StackOverflow and Quora, BASIC was case-insensitive, and it used () for both arrays and functions because hardware limitations at the time (we’re talking punch cards). BASIC’s character set wasn’t large enough to include both upper-case and lower-case letters, and its character set didn’t even have [], so it used () for everything.

This makes sense, and it’s pretty interesting - but ACSL’s pseudocode doesn’t even follow this logic.

  • Using ASCII is acceptable, but then it betrays its restricted-charset history by using Unicode tokens like .
  • It uses brackets for string index access anyway. So why not let arrays use brackets too?

As for the trouble with newlines, BASIC was famously designed to be line based.

Weird Spacing

Another example of the actual problems having new syntax rules is how operators in expressions are allowed to have spacing in them. For example:

// These are syntactically equivalent
1 == 1  // True
1 = = 1 // True

This was rather easy to solve, though. Normally in your grammar, you would have two separate tokens EQ: = and EQEQ: ==. To fix this, I redefined EQEQ in terms of EQ allowing whitespace - EQEQ: EQ [ ]* EQ