The material presented so far allows you to write powerful SNOBOL4 programs. In this chapter, we will examine other interesting and useful features of the language.
This function produces a pattern which will match zero or more consecutive occurrences of the pattern specified by its argument. As its name implies, ARBNO is useful when an arbitrary number of instances of a pattern may occur. For example, ARBNO(LEN(3)) matches strings of length 0, 3, 6, 9, .... There is no restriction on the complexity of the pattern argument.
Like the ARB pattern, ARBNO is shy, and tries to match the shortest possible string. Initially, it simply matches the null string. If a subsequent pattern component fails to match, SNOBOL4 backs up, and asks ARBNO to try again. Each time ARBNO is retried, it supplies another instance of its argument pattern. In other words, ARBNO(PAT) behaves like
( '' | PAT | PAT PAT | PAT PAT PAT | ... )Also like ARB, ARBNO is usually used with adjacent patterns to "draw it out." Let's consider a simple example. We want to write a pattern to test for a list. We'll define a list as being one or more numbers separated by comma, and enclosed by parentheses. Use CODE.SNO to try this definition:
? ITEM = SPAN('0123456789') ? LIST = POS(0) '(' ITEM ARBNO(',' ITEM) ')' RPOS(0) ? '(12,345,6)' LIST Success ? '(12,,34)' LIST FailureARBNO is retried and extended until its subsequent, ')', finally matches. POS(0) and RPOS(0) force the pattern to be applied to the entire subject string.
Alternation may be used within ARBNO's argument. This pattern matches any number of pairs of certain letters:
? PAIRS = POS(0) ARBNO('AA' | 'BB' | 'CC') RPOS(0) ? 'CCBBAAAACC' PAIRS Success ? 'AABBB' PAIRS Failure
This is the pattern analogue of a recursive function -- a pattern is defined in terms of itself. The unevaluated expression operator makes the definition possible.
Suppose we wanted to expand the previous definition of a list to say that a list item may be a span of digits, or another list. The definition proceeds as before, except that the unevaluated expression operator is used in the first statement; the concept of a list has not yet been defined:
? ITEM = SPAN('0123456789') | *LIST ? LIST = '(' ITEM ARBNO(',' ITEM) ')' ? TEST = POS(0) LIST RPOS(0) ? '(12,(3,45,(6)),78)' TEST Success ? '(12,(34)' TEST FailureRecursion occurs because LIST is defined in terms of ITEM, which is defined in terms of LIST, and so on. Note that functions POS(0) and RPOS(0) were "moved out one level," to TEST, because LIST must now match substrings within the subject.
In our previous discussion of recursive functions, we said they work because successive calls present the function with progressively simpler problems, until the problem can be solved without further recursion. Similarly, patterns ITEM and LIST are applied to successively smaller substrings, until ITEM can use its SPAN() alternative instead of invoking LIST again.
In general, you will need an alternative somewhere in the recursive loop to allow the pattern matcher "a way out." Also, you should place recursive objects last in a series of alternatives, so that the simpler, nonrecursive patterns are attempted first and "recursive plunges" can be avoided.
SNOBOL4 saves information on a "pattern stack" during the pattern match process. Heavily recursive patterns and long subject strings can sometimes result in stack overflow. If this occurs, you should break the problem apart into several smaller pattern matches.
As recursive patterns use the unevaluated expression operator, it is sometimes necessary to disable SNOBOL4's heuristics by setting &FULLSCAN = 1.
Pattern matching can be very time-consuming because of the number of possibilities which must be attempted. In the normal "quickscan" mode, SNOBOL4 stops searching for a match when it
thinks further efforts would be futile. The heuristics are complex, but can be summarized as follows: pattern matching fails when there are insufficient subject characters to satisfy the remaining pattern components.
The cursor operator can be used to demonstrate at what point SNOBOL4 gives up. For example, in the pattern match
? 'ABCD' @OUTPUT 'X' LEN(3) 0 FailureSNOBOL4 does not attempt to match 'X' against 'B' because fewer than 3 subject characters remain after it, and LEN(3) could never succeed.
A second type of heuristic is the "one character assumption" for unevaluated expressions. SNOBOL4 assumes that unevaluated expressions will match at least one character. This heuristic was originally provided to break recursive loops, but can cause programming problems when an unevaluated expression must match the null string. Consider a pattern which succeeds if 'B' is at least 4 character positions beyond an 'A' in the subject:
? P = 'A' ARB $ X 'B' *GE(SIZE(X), 4) ? 'A12345BC' P Success ? 'A12345B' P FailureThe characters between 'A' and 'B' are matched by ARB, and immediately assigned to X. The size of X is then compared to 4 by the GE function, which succeeds and returns the null string. This null string result should not interfere with the pattern match, but we find the pattern misbehaves when 'B' is the last character of the subject. The unevaluated expression operator made SNOBOL4 assume a one character length for the GE function, and matching 'B' against the last subject character was never attempted.
For most pattern matching, heuristics are invisible. However, there are circumstances when we would like SNOBOL4 to be exhaustive in its match attempts. We can disable heuristics and enter "fullscan" mode by setting keyword &FULLSCAN nonzero:
? &FULLSCAN = 1 ? 'A12345B' P Success ? 'ABCD' @OUTPUT 'X' LEN(3) 0 1 2 3 4 FailureThe quickscan mode can be reinstated by setting &FULLSCAN = 0.
We can accomplish quite a lot with just the primitive patterns ARB and REM. However, there are five additional patterns which you should be aware of:
The ABORT pattern causes immediate failure of the entire pattern match, without seeking other alternatives. Usually a match succeeds when we find a subject sequence which satisfies the pattern. The ABORT pattern does the opposite: if we find a certain pattern, we will abort the match and fail immediately. For example, suppose we are looking for an 'A' or 'B', but want to fail if '1' is encountered first:
? '--AB-1-' (ANY('AB') | '1' ABORT) Success ? '--1B-A-' (ANY('AB') | '1' ABORT) FailureThe last example may be confusing because the ANY function appears as the first alternative, fostering the illusion that it will find the 'B' in the subject before the other pattern alternative is tried. However, that is not the order of pattern matching; ALL pattern alternatives are tried at cursor position zero in the subject. If none succeed, the cursor is advanced by one, and all alternatives are tried again. When the cursor is in front of subject character '1', ANY still does not match, but the second alternative now does. As the '1's match, ABORT is reached, causing failure.
The BAL pattern matches the shortest nonnull string in which parentheses are balanced. (A string without parentheses is also considered to be balanced.) These strings are balanced:
(X) Y (A!(C:D)) (AB)+(CD) 9395These are not:
)A+B (A*(B+) (X))BAL is concerned only with left and right parentheses. The matching string does not have to be a well-formed expression in the algebraic sense; in fact, it needn't be an algebraic expression at all. Like ARB, BAL is most useful when constrained on both sides by other pattern components.
The FAIL pattern signals failure of this portion of the pattern match, causing the pattern matcher to backtrack and seek other alternatives. FAIL will also suppress a successful match, which can be very useful when the match is being performed for its side effects, such as immediate assignment. For example, in unanchored mode, this statement will display the subject characters, one per line:
SUBJECT LEN(1) $ OUTPUT FAILLEN(1) matches the first subject character, and immediately assigns it to OUTPUT. FAIL tells the pattern matcher to try again, and since there are no other alternatives, the entire match is retried at the next subject character. Forced failure and retries continue until the subject is exhausted.
Pattern FENCE matches the null string and has no effect when the pattern matcher is moving left to right in a pattern. However, if the pattern matcher is backing up to try other alternatives, and encounters FENCE, the match fails.
FENCE can be used to "lock in" an earlier success. Suppose we want to succeed if the first 'A' or 'B' in the subject is followed by a plus sign. In the following example, the 'A's match, we go through the FENCE, and find '+' does not match the next subject character, 'B'. SNOBOL4 tries to backtrack, but is stopped by the FENCE and fails:
? '1AB+' ANY('AB') FENCE '+' FailureIf FENCE were omitted, backtracking would match ANY to 'B', and then proceed forward again to match '+'.
If FENCE appears as the first component of a pattern, SNOBOL4 cannot back up through it to try another subject starting position. This results in an anchored pattern, even if the &ANCHOR keyword specifies unanchored mode:
? 'ABC' FENCE 'B' Failure
This pattern matches the null string and always succeeds. If the scanner is backtracking when it encounters SUCCEED, it reverses and starts forward again. Placing a pattern between SUCCEED and FAIL causes the pattern matcher to oscillate.
I'd like to briefly point out a few more built-in functions. See Chapter 8 of the Reference Manual for a complete description of their use.
APPLY | Allows an indirect call to a function through a variable. |
CONVERT | Provides explicit conversion from one data type to another. Chapter 6 of the Reference Manual, "Data Types and Conversion," describes the conversions possible. |
ENDFILE | Closes a file and detaches all variables associated with it. |
ITEM | Allows an indirect reference to an array or table. |
LPAD & RPAD | These are padding functions, which will pad a string on its left or right side with blanks or a given character. Padding is provided to a specified width, and is useful when producing columnar output. |
Operation: | Negation |
Symbol: | ~ (tilde) |
Operation: | Interrogation |
Symbol | ? (question mark) |
The two functions described below are among the most esoteric features, not just of SNOBOL4, but of all programming languages in existence. While your program is executing, the entire SNOBOL4 compiler is just a function call away.
A SNOBOL4 program is nothing more than a string of characters. The functions EVAL and CODE let you supply the compiler with character strings from within the program itself.
This function is used to evaluate an expression. Its argument may take a number of forms:
? OUTPUT = EVAL(19) 19
? E = *('N SQUARED IS ' N ** 2) ? N = 15 ? OUTPUT = EVAL(E) N SQUARED IS 225This is similar to our earlier use of unevaluated expressions with patterns. In this case, however, the unevaluated expression operator (*) must be applied to the entire expression to create an object with the EXPRESSION data type.
? OUTPUT = EVAL('3 * N + 2') 47If the string compiles without error, EVAL then evaluates the expression and returns the result.
It is this last use of EVAL -- to compile a string -- which is the most interesting. Here is a trivial program which behaves like a simple desk calculator.
LOOP OUTPUT = EVAL(INPUT) :S(LOOP) ENDYou can easily try it by placing a semicolon after the GOTO to protect it from CODE.SNO's own machinations:
?LOOP OUTPUT = EVAL(INPUT) :S(LOOP); 4 * (5 - 2) / 2 6 N + 1 16 ^ZThe program reads a line of input, compiles and evaluates it, and displays the result. Each expression you enter must be wellformed according to SNOBOL4's syntax rules. In particular, this means there must be blanks around the binary operators.
The BNF program included with Vanilla SNOBOL4 demonstrates that EVAL's power is useful even if the input data does not conform to SNOBOL4 syntax. It reads a definition of a grammar from a file, and converts it to SNOBOL4 patterns.
EVAL fails if evaluation of the argument fails, or if the argument contains a syntax error. The SNOBOL4 keyword &ERRTEXT will contain a string describing the error.
The expressions used with EVAL may return any SNOBOL4 data type, not just numbers. For instance, the expression might construct a new pattern, and return it as the result:
ITEM = EVAL('SPAN("0123456789") | *LIST')Note that EVAL can only call the compiler with a string argument. If we used a pattern as the argument, we would produce an execution error:
ITEM = EVAL(SPAN("0123456789") | *LIST) (incorrect)
CODE accepts a string argument containing one or more statements to be compiled. Multiple statements are separated by semicolons (;). Statements may be labeled, and can include all the usual components -- subject, pattern, replacement, and GOTO. However, comment and continuation statements are not permitted.
The CODE function compiles the statements, and returns a pointer to the resulting object code. It fails if any statement contains an error, and places an error message in &ERRTEXT.
There are two ways to execute the new object code.
* Compile a sample piece of code: S = 'L OUTPUT = N; N = LT(N,10) N + 1 :S(L)F(DONE)' CODE(S) * Transfer to a label in it: :(L) * Come here when the new code transfers back. DONE . . .Notice how we placed a GOTO from the new code back to label DONE in the main program. If we had not done this, SNOBOL4 would terminate when execution "fell out of the bottom" of the new code block.
* Compile a sample piece of code: S = 'L OUTPUT = N; N = LT(N,10) N + 1 :S(L)F(DONE)' C = CODE(S) * Transfer to the first statement in the block: :<C> DONE . . .
Labels contained in the new program fragment override any labels of the same name in your main program. This provides the ability to write self-modifying SNOBOL4 programs, and makes the division between "code" and "data" far less distinct than in other high-level languages.