Previous chapter
·
Next chapter
·
Table of Contents
In this chapter we will explore some additional SNOBOL4 operators and data types. Many of these concepts are entirely absent from other programming languages. Far from being esoteric, they fit quite naturally into SNOBOL4, and add to its conciseness and power of expression. In the following examples, we will continue to use the CODE.SNO program to illustrate each new idea.
In conventional programming languages, a variable's name may be specified only at the time the program is written. In fact, once the run-time storage has been allocated, the textual form of the name can be discarded. This is not the case in SNOBOL4; you can create new variables during execution, and reference existing ones from names specified in character strings.
The unary operator dollar sign ($) is the "indirect reference operator." By applying it to a variable you instruct SNOBOL4 to use its contents as the name of another variable, and to continue on to reference that variable. SNOBOL4 "goes through" the operand to reach the variable. Try the following simple example:
? DOG = 'BARK'
? CAT = 'MEOW'
? ANIMAL = 'CAT'
? OUTPUT = $ANIMAL
MEOW
? ANIMAL = 'DOG'
? OUTPUT = $ANIMAL
BARK
These statements make their indirect reference through the
string contained in variable ANIMAL. ANIMAL's contents are
treated as a "pointer" to the final destination. That is, using
ANIMAL by itself retrieves the string 'DOG', while $ANIMAL refers
to the variable DOG.
New variables may also be created by using an indirect reference as the object of an assignment. Here, $DOG causes variable BARK to be created, and assigned the string 'RUFF':
? $DOG = 'RUFF'
? OUTPUT = BARK
RUFF
Indirect referencing may proceed to any depth, provided the
null string is never encountered as a variable name:
? OUTPUT = $ANIMAL '-' $$ANIMAL
BARK-RUFF
? OUTPUT = $RUFF
Execution error #4, Null string in illegal context
In the first example, $ANIMAL produces the contents of variable
DOG, while $$ANIMAL refers to the variable BARK. The second example
attempts to go through RUFF -- which was not previously defined -- and
obtains the null string. Of course, the null string
is not a valid variable name.
Indirect referencing provides a means of "programming by association." Suppose we want to write a program that allows the user to enter a state name and receive the state's capital in response. We've provided a data file called CAPITAL.DAT, in which each line contains a state name, comma, and the capital. The first part of the program will read the file and set up an associative data base:
* Trim input, attach data file to variable INFILE
&TRIM = 1
INPUT('INFILE', 1, , 'CAPITAL.DAT') :F(ERR)
* Read a line from file. Start querying upon EOF
READF LINE = INFILE :F(QUERY)
* Break out state and capital from line
LINE BREAK(',') . STATE LEN(1) REM . CAPITAL :F(ERR)
* Convert state name into a variable, and assign the
* capital city string to it. Then read next line.
$STATE = CAPITAL :(READF)
ERR OUTPUT = 'Illegal data file' :(END)
QUERY . . .
We attach the file, and associate variable INFILE with it.
Successive file lines are read into variable LINE. Pattern
matching assigns the state name and capital city to variables
STATE and CAPITAL respectively. We use an indirect reference
through $STATE to create a new variable with the state's name,
and assign the capital city to it. For example, the file line
'COLORADO,DENVER' creates variable COLORADO, containing 'DENVER'.
Having established a data base, completing the program to access it is trivial:
* Read state name, access it as a variable
QUERY OUTPUT = $INPUT :S(QUERY)
END
An input line is read from the user, and used for an indirect
reference. If the user types a state name, treating it as a
variable name obtains the state capital. An invalid state name
would reference a new variable, whose value is the null string,
and a blank line would be output. A more complete program might
test for this null string and produce an error message.
The addition of one statement to the program loop creating the data base allows us to enter either the state name or capital city, and obtain the other:
$STATE = CAPITAL
$CAPITAL = STATE :(READF)
How would we solve this problem in a language like BASIC?
States and capitals could be stored in an array. We would then
use a loop to sequentially compare the user's input string with
the array elements. If a match were found, the result would be
displayed from another array element. In SNOBOL4, we did it all
with one statement: OUTPUT = $INPUT. Associative programming can
often replace a conventional linear search.
Earlier I said that variable names were composed of letters, digits, and the characters period and underscore. These restrictions apply only to variables which appear in program text. Variable names created or referenced with the indirect reference operator may be composed of ANY nonnull string of characters, and may be as long as any other string. If we set keyword &DUMP nonzero, we would see a list of states and capitals when the program terminated. The variable names created by $STATE are in the left column, and their string contents in the right column:
ALABAMA = 'MONTGOMERY'
ALASKA = 'JUNEAU'
. . .
NEW HAMPSHIRE = 'CONCORD'
The dump reveals a variable named NEW HAMPSHIRE, which contains
a "blank" within its name. Clearly, you cannot directly say:
NEW HAMPSHIRE = 'CONCORD'
since SNOBOL4 sees this as a pattern match statement: variable
NEW is the subject, and variable HAMPSHIRE contains the pattern.
To reference this variable, we must use:
$'NEW HAMPSHIRE' = 'CONCORD'
Try CODE.SNO with some unconventional variable names:
? $'"' = 'DOUBLE QUOTE'
? $'$#@!*' = 53
? OUTPUT = $'$#@!*' $'"'
53 DOUBLE QUOTE
Indirect referencing is not restricted to the main body of a statement. It may be used in the GOTO field to transfer control to a label specified by a variable. Suppose variable OP held the one-character string '+', '-', '*', or '/'. This GOTO would transfer to one of four statements, labeled L+, L-, L*, or L/:
statement :($('L' OP))
L+ statement
L- statement
The string in OP is appended to string 'L', and the result is
used with an indirect reference to obtain the final label name.
Indirect referencing in the GOTO field is a more powerful version of the computed GOTO which appears in some languages. It allows a program to quickly perform a multiway control branch based on an item of data. Of course, the computed label name must be defined in the program. SNOBOL4 provides an error message if your program transfers to an undefined label.
Indirect referencing may not be used in a statement's label field. Dynamically changing the name of a statement during execution is excessive even by SNOBOL4 standards.
The pattern data type appears when a pattern structure is stored in a variable for subsequent use in a pattern match. For example, a pattern to capture the next N characters after a colon, and store them in variable ITEM could be written as:
NPAT = ':' LEN(N) . ITEM
Notice that a definition such as this is static. NPAT captures
the value of variable N at the time of pattern construction. If
we subsequently alter N in the program, NPAT retains N's original
value. One way to use the current value of N is to explicitly
specify the pattern each time it is needed:
SUBJECT ':' LEN(N) . ITEM
Now the pattern is being constructed anew whenever the statement
is executed. But reconstructing a pattern whenever it is
used is inefficient, so a one-time definition is preferable.
The "unevaluated expression" operator allows us to obtain the efficiency of the NPAT formulation, yet use the current value of N when NPAT is referenced. It is a unary operator, whose graphic symbol is the asterisk (*). Now we would specify NPAT like this:
NPAT = ':' LEN(*N) . ITEM
The pattern is only constructed once, and assigned to NPAT.
The current value of N is ignored at this time. Later, when NPAT
is used in a pattern match, the unevaluated expression operator
tells SNOBOL4 to fetch the current value of N.
The unevaluated expression operator may be used with the argument of the pattern functions ANY, BREAK, LEN, NOTANY, POS, RPOS, RTAB, SPAN, or TAB. It may also be applied to an alternate or subsequent clause or to an entire pattern. Here's an example:
? PAT = TAB(*I) . OUTPUT SPAN(*S) . OUTPUT
? SUB = '123AABBCC'
? I = 4
? S = 'AB'
? SUB PAT
123A
ABB
? I = 3
? SUB PAT
123
AABB
It's worth noting that I and S were undefined when PAT was
first constructed. Later, we will apply this technique to construct
recursive patterns.
Our examples have made extensive use of the conditional assignment operator to capture matched substrings after a successful pattern match. The "immediate assignment" operator allows us to capture intermediate results during the pattern match.
Immediate assignment occurs whenever a subpattern matches, even if the entire pattern match ultimately fails. Immediate assignment is a binary operator whose graphic symbol is the dollar sign ($). Like conditional assignment, the matching substring on its left is assigned to the variable on its right. Here are examples with CODE.SNO where we use variable OUTPUT to reveal the work of the pattern matcher:
? S = 'ABCDEFG'
? S 'A' ARB $ OUTPUT 'E'
B
BC
BCD
Success
? S ('B' LEN(2) | 'C' LEN(3)) $ OUTPUT 'G'
BCD
CDEF
Success
?
As useful as immediate assignment is for revealing the inner workings of a pattern match, a more powerful use is possible. It can be used with the unevaluated expression operator to develop a new class of patterns. An interesting substring at the beginning of the subject is immediately assigned to a variable, and the variable is then subsequently used in the very same pattern.
Suppose a number at the beginning of the subject specifies the length of a variable width field that follows. We would like to capture the number into variable N, then use it with the LEN function to transfer the data into variable FIELD. When used with LEN, N must be preceded by the unevaluated expression operator, so that its new value is retrieved. For instance:
? FPAT = SPAN('0123456789') $ N LEN(*N) . FIELD
? '12ABCDEFGHIJKLMNOPQ' FPAT
Success
? OUTPUT = FIELD
ABCDEFGHIJKL
SPAN matched the field length, 12, and immediately assigned it
to N. LEN(*N) then matched the next 12 characters. Another subject,
with a different field length, would update N appropriately.
Type conversion was working quietly behind the scenes
here: N was assigned the string '12', yet it appeared as integer
12 to the LEN function.
Now here is an example which provides a glimpse of just how powerful SNOBOL4's pattern matching can be. Problem: Examine a subject for an arbitrary three-character substring which appears twice in a row, or bracketed in parentheses. Solution:
? TWOPAT = LEN(3) $ X . OUTPUT *(X | "(" X ")")
? 'ABCDECDEFGH' TWOPAT
CDE
Success
? 'ABCDE(CDE)BA' TWOPAT
CDE
Success
As you experiment with these types of patterns, you may discover
some which fail when they should succeed. The problem is
that SNOBOL4 stops matching when it believes further match attempts
would be futile. These "heuristics" are normally invisible,
and speed program execution. At this time, we'll defer discussing
heuristics, and simply mention that they can be disabled
with the statement:
&FULLSCAN = 1
Let's take a break from pattern matching, and examine some
other SNOBOL4 data types.
Arrays in SNOBOL4 are similar to arrays in other programming languages. They allow a single variable name to specify more than one data element; integer subscripts distinguish the individual members of an array. Each array element may contain any data type, independent of the types in other array elements.
A one-dimensional array is a "vector;" it is simply a list of I items. A two-dimensional array is a "grid" composed of several adjacent vectors -- an I by J array has I rows and J columns. A three-dimensional array, I by J by K in size, is a rectangular solid consisting of K adjacent grids. There's no limit to the number of dimensions allowed, but such arrays become increasingly difficult to visualize.
In keeping with SNOBOL4's pliability, an array is defined during program execution, rather than at compilation time. Its size and shape is specified by a string. The definition of an array may be changed at any time, or the array may be deleted and its memory reused when it is no longer needed.
Arrays are created by the SNOBOL4 function ARRAY. A program calls this function with a "prototype string" which specifies the number of dimensions and their sizes. The function returns an "array pointer," which is stored in a variable; the array elements are referenced by applying subscripts to this variable. Here are two statements for use with CODE.SNO. They create oneand two-dimensional arrays named LIST and BOX respectively:
? LIST = ARRAY('25')
? BOX = ARRAY('12,3')
LIST points to a vector of 25 elements. BOX points to a grid,
12 rows high and 3 columns wide, containing 36 elements. The
ARRAY function initializes all array elements to the null string.
Array subscripts are integer valued, and are specified by angular or square brackets (<> or []). Subscript values range from 1 to the size of each dimension. If you attempt to use a subscript outside this range, the array reference will fail, and the failure may be detected in the GOTO portion of the statement. Try some array references with CODE.SNO:
? LIST<3> = 'MAPLE'
? BOX[10,2] = 3
? LIST[33] = 4
Failure
? OUTPUT = LIST[3] LIST[4] BOX<10,2>
MAPLE3
Angular and square brackets are interchangeable. The reference
to LIST[33] failed because the largest subscript allowed for that
array is 25. LIST[4] produced its initialized value, the null
string, and had no effect on the concatenation. The array
pointer in LIST can be assigned to another variable:
? B = LIST
? OUTPUT = B[3]
MAPLE
? B<3> = 'WILLOW'
? OUTPUT = LIST<3>
WILLOW
Assigning the pointer in LIST to B made both variables point to
the same array. Since there's but one actual array, array references
made using LIST or B are equivalent. The COPY function
(see Chapter 8 of the Reference Manual) creates a duplicate copy of an entire array.
Array elements may be used anywhere a variable name is allowed -- expressions, patterns, function arguments, etc. The fact that an array reference fails if a subscript is out-ofbounds can be used in a simple and natural way when scanning an array. Rather than having to know an array's size, we simply loop until an array reference fails. A program segment to display the members of an array SCORE might look like this:
I = 0 I = I + 1 OUTPUT = SCORE[I] :S(PRINT) . . .
Arrays may be created with an initial value other than the null string. ARRAY accepts a second argument which specifies this initial value. We can create a three-dimensional array with all elements initialized to the string 'PA-18' as follows:
? A = ARRAY('2,3,4','PA-18')
? OUTPUT = A[1,2,3]
PA-18
Ordinarily, subscripts range from 1 to the size of each dimension. However, if you find it more convenient, other subscript ranges may be used. The prototype string for ARRAY's first argument has the general form:
'L1:H1,L2:H3,...,Ln:Hn'
The L's and H's are integers specifying the lower and upper
bounds of each dimension. If the lower bound and colon are
omitted from any dimension, the integer 1 is assumed. Here is a
five element vector, with allowed subscripts -2, -1, 0, 1 and 2:
? A = ARRAY('-2:2','PIPER')
? OUTPUT = A[-1]
PIPER
? OUTPUT = A[3]
Failure
Arrays are a traditional computer programming concept. Now
we'll see how SNOBOL4 takes the idea one important step further,
with the concept of tables.
A "table" is similar to a one-dimensional array, with two important differences. First, a table's size is not fixed; it extends itself automatically whenever a new element is added to it. Second, table subscripts are not limited to integers, but may be any SNOBOL4 data type. Strings and patterns may be used as subscripts. Tables combine the idea of associative programming with the data grouping of arrays.
Tables are created by the SNOBOL4 function TABLE. No arguments are required, since a table's size is not fixed. The function returns a table pointer, which you store in a variable. Like arrays, table elements are referenced by applying subscripts to the variable. Try this example with CODE.SNO:
? T = TABLE()
? T['ROSE'] = 'RED'
? T['N'] = 6
? OUTPUT = T['N'] T['THE'] T['ROSE']
6RED
? FLOWER = 'ROSE'
? T[FLOWER] = T[FLOWER] ',THORNS'
? OUTPUT = T[FLOWER]
RED,THORNS
Here, strings have been used as table subscripts. The concept
of an "out-of-bounds" subscript does not exist with tables. The
reference to T['THE'] created a new entry, and assigned it the
null string. Unlike arrays, no initial value for new entries may
be specified in the call to TABLE; new table entries are always
initialized to the null string.
In the above example, we know what values were used as table subscripts. But if the table were constructed from data in a file, how can we determine what items were placed in the table? We need to know the subscripts to view the table, but the subscripts themselves are part of the table. If this were an array, we could run an integer subscript over the array to see the data. Applying integer subscripts to a table only creates more entries.
SNOBOL4 provides a simple solution to this dilemma -- a method to convert a table to an array. An N row by 2 column array can be created from a table. The first array column contains the subscripts that were used to create the table. The second column contains the data items stored with the corresponding table subscript. N is the number of table entries with nonnull values.
Once the table is in array form, integer subscripts can be applied to the array to display the subscripts and their values. A table is converted to an array with the CONVERT function, which accepts a table argument and the word 'ARRAY', and returns a pointer to the new array. Continuing with the previous example:
? A = CONVERT(T, 'ARRAY')
Success
? OUTPUT = A[1,1] '-' A[1,2]
ROSE-RED,THORNS
? OUTPUT = A[2,1] '-' A[2,2]
N-6
As you would expect with SNOBOL4, the inverse operation -- conversion
of an array to a table -- is also possible. The array
must be rectangular, N rows by 2 columns. The array entries in
the first column become the table subscripts. The array's second
column becomes the table entry values:
? W = CONVERT(A, 'TABLE')
Success
? OUTPUT = W['ROSE']
RED,THORNS
Tables are useful when we want to record a number of pair associations, where each half of the pair might have any data type. A classic example of a table's utility is a word usage program. Earlier, we developed a program to count the total number of words in a file. We will modify that program to count the number of times each unique word appears. The program begins like this:
* Simple word usage program, WORDU.SNO.
*
* A word is defined to be a contiguous run of letters,
* digits, apostrophe and hyphen. This definition of legal
* letters in a word can be altered for specialized text.
*
* If the file to be counted is TEXT.IN, run as follows:
* B>SNOBOL4 WORDU /I=TEXT
*
&TRIM = 1
* Define the characters which comprise a 'word'
WORD = "'-" '0123456789' &LCASE
* Pattern to isolate each word as assign it to ITEM:
WPAT = BREAK(WORD) SPAN(WORD) . ITEM
* Create a table to maintain the word counts
WCOUNT = TABLE()
* Read a line of input and obtain the next word
NEXTL LINE = REPLACE(INPUT, &UCASE, &LCASE) :F(DONE)
NEXTW LINE WPAT = :F(NEXTL)
* Use word as subscript, update its usage count
WCOUNT<ITEM> = WCOUNT<ITEM> + 1 :(NEXTW)
DONE . . .
We'll convert the input to lower case, so words like 'The' and
'the' are counted together. WPAT has been changed to store each
word in variable ITEM. When a word is identified, it is used as
a subscript for table WCOUNT. When ITEM contains a new word, the
first reference to WCOUNT<ITEM> creates a new table entry and
returns the null string. Integer 1 is added to the null string,
and the result, 1, is stored back into WCOUNT<ITEM>. If the same
word is encountered again, WCOUNT<ITEM> for that word will be
incremented to 2.
The program reads the input file, building a table with entries for each unique word. When End-of-File is read, control transfers to label DONE, where we display the words and their respective counts. We convert WCOUNT to an array, and use integer subscripts to retrieve the words and their counts. Conversion fails if the table is empty. Continuing with this program:
* Convert table to array. Fail if table is empty
DONE A = CONVERT(WCOUNT, 'ARRAY') :F(EMPTY)
* Scan array, printing words and counts
I = 0
PRINT I = I + 1
OUTPUT = A<I,1> '--' A<I,2> :S(PRINT) F(END)
EMPTY OUTPUT = 'No words'
END
The table subscripts were the file's words, and have been
placed in the first column of the array, A<I,1>. The count for
each word was the table entry, now in the second column, A<I,2>.
Tables are very convenient for recording information about data
items, while conversion to an array makes it easy to systematically
examine the recorded information.
The unary name operator provides the address or location in memory where a variable is stored. Its graphic symbol is the period (.). We'll introduce it here through an example.
Consider the indirect reference operator mentioned earlier. Suppose we want to use a variable to point to different elements of an array or table. If we try the following, we immediately discover a problem:
? A = ARRAY('10,10')
? A[4,2] = 'DOG'
? V = 'A[4,2]'
? OUTPUT = $V
Success
The indirect reference operator treats the string 'A[4,2]' as a variable name, rather than an array element. Remember, any character sequence can be used indirectly to create a variable. SNOBOL4 creates a variable called A[4,2] that has absolutely no connection with array A. The fact that this character sequence happens to look like an array reference to us is purely coincidental from SNOBOL4's point of view.
To make this work, the name operator is applied to A[4,2] to obtain the address of that array element. The address can be stored in variable V, and referenced with the indirect operator:
? V = .A[4,2]
? OUTPUT = $V
DOG
The name operator provides a general method for specifying the
name of an object. Both of these statements are correct for
specifying the first argument to the INPUT function:
INPUT('INFILE', 1, , 'CAPITAL.DAT')
INPUT(.INFILE, 1, , 'CAPITAL.DAT')
Either form, 'INFILE' or .INFILE, tells the INPUT function the
name of the variable to be input associated. However, using the
name operator allows us to associate a file with an array or
table element:
INPUT('A[4,2]', 1, , 'CAPITAL.DAT') (incorrect)
INPUT(.A[4,2], 1, , 'CAPITAL.DAT')
Note that alternate use of the indirect reference and name
operators "cancel" one another, so
? OUTPUT = $(.($(.A[4,2])))
DOG
is simply a reference to A[4,2].
Previous chapter
·
Next chapter
·
Table of Contents