Tutorial on Collections and References in Perl

Anoop Sarkar

This page is a short tutorial on collections (lists and hash tables) in Perl and how to use references to such collection types. You will need to first read a more basic tutorial on Perl before reading this tutorial to discover other important facets of Perl (notably the use of regular expression matching). This tutorial assumes some knowledge of another programming language.

Scalars

Scalars are atomic types. Datatypes such as integers, floating point numbers, strings are scalar variables. Scalar variables are denoted by the prefix $ to the variable name. Comments can be started by the symbol # upto the end of the line.

$scalarInt = 42; # scalar integer variable

$scalarFloat = 42.3; # scalar floating point variable

$scalarInterpolatedString = "value=$scalarInt\n"; # scalar string variable with interpolation of values into the string

$scalarString = 'value'; # scalar string without interpolation

There are many operators (such as ++, eq, ...) and functions (such as lc, length, ...) that can be used to manipulate scalar variables. See the man page for perlop for more on operators (using man perlop) and the man page on perl functions (man perlfunc).

Arrays

The most basic collection type is the array. Arrays can be also viewed as lists (no random access), but in Perl lists are still stored as arrays and are not a distinct collection type. Array variables are denoted by the prefix @ to the variable name. Arrays are indexed from 0 in Perl. Notice that when referring to each element of the array (e.g. $intArray[0]), since each element is a scalar, this is denoted by the prefix $.

@intArray = (1,2,3,4,5); # array of integers

$firstElement = $intArray[0]; # random access to the first element of @intArray

$thirdElement = $intArray[2]; # random access to the third element of @intArray

@miscArray = (1, "second", 2, 2.34); # arrays are not strongly typed, each array element can be a different scalar type

The function pop removes the last element of the array, and the function push adds a new element as the last element of the array. From the other end, the function shift removes the first element of the array, while unshift adds a new element as the first element of the array.

@intArray = (); # the empty array

push(@intArray, 1); # adds 1 as the last element

push(@intArray, 2); # adds 2 as the last element

$value = shift(@intArray); # $value == 1 and @intArray becomes (2)

unshift(@intArray, 3); # adds 3 as first element, @intArray becomes (3,2)

$value = pop(@intArray); # $value == 2, @intArray becomes (3)

$intArray[1] = 4; # inserts a new element as second element of @intArray

The length of an array can be obtained with two different methods. Perl has a special variable $# for each array variable which is an integer equal to the index of the last element in the array. The length of the array can also be obtained by coercing the array variable into a scalar variable.

@intArray = (1..3); # initializes array with all integers between 1 and 3 inclusive

$lastIndex = $#intArray; # index of the last element, $lastIndex == 2

$length1 = $lastIndex + 1; # size of the array

$length2 = scalar(@intArray); # type coercion, reports size of the array, $length1 == $length2

The functions join and split are useful in converting from arrays to strings and vice versa.

$line = "This is a line of text\n"; # $line is a scalar string variable

chomp($line); # chomp removes the final newline character

@wordArray = split(" ", $line); # Looks for " " to separate substrings into list elements

@wordArray = split(/\s+/, $line); # split can also use regular expressions (see 'man perlre' for more on regexps)

$line = join(" ", @wordArray); # merges elements of @wordArray with " " between each element to form a scalar

$line = $line . "\n"; # concatenates the newline character to form the original value of $line

Just like scalar variables can be interpolated into strings, array variables are also automatically interpolated.

print join(" ", @wordArray), "\n"; # one method to print the contents of an array variable

print "@wordArray\n"; # alternative method to print the array

Arrays can also be multi-dimensional. Here is an example of a two dimensional array:

$distance[14][20] = 120;

Hash Tables

Hash table variables are denoted by the prefix %. Hash tables create a mapping from scalar variables (the keys) to scalar variables (the values). The keys have to be scalar types for which a hashcode can be computed (integers and strings). The syntax to insert and delete elements from hash tables is similar to the syntax of inserting and deleting elements in arrays.

%hashTable = ();

$hashTable{"first"} = 1; # inserts a new hash table entry with key="first" and value=1

$hashTable{"second"} = 2; # new entry with key="second" and value=2

@hashkeys = keys(%hashTable); # the function keys returns an array containing only the keys: ("first", "second")

for my $key (keys(%hashTable)) { print $hashTable{$key}, "\n"; } # printing every value in the hash table

Note that only scalars can be stored as values. So any variable that is not scalar will be coerced to a scalar value before insertion as a value into the hash table. Remember that an array variable when coerced to a scalar will report the length of the array.

@intArray = (1..3);

$hashTable{"third"} = @intArray; # coercion to a scalar type means the value is not the array but its length

References

So if we wanted to store more complex information in arrays or hash tables we need a method to convert complex data types into scalars. References allow us to do just that. Let's take an example program: say we need to store for each word in the input text a list of all the line numbers in which that word occurred. The natural method to store this information would be to store each word as a key in a hash table and have a list of line numbers as the value. Let's assume we have a list of line numbers:

@lineNumbers = (34, 23, 78, 122, 455);

A reference to a list is a scalar value (as usual denoted by the $ prefix to the variable name). The operator \ creates a reference:

$refList = \@lineNumbers;

The function ref returns true for an input scalar variable if the variable is a reference. The reference also internally knows whether it was a reference to an array or a hash table (or even a function or a scalar variable).

if (ref $reflist) { print "yes\n"; } else { print "no\n"; } # will print 'yes'

Note that references are scalar variables (as seen from the prefix $ on the variable name). Since collections are over scalars, this means that we can use references as members in lists and hash tables. So if we wanted to store a list of lists, we can now do it using references:

@lineNumbers = (34, 23, 78, 122, 455);

$refList = \@lineNumbers;

$refList = [34, 23, 78, 122, 455]; # this is a short-hand for creating a reference to a list directly, same effect as previous statement

$reflistA = [1,2,3];
$reflistB = [4,5,6];
$reflistC = [7,8,9];

@listOfLists = ($reflistA, $reflistB, $reflistC);  # this creates a list with three elements: each a reference to a list

$listOfLists[3] = [10,11,12]; # this adds a new list ref to the list @listOfLists

Now that we have a scalar variable that refers to a list, we can store this variable $refList as a value in a hash table. Let's assume the word is "keyword", then:

$hashTable{"keyword"}  = $reflist;

To get the original list variable back, we need to dereference the reference:

$reflist = $hashTable{"keyword"};

@lineNumbers = @{$reflist};

These two lines can be combined into one statement (in this case the precedence requires the use of the curly braces {} around the reference:

@lineNumbers = @{$hashTable{"keyword"}};

The deference which is the rvalue in the above statement can also be used as an lvalue:

$newLineNumber = 677;

push( @{ $hashTable{"keyword"} } , $newLineNumber );

The statement shown above actually modifies the contents of the list pointed to by the reference that was stored as a value in the hash table. So if we wanted to add to a list variable, and we have a reference to that list variable, then the step shown above is first dereferencing the reference and then modifying the list. Perl has some syntactic sugar that can make this process easier:

@listVar = (1..10); # create a list variable

$reflist = \@listVar; # create a reference to the list variable

${$reflist}[0] = 11; # changes the first element of @listVar to 11

$reflist->[0] = 11; # same as previous statement, changes first element of @listVar but this time without a dereference

The above notation can be useful when dealing with lists of lists (read the description earlier in this section on how to define list of lists):

@listOfLists = ([1,2,3], [5,6,7], [9,10,11]); # create a list variable with three listrefs

@listOfLists[0]->[3] = 4; # add the element 4 to the first listref element

@listOfLists[1]->[3] = 8; # add the element 8 to the second listref element

@listOfLists[2]->[3] = 12; # add the element 12 to the third listref element

for my $refList (@listOfLists) { print join(" ", @$refList), "\n"; } # print out the contents of @listOfLists

Hash tables can also be defined over multiple keys (analogous to the use of multi-dimensional arrays):

%mdimHash = (); # initialize hash table to be empty

$mdimHash{"key1"}{"key2"} = "valuestring";

The value of $mdimHash{"key1"} is actually a reference to another hash table which contains key "key1" which stores the value "valuestring". Here is how to loop over all the keys of a multi-dimensional hash table:

for my $k1 (keys(%mdimHash)) {
  for my $k2 (keys(%{ $mdimHash{$k1} })) {
    print "key1=$k1 key2=$k2 value=$mdimHash{$k1}{$k2}\n";
  }
}

As we saw before we can also dereference values implicitly using the -> operator. Hence the following is a true statement:

$mdimHash{"key1"}{"key2"} == $mdimHash{"key1"}->{"key2"}

An Example

Now you're in a position to understand an example Perl program which combines the ideas presented above. The input to the program is a text file with one sentence per line. Output of this program is a list of all the linenumbers k in the input text file for every word pair w[i-1], w[i] that appears in the sentence at linenumber k. An additional condition is that both count(w[i-1]) and count(w[i]) are greater than some minimum word count defined in the variable $minWordCount. The input to this program should be a text file (you can try hw1.txt for example).

use strict;

my $minWordCount = 10;
my %pairCount = ();
my %wordCount = ();
my $line;
my $lineNumber = 0;

while ($line = <>) {
    chomp($line);
    ++$lineNumber;
    my @words = split(' ', $line);

    # size of the array @words
    my $sz = scalar(@words);

    # go to next line if current line has less than 2 words
    if ($sz < 2) { next; }

    for (my $i=1; $i < $sz; $i++) {

        # store the count of word w[i] in the hash table %wordCount
        $wordCount{ $words[$i] }++;

        # store the current line number in the reference to the list                                
        # stored in the hash table %pairCount with two keys: w[i-1]                                 
        # and w[i]                                                                                  
        push( @{ $pairCount{ $words[$i-1] }{ $words[$i] } } , $lineNumber);
    }
}

for my $w1 (keys(%pairCount)) {
    if ($wordCount{$w1} > $minWordCount) {
        for my $w2 (keys(%{ $pairCount{$w1} })) {
            if ($wordCount{$w2} > $minWordCount) {
                print "word1=$w1 word2=$w2: ";
                # reference to a list in a multi-dimensional hash table                             
                print "linenumbers = @{ $pairCount{$w1}{$w2} } \n";
            }
        }
    }
}

Anoop Sarkar <anoop at cs.sfu.ca>