Tutorial on Collections and References in Perl
Anoop Sarkar
This page is a short tutorial on collections
(lists and hash tables) in Perl and how to use references
to such collection types. You will need to first read a more basic
tutorial on Perl before reading this tutorial to discover other
important facets of Perl (notably the use of regular expression
matching). This tutorial assumes some knowledge of another programming
language.
Scalars
Scalars are atomic types. Datatypes such as integers, floating point numbers, strings are scalar variables.
Scalar variables are denoted by the prefix $
to the
variable name. Comments can be started by the symbol #
upto the end of the line.
$scalarInt = 42; # scalar integer variable
$scalarFloat = 42.3; # scalar floating point variable
$scalarInterpolatedString = "value=$scalarInt\n"; # scalar string variable with interpolation of values into the string
$scalarString = 'value'; # scalar string without interpolation
There are many operators (such as ++, eq, ...
) and
functions (such as lc, length, ...
) that can be used to
manipulate scalar variables. See the man page for perlop for more on
operators (using man perlop
) and the man page on perl
functions (man perlfunc
).
Arrays
The most basic collection type is the array. Arrays can be also viewed
as lists (no random access), but in Perl lists are still stored as
arrays and are not a distinct collection type. Array variables
are denoted by the prefix @
to the variable name. Arrays
are indexed from 0
in Perl. Notice that when referring to
each element of the array (e.g. $intArray[0]
), since each
element is a scalar, this is denoted by the prefix $
.
@intArray = (1,2,3,4,5); # array of integers
$firstElement = $intArray[0]; # random access to the first element of @intArray
$thirdElement = $intArray[2]; # random access to the third element of @intArray
@miscArray = (1, "second", 2, 2.34); # arrays are not strongly typed, each array element can be a different scalar type
The function pop removes the last element of the array, and the
function push adds a new element as the last element of the array. From
the other end, the function shift removes the first element of the
array, while unshift adds a new element as the first element of the
array.
@intArray = (); # the empty array
push(@intArray, 1); # adds 1 as the last element
push(@intArray, 2); # adds 2 as the last element
$value = shift(@intArray); # $value == 1 and @intArray becomes (2)
unshift(@intArray, 3); # adds 3 as first element, @intArray becomes (3,2)
$value = pop(@intArray); # $value == 2, @intArray becomes (3)
$intArray[1] = 4; # inserts a new element as second element of @intArray
The length of an array can be obtained with two different methods. Perl
has a special variable $#
for each array variable which is an integer equal to the index of the
last element in the array. The length of the array can also be obtained
by coercing the array variable into a scalar variable.
@intArray = (1..3); # initializes array with all integers between 1 and 3 inclusive
$lastIndex = $#intArray; # index of the last element, $lastIndex == 2
$length1 = $lastIndex + 1; # size of the array
$length2 = scalar(@intArray); # type coercion, reports size of the array, $length1 == $length2
The functions join
and split
are useful in
converting from arrays to strings and vice versa.
$line = "This is a line of text\n"; # $line is a scalar string variable
chomp($line); # chomp removes the final newline character
@wordArray = split(" ", $line); # Looks for " " to separate substrings into list elements
@wordArray = split(/\s+/, $line); # split can also use regular expressions (see 'man perlre' for more on regexps)
$line = join(" ", @wordArray); # merges elements of @wordArray with " " between each element to form a scalar
$line = $line . "\n"; # concatenates the newline character to form the original value of $line
Just like scalar variables can be interpolated into strings, array
variables are also automatically interpolated.
print join(" ", @wordArray), "\n"; # one method to print the contents of an array variable
print "@wordArray\n"; # alternative method to print the array
Arrays can also be multi-dimensional. Here is an example of a two
dimensional array:
$distance[14][20] = 120;
Hash Tables
Hash table variables are denoted by the prefix %
. Hash
tables create a mapping from scalar variables (the keys) to scalar variables (the values).
The keys have to be scalar types for which a hashcode can be computed
(integers and strings). The syntax to insert and delete elements from
hash tables is similar to the syntax of inserting and deleting elements
in arrays.
%hashTable = ();
$hashTable{"first"} = 1; # inserts a new hash table entry with key="first" and value=1
$hashTable{"second"} = 2; # new entry with key="second" and value=2
@hashkeys = keys(%hashTable); # the function keys returns an array containing only the keys: ("first", "second")
for my $key (keys(%hashTable)) { print $hashTable{$key}, "\n"; } # printing every value in the hash table
Note that only scalars can be stored as values. So any variable that is
not scalar will be coerced to a scalar value before insertion as a
value into the hash table. Remember that an array variable when coerced
to a scalar will report the length of the array.
@intArray = (1..3);
$hashTable{"third"} = @intArray; # coercion to a scalar type means the value is not the array but its length
References
So if we wanted to store more complex information in arrays or hash
tables we need a method to convert complex data types into scalars.
References allow us to do just that. Let's take an example program: say
we need to store for each word in the input text a list of all the line
numbers in which that word occurred. The natural method to store this
information would be to store each word as a key in a hash table and
have a list of line numbers as the value. Let's assume we have a list
of line numbers:
@lineNumbers = (34, 23, 78, 122, 455);
A reference to a list is a
scalar value (as usual denoted by the $
prefix to the
variable name). The operator \
creates a reference:
$refList = \@lineNumbers;
The function ref
returns true for an input scalar
variable
if the variable is a reference. The reference also internally knows
whether it was a reference to an array or a hash table (or even a
function or a scalar variable).
if (ref $reflist) { print "yes\n"; } else { print "no\n"; } # will print 'yes'
Note that references are scalar variables (as seen from the prefix $
on the variable name). Since collections are over scalars, this means
that we can use references as members in lists and hash tables. So if
we wanted to store a list of lists, we can now do it using references:
@lineNumbers = (34, 23, 78, 122, 455);
$refList = \@lineNumbers;
$refList = [34, 23, 78, 122, 455]; # this is a short-hand for creating a reference to a list directly, same effect as previous statement
$reflistA = [1,2,3];
$reflistB = [4,5,6];
$reflistC = [7,8,9];
@listOfLists = ($reflistA, $reflistB, $reflistC); # this creates a list with three elements: each a reference to a list
$listOfLists[3] = [10,11,12]; # this adds a new list ref to the list @listOfLists
Now that we have a scalar variable that refers to a list, we can store
this variable $refList
as a value in a hash table. Let's
assume the word is "keyword"
, then:
$hashTable{"keyword"} = $reflist;
To get the original list variable back, we need to dereference the reference:
$reflist = $hashTable{"keyword"};
@lineNumbers = @{$reflist};
These two lines can be combined into one statement (in this case the
precedence requires the use of the curly braces {}
around
the reference:
@lineNumbers = @{$hashTable{"keyword"}};
The deference which is the rvalue
in the above statement can also be used as an lvalue:
$newLineNumber = 677;
push( @{ $hashTable{"keyword"} } , $newLineNumber );
The statement shown above actually modifies the contents of the list
pointed to by the reference that was stored as a value in the hash
table. So if we wanted to add to a list variable, and we have a
reference to that list variable, then the step shown above is first
dereferencing the reference and then modifying the list. Perl has some
syntactic sugar that can make this process easier:
@listVar = (1..10); # create a list variable
$reflist = \@listVar; # create a reference to the list variable
${$reflist}[0] = 11; # changes the first element of @listVar to 11
$reflist->[0] = 11; # same as previous statement, changes first element of @listVar but this time without a dereference
The above notation can be useful when dealing with lists of lists (read
the description earlier in this section on how to define list of lists):
@listOfLists = ([1,2,3], [5,6,7], [9,10,11]); # create a list variable with three listrefs
@listOfLists[0]->[3] = 4; # add the element 4 to the first listref element
@listOfLists[1]->[3] = 8; # add the element 8 to the second listref element
@listOfLists[2]->[3] = 12; # add the element 12 to the third listref element
for my $refList (@listOfLists) { print join(" ", @$refList), "\n"; } # print out the contents of @listOfLists
Hash tables can also be defined over multiple keys (analogous to the
use of multi-dimensional arrays):
%mdimHash = (); # initialize hash table to be empty
$mdimHash{"key1"}{"key2"} = "valuestring";
The value of $mdimHash{"key1"}
is actually a reference to
another hash table which contains key "key1"
which stores
the value "valuestring"
. Here is how to loop over all the
keys of a multi-dimensional hash table:
for my $k1 (keys(%mdimHash)) {
for my $k2 (keys(%{ $mdimHash{$k1} })) {
print "key1=$k1 key2=$k2 value=$mdimHash{$k1}{$k2}\n";
}
}
As we saw before we can also dereference values implicitly using the ->
operator. Hence the following is a true statement:
$mdimHash{"key1"}{"key2"} == $mdimHash{"key1"}->{"key2"}
An Example
Now you're in a position to understand an example Perl program which
combines the ideas presented above. The input to the program is a text
file with one sentence per line. Output of this program is a list
of all the linenumbers k
in the input text file for every
word pair w[i-1], w[i]
that appears in the sentence at
linenumber k
. An additional condition is that both count(w[i-1])
and count(w[i])
are greater than some minimum word count
defined in the variable $minWordCount.
The input to this
program should be a text file (you can try hw1.txt
for example).
use strict;
my $minWordCount = 10;
my %pairCount = ();
my %wordCount = ();
my $line;
my $lineNumber = 0;
while ($line = <>) {
chomp($line);
++$lineNumber;
my @words = split(' ', $line);
# size of the array @words
my $sz = scalar(@words);
# go to next line if current line has less than 2 words
if ($sz < 2) { next; }
for (my $i=1; $i < $sz; $i++) {
# store the count of word w[i] in the hash table %wordCount
$wordCount{ $words[$i] }++;
# store the current line number in the reference to the list
# stored in the hash table %pairCount with two keys: w[i-1]
# and w[i]
push( @{ $pairCount{ $words[$i-1] }{ $words[$i] } } , $lineNumber);
}
}
for my $w1 (keys(%pairCount)) {
if ($wordCount{$w1} > $minWordCount) {
for my $w2 (keys(%{ $pairCount{$w1} })) {
if ($wordCount{$w2} > $minWordCount) {
print "word1=$w1 word2=$w2: ";
# reference to a list in a multi-dimensional hash table
print "linenumbers = @{ $pairCount{$w1}{$w2} } \n";
}
}
}
}
Anoop Sarkar <anoop at cs.sfu.ca>