Skip to main content

Github Clone Link

Academic Integrity.

Adhere to the highest levels of academic integrity. Cite any sources that you have referred to. Also, state anyone who you may have discussed your approach with. Use the ‘Whiteboard approach’ discussed in lecture at the beginning of the semester.

You are expected to use the provided data structures and methods only. You are forbid from using your own methods or data structures that go against the spirit of the assignment and cannot be tested by our tools.

e.g., Adding your own my_strlen helper method is ok (as long as they do not modify the headers). It is not ok to replace vector_char with your own data structure or knowingly bypassing our methods or with your own methods or using strtok.

Our internal grading tools will simply mark as 0. Note that the grade distributed within the assignment scripts is tentative only. If you contravene the rules we cannot guarantee that your code will run against our internal testing.

# In many cases below you will see a reference to $REPO. $REPO refers to the folder you have clones the repo to locally e.g., github username is student
git clone git@github.com/CMPT-295-SFU/assignment-1-student
# $REPO is the folder assignment-1-student.

Goals

  • Enhance understanding of pointers in C
  • Work with the complex pointer-based data structures
  • Prepare you to work with hardware

Check Yourself

  • Have you completed lab 0? WARNING: WE WILL NOT GRADE UNTIL YOU HAVE COMPLETED LAB 0!
  • Do you know how to access fields in a struct?
  • Do you know how to malloc, realloc and free ? Watch Week 2 video
  • Do you know how to use cgdb?
  • Do you know how to print in hex? (See Week 1)

If you cannot answer these questions, revise lab 1 and read

  • Essential C Particularly section 3.

  • Do you know what in gcc -I./include what does -I flag does. Short answer: It adds additional folders for compiler searches for headers. Answer
  • Do you know what make -n does ? It’s a dry run that shows what commands are being run by make during the exe build process.

Pre Ass: Part 1 Arraylist

Prereq self assesment deadline: Try to complete within 1st three days

  • Expected Completion Time: 1 hour.
  • This part tests your understanding of pointers and dynamic memory allocation which should have been covered by the prerequesite courses CMPT 125,130
  • If you find this part challenging this is an indication that you need to review the prerequsite courses.

Source

In this module we will implement 3 functions in the file Arraylist/arraylist.c. YOU CANNOT MODIFY ANY OTHER FILES OR FUNCTIONS

  • void arraylist_add(arraylist *a, void *x)
  • void arraylist_free(arraylist *a)
  • void arraylist_insert(arraylist *a, unsigned int index, void *x)

Step 1: Implement the arraylist_add

(arraylist* a, void* x)

The elements of an arraylist are stored consecutively in a dynamically allocated array. This function takes a pointer to an arraylist and a value, and appends that value to the end of the arraylist. It should also increment the length of the arraylist to include the new element.

If this is a conceptual drawing of what the arguments to the function look like at the beginning of a call to arraylist_add:

Then this is a conceptual drawing of what the arguments should look like at the end of the function arraylist_add:
.

If the arraylist is currently at capacity when arraylist_add is called, you should create a new dynamic array that is twice the size of the old one and copy over all the elements currently in the arraylist. You may find that the standard library function realloc is useful for doing this.

Step 2: Implement arraylist_insert

(arraylist* a, unsigned int index, void* x)

Hint: Use memmove(). This function should take advantage of your arraylist_add to insert an element at an arbitrary location in your arraylist. You do not have to worry about attempting to insert to locations beyond the end of the arraylist (doing so is undefined behavior).

Step 3: Implement the `arraylist_free

(arraylist* a)

The arraylist_free simply has to free all the resources used by an arraylist. Note that the arraylist struct contains a pointer!

Test and Validate

$ cd $REPO
$ cd Arraylist
$ make -n # prints the steps for building the executable
gcc -c -o obj/arraylist.o arraylist.c -ggdb -I./include
gcc -c -o obj/main.o main.c -ggdb -I./include
gcc -o arraylist.bin obj/arraylist.o obj/main.o -ggdb -I./include -lm
$ make # Builds the executable
$ ./arraylist.bin --verbose
# If success
Test Test 0:
  Case Add:1-5: Inserts numbers 1 to 5.
    main.c:90: Check !strcmp(vec->output, output)... ok
  SUCCESS: All conditions have passed.

Test Test 1:
  Case Add:1-3,Insert:100:0-3:  Inserts numbers 1 to 5. followed up insertion of 100 at positions 0-3.
    main.c:127: Check !strcmp(vec->output, output)... ok
  SUCCESS: All conditions have passed.

Test Test 2:
  Case Add:1-5,Insert:100:0-6,Free: Inserts numbers 1 to 5. followed up insertion of 100 at positions 0-3. Finally free the arraylist.
    main.c:60: Check !strcmp(vec->output, output)... ok
  SUCCESS: All conditions have passed.

Summary:
  Count of all unit tests:        3
  Count of run unit tests:        3
  Count of failed unit tests:     0
  Count of skipped unit tests:    0
# If test case fails, you will see the expected output.
# Test 1 and 2 WILL LEAK MEMORY. TEST 3 should not.

Part 2 Document Analyzer

This assignment will help you understand

  • how to use C pointers and data structures.
  • how to allocate and deallocate memory
  • how to write a larger C program

You will write C code to implement programs that takes in an input string and:

  • splits it into words
  • creates a dictionary of words (i.e., list of all unique words)
  • counts the number of occurences of each word.

  • Alphanumeric character: An alphanumeric character (includes any of a-z, A-Z, or 0-9 ) that appears in a file. Any character that is not a alphabet or digit should be ignored and is not considered part of the word. Read Hint
  • Word A word is a sequence of alphanumeric characters separated by non-alphanumeric character.

e.g.,

Input string:  Lorem Lorem i dolor sit amet,
Step 1: Words:
# Notice that we ignored the space and punctuations.
Lorem
Lorem
i
dolor
sit
amet
# Step 2: Dictionary of words (only unify)
1. Lorem
2. i
3. dolor
4. sit
5. amet
# Step 3: Word counts
Lorem:2
i:1
dolor:1
sit:1
amet:1

Source files

All command files below here refer to word-count folder

cd $REPO
cd word-count
Filename Description
dictionary.c TODO: main(). Implement dictionary
tokenize.c TODO: main(). Implement tokenizer
linecount.c TODO: main(). Implement line count
include header files
obj where compiled .o files are stored
txt Input test files.
out folder for holding output from tests
reference Reference output for each step
str_cmp.c TODO: Implement your own strcmp
vector_char.c A dynamically resizable character array
vector_string.c TODO: A list of words
table_string.c TODO: A table of words
unit_test_vector_char.c A simple program illustrating vector chars
unit_test_vector_string.c A simple program to test your vector string implementation

Rules

  • For this assignment you cannot use any of the following functions.strcmp|strlen|isalnum|isalpha|isdigit|strdup|strcat|strncpy|strncmp|strtok. You can read about them here. You will have to write your own implementations when required.
  • Attempt steps below in order.
  • You can only modify the source code with TODO: blocks in the code. You can modify or add files prefixed with unit_test.

Check Yourself

  1. Do you know the difference between char* and char**? See example
  2. Do you know what the `\0’ is ? Null terminated string. Iterating over character array without \0
  3. Do you know what is the int value representing ‘a’ and ‘A’ ? What is the ascii table ? Ascii Table Lecture 1 Slides
  4. In the code below, what is the size (in bytes) of header_as_type vs header_as_pointer ? what is the size in bytes of entry_t ? What is the problem with linue 24 (head_p.variable->value = string;)? Run in python tutor here . WARNING: DO NOT PROCEED WITHOUT UNDERSTANDING THIS EXAMPLE AND HOW TO FIX THE UNINITIALIZED POINTER
struct entry_t {
  char *value;
  char *next;
};

struct header_as_type {
 struct entry_t variable;
};

struct header_as_pointer {
  struct entry_t* variable;
};

void main() {
 struct header_as_type head_t;
 struct header_as_pointer head_p;
 char string[] = "Hello World";
 head_t.variable.value = string;
 printf("%p",head_t.variable.value);

 // INCORRECT EXECUTION.
// UNCOMMENT Line below if you want correct execution
// head_p.variable = (struct entry_t*)malloc(sizeof(struct entry_t));
 head_p.variable->value = string;
 printf("%p",head_p.variable->value);
}

Step 0 String comparison

In this assignment you will not be using any of the string helper functions. To help you get started in the first step you have to implement your own strcmp function.

  • Compare two character arrays. return 0 if they match, 1 otherwise
  • You can assume that s1 and s2 are null terminated strings.
  • WARNING: strings could potentially be of different lengths e.g., “apple” does not match “apple “ (which includes the extra space). Value to be returned will be 1.
  • You cannot use any of the string helper functions including strlen and strncmp, strcmp.
  • Read here for C’s own strcmp. WARNING:Our my_str_cmp is slightly different. When strings mismatch, we simply return 1

SOURCE FILES

If you do not fill any code and simply try running there will be a segmentation fault and/or compiler errors

   
str_cmp.c TODO: Implement string compare
test_str_cmp.c YOU CANNOT MODIFY THIS FILE. Tester for str_cmp (will not work with gdb)
unit_test_str_cmp.c TODO: Write your own unit test case for gdbing
# Build unit test case
gcc -I./include/ unit_test_strcmp.c str_cmp.c -o unit_test_strcmp.bin

Complete testing and grading

make test_strcmp.bin
./test_strcmp.bin --verbose
Test Test:
  test_strcmp.c:24: Check result == test_vectors[i].expected... failed
    String s1: a
    String s2: b
  test_strcmp.c:24: Check result == test_vectors[i].expected... failed
    String s1: apple
    String s2: appl@
  test_strcmp.c:24: Check result == test_vectors[i].expected... failed
    String s1: apple and oranges
    String s2: apple and oranges
  test_strcmp.c:24: Check result == test_vectors[i].expected... failed
    String s1: 123456
    String s2: 123456
  test_strcmp.c:24: Check result == test_vectors[i].expected... failed
    String s1: 342342;
    String s2: afasdfsafd;
  FAILED: 5 conditions have failed.

Summary:
  Count of all unit tests:        1
  Count of run unit tests:        1
  Count of failed unit tests:     1
  Count of skipped unit tests:    0

You should now see the fruits of your labor! If any of these cases fail the tester will display the produced and expected values. Use this feedback to check constructed by your approach.

Step 1 Word Tokenizer

When writing software that deals with user input, a tokenizer is a function designed to take the input and divide it into smaller pieces called “tokens”. In this assignment, you will be creating a simple word tokenizer.

A word tokenizer cuts an input string into individual words. For our purposes, a word is defined as a contiguous sequence of alphabets or numbers (from ‘a’ to ‘z’, ‘A’ to ‘Z’, and ‘0’ to ‘9’).

Helper Source Files

A useful data structure for tokenizing is a character vector, since words in the input string can be of varying length. A vector is an array that can dynamically grow and be expanded as more fields are added to it. Unfortunately C language lacks a dynamic vector, it only has support for fixed-size character arrays. To help you with your task we have provided one such data structure, a vector_char, a vector of characters and associated helper functions. You do not need to know how vector_char works (however it may help if you do), you can complete this assignment by only knowing how to use it.

Watch youtube video on vector char

To assist you in implementing the word tokenizer, we have provided the following helper files:

  • vector_char.c: Implementation of the vector_char data structure (DO NOT MODIFY)
  • unit_test_vector_char.c: Example usage of the vector_char data structure

You can compile and run the unit_test_vector_char.c file to see how the vector_char works.

   
vector_char.c Implementation of vector_char DO NOT MODIFY
unit_test_vector_char.c READ: Illustrates how to use a vector_char
make unit_test_vector_char.bin
./unit_test_vector_char.bin         ∞
Actual:HELLO WORLD
Lower: hello world
Length (without \0 termination character): 11%

If you can answer the following questions you can proceed.

  • What does vector_char_allocate() allocate ? space for string OR space for pointer to string?
  • Where does header reside ? unit_test_vector_char.c line 10:vector_char_t *header ? stack or heap ?
  • where is header->data allocated stack or heap ?

MODIFY SOURCE FILES

tokenize.c is where main is located. To make it easier we have already provided the basic code for reading the input file into a null terminated character array, source. Follow the instructions in tokenize.c.

./tokenize.bin txt/small.txt
# source contains contents of txt/small.txt as character array
   
tokenize.c TODO: Modify listing to iterate over string and tokenize
refererence/.tokenize Reference outputs. Your output should match that

Test and Validaton

If things work correctly then the output of tokenize.bin should be as shown below. IMPORTANT: See below for grading

make tokenize.bin
# Test below shows expected output.
./tokenize.bin txt/small.txt
Lorem
Lorem
i
dolor
sit
amet
# Validate
./tokenize.bin txt/small.txt > out/small.txt.tokenize; diff out/small.txt.tokenize reference/small.txt.tokenize

If you use vscode it has a more powerful diff. $ code --diff out/small.txt.tokenize reference/small.txt.tokenize

# Source is the array of characters that needs to be tokenized into words
# This is only meant to be a guide and may not be fully accurate or correct
for ch in source[]:
  word = new word()
  if ch is alphabetical
    word.add(ch)
  else
    if word is not empty:
      print word
      word = ""

WARNING 1: You cannot use strtok. WARNING 2: You cannot assume the max word length is known. WARNING 3: The data ptr may change as you call vector_char_add and other functions

Use vector_char. It’s required and there to help you. Further its required in subsequent steps

Step 2 Dictionary

Now that we know how to tokenize , let’s build a dictionary. A dictionary is a collection of all the unify words. e.g.,

1. Lorem
2. i
3. dolor
4. sit
5. amet

Pseudocode

for each word (w words)
  // Each word
  lookup word in vector string
  if word is present, continue
  else
    insert at the end of vector string
end

entries = array of all unify keys (words) in the vector string
for each entry
  print index + ". " + word
end

Check Yourself

  • Do you know what a linked list is ? Linked list wiki
  • Have you completed Part 2. C-Lab-1.

MODIFY SOURCE FILE

To help you with this task we want you to implement a data structure called vector_string, a dynamic linked list in which each entry points to a unify word.

   
include/vector_string.h READ Contains the top-level structs describing the vector_string
vector_string.c TODO: Implement the TODO blocks
refererence/*.dictionary Reference outputs. Your output should match that
dictionary.c TODO: Implement main()

You are expected to implement the following five functions

  • vector_string *vector_string_allocate(); Create a vector string on the heap
  • bool vector_string_find(vector_string *vs, char *key); Search the vector string pointed to by vs and return true if entry found, false otherwise
  • void vector_string_insert(vector_string *vs, char *key); Insert a string into the vector string at the tail.
  • void vector_string_deallocate(vector_string *vs);Remove all entries and deallocate vector string
  • void vector_string_print(vector_string *vs); Print all value in each entry of the vector string, one line at a time.
  • WARNING: print format has to exactly match for tests to pass

Test and Validation

make dictionary.bin
# Validate
./dictionary.bin txt/small.txt > out/small.txt.dictionary; diff out/small.txt.dictionary reference/small.txt.dictionary

Step 3 Line count

Track each line in which a word occurs. Provide a list of all unify words in a document along with the exact line number in which they occur.

$ cat word-count/reference/input.txt.linecount
Lorem: 1
amet: 1 4 6 8
consectetur: 1 7
adipiscing: 1
felis: 3
tempor: 4
....
....

If you open txt/input.txt in a text editor and search for Sed, you will see it occurs in line 1 and 7 while adipiscing occurs only in line 1

Hints

  • You can look for new line character in C as \n
  • You will be using a dynamic array to track the set of line numbers.
  • Make sure you free the array in the end. You cannot make any assumptions about the size of the array. Refer slides in Week 2 for : realloc. (https://www.tutorialspoint.com/c_standard_library/c_function_realloc.htm)
  • Remember in C you do not know the size of dynamic arrays, hence you have to explicitly track it in a variable.

    SOURCE FILES

A key limitation of the vector_string is that it uses a single linked list to keep track of the words in the document. Hence searching through it is expensive. Think of the checking if a word exists? In the worst-case how many entries in the list would we have to check?. To make things faster we will be building a new data structure with new methods, table_string. Figure below illustrates a table_string. Unlike a vector_string, a table_string maintains multiple lists (you can refer to these as buckets). Each word is hashed or mapped to a unify bucket. The function djb2_word_to_bucket(char* word, int buckets):table_string.c takes in a string and returns the bucket corresponding to the word. It return a value between [0,buckets-1]. The word should be inserted only in that bucket. Each bucket is simply a linked list of entries. This ensures that words are naturally partitioned between buckets. The linked list in each bucket will not be as long and this makes it faster and easier to search for a particular word.

File Description
include/table_string.h READ Contains the top-level structs describing the table_string
table_string.c TODO: Implement the TODO blocks
unit_test_table_string.c READ. Contains unit test and framework for illustrating how to use table string
linecount.c TODO: main() for tracking which lines a word occurs in

These are the four functions that you are expected to implement for table_string.

  • table_string *table_string_allocate(); Create a table string on the heap. The reference outputs were generated assuming a table_string with 4 buckets.
  • void table_string_insert_or_add(table_string *vs, char *key, int line); Insert the word pointed to by key into the table string at the tail (if it does not exist), otherwise add new line entry. Refer to vs_entry_t in include/vector_string.h. Both table_string and vector_string use nearly identical entries (table_string includes extra array for tracking the lines in which each word occurs.)
  • void table_string_deallocate(table_string *vs); Remove all entries and deallocate table string
    void table_string_print(table_string *vs); Print all value in each entry of the table string. See above for format WARNING: print format has to exactly match for tests to pass. See table_string.c

WARNING: It is very important that you insert at the tail in each bucket If the order of words in the bucket or the bucket in which you insert a word is wrong then you will not match the reference output

# Make sure you initialize the table_string with 4 buckets in linecount.c
# The reference output was generated assuming 4 buckets
# in the table string.
# If you do not set buckets to 4, then each word will be fine. But the order of the words could be different leading to mismatches against the reference
cd $REPO/word-count
make linecount.bin
# Validate. 
./linecount.bin ./txt/small.txt > ./out/small.txt.linecount; diff  ./out/small.txt.linecount ./reference/small.txt.linecount

Step 4 Unify

We will now proceed to find the words from two files that are unify words across file1 and 2. You will display each occurrence of words between the two files along with the line numbers in which the word occurs within each file. Common words will only be displayed once with the line numbers in which they occur in each file.

SOURCE FILES

We are going to build a useful application on existing data structures. Table strings make it really easy to search since they partition the words across buckets. We are going to be building a unify word detector. Provided two files, we will search for unify words between the two files. The steps are as follows:

  • Create a table string to track the occurrence of words in each file. You will create one table string per file (lets call it table1 and table2)
  • iterate over the table string table1, for each word check table string table2.
  • You will track the unify words using a new struct dedup entry.
  • dedup_entry has two pointers, line1* and line2* for tracking the lines in which the word occurs in each file.
  • If you are clever, you will notice that you don’t have to copy over the line numbers, but simply reuse the array in your table strings, table1 and table2.
  • You also need to track the file names file1 and file2
File Description
include/dedup.h READ Contains the top-level structs describing the dedup entry for tracking unifys
dedup.c TODO: Implement the unify() function. Print the union of both table strings (allocate only one entry per duplicate words)
duplicate.c TODO: main() for tracking for invoking table_string and unify values

There is only one function that you are expected to implement for unifys.

int unify(table_string *t1, char* f1, table_string *t2, char* f2):

  • f1 and f2 are the filenames of the two files in which to search for unify words.
  • t1 and t2 are the table strings that you will use to track the words in the two files.
# Below. First print all the words in f1
# If word also found in ts2 i.e., duplicate print a single entry and both lines found in f1 and f2.
# Print all words found in f2, but not in f1.
# Below here adipiscing occurs in only input.txt, but Lorem occurs in both.
# test only occurs in small2.txt
cat $REPO/word-count/reference/input.txt.unify
adipiscing
txt/input.txt 1,
.....
Lorem
txt/input.txt 1,
txt/small2.txt 1,1,
amet
txt/input.txt 1,4,6,8,
txt/small2.txt 1,
......
test
txt/small2.txt 2,

Source files

# Make sure you initialize the table_string with 8 buckets in linecount.c
# The reference output was generated assuming 8 buckets
# in the table string.
cd $REPO/word-count
# Even though binary name says duplicate. Note that you 
# are expected to implement unify finding algorithm.
make duplicate.bin 
# To test make sure you are in the parent folder
cd $REPO 
./word-count/duplicate.bin  ./word-count/txt/input.txt ./word-count/txt/small2.txt > ./word-count/out/input.txt.unify
diff  ./word-count/out/input.txt.unify ./word-count/reference/input.txt.unify
# To match reference output, you have to run from the $REPO folder.

Grading

Congrats. You are now ready to receive your tentative grade.

$ cd $REPO # $REPO refers to the top of your cloned repo.
$ source ./scripts/localci.sh # run from the repo folder
# Check if SUCCESS OR FAILED was dumped by scripts
# There will be a _Grade.json (this is your tentative grade). The file is in json format.
# The mark component in each module indicates your score for that portion.
# LOG.md contains the log of your runs. This will give you more information. See the FAILED header.
# To check if you are using these banned functions.
$ cd $REPO/word-count
$ egrep "strcmp|strlen|isalnum|isalpha|isdigit|strdup|strcat|strncpy|strncmp|strtok" table_string.c vector_string.c dictionary.c tokenize.c linecount.c duplicate.c -n
# You can ignore line. word-count/table_string.c:21:  for (i = 0; i < strlen(word); i++). This is code provided by us. You can also ignore any usage in unit_tests. We do not grade unit_tests.
# The above will return any lines where you are using those banned functions.
# Note that if you are writing your own my_strlen and use it anywhere in your code it will match again. You will have to rename it to something like my_str_len

This grade is only tentative. Based on additional test cases in our evaluation, you could score less points. If valgrind reports any memory errors other than leaks e.g., writes to unallocated memory etc then your score will be zeroed out for that portion

Module Total
arraylist 5
strcmp 5
tokenize 20 (+5 points if no memory leaks)
dictionary 20 (+5 points if no memory leaks)
linecount 20 (+5 points if no memory leaks)
unify 40 (+5 points if no memory leaks)

Memory Management

In this assignment, you are also expected to manage your memory and prevent memory leaks from happening. A good tool for checking whether your program has memory leak is Valgrind. You can learn how to use it here. A short example is below. We will deduct points if you fail to free all memory before the program exits.

$ valgrind --leak-check=yes ./executable

Debugging tips

  • Valgrind run memcheck to find uninitialized memory locations, references
$ valgrind  --tool=memcheck ./executable
  • Read here to find common errors Slide 29
  • Refer here to catch typical memory errors
  • Follow gdb instructions from lab 0
  • Are you writing unallocated memory
  • Use python tutor to visualize individual functions. Your code has to be < 200 lines of code.
  • If you see garbage characters after your string e.g., Lorem^?. Check if you inserted null termination character.
  • If your output seems similar to reference but you are still failing, then check for additional spaces or new lines. example
  • Do not forget to free the line array in table-string at the end.

TA Help

We strictly adhere to a ‘No Print No Help’ policy. Please review the policy on assignment page for more details.

To ensure the TA meetings are productive, we recommend bringing the following information:

  • The data pointer and struct pointer of vector_char.
  • Each node in the linked list of vector_string (including head, tail, and next).
  • The array of buckets and nodes in the linked list of table_string.

Please ensure that you have thoroughly reviewed your code and attempted to debug the issue before seeking assistance.