Academic Integrity.
Adhere to the highest levels of academic integrity. Cite any sources that you have referred to. Also, state anyone who you may have discussed your approach with. Use the ‘Whiteboard approach’ discussed in lecture at the beginning of the semester.
You are expected to use the provided data structures and methods only. You are forbid from using your own methods or data structures that go against the spirit of the assignment and cannot be tested by our tools.
e.g., Adding your own my_strlen helper method is ok (as long as they do not modify the headers). It is not ok to replace vector_char with your own data structure or knowingly bypassing our methods or with your own methods or using strtok.
Our internal grading tools will simply mark as 0. Note that the grade distributed within the assignment scripts is tentative only. If you contravene the rules we cannot guarantee that your code will run against our internal testing.
# In many cases below you will see a reference to $REPO. $REPO refers to the folder you have clones the repo to locally e.g., github username is student
git clone git@github.com/CMPT-295-SFU/assignment-1-student
# $REPO is the folder assignment-1-student.
WARNING: WE WILL NOT GRADE UNTIL YOU HAVE COMPLETED LAB 0!
If you cannot answer these questions, revise lab 1 and read
Essential C Particularly section 3.
-I
flag does. Short answer: It adds additional folders for compiler searches for headers. Answermake -n
does ? It’s a dry run that shows what commands are being run by make during the exe build process.Prereq self assesment deadline: Try to complete within 1st three days
Expected Completion Time: 1 hour
.In this module we will implement 3 functions in the file Arraylist/arraylist.c
. YOU CANNOT MODIFY ANY OTHER FILES OR FUNCTIONS
void arraylist_add(arraylist *a, void *x)
void arraylist_free(arraylist *a)
void arraylist_insert(arraylist *a, unsigned int index, void *x)
arraylist_add
(arraylist* a, void* x)
The elements of an arraylist are stored consecutively in a dynamically allocated array. This function takes a pointer to an arraylist and a value, and appends that value to the end of the arraylist. It should also increment the length of the arraylist to include the new element.
If this is a conceptual drawing of what the arguments to the function look like at the beginning of a call to arraylist_add:
Then this is a conceptual drawing of what the arguments should look like at the end of the function arraylist_add:
.
If the arraylist is currently at capacity when arraylist_add
is called, you should create a new dynamic array that is twice the size of the old one and copy over all the elements currently in the arraylist. You may find that the standard library function realloc
is useful for doing this.
arraylist_insert
(arraylist* a, unsigned int index, void* x)
Hint: Use memmove(). This function should take advantage of your arraylist_add
to insert an element at an arbitrary location in your arraylist. You do not have to worry about attempting to insert to locations beyond the end of the arraylist (doing so is undefined behavior).
(arraylist* a)
The arraylist_free
simply has to free all the resources used by an arraylist. Note that the arraylist struct contains a pointer!
$ cd $REPO
$ cd Arraylist
$ make -n # prints the steps for building the executable
gcc -c -o obj/arraylist.o arraylist.c -ggdb -I./include
gcc -c -o obj/main.o main.c -ggdb -I./include
gcc -o arraylist.bin obj/arraylist.o obj/main.o -ggdb -I./include -lm
$ make # Builds the executable
$ ./arraylist.bin --verbose
# If success
Test Test 0:
Case Add:1-5: Inserts numbers 1 to 5.
main.c:90: Check !strcmp(vec->output, output)... ok
SUCCESS: All conditions have passed.
Test Test 1:
Case Add:1-3,Insert:100:0-3: Inserts numbers 1 to 5. followed up insertion of 100 at positions 0-3.
main.c:127: Check !strcmp(vec->output, output)... ok
SUCCESS: All conditions have passed.
Test Test 2:
Case Add:1-5,Insert:100:0-6,Free: Inserts numbers 1 to 5. followed up insertion of 100 at positions 0-3. Finally free the arraylist.
main.c:60: Check !strcmp(vec->output, output)... ok
SUCCESS: All conditions have passed.
Summary:
Count of all unit tests: 3
Count of run unit tests: 3
Count of failed unit tests: 0
Count of skipped unit tests: 0
# If test case fails, you will see the expected output.
# Test 1 and 2 WILL LEAK MEMORY. TEST 3 should not.
This assignment will help you understand
You will write C code to implement programs that takes in an input string and:
counts the number of occurences of each word.
Any character that is not a alphabet or digit should be ignored and is not considered part of the word.
Read Hinte.g.,
Input string: Lorem Lorem i dolor sit amet,
Step 1: Words:
# Notice that we ignored the space and punctuations.
Lorem
Lorem
i
dolor
sit
amet
# Step 2: Dictionary of words (only unify)
1. Lorem
2. i
3. dolor
4. sit
5. amet
# Step 3: Word counts
Lorem:2
i:1
dolor:1
sit:1
amet:1
All command files below here refer to word-count folder
cd $REPO
cd word-count
Filename | Description |
---|---|
dictionary.c | TODO: main(). Implement dictionary |
tokenize.c | TODO: main(). Implement tokenizer |
linecount.c | TODO: main(). Implement line count |
include | header files |
obj | where compiled .o files are stored |
txt | Input test files. |
out | folder for holding output from tests |
reference | Reference output for each step |
str_cmp.c | TODO: Implement your own strcmp |
vector_char.c | A dynamically resizable character array |
vector_string.c | TODO: A list of words |
table_string.c | TODO: A table of words |
unit_test_vector_char.c | A simple program illustrating vector chars |
unit_test_vector_string.c | A simple program to test your vector string implementation |
strcmp|strlen|isalnum|isalpha|isdigit|strdup|strcat|strncpy|strncmp|strtok
. You can read about them here. You will have to write your own implementations when required.TODO:
blocks in the code. You can modify or add files prefixed with unit_test.header_as_type
vs header_as_pointer
? what is the size in bytes of entry_t
? What is the problem with linue 24 (head_p.variable->value = string;
)? Run in python tutor here . WARNING: DO NOT PROCEED WITHOUT UNDERSTANDING THIS EXAMPLE AND HOW TO FIX THE UNINITIALIZED POINTER
struct entry_t {
char *value;
char *next;
};
struct header_as_type {
struct entry_t variable;
};
struct header_as_pointer {
struct entry_t* variable;
};
void main() {
struct header_as_type head_t;
struct header_as_pointer head_p;
char string[] = "Hello World";
head_t.variable.value = string;
printf("%p",head_t.variable.value);
// INCORRECT EXECUTION.
// UNCOMMENT Line below if you want correct execution
// head_p.variable = (struct entry_t*)malloc(sizeof(struct entry_t));
head_p.variable->value = string;
printf("%p",head_p.variable->value);
}
In this assignment you will not be using any of the string helper functions. To help you get started in the first step you have to implement your own strcmp function.
WARNING:Our my_str_cmp is slightly different. When strings mismatch, we simply return 1
If you do not fill any code and simply try running there will be a segmentation fault and/or compiler errors
str_cmp.c | TODO: Implement string compare |
test_str_cmp.c | YOU CANNOT MODIFY THIS FILE . Tester for str_cmp (will not work with gdb) |
unit_test_str_cmp.c | TODO: Write your own unit test case for gdbing |
# Build unit test case
gcc -I./include/ unit_test_strcmp.c str_cmp.c -o unit_test_strcmp.bin
make test_strcmp.bin
./test_strcmp.bin --verbose
Test Test:
test_strcmp.c:24: Check result == test_vectors[i].expected... failed
String s1: a
String s2: b
test_strcmp.c:24: Check result == test_vectors[i].expected... failed
String s1: apple
String s2: appl@
test_strcmp.c:24: Check result == test_vectors[i].expected... failed
String s1: apple and oranges
String s2: apple and oranges
test_strcmp.c:24: Check result == test_vectors[i].expected... failed
String s1: 123456
String s2: 123456
test_strcmp.c:24: Check result == test_vectors[i].expected... failed
String s1: 342342;
String s2: afasdfsafd;
FAILED: 5 conditions have failed.
Summary:
Count of all unit tests: 1
Count of run unit tests: 1
Count of failed unit tests: 1
Count of skipped unit tests: 0
You should now see the fruits of your labor! If any of these cases fail the tester will display the produced and expected values. Use this feedback to check constructed by your approach.
When writing software that deals with user input, a tokenizer is a function designed to take the input and divide it into smaller pieces called “tokens”. In this assignment, you will be creating a simple word tokenizer.
A word tokenizer cuts an input string into individual words. For our purposes, a word is defined as a contiguous sequence of alphabets or numbers (from ‘a’ to ‘z’, ‘A’ to ‘Z’, and ‘0’ to ‘9’).
A useful data structure for tokenizing is a character vector, since words in the input string can be of varying length. A vector is an array that can dynamically grow and be expanded as more fields are added to it. Unfortunately C language lacks a dynamic vector, it only has support for fixed-size character arrays. To help you with your task we have provided one such data structure, a vector_char
, a vector of characters and associated helper functions. You do not need to know how vector_char works (however it may help if you do), you can complete this assignment by only knowing how to use it.
Watch youtube video on vector char
To assist you in implementing the word tokenizer, we have provided the following helper files:
vector_char.c
: Implementation of the vector_char
data structure (DO NOT MODIFY)unit_test_vector_char.c
: Example usage of the vector_char
data structureYou can compile and run the unit_test_vector_char.c
file to see how the vector_char
works.
vector_char.c | Implementation of vector_char DO NOT MODIFY |
unit_test_vector_char.c | READ : Illustrates how to use a vector_char |
make unit_test_vector_char.bin
./unit_test_vector_char.bin ∞
Actual:HELLO WORLD
Lower: hello world
Length (without \0 termination character): 11%
If you can answer the following questions you can proceed.
unit_test_vector_char.c line 10:vector_char_t *header
? stack or heap ?tokenize.c is where main is located. To make it easier we have already provided the basic code for reading the input file into a null terminated character array, source
.
Follow the instructions in tokenize.c
.
./tokenize.bin txt/small.txt
# source contains contents of txt/small.txt as character array
tokenize.c | TODO : Modify listing to iterate over string and tokenize |
refererence/.tokenize | Reference outputs. Your output should match that |
If things work correctly then the output of tokenize.bin should be as shown below.
IMPORTANT
: See below for grading
make tokenize.bin
# Test below shows expected output.
./tokenize.bin txt/small.txt
Lorem
Lorem
i
dolor
sit
amet
# Validate
./tokenize.bin txt/small.txt > out/small.txt.tokenize; diff out/small.txt.tokenize reference/small.txt.tokenize
If you use vscode
it has a more powerful diff.
$ code --diff out/small.txt.tokenize reference/small.txt.tokenize
# Source is the array of characters that needs to be tokenized into words
# This is only meant to be a guide and may not be fully accurate or correct
for ch in source[]:
word = new word()
if ch is alphabetical
word.add(ch)
else
if word is not empty:
print word
word = ""
WARNING 1: You cannot use strtok.
WARNING 2: You cannot assume the max word length is known.
WARNING 3: The data ptr may change as you call vector_char_add and other functions
Use vector_char. It’s required and there to help you. Further its required in subsequent steps
Now that we know how to tokenize , let’s build a dictionary. A dictionary is a collection of all the unify words
. e.g.,
1. Lorem
2. i
3. dolor
4. sit
5. amet
Pseudocode
for each word (w words)
// Each word
lookup word in vector string
if word is present, continue
else
insert at the end of vector string
end
entries = array of all unify keys (words) in the vector string
for each entry
print index + ". " + word
end
To help you with this task we want you to implement a data structure called vector_string
, a dynamic linked list in which each entry points to a unify word.
include/vector_string.h | READ Contains the top-level structs describing the vector_string |
vector_string.c | TODO: Implement the TODO blocks |
refererence/*.dictionary | Reference outputs. Your output should match that |
dictionary.c | TODO : Implement main() |
You are expected to implement the following five functions
vector_string *vector_string_allocate();
Create a vector string on the heapbool vector_string_find(vector_string *vs, char *key);
Search the vector string pointed to by vs and return true if entry found, false otherwisevoid vector_string_insert(vector_string *vs, char *key);
Insert a string into the vector string at the tail.void vector_string_deallocate(vector_string *vs);
Remove all entries and deallocate vector stringvoid vector_string_print(vector_string *vs);
Print all value in each entry of the vector string, one line at a time.WARNING: print format has to exactly match for tests to pass
make dictionary.bin
# Validate
./dictionary.bin txt/small.txt > out/small.txt.dictionary; diff out/small.txt.dictionary reference/small.txt.dictionary
Track each line in which a word occurs. Provide a list of all unify words in a document along with the exact line number in which they occur.
$ cat word-count/reference/input.txt.linecount
Lorem: 1
amet: 1 4 6 8
consectetur: 1 7
adipiscing: 1
felis: 3
tempor: 4
....
....
If you open txt/input.txt in a text editor
and search for Sed, you will see it occurs in line 1 and 7
while adipiscing occurs only in line 1
\n
A key limitation of the vector_string is that it uses a single linked list to keep track of the words in the document. Hence searching through it is expensive. Think of the checking if a word exists? In the worst-case how many entries in the list would we have to check?
. To make things faster we will be building a new data structure with new methods, table_string
. Figure below illustrates a table_string. Unlike a vector_string, a table_string maintains multiple lists (you can refer to these as buckets). Each word is hashed or mapped to a unify bucket. The function djb2_word_to_bucket(char* word, int buckets):table_string.c
takes in a string and returns the bucket corresponding to the word. It return a value between [0,buckets-1]. The word should be inserted only in that bucket. Each bucket is simply a linked list of entries. This ensures that words are naturally partitioned between buckets. The linked list in each bucket will not be as long and this makes it faster and easier to search for a particular word.
File | Description |
---|---|
include/table_string.h | READ Contains the top-level structs describing the table_string |
table_string.c | TODO: Implement the TODO blocks |
unit_test_table_string.c | READ . Contains unit test and framework for illustrating how to use table string |
linecount.c | TODO: main() for tracking which lines a word occurs in |
These are the four functions that you are expected to implement for table_string.
table_string *table_string_allocate();
Create a table string on the heap. The reference outputs were generated assuming a table_string with 4 buckets.void table_string_insert_or_add(table_string *vs, char *key, int line);
Insert the word pointed to by key into the table string at the tail (if it does not exist
), otherwise add new line entry. Refer to vs_entry_t
in include/vector_string.h. Both table_string and vector_string use nearly identical entries (table_string includes extra array for tracking the lines in which each word occurs.)void table_string_deallocate(table_string *vs);
Remove all entries and deallocate table stringvoid table_string_print(table_string *vs);
Print all value in each entry of the table string. See above for format WARNING: print format has to exactly match for tests to pass. See table_string.c
WARNING: It is very important that you insert at the tail in each bucket
If the order of words in the bucket or the bucket in which you insert a word
is wrong then you will not match the reference output
# Make sure you initialize the table_string with 4 buckets in linecount.c
# The reference output was generated assuming 4 buckets
# in the table string.
# If you do not set buckets to 4, then each word will be fine. But the order of the words could be different leading to mismatches against the reference
cd $REPO/word-count
make linecount.bin
# Validate.
./linecount.bin ./txt/small.txt > ./out/small.txt.linecount; diff ./out/small.txt.linecount ./reference/small.txt.linecount
We will now proceed to find the words from two files that are unify words across file1 and 2. You will display each occurrence of words between the two files along with the line numbers in which the word occurs within each file. Common words will only be displayed once with the line numbers in which they occur in each file.
We are going to build a useful application on existing data structures. Table strings make it really easy to search since they partition the words across buckets. We are going to be building a unify word detector. Provided two files, we will search for unify words between the two files. The steps are as follows:
File | Description |
---|---|
include/dedup.h | READ Contains the top-level structs describing the dedup entry for tracking unifys |
dedup.c | TODO: Implement the unify() function. Print the union of both table strings (allocate only one entry per duplicate words) |
duplicate.c | TODO: main() for tracking for invoking table_string and unify values |
There is only one function that you are expected to implement for unifys.
int unify(table_string *t1, char* f1, table_string *t2, char* f2)
:
# Below. First print all the words in f1
# If word also found in ts2 i.e., duplicate print a single entry and both lines found in f1 and f2.
# Print all words found in f2, but not in f1.
# Below here adipiscing occurs in only input.txt, but Lorem occurs in both.
# test only occurs in small2.txt
cat $REPO/word-count/reference/input.txt.unify
adipiscing
txt/input.txt 1,
.....
Lorem
txt/input.txt 1,
txt/small2.txt 1,1,
amet
txt/input.txt 1,4,6,8,
txt/small2.txt 1,
......
test
txt/small2.txt 2,
# Make sure you initialize the table_string with 8 buckets in linecount.c
# The reference output was generated assuming 8 buckets
# in the table string.
cd $REPO/word-count
# Even though binary name says duplicate. Note that you
# are expected to implement unify finding algorithm.
make duplicate.bin
# To test make sure you are in the parent folder
cd $REPO
./word-count/duplicate.bin ./word-count/txt/input.txt ./word-count/txt/small2.txt > ./word-count/out/input.txt.unify
diff ./word-count/out/input.txt.unify ./word-count/reference/input.txt.unify
# To match reference output, you have to run from the $REPO folder.
Congrats. You are now ready to receive your tentative grade.
$ cd $REPO # $REPO refers to the top of your cloned repo.
$ source ./scripts/localci.sh # run from the repo folder
# Check if SUCCESS OR FAILED was dumped by scripts
# There will be a _Grade.json (this is your tentative grade). The file is in json format.
# The mark component in each module indicates your score for that portion.
# LOG.md contains the log of your runs. This will give you more information. See the FAILED header.
# To check if you are using these banned functions.
$ cd $REPO/word-count
$ egrep "strcmp|strlen|isalnum|isalpha|isdigit|strdup|strcat|strncpy|strncmp|strtok" table_string.c vector_string.c dictionary.c tokenize.c linecount.c duplicate.c -n
# You can ignore line. word-count/table_string.c:21: for (i = 0; i < strlen(word); i++). This is code provided by us. You can also ignore any usage in unit_tests. We do not grade unit_tests.
# The above will return any lines where you are using those banned functions.
# Note that if you are writing your own my_strlen and use it anywhere in your code it will match again. You will have to rename it to something like my_str_len
This grade is only tentative. Based on additional test cases in our evaluation, you could score less points. If valgrind reports any memory errors other than leaks e.g., writes to unallocated memory etc then your score will be zeroed out for that portion
Module | Total |
---|---|
arraylist | 5 |
strcmp | 5 |
tokenize | 20 (+5 points if no memory leaks) |
dictionary | 20 (+5 points if no memory leaks) |
linecount | 20 (+5 points if no memory leaks) |
unify | 40 (+5 points if no memory leaks) |
In this assignment, you are also expected to manage your memory and prevent memory leaks from happening. A good tool for checking whether your program has memory leak is Valgrind. You can learn how to use it here. A short example is below. We will deduct points if you fail to free all memory before the program exits.
$ valgrind --leak-check=yes ./executable
$ valgrind --tool=memcheck ./executable
Lorem^?
. Check if you inserted null termination character.We strictly adhere to a ‘No Print No Help’ policy. Please review the policy on assignment page for more details.
To ensure the TA meetings are productive, we recommend bringing the following information:
data
pointer and struct
pointer of vector_char
.vector_string
(including head
, tail
, and next
).table_string
.Please ensure that you have thoroughly reviewed your code and attempted to debug the issue before seeking assistance.