Notes@HKU by Jax

Data Structures

The purpose of a program is to process data efficiently. A data structure is a way to group & store data in a computer, and have pros and cons which make them suitable for different scenarios.

Operations

A data structure has the following operations:

  • Access
  • Insertion & deletion
  • Search

We use Ω,Θ,O\Omega, \Theta, O to describe the best, average and worst case time complexity of an algorithm.

Overview of data structures

Linear vs. Non-linear

A Linear Data Structure has each item only relates to it's front and back item.

A Non-linear Data Structure has each item can relate to multiple other items.

Abstract vs. Concrete data types

Abstract data types (ADT) is a concept which defines the operations that can be performed on a data structure, without specifying the implementation details.

Concrete data types (CDT) is the implementation of the ADT.

The following is an overview of the ADTs that is dicussed in the course:

Core implementations

The following ADTs can be used to implement other data structures that will be introduced in the following sections.

Array / Direct access table

Set number of items stored in adjacent memory locations. Identified by the first item.

  • Access: Simple access by index. O(1)O(1)
  • Insertion & deletion: Slow, as it involves shifting the surrounding items and creating a new array. O(n)O(n)

Node

A node is an item, which can be connected to other nodes by edges. In order to access an item in a node-based data structure, we have to traverse the nodes by following the edges.

Nodes can be directional (one-way access) or non-directional (two-way access), and can have weights (values).

Linked list

Nodes that are connected to the next node. Identified by the first node.

  • Access: Slow, as we have to traverse the list from the start. O(n)O(n)
  • Insertion & deletion: Fast, as we only need to change the pointers of the neighbouring nodes. O(1)O(1)

Reversing a linked list

We can reverse a linked list in O(n)O(n) time by iterating through the list and reversing the direction of the edges using three pointers:

  • Make current node point to previous node
  • Save previous node
  • Move to next node
  • Repeat

Ex. psuedocode

Hare and tortoise algorithm

We can find the middle node of a linked list in O(n)O(n) time by using two pointers:

  • Move one pointer one node at a time
  • Move the other pointer two nodes at a time
  • When the fast pointer reaches the end, the slow pointer will be at the middle

Ex. psuedocode

Linear structures

Push pop & access

Stacks and queues are concepts that extend from a linear list of items.

They have the following characteristics:

  • Only one item can be accessed at an instance of the data structure.
  • Items cannot be inserted or removed freely.

In a stack / queue, push / enqueue (in) means adding an item, and pop / dequeue (out) means removing an item. Only the out item can be accessed.

Stacks & Queues

  • Stack: First In, First Out (FIFO).
  • Queue: Last In, First Out (LIFO). Accessing out items is O(1)O(1). Accessing / operating on other items is O(n)O(n).

Priority queues are a special type of queue, know more about them in the Heaps section.

Non-linear structures

Graph

Graph

Nodes which can connect to multiple other nodes. (Allow loops)

  • Directed: Only one direction is allowed.
  • Undirected: Both directions are allowed.
  • Weighted: Each edge has a weight.
  • Unweighted: Each edge has no weight.
  • Number of nodes: nn
  • Number of edges: mm
  • Degree of node: dd Number of edges connected to the node.
  • Maximum degree of graph: dmaxd_{max} Maximum degree of all nodes.

Implementations:

Adjacency matrix

A n×nn \times n matrix. A[i]A[i] are the list of nodes that node ii is connected to.

  • Space: O(n2)O(n^2) - better for dense graphs (more edges).
  • Access (check): O(1)O(1) for determining if there is a connection between two nodes.
  • Access (process): O(n2)O(n^2) for processing all edges of the graph.
graph
[010101010]\left[\begin{matrix}0 & 1 & 0 \\1 & 0 & 1 \\0 & 1 & 0 \\\end{matrix}\right]

Adjacency list

An array with nn items, each item in the array is a list of nodes that the current node is connected to.

  • Space: O(n+m)O(n+m) - better for sparse graphs (less edges).
  • Access (check): O(m)O(m) for determining if there is a connection between two nodes.
  • Access (process): Better for sparse graphs as less than n2n^2 elements.
graph
[[2],[1,3],[2]][[2],[1,3],[2]]

Edge list

An array with mm items, each item is a tuple of the two nodes that the edge is connecting.

graph
[(0,1),(1,2)][(0,1),(1,2)]

Traversal:

Breadth-first search (BFS)

Visits nodes at the increasing order of distance from the starting node.

  • Queue starting node
  • Repeat until queue is empty, visit queue node
  • Add unqueued adjacent nodes to queue and mark as queued.

Example implementation: Graph Bfs

Depth-first search (DFS)

Visit the furthest nodes and backtrack.

Heaps

Priority queues

A priority queue is a queue where each item has a priority, and the item with the highest priority is accessed / removed first. The most common implementation is the heap.

  • Access: O(1)O(1) (size, peek)
  • Insertion: O(logn)O(\log n) (enqueue, dequeue)

Heap

A (max) heap is a binary tree where each node has a value greater than or equal to its children. The root node has the highest value.

  • Max heap: The root node has the highest value.
  • Min heap: The root node has the lowest value.

Thouhg a tree-based data structure, we usually implement heaps as arrays, which is useful such as in the case of heap sort.

  • Root at index 0
  • Left child at 2i+12i + 1
  • Right child at 2i+22i + 2
  • Parent at (i1)/2(i - 1) / 2

Heapify and building heaps

We can build (max) heaps using heapify, which converts a binary tree into a heap, given that its children are already heaps. This takes O(logn)O(\log n) time.

  1. If largest node is not the root node, swap the largest node and the root

  2. Heapify the new root node recursively Due to the limitation of heapify, we must build the heap from the bottom up. Heapifying leaf-nodes does nothing. Observe that the last non-leaf node is at index n/21n/2 - 1, we just have to heapify all nodes before that index to build the complete heap. As items further down the array is deeper in the heap:

  3. Iterate through the array from the last non-leaf node to the root node.

  4. Heapify the current node. This takes O(n)O(n) time.

Finding k largest and smallest elements

Heaps are especially useful for finding the largest and smallest elements. To find the largest k elements:

  1. Put first k elements in a min-heap.
  2. For each remaining element, peek the min-heap and check if the current element is greater than the peeked element. If so, remove the peeked element and add the current element to the min-heap.

Merging sorted arrays

We can also use heaps to merge sorted arrays.

  1. Put the first element of each array into a min-heap.
  2. Pop the min-heap to the result, and add the next element of the array that the popped element came from.
  3. Repeat until all elements are added. Time complexity: O(nlogk)O(n \log k) where nn is the total number of elements and kk is the number of arrays.

We had inserted nn elements into the heap, and each insertion takes O(logk)O(\log k) time.

Space complexity: O(n+k)O(n+k) as the heap have at most kk elements, and result array has nn elements.

Trees

Trees

Nodes which has one parent and can connect to multiple children.

Definitions

  • Root: The top node of the tree.
  • Internal node: A node with children.
  • Leaf / external node: A node with no children.
  • Sibilings: Nodes with the same parent.
  • Height: The number of edges from the root to the furthest leaf.
  • Depth: The number of edges from the root to the current node.
  • Sub tree: A tree that is a child of a node.
  • Degree of node: The number of immediate children of the node.
  • Degree of tree: The maximum degree of all nodes.
  • (Proper) ancestor: Node that has the current node as a child. If not proper, a node can be an ancestor of itself.
  • (Proper) Descendant: Node that has the current node as a parent. If not proper, a node can be a descendant of itself.

Hash tables

Hash table

Dynamic set of key-value pairs, where items are stored in an array with fixed size. To access an item, the index is found by a hash function.

A collision occurs when h(k1)=h(k2)h(k_1) = h(k_2). We deal with it by different methods.

  • Access: Θ(1)\Theta(1), O(n)O(n)
  • Insertion & deletion: Θ(1)\Theta(1), O(n)O(n)

Where hh is the hash function and kk is the key.

Note: Hash tables are a specific implementation of a dictionary. Each item in the hash table table is called a bucket or a slot

Load factor

α=nm\alpha = \frac{n}{m}

Where nn is the number of items, and mm is the size of the table.

Hash function

A hash function is a function that takes a key and returns an index. It should be in relation to the size of the table mm.

The most basic hash function is the division method:

h(k)=k%primeh(k) = k \% prime

A good value of primeprime should be a prime number that is not close to a power of 2.

Collision resolution

When h(k1)=h(k2)h(k_1) = h(k_2), we have a collision. We can resolve it by:

  • Chaining: Store items in the same slot by appending them to the linked list in the slot.
  • Open addressing: Store items in another slot. For opening addressing, we have the following implementations and the slot where the item is stored is:
MethodSlot
Linear probingA[h(k)+i]A[h(k) + i]
Quadratic probingA[h(k)+i2]A[h(k) + i^2]
Double hashingA[h(k)+ih(k)]A[h(k) + i \cdot h'(k)]

Where ii is the number of collisions.

Primary clustering

Primary clustering occurs in open-addressing hash tables that use linear probing for collision resolution, which can lead to clusters of occupied slots.

This can lead to longer search times for items that are stored in the same cluster.

General steps:

  1. For each slot, assume a collision and resolve until an empty slot is found, and count the steps (no. of slots inspected). We usually count nil slots as well.
  2. Sum and divide by mm Notes:
  • For double hashing, we must consider the second hash function. In other resolution methods, we already assumed a collision so the first hash function does not matter. Refer to example for detailed steps.

Example

General steps:

  1. For each slot, count the number of items in the slot.
  2. Sum and divide by mm

Averages and spaces

For chaining:

  • Sn=1+a2S_n=1+\frac{a}{2}

  • Un=αU_n=\alpha

  • Space =mps+n(ps+es)=mp_s+n(p_s+e_s) For open addressing:

  • Sn=12(1+11α)S_n=\frac12(1+\frac{1}{1-\alpha})

  • Un=12(1+1(1α)2)U_n=\frac12(1+\frac{1}{(1-\alpha)^2})

  • Space =nes=ne_s Where SnS_n is the average number of slots inspected, UnU_n is the average number of slots inspected for successful search, mm is the size of the table, nn is the number of items, psp_s is the space for a pointer, ese_s is the space for an element, and α\alpha is the load factor.

On this page