CSCI 260 Fall 2010: Notes

CSCI 260 Fall 2010: Choosing ADTs

Two of the most important decisions in designing a software solution are:

How should I decompose the problem into logical modules and methods?
What data structures should I use to model the information needed for the solution?

Here I want to very briefly summarize some of the questions you should ask and decisions you should make when trying to select a data structure for a particular application.

(Note that after this course, when you're given a problem you'll generally be expected to pick, implement, and use an appropriate data type on your own, so the decision making process is extremely important.)

Available data types

Here is a quick rundown of the major data structures discussed so far (in CSCI 161 and 260):

Unsorted arrays: good for storing small amounts of data quickly and efficiently
Sorted arrays: good if we have a reasonable upper limit on the size of the data set and insert/remove operations are relatively rare
Queues: good for processing data in the order it arrives, with circular buffer implementations for queues with a reasonable fixed size limit or pointer-based implementations for more dynamic sizes
Stacks: good for interrupt based processing of data (where we remember the state we were in while we switch to a new task, then pick up where we left off), also with array implementations for stacks with a reasonable fixed size limit or pointer-based implementations for more dynamic sizes
Unsorted linked lists: good if the list size is highly dynamic (making an array inappropriate) and we don't need to perform fast searches (e.g. very slow performance is ok, or we're usually accessing items near the front or back of the list)
Sorted linked lists: good if the list size is highly dynamic (making an array inappropriate) and we don't need to perform fast searches but we do often process the data in a sequential (sorted) fashion
Self organizing lists: good if we need to quickly search the data and we know data requests are very unevenly distributed (i.e. we know that either some items are searched for much more often than most, or we know that requests for any given item are likely to come in "clumps")
Caches: not really a data structure, but a possible supplement for another searchable data type. Used in the same circumstances as self-organizing lists, but where we want the "main" list to be maintained in some other format.
Skiplists: (assuming random node levels are in use) these give good average search, insert, and remove speeds, and are not sensitive to the order in which data items are inserted or removed
Binary search trees: a simple data structure with good search, insert, and remove speeds if the data insertion and removal order is highly randomized but very poor performance if the data insertion order tends to be mostly sorted (or reverse sorted)
AVL trees: slightly more complex than binary search trees, very good search, insert, and remove efficiency and not sensitive to data orderings - somewhat more space efficient than skiplists

Questions and implications

Here are some of the key questions you should ask before picking a data structure:

How large can the data set grow,
how much does the size vary over time,
and is it reasonable for us to store space for the entire data set all the time?
Do I need to search the data set quickly or frequently,
and (if so) are the searches randomly distributed?
Do I frequently need to access data in a particular position in the data set (top, back, front, ...) or with a particular characteristic (largest, smallest, most-used, ...)?
Is the data set (or most of it) being supplied at the beginning, or are there frequent inserts and removes later?
Is the data supplied in a (nearly) predictable order (e.g. sorted or mostly sorted), or is the initial ordering highly random?
Do I frequently need to access the data in sorted order (in either direction)?

Based on the answers to those questions, you should be able to narrow down the set of appropriate data types quite quickly.

Always look for a solution that is most effective for your situation in terms of longevity, simplicity, maintainability, and efficiency.

Questions and implications

Of course, for complex problems you will often find the best solution is a combination of the simpler data types, for example:

Suppose you need to implement a job queue with the following properties:
- the number of jobs in the queue is a highly variable and unpredictable
- jobs have priority levels, and higher priority jobs are processed first
- jobs have unique id numbers, and we need to be able to search for jobs with a specific id (e.g. so we can cancel them and remove them from the queue)
A heap might be a reasonable implementation for the priority queue, whereas an avltree would be more reasonable for the search aspects.
Ideally, we combine the two, e.g. creating our own avlheap:
Suppose we need an extraordinarily large but very sparse table, with a potentially infinite number of row values and column values, and some rows and columns that contain a great many values. Suppose, furthermore, that we need to be able to search for specific cells very quickly.
We can't maintain arrays of row and column pointers, because we don't have sufficient storage space and couldn't afford the time to search them anyway. Thus we might use one avltree whose nodes contain the individual row pointers, and another avltree for the column pointers.
Each individual row and column would also be an avltree, so that we could rapidly search the row (or column) for a specific cell. Thus the row pointers and column pointers actually point to the root of the avltree for that particular row or column.

Again, always look for a solution that is most effective for your situation in terms of longevity, simplicity, maintainability, and efficiency.