CSCI 260 Fall 2010: Choosing ADTs
Two of the most important decisions in designing a software solution
are:
- How should I decompose the problem into logical modules and methods?
- What data structures should I use to model the information needed
for the solution?
Here I want to very briefly summarize some of the questions you should
ask and decisions you should make when trying to select a data structure
for a particular application.
(Note that after this course, when you're given a problem you'll generally be
expected to pick, implement, and use an appropriate data type on your own,
so the decision making process is extremely important.)
Available data types
Here is a quick rundown of the major data structures discussed
so far (in CSCI 161 and 260):
- Unsorted arrays: good for storing small amounts of data quickly and
efficiently
- Sorted arrays: good if we have a reasonable upper limit on the size
of the data set and insert/remove operations are relatively rare
- Queues: good for processing data in the order it arrives,
with circular buffer implementations for queues with a reasonable
fixed size limit or pointer-based implementations for more dynamic sizes
- Stacks: good for interrupt based processing of data (where we
remember the state we were in while we switch to a new task, then
pick up where we left off), also with array implementations for stacks
with a reasonable fixed size limit or pointer-based implementations
for more dynamic sizes
- Unsorted linked lists: good if the list size is highly dynamic
(making an array inappropriate) and we don't need to perform fast
searches (e.g. very slow performance is ok, or we're usually accessing
items near the front or back of the list)
- Sorted linked lists: good if the list size is highly dynamic (making
an array inappropriate) and we don't need to perform fast searches
but we do often process the data in a sequential (sorted) fashion
- Self organizing lists: good if we need to quickly search the data
and we know data requests are very unevenly distributed (i.e. we know
that either some items are searched for much more often than
most, or we know that requests for any given item are likely to come
in "clumps")
- Caches: not really a data structure, but a possible supplement for
another searchable data type. Used in the same circumstances as
self-organizing lists, but where we want the "main" list to be maintained
in some other format.
- Skiplists: (assuming random node levels are in use) these give good
average search, insert, and remove speeds, and are not sensitive to the
order in which data items are inserted or removed
- Binary search trees: a simple data structure with good search, insert,
and remove speeds if the data insertion and removal order is
highly randomized but very poor performance if the data insertion order
tends to be mostly sorted (or reverse sorted)
- AVL trees: slightly more complex than binary search trees, very good
search, insert, and remove efficiency and not
sensitive to data orderings - somewhat more space efficient than skiplists
Questions and implications
Here are some of the key questions you should ask before picking a data structure:
- How large can the data set grow,
how much does the size vary over time,
and is it reasonable for us to store space for the entire data set
all the time?
- Do I need to search the data set quickly or frequently,
and (if so) are the searches randomly distributed?
- Do I frequently need to access data in a particular position in
the data set (top, back, front, ...) or with a particular
characteristic (largest, smallest, most-used, ...)?
- Is the data set (or most of it) being supplied at the beginning,
or are there frequent inserts and removes later?
- Is the data supplied in a (nearly) predictable order
(e.g. sorted or mostly sorted), or is the initial ordering
highly random?
- Do I frequently need to access the data in sorted order (in either
direction)?
Based on the answers to those questions, you should be able to narrow down
the set of appropriate data types quite quickly.
Always look for a solution that is most effective for your situation
in terms of longevity, simplicity, maintainability, and efficiency.
Questions and implications
Of course, for complex problems you will often find the best solution is a combination
of the simpler data types, for example:
- Suppose you need to implement a job queue with the following properties:
- the number of jobs in the queue is a highly variable and unpredictable
- jobs have priority levels, and higher priority jobs are processed first
- jobs have unique id numbers, and we need to be able to search for
jobs with a specific id (e.g. so we can cancel them and remove them
from the queue)
A heap might be a reasonable implementation for the priority queue,
whereas an avltree would be more reasonable for the search aspects.
Ideally, we combine the two, e.g. creating our own avlheap:
Nodes in the avlheap have both heap pointers and avltree pointers,
e.g. avlleft, avlright, avlparent, heapright, heapleft, heapparent
When we do an insert (or remove) we call a heap insert
(using the heap pointers to insert the item in the heap) and a tree insert
(using the avl pointers to insert the item in the tree).
When we need to get the next item for processing we do a heapremove,
which removes the root item from the heap (and fixes the heap afterward)
and then call avlremove to take the item out of the tree.
To cancel a job, we would also call both avlremove and heapremove.
- Suppose we need an extraordinarily large but very sparse table,
with a potentially
infinite number of row values and column values, and some rows and columns
that contain a great many values. Suppose, furthermore, that we need to
be able to search for specific cells very quickly.
We can't maintain arrays of row and column pointers, because we don't have
sufficient storage space and couldn't afford the time to search them anyway.
Thus we might use one avltree whose nodes contain the individual row pointers,
and another avltree for the column pointers.
Each individual row and column would also be an avltree, so that we could
rapidly search the row (or column) for a specific cell. Thus the row pointers
and column pointers actually point to the root of the avltree for that
particular row or column.
Again, always look for a solution that is most effective for your situation
in terms of longevity, simplicity, maintainability, and efficiency.