top of page

10 essential skills: (5) data structures and algorithms.

  • Writer: Pamela Kinga Gill
    Pamela Kinga Gill
  • Jan 26, 2019
  • 7 min read

Working with 'big' data sets, it becomes crucially important to understand data structures, algorithms, and run-time/space complexity to program efficiently and really understand the tools in use. *Time complexity is not added to this discussion, but for a review of Big 'O' Notation here is a great YouTube video by CS Dojo.


An earlier discussion on database knowledge shared the concept of the database and identified popular tools to store and handle data in a permanent way. In contrast, the purpose of data structures is to prepare data in a specific format so that it can be organized and stored. The appropriate selection of such structures will provide a specific purpose which defines how data can be inserted/accessed (among many other operations) but ultimately worked on by the programmer (data scientist/engineer/analyst). This becomes significant due to the variety of applications and requirements confronting the big data professional. For example, you may which to classify your data in a unique way and so build your own machine learning algorithm - you'll need a strong grasp of data structure basics and algorithms to code efficiently.


An important application of this knowledge in the big data context is in data mining and data science. Data structures are fundamental building blocks of computer science and so they are also critical to software development. In this series ... I limit the context to the big data space.

There are a huge variety of data structures: some simple, some complex. It is also important to understand their implementation.

Knowledge of data structures could be approached in the following ways:

  1. How can data structures be created and manipulated for a variety of data types?

  2. Are we choosing data structures that are aligned with our anticipated needs for that data - i.e. which algorithms will be required and which data structures support these algorithms? (As opposed to: how can I most easily store this data? Ultimately, you don't want to become constrained!)

  3. What are the appropriate data structures for writing optimal programs?

To approach data structures in such a way requires a dedicated effort to learning and practice. It's not overnight one acquires this type of understanding which is in-depth and critical. With that, it's well worth the endeavour to begin to practice and learn data structures and their implementations. At the bottom of this article, I will share some resources to get started. Learning to program in Python, you become acquainted with data structures immediately. Therefore, my experience learning programming and learning data structures has been hand-in-hand (I continue this practice daily by the way)! I believe approaching your training from this perspective will give you a strong grasp of the essentials.


What are data structures?


A data structure is a way to store and organize data in a computer so that it can be used efficiently. Data structures can be organized by their implementation and operation. Abstract data types (ADTs) define data and operations, but not their implementation. Basic operations include how to access, insert, delete, find, and sort data. A more technical description of ADT's is found here by the University of Waterloo.


Fundamental abstract data structures

  • Arrays - a most fundamental data structure and the most important to machine learning. An array is a linear collection of items of a single type (i.e., an array of integers, or strings. It is uncommon for an array to hold multiple types). A one-dimensional array is a vector, a two-dimensional array is a matrix, and increasing the dimension leads to higher-order tensors. These vectors, matrices, and tensors are building blocks of machine learning and deep learning algorithms. So it makes sense. Arrays are very flexible as data structures and in the Python programming language arrays are extensible, meaning that we can actually make them larger (the implementation of this is less important to this conversation at this time). To explain, and ignoring the elements contained in the array, there are two additional pieces of information that are part of the array structure. They are: the storage space, and the size of the array. An extensible array means that, once the size of the array > storage space, a new array is created that is double the size of its predecessor, whose values are copied and inserted into the new array. The old array is deleted. In addition, arrays allow for random access - like a book, where each page can be accessed independent of the other. This random access is critical to certain algorithms such as binary search.

There are three elements (a.k.a values) in this array whose integer values are: 2,5,7. In Python and C, arrays are zero-indexed. They start at [0]. The location of each element is identified by its physical place in memory, i.e. array[2] = 7.
  • Linked Lists - another linear collection of elements (contained in nodes) but here each element includes a reference point to the next element. This reference point actually requires additional storage space which can become costly. To access each element, we actually have to traverse the entire linked list starting at the origin (the 'head') and working through each element and its reference to the next. Herein lies the linked list's strength and weakness: Linked lists allow inserting/deleting nodes at any point in the list which is great. However, since simple linked lists do not allow random access to the data or any form of efficient indexing. Many of the basic operations such as accessing, deleting, finding, or sorting a specific element can require as much as iterating through the entire list of elements. This may not be the most efficient way of handling the data and has implications on the use of algorithms.

Here is a linked list as a collection of nodes containing integer values, linked together to form a sequence. Each node includes a reference to the next.

  • Stacks/Queues - another linear data structure. Stacks are described by their LIFO rules; operations are performed on the last-in-first-out principle, i.e. inserting and deleting data (also knowns as 'push' and 'pop') whereas queues are described by their FIFO rules; first-in-first-out and operations like adding elements at the end is known as 'enqueue' and deleting elements from the front is known as 'dequeue'.

Visual representation of a stack and queue and their respective insert/delete operations.

  • Binary trees - finally, a non-linear data structure! Here we have nodes containing elements (a.k.a values or data) connected with edges (these edges manage the relationship between nodes) and the tree starts at a root (the top of the tree hierarchy). Each node has at most two children and the root grows to the left and the right. If there are 'n' nodes then there are at most 'n-1' edges. Common operations include inserting and deleting notes. In addition to this, there are a few methods for traversing a tree which have implications on searching through nodes and related search algorithms. There is also a search tree ADT which is a tree data structure used for locating specific elements (data contained in the node) from within a set. For a tree to function as a search tree, the element in each node must be greater than any elements in subtrees on the left and less than any elements in subtrees on the right. A simple example is a binary search tree. A key strength here is that search tree structures allow efficient insert, delete, and search operations. For example, when a tree is sorted in this way, it is much faster to find the min or max of the set.


Here are a few other basic abstract data structures:

  • Heaps/Priority

  • Queues

  • Hash Tables

  • Strings

  • Sets

  • Graphs


Algorithms


Algorithms are essentially a series of rules that take data as input and perform operations on this data to create a desired output. Algorithms, then, are instructions we tell a computer to repeat in order to solve problems for us. These problems can be as simple as "finding the minimum value" or "ranking values based on magnitude" or "classify data according to {}", etc... Data held in a database is organized, accessed, and stored according to its data structure. This data is also the input the computer is working with. When we build an algorithm or use a canned-algorithm, we essentially subscribe to a set of rules that perform operations and deliver a desire output whose success and performance will be determined in part by the underlying structure of the data we are acting on. We select algorithms therefore with an understanding of the methods being used and the data structure that make those methods efficient or usable.


Examples:

  • Bubble Sort - "s a simple sorting algorithm that repeatedly steps through the list, compares adjacent pairs and swaps them if they are in the wrong order. The pass through the list is repeated until the list is sorted. The algorithm, which is a comparison sort, is named for the way smaller or larger elements "bubble" to the top of the list. " - Wikipedia

  • Binary Search - "is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found." - Wikipedia

  • Depth-First Search (DFS) - DFS “is an algorithm for traversing or searching tree data structure. One starts at the root and explores as far as possible along each branch before backtracking.” — Wikipedia

  • Breadth-First Search (BFS) - BFS “is an algorithm for traversing or searching tree data structure. It starts at the tree root and explores the neighbor nodes first, before moving to the next level neighbors.” — Wikipedia

Applications


Data mining is the practice of uncovering hidden patterns in big data using machine learning, artificial intelligence and statistics. It makes up one of the range of activities performed by the data scientist. The use of algorithms in data mining can get quite complex in comparison to those discussed above. But knowledge of these algorithms is more easily acquired once the underlying data structures are understood. For an article on the top ten data mining algorithms, here is an article by KDNuggets.


My recommended resources

  • "Data Structures & Algorithms in Python" a textbook by Goodrich, Tamassia, Goldwasser provides an introduction to data structures and algorithms, including their design, analysis, and implementation. This book is designed for use in a beginning- level data structures course, or in an intermediate-level introduction to algorithms course.

  • "Data Structures and Algorithms Through Python In Depth" a lecture series on Udemy. This course explains data structures like linked lists, stacks and queues, binary search trees, heap. Various sorting algorithms with implementation and analysis are included. Concept of recursion is very important for designing and understanding certain algorithms so the process of recursion is explained with the help of several examples.

  • 'Data structures and algorithms' a specialization by Coursera. This specialization is a mix of theory and practice: you learn algorithmic techniques for solving various computational problems and will implement about 100 algorithmic coding problems in a programming language of your choice! There are four courses in the specialization: (1) Algorithmic Toolbox; (2) Data Structures; (3) Algorithms on Graphs; (4) Algorithms on String.


Not coming from a computer science background, this is the most challenging aspect of my career toward data science. Nevertheless, if you break it down into building blocks, the challenge can be fun! Practical understanding of data structures and algorithms are fundamental to the work of the big data professional and will make the work easier and more enjoyable as your understanding and practice increases. Happy coding! :)

Commentaires


  • instagram
  • linkedin
  • twitter

Join the Community!

White on Transparent.png
bottom of page