Six essential math skills every data scientist needs to know

Posted on 22nd December 2019

Posted in Data Science

While data scientists can work in almost every industry, one key thing that unites them all is an understanding of maths. Whether they’re working on statistics, data analysis or machine learning, maths is at the heart of all of them.

Here’s six essential maths skills that every data scientist needs.

Arithmetic

The maths we learn at school, arithmetic, is at the base of almost all other mathematics and essential maths for data science. Arithmetic is the study of numbers and what we do to them, such as addition, subtraction, multiplication and division. 

Logarithms are also part of arithmetic, and they are behind the dynamics of binary search algorithms. In its simplest form, a logarithm is the small number that sits above another number to indicate how many times that number is multiplied by itself, e.g. 103 = 10 x 10 x 10 = 1000.

If you have sorted data, a binary search algorithm uses logarithms to search faster. So, instead of looking through a million elements individually, it can complete the same task in 20 steps or less. 

A binary search algorithm is a valuable tool that is used in programming for debugging. Once the program is written, the algorithm can pinpoint the place where a bug occurs quickly, rather than scrolling through large chunks of code.

Linear Algebra

This brand of algebra is concerned with linear equations, vector spaces and matrices (that’s plural for matrix). Linear algebra takes arithmetic and supercharges it for application in geometry, science and engineering.

The essential maths for data science here is matrix algebra. Data science author Tirthajyoti Sarkar declares that matrix algebra powers “everything from friend suggestions on Facebook, to song recommendations on Spotify, to transferring your selfie to a Salvador Dali-style portrait using deep transfer learning”.

The matrix is also behind neural networks which are machine learning models inspired by the human brain. Scientists in America have recently used neural networks to improve the identification of molecular gases. In future, this application could be used in airport security to identify an unknown chemical or to eliminate impurities in drug manufacturing. 

Geometry

If you’ve ever used a protractor, compass or set square, then you’ve got some understanding of geometry. It’s the measurement of shape, size and relative positions of objects in terms of lengths, areas and volumes. You might recall some of geometry’s theorems ­– such as all right angles are equal, or the shortest distance between any two points is a straight line. 

Euclidean geometry extends on the geometry taught in high schools to make the leap from measuring distance between objects to measuring the distance between data. 

These measurements are used by the K-means clustering algorithm, which is also known as Lloyd’s algorithm.

The K-Means clustering algorithm is known as unsupervised machine learning, which means it can work with unreliable or absent data. This algorithm is used in healthcare to find structure in unlabelled data to make better forecasts for patients. It’s also been used to rate hospitals more accurately based on patient feedback.

Calculus

At this point, we’ll have to be careful that we don’t go off on a tangent. Calculus is the study of continuous change and measures both the rate of change in the slope of a curve, as well as the area underneath a curve. A tangent is the best straight-line approximation of a curve at any point.

Understanding the relationship between tangents and curves is an integral part of working with regression analysis in statistics. This method of statistical modelling sits behind the linear regression algorithm, which is used to explain the relationship between multiple continuous variables.

A linear regression algorithm was used in human resources research to identify the possibility that innovation could be generated through a process. Researchers analysed five years of data from 154 companies that used a social media style innovation aggregator called Spigit. The continuous variables necessary for successful innovation turned out to be more participants, more ideas, more people evaluating those ideas and more diversity in the people contributing.

Calculus also feeds into logistic regression algorithms, which are similar to linear regression algorithms, except they produce a probability value rather than a real number. 

Probability

What is the probability that you will study a Master of Data Science online after reading this blog? The probability is a numerical value expressed as a fraction or percentage between 0 and 1. If the probability is 1, then you’re definitely going to do it, and we’ll see you at James Cook University soon!

Probability is one of the fundamental elements of statistics that is equal to the number of desired outcomes (X) divided by the total number of possible outcomes (T). So, when you flip a coin and call heads, the number of desired outcomes is one and the number of possible outcomes is two (heads or tails). The probability of heads is ½ or 0.5.

The same calculation can be used repeatedly as part of a decision tree to determine much more specific outcomes. Decision tree algorithms have been used in sport to improve players performance under varying conditions. This works particularly well in a sport like baseball where much of the game comes down to the interaction between the pitcher and the batter.

Bayes Theorem

Another element of statistics that is essential maths for data science is Bayes Theorem, which comes into play when you have previously calculated probabilities. This can be used to work out the effectiveness of a medical test for a specific disease. For example, let’s say we know: 

  • the probability of the medical test returning a correct positive result
  • the probability of the medical test returning a wrong negative result
  • the probability of anyone in the population having the disease. 

With these three probabilities, we can work out the probability of someone having the disease if their medical test returns a positive result.

Naive Bayes’ Classifiers are a family of algorithms based on Bayes Theorem that use what we already know to predict the probability of a particular thing happening. In addition to medical applications, Naïve Bayes’ Classifiers are being used alongside other plugins for very detailed research in DNA.

Gain the necessary maths to become a data scientist with the Master of Data Science online. Speak to one of our Enrolment advisors on 1300 535 919.