Data Analytics Skills that Accelerate Scientific Discovery (1)

The following main skills are essential for researchers and technology innovators:

  • Data summarizing knowledge
  • Uncertainty and quantification of uncertainty
  • Predictive models
  • Design and analysis of experimental data 

None of these are either trivial or easy. We will discuss in separate posts the above topics for practical application that will provide immediate benefits. Further study is always welcome such as through university courses or reading advanced texts. In each of the posts, we will first summarize the basic knowledge, then illustrate how this knowledge may be applied in the real world setting using one or multiple scientific and technological application examples.

1. Data Summarization Basics

Data Summarizing Knowledge is the basic skill for all data analysis methods. A good understanding of the data provides a foundation for locating the best method to tackle scientific and technological problems. To understand data, the first step would be to check on

  1. Types of the data (numerical, categorical, or  a mix of all)
  2. Structure of the data (a series, multiple series such as in a table, unstructured such as texts or images)

For numerical data, to summarize the data we need to focus on

  1. The center of the data (mean, median, mode, quantile)
  2. The variation of the data (variance, max, min, range)
  3. The distribution pattern (symmetric vs. tailed, the direction of skewness)

For categorical data, to summarize we need to check

  1. The frequencies or relative frequencies of each category

If the data contains multiple series such as those usually appear in a table, in addition to the above actions on each of the individual series we need to check the statistical relationships between the series (columns or variables in a table) as well. The most common statistical relationship is the linear correlation. A linear correlation exists between numerical series, between numerical and categorical series, between categorical and categorical series. More about that will be described later. A complete correlation matrix helps us understand which two series are closely related. Note this is just to gain very basic knowledge, there are many relationships that are hidden quite deep, we will need more advanced methods to discover, which we will introduce later. Linear correlation paints a direct picture of the association between the series. Often it tells us how these series are related.