Chapter 1: Foundations of Statistical Inference and Data Science

Overview

Statistics is a discipline that bridges the gap between raw data and meaningful conclusions about the real world. In this chapter, you'll learn why statistics matters, how data collection works, and the fundamental concepts that underpin all statistical thinking. The core idea is simple but powerful: we gather information from samples to draw conclusions about populations, using probability as our guide for confidence in those conclusions.

1.1 Statistical Inference, Samples, Populations, and the Role of Probability

The Context: Why Statistics Matters

Beginning in the 1980s and continuing today, there has been enormous focus on improving quality in American industry. Much of this success has been attributed to the use of statistical methods and statistical thinking among management personnel. The Japanese industrial success in manufacturing was often called the "Japanese industrial miracle," and it was achieved largely through the systematic application of statistical techniques.

Statistical methods are essential across many fields: manufacturing, product development, software engineering, pharmaceuticals, energy, and many others. The use of scientific data—information collected systematically and stored for analysis—has been a cornerstone of scientific work for over a thousand years.

Statistical Inference: The process of drawing conclusions about a population based on information gathered from a sample, using probability to quantify the strength of our conclusions.

Use of Scientific Data

In many practical situations, data are gathered and collected, but there is an important distinction between simple data collection and inferential statistics. Inferential statistics have received significant attention in recent decades as a "toolbox" of methods for making scientific judgments in the face of uncertainty and variation. These methods are designed to help us understand where variability comes from and how to analyze data to improve process quality.

In manufacturing, for example, if you observe variation in the density of materials coming off a production line, statistical methods help you determine whether this variation is due to batch-to-batch differences or within-batch differences. This understanding is crucial for maintaining quality.

Key Insight: The gathering of information as scientific data—collections of observations—is fundamental. The process of sampling is introduced in Chapter 2, and sampling distributions are covered throughout the book.

Variability in Scientific Data

Consider an engineer studying sulfur monoxide levels in polluted air. There are two sources of variation to contend with:

Same-location, same-day variation: Measurements from the same location on the same day fluctuate.
Variation between observed and true values: The measured values differ from the actual sulfur monoxide level in the air.

Real-world examples illustrate why variability matters. In a drug study, a new hypertension medication brought relief to 85% of patients, while the older drug provided relief to 80%. But the new drug is more expensive and has side effects. Should it be adopted? The key is that in real problems like this, variation from patient to patient is endemic—variation must be taken into account in decision-making.

📝 Section Recap: Statistical data naturally contains variability. Understanding this variability is essential for making sound scientific judgments. Statisticians use specialized techniques to analyze variation and draw reliable conclusions about the underlying system.

1.2 Sampling Procedures; Collection of Data

Simple Random Sampling

The importance of proper sampling centers on the degree of confidence with which an analyst can answer questions. When studying populations, simple random sampling means that any particular sample of a given size has the same chance of being selected as any other sample of the same size.

Simple Random Sampling: A sampling procedure in which each possible sample of a specified size has an equal probability of being selected.

The virtue of simple random sampling is that it helps eliminate bias and ensures the sample reflects the population fairly. However, simple random sampling is not always appropriate. In some cases, the population naturally divides into groups—called strata—that differ in meaningful ways. When this occurs, stratified random sampling is used.

Stratified Random Sampling: A procedure in which random selection occurs within each stratum, ensuring that each group is properly represented in the sample.

For example, if you want to survey opinions about a referendum in a city with distinct neighborhoods, you might randomly sample families from each neighborhood separately to ensure each community's views are captured.

Experimental Design

The concept of randomness or random assignment is crucial in experimental work. This is a fundamental principle in almost any area of engineering or experimental science. When we run an experimental design, we systematically assign treatments to experimental units.

A treatment or treatment combination refers to conditions applied to the units being studied. For example:

In a drug study, "placebo" versus "active drug" are treatments.
In a materials study, different coating types and humidity levels are treatment combinations.

Experimental Unit: The basic object to which a treatment is applied; the smallest independent unit in the experiment.

The random assignment of experimental units to treatments is critical. Why? Consider this: if you're testing a new drug and all the sickest patients happen to receive it (through non-random assignment), any observed benefit might be due to patient differences, not the drug itself. Random assignment ensures that differences between groups are attributable to the treatment, not to pre-existing differences.

Why Assign Experimental Units Randomly?

When experimental units are not randomly assigned, bias can creep in. For instance, suppose patients naturally selected for a placebo group tend to be heavier than those in the treatment group. Any difference in blood pressure might reflect weight differences, not the drug's effect.

Variability is the key insight. Excessive variability among experimental units—particularly when it's related to the outcome being measured—confounds the results. If units are very different from one another, it becomes hard to detect the true effect of a treatment.

Observational Studies Versus Designed Experiments

Not all scientific questions can be studied through experiments. In observational studies, scientists observe data but have no control over which units receive which treatments. In designed experiments, scientists control the assignment of treatments.

The critical distinction is control. In a designed experiment studying how humidity affects corrosion, the experimenter deliberately sets humidity levels and measures corrosion. In an observational study of blood cholesterol and sodium intake, the researcher simply observes people as they naturally live.

📝 Section Recap: Data collection methods vary: simple random sampling for population studies, stratified random sampling when the population has natural subgroups, and experimental design with random assignment of treatments for causal inference. Proper sampling and experimental design are essential for drawing reliable conclusions.

1.3 Measures of Location: The Sample Mean and Median

The Sample Mean

Measures of location describe where the "center" of data is located. The most intuitive and useful measure is the sample mean—the arithmetic average.

Definition 1.1 - Sample Mean: Suppose the observations in a sample are $x_1, x_2, \ldots, x_n$ . The sample mean, denoted by $\bar{x}$ , is

$\bar{x} = \sum_{i=1}^{n} \frac{x_i}{n} = \frac{x_1 + x_2 + \cdots + x_n}{n}$

The sample mean represents the center of the data in a balancing sense—if you imagine the data points as weights on a seesaw, the mean is the fulcrum point where the system balances.

The Sample Median

Another important measure is the sample median, which is useful because it is uninfluenced by extreme values (outliers).

Definition 1.2 - Sample Median: Given that observations in a sample are $x_1, x_2, \ldots, x_n$ , arranged in increasing order of magnitude, the sample median is

x_{(n+1)/2}, & \text{if } n \text{ is odd,} \ \frac{1}{2}(x_{n/2} + x_{n/2+1}), & \text{if } n \text{ is even}

\end{cases}$$

To illustrate the difference, consider the data set: 1.7, 2.2, 3.9, 3.11, and 14.7.

Sample mean: $\bar{x} = 5.12$ grams
Sample median: $\tilde{x} = 3.9$ grams

The mean is heavily influenced by the extreme observation (14.7), whereas the median places emphasis on the true "center" of the data. In practical applications, when one or two outliers exist, the median may better represent the typical value.

Comparing Mean and Median

For the nitrogen treatment example in the chapter, the stem weight data shows:

No nitrogen: $\bar{x} = 0.399$ grams, $\tilde{x} = 0.400$ grams
With nitrogen: $\bar{x} = 0.565$ grams, $\tilde{x} = 0.505$ grams

In this case, the mean and median are quite similar because there are no extreme outliers. The choice between them depends on the data structure and your research question.

Key Concept: The sample mean is an estimate of the population mean $\mu$ . The purpose of statistical inference is to draw conclusions about population characteristics or parameters based on sample statistics.

Other Measures of Location

Beyond the mean and median, there are alternative measures. Trimmed means are computed by "trimming away" a certain percent of both the largest and smallest values. For example, a 10% trimmed mean eliminates the largest 10% and smallest 10% of observations, then averages the remaining values.

For the stem weight data with no nitrogen (10 observations: 0.32, 0.37, 0.47, 0.43, 0.36, 0.42, 0.38, 0.43):

$\bar{x}_{tr(10)} = \frac{0.32 + 0.37 + 0.47 + 0.43 + 0.36 + 0.42 + 0.38 + 0.43}{8} = 0.39750$

Trimmed means are more insensitive to outliers than the sample mean but more insensitive than the sample median, offering a middle ground.

📝 Section Recap: The sample mean and median are complementary measures of center. The mean is influenced by all observations and is useful when data are symmetrically distributed; the median is robust to outliers. Trimmed means offer a compromise. All three serve as estimates of population location, a fundamental concept in statistical inference.

1.4 Measures of Variability

Why Variability Matters

Sample variability is critical in data analysis. Process and product variability is a fact of life in engineering and scientific systems. More and more process engineers and managers realize that product quality depends on understanding and controlling process variability. Much of Chapters 9 through 15 deals with data analysis and modeling procedures where sample variability plays a major role.

Here's a concrete example: Consider two data sets, each with two samples and roughly the same difference in means:

Data set A: Larger variability within each sample
Data set B: Smaller variability within each sample

If the goal is to distinguish between two populations, data set B provides much sharper contrast because the samples themselves are internally homogeneous. In data set A, the large variability makes it harder to see the difference between the populations.

Sample Range and Sample Standard Deviation

Just as there are many measures of location, there are several measures of spread or variability.

Definition 1.3 - Sample Variance and Standard Deviation: The sample variance, denoted by $s^2$ , is given by

$s^2 = \sum_{i=1}^{n} \frac{(x_i - \bar{x})^2}{n-1}$

The sample standard deviation, denoted by $s$ , is the positive square root of $s^2$ , that is,

$s = \sqrt{s^2}$

The denominator $n - 1$ is called the degrees of freedom associated with the variance estimate. In this context, it represents the number of independent pieces of information available for computing variability.

The sample range $R = x_{\max} - x_{\min}$ is simpler but less informative—it only uses two observations (the largest and smallest).

Practical Calculation

Consider a pH meter calibration study where 10 measurements on a neutral substance (pH = 7.0) yielded:

7.07, 7.00, 7.10, 6.97, 7.00, 7.03, 7.01, 7.01, 6.98, 7.08

The sample mean is $\bar{x} = 7.0250$ .

The sample variance is:

$s^2 = \frac{1}{9}[(7.07 - 7.025)^2 + (7.00 - 7.025)^2 + \cdots + (7.08 - 7.025)^2] = 0.001939$

The sample standard deviation is:

$s = \sqrt{0.001939} = 0.044$

This means observations typically deviate from the mean by about 0.044 pH units.

Units and Interpretation

Variance has units that are the square of the original data units. If stem weights are measured in grams, variance is in grams². The standard deviation is measured in the same units as the original data, making it more intuitive for interpretation. For stem weight data measured in grams, standard deviations are in grams.

📝 Section Recap: The sample range, variance, and standard deviation all measure spread. The standard deviation is most commonly used because it operates in the same units as the data and plays a central role in statistical inference theory. Sample variability is essential for understanding whether differences between groups are real or simply due to chance variation.

1.5 Discrete and Continuous Data

Types of Data

Statistical inference is used in many scientific areas, but the nature of data varies. Data may be discrete or continuous depending on the context.

Discrete data arise when observations take on count or categorical values. For example:

Number of defective items in a sample
Number of radioactive particles passing through a detector
Response to a binary question (success/failure)

Continuous data arise when measurements fall on a continuum. Examples include:

Product density or weight
Chemical concentration in a solution
Temperature in an experiment

Great distinctions are made between discrete and continuous data in probability theory, and different statistical methods apply depending on which you're analyzing.

Binary Data and Sample Proportions

Special attention should be paid to binary data early in this textbook. Many applications involve binary outcomes (yes/no, success/failure). For such data, the basic measure is the sample proportion.

In a medical application, if 50 patients received a drug and 20 experienced improvement, then:

$\frac{x}{n} = \frac{20}{50} = 0.4$

is the sample proportion for which the drug was a success. The complement, $1 - 0.4 = 0.6$ , is the sample proportion for which the drug was not successful.

When Are Binary Methods Needed?

The kinds of problems facing scientists and engineers dealing with binary data differ significantly from those involving continuous measurements. In the following example, a manufacturer reports that 100 of 5000 randomly selected tires show blemishes. Here the sample proportion is $\frac{100}{5000} = 0.02$ .

Consider a process designed to reduce blemishes. If a second sample of 5000 tires yields 90 blemished tires, the sample proportion has been reduced to $\frac{90}{5000} = 0.018$ . The question becomes: "Does this decrease suggest a real improvement in the population proportion?" Both illustrations require analysis of sample proportions in discrete (binary) populations.

📝 Section Recap: Data can be discrete (counts, categories) or continuous (measurements on a scale). Binary data—with only two outcomes—require special methods centered on sample proportions. These methods differ from those for continuous data, and both are important throughout the statistical toolkit.

1.6 Statistical Modeling, Scientific Inspection, and Graphical Diagnostics

The Role of Statistical Models

Often the end result of a statistical analysis is the estimation of parameters of a postulated model. A statistical model is not deterministic but includes some probabilistic aspects. A model forms the foundation of assumptions that are made by the analyst.

For example, suppose in the nitrogen/no-nitrogen experiment (Example 1.2), the scientist wishes to draw some level of distinction between the two populations through sample information. The analysis may require an assumption that the two samples come from normal (or Gaussian) distributions. See Chapter 6 for a detailed discussion of the normal distribution.

Clearly Stating Assumptions

Obviously, the user of statistical methods cannot generate sufficient information from observational or experimental data to characterize the population totally. But sets of data do often lead to learning about certain properties of the population. Scientists and engineers are accustomed to dealing with data sets. The importance of characterizing or summarizing the nature of collections of data should be obvious. Often a summary of a collection via a graphical display can provide insight regarding the system from which the data were taken.

Scatter Plots for Data Visualization

Consider a textile manufacturer designing an experiment with cloth specimens containing various percentages of cotton. At times the model postulated may take on a complicated form. For example, a textile manufacturer might postulate a regression model relating population mean tensile strength to cotton concentration:

$\mu_{t,c} = \beta_0 + \beta_1 C + \beta_2 C^2$

where $\mu_{t,c}$ is the population mean tensile strength, which varies with the amount of cotton in the product, $C$ . The functional form is chosen by the scientist. At times the data analysis may suggest that the model be changed. Then the data analyst "entertains" a model that may be altered after some analysis is done. The use of an empirical model is accompanied by estimation theory, where $\beta_0$ , $\beta_1$ , and $\beta_2$ are estimated by the data. Further, statistical inference can then be used to determine model adequacy.

A scatter plot shows the relationship between two variables visually. For the cotton/tensile strength data, a scatter plot displays tensile strength on the vertical axis and cotton percentage on the horizontal axis. Two important points emerge: (1) the type of model used depends on the experimental goal, and (2) the structure of the model should take advantage of nonstatistical scientific knowledge.

Stem-and-Leaf Plots

Statistical data, generated in large masses, can be very useful for studying the behavior of distributions if presented in a combined tabular and graphic display. A stem-and-leaf plot is one such effective tool.

To construct a stem-and-leaf plot from car battery life data:

Split each observation into a stem (first digit(s)) and a leaf (last digit)
List stems vertically on the left
Record leaves on the right opposite the appropriate stem

For example, car battery life data ranging from 2.2 to 4.7 years:

Stem	Leaf	Frequency
2	25669	5
3	00111122233444...	25
4	11234	8

The stem-and-leaf plot represents an effective way to summarize data. If needed, a double-stem-and-leaf plot with stems 1•, 1•, 2•, 2•, etc., can provide finer detail.

Histograms and Frequency Distributions

Dividing each class frequency by the total number of observations yields the proportion in each class interval. A table listing relative frequencies is called a relative frequency distribution. The information provided by a relative frequency distribution in tabular form is easier to grasp if presented graphically by constructing a relative frequency histogram.

For battery life data grouped into class intervals (1.5–1.9, 2.0–2.4, 2.5–2.9, etc.):

Class Interval	Frequency	Relative Frequency
1.5–1.9	2	0.050
2.0–2.4	1	0.025
2.5–2.9	4	0.100
3.0–3.4	15	0.375
3.5–3.9	10	0.250

Relative frequency histograms reveal the shape of the distribution. Many continuous frequency distributions can be represented graphically by the characteristic bell-shaped curve. Graphical tools such as histograms aid in the characterization of the nature of the population.

Skewness and Distribution Shape

A distribution may exhibit skewness. A distribution said to be skewed lacks symmetry with respect to a vertical axis. Some distributions are skewed to the right (with a long right tail), others to the left. Understanding distribution shape is crucial for choosing appropriate statistical methods.

Box-and-Whisker Plots

Another display that is helpful for reflecting properties of a sample is the box-and-whisker plot (or box plot). This plot encloses the interquartile range of the data in a box that has the median displayed within. The interquartile range has as its extremes the 75th percentile (upper quartile) and the 25th percentile (lower quartile). In addition to the box, "whiskers" extend showing extreme observations in the sample.

A box plot can provide the viewer with information regarding which observations may be outliers—observations that are considered to be unusually far from the bulk of the data. A common procedure is to use a multiple of the interquartile range. For example, if the distance from the box exceeds 1.5 times the interquartile range, the observation may be labeled as an outlier.

The visual information in the box-and-whisker plot or box plot is not intended to be a formal test for outliers. Rather, it is a diagnostic tool. Determination of which observations are outliers varies with the software used, but one common procedure is to use a multiple of the interquartile range approach.

📝 Section Recap: Statistical modeling involves postulating relationships between variables through assumptions (often distributional). Scatter plots, stem-and-leaf plots, histograms, and box-and-whisker plots are graphical tools that help visualize data structure and distribution shape. These displays are critical for exploratory data analysis and for detecting outliers or assumption violations before formal statistical analysis.

1.7 General Types of Statistical Studies: Designed Experiment, Observational Study, and Retrospective Study

The Three Types of Scientific Studies

In the foregoing sections, we have emphasized the notion of sampling from a population and the use of statistical methods to learn or perhaps affirm important information about the population. The information sought and learned through the use of these statistical methods can often be influential in decision making and problem solving in many important scientific and engineering areas.

Designed Experiments

A designed experiment involves the systematic manipulation of factors and random assignment of treatments. Researchers control which experimental units receive which treatments. This allows for causal inference under proper conditions. In Example 1.3 (the corrosion study), the engineer systematically selected four treatment combinations (coating, humidity), with eight experimental units per combination assigned randomly.

The power of designed experiments lies in the ability to control factors and attribute differences in outcomes to the treatments applied, rather than to pre-existing differences among units.

What Is Interaction?

A crucial concept in designed experiments is interaction. When two factors interact, the effect of one factor depends on the level of the other factor.

For example, in the corrosion study with coating type and relative humidity:

The effect of humidity on corrosion might differ between uncoated and coated specimens
If humidity has a larger impact on uncoated specimens than on chemically coated ones, there is an interaction effect

Understanding interactions is essential. The presence of interaction means that we cannot simply add up individual factor effects; the factors work together in a way that produces a combined effect that is greater or less than the sum of individual effects.

Observational Studies

Not all scientific questions can be addressed through designed experiments. An observational study is one in which researchers observe data but have no control over which units receive which treatments. Scientists and engineers must accept what nature provides.

For example, a study of blood cholesterol levels and the amount of sodium measured in the blood cannot be conducted as a designed experiment. Researchers cannot ethically or practically assign people to high or low sodium diets for extended periods. Instead, researchers observe existing variation in sodium intake and cholesterol levels and seek to understand relationships from naturally occurring data.

The critical disadvantage of observational studies is the difficulty in determining true causation. Differences found in the outcome may be due to nuisance factors—variables that were not controlled. For instance, sodium intake and exercise activity are naturally correlated; if high-sodium consumers are also more sedentary, differences in cholesterol may be due to activity level, not sodium.

Retrospective Studies

A retrospective study uses strictly historical data—information taken over a specific period of time. One obvious advantage is reduced cost of collecting data. However, there are clear disadvantages:

Validity and reliability of historical data are often in doubt
If time is an important aspect of the data structure, there may be data missing
There may be errors in collection of the data that are not known
There is no control on the ranges of measured variables (the factors in a study)

The ranges found in historical data may not be relevant for current studies.

When Observational or Historical Data Must Be Used

In many fields, designed experiments are simply not possible. Thus, observational or historical data must be used. For example, Exercise 12.5 on page 450 asks students to build a model relating monthly electric power consumed to average ambient temperature, the number of days in the month, the average product purity, and the tons of product produced. These data are from real historical records and were not generated from a designed experiment.

The advantage of observational data and retrospective data is found in their ability to provide real-world context. But the disadvantage is the difficulty in drawing firm causal conclusions. Graphical and modeling tools become extremely important when designed experiments are impossible and observational or historical data must be analyzed.

📝 Section Recap: Three types of statistical studies—designed experiments, observational studies, and retrospective studies—each offer different advantages and limitations. Designed experiments allow causal inference but may be impractical or unethical. Observational studies provide real-world context but make causal attribution difficult. Retrospective studies are cost-effective but raise questions about data quality and relevance. Each has its place in scientific research depending on the research question and practical constraints.

1.8 How Do Probability and Statistical Inference Work Together?

Bridging Probability and Inference

It is important for you to understand the clear distinction between the discipline of probability, a science in its own right, and the discipline of inferential statistics. The use or application of concepts in probability allows real-life interpretation of the results of statistical inference.

As we have already indicated, the use of concepts in probability allows conclusions to be drawn about some feature of the population. The sample information is made available to the analyst, and, with the aid of statistical methods and elements of probability, conclusions are drawn about some feature of the population (the process).

Key Distinction: In probability, we reason deductively: if we know properties of a population, we can compute the probability of observations. In statistical inference, we reason inductively: given observations, we infer properties of the population.

The Fundamental Relationship

Consider Example 1.1 again. The quality of information regarding the process is determined by probability. Here, the elements of probability provide a summary that the scientist or engineer can use as evidence on which to build a decision. Statistical methods, which we will actually detail in Chapter 10, produced a P-value of 0.0282. This result suggests that the process very likely is not acceptable.

In Example 1.2, the study revolves around whether "probability and statistical evidence" would support the inference. These methods will be discussed in Chapter 10. The issue revolves around the "probability that data like these could be observed given that nitrogen has no effect" (in other words, given that both samples were generated from the same population).

Suppose this probability is small, say 0.03. Then there would certainly be strong evidence that the use of nitrogen does indeed influence (apparently increases) average stem weight of the red oak seedlings.

Thus, elements of probability allow us to draw conclusions about characteristics of hypothetical data taken from the population, based on known features of the population. This type of reasoning is deductive in nature. Now as we move into Chapter 2 and beyond, the reader will note that, unlike what we do in our two examples here, we will not focus on solving statistical problems. Many examples will be given in which no sample is involved. There will be a population clearly described with all features of the population known. Then questions of importance will focus on the nature of data that might hypothetically be drawn from the population. Thus, one can say that elements in probability allow us to draw conclusions about characteristics of hypothetical data taken from the population, based on known features of the population. This type of reasoning is deductive in nature.

Key Concepts Summary

Fundamental Definitions

Concept	Definition
Population	The entire collection of items or individuals of a particular type
Sample	A subset of the population, used to draw inferences
Sample Mean	The arithmetic average of sample observations; estimates the population mean
Sample Variance/SD	Measures the spread of data around the mean
Discrete Data	Data that takes on count or categorical values
Continuous Data	Data that takes on values along a continuum
Designed Experiment	Study where treatments are assigned randomly and factors are controlled
Observational Study	Study where researchers observe but do not control treatment assignment

Core Relationships

The relationship between probability and statistical inference forms the backbone of statistical reasoning:

Probability allows us to reason from populations to samples (deductive)
Statistical Inference allows us to reason from samples to populations (inductive)
Both are complementary: probability provides the theoretical foundation for statistical inference

Practical Applications

Quality Control: Statistical methods identify when process variation exceeds acceptable limits
Drug Testing: Variability in patient response requires careful experimental design and analysis
Manufacturing: Understanding variability in material properties drives process improvement
Risk Assessment: Probability enables quantification of confidence in conclusions

Final Recap: Chapter 1 establishes the foundation for all statistical thinking. You now understand why data must be gathered carefully, why variability matters, how to describe data through location and spread, and how samples relate to populations through the lens of probability. These concepts—sampling, variability, inference, and probability—form the backbone of statistical science and are essential for anyone who works with data in engineering or science.