# Correlation

**Skill level:** Intermediate

**Description**

Correlation is a powerful tool used to determine the relationship between an input and an output variable.

Correlation is not necessarily causation. That said, it is generally the intent to (with adequate understanding of a process) understand how an input variable will or can be used to affect the output variable.

Adequate understanding of the process is needed to determine causation. If the process is not adequately well known, then more research will be required to determine if the relationship is causal.

There are many extensions to basic correlation that may allow the user to build direct mathematical models for process variables. In the simplest form, it is easy to determine positive/negative and degree of correlation. This is helpful when communicating process parameters.

**Benefits**

- Powerful
- Visually communicates potential dependencies
- Can be extended to develop working models
- Well understood by experts, and quick checks are available to prevent misinterpretation

**How to Use**

**Step 1.**Identify the variables.**Step 2.**Determine what predictor or independent input variables are important and what variables will be tested.**Step 3.**Determine what dependent output variable is of interest and is tied to the desired outcome.**Step 4.**Collect data.**Step 5.**Using a data analysis package (something as simple as a spreadsheet plotting function can be used), plot the data where x is the independent variable and y is the dependent variable.**Step 6.**Review the plot for positive or negative correlation.**Step 7.**Calculate r squared (r2) values.

**Relevant Definitions**

*Negative correlation:* When the dependent variable changes inversely proportionally to the independent variable.

*Positive correlation: *When the dependent variable changes proportionally to the independent variable.

*r2 value:* Derived from the Pearson correlation coefficient r, which reflects the strength and direction of the correlation. Squaring r results in a percentage variation of x relative to y. *r2* is always positive and can range from 0 to 1. 1 designates perfect correlation, effectively forming a line where y=f(x) with no data points not on the line. 0 indicates that there is effectively no correlation between

y and x.

*Regression analysis: *The next step to correlation. Allows the practitioner to develop the mathematical model y=f(x). Regression can be for a single or for multiple independent input variables.

**Example **

A copier service company, in an attempt to optimize service cost, sought to understand if there is a relationship between preventative maintenance (PM) time and cost of service per printed page. Engineering staff developed a standard PM procedure and determined the length of time it should take. The company realized that the PM time seen in the field varied widely from the standard time and decided to test whether deviations from the standard PM time correlated with higher service cost.

Data were extracted from the service tracking system to show cost per 1000 pages and PM time. PM time was subtracted from the standard with the absolute value taken because the company wanted to know if the cost was correlated with either taking too long or cutting a PM short.

The graph below shows clear strong positive correlation. The r2 value was .86 or 86%. The exercise proved that this is an area the company should investigate further and understand if a causal relationship exists.