Determining Outliers


Thus far, you have seen 3 ways of determining whether or not any given value is an outlier.  These ways are listed below.  The first two ways should be used only if your data is mound-shaped and reasonably symmetricIf your data is skewed or not mound-shaped, then you should use the IQR test (method 3 below).


1.  If the data is mound-shaped and reasonably symmetric -- you can use the  EMPIRICAL RULE.

 

In particular, any data outside the interval (mean - 2*std. deviation,  mean + 2*std. deviation) would be a mild outlier,

and data outside the interval (mean - 3*std. deviation,  mean + 3*std. deviation)  would be an extreme outlier.

 


2.  If the data is mound-shaped and reasonably symmetric -- you can also use  Z-scores.

 

  Recall,  

Z-score =    data - mean  
    std. deviation

 

Since virtually all mound-shaped data falls within 3 standard deviations of its mean,  the Z-scores of such data are virtually all between -3 and 3.  Thus any data that has a Z-score less than -3, or greater than +3 would be an extreme outlier

 


 

3.  Use the Interquartile Range = IQR Any data that is more than 1½ times the IQR from the upper or lower quartile is considered an outlier.

 

    Steps:

1.  Determine Q1 and Q3, then the IQR = Q3 - Q1

 

2.  Multiply this value (IQR) by 1.5.

 

3.  Subtract this new value (1.5*IQR) from the lower quartile Q1

 

4.  Add this value (1.5*IQR) to the upper quartile Q3.

 

5.  You have now created a interval   (Q1-1.5*IQR,  Q3+ 1.5*IQR).  

Check to see whether any of your data lie outside this interval.

 

6.  All data outside (not in) the interval are outliers.

 

Example:  LPGA data (Activity 6-2):

        The 5 - Number summary is:   (452, 584,  688, 856,  2588).

        Thus Q1 = 584   and Q3 = 856.

        The difference between these two numbers is the IQR  = 856-584= 272.

        Multiply the IQR by 1.5  (1.5)*272 =  408.

 

        Now -- subtract 408 from Q1  to get  176   (584 - 408)

        And add 325.5 to Q3  to get 1264    (856 +408)

 

Our interval of numbers is  (176, 1264).  Any LPG earning less than $176,0000 or more than $1,264,000 would be an outlier.  There are no women golfers in our list who earned less than $176,000, so there are no low outliers.  There are 3 women golfers who earned more than $1,264,000 -- they are Sorenstam ($2,588,000),Creamer($1,532,000) and Kerr ($1,361,000).  Thus we have 3 high outliers among this list of lady golfers.

 


 

If you find that there are outliers using this method -- you should indicate the outliers by using a MODIFIED BOXPLOT.

 

In a MODIFIED BOXPLOT the outliers are marked with *s, and the whiskers are extended only to the largest/smallest non-outliers in the data.

 

Here is a modified boxplot of the LGPA winnings:

Note that although the actual maximum of this set is $2,588,000 -- that is an outlier -- so the upper whisker does not extend from Q3 all the way up to $2,588,000.  The upper whisker extends only from Q3 ($856,000) to the largest non-outlier (which is $1,202,000).


Warning - Boxplots have limitations.   Here are two examples:

 

1)   Boxplots cannot show clusters:  

For example,  suppose you are taking a class and have four test grades in the course:  {45, 45, 95, 95}. 

The "5-Number Summary" would be {45, 45, 70, 95, 95 } -- which is already misleading since we've just created more numbers than we had in our original list!   The boxplot that displays this would be:

                       

 

and although our data consists entirely of two clusters -- there is no evidence of this on the boxplot.

 

 

2) Boxplots can be very misleading with small sample sizes:

Suppose that you have five test grades:  {45, 87, 88, 89, 89}.

The "5-Number Summary" would be {45,66.5, 88, 89, 89 }. 

The boxplot looks like this:

 

 

 

(Note that the box appears to be missing a whisker.  Why?  Is it really missing a whisker?)

 

The boxplot indicates that we have the middle 50% of the data spread over the interval from 66.5 to 90.  In fact -- all of the data except for one value is in the tiny interval from 87 to 89.    Although WE KNOW that the 45 is an outlier,  the IQR outlier test doesn't tell us that this value is an outlier.  The test may fail to detect an outlier if your data set is this small.  With such a small data set -- it is usually not wise to try to summarize the set with the 5-number summary or a boxplot.


Stat 140 Web Page             Stat 140 Schedule

 

New Page 1

...