Stat I, prof, Vinod, Class notes Frequency Distribution, Frequency Histogram, Frequency Polygon and Ogives
1) What is a frequency distribution?
It is a summary technique for organizing data into classes. It yields a table, from which one calculates frequencies (f_{j}), relative frequencies f_{j}/n and cumulative frequencies.
2) How to construct a frequency Distribution? Need to construct a table. (See Table 1 below).
First find the smallest value Xmin and the largest value Xmax in the data. Class intervals need a startinglowerlimitofthefirstclassinterval (StartLo). It should be SMALLER than or equal to Xmin: (In your midterm EXAM I might specify StartLo).
Rule 1) StartLo Xmin (the symbol means less than or equal to)
Rule 2) StartLo should be a number where it is intuitively natural to start an interval (e.g. a round number). This is not a hard and fast rule. It can happen that StartLo =1.2 say.
How many classes should we make? Let k denote the number of classes. Let j denote the class interval number. In Table 1 we have j=0 as the first “dummy” class interval. It is called dummy because it has no real observations in it. It is included for the purpose of finding the midpoint where to join the frequency polygon on the horizontal axis.
Now j=1 is the first honesttogoodness class interval (i.e., interval which has some real observations in it). This interval starts at the startinglowerlimitofthefirstclassinterval (StartLo) defined above, j=2 is the next interval and so on till j=k as the last honesttogoodness interval. Finally j=k+1 is the last “dummy” interval.
Rule 3) k= The number of classes chosen by the investigator. This usually ranges between 3 and 20. If the data series is short with only say 20 observations, k=3 or 4 is adequate. More the data, more the classes might be needed making the k chosen by the researcher to be closer to 20.
Rule 4) Let CIW denote “class interval width.”
CIW (Maximum value minus the Minimum value)/ (number of classes)
CIW
EXAMPLE: Original Unclassified Data 50 98 82 23
46 40 63 52 92 54. We have n=10 observations here. Assume that we are asked to make exactly k=3 class intervals j=1 to j=3.
We must begin by sorting the data from the smallest to the largest as:
Xmin=23 40 46 50 52 54 63 82 92 Xmax=98
Rule 1 says, StartLo Xmin which means StartLo 23, From Rule 2 we choose a round number 20. Now by rule 4, the class interval width must be at least 25 by the formula:
CIW or or 75/3 or 25
This says that the Width of class intervals should be at least 25. Let us try a round number larger than 25, say 30. We choose CIW=30. It turns out that if we had chosen CIW=25 the last upper limit would become 95 leaving the data point 98 an orphan and we will have to revise our scheme by increasing the CIW.
Recall that StartLo=20 and width CIW=30. So the lower limit of first honest to goodness class interval is StartLo=20 and upper limit is simply StartLo plus the width 30 leading to 50. The next interval is simply upper limit of previous interval plus CIW=30 and so on.
An Ambiguity Solved By Convention:
Upper limit of each class interval is the lower limit plus the width. There is ambiguity with respect to the upper limits, but it is resolved by convention. We always let the real upper limit be a notch below what is alleged to be the upper limit. For example, the convention says that the upper limit 50 is really 49.999999999999999999999999, but not quite 50 even if it is shown to be 50.
So the measurement 50 belongs to the next class 50 to 80. The reason for this convention is that it saves the Govt. in printing costs and makes the tables more readable (less cluttered).
Table 1
Sequence no. of the class j

Lower Limit

Upper limit

Mid
Point

Tally marks

Frequency f_{j}

Relative freq.=f_{j}/n

0 (dummy interval)

10=
(20CIW)

20

5


0


1

20

50

35

III

3

0.3

2

50

80

65

IIII

4

0.4

3 =k

80

110

95

III

3

0.3

Dummy interval

110

140=
(110+CIW)

125


0


Totals





n=10

1.0

This classification has been successful in the sense that we are asked to make exactly 3 classes and we have exactly 3 meaningful classes. The scheme is meaningful if it satisfies two tests: (i) There should be no orphan points and (ii) There should be no orphan intervals in the sense defined below.
Orphan Points Problem:
If there are points in the unclassified data, which are not allocated to any interval whatsoever, then we call them orphan points. Then we say that the classification is not meaningful or has failed. A good check is to make sure that Xmin is allocated to the genuine interval with j=1 and Xmax is allocated to the last genuine interval with j=k.
An example of the orphan points. If we choose startLo=0 and CIW=30 then the intervals would be: 0 To 30, 30 to 60 and 60 to 90. Now two numbers 98 and 92 are orphaned as they belong to no class. This is not acceptable. We have to go back to the drawing board and fix things.
Orphan Intervals Problem:
Whether we have 3 meaningful classes or not is decided by looking at the genuine class intervals when j=1 and j=k (Not the dummy intervals). Here they are all meaningful in the sense that there is at least one observation (tally mark) in these classes. So we have no orphan intervals problem, things are OK. An orphan interval means that there are no observations (Tally marks) in the first or the last interval, that is, when j=1 or j=k interval. If there are no observations in the intermediate intervals, (j=2 to j=k1) that may be a true nature of the data (that is the way the cookie crumbles) and not an artifact of our classification scheme, hence that situation is not defined as orphan intervals problem. Remember that the two dummy class intervals are always and by definition orphans and do not pose any problem.
An example of orphan interval: If we choose startLo=20 and CIW=40 then the intervals would be 20 To 60, 60 to 100 and 100 to 140. Now the last interval is orphaned as no one belongs in it. This is not acceptable. You were asked to make 3 intervals and you have effectively made two intervals 20 to 60 and 60 to 100 which contain the entire data. We have to go back to the drawing board and fix things.
Trial and Error in Classification is needed if classification fails the first time around. The solution to the orphans problem is to go back (to the drawing board) and choose a different StartLo (the lower limit of the starting interval) and or a different “class interval width” (CIW) and redo the tally marks and entire classification.
Since the theory requires that CIW
We can choose any CIW which satisfies this inequality. For example, our CIW=25 could have been larger and it will still satisfy the inequality.
Common solutions to the orphans problem:
If j=1 is an orphan interval StartLo may have been wrong.
If j=k is an orphan interval, CIW is too large and needs to be reduced.
If there are orphan points, we increase the chosen class interval width (CIW)
Frequency Distribution Graphics by a histogram and frequency polygon
Assume that the trial and error is complete, no interval is an orphan and no point is an orphan. Only now we have classified the data and constructed the frequency distribution. The word distribution suggests that we are distributing the n items into k classes. The table represents the frequency distribution. Now we are ready for frequency histogram and frequency polygon which are graphical representations of the frequency distribution.
A histogram is a graphical image of the Frequency Distribution or Relative Frequency Distribution with measurements on the horizontal axis and frequency on the vertical. (It looks a bit like NYC skyline) We represent frequency by pillars (bars). The height of a pillar is proportional to the frequency in the particular class interval and the width of the pillar on the horizontal axis starts at the lower limit of the interval and ends at the upper limit. We draw as many pillars as are intervals. The dummy intervals will have zero heights since they have zero frequency. Hence it is not customary to show the dummy intervals for histograms. (See Table 1). In business and economics applications the pillars (bars) are usually attached.
http://www.stat.sc.edu/~west/javahtml/Histogram.html
has a nice Java appelet which teaches the effect of changing width on a histogram
Two equivalent Definitions of a mid point=(upper limit + lower limit) /2
=Lower limit + (width/2)
Both definitions work.
For software called R the following input will draw the frequency histogram and polygon
x=c(23, 40, 46, 51, 52, 54, 63, 82, 92, 98)
#note: I changed 50 to 51 to get a cleaner software illustration.
hist(x, breaks=c(10, 20, 50, 80, 110, 140),main="Histogram and Polygon 23, 40, 46, 51, 52, 54, 63, 82, 92, 98",xlab="measurements", axes=FALSE)
tik=seq(10,140,by=15)#define location of tick marks
axis(1,tik,tik)#first tik for location, second is for labels
#axis(1...) is for x axis and axis(2,...) is for y axis
axis(2, 0:4, 0:4)
# now join consecutive midpoints to get polygon
lines(x=c(5,35),y=c(0,3))
lines(x=c(35,65),y=c(3,4))
lines(x=c(65,95),y=c(4,3))
lines(x=c(95,125),y=c(3,0))
or a FREQUENCY POLYGON we need two dummy class intervals at two ends!
The lower limit of the first dummy interval on left side = (starting value) MINUS (width).
In Table 1, it is 20 MINUS (CIW=30) =2030= 10 or MINUS 10.
The upper limit of first dummy interval is just the lower limit of the first regular (nondummy) interval, also called the starting value of the classification process.
The upper limit of the 2nd dummy interval on the right side =(upper limit of last interval) PLUS CIW (width). In the following example, it is 110+30=140. Of course, the lower limit of this 2nd dummy interval is simply the upper limit of the last genuine (nondummy) interval.
In order to draw the frequency polygon find the midpoints of the two dummy intervals
Join the midpoints of all intervals consecutively at the tops of the pillars to form a polygon.
A FREQURNVY POLYGON is usually drawn right on top of the freq. histogram by joining the midpoints at the tops of consecutive pillars. (Take Care to include dummy intervals before drawing the freq. polygon and determine the midpoints of dummy intervals). Note that the pillars at the dummy intervals have zero height, representing the fact that there are no observations there. So, the frequency polygon line starts at the midpoint of the left side dummy interval and ends at the mid point of the right side dummy interval.
Good Graph: Any valid graph should have a Title, Both axes should be properly labeled, there should be Legends for all curves and a source should be indicated.
Stem and Leaf display is a hybrid graphical method similar to histogram but
the data remain visible. There is a STEM= simply the first digit of the number. For example
if the number is 78, the first digit is 7 and second is 8
LEAF= 2nd digit 8
Just List all numbers (heart rates) with the first digit then a colon and then all numbers with that first digit
For example, Heart rates are 45, 56, 44, 70, 72, 60, 61, 47, 53, 48
then stem and leaf display is: 4: 5, 4, 7, 8
5: 6, 3
6: 0, 1
7: 0, 2
http://regentsprep.org/Regents/math/data/stemleaf.htm
has a nice description. Sometimes the stem can be based on first two digits and sometimes the leaf may involve dropping the last digit.
Baseball example: (Leaf is going in two directions for comparison of two players)
BabeRuth

stem

BarryBonds

0 4 3 2 6

0


1

1

6 9

9 5 2

2

5 4 5

5 4

3

3 4 7 3 7 4 9

1 6 7 6 9 6 1

4

6 2 0 9 6

4 9 4

5


0

6



7

3

What do you conclude? Babe was 1914 to 1935, Barry was 1986 to 2003
Relative frequency Distribution.
Relative freq. in class j = (frequency number in class j) / (total no. of observations).
These must add up to 1 as they do. Relative freq. is interpreted as the probability of being in that class interval. Lord Keynes give the prob. interpretation in 1920’s in a book called Treatise on Probability.
Sequence no. of class j

Lower Limit

Upper limit

MidPoint

Frequency

Relative
Frequency

0 (dummy)

10

20

10/2 = 5

0

0

1

20

50

35

3

0.3

2

50

80

65

4

0.4

3

80

110

95

3

0.3

4 (dummy)

110

140 (110+CIW)

125

0

0

Total




10


Cumulative frequency (top to bottom)=sum of freq in the current class and all previous classes. It goes with the upper limits of class intervals. Interpretation of cumulative freq for j=1 (top to bottom case) is simply that there are 3 measurements in the data set with measurements less than the upper limit 50 of the j=1 interval. A graph of this having upper limits of intervals (measurements) on the horizontal axis and cumulative frequency on vertical axis is called “Less Than Ogive.” The scale on the vertical axis is
Cumulative frequency (bottom to top)=sum of the frequency in the current class and all subsequent classes below it. This number goes with the lower limits of class intervals. What does it mean? Interpretation of cumulative freq for j=2 (bottom to top case) is simply that there are 7 measurements in the data set with measurements greater than the lower limit 50 of the j=2 interval. A graph of this having lower limits of intervals (measurements) on the horizontal axis and bottom to top cumulative frequencies on vertical axis is called “Greater Than Ogive.” Be sure the label the axes as measurements of horizontal axis and cumulative frequency for vertical axis.
See the Table below. The scale on the vertical axis for the cumulative frequencies goes from zero to n (=10). Hence it is obvious that the graph of Ogives should be drawn separately from the graph of histogram or polygon. However the two Ogives can and should be drawn on the same graph because the Median of the Classified Data (grouped data) can be determined graphically as the measurement where the “Less Than Ogive” intersects the “Greater Than Ogive.”
Sequence no. of class j

Lower Limit (horizontal axis for “Greater than ogive”)

Upper limit (horizontal axis for “Less than ogive”)

Frequency

Cumulative Freq (top to bottom) Less than Ogive

Cumulative Freq (bottom to top) Greater than Ogive

0 (dummy)

10

20

0

0

10

1

20

50

3

3

10

2

50

80

4

7

7

3

80

110

3

10

3

4 (dummy)

110

140 (110+CIW)

0

10

0

A Warning:
See "How to Lie With Statistics" by the author, Darrell Huff. He wrote that "the secret language of statistics, so appealing in a factminded culture, is employed to sensationalize, inflate, confuse and oversimplify."
He preached to readers that they were part of the chain of accountability, and needed to look for bias in the origin of the statistics or their treatment, and to ask the kinds of questions that poked holes in shoddy or dishonest work. " 