Please refer to the word doc for question
data miningdata
ATTACHED FILE(S)
9 781292 026152
ISBN 9781292026152
Introduction to Data Mining
Tan SteinbachKumar
First Edition
In
tro
d
u
c
tio
n
to
D
a
ta
M
in
in
g
Ta
n
e
t a
l.First Ed
itio
n
Introduction to Data Mining
Tan SteinbachKumar
First Edition
Pearson Education Limited
Edinburgh Gate
Harlow
Essex CM20 2JE
England and Associated Companies throughout the world
Visit us on the World Wide Web at: www.pearsoned.co.uk
© Pearson Education Limited 2014
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without either the
prior written permission of the publisher or a licence permitting restricted copying in the United Kingdom
issued by the Copyright Licensing Agency Ltd, Saffron House, 6–10 Kirby Street, London EC1N 8TS.
All trademarks used herein are the property of their respective owners. The use of any trademark
in this text does not vest in the author or publisher any trademark ownership rights in such
trademarks, nor does the use of such trademarks imply any affi liation with or endorsement of this
book by such owners.
British Library CataloguinginPublication Data
A catalogue record for this book is available from the British Library
Printed in the United States of America
ISBN 10: 1292026154
ISBN 13: 9781292026152
ISBN 10: 1292026154
ISBN 13: 9781292026152
Table ofContents
P E A R S O NC U S T O ML I B R A R Y
I
Chapter 1. Introduction
1
1PangNing Tan/Michael Steinbach/Vipin Kumar
Chapter 2. Data
19
19PangNing Tan/Michael Steinbach/Vipin Kumar
Chapter 3. Exploring Data
97
97PangNing Tan/Michael Steinbach/Vipin Kumar
Chapter 4. Classification: Basic Concepts, Decision Trees, and Model Evaluation
145
145PangNing Tan/Michael Steinbach/Vipin Kumar
Chapter 5. Classification: Alternative Techniques
207
207PangNing Tan/Michael Steinbach/Vipin Kumar
Chapter 6. Association Analysis: Basic Concepts and Algorithms
327
327PangNing Tan/Michael Steinbach/Vipin Kumar
Chapter 7. Association Analysis: Advanced Concepts
415
415PangNing Tan/Michael Steinbach/Vipin Kumar
Chapter 8. Cluster Analysis: Basic Concepts and Algorithms
487
487PangNing Tan/Michael Steinbach/Vipin Kumar
Chapter 9. Cluster Analysis: Additional Issues and Algorithms
569
569PangNing Tan/Michael Steinbach/Vipin Kumar
Chapter 10. Anomaly Detection
651
651PangNing Tan/Michael Steinbach/Vipin Kumar
Appendix B: Dimensionality Reduction
685
685PangNing Tan/Michael Steinbach/Vipin Kumar
Appendix D: Regression
703
703PangNing Tan/Michael Steinbach/Vipin Kumar
Appendix E: Optimization
713
713PangNing Tan/Michael Steinbach/Vipin Kumar
II
Copyright Permissions
724
724PangNing Tan/Michael Steinbach/Vipin Kumar
725
725Index
1
Introduction
Rapid advances in data collection and storage technology have enabled or
ganizations to accumulate vast amounts of data. However, extracting useful
information has proven extremely challenging. Often, traditional data analy
sis tools and techniques cannot be used because of the massive size of a data
set. Sometimes, the nontraditional nature of the data means that traditional
approaches cannot be applied even if the data set is relatively small. In other
situations, the questions that need to be answered cannot be addressed using
existing data analysis techniques, and thus, new methods need to be devel
oped.
Data mining is a technology that blends traditional data analysis methods
with sophisticated algorithms for processing large volumes of data. It has also
opened up exciting opportunities for exploring and analyzing new types of
data and for analyzing old types of data in new ways. In this introductory
chapter, we present an overview of data mining and outline the key topics
to be covered in this book. We start with a description of some wellknown
applications that require new techniques for data analysis.
Business Pointofsale data collection (bar code scanners, radio frequency
identification (RFID), and smart card technology) have allowed retailers to
collect uptotheminute data about customer purchases at the checkout coun
ters of their stores. Retailers can utilize this information, along with other
businesscritical data such as Web logs from ecommerce Web sites and cus
tomer service records from call centers, to help them better understand the
needs of their customers and make more informed business decisions.
Data mining techniques can be used to support a wide range of business
intelligence applications such as customer profiling, targeted marketing, work
flow management, store layout, and fraud detection. It can also help retailers
From Chapter 1 of Introduction to Data Mining
Vipin Kumar. Copyright © 2006 by Pearson Education, Inc. All rights reserved.
, First Edition. PangNing Tan, Michael Steinbach,
1
Chapter 1 Introduction
answer important business questions such as “Who are the most profitable
customers?” “What products can be crosssold or upsold?” and “What is the
revenue outlook of the company for next year?” Some of these questions mo
tivated the creation of association analysis (Chapters 6 and 7), a new data
analysis technique.
Medicine, Science, and Engineering Researchers in medicine, science,
and engineering are rapidly accumulating data that is key to important new
discoveries. For example, as an important step toward improving our under
standing of the Earth’s climate system, NASA has deployed a series of Earth
orbiting satellites that continuously generate global observations of the land
surface, oceans, and atmosphere. However, because of the size and spatio
temporal nature of the data, traditional methods are often not suitable for
analyzing these data sets. Techniques developed in data mining can aid Earth
scientists in answering questions such as “What is the relationship between
the frequency and intensity of ecosystem disturbances such as droughts and
hurricanes to global warming?” “How is land surface precipitation and temper
ature affected by ocean surface temperature?” and “How well can we predict
the beginning and end of the growing season for a region?”
As another example, researchers in molecular biology hope to use the large
amounts of genomic data currently being gathered to better understand the
structure and function of genes. In the past, traditional methods in molecu
lar biology allowed scientists to study only a few genes at a time in a given
experiment. Recent breakthroughs in microarray technology have enabled sci
entists to compare the behavior of thousands of genes under various situations.
Such comparisons can help determine the function of each gene and perhaps
isolate the genes responsible for certain diseases. However, the noisy and high
dimensional nature of data requires new types of data analysis. In addition
to analyzing gene array data, data mining can also be used to address other
important biological challenges such as protein structure prediction, multiple
sequence alignment, the modeling of biochemical pathways, and phylogenetics.
1.1 What Is Data Mining?
Data mining is the process of automatically discovering useful information in
large data repositories. Data mining techniques are deployed to scour large
databases in order to find novel and useful patterns that might otherwise
remain unknown. They also provide capabilities to predict the outcome of a
2
1.1 What Is Data Mining?
future observation, such as predicting whether a newly arrived customer will
spend more than $100 at a department store.
Not all information discovery tasks are considered to be data mining. For
example, looking up individual records using a database management system
or finding particular Web pages via a query to an Internet search engine are
tasks related to the area of information retrieval. Although such tasks are
important and may involve the use of the sophisticated algorithms and data
structures, they rely on traditional computer science techniques and obvious
features of the data to create index structures for efficiently organizing and
retrieving information. Nonetheless, data mining techniques have been used
to enhance information retrieval systems.
Data Mining and Knowledge Discovery
Data mining is an integral part of knowledge discovery in databases
(KDD), which is the overall process of converting raw data into useful in
formation, as shown in Figure 1.1. This process consists of a series of trans
formation steps, from data preprocessing to postprocessing of data mining
results.
Input
Data
Information
Data
Preprocessing
Data
Mining
Postprocessing
Filtering Patterns
Visualization
Pattern Interpretation
Feature Selection
Dimensionality Reduction
Normalization
Data Subsetting
Figure 1.1. The process of knowledge discovery in databases (KDD).
The input data can be stored in a variety of formats (flat files, spread
sheets, or relational tables) and may reside in a centralized data repository
or be distributed across multiple sites. The purpose of preprocessing is
to transform the raw input data into an appropriate format for subsequent
analysis. The steps involved in data preprocessing include fusing data from
multiple sources, cleaning data to remove noise and duplicate observations,
and selecting records and features that are relevant to the data mining task
at hand. Because of the many ways data can be collected and stored, data
3
Chapter 1 Introduction
preprocessing is perhaps the most laborious and timeconsuming step in the
overall knowledge discovery process.
“Closing the loop” is the phrase often used to refer to the process of in
tegrating data mining results into decision support systems. For example,
in business applications, the insights offered by data mining results can be
integrated with campaign management tools so that effective marketing pro
motions can be conducted and tested. Such integration requires a postpro
cessing step that ensures that only valid and useful results are incorporated
into the decision support system. An example of postprocessing is visualiza
tion (see Chapter 3), which allows analysts to explore the data and the data
mining results from a variety of viewpoints. Statistical measures or hypoth
esis testing methods can also be applied during postprocessing to eliminate
spurious data mining results.
1.2 Motivating Challenges
As mentioned earlier, traditional data analysis techniques have often encoun
tered practical difficulties in meeting the challenges posed by new data sets.
The following are some of the specific challenges that motivated the develop
ment of data mining.
Scalability Because of advances in data generation and collection, data sets
with sizes of gigabytes, terabytes, or even petabytes are becoming common.
If data mining algorithms are to handle these massive data sets, then they
must be scalable. Many data mining algorithms employ special search strate
gies to handle exponential search problems. Scalability may also require the
implementation of novel data structures to access individual records in an ef
ficient manner. For instance, outofcore algorithms may be necessary when
processing data sets that cannot fit into main memory. Scalability can also be
improved by using sampling or developing parallel and distributed algorithms.
High Dimensionality It is now common to encounter data sets with hun
dreds or thousands of attributes instead of the handful common a few decades
ago. In bioinformatics, progress in microarray technology has produced gene
expression data involving thousands of features. Data sets with temporal
or spatial components also tend to have high dimensionality. For example,
consider a data set that contains measurements of temperature at various
locations. If the temperature measurements are taken repeatedly for an ex
tended period, the number of dimensions (features) increases in proportion to
4
1.2 Motivating Challenges
the number of measurements taken. Traditional data analysis techniques that
were developed for lowdimensional data often do not work well for such high
dimensional data. Also, for some data analysis algorithms, the computational
complexity increases rapidly as the dimensionality (the number of features)
increases.
Heterogeneous and Complex Data Traditional data analysis methods
often deal with data sets containing attributes of the same type, either contin
uous or categorical. As the role of data mining in business, science, medicine,
and other fields has grown, so has the need for techniques that can handle
heterogeneous attributes. Recent years have also seen the emergence of more
complex data objects. Examples of such nontraditional types of data include
collections of Web pages containing semistructured text and hyperlinks; DNA
data with sequential and threedimensional structure; and climate data that
consists of time series measurements (temperature, pressure, etc.) at various
locations on the Earth’s surface. Techniques developed for mining such com
plex objects should take into consideration relationships in the data, such as
temporal and spatial autocorrelation, graph connectivity, and parentchild re
lationships between the elements in semistructured text and XML documents.
Data Ownership and Distribution Sometimes, the data needed for an
analysis is not stored in one location or owned by one organization. Instead,
the data is geographically distributed among resources belonging to multiple
entities. This requires the development of distributed data mining techniques.
Among the key challenges faced by distributed data mining algorithms in
clude (1) how to reduce the amount of communication needed to perform the
distributed computation, (2) how to effectively consolidate the data mining
results obtained from multiple sources, and (3) how to address data security
issues.
Nontraditional Analysis The traditional statistical approach is based on
a hypothesizeandtest paradigm. In other words, a hypothesis is proposed,
an experiment is designed to gather the data, and then the data is analyzed
with respect to the hypothesis. Unfortunately, this process is extremely labor
intensive. Current data analysis tasks often require the generation and evalu
ation of thousands of hypotheses, and consequently, the development of some
data mining techniques has been motivated by the desire to automate the
process of hypothesis generation and evaluation. Furthermore, the data sets
analyzed in data mining are typically not the result of a carefully designed
5
Chapter 1 Introduction
experiment and often represent opportunistic samples of the data, rather than
random samples. Also, the data sets frequently involve nontraditional types
of data and data distributions.
1.3 The Origins of Data Mining
Brought together by the goal of meeting the challenges of the previous sec
tion, researchers from different disciplines began to focus on developing more
efficient and scalable tools that could handle diverse types of data. This work,
which culminated in the field of data mining, built upon the methodology and
algorithms that researchers had previously used. In particular, data mining
draws upon ideas, such as (1) sampling, estimation, and hypothesis testing
from statistics and (2) search algorithms, modeling techniques, and learning
theories from artificial intelligence, pattern recognition, and machine learning.
Data mining has also been quick to adopt ideas from other areas, including
optimization, evolutionary computing, information theory, signal processing,
visualization, and information retrieval.
A number of other areas also play key supporting roles. In particular,
database systems are needed to provide support for efficient storage, index
ing, and query processing. Techniques from high performance (parallel) com
puting are often important in addressing the massive size of some data sets.
Distributed techniques can also help address the issue of size and are essential
when the data cannot be gathered in one location.
Figure 1.2 shows the relationship of data mining to other areas.
Database Technology, Parallel Computing, Distributed Computing
AI,
Machine
Learning,
and
Pattern
Recognition
Statistics
Data Mining
Figure 1.2. Data mining as a confluence of many disciplines.
6
1.4 Data Mining Tasks
1.4 Data Mining Tasks
Data mining tasks are generally divided into two major categories:
Predictive tasks. The objective of these tasks is to predict the value of a par
ticular attribute based on the values of other attributes. The attribute
to be predicted is commonly known as the target or dependent vari
able, while the attributes used for making the prediction are known as
the explanatory or independent variables.
Descriptive tasks. Here, the objective is to derive patterns (correlations,
trends, clusters, trajectories, and anomalies) that summarize the un
derlying relationships in data. Descriptive data mining tasks are often
exploratory in nature and frequently require postprocessing techniques
to validate and explain the results.
Figure 1.3 illustrates four of the core data mining tasks that are described
in the remainder of this book.
DIAPER
Anomaly
Detection
Data
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1
2
3
4
5
6
7
8
9
10
Yes
No
No
Yes
No
No
Yes
No
No
No
125K
100K
70K
120K
95K
80K
220K
85K
75K
90K
Single
Married
Single
Married
Divorced
Married
Divorced
Single
Married
Single
No
No
No
No
Yes
No
No
Yes
No
Yes
Pr
ed
ict
ive
Mo
de
lin
g
Cluster
Analysis
As
so
cia
tio
n
An
aly
sis
DIAPER
Figure 1.3. Four of the core data mining tasks.
7
Chapter 1 Introduction
Predictive modeling refers to the task of building a model for the target
variable as a function of the explanatory variables. There are two types of
predictive modeling tasks: classification, which is used for discrete target
variables, and regression, which is used for continuous target variables. For
example, predicting whether a Web user will make a purchase at an online
bookstore is a classification task because the target variable is binaryvalued.
On the other hand, forecasting the future price of a stock is a regression task
because price is a continuousvalued attribute. The goal of both tasks is to
learn a model that minimizes the error between the predicted and true values
of the target variable. Predictive modeling can be used to identify customers
that will respond to a marketing campaign, predict disturbances in the Earth’s
ecosystem, or judge whether a patient has a particular disease based on the
results of medical tests.
Example 1.1 (Predicting the Type of a Flower). Consider the task of
predicting a species of flower based on the characteristics of the flower. In
particular, consider classifying an Iris flower as to whether it belongs to one
of the following three Iris species: Setosa, Versicolour, or Virginica. To per
form this task, we need a data set containing the characteristics of various
flowers of these three species. A data set with this type of information is
the wellknown Iris data set from the UCI Machine Learning Repository at
http://www.ics.uci.edu/∼mlearn. In addition to the species of a flower,
this data set contains four other attributes: sepal width, sepal length, petal
length, and petal width. (The Iris data set and its attributes are described
further in Section 3.1.) Figure 1.4 shows a plot of petal width versus petal
length for the 150 flowers in the Iris data set. Petal width is broken into the
categories low, medium, and high, which correspond to the intervals [0, 0.75),
[0.75, 1.75), [1.75, ∞), respectively. Also, petal length is broken into categories
low, medium, and high, which correspond to the intervals [0, 2.5), [2.5, 5), [5,
∞), respectively. Based on these categories of petal width and length, the
following rules can be derived:
Petal width low and petal length low implies Setosa.
Petal width medium and petal length medium implies Versicolour.
Petal width high and petal length high implies Virginica.
While these rules do not classify all the flowers, they do a good (but not
perfect) job of classifying most of the flowers. Note that flowers from the
Setosa species are well separated from the Versicolour and Virginica species
with respect to petal width and length, but the latter two species overlap
somewhat with respect to these attributes.
8
1.4 Data Mining Tasks
0 1 2 2.5 3 4 5 6 7
0
0.5
0.75
1
1.5
1.75
2
2.5
Petal Length (cm)
P
e
ta
l
W
id
th
(
c
m
)
Setosa
Versicolour
Virginica
Figure 1.4. Petal width versus petal length for 150 Iris flowers.
Association analysis is used to discover patterns that describe strongly as
sociated features in the data. The discovered patterns are typically represented
in the form of implication rules or feature subsets. Because of the exponential
size of its search space, the goal of association analysis is to extract the most
interesting patterns in an efficient manner. Useful applications of association
analysis include finding groups of genes that have related functionality, identi
fying Web pages that are accessed together, or understanding the relationships
between different elements of Earth’s climate system.
Example 1.2 (Market Basket Analysis). The transactions shown in Ta
ble 1.1 illustrate pointofsale data collected at the checkout counters of a
grocery store. Association analysis can be applied to find items that are fre
quently bought together by customers. For example, we may discover the
rule {Diapers} −→ {Milk}, which suggests that customers who buy diapers
also tend to buy milk. This type of rule can be used to identify potential
crossselling opportunities among related items.
Cluster analysis seeks to find groups of closely related observations so that
observations that belong to the same cluster are more similar to each other
9
Chapter 1 Introduction
Table 1.1. Market basket data.
Transaction ID Items
1 {Bread, Butter, Diapers, Milk}
2 {Coffee, Sugar, Cookies, Salmon}
3 {Bread, Butter, Coffee, Diapers, Milk, Eggs}
4 {Bread, Butter, Salmon, Chicken}
5 {Eggs, Bread, Butter}
6 {Salmon, Diapers, Milk}
7 {Bread, Tea, Sugar, Eggs}
8 {Coffee, Sugar, Chicken, Eggs}
9 {Bread, Diapers, Milk, Salt}
10 {Tea, Eggs, Cookies, Diapers, Milk}
than observations that belong to other clusters. Clustering has been used to
group sets of related customers, find areas of the ocean that have a significant
impact on the Earth’s climate, and compress data.
Example 1.3 (Document Clustering). The collection of news articles
shown in Table 1.2 can be grouped based on their respective topics. Each
article is represented as a set of wordfrequency pairs (w, c), where w is a word
and c is the number of times the word appears in the article. There are two
natural clusters in the data set. The first cluster consists of the first four ar
ticles, which correspond to news about the economy, while the second cluster
contains the last four articles, which correspond to news about health care. A
good clustering algorithm should be able to identify these two clusters based
on the similarity between words that appear in the articles.
Table 1.2. Collection of news articles.
Article Words
1 dollar: 1, industry: 4, country: 2, loan: 3, deal: 2, government: 2
2 machinery: 2, labor: 3, market: 4, industry: 2, work: 3, country: 1
3 job: 5, inflation: 3, rise: 2, jobless: 2, market: 3, country: 2, index: 3
4 domestic: 3, forecast: 2, gain: 1, market: 2, sale: 3, price: 2
5 patient: 4, symptom: 2, drug: 3, health: 2, clinic: 2, doctor: 2
6 pharmaceutical: 2, company: 3, drug: 2, vaccine: 1, flu: 3
7 death: 2, cancer: 4, drug: 3, public: 4, health: 3, director: 2
8 medical: 2, cost: 3, increase: 2, patient: 2, health: 3, care: 1
10
1.5 Scope and Organization of the Book
Anomaly detection is the task of identifying observations whose character
istics are significantly different from the rest of the data. Such observations
are known as anomalies or outliers. The goal of an anomaly detection al
gorithm is to discover the real anomalies and avoid falsely labeling normal
objects as anomalous. In other words, a good anomaly detector must have
a high detection rate and a low false alarm rate. Applications of anomaly
detection include the detection of fraud, network intrusions, unusual patterns
of disease, and ecosystem disturbances.
Example 1.4 (Credit Card Fraud Detection). A credit card company
records the transactions made by every credit card holder, along with personal
information such as credit limit, age, annual income, and address. Since the
number of fraudulent cases is relatively small compared to the number of
legitimate transactions, anomaly detection techniques can be applied to build
a profile of legitimate transactions for the users. When a new transaction
arrives, it is compared against the profile of the user. If the characteristics of
the transaction are very different from the previously created profile, then the
transaction is flagged as potentially fraudulent.
1.5 Scope and Organization of the Book
This book introduces the major principles and techniques used in data mining
from an algorithmic perspective. A study of these principles and techniques is
essential for developing a better understanding of how data mining technology
can be applied to various kinds of data. This book also serves as a starting
point for readers who are interested in doing research in this field.
We begin the technical discussion of this book with a chapter on data
(Chapter 2), which discusses the basic types of data, data quality, prepro
cessing techniques, and measures of similarity and dissimilarity. Although
this material can be covered quickly, it provides an essential foundation for
data analysis. Chapter 3, on data exploration, discusses summary statistics,
visualization techniques, and OnLine Analytical Processing (OLAP). These
techniques provide the means for quickly gaining insight into a data set.
Chapters 4 and 5 cover classification. Chapter 4 provides a foundation
by discussing decision tree classifiers and several issues that are important
to all classification: overfitting, performance evaluation, and the comparison
of different classification models. Using this foundation, Chapter 5 describes
a number of other important classification techniques: rulebased systems,
nearestneighbor classifiers, Bayesian classifiers, artificial neural networks, sup
port vector machines, and ensemble classifiers, which are collections of classi
11
Chapter 1 Introduction
fiers. The multiclass and imbalanced class problems are also discussed. These
topics can be covered independently.
Association analysis is explored in Chapters 6 and 7. Chapter 6 describes
the basics of association analysis: frequent itemsets, association rules, and
some of the algorithms used to generate them. Specific types of frequent
itemsets—maximal, closed, and hyperclique—that are important for data min
ing are also discussed, and the chapter concludes with a discussion of evalua
tion measures for association analysis. Chapter 7 considers a variety of more
advanced topics, including how association analysis can be applied to categor
ical and continuous data or to data that has a concept hierarchy. (A concept
hierarchy is a hierarchical categorization of objects, e.g., store items, clothing,
shoes, sneakers.) This chapter also describes how association analysis can be
extended to find sequential patterns (patterns involving order), patterns in
graphs, and negative relationships (if one item is present, then the other is
not).
Cluster analysis is discussed in Chapters 8 and 9. Chapter 8 first describes
the different types of clusters and then presents three specific clustering tech
niques: Kmeans, agglomerative hierarchical clustering, and DBSCAN. This
is followed by a discussion of techniques for validating the results of a cluster
ing algorithm. Additional clustering concepts and techniques are explored in
Chapter 9, including fuzzy and probabilistic clustering, SelfOrganizing Maps
(SOM), graphbased clustering, and densitybased clustering. There is also a
discussion of scalability issues and factors to consider when selecting a clus
tering algorithm.
The last chapter, Chapter 10, is on anomaly detection. After some basic
definitions, several different types of anomaly detection are considered: sta
tistical, distancebased, densitybased, and clusteringbased. Appendices A
through E give a brief review of important topics that are used in portions of
the book: linear algebra, dimensionality reduction, statistics, regression, and
optimization.
The subject of data mining, while relatively young compared to statistics
or machine learning, is already too large to cover in a single book. Selected
references to topics that are only briefly covered, such as data quality, are
provided in the bibliographic notes of the appropriate chapter. References to
topics not covered in this book, such as data mining for streams and privacy
preserving data mining, are provided in the bibliographic notes of this chapter.
12
1.6 Bibliographic Notes
1.6 Bibliographic Notes
The topic of data mining has inspired many textbooks. Introductory text
books include those by Dunham [10], Han and Kamber [21], Hand et al. [23],
and Roiger and Geatz [36]. Data mining books with a stronger emphasis on
business applications include the works by Berry and Linoff [2], Pyle [34], and
Parr Rud [33]. Books with an emphasis on statistical learning include those
by Cherkassky and Mulier [6], and Hastie et al. [24]. Some books with an
emphasis on machine learning or pattern recognition are those by Duda et
al. [9], Kantardzic [25], Mitchell [31], Webb [41], and Witten and Frank [42].
There are also some more specialized books: Chakrabarti [4] (web mining),
Fayyad et al. [13] (collection of early articles on data mining), Fayyad et al.
[11] (visualization), Grossman et al. [18] (science and engineering), Kargupta
and Chan [26] (distributed data mining), Wang et al. [40] (bioinformatics),
and Zaki and Ho [44] (parallel data mining).
There are several conferences related to data mining. Some of the main
conferences dedicated to this field include the ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), the IEEE In
ternational Conference on Data Mining (ICDM), the SIAM International Con
ference on Data Mining (SDM), the European Conference on Principles and
Practice of Knowledge Discovery in Databases (PKDD), and the PacificAsia
Conference on Knowledge Discovery and Data Mining (PAKDD). Data min
ing papers can also be found in other major conferences such as the ACM
SIGMOD/PODS conference, the International Conference on Very Large Data
Bases (VLDB), the Conference on Information and Knowledge Management
(CIKM), the International Conference on Data Engineering (ICDE), the In
ternational Conference on Machine Learning (ICML), and the National Con
ference on Artificial Intelligence (AAAI).
Journal publications on data mining include IEEE Transactions on Knowl
edge and Data Engineering, Data Mining and Knowledge Discovery, Knowl
edge and Information Systems, Intelligent Data Analysis, Information Sys
tems, and the Journal of Intelligent Information Systems.
There have been a number of general articles on data mining that define the
field or its relationship to other fields, particularly statistics. Fayyad et al. [12]
describe data mining and how it fits into the total knowledge discovery process.
Chen et al. [5] give a database perspective on data mining. Ramakrishnan
and Grama [35] provide a general discussion of data mining and present several
viewpoints. Hand [22] describes how data mining differs from statistics, as does
Friedman [14]. Lambert [29] explores the use of statistics for large data sets and
provides some comments on the respective roles of data mining and statistics.
13
Chapter 1 Introduction
Glymour et al. [16] consider the lessons that statistics may have for data
mining. Smyth et al. [38] describe how the evolution of data mining is being
driven by new types of data and applications, such as those involving streams,
graphs, and text. Emerging applications in data mining are considered by Han
et al. [20] and Smyth [37] describes some research challenges in data mining.
A discussion of how developments in data mining research can be turned into
practical tools is given by Wu et al. [43]. Data mining standards are the
subject of a paper by Grossman et al. [17]. Bradley [3] discusses how data
mining algorithms can be scaled to large data sets.
With the emergence of new data mining applications have come new chal
lenges that need to be addressed. For instance, concerns about privacy breaches
as a result of data mining have escalated in recent years, particularly in ap
plication domains such as Web commerce and health care. As a result, there
is growing interest in developing data mining algorithms that maintain user
privacy. Developing techniques for mining encrypted or randomized data is
known as privacypreserving data mining. Some general references in this
area include papers by Agrawal and Srikant [1], Clifton et al. [7] and Kargupta
et al. [27]. Vassilios et al. [39] provide a survey.
Recent years have witnessed a growing number of applications that rapidly
generate continuous streams of data. Examples of stream data include network
traffic, multimedia streams, and stock prices. Several issues must be considered
when mining data streams, such as the limited amount of memory available,
the need for online analysis, and the change of the data over time. Data
mining for stream data has become an important area in data mining. Some
selected publications are Domingos and Hulten [8] (classification), Giannella
et al. [15] (association analysis), Guha et al. [19] (clustering), Kifer et al. [28]
(change detection), Papadimitriou et al. [32] (time series), and Law et al. [30]
(dimensionality reduction).
Bibliography
[1] R. Agrawal and R. Srikant. Privacypreserving data mining. In Proc. of 2000 ACM
SIGMOD Intl. Conf. on Management of Data, pages 439–450, Dallas, Texas, 2000.
ACM Press.
[2] M. J. A. Berry and G. Linoff. Data Mining Techniques: For Marketing, Sales, and
Customer Relationship Management. Wiley Computer Publishing, 2nd edition, 2004.
[3] P. S. Bradley, J. Gehrke, R. Ramakrishnan, and R. Srikant. Scaling mining algorithms
to large databases. Communications of the ACM, 45(8):38–43, 2002.
[4] S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan
Kaufmann, San Francisco, CA, 2003.
14
Bibliography
[5] M.S. Chen, J. Han, and P. S. Yu. Data Mining: An Overview from a Database
Perspective. IEEE Transactions on Knowledge abd Data Engineering, 8(6):866–883,
1996.
[6] V. Cherkassky and F. Mulier. Learning from Data: Concepts, Theory, and Methods.
Wiley Interscience, 1998.
[7] C. Clifton, M. Kantarcioglu, and J. Vaidya. Defining privacy for data mining. In
National Science Foundation Workshop on Next Generation Data Mining, pages 126–
133, Baltimore, MD, November 2002.
[8] P. Domingos and G. Hulten. Mining highspeed data streams. In Proc. of the 6th Intl.
Conf. on Knowledge Discovery and Data Mining, pages 71–80, Boston, Massachusetts,
2000. ACM Press.
[9] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley & Sons,
Inc., New York, 2nd edition, 2001.
[10] M. H. Dunham. Data Mining: Introductory and Advanced Topics. Prentice Hall, 2002.
[11] U. M. Fayyad, G. G. Grinstein, and A. Wierse, editors. Information Visualization in
Data Mining and Knowledge Discovery. Morgan Kaufmann Publishers, San Francisco,
CA, September 2001.
[12] U. M. Fayyad, G. PiatetskyShapiro, and P. Smyth. From Data Mining to Knowledge
Discovery: An Overview. In Advances in Knowledge Discovery and Data Mining, pages
1–34. AAAI Press, 1996.
[13] U. M. Fayyad, G. PiatetskyShapiro, P. Smyth, and R. Uthurusamy, editors. Advances
in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
[14] J. H. Friedman. Data Mining and Statistics: What’s the Connection? Unpublished.
wwwstat.stanford.edu/∼jhf/ftp/dmstat.ps, 1997.
[15] C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu. Mining Frequent Patterns in Data
Streams at Multiple Time Granularities. In H. Kargupta, A. Joshi, K. Sivakumar, and
Y. Yesha, editors, Next Generation Data Mining, pages 191–212. AAAI/MIT, 2003.
[16] C. Glymour, D. Madigan, D. Pregibon, and P. Smyth. Statistical Themes and Lessons
for Data Mining. Data Mining and Knowledge Discovery, 1(1):11–28, 1997.
[17] R. L. Grossman, M. F. Hornick, and G. Meyer. Data mining standards initiatives.
Communications of the ACM, 45(8):59–61, 2002.
[18] R. L. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R. Namburu, editors. Data
Mining for Scientific and Engineering Applications. Kluwer Academic Publishers, 2001.
[19] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering Data
Streams: Theory and Practice. IEEE Transactions on Knowledge and Data Engineering,
15(3):515–528, May/June 2003.
[20] J. Han, R. B. Altman, V. Kumar, H. Mannila, and D. Pregibon. Emerging scientific
applications in data mining. Communications of the ACM, 45(8):54–58, 2002.
[21] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann
Publishers, San Francisco, 2001.
[22] D. J. Hand. Data Mining: Statistics and More? The American Statistician, 52(2):
112–118, 1998.
[23] D. J. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press, 2001.
[24] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning:
Data Mining, Inference, Prediction. Springer, New York, 2001.
[25] M. Kantardzic. Data Mining: Concepts, Models, Methods, and Algorithms. WileyIEEE
Press, Piscataway, NJ, 2003.
15
Chapter 1 Introduction
[26] H. Kargupta and P. K. Chan, editors. Advances in Distributed and Parallel Knowledge
Discovery. AAAI Press, September 2002.
[27] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. On the Privacy Preserving Prop
erties of Random Data Perturbation Techniques. In Proc. of the 2003 IEEE Intl. Conf.
on Data Mining, pages 99–106, Melbourne, Florida, December 2003. IEEE Computer
Society.
[28] D. Kifer, S. BenDavid, and J. Gehrke. Detecting Change in Data Streams. In Proc. of
the 30th VLDB Conf., pages 180–191, Toronto, Canada, 2004. Morgan Kaufmann.
[29] D. Lambert. What Use is Statistics for Massive Data? In ACM SIGMOD Workshop
on Research Issues in Data Mining and Knowledge Discovery, pages 54–62, 2000.
[30] M. H. C. Law, N. Zhang, and A. K. Jain. Nonlinear Manifold Learning for Data
Streams. In Proc. of the SIAM Intl. Conf. on Data Mining, Lake Buena Vista, Florida,
April 2004. SIAM.
[31] T. Mitchell. Machine Learning. McGrawHill, Boston, MA, 1997.
[32] S. Papadimitriou, A. Brockwell, and C. Faloutsos. Adaptive, unsupervised stream min
ing. VLDB Journal, 13(3):222–239, 2004.
[33] O. Parr Rud. Data Mining Cookbook: Modeling Data for Marketing, Risk and Customer
Relationship Management. John Wiley & Sons, New York, NY, 2001.
[34] D. Pyle. Business Modeling and Data Mining. Morgan Kaufmann, San Francisco, CA,
2003.
[35] N. Ramakrishnan and A. Grama. Data Mining: From Serendipity to Science—Guest
Editors’ Introduction. IEEE Computer, 32(8):34–37, 1999.
[36] R. Roiger and M. Geatz. Data Mining: A Tutorial Based Primer. AddisonWesley,
2002.
[37] P. Smyth. Breaking out of the BlackBox: Research Challenges in Data Mining. In
Proc. of the 2001 ACM SIGMOD Workshop on Research Issues in Data Mining and
Knowledge Discovery, 2001.
[38] P. Smyth, D. Pregibon, and C. Faloutsos. Datadriven evolution of data mining algo
rithms. Communications of the ACM, 45(8):33–37, 2002.
[39] V. S. Verykios, E. Bertino, I. N. Fovino, L. P. Provenza, Y. Saygin, and Y. Theodoridis.
Stateoftheart in privacy preserving data mining. SIGMOD Record, 33(1):50–57, 2004.
[40] J. T. L. Wang, M. J. Zaki, H. Toivonen, and D. E. Shasha, editors. Data Mining in
Bioinformatics. Springer, September 2004.
[41] A. R. Webb. Statistical Pattern Recognition. John Wiley & Sons, 2nd edition, 2002.
[42] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Tech
niques with Java Implementations. Morgan Kaufmann, 1999.
[43] X. Wu, P. S. Yu, and G. PiatetskyShapiro. Data Mining: How Research Meets Practical
Development? Knowledge and Information Systems, 5(2):248–261, 2003.
[44] M. J. Zaki and C.T. Ho, editors. LargeScale Parallel Data Mining. Springer, September
2002.
1.7 Exercises
1. Discuss whether or not each of the following activities is a data mining task.
16
1.7 Exercises
(a) Dividing the customers of a company according to their gender.
(b) Dividing the customers of a company according to their profitability.
(c) Computing the total sales of a company.
(d) Sorting a student database based on student identification numbers.
(e) Predicting the outcomes of tossing a (fair) pair of dice.
(f) Predicting the future stock price of a company using historical records.
(g) Monitoring the heart rate of a patient for abnormalities.
(h) Monitoring seismic waves for earthquake activities.
(i) Extracting the frequencies of a sound wave.
2. Suppose that you are employed as a data mining consultant for an Internet
search engine company. Describe how data mining can help the company by
giving specific examples of how techniques, such as clustering, classification,
association rule mining, and anomaly detection can be applied.
3. For each of the following data sets, explain whether or not data privacy is an
important issue.
(a) Census data collected from 1900–1950.
(b) IP addresses and visit times of Web users who visit your Website.
(c) Images from Earthorbiting satellites.
(d) Names and addresses of people from the telephone book.
(e) Names and email addresses collected from the Web.
17
18
2
Data
This chapter discusses several datarelated issues that are important for suc
cessful data mining:
The Type of Data Data sets differ in a number of ways. For example, the
attributes used to describe data objects can be of different types—quantitative
or qualitative—and data sets may have special characteristics; e.g., some data
sets contain time series or objects with explicit relationships to one another.
Not surprisingly, the type of data determines which tools and techniques can
be used to analyze the data. Furthermore, new research in data mining is
often driven by the need to accommodate new application areas and their new
types of data.
The Quality of the Data Data is often far from perfect. While most data
mining techniques can tolerate some level of imperfection in the data, a focus
on understanding and improving data quality typically improves the quality
of the resulting analysis. Data quality issues that often need to be addressed
include the presence of noise and outliers; missing, inconsistent, or duplicate
data; and data that is biased or, in some other way, unrepresentative of the
phenomenon or population that the data is supposed to describe.
Preprocessing Steps to Make the Data More Suitable for Data Min
ing Often, the raw data must be processed in order to make it suitable for
analysis. While one objective may be to improve data quality, other goals
focus on modifying the data so that it better fits a specified data mining tech
nique or tool. For example, a continuous attribute, e.g., length, may need to
be transformed into an attribute with discrete categories, e.g., short, medium,
or long, in order to apply a particular technique. As another example, the
From Chapter 2 of Introduction to Data Mining
Vipin Kumar. Copyright © 2006 by Pearson Education, Inc. All rights reserved.
, First Edition. PangNing Tan, Michael Steinbach,
19
Chapter 2 Data
number of attributes in a data set is often reduced because many techniques
are more effective when the data has a relatively small number of attributes.
Analyzing Data in Terms of Its Relationships One approach to data
analysis is to find relationships among the data objects and then perform
the remaining analysis using these relationships rather than the data objects
themselves. For instance, we can compute the similarity or distance between
pairs of objects and then perform the analysis—clustering, classification, or
anomaly detection—based on these similarities or distances. There are many
such similarity or distance measures, and the proper choice depends on the
type of data and the particular application.
Example 2.1 (An Illustration of DataRelated Issues). To further il
lustrate the importance of these issues, consider the following hypothetical sit
uation. You receive an email from a medical researcher concerning a project
that you are eager to work on.
Hi,
I’ve attached the data file that I mentioned in my previous email.
Each line contains the information for a single patient and consists
of five fields. We want to predict the last field using the other fields.
I don’t have time to provide any more information about the data
since I’m going out of town for a couple of days, but hopefully that
won’t slow you down too much. And if you don’t mind, could we
meet when I get back to discuss your preliminary results? I might
invite a few other members of my team.
Thanks and see you in a couple of days.
Despite some misgivings, you proceed to analyze the data. The first few
rows of the file are as follows:
012 232 33.5 0 10.7
020 121 16.9 2 210.1
027 165 24.0 0 427.6
…
A brief look at the data reveals nothing strange. You put your doubts aside
and start the analysis. There are only 1000 lines, a smaller data file than you
had hoped for, but two days later, you feel that you have made some progress.
You arrive for the meeting, and while waiting for others to arrive, you strike
20
up a conversation with a statistician who is working on the project. When she
learns that you have also been analyzing the data from the project, she asks
if you would mind giving her a brief overview of your results.
Statistician: So, you got the data for all the patients?
Data Miner: Yes. I haven’t had much time for analysis, but I
do have a few interesting results.
Statistician: Amazing. There were so many data issues with
this set of patients that I couldn’t do much.
Data Miner: Oh? I didn’t hear about any possible problems.
Statistician: Well, first there is field 5, the variable we want to
predict. It’s common knowledge among people who analyze
this type of data that results are better if you work with the
log of the values, but I didn’t discover this until later. Was it
mentioned to you?
Data Miner: No.
Statistician: But surely you heard about what happened to field
4? It’s supposed to be measured on a scale from 1 to 10, with
0 indicating a missing value, but because of a data entry
error, all 10’s were changed into 0’s. Unfortunately, since
some of the patients have missing values for this field, it’s
impossible to say whether a 0 in this field is a real 0 or a 10.
Quite a few of the records have that problem.
Data Miner: Interesting. Were there any other problems?
Statistician: Yes, fields 2 and 3 are basically the same, but I
assume that you probably noticed that.
Data Miner: Yes, but these fields were only weak predictors of
field 5.
Statistician: Anyway, given all those problems, I’m surprised
you were able to accomplish anything.
Data Miner: True, but my results are really quite good. Field 1
is a very strong predictor of field 5. I’m surprised that this
wasn’t noticed before.
Statistician: What? Field 1 is just an identification number.
Data Miner: Nonetheless, my results speak for themselves.
Statistician: Oh, no! I just remembered. We assigned ID
numbers after we sorted the records based on field 5. There is
a strong connection, but it’s meaningless. Sorry.
21
Chapter 2 Data
Although this scenario represents an extreme situation, it emphasizes the
importance of “knowing your data.” To that end, this chapter will address
each of the four issues mentioned above, outlining some of the basic challenges
and standard approaches.
2.1 Types of Data
A data set can often be viewed as a collection of data objects. Other
names for a data object are record, point, vector, pattern, event, case, sample,
observation, or entity. In turn, data objects are described by a number of
attributes that capture the basic characteristics of an object, such as the
mass of a physical object or the time at which an event occurred. Other
names for an attribute are variable, characteristic, field, feature, or dimension.
Example 2.2 (Student Information). Often, a data set is a file, in which
the objects are records (or rows) in the file and each field (or column) corre
sponds to an attribute. For example, Table 2.1 shows a data set that consists
of student information. Each row corresponds to a student and each column
is an attribute that describes some aspect of a student, such as grade point
average (GPA) or identification number (ID).
Table 2.1. A sample data set containing student information.
Student ID Year Grade Point Average (GPA) . . .
…
1034262 Senior 3.24 . . .
1052663 Sophomore 3.51 . . .
1082246 Freshman 3.62 . . .
…
Although recordbased data sets are common, either in flat files or rela
tional database systems, there are other important types of data sets and
systems for storing data. In Section 2.1.2, we will discuss some of the types of
data sets that are commonly encountered in data mining. However, we first
consider attributes.
22
2.1 Types of Data
2.1.1 Attributes and Measurement
In this section we address the issue of describing data by considering what
types of attributes are used to describe data objects. We first define an at
tribute, then consider what we mean by the type of an attribute, and finally
describe the types of attributes that are commonly encountered.
What Is an attribute?
We start with a more detailed definition of an attribute.
Definition 2.1. An attribute is a property or characteristic of an object
that may vary, either from one object to another or from one time to another.
For example, eye color varies from person to person, while the temperature
of an object varies over time. Note that eye color is a symbolic attribute with
a small number of possible values {brown, black, blue, green, hazel, etc.}, while
temperature is a numerical attribute with a potentially unlimited number of
values.
At the most basic level, attributes are not about numbers or symbols.
However, to discuss and more precisely analyze the characteristics of objects,
we assign numbers or symbols to them. To do this in a welldefined way, we
need a measurement scale.
Definition 2.2. A measurement scale is a rule (function) that associates
a numerical or symbolic value with an attribute of an object.
Formally, the process of measurement is the application of a measure
ment scale to associate a value with a particular attribute of a specific object.
While this may seem a bit abstract, we engage in the process of measurement
all the time. For instance, we step on a bathroom scale to determine our
weight, we classify someone as male or female, or we count the number of
chairs in a room to see if there will be enough to seat all the people coming to
a meeting. In all these cases, the “physical value” of an attribute of an object
is mapped to a numerical or symbolic value.
With this background, we can now discuss the type of an attribute, a
concept that is important in determining if a particular data analysis technique
is consistent with a specific type of attribute.
The Type of an Attribute
It should be apparent from the previous discussion that the properties of an
attribute need not be the same as the properties of the values used to mea
23
Chapter 2 Data
sure it. In other words, the values used to represent an attribute may have
properties that are not properties of the attribute itself, and vice versa. This
is illustrated with two examples.
Example 2.3 (Employee Age and ID Number). Two attributes that
might be associated with an employee are ID and age (in years). Both of these
attributes can be represented as integers. However, while it is reasonable to
talk about the average age of an employee, it makes no sense to talk about
the average employee ID. Indeed, the only aspect of employees that we want
to capture with the ID attribute is that they are distinct. Consequently, the
only valid operation for employee IDs is to test whether they are equal. There
is no hint of this limitation, however, when integers are used to represent the
employee ID attribute. For the age attribute, the properties of the integers
used to represent age are very much the properties of the attribute. Even so,
the correspondence is not complete since, for example, ages have a maximum,
while integers do not.
Example 2.4 (Length of Line Segments). Consider Figure 2.1, which
shows some objects—line segments—and how the length attribute of these
objects can be mapped to numbers in two different ways. Each successive
line segment, going from the top to the bottom, is formed by appending the
topmost line segment to itself. Thus, the second line segment from the top is
formed by appending the topmost line segment to itself twice, the third line
segment from the top is formed by appending the topmost line segment to
itself three times, and so forth. In a very real (physical) sense, all the line
segments are multiples of the first. This fact is captured by the measurements
on the righthand side of the figure, but not by those on the left handside.
More specifically, the measurement scale on the lefthand side captures only
the ordering of the length attribute, while the scale on the righthand side
captures both the ordering and additivity properties. Thus, an attribute can be
measured in a way that does not capture all the properties of the attribute.
The type of an attribute should tell us what properties of the attribute are
reflected in the values used to measure it. Knowing the type of an attribute
is important because it tells us which properties of the measured values are
consistent with the underlying properties of the attribute, and therefore, it
allows us to avoid foolish actions, such as computing the average employee ID.
Note that it is common to refer to the type of an attribute as the type of a
measurement scale.
24
2.1 Types of Data
1 1
2
3
4
5
3
7
8
10
A mapping of lengths to numbers
that captures only the order
properties of length.
A mapping of lengths to numbers
that captures both the order and
additivity properties of length.
Figure 2.1. The measurement of the length of line segments on two different scales of measurement.
The Different Types of Attributes
A useful (and simple) way to specify the type of an attribute is to identify
the properties of numbers that correspond to underlying properties of the
attribute. For example, an attribute such as length has many of the properties
of numbers. It makes sense to compare and order objects by length, as well
as to talk about the differences and ratios of length. The following properties
(operations) of numbers are typically used to describe attributes.
1. Distinctness = and �=
2. Order <, ≤, >, and ≥
3. Addition + and −
4. Multiplication ∗ and /
Given these properties, we can define four types of attributes: nominal,
ordinal, interval, and ratio. Table 2.2 gives the definitions of these types,
along with information about the statistical operations that are valid for each
type. Each attribute type possesses all of the properties and operations of the
attribute types above it. Consequently, any property or operation that is valid
for nominal, ordinal, and interval attributes is also valid for ratio attributes.
In other words, the definition of the attribute types is cumulative. However,
25
Chapter 2 Data
Table 2.2. Different attribute types.
Attribute
Type Description Examples Operations
Nominal The values of a nominal
attribute are just different
names; i.e., nominal values
provide only enough
information to distinguish
one object from another.
(=, �=)
zip codes,
employee ID numbers,
eye color, gender
mode, entropy,
contingency
correlation,
χ2 test
C
at
eg
or
ic
al
(Q
u
al
it
at
iv
e)
Ordinal The values of an ordinal
attribute provide enough
information to order
objects.
(<, >)
hardness of minerals,
{good, better, best},
grades,
street numbers
median,
percentiles,
rank correlation,
run tests,
sign tests
Interval For interval attributes, the
differences between values
are meaningful, i.e., a unit
of measurement exists.
(+, − )
calendar dates,
temperature in Celsius
or Fahrenheit
mean,
standard deviation,
Pearson’s
correlation,
t and F tests
N
u
m
er
ic
(Q
u
an
ti
ta
ti
ve
)
Ratio For ratio variables, both
differences and ratios are
meaningful.
(*, /)
temperature in Kelvin,
monetary quantities,
counts, age, mass,
length,
electrical current
geometric mean,
harmonic mean,
percent
variation
this does not mean that the operations appropriate for one attribute type are
appropriate for the attribute types above it.
Nominal and ordinal attributes are collectively referred to as categorical
or qualitative attributes. As the name suggests, qualitative attributes, such
as employee ID, lack most of the properties of numbers. Even if they are rep
resented by numbers, i.e., integers, they should be treated more like symbols.
The remaining two types of attributes, interval and ratio, are collectively re
ferred to as quantitative or numeric attributes. Quantitative attributes are
represented by numbers and have most of the properties of numbers. Note
that quantitative attributes can be integervalued or continuous.
The types of attributes can also be described in terms of transformations
that do not change the meaning of an attribute. Indeed, S. Smith Stevens, the
psychologist who originally defined the types of attributes shown in Table 2.2,
defined them in terms of these permissible transformations. For example,
26
2.1 Types of Data
Table 2.3. Transformations that define attribute levels.
Attribute
Type Transformation Comment
Nominal Any onetoone mapping, e.g., a
permutation of values
If all employee ID numbers are
reassigned, it will not make any
difference.
C
at
eg
or
ic
al
(Q
u
al
it
at
iv
e)
Ordinal An orderpreserving change of
values, i.e.,
new value = f (old value),
where f is a monotonic function.
An attribute encompassing the
notion of good, better, best can
be represented equally well by
the values {1, 2, 3} or by
{0.5, 1, 10}.
Interval new value = a ∗ old value + b,
a and b constants.
The Fahrenheit and Celsius
temperature scales differ in the
location of their zero value and
the size of a degree (unit).
N
u
m
er
ic
(Q
u
an
ti
ta
ti
ve
)
Ratio new value = a ∗ old value Length can be measured in
meters or feet.
the meaning of a length attribute is unchanged if it is measured in meters
instead of feet.
The statistical operations that make sense for a particular type of attribute
are those that will yield the same results when the attribute is transformed us
ing a transformation that preserves the attribute’s meaning. To illustrate, the
average length of a set of objects is different when measured in meters rather
than in feet, but both averages represent the same length. Table 2.3 shows the
permissible (meaningpreserving) transformations for the four attribute types
of Table 2.2.
Example 2.5 (Temperature Scales). Temperature provides a good illus
tration of some of the concepts that have been described. First, temperature
can be either an interval or a ratio attribute, depending on its measurement
scale. When measured on the Kelvin scale, a temperature of 2◦ is, in a physi
cally meaningful way, twice that of a temperature of 1◦. This is not true when
temperature is measured on either the Celsius or Fahrenheit scales, because,
physically, a temperature of 1◦ Fahrenheit (Celsius) is not much different than
a temperature of 2◦ Fahrenheit (Celsius). The problem is that the zero points
of the Fahrenheit and Celsius scales are, in a physical sense, arbitrary, and
therefore, the ratio of two Celsius or Fahrenheit temperatures is not physi
cally meaningful.
27
Chapter 2 Data
Describing Attributes by the Number of Values
An independent way of distinguishing between attributes is by the number of
values they can take.
Discrete A discrete attribute has a finite or countably infinite set of values.
Such attributes can be categorical, such as zip codes or ID numbers,
or numeric, such as counts. Discrete attributes are often represented
using integer variables. Binary attributes are a special case of dis
crete attributes and assume only two values, e.g., true/false, yes/no,
male/female, or 0/1. Binary attributes are often represented as Boolean
variables, or as integer variables that only take the values 0 or 1.
Continuous A continuous attribute is one whose values are real numbers. Ex
amples include attributes such as temperature, height, or weight. Con
tinuous attributes are typically represented as floatingpoint variables.
Practically, real values can only be measured and represented with lim
ited precision.
In theory, any of the measurement scale types—nominal, ordinal, interval, and
ratio—could be combined with any of the types based on the number of at
tribute values—binary, discrete, and continuous. However, some combinations
occur only infrequently or do not make much sense. For instance, it is difficult
to think of a realistic data set that contains a continuous binary attribute.
Typically, nominal and ordinal attributes are binary or discrete, while interval
and ratio attributes are continuous. However, count attributes, which are
discrete, are also ratio attributes.
Asymmetric Attributes
For asymmetric attributes, only presence—a nonzero attribute value—is re
garded as important. Consider a data set where each object is a student and
each attribute records whether or not a student took a particular course at
a university. For a specific student, an attribute has a value of 1 if the stu
dent took the course associated with that attribute and a value of 0 otherwise.
Because students take only a small fraction of all available courses, most of
the values in such a data set would be 0. Therefore, it is more meaningful
and more efficient to focus on the nonzero values. To illustrate, if students
are compared on the basis of the courses they don’t take, then most students
would seem very similar, at least if the number of courses is large. Binary
attributes where only nonzero values are important are called asymmetric
28
2.1 Types of Data
binary attributes. This type of attribute is particularly important for as
sociation analysis, which is discussed in Chapter 6. It is also possible to have
discrete or continuous asymmetric features. For instance, if the number of
credits associated with each course is recorded, then the resulting data set will
consist of asymmetric discrete or continuous attributes.
2.1.2 Types of Data Sets
There are many types of data sets, and as the field of data mining develops
and matures, a greater variety of data sets become available for analysis. In
this section, we describe some of the most common types. For convenience,
we have grouped the types of data sets into three groups: record data, graph
based data, and ordered data. These categories do not cover all possibilities
and other groupings are certainly possible.
General Characteristics of Data Sets
Before providing details of specific kinds of data sets, we discuss three char
acteristics that apply to many data sets and have a significant impact on the
data mining techniques that are used: dimensionality, sparsity, and resolution.
Dimensionality The dimensionality of a data set is the number of attributes
that the objects in the data set possess. Data with a small number of dimen
sions tends to be qualitatively different than moderate or highdimensional
data. Indeed, the difficulties associated with analyzing highdimensional data
are sometimes referred to as the curse of dimensionality. Because of this,
an important motivation in preprocessing the data is dimensionality reduc
tion. These issues are discussed in more depth later in this chapter and in
Appendix B.
Sparsity For some data sets, such as those with asymmetric features, most
attributes of an object have values of 0; in many cases, fewer than 1% of
the entries are nonzero. In practical terms, sparsity is an advantage because
usually only the nonzero values need to be stored and manipulated. This
results in significant savings with respect to computation time and storage.
Furthermore, some data mining algorithms work well only for sparse data.
Resolution It is frequently possible to obtain data at different levels of reso
lution, and often the properties of the data are different at different resolutions.
For instance, the surface of the Earth seems very uneven at a resolution of a
29
Chapter 2 Data
few meters, but is relatively smooth at a resolution of tens of kilometers. The
patterns in the data also depend on the level of resolution. If the resolution
is too fine, a pattern may not be visible or may be buried in noise; if the
resolution is too coarse, the pattern may disappear. For example, variations
in atmospheric pressure on a scale of hours reflect the movement of storms
and other weather systems. On a scale of months, such phenomena are not
detectable.
Record Data
Much data mining work assumes that the data set is a collection of records
(data objects), each of which consists of a fixed set of data fields (attributes).
See Figure 2.2(a). For the most basic form of record data, there is no explicit
relationship among records or data fields, and every record (object) has the
same set of attributes. Record data is usually stored either in flat files or in
relational databases. Relational databases are certainly more than a collection
of records, but data mining often does not use any of the additional information
available in a relational database. Rather, the database serves as a convenient
place to find records. Different types of record data are described below and
are illustrated in Figure 2.2.
Transaction or Market Basket Data Transaction data is a special type
of record data, where each record (transaction) involves a set of items. Con
sider a grocery store. The set of products purchased by a customer during one
shopping trip constitutes a transaction, while the individual products that
were purchased are the items. This type of data is called market basket
data because the items in each record are the products in a person’s “mar
ket basket.” Transaction data is a collection of sets of items, but it can be
viewed as a set of records whose fields are asymmetric attributes. Most often,
the attributes are binary, indicating whether or not an item was purchased,
but more generally, the attributes can be discrete or continuous, such as the
number of items purchased or the amount spent on those items. Figure 2.2(b)
shows a sample transaction data set. Each row represents the purchases of a
particular customer at a particular time.
The Data Matrix If the data objects in a collection of data all have the
same fixed set of numeric attributes, then the data objects can be thought of as
points (vectors) in a multidimensional space, where each dimension represents
a distinct attribute describing the object. A set of such data objects can be
interpreted as an m by n matrix, where there are m rows, one for each object,
30
2.1 Types of Data
Refund Defaulted
Borrower
Marital
Status
Taxable
Income
Tid
125K
100K
70K
120K
95K
60K
220K
85K
75K
90K
No
No
No
No
Yes
No
No
Yes
No
Yes
Yes
No
No
Yes
No
No
Yes
No
No
No
1
2
3
4
5
6
7
8
9
10
Single
Married
Single
Married
Divorced
Married
Divorced
Single
Married
Single
(a) Record data.
TID ITEMS
1
2
3
4
5
Bread, Soda, Milk
Beer, Bread
Beer, Soda, Diaper, Milk
Beer, Bread, Diaper, Milk
Soda, Diaper, Milk
(b) Transaction data.
Projection of
x Load
Projection of
y Load
Distance Load Thickness
10.23
12.65
13.54
14.27
15.22
16.22
17.34
18.45
5.27
6.25
7.23
8.43
27
22
23
25
1.2
1.1
1.2
0.9
(c) Data matrix.
team
coach
play
score
gam
e
w
in
lost
tim
eout
season
ball
Document 1 3 0 5 0 2 6 0 2 0 2
0 7 0 2 1 0 0 3 0 0
0 1 0 0 1 2 2 0 3 0
Document 2
Document 3
(d) Documentterm matrix.
Figure 2.2. Different variations of record data.
and n columns, one for each attribute. (A representation that has data objects
as columns and attributes as rows is also fine.) This matrix is called a data
matrix or a pattern matrix. A data matrix is a variation of record data,
but because it consists of numeric attributes, standard matrix operation can
be applied to transform and manipulate the data. Therefore, the data matrix
is the standard data format for most statistical data. Figure 2.2(c) shows a
sample data matrix.
The Sparse Data Matrix A sparse data matrix is a special case of a data
matrix in which the attributes are of the same type and are asymmetric; i.e.,
only nonzero values are important. Transaction data is an example of a sparse
data matrix that has only 0–1 entries. Another common example is document
data. In particular, if the order of the terms (words) in a document is ignored,
31
Chapter 2 Data
then a document can be represented as a term vector, where each term is
a component (attribute) of the vector and the value of each component is
the number of times the corresponding term occurs in the document. This
representation of a collection of documents is often called a documentterm
matrix. Figure 2.2(d) shows a sample documentterm matrix. The documents
are the rows of this matrix, while the terms are the columns. In practice, only
the nonzero entries of sparse data matrices are stored.
GraphBased Data
A graph can sometimes be a convenient and powerful representation for data.
We consider two specific cases: (1) the graph captures relationships among
data objects and (2) the data objects themselves are represented as graphs.
Data with Relationships among Objects The relationships among ob
jects frequently convey important information. In such cases, the data is often
represented as a graph. In particular, the data objects are mapped to nodes
of the graph, while the relationships among objects are captured by the links
between objects and link properties, such as direction and weight. Consider
Web pages on the World Wide Web, which contain both text and links to
other pages. In order to process search queries, Web search engines collect
and process Web pages to extract their contents. It is well known, however,
that the links to and from each page provide a great deal of information about
the relevance of a Web page to a query, and thus, must also be taken into
consideration. Figure 2.3(a) shows a set of linked Web pages.
Data with Objects That Are Graphs If objects have structure, that
is, the objects contain subobjects that have relationships, then such objects
are frequently represented as graphs. For example, the structure of chemical
compounds can be represented by a graph, where the nodes are atoms and the
links between nodes are chemical bonds. Figure 2.3(b) shows a ballandstick
diagram of the chemical compound benzene, which contains atoms of carbon
(black) and hydrogen (gray). A graph representation makes it possible to
determine which substructures occur frequently in a set of compounds and to
ascertain whether the presence of any of these substructures is associated with
the presence or absence of certain chemical properties, such as melting point
or heat of formation. Substructure mining, which is a branch of data mining
that analyzes such data, is considered in Section 7.5.
32
2.1 Types of Data
(Gets updated frequently, so visit often!)
Book References in Data Mining and
Knowledge Discovery
Useful Links:
• Books
• General Data Mining
•
• Other Useful Web sites
o
o
o The Data Mine
Usama Fayyad, Gregory PiatetskyShapiro,
Padhraic Smyth, and Ramasamy uthurasamy,
“Advances in Knowledge Discovery and Data
Mining”, AAAI Press/the MIT Press, 1996.
J. Ross Quinlan, “C4.5: Programs for Machine
Learning”, Morgan Kaufmann Publishers, 1993.
Michael Berry and Gordon Linoff, “Data Mining
Techniques (For Marketing, Sales, and Customer
Support), John Wiley & Sons, 1997.
Usama Fayyad, “Mining Databases: Towards
Algorithms for Knowledge Discovery”, Bulletin of
the IEEE Computer Society Technical Committee
on data Engineering, vol. 21, no. 1, March 1998.
Christopher Matheus, Philip Chan, and Gregory
PiatetskyShapiro, “Systems for knowledge
Discovery in databases”, IEEE Transactions on
Knowledge and Data Engineering, 5(6):903913,
December 1993.
Bibliography
ACM SIGKDD
KDnuggets
General Data Mining
Knowledge Discovery and
Data Mining Bibliography
(a) Linked Web pages. (b) Benzene molecule.
Figure 2.3. Different variations of graph data.
Ordered Data
For some types of data, the attributes have relationships that involve order
in time or space. Different types of ordered data are described next and are
shown in Figure 2.4.
Sequential Data Sequential data, also referred to as temporal data, can
be thought of as an extension of record data, where each record has a time
associated with it. Consider a retail transaction data set that also stores the
time at which the transaction took place. This time information makes it
possible to find patterns such as “candy sales peak before Halloween.” A time
can also be associated with each attribute. For example, each record could
be the purchase history of a customer, with a listing of items purchased at
different times. Using this information, it is possible to find patterns such as
“people who buy DVD players tend to buy DVDs in the period immediately
following the purchase.”
Figure 2.4(a) shows an example of sequential transaction data. There
are five different times—t1, t2, t3, t4, and t5 ; three different customers—C1,
33
Chapter 2 Data
Time Customer Items Purchased
t1 C1 A, B
t2 C3 A, C
t2 C1 C, D
t3 C2 A, D
t4 C2 E
t5 C1 A, E
Customer Time and Items Purchased
C1 (t1: A,B)(t2:C,D)(t5:A,E)
C2 (t3: A, D) (t4: E)
C3 (t2: A, C)
(a) Sequential transaction data.
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
(b) Genomic sequence data.
1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
–20
–15
–10
–5
0
5
10
15
20
25
30
Year
Minneapolis Average Monthly Temperature (1982–1993)
T
e
m
p
e
ra
tu
re
(
ce
lc
iu
s)
(c) Temperature time series.
Longitude
Temp–150–180 –120 –90 –60 –30 030 60 90 120 150 180
0
5
10
15
20
25
30
90
60
– 60
–90
30
–30
0
L
a
tit
u
d
e
(d) Spatial temperature data.
Figure 2.4. Different variations of ordered data.
C2, and C3; and five different items—A, B, C, D, and E. In the top table,
each row corresponds to the items purchased at a particular time by each
customer. For instance, at time t3, customer C2 purchased items A and D. In
the bottom table, the same information is displayed, but each row corresponds
to a particular customer. Each row contains information on each transaction
involving the customer, where a transaction is considered to be a set of items
and the time at which those items were purchased. For example, customer C3
bought items A and C at time t2.
34
2.1 Types of Data
Sequence Data Sequence data consists of a data set that is a sequence of
individual entities, such as a sequence of words or letters. It is quite similar to
sequential data, except that there are no time stamps; instead, there are posi
tions in an ordered sequence. For example, the genetic information of plants
and animals can be represented in the form of sequences of nucleotides that
are known as genes. Many of the problems associated with genetic sequence
data involve predicting similarities in the structure and function of genes from
similarities in nucleotide sequences. Figure 2.4(b) shows a section of the hu
man genetic code expressed using the four nucleotides from which all DNA is
constructed: A, T, G, and C.
Time Series Data Time series data is a special type of sequential data
in which each record is a time series, i.e., a series of measurements taken
over time. For example, a financial data set might contain objects that are
time series of the daily prices of various stocks. As another example, consider
Figure 2.4(c), which shows a time series of the average monthly temperature
for Minneapolis during the years 1982 to 1994. When working with temporal
data, it is important to consider temporal autocorrelation; i.e., if two
measurements are close in time, then the values of those measurements are
often very similar.
Spatial Data Some objects have spatial attributes, such as positions or ar
eas, as well as other types of attributes. An example of spatial data is weather
data (precipitation, temperature, pressure) that is collected for a variety of
geographical locations. An important aspect of spatial data is spatial auto
correlation; i.e., objects that are physically close tend to be similar in other
ways as well. Thus, two points on the Earth that are close to each other
usually have similar values for temperature and rainfall.
Important examples of spatial data are the science and engineering data
sets that are the result of measurements or model output taken at regularly
or irregularly distributed points on a two or threedimensional grid or mesh.
For instance, Earth science data sets record the temperature or pressure mea
sured at points (grid cells) on latitude–longitude spherical grids of various
resolutions, e.g., 1◦ by 1◦. (See Figure 2.4(d).) As another example, in the
simulation of the flow of a gas, the speed and direction of flow can be recorded
for each grid point in the simulation.
35
Chapter 2 Data
Handling NonRecord Data
Most data mining algorithms are designed for record data or its variations,
such as transaction data and data matrices. Recordoriented techniques can
be applied to nonrecord data by extracting features from data objects and
using these features to create a record corresponding to each object. Consider
the chemical structure data that was described earlier. Given a set of common
substructures, each compound can be represented as a record with binary
attributes that indicate whether a compound contains a specific substructure.
Such a representation is actually a transaction data set, where the transactions
are the compounds and the items are the substructures.
In some cases, it is easy to represent the data in a record format, but
this type of representation does not capture all the information in the data.
Consider spatiotemporal data consisting of a time series from each point on
a spatial grid. This data is often stored in a data matrix, where each row
represents a location and each column represents a particular point in time.
However, such a representation does not explicitly capture the time relation
ships that are present among attributes and the spatial relationships that
exist among objects. This does not mean that such a representation is inap
propriate, but rather that these relationships must be taken into consideration
during the analysis. For example, it would not be a good idea to use a data
mining technique that assumes the attributes are statistically independent of
one another.
2.2 Data Quality
Data mining applications are often applied to data that was collected for an
other purpose, or for future, but unspecified applications. For that reason,
data mining cannot usually take advantage of the significant benefits of “ad
dressing quality issues at the source.” In contrast, much of statistics deals
with the design of experiments or surveys that achieve a prespecified level of
data quality. Because preventing data quality problems is typically not an op
tion, data mining focuses on (1) the detection and correction of data quality
problems and (2) the use of algorithms that can tolerate poor data quality.
The first step, detection and correction, is often called data cleaning.
The following sections discuss specific aspects of data quality. The focus is
on measurement and data collection issues, although some applicationrelated
issues are also discussed.
36
2.2 Data Quality
2.2.1 Measurement and Data Collection Issues
It is unrealistic to expect that data will be perfect. There may be problems due
to human error, limitations of measuring devices, or flaws in the data collection
process. Values or even entire data objects may be missing. In other cases,
there may be spurious or duplicate objects; i.e., multiple data objects that all
correspond to a single “real” object. For example, there might be two different
records for a person who has recently lived at two different addresses. Even if
all the data is present and “looks fine,” there may be inconsistencies—a person
has a height of 2 meters, but weighs only 2 kilograms.
In the next few sections, we focus on aspects of data quality that are related
to data measurement and collection. We begin with a definition of measure
ment and data collection errors and then consider a variety of problems that
involve measurement error: noise, artifacts, bias, precision, and accuracy. We
conclude by discussing data quality issues that may involve both measurement
and data collection problems: outliers, missing and inconsistent values, and
duplicate data.
Measurement and Data Collection Errors
The term measurement error refers to any problem resulting from the mea
surement process. A common problem is that the value recorded differs from
the true value to some extent. For continuous attributes, the numerical dif
ference of the measured and true value is called the error. The term data
collection error refers to errors such as omitting data objects or attribute
values, or inappropriately including a data object. For example, a study of
animals of a certain species might include animals of a related species that are
similar in appearance to the species of interest. Both measurement errors and
data collection errors can be either systematic or random.
We will only consider general types of errors. Within particular domains,
there are certain types of data errors that are commonplace, and there often
exist welldeveloped techniques for detecting and/or correcting these errors.
For example, keyboard errors are common when data is entered manually, and
as a result, many data entry programs have techniques for detecting and, with
human intervention, correcting such errors.
Noise and Artifacts
Noise is the random component of a measurement error. It may involve the
distortion of a value or the addition of spurious objects. Figure 2.5 shows a
time series before and after it has been disrupted by random noise. If a bit
37
Chapter 2 Data
(a) Time series. (b) Time series with noise.
Figure 2.5. Noise in a time series context.
(a) Three groups of points. (b) With noise points (+) added.
Figure 2.6. Noise in a spatial context.
more noise were added to the time series, its shape would be lost. Figure 2.6
shows a set of data points before and after some noise points (indicated by
‘+’s) have been added. Notice that some of the noise points are intermixed
with the nonnoise points.
The term noise is often used in connection with data that has a spatial or
temporal component. In such cases, techniques from signal or image process
ing can frequently be used to reduce noise and thus, help to discover patterns
(signals) that might be “lost in the noise.” Nonetheless, the elimination of
noise is frequently difficult, and much work in data mining focuses on devis
ing robust algorithms that produce acceptable results even when noise is
present.
38
2.2 Data Quality
Data errors may be the result of a more deterministic phenomenon, such
as a streak in the same place on a set of photographs. Such deterministic
distortions of the data are often referred to as artifacts.
Precision, Bias, and Accuracy
In statistics and experimental science, the quality of the measurement process
and the resulting data are measured by precision and bias. We provide the
standard definitions, followed by a brief discussion. For the following defini
tions, we assume that we make repeated measurements of the same underlying
quantity and use this set of values to calculate a mean (average) value that
serves as our estimate of the true value.
Definition 2.3 (Precision). The closeness of repeated measurements (of the
same quantity) to one another.
Definition 2.4 (Bias). A systematic variation of measurements from the
quantity being measured.
Precision is often measured by the standard deviation of a set of values,
while bias is measured by taking the difference between the mean of the set
of values and the known value of the quantity being measured. Bias can
only be determined for objects whose measured quantity is known by means
external to the current situation. Suppose that we have a standard laboratory
weight with a mass of 1g and want to assess the precision and bias of our new
laboratory scale. We weigh the mass five times, and obtain the following five
values: {1.015, 0.990, 1.013, 1.001, 0.986}. The mean of these values is 1.001,
and hence, the bias is 0.001. The precision, as measured by the standard
deviation, is 0.013.
It is common to use the more general term, accuracy, to refer to the
degree of measurement error in data.
Definition 2.5 (Accuracy). The closeness of measurements to the true value
of the quantity being measured.
Accuracy depends on precision and bias, but since it is a general concept,
there is no specific formula for accuracy in terms of these two quantities.
One important aspect of accuracy is the use of significant digits. The
goal is to use only as many digits to represent the result of a measurement or
calculation as are justified by the precision of the data. For example, if the
length of an object is measured with a meter stick whose smallest markings are
millimeters, then we should only record the length of data to the nearest mil
limeter. The precision of such a measurement would be ± 0.5mm. We do not
39
Chapter 2 Data
review the details of working with significant digits, as most readers will have
encountered them in previous courses, and they are covered in considerable
depth in science, engineering, and statistics textbooks.
Issues such as significant digits, precision, bias, and accuracy are sometimes
overlooked, but they are important for data mining as well as statistics and
science. Many times, data sets do not come with information on the precision
of the data, and furthermore, the programs used for analysis return results
without any such information. Nonetheless, without some understanding of
the accuracy of the data and the results, an analyst runs the risk of committing
serious data analysis blunders.
Outliers
Outliers are either (1) data objects that, in some sense, have characteristics
that are different from most of the other data objects in the data set, or
(2) values of an attribute that are unusual with respect to the typical values
for that attribute. Alternatively, we can speak of anomalous objects or
values. There is considerable leeway in the definition of an outlier, and many
different definitions have been proposed by the statistics and data mining
communities. Furthermore, it is important to distinguish between the notions
of noise and outliers. Outliers can be legitimate data objects or values. Thus,
unlike noise, outliers may sometimes be of interest. In fraud and network
intrusion detection, for example, the goal is to find unusual objects or events
from among a large number of normal ones. Chapter 10 discusses anomaly
detection in more detail.
Missing Values
It is not unusual for an object to be missing one or more attribute values.
In some cases, the information was not collected; e.g., some people decline to
give their age or weight. In other cases, some attributes are not applicable
to all objects; e.g., often, forms have conditional parts that are filled out only
when a person answers a previous question in a certain way, but for simplicity,
all fields are stored. Regardless, missing values should be taken into account
during the data analysis.
There are several strategies (and variations on these strategies) for dealing
with missing data, each of which may be appropriate in certain circumstances.
These strategies are listed next, along with an indication of their advantages
and disadvantages.
40
2.2 Data Quality
Eliminate Data Objects or Attributes A simple and effective strategy
is to eliminate objects with missing values. However, even a partially speci
fied data object contains some information, and if many objects have missing
values, then a reliable analysis can be difficult or impossible. Nonetheless, if
a data set has only a few objects that have missing values, then it may be
expedient to omit them. A related strategy is to eliminate attributes that
have missing values. This should be done with caution, however, since the
eliminated attributes may be the ones that are critical to the analysis.
Estimate Missing Values Sometimes missing data can be reliably esti
mated. For example, consider a time series that changes in a reasonably
smooth fashion, but has a few, widely scattered missing values. In such cases,
the missing values can be estimated (interpolated) by using the remaining
values. As another example, consider a data set that has many similar data
points. In this situation, the attribute values of the points closest to the point
with the missing value are often used to estimate the missing value. If the
attribute is continuous, then the average attribute value of the nearest neigh
bors is used; if the attribute is categorical, then the most commonly occurring
attribute value can be taken. For a concrete illustration, consider precipitation
measurements that are recorded by ground stations. For areas not containing
a ground station, the precipitation can be estimated using values observed at
nearby ground stations.
Ignore the Missing Value during Analysis Many data mining approaches
can be modified to ignore missing values. For example, suppose that objects
are being clustered and the similarity between pairs of data objects needs to
be calculated. If one or both objects of a pair have missing values for some
attributes, then the similarity can be calculated by using only the attributes
that do not have missing values. It is true that the similarity will only be
approximate, but unless the total number of attributes is small or the num
ber of missing values is high, this degree of inaccuracy may not matter much.
Likewise, many classification schemes can be modified to work with missing
values.
Inconsistent Values
Data can contain inconsistent values. Consider an address field, where both a
zip code and city are listed, but the specified zip code area is not contained in
that city. It may be that the individual entering this information transposed
two digits, or perhaps a digit was misread when the information was scanned
41
Chapter 2 Data
from a handwritten form. Regardless of the cause of the inconsistent values,
it is important to detect and, if possible, correct such problems.
Some types of inconsistences are easy to detect. For instance, a person’s
height should not be negative. In other cases, it can be necessary to consult
an external source of information. For example, when an insurance company
processes claims for reimbursement, it checks the names and addresses on the
reimbursement forms against a database of its customers.
Once an inconsistency has been detected, it is sometimes possible to correct
the data. A product code may have “check” digits, or it may be possible to
doublecheck a product code against a list of known product codes, and then
correct the code if it is incorrect, but close to a known code. The correction
of an inconsistency requires additional or redundant information.
Example 2.6 (Inconsistent Sea Surface Temperature). This example
illustrates an inconsistency in actual time series data that measures the sea
surface temperature (SST) at various points on the ocean. SST data was origi
nally collected using oceanbased measurements from ships or buoys, but more
recently, satellites have been used to gather the data. To create a longterm
data set, both sources of data must be used. However, because the data comes
from different sources, the two parts of the data are subtly different. This
discrepancy is visually displayed in Figure 2.7, which shows the correlation of
SST values between pairs of years. If a pair of years has a positive correlation,
then the location corresponding to the pair of years is colored white; otherwise
it is colored black. (Seasonal variations were removed from the data since, oth
erwise, all the years would be highly correlated.) There is a distinct change in
behavior where the data has been put together in 1983. Years within each of
the two groups, 1958–1982 and 1983–1999, tend to have a positive correlation
with one another, but a negative correlation with years in the other group.
This does not mean that this data should not be used, only that the analyst
should consider the potential impact of such discrepancies on the data mining
analysis.
Duplicate Data
A data set may include data objects that are duplicates, or almost duplicates,
of one another. Many people receive duplicate mailings because they appear
in a database multiple times under slightly different names. To detect and
eliminate such duplicates, two main issues must be addressed. First, if there
are two objects that actually represent a single object, then the values of
corresponding attributes may differ, and these inconsistent values must be
42
2.2 Data Quality
60 65 70 75 80 85 90 95
Year
Y
e
a
r
60
65
70
75
80
85
90
95
Figure 2.7. Correlation of SST data between pairs of years. White areas indicate positive correlation.
Black areas indicate negative correlation.
resolved. Second, care needs to be taken to avoid accidentally combining data
objects that are similar, but not duplicates, such as two distinct people with
identical names. The term deduplication is often used to refer to the process
of dealing with these issues.
In some cases, two or more objects are identical with respect to the at
tributes measured by the database, but they still represent different objects.
Here, the duplicates are legitimate, but may still cause problems for some al
gorithms if the possibility of identical objects is not specifically accounted for
in their design. An example of this is given in Exercise 13 on page 91.
2.2.2 Issues Related to Applications
Data quality issues can also be considered from an application viewpoint as
expressed by the statement “data is of high quality if it is suitable for its
intended use.” This approach to data quality has proven quite useful, particu
larly in business and industry. A similar viewpoint is also present in statistics
and the experimental sciences, with their emphasis on the careful design of ex
periments to collect the data relevant to a specific hypothesis. As with quality
43
Chapter 2 Data
issues at the measurement and data collection level, there are many issues that
are specific to particular applications and fields. Again, we consider only a few
of the general issues.
Timeliness Some data starts to age as soon as it has been collected. In
particular, if the data provides a snapshot of some ongoing phenomenon or
process, such as the purchasing behavior of customers or Web browsing pat
terns, then this snapshot represents reality for only a limited time. If the data
is out of date, then so are the models and patterns that are based on it.
Relevance The available data must contain the information necessary for
the application. Consider the task of building a model that predicts the acci
dent rate for drivers. If information about the age and gender of the driver is
omitted, then it is likely that the model will have limited accuracy unless this
information is indirectly available through other attributes.
Making sure that the objects in a data set are relevant is also challenging.
A common problem is sampling bias, which occurs when a sample does not
contain different types of objects in proportion to their actual occurrence in
the population. For example, survey data describes only those who respond to
the survey. (Other aspects of sampling are discussed further in Section 2.3.2.)
Because the results of a data analysis can reflect only the data that is present,
sampling bias will typically result in an erroneous analysis.
Knowledge about the Data Ideally, data sets are accompanied by doc
umentation that describes different aspects of the data; the quality of this
documentation can either aid or hinder the subsequent analysis. For example,
if the documentation identifies several attributes as being strongly related,
these attributes are likely to provide highly redundant information, and we
may decide to keep just one. (Consider sales tax and purchase price.) If the
documentation is poor, however, and fails to tell us, for example, that the
missing values for a particular field are indicated with a 9999, then our analy
sis of the data may be faulty. Other important characteristics are the precision
of the data, the type of features (nominal, ordinal, interval, ratio), the scale
of measurement (e.g., meters or feet for length), and the origin of the data.
2.3 Data Preprocessing
In this section, we address the issue of which preprocessing steps should be
applied to make the data more suitable for data mining. Data preprocessing
44
2.3 Data Preprocessing
is a broad area and consists of a number of different strategies and techniques
that are interrelated in complex ways. We will present some of the most
important ideas and approaches, and try to point out the interrelationships
among them. Specifically, we will discuss the following topics:
• Aggregation
• Sampling
• Dimensionality reduction
• Feature subset selection
• Feature creation
• Discretization and binarization
• Variable transformation
Roughly speaking, these items fall into two categories: selecting data ob
jects and attributes for the analysis or creating/changing the attributes. In
both cases the goal is to improve the data mining analysis with respect to
time, cost, and quality. Details are provided in the following sections.
A quick note on terminology: In the following, we sometimes use synonyms
for attribute, such as feature or variable, in order to follow common usage.
2.3.1 Aggregation
Sometimes “less is more” and this is the case with aggregation, the combining
of two or more objects into a single object. Consider a data set consisting of
transactions (data objects) recording the daily sales of products in various
store locations (Minneapolis, Chicago, Paris, . . .) for different days over the
course of a year. See Table 2.4. One way to aggregate transactions for this data
set is to replace all the transactions of a single store with a single storewide
transaction. This reduces the hundreds or thousands of transactions that occur
daily at a specific store to a single daily transaction, and the number of data
objects is reduced to the number of stores.
An obvious issue is how an aggregate transaction is created; i.e., how the
values of each attribute are combined across all the records corresponding to a
particular location to create the aggregate transaction that represents the sales
of a single store or date. Quantitative attributes, such as price, are typically
aggregated by taking a sum or an average. A qualitative attribute, such as
item, can either be omitted or summarized as the set of all the items that were
sold at that location.
The data in Table 2.4 can also be viewed as a multidimensional array,
where each attribute is a dimension. From this viewpoint, aggregation is the
45
Chapter 2 Data
Table 2.4. Data set containing information about customer purchases.
Transaction ID Item Store Location Date Price . . .
…
…
…
…
…
101123 Watch Chicago 09/06/04 $25.99 . . .
101123 Battery Chicago 09/06/04 $5.99 . . .
101124 Shoes Minneapolis 09/06/04 $75.00 . . .
…
…
…
…
…
process of eliminating attributes, such as the type of item, or reducing the
number of values for a particular attribute; e.g., reducing the possible values
for date from 365 days to 12 months. This type of aggregation is commonly
used in Online Analytical Processing (OLAP), which is discussed further in
Chapter 3.
There are several motivations for aggregation. First, the smaller data sets
resulting from data reduction require less memory and processing time, and
hence, aggregation may permit the use of more expensive data mining algo
rithms. Second, aggregation can act as a change of scope or scale by providing
a highlevel view of the data instead of a lowlevel view. In the previous ex
ample, aggregating over store locations and months gives us a monthly, per
store view of the data instead of a daily, per item view. Finally, the behavior
of groups of objects or attributes is often more stable than that of individual
objects or attributes. This statement reflects the statistical fact that aggregate
quantities, such as averages or totals, have less variability than the individ
ual objects being aggregated. For totals, the actual amount of variation is
larger than that of individual objects (on average), but the percentage of the
variation is smaller, while for means, the actual amount of variation is less
than that of individual objects (on average). A disadvantage of aggregation is
the potential loss of interesting details. In the store example aggregating over
months loses information about which day of the week has the highest sales.
Example 2.7 (Australian Precipitation). This example is based on pre
cipitation in Australia from the period 1982 to 1993. Figure 2.8(a) shows
a histogram for the standard deviation of average monthly precipitation for
3,030 0.5◦ by 0.5◦ grid cells in Australia, while Figure 2.8(b) shows a histogram
for the standard deviation of the average yearly precipitation for the same lo
cations. The average yearly precipitation has less variability than the average
monthly precipitation. All precipitation measurements (and their standard
deviations) are in centimeters.
46
2.3 Data Preprocessing
0 2 4 6 8 10 12 14 16 18
0
20
40
60
80
100
120
140
160
180
N
u
m
b
e
r
o
f
L
a
n
d
L
o
ca
tio
n
s
Standard Deviation
(a) Histogram of standard deviation of
average monthly precipitation
0 1 2 3 4 5 6
0
50
100
150
N
u
m
b
e
r
o
f
L
a
n
d
L
o
ca
tio
n
s
Standard Deviation
(b) Histogram of standard deviation of
average yearly precipitation
Figure 2.8. Histograms of standard deviation for monthly and yearly precipitation in Australia for the
period 1982 to 1993.
2.3.2 Sampling
Sampling is a commonly used approach for selecting a subset of the data
objects to be analyzed. In statistics, it has long been used for both the pre
liminary investigation of the data and the final data analysis. Sampling can
also be very useful in data mining. However, the motivations for sampling
in statistics and data mining are often different. Statisticians use sampling
because obtaining the entire set of data of interest is too expensive or time
consuming, while data miners sample because it is too expensive or time con
suming to process all the data. In some cases, using a sampling algorithm can
reduce the data size to the point where a better, but more expensive algorithm
can be used.
The key principle for effective sampling is the following: Using a sample
will work almost as well as using the entire data set if the sample is repre
sentative. In turn, a sample is representative if it has approximately the
same property (of interest) as the original set of data. If the mean (average)
of the data objects is the property of interest, then a sample is representative
if it has a mean that is close to that of the original data. Because sampling is
a statistical process, the representativeness of any particular sample will vary,
and the best that we can do is choose a sampling scheme that guarantees a
high probability of getting a representative sample. As discussed next, this
involves choosing the appropriate sample size and sampling techniques.
47
Chapter 2 Data
Sampling Approaches
There are many sampling techniques, but only a few of the most basic ones
and their variations will be covered here. The simplest type of sampling is
simple random sampling. For this type of sampling, there is an equal prob
ability of selecting any particular item. There are two variations on random
sampling (and other sampling techniques as well): (1) sampling without re
placement—as each item is selected, it is removed from the set of all objects
that together constitute the population, and (2) sampling with replace
ment—objects are not removed from the population as they are selected for
the sample. In sampling with replacement, the same object can be picked more
than once. The samples produced by the two methods are not much different
when samples are relatively small compared to the data set size, but sampling
with replacement is simpler to analyze since the probability of selecting any
object remains constant during the sampling process.
When the population consists of different types of objects, with widely
different numbers of objects, simple random sampling can fail to adequately
represent those types of objects that are less frequent. This can cause prob
lems when the analysis requires proper representation of all object types. For
example, when building classification models for rare classes, it is critical that
the rare classes be adequately represented in the sample. Hence, a sampling
scheme that can accommodate differing frequencies for the items of interest is
needed. Stratified sampling, which starts with prespecified groups of ob
jects, is such an approach. In the simplest version, equal numbers of objects
are drawn from each group even though the groups are of different sizes. In an
other variation, the number of objects drawn from each group is proportional
to the size of that group.
Example 2.8 (Sampling and Loss of Information). Once a sampling
technique has been selected, it is still necessary to choose the sample size.
Larger sample sizes increase the probability that a sample will be representa
tive, but they also eliminate much of the advantage of sampling. Conversely,
with smaller sample sizes, patterns may be missed or erroneous patterns can be
detected. Figure 2.9(a) shows a data set that contains 8000 twodimensional
points, while Figures 2.9(b) and 2.9(c) show samples from this data set of size
2000 and 500, respectively. Although most of the structure of this data set is
present in the sample of 2000 points, much of the structure is missing in the
sample of 500 points.
48
2.3 Data Preprocessing
(a) 8000 points (b) 2000 points (c) 500 points
Figure 2.9. Example of the loss of structure with sampling.
Example 2.9 (Determining the Proper Sample Size). To illustrate that
determining the proper sample size requires a methodical approach, consider
the following task.
Given a set of data that consists of a small number of almost equal
sized groups, find at least one representative point for each of the
groups. Assume that the objects in each group are highly similar
to each other, but not very similar to objects in different groups.
Also assume that there are a relatively small number of groups,
e.g., 10. Figure 2.10(a) shows an idealized set of clusters (groups)
from which these points might be drawn.
This problem can be efficiently solved using sampling. One approach is to
take a small sample of data points, compute the pairwise similarities between
points, and then form groups of points that are highly similar. The desired
set of representative points is then obtained by taking one point from each of
these groups. To follow this approach, however, we need to determine a sample
size that would guarantee, with a high probability, the desired outcome; that
is, that at least one point will be obtained from each cluster. Figure 2.10(b)
shows the probability of getting one object from each of the 10 groups as the
sample size runs from 10 to 60. Interestingly, with a sample size of 20, there is
little chance (20%) of getting a sample that includes all 10 clusters. Even with
a sample size of 30, there is still a moderate chance (almost 40%) of getting a
sample that doesn’t contain objects from all 10 clusters. This issue is further
explored in the context of clustering by Exercise 4 on page 559.
49
Chapter 2 Data
(a) Ten groups of points.
0 10 20 30 40 50 60 70
0
0.2
0.4
0.6
0.8
1
Sample Size
P
ro
b
a
b
il
it
y
(b) Probability a sample contains points
from each of 10 groups.
Figure 2.10. Finding representative points from 10 groups.
Progressive Sampling
The proper sample size can be difficult to determine, so adaptive or progres
sive sampling schemes are sometimes used. These approaches start with a
small sample, and then increase the sample size until a sample of sufficient
size has been obtained. While this technique eliminates the need to determine
the correct sample size initially, it requires that there be a way to evaluate the
sample to judge if it is large enough.
Suppose, for instance, that progressive sampling is used to learn a pre
dictive model. Although the accuracy of predictive models increases as the
sample size increases, at some point the increase in accuracy levels off. We
want to stop increasing the sample size at this levelingoff point. By keeping
track of the change in accuracy of the model as we take progressively larger
samples, and by taking other samples close to the size of the current one, we
can get an estimate as to how close we are to this levelingoff point, and thus,
stop sampling.
2.3.3 Dimensionality Reduction
Data sets can have a large number of features. Consider a set of documents,
where each document is represented by a vector whose components are the
frequencies with which each word occurs in the document. In such cases,
50
2.3 Data Preprocessing
there are typically thousands or tens of thousands of attributes (components),
one for each word in the vocabulary. As another example, consider a set of
time series consisting of the daily closing price of various stocks over a period
of 30 years. In this case, the attributes, which are the prices on specific days,
again number in the thousands.
There are a variety of benefits to dimensionality reduction. A key benefit
is that many data mining algorithms work better if the dimensionality—the
number of attributes in the data—is lower. This is partly because dimension
ality reduction can eliminate irrelevant features and reduce noise and partly
because of the curse of dimensionality, which is explained below. Another ben
efit is that a reduction of dimensionality can lead to a more understandable
model because the model may involve fewer attributes. Also, dimensionality
reduction may allow the data to be more easily visualized. Even if dimen
sionality reduction doesn’t reduce the data to two or three dimensions, data
is often visualized by looking at pairs or triplets of attributes, and the num
ber of such combinations is greatly reduced. Finally, the amount of time and
memory required by the data mining algorithm is reduced with a reduction in
dimensionality.
The term dimensionality reduction is often reserved for those techniques
that reduce the dimensionality of a data set by creating new attributes that
are a combination of the old attributes. The reduction of dimensionality by
selecting new attributes that are a subset of the old is known as feature subset
selection or feature selection. It will be discussed in Section 2.3.4.
In the remainder of this section, we briefly introduce two important topics:
the curse of dimensionality and dimensionality reduction techniques based on
linear algebra approaches such as principal components analysis (PCA). More
details on dimensionality reduction can be found in Appendix B.
The Curse of Dimensionality
The curse of dimensionality refers to the phenomenon that many types of
data analysis become significantly harder as the dimensionality of the data
increases. Specifically, as dimensionality increases, the data becomes increas
ingly sparse in the space that it occupies. For classification, this can mean
that there are not enough data objects to allow the creation of a model that
reliably assigns a class to all possible objects. For clustering, the definitions
of density and the distance between points, which are critical for clustering,
become less meaningful. (This is discussed further in Sections 9.1.2, 9.4.5, and
9.4.7.) As a result, many clustering and classification algorithms (and other
51
Chapter 2 Data
data analysis algorithms) have trouble with highdimensional data—reduced
classification accuracy and poor quality clusters.
Linear Algebra Techniques for Dimensionality Reduction
Some of the most common approaches for dimensionality reduction, partic
ularly for continuous data, use techniques from linear algebra to project the
data from a highdimensional space into a lowerdimensional space. Principal
Components Analysis (PCA) is a linear algebra technique for continuous
attributes that finds new attributes (principal components) that (1) are linear
combinations of the original attributes, (2) are orthogonal (perpendicular) to
each other, and (3) capture the maximum amount of variation in the data. For
example, the first two principal components capture as much of the variation
in the data as is possible with two orthogonal attributes that are linear combi
nations of the original attributes. Singular Value Decomposition (SVD)
is a linear algebra technique that is related to PCA and is also commonly used
for dimensionality reduction. For additional details, see Appendices A and B.
2.3.4 Feature Subset Selection
Another way to reduce the dimensionality is to use only a subset of the fea
tures. While it might seem that such an approach would lose information, this
is not the case if redundant and irrelevant features are present. Redundant
features duplicate much or all of the information contained in one or more
other attributes. For example, the purchase price of a product and the amount
of sales tax paid contain much of the same information. Irrelevant features
contain almost no useful information for the data mining task at hand. For
instance, students’ ID numbers are irrelevant to the task of predicting stu
dents’ grade point averages. Redundant and irrelevant features can reduce
classification accuracy and the quality of the clusters that are found.
While some irrelevant and redundant attributes can be eliminated imme
diately by using common sense or domain knowledge, selecting the best subset
of features frequently requires a systematic approach. The ideal approach to
feature selection is to try all possible subsets of features as input to the data
mining algorithm of interest, and then take the subset that produces the best
results. This method has the advantage of reflecting the objective and bias of
the data mining algorithm that will eventually be used. Unfortunately, since
the number of subsets involving n attributes is 2n, such an approach is imprac
tical in most situations and alternative strategies are needed. There are three
standard approaches to feature selection: embedded, filter, and wrapper.
52
2.3 Data Preprocessing
Embedded approaches Feature selection occurs naturally as part of the
data mining algorithm. Specifically, during the operation of the data mining
algorithm, the algorithm itself decides which attributes to use and which to
ignore. Algorithms for building decision tree classifiers, which are discussed in
Chapter 4, often operate in this manner.
Filter approaches Features are selected before the data mining algorithm
is run, using some approach that is independent of the data mining task. For
example, we might select sets of attributes whose pairwise correlation is as low
as possible.
Wrapper approaches These methods use the target data mining algorithm
as a black box to find the best subset of attributes, in a way similar to that
of the ideal algorithm described above, but typically without enumerating all
possible subsets.
Since the embedded approaches are algorithmspecific, only the filter and
wrapper approaches will be discussed further here.
An Architecture for Feature Subset Selection
It is possible to encompass both the filter and wrapper approaches within a
common architecture. The feature selection process is viewed as consisting of
four parts: a measure for evaluating a subset, a search strategy that controls
the generation of a new subset of features, a stopping criterion, and a valida
tion procedure. Filter methods and wrapper methods differ only in the way
in which they evaluate a subset of features. For a wrapper method, subset
evaluation uses the target data mining algorithm, while for a filter approach,
the evaluation technique is distinct from the target data mining algorithm.
The following discussion provides some details of this approach, which is sum
marized in Figure 2.11.
Conceptually, feature subset selection is a search over all possible subsets
of features. Many different types of search strategies can be used, but the
search strategy should be computationally inexpensive and should find optimal
or near optimal sets of features. It is usually not possible to satisfy both
requirements, and thus, tradeoffs are necessary.
An integral part of the search is an evaluation step to judge how the current
subset of features compares to others that have been considered. This requires
an evaluation measure that attempts to determine the goodness of a subset of
attributes with respect to a particular data mining task, such as classification
53
Chapter 2 Data
Search
Strategy
Stopping
Criterion
Selected
Attributes
Attributes
Validation
Procedure
Subset of
Attributes
Evaluation
Done
Not
Done
Figure 2.11. Flowchart of a feature subset selection process.
or clustering. For the filter approach, such measures attempt to predict how
well the actual data mining algorithm will perform on a given set of attributes.
For the wrapper approach, where evaluation consists of actually running the
target data mining application, the subset evaluation function is simply the
criterion normally used to measure the result of the data mining.
Because the number of subsets can be enormous and it is impractical to
examine them all, some sort of stopping criterion is necessary. This strategy is
usually based on one or more conditions involving the following: the number
of iterations, whether the value of the subset evaluation measure is optimal or
exceeds a certain threshold, whether a subset of a certain size has been ob
tained, whether simultaneous size and evaluation criteria have been achieved,
and whether any improvement can be achieved by the options available to the
search strategy.
Finally, once a subset of features has been selected, the results of the
target data mining algorithm on the selected subset should be validated. A
straightforward evaluation approach is to run the algorithm with the full set
of features and compare the full results to results obtained using the subset of
features. Hopefully, the subset of features will produce results that are better
than or almost as good as those produced when using all features. Another
validation approach is to use a number of different feature selection algorithms
to obtain subsets of features and then compare the results of running the data
mining algorithm on each subset.
54
2.3 Data Preprocessing
Feature Weighting
Feature weighting is an alternative to keeping or eliminating features. More
important features are assigned a higher weight, while less important features
are given a lower weight. These weights are sometimes assigned based on do
main knowledge about the relative importance of features. Alternatively, they
may be determined automatically. For example, some classification schemes,
such as support vector machines (Chapter 5), produce classification models in
which each feature is given a weight. Features with larger weights play a more
important role in the model. The normalization of objects that takes place
when computing the cosine similarity (Section 2.4.5) can also be regarded as
a type of feature weighting.
2.3.5 Feature Creation
It is frequently possible to create, from the original attributes, a new set of
attributes that captures the important information in a data set much more
effectively. Furthermore, the number of new attributes can be smaller than the
number of original attributes, allowing us to reap all the previously described
benefits of dimensionality reduction. Three related methodologies for creating
new attributes are described next: feature extraction, mapping the data to a
new space, and feature construction.
Feature Extraction
The creation of a new set of features from the original raw data is known as
feature extraction. Consider a set of photographs, where each photograph
is to be classified according to whether or not it contains a human face. The
raw data is a set of pixels, and as such, is not suitable for many types of
classification algorithms. However, if the data is processed to provide higher
level features, such as the presence or absence of certain types of edges and
areas that are highly correlated with the presence of human faces, then a much
broader set of classification techniques can be applied to this problem.
Unfortunately, in the sense in which it is most commonly used, feature
extraction is highly domainspecific. For a particular field, such as image
processing, various features and the techniques to extract them have been
developed over a period of time, and often these techniques have limited ap
plicability to other fields. Consequently, whenever data mining is applied to a
relatively new area, a key task is the development of new features and feature
extraction methods.
55
Chapter 2 Data
0 0.2 0.4 0.6 0.8 1
1
0.5
0
0.5
1
Time (seconds)
(a) Two time series.
0 0.2 0.4 0.6 0.8 1
15
10
5
0
5
10
15
Time (seconds)
(b) Noisy time series.
010 20 30 40 50 60 70 80 90
0
50
100
150
200
250
300
Frequency
(c) Power spectrum
Figure 2.12. Application of the Fourier transform to identify the underlying frequencies in time series
data.
Mapping the Data to a New Space
A totally different view of the data can reveal important and interesting fea
tures. Consider, for example, time series data, which often contains periodic
patterns. If there is only a single periodic pattern and not much noise, then
the pattern is easily detected. If, on the other hand, there are a number of
periodic patterns and a significant amount of noise is present, then these pat
terns are hard to detect. Such patterns can, nonetheless, often be detected
by applying a Fourier transform to the time series in order to change to a
representation in which frequency information is explicit. In the example that
follows, it will not be necessary to know the details of the Fourier transform.
It is enough to know that, for each time series, the Fourier transform produces
a new data object whose attributes are related to frequencies.
Example 2.10 (Fourier Analysis). The time series presented in Figure
2.12(b) is the sum of three other time series, two of which are shown in Figure
2.12(a) and have frequencies of 7 and 17 cycles per second, respectively. The
third time series is random noise. Figure 2.12(c) shows the power spectrum
that can be computed after applying a Fourier transform to the original time
series. (Informally, the power spectrum is proportional to the square of each
frequency attribute.) In spite of the noise, there are two peaks that correspond
to the periods of the two original, nonnoisy time series. Again, the main point
is that better features can reveal important aspects of the data.
56
2.3 Data Preprocessing
Many other sorts of transformations are also possible. Besides the Fourier
transform, the wavelet transform has also proven very useful for time series
and other types of data.
Feature Construction
Sometimes the features in the original data sets have the necessary information,
but it is not in a form suitable for the data mining algorithm. In this situation,
one or more new features constructed out of the original features can be more
useful than the original features.
Example 2.11 (Density). To illustrate this, consider a data set consisting
of information about historical artifacts, which, along with other information,
contains the volume and mass of each artifact. For simplicity, assume that
these artifacts are made of a small number of materials (wood, clay, bronze,
gold) and that we want to classify the artifacts with respect to the material
of which they are made. In this case, a density feature constructed from the
mass and volume features, i.e., density = mass/volume, would most directly
yield an accurate classification. Although there have been some attempts to
automatically perform feature construction by exploring simple mathematical
combinations of existing attributes, the most common approach is to construct
features using domain expertise.
2.3.6 Discretization and Binarization
Some data mining algorithms, especially certain classification algorithms, re
quire that the data be in the form of categorical attributes. Algorithms that
find association patterns require that the data be in the form of binary at
tributes. Thus, it is often necessary to transform a continuous attribute into
a categorical attribute (discretization), and both continuous and discrete
attributes may need to be transformed into one or more binary attributes
(binarization). Additionally, if a categorical attribute has a large number of
values (categories), or some values occur infrequently, then it may be beneficial
for certain data mining tasks to reduce the number of categories by combining
some of the values.
As with feature selection, the best discretization and binarization approach
is the one that “produces the best result for the data mining algorithm that
will be used to analyze the data.” It is typically not practical to apply such a
criterion directly. Consequently, discretization or binarization is performed in
57
Chapter 2 Data
Table 2.5. Conversion of a categorical attribute to three binary attributes.
Categorical Value Integer Value x1 x2 x3
awful 0 0 0 0
poor 1 0 0 1
OK 2 0 1 0
good 3 0 1 1
great 4 1 0 0
Table 2.6. Conversion of a categorical attribute to five asymmetric binary attributes.
Categorical Value Integer Value x1 x2 x3 x4 x5
awful 0 1 0 0 0 0
poor 1 0 1 0 0 0
OK 2 0 0 1 0 0
good 3 0 0 0 1 0
great 4 0 0 0 0 1
a way that satisfies a criterion that is thought to have a relationship to good
performance for the data mining task being considered.
Binarization
A simple technique to binarize a categorical attribute is the following: If there
are m categorical values, then uniquely assign each original value to an integer
in the interval [0, m − 1]. If the attribute is ordinal, then order must be
maintained by the assignment. (Note that even if the attribute is originally
represented using integers, this process is necessary if the integers are not in the
interval [0, m−1].) Next, convert each of these m integers to a binary number.
Since n = �log2(m)� binary digits are required to represent these integers,
represent these binary numbers using n binary attributes. To illustrate, a
categorical variable with 5 values {awful, poor, OK, good, great} would require
three binary variables x1, x2, and x3. The conversion is shown in Table 2.5.
Such a transformation can cause complications, such as creating unin
tended relationships among the transformed attributes. For example, in Table
2.5, attributes x2 and x3 are correlated because information about the good
value is encoded using both attributes. Furthermore, association analysis re
quires asymmetric binary attributes, where only the presence of the attribute
(value = 1) is important. For association problems, it is therefore necessary to
introduce one binary attribute for each categorical value, as in Table 2.6. If the
58
2.3 Data Preprocessing
number of resulting attributes is too large, then the techniques described below
can be used to reduce the number of categorical values before binarization.
Likewise, for association problems, it may be necessary to replace a single
binary attribute with two asymmetric binary attributes. Consider a binary
attribute that records a person’s gender, male or female. For traditional as
sociation rule algorithms, this information needs to be transformed into two
asymmetric binary attributes, one that is a 1 only when the person is male
and one that is a 1 only when the person is female. (For asymmetric binary
attributes, the information representation is somewhat inefficient in that two
bits of storage are required to represent each bit of information.)
Discretization of Continuous Attributes
Discretization is typically applied to attributes that are used in classification
or association analysis. In general, the best discretization depends on the algo
rithm being used, as well as the other attributes being considered. Typically,
however, the discretization of an attribute is considered in isolation.
Transformation of a continuous attribute to a categorical attribute involves
two subtasks: deciding how many categories to have and determining how to
map the values of the continuous attribute to these categories. In the first step,
after the values of the continuous attribute are sorted, they are then divided
into n intervals by specifying n − 1 split points. In the second, rather trivial
step, all the values in one interval are mapped to the same categorical value.
Therefore, the problem of discretization is one of deciding how many split
points to choose and where to place them. The result can be represented
either as a set of intervals {(x0, x1], (x1, x2], . . . , (xn−1, xn)}, where x0 and xn
may be +∞ or −∞, respectively, or equivalently, as a series of inequalities
x0 < x ≤ x1, . . . , xn−1 < x < xn.
Unsupervised Discretization A basic distinction between discretization
methods for classification is whether class information is used (supervised) or
not (unsupervised). If class information is not used, then relatively simple
approaches are common. For instance, the equal width approach divides the
range of the attribute into a userspecified number of intervals each having the
same width. Such an approach can be badly affected by outliers, and for that
reason, an equal frequency (equal depth) approach, which tries to put
the same number of objects into each interval, is often preferred. As another
example of unsupervised discretization, a clustering method, such as Kmeans
(see Chapter 8), can also be used. Finally, visually inspecting the data can
sometimes be an effective approach.
59
Chapter 2 Data
Example 2.12 (Discretization Techniques). This example demonstrates
how these approaches work on an actual data set. Figure 2.13(a) shows data
points belonging to four different groups, along with two outliers—the large
dots on either end. The techniques of the previous paragraph were applied
to discretize the x values of these data points into four categorical values.
(Points in the data set have a random y component to make it easy to see
how many points are in each group.) Visually inspecting the data works quite
well, but is not automatic, and thus, we focus on the other three approaches.
The split points produced by the techniques equal width, equal frequency, and
Kmeans are shown in Figures 2.13(b), 2.13(c), and 2.13(d), respectively. The
split points are represented as dashed lines. If we measure the performance of
a discretization technique by the extent to which different objects in different
groups are assigned the same categorical value, then Kmeans performs best,
followed by equal frequency, and finally, equal width.
Supervised Discretization The discretization methods described above
are usually better than no discretization, but keeping the end purpose in mind
and using additional information (class labels) often produces better results.
This should not be surprising, since an interval constructed with no knowledge
of class labels often contains a mixture of class labels. A conceptually simple
approach is to place the splits in a way that maximizes the purity of the
intervals. In practice, however, such an approach requires potentially arbitrary
decisions about the purity of an interval and the minimum size of an interval.
To overcome such concerns, some statistically based approaches start with each
attribute value as a separate interval and create larger intervals by merging
adjacent intervals that are similar according to a statistical test. Entropy
based approaches are one of the most promising approaches to discretization,
and a simple approach based on entropy will be presented.
First, it is necessary to define entropy. Let k be the number of different
class labels, mi be the number of values in the ith interval of a partition, and
mij be the number of values of class j in interval i. Then the entropy ei of the
ith interval is given by the equation
ei =
k∑
i=1
pij log2 pij ,
where pij = mij /mi is the probability (fraction of values) of class j in the ith
interval. The total entropy, e, of the partition is the weighted average of the
individual interval entropies, i.e.,
60
2.3 Data Preprocessing
0 5 10 15 20
(a) Original data.
0 5 10 15 20
(b) Equal width discretization.
0 5 10 15 20
(c) Equal frequency discretization.
0 5 10 15 20
(d) Kmeans discretization.
Figure 2.13. Different discretization techniques.
e =
n∑
i=1
wiei,
where m is the number of values, wi = mi/m is the fraction of values in the
ith interval, and n is the number of intervals. Intuitively, the entropy of an
interval is a measure of the purity of an interval. If an interval contains only
values of one class (is perfectly pure), then the entropy is 0 and it contributes
61
Chapter 2 Data
nothing to the overall entropy. If the classes of values in an interval occur
equally often (the interval is as impure as possible), then the entropy is a
maximum.
A simple approach for partitioning a continuous attribute starts by bisect
ing the initial values so that the resulting two intervals give minimum entropy.
This technique only needs to consider each value as a possible split point, be
cause it is assumed that intervals contain ordered sets of values. The splitting
process is then repeated with another interval, typically choosing the interval
with the worst (highest) entropy, until a userspecified number of intervals is
reached, or a stopping criterion is satisfied.
Example 2.13 (Discretization of Two Attributes). This method was
used to independently discretize both the x and y attributes of the two
dimensional data shown in Figure 2.14. In the first discretization, shown in
Figure 2.14(a), the x and y attributes were both split into three intervals. (The
dashed lines indicate the split points.) In the second discretization, shown in
Figure 2.14(b), the x and y attributes were both split into five intervals.
This simple example illustrates two aspects of discretization. First, in two
dimensions, the classes of points are well separated, but in one dimension, this
is not so. In general, discretizing each attribute separately often guarantees
suboptimal results. Second, five intervals work better than three, but six
intervals do not improve the discretization much, at least in terms of entropy.
(Entropy values and results for six intervals are not shown.) Consequently,
it is desirable to have a stopping criterion that automatically finds the right
number of partitions.
Categorical Attributes with Too Many Values
Categorical attributes can sometimes have too many values. If the categorical
attribute is an ordinal attribute, then techniques similar to those for con
tinuous attributes can be used to reduce the number of categories. If the
categorical attribute is nominal, however, then other approaches are needed.
Consider a university that has a large number of departments. Consequently,
a department name attribute might have dozens of different values. In this
situation, we could use our knowledge of the relationships among different
departments to combine departments into larger groups, such as engineering,
social sciences, or biological sciences. If domain knowledge does not serve as
a useful guide or such an approach results in poor classification performance,
then it is necessary to use a more empirical approach, such as grouping values
62
2.3 Data Preprocessing
0 1 2 3 4 5
0
1
2
3
4
5
x
y
(a) Three intervals
0 1 2 3 4 5
0
1
2
3
4
5
x
y
(b) Five intervals
Figure 2.14. Discretizing x and y attributes for four groups (classes) of points.
together only if such a grouping results in improved classification accuracy or
achieves some other data mining objective.
2.3.7 Variable Transformation
A variable transformation refers to a transformation that is applied to all
the values of a variable. (We use the term variable instead of attribute to ad
here to common usage, although we will also refer to attribute transformation
on occasion.) In other words, for each object, the transformation is applied to
the value of the variable for that object. For example, if only the magnitude
of a variable is important, then the values of the variable can be transformed
by taking the absolute value. In the following section, we discuss two impor
tant types of variable transformations: simple functional transformations and
normalization.
Simple Functions
For this type of variable transformation, a simple mathematical function is
applied to each value individually. If x is a variable, then examples of such
transformations include xk, log x, ex,
√
x, 1/x, sin x, or x. In statistics, vari
able transformations, especially sqrt, log, and 1/x, are often used to transform
data that does not have a Gaussian (normal) distribution into data that does.
While this can be important, other reasons often take precedence in data min
63
Chapter 2 Data
ing. Suppose the variable of interest is the number of data bytes in a session,
and the number of bytes ranges from 1 to 1 billion. This is a huge range, and
it may be advantageous to compress it by using a log10 transformation. In
this case, sessions that transferred 108 and 109 bytes would be more similar
to each other than sessions that transferred 10 and 1000 bytes (9 − 8 = 1
versus 3 − 1 = 2). For some applications, such as network intrusion detection,
this may be what is desired, since the first two sessions most likely represent
transfers of large files, while the latter two sessions could be two quite distinct
types of sessions.
Variable transformations should be applied with caution since they change
the nature of the data. While this is what is desired, there can be problems
if the nature of the transformation is not fully appreciated. For instance, the
transformation 1/x reduces the magnitude of values that are 1 or larger, but
increases the magnitude of values between 0 and 1. To illustrate, the values
{1, 2, 3} go to {1, 1
2
, 1
3
}, but the values {1, 1
2
, 1
3
} go to {1, 2, 3}. Thus, for
all sets of values, the transformation 1/x reverses the order. To help clarify
the effect of a transformation, it is important to ask questions such as the
following: Does the order need to be maintained? Does the transformation
apply to all values, especially negative values and 0? What is the effect of
the transformation on the values between 0 and 1? Exercise 17 on page 92
explores other aspects of variable transformation.
Normalization or Standardization
Another common type of variable transformation is the standardization or
normalization of a variable. (In the data mining community the terms are
often used interchangeably. In statistics, however, the term normalization can
be confused with the transformations used for making a variable normal, i.e.,
Gaussian.) The goal of standardization or normalization is to make an en
tire set of values have a particular property. A traditional example is that
of “standardizing a variable” in statistics. If x is the mean (average) of the
attribute values and sx is their standard deviation, then the transformation
x′ = (x − x)/sx creates a new variable that has a mean of 0 and a standard
deviation of 1. If different variables are to be combined in some way, then
such a transformation is often necessary to avoid having a variable with large
values dominate the results of the calculation. To illustrate, consider compar
ing people based on two variables: age and income. For any two people, the
difference in income will likely be much higher in absolute terms (hundreds or
thousands of dollars) than the difference in age (less than 150). If the differ
ences in the range of values of age and income are not taken into account, then
64
2.4 Measures of Similarity and Dissimilarity
the comparison between people will be dominated by differences in income. In
particular, if the similarity or dissimilarity of two people is calculated using the
similarity or dissimilarity measures defined later in this chapter, then in many
cases, such as that of Euclidean distance, the income values will dominate the
calculation.
The mean and standard deviation are strongly affected by outliers, so the
above transformation is often modified. First, the mean is replaced by the
median, i.e., the middle value. Second, the standard deviation is replaced by
the absolute standard deviation. Specifically, if x is a variable, then the
absolute standard deviation of x is given by σA =
∑m
i=1 xi − µ, where xi is
the ith value of the variable, m is the number of objects, and µ is either the
mean or median. Other approaches for computing estimates of the location
(center) and spread of a set of values in the presence of outliers are described
in Sections 3.2.3 and 3.2.4, respectively. These measures can also be used to
define a standardization transformation.
2.4 Measures of Similarity and Dissimilarity
Similarity and dissimilarity are important because they are used by a number
of data mining techniques, such as clustering, nearest neighbor classification,
and anomaly detection. In many cases, the initial data set is not needed once
these similarities or dissimilarities have been computed. Such approaches can
be viewed as transforming the data to a similarity (dissimilarity) space and
then performing the analysis.
We begin with a discussion of the basics: highlevel definitions of similarity
and dissimilarity, and a discussion of how they are related. For convenience,
the term proximity is used to refer to either similarity or dissimilarity. Since
the proximity between two objects is a function of the proximity between the
corresponding attributes of the two objects, we first describe how to measure
the proximity between objects having only one simple attribute, and then
consider proximity measures for objects with multiple attributes. This in
cludes measures such as correlation and Euclidean distance, which are useful
for dense data such as time series or twodimensional points, as well as the
Jaccard and cosine similarity measures, which are useful for sparse data like
documents. Next, we consider several important issues concerning proximity
measures. The section concludes with a brief discussion of how to select the
right proximity measure.
65
Chapter 2 Data
2.4.1 Basics
Definitions
Informally, the similarity between two objects is a numerical measure of the
degree to which the two objects are alike. Consequently, similarities are higher
for pairs of objects that are more alike. Similarities are usually nonnegative
and are often between 0 (no similarity) and 1 (complete similarity).
The dissimilarity between two objects is a numerical measure of the de
gree to which the two objects are different. Dissimilarities are lower for more
similar pairs of objects. Frequently, the term distance is used as a synonym
for dissimilarity, although, as we shall see, distance is often used to refer to
a special class of dissimilarities. Dissimilarities sometimes fall in the interval
[0, 1], but it is also common for them to range from 0 to ∞.
Transformations
Transformations are often applied to convert a similarity to a dissimilarity,
or vice versa, or to transform a proximity measure to fall within a particular
range, such as [0,1]. For instance, we may have similarities that range from 1
to 10, but the particular algorithm or software package that we want to use
may be designed to only work with dissimilarities, or it may only work with
similarities in the interval [0,1]. We discuss these issues here because we will
employ such transformations later in our discussion of proximity. In addi
tion, these issues are relatively independent of the details of specific proximity
measures.
Frequently, proximity measures, especially similarities, are defined or trans
formed to have values in the interval [0,1]. Informally, the motivation for this
is to use a scale in which a proximity value indicates the fraction of similarity
(or dissimilarity) between two objects. Such a transformation is often rela
tively straightforward. For example, if the similarities between objects range
from 1 (not at all similar) to 10 (completely similar), we can make them fall
within the range [0, 1] by using the transformation s′ = (s − 1)/9, where s and
s′ are the original and new similarity values, respectively. In the more general
case, the transformation of similarities to the interval [0, 1] is given by the
expression s′ = (s−min s)/(max s−min s), where max s and min s are the
maximum and minimum similarity values, respectively. Likewise, dissimilarity
measures with a finite range can be mapped to the interval [0,1] by using the
formula d′ = (d − min d)/(max d − min d).
There can be various complications in mapping proximity measures to the
interval [0, 1], however. If, for example, the proximity measure originally takes
66
2.4 Measures of Similarity and Dissimilarity
values in the interval [0,∞], then a nonlinear transformation is needed and
values will not have the same relationship to one another on the new scale.
Consider the transformation d′ = d/(1 + d) for a dissimilarity measure that
ranges from 0 to ∞. The dissimilarities 0, 0.5, 2, 10, 100, and 1000 will be
transformed into the new dissimilarities 0, 0.33, 0.67, 0.90, 0.99, and 0.999,
respectively. Larger values on the original dissimilarity scale are compressed
into the range of values near 1, but whether or not this is desirable depends on
the application. Another complication is that the meaning of the proximity
measure may be changed. For example, correlation, which is discussed later,
is a measure of similarity that takes values in the interval [1,1]. Mapping
these values to the interval [0,1] by taking the absolute value loses information
about the sign, which can be important in some applications. See Exercise 22
on page 94.
Transforming similarities to dissimilarities and vice versa is also relatively
straightforward, although we again face the issues of preserving meaning and
changing a linear scale into a nonlinear scale. If the similarity (or dissimilar
ity) falls in the interval [0,1], then the dissimilarity can be defined as d = 1−s
(s = 1 − d). Another simple approach is to define similarity as the nega
tive of the dissimilarity (or vice versa). To illustrate, the dissimilarities 0, 1,
10, and 100 can be transformed into the similarities 0, −1, −10, and −100,
respectively.
The similarities resulting from the negation transformation are not re
stricted to the range [0, 1], but if that is desired, then transformations such as
s = 1
d+1
, s = e−d, or s = 1 − d−min d
max d−min d can be used. For the transformation
s = 1
d+1
, the dissimilarities 0, 1, 10, 100 are transformed into 1, 0.5, 0.09, 0.01,
respectively. For s = e−d, they become 1.00, 0.37, 0.00, 0.00, respectively,
while for s = 1 − d−min d
max d−min d they become 1.00, 0.99, 0.00, 0.00, respectively.
In this discussion, we have focused on converting dissimilarities to similarities.
Conversion in the opposite direction is considered in Exercise 23 on page 94.
In general, any monotonic decreasing function can be used to convert dis
similarities to similarities, or vice versa. Of course, other factors also must
be considered when transforming similarities to dissimilarities, or vice versa,
or when transforming the values of a proximity measure to a new scale. We
have mentioned issues related to preserving meaning, distortion of scale, and
requirements of data analysis tools, but this list is certainly not exhaustive.
2.4.2 Similarity and Dissimilarity between Simple Attributes
The proximity of objects with a number of attributes is typically defined by
combining the proximities of individual attributes, and thus, we first discuss
67
Chapter 2 Data
proximity between objects having a single attribute. Consider objects de
scribed by one nominal attribute. What would it mean for two such objects
to be similar? Since nominal attributes only convey information about the
distinctness of objects, all we can say is that two objects either have the same
value or they do not. Hence, in this case similarity is traditionally defined as 1
if attribute values match, and as 0 otherwise. A dissimilarity would be defined
in the opposite way: 0 if the attribute values match, and 1 if they do not.
For objects with a single ordinal attribute, the situation is more compli
cated because information about order should be taken into account. Consider
an attribute that measures the quality of a product, e.g., a candy bar, on the
scale {poor, fair, OK, good, wonderful}. It would seem reasonable that a prod
uct, P1, which is rated wonderful, would be closer to a product P2, which is
rated good, than it would be to a product P3, which is rated OK. To make this
observation quantitative, the values of the ordinal attribute are often mapped
to successive integers, beginning at 0 or 1, e.g., {poor =0, fair =1, OK =2,
good=3, wonderful=4}. Then, d(P1, P2) = 3 − 2 = 1 or, if we want the dis
similarity to fall between 0 and 1, d(P1, P2) = 3−2
4
= 0.25. A similarity for
ordinal attributes can then be defined as s = 1 − d.
This definition of similarity (dissimilarity) for an ordinal attribute should
make the reader a bit uneasy since this assumes equal intervals, and this is not
so. Otherwise, we would have an interval or ratio attribute. Is the difference
between the values fair and good really the same as that between the values
OK and wonderful? Probably not, but in practice, our options are limited,
and in the absence of more information, this is the standard approach for
defining proximity between ordinal attributes.
For interval or ratio attributes, the natural measure of dissimilarity be
tween two objects is the absolute difference of their values. For example, we
might compare our current weight and our weight a year ago by saying “I am
ten pounds heavier.” In cases such as these, the dissimilarities typically range
from 0 to ∞, rather than from 0 to 1. The similarity of interval or ratio at
tributes is typically expressed by transforming a similarity into a dissimilarity,
as previously described.
Table 2.7 summarizes this discussion. In this table, x and y are two objects
that have one attribute of the indicated type. Also, d(x, y) and s(x, y) are the
dissimilarity and similarity between x and y, respectively. Other approaches
are possible; these are the most common ones.
The following two sections consider more complicated measures of prox
imity between objects that involve multiple attributes: (1) dissimilarities be
tween data objects and (2) similarities between data objects. This division
68
2.4 Measures of Similarity and Dissimilarity
Table 2.7. Similarity and dissimilarity for simple attributes
Attribute
Type
Dissimilarity Similarity
Nominal d =
{
0 if x = y
1 if x �= y s =
{
1 if x = y
0 if x �= y
Ordinal
d = x − y/(n − 1)
(values mapped to integers 0 to n−1,
where n is the number of values)
s = 1 − d
Interval or Ratio d = x − y s = −d, s = 1
1+d
, s = e−d,
s = 1 − d−min d
max d−min d
allows us to more naturally display the underlying motivations for employing
various proximity measures. We emphasize, however, that similarities can be
transformed into dissimilarities and vice versa using the approaches described
earlier.
2.4.3 Dissimilarities between Data Objects
In this section, we discuss various kinds of dissimilarities. We begin with a
discussion of distances, which are dissimilarities with certain properties, and
then provide examples of more general kinds of dissimilarities.
Distances
We first present some examples, and then offer a more formal description of
distances in terms of the properties common to all distances. The Euclidean
distance, d, between two points, x and y, in one, two, three, or higher
dimensional space, is given by the following familiar formula:
d(x, y) =
√√√√ n∑
k=1
(xk − yk)2, (2.1)
where n is the number of dimensions and xk and yk are, respectively, the kth
attributes (components) of x and y. We illustrate this formula with Figure
2.15 and Tables 2.8 and 2.9, which show a set of points, the x and y coordinates
of these points, and the distance matrix containing the pairwise distances
of these points.
69
Chapter 2 Data
The Euclidean distance measure given in Equation 2.1 is generalized by
the Minkowski distance metric shown in Equation 2.2,
d(x, y) =
(
n∑
k=1
xk − ykr
)1/r
, (2.2)
where r is a parameter. The following are the three most common examples
of Minkowski distances.
• r = 1. City block (Manhattan, taxicab, L1 norm) distance. A common
example is the Hamming distance, which is the number of bits that
are different between two objects that have only binary attributes, i.e.,
between two binary vectors.
• r = 2. Euclidean distance (L2 norm).
• r = ∞. Supremum (Lmax or L∞ norm) distance. This is the maximum
difference between any attribute of the objects. More formally, the L∞
distance is defined by Equation 2.3
d(x, y) = lim
r→∞
(
n∑
k=1
xk − ykr
)1/r
. (2.3)
The r parameter should not be confused with the number of dimensions (at
tributes) n. The Euclidean, Manhattan, and supremum distances are defined
for all values of n: 1, 2, 3, . . ., and specify different ways of combining the
differences in each dimension (attribute) into an overall distance.
Tables 2.10 and 2.11, respectively, give the proximity matrices for the L1
and L∞ distances using data from Table 2.8. Notice that all these distance
matrices are symmetric; i.e., the ijth entry is the same as the jith entry. In
Table 2.9, for instance, the fourth row of the first column and the fourth
column of the first row both contain the value 5.1.
Distances, such as the Euclidean distance, have some wellknown proper
ties. If d(x, y) is the distance between two points, x and y, then the following
properties hold.
1. Positivity
(a) d(x, x) ≥ 0 for all x and y,
(b) d(x, y) = 0 only if x = y.
70
2.4 Measures of Similarity and Dissimilarity
p1
p2
p3 p4
2
1
0
3
y
1 2 3 4 5 6
x
Figure 2.15. Four twodimensional points.
Table 2.8. x and y coordinates of four points.
point x coordinate y coordinate
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Table 2.9. Euclidean distance matrix for Table 2.8.
p1 p2 p3 p4
p1 0.0 2.8 3.2 5.1
p2 2.8 0.0 1.4 3.2
p3 3.2 1.4 0.0 2.0
p4 5.1 3.2 2.0 0.0
Table 2.10. L1 distance matrix for Table 2.8.
L1 p1 p2 p3 p4
p1 0.0 4.0 4.0 6.0
p2 4.0 0.0 2.0 4.0
p3 4.0 2.0 0.0 2.0
p4 6.0 4.0 2.0 0.0
Table 2.11. L∞ distance matrix for Table 2.8.
L∞ p1 p2 p3 p4
p1 0.0 2.0 3.0 5.0
p2 2.0 0.0 1.0 3.0
p3 3.0 1.0 0.0 2.0
p4 5.0 3.0 2.0 0.0
2. Symmetry
d(x, y) = d(y, x) for all x and y.
3. Triangle Inequality
d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z.
Measures that satisfy all three properties are known as metrics. Some
people only use the term distance for dissimilarity measures that satisfy these
properties, but that practice is often violated. The three properties described
here are useful, as well as mathematically pleasing. Also, if the triangle in
equality holds, then this property can be used to increase the efficiency of tech
niques (including clustering) that depend on distances possessing this property.
(See Exercise 25.) Nonetheless, many dissimilarities do not satisfy one or more
of the metric properties. We give two examples of such measures.
71
Chapter 2 Data
Example 2.14 (Nonmetric Dissimilarities: Set Differences). This ex
ample is based on the notion of the difference of two sets, as defined in set
theory. Given two sets A and B, A − B is the set of elements of A that are
not in B. For example, if A = {1, 2, 3, 4} and B = {2, 3, 4}, then A − B = {1}
and B − A = ∅, the empty set. We can define the distance d between two
sets A and B as d(A, B) = size(A − B), where size is a function returning
the number of elements in a set. This distance measure, which is an integer
value greater than or equal to 0, does not satisfy the second part of the pos
itivity property, the symmetry property, or the triangle inequality. However,
these properties can be made to hold if the dissimilarity measure is modified
as follows: d(A, B) = size(A − B) + size(B − A). See Exercise 21 on page
94.
Example 2.15 (Nonmetric Dissimilarities: Time). This example gives
a more everyday example of a dissimilarity measure that is not a metric, but
that is still useful. Define a measure of the distance between times of the day
as follows:
d(t1, t2) =
{
t2 − t1 if t1 ≤ t2
24 + (t2 − t1) if t1 ≥ t2
}
. (2.4)
To illustrate, d (1PM, 2PM) = 1 hour, while d (2PM, 1PM) = 23 hours.
Such a definition would make sense, for example, when answering the question:
“If an event occurs at 1PM every day, and it is now 2PM, how long do I have
to wait for that event to occur again?”
2.4.4 Similarities between Data Objects
For similarities, the triangle inequality (or the analogous property) typically
does not hold, but symmetry and positivity typically do. To be explicit, if
s(x, y) is the similarity between points x and y, then the typical properties of
similarities are the following:
1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)
There is no general analog of the triangle inequality for similarity mea
sures. It is sometimes possible, however, to show that a similarity measure
can easily be converted to a metric distance. The cosine and Jaccard similarity
measures, which are discussed shortly, are two examples. Also, for specific sim
ilarity measures, it is possible to derive mathematical bounds on the similarity
between two objects that are similar in spirit to the triangle inequality.
72
2.4 Measures of Similarity and Dissimilarity
Example 2.16 (A Nonsymmetric Similarity Measure). Consider an
experiment in which people are asked to classify a small set of characters as
they flash on a screen. The confusion matrix for this experiment records how
often each character is classified as itself, and how often each is classified as
another character. For instance, suppose that “0” appeared 200 times and was
classified as a “0” 160 times, but as an “o” 40 times. Likewise, suppose that
‘o’ appeared 200 times and was classified as an “o” 170 times, but as “0” only
30 times. If we take these counts as a measure of the similarity between two
characters, then we have a similarity measure, but one that is not symmetric.
In such situations, the similarity measure is often made symmetric by setting
s′(x, y) = s′(y, x) = (s(x, y)+s(y, x))/2, where s′ indicates the new similarity
measure.
2.4.5 Examples of Proximity Measures
This section provides specific examples of some similarity and dissimilarity
measures.
Similarity Measures for Binary Data
Similarity measures between objects that contain only binary attributes are
called similarity coefficients, and typically have values between 0 and 1. A
value of 1 indicates that the two objects are completely similar, while a value
of 0 indicates that the objects are not at all similar. There are many rationales
for why one coefficient is better than another in specific instances.
Let x and y be two objects that consist of n binary attributes. The com
parison of two such objects, i.e., two binary vectors, leads to the following four
quantities (frequencies):
f00 = the number of attributes where x is 0 and y is 0
f01 = the number of attributes where x is 0 and y is 1
f10 = the number of attributes where x is 1 and y is 0
f11 = the number of attributes where x is 1 and y is 1
Simple Matching Coefficient One commonly used similarity coefficient is
the simple matching coefficient (SM C), which is defined as
SM C =
number of matching attribute values
number of attributes
=
f11 + f00
f01 + f10 + f11 + f00
. (2.5)
73
Chapter 2 Data
This measure counts both presences and absences equally. Consequently, the
SM C could be used to find students who had answered questions similarly on
a test that consisted only of true/false questions.
Jaccard Coefficient Suppose that x and y are data objects that represent
two rows (two transactions) of a transaction matrix (see Section 2.1.2). If each
asymmetric binary attribute corresponds to an item in a store, then a 1 indi
cates that the item was purchased, while a 0 indicates that the product was not
purchased. Since the number of products not purchased by any customer far
outnumbers the number of products that were purchased, a similarity measure
such as SM C would say that all transactions are very similar. As a result, the
Jaccard coefficient is frequently used to handle objects consisting of asymmet
ric binary attributes. The Jaccard coefficient, which is often symbolized by
J, is given by the following equation:
J =
number of matching presences
number of attributes not involved in 00 matches
=
f11
f01 + f10 + f11
. (2.6)
Example 2.17 (The SMC and Jaccard Similarity Coefficients). To
illustrate the difference between these two similarity measures, we calculate
SM C and J for the following two binary vectors.
x = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
y = (0, 0, 0, 0, 0, 0, 1, 0, 0, 1)
f01 = 2 the number of attributes where x was 0 and y was 1
f10 = 1 the number of attributes where x was 1 and y was 0
f00 = 7 the number of attributes where x was 0 and y was 0
f11 = 0 the number of attributes where x was 1 and y was 1
SM C = f11+f00
f01+f10+f11+f00
= 0+7
2+1+0+7
= 0.7
J = f11
f01+f10+f11
= 0
2+1+0
= 0
Cosine Similarity
Documents are often represented as vectors, where each attribute represents
the frequency with which a particular term (word) occurs in the document. It
is more complicated than this, of course, since certain common words are ig
74
2.4 Measures of Similarity and Dissimilarity
nored and various processing techniques are used to account for different forms
of the same word, differing document lengths, and different word frequencies.
Even though documents have thousands or tens of thousands of attributes
(terms), each document is sparse since it has relatively few nonzero attributes.
(The normalizations used for documents do not create a nonzero entry where
there was a zero entry; i.e., they preserve sparsity.) Thus, as with transaction
data, similarity should not depend on the number of shared 0 values since
any two documents are likely to “not contain” many of the same words, and
therefore, if 0–0 matches are counted, most documents will be highly similar to
most other documents. Therefore, a similarity measure for documents needs
to ignores 0–0 matches like the Jaccard measure, but also must be able to
handle nonbinary vectors. The cosine similarity, defined next, is one of the
most common measure of document similarity. If x and y are two document
vectors, then
cos(x, y) =
x · y
‖x‖ ‖y‖, (2.7)
where · indicates the vector dot product, x · y = ∑nk=1 xkyk, and ‖x‖ is the
length of vector x, ‖x‖ =
√∑n
k=1 x
2
k =
√
x · x.
Example 2.18 (Cosine Similarity of Two Document Vectors). This
example calculates the cosine similarity for the following two data objects,
which might represent document vectors:
x = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0)
y = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2)
x · y = 3 ∗ 1 + 2 ∗ 0 + 0 ∗ 0 + 5 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 + 2 ∗ 1 + 0 ∗ 0 + 0 ∗ 2 = 5
‖x‖ =
√
3 ∗ 3 + 2 ∗ 2 + 0 ∗ 0 + 5 ∗ 5 + 0 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 + 2 ∗ 2 + 0 ∗ 0 + 0 ∗ 0 = 6.48
‖y‖ =
√
1 ∗ 1 + 0 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 + 1 ∗ 1 + 0 ∗ 0 + 2 ∗ 2 = 2.24
cos(x, y) = 0.31
As indicated by Figure 2.16, cosine similarity really is a measure of the
(cosine of the) angle between x and y. Thus, if the cosine similarity is 1, the
angle between x and y is 0◦, and x and y are the same except for magnitude
(length). If the cosine similarity is 0, then the angle between x and y is 90◦,
and they do not share any terms (words).
75
Chapter 2 Data
x
y
θ
Figure 2.16. Geometric illustration of the cosine measure.
Equation 2.7 can be written as Equation 2.8.
cos(x, y) =
x
‖x‖ ·
y
‖y‖ = x
′ · y′, (2.8)
where x′ = x/‖x‖ and y′ = y/‖y‖. Dividing x and y by their lengths normal
izes them to have a length of 1. This means that cosine similarity does not take
the magnitude of the two data objects into account when computing similarity.
(Euclidean distance might be a better choice when magnitude is important.)
For vectors with a length of 1, the cosine measure can be calculated by taking
a simple dot product. Consequently, when many cosine similarities between
objects are being computed, normalizing the objects to have unit length can
reduce the time required.
Extended Jaccard Coefficient (Tanimoto Coefficient)
The extended Jaccard coefficient can be used for document data and that re
duces to the Jaccard coefficient in the case of binary attributes. The extended
Jaccard coefficient is also known as the Tanimoto coefficient. (However, there
is another coefficient that is also known as the Tanimoto coefficient.) This co
efficient, which we shall represent as EJ, is defined by the following equation:
EJ(x, y) =
x · y
‖x‖2 + ‖y‖2 − x · y . (2.9)
Correlation
The correlation between two data objects that have binary or continuous vari
ables is a measure of the linear relationship between the attributes of the
objects. (The calculation of correlation between attributes, which is more
common, can be defined similarly.) More precisely, Pearson’s correlation
76
2.4 Measures of Similarity and Dissimilarity
coefficient between two data objects, x and y, is defined by the following
equation:
corr(x, y) =
covariance(x, y)
standard deviation(x) ∗ standard deviation(y) =
sxy
sx sy
, (2.10)
where we are using the following standard statistical notation and definitions:
covariance(x, y) = sxy =
1
n − 1
n∑
k=1
(xk − x)(yk − y) (2.11)
standard deviation(x) = sx =
√√√√ 1
n − 1
n∑
k=1
(xk − x)2
standard deviation(y) = sy =
√√√√ 1
n − 1
n∑
k=1
(yk − y)2
x =
1
n
n∑
k=1
xk is the mean of x
y =
1
n
n∑
k=1
yk is the mean of y
Example 2.19 (Perfect Correlation). Correlation is always in the range
−1 to 1. A correlation of 1 (−1) means that x and y have a perfect positive
(negative) linear relationship; that is, xk = ayk + b, where a and b are con
stants. The following two sets of values for x and y indicate cases where the
correlation is −1 and +1, respectively. In the first case, the means of x and y
were chosen to be 0, for simplicity.
x = (−3, 6, 0, 3, −6)
y = ( 1, −2, 0, −1, 2)
x = (3, 6, 0, 3, 6)
y = (1, 2, 0, 1, 2)
77
Chapter 2 Data
–1.00 –0.90 –0.80 –0.70 –0.60 0.50 –0.40
–0.30 –0.20 –0.100.000.100.200.30
0.400.500.600.700.800.901.00
Figure 2.17. Scatter plots illustrating correlations from −1 to 1.
Example 2.20 (Nonlinear Relationships). If the correlation is 0, then
there is no linear relationship between the attributes of the two data objects.
However, nonlinear relationships may still exist. In the following example,
xk = y2k, but their correlation is 0.
x = (−3, −2, −1, 0, 1, 2, 3)
y = ( 9, 4, 1, 0, 1, 4, 9)
Example 2.21 (Visualizing Correlation). It is also easy to judge the cor
relation between two data objects x and y by plotting pairs of corresponding
attribute values. Figure 2.17 shows a number of these plots when x and y
have 30 attributes and the values of these attributes are randomly generated
(with a normal distribution) so that the correlation of x and y ranges from −1
to 1. Each circle in a plot represents one of the 30 attributes; its x coordinate
is the value of one of the attributes for x, while its y coordinate is the value
of the same attribute for y.
If we transform x and y by subtracting off their means and then normaliz
ing them so that their lengths are 1, then their correlation can be calculated by
78
2.4 Measures of Similarity and Dissimilarity
taking the dot product. Notice that this is not the same as the standardization
used in other contexts, where we make the transformations, x′k = (xk − x)/sx
and y′k = (yk − y)/sy.
Bregman Divergence∗ This section provides a brief description of Breg
man divergences, which are a family of proximity functions that share some
common properties. As a result, it is possible to construct general data min
ing algorithms, such as clustering algorithms, that work with any Bregman
divergence. A concrete example is the Kmeans clustering algorithm (Section
8.2). Note that this section requires knowledge of vector calculus.
Bregman divergences are loss or distortion functions. To understand the
idea of a loss function, consider the following. Let x and y be two points, where
y is regarded as the original point and x is some distortion or approximation
of it. For example, x may be a point that was generated, for example, by
adding random noise to y. The goal is to measure the resulting distortion or
loss that results if y is approximated by x. Of course, the more similar x and
y are, the smaller the loss or distortion. Thus, Bregman divergences can be
used as dissimilarity functions.
More formally, we have the following definition.
Definition 2.6 (Bregman Divergence). Given a strictly convex function
φ (with a few modest restrictions that are generally satisfied), the Bregman
divergence (loss function) D(x, y) generated by that function is given by the
following equation:
D(x, y) = φ(x) − φ(y) − 〈∇φ(y), (x − y)〉 (2.12)
where ∇φ(y) is the gradient of φ evaluated at y, x−y, is the vector difference
between x and y, and 〈∇φ(y), (x − y)〉 is the inner product between ∇φ(x)
and (x − y). For points in Euclidean space, the inner product is just the dot
product.
D(x, y) can be written as D(x, y) = φ(x) − L(x), where L(x) = φ(y) +
〈∇φ(y), (x − y)〉 and represents the equation of a plane that is tangent to the
function φ at y. Using calculus terminology, L(x) is the linearization of φ
around the point y and the Bregman divergence is just the difference between
a function and a linear approximation to that function. Different Bregman
divergences are obtained by using different choices for φ.
Example 2.22. We provide a concrete example using squared Euclidean dis
tance, but restrict ourselves to one dimension to simplify the mathematics. Let
79
Chapter 2 Data
x and y be real numbers and φ(t) be the real valued function, φ(t) = t2. In
that case, the gradient reduces to the derivative and the dot product reduces
to multiplication. Specifically, Equation 2.12 becomes Equation 2.13.
D(x, y) = x2 − y2 − 2y(x − y) = (x − y)2 (2.13)
The graph for this example, with y = 1, is shown in Figure 2.18. The
Bregman divergence is shown for two values of x: x = 2 and x = 3.
10
9
8
7
6
5
4
3
2
1
–4 –3 –2 –1 0 1 2 3 4
0
y
x
φ(x) = x2
D(2, 1)
D(3, 1)
y = 2x –1
Figure 2.18. Illustration of Bregman divergence.
2.4.6 Issues in Proximity Calculation
This section discusses several important issues related to proximity measures:
(1) how to handle the case in which attributes have different scales and/or are
correlated, (2) how to calculate proximity between objects that are composed
of different types of attributes, e.g., quantitative and qualitative, (3) and how
to handle proximity calculation when attributes have different weights; i.e.,
when not all attributes contribute equally to the proximity of objects.
80
2.4 Measures of Similarity and Dissimilarity
Standardization and Correlation for Distance Measures
An important issue with distance measures is how to handle the situation
when attributes do not have the same range of values. (This situation is
often described by saying that “the variables have different scales.”) Earlier,
Euclidean distance was used to measure the distance between people based on
two attributes: age and income. Unless these two attributes are standardized,
the distance between two people will be dominated by income.
A related issue is how to compute distance when there is correlation be
tween some of the attributes, perhaps in addition to differences in the ranges of
values. A generalization of Euclidean distance, the Mahalanobis distance,
is useful when attributes are correlated, have different ranges of values (dif
ferent variances), and the distribution of the data is approximately Gaussian
(normal). Specifically, the Mahalanobis distance between two objects (vectors)
x and y is defined as
mahalanobis(x, y) = (x − y)Σ−1(x − y)T , (2.14)
where Σ−1 is the inverse of the covariance matrix of the data. Note that the
covariance matrix Σ is the matrix whose ijth entry is the covariance of the ith
and jth attributes as defined by Equation 2.11.
Example 2.23. In Figure 2.19, there are 1000 points, whose x and y at
tributes have a correlation of 0.6. The distance between the two large points
at the opposite ends of the long axis of the ellipse is 14.7 in terms of Euclidean
distance, but only 6 with respect to Mahalanobis distance. In practice, com
puting the Mahalanobis distance is expensive, but can be worthwhile for data
whose attributes are correlated. If the attributes are relatively uncorrelated,
but have different ranges, then standardizing the variables is sufficient.
Combining Similarities for Heterogeneous Attributes
The previous definitions of similarity were based on approaches that assumed
all the attributes were of the same type. A general approach is needed when the
attributes are of different types. One straightforward approach is to compute
the similarity between each attribute separately using Table 2.7, and then
combine these similarities using a method that results in a similarity between
0 and 1. Typically, the overall similarity is defined as the average of all the
individual attribute similarities.
81
Chapter 2 Data
–8 –6 –4 –2 0 2 4 6 8
–5
–4
–3
–2
–1
0
1
2
3
4
5
x
y
Figure 2.19. Set of twodimensional points. The Mahalanobis distance between the two points repre
sented by large dots is 6; their Euclidean distance is 14.7.
Unfortunately, this approach does not work well if some of the attributes
are asymmetric attributes. For example, if all the attributes are asymmetric
binary attributes, then the similarity measure suggested previously reduces to
the simple matching coefficient, a measure that is not appropriate for asym
metric binary attributes. The easiest way to fix this problem is to omit asym
metric attributes from the similarity calculation when their values are 0 for
both of the objects whose similarity is being computed. A similar approach
also works well for handling missing values.
In summary, Algorithm 2.1 is effective for computing an overall similar
ity between two objects, x and y, with different types of attributes. This
procedure can be easily modified to work with dissimilarities.
Using Weights
In much of the previous discussion, all attributes were treated equally when
computing proximity. This is not desirable when some attributes are more im
portant to the definition of proximity than others. To address these situations,
82
2.4 Measures of Similarity and Dissimilarity
Algorithm 2.1 Similarities of heterogeneous objects.
1: For the kth attribute, compute a similarity, sk(x, y), in the range [0, 1].
2: Define an indicator variable, δk, for the kth attribute as follows:
δk =
⎧⎪⎪⎨
⎪⎪⎩
0 if the kth attribute is an asymmetric attribute and
both objects have a value of 0, or if one of the objects
has a missing value for the kth attribute
1 otherwise
3: Compute the overall similarity between the two objects using the following for
mula:
similarity(x, y) =
∑n
k=1 δksk(x, y)∑n
k=1 δk
(2.15)
the formulas for proximity can be modified by weighting the contribution of
each attribute.
If the weights wk sum to 1, then (2.15) becomes
similarity(x, y) =
∑n
k=1 wkδksk(x, y)∑n
k=1 δk
. (2.16)
The definition of the Minkowski distance can also be modified as follows:
d(x, y) =
(
n∑
k=1
wkxk − ykr
)1/r
. (2.17)
2.4.7 Selecting the Right Proximity Measure
The following are a few general observations that may be helpful. First, the
type of proximity measure should fit the type of data. For many types of dense,
continuous data, metric distance measures such as Euclidean distance are of
ten used. Proximity between continuous attributes is most often expressed
in terms of differences, and distance measures provide a welldefined way of
combining these differences into an overall proximity measure. Although at
tributes can have different scales and be of differing importance, these issues
can often be dealt with as described earlier.
For sparse data, which often consists of asymmetric attributes, we typi
cally employ similarity measures that ignore 0–0 matches. Conceptually, this
reflects the fact that, for a pair of complex objects, similarity depends on the
number of characteristics they both share, rather than the number of charac
teristics they both lack. More specifically, for sparse, asymmetric data, most
83
Chapter 2 Data
objects have only a few of the characteristics described by the attributes, and
thus, are highly similar in terms of the characteristics they do not have. The
cosine, Jaccard, and extended Jaccard measures are appropriate for such data.
There are other characteristics of data vectors that may need to be consid
ered. Suppose, for example, that we are interested in comparing time series.
If the magnitude of the time series is important (for example, each time series
represent total sales of the same organization for a different year), then we
could use Euclidean distance. If the time series represent different quantities
(for example, blood pressure and oxygen consumption), then we usually want
to determine if the time series have the same shape, not the same magnitude.
Correlation, which uses a builtin normalization that accounts for differences
in magnitude and level, would be more appropriate.
In some cases, transformation or normalization of the data is important
for obtaining a proper similarity measure since such transformations are not
always present in proximity measures. For instance, time series may have
trends or periodic patterns that significantly impact similarity. Also, a proper
computation of similarity may require that time lags be taken into account.
Finally, two time series may only be similar over specific periods of time. For
example, there is a strong relationship between temperature and the use of
natural gas, but only during the heating season.
Practical consideration can also be important. Sometimes, a one or more
proximity measures are already in use in a particular field, and thus, others
will have answered the question of which proximity measures should be used.
Other times, the software package or clustering algorithm being used may
drastically limit the choices. If efficiency is a concern, then we may want to
choose a proximity measure that has a property, such as the triangle inequality,
that can be used to reduce the number of proximity calculations. (See Exercise
25.)
However, if common practice or practical restrictions do not dictate a
choice, then the proper choice of a proximity measure can be a timeconsuming
task that requires careful consideration of both domain knowledge and the
purpose for which the measure is being used. A number of different similarity
measures may need to be evaluated to see which ones produce results that
make the most sense.
2.5 Bibliographic Notes
It is essential to understand the nature of the data that is being analyzed,
and at a fundamental level, this is the subject of measurement theory. In
84
2.5 Bibliographic Notes
particular, one of the initial motivations for defining types of attributes was
to be precise about which statistical operations were valid for what sorts of
data. We have presented the view of measurement theory that was initially
described in a classic paper by S. S. Stevens [79]. (Tables 2.2 and 2.3 are
derived from those presented by Stevens [80].) While this is the most common
view and is reasonably easy to understand and apply, there is, of course,
much more to measurement theory. An authoritative discussion can be found
in a threevolume series on the foundations of measurement theory [63, 69,
81]. Also of interest is a wideranging article by Hand [55], which discusses
measurement theory and statistics, and is accompanied by comments from
other researchers in the field. Finally, there are many books and articles that
describe measurement issues for particular areas of science and engineering.
Data quality is a broad subject that spans every discipline that uses data.
Discussions of precision, bias, accuracy, and significant figures can be found
in many introductory science, engineering, and statistics textbooks. The view
of data quality as “fitness for use” is explained in more detail in the book by
Redman [76]. Those interested in data quality may also be interested in MIT’s
Total Data Quality Management program [70, 84]. However, the knowledge
needed to deal with specific data quality issues in a particular domain is often
best obtained by investigating the data quality practices of researchers in that
field.
Aggregation is a less welldefined subject than many other preprocessing
tasks. However, aggregation is one of the main techniques used by the database
area of Online Analytical Processing (OLAP), which is discussed in Chapter 3.
There has also been relevant work in the area of symbolic data analysis (Bock
and Diday [47]). One of the goals in this area is to summarize traditional record
data in terms of symbolic data objects whose attributes are more complex than
traditional attributes. Specifically, these attributes can have values that are
sets of values (categories), intervals, or sets of values with weights (histograms).
Another goal of symbolic data analysis is to be able to perform clustering,
classification, and other kinds of data analysis on data that consists of symbolic
data objects.
Sampling is a subject that has been well studied in statistics and related
fields. Many introductory statistics books, such as the one by Lindgren [65],
have some discussion on sampling, and there are entire books devoted to the
subject, such as the classic text by Cochran [49]. A survey of sampling for
data mining is provided by Gu and Liu [54], while a survey of sampling for
databases is provided by Olken and Rotem [72]. There are a number of other
data mining and databaserelated sampling references that may be of interest,
85
Chapter 2 Data
including papers by Palmer and Faloutsos [74], Provost et al. [75], Toivonen
[82], and Zaki et al. [85].
In statistics, the traditional techniques that have been used for dimension
ality reduction are multidimensional scaling (MDS) (Borg and Groenen [48],
Kruskal and Uslaner [64]) and principal component analysis (PCA) (Jolliffe
[58]), which is similar to singular value decomposition (SVD) (Demmel [50]).
Dimensionality reduction is discussed in more detail in Appendix B.
Discretization is a topic that has been extensively investigated in data
mining. Some classification algorithms only work with categorical data, and
association analysis requires binary data, and thus, there is a significant moti
vation to investigate how to best binarize or discretize continuous attributes.
For association analysis, we refer the reader to work by Srikant and Agrawal
[78], while some useful references for discretization in the area of classification
include work by Dougherty et al. [51], Elomaa and Rousu [52], Fayyad and
Irani [53], and Hussain et al. [56].
Feature selection is another topic well investigated in data mining. A broad
coverage of this topic is provided in a survey by Molina et al. [71] and two
books by Liu and Motada [66, 67]. Other useful papers include those by Blum
and Langley [46], Kohavi and John [62], and Liu et al. [68].
It is difficult to provide references for the subject of feature transformations
because practices vary from one discipline to another. Many statistics books
have a discussion of transformations, but typically the discussion is restricted
to a particular purpose, such as ensuring the normality of a variable or making
sure that variables have equal variance. We offer two references: Osborne [73]
and Tukey [83].
While we have covered some of the most commonly used distance and
similarity measures, there are hundreds of such measures and more are being
created all the time. As with so many other topics in this chapter, many of
these measures are specific to particular fields; e.g., in the area of time series see
papers by Kalpakis et al. [59] and Keogh and Pazzani [61]. Clustering books
provide the best general discussions. In particular, see the books by Anderberg
[45], Jain and Dubes [57], Kaufman and Rousseeuw [60], and Sneath and Sokal
[77].
Bibliography
[45] M. R. Anderberg. Cluster Analysis for Applications. Academic Press, New York, De
cember 1973.
[46] A. Blum and P. Langley. Selection of Relevant Features and Examples in Machine
Learning. Artificial Intelligence, 97(1–2):245–271, 1997.
86
Bibliography
[47] H. H. Bock and E. Diday. Analysis of Symbolic Data: Exploratory Methods for Extract
ing Statistical Information from Complex Data (Studies in Classification, Data Analysis,
and Knowledge Organization). SpringerVerlag Telos, January 2000.
[48] I. Borg and P. Groenen. Modern Multidimensional Scaling—Theory and Applications.
SpringerVerlag, February 1997.
[49] W. G. Cochran. Sampling Techniques. John Wiley & Sons, 3rd edition, July 1977.
[50] J. W. Demmel. Applied Numerical Linear Algebra. Society for Industrial & Applied
Mathematics, September 1997.
[51] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and Unsupervised Discretization
of Continuous Features. In Proc. of the 12th Intl. Conf. on Machine Learning, pages
194–202, 1995.
[52] T. Elomaa and J. Rousu. General and Efficient Multisplitting of Numerical Attributes.
Machine Learning, 36(3):201–244, 1999.
[53] U. M. Fayyad and K. B. Irani. Multiinterval discretization of continuousvalued at
tributes for classification learning. In Proc. 13th Int. Joint Conf. on Artificial Intelli
gence, pages 1022–1027. Morgan Kaufman, 1993.
[54] F. H. Gaohua Gu and H. Liu. Sampling and Its Application in Data Mining: A Survey.
Technical Report TRA6/00, National University of Singapore, Singapore, 2000.
[55] D. J. Hand. Statistics and the Theory of Measurement. Journal of the Royal Statistical
Society: Series A (Statistics in Society), 159(3):445–492, 1996.
[56] F. Hussain, H. Liu, C. L. Tan, and M. Dash. TRC6/99: Discretization: an enabling
technique. Technical report, National University of Singapore, Singapore, 1999.
[57] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall
Advanced Reference Series. Prentice Hall, March 1988. Book available online at
http://www.cse.msu.edu/∼jain/Clustering Jain Dubes.pdf.
[58] I. T. Jolliffe. Principal Component Analysis. Springer Verlag, 2nd edition, October
2002.
[59] K. Kalpakis, D. Gada, and V. Puttagunta. Distance Measures for Effective Clustering
of ARIMA TimeSeries. In Proc. of the 2001 IEEE Intl. Conf. on Data Mining, pages
273–280. IEEE Computer Society, 2001.
[60] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster
Analysis. Wiley Series in Probability and Statistics. John Wiley and Sons, New York,
November 1990.
[61] E. J. Keogh and M. J. Pazzani. Scaling up dynamic time warping for datamining
applications. In KDD, pages 285–289, 2000.
[62] R. Kohavi and G. H. John. Wrappers for Feature Subset Selection. Artificial Intelligence,
97(1–2):273–324, 1997.
[63] D. Krantz, R. D. Luce, P. Suppes, and A. Tversky. Foundations of Measurements:
Volume 1: Additive and polynomial representations. Academic Press, New York, 1971.
[64] J. B. Kruskal and E. M. Uslaner. Multidimensional Scaling. Sage Publications, August
1978.
[65] B. W. Lindgren. Statistical Theory. CRC Press, January 1993.
[66] H. Liu and H. Motoda, editors. Feature Extraction, Construction and Selection: A Data
Mining Perspective. Kluwer International Series in Engineering and Computer Science,
453. Kluwer Academic Publishers, July 1998.
[67] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Min
ing. Kluwer International Series in Engineering and Computer Science, 454. Kluwer
Academic Publishers, July 1998.
87
Chapter 2 Data
[68] H. Liu, H. Motoda, and L. Yu. Feature Extraction, Selection, and Construction. In
N. Ye, editor, The Handbook of Data Mining, pages 22–41. Lawrence Erlbaum Asso
ciates, Inc., Mahwah, NJ, 2003.
[69] R. D. Luce, D. Krantz, P. Suppes, and A. Tversky. Foundations of Measurements:
Volume 3: Representation, Axiomatization, and Invariance. Academic Press, New York,
1990.
[70] MIT Total Data Quality Management Program. web.mit.edu/tdqm/www/index.shtml,
2003.
[71] L. C. Molina, L. Belanche, and A. Nebot. Feature Selection Algorithms: A Survey and
Experimental Evaluation. In Proc. of the 2002 IEEE Intl. Conf. on Data Mining, 2002.
[72] F. Olken and D. Rotem. Random Sampling from Databases—A Survey. Statistics &
Computing, 5(1):25–42, March 1995.
[73] J. Osborne. Notes on the Use of Data Transformations. Practical Assessment, Research
& Evaluation, 28(6), 2002.
[74] C. R. Palmer and C. Faloutsos. Density biased sampling: An improved method for data
mining and clustering. ACM SIGMOD Record, 29(2):82–92, 2000.
[75] F. J. Provost, D. Jensen, and T. Oates. Efficient Progressive Sampling. In Proc. of the
5th Intl. Conf. on Knowledge Discovery and Data Mining, pages 23–32, 1999.
[76] T. C. Redman. Data Quality: The Field Guide. Digital Press, January 2001.
[77] P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy. Freeman, San Francisco, 1971.
[78] R. Srikant and R. Agrawal. Mining Quantitative Association Rules in Large Relational
Tables. In Proc. of 1996 ACMSIGMOD Intl. Conf. on Management of Data, pages
1–12, Montreal, Quebec, Canada, August 1996.
[79] S. S. Stevens. On the Theory of Scales of Measurement. Science, 103(2684):677–680,
June 1946.
[80] S. S. Stevens. Measurement. In G. M. Maranell, editor, Scaling: A Sourcebook for
Behavioral Scientists, pages 22–41. Aldine Publishing Co., Chicago, 1974.
[81] P. Suppes, D. Krantz, R. D. Luce, and A. Tversky. Foundations of Measurements:
Volume 2: Geometrical, Threshold, and Probabilistic Representations. Academic Press,
New York, 1989.
[82] H. Toivonen. Sampling Large Databases for Association Rules. In VLDB96, pages
134–145. Morgan Kaufman, September 1996.
[83] J. W. Tukey. On the Comparative Anatomy of Transformations. Annals of Mathematical
Statistics, 28(3):602–632, September 1957.
[84] R. Y. Wang, M. Ziad, Y. W. Lee, and Y. R. Wang. Data Quality. The Kluwer In
ternational Series on Advances in Database Systems, Volume 23. Kluwer Academic
Publishers, January 2001.
[85] M. J. Zaki, S. Parthasarathy, W. Li, and M. Ogihara. Evaluation of Sampling for Data
Mining of Association Rules. Technical Report TR617, Rensselaer Polytechnic Institute,
1996.
2.6 Exercises
1. In the initial example of Chapter 2, the statistician says, “Yes, fields 2 and 3
are basically the same.” Can you tell from the three lines of sample data that
are shown why she says that?
88
2.6 Exercises
2. Classify the following attributes as binary, discrete, or continuous. Also classify
them as qualitative (nominal or ordinal) or quantitative (interval or ratio).
Some cases may have more than one interpretation, so briefly indicate your
reasoning if you think there may be some ambiguity.
Example: Age in years. Answer: Discrete, quantitative, ratio
(a) Time in terms of AM or PM.
(b) Brightness as measured by a light meter.
(c) Brightness as measured by people’s judgments.
(d) Angles as measured in degrees between 0 and 360.
(e) Bronze, Silver, and Gold medals as awarded at the Olympics.
(f) Height above sea level.
(g) Number of patients in a hospital.
(h) ISBN numbers for books. (Look up the format on the Web.)
(i) Ability to pass light in terms of the following values: opaque, translucent,
transparent.
(j) Military rank.
(k) Distance from the center of campus.
(l) Density of a substance in grams per cubic centimeter.
(m) Coat check number. (When you attend an event, you can often give your
coat to someone who, in turn, gives you a number that you can use to
claim your coat when you leave.)
3. You are approached by the marketing director of a local company, who believes
that he has devised a foolproof way to measure customer satisfaction. He
explains his scheme as follows: “It’s so simple that I can’t believe that no one
has thought of it before. I just keep track of the number of customer complaints
for each product. I read in a data mining book that counts are ratio attributes,
and so, my measure of product satisfaction must be a ratio attribute. But
when I rated the products based on my new customer satisfaction measure and
showed them to my boss, he told me that I had overlooked the obvious, and
that my measure was worthless. I think that he was just mad because our best
selling product had the worst satisfaction since it had the most complaints.
Could you help me set him straight?”
(a) Who is right, the marketing director or his boss? If you answered, his
boss, what would you do to fix the measure of satisfaction?
(b) What can you say about the attribute type of the original product satis
faction attribute?
89
Chapter 2 Data
4. A few months later, you are again approached by the same marketing director
as in Exercise 3. This time, he has devised a better approach to measure the
extent to which a customer prefers one product over other, similar products. He
explains, “When we develop new products, we typically create several variations
and evaluate which one customers prefer. Our standard procedure is to give
our test subjects all of the product variations at one time and then ask them to
rank the product variations in order of preference. However, our test subjects
are very indecisive, especially when there are more than two products. As a
result, testing takes forever. I suggested that we perform the comparisons in
pairs and then use these comparisons to get the rankings. Thus, if we have
three product variations, we have the customers compare variations 1 and 2,
then 2 and 3, and finally 3 and 1. Our testing time with my new procedure
is a third of what it was for the old procedure, but the employees conducting
the tests complain that they cannot come up with a consistent ranking from
the results. And my boss wants the latest product evaluations, yesterday. I
should also mention that he was the person who came up with the old product
evaluation approach. Can you help me?”
(a) Is the marketing director in trouble? Will his approach work for gener
ating an ordinal ranking of the product variations in terms of customer
preference? Explain.
(b) Is there a way to fix the marketing director’s approach? More generally,
what can you say about trying to create an ordinal measurement scale
based on pairwise comparisons?
(c) For the original product evaluation scheme, the overall rankings of each
product variation are found by computing its average over all test subjects.
Comment on whether you think that this is a reasonable approach. What
other approaches might you take?
5. Can you think of a situation in which identification numbers would be useful
for prediction?
6. An educational psychologist wants to use association analysis to analyze test
results. The test consists of 100 questions with four possible answers each.
(a) How would you convert this data into a form suitable for association
analysis?
(b) In particular, what type of attributes would you have and how many of
them are there?
7. Which of the following quantities is likely to show more temporal autocorrela
tion: daily rainfall or daily temperature? Why?
8. Discuss why a documentterm matrix is an example of a data set that has
asymmetric discrete or asymmetric continuous features.
90
2.6 Exercises
9. Many sciences rely on observation instead of (or in addition to) designed ex
periments. Compare the data quality issues involved in observational science
with those of experimental science and data mining.
10. Discuss the difference between the precision of a measurement and the terms
single and double precision, as they are used in computer science, typically to
represent floatingpoint numbers that require 32 and 64 bits, respectively.
11. Give at least two advantages to working with data stored in text files instead
of in a binary format.
12. Distinguish between noise and outliers. Be sure to consider the following ques
tions.
(a) Is noise ever interesting or desirable? Outliers?
(b) Can noise objects be outliers?
(c) Are noise objects always outliers?
(d) Are outliers always noise objects?
(e) Can noise make a typical value into an unusual one, or vice versa?
13. Consider the problem of finding the K nearest neighbors of a data object. A
programmer designs Algorithm 2.2 for this task.
Algorithm 2.2 Algorithm for finding K nearest neighbors.
1: for i = 1 to number of data objects do
2: Find the distances of the ith object to all other objects.
3: Sort these distances in decreasing order.
(Keep track of which object is associated with each distance.)
4: return the objects associated with the first K distances of the sorted list
5: end for
(a) Describe the potential problems with this algorithm if there are duplicate
objects in the data set. Assume the distance function will only return a
distance of 0 for objects that are the same.
(b) How would you fix this problem?
14. The following attributes are measured for members of a herd of Asian ele
phants: weight, height, tusk length, trunk length, and ear area. Based on these
measurements, what sort of similarity measure from Section 2.4 would you use
to compare or group these elephants? Justify your answer and explain any
special circumstances.
91
Chapter 2 Data
15. You are given a set of m objects that is divided into K groups, where the ith
group is of size mi. If the goal is to obtain a sample of size n < m, what is
the difference between the following two sampling schemes? (Assume sampling
with replacement.)
(a) We randomly select n ∗ mi/m elements from each group.
(b) We randomly select n elements from the data set, without regard for the
group to which an object belongs.
16. Consider a documentterm matrix, where tfij is the frequency of the ith word
(term) in the jth document and m is the number of documents. Consider the
variable transformation that is defined by
tf ′ij = tfij ∗ log
m
dfi
, (2.18)
where dfi is the number of documents in which the ith term appears, which
is known as the document frequency of the term. This transformation is
known as the inverse document frequency transformation.
(a) What is the effect of this transformation if a term occurs in one document?
In every document?
(b) What might be the purpose of this transformation?
17. Assume that we apply a square root transformation to a ratio attribute x to
obtain the new attribute x∗. As part of your analysis, you identify an interval
(a, b) in which x∗ has a linear relationship to another attribute y.
(a) What is the corresponding interval (a, b) in terms of x?
(b) Give an equation that relates y to x.
18. This exercise compares and contrasts some similarity and distance measures.
(a) For binary data, the L1 distance corresponds to the Hamming distance;
that is, the number of bits that are different between two binary vectors.
The Jaccard similarity is a measure of the similarity between two binary
vectors. Compute the Hamming distance and the Jaccard similarity be
tween the following two binary vectors.
x = 0101010001
y = 0100011000
(b) Which approach, Jaccard or Hamming distance, is more similar to the
Simple Matching Coefficient, and which approach is more similar to the
cosine measure? Explain. (Note: The Hamming measure is a distance,
while the other three measures are similarities, but don’t let this confuse
you.)
92
2.6 Exercises
(c) Suppose that you are comparing how similar two organisms of different
species are in terms of the number of genes they share. Describe which
measure, Hamming or Jaccard, you think would be more appropriate for
comparing the genetic makeup of two organisms. Explain. (Assume that
each animal is represented as a binary vector, where each attribute is 1 if
a particular gene is present in the organism and 0 otherwise.)
(d) If you wanted to compare the genetic makeup of two organisms of the same
species, e.g., two human beings, would you use the Hamming distance,
the Jaccard coefficient, or a different measure of similarity or distance?
Explain. (Note that two human beings share > 99.9% of the same genes.)
19. For the following vectors, x and y, calculate the indicated similarity or distance
measures.
(a) x = (1, 1, 1, 1), y = (2, 2, 2, 2) cosine, correlation, Euclidean
(b) x = (0, 1, 0, 1), y = (1, 0, 1, 0) cosine, correlation, Euclidean, Jaccard
(c) x = (0, −1, 0, 1), y = (1, 0, −1, 0) cosine, correlation, Euclidean
(d) x = (1, 1, 0, 1, 0, 1), y = (1, 1, 1, 0, 0, 1) cosine, correlation, Jaccard
(e) x = (2, −1, 0, 2, 0, −3), y = (−1, 1, −1, 0, 0, −1) cosine, correlation
20. Here, we further explore the cosine and correlation measures.
(a) What is the range of values that are possible for the cosine measure?
(b) If two objects have a cosine measure of 1, are they identical? Explain.
(c) What is the relationship of the cosine measure to correlation, if any?
(Hint: Look at statistical measures such as mean and standard deviation
in cases where cosine and correlation are the same and different.)
(d) Figure 2.20(a) shows the relationship of the cosine measure to Euclidean
distance for 100,000 randomly generated points that have been normalized
to have an L2 length of 1. What general observation can you make about
the relationship between Euclidean distance and cosine similarity when
vectors have an L2 norm of 1?
(e) Figure 2.20(b) shows the relationship of correlation to Euclidean distance
for 100,000 randomly generated points that have been standardized to
have a mean of 0 and a standard deviation of 1. What general observa
tion can you make about the relationship between Euclidean distance and
correlation when the vectors have been standardized to have a mean of 0
and a standard deviation of 1?
(f) Derive the mathematical relationship between cosine similarity and Eu
clidean distance when each data object has an L2 length of 1.
(g) Derive the mathematical relationship between correlation and Euclidean
distance when each data point has been been standardized by subtracting
its mean and dividing by its standard deviation.
93
Chapter 2 Data
0 0.2 0.4 0.6 0.8 1
Cosine Similarity
1.4
1.2
1
0.8
0.6
0.4
0.2
0
E
u
c
li
d
e
a
n
D
is
ta
n
c
e
(a) Relationship between Euclidean
distance and the cosine measure.
0 0.2 0.4 0.6 0.8 1
Correlation
1.4
1.2
1
0.8
0.6
0.4
0.2
0
E
u
c
li
d
e
a
n
D
is
ta
n
c
e
(b) Relationship between Euclidean
distance and correlation.
Figure 2.20. Graphs for Exercise 20.
21. Show that the set difference metric given by
d(A, B) = size(A − B) + size(B − A) (2.19)
satisfies the metric axioms given on page 70. A and B are sets and A − B is
the set difference.
22. Discuss how you might map correlation values from the interval [−1,1] to the
interval [0,1]. Note that the type of transformation that you use might depend
on the application that you have in mind. Thus, consider two applications:
clustering time series and predicting the behavior of one time series given an
other.
23. Given a similarity measure with values in the interval [0,1] describe two ways to
transform this similarity value into a dissimilarity value in the interval [0,∞].
24. Proximity is typically defined between a pair of objects.
(a) Define two ways in which you might define the proximity among a group
of objects.
(b) How might you define the distance between two sets of points in Euclidean
space?
(c) How might you define the proximity between two sets of data objects?
(Make no assumption about the data objects, except that a proximity
measure is defined between any pair of objects.)
25. You are given a set of points S in Euclidean space, as well as the distance of
each point in S to a point x. (It does not matter if x ∈ S.)
94
2.6 Exercises
(a) If the goal is to find all points within a specified distance ε of point y,
y �= x, explain how you could use the triangle inequality and the already
calculated distances to x to potentially reduce the number of distance
calculations necessary? Hint: The triangle inequality, d(x, z) ≤ d(x, y) +
d(y, x), can be rewritten as d(x, y) ≥ d(x, z) − d(y, z).
(b) In general, how would the distance between x and y affect the number of
distance calculations?
(c) Suppose that you can find a small subset of points S′, from the original
data set, such that every point in the data set is within a specified distance
ε of at least one of the points in S′, and that you also have the pairwise
distance matrix for S′. Describe a technique that uses this information to
compute, with a minimum of distance calculations, the set of all points
within a distance of β of a specified point from the data set.
26. Show that 1 minus the Jaccard similarity is a distance measure between two data
objects, x and y, that satisfies the metric axioms given on page 70. Specifically,
d(x, y) = 1 − J(x, y).
27. Show that the distance measure defined as the angle between two data vectors,
x and y, satisfies the metric axioms given on page 70. Specifically, d(x, y) =
arccos(cos(x, y)).
28. Explain why computing the proximity between two attributes is often simpler
than computing the similarity between two objects.
95
96
3
Exploring Data
The previous chapter addressed highlevel data issues that are important in
the knowledge discovery process. This chapter provides an introduction to
data exploration, which is a preliminary investigation of the data in order
to better understand its specific characteristics. Data exploration can aid in
selecting the appropriate preprocessing and data analysis techniques. It can
even address some of the questions typically answered by data mining. For
example, patterns can sometimes be found by visually inspecting the data.
Also, some of the techniques used in data exploration, such as visualization,
can be used to understand and interpret data mining results.
This chapter covers three major topics: summary statistics, visualization,
and OnLine Analytical Processing (OLAP). Summary statistics, such as the
mean and standard deviation of a set of values, and visualization techniques,
such as histograms and scatter plots, are standard methods that are widely
employed for data exploration. OLAP, which is a more recent development,
consists of a set of techniques for exploring multidimensional arrays of values.
OLAPrelated analysis functions focus on various ways to create summary
data tables from a multidimensional data array. These techniques include
aggregating data either across various dimensions or across various attribute
values. For instance, if we are given sales information reported according
to product, location, and date, OLAP techniques can be used to create a
summary that describes the sales activity at a particular location by month
and product category.
The topics covered in this chapter have considerable overlap with the area
known as Exploratory Data Analysis (EDA), which was created in the
1970s by the prominent statistician, John Tukey. This chapter, like EDA,
places a heavy emphasis on visualization. Unlike EDA, this chapter does not
include topics such as cluster analysis or anomaly detection. There are two
From Chapter 3 of Introduction to Data Mining
Vipin Kumar. Copyright © 2006 by Pearson Education, Inc. All rights reserved.
, First Edition. PangNing Tan, Michael Steinbach,
97
Chapter 3 Exploring Data
reasons for this. First, data mining views descriptive data analysis techniques
as an end in themselves, whereas statistics, from which EDA originated, tends
to view hypothesisbased testing as the final goal. Second, cluster analysis
and anomaly detection are large areas and require full chapters for an in
depth discussion. Hence, cluster analysis is covered in Chapters 8 and 9, while
anomaly detection is discussed in Chapter 10.
3.1 The Iris Data Set
In the following discussion, we will often refer to the Iris data set that is
available from the University of California at Irvine (UCI) Machine Learn
ing Repository. It consists of information on 150 Iris flowers, 50 each from
one of three Iris species: Setosa, Versicolour, and Virginica. Each flower is
characterized by five attributes:
1. sepal length in centimeters
2. sepal width in centimeters
3. petal length in centimeters
4. petal width in centimeters
5. class (Setosa, Versicolour, Virginica)
The sepals of a flower are the outer structures that protect the more fragile
parts of the flower, such as the petals. In many flowers, the sepals are green,
and only the petals are colorful. For Irises, however, the sepals are also colorful.
As illustrated by the picture of a Virginica Iris in Figure 3.1, the sepals of an
Iris are larger than the petals and are drooping, while the petals are upright.
3.2 Summary Statistics
Summary statistics are quantities, such as the mean and standard deviation,
that capture various characteristics of a potentially large set of values with a
single number or a small set of numbers. Everyday examples of summary
statistics are the average household income or the fraction of college students
who complete an undergraduate degree in four years. Indeed, for many people,
summary statistics are the most visible manifestation of statistics. We will
concentrate on summary statistics for the values of a single attribute, but will
provide a brief description of some multivariate summary statistics.
98
3.2 Summary Statistics
Figure 3.1. Picture of Iris Virginica. Robert H. Mohlenbrock @ USDANRCS PLANTS Database/
USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National
Technical Center, Chester, PA. Background removed.
This section considers only the descriptive nature of summary statistics.
However, as described in Appendix C, statistics views data as arising from an
underlying statistical process that is characterized by various parameters, and
some of the summary statistics discussed here can be viewed as estimates of
statistical parameters of the underlying distribution that generated the data.
3.2.1 Frequencies and the Mode
Given a set of unordered categorical values, there is not much that can be done
to further characterize the values except to compute the frequency with which
each value occurs for a particular set of data. Given a categorical attribute x,
which can take values {v1, . . . , vi, . . . vk} and a set of m objects, the frequency
of a value vi is defined as
frequency(vi) =
number of objects with attribute value vi
m
. (3.1)
The mode of a categorical attribute is the value that has the highest frequency.
99
Chapter 3 Exploring Data
Example 3.1. Consider a set of students who have an attribute, class, which
can take values from the set {f reshman, sophomore, junior, senior}. Table
3.1 shows the number of students for each value of the class attribute. The
mode of the class attribute is f reshman, with a frequency of 0.33. This may
indicate dropouts due to attrition or a larger than usual freshman class.
Table 3.1. Class size for students in a hypothetical college.
Class Size Frequency
freshman 140 0.33
sophomore 160 0.27
junior 130 0.22
senior 170 0.18
Categorical attributes often, but not always, have a small number of values,
and consequently, the mode and frequencies of these values can be interesting
and useful. Notice, though, that for the Iris data set and the class attribute,
the three types of flower all have the same frequency, and therefore, the notion
of a mode is not interesting.
For continuous data, the mode, as currently defined, is often not useful
because a single value may not occur more than once. Nonetheless, in some
cases, the mode may indicate important information about the nature of the
values or the presence of missing values. For example, the heights of 20 people
measured to the nearest millimeter will typically not repeat, but if the heights
are measured to the nearest tenth of a meter, then some people may have the
same height. Also, if a unique value is used to indicate a missing value, then
this value will often show up as the mode.
3.2.2 Percentiles
For ordered data, it is more useful to consider the percentiles of a set of
values. In particular, given an ordinal or continuous attribute x and a number
p between 0 and 100, the pth percentile xp is a value of x such that p% of the
observed values of x are less than xp. For instance, the 50th percentile is the
value x50% such that 50% of all values of x are less than x50%. Table 3.2 shows
the percentiles for the four quantitative attributes of the Iris data set.
100
3.2 Summary Statistics
Table 3.2. Percentiles for sepal length, sepal width, petal length, and petal width. (All values are in
centimeters.)
Percentile Sepal Length Sepal Width Petal Length Petal Width
0 4.3 2.0 1.0 0.1
10 4.8 2.5 1.4 0.2
20 5.0 2.7 1.5 0.2
30 5.2 2.8 1.7 0.4
40 5.6 3.0 3.9 1.2
50 5.8 3.0 4.4 1.3
60 6.1 3.1 4.6 1.5
70 6.3 3.2 5.0 1.8
80 6.6 3.4 5.4 1.9
90 6.9 3.6 5.8 2.2
100 7.9 4.4 6.9 2.5
Example 3.2. The percentiles, x0%, x10%, . . . , x90%, x100% of the integers from
1 to 10 are, in order, the following: 1.0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5,
10.0. By tradition, min(x) = x0% and max(x) = x100%.
3.2.3 Measures of Location: Mean and Median
For continuous data, two of the most widely used summary statistics are the
mean and median, which are measures of the location of a set of values.
Consider a set of m objects and an attribute x. Let {x1, . . . , xm} be the
attribute values of x for these m objects. As a concrete example, these values
might be the heights of m children. Let {x(1), . . . , x(m)} represent the values
of x after they have been sorted in nondecreasing order. Thus, x(1) = min(x)
and x(m) = max(x). Then, the mean and median are defined as follows:
mean(x) = x =
1
m
m∑
i=1
xi (3.2)
median(x) =
{
x(r+1) if m is odd, i.e., m = 2r + 1
1
2
(x(r) + x(r+1)) if m is even, i.e., m = 2r
(3.3)
To summarize, the median is the middle value if there are an odd number
of values, and the average of the two middle values if the number of values
is even. Thus, for seven values, the median is x(4), while for ten values, the
median is 1
2
(x(5) + x(6)).
101
Chapter 3 Exploring Data
Although the mean is sometimes interpreted as the middle of a set of values,
this is only correct if the values are distributed in a symmetric manner. If the
distribution of values is skewed, then the median is a better indicator of the
middle. Also, the mean is sensitive to the presence of outliers. For data with
outliers, the median again provides a more robust estimate of the middle of a
set of values.
To overcome problems with the traditional definition of a mean, the notion
of a trimmed mean is sometimes used. A percentage p between 0 and 100
is specified, the top and bottom (p/2)% of the data is thrown out, and the
mean is then calculated in the normal way. The median is a trimmed mean
with p = 100%, while the standard mean corresponds to p = 0%.
Example 3.3. Consider the set of values {1, 2, 3, 4, 5, 90}. The mean of these
values is 17.5, while the median is 3.5. The trimmed mean with p = 40% is
also 3.5.
Example 3.4. The means, medians, and trimmed means (p = 20%) of the
four quantitative attributes of the Iris data are given in Table 3.3. The three
measures of location have similar values except for the attribute petal length.
Table 3.3. Means and medians for sepal length, sepal width, petal length, and petal width. (All values
are in centimeters.)
Measure Sepal Length Sepal Width Petal Length Petal Width
mean 5.84 3.05 3.76 1.20
median 5.80 3.00 4.35 1.30
trimmed mean (20%) 5.79 3.02 3.72 1.12
3.2.4 Measures of Spread: Range and Variance
Another set of commonly used summary statistics for continuous data are
those that measure the dispersion or spread of a set of values. Such measures
indicate if the attribute values are widely spread out or if they are relatively
concentrated around a single point such as the mean.
The simplest measure of spread is the range, which, given an attribute x
with a set of m values {x1, . . . , xm}, is defined as
range(x) = max(x) − min(x) = x(m) − x(1). (3.4)
102
3.2 Summary Statistics
Table 3.4. Range, standard deviation (std), absolute average difference (AAD), median absolute differ
ence (MAD), and interquartile range (IQR) for sepal length, sepal width, petal length, and petal width.
(All values are in centimeters.)
Measure Sepal Length Sepal Width Petal Length Petal Width
range 3.6 2.4 5.9 2.4
std 0.8 0.4 1.8 0.8
AAD 0.7 0.3 1.6 0.6
MAD 0.7 0.3 1.2 0.7
IQR 1.3 0.5 3.5 1.5
Although the range identifies the maximum spread, it can be misleading if
most of the values are concentrated in a narrow band of values, but there are
also a relatively small number of more extreme values. Hence, the variance
is preferred as a measure of spread. The variance of the (observed) values of
an attribute x is typically written as s2x and is defined below. The standard
deviation, which is the square root of the variance, is written as sx and has
the same units as x.
variance(x) = s2x =
1
m − 1
m∑
i=1
(xi − x)2 (3.5)
The mean can be distorted by outliers, and since the variance is computed
using the mean, it is also sensitive to outliers. Indeed, the variance is particu
larly sensitive to outliers since it uses the squared difference between the mean
and other values. As a result, more robust estimates of the spread of a set
of values are often used. Following are the definitions of three such measures:
the absolute average deviation (AAD), the median absolute deviation
(MAD), and the interquartile range(IQR). Table 3.4 shows these measures
for the Iris data set.
AAD(x) =
1
m
m∑
i=1
xi − x (3.6)
MAD(x) = median
(
{x1 − x, . . . , xm − x}
)
(3.7)
interquartile range(x) = x75% − x25% (3.8)
103
Chapter 3 Exploring Data
3.2.5 Multivariate Summary Statistics
Measures of location for data that consists of several attributes (multivariate
data) can be obtained by computing the mean or median separately for each
attribute. Thus, given a data set the mean of the data objects, x, is given by
x = (x1, . . . , xn), (3.9)
where xi is the mean of the ith attribute xi.
For multivariate data, the spread of each attribute can be computed in
dependently of the other attributes using any of the approaches described in
Section 3.2.4. However, for data with continuous variables, the spread of the
data is most commonly captured by the covariance matrix S, whose ijth
entry sij is the covariance of the ith and jth attributes of the data. Thus, if xi
and xj are the ith and jth attributes, then
sij = covariance(xi, xj ). (3.10)
In turn, covariance(xi, xj ) is given by
covariance(xi, xj ) =
1
m − 1
m∑
k=1
(xki − xi)(xkj − xj ), (3.11)
where xki and xkj are the values of the ith and jth attributes for the kth object.
Notice that covariance(xi, xi) = variance(xi). Thus, the covariance matrix has
the variances of the attributes along the diagonal.
The covariance of two attributes is a measure of the degree to which two
attributes vary together and depends on the magnitudes of the variables. A
value near 0 indicates that two attributes do not have a (linear) relationship,
but it is not possible to judge the degree of relationship between two variables
by looking only at the value of the covariance. Because the correlation of two
attributes immediately gives an indication of how strongly two attributes are
(linearly) related, correlation is preferred to covariance for data exploration.
(Also see the discussion of correlation in Section 2.4.5.) The ijth entry of the
correlation matrix R, is the correlation between the ith and jth attributes
of the data. If xi and xj are the ith and jth attributes, then
rij = correlation(xi, xj ) =
covariance(xi, xj )
sisj
, (3.12)
104
3.3 Visualization
where si and sj are the variances of xi and xj , respectively. The diagonal
entries of R are correlation(xi, xi) = 1, while the other entries are between
−1 and 1. It is also useful to consider correlation matrices that contain the
pairwise correlations of objects instead of attributes.
3.2.6 Other Ways to Summarize the Data
There are, of course, other types of summary statistics. For instance, the
skewness of a set of values measures the degree to which the values are sym
metrically distributed around the mean. There are also other characteristics
of the data that are not easy to measure quantitatively, such as whether the
distribution of values is multimodal; i.e., the data has multiple “bumps” where
most of the values are concentrated. In many cases, however, the most effec
tive approach to understanding the more complicated or subtle aspects of how
the values of an attribute are distributed, is to view the values graphically in
the form of a histogram. (Histograms are discussed in the next section.)
3.3 Visualization
Data visualization is the display of information in a graphic or tabular format.
Successful visualization requires that the data (information) be converted into
a visual format so that the characteristics of the data and the relationships
among data items or attributes can be analyzed or reported. The goal of
visualization is the interpretation of the visualized information by a person
and the formation of a mental model of the information.
In everyday life, visual techniques such as graphs and tables are often the
preferred approach used to explain the weather, the economy, and the results
of political elections. Likewise, while algorithmic or mathematical approaches
are often emphasized in most technical disciplines—data mining included—
visual techniques can play a key role in data analysis. In fact, sometimes the
use of visualization techniques in data mining is referred to as visual data
mining.
3.3.1 Motivations for Visualization
The overriding motivation for using visualization is that people can quickly
absorb large amounts of visual information and find patterns in it. Consider
Figure 3.2, which shows the Sea Surface Temperature (SST) in degrees Celsius
for July, 1982. This picture summarizes the information from approximately
250,000 numbers and is readily interpreted in a few seconds. For example, it
105
Chapter 3 Exploring Data
Longitude
Temp–150–180 –120 –90 –60 –30 030 60 90 120 150 180
0
5
10
15
20
25
30
90
60
– 60
–90
30
–30
0
L
a
tit
u
d
e
Figure 3.2. Sea Surface Temperature (SST) for July, 1982.
is easy to see that the ocean temperature is highest at the equator and lowest
at the poles.
Another general motivation for visualization is to make use of the domain
knowledge that is “locked up in people’s heads.” While the use of domain
knowledge is an important task in data mining, it is often difficult or impossible
to fully utilize such knowledge in statistical or algorithmic tools. In some cases,
an analysis can be performed using nonvisual tools, and then the results
presented visually for evaluation by the domain expert. In other cases, having
a domain specialist examine visualizations of the data may be the best way
of finding patterns of interest since, by using domain knowledge, a person can
often quickly eliminate many uninteresting patterns and direct the focus to
the patterns that are important.
3.3.2 General Concepts
This section explores some of the general concepts related to visualization, in
particular, general approaches for visualizing the data and its attributes. A
number of visualization techniques are mentioned briefly and will be described
in more detail when we discuss specific approaches later on. We assume that
the reader is familiar with line graphs, bar charts, and scatter plots.
106
3.3 Visualization
Representation: Mapping Data to Graphical Elements
The first step in visualization is the mapping of information to a visual format;
i.e., mapping the objects, attributes, and relationships in a set of information
to visual objects, attributes, and relationships. That is, data objects, their at
tributes, and the relationships among data objects are translated into graphical
elements such as points, lines, shapes, and colors.
Objects are usually represented in one of three ways. First, if only a
single categorical attribute of the object is being considered, then objects
are often lumped into categories based on the value of that attribute, and
these categories are displayed as an entry in a table or an area on a screen.
(Examples shown later in this chapter are a crosstabulation table and a bar
chart.) Second, if an object has multiple attributes, then the object can be
displayed as a row (or column) of a table or as a line on a graph. Finally,
an object is often interpreted as a point in two or threedimensional space,
where graphically, the point might be represented by a geometric figure, such
as a circle, cross, or box.
For attributes, the representation depends on the type of attribute, i.e.,
nominal, ordinal, or continuous (interval or ratio). Ordinal and continuous
attributes can be mapped to continuous, ordered graphical features such as
location along the x, y, or z axes; intensity; color; or size (diameter, width,
height, etc.). For categorical attributes, each category can be mapped to
a distinct position, color, shape, orientation, embellishment, or column in
a table. However, for nominal attributes, whose values are unordered, care
should be taken when using graphical features, such as color and position that
have an inherent ordering associated with their values. In other words, the
graphical elements used to represent the ordinal values often have an order,
but ordinal values do not.
The representation of relationships via graphical elements occurs either
explicitly or implicitly. For graph data, the standard graph representation—
a set of nodes with links between the nodes—is normally used. If the nodes
(data objects) or links (relationships) have attributes or characteristics of their
own, then this is represented graphically. To illustrate, if the nodes are cities
and the links are highways, then the diameter of the nodes might represent
population, while the width of the links might represent the volume of traffic.
In most cases, though, mapping objects and attributes to graphical el
ements implicitly maps the relationships in the data to relationships among
graphical elements. To illustrate, if the data object represents a physical object
that has a location, such as a city, then the relative positions of the graphical
objects corresponding to the data objects tend to naturally preserve the actual
107
Chapter 3 Exploring Data
relative positions of the objects. Likewise, if there are two or three continuous
attributes that are taken as the coordinates of the data points, then the result
ing plot often gives considerable insight into the relationships of the attributes
and the data points because data points that are visually close to each other
have similar values for their attributes.
In general, it is difficult to ensure that a mapping of objects and attributes
will result in the relationships being mapped to easily observed relationships
among graphical elements. Indeed, this is one of the most challenging aspects
of visualization. In any given set of data, there are many implicit relationships,
and hence, a key challenge of visualization is to choose a technique that makes
the relationships of interest easily observable.
Arrangement
As discussed earlier, the proper choice of visual representation of objects and
attributes is essential for good visualization. The arrangement of items within
the visual display is also crucial. We illustrate this with two examples.
Example 3.5. This example illustrates the importance of rearranging a table
of data. In Table 3.5, which shows nine objects with six binary attributes,
there is no clear relationship between objects and attributes, at least at first
glance. If the rows and columns of this table are permuted, however, as shown
in Table 3.6, then it is clear that there are really only two types of objects in
the table—one that has all ones for the first three attributes and one that has
only ones for the last three attributes.
Table 3.5. A table of nine objects (rows) with
six binary attributes (columns).
1 2 3 4 5 6
1 0 1 0 1 1 0
2 1 0 1 0 0 1
3 0 1 0 1 1 0
4 1 0 1 0 0 1
5 0 1 0 1 1 0
6 1 0 1 0 0 1
7 0 1 0 1 1 0
8 1 0 1 0 0 1
9 0 1 0 1 1 0
Table 3.6. A table of nine objects (rows) with six
binary attributes (columns) permuted so that the
relationships of the rows and columns are clear.
6 1 3 2 5 4
4 1 1 1 0 0 0
2 1 1 1 0 0 0
6 1 1 1 0 0 0
8 1 1 1 0 0 0
5 0 0 0 1 1 1
3 0 0 0 1 1 1
9 0 0 0 1 1 1
1 0 0 0 1 1 1
7 0 0 0 1 1 1
108
3.3 Visualization
Example 3.6. Consider Figure 3.3(a), which shows a visualization of a graph.
If the connected components of the graph are separated, as in Figure 3.3(b),
then the relationships between nodes and graphs become much simpler to
understand.
(a) Original view of a graph. (b) Uncoupled view of connected components
of the graph.
Figure 3.3. Two visualizations of a graph.
Selection
Another key concept in visualization is selection, which is the elimination
or the deemphasis of certain objects and attributes. Specifically, while data
objects that only have a few dimensions can often be mapped to a two or
threedimensional graphical representation in a straightforward way, there is
no completely satisfactory and general approach to represent data with many
attributes. Likewise, if there are many data objects, then visualizing all the
objects can result in a display that is too crowded. If there are many attributes
and many objects, then the situation is even more challenging.
The most common approach to handling many attributes is to choose a
subset of attributes—usually two—for display. If the dimensionality is not too
high, a matrix of bivariate (twoattribute) plots can be constructed for simul
taneous viewing. (Figure 3.16 shows a matrix of scatter plots for the pairs
of attributes of the Iris data set.) Alternatively, a visualization program can
automatically show a series of twodimensional plots, in which the sequence is
user directed or based on some predefined strategy. The hope is that visualiz
ing a collection of twodimensional plots will provide a more complete view of
the data.
109
Chapter 3 Exploring Data
The technique of selecting a pair (or small number) of attributes is a type of
dimensionality reduction, and there are many more sophisticated dimension
ality reduction techniques that can be employed, e.g., principal components
analysis (PCA). Consult Appendices A (Linear Algebra) and B (Dimension
ality Reduction) for more information.
When the number of data points is high, e.g., more than a few hundred,
or if the range of the data is large, it is difficult to display enough information
about each object. Some data points can obscure other data points, or a
data object may not occupy enough pixels to allow its features to be clearly
displayed. For example, the shape of an object cannot be used to encode a
characteristic of that object if there is only one pixel available to display it. In
these situations, it is useful to be able to eliminate some of the objects, either
by zooming in on a particular region of the data or by taking a sample of the
data points.
3.3.3 Techniques
Visualization techniques are often specialized to the type of data being ana
lyzed. Indeed, new visualization techniques and approaches, as well as special
ized variations of existing approaches, are being continuously created, typically
in response to new kinds of data and visualization tasks.
Despite this specialization and the ad hoc nature of visualization, there are
some generic ways to classify visualization techniques. One such classification
is based on the number of attributes involved (1, 2, 3, or many) or whether the
data has some special characteristic, such as a hierarchical or graph structure.
Visualization methods can also be classified according to the type of attributes
involved. Yet another classification is based on the type of application: scien
tific, statistical, or information visualization. The following discussion will use
three categories: visualization of a small number of attributes, visualization of
data with spatial and/or temporal attributes, and visualization of data with
many attributes.
Most of the visualization techniques discussed here can be found in a wide
variety of mathematical and statistical packages, some of which are freely
available. There are also a number of data sets that are freely available on the
World Wide Web. Readers are encouraged to try these visualization techniques
as they proceed through the following sections.
110
3.3 Visualization
Visualizing Small Numbers of Attributes
This section examines techniques for visualizing data with respect to a small
number of attributes. Some of these techniques, such as histograms, give
insight into the distribution of the observed values for a single attribute. Other
techniques, such as scatter plots, are intended to display the relationships
between the values of two attributes.
Stem and Leaf Plots Stem and leaf plots can be used to provide insight
into the distribution of onedimensional integer or continuous data. (We will
assume integer data initially, and then explain how stem and leaf plots can be
applied to continuous data.) For the simplest type of stem and leaf plot, we
split the values into groups, where each group contains those values that are
the same except for the last digit. Each group becomes a stem, while the last
digits of a group are the leaves. Hence, if the values are twodigit integers,
e.g., 35, 36, 42, and 51, then the stems will be the highorder digits, e.g., 3,
4, and 5, while the leaves are the loworder digits, e.g., 1, 2, 5, and 6. By
plotting the stems vertically and leaves horizontally, we can provide a visual
representation of the distribution of the data.
Example 3.7. The set of integers shown in Figure 3.4 is the sepal length in
centimeters (multiplied by 10 to make the values integers) taken from the Iris
data set. For convenience, the values have also been sorted.
The stem and leaf plot for this data is shown in Figure 3.5. Each number in
Figure 3.4 is first put into one of the vertical groups—4, 5, 6, or 7—according
to its ten’s digit. Its last digit is then placed to the right of the colon. Often,
especially if the amount of data is larger, it is desirable to split the stems.
For example, instead of placing all values whose ten’s digit is 4 in the same
“bucket,” the stem 4 is repeated twice; all values 40–44 are put in the bucket
corresponding to the first stem and all values 45–49 are put in the bucket
corresponding to the second stem. This approach is shown in the stem and
leaf plot of Figure 3.6. Other variations are also possible.
Histograms Stem and leaf plots are a type of histogram, a plot that dis
plays the distribution of values for attributes by dividing the possible values
into bins and showing the number of objects that fall into each bin. For cate
gorical data, each value is a bin. If this results in too many values, then values
are combined in some way. For continuous attributes, the range of values is di
vided into bins—typically, but not necessarily, of equal width—and the values
in each bin are counted.
111
Chapter 3 Exploring Data
43 44 44 44 45 46 46 46 46 47 47 48 48 48 48 48 49 49 49 49 49 49 50
50 50 50 50 50 50 50 50 50 51 51 51 51 51 51 51 51 51 52 52 52 52 53
54 54 54 54 54 54 55 55 55 55 55 55 55 56 56 56 56 56 56 57 57 57 57
57 57 57 57 58 58 58 58 58 58 58 59 59 59 60 60 60 60 60 60 61 61 61
61 61 61 62 62 62 62 63 63 63 63 63 63 63 63 63 64 64 64 64 64 64 64
65 65 65 65 65 66 66 67 67 67 67 67 67 67 67 68 68 68 69 69 69 69 70
71 72 72 72 73 74 76 77 77 77 77 79
Figure 3.4. Sepal length data from the Iris data set.
4 : 34444566667788888999999
5 : 0000000000111111111222234444445555555666666777777778888888999
6 : 000000111111222233333333344444445555566777777778889999
7 : 0122234677779
Figure 3.5. Stem and leaf plot for the sepal length from the Iris data set.
4 : 3444
4 : 566667788888999999
5 : 000000000011111111122223444444
5 : 5555555666666777777778888888999
6 : 00000011111122223333333334444444
6 : 5555566777777778889999
7 : 0122234
7 : 677779
Figure 3.6. Stem and leaf plot for the sepal length from the Iris data set when buckets corresponding
to digits are split.
Once the counts are available for each bin, a bar plot is constructed such
that each bin is represented by one bar and the area of each bar is proportional
to the number of values (objects) that fall into the corresponding range. If all
intervals are of equal width, then all bars are the same width and the height
of a bar is proportional to the number of values in the corresponding bin.
Example 3.8. Figure 3.7 shows histograms (with 10 bins) for sepal length,
sepal width, petal length, and petal width. Since the shape of a histogram
can depend on the number of bins, histograms for the same data, but with 20
bins, are shown in Figure 3.8.
There are variations of the histogram plot. A relative (frequency) his
togram replaces the count by the relative frequency. However, this is just a
112
3.3 Visualization
4 4.5 5 5.5 6 6.5 7 7.5 8
0
5
10
15
20
25
30
Sepal Length
C
o
u
n
t
(a) Sepal length.
2 2.5 3 3.5 4 4.5
0
5
10
15
20
25
30
35
40
45
50
Sepal Width
C
o
u
n
t
(b) Sepal width.
0 1 2 3 4 5 6 7
0
5
10
15
20
25
30
35
40
Petal Length
C
o
u
n
t
(c) Petal length.
0 0.5 1 1.5 2 2.5 3
0
5
10
15
20
25
30
35
40
45
Petal Width
C
o
u
n
t
(d) Petal width.
Figure 3.7. Histograms of four Iris attributes (10 bins).
4 4.5 5 5.5 6 6.5 7 7.5 8
0
2
4
6
8
10
12
14
16
Sepal Length
C
o
u
n
t
(a) Sepal length.
2 2.5 3 3.5 4 4.5
0
5
10
15
20
25
30
Sepal Width
C
o
u
n
t
(b) Sepal width.
1 2 3 4 5 6 7
0
5
10
15
20
25
30
35
Petal Length
C
o
u
n
t
(c) Petal length.
0 0.5 1 1.5 2 2.5
0
5
10
15
20
25
30
35
Petal Width
C
o
u
n
t
(d) Petal width.
Figure 3.8. Histograms of four Iris attributes (20 bins).
change in scale of the y axis, and the shape of the histogram does not change.
Another common variation, especially for unordered categorical data, is the
Pareto histogram, which is the same as a normal histogram except that the
categories are sorted by count so that the count is decreasing from left to right.
TwoDimensional Histograms Twodimensional histograms are also pos
sible. Each attribute is divided into intervals and the two sets of intervals define
twodimensional rectangles of values.
Example 3.9. Figure 3.9 shows a twodimensional histogram of petal length
and petal width. Because each attribute is split into three bins, there are nine
rectangular twodimensional bins. The height of each rectangular bar indicates
the number of objects (flowers in this case) that fall into each bin. Most of
the flowers fall into only three of the bins—those along the diagonal. It is not
possible to see this by looking at the onedimensional distributions.
113
Chapter 3 Exploring Data
Petal Length
Petal Width
50
40
30
20
10
0
C
o
u
n
t
2
2
3
4
5
6
1
1.5
0.5
Figure 3.9. Twodimensional histogram of petal length and width in the Iris data set.
While twodimensional histograms can be used to discover interesting facts
about how the values of two attributes cooccur, they are visually more com
plicated. For instance, it is easy to imagine a situation in which some of the
columns are hidden by others.
Box Plots Box plots are another method for showing the distribution of the
values of a single numerical attribute. Figure 3.10 shows a labeled box plot for
sepal length. The lower and upper ends of the box indicate the 25th and 75th
percentiles, respectively, while the line inside the box indicates the value of the
50th percentile. The top and bottom lines of the tails indicate the 10th and
90th percentiles. Outliers are shown by “+” marks. Box plots are relatively
compact, and thus, many of them can be shown on the same plot. Simplified
versions of the box plot, which take less space, can also be used.
Example 3.10. The box plots for the first four attributes of the Iris data
set are shown in Figure 3.11. Box plots can also be used to compare how
attributes vary between different classes of objects, as shown in Figure 3.12.
Pie Chart A pie chart is similar to a histogram, but is typically used with
categorical attributes that have a relatively small number of values. Instead of
showing the relative frequency of different values with the area or height of a
bar, as in a histogram, a pie chart uses the relative area of a circle to indicate
relative frequency. Although pie charts are common in popular articles, they
114
3.3 Visualization
Outlier
90th percentile
10th percentile
50th percentile
75th percentile
25th percentile
+
+
+
+
Figure 3.10. Description of
box plot for sepal length.
8
7
6
5
4
3
2
1
0
V
a
lu
e
s
(
c
e
n
ti
m
e
te
rs
)
++
+
+
Sepal Length Petal Length Petal WidthSepal Width
Figure 3.11. Box plot for Iris attributes.
6
5
4
3
2
1
0
V
a
lu
e
s
(
c
e
n
ti
m
e
te
rs
)
+
+
++
Sepal Length Petal Length Petal WidthSepal Width
(a) Setosa.
7
5
4
3
2
1
6
V
a
lu
e
s
(
c
e
n
ti
m
e
te
rs
)
+
Sepal Length Petal Length Petal WidthSepal Width
(b) Versicolour.
7
5
4
3
2
8
6
V
a
lu
e
s
(
c
e
n
ti
m
e
te
rs
)
+
Sepal Length Petal Length Petal WidthSepal Width
(c) Virginica.
Figure 3.12. Box plots of attributes by Iris species.
are used less frequently in technical publications because the size of relative
areas can be hard to judge. Histograms are preferred for technical work.
Example 3.11. Figure 3.13 displays a pie chart that shows the distribution
of Iris species in the Iris data set. In this case, all three flower types have the
same frequency.
Percentile Plots and Empirical Cumulative Distribution Functions
A type of diagram that shows the distribution of the data more quantitatively
is the plot of an empirical cumulative distribution function. While this type of
plot may sound complicated, the concept is straightforward. For each value of
a statistical distribution, a cumulative distribution function (CDF) shows
115
Chapter 3 Exploring Data
Setosa Virginica
Versicolour
Figure 3.13. Distribution of the types of Iris flowers.
the probability that a point is less than that value. For each observed value, an
empirical cumulative distribution function (ECDF) shows the fraction
of points that are less than this value. Since the number of points is finite, the
empirical cumulative distribution function is a step function.
Example 3.12. Figure 3.14 shows the ECDFs of the Iris attributes. The
percentiles of an attribute provide similar information. Figure 3.15 shows the
percentile plots of the four continuous attributes of the Iris data set from
Table 3.2. The reader should compare these figures with the histograms given
in Figures 3.7 and 3.8.
Scatter Plots Most people are familiar with scatter plots to some extent,
and they were used in Section 2.4.5 to illustrate linear correlation. Each data
object is plotted as a point in the plane using the values of the two attributes
as x and y coordinates. It is assumed that the attributes are either integer or
realvalued.
Example 3.13. Figure 3.16 shows a scatter plot for each pair of attributes
of the Iris data set. The different species of Iris are indicated by different
markers. The arrangement of the scatter plots of pairs of attributes in this
type of tabular format, which is known as a scatter plot matrix, provides
an organized way to examine a number of scatter plots simultaneously.
116
3.3 Visualization
4 4.5 5 5.5 6 6.5 7 7.5 8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
F
(x
)
(a) Sepal Length.
2 2.5 3 3.5 4 4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
F
(x
)
(b) Sepal Width.
1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
F
(x
)
(c) Petal Length.
0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
F
(x
)
(d) Petal Width.
Figure 3.14. Empirical CDFs of four Iris attributes.
0 20 40 60 80 10
1
2
3
4
5
6
7
Percentile
V
a
lu
e
(
c
e
n
ti
m
e
te
rs
)
sepal length
sepal width
petal length
petal width
Figure 3.15. Percentile plots for sepal length, sepal width, petal length, and petal width.
117
Chapter 3 Exploring Data
0
1
2
p
e
ta
l w
id
th
2
4
6
p
e
ta
l l
e
n
g
th
2
3
4
se
p
a
l w
id
th
5
6
7
8
0
0
.51
1
.52
2
.5
se
p
a
l l
e
n
g
th
petal widthpetal length
2465678 2
2
.53
3
.54
4
.5
sepal widthsepal length
S
e
to
sa
V
e
rs
ic
o
lo
u
r
V
ir
g
in
ic
a
Fi
gu
re
3.
16
.
M
at
rix
of
sc
at
te
rp
lo
ts
fo
rt
he
Iri
s
da
ta
se
t.
118
3.3 Visualization
There are two main uses for scatter plots. First, they graphically show
the relationship between two attributes. In Section 2.4.5, we saw how scatter
plots could be used to judge the degree of linear correlation. (See Figure 2.17.)
Scatter plots can also be used to detect nonlinear relationships, either directly
or by using a scatter plot of the transformed attributes.
Second, when class labels are available, they can be used to investigate the
degree to which two attributes separate the classes. If is possible to draw a
line (or a more complicated curve) that divides the plane defined by the two
attributes into separate regions that contain mostly objects of one class, then
it is possible to construct an accurate classifier based on the specified pair of
attributes. If not, then more attributes or more sophisticated methods are
needed to build a classifier. In Figure 3.16, many of the pairs of attributes (for
example, petal width and petal length) provide a moderate separation of the
Iris species.
Example 3.14. There are two separate approaches for displaying three at
tributes of a data set with a scatter plot. First, each object can be displayed
according to the values of three, instead of two attributes. Figure 3.17 shows a
threedimensional scatter plot for three attributes in the Iris data set. Second,
one of the attributes can be associated with some characteristic of the marker,
such as its size, color, or shape. Figure 3.18 shows a plot of three attributes
of the Iris data set, where one of the attributes, sepal width, is mapped to the
size of the marker.
Extending Two and ThreeDimensional Plots As illustrated by Fig
ure 3.18, two or threedimensional plots can be extended to represent a few
additional attributes. For example, scatter plots can display up to three ad
ditional attributes using color or shading, size, and shape, allowing five or six
dimensions to be represented. There is a need for caution, however. As the
complexity of a visual representation of the data increases, it becomes harder
for the intended audience to interpret the information. There is no benefit in
packing six dimensions’ worth of information into a two or threedimensional
plot, if doing so makes it impossible to understand.
Visualizing Spatiotemporal Data
Data often has spatial or temporal attributes. For instance, the data may
consist of a set of observations on a spatial grid, such as observations of pres
sure on the surface of the Earth or the modeled temperature at various grid
points in the simulation of a physical object. These observations can also be
119
Chapter 3 Exploring Data
2
3
4
5
2
1
3
4
5
6
7
0
0.5
1.5
1
2
Petal Width
Sepal Width
Setosa
Versicolour
Virginica
S
e
p
a
l
L
e
n
g
th
Figure 3.17. Threedimensional scatter plot of sepal width, sepal length, and petal width.
Setosa
Versicolour
Virginica
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
Petal Length
P
e
ta
l
W
id
th
Figure 3.18. Scatter plot of petal length versus petal width, with the size of the marker indicating sepal
width.
120
3.3 Visualization
5
5
5
5
5
0
0
0
5
5
5
5
5
10 10
10
10
10
15
15
15
1515
20 2
0
20 2020
2
0
20
25
25
25
25
25
25
25
0
25
2
5
5
5
5
5
5
5
0
5
5
Temperature
(Celsius)
5
10
15
20
25
0
–5
Figure 3.19. Contour plot of SST for December 1998.
made at various points in time. In addition, data may have only a temporal
component, such as time series data that gives the daily prices of stocks.
Contour Plots For some threedimensional data, two attributes specify a
position in a plane, while the third has a continuous value, such as temper
ature or elevation. A useful visualization for such data is a contour plot,
which breaks the plane into separate regions where the values of the third
attribute (temperature, elevation) are roughly the same. A common example
of a contour plot is a contour map that shows the elevation of land locations.
Example 3.15. Figure 3.19 shows a contour plot of the average sea surface
temperature (SST) for December 1998. The land is arbitrarily set to have a
temperature of 0◦C. In many contour maps, such as that of Figure 3.19, the
contour lines that separate two regions are labeled with the value used to
separate the regions. For clarity, some of these labels have been deleted.
Surface Plots Like contour plots, surface plots use two attributes for the
x and y coordinates. The third attribute is used to indicate the height above
121
Chapter 3 Exploring Data
(a) Set of 12 points. (b) Overall density function—surface
plot.
Figure 3.20. Density of a set of 12 points.
the plane defined by the first two attributes. While such graphs can be useful,
they require that a value of the third attribute be defined for all combinations
of values for the first two attributes, at least over some range. Also, if the
surface is too irregular, then it can be difficult to see all the information,
unless the plot is viewed interactively. Thus, surface plots are often used to
describe mathematical functions or physical surfaces that vary in a relatively
smooth manner.
Example 3.16. Figure 3.20 shows a surface plot of the density around a set
of 12 points. This example is further discussed in Section 9.3.3.
Vector Field Plots In some data, a characteristic may have both a mag
nitude and a direction associated with it. For example, consider the flow of a
substance or the change of density with location. In these situations, it can be
useful to have a plot that displays both direction and magnitude. This type
of plot is known as a vector plot.
Example 3.17. Figure 3.21 shows a contour plot of the density of the two
smaller density peaks from Figure 3.20(b), annotated with the density gradient
vectors.
LowerDimensional Slices Consider a spatiotemporal data set that records
some quantity, such as temperature or pressure, at various locations over time.
Such a data set has four dimensions and cannot be easily displayed by the types
122
3.3 Visualization
Figure 3.21. Vector plot of the gradient (change) in density for the bottom two density peaks of Figure
3.20.
of plots that we have described so far. However, separate “slices” of the data
can be displayed by showing a set of plots, one for each month. By examining
the change in a particular area from one month to another, it is possible to
notice changes that occur, including those that may be due to seasonal factors.
Example 3.18. The underlying data set for this example consists of the av
erage monthly sea level pressure (SLP) from 1982 to 1999 on a 2.5◦ by 2.5◦
latitudelongitude grid. The twelve monthly plots of pressure for one year are
shown in Figure 3.22. In this example, we are interested in slices for a par
ticular month in the year 1982. More generally, we can consider slices of the
data along any arbitrary dimension.
Animation Another approach to dealing with slices of data, whether or not
time is involved, is to employ animation. The idea is to display successive
twodimensional slices of the data. The human visual system is well suited to
detecting visual changes and can often notice changes that might be difficult
to detect in another manner. Despite the visual appeal of animation, a set of
still plots, such as those of Figure 3.22, can be more useful since this type of
visualization allows the information to be studied in arbitrary order and for
arbitrary amounts of time.
123
Chapter 3 Exploring Data
January February March
April May June
July August September
October November December
Figure 3.22. Monthly plots of sea level pressure over the 12 months of 1982.
3.3.4 Visualizing HigherDimensional Data
This section considers visualization techniques that can display more than the
handful of dimensions that can be observed with the techniques just discussed.
However, even these techniques are somewhat limited in that they only show
some aspects of the data.
Matrices An image can be regarded as a rectangular array of pixels, where
each pixel is characterized by its color and brightness. A data matrix is a
rectangular array of values. Thus, a data matrix can be visualized as an image
by associating each entry of the data matrix with a pixel in the image. The
brightness or color of the pixel is determined by the value of the corresponding
entry of the matrix.
124
3.3 Visualization
V
ir
g
in
ic
a
V
e
rs
ic
o
lo
u
r
S
e
to
s
a
Sepal length Sepal width Petal length Petal width
50
100
150
Standard
Deviation
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
3
Figure 3.23. Plot of the Iris data matrix where
columns have been standardized to have a mean
of 0 and standard deviation of 1.
S
e
to
s
a
V
e
rs
ic
o
lo
u
r
V
ir
g
in
ic
a
VirginicaVersicolourSetosa
50 100
50
100
150
Correlation
0.4
0.5
0.6
0.7
0.8
0.9
Figure 3.24. Plot of the Iris correlation matrix.
There are some important practical considerations when visualizing a data
matrix. If class labels are known, then it is useful to reorder the data matrix
so that all objects of a class are together. This makes it easier, for example, to
detect if all objects in a class have similar attribute values for some attributes.
If different attributes have different ranges, then the attributes are often stan
dardized to have a mean of zero and a standard deviation of 1. This prevents
the attribute with the largest magnitude values from visually dominating the
plot.
Example 3.19. Figure 3.23 shows the standardized data matrix for the Iris
data set. The first 50 rows represent Iris flowers of the species Setosa, the next
50 Versicolour, and the last 50 Virginica. The Setosa flowers have petal width
and length well below the average, while the Versicolour flowers have petal
width and length around average. The Virginica flowers have petal width and
length above average.
It can also be useful to look for structure in the plot of a proximity matrix
for a set of data objects. Again, it is useful to sort the rows and columns of
the similarity matrix (when class labels are known) so that all the objects of a
class are together. This allows a visual evaluation of the cohesiveness of each
class and its separation from other classes.
Example 3.20. Figure 3.24 shows the correlation matrix for the Iris data
set. Again, the rows and columns are organized so that all the flowers of a
particular species are together. The flowers in each group are most similar
125
Chapter 3 Exploring Data
to each other, but Versicolour and Virginica are more similar to one another
than to Setosa.
If class labels are not known, various techniques (matrix reordering and
seriation) can be used to rearrange the rows and columns of the similarity
matrix so that groups of highly similar objects and attributes are together
and can be visually identified. Effectively, this is a simple kind of clustering.
See Section 8.5.3 for a discussion of how a proximity matrix can be used to
investigate the cluster structure of data.
Parallel Coordinates Parallel coordinates have one coordinate axis for
each attribute, but the different axes are parallel to one other instead of per
pendicular, as is traditional. Furthermore, an object is represented as a line
instead of as a point. Specifically, the value of each attribute of an object is
mapped to a point on the coordinate axis associated with that attribute, and
these points are then connected to form the line that represents the object.
It might be feared that this would yield quite a mess. However, in many
cases, objects tend to fall into a small number of groups, where the points in
each group have similar values for their attributes. If so, and if the number of
data objects is not too large, then the resulting parallel coordinates plot can
reveal interesting patterns.
Example 3.21. Figure 3.25 shows a parallel coordinates plot of the four nu
merical attributes of the Iris data set. The lines representing objects of differ
ent classes are distinguished by their shading and the use of three different line
styles—solid, dotted, and dashed. The parallel coordinates plot shows that the
classes are reasonably well separated for petal width and petal length, but less
well separated for sepal length and sepal width. Figure 3.25 is another parallel
coordinates plot of the same data, but with a different ordering of the axes.
One of the drawbacks of parallel coordinates is that the detection of pat
terns in such a plot may depend on the order. For instance, if lines cross a
lot, the picture can become confusing, and thus, it can be desirable to order
the coordinate axes to obtain sequences of axes with less crossover. Compare
Figure 3.26, where sepal width (the attribute that is most mixed) is at the left
of the figure, to Figure 3.25, where this attribute is in the middle.
Star Coordinates and Chernoff Faces
Another approach to displaying multidimensional data is to encode objects
as glyphs or icons—symbols that impart information nonverbally. More
126
3.3 Visualization
Sepal Length Sepal Width Petal Length Petal Width
0
1
2
3
4
5
6
7
8
V
a
lu
e
(
c
e
n
ti
m
e
te
rs
)
Setosa
Versicolour
Virginica
Figure 3.25. A parallel coordinates plot of the four Iris attributes.
Sepal Width Sepal Length Petal Length Petal Width
0
1
2
3
4
5
6
7
8
V
a
lu
e
(
c
e
n
ti
m
e
te
rs
)
Setosa
Versicolour
Virginica
Figure 3.26. A parallel coordinates plot of the four Iris attributes with the attributes reordered to
emphasize similarities and dissimilarities of groups.
127
Chapter 3 Exploring Data
specifically, each attribute of an object is mapped to a particular feature of a
glyph, so that the value of the attribute determines the exact nature of the
feature. Thus, at a glance, we can distinguish how two objects differ.
Star coordinates are one example of this approach. This technique uses
one axis for each attribute. These axes all radiate from a center point, like the
spokes of a wheel, and are evenly spaced. Typically, all the attribute values
are mapped to the range [0,1].
An object is mapped onto this starshaped set of axes using the following
process: Each attribute value of the object is converted to a fraction that
represents its distance between the minimum and maximum values of the
attribute. This fraction is mapped to a point on the axis corresponding to
this attribute. Each point is connected with a line segment to the point on
the axis preceding or following its own axis; this forms a polygon. The size
and shape of this polygon gives a visual description of the attribute values of
the object. For ease of interpretation, a separate set of axes is used for each
object. In other words, each object is mapped to a polygon. An example of a
star coordinates plot of flower 150 is given in Figure 3.27(a).
It is also possible to map the values of features to those of more familiar
objects, such as faces. This technique is named Chernoff faces for its creator,
Herman Chernoff. In this technique, each attribute is associated with a specific
feature of a face, and the attribute value is used to determine the way that
the facial feature is expressed. Thus, the shape of the face may become more
elongated as the value of the corresponding data feature increases. An example
of a Chernoff face for flower 150 is given in Figure 3.27(b).
The program that we used to make this face mapped the features to the
four features listed below. Other features of the face, such as width between
the eyes and length of the mouth, are given default values.
Data Feature Facial Feature
sepal length size of face
sepal width forehead/jaw relative arc length
petal length shape of forehead
petal width shape of jaw
Example 3.22. A more extensive illustration of these two approaches to view
ing multidimensional data is provided by Figures 3.28 and 3.29, which shows
the star and face plots, respectively, of 15 flowers from the Iris data set. The
first 5 flowers are of species Setosa, the second 5 are Versicolour, and the last
5 are Virginica.
128
3.3 Visualization
s
e
p
a
l
w
id
th
p
e
ta
l
w
id
th
petal length sepal length
(a) Star graph of Iris 150. (b) Chernoff face of Iris 150.
Figure 3.27. Star coordinates graph and Chernoff face of the 150th flower of the Iris data set.
1 2 3 4 5
51 52 53 54 55
101 102 103 104 105
Figure 3.28. Plot of 15 Iris flowers using star coordinates.
1 2 3 4 5
51 52 53 54 55
101 102 103 104 105
Figure 3.29. A plot of 15 Iris flowers using Chernoff faces.
129
Chapter 3 Exploring Data
Despite the visual appeal of these sorts of diagrams, they do not scale well,
and thus, they are of limited use for many data mining problems. Nonetheless,
they may still be of use as a means to quickly compare small sets of objects
that have been selected by other techniques.
3.3.5 Do’s and Don’ts
To conclude this section on visualization, we provide a short list of visualiza
tion do’s and don’ts. While these guidelines incorporate a lot of visualization
wisdom, they should not be followed blindly. As always, guidelines are no
substitute for thoughtful consideration of the problem at hand.
ACCENT Principles The following are the ACCENT principles for ef
fective graphical display put forth by D. A. Burn (as adapted by Michael
Friendly):
Apprehension Ability to correctly perceive relations among variables. Does
the graph maximize apprehension of the relations among variables?
Clarity Ability to visually distinguish all the elements of a graph. Are the
most important elements or relations visually most prominent?
Consistency Ability to interpret a graph based on similarity to previous
graphs. Are the elements, symbol shapes, and colors consistent with
their use in previous graphs?
Efficiency Ability to portray a possibly complex relation in as simple a way
as possible. Are the elements of the graph economically used? Is the
graph easy to interpret?
Necessity The need for the graph, and the graphical elements. Is the graph
a more useful way to represent the data than alternatives (table, text)?
Are all the graph elements necessary to convey the relations?
Truthfulness Ability to determine the true value represented by any graph
ical element by its magnitude relative to the implicit or explicit scale.
Are the graph elements accurately positioned and scaled?
Tufte’s Guidelines Edward R. Tufte has also enumerated the following
principles for graphical excellence:
130
3.4 OLAP and Multidimensional Data Analysis
• Graphical excellence is the welldesigned presentation of interesting data—
a matter of substance, of statistics, and of design.
• Graphical excellence consists of complex ideas communicated with clar
ity, precision, and efficiency.
• Graphical excellence is that which gives to the viewer the greatest num
ber of ideas in the shortest time with the least ink in the smallest space.
• Graphical excellence is nearly always multivariate.
• And graphical excellence requires telling the truth about the data.
3.4 OLAP and Multidimensional Data Analysis
In this section, we investigate the techniques and insights that come from
viewing data sets as multidimensional arrays. A number of database sys
tems support such a viewpoint, most notably, OnLine Analytical Processing
(OLAP) systems. Indeed, some of the terminology and capabilities of OLAP
systems have made their way into spreadsheet programs that are used by mil
lions of people. OLAP systems also have a strong focus on the interactive
analysis of data and typically provide extensive capabilities for visualizing the
data and generating summary statistics. For these reasons, our approach to
multidimensional data analysis will be based on the terminology and concepts
common to OLAP systems.
3.4.1 Representing Iris Data as a Multidimensional Array
Most data sets can be represented as a table, where each row is an object and
each column is an attribute. In many cases, it is also possible to view the data
as a multidimensional array. We illustrate this approach by representing the
Iris data set as a multidimensional array.
Table 3.7 was created by discretizing the petal length and petal width
attributes to have values of low, medium, and high and then counting the
number of flowers from the Iris data set that have particular combinations
of petal width, petal length, and species type. (For petal width, the cat
egories low, medium, and high correspond to the intervals [0, 0.75), [0.75,
1.75), [1.75, ∞), respectively. For petal length, the categories low, medium,
and high correspond to the intervals [0, 2.5), [2.5, 5), [5, ∞), respectively.)
131
Chapter 3 Exploring Data
Table 3.7. Number of flowers having a particular combination of petal width, petal length, and species
type.
Petal Length Petal Width Species Type Count
low low Setosa 46
low medium Setosa 2
medium low Setosa 2
medium medium Versicolour 43
medium high Versicolour 3
medium high Virginica 3
high medium Versicolour 2
high medium Virginica 3
high high Versicolour 2
high high Virginica 44
0
0
0
0
0
2
0
2
46
Virginica
Versicolour
Setosa
high
low
medium
h
ig
h
m
e
d
iu
m
lo
w
Sp
ec
ie
s
Petal
Width
Petal
Width
Figure 3.30. A multidimensional data representation for the Iris data set.
132
3.4 OLAP and Multidimensional Data Analysis
Table 3.8. Crosstabulation of flowers accord
ing to petal length and width for flowers of the
Setosa species.
Width
low medium high
low 46 2 0
medium 2 0 0
high 0 0 0L
en
g
th
Table 3.9. Crosstabulation of flowers accord
ing to petal length and width for flowers of the
Versicolour species.
Width
low medium high
low 0 0 0
medium 0 43 3
high 0 2 2L
en
g
th
Table 3.10. Crosstabulation of flowers ac
cording to petal length and width for flowers of
the Virginica species.
Width
low medium high
low 0 0 0
medium 0 0 3
high 0 3 44L
en
g
th
Empty combinations—those combinations that do not correspond to at least
one flower—are not shown.
The data can be organized as a multidimensional array with three dimen
sions corresponding to petal width, petal length, and species type, as illus
trated in Figure 3.30. For clarity, slices of this array are shown as a set of
three twodimensional tables, one for each species—see Tables 3.8, 3.9, and
3.10. The information contained in both Table 3.7 and Figure 3.30 is the
same. However, in the multidimensional representation shown in Figure 3.30
(and Tables 3.8, 3.9, and 3.10), the values of the attributes—petal width, petal
length, and species type—are array indices.
What is important are the insights can be gained by looking at data from a
multidimensional viewpoint. Tables 3.8, 3.9, and 3.10 show that each species
of Iris is characterized by a different combination of values of petal length
and width. Setosa flowers have low width and length, Versicolour flowers have
medium width and length, and Virginica flowers have high width and length.
3.4.2 Multidimensional Data: The General Case
The previous section gave a specific example of using a multidimensional ap
proach to represent and analyze a familiar data set. Here we describe the
general approach in more detail.
133
Chapter 3 Exploring Data
The starting point is usually a tabular representation of the data, such
as that of Table 3.7, which is called a fact table. Two steps are necessary
in order to represent data as a multidimensional array: identification of the
dimensions and identification of an attribute that is the focus of the analy
sis. The dimensions are categorical attributes or, as in the previous example,
continuous attributes that have been converted to categorical attributes. The
values of an attribute serve as indices into the array for the dimension corre
sponding to the attribute, and the number of attribute values is the size of
that dimension. In the previous example, each attribute had three possible
values, and thus, each dimension was of size three and could be indexed by
three values. This produced a 3 × 3 × 3 multidimensional array.
Each combination of attribute values (one value for each different attribute)
defines a cell of the multidimensional array. To illustrate using the previous
example, if petal length = low, petal width = medium, and species = Setosa,
a specific cell containing the value 2 is identified. That is, there are only two
flowers in the data set that have the specified attribute values. Notice that
each row (object) of the data set in Table 3.7 corresponds to a cell in the
multidimensional array.
The contents of each cell represents the value of a target quantity (target
variable or attribute) that we are interested in analyzing. In the Iris example,
the target quantity is the number of flowers whose petal width and length
fall within certain limits. The target attribute is quantitative because a key
goal of multidimensional data analysis is to look aggregate quantities, such as
totals or averages.
The following summarizes the procedure for creating a multidimensional
data representation from a data set represented in tabular form. First, identify
the categorical attributes to be used as the dimensions and a quantitative
attribute to be used as the target of the analysis. Each row (object) in the
table is mapped to a cell of the multidimensional array. The indices of the cell
are specified by the values of the attributes that were selected as dimensions,
while the value of the cell is the value of the target attribute. Cells not defined
by the data are assumed to have a value of 0.
Example 3.23. To further illustrate the ideas just discussed, we present a
more traditional example involving the sale of products.The fact table for this
example is given by Table 3.11. The dimensions of the multidimensional rep
resentation are the product ID, location, and date attributes, while the target
attribute is the revenue. Figure 3.31 shows the multidimensional representa
tion of this data set. This larger and more complicated data set will be used
to illustrate additional concepts of multidimensional data analysis.
134
3.4 OLAP and Multidimensional Data Analysis
3.4.3 Analyzing Multidimensional Data
In this section, we describe different multidimensional analysis techniques. In
particular, we discuss the creation of data cubes, and related operations, such
as slicing, dicing, dimensionality reduction, rollup, and drill down.
Data Cubes: Computing Aggregate Quantities
A key motivation for taking a multidimensional viewpoint of data is the im
portance of aggregating data in various ways. In the sales example, we might
wish to find the total sales revenue for a specific year and a specific product.
Or we might wish to see the yearly sales revenue for each location across all
products. Computing aggregate totals involves fixing specific values for some
of the attributes that are being used as dimensions and then summing over
all possible values for the attributes that make up the remaining dimensions.
There are other types of aggregate quantities that are also of interest, but for
simplicity, this discussion will use totals (sums).
Table 3.12 shows the result of summing over all locations for various com
binations of date and product. For simplicity, assume that all the dates are
within one year. If there are 365 days in a year and 1000 products, then Table
3.12 has 365,000 entries (totals), one for each productdata pair. We could
also specify the store location and date and sum over products, or specify the
location and product and sum over all dates.
Table 3.13 shows the marginal totals of Table 3.12. These totals are the
result of further summing over either dates or products. In Table 3.13, the
total sales revenue due to product 1, which is obtained by summing across
row 1 (over all dates), is $370,000. The total sales revenue on January 1,
2004, which is obtained by summing down column 1 (over all products), is
$527,362. The total sales revenue, which is obtained by summing over all rows
and columns (all times and products) is $227,352,127. All of these totals are
for all locations because the entries of Table 3.13 include all locations.
A key point of this example is that there are a number of different totals
(aggregates) that can be computed for a multidimensional array, depending on
how many attributes we sum over. Assume that there are n dimensions and
that the ith dimension (attribute) has si possible values. There are n different
ways to sum only over a single attribute. If we sum over dimension j, then we
obtain s1 ∗ · · · ∗ sj−1 ∗ sj+1 ∗ · · · ∗ sn totals, one for each possible combination
of attribute values of the n − 1 other attributes (dimensions). The totals that
result from summing over one attribute form a multidimensional array of n−1
dimensions and there are n such arrays of totals. In the sales example, there
135
Chapter 3 Exploring Data
Table 3.11. Sales revenue of products (in dollars) for various locations and times.
Product ID Location Date Revenue
…
…
…
…
1 Minneapolis Oct. 18, 2004 $250
1 Chicago Oct. 18, 2004 $79
…
…
…
1 Paris Oct. 18, 2004 301
…
…
…
…
27 Minneapolis Oct. 18, 2004 $2,321
27 Chicago Oct. 18, 2004 $3,278
…
…
…
27 Paris Oct. 18, 2004 $1,325
…
…
…
…
$ $ $
Lo
ca
tio
n
Date
Product ID
. .
.
. . .
.
.
.
Figure 3.31. Multidimensional data representation for sales data.
136
3.4 OLAP and Multidimensional Data Analysis
Table 3.12. Totals that result from summing over all locations for a fixed time and product.
date
Jan 1, 2004 Jan 2, 2004 . . . Dec 31, 2004
1 $1,001 $987 . . . $891
…
…
…
27 $10,265 $10,225 . . . $9,325
p
ro
d
u
ct
ID
…
…
…
Table 3.13. Table 3.12 with marginal totals.
date
Jan 1, 2004 Jan 2, 2004 . . . Dec 31, 2004 total
1 $1,001 $987 . . . $891 $370,000
…
…
…
…
27 $10,265 $10,225 . . . $9,325 $3,800,020
p
ro
d
u
ct
ID
…
…
…
…
total $527,362 $532,953 . . . $631,221 $227,352,127
are three sets of totals that result from summing over only one dimension and
each set of totals can be displayed as a twodimensional table.
If we sum over two dimensions (perhaps starting with one of the arrays
of totals obtained by summing over one dimension), then we will obtain a
multidimensional array of totals with n − 2 dimensions. There will be
(
n
2
)
distinct arrays of such totals. For the sales examples, there will be
(
3
2
)
= 3
arrays of totals that result from summing over location and product, location
and time, or product and time. In general, summing over k dimensions yields(
n
k
)
arrays of totals, each with dimension n − k.
A multidimensional representation of the data, together with all possible
totals (aggregates), is known as a data cube. Despite the name, the size of
each dimension—the number of attribute values—does not need to be equal.
Also, a data cube may have either more or fewer than three dimensions. More
importantly, a data cube is a generalization of what is known in statistical
terminology as a crosstabulation. If marginal totals were added, Tables
3.8, 3.9, or 3.10 would be typical examples of cross tabulations.
137
Chapter 3 Exploring Data
Dimensionality Reduction and Pivoting
The aggregation described in the last section can be viewed as a form of
dimensionality reduction. Specifically, the jth dimension is eliminated by
summing over it. Conceptually, this collapses each “column” of cells in the jth
dimension into a single cell. For both the sales and Iris examples, aggregating
over one dimension reduces the dimensionality of the data from 3 to 2. If sj
is the number of possible values of the jth dimension, the number of cells is
reduced by a factor of sj . Exercise 17 on page 143 asks the reader to explore
the difference between this type of dimensionality reduction and that of PCA.
Pivoting refers to aggregating over all dimensions except two. The result
is a twodimensional cross tabulation with the two specified dimensions as the
only remaining dimensions. Table 3.13 is an example of pivoting on date and
product.
Slicing and Dicing
These two colorful names refer to rather straightforward operations. Slicing is
selecting a group of cells from the entire multidimensional array by specifying
a specific value for one or more dimensions. Tables 3.8, 3.9, and 3.10 are
three slices from the Iris set that were obtained by specifying three separate
values for the species dimension. Dicing involves selecting a subset of cells by
specifying a range of attribute values. This is equivalent to defining a subarray
from the complete array. In practice, both operations can also be accompanied
by aggregation over some dimensions.
RollUp and DrillDown
In Chapter 2, attribute values were regarded as being “atomic” in some sense.
However, this is not always the case. In particular, each date has a number
of properties associated with it such as the year, month, and week. The data
can also be identified as belonging to a particular business quarter, or if the
application relates to education, a school quarter or semester. A location
also has various properties: continent, country, state (province, etc.), and
city. Products can also be divided into various categories, such as clothing,
electronics, and furniture.
Often these categories can be organized as a hierarchical tree or lattice.
For instance, years consist of months or weeks, both of which consist of days.
Locations can be divided into nations, which contain states (or other units
of local government), which in turn contain cities. Likewise, any category
138
3.5 Bibliographic Notes
of products can be further subdivided. For example, the product category,
furniture, can be subdivided into the subcategories, chairs, tables, sofas, etc.
This hierarchical structure gives rise to the rollup and drilldown opera
tions. To illustrate, starting with the original sales data, which is a multidi
mensional array with entries for each date, we can aggregate (roll up) the
sales across all the dates in a month. Conversely, given a representation of the
data where the time dimension is broken into months, we might want to split
the monthly sales totals (drill down) into daily sales totals. Of course, this
requires that the underlying sales data be available at a daily granularity.
Thus, rollup and drilldown operations are related to aggregation. No
tice, however, that they differ from the aggregation operations discussed until
now in that they aggregate cells within a dimension, not across the entire
dimension.
3.4.4 Final Comments on Multidimensional Data Analysis
Multidimensional data analysis, in the sense implied by OLAP and related sys
tems, consists of viewing the data as a multidimensional array and aggregating
data in order to better analyze the structure of the data. For the Iris data,
the differences in petal width and length are clearly shown by such an anal
ysis. The analysis of business data, such as sales data, can also reveal many
interesting patterns, such as profitable (or unprofitable) stores or products.
As mentioned, there are various types of database systems that support
the analysis of multidimensional data. Some of these systems are based on
relational databases and are known as ROLAP systems. More specialized
database systems that specifically employ a multidimensional data represen
tation as their fundamental data model have also been designed. Such systems
are known as MOLAP systems. In addition to these types of systems, statisti
cal databases (SDBs) have been developed to store and analyze various types
of statistical data, e.g., census and public health data, that are collected by
governments or other large organizations. References to OLAP and SDBs are
provided in the bibliographic notes.
3.5 Bibliographic Notes
Summary statistics are discussed in detail in most introductory statistics
books, such as [92]. References for exploratory data analysis are the classic
text by Tukey [104] and the book by Velleman and Hoaglin [105].
The basic visualization techniques are readily available, being an integral
part of most spreadsheets (Microsoft EXCEL [95]), statistics programs (SAS
139
Chapter 3 Exploring Data
[99], SPSS [102], R [96], and SPLUS [98]), and mathematics software (MAT
LAB [94] and Mathematica [93]). Most of the graphics in this chapter were
generated using MATLAB. The statistics package R is freely available as an
open source software package from the R project.
The literature on visualization is extensive, covering many fields and many
decades. One of the classics of the field is the book by Tufte [103]. The book
by Spence [101], which strongly influenced the visualization portion of this
chapter, is a useful reference for information visualization—both principles and
techniques. This book also provides a thorough discussion of many dynamic
visualization techniques that were not covered in this chapter. Two other
books on visualization that may also be of interest are those by Card et al.
[87] and Fayyad et al. [89].
Finally, there is a great deal of information available about data visualiza
tion on the World Wide Web. Since Web sites come and go frequently, the best
strategy is a search using “information visualization,” “data visualization,” or
“statistical graphics.” However, we do want to single out for attention “The
Gallery of Data Visualization,” by Friendly [90]. The ACCENT Principles for
effective graphical display as stated in this chapter can be found there, or as
originally presented in the article by Burn [86].
There are a variety of graphical techniques that can be used to explore
whether the distribution of the data is Gaussian or some other specified dis
tribution. Also, there are plots that display whether the observed values are
statistically significant in some sense. We have not covered any of these tech
niques here and refer the reader to the previously mentioned statistical and
mathematical packages.
Multidimensional analysis has been around in a variety of forms for some
time. One of the original papers was a white paper by Codd [88], the father
of relational databases. The data cube was introduced by Gray et al. [91],
who described various operations for creating and manipulating data cubes
within a relational database framework. A comparison of statistical databases
and OLAP is given by Shoshani [100]. Specific information on OLAP can
be found in documentation from database vendors and many popular books.
Many database textbooks also have general discussions of OLAP, often in the
context of data warehousing. For example, see the text by Ramakrishnan and
Gehrke [97].
Bibliography
[86] D. A. Burn. Designing Effective Statistical Graphs. In C. R. Rao, editor, Handbook of
Statistics 9. Elsevier/NorthHolland, Amsterdam, The Netherlands, September 1993.
140
3.6 Exercises
[87] S. K. Card, J. D. MacKinlay, and B. Shneiderman, editors. Readings in Information
Visualization: Using Vision to Think. Morgan Kaufmann Publishers, San Francisco,
CA, January 1999.
[88] E. F. Codd, S. B. Codd, and C. T. Smalley. Providing OLAP (Online Analytical
Processing) to User Analysts: An IT Mandate. White Paper, E.F. Codd and Associates,
1993.
[89] U. M. Fayyad, G. G. Grinstein, and A. Wierse, editors. Information Visualization in
Data Mining and Knowledge Discovery. Morgan Kaufmann Publishers, San Francisco,
CA, September 2001.
[90] M. Friendly. Gallery of Data Visualization. http://www.math.yorku.ca/SCS/Gallery/,
2005.
[91] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow,
and H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group
By, CrossTab, and SubTotals. Journal Data Mining and Knowledge Discovery, 1(1):
29–53, 1997.
[92] B. W. Lindgren. Statistical Theory. CRC Press, January 1993.
[93] Mathematica 5.1. Wolfram Research, Inc. http://www.wolfram.com/, 2005.
[94] MATLAB 7.0. The MathWorks, Inc. http://www.mathworks.com, 2005.
[95] Microsoft Excel 2003. Microsoft, Inc. http://www.microsoft.com/, 2003.
[96] R: A language and environment for statistical computing and graphics. The R Project
for Statistical Computing. http://www.rproject.org/, 2005.
[97] R. Ramakrishnan and J. Gehrke. Database Management Systems. McGrawHill, 3rd
edition, August 2002.
[98] SPLUS. Insightful Corporation. http://www.insightful.com, 2005.
[99] SAS: Statistical Analysis System. SAS Institute Inc. http://www.sas.com/, 2005.
[100] A. Shoshani. OLAP and statistical databases: similarities and differences. In Proc.
of the Sixteenth ACM SIGACTSIGMODSIGART Symp. on Principles of Database
Systems, pages 185–196. ACM Press, 1997.
[101] R. Spence. Information Visualization. ACM Press, New York, December 2000.
[102] SPSS: Statistical Package for the Social Sciences. SPSS, Inc. http://www.spss.com/,
2005.
[103] E. R. Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire,
CT, March 1986.
[104] J. W. Tukey. Exploratory data analysis. AddisonWesley, 1977.
[105] P. Velleman and D. Hoaglin. The ABC’s of EDA: Applications, Basics, and Computing
of Exploratory Data Analysis. Duxbury, 1981.
3.6 Exercises
1. Obtain one of the data sets available at the UCI Machine Learning Repository
and apply as many of the different visualization techniques described in the
chapter as possible. The bibliographic notes and book Web site provide pointers
to visualization software.
141
Chapter 3 Exploring Data
2. Identify at least two advantages and two disadvantages of using color to visually
represent information.
3. What are the arrangement issues that arise with respect to threedimensional
plots?
4. Discuss the advantages and disadvantages of using sampling to reduce the num
ber of data objects that need to be displayed. Would simple random sampling
(without replacement) be a good approach to sampling? Why or why not?
5. Describe how you would create visualizations to display information that de
scribes the following types of systems.
(a) Computer networks. Be sure to include both the static aspects of the
network, such as connectivity, and the dynamic aspects, such as traffic.
(b) The distribution of specific plant and animal species around the world for
a specific moment in time.
(c) The use of computer resources, such as processor time, main memory, and
disk, for a set of benchmark database programs.
(d) The change in occupation of workers in a particular country over the last
thirty years. Assume that you have yearly information about each person
that also includes gender and level of education.
Be sure to address the following issues:
• Representation. How will you map objects, attributes, and relation
ships to visual elements?
• Arrangement. Are there any special considerations that need to be
taken into account with respect to how visual elements are displayed? Spe
cific examples might be the choice of viewpoint, the use of transparency,
or the separation of certain groups of objects.
• Selection. How will you handle a large number of attributes and data
objects?
6. Describe one advantage and one disadvantage of a stem and leaf plot with
respect to a standard histogram.
7. How might you address the problem that a histogram depends on the number
and location of the bins?
8. Describe how a box plot can give information about whether the value of an
attribute is symmetrically distributed. What can you say about the symmetry
of the distributions of the attributes shown in Figure 3.11?
9. Compare sepal length, sepal width, petal length, and petal width, using Figure
3.12.
142
3.6 Exercises
10. Comment on the use of a box plot to explore a data set with four attributes:
age, weight, height, and income.
11. Give a possible explanation as to why most of the values of petal length and
width fall in the buckets along the diagonal in Figure 3.9.
12. Use Figures 3.14 and 3.15 to identify a characteristic shared by the petal width
and petal length attributes.
13. Simple line plots, such as that displayed in Figure 2.12 on page 56, which
shows two time series, can be used to effectively display highdimensional data.
For example, in Figure 2.12 it is easy to tell that the frequencies of the two
time series are different. What characteristic of time series allows the effective
visualization of highdimensional data?
14. Describe the types of situations that produce sparse or dense data cubes. Illus
trate with examples other than those used in the book.
15. How might you extend the notion of multidimensional data analysis so that the
target variable is a qualitative variable? In other words, what sorts of summary
statistics or data visualizations would be of interest?
16. Construct a data cube from Table 3.14. Is this a dense or sparse data cube? If
it is sparse, identify the cells that empty.
Table 3.14. Fact table for Exercise 16.
Product ID Location ID Number Sold
1 1 10
1 3 6
2 1 5
2 2 22
17. Discuss the differences between dimensionality reduction based on aggregation
and dimensionality reduction based on techniques such as PCA and SVD.
143
144
4
Classification:
Basic Concepts,
Decision Trees, and
Model Evaluation
Classification, which is the task of assigning objects to one of several predefined
categories, is a pervasive problem that encompasses many diverse applications.
Examples include detecting spam email messages based upon the message
header and content, categorizing cells as malignant or benign based upon the
results of MRI scans, and classifying galaxies based upon their shapes (see
Figure 4.1).
(a) A spiral galaxy. (b) An elliptical galaxy.
Figure 4.1. Classification of galaxies. The images are from the NASA website.
From Chapter 4 of Introduction to Data Mining
Vipin Kumar. Copyright © 2006 by Pearson Education, Inc. All rights reserved.
, First Edition. PangNing Tan, Michael Steinbach,
145
Chapter 4 Classification
Classification
model
Input
Attribute set
(x)
Output
Class label
(y)
Figure 4.2. Classification as the task of mapping an input attribute set x into its class label y.
This chapter introduces the basic concepts of classification, describes some
of the key issues such as model overfitting, and presents methods for evaluating
and comparing the performance of a classification technique. While it focuses
mainly on a technique known as decision tree induction, most of the discussion
in this chapter is also applicable to other classification techniques, many of
which are covered in Chapter 5.
4.1 Preliminaries
The input data for a classification task is a collection of records. Each record,
also known as an instance or example, is characterized by a tuple (x, y), where
x is the attribute set and y is a special attribute, designated as the class label
(also known as category or target attribute). Table 4.1 shows a sample data set
used for classifying vertebrates into one of the following categories: mammal,
bird, fish, reptile, or amphibian. The attribute set includes properties of a
vertebrate such as its body temperature, skin cover, method of reproduction,
ability to fly, and ability to live in water. Although the attributes presented
in Table 4.1 are mostly discrete, the attribute set can also contain continuous
features. The class label, on the other hand, must be a discrete attribute.
This is a key characteristic that distinguishes classification from regression,
a predictive modeling task in which y is a continuous attribute. Regression
techniques are covered in Appendix D.
Definition 4.1 (Classification). Classification is the task of learning a tar
get function f that maps each attribute set x to one of the predefined class
labels y.
The target function is also known informally as a classification model.
A classification model is useful for the following purposes.
Descriptive Modeling A classification model can serve as an explanatory
tool to distinguish between objects of different classes. For example, it would
be useful—for both biologists and others—to have a descriptive model that
146
4.1 Preliminaries
Table 4.1. The vertebrate data set.
Name Body Skin Gives Aquatic Aerial Has Hiber Class
Temperature Cover Birth Creature Creature Legs nates Label
human warmblooded hair yes no no yes no mammal
python coldblooded scales no no no no yes reptile
salmon coldblooded scales no yes no no no fish
whale warmblooded hair yes yes no no no mammal
frog coldblooded none no semi no yes yes amphibian
komodo
dragon
coldblooded scales no no no yes no reptile
bat warmblooded hair yes no yes yes yes mammal
pigeon warmblooded feathers no no yes yes no bird
cat warmblooded fur yes no no yes no mammal
leopard
shark
coldblooded scales yes yes no no no fish
turtle coldblooded scales no semi no yes no reptile
penguin warmblooded feathers no semi no yes no bird
porcupine warmblooded quills yes no no yes yes mammal
eel coldblooded scales no yes no no no fish
salamander coldblooded none no semi no yes yes amphibian
summarizes the data shown in Table 4.1 and explains what features define a
vertebrate as a mammal, reptile, bird, fish, or amphibian.
Predictive Modeling A classification model can also be used to predict
the class label of unknown records. As shown in Figure 4.2, a classification
model can be treated as a black box that automatically assigns a class label
when presented with the attribute set of an unknown record. Suppose we are
given the following characteristics of a creature known as a gila monster:
Name Body Skin Gives Aquatic Aerial Has Hiber Class
Temperature Cover Birth Creature Creature Legs nates Label
gila monster coldblooded scales no no no yes yes ?
We can use a classification model built from the data set shown in Table 4.1
to determine the class to which the creature belongs.
Classification techniques are most suited for predicting or describing data
sets with binary or nominal categories. They are less effective for ordinal
categories (e.g., to classify a person as a member of high, medium, or low
income group) because they do not consider the implicit order among the
categories. Other forms of relationships, such as the subclass–superclass re
lationships among categories (e.g., humans and apes are primates, which in
147
Chapter 4 Classification
turn, is a subclass of mammals) are also ignored. The remainder of this chapter
focuses only on binary or nominal class labels.
4.2 General Approach to Solving a Classification
Problem
A classification technique (or classifier) is a systematic approach to building
classification models from an input data set. Examples include decision tree
classifiers, rulebased classifiers, neural networks, support vector machines,
and näıve Bayes classifiers. Each technique employs a learning algorithm
to identify a model that best fits the relationship between the attribute set and
class label of the input data. The model generated by a learning algorithm
should both fit the input data well and correctly predict the class labels of
records it has never seen before. Therefore, a key objective of the learning
algorithm is to build models with good generalization capability; i.e., models
that accurately predict the class labels of previously unknown records.
Figure 4.3 shows a general approach for solving classification problems.
First, a training set consisting of records whose class labels are known must
Induction
Deduction
Model
Learn
Model
Apply
Model
Learning
Algorithm
Training Set
Test Set
Tid ClassAttrib1 Attrib2 Attrib3
1
2
3
4
5
6
7
8
9
10
Yes
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
No
No
Yes
No
Yes
125K
100K
70K
120K
95K
60K
220K
85K
75K
90K
Large
Medium
Small
Medium
Large
Medium
Large
Small
Medium
Small
Tid ClassAttrib1 Attrib2 Attrib3
11
12
13
14
15
No
Yes
Yes
No
No
?
?
?
?
?
55K
80K
110K
95K
67K
Small
Medium
Large
Small
Large
Figure 4.3. General approach for building a classification model.
148
4.2 General Approach to Solving a Classification Problem
Table 4.2. Confusion matrix for a 2class problem.
Predicted Class
Class = 1 Class = 0
Actual Class = 1 f11 f10
Class Class = 0 f01 f00
be provided. The training set is used to build a classification model, which is
subsequently applied to the test set, which consists of records with unknown
class labels.
Evaluation of the performance of a classification model is based on the
counts of test records correctly and incorrectly predicted by the model. These
counts are tabulated in a table known as a confusion matrix. Table 4.2
depicts the confusion matrix for a binary classification problem. Each entry
fij in this table denotes the number of records from class i predicted to be
of class j. For instance, f01 is the number of records from class 0 incorrectly
predicted as class 1. Based on the entries in the confusion matrix, the total
number of correct predictions made by the model is (f11 + f00) and the total
number of incorrect predictions is (f10 + f01).
Although a confusion matrix provides the information needed to determine
how well a classification model performs, summarizing this information with
a single number would make it more convenient to compare the performance
of different models. This can be done using a performance metric such as
accuracy, which is defined as follows:
Accuracy =
Number of correct predictions
Total number of predictions
=
f11 + f00
f11 + f10 + f01 + f00
. (4.1)
Equivalently, the performance of a model can be expressed in terms of its
error rate, which is given by the following equation:
Error rate =
Number of wrong predictions
Total number of predictions
=
f10 + f01
f11 + f10 + f01 + f00
. (4.2)
Most classification algorithms seek models that attain the highest accuracy, or
equivalently, the lowest error rate when applied to the test set. We will revisit
the topic of model evaluation in Section 4.5.
149
Chapter 4 Classification
4.3 Decision Tree Induction
This section introduces a decision tree classifier, which is a simple yet widely
used classification technique.
4.3.1 How a Decision Tree Works
To illustrate how classification with a decision tree works, consider a simpler
version of the vertebrate classification problem described in the previous sec
tion. Instead of classifying the vertebrates into five distinct groups of species,
we assign them to two categories: mammals and nonmammals.
Suppose a new species is discovered by scientists. How can we tell whether
it is a mammal or a nonmammal? One approach is to pose a series of questions
about the characteristics of the species. The first question we may ask is
whether the species is cold or warmblooded. If it is coldblooded, then it is
definitely not a mammal. Otherwise, it is either a bird or a mammal. In the
latter case, we need to ask a followup question: Do the females of the species
give birth to their young? Those that do give birth are definitely mammals,
while those that do not are likely to be nonmammals (with the exception of
egglaying mammals such as the platypus and spiny anteater).
The previous example illustrates how we can solve a classification problem
by asking a series of carefully crafted questions about the attributes of the
test record. Each time we receive an answer, a followup question is asked
until we reach a conclusion about the class label of the record. The series of
questions and their possible answers can be organized in the form of a decision
tree, which is a hierarchical structure consisting of nodes and directed edges.
Figure 4.4 shows the decision tree for the mammal classification problem. The
tree has three types of nodes:
• A root node that has no incoming edges and zero or more outgoing
edges.
• Internal nodes, each of which has exactly one incoming edge and two
or more outgoing edges.
• Leaf or terminal nodes, each of which has exactly one incoming edge
and no outgoing edges.
In a decision tree, each leaf node is assigned a class label. The non
terminal nodes, which include the root and other internal nodes, contain
attribute test conditions to separate records that have different characteris
tics. For example, the root node shown in Figure 4.4 uses the attribute Body
150
4.3 Decision Tree Induction
Body
Temperature Root
node
Leaf
nodes
ColdWarmInternal
node
Gives Birth
Yes No
Non
mammals
Non
mammals
Mammals
Figure 4.4. A decision tree for the mammal classification problem.
Temperature to separate warmblooded from coldblooded vertebrates. Since
all coldblooded vertebrates are nonmammals, a leaf node labeled Nonmammals
is created as the right child of the root node. If the vertebrate is warmblooded,
a subsequent attribute, Gives Birth, is used to distinguish mammals from
other warmblooded creatures, which are mostly birds.
Classifying a test record is straightforward once a decision tree has been
constructed. Starting from the root node, we apply the test condition to the
record and follow the appropriate branch based on the outcome of the test.
This will lead us either to another internal node, for which a new test condition
is applied, or to a leaf node. The class label associated with the leaf node is
then assigned to the record. As an illustration, Figure 4.5 traces the path in
the decision tree that is used to predict the class label of a flamingo. The path
terminates at a leaf node labeled Nonmammals.
4.3.2 How to Build a Decision Tree
In principle, there are exponentially many decision trees that can be con
structed from a given set of attributes. While some of the trees are more accu
rate than others, finding the optimal tree is computationally infeasible because
of the exponential size of the search space. Nevertheless, efficient algorithms
have been developed to induce a reasonably accurate, albeit suboptimal, de
cision tree in a reasonable amount of time. These algorithms usually employ
a greedy strategy that grows a decision tree by making a series of locally op
151
Chapter 4 Classification
Body
Temperature
Non
mammals
ColdWarm
Flamingo Warm No … ?
Unlabeled
data
Gives Birth
Yes No
Non
mammals
Non
mammals
Mammals
Name Gives Birth … ClassBody temperature
Figure 4.5. Classifying an unlabeled vertebrate. The dashed lines represent the outcomes of applying
various attribute test conditions on the unlabeled vertebrate. The vertebrate is eventually assigned to
the Nonmammal class.
timum decisions about which attribute to use for partitioning the data. One
such algorithm is Hunt’s algorithm, which is the basis of many existing de
cision tree induction algorithms, including ID3, C4.5, and CART. This section
presents a highlevel discussion of Hunt’s algorithm and illustrates some of its
design issues.
Hunt’s Algorithm
In Hunt’s algorithm, a decision tree is grown in a recursive fashion by parti
tioning the training records into successively purer subsets. Let Dt be the set
of training records that are associated with node t and y = {y1, y2, . . . , yc} be
the class labels. The following is a recursive definition of Hunt’s algorithm.
Step 1: If all the records in Dt belong to the same class yt, then t is a leaf
node labeled as yt.
Step 2: If Dt contains records that belong to more than one class, an at
tribute test condition is selected to partition the records into smaller
subsets. A child node is created for each outcome of the test condi
tion and the records in Dt are distributed to the children based on the
outcomes. The algorithm is then recursively applied to each child node.
152
4.3 Decision Tree Induction
bi
na
ry
ca
te
go
ric
al
co
nt
in
uo
us
cla
ss
Tid
Defaulted
Borrower
Home
Owner
Marital
Status
Annual
Income
1
2
3
4
5
6
7
8
9
10
Yes
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
No
No
Yes
No
Yes
125K
100K
70K
120K
95K
60K
220K
85K
75K
90K
Single
Married
Single
Married
Divorced
Married
Divorced
Single
Married
Single
Figure 4.6. Training set for predicting borrowers who will default on loan payments.
To illustrate how the algorithm works, consider the problem of predicting
whether a loan applicant will repay her loan obligations or become delinquent,
subsequently defaulting on her loan. A training set for this problem can be
constructed by examining the records of previous borrowers. In the example
shown in Figure 4.6, each record contains the personal information of a bor
rower along with a class label indicating whether the borrower has defaulted
on loan payments.
The initial tree for the classification problem contains a single node with
class label Defaulted = No (see Figure 4.7(a)), which means that most of
the borrowers successfully repaid their loans. The tree, however, needs to be
refined since the root node contains records from both classes. The records are
subsequently divided into smaller subsets based on the outcomes of the Home
Owner test condition, as shown in Figure 4.7(b). The justification for choosing
this attribute test condition will be discussed later. For now, we will assume
that this is the best criterion for splitting the data at this point. Hunt’s
algorithm is then applied recursively to each child of the root node. From
the training set given in Figure 4.6, notice that all borrowers who are home
owners successfully repaid their loans. The left child of the root is therefore a
leaf node labeled Defaulted = No (see Figure 4.7(b)). For the right child, we
need to continue applying the recursive step of Hunt’s algorithm until all the
records belong to the same class. The trees resulting from each recursive step
are shown in Figures 4.7(c) and (d).
153
Chapter 4 Classification
Defaulted = No
Yes No
Home
Owner
MarriedSingle,
Divorced
Defaulted = No
Marital
Status
(a) (b)
Defaulted = No
Defaulted = No
Yes No
Defaulted = No
Home
Owner
(d)(c)
Defaulted = No
< 80K >= 80K
Defaulted = Yes
Annual
Income
Defaulted = No
Yes No
Home
Owner
MarriedSingle,
Divorced
Defaulted = NoDefaulted = Yes
Marital
Status
Figure 4.7. Hunt’s algorithm for inducing decision trees.
Hunt’s algorithm will work if every combination of attribute values is
present in the training data and each combination has a unique class label.
These assumptions are too stringent for use in most practical situations. Ad
ditional conditions are needed to handle the following cases:
1. It is possible for some of the child nodes created in Step 2 to be empty;
i.e., there are no records associated with these nodes. This can happen
if none of the training records have the combination of attribute values
associated with such nodes. In this case the node is declared a leaf
node with the same class label as the majority class of training records
associated with its parent node.
2. In Step 2, if all the records associated with Dt have identical attribute
values (except for the class label), then it is not possible to split these
records any further. In this case, the node is declared a leaf node with
the same class label as the majority class of training records associated
with this node.
154
4.3 Decision Tree Induction
Design Issues of Decision Tree Induction
A learning algorithm for inducing decision trees must address the following
two issues.
1. How should the training records be split? Each recursive step
of the treegrowing process must select an attribute test condition to
divide the records into smaller subsets. To implement this step, the
algorithm must provide a method for specifying the test condition for
different attribute types as well as an objective measure for evaluating
the goodness of each test condition.
2. How should the splitting procedure stop? A stopping condition is
needed to terminate the treegrowing process. A possible strategy is to
continue expanding a node until either all the records belong to the same
class or all the records have identical attribute values. Although both
conditions are sufficient to stop any decision tree induction algorithm,
other criteria can be imposed to allow the treegrowing procedure to
terminate earlier. The advantages of early termination will be discussed
later in Section 4.4.5.
4.3.3 Methods for Expressing Attribute Test Conditions
Decision tree induction algorithms must provide a method for expressing an
attribute test condition and its corresponding outcomes for different attribute
types.
Binary Attributes The test condition for a binary attribute generates two
potential outcomes, as shown in Figure 4.8.
Body
Temperature
Warm
blooded
Cold
blooded
Figure 4.8. Test condition for binary attributes.
155
Chapter 4 Classification
{Married} {Single,
Divorced}
(a) Multiway split
Single Divorced Married
{Single} {Married,
Divorced}
(b) Binary split {by grouping attribute values}
{Single,
Married}
{Divorced}
OR OR
Marital
Status
Marital
Status
Marital
Status
Marital
Status
Figure 4.9. Test conditions for nominal attributes.
Nominal Attributes Since a nominal attribute can have many values, its
test condition can be expressed in two ways, as shown in Figure 4.9. For
a multiway split (Figure 4.9(a)), the number of outcomes depends on the
number of distinct values for the corresponding attribute. For example, if
an attribute such as marital status has three distinct values—single, married,
or divorced—its test condition will produce a threeway split. On the other
hand, some decision tree algorithms, such as CART, produce only binary splits
by considering all 2k−1 − 1 ways of creating a binary partition of k attribute
values. Figure 4.9(b) illustrates three different ways of grouping the attribute
values for marital status into two subsets.
Ordinal Attributes Ordinal attributes can also produce binary or multiway
splits. Ordinal attribute values can be grouped as long as the grouping does
not violate the order property of the attribute values. Figure 4.10 illustrates
various ways of splitting training records based on the Shirt Size attribute.
The groupings shown in Figures 4.10(a) and (b) preserve the order among
the attribute values, whereas the grouping shown in Figure 4.10(c) violates
this property because it combines the attribute values Small and Large into
156
4.3 Decision Tree Induction
Shirt
Size
{Small,
Medium}
{Large,
Extra Large}
(a)
Shirt
Size
{Small} {Medium, Large,
Extra Large}
(b)
Shirt
Size
{Small,
Large}
{Medium,
Extra Large}
(c)
Figure 4.10. Different ways of grouping ordinal attribute values.
the same partition while Medium and Extra Large are combined into another
partition.
Continuous Attributes For continuous attributes, the test condition can
be expressed as a comparison test (A < v) or (A ≥ v) with binary outcomes, or
a range query with outcomes of the form vi ≤ A < vi+1, for i = 1, . . . , k. The
difference between these approaches is shown in Figure 4.11. For the binary
case, the decision tree algorithm must consider all possible split positions v,
and it selects the one that produces the best partition. For the multiway
split, the algorithm must consider all possible ranges of continuous values.
One approach is to apply the discretization strategies described in Section
2.3.6 on page 57. After discretization, a new ordinal value will be assigned to
each discretized interval. Adjacent intervals can also be aggregated into wider
ranges as long as the order property is preserved.
(b)(a)
Yes No
Annual
Income
> 80K
{10K, 25K} {25K, 50K} {50K, 80K}
Annual
Income
> 80K< 10K
Figure 4.11. Test condition for continuous attributes.
157
Chapter 4 Classification
Male Female
C0: 6
C1: 4
C0: 4
C1: 6
(a)
Gender
Family Luxury
Sports
C0:1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
(b)
Car
Type
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
C0: 0
C1: 1
(c)
. . . . . .
v1 v20
v10 v11
Customer
ID
Figure 4.12. Multiway versus binary splits.
4.3.4 Measures for Selecting the Best Split
There are many measures that can be used to determine the best way to split
the records. These measures are defined in terms of the class distribution of
the records before and after splitting.
Let p(it) denote the fraction of records belonging to class i at a given node
t. We sometimes omit the reference to node t and express the fraction as pi.
In a twoclass problem, the class distribution at any node can be written as
(p0, p1), where p1 = 1 − p0. To illustrate, consider the test conditions shown
in Figure 4.12. The class distribution before splitting is (0.5, 0.5) because
there are an equal number of records from each class. If we split the data
using the Gender attribute, then the class distributions of the child nodes are
(0.6, 0.4) and (0.4, 0.6), respectively. Although the classes are no longer evenly
distributed, the child nodes still contain records from both classes. Splitting
on the second attribute, Car Type, will result in purer partitions.
The measures developed for selecting the best split are often based on the
degree of impurity of the child nodes. The smaller the degree of impurity, the
more skewed the class distribution. For example, a node with class distribu
tion (0, 1) has zero impurity, whereas a node with uniform class distribution
(0.5, 0.5) has the highest impurity. Examples of impurity measures include
Entropy(t) = −
c−1∑
i=0
p(it) log2 p(it), (4.3)
Gini(t) = 1 −
c−1∑
i=0
[p(it)]2, (4.4)
Classification error(t) = 1 − max
i
[p(it)], (4.5)
where c is the number of classes and 0 log2 0 = 0 in entropy calculations.
158
4.3 Decision Tree Induction
Entropy
Gini
Misclassification error
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
Figure 4.13. Comparison among the impurity measures for binary classification problems.
Figure 4.13 compares the values of the impurity measures for binary classi
fication problems. p refers to the fraction of records that belong to one of the
two classes. Observe that all three measures attain their maximum value when
the class distribution is uniform (i.e., when p = 0.5). The minimum values for
the measures are attained when all the records belong to the same class (i.e.,
when p equals 0 or 1). We next provide several examples of computing the
different impurity measures.
Node N1 Count
Class=0 0
Class=1 6
Gini = 1 − (0/6)2 − (6/6)2 = 0
Entropy = −(0/6) log2(0/6) − (6/6) log2(6/6) = 0
Error = 1 − max[0/6, 6/6] = 0
Node N2 Count
Class=0 1
Class=1 5
Gini = 1 − (1/6)2 − (5/6)2 = 0.278
Entropy = −(1/6) log2(1/6) − (5/6) log2(5/6) = 0.650
Error = 1 − max[1/6, 5/6] = 0.167
Node N3 Count
Class=0 3
Class=1 3
Gini = 1 − (3/6)2 − (3/6)2 = 0.5
Entropy = −(3/6) log2(3/6) − (3/6) log2(3/6) = 1
Error = 1 − max[3/6, 3/6] = 0.5
159
Chapter 4 Classification
The preceding examples, along with Figure 4.13, illustrate the consistency
among different impurity measures. Based on these calculations, node N1 has
the lowest impurity value, followed by N2 and N3. Despite their consistency,
the attribute chosen as the test condition may vary depending on the choice
of impurity measure, as will be shown in Exercise 3 on page 198.
To determine how well a test condition performs, we need to compare the
degree of impurity of the parent node (before splitting) with the degree of
impurity of the child nodes (after splitting). The larger their difference, the
better the test condition. The gain, ∆, is a criterion that can be used to
determine the goodness of a split:
∆ = I(parent) −
k∑
j=1
N (vj )
N
I(vj ), (4.6)
where I(·) is the impurity measure of a given node, N is the total number of
records at the parent node, k is the number of attribute values, and N (vj )
is the number of records associated with the child node, vj . Decision tree
induction algorithms often choose a test condition that maximizes the gain
∆. Since I(parent) is the same for all test conditions, maximizing the gain is
equivalent to minimizing the weighted average impurity measures of the child
nodes. Finally, when entropy is used as the impurity measure in Equation 4.6,
the difference in entropy is known as the information gain, ∆info.
Splitting of Binary Attributes
Consider the diagram shown in Figure 4.14. Suppose there are two ways to
split the data into smaller subsets. Before splitting, the Gini index is 0.5 since
there are an equal number of records from both classes. If attribute A is chosen
to split the data, the Gini index for node N1 is 0.4898, and for node N2, it
is 0.480. The weighted average of the Gini index for the descendent nodes is
(7/12) × 0.4898 + (5/12) × 0.480 = 0.486. Similarly, we can show that the
weighted average of the Gini index for attribute B is 0.375. Since the subsets
for attribute B have a smaller Gini index, it is preferred over attribute A.
Splitting of Nominal Attributes
As previously noted, a nominal attribute can produce either binary or multi
way splits, as shown in Figure 4.15. The computation of the Gini index for a
binary split is similar to that shown for determining binary attributes. For the
first binary grouping of the Car Type attribute, the Gini index of {Sports,
160
4.3 Decision Tree Induction
Gini = 0.375
N1
1
4
5
2
N2
C0
C1
Gini = 0.500
Parent
6
6
C0
C1
Gini = 0.486
N1
4
3
2
3
N2
C0
Node N1
C1
Node N2
A
Yes No
Node N1 Node N2
B
Yes No
Figure 4.14. Splitting binary attributes.
Car Type Car Type Car Type
{Sports,
Luxury}
{Sports,
Luxury}
{Family,
Luxury}{Family}
{Family}
9
7
1
3
{Sports}
Family Luxury
Sports
Car Type
C0
C1
Gini 0.468
{Sports}
{Family,
Luxury}
8
0
2
10
Car Type
C0
C1
Gini 0.167
1
3
8
0
1
7
Car Type
C0
C1
Gini 0.163
Family Sports Luxury
(a) Binary split (b) Multiway split
Figure 4.15. Splitting nominal attributes.
Luxury} is 0.4922 and the Gini index of {Family} is 0.3750. The weighted
average Gini index for the grouping is equal to
16/20 × 0.4922 + 4/20 × 0.3750 = 0.468.
Similarly, for the second binary grouping of {Sports} and {Family, Luxury},
the weighted average Gini index is 0.167. The second grouping has a lower
Gini index because its corresponding subsets are much purer.
161
Chapter 4 Classification
Sorted Values
Split Positions
No No No No No No NoYes Yes Yes
Annual Income
60 70 90 100 120 22012575 9585
Class
Yes
No
Gini
55 65 72 80 87 92 97 110 122 172 230
<= <=> > <= > <= > <= > <= > <= > <= > <= > <= > <= >
0 3
0 7
0 3
1 6
0 3
2 5
0 3
3 4
1 2
3 4
2 1
3 4
3 0
3 4
3 0
4 3
3 0
5 2
3 0
6 1
3 0
7 0
0.420 0.400 0.375 0.343 0.417 0.400 0.400 0.4200.343 0.3750.300
Figure 4.16. Splitting continuous attributes.
For the multiway split, the Gini index is computed for every attribute value.
Since Gini({Family}) = 0.375, Gini({Sports}) = 0, and Gini({Luxury}) =
0.219, the overall Gini index for the multiway split is equal to
4/20 × 0.375 + 8/20 × 0 + 8/20 × 0.219 = 0.163.
The multiway split has a smaller Gini index compared to both twoway splits.
This result is not surprising because the twoway split actually merges some
of the outcomes of a multiway split, and thus, results in less pure subsets.
Splitting of Continuous Attributes
Consider the example shown in Figure 4.16, in which the test condition Annual
Income ≤ v is used to split the training records for the loan default classifica
tion problem. A bruteforce method for finding v is to consider every value of
the attribute in the N records as a candidate split position. For each candidate
v, the data set is scanned once to count the number of records with annual
income less than or greater than v. We then compute the Gini index for each
candidate and choose the one that gives the lowest value. This approach is
computationally expensive because it requires O(N ) operations to compute
the Gini index at each candidate split position. Since there are N candidates,
the overall complexity of this task is O(N 2). To reduce the complexity, the
training records are sorted based on their annual income, a computation that
requires O(N log N ) time. Candidate split positions are identified by taking
the midpoints between two adjacent sorted values: 55, 65, 72, and so on. How
ever, unlike the bruteforce approach, we do not have to examine all N records
when evaluating the Gini index of a candidate split position.
For the first candidate, v = 55, none of the records has annual income less
than $55K. As a result, the Gini index for the descendent node with Annual
162
4.3 Decision Tree Induction
Income < $55K is zero. On the other hand, the number of records with annual
income greater than or equal to $55K is 3 (for class Yes) and 7 (for class No),
respectively. Thus, the Gini index for this node is 0.420. The overall Gini
index for this candidate split position is equal to 0 × 0 + 1 × 0.420 = 0.420.
For the second candidate, v = 65, we can determine its class distribution
by updating the distribution of the previous candidate. More specifically, the
new distribution is obtained by examining the class label of the record with
the lowest annual income (i.e., $60K). Since the class label for this record is
No, the count for class No is increased from 0 to 1 (for Annual Income ≤ $65K)
and is decreased from 7 to 6 (for Annual Income > $65K). The distribution
for class Yes remains unchanged. The new weightedaverage Gini index for
this candidate split position is 0.400.
This procedure is repeated until the Gini index values for all candidates are
computed, as shown in Figure 4.16. The best split position corresponds to the
one that produces the smallest Gini index, i.e., v = 97. This procedure is less
expensive because it requires a constant amount of time to update the class
distribution at each candidate split position. It can be further optimized by
considering only candidate split positions located between two adjacent records
with different class labels. For example, because the first three sorted records
(with annual incomes $60K, $70K, and $75K) have identical class labels, the
best split position should not reside between $60K and $75K. Therefore, the
candidate split positions at v = $55K, $65K, $72K, $87K, $92K, $110K, $122K,
$172K, and $230K are ignored because they are located between two adjacent
records with the same class labels. This approach allows us to reduce the
number of candidate split positions from 11 to 2.
Gain Ratio
Impurity measures such as entropy and Gini index tend to favor attributes that
have a large number of distinct values. Figure 4.12 shows three alternative
test conditions for partitioning the data set given in Exercise 2 on page 198.
Comparing the first test condition, Gender, with the second, Car Type, it
is easy to see that Car Type seems to provide a better way of splitting the
data since it produces purer descendent nodes. However, if we compare both
conditions with Customer ID, the latter appears to produce purer partitions.
Yet Customer ID is not a predictive attribute because its value is unique for
each record. Even in a less extreme situation, a test condition that results in a
large number of outcomes may not be desirable because the number of records
associated with each partition is too small to enable us to make any reliable
predictions.
163
Chapter 4 Classification
There are two strategies for overcoming this problem. The first strategy is
to restrict the test conditions to binary splits only. This strategy is employed
by decision tree algorithms such as CART. Another strategy is to modify the
splitting criterion to take into account the number of outcomes produced by
the attribute test condition. For example, in the C4.5 decision tree algorithm,
a splitting criterion known as gain ratio is used to determine the goodness
of a split. This criterion is defined as follows:
Gain ratio =
∆info
Split Info
. (4.7)
Here, Split Info = −∑ki=1 P (vi) log2 P (vi) and k is the total number of splits.
For example, if each attribute value has the same number of records, then
∀i : P (vi) = 1/k and the split information would be equal to log2 k. This
example suggests that if an attribute produces a large number of splits, its
split information will also be large, which in turn reduces its gain ratio.
4.3.5 Algorithm for Decision Tree Induction
A skeleton decision tree induction algorithm called TreeGrowth is shown
in Algorithm 4.1. The input to this algorithm consists of the training records
E and the attribute set F . The algorithm works by recursively selecting the
best attribute to split the data (Step 7) and expanding the leaf nodes of the
Algorithm 4.1 A skeleton decision tree induction algorithm.
TreeGrowth (E, F )
1: if stopping cond(E,F ) = true then
2: leaf = createNode().
3: leaf.label = Classify(E).
4: return leaf .
5: else
6: root = createNode().
7: root.test cond = find best split(E, F ).
8: let V = {vv is a possible outcome of root.test cond }.
9: for each v ∈ V do
10: Ev = {e  root.test cond(e) = v and e ∈ E}.
11: child = TreeGrowth(Ev, F ).
12: add child as descendent of root and label the edge (root → child) as v.
13: end for
14: end if
15: return root.
164
4.3 Decision Tree Induction
tree (Steps 11 and 12) until the stopping criterion is met (Step 1). The details
of this algorithm are explained below:
1. The createNode() function extends the decision tree by creating a new
node. A node in the decision tree has either a test condition, denoted as
node.test cond, or a class label, denoted as node.label.
2. The find best split() function determines which attribute should be
selected as the test condition for splitting the training records. As pre
viously noted, the choice of test condition depends on which impurity
measure is used to determine the goodness of a split. Some widely used
measures include entropy, the Gini index, and the χ2 statistic.
3. The Classify() function determines the class label to be assigned to a
leaf node. For each leaf node t, let p(it) denote the fraction of training
records from class i associated with the node t. In most cases, the leaf
node is assigned to the class that has the majority number of training
records:
leaf.label = argmax
i
p(it), (4.8)
where the argmax operator returns the argument i that maximizes the
expression p(it). Besides providing the information needed to determine
the class label of a leaf node, the fraction p(it) can also be used to es
timate the probability that a record assigned to the leaf node t belongs
to class i. Sections 5.7.2 and 5.7.3 describe how such probability esti
mates can be used to determine the performance of a decision tree under
different cost functions.
4. The stopping cond() function is used to terminate the treegrowing pro
cess by testing whether all the records have either the same class label
or the same attribute values. Another way to terminate the recursive
function is to test whether the number of records have fallen below some
minimum threshold.
After building the decision tree, a treepruning step can be performed
to reduce the size of the decision tree. Decision trees that are too large are
susceptible to a phenomenon known as overfitting. Pruning helps by trim
ming the branches of the initial tree in a way that improves the generalization
capability of the decision tree. The issues of overfitting and tree pruning are
discussed in more detail in Section 4.4.
165
Chapter 4 Classification
Session IP Address Timestamp Protocol Status Referrer User AgentNumber
of Bytes
Requested Web PageRequest
Method
08/Aug/2004
10:15:21
160.11.11.111 GET http://www.cs.umn.edu/
~kumar
HTTP/1.1 200 6424 Mozilla/4.0
(compatible; MSIE 6.0;
Windows NT 5.0)
08/Aug/2004
10:15:34
160.11.11.111 GET http://www.cs.umn.edu/
~kumar/MINDS
http://www.cs.umn.edu/
~kumar
http://www.cs.umn.edu/
~kumar
HTTP/1.1 200 41378 Mozilla/4.0
(compatible; MSIE 6.0;
Windows NT 5.0)
08/Aug/2004
10:15:41
160.11.11.111 GET
08/Aug/2004
10:16:11
160.11.11.111 GET
08/Aug/2004
10:16:15
35.9.2.22 GET
http://www.cs.umn.edu/
~kumar/MINDS/MINDS
_papers.htm
http://www.cs.umn.edu/
~kumar/papers/papers.
html
http://www.cs.umn.edu/
~steinbac
http://www.cs.umn.edu/
~kumar/MINDS
HTTP/1.1 200
HTTP/1.1 200
HTTP/1.0
Attribute Name Description
200
1018516
7463
3149
Mozilla/4.0
(compatible; MSIE 6.0;
Windows NT 5.0)
Mozilla/4.0
(compatible; MSIE 6.0;
Windows NT 5.0)
Mozilla/5.0 (Windows; U;
Windows NT 5.1; enUS;
rv:1.7) Gecko/20040616
(a) Example of a Web server log.
http://www.cs.umn.edu/~kumar
MINDS
papers/papers.html
MINDS/MINDS_papers.htm
(b) Graph of a Web session. (c) Derived attributes for Web robot detection.
totalPages Total number of pages retrieved in a Web session
Total number of image pages retrieved in a Web session
Total amount of time spent by Web site visitor
The same page requested more than once in a Web session
Errors in requesting for Web pages
Breadth of Web traversal
Depth of Web traversal
Session with multiple IP addresses
Session with multiple user agents
Percentage of requests made using GET method
Percentage of requests made using POST method
Percentage of requests made using HEAD method
TotalTime
RepeatedAccess
ErrorRequest
Breadth
Depth
MultilP
MultiAgent
GET
POST
HEAD
ImagePages
Figure 4.17. Input data for Web robot detection.
4.3.6 An Example: Web Robot Detection
Web usage mining is the task of applying data mining techniques to extract
useful patterns from Web access logs. These patterns can reveal interesting
characteristics of site visitors; e.g., people who repeatedly visit a Web site and
view the same product description page are more likely to buy the product if
certain incentives such as rebates or free shipping are offered.
In Web usage mining, it is important to distinguish accesses made by hu
man users from those due to Web robots. A Web robot (also known as a Web
crawler) is a software program that automatically locates and retrieves infor
mation from the Internet by following the hyperlinks embedded in Web pages.
These programs are deployed by search engine portals to gather the documents
necessary for indexing the Web. Web robot accesses must be discarded before
applying Web mining techniques to analyze human browsing behavior.
166
4.3 Decision Tree Induction
This section describes how a decision tree classifier can be used to distin
guish between accesses by human users and those by Web robots. The input
data was obtained from a Web server log, a sample of which is shown in Figure
4.17(a). Each line corresponds to a single page request made by a Web client
(a user or a Web robot). The fields recorded in the Web log include the IP
address of the client, timestamp of the request, Web address of the requested
document, size of the document, and the client’s identity (via the user agent
field). A Web session is a sequence of requests made by a client during a single
visit to a Web site. Each Web session can be modeled as a directed graph, in
which the nodes correspond to Web pages and the edges correspond to hyper
links connecting one Web page to another. Figure 4.17(b) shows a graphical
representation of the first Web session given in the Web server log.
To classify the Web sessions, features are constructed to describe the char
acteristics of each session. Figure 4.17(c) shows some of the features used
for the Web robot detection task. Among the notable features include the
depth and breadth of the traversal. Depth determines the maximum dis
tance of a requested page, where distance is measured in terms of the num
ber of hyperlinks away from the entry point of the Web site. For example,
the home page http://www.cs.umn.edu/∼kumar is assumed to be at depth
0, whereas http://www.cs.umn.edu/kumar/MINDS/MINDS papers.htm is lo
cated at depth 2. Based on the Web graph shown in Figure 4.17(b), the depth
attribute for the first session is equal to two. The breadth attribute measures
the width of the corresponding Web graph. For example, the breadth of the
Web session shown in Figure 4.17(b) is equal to two.
The data set for classification contains 2916 records, with equal numbers
of sessions due to Web robots (class 1) and human users (class 0). 10% of the
data were reserved for training while the remaining 90% were used for testing.
The induced decision tree model is shown in Figure 4.18. The tree has an
error rate equal to 3.8% on the training set and 5.3% on the test set.
The model suggests that Web robots can be distinguished from human
users in the following way:
1. Accesses by Web robots tend to be broad but shallow, whereas accesses
by human users tend to be more focused (narrow but deep).
2. Unlike human users, Web robots seldom retrieve the image pages asso
ciated with a Web document.
3. Sessions due to Web robots tend to be long and contain a large number
of requested pages.
167
Chapter 4 Classification
Decision Tree:
depth = 1:
breadth> 7 : class 1
breadth<= 7:
breadth <= 3:
ImagePages> 0.375: class 0
ImagePages<= 0.375:
totalPages<= 6: class 1
totalPages> 6:
breadth <= 1: class 1
breadth > 1: class 0
width > 3:
MultilP = 0:
ImagePages<= 0.1333: class 1
ImagePages> 0.1333:
breadth <= 6: class 0
breadth > 6: class 1
MultilP = 1:
TotalTime <= 361: class 0
TotalTime > 361: class 1
depth> 1:
MultiAgent = 0:
depth > 2: class 0
depth < 2:
MultilP = 1: class 0
MultilP = 0:
breadth <= 6: class 0
breadth > 6:
RepeatedAccess <= 0.322: class 0
RepeatedAccess > 0.322: class 1
MultiAgent = 1:
totalPages <= 81: class 0
totalPages > 81: class 1
Figure 4.18. Decision tree model for Web robot detection.
4. Web robots are more likely to make repeated requests for the same doc
ument since the Web pages retrieved by human users are often cached
by the browser.
4.3.7 Characteristics of Decision Tree Induction
The following is a summary of the important characteristics of decision tree
induction algorithms.
1. Decision tree induction is a nonparametric approach for building classifi
cation models. In other words, it does not require any prior assumptions
regarding the type of probability distributions satisfied by the class and
other attributes (unlike some of the techniques described in Chapter 5).
168
4.3 Decision Tree Induction
2. Finding an optimal decision tree is an NPcomplete problem. Many de
cision tree algorithms employ a heuristicbased approach to guide their
search in the vast hypothesis space. For example, the algorithm pre
sented in Section 4.3.5 uses a greedy, topdown, recursive partitioning
strategy for growing a decision tree.
3. Techniques developed for constructing decision trees are computationally
inexpensive, making it possible to quickly construct models even when
the training set size is very large. Furthermore, once a decision tree has
been built, classifying a test record is extremely fast, with a worstcase
complexity of O(w), where w is the maximum depth of the tree.
4. Decision trees, especially smallersized trees, are relatively easy to inter
pret. The accuracies of the trees are also comparable to other classifica
tion techniques for many simple data sets.
5. Decision trees provide an expressive representation for learning discrete
valued functions. However, they do not generalize well to certain types
of Boolean problems. One notable example is the parity function, whose
value is 0 (1) when there is an odd (even) number of Boolean attributes
with the value T rue. Accurate modeling of such a function requires a full
decision tree with 2d nodes, where d is the number of Boolean attributes
(see Exercise 1 on page 198).
6. Decision tree algorithms are quite robust to the presence of noise, espe
cially when methods for avoiding overfitting, as described in Section 4.4,
are employed.
7. The presence of redundant attributes does not adversely affect the ac
curacy of decision trees. An attribute is redundant if it is strongly cor
related with another attribute in the data. One of the two redundant
attributes will not be used for splitting once the other attribute has been
chosen. However, if the data set contains many irrelevant attributes, i.e.,
attributes that are not useful for the classification task, then some of the
irrelevant attributes may be accidently chosen during the treegrowing
process, which results in a decision tree that is larger than necessary.
Feature selection techniques can help to improve the accuracy of deci
sion trees by eliminating the irrelevant attributes during preprocessing.
We will investigate the issue of too many irrelevant attributes in Section
4.4.3.
169
Chapter 4 Classification
8. Since most decision tree algorithms employ a topdown, recursive parti
tioning approach, the number of records becomes smaller as we traverse
down the tree. At the leaf nodes, the number of records may be too
small to make a statistically significant decision about the class rep
resentation of the nodes. This is known as the data fragmentation
problem. One possible solution is to disallow further splitting when the
number of records falls below a certain threshold.
9. A subtree can be replicated multiple times in a decision tree, as illus
trated in Figure 4.19. This makes the decision tree more complex than
necessary and perhaps more difficult to interpret. Such a situation can
arise from decision tree implementations that rely on a single attribute
test condition at each internal node. Since most of the decision tree al
gorithms use a divideandconquer partitioning strategy, the same test
condition can be applied to different parts of the attribute space, thus
leading to the subtree replication problem.
0 1
0 1
0
0
1
P
R
Q
S
Q
S
Figure 4.19. Tree replication problem. The same subtree can appear at different branches.
10. The test conditions described so far in this chapter involve using only a
single attribute at a time. As a consequence, the treegrowing procedure
can be viewed as the process of partitioning the attribute space into
disjoint regions until each region contains records of the same class (see
Figure 4.20). The border between two neighboring regions of different
classes is known as a decision boundary. Since the test condition in
volves only a single attribute, the decision boundaries are rectilinear; i.e.,
parallel to the “coordinate axes.” This limits the expressiveness of the
170
4.3 Decision Tree Induction
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
y
Yes No
Yes No Yes No
y < 0.33
:4
:0
:0
:4
:0
:3
:4
:0
x < 0.43
y < 0.47
Figure 4.20. Example of a decision tree and its decision boundaries for a twodimensional data set.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 4.21. Example of data set that cannot be partitioned optimally using test conditions involving
single attributes.
decision tree representation for modeling complex relationships among
continuous attributes. Figure 4.21 illustrates a data set that cannot be
classified effectively by a decision tree algorithm that uses test conditions
involving only a single attribute at a time.
171
Chapter 4 Classification
An oblique decision tree can be used to overcome this limitation
because it allows test conditions that involve more than one attribute.
The data set given in Figure 4.21 can be easily represented by an oblique
decision tree containing a single node with test condition
x + y < 1.
Although such techniques are more expressive and can produce more
compact trees, finding the optimal test condition for a given node can
be computationally expensive.
Constructive induction provides another way to partition the data
into homogeneous, nonrectangular regions (see Section 2.3.5 on page 57).
This approach creates composite attributes representing an arithmetic
or logical combination of the existing attributes. The new attributes
provide a better discrimination of the classes and are augmented to the
data set prior to decision tree induction. Unlike the oblique decision tree
approach, constructive induction is less expensive because it identifies all
the relevant combinations of attributes once, prior to constructing the
decision tree. In contrast, an oblique decision tree must determine the
right attribute combination dynamically, every time an internal node is
expanded. However, constructive induction can introduce attribute re
dundancy in the data since the new attribute is a combination of several
existing attributes.
11. Studies have shown that the choice of impurity measure has little effect
on the performance of decision tree induction algorithms. This is because
many impurity measures are quite consistent with each other, as shown
in Figure 4.13 on page 159. Indeed, the strategy used to prune the
tree has a greater impact on the final tree than the choice of impurity
measure.
4.4 Model Overfitting
The errors committed by a classification model are generally divided into two
types: training errors and generalization errors. Training error, also
known as resubstitution error or apparent error, is the number of misclas
sification errors committed on training records, whereas generalization error
is the expected error of the model on previously unseen records.
Recall from Section 4.2 that a good classification model must not only fit
the training data well, it must also accurately classify records it has never
172
4.4 Model Overfitting
0 2 4 6 8 10 12 14 16 18 20
0
2
4
6
8
10
12
14
16
18
20
Training set
x1
x 2
Figure 4.22. Example of a data set with binary classes.
seen before. In other words, a good model must have low training error as
well as low generalization error. This is important because a model that fits
the training data too well can have a poorer generalization error than a model
with a higher training error. Such a situation is known as model overfitting.
Overfitting Example in TwoDimensional Data For a more concrete
example of the overfitting problem, consider the twodimensional data set
shown in Figure 4.22. The data set contains data points that belong to two
different classes, denoted as class o and class +, respectively. The data points
for the o class are generated from a mixture of three Gaussian distributions,
while a uniform distribution is used to generate the data points for the + class.
There are altogether 1200 points belonging to the o class and 1800 points be
longing to the + class. 30% of the points are chosen for training, while the
remaining 70% are used for testing. A decision tree classifier that uses the
Gini index as its impurity measure is then applied to the training set. To
investigate the effect of overfitting, different levels of pruning are applied to
the initial, fullygrown tree. Figure 4.23(b) shows the training and test error
rates of the decision tree.
173
Chapter 4 Classification
0 50 100 150 200 250 300
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Number of Nodes
E
rr
o
r
R
a
te
Test Error
Training Error
Figure 4.23. Training and test error rates.
Notice that the training and test error rates of the model are large when the
size of the tree is very small. This situation is known as model underfitting.
Underfitting occurs because the model has yet to learn the true structure of
the data. As a result, it performs poorly on both the training and the test
sets. As the number of nodes in the decision tree increases, the tree will have
fewer training and test errors. However, once the tree becomes too large, its
test error rate begins to increase even though its training error rate continues
to decrease. This phenomenon is known as model overfitting.
To understand the overfitting phenomenon, note that the training error of
a model can be reduced by increasing the model complexity. For example, the
leaf nodes of the tree can be expanded until it perfectly fits the training data.
Although the training error for such a complex tree is zero, the test error can
be large because the tree may contain nodes that accidently fit some of the
noise points in the training data. Such nodes can degrade the performance
of the tree because they do not generalize well to the test examples. Figure
4.24 shows the structure of two decision trees with different number of nodes.
The tree that contains the smaller number of nodes has a higher training error
rate, but a lower test error rate compared to the more complex tree.
Overfitting and underfitting are two pathologies that are related to the
model complexity. The remainder of this section examines some of the poten
tial causes of model overfitting.
174
4.4 Model Overfitting
x1 < 13.29 x2 < 17.35
x2 < 12.63
x1 < 6.56
x2 < 1.38
x1 < 2.15
x1 < 7.24
x1 < 12.11
x1 < 18.88
x2 < 8.64
(a) Decision tree with 11 leaf
nodes.
x1 < 13.29
x2 < 17.35
x2 < 12.63
x1 < 6.56
x2 < 8.64
x2 < 1.38
x1 < 2.15
x1 < 7.24
x1 < 12.11
x1 < 18.88
x2 < 4.06
x1 < 6.99
x1 < 6.78
x2 < 19.93
x1 < 3.03
x2 < 12.68
x1 < 2.72
x2 < 15.77 x2 < 17.14
x2 < 12.89
x2 < 13.80
x2 < 16.75
x2 < 16.33
(b) Decision tree with 24 leaf nodes.
Figure 4.24. Decision trees with different model complexities.
4.4.1 Overfitting Due to Presence of Noise
Consider the training and test sets shown in Tables 4.3 and 4.4 for the mammal
classification problem. Two of the ten training records are mislabeled: bats
and whales are classified as nonmammals instead of mammals.
A decision tree that perfectly fits the training data is shown in Figure
4.25(a). Although the training error for the tree is zero, its error rate on
Table 4.3. An example training set for classifying mammals. Class labels with asterisk symbols repre
sent mislabeled records.
Name Body Gives Four Hibernates Class
Temperature Birth legged Label
porcupine warmblooded yes yes yes yes
cat warmblooded yes yes no yes
bat warmblooded yes no yes no∗
whale warmblooded yes no no no∗
salamander coldblooded no yes yes no
komodo dragon coldblooded no yes no no
python coldblooded no no yes no
salmon coldblooded no no no no
eagle warmblooded no no no no
guppy coldblooded yes no no no
175
Chapter 4 Classification
Table 4.4. An example test set for classifying mammals.
Name Body Gives Four Hibernates Class
Temperature Birth legged Label
human warmblooded yes no no yes
pigeon warmblooded no no no no
elephant warmblooded yes yes no yes
leopard shark coldblooded yes no no no
turtle coldblooded no yes no no
penguin coldblooded no no no no
eel coldblooded no no no no
dolphin warmblooded yes no no yes
spiny anteater warmblooded no yes yes yes
gila monster coldblooded no yes yes no
Warmblooded Coldblooded
Gives Birth
Yes No
Non
mammals
Non
mammals
Non
mammals
Mammals
Non
mammals
MammalsFour
legged
Yes No
Body
Temperature
Warmblooded Coldblooded
Gives Birth
Yes No
Non
mammals
Body
Temperature
(a) Model M1 (b) Model M2
Figure 4.25. Decision tree induced from the data set shown in Table 4.3.
the test set is 30%. Both humans and dolphins were misclassified as non
mammals because their attribute values for Body Temperature, Gives Birth,
and Fourlegged are identical to the mislabeled records in the training set.
Spiny anteaters, on the other hand, represent an exceptional case in which the
class label of a test record contradicts the class labels of other similar records
in the training set. Errors due to exceptional cases are often unavoidable and
establish the minimum error rate achievable by any classifier.
176
4.4 Model Overfitting
In contrast, the decision tree M 2 shown in Figure 4.25(b) has a lower test
error rate (10%) even though its training error rate is somewhat higher (20%).
It is evident that the first decision tree, M 1, has overfitted the training data
because there is a simpler model with lower error rate on the test set. The
Fourlegged attribute test condition in model M 1 is spurious because it fits
the mislabeled training records, which leads to the misclassification of records
in the test set.
4.4.2 Overfitting Due to Lack of Representative Samples
Models that make their classification decisions based on a small number of
training records are also susceptible to overfitting. Such models can be gener
ated because of lack of representative samples in the training data and learning
algorithms that continue to refine their models even when few training records
are available. We illustrate these effects in the example below.
Consider the five training records shown in Table 4.5. All of these training
records are labeled correctly and the corresponding decision tree is depicted
in Figure 4.26. Although its training error is zero, its error rate on the test
set is 30%.
Table 4.5. An example training set for classifying mammals.
Name Body Gives Four Hibernates Class
Temperature Birth legged Label
salamander coldblooded no yes yes no
guppy coldblooded yes no no no
eagle warmblooded no no no no
poorwill warmblooded no no yes no
platypus warmblooded no yes yes yes
Humans, elephants, and dolphins are misclassified because the decision tree
classifies all warmblooded vertebrates that do not hibernate as nonmammals.
The tree arrives at this classification decision because there is only one training
record, which is an eagle, with such characteristics. This example clearly
demonstrates the danger of making wrong predictions when there are not
enough representative examples at the leaf nodes of a decision tree.
177
Chapter 4 Classification
Warmblooded Coldblooded
Hibernates
Yes No
Non
mammals
Non
mammals
Non
mammals
Mammals
Four
legged
Yes No
Body
Temperature
Figure 4.26. Decision tree induced from the data set shown in Table 4.5.
4.4.3 Overfitting and the Multiple Comparison Procedure
Model overfitting may arise in learning algorithms that employ a methodology
known as multiple comparison procedure. To understand multiple comparison
procedure, consider the task of predicting whether the stock market will rise
or fall in the next ten trading days. If a stock analyst simply makes random
guesses, the probability that her prediction is correct on any trading day is
0.5. However, the probability that she will predict correctly at least eight out
of the ten times is (
10
8
)
+
(
10
9
)
+
(
10
10
)
210
= 0.0547,
which seems quite unlikely.
Suppose we are interested in choosing an investment advisor from a pool of
fifty stock analysts. Our strategy is to select the analyst who makes the most
correct predictions in the next ten trading days. The flaw in this strategy is
that even if all the analysts had made their predictions in a random fashion, the
probability that at least one of them makes at least eight correct predictions
is
1 − (1 − 0.0547)50 = 0.9399,
which is very high. Although each analyst has a low probability of predicting
at least eight times correctly, putting them together, we have a high probability
of finding an analyst who can do so. Furthermore, there is no guarantee in the
178
4.4 Model Overfitting
future that such an analyst will continue to make accurate predictions through
random guessing.
How does the multiple comparison procedure relate to model overfitting?
Many learning algorithms explore a set of independent alternatives, {γi}, and
then choose an alternative, γmax, that maximizes a given criterion function.
The algorithm will add γmax to the current model in order to improve its
overall performance. This procedure is repeated until no further improvement
is observed. As an example, during decision tree growing, multiple tests are
performed to determine which attribute can best split the training data. The
attribute that leads to the best split is chosen to extend the tree as long as
the observed improvement is statistically significant.
Let T0 be the initial decision tree and Tx be the new tree after inserting an
internal node for attribute x. In principle, x can be added to the tree if the
observed gain, ∆(T0, Tx), is greater than some predefined threshold α. If there
is only one attribute test condition to be evaluated, then we can avoid inserting
spurious nodes by choosing a large enough value of α. However, in practice,
more than one test condition is available and the decision tree algorithm must
choose the best attribute xmax from a set of candidates, {x1, x2, . . . , xk}, to
partition the data. In this situation, the algorithm is actually using a multiple
comparison procedure to decide whether a decision tree should be extended.
More specifically, it is testing for ∆(T0, Txmax ) > α instead of ∆(T0, Tx) > α.
As the number of alternatives, k, increases, so does our chance of finding
∆(T0, Txmax ) > α. Unless the gain function ∆ or threshold α is modified to
account for k, the algorithm may inadvertently add spurious nodes to the
model, which leads to model overfitting.
This effect becomes more pronounced when the number of training records
from which xmax is chosen is small, because the variance of ∆(T0, Txmax ) is high
when fewer examples are available for training. As a result, the probability of
finding ∆(T0, Txmax ) > α increases when there are very few training records.
This often happens when the decision tree grows deeper, which in turn reduces
the number of records covered by the nodes and increases the likelihood of
adding unnecessary nodes into the tree. Failure to compensate for the large
number of alternatives or the small number of training records will therefore
lead to model overfitting.
4.4.4 Estimation of Generalization Errors
Although the primary reason for overfitting is still a subject of debate, it
is generally agreed that the complexity of a model has an impact on model
overfitting, as was illustrated in Figure 4.23. The question is, how do we
179
Chapter 4 Classification
determine the right model complexity? The ideal complexity is that of a
model that produces the lowest generalization error. The problem is that the
learning algorithm has access only to the training set during model building
(see Figure 4.3). It has no knowledge of the test set, and thus, does not know
how well the tree will perform on records it has never seen before. The best it
can do is to estimate the generalization error of the induced tree. This section
presents several methods for doing the estimation.
Using Resubstitution Estimate
The resubstitution estimate approach assumes that the training set is a good
representation of the overall data. Consequently, the training error, otherwise
known as resubstitution error, can be used to provide an optimistic estimate
for the generalization error. Under this assumption, a decision tree induction
algorithm simply selects the model that produces the lowest training error rate
as its final model. However, the training error is usually a poor estimate of
generalization error.
Example 4.1. Consider the binary decision trees shown in Figure 4.27. As
sume that both trees are generated from the same training data and both
make their classification decisions at each leaf node according to the majority
class. Note that the left tree, TL, is more complex because it expands some
of the leaf nodes in the right tree, TR. The training error rate for the left
tree is e(TL) = 4/24 = 0.167, while the training error rate for the right tree is
+: 3
–: 1
+: 2
–: 1
+: 0
–: 2
+: 1
–: 2
+: 3
–: 1
+: 0
–: 5
+: 5
–: 2
+: 1
–: 4
+: 3
–: 0
+: 3
–: 6
+: 3
–: 0
Decision Tree, TL Decision Tree, TR
Figure 4.27. Example of two decision trees generated from the same training data.
180
4.4 Model Overfitting
e(TR) = 6/24 = 0.25. Based on their resubstitution estimate, the left tree is
considered better than the right tree.
Incorporating Model Complexity
As previously noted, the chance for model overfitting increases as the model
becomes more complex. For this reason, we should prefer simpler models, a
strategy that agrees with a wellknown principle known as Occam’s razor or
the principle of parsimony:
Definition 4.2. Occam’s Razor: Given two models with the same general
ization errors, the simpler model is preferred over the more complex model.
Occam’s razor is intuitive because the additional components in a complex
model stand a greater chance of being fitted purely by chance. In the words of
Einstein, “Everything should be made as simple as possible, but not simpler.”
Next, we present two methods for incorporating model complexity into the
evaluation of classification models.
Pessimistic Error Estimate The first approach explicitly computes gener
alization error as the sum of training error and a penalty term for model com
plexity. The resulting generalization error can be considered its pessimistic
error estimate. For instance, let n(t) be the number of training records classi
fied by node t and e(t) be the number of misclassified records. The pessimistic
error estimate of a decision tree T , eg(T ), can be computed as follows:
eg(T ) =
∑k
i=1[e(ti) + Ω(ti)]∑k
i=1 n(ti)
=
e(T ) + Ω(T )
Nt
,
where k is the number of leaf nodes, e(T ) is the overall training error of the
decision tree, Nt is the number of training records, and Ω(ti) is the penalty
term associated with each node ti.
Example 4.2. Consider the binary decision trees shown in Figure 4.27. If
the penalty term is equal to 0.5, then the pessimistic error estimate for the
left tree is
eg(TL) =
4 + 7 × 0.5
24
=
7.5
24
= 0.3125
and the pessimistic error estimate for the right tree is
eg(TR) =
6 + 4 × 0.5
24
=
8
24
= 0.3333.
181
Chapter 4 Classification
A?
B?
C?
0
0 1
1
LabeledyX
X1
B1 B2
C1 C2
X2
X3
X4
Xn
1
1
1
. . .. . .
0
0
UnlabeledyX
X1
X2
X3
X4
Xn
?
?
?
. . .. . .
?
?
A B
NoYes
Figure 4.28. The minimum description length (MDL) principle.
Thus, the left tree has a better pessimistic error rate than the right tree. For
binary trees, a penalty term of 0.5 means a node should always be expanded
into its two child nodes as long as it improves the classification of at least one
training record because expanding a node, which is equivalent to adding 0.5
to the overall error, is less costly than committing one training error.
If Ω(t) = 1 for all the nodes t, the pessimistic error estimate for the left
tree is eg(TL) = 11/24 = 0.458, while the pessimistic error estimate for the
right tree is eg(TR) = 10/24 = 0.417. The right tree therefore has a better
pessimistic error rate than the left tree. Thus, a node should not be expanded
into its child nodes unless it reduces the misclassification error for more than
one training record.
Minimum Description Length Principle Another way to incorporate
model complexity is based on an informationtheoretic approach known as the
minimum description length or MDL principle. To illustrate this principle,
consider the example shown in Figure 4.28. In this example, both A and B are
given a set of records with known attribute values x. In addition, person A
knows the exact class label for each record, while person B knows none of this
information. B can obtain the classification of each record by requesting that
A transmits the class labels sequentially. Such a message would require Θ(n)
bits of information, where n is the total number of records.
Alternatively, A may decide to build a classification model that summarizes
the relationship between x and y. The model can be encoded in a compact
182
4.4 Model Overfitting
form before being transmitted to B. If the model is 100% accurate, then the
cost of transmission is equivalent to the cost of encoding the model. Otherwise,
A must also transmit information about which record is classified incorrectly
by the model. Thus, the overall cost of transmission is
Cost(model, data) = Cost(model) + Cost(datamodel), (4.9)
where the first term on the righthand side is the cost of encoding the model,
while the second term represents the cost of encoding the mislabeled records.
According to the MDL principle, we should seek a model that minimizes the
overall cost function. An example showing how to compute the total descrip
tion length of a decision tree is given by Exercise 9 on page 202.
Estimating Statistical Bounds
The generalization error can also be estimated as a statistical correction to
the training error. Since generalization error tends to be larger than training
error, the statistical correction is usually computed as an upper bound to the
training error, taking into account the number of training records that reach
a particular leaf node. For instance, in the C4.5 decision tree algorithm, the
number of errors committed by each leaf node is assumed to follow a binomial
distribution. To compute its generalization error, we must determine the upper
bound limit to the observed training error, as illustrated in the next example.
Example 4.3. Consider the leftmost branch of the binary decision trees
shown in Figure 4.27. Observe that the leftmost leaf node of TR has been
expanded into two child nodes in TL. Before splitting, the error rate of the
node is 2/7 = 0.286. By approximating a binomial distribution with a normal
distribution, the following upper bound of the error rate e can be derived:
eupper(N, e, α) =
e +
z2
α/2
2N
+ zα/2
√
e(1−e)
N
+
z2
α/2
4N 2
1 +
z2
α/2
N
, (4.10)
where α is the confidence level, zα/2 is the standardized value from a standard
normal distribution, and N is the total number of training records used to
compute e. By replacing α = 25%, N = 7, and e = 2/7, the upper bound for
the error rate is eupper(7, 2/7, 0.25) = 0.503, which corresponds to 7 × 0.503 =
3.521 errors. If we expand the node into its child nodes as shown in TL, the
training error rates for the child nodes are 1/4 = 0.250 and 1/3 = 0.333,
183
Chapter 4 Classification
respectively. Using Equation 4.10, the upper bounds of these error rates are
eupper(4, 1/4, 0.25) = 0.537 and eupper(3, 1/3, 0.25) = 0.650, respectively. The
overall training error of the child nodes is 4 × 0.537 + 3 × 0.650 = 4.098, which
is larger than the estimated error for the corresponding node in TR.
Using a Validation Set
In this approach, instead of using the training set to estimate the generalization
error, the original training data is divided into two smaller subsets. One of
the subsets is used for training, while the other, known as the validation set,
is used for estimating the generalization error. Typically, twothirds of the
training set is reserved for model building, while the remaining onethird is
used for error estimation.
This approach is typically used with classification techniques that can be
parameterized to obtain models with different levels of complexity. The com
plexity of the best model can be estimated by adjusting the parameter of the
learning algorithm (e.g., the pruning level of a decision tree) until the empir
ical model produced by the learning algorithm attains the lowest error rate
on the validation set. Although this approach provides a better way for esti
mating how well the model performs on previously unseen records, less data
is available for training.
4.4.5 Handling Overfitting in Decision Tree Induction
In the previous section, we described several methods for estimating the gen
eralization error of a classification model. Having a reliable estimate of gener
alization error allows the learning algorithm to search for an accurate model
without overfitting the training data. This section presents two strategies for
avoiding model overfitting in the context of decision tree induction.
Prepruning (Early Stopping Rule) In this approach, the treegrowing
algorithm is halted before generating a fully grown tree that perfectly fits the
entire training data. To do this, a more restrictive stopping condition must
be used; e.g., stop expanding a leaf node when the observed gain in impurity
measure (or improvement in the estimated generalization error) falls below a
certain threshold. The advantage of this approach is that it avoids generating
overly complex subtrees that overfit the training data. Nevertheless, it is
difficult to choose the right threshold for early termination. Too high of a
threshold will result in underfitted models, while a threshold that is set too low
may not be sufficient to overcome the model overfitting problem. Furthermore,
184
4.4 Model Overfitting
Decision Tree:
Simplified Decision Tree:
Subtree
Replacement
Subtree
Raising
depth = 1:
breadth> 7 : class 1
breadth<= 7:
breadth <= 3:
ImagePages> 0.375: class 0
ImagePages<= 0.375:
totalPages<= 6: class 1
totalPages> 6:
breadth <= 1: class 1
breadth > 1: class 0
width > 3:
MultilP = 0:
ImagePages<= 0.1333: class 1
ImagePages> 0.1333:
breadth <= 6: class 0
breadth > 6: class 1
MultilP = 1:
TotalTime <= 361: class 0
TotalTime > 361: class 1
depth> 1:
MultiAgent = 0:
depth > 2: class 0
depth <= 2:
MultilP = 1: class 0
MultilP = 0:
breadth <= 6: class 0
breadth > 6:
RepeatedAccess <= 0.322: class 0
RepeatedAccess > 0.322: class 1
MultiAgent = 1:
totalPages <= 81: class 0
totalPages > 81: class 1
depth = 1:
ImagePages <= 0.1333: class 1
ImagePages > 0.1333:
breadth <= 6: class 0
breadth > 6: class 1
depth > 1:
MultiAgent = 0: class 0
MultiAgent = 1:
totalPages <= 81: class 0
totalPages > 81: class 1
Figure 4.29. Postpruning of the decision tree for Web robot detection.
even if no significant gain is obtained using one of the existing attribute test
conditions, subsequent splitting may result in better subtrees.
Postpruning In this approach, the decision tree is initially grown to its
maximum size. This is followed by a treepruning step, which proceeds to
trim the fully grown tree in a bottomup fashion. Trimming can be done by
replacing a subtree with (1) a new leaf node whose class label is determined
from the majority class of records affiliated with the subtree, or (2) the most
frequently used branch of the subtree. The treepruning step terminates when
no further improvement is observed. Postpruning tends to give better results
than prepruning because it makes pruning decisions based on a fully grown
tree, unlike prepruning, which can suffer from premature termination of the
treegrowing process. However, for postpruning, the additional computations
needed to grow the full tree may be wasted when the subtree is pruned.
Figure 4.29 illustrates the simplified decision tree model for the Web robot
detection example given in Section 4.3.6. Notice that the subtrees rooted at
185
Chapter 4 Classification
depth = 1 have been replaced by one of the branches involving the attribute
ImagePages. This approach is also known as subtree raising. The depth >
1 and MultiAgent = 0 subtree has been replaced by a leaf node assigned to
class 0. This approach is known as subtree replacement. The subtree for
depth > 1 and MultiAgent = 1 remains intact.
4.5 Evaluating the Performance of a Classifier
Section 4.4.4 described several methods for estimating the generalization error
of a model during training. The estimated error helps the learning algorithm
to do model selection; i.e., to find a model of the right complexity that is
not susceptible to overfitting. Once the model has been constructed, it can be
applied to the test set to predict the class labels of previously unseen records.
It is often useful to measure the performance of the model on the test set
because such a measure provides an unbiased estimate of its generalization
error. The accuracy or error rate computed from the test set can also be
used to compare the relative performance of different classifiers on the same
domain. However, in order to do this, the class labels of the test records
must be known. This section reviews some of the methods commonly used to
evaluate the performance of a classifier.
4.5.1 Holdout Method
In the holdout method, the original data with labeled examples is partitioned
into two disjoint sets, called the training and the test sets, respectively. A
classification model is then induced from the training set and its performance
is evaluated on the test set. The proportion of data reserved for training and
for testing is typically at the discretion of the analysts (e.g., 5050 or two
thirds for training and onethird for testing). The accuracy of the classifier
can be estimated based on the accuracy of the induced model on the test set.
The holdout method has several wellknown limitations. First, fewer la
beled examples are available for training because some of the records are with
held for testing. As a result, the induced model may not be as good as when all
the labeled examples are used for training. Second, the model may be highly
dependent on the composition of the training and test sets. The smaller the
training set size, the larger the variance of the model. On the other hand, if
the training set is too large, then the estimated accuracy computed from the
smaller test set is less reliable. Such an estimate is said to have a wide con
fidence interval. Finally, the training and test sets are no longer independent
186
4.5 Evaluating the Performance of a Classifier
of each other. Because the training and test sets are subsets of the original
data, a class that is overrepresented in one subset will be underrepresented in
the other, and vice versa.
4.5.2 Random Subsampling
The holdout method can be repeated several times to improve the estimation
of a classifier’s performance. This approach is known as random subsampling.
Let acci be the model accuracy during the ith iteration. The overall accuracy
is given by accsub =
∑k
i=1 acci/k. Random subsampling still encounters some
of the problems associated with the holdout method because it does not utilize
as much data as possible for training. It also has no control over the number of
times each record is used for testing and training. Consequently, some records
might be used for training more often than others.
4.5.3 CrossValidation
An alternative to random subsampling is crossvalidation. In this approach,
each record is used the same number of times for training and exactly once
for testing. To illustrate this method, suppose we partition the data into two
equalsized subsets. First, we choose one of the subsets for training and the
other for testing. We then swap the roles of the subsets so that the previous
training set becomes the test set and vice versa. This approach is called a two
fold crossvalidation. The total error is obtained by summing up the errors for
both runs. In this example, each record is used exactly once for training and
once for testing. The kfold crossvalidation method generalizes this approach
by segmenting the data into k equalsized partitions. During each run, one of
the partitions is chosen for testing, while the rest of them are used for training.
This procedure is repeated k times so that each partition is used for testing
exactly once. Again, the total error is found by summing up the errors for
all k runs. A special case of the kfold crossvalidation method sets k = N ,
the size of the data set. In this socalled leaveoneout approach, each test
set contains only one record. This approach has the advantage of utilizing
as much data as possible for training. In addition, the test sets are mutually
exclusive and they effectively cover the entire data set. The drawback of this
approach is that it is computationally expensive to repeat the procedure N
times. Furthermore, since each test set contains only one record, the variance
of the estimated performance metric tends to be high.
187
Chapter 4 Classification
4.5.4 Bootstrap
The methods presented so far assume that the training records are sampled
without replacement. As a result, there are no duplicate records in the training
and test sets. In the bootstrap approach, the training records are sampled
with replacement; i.e., a record already chosen for training is put back into
the original pool of records so that it is equally likely to be redrawn. If the
original data has N records, it can be shown that, on average, a bootstrap
sample of size N contains about 63.2% of the records in the original data. This
approximation follows from the fact that the probability a record is chosen by
a bootstrap sample is 1 − (1 − 1/N )N . When N is sufficiently large, the
probability asymptotically approaches 1 − e−1 = 0.632. Records that are not
included in the bootstrap sample become part of the test set. The model
induced from the training set is then applied to the test set to obtain an
estimate of the accuracy of the bootstrap sample, i. The sampling procedure
is then repeated b times to generate b bootstrap samples.
There are several variations to the bootstrap sampling approach in terms
of how the overall accuracy of the classifier is computed. One of the more
widely used approaches is the .632 bootstrap, which computes the overall
accuracy by combining the accuracies of each bootstrap sample ( i) with the
accuracy computed from a training set that contains all the labeled examples
in the original data (accs):
Accuracy, accboot =
1
b
b∑
i=1
(0.632 × i + 0.368 × accs). (4.11)
4.6 Methods for Comparing Classifiers
It is often useful to compare the performance of different classifiers to deter
mine which classifier works better on a given data set. However, depending
on the size of the data, the observed difference in accuracy between two clas
sifiers may not be statistically significant. This section examines some of the
statistical tests available to compare the performance of different models and
classifiers.
For illustrative purposes, consider a pair of classification models, MA and
MB. Suppose MA achieves 85% accuracy when evaluated on a test set con
taining 30 records, while MB achieves 75% accuracy on a different test set
containing 5000 records. Based on this information, is MA a better model
than MB?
188
4.6 Methods for Comparing Classifiers
The preceding example raises two key questions regarding the statistical
significance of the performance metrics:
1. Although MA has a higher accuracy than MB, it was tested on a smaller
test set. How much confidence can we place on the accuracy for MA?
2. Is it possible to explain the difference in accuracy as a result of variations
in the composition of the test sets?
The first question relates to the issue of estimating the confidence interval of a
given model accuracy. The second question relates to the issue of testing the
statistical significance of the observed deviation. These issues are investigated
in the remainder of this section.
4.6.1 Estimating a Confidence Interval for Accuracy
To determine the confidence interval, we need to establish the probability
distribution that governs the accuracy measure. This section describes an ap
proach for deriving the confidence interval by modeling the classification task
as a binomial experiment. Following is a list of characteristics of a binomial
experiment:
1. The experiment consists of N independent trials, where each trial has
two possible outcomes: success or failure.
2. The probability of success, p, in each trial is constant.
An example of a binomial experiment is counting the number of heads that
turn up when a coin is flipped N times. If X is the number of successes
observed in N trials, then the probability that X takes a particular value is
given by a binomial distribution with mean N p and variance N p(1 − p):
P (X = v) =
(
N
p
)
pv(1 − p)N−v.
For example, if the coin is fair (p = 0.5) and is flipped fifty times, then the
probability that the head shows up 20 times is
P (X = 20) =
(
50
20
)
0.520(1 − 0.5)30 = 0.0419.
If the experiment is repeated many times, then the average number of heads
expected to show up is 50×0.5 = 25, while its variance is 50×0.5×0.5 = 12.5.
189
Chapter 4 Classification
The task of predicting the class labels of test records can also be consid
ered as a binomial experiment. Given a test set that contains N records, let
X be the number of records correctly predicted by a model and p be the true
accuracy of the model. By modeling the prediction task as a binomial experi
ment, X has a binomial distribution with mean N p and variance N p(1 − p).
It can be shown that the empirical accuracy, acc = X/N , also has a binomial
distribution with mean p and variance p(1−p)/N (see Exercise 12). Although
the binomial distribution can be used to estimate the confidence interval for
acc, it is often approximated by a normal distribution when N is sufficiently
large. Based on the normal distribution, the following confidence interval for
acc can be derived:
P
(
− Zα/2 ≤
acc − p√
p(1 − p)/N
≤ Z1−α/2
)
= 1 − α, (4.12)
where Zα/2 and Z1−α/2 are the upper and lower bounds obtained from a stan
dard normal distribution at confidence level (1 − α). Since a standard normal
distribution is symmetric around Z = 0, it follows that Zα/2 = Z1−α/2. Rear
ranging this inequality leads to the following confidence interval for p:
2 × N × acc + Z2
α/2
± Zα/2
√
Z2
α/2
+ 4N acc − 4N acc2
2(N + Z2
α/2
)
. (4.13)
The following table shows the values of Zα/2 at different confidence levels:
1 − α 0.99 0.98 0.95 0.9 0.8 0.7 0.5
Zα/2 2.58 2.33 1.96 1.65 1.28 1.04 0.67
Example 4.4. Consider a model that has an accuracy of 80% when evaluated
on 100 test records. What is the confidence interval for its true accuracy at a
95% confidence level? The confidence level of 95% corresponds to Zα/2 = 1.96
according to the table given above. Inserting this term into Equation 4.13
yields a confidence interval between 71.1% and 86.7%. The following table
shows the confidence interval when the number of records, N , increases:
N 20 50 100 500 1000 5000
Confidence 0.584 0.670 0.711 0.763 0.774 0.789
Interval − 0.919 − 0.888 − 0.867 − 0.833 − 0.824 − 0.811
Note that the confidence interval becomes tighter when N increases.
190
4.6 Methods for Comparing Classifiers
4.6.2 Comparing the Performance of Two Models
Consider a pair of models, M1 and M2, that are evaluated on two independent
test sets, D1 and D2. Let n1 denote the number of records in D1 and n2 denote
the number of records in D2. In addition, suppose the error rate for M1 on
D1 is e1 and the error rate for M2 on D2 is e2. Our goal is to test whether the
observed difference between e1 and e2 is statistically significant.
Assuming that n1 and n2 are sufficiently large, the error rates e1 and e2
can be approximated using normal distributions. If the observed difference in
the error rate is denoted as d = e1 − e2, then d is also normally distributed
with mean dt, its true difference, and variance, σ2d. The variance of d can be
computed as follows:
σ2d � σ̂2d =
e1(1 − e1)
n1
+
e2(1 − e2)
n2
, (4.14)
where e1(1 − e1)/n1 and e2(1 − e2)/n2 are the variances of the error rates.
Finally, at the (1 − α)% confidence level, it can be shown that the confidence
interval for the true difference dt is given by the following equation:
dt = d ± zα/2σ̂d. (4.15)
Example 4.5. Consider the problem described at the beginning of this sec
tion. Model MA has an error rate of e1 = 0.15 when applied to N1 = 30
test records, while model MB has an error rate of e2 = 0.25 when applied
to N2 = 5000 test records. The observed difference in their error rates is
d = 0.15 − 0.25 = 0.1. In this example, we are performing a twosided test
to check whether dt = 0 or dt �= 0. The estimated variance of the observed
difference in error rates can be computed as follows:
σ̂2d =
0.15(1 − 0.15)
30
+
0.25(1 − 0.25)
5000
= 0.0043
or σ̂d = 0.0655. Inserting this value into Equation 4.15, we obtain the following
confidence interval for dt at 95% confidence level:
dt = 0.1 ± 1.96 × 0.0655 = 0.1 ± 0.128.
As the interval spans the value zero, we can conclude that the observed differ
ence is not statistically significant at a 95% confidence level.
191
Chapter 4 Classification
At what confidence level can we reject the hypothesis that dt = 0? To do
this, we need to determine the value of Zα/2 such that the confidence interval
for dt does not span the value zero. We can reverse the preceding computation
and look for the value Zα/2 such that d > Zα/2σ̂d. Replacing the values of d
and σ̂d gives Zα/2 < 1.527. This value first occurs when (1 − α) � 0.936 (for a
twosided test). The result suggests that the null hypothesis can be rejected
at confidence level of 93.6% or lower.
4.6.3 Comparing the Performance of Two Classifiers
Suppose we want to compare the performance of two classifiers using the kfold
crossvalidation approach. Initially, the data set D is divided into k equalsized
partitions. We then apply each classifier to construct a model from k − 1 of
the partitions and test it on the remaining partition. This step is repeated k
times, each time using a different partition as the test set.
Let Mij denote the model induced by classification technique Li during the
jth iteration. Note that each pair of models M1j and M2j are tested on the
same partition j. Let e1j and e2j be their respective error rates. The difference
between their error rates during the jth fold can be written as dj = e1j − e2j .
If k is sufficiently large, then dj is normally distributed with mean dcvt , which
is the true difference in their error rates, and variance σcv. Unlike the previous
approach, the overall variance in the observed differences is estimated using
the following formula:
σ̂2dcv =
∑k
j=1(dj − d)2
k(k − 1) , (4.16)
where d is the average difference. For this approach, we need to use a t
distribution to compute the confidence interval for dcvt :
dcvt = d ± t(1−α),k−1σ̂dcv .
The coefficient t(1−α),k−1 is obtained from a probability table with two input
parameters, its confidence level (1 − α) and the number of degrees of freedom,
k − 1. The probability table for the tdistribution is shown in Table 4.6.
Example 4.6. Suppose the estimated difference in the accuracy of models
generated by two classification techniques has a mean equal to 0.05 and a
standard deviation equal to 0.002. If the accuracy is estimated using a 30fold
crossvalidation approach, then at a 95% confidence level, the true accuracy
difference is
dcvt = 0.05 ± 2.04 × 0.002. (4.17)
192
4.7 Bibliographic Notes
Table 4.6. Probability table for tdistribution.
(1 − α)
k − 1 0.99 0.98 0.95 0.9 0.8
1 3.08 6.31 12.7 31.8 63.7
2 1.89 2.92 4.30 6.96 9.92
4 1.53 2.13 2.78 3.75 4.60
9 1.38 1.83 2.26 2.82 3.25
14 1.34 1.76 2.14 2.62 2.98
19 1.33 1.73 2.09 2.54 2.86
24 1.32 1.71 2.06 2.49 2.80
29 1.31 1.70 2.04 2.46 2.76
Since the confidence interval does not span the value zero, the observed dif
ference between the techniques is statistically significant.
4.7 Bibliographic Notes
Early classification systems were developed to organize a large collection of
objects. For example, the Dewey Decimal and Library of Congress classifica
tion systems were designed to catalog and index the vast number of library
books. The categories are typically identified in a manual fashion, with the
help of domain experts.
Automated classification has been a subject of intensive research for many
years. The study of classification in classical statistics is sometimes known as
discriminant analysis, where the objective is to predict the group member
ship of an object based on a set of predictor variables. A wellknown classical
method is Fisher’s linear discriminant analysis [117], which seeks to find a lin
ear projection of the data that produces the greatest discrimination between
objects that belong to different classes.
Many pattern recognition problems also require the discrimination of ob
jects from different classes. Examples include speech recognition, handwritten
character identification, and image classification. Readers who are interested
in the application of classification techniques for pattern recognition can refer
to the survey articles by Jain et al. [122] and Kulkarni et al. [128] or classic
pattern recognition books by Bishop [107], Duda et al. [114], and Fukunaga
[118]. The subject of classification is also a major research topic in the fields of
neural networks, statistical learning, and machine learning. An indepth treat
193
Chapter 4 Classification
ment of various classification techniques is given in the books by Cherkassky
and Mulier [112], Hastie et al. [120], Michie et al. [133], and Mitchell [136].
An overview of decision tree induction algorithms can be found in the
survey articles by Buntine [110], Moret [137], Murthy [138], and Safavian et
al. [147]. Examples of some wellknown decision tree algorithms include CART
[108], ID3 [143], C4.5 [145], and CHAID [125]. Both ID3 and C4.5 employ the
entropy measure as their splitting function. An indepth discussion of the
C4.5 decision tree algorithm is given by Quinlan [145]. Besides explaining the
methodology for decision tree growing and tree pruning, Quinlan [145] also
described how the algorithm can be modified to handle data sets with missing
values. The CART algorithm was developed by Breiman et al. [108] and uses
the Gini index as its splitting function. CHAID [125] uses the statistical χ2
test to determine the best split during the treegrowing process.
The decision tree algorithm presented in this chapter assumes that the
splitting condition is specified one attribute at a time. An oblique decision tree
can use multiple attributes to form the attribute test condition in the internal
nodes [121, 152]. Breiman et al. [108] provide an option for using linear
combinations of attributes in their CART implementation. Other approaches
for inducing oblique decision trees were proposed by Heath et al. [121], Murthy
et al. [139], CantúPaz and Kamath [111], and Utgoff and Brodley [152].
Although oblique decision trees help to improve the expressiveness of a decision
tree representation, learning the appropriate test condition at each node is
computationally challenging. Another way to improve the expressiveness of a
decision tree without using oblique decision trees is to apply a method known
as constructive induction [132]. This method simplifies the task of learning
complex splitting functions by creating compound features from the original
attributes.
Besides the topdown approach, other strategies for growing a decision tree
include the bottomup approach by Landeweerd et al. [130] and Pattipati and
Alexandridis [142], as well as the bidirectional approach by Kim and Landgrebe
[126]. Schuermann and Doster [150] and Wang and Suen [154] proposed using
a soft splitting criterion to address the data fragmentation problem. In
this approach, each record is assigned to different branches of the decision tree
with different probabilities.
Model overfitting is an important issue that must be addressed to ensure
that a decision tree classifier performs equally well on previously unknown
records. The model overfitting problem has been investigated by many authors
including Breiman et al. [108], Schaffer [148], Mingers [135], and Jensen and
Cohen [123]. While the presence of noise is often regarded as one of the
194
Bibliography
primary reasons for overfitting [135, 140], Jensen and Cohen [123] argued
that overfitting is the result of using incorrect hypothesis tests in a multiple
comparison procedure.
Schapire [149] defined generalization error as “the probability of misclas
sifying a new example” and test error as “the fraction of mistakes on a newly
sampled test set.” Generalization error can therefore be considered as the ex
pected test error of a classifier. Generalization error may sometimes refer to
the true error [136] of a model, i.e., its expected error for randomly drawn
data points from the same population distribution where the training set is
sampled. These definitions are in fact equivalent if both the training and test
sets are gathered from the same population distribution, which is often the
case in many data mining and machine learning applications.
The Occam’s razor principle is often attributed to the philosopher William
of Occam. Domingos [113] cautioned against the pitfall of misinterpreting
Occam’s razor as comparing models with similar training errors, instead of
generalization errors. A survey on decision treepruning methods to avoid
overfitting is given by Breslow and Aha [109] and Esposito et al. [116]. Some
of the typical pruning methods include reduced error pruning [144], pessimistic
error pruning [144], minimum error pruning [141], critical value pruning [134],
costcomplexity pruning [108], and errorbased pruning [145]. Quinlan and
Rivest proposed using the minimum description length principle for decision
tree pruning in [146].
Kohavi [127] had performed an extensive empirical study to compare the
performance metrics obtained using different estimation methods such as ran
dom subsampling, bootstrapping, and kfold crossvalidation. Their results
suggest that the best estimation method is based on the tenfold stratified
crossvalidation. Efron and Tibshirani [115] provided a theoretical and empir
ical comparison between crossvalidation and a bootstrap method known as
the 632+ rule.
Current techniques such as C4.5 require that the entire training data set fit
into main memory. There has been considerable effort to develop parallel and
scalable versions of decision tree induction algorithms. Some of the proposed
algorithms include SLIQ by Mehta et al. [131], SPRINT by Shafer et al. [151],
CMP by Wang and Zaniolo [153], CLOUDS by Alsabti et al. [106], RainForest
by Gehrke et al. [119], and ScalParC by Joshi et al. [124]. A general survey
of parallel algorithms for data mining is available in [129].
195
Chapter 4 Classification
Bibliography
[106] K. Alsabti, S. Ranka, and V. Singh. CLOUDS: A Decision Tree Classifier for Large
Datasets. In Proc. of the 4th Intl. Conf. on Knowledge Discovery and Data Mining,
pages 2–8, New York, NY, August 1998.
[107] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press,
Oxford, U.K., 1995.
[108] L. Breiman, J. H. Friedman, R. Olshen, and C. J. Stone. Classification and Regression
Trees. Chapman & Hall, New York, 1984.
[109] L. A. Breslow and D. W. Aha. Simplifying Decision Trees: A Survey. Knowledge
Engineering Review, 12(1):1–40, 1997.
[110] W. Buntine. Learning classification trees. In Artificial Intelligence Frontiers in Statis
tics, pages 182–201. Chapman & Hall, London, 1993.
[111] E. CantúPaz and C. Kamath. Using evolutionary algorithms to induce oblique decision
trees. In Proc. of the Genetic and Evolutionary Computation Conf., pages 1053–1060,
San Francisco, CA, 2000.
[112] V. Cherkassky and F. Mulier. Learning from Data: Concepts, Theory, and Methods.
Wiley Interscience, 1998.
[113] P. Domingos. The Role of Occam’s Razor in Knowledge Discovery. Data Mining and
Knowledge Discovery, 3(4):409–425, 1999.
[114] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley & Sons,
Inc., New York, 2nd edition, 2001.
[115] B. Efron and R. Tibshirani. Crossvalidation and the Bootstrap: Estimating the Error
Rate of a Prediction Rule. Technical report, Stanford University, 1995.
[116] F. Esposito, D. Malerba, and G. Semeraro. A Comparative Analysis of Methods for
Pruning Decision Trees. IEEE Trans. Pattern Analysis and Machine Intelligence, 19
(5):476–491, May 1997.
[117] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of
Eugenics, 7:179–188, 1936.
[118] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New
York, 1990.
[119] J. Gehrke, R. Ramakrishnan, and V. Ganti. RainForest—A Framework for Fast De
cision Tree Construction of Large Datasets. Data Mining and Knowledge Discovery, 4
(2/3):127–162, 2000.
[120] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning:
Data Mining, Inference, Prediction. Springer, New York, 2001.
[121] D. Heath, S. Kasif, and S. Salzberg. Induction of Oblique Decision Trees. In Proc. of
the 13th Intl. Joint Conf. on Artificial Intelligence, pages 1002–1007, Chambery, France,
August 1993.
[122] A. K. Jain, R. P. W. Duin, and J. Mao. Statistical Pattern Recognition: A Review.
IEEE Tran. Patt. Anal. and Mach. Intellig., 22(1):4–37, 2000.
[123] D. Jensen and P. R. Cohen. Multiple Comparisons in Induction Algorithms. Machine
Learning, 38(3):309–338, March 2000.
[124] M. V. Joshi, G. Karypis, and V. Kumar. ScalParC: A New Scalable and Efficient
Parallel Classification Algorithm for Mining Large Datasets. In Proc. of 12th Intl.
Parallel Processing Symp. (IPPS/SPDP), pages 573–579, Orlando, FL, April 1998.
[125] G. V. Kass. An Exploratory Technique for Investigating Large Quantities of Categor
ical Data. Applied Statistics, 29:119–127, 1980.
196
Bibliography
[126] B. Kim and D. Landgrebe. Hierarchical decision classifiers in highdimensional and
large class data. IEEE Trans. on Geoscience and Remote Sensing, 29(4):518–528, 1991.
[127] R. Kohavi. A Study on CrossValidation and Bootstrap for Accuracy Estimation and
Model Selection. In Proc. of the 15th Intl. Joint Conf. on Artificial Intelligence, pages
1137–1145, Montreal, Canada, August 1995.
[128] S. R. Kulkarni, G. Lugosi, and S. S. Venkatesh. Learning Pattern Classification—A
Survey. IEEE Tran. Inf. Theory, 44(6):2178–2206, 1998.
[129] V. Kumar, M. V. Joshi, E.H. Han, P. N. Tan, and M. Steinbach. High Performance
Data Mining. In High Performance Computing for Computational Science (VECPAR
2002), pages 111–125. Springer, 2002.
[130] G. Landeweerd, T. Timmers, E. Gersema, M. Bins, and M. Halic. Binary tree versus
single level tree classification of white blood cells. Pattern Recognition, 16:571–577,
1983.
[131] M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A Fast Scalable Classifier for Data
Mining. In Proc. of the 5th Intl. Conf. on Extending Database Technology, pages 18–32,
Avignon, France, March 1996.
[132] R. S. Michalski. A theory and methodology of inductive learning. Artificial Intelligence,
20:111–116, 1983.
[133] D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and
Statistical Classification. Ellis Horwood, Upper Saddle River, NJ, 1994.
[134] J. Mingers. Expert Systems—Rule Induction with Statistical Data. J Operational
Research Society, 38:39–47, 1987.
[135] J. Mingers. An empirical comparison of pruning methods for decision tree induction.
Machine Learning, 4:227–243, 1989.
[136] T. Mitchell. Machine Learning. McGrawHill, Boston, MA, 1997.
[137] B. M. E. Moret. Decision Trees and Diagrams. Computing Surveys, 14(4):593–623,
1982.
[138] S. K. Murthy. Automatic Construction of Decision Trees from Data: A Multi
Disciplinary Survey. Data Mining and Knowledge Discovery, 2(4):345–389, 1998.
[139] S. K. Murthy, S. Kasif, and S. Salzberg. A system for induction of oblique decision
trees. J of Artificial Intelligence Research, 2:1–33, 1994.
[140] T. Niblett. Constructing decision trees in noisy domains. In Proc. of the 2nd European
Working Session on Learning, pages 67–78, Bled, Yugoslavia, May 1987.
[141] T. Niblett and I. Bratko. Learning Decision Rules in Noisy Domains. In Research and
Development in Expert Systems III, Cambridge, 1986. Cambridge University Press.
[142] K. R. Pattipati and M. G. Alexandridis. Application of heuristic search and information
theory to sequential fault diagnosis. IEEE Trans. on Systems, Man, and Cybernetics,
20(4):872–887, 1990.
[143] J. R. Quinlan. Discovering rules by induction from large collection of examples. In
D. Michie, editor, Expert Systems in the Micro Electronic Age. Edinburgh University
Press, Edinburgh, UK, 1979.
[144] J. R. Quinlan. Simplifying Decision Trees. Intl. J. ManMachine Studies, 27:221–234,
1987.
[145] J. R. Quinlan. C4.5: Programs for Machine Learning. MorganKaufmann Publishers,
San Mateo, CA, 1993.
[146] J. R. Quinlan and R. L. Rivest. Inferring Decision Trees Using the Minimum Descrip
tion Length Principle. Information and Computation, 80(3):227–248, 1989.
197
Chapter 4 Classification
[147] S. R. Safavian and D. Landgrebe. A Survey of Decision Tree Classifier Methodology.
IEEE Trans. Systems, Man and Cybernetics, 22:660–674, May/June 1998.
[148] C. Schaffer. Overfitting avoidence as bias. Machine Learning, 10:153–178, 1993.
[149] R. E. Schapire. The Boosting Approach to Machine Learning: An Overview. In MSRI
Workshop on Nonlinear Estimation and Classification, 2002.
[150] J. Schuermann and W. Doster. A decisiontheoretic approach in hierarchical classifier
design. Pattern Recognition, 17:359–369, 1984.
[151] J. C. Shafer, R. Agrawal, and M. Mehta. SPRINT: A Scalable Parallel Classifier
for Data Mining. In Proc. of the 22nd VLDB Conf., pages 544–555, Bombay, India,
September 1996.
[152] P. E. Utgoff and C. E. Brodley. An incremental method for finding multivariate splits
for decision trees. In Proc. of the 7th Intl. Conf. on Machine Learning, pages 58–65,
Austin, TX, June 1990.
[153] H. Wang and C. Zaniolo. CMP: A Fast Decision Tree Classifier Using Multivariate
Predictions. In Proc. of the 16th Intl. Conf. on Data Engineering, pages 449–460, San
Diego, CA, March 2000.
[154] Q. R. Wang and C. Y. Suen. Large tree classifier with heuristic search and global
training. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(1):91–102, 1987.
4.8 Exercises
1. Draw the full decision tree for the parity function of four Boolean attributes,
A, B, C, and D. Is it possible to simplify the tree?
2. Consider the training examples shown in Table 4.7 for a binary classification
problem.
(a) Compute the Gini index for the overall collection of training examples.
(b) Compute the Gini index for the Customer ID attribute.
(c) Compute the Gini index for the Gender attribute.
(d) Compute the Gini index for the Car Type attribute using multiway split.
(e) Compute the Gini index for the Shirt Size attribute using multiway
split.
(f) Which attribute is better, Gender, Car Type, or Shirt Size?
(g) Explain why Customer ID should not be used as the attribute test con
dition even though it has the lowest Gini.
3. Consider the training examples shown in Table 4.8 for a binary classification
problem.
(a) What is the entropy of this collection of training examples with respect
to the positive class?
198
4.8 Exercises
Table 4.7. Data set for Exercise 2.
Customer ID Gender Car Type Shirt Size Class
1 M Family Small C0
2 M Sports Medium C0
3 M Sports Medium C0
4 M Sports Large C0
5 M Sports Extra Large C0
6 M Sports Extra Large C0
7 F Sports Small C0
8 F Sports Small C0
9 F Sports Medium C0
10 F Luxury Large C0
11 M Family Large C1
12 M Family Extra Large C1
13 M Family Medium C1
14 M Luxury Extra Large C1
15 F Luxury Small C1
16 F Luxury Small C1
17 F Luxury Medium C1
18 F Luxury Medium C1
19 F Luxury Medium C1
20 F Luxury Large C1
Table 4.8. Data set for Exercise 3.
Instance a1 a2 a3 Target Class
1 T T 1.0 +
2 T T 6.0 +
3 T F 5.0 −
4 F F 4.0 +
5 F T 7.0 −
6 F T 3.0 −
7 F F 8.0 −
8 T F 7.0 +
9 F T 5.0 −
(b) What are the information gains of a1 and a2 relative to these training
examples?
(c) For a3, which is a continuous attribute, compute the information gain for
every possible split.
199
Chapter 4 Classification
(d) What is the best split (among a1, a2, and a3) according to the information
gain?
(e) What is the best split (between a1 and a2) according to the classification
error rate?
(f) What is the best split (between a1 and a2) according to the Gini index?
4. Show that the entropy of a node never increases after splitting it into smaller
successor nodes.
5. Consider the following data set for a binary class problem.
A B Class Label
T F +
T T +
T T +
T F −
T T +
F F −
F F −
F F −
T T −
T F −
(a) Calculate the information gain when splitting on A and B. Which at
tribute would the decision tree induction algorithm choose?
(b) Calculate the gain in the Gini index when splitting on A and B. Which
attribute would the decision tree induction algorithm choose?
(c) Figure 4.13 shows that entropy and the Gini index are both monotonously
increasing on the range [0, 0.5] and they are both monotonously decreasing
on the range [0.5, 1]. Is it possible that information gain and the gain in
the Gini index favor different attributes? Explain.
6. Consider the following set of training examples.
X Y Z No. of Class C1 Examples No. of Class C2 Examples
0 0 0 5 40
0 0 1 0 15
0 1 0 10 5
0 1 1 45 0
1 0 0 10 5
1 0 1 25 0
1 1 0 5 20
1 1 1 0 15
200
4.8 Exercises
(a) Compute a twolevel decision tree using the greedy approach described in
this chapter. Use the classification error rate as the criterion for splitting.
What is the overall error rate of the induced tree?
(b) Repeat part (a) using X as the first splitting attribute and then choose the
best remaining attribute for splitting at each of the two successor nodes.
What is the error rate of the induced tree?
(c) Compare the results of parts (a) and (b). Comment on the suitability of
the greedy heuristic used for splitting attribute selection.
7. The following table summarizes a data set with three attributes A, B, C and
two class labels +, −. Build a twolevel decision tree.
A B C
Number of
Instances
+ −
T T T 5 0
F T T 0 20
T F T 20 0
F F T 0 5
T T F 0 0
F T F 25 0
T F F 0 0
F F F 0 25
(a) According to the classification error rate, which attribute would be chosen
as the first splitting attribute? For each attribute, show the contingency
table and the gains in classification error rate.
(b) Repeat for the two children of the root node.
(c) How many instances are misclassified by the resulting decision tree?
(d) Repeat parts (a), (b), and (c) using C as the splitting attribute.
(e) Use the results in parts (c) and (d) to conclude about the greedy nature
of the decision tree induction algorithm.
8. Consider the decision tree shown in Figure 4.30.
(a) Compute the generalization error rate of the tree using the optimistic
approach.
(b) Compute the generalization error rate of the tree using the pessimistic
approach. (For simplicity, use the strategy of adding a factor of 0.5 to
each leaf node.)
(c) Compute the generalization error rate of the tree using the validation set
shown above. This approach is known as reduced error pruning.
201
Chapter 4 Classification
+ _ + _
B C
A
Instance
1
2
3
4
5
6
7
8
9
10
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
0
1
0
1
1
0
1
0
1
0
0
0
1
0
0
A B C
+
+
+
–
+
+
–
+
–
–
Class
Training:
Instance
11
12
13
14
15
0
0
1
1
1
0
1
1
0
0
0
1
0
1
0
A B C
+
+
+
–
+
Class
Validation:
0
0 1 0 1
1
Figure 4.30. Decision tree and data sets for Exercise 8.
9. Consider the decision trees shown in Figure 4.31. Assume they are generated
from a data set that contains 16 binary attributes and 3 classes, C1, C2, and
C3.
(a) Decision tree with 7 errors (b) Decision tree with 4 errors
C
1
C
2
C
3
C
1
C
2
C
3
C1 C2
Figure 4.31. Decision trees for Exercise 9.
202
4.8 Exercises
Compute the total description length of each decision tree according to the
minimum description length principle.
• The total description length of a tree is given by:
Cost(tree, data) = Cost(tree) + Cost(datatree).
• Each internal node of the tree is encoded by the ID of the splitting at
tribute. If there are m attributes, the cost of encoding each attribute is
log2 m bits.
• Each leaf is encoded using the ID of the class it is associated with. If
there are k classes, the cost of encoding a class is log2 k bits.
• Cost(tree) is the cost of encoding all the nodes in the tree. To simplify the
computation, you can assume that the total cost of the tree is obtained
by adding up the costs of encoding each internal node and each leaf node.
• Cost(datatree) is encoded using the classification errors the tree commits
on the training set. Each error is encoded by log2 n bits, where n is the
total number of training instances.
Which decision tree is better, according to the MDL principle?
10. While the .632 bootstrap approach is useful for obtaining a reliable estimate of
model accuracy, it has a known limitation [127]. Consider a twoclass problem,
where there are equal number of positive and negative examples in the data.
Suppose the class labels for the examples are generated randomly. The classifier
used is an unpruned decision tree (i.e., a perfect memorizer). Determine the
accuracy of the classifier using each of the following methods.
(a) The holdout method, where twothirds of the data are used for training
and the remaining onethird are used for testing.
(b) Tenfold crossvalidation.
(c) The .632 bootstrap method.
(d) From the results in parts (a), (b), and (c), which method provides a more
reliable evaluation of the classifier’s accuracy?
11. Consider the following approach for testing whether a classifier A beats another
classifier B. Let N be the size of a given data set, pA be the accuracy of classifier
A, pB be the accuracy of classifier B, and p = (pA + pB)/2 be the average
accuracy for both classifiers. To test whether classifier A is significantly better
than B, the following Zstatistic is used:
Z =
pA − pB√
2p(1−p)
N
.
Classifier A is assumed to be better than classifier B if Z > 1.96.
203
Chapter 4 Classification
Table 4.9 compares the accuracies of three different classifiers, decision tree
classifiers, näıve Bayes classifiers, and support vector machines, on various data
sets. (The latter two classifiers are described in Chapter 5.)
Table 4.9. Comparing the accuracy of various classification methods.
Data Set Size Decision näıve Support vector
(N ) Tree (%) Bayes (%) machine (%)
Anneal 898 92.09 79.62 87.19
Australia 690 85.51 76.81 84.78
Auto 205 81.95 58.05 70.73
Breast 699 95.14 95.99 96.42
Cleve 303 76.24 83.50 84.49
Credit 690 85.80 77.54 85.07
Diabetes 768 72.40 75.91 76.82
German 1000 70.90 74.70 74.40
Glass 214 67.29 48.59 59.81
Heart 270 80.00 84.07 83.70
Hepatitis 155 81.94 83.23 87.10
Horse 368 85.33 78.80 82.61
Ionosphere 351 89.17 82.34 88.89
Iris 150 94.67 95.33 96.00
Labor 57 78.95 94.74 92.98
Led7 3200 73.34 73.16 73.56
Lymphography 148 77.03 83.11 86.49
Pima 768 74.35 76.04 76.95
Sonar 208 78.85 69.71 76.92
Tictactoe 958 83.72 70.04 98.33
Vehicle 846 71.04 45.04 74.94
Wine 178 94.38 96.63 98.88
Zoo 101 93.07 93.07 96.04
Summarize the performance of the classifiers given in Table 4.9 using the fol
lowing 3 × 3 table:
winlossdraw Decision tree Näıve Bayes Support vector
machine
Decision tree 0 – 0 – 23
Näıve Bayes 0 – 0 – 23
Support vector machine 0 – 0 – 23
Each cell in the table contains the number of wins, losses, and draws when
comparing the classifier in a given row to the classifier in a given column.
204
4.8 Exercises
12. Let X be a binomial random variable with mean N p and variance N p(1 − p).
Show that the ratio X/N also has a binomial distribution with mean p and
variance p(1 − p)/N .
205
206
5
Classification:
Alternative Techniques
The previous chapter described a simple, yet quite effective, classification tech
nique known as decision tree induction. Issues such as model overfitting and
classifier evaluation were also discussed in great detail. This chapter presents
alternative techniques for building classification models—from simple tech
niques such as rulebased and nearestneighbor classifiers to more advanced
techniques such as support vector machines and ensemble methods. Other
key issues such as the class imbalance and multiclass problems are also dis
cussed at the end of the chapter.
5.1 RuleBased Classifier
A rulebased classifier is a technique for classifying records using a collection
of “if . . .then. . .” rules. Table 5.1 shows an example of a model generated by a
rulebased classifier for the vertebrate classification problem. The rules for the
model are represented in a disjunctive normal form, R = (r1∨r2∨. . . rk), where
R is known as the rule set and ri’s are the classification rules or disjuncts.
Table 5.1. Example of a rule set for the vertebrate classification problem.
r1: (Gives Birth = no) ∧ (Aerial Creature = yes) −→ Birds
r2: (Gives Birth = no) ∧ (Aquatic Creature = yes) −→ Fishes
r3: (Gives Birth = yes) ∧ (Body Temperature = warmblooded) −→ Mammals
r4: (Gives Birth = no) ∧ (Aerial Creature = no) −→ Reptiles
r5: (Aquatic Creature = semi) −→ Amphibians
From Chapter 5 of Introduction to Data Mining
Vipin Kumar. Copyright © 2006 by Pearson Education, Inc. All rights reserved.
, First Edition. PangNing Tan, Michael Steinbach,
207
Chapter 5 Classification: Alternative Techniques
Each classification rule can be expressed in the following way:
ri : (Conditioni) −→ yi. (5.1)
The lefthand side of the rule is called the rule antecedent or precondition.
It contains a conjunction of attribute tests:
Conditioni = (A1 op v1) ∧ (A2 op v2) ∧ . . . (Ak op vk), (5.2)
where (Aj , vj ) is an attributevalue pair and op is a logical operator chosen
from the set {=, �=, <, >, ≤, ≥}. Each attribute test (Aj op vj ) is known as
a conjunct. The righthand side of the rule is called the rule consequent,
which contains the predicted class yi.
A rule r covers a record x if the precondition of r matches the attributes
of x. r is also said to be fired or triggered whenever it covers a given record.
For an illustration, consider the rule r1 given in Table 5.1 and the following
attributes for two vertebrates: hawk and grizzly bear.
Name Body Skin Gives Aquatic Aerial Has Hiber
Temperature Cover Birth Creature Creature Legs nates
hawk warmblooded feather no no yes yes no
grizzly bear warmblooded fur yes no no yes yes
r1 covers the first vertebrate because its precondition is satisfied by the hawk’s
attributes. The rule does not cover the second vertebrate because grizzly bears
give birth to their young and cannot fly, thus violating the precondition of r1.
The quality of a classification rule can be evaluated using measures such as
coverage and accuracy. Given a data set D and a classification rule r : A −→ y,
the coverage of the rule is defined as the fraction of records in D that trigger
the rule r. On the other hand, its accuracy or confidence factor is defined as
the fraction of records triggered by r whose class labels are equal to y. The
formal definitions of these measures are
Coverage(r) =
A
D
Accuracy(r) =
A ∩ y
A , (5.3)
where A is the number of records that satisfy the rule antecedent, A ∩ y is
the number of records that satisfy both the antecedent and consequent, and
D is the total number of records.
208
5.1 RuleBased Classifier
Table 5.2. The vertebrate data set.
Name Body Skin Gives Aquatic Aerial Has Hiber Class Label
Temperature Cover Birth Creature Creature Legs nates
human warmblooded hair yes no no yes no Mammals
python coldblooded scales no no no no yes Reptiles
salmon coldblooded scales no yes no no no Fishes
whale warmblooded hair yes yes no no no Mammals
frog coldblooded none no semi no yes yes Amphibians
komodo
dragon
coldblooded scales no no no yes no Reptiles
bat warmblooded hair yes no yes yes yes Mammals
pigeon warmblooded feathers no no yes yes no Birds
cat warmblooded fur yes no no yes no Mammals
guppy coldblooded scales yes yes no no no Fishes
alligator coldblooded scales no semi no yes no Reptiles
penguin warmblooded feathers no semi no yes no Birds
porcupine warmblooded quills yes no no yes yes Mammals
eel coldblooded scales no yes no no no Fishes
salamander coldblooded none no semi no yes yes Amphibians
Example 5.1. Consider the data set shown in Table 5.2. The rule
(Gives Birth = yes) ∧ (Body Temperature = warmblooded) −→ Mammals
has a coverage of 33% since five of the fifteen records support the rule an
tecedent. The rule accuracy is 100% because all five vertebrates covered by
the rule are mammals.
5.1.1 How a RuleBased Classifier Works
A rulebased classifier classifies a test record based on the rule triggered by
the record. To illustrate how a rulebased classifier works, consider the rule
set shown in Table 5.1 and the following vertebrates:
Name Body Skin Gives Aquatic Aerial Has Hiber
Temperature Cover Birth Creature Creature Legs nates
lemur warmblooded fur yes no no yes yes
turtle coldblooded scales no semi no yes no
dogfish shark coldblooded scales yes yes no no no
• The first vertebrate, which is a lemur, is warmblooded and gives birth
to its young. It triggers the rule r3, and thus, is classified as a mammal.
209
Chapter 5 Classification: Alternative Techniques
• The second vertebrate, which is a turtle, triggers the rules r4 and r5.
Since the classes predicted by the rules are contradictory (reptiles versus
amphibians), their conflicting classes must be resolved.
• None of the rules are applicable to a dogfish shark. In this case, we
need to ensure that the classifier can still make a reliable prediction even
though a test record is not covered by any rule.
The previous example illustrates two important properties of the rule set gen
erated by a rulebased classifier.
Mutually Exclusive Rules The rules in a rule set R are mutually exclusive
if no two rules in R are triggered by the same record. This property ensures
that every record is covered by at most one rule in R. An example of a
mutually exclusive rule set is shown in Table 5.3.
Exhaustive Rules A rule set R has exhaustive coverage if there is a rule
for each combination of attribute values. This property ensures that every
record is covered by at least one rule in R. Assuming that Body Temperature
and Gives Birth are binary variables, the rule set shown in Table 5.3 has
exhaustive coverage.
Table 5.3. Example of a mutually exclusive and exhaustive rule set.
r1: (Body Temperature = coldblooded) −→ Nonmammals
r2: (Body Temperature = warmblooded) ∧ (Gives Birth = yes) −→ Mammals
r3: (Body Temperature = warmblooded) ∧ (Gives Birth = no) −→ Nonmammals
Together, these properties ensure that every record is covered by exactly
one rule. Unfortunately, many rulebased classifiers, including the one shown
in Table 5.1, do not have such properties. If the rule set is not exhaustive,
then a default rule, rd : () −→ yd, must be added to cover the remaining
cases. A default rule has an empty antecedent and is triggered when all other
rules have failed. yd is known as the default class and is typically assigned to
the majority class of training records not covered by the existing rules.
If the rule set is not mutually exclusive, then a record can be covered by
several rules, some of which may predict conflicting classes. There are two
ways to overcome this problem.
210
5.1 RuleBased Classifier
Ordered Rules In this approach, the rules in a rule set are ordered in
decreasing order of their priority, which can be defined in many ways (e.g.,
based on accuracy, coverage, total description length, or the order in which
the rules are generated). An ordered rule set is also known as a decision
list. When a test record is presented, it is classified by the highestranked rule
that covers the record. This avoids the problem of having conflicting classes
predicted by multiple classification rules.
Unordered Rules This approach allows a test record to trigger multiple
classification rules and considers the consequent of each rule as a vote for
a particular class. The votes are then tallied to determine the class label
of the test record. The record is usually assigned to the class that receives
the highest number of votes. In some cases, the vote may be weighted by
the rule’s accuracy. Using unordered rules to build a rulebased classifier has
both advantages and disadvantages. Unordered rules are less susceptible to
errors caused by the wrong rule being selected to classify a test record (unlike
classifiers based on ordered rules, which are sensitive to the choice of rule
ordering criteria). Model building is also less expensive because the rules do
not have to be kept in sorted order. Nevertheless, classifying a test record can
be quite an expensive task because the attributes of the test record must be
compared against the precondition of every rule in the rule set.
In the remainder of this section, we will focus on rulebased classifiers that
use ordered rules.
5.1.2 RuleOrdering Schemes
Rule ordering can be implemented on a rulebyrule basis or on a classbyclass
basis. The difference between these schemes is illustrated in Figure 5.1.
RuleBased Ordering Scheme This approach orders the individual rules
by some rule quality measure. This ordering scheme ensures that every test
record is classified by the “best” rule covering it. A potential drawback of this
scheme is that lowerranked rules are much harder to interpret because they
assume the negation of the rules preceding them. For example, the fourth rule
shown in Figure 5.1 for rulebased ordering,
Aquatic Creature = semi −→ Amphibians,
has the following interpretation: If the vertebrate does not have any feathers
or cannot fly, and is coldblooded and semiaquatic, then it is an amphibian.
211
Chapter 5 Classification: Alternative Techniques
(Skin Cover=feathers, Aerial Creature=yes)
==> Birds
(Skin Cover=scales, Aquatic Creature=no)
==> Reptiles
(Skin Cover=scales, Aquatic Creature=yes)
==> Fishes
(Skin Cover=none) ==> Amphibians
(Body temperature=warmblooded,
Gives Birth=yes) ==> Mammals
(Body temperature=warmblooded,
Gives Birth=no) ==> Birds
(Aquatic Creature=semi)) ==> Amphibians
RuleBased Ordering
(Skin Cover=feathers, Aerial Creature=yes)
==> Birds
(Skin Cover=scales, Aquatic Creature=no)
==> Reptiles
(Skin Cover=scales, Aquatic Creature=yes)
==> Fishes
(Skin Cover=none) ==> Amphibians
(Body temperature=warmblooded,
Gives Birth=yes) ==> Mammals
(Body temperature=warmblooded,
Gives Birth=no) ==> Birds
(Aquatic Creature=semi)) ==> Amphibians
ClassBased Ordering
Figure 5.1. Comparison between rulebased and classbased ordering schemes.
The additional conditions (that the vertebrate does not have any feathers or
cannot fly, and is coldblooded) are due to the fact that the vertebrate does
not satisfy the first three rules. If the number of rules is large, interpreting the
meaning of the rules residing near the bottom of the list can be a cumbersome
task.
ClassBased Ordering Scheme In this approach, rules that belong to the
same class appear together in the rule set R. The rules are then collectively
sorted on the basis of their class information. The relative ordering among the
rules from the same class is not important; as long as one of the rules fires,
the class will be assigned to the test record. This makes rule interpretation
slightly easier. However, it is possible for a highquality rule to be overlooked
in favor of an inferior rule that happens to predict the higherranked class.
Since most of the wellknown rulebased classifiers (such as C4.5rules and
RIPPER) employ the classbased ordering scheme, the discussion in the re
mainder of this section focuses mainly on this type of ordering scheme.
5.1.3 How to Build a RuleBased Classifier
To build a rulebased classifier, we need to extract a set of rules that identifies
key relationships between the attributes of a data set and the class label.
212
5.1 RuleBased Classifier
There are two broad classes of methods for extracting classification rules: (1)
direct methods, which extract classification rules directly from data, and (2)
indirect methods, which extract classification rules from other classification
models, such as decision trees and neural networks.
Direct methods partition the attribute space into smaller subspaces so that
all the records that belong to a subspace can be classified using a single classi
fication rule. Indirect methods use the classification rules to provide a succinct
description of more complex classification models. Detailed discussions of these
methods are presented in Sections 5.1.4 and 5.1.5, respectively.
5.1.4 Direct Methods for Rule Extraction
The sequential covering algorithm is often used to extract rules directly
from data. Rules are grown in a greedy fashion based on a certain evaluation
measure. The algorithm extracts the rules one class at a time for data sets
that contain more than two classes. For the vertebrate classification problem,
the sequential covering algorithm may generate rules for classifying birds first,
followed by rules for classifying mammals, amphibians, reptiles, and finally,
fishes (see Figure 5.1). The criterion for deciding which class should be gen
erated first depends on a number of factors, such as the class prevalence (i.e.,
fraction of training records that belong to a particular class) or the cost of
misclassifying records from a given class.
A summary of the sequential covering algorithm is given in Algorithm
5.1. The algorithm starts with an empty decision list, R. The LearnOne
Rule function is then used to extract the best rule for class y that covers the
current set of training records. During rule extraction, all training records
for class y are considered to be positive examples, while those that belong to
Algorithm 5.1 Sequential covering algorithm.
1: Let E be the training records and A be the set of attributevalue pairs, {(Aj , vj )}.
2: Let Yo be an ordered set of classes {y1, y2, . . . , yk}.
3: Let R = { } be the initial rule list.
4: for each class y ∈ Yo − {yk} do
5: while stopping condition is not met do
6: r ← LearnOneRule (E, A, y).
7: Remove training records from E that are covered by r.
8: Add r to the bottom of the rule list: R −→ R ∨ r.
9: end while
10: end for
11: Insert the default rule, {} −→ yk, to the bottom of the rule list R.
213
Chapter 5 Classification: Alternative Techniques
other classes are considered to be negative examples. A rule is desirable if it
covers most of the positive examples and none (or very few) of the negative
examples. Once such a rule is found, the training records covered by the rule
are eliminated. The new rule is added to the bottom of the decision list R.
This procedure is repeated until the stopping criterion is met. The algorithm
then proceeds to generate rules for the next class.
Figure 5.2 demonstrates how the sequential covering algorithm works for
a data set that contains a collection of positive and negative examples. The
rule R1, whose coverage is shown in Figure 5.2(b), is extracted first because
it covers the largest fraction of positive examples. All the training records
covered by R1 are subsequently removed and the algorithm proceeds to look
for the next best rule, which is R2.
R1
R1
R1
R2
(a) Original Data (b) Step 1
(c) Step 2 (d) Step 3
Figure 5.2. An example of the sequential covering algorithm.
214
5.1 RuleBased Classifier
LearnOneRule Function
The objective of the LearnOneRule function is to extract a classification
rule that covers many of the positive examples and none (or very few) of the
negative examples in the training set. However, finding an optimal rule is
computationally expensive given the exponential size of the search space. The
LearnOneRule function addresses the exponential search problem by growing
the rules in a greedy fashion. It generates an initial rule r and keeps refining
the rule until a certain stopping criterion is met. The rule is then pruned to
improve its generalization error.
RuleGrowing Strategy There are two common strategies for growing a
classification rule: generaltospecific or specifictogeneral. Under the general
tospecific strategy, an initial rule r : {} −→ y is created, where the lefthand
side is an empty set and the righthand side contains the target class. The rule
has poor quality because it covers all the examples in the training set. New
Body Temperature = warmblooded,
Has Legs = yes => Mammals
Body Temperature=warmblooded, Skin Cover=hair,
Gives Birth=yes, Aquatic creature=no, Aerial Creature=no
Has Legs=yes, Hibernates=no => Mammals
Body Temperature=warmblooded,
Skin Cover=hair, Gives Birth=yes,
Aquatic creature=no, Aerial Creature=no
Has Legs=yes => Mammals
Skin Cover = hair
=> Mammals
{ } => Mammals
Body Temperature = warmblooded
=> Mammals
Body Temperature = warmblooded,
Gives Birth = yes => Mammals
Has Legs = No
=> Mammals
(a) Generaltospecific
(b) Specifictogeneral
. . .
. . .
. . .
Skin Cover=hair, Gives Birth=yes
Aquatic Creature=no, Aerial Creature=no,
Has Legs=yes, Hibernates=no
=> Mammals
Figure 5.3. Generaltospecific and specifictogeneral rulegrowing strategies.
215
Chapter 5 Classification: Alternative Techniques
conjuncts are subsequently added to improve the rule’s quality. Figure 5.3(a)
shows the generaltospecific rulegrowing strategy for the vertebrate classifi
cation problem. The conjunct Body Temperature=warmblooded is initially
chosen to form the rule antecedent. The algorithm then explores all the possi
ble candidates and greedily chooses the next conjunct, Gives Birth=yes, to
be added into the rule antecedent. This process continues until the stopping
criterion is met (e.g., when the added conjunct does not improve the quality
of the rule).
For the specifictogeneral strategy, one of the positive examples is ran
domly chosen as the initial seed for the rulegrowing process. During the
refinement step, the rule is generalized by removing one of its conjuncts so
that it can cover more positive examples. Figure 5.3(b) shows the specificto
general approach for the vertebrate classification problem. Suppose a positive
example for mammals is chosen as the initial seed. The initial rule contains
the same conjuncts as the attribute values of the seed. To improve its cov
erage, the rule is generalized by removing the conjunct Hibernate=no. The
refinement step is repeated until the stopping criterion is met, e.g., when the
rule starts covering negative examples.
The previous approaches may produce suboptimal rules because the rules
are grown in a greedy fashion. To avoid this problem, a beam search may be
used, where k of the best candidate rules are maintained by the algorithm.
Each candidate rule is then grown separately by adding (or removing) a con
junct from its antecedent. The quality of the candidates are evaluated and the
k best candidates are chosen for the next iteration.
Rule Evaluation An evaluation metric is needed to determine which con
junct should be added (or removed) during the rulegrowing process. Accu
racy is an obvious choice because it explicitly measures the fraction of training
examples classified correctly by the rule. However, a potential limitation of ac
curacy is that it does not take into account the rule’s coverage. For example,
consider a training set that contains 60 positive examples and 100 negative
examples. Suppose we are given the following two candidate rules:
Rule r1: covers 50 positive examples and 5 negative examples,
Rule r2: covers 2 positive examples and no negative examples.
The accuracies for r1 and r2 are 90.9% and 100%, respectively. However,
r1 is the better rule despite its lower accuracy. The high accuracy for r2 is
potentially spurious because the coverage of the rule is too low.
216
5.1 RuleBased Classifier
The following approaches can be used to handle this problem.
1. A statistical test can be used to prune rules that have poor coverage.
For example, we may compute the following likelihood ratio statistic:
R = 2
k∑
i=1
fi log(fi/ei),
where k is the number of classes, fi is the observed frequency of class i
examples that are covered by the rule, and ei is the expected frequency
of a rule that makes random predictions. Note that R has a chisquare
distribution with k − 1 degrees of freedom. A large R value suggests
that the number of correct predictions made by the rule is significantly
larger than that expected by random guessing. For example, since r1
covers 55 examples, the expected frequency for the positive class is e+ =
55×60/160 = 20.625, while the expected frequency for the negative class
is e− = 55 × 100/160 = 34.375. Thus, the likelihood ratio for r1 is
R(r1) = 2 × [50 × log2(50/20.625) + 5 × log2(5/34.375)] = 99.9.
Similarly, the expected frequencies for r2 are e+ = 2 × 60/160 = 0.75
and e− = 2 × 100/160 = 1.25. The likelihood ratio statistic for r2 is
R(r2) = 2 × [2 × log2(2/0.75) + 0 × log2(0/1.25)] = 5.66.
This statistic therefore suggests that r1 is a better rule than r2.
2. An evaluation metric that takes into account the rule coverage can be
used. Consider the following evaluation metrics:
Laplace =
f+ + 1
n + k
, (5.4)
mestimate =
f+ + kp+
n + k
, (5.5)
where n is the number of examples covered by the rule, f+ is the number
of positive examples covered by the rule, k is the total number of classes,
and p+ is the prior probability for the positive class. Note that the m
estimate is equivalent to the Laplace measure by choosing p+ = 1/k.
Depending on the rule coverage, these measures capture the tradeoff
217
Chapter 5 Classification: Alternative Techniques
between rule accuracy and the prior probability of the positive class. If
the rule does not cover any training example, then the Laplace mea
sure reduces to 1/k, which is the prior probability of the positive class
assuming a uniform class distribution. The mestimate also reduces to
the prior probability (p+) when n = 0. However, if the rule coverage
is large, then both measures asymptotically approach the rule accuracy,
f+/n. Going back to the previous example, the Laplace measure for
r1 is 51/57 = 89.47%, which is quite close to its accuracy. Conversely,
the Laplace measure for r2 (75%) is significantly lower than its accuracy
because r2 has a much lower coverage.
3. An evaluation metric that takes into account the support count of the
rule can be used. One such metric is the FOIL’s information gain.
The support count of a rule corresponds to the number of positive exam
ples covered by the rule. Suppose the rule r : A −→ + covers p0 positive
examples and n0 negative examples. After adding a new conjunct B, the
extended rule r′ : A ∧ B −→ + covers p1 positive examples and n1 neg
ative examples. Given this information, the FOIL’s information gain of
the extended rule is defined as follows:
FOIL’s information gain = p1 ×
(
log2
p1
p1 + n1
− log2
p0
p0 + n0
)
. (5.6)
Since the measure is proportional to p1 and p1/(p1 + n1), it prefers rules
that have high support count and accuracy. The FOIL’s information
gains for rules r1 and r2 given in the preceding example are 43.12 and 2,
respectively. Therefore, r1 is a better rule than r2.
Rule Pruning The rules generated by the LearnOneRule function can be
pruned to improve their generalization errors. To determine whether pruning
is necessary, we may apply the methods described in Section 4.4 on page
172 to estimate the generalization error of a rule. For example, if the error
on validation set decreases after pruning, we should keep the simplified rule.
Another approach is to compare the pessimistic error of the rule before and
after pruning (see Section 4.4.4 on page 179). The simplified rule is retained
in place of the original rule if the pessimistic error improves after pruning.
218
5.1 RuleBased Classifier
Rationale for Sequential Covering
After a rule is extracted, the sequential covering algorithm must eliminate
all the positive and negative examples covered by the rule. The rationale for
doing this is given in the next example.
class = +
class = –
+
+
+
+
+
+
+ +
+
+
+
+
+
+
+ + + + +
+
+
+
+
+
+
+
+
+ +
–
–
–
–
–
–
–
–
– –
–
–
–
– –
–
–
–
– –
–
R1
R3 R2
Figure 5.4. Elimination of training records by the sequential covering algorithm. R1, R2, and R3
represent regions covered by three different rules.
Figure 5.4 shows three possible rules, R1, R2, and R3, extracted from a
data set that contains 29 positive examples and 21 negative examples. The
accuracies of R1, R2, and R3 are 12/15 (80%), 7/10 (70%), and 8/12 (66.7%),
respectively. R1 is generated first because it has the highest accuracy. After
generating R1, it is clear that the positive examples covered by the rule must be
removed so that the next rule generated by the algorithm is different than R1.
Next, suppose the algorithm is given the choice of generating either R2 or R3.
Even though R2 has higher accuracy than R3, R1 and R3 together cover 18
positive examples and 5 negative examples (resulting in an overall accuracy of
78.3%), whereas R1 and R2 together cover 19 positive examples and 6 negative
examples (resulting in an overall accuracy of 76%). The incremental impact of
R2 or R3 on accuracy is more evident when the positive and negative examples
covered by R1 are removed before computing their accuracies. In particular, if
positive examples covered by R1 are not removed, then we may overestimate
the effective accuracy of R3, and if negative examples are not removed, then
we may underestimate the accuracy of R3. In the latter case, we might end up
preferring R2 over R3 even though half of the false positive errors committed
by R3 have already been accounted for by the preceding rule, R1.
219
Chapter 5 Classification: Alternative Techniques
RIPPER Algorithm
To illustrate the direct method, we consider a widely used rule induction algo
rithm called RIPPER. This algorithm scales almost linearly with the number
of training examples and is particularly suited for building models from data
sets with imbalanced class distributions. RIPPER also works well with noisy
data sets because it uses a validation set to prevent model overfitting.
For twoclass problems, RIPPER chooses the majority class as its default
class and learns the rules for detecting the minority class. For multiclass prob
lems, the classes are ordered according to their frequencies. Let (y1, y2, . . . , yc)
be the ordered classes, where y1 is the least frequent class and yc is the most
frequent class. During the first iteration, instances that belong to y1 are la
beled as positive examples, while those that belong to other classes are labeled
as negative examples. The sequential covering method is used to generate rules
that discriminate between the positive and negative examples. Next, RIPPER
extracts rules that distinguish y2 from other remaining classes. This process
is repeated until we are left with yc, which is designated as the default class.
Rule Growing RIPPER employs a generaltospecific strategy to grow a
rule and the FOIL’s information gain measure to choose the best conjunct
to be added into the rule antecedent. It stops adding conjuncts when the
rule starts covering negative examples. The new rule is then pruned based
on its performance on the validation set. The following metric is computed to
determine whether pruning is needed: (p−n)/(p+n), where p (n) is the number
of positive (negative) examples in the validation set covered by the rule. This
metric is monotonically related to the rule’s accuracy on the validation set. If
the metric improves after pruning, then the conjunct is removed. Pruning is
done starting from the last conjunct added to the rule. For example, given a
rule ABCD −→ y, RIPPER checks whether D should be pruned first, followed
by CD, BCD, etc. While the original rule covers only positive examples, the
pruned rule may cover some of the negative examples in the training set.
Building the Rule Set After generating a rule, all the positive and negative
examples covered by the rule are eliminated. The rule is then added into the
rule set as long as it does not violate the stopping condition, which is based
on the minimum description length principle. If the new rule increases the
total description length of the rule set by at least d bits, then RIPPER stops
adding rules into its rule set (by default, d is chosen to be 64 bits). Another
stopping condition used by RIPPER is that the error rate of the rule on the
validation set must not exceed 50%.
220
5.1 RuleBased Classifier
RIPPER also performs additional optimization steps to determine whether
some of the existing rules in the rule set can be replaced by better alternative
rules. Readers who are interested in the details of the optimization method
may refer to the reference cited at the end of this chapter.
5.1.5 Indirect Methods for Rule Extraction
This section presents a method for generating a rule set from a decision tree.
In principle, every path from the root node to the leaf node of a decision tree
can be expressed as a classification rule. The test conditions encountered along
the path form the conjuncts of the rule antecedent, while the class label at the
leaf node is assigned to the rule consequent. Figure 5.5 shows an example of a
rule set generated from a decision tree. Notice that the rule set is exhaustive
and contains mutually exclusive rules. However, some of the rules can be
simplified as shown in the next example.
No Yes
No NoYes Yes
No Yes
P
Q
Q
R
– + +
– +
r1: (P=No,Q=No) ==> –
r2: (P=No,Q=Yes) ==> +
r3: (P=Yes,Q=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> –
r5: (P=Yes,R=Yes,Q=Yes) ==> +
Rule Set
Figure 5.5. Converting a decision tree into classification rules.
Example 5.2. Consider the following three rules from Figure 5.5:
r2 : (P = No) ∧ (Q = Yes) −→ +
r3 : (P = Yes) ∧ (R = No) −→ +
r5 : (P = Yes) ∧ (R = Yes) ∧ (Q = Yes) −→ +
Observe that the rule set always predicts a positive class when the value of Q
is Yes. Therefore, we may simplify the rules as follows:
r2′: (Q = Yes) −→ +
r3: (P = Yes) ∧ (R = No) −→ +.
221
Chapter 5 Classification: Alternative Techniques
Gives
Birth?
Mammals
Yes No
(Gives Birth=No, Aerial Creature=Yes)=>Birds
(Gives Birth=No, Aerial Creature=No, Aquatic Creature=No)
=>Reptiles
(Gives Birth=No, Aquatic Creature=Yes)=>Fishes
(Gives Birth=Yes)=>Mammals
( )=>Amphibians
Yes No
Semi
Yes No
Fishes Amphibians
Birds Reptiles
Aquatic
Creature
Aerial
Creature
RuleBased Classifier:
Figure 5.6. Classification rules extracted from a decision tree for the vertebrate classification problem.
r3 is retained to cover the remaining instances of the positive class. Although
the rules obtained after simplification are no longer mutually exclusive, they
are less complex and are easier to interpret.
In the following, we describe an approach used by the C4.5rules algorithm
to generate a rule set from a decision tree. Figure 5.6 shows the decision tree
and resulting classification rules obtained for the data set given in Table 5.2.
Rule Generation Classification rules are extracted for every path from the
root to one of the leaf nodes in the decision tree. Given a classification rule
r : A −→ y, we consider a simplified rule, r′ : A′ −→ y, where A′ is obtained
by removing one of the conjuncts in A. The simplified rule with the lowest
pessimistic error rate is retained provided its error rate is less than that of the
original rule. The rulepruning step is repeated until the pessimistic error of
the rule cannot be improved further. Because some of the rules may become
identical after pruning, the duplicate rules must be discarded.
Rule Ordering After generating the rule set, C4.5rules uses the classbased
ordering scheme to order the extracted rules. Rules that predict the same class
are grouped together into the same subset. The total description length for
each subset is computed, and the classes are arranged in increasing order of
their total description length. The class that has the smallest description
222
5.2 NearestNeighbor classifiers
length is given the highest priority because it is expected to contain the best
set of rules. The total description length for a class is given by Lexception + g ×
Lmodel, where Lexception is the number of bits needed to encode the misclassified
examples, Lmodel is the number of bits needed to encode the model, and g is a
tuning parameter whose default value is 0.5. The tuning parameter depends
on the number of redundant attributes present in the model. The value of the
tuning parameter is small if the model contains many redundant attributes.
5.1.6 Characteristics of RuleBased Classifiers
A rulebased classifier has the following characteristics:
• The expressiveness of a rule set is almost equivalent to that of a decision
tree because a decision tree can be represented by a set of mutually ex
clusive and exhaustive rules. Both rulebased and decision tree classifiers
create rectilinear partitions of the attribute space and assign a class to
each partition. Nevertheless, if the rulebased classifier allows multiple
rules to be triggered for a given record, then a more complex decision
boundary can be constructed.
• Rulebased classifiers are generally used to produce descriptive models
that are easier to interpret, but gives comparable performance to the
decision tree classifier.
• The classbased ordering approach adopted by many rulebased classi
fiers (such as RIPPER) is well suited for handling data sets with imbal
anced class distributions.
5.2 NearestNeighbor classifiers
The classification framework shown in Figure 4.3 involves a twostep process:
(1) an inductive step for constructing a classification model from data, and
(2) a deductive step for applying the model to test examples. Decision tree
and rulebased classifiers are examples of eager learners because they are
designed to learn a model that maps the input attributes to the class label as
soon as the training data becomes available. An opposite strategy would be to
delay the process of modeling the training data until it is needed to classify the
test examples. Techniques that employ this strategy are known as lazy learn
ers. An example of a lazy learner is the Rote classifier, which memorizes the
entire training data and performs classification only if the attributes of a test
instance match one of the training examples exactly. An obvious drawback of
223
Chapter 5 Classification: Alternative Techniques
x x x
(a) 1nearest neighbor (b) 2nearest neighbor (c) 3nearest neighbor
Figure 5.7. The 1, 2, and 3nearest neighbors of an instance.
this approach is that some test records may not be classified because they do
not match any training example.
One way to make this approach more flexible is to find all the training
examples that are relatively similar to the attributes of the test example.
These examples, which are known as nearest neighbors, can be used to
determine the class label of the test example. The justification for using nearest
neighbors is best exemplified by the following saying: “If it walks like a duck,
quacks like a duck, and looks like a duck, then it’s probably a duck.” A nearest
neighbor classifier represents each example as a data point in a ddimensional
space, where d is the number of attributes. Given a test example, we compute
its proximity to the rest of the data points in the training set, using one of
the proximity measures described in Section 2.4 on page 65. The knearest
neighbors of a given example z refer to the k points that are closest to z.
Figure 5.7 illustrates the 1, 2, and 3nearest neighbors of a data point
located at the center of each circle. The data point is classified based on
the class labels of its neighbors. In the case where the neighbors have more
than one label, the data point is assigned to the majority class of its nearest
neighbors. In Figure 5.7(a), the 1nearest neighbor of the data point is a
negative example. Therefore the data point is assigned to the negative class.
If the number of nearest neighbors is three, as shown in Figure 5.7(c), then
the neighborhood contains two positive examples and one negative example.
Using the majority voting scheme, the data point is assigned to the positive
class. In the case where there is a tie between the classes (see Figure 5.7(b)),
we may randomly choose one of them to classify the data point.
The preceding discussion underscores the importance of choosing the right
value for k. If k is too small, then the nearestneighbor classifier may be
224
5.2 NearestNeighbor classifiers
x
Figure 5.8. knearest neighbor classification with large k.
susceptible to overfitting because of noise in the training data. On the other
hand, if k is too large, the nearestneighbor classifier may misclassify the test
instance because its list of nearest neighbors may include data points that are
located far away from its neighborhood (see Figure 5.8).
5.2.1 Algorithm
A highlevel summary of the nearestneighbor classification method is given in
Algorithm 5.2. The algorithm computes the distance (or similarity) between
each test example z = (x′, y′) and all the training examples (x, y) ∈ D to
determine its nearestneighbor list, Dz. Such computation can be costly if the
number of training examples is large. However, efficient indexing techniques
are available to reduce the amount of computations needed to find the nearest
neighbors of a test example.
Algorithm 5.2 The knearest neighbor classification algorithm.
1: Let k be the number of nearest neighbors and D be the set of training examples.
2: for each test example z = (x′, y′) do
3: Compute d(x′, x), the distance between z and every example, (x, y) ∈ D.
4: Select Dz ⊆ D, the set of k closest training examples to z.
5: y′ = argmax
v
∑
(xi,yi)∈Dz I(v = yi)
6: end for
225
Chapter 5 Classification: Alternative Techniques
Once the nearestneighbor list is obtained, the test example is classified
based on the majority class of its nearest neighbors:
Majority Voting: y′ = argmax
v
∑
(xi,yi)∈Dz
I(v = yi), (5.7)
where v is a class label, yi is the class label for one of the nearest neighbors,
and I(·) is an indicator function that returns the value 1 if its argument is
true and 0 otherwise.
In the majority voting approach, every neighbor has the same impact on the
classification. This makes the algorithm sensitive to the choice of k, as shown
in Figure 5.7. One way to reduce the impact of k is to weight the influence
of each nearest neighbor xi according to its distance: wi = 1/d(x′, xi)2. As
a result, training examples that are located far away from z have a weaker
impact on the classification compared to those that are located close to z.
Using the distanceweighted voting scheme, the class label can be determined
as follows:
DistanceWeighted Voting: y′ = argmax
v
∑
(xi,yi)∈Dz
wi × I(v = yi). (5.8)
5.2.2 Characteristics of NearestNeighbor Classifiers
The characteristics of the nearestneighbor classifier are summarized below:
• Nearestneighbor classification is part of a more general technique known
as instancebased learning, which uses specific training instances to make
predictions without having to maintain an abstraction (or model) de
rived from data. Instancebased learning algorithms require a proximity
measure to determine the similarity or distance between instances and a
classification function that returns the predicted class of a test instance
based on its proximity to other instances.
• Lazy learners such as nearestneighbor classifiers do not require model
building. However, classifying a test example can be quite expensive
because we need to compute the proximity values individually between
the test and training examples. In contrast, eager learners often spend
the bulk of their computing resources for model building. Once a model
has been built, classifying a test example is extremely fast.
• Nearestneighbor classifiers make their predictions based on local infor
mation, whereas decision tree and rulebased classifiers attempt to find
226
5.3 Bayesian Classifiers
a global model that fits the entire input space. Because the classification
decisions are made locally, nearestneighbor classifiers (with small values
of k) are quite susceptible to noise.
• Nearestneighbor classifiers can produce arbitrarily shaped decision bound
aries. Such boundaries provide a more flexible model representation
compared to decision tree and rulebased classifiers that are often con
strained to rectilinear decision boundaries. The decision boundaries of
nearestneighbor classifiers also have high variability because they de
pend on the composition of training examples. Increasing the number of
nearest neighbors may reduce such variability.
• Nearestneighbor classifiers can produce wrong predictions unless the
appropriate proximity measure and data preprocessing steps are taken.
For example, suppose we want to classify a group of people based on
attributes such as height (measured in meters) and weight (measured in
pounds). The height attribute has a low variability, ranging from 1.5 m
to 1.85 m, whereas the weight attribute may vary from 90 lb. to 250
lb. If the scale of the attributes are not taken into consideration, the
proximity measure may be dominated by differences in the weights of a
person.
5.3 Bayesian Classifiers
In many applications the relationship between the attribute set and the class
variable is nondeterministic. In other words, the class label of a test record
cannot be predicted with certainty even though its attribute set is identical
to some of the training examples. This situation may arise because of noisy
data or the presence of certain confounding factors that affect classification
but are not included in the analysis. For example, consider the task of pre
dicting whether a person is at risk for heart disease based on the person’s diet
and workout frequency. Although most people who eat healthily and exercise
regularly have less chance of developing heart disease, they may still do so be
cause of other factors such as heredity, excessive smoking, and alcohol abuse.
Determining whether a person’s diet is healthy or the workout frequency is
sufficient is also subject to interpretation, which in turn may introduce uncer
tainties into the learning problem.
This section presents an approach for modeling probabilistic relationships
between the attribute set and the class variable. The section begins with an
introduction to the Bayes theorem, a statistical principle for combining prior
227
Chapter 5 Classification: Alternative Techniques
knowledge of the classes with new evidence gathered from data. The use of the
Bayes theorem for solving classification problems will be explained, followed
by a description of two implementations of Bayesian classifiers: näıve Bayes
and the Bayesian belief network.
5.3.1 Bayes Theorem
Consider a football game between two rival teams: Team 0 and Team 1.
Suppose Team 0 wins 65% of the time and Team 1 wins the remaining
matches. Among the games won by Team 0, only 30% of them come
from playing on Team 1’s football field. On the other hand, 75% of the
victories for Team 1 are obtained while playing at home. If Team 1 is to
host the next match between the two teams, which team will most likely
emerge as the winner?
This question can be answered by using the wellknown Bayes theorem. For
completeness, we begin with some basic definitions from probability theory.
Readers who are unfamiliar with concepts in probability may refer to Appendix
C for a brief review of this topic.
Let X and Y be a pair of random variables. Their joint probability, P (X =
x, Y = y), refers to the probability that variable X will take on the value
x and variable Y will take on the value y. A conditional probability is the
probability that a random variable will take on a particular value given that the
outcome for another random variable is known. For example, the conditional
probability P (Y = yX = x) refers to the probability that the variable Y will
take on the value y, given that the variable X is observed to have the value x.
The joint and conditional probabilities for X and Y are related in the following
way:
P (X, Y ) = P (Y X) × P (X) = P (XY ) × P (Y ). (5.9)
Rearranging the last two expressions in Equation 5.9 leads to the following
formula, known as the Bayes theorem:
P (Y X) = P (XY )P (Y )
P (X)
. (5.10)
The Bayes theorem can be used to solve the prediction problem stated
at the beginning of this section. For notational convenience, let X be the
random variable that represents the team hosting the match and Y be the
random variable that represents the winner of the match. Both X and Y can
228
5.3 Bayesian Classifiers
take on values from the set {0, 1}. We can summarize the information given
in the problem as follows:
Probability Team 0 wins is P (Y = 0) = 0.65.
Probability Team 1 wins is P (Y = 1) = 1 − P (Y = 0) = 0.35.
Probability Team 1 hosted the match it won is P (X = 1Y = 1) = 0.75.
Probability Team 1 hosted the match won by Team 0 is P (X = 1Y = 0) = 0.3.
Our objective is to compute P (Y = 1X = 1), which is the conditional
probability that Team 1 wins the next match it will be hosting, and compares
it against P (Y = 0X = 1). Using the Bayes theorem, we obtain
P (Y = 1X = 1) = P (X = 1Y = 1) × P (Y = 1)
P (X = 1)
=
P (X = 1Y = 1) × P (Y = 1)
P (X = 1, Y = 1) + P (X = 1, Y = 0)
=
P (X = 1Y = 1) × P (Y = 1)
P (X = 1Y = 1)P (Y = 1) + P (X = 1Y = 0)P (Y = 0)
=
0.75 × 0.35
0.75 × 0.35 + 0.3 × 0.65
= 0.5738,
where the law of total probability (see Equation C.5 on page 722) was applied
in the second line. Furthermore, P (Y = 0X = 1) = 1 − P (Y = 1X = 1) =
0.4262. Since P (Y = 1X = 1) > P (Y = 0X = 1), Team 1 has a better
chance than Team 0 of winning the next match.
5.3.2 Using the Bayes Theorem for Classification
Before describing how the Bayes theorem can be used for classification, let
us formalize the classification problem from a statistical perspective. Let X
denote the attribute set and Y denote the class variable. If the class variable
has a nondeterministic relationship with the attributes, then we can treat
X and Y as random variables and capture their relationship probabilistically
using P (Y X). This conditional probability is also known as the posterior
probability for Y , as opposed to its prior probability, P (Y ).
During the training phase, we need to learn the posterior probabilities
P (Y X) for every combination of X and Y based on information gathered
from the training data. By knowing these probabilities, a test record X′ can
be classified by finding the class Y ′ that maximizes the posterior probability,
229
Chapter 5 Classification: Alternative Techniques
P (Y ′X′). To illustrate this approach, consider the task of predicting whether
a loan borrower will default on their payments. Figure 5.9 shows a training
set with the following attributes: Home Owner, Marital Status, and Annual
Income. Loan borrowers who defaulted on their payments are classified as
Yes, while those who repaid their loans are classified as No.
bi
na
ry
ca
te
go
ric
al
co
nt
in
uo
us
cla
ss
Tid
Defaulted
Borrower
Home
Owner
Marital
Status
Annual
Income
1
2
3
4
5
6
7
8
9
10
Yes
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
No
No
Yes
No
Yes
125K
100K
70K
120K
95K
60K
220K
85K
75K
90K
Single
Married
Single
Married
Divorced
Married
Divorced
Single
Married
Single
Figure 5.9. Training set for predicting the loan default problem.
Suppose we are given a test record with the following attribute set: X =
(Home Owner = No, Marital Status = Married, Annual Income = $120K). To
classify the record, we need to compute the posterior probabilities P (YesX)
and P (NoX) based on information available in the training data. If P (YesX) >
P (NoX), then the record is classified as Yes; otherwise, it is classified as No.
Estimating the posterior probabilities accurately for every possible combi
nation of class label and attribute value is a difficult problem because it re
quires a very large training set, even for a moderate number of attributes. The
Bayes theorem is useful because it allows us to express the posterior probabil
ity in terms of the prior probability P (Y ), the classconditional probability
P (XY ), and the evidence, P (X):
P (Y X) = P (XY ) × P (Y )
P (X)
. (5.11)
When comparing the posterior probabilities for different values of Y , the de
nominator term, P (X), is always constant, and thus, can be ignored. The
230
5.3 Bayesian Classifiers
prior probability P (Y ) can be easily estimated from the training set by com
puting the fraction of training records that belong to each class. To estimate
the classconditional probabilities P (XY ), we present two implementations of
Bayesian classification methods: the näıve Bayes classifier and the Bayesian
belief network. These implementations are described in Sections 5.3.3 and
5.3.5, respectively.
5.3.3 Näıve Bayes Classifier
A näıve Bayes classifier estimates the classconditional probability by assuming
that the attributes are conditionally independent, given the class label y. The
conditional independence assumption can be formally stated as follows:
P (XY = y) =
d∏
i=1
P (XiY = y), (5.12)
where each attribute set X = {X1, X2, . . . , Xd} consists of d attributes.
Conditional Independence
Before delving into the details of how a näıve Bayes classifier works, let us
examine the notion of conditional independence. Let X, Y, and Z denote
three sets of random variables. The variables in X are said to be conditionally
independent of Y, given Z, if the following condition holds:
P (XY, Z) = P (XZ). (5.13)
An example of conditional independence is the relationship between a person’s
arm length and his or her reading skills. One might observe that people with
longer arms tend to have higher levels of reading skills. This relationship can
be explained by the presence of a confounding factor, which is age. A young
child tends to have short arms and lacks the reading skills of an adult. If the
age of a person is fixed, then the observed relationship between arm length
and reading skills disappears. Thus, we can conclude that arm length and
reading skills are conditionally independent when the age variable is fixed.
231
Chapter 5 Classification: Alternative Techniques
The conditional independence between X and Y can also be written into
a form that looks similar to Equation 5.12:
P (X, YZ) = P (X, Y, Z)
P (Z)
=
P (X, Y, Z)
P (Y, Z)
× P (Y, Z)
P (Z)
= P (XY, Z) × P (YZ)
= P (XZ) × P (YZ), (5.14)
where Equation 5.13 was used to obtain the last line of Equation 5.14.
How a Näıve Bayes Classifier Works
With the conditional independence assumption, instead of computing the
classconditional probability for every combination of X, we only have to esti
mate the conditional probability of each Xi, given Y . The latter approach is
more practical because it does not require a very large training set to obtain
a good estimate of the probability.
To classify a test record, the näıve Bayes classifier computes the posterior
probability for each class Y :
P (Y X) = P (Y )
∏d
i=1 P (XiY )
P (X)
. (5.15)
Since P (X) is fixed for every Y , it is sufficient to choose the class that maxi
mizes the numerator term, P (Y )
∏d
i=1 P (XiY ). In the next two subsections,
we describe several approaches for estimating the conditional probabilities
P (XiY ) for categorical and continuous attributes.
Estimating Conditional Probabilities for Categorical Attributes
For a categorical attribute Xi, the conditional probability P (Xi = xiY = y)
is estimated according to the fraction of training instances in class y that take
on a particular attribute value xi. For example, in the training set given in
Figure 5.9, three out of the seven people who repaid their loans also own a
home. As a result, the conditional probability for P (Home Owner=YesNo) is
equal to 3/7. Similarly, the conditional probability for defaulted borrowers
who are single is given by P (Marital Status = SingleYes) = 2/3.
232
5.3 Bayesian Classifiers
Estimating Conditional Probabilities for Continuous Attributes
There are two ways to estimate the classconditional probabilities for contin
uous attributes in näıve Bayes classifiers:
1. We can discretize each continuous attribute and then replace the con
tinuous attribute value with its corresponding discrete interval. This
approach transforms the continuous attributes into ordinal attributes.
The conditional probability P (XiY = y) is estimated by computing
the fraction of training records belonging to class y that falls within the
corresponding interval for Xi. The estimation error depends on the dis
cretization strategy (as described in Section 2.3.6 on page 57), as well as
the number of discrete intervals. If the number of intervals is too large,
there are too few training records in each interval to provide a reliable
estimate for P (XiY ). On the other hand, if the number of intervals
is too small, then some intervals may aggregate records from different
classes and we may miss the correct decision boundary.
2. We can assume a certain form of probability distribution for the contin
uous variable and estimate the parameters of the distribution using the
training data. A Gaussian distribution is usually chosen to represent the
classconditional probability for continuous attributes. The distribution
is characterized by two parameters, its mean, µ, and variance, σ2. For
each class yj , the classconditional probability for attribute Xi is
P (Xi = xiY = yj ) =
1√
2πσij
exp
− (xi−µij )
2
2σ2
ij . (5.16)
The parameter µij can be estimated based on the sample mean of Xi
(x) for all training records that belong to the class yj . Similarly, σ2ij can
be estimated from the sample variance (s2) of such training records. For
example, consider the annual income attribute shown in Figure 5.9. The
sample mean and variance for this attribute with respect to the class No
are
x =
125 + 100 + 70 + . . . + 75
7
= 110
s2 =
(125 − 110)2 + (100 − 110)2 + . . . + (75 − 110)2
7(6)
= 2975
s =
√
2975 = 54.54.
233
Chapter 5 Classification: Alternative Techniques
Given a test record with taxable income equal to $120K, we can compute
its classconditional probability as follows:
P (Income=120No) = 1√
2π(54.54)
exp−
(120−110)2
2×2975 = 0.0072.
Note that the preceding interpretation of classconditional probability
is somewhat misleading. The righthand side of Equation 5.16 corre
sponds to a probability density function, f (Xi; µij , σij ). Since the
function is continuous, the probability that the random variable Xi takes
a particular value is zero. Instead, we should compute the conditional
probability that Xi lies within some interval, xi and xi + , where is a
small constant:
P (xi ≤ Xi ≤ xi + Y = yj ) =
∫ xi+�
xi
f (Xi; µij , σij )dXi
≈ f (xi; µij , σij ) × . (5.17)
Since appears as a constant multiplicative factor for each class, it
cancels out when we normalize the posterior probability for P (Y X).
Therefore, we can still apply Equation 5.16 to approximate the class
conditional probability P (XiY ).
Example of the Näıve Bayes Classifier
Consider the data set shown in Figure 5.10(a). We can compute the class
conditional probability for each categorical attribute, along with the sample
mean and variance for the continuous attribute using the methodology de
scribed in the previous subsections. These probabilities are summarized in
Figure 5.10(b).
To predict the class label of a test record X = (Home Owner=No, Marital
Status = Married, Income = $120K), we need to compute the posterior prob
abilities P (NoX) and P (YesX). Recall from our earlier discussion that these
posterior probabilities can be estimated by computing the product between
the prior probability P (Y ) and the classconditional probabilities
∏
i P (XiY ),
which corresponds to the numerator of the righthand side term in Equation
5.15.
The prior probabilities of each class can be estimated by calculating the
fraction of training records that belong to each class. Since there are three
records that belong to the class Yes and seven records that belong to the class
234
5.3 Bayesian Classifiers
Tid
Defaulted
Borrower
Home
Owner
Marital
Status
Annual
Income
1
2
3
4
5
6
7
8
9
10
Yes
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
No
No
Yes
No
Yes
125K
100K
70K
120K
95K
60K
220K
85K
75K
90K
Single
Married
Single
Married
Divorced
Married
Divorced
Single
Married
Single
P(Home Owner=YesNo) = 3/7
P(Home Owner=NoNo) = 4/7
P(Home Owner=YesYes) = 0
P(Home Owner=NoYes) = 1
P(Marital Status=SingleNo) = 2/7
P(Marital Status=DivorcedNo) = 1/7
P(Marital Status=MarriedNo) = 4/7
P(Marital Status=SingleYes) = 2/3
P(Marital Status=DivorcedYes) = 1/3
P(Marital Status=MarriedYes) = 0
For Annual Income:
If class=No:
If class=Yes:
sample mean=110
sample variance=2975
sample mean=90
sample variance=25
(a) (b)
Figure 5.10. The naı̈ve Bayes classifier for the loan classification problem.
No, P (Yes) = 0.3 and P (No) = 0.7. Using the information provided in Figure
5.10(b), the classconditional probabilities can be computed as follows:
P (XNo) = P (Home Owner = NoNo) × P (Status = MarriedNo)
× P (Annual Income = $120KNo)
= 4/7 × 4/7 × 0.0072 = 0.0024.
P (XYes) = P (Home Owner = NoYes) × P (Status = MarriedYes)
× P (Annual Income = $120KYes)
= 1 × 0 × 1.2 × 10−9 = 0.
Putting them together, the posterior probability for class No is P (NoX) =
α × 7/10 × 0.0024 = 0.0016α, where α = 1/P (X) is a constant term. Using
a similar approach, we can show that the posterior probability for class Yes
is zero because its classconditional probability is zero. Since P (NoX) >
P (YesX), the record is classified as No.
235
Chapter 5 Classification: Alternative Techniques
Mestimate of Conditional Probability
The preceding example illustrates a potential problem with estimating poste
rior probabilities from training data. If the classconditional probability for
one of the attributes is zero, then the overall posterior probability for the class
vanishes. This approach of estimating classconditional probabilities using
simple fractions may seem too brittle, especially when there are few training
examples available and the number of attributes is large.
In a more extreme case, if the training examples do not cover many of
the attribute values, we may not be able to classify some of the test records.
For example, if P (Marital Status = DivorcedNo) is zero instead of 1/7,
then a record with attribute set X = (Home Owner = Yes, Marital Status =
Divorced, Income = $120K) has the following classconditional probabilities:
P (XNo) = 3/7 × 0 × 0.0072 = 0.
P (XYes) = 0 × 1/3 × 1.2 × 10−9 = 0.
The näıve Bayes classifier will not be able to classify the record. This prob
lem can be addressed by using the mestimate approach for estimating the
conditional probabilities:
P (xiyj ) =
nc + mp
n + m
, (5.18)
where n is the total number of instances from class yj , nc is the number of
training examples from class yj that take on the value xi, m is a parameter
known as the equivalent sample size, and p is a userspecified parameter. If
there is no training set available (i.e., n = 0), then P (xiyj ) = p. Therefore
p can be regarded as the prior probability of observing the attribute value
xi among records with class yj . The equivalent sample size determines the
tradeoff between the prior probability p and the observed probability nc/n.
In the example given in the previous section, the conditional probability
P (Status = MarriedYes) = 0 because none of the training records for the
class has the particular attribute value. Using the mestimate approach with
m = 3 and p = 1/3, the conditional probability is no longer zero:
P (Marital Status = MarriedYes) = (0 + 3 × 1/3)/(3 + 3) = 1/6.
236
5.3 Bayesian Classifiers
If we assume p = 1/3 for all attributes of class Yes and p = 2/3 for all
attributes of class No, then
P (XNo) = P (Home Owner = NoNo) × P (Status = MarriedNo)
× P (Annual Income = $120KNo)
= 6/10 × 6/10 × 0.0072 = 0.0026.
P (XYes) = P (Home Owner = NoYes) × P (Status = MarriedYes)
× P (Annual Income = $120KYes)
= 4/6 × 1/6 × 1.2 × 10−9 = 1.3 × 10−10.
The posterior probability for class No is P (NoX) = α × 7/10 × 0.0026 =
0.0018α, while the posterior probability for class Yes is P (YesX) = α ×
3/10 × 1.3 × 10−10 = 4.0 × 10−11α. Although the classification decision has
not changed, the mestimate approach generally provides a more robust way
for estimating probabilities when the number of training examples is small.
Characteristics of Näıve Bayes Classifiers
Näıve Bayes classifiers generally have the following characteristics:
• They are robust to isolated noise points because such points are averaged
out when estimating conditional probabilities from data. Näıve Bayes
classifiers can also handle missing values by ignoring the example during
model building and classification.
• They are robust to irrelevant attributes. If Xi is an irrelevant at
tribute, then P (XiY ) becomes almost uniformly distributed. The class
conditional probability for Xi has no impact on the overall computation
of the posterior probability.
• Correlated attributes can degrade the performance of näıve Bayes clas
sifiers because the conditional independence assumption no longer holds
for such attributes. For example, consider the following probabilities:
P (A = 0Y = 0) = 0.4, P (A = 1Y = 0) = 0.6,
P (A = 0Y = 1) = 0.6, P (A = 1Y = 1) = 0.4,
where A is a binary attribute and Y is a binary class variable. Suppose
there is another binary attribute B that is perfectly correlated with A
237
Chapter 5 Classification: Alternative Techniques
when Y = 0, but is independent of A when Y = 1. For simplicity,
assume that the classconditional probabilities for B are the same as for
A. Given a record with attributes A = 0, B = 0, we can compute its
posterior probabilities as follows:
P (Y = 0A = 0, B = 0) = P (A = 0Y = 0)P (B = 0Y = 0)P (Y = 0)
P (A = 0, B = 0)
=
0.16 × P (Y = 0)
P (A = 0, B = 0)
.
P (Y = 1A = 0, B = 0) = P (A = 0Y = 1)P (B = 0Y = 1)P (Y = 1)
P (A = 0, B = 0)
=
0.36 × P (Y = 1)
P (A = 0, B = 0)
.
If P (Y = 0) = P (Y = 1), then the näıve Bayes classifier would assign
the record to class 1. However, the truth is,
P (A = 0, B = 0Y = 0) = P (A = 0Y = 0) = 0.4,
because A and B are perfectly correlated when Y = 0. As a result, the
posterior probability for Y = 0 is
P (Y = 0A = 0, B = 0) = P (A = 0, B = 0Y = 0)P (Y = 0)
P (A = 0, B = 0)
=
0.4 × P (Y = 0)
P (A = 0, B = 0)
,
which is larger than that for Y = 1. The record should have been
classified as class 0.
5.3.4 Bayes Error Rate
Suppose we know the true probability distribution that governs P (XY ). The
Bayesian classification method allows us to determine the ideal decision bound
ary for the classification task, as illustrated in the following example.
Example 5.3. Consider the task of identifying alligators and crocodiles based
on their respective lengths. The average length of an adult crocodile is about 15
feet, while the average length of an adult alligator is about 12 feet. Assuming
238
5.3 Bayesian Classifiers
5 10 15 20
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Length, x
P
(x
y
)
Alligator Crocodile
Figure 5.11. Comparing the likelihood functions of a crocodile and an alligator.
that their length x follows a Gaussian distribution with a standard deviation
equal to 2 feet, we can express their classconditional probabilities as follows:
P (XCrocodile) = 1√
2π · 2
exp
[
− 1
2
(
X − 15
2
)2]
(5.19)
P (XAlligator) = 1√
2π · 2
exp
[
− 1
2
(
X − 12
2
)2]
(5.20)
Figure 5.11 shows a comparison between the classconditional probabilities
for a crocodile and an alligator. Assuming that their prior probabilities are
the same, the ideal decision boundary is located at some length x̂ such that
P (X = x̂Crocodile) = P (X = x̂Alligator).
Using Equations 5.19 and 5.20, we obtain
(
x̂ − 15
2
)2
=
(
x̂ − 12
2
)2
,
which can be solved to yield x̂ = 13.5. The decision boundary for this example
is located halfway between the two means.
239
Chapter 5 Classification: Alternative Techniques
C
A B
D
A B
X1 X2 X3 X4 Xd
C
. . .
y
(a) (b) (c)
Figure 5.12. Representing probabilistic relationships using directed acyclic graphs.
When the prior probabilities are different, the decision boundary shifts
toward the class with lower prior probability (see Exercise 10 on page 319).
Furthermore, the minimum error rate attainable by any classifier on the given
data can also be computed. The ideal decision boundary in the preceding
example classifies all creatures whose lengths are less than x̂ as alligators and
those whose lengths are greater than x̂ as crocodiles. The error rate of the
classifier is given by the sum of the area under the posterior probability curve
for crocodiles (from length 0 to x̂) and the area under the posterior probability
curve for alligators (from x̂ to ∞):
Error =
∫ x̂
0
P (CrocodileX)dX +
∫ ∞
x̂
P (AlligatorX)dX.
The total error rate is known as the Bayes error rate.
5.3.5 Bayesian Belief Networks
The conditional independence assumption made by näıve Bayes classifiers may
seem too rigid, especially for classification problems in which the attributes
are somewhat correlated. This section presents a more flexible approach for
modeling the classconditional probabilities P (XY ). Instead of requiring all
the attributes to be conditionally independent given the class, this approach
allows us to specify which pair of attributes are conditionally independent.
We begin with a discussion on how to represent and build such a probabilistic
model, followed by an example of how to make inferences from the model.
240
5.3 Bayesian Classifiers
Model Representation
A Bayesian belief network (BBN), or simply, Bayesian network, provides a
graphical representation of the probabilistic relationships among a set of ran
dom variables. There are two key elements of a Bayesian network:
1. A directed acyclic graph (dag) encoding the dependence relationships
among a set of variables.
2. A probability table associating each node to its immediate parent nodes.
Consider three random variables, A, B, and C, in which A and B are
independent variables and each has a direct influence on a third variable, C.
The relationships among the variables can be summarized into the directed
acyclic graph shown in Figure 5.12(a). Each node in the graph represents a
variable, and each arc asserts the dependence relationship between the pair
of variables. If there is a directed arc from X to Y , then X is the parent of
Y and Y is the child of X. Furthermore, if there is a directed path in the
network from X to Z, then X is an ancestor of Z, while Z is a descendant
of X. For example, in the diagram shown in Figure 5.12(b), A is a descendant
of D and D is an ancestor of B. Both B and D are also nondescendants of
A. An important property of the Bayesian network can be stated as follows:
Property 1 (Conditional Independence). A node in a Bayesian network
is conditionally independent of its nondescendants, if its parents are known.
In the diagram shown in Figure 5.12(b), A is conditionally independent of
both B and D given C because the nodes for B and D are nondescendants
of node A. The conditional independence assumption made by a näıve Bayes
classifier can also be represented using a Bayesian network, as shown in Figure
5.12(c), where y is the target class and {X1, X2, . . . , Xd} is the attribute set.
Besides the conditional independence conditions imposed by the network
topology, each node is also associated with a probability table.
1. If a node X does not have any parents, then the table contains only the
prior probability P (X).
2. If a node X has only one parent, Y , then the table contains the condi
tional probability P (XY ).
3. If a node X has multiple parents, {Y1, Y2, . . . , Yk}, then the table contains
the conditional probability P (XY1, Y2, . . . , Yk).
241
Chapter 5 Classification: Alternative Techniques
E=Yes
D=Healthy
E=No
D=Healthy
E=Yes
D=Unhealthy
E=No
D=Unhealthy
0.25
0.45
0.55
0.75
HD=Yes
Hb=Yes
CP=Yes
Blood
Pressure
Chest
Pain
BP=High
HD=Yes
HD=No
0.85
0.2
E=Yes
0.7
D=Healthy
0.25
D=Healthy
D=Unhealthy
0.2
0.85
HD=Yes
Hb=Yes
HD=No
Hb=Yes
HD=No
Hb=No
HD=Yes
Hb=No
0.8
0.6
0.4
0.1
DietExercise
Heartburn
Heart
Disease
Figure 5.13. A Bayesian belief network for detecting heart disease and heartburn in patients.
Figure 5.13 shows an example of a Bayesian network for modeling patients
with heart disease or heartburn problems. Each variable in the diagram is
assumed to be binaryvalued. The parent nodes for heart disease (HD) cor
respond to risk factors that may affect the disease, such as exercise (E) and
diet (D). The child nodes for heart disease correspond to symptoms of the
disease, such as chest pain (CP) and high blood pressure (BP). For example,
the diagram shows that heartburn (Hb) may result from an unhealthy diet
and may lead to chest pain.
The nodes associated with the risk factors contain only the prior proba
bilities, whereas the nodes for heart disease, heartburn, and their correspond
ing symptoms contain the conditional probabilities. To save space, some of
the probabilities have been omitted from the diagram. The omitted prob
abilities can be recovered by noting that P (X = x) = 1 − P (X = x) and
P (X = xY ) = 1 − P (X = xY ), where x denotes the opposite outcome of x.
For example, the conditional probability
P (Heart Disease = NoExercise = No, Diet = Healthy)
= 1 − P (Heart Disease = YesExercise = No, Diet = Healthy)
= 1 − 0.55 = 0.45.
242
5.3 Bayesian Classifiers
Model Building
Model building in Bayesian networks involves two steps: (1) creating the struc
ture of the network, and (2) estimating the probability values in the tables
associated with each node. The network topology can be obtained by encod
ing the subjective knowledge of domain experts. Algorithm 5.3 presents a
systematic procedure for inducing the topology of a Bayesian network.
Algorithm 5.3 Algorithm for generating the topology of a Bayesian network.
1: Let T = (X1, X2, . . . , Xd) denote a total order of the variables.
2: for j = 1 to d do
3: Let XT (j) denote the jth highest order variable in T .
4: Let π(XT (j)) = {XT (1), XT (2), . . . , XT (j−1)} denote the set of variables preced
ing XT (j).
5: Remove the variables from π(XT (j)) that do not affect Xj (using prior knowl
edge).
6: Create an arc between XT (j) and the remaining variables in π(XT (j)).
7: end for
Example 5.4. Consider the variables shown in Figure 5.13. After performing
Step 1, let us assume that the variables are ordered in the following way:
(E, D, HD, Hb, CP, BP ). From Steps 2 to 7, starting with variable D, we
obtain the following conditional probabilities:
• P (DE) is simplified to P (D).
• P (HDE, D) cannot be simplified.
• P (HbHD, E, D) is simplified to P (HbD).
• P (CPHb, HD, E, D) is simplified to P (CPHb, HD).
• P (BPCP, Hb, HD, E, D) is simplified to P (BPHD).
Based on these conditional probabilities, we can create arcs between the nodes
(E, HD), (D, HD), (D, Hb), (HD, CP ), (Hb, CP ), and (HD, BP ). These
arcs result in the network structure shown in Figure 5.13.
Algorithm 5.3 guarantees a topology that does not contain any cycles. The
proof for this is quite straightforward. If a cycle exists, then there must be at
least one arc connecting the lowerordered nodes to the higherordered nodes,
and at least another arc connecting the higherordered nodes to the lower
ordered nodes. Since Algorithm 5.3 prevents any arc from connecting the
243
Chapter 5 Classification: Alternative Techniques
lowerordered nodes to the higherordered nodes, there cannot be any cycles
in the topology.
Nevertheless, the network topology may change if we apply a different or
dering scheme to the variables. Some topology may be inferior because it
produces many arcs connecting between different pairs of nodes. In principle,
we may have to examine all d! possible orderings to determine the most appro
priate topology, a task that can be computationally expensive. An alternative
approach is to divide the variables into causal and effect variables, and then
draw the arcs from each causal variable to its corresponding effect variables.
This approach eases the task of building the Bayesian network structure.
Once the right topology has been found, the probability table associated
with each node is determined. Estimating such probabilities is fairly straight
forward and is similar to the approach used by näıve Bayes classifiers.
Example of Inferencing Using BBN
Suppose we are interested in using the BBN shown in Figure 5.13 to diagnose
whether a person has heart disease. The following cases illustrate how the
diagnosis can be made under different scenarios.
Case 1: No Prior Information
Without any prior information, we can determine whether the person is likely
to have heart disease by computing the prior probabilities P (HD = Yes) and
P (HD = No). To simplify the notation, let α ∈ {Yes, No} denote the binary
values of Exercise and β ∈ {Healthy, Unhealthy} denote the binary values
of Diet.
P (HD = Yes) =
∑
α
∑
β
P (HD = YesE = α, D = β)P (E = α, D = β)
=
∑
α
∑
β
P (HD = YesE = α, D = β)P (E = α)P (D = β)
= 0.25 × 0.7 × 0.25 + 0.45 × 0.7 × 0.75 + 0.55 × 0.3 × 0.25
+ 0.75 × 0.3 × 0.75
= 0.49.
Since P (HD = no) = 1 − P (HD = yes) = 0.51, the person has a slightly higher
chance of not getting the disease.
244
5.3 Bayesian Classifiers
Case 2: High Blood Pressure
If the person has high blood pressure, we can make a diagnosis about heart
disease by comparing the posterior probabilities, P (HD = YesBP = High)
against P (HD = NoBP = High). To do this, we must compute P (BP = High):
P (BP = High) =
∑
γ
P (BP = HighHD = γ)P (HD = γ)
= 0.85 × 0.49 + 0.2 × 0.51 = 0.5185.
where γ ∈ {Yes, No}. Therefore, the posterior probability the person has heart
disease is
P (HD = YesBP = High) = P (BP = HighHD = Yes)P (HD = Yes)
P (BP = High)
=
0.85 × 0.49
0.5185
= 0.8033.
Similarly, P (HD = NoBP = High) = 1 − 0.8033 = 0.1967. Therefore, when a
person has high blood pressure, it increases the risk of heart disease.
Case 3: High Blood Pressure, Healthy Diet, and Regular Exercise
Suppose we are told that the person exercises regularly and eats a healthy diet.
How does the new information affect our diagnosis? With the new information,
the posterior probability that the person has heart disease is
P (HD = YesBP = High, D = Healthy, E = Yes)
=
[
P (BP = HighHD = Yes, D = Healthy, E = Yes)
P (BP = HighD = Healthy, E = Yes)
]
× P (HD = YesD = Healthy, E = Yes)
=
P (BP = HighHD = Yes)P (HD = YesD = Healthy, E = Yes)∑
γ P (BP = HighHD = γ)P (HD = γD = Healthy, E = Yes)
=
0.85 × 0.25
0.85 × 0.25 + 0.2 × 0.75
= 0.5862,
while the probability that the person does not have heart disease is
P (HD = NoBP = High, D = Healthy, E = Yes) = 1 − 0.5862 = 0.4138.
245
Chapter 5 Classification: Alternative Techniques
The model therefore suggests that eating healthily and exercising regularly
may reduce a person’s risk of getting heart disease.
Characteristics of BBN
Following are some of the general characteristics of the BBN method:
1. BBN provides an approach for capturing the prior knowledge of a par
ticular domain using a graphical model. The network can also be used
to encode causal dependencies among variables.
2. Constructing the network can be time consuming and requires a large
amount of effort. However, once the structure of the network has been
determined, adding a new variable is quite straightforward.
3. Bayesian networks are well suited to dealing with incomplete data. In
stances with missing attributes can be handled by summing or integrat
ing the probabilities over all possible values of the attribute.
4. Because the data is combined probabilistically with prior knowledge, the
method is quite robust to model overfitting.
5.4 Artificial Neural Network (ANN)
The study of artificial neural networks (ANN) was inspired by attempts to
simulate biological neural systems. The human brain consists primarily of
nerve cells called neurons, linked together with other neurons via strands
of fiber called axons. Axons are used to transmit nerve impulses from one
neuron to another whenever the neurons are stimulated. A neuron is connected
to the axons of other neurons via dendrites, which are extensions from the
cell body of the neuron. The contact point between a dendrite and an axon is
called a synapse. Neurologists have discovered that the human brain learns
by changing the strength of the synaptic connection between neurons upon
repeated stimulation by the same impulse.
Analogous to human brain structure, an ANN is composed of an inter
connected assembly of nodes and directed links. In this section, we will exam
ine a family of ANN models, starting with the simplest model called percep
tron, and show how the models can be trained to solve classification problems.
246
5.4 Artificial Neural Network (ANN)
5.4.1 Perceptron
Consider the diagram shown in Figure 5.14. The table on the left shows a data
set containing three boolean variables (x1, x2, x3) and an output variable, y,
that takes on the value −1 if at least two of the three inputs are zero, and +1
if at least two of the inputs are greater than zero.
X1
X1
X2
X2
X3
X3
y
1
1
1
1
0
0
0
0
0
0
1
1
0
1
1
0
0
1
0
1
1
0
1
0
–1
1
1
1
–1
–1
1
–1
(a) Data set. (b) Perceptron.
Input
nodes
Output
node
0.3
0.3
0.3
t = 0.4
y
Figure 5.14. Modeling a boolean function using a perceptron.
Figure 5.14(b) illustrates a simple neural network architecture known as a
perceptron. The perceptron consists of two types of nodes: input nodes, which
are used to represent the input attributes, and an output node, which is used
to represent the model output. The nodes in a neural network architecture
are commonly known as neurons or units. In a perceptron, each input node is
connected via a weighted link to the output node. The weighted link is used to
emulate the strength of synaptic connection between neurons. As in biological
neural systems, training a perceptron model amounts to adapting the weights
of the links until they fit the inputoutput relationships of the underlying data.
A perceptron computes its output value, ŷ, by performing a weighted sum
on its inputs, subtracting a bias factor t from the sum, and then examining
the sign of the result. The model shown in Figure 5.14(b) has three input
nodes, each of which has an identical weight of 0.3 to the output node and a
bias factor of t = 0.4. The output computed by the model is
ŷ =
{
1, if 0.3×1 + 0.3×2 + 0.3×3 − 0.4 > 0;
−1, if 0.3×1 + 0.3×2 + 0.3×3 − 0.4 < 0.
(5.21)
247
Chapter 5 Classification: Alternative Techniques
For example, if x1 = 1, x2 = 1, x3 = 0, then ŷ = +1 because 0.3x1 + 0.3x2 +
0.3x3 − 0.4 is positive. On the other hand, if x1 = 0, x2 = 1, x3 = 0, then
ŷ = −1 because the weighted sum subtracted by the bias factor is negative.
Note the difference between the input and output nodes of a perceptron.
An input node simply transmits the value it receives to the outgoing link with
out performing any transformation. The output node, on the other hand, is a
mathematical device that computes the weighted sum of its inputs, subtracts
the bias term, and then produces an output that depends on the sign of the
resulting sum. More specifically, the output of a perceptron model can be
expressed mathematically as follows:
ŷ = sign
(
wdxd + wd−1xd−1 + . . . + w2x2 + w1x1 − t
)
, (5.22)
where w1, w2, . . . , wd are the weights of the input links and x1, x2, . . . , xd are
the input attribute values. The sign function, which acts as an activation
function for the output neuron, outputs a value +1 if its argument is positive
and −1 if its argument is negative. The perceptron model can be written in a
more compact form as follows:
ŷ = sign[wdxd + wd−1xd−1 + . . . + w1x1 + w0x0] = sign(w · x), (5.23)
where w0 = −t, x0 = 1, and w·x is the dot product between the weight vector
w and the input attribute vector x.
Learning Perceptron Model
During the training phase of a perceptron model, the weight parameters w
are adjusted until the outputs of the perceptron become consistent with the
true outputs of training examples. A summary of the perceptron learning
algorithm is given in Algorithm 5.4.
The key computation for this algorithm is the weight update formula given
in Step 7 of the algorithm:
w
(k+1)
j = w
(k)
j + λ
(
yi − ŷ(k)i
)
xij , (5.24)
where w(k) is the weight parameter associated with the ith input link after the
kth iteration, λ is a parameter known as the learning rate, and xij is the
value of the jth attribute of the training example xi. The justification for the
weight update formula is rather intuitive. Equation 5.24 shows that the new
weight w(k+1) is a combination of the old weight w(k) and a term proportional
248
5.4 Artificial Neural Network (ANN)
Algorithm 5.4 Perceptron learning algorithm.
1: Let D = {(xi, yi)  i = 1, 2, . . . , N} be the set of training examples.
2: Initialize the weight vector with random values, w(0)
3: repeat
4: for each training example (xi, yi) ∈ D do
5: Compute the predicted output ŷ(k)i
6: for each weight wj do
7: Update the weight, w(k+1)j = w
(k)
j + λ
(
yi − ŷ(k)i
)
xij .
8: end for
9: end for
10: until stopping condition is met
to the prediction error, (y − ŷ). If the prediction is correct, then the weight
remains unchanged. Otherwise, it is modified in the following ways:
• If y = +1 and ŷ = −1, then the prediction error is (y − ŷ) = 2. To
compensate for the error, we need to increase the value of the predicted
output by increasing the weights of all links with positive inputs and
decreasing the weights of all links with negative inputs.
• If yi = −1 and ŷ = +1, then (y − ŷ) = −2. To compensate for the error,
we need to decrease the value of the predicted output by decreasing the
weights of all links with positive inputs and increasing the weights of all
links with negative inputs.
In the weight update formula, links that contribute the most to the error term
are the ones that require the largest adjustment. However, the weights should
not be changed too drastically because the error term is computed only for
the current training example. Otherwise, the adjustments made in earlier
iterations will be undone. The learning rate λ, a parameter whose value is
between 0 and 1, can be used to control the amount of adjustments made in
each iteration. If λ is close to 0, then the new weight is mostly influenced
by the value of the old weight. On the other hand, if λ is close to 1, then
the new weight is sensitive to the amount of adjustment performed in the
current iteration. In some cases, an adaptive λ value can be used; initially, λ
is moderately large during the first few iterations and then gradually decreases
in subsequent iterations.
The perceptron model shown in Equation 5.23 is linear in its parameters
w and attributes x. Because of this, the decision boundary of a perceptron,
which is obtained by setting ŷ = 0, is a linear hyperplane that separates the
data into two classes, −1 and +1. Figure 5.15 shows the decision boundary
249
Chapter 5 Classification: Alternative Techniques
0
0.5 1 1.5 – 0.5
– 0.5
0
0.5
1
1.5
0
0.5
1
X2X1
X3
Figure 5.15. Perceptron decision boundary for the data given in Figure 5.14.
obtained by applying the perceptron learning algorithm to the data set given in
Figure 5.14. The perceptron learning algorithm is guaranteed to converge to an
optimal solution (as long as the learning rate is sufficiently small) for linearly
separable classification problems. If the problem is not linearly separable,
the algorithm fails to converge. Figure 5.16 shows an example of nonlinearly
separable data given by the XOR function. Perceptron cannot find the right
solution for this data because there is no linear hyperplane that can perfectly
separate the training instances.
X1 X2 y
0
1
0
1
0
0
1
1
–1
1
1
–1
1.5
0.5
– 0.5
– 0.5
1
0
0 10.5 1.5
X2
X1
Figure 5.16. XOR classification problem. No linear hyperplane can separate the two classes.
250
5.4 Artificial Neural Network (ANN)
5.4.2 Multilayer Artificial Neural Network
An artificial neural network has a more complex structure than that of a
perceptron model. The additional complexities may arise in a number of ways:
1. The network may contain several intermediary layers between its input
and output layers. Such intermediary layers are called hidden layers
and the nodes embedded in these layers are called hidden nodes. The
resulting structure is known as a multilayer neural network (see Fig
ure 5.17). In a feedforward neural network, the nodes in one layer
Input
Layer
Hidden
Layer
Output
Layer
X1 X2 X3 X4 X5
y
Figure 5.17. Example of a multilayer feedforward artificial neural network (ANN).
are connected only to the nodes in the next layer. The perceptron is a
singlelayer, feedforward neural network because it has only one layer
of nodes—the output layer—that performs complex mathematical op
erations. In a recurrent neural network, the links may connect nodes
within the same layer or nodes from one layer to the previous layers.
2. The network may use types of activation functions other than the sign
function. Examples of other activation functions include linear, sigmoid
(logistic), and hyperbolic tangent functions, as shown in Figure 5.18.
These activation functions allow the hidden and output nodes to produce
output values that are nonlinear in their input parameters.
These additional complexities allow multilayer neural networks to model
more complex relationships between the input and output variables. For ex
251
Chapter 5 Classification: Alternative Techniques
–1 0 0.5– 0.5 1
–1 0 0.5– 0.5 1
–1 0 0.5– 0.5 1
–1 0 0.5– 0.5 1
1
– 0.5
0.5
0
1
1
– 0.5
0.5
0
1
1
– 0.5
0.5
0
1
Linear function Sigmoid function
Tanh function Sign function
1.5
1
0.5
0
– 0.5
–1
–1.5
Figure 5.18. Types of activation functions in artificial neural networks.
ample, consider the XOR problem described in the previous section. The in
stances can be classified using two hyperplanes that partition the input space
into their respective classes, as shown in Figure 5.19(a). Because a percep
tron can create only one hyperplane, it cannot find the optimal solution. This
problem can be addressed using a twolayer, feedforward neural network, as
shown in Figure 5.19(b). Intuitively, we can think of each hidden node as a
perceptron that tries to construct one of the two hyperplanes, while the out
put node simply combines the results of the perceptrons to yield the decision
boundary shown in Figure 5.19(a).
To learn the weights of an ANN model, we need an efficient algorithm
that converges to the right solution when a sufficient amount of training data
is provided. One approach is to treat each hidden node or output node in
the network as an independent perceptron unit and to apply the same weight
update formula as Equation 5.24. Obviously, this approach will not work
because we lack a priori knowledge about the true outputs of the hidden
nodes. This makes it difficult to determine the error term, (y − ŷ), associated
252
5.4 Artificial Neural Network (ANN)
1.5
0.5
– 0.5
– 0.5
1
0
0 10.5 1.5
X2
X1
(a) Decision boundary.
X1
X2
Input
Layer
Hidden
Layer
Output
Layer
n1
n2
n3 w53
w54
w31
w32
w41
w42
n5
n4
y
(b) Neural network topology.
Figure 5.19. A twolayer, feedforward neural network for the XOR problem.
with each hidden node. A methodology for learning the weights of a neural
network based on the gradient descent approach is presented next.
Learning the ANN Model
The goal of the ANN learning algorithm is to determine a set of weights w
that minimize the total sum of squared errors:
E(w) =
1
2
N∑
i=1
(yi − ŷi)2. (5.25)
Note that the sum of squared errors depends on w because the predicted class
ŷ is a function of the weights assigned to the hidden and output nodes. Figure
5.20 shows an example of the error surface as a function of its two parameters,
w1 and w2. This type of error surface is typically encountered when ŷi is a
linear function of its parameters, w. If we replace ŷ = w · x into Equation
5.25, then the error function becomes quadratic in its parameters and a global
minimum solution can be easily found.
In most cases, the output of an ANN is a nonlinear function of its param
eters because of the choice of its activation functions (e.g., sigmoid or tanh
function). As a result, it is no longer straightforward to derive a solution for
w that is guaranteed to be globally optimal. Greedy algorithms such as those
based on the gradient descent method have been developed to efficiently solve
the optimization problem. The weight update formula used by the gradient
253
Chapter 5 Classification: Alternative Techniques
0
0.5
1
0
0.5
1
1
1.2
1.4
1.6
1.8
w2
w1
E(w1,w2)
Figure 5.20. Error surface E(w1, w2) for a twoparameter model.
descent method can be written as follows:
wj ←− wj − λ
∂E(w)
∂wj
, (5.26)
where λ is the learning rate. The second term states that the weight should be
increased in a direction that reduces the overall error term. However, because
the error function is nonlinear, it is possible that the gradient descent method
may get trapped in a local minimum.
The gradient descent method can be used to learn the weights of the out
put and hidden nodes of a neural network. For hidden nodes, the computation
is not trivial because it is difficult to assess their error term, ∂E/∂wj , without
knowing what their output values should be. A technique known as back
propagation has been developed to address this problem. There are two
phases in each iteration of the algorithm: the forward phase and the backward
phase. During the forward phase, the weights obtained from the previous iter
ation are used to compute the output value of each neuron in the network. The
computation progresses in the forward direction; i.e., outputs of the neurons
at level k are computed prior to computing the outputs at level k + 1. Dur
ing the backward phase, the weight update formula is applied in the reverse
direction. In other words, the weights at level k + 1 are updated before the
weights at level k are updated. This backpropagation approach allows us to
use the errors for neurons at layer k + 1 to estimate the errors for neurons at
layer k.
254
5.4 Artificial Neural Network (ANN)
Design Issues in ANN Learning
Before we train a neural network to learn a classification task, the following
design issues must be considered.
1. The number of nodes in the input layer should be determined. Assign an
input node to each numerical or binary input variable. If the input vari
able is categorical, we could either create one node for each categorical
value or encode the kary variable using �log2 k� input nodes.
2. The number of nodes in the output layer should be established. For
a twoclass problem, it is sufficient to use a single output node. For a
kclass problem, there are k output nodes.
3. The network topology (e.g., the number of hidden layers and hidden
nodes, and feedforward or recurrent network architecture) must be se
lected. Note that the target function representation depends on the
weights of the links, the number of hidden nodes and hidden layers, bi
ases in the nodes, and type of activation function. Finding the right
topology is not an easy task. One way to do this is to start from a fully
connected network with a sufficiently large number of nodes and hid
den layers, and then repeat the modelbuilding procedure with a smaller
number of nodes. This approach can be very time consuming. Alter
natively, instead of repeating the modelbuilding procedure, we could
remove some of the nodes and repeat the model evaluation procedure to
select the right model complexity.
4. The weights and biases need to be initialized. Random assignments are
usually acceptable.
5. Training examples with missing values should be removed or replaced
with most likely values.
5.4.3 Characteristics of ANN
Following is a summary of the general characteristics of an artificial neural
network:
1. Multilayer neural networks with at least one hidden layer are univer
sal approximators; i.e., they can be used to approximate any target
functions. Since an ANN has a very expressive hypothesis space, it is im
portant to choose the appropriate network topology for a given problem
to avoid model overfitting.
255
Chapter 5 Classification: Alternative Techniques
2. ANN can handle redundant features because the weights are automat
ically learned during the training step. The weights for redundant fea
tures tend to be very small.
3. Neural networks are quite sensitive to the presence of noise in the train
ing data. One approach to handling noise is to use a validation set to
determine the generalization error of the model. Another approach is to
decrease the weight by some factor at each iteration.
4. The gradient descent method used for learning the weights of an ANN
often converges to some local minimum. One way to escape from the local
minimum is to add a momentum term to the weight update formula.
5. Training an ANN is a time consuming process, especially when the num
ber of hidden nodes is large. Nevertheless, test examples can be classified
rapidly.
5.5 Support Vector Machine (SVM)
A classification technique that has received considerable attention is support
vector machine (SVM). This technique has its roots in statistical learning the
ory and has shown promising empirical results in many practical applications,
from handwritten digit recognition to text categorization. SVM also works
very well with highdimensional data and avoids the curse of dimensionality
problem. Another unique aspect of this approach is that it represents the deci
sion boundary using a subset of the training examples, known as the support
vectors.
To illustrate the basic idea behind SVM, we first introduce the concept of
a maximal margin hyperplane and explain the rationale of choosing such
a hyperplane. We then describe how a linear SVM can be trained to explicitly
look for this type of hyperplane in linearly separable data. We conclude by
showing how the SVM methodology can be extended to nonlinearly separable
data.
5.5.1 Maximum Margin Hyperplanes
Figure 5.21 shows a plot of a data set containing examples that belong to
two different classes, represented as squares and circles. The data set is also
linearly separable; i.e., we can find a hyperplane such that all the squares
reside on one side of the hyperplane and all the circles reside on the other
256
5.5 Support Vector Machine (SVM)
Figure 5.21. Possible decision boundaries for a linearly separable data set.
side. However, as shown in Figure 5.21, there are infinitely many such hyper
planes possible. Although their training errors are zero, there is no guarantee
that the hyperplanes will perform equally well on previously unseen examples.
The classifier must choose one of these hyperplanes to represent its decision
boundary, based on how well they are expected to perform on test examples.
To get a clearer picture of how the different choices of hyperplanes affect the
generalization errors, consider the two decision boundaries, B1 and B2, shown
in Figure 5.22. Both decision boundaries can separate the training examples
into their respective classes without committing any misclassification errors.
Each decision boundary Bi is associated with a pair of hyperplanes, denoted
as bi1 and bi2, respectively. bi1 is obtained by moving a parallel hyperplane
away from the decision boundary until it touches the closest square(s), whereas
bi2 is obtained by moving the hyperplane until it touches the closest circle(s).
The distance between these two hyperplanes is known as the margin of the
classifier. From the diagram shown in Figure 5.22, notice that the margin for
B1 is considerably larger than that for B2. In this example, B1 turns out to
be the maximum margin hyperplane of the training instances.
Rationale for Maximum Margin
Decision boundaries with large margins tend to have better generalization
errors than those with small margins. Intuitively, if the margin is small, then
257
Chapter 5 Classification: Alternative Techniques
B1
B2b21 b22
b11 b12margin for B1
margin for B2
Figure 5.22. Margin of a decision boundary.
any slight perturbations to the decision boundary can have quite a significant
impact on its classification. Classifiers that produce decision boundaries with
small margins are therefore more susceptible to model overfitting and tend to
generalize poorly on previously unseen examples.
A more formal explanation relating the margin of a linear classifier to its
generalization error is given by a statistical learning principle known as struc
tural risk minimization (SRM). This principle provides an upper bound to
the generalization error of a classifier (R) in terms of its training error (Re),
the number of training examples (N ), and the model complexity, otherwise
known as its capacity (h). More specifically, with a probability of 1 − η, the
generalization error of the classifier can be at worst
R ≤ Re + ϕ
(
h
N
,
log(η)
N
)
, (5.27)
where ϕ is a monotone increasing function of the capacity h. The preced
ing inequality may seem quite familiar to the readers because it resembles
the equation given in Section 4.4.4 (on page 179) for the minimum descrip
tion length (MDL) principle. In this regard, SRM is another way to express
generalization error as a tradeoff between training error and model complexity.
258
5.5 Support Vector Machine (SVM)
The capacity of a linear model is inversely related to its margin. Models
with small margins have higher capacities because they are more flexible and
can fit many training sets, unlike models with large margins. However, accord
ing to the SRM principle, as the capacity increases, the generalization error
bound will also increase. Therefore, it is desirable to design linear classifiers
that maximize the margins of their decision boundaries in order to ensure that
their worstcase generalization errors are minimized. One such classifier is the
linear SVM, which is explained in the next section.
5.5.2 Linear SVM: Separable Case
A linear SVM is a classifier that searches for a hyperplane with the largest
margin, which is why it is often known as a maximal margin classifier. To
understand how SVM learns such a boundary, we begin with some preliminary
discussion about the decision boundary and margin of a linear classifier.
Linear Decision Boundary
Consider a binary classification problem consisting of N training examples.
Each example is denoted by a tuple (xi, yi) (i = 1, 2, . . . , N ), where xi =
(xi1, xi2, . . . , xid)T corresponds to the attribute set for the ith example. By
convention, let yi ∈ {−1, 1} denote its class label. The decision boundary of a
linear classifier can be written in the following form:
w · x + b = 0, (5.28)
where w and b are parameters of the model.
Figure 5.23 shows a twodimensional training set consisting of squares and
circles. A decision boundary that bisects the training examples into their
respective classes is illustrated with a solid line. Any example located along
the decision boundary must satisfy Equation 5.28. For example, if xa and xb
are two points located on the decision boundary, then
w · xa + b = 0,
w · xb + b = 0.
Subtracting the two equations will yield the following:
w · (xb − xa) = 0,
259
Chapter 5 Classification: Alternative Techniques
w
w.x + b = 0
x
1x
2
w.x + b = 1
w.x + b = −1
d
O
x
1
− x
2
Figure 5.23. Decision boundary and margin of SVM.
where xb − xa is a vector parallel to the decision boundary and is directed
from xa to xb. Since the dot product is zero, the direction for w must be
perpendicular to the decision boundary, as shown in Figure 5.23.
For any square xs located above the decision boundary, we can show that
w · xs + b = k, (5.29)
where k > 0. Similarly, for any circle xc located below the decision boundary,
we can show that
w · xc + b = k′, (5.30)
where k′ < 0. If we label all the squares as class +1 and all the circles as
class −1, then we can predict the class label y for any test example z in the
following way:
y =
{
1, if w · z + b > 0;
−1, if w · z + b < 0. (5.31)
Margin of a Linear Classifier
Consider the square and the circle that are closest to the decision boundary.
Since the square is located above the decision boundary, it must satisfy Equa
tion 5.29 for some positive value k, whereas the circle must satisfy Equation
260
5.5 Support Vector Machine (SVM)
5.30 for some negative value k′. We can rescale the parameters w and b of
the decision boundary so that the two parallel hyperplanes bi1 and bi2 can be
expressed as follows:
bi1 : w · x + b = 1, (5.32)
bi2 : w · x + b = −1. (5.33)
The margin of the decision boundary is given by the distance between these
two hyperplanes. To compute the margin, let x1 be a data point located on
bi1 and x2 be a data point on bi2, as shown in Figure 5.23. Upon substituting
these points into Equations 5.32 and 5.33, the margin d can be computed by
subtracting the second equation from the first equation:
w · (x1 − x2) = 2
‖w‖ × d = 2
∴ d =
2
‖w‖. (5.34)
Learning a Linear SVM Model
The training phase of SVM involves estimating the parameters w and b of the
decision boundary from the training data. The parameters must be chosen in
such a way that the following two conditions are met:
w · xi + b ≥ 1 if yi = 1,
w · xi + b ≤ −1 if yi = −1. (5.35)
These conditions impose the requirements that all training instances from
class y = 1 (i.e., the squares) must be located on or above the hyperplane
w · x + b = 1, while those instances from class y = −1 (i.e., the circles) must
be located on or below the hyperplane w · x + b = −1. Both inequalities can
be summarized in a more compact form as follows:
yi(w · xi + b) ≥ 1, i = 1, 2, . . . , N. (5.36)
Although the preceding conditions are also applicable to any linear classi
fiers (including perceptrons), SVM imposes an additional requirement that the
margin of its decision boundary must be maximal. Maximizing the margin,
however, is equivalent to minimizing the following objective function:
f (w) =
‖w‖2
2
. (5.37)
261
Chapter 5 Classification: Alternative Techniques
Definition 5.1 (Linear SVM: Separable Case). The learning task in SVM
can be formalized as the following constrained optimization problem:
min
w
‖w‖2
2
subject to yi(w · xi + b) ≥ 1, i = 1, 2, . . . , N.
Since the objective function is quadratic and the constraints are linear in
the parameters w and b, this is known as a convex optimization problem,
which can be solved using the standard Lagrange multiplier method. Fol
lowing is a brief sketch of the main ideas for solving the optimization problem.
A more detailed discussion is given in Appendix E.
First, we must rewrite the objective function in a form that takes into
account the constraints imposed on its solutions. The new objective function
is known as the Lagrangian for the optimization problem:
LP =
1
2
‖w‖2 −
N∑
i=1
λi
(
yi(w · xi + b) − 1
)
, (5.38)
where the parameters λi are called the Lagrange multipliers. The first term in
the Lagrangian is the same as the original objective function, while the second
term captures the inequality constraints. To understand why the objective
function must be modified, consider the original objective function given in
Equation 5.37. It is easy to show that the function is minimized when w = 0, a
null vector whose components are all zeros. Such a solution, however, violates
the constraints given in Definition 5.1 because there is no feasible solution
for b. The solutions for w and b are infeasible if they violate the inequality
constraints; i.e., if yi(w·xi +b)−1 < 0. The Lagrangian given in Equation 5.38
incorporates this constraint by subtracting the term from its original objective
function. Assuming that λi ≥ 0, it is clear that any infeasible solution may
only increase the value of the Lagrangian.
To minimize the Lagrangian, we must take the derivative of LP with respect
to w and b and set them to zero:
∂Lp
∂w
= 0 =⇒ w =
N∑
i=1
λiyixi, (5.39)
∂Lp
∂b
= 0 =⇒
N∑
i=1
λiyi = 0. (5.40)
262
5.5 Support Vector Machine (SVM)
Because the Lagrange multipliers are unknown, we still cannot solve for w and
b. If Definition 5.1 contains only equality instead of inequality constraints, then
we can use the N equations from equality constraints along with Equations
5.39 and 5.40 to find the feasible solutions for w, b, and λi. Note that the
Lagrange multipliers for equality constraints are free parameters that can take
any values.
One way to handle the inequality constraints is to transform them into a
set of equality constraints. This is possible as long as the Lagrange multipliers
are restricted to be nonnegative. Such transformation leads to the following
constraints on the Lagrange multipliers, which are known as the KarushKuhn
Tucker (KKT) conditions:
λi ≥ 0, (5.41)
λi
[
yi(w · xi + b) − 1
]
= 0. (5.42)
At first glance, it may seem that there are as many Lagrange multipli
ers as there are training instances. It turns out that many of the Lagrange
multipliers become zero after applying the constraint given in Equation 5.42.
The constraint states that the Lagrange multiplier λi must be zero unless the
training instance xi satisfies the equation yi(w · xi + b) = 1. Such training
instance, with λi > 0, lies along the hyperplanes bi1 or bi2 and is known as a
support vector. Training instances that do not reside along these hyperplanes
have λi = 0. Equations 5.39 and 5.42 also suggest that the parameters w and
b, which define the decision boundary, depend only on the support vectors.
Solving the preceding optimization problem is still quite a daunting task
because it involves a large number of parameters: w, b, and λi. The problem
can be simplified by transforming the Lagrangian into a function of the La
grange multipliers only (this is known as the dual problem). To do this, we
first substitute Equations 5.39 and 5.40 into Equation 5.38. This will lead to
the following dual formulation of the optimization problem:
LD =
N∑
i=1
λi −
1
2
∑
i,j
λiλj yiyjxi · xj. (5.43)
The key differences between the dual and primary Lagrangians are as fol
lows:
1. The dual Lagrangian involves only the Lagrange multipliers and the
training data, while the primary Lagrangian involves the Lagrange mul
tipliers as well as parameters of the decision boundary. Nevertheless, the
solutions for both optimization problems are equivalent.
263
Chapter 5 Classification: Alternative Techniques
2. The quadratic term in Equation 5.43 has a negative sign, which means
that the original minimization problem involving the primary Lagrangian,
LP , has turned into a maximization problem involving the dual La
grangian, LD.
For large data sets, the dual optimization problem can be solved using
numerical techniques such as quadratic programming, a topic that is beyond
the scope of this book. Once the λi’s are found, we can use Equations 5.39
and 5.42 to obtain the feasible solutions for w and b. The decision boundary
can be expressed as follows:( N∑
i=1
λiyixi · x
)
+ b = 0. (5.44)
b is obtained by solving Equation 5.42 for the support vectors. Because the λi’s
are calculated numerically and can have numerical errors, the value computed
for b may not be unique. Instead it depends on the support vector used in
Equation 5.42. In practice, the average value for b is chosen to be the parameter
of the decision boundary.
Example 5.5. Consider the twodimensional data set shown in Figure 5.24,
which contains eight training instances. Using quadratic programming, we can
solve the optimization problem stated in Equation 5.43 to obtain the Lagrange
multiplier λi for each training instance. The Lagrange multipliers are depicted
in the last column of the table. Notice that only the first two instances have
nonzero Lagrange multipliers. These instances correspond to the support
vectors for this data set.
Let w = (w1, w2) and b denote the parameters of the decision boundary.
Using Equation 5.39, we can solve for w1 and w2 in the following way:
w1 =
∑
i
λiyixi1 = 65.5621 × 1 × 0.3858 + 65.5621 × −1 × 0.4871 = −6.64.
w2 =
∑
i
λiyixi2 = 65.5621 × 1 × 0.4687 + 65.5621 × −1 × 0.611 = −9.32.
The bias term b can be computed using Equation 5.42 for each support vector:
b(1) = 1 − w · x1 = 1 − (−6.64)(0.3858) − (−9.32)(0.4687) = 7.9300.
b(2) = −1 − w · x2 = −1 − (−6.64)(0.4871) − (−9.32)(0.611) = 7.9289.
Averaging these values, we obtain b = 7.93. The decision boundary corre
sponding to these parameters is shown in Figure 5.24.
264
5.5 Support Vector Machine (SVM)
– 6.64 x1 – 9.32 x2 + 7.93 = 0
0 0.2 0.4 0.6 0.8 1
0
0.3858
0.4871
0.9218
0.7382
0.1763
0.4057
0.9355
0.2146
65.5261
65.5261
0
0
0
0
0
0
0.4687
0.611
0.4103
0.8936
0.0579
0.3529
0.8132
0.0099
1
–1
–1
–1
1
1
–1
1
x2x1 y
Lagrange
Multiplier
x1
x
2
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Figure 5.24. Example of a linearly separable data set.
Once the parameters of the decision boundary are found, a test instance z
is classified as follows:
f (z) = sign
(
w · z + b
)
= sign
( N∑
i=1
λiyixi · z + b
)
.
If f (z) = 1, then the test instance is classified as a positive class; otherwise, it
is classified as a negative class.
265
Chapter 5 Classification: Alternative Techniques
5.5.3 Linear SVM: Nonseparable Case
Figure 5.25 shows a data set that is similar to Figure 5.22, except it has two
new examples, P and Q. Although the decision boundary B1 misclassifies the
new examples, while B2 classifies them correctly, this does not mean that B2 is
a better decision boundary than B1 because the new examples may correspond
to noise in the training data. B1 should still be preferred over B2 because it
has a wider margin, and thus, is less susceptible to overfitting. However, the
SVM formulation presented in the previous section constructs only decision
boundaries that are mistakefree. This section examines how the formulation
can be modified to learn a decision boundary that is tolerable to small training
errors using a method known as the soft margin approach. More importantly,
the method presented in this section allows SVM to construct a linear decision
boundary even in situations where the classes are not linearly separable. To
do this, the learning algorithm in SVM must consider the tradeoff between
the width of the margin and the number of training errors committed by the
linear decision boundary.
B1
B2b21 b22
b11 b12margin for B1
margin for B2 Q
P
Figure 5.25. Decision boundary of SVM for the nonseparable case.
266
5.5 Support Vector Machine (SVM)
1.2
0.6
0.8
– 0.2
0.2
0.4
– 0.5
1
0
0 10.5 1.5
X
2
X1
w.x + b = –1+ ξ
w.x + b = –1
w.x + b = 0
ξ /w
P
Figure 5.26. Slack variables for nonseparable data.
While the original objective function given in Equation 5.37 is still appli
cable, the decision boundary B1 no longer satisfies all the constraints given
in Equation 5.36. The inequality constraints must therefore be relaxed to ac
commodate the nonlinearly separable data. This can be done by introducing
positivevalued slack variables (ξ) into the constraints of the optimization
problem, as shown in the following equations:
w · xi + b ≥ 1 − ξi if yi = 1,
w · xi + b ≤ −1 + ξi if yi = −1, (5.45)
where ∀i : ξi > 0.
To interpret the meaning of the slack variables ξi, consider the diagram
shown in Figure 5.26. The circle P is one of the instances that violates the
constraints given in Equation 5.35. Let w · x + b = −1 + ξ denote a line that
is parallel to the decision boundary and passes through the point P. It can be
shown that the distance between this line and the hyperplane w · x + b = −1
is ξ/‖w‖. Thus, ξ provides an estimate of the error of the decision boundary
on the training example P.
In principle, we can apply the same objective function as before and impose
the conditions given in Equation 5.45 to find the decision boundary. However,
267
Chapter 5 Classification: Alternative Techniques
Q
P
Figure 5.27. A decision boundary that has a wide margin but large training error.
since there are no constraints on the number of mistakes the decision boundary
can make, the learning algorithm may find a decision boundary with a very
wide margin but misclassifies many of the training examples, as shown in
Figure 5.27. To avoid this problem, the objective function must be modified
to penalize a decision boundary with large values of slack variables. The
modified objective function is given by the following equation:
f (w) =
‖w‖2
2
+ C(
N∑
i=1
ξi)
k,
where C and k are userspecified parameters representing the penalty of mis
classifying the training instances. For the remainder of this section, we assume
k = 1 to simplify the problem. The parameter C can be chosen based on the
model’s performance on the validation set.
It follows that the Lagrangian for this constrained optimization problem
can be written as follows:
LP =
1
2
‖w‖2 + C
N∑
i=1
ξi −
N∑
i=1
λi{yi(w · xi + b) − 1 + ξi} −
N∑
i=1
µiξi, (5.46)
where the first two terms are the objective function to be minimized, the third
term represents the inequality constraints associated with the slack variables,
268
5.5 Support Vector Machine (SVM)
and the last term is the result of the nonnegativity requirements on the val
ues of ξi’s. Furthermore, the inequality constraints can be transformed into
equality constraints using the following KKT conditions:
ξi ≥ 0, λi ≥ 0, µi ≥ 0, (5.47)
λi{yi(w · xi + b) − 1 + ξi} = 0, (5.48)
µiξi = 0. (5.49)
Note that the Lagrange multiplier λi given in Equation 5.48 is nonvanishing
only if the training instance resides along the lines w · xi + b = ±1 or has
ξi > 0. On the other hand, the Lagrange multipliers µi given in Equation 5.49
are zero for any training instances that are misclassified (i.e., having ξi > 0).
Setting the firstorder derivative of L with respect to w, b, and ξi to zero
would result in the following equations:
∂L
∂wj
= wj −
N∑
i=1
λiyixij = 0 =⇒ wj =
N∑
i=1
λiyixij . (5.50)
∂L
∂b
= −
N∑
i=1
λiyi = 0 =⇒
N∑
i=1
λiyi = 0. (5.51)
∂L
∂ξi
= C − λi − µi = 0 =⇒ λi + µi = C. (5.52)
Substituting Equations 5.50, 5.51, and 5.52 into the Lagrangian will pro
duce the following dual Lagrangian:
LD =
1
2
∑
i,j
λiλj yiyjxi · xj + C
∑
i
ξi
−
∑
i
λi{yi(
∑
j
λj yjxi · xj + b) − 1 + ξi}
−
∑
i
(C − λi)ξi
=
N∑
i=1
λi −
1
2
∑
i,j
λiλj yiyjxi · xj, (5.53)
which turns out to be identical to the dual Lagrangian for linearly separable
data (see Equation 5.40 on page 262). Nevertheless, the constraints imposed
269
Chapter 5 Classification: Alternative Techniques
on the Lagrange multipliers λi’s are slightly different those in the linearly
separable case. In the linearly separable case, the Lagrange multipliers must
be nonnegative, i.e., λi ≥ 0. On the other hand, Equation 5.52 suggests that
λi should not exceed C (since both µi and λi are nonnegative). Therefore,
the Lagrange multipliers for nonlinearly separable data are restricted to 0 ≤
λi ≤ C.
The dual problem can then be solved numerically using quadratic pro
gramming techniques to obtain the Lagrange multipliers λi. These multipliers
can be replaced into Equation 5.50 and the KKT conditions to obtain the
parameters of the decision boundary.
5.5.4 Nonlinear SVM
The SVM formulations described in the previous sections construct a linear de
cision boundary to separate the training examples into their respective classes.
This section presents a methodology for applying SVM to data sets that have
nonlinear decision boundaries. The trick here is to transform the data from its
original coordinate space in x into a new space Φ(x) so that a linear decision
boundary can be used to separate the instances in the transformed space. Af
ter doing the transformation, we can apply the methodology presented in the
previous sections to find a linear decision boundary in the transformed space.
Attribute Transformation
To illustrate how attribute transformation can lead to a linear decision bound
ary, Figure 5.28(a) shows an example of a twodimensional data set consisting
of squares (classified as y = 1) and circles (classified as y = −1). The data set
is generated in such a way that all the circles are clustered near the center of
the diagram and all the squares are distributed farther away from the center.
Instances of the data set can be classified using the following equation:
y(x1, x2) =
{
1 if
√
(x1 − 0.5)2 + (x2 − 0.5)2 > 0.2,
−1 otherwise.
(5.54)
The decision boundary for the data can therefore be written as follows:√
(x1 − 0.5)2 + (x2 − 0.5)2 = 0.2,
which can be further simplified into the following quadratic equation:
x21 − x1 + x22 − x2 = −0.46.
270
5.5 Support Vector Machine (SVM)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
X1
X
2
(a) Decision boundary in the original
twodimensional space.
–0.25 –0.2 –0.15 –0.1 –0.05 0
–0.25
–0.2
–0.15
–0.1
–0.05
0
X
1
– X
1
2
X
2
–
X
2
2
(b) Decision boundary in the trans
formed space.
Figure 5.28. Classifying data with a nonlinear decision boundary.
A nonlinear transformation Φ is needed to map the data from its original
feature space into a new space where the decision boundary becomes linear.
Suppose we choose the following transformation:
Φ : (x1, x2) −→ (x21, x22,
√
2×1,
√
2×2, 1). (5.55)
In the transformed space, we can find the parameters w = (w0, w1, . . ., w4)
such that:
w4x
2
1 + w3x
2
2 + w2
√
2×1 + w1
√
2×2 + w0 = 0.
For illustration purposes, let us plot the graph of x22 − x2 versus x21 − x1 for
the previously given instances. Figure 5.28(b) shows that in the transformed
space, all the circles are located in the lower righthand side of the diagram. A
linear decision boundary can therefore be constructed to separate the instances
into their respective classes.
One potential problem with this approach is that it may suffer from the
curse of dimensionality problem often associated with highdimensional data.
We will show how nonlinear SVM avoids this problem (using a method known
as the kernel trick) later in this section.
Learning a Nonlinear SVM Model
Although the attribute transformation approach seems promising, it raises
several implementation issues. First, it is not clear what type of mapping
271
Chapter 5 Classification: Alternative Techniques
function should be used to ensure that a linear decision boundary can be
constructed in the transformed space. One possibility is to transform the data
into an infinite dimensional space, but such a highdimensional space may not
be that easy to work with. Second, even if the appropriate mapping function is
known, solving the constrained optimization problem in the highdimensional
feature space is a computationally expensive task.
To illustrate these issues and examine the ways they can be addressed, let
us assume that there is a suitable function, Φ(x), to transform a given data
set. After the transformation, we need to construct a linear decision boundary
that will separate the instances into their respective classes. The linear decision
boundary in the transformed space has the following form: w · Φ(x) + b = 0.
Definition 5.2 (Nonlinear SVM). The learning task for a nonlinear SVM
can be formalized as the following optimization problem:
min
w
‖w‖2
2
subject to yi(w · Φ(xi) + b) ≥ 1, i = 1, 2, . . . , N.
Note the similarity between the learning task of a nonlinear SVM to that
of a linear SVM (see Definition 5.1 on page 262). The main difference is that,
instead of using the original attributes x, the learning task is performed on the
transformed attributes Φ(x). Following the approach taken in Sections 5.5.2
and 5.5.3 for linear SVM, we may derive the following dual Lagrangian for the
constrained optimization problem:
LD =
n∑
i=1
λi −
1
2
∑
i,j
λiλj yiyj Φ(xi) · Φ(xj ) (5.56)
Once the λi’s are found using quadratic programming techniques, the param
eters w and b can be derived using the following equations:
w =
∑
i
λiyiΦ(xi) (5.57)
λi{yi(
∑
j
λj yj Φ(xj ) · Φ(xi) + b) − 1} = 0, (5.58)
272
5.5 Support Vector Machine (SVM)
which are analogous to Equations 5.39 and 5.40 for linear SVM. Finally, a test
instance z can be classified using the following equation:
f (z) = sign
(
w · Φ(z) + b
)
= sign
( n∑
i=1
λiyiΦ(xi) · Φ(z) + b
)
. (5.59)
Except for Equation 5.57, note that the rest of the computations (Equa
tions 5.58 and 5.59) involve calculating the dot product (i.e., similarity) be
tween pairs of vectors in the transformed space, Φ(xi) · Φ(xj ). Such computa
tion can be quite cumbersome and may suffer from the curse of dimensionality
problem. A breakthrough solution to this problem comes in the form of a
method known as the kernel trick.
Kernel Trick
The dot product is often regarded as a measure of similarity between two
input vectors. For example, the cosine similarity described in Section 2.4.5
on page 73 can be defined as the dot product between two vectors that are
normalized to unit length. Analogously, the dot product Φ(xi)·Φ(xj ) can also
be regarded as a measure of similarity between two instances, xi and xj , in
the transformed space.
The kernel trick is a method for computing similarity in the transformed
space using the original attribute set. Consider the mapping function Φ given
in Equation 5.55. The dot product between two input vectors u and v in the
transformed space can be written as follows:
Φ(u) · Φ(v) = (u21, u22,
√
2u1,
√
2u2, 1) · (v21, v22,
√
2v1,
√
2v2, 1)
= u21v
2
1 + u
2
2v
2
2 + 2u1v1 + 2u2v2 + 1
= (u · v + 1)2. (5.60)
This analysis shows that the dot product in the transformed space can be
expressed in terms of a similarity function in the original space:
K(u, v) = Φ(u) · Φ(v) = (u · v + 1)2. (5.61)
The similarity function, K, which is computed in the original attribute space,
is known as the kernel function. The kernel trick helps to address some
of the concerns about how to implement nonlinear SVM. First, we do not
have to know the exact form of the mapping function Φ because the kernel
273
Chapter 5 Classification: Alternative Techniques
functions used in nonlinear SVM must satisfy a mathematical principle known
as Mercer’s theorem. This principle ensures that the kernel functions can
always be expressed as the dot product between two input vectors in some
highdimensional space. The transformed space of the SVM kernels is called
a reproducing kernel Hilbert space (RKHS). Second, computing the
dot products using kernel functions is considerably cheaper than using the
transformed attribute set Φ(x). Third, since the computations are performed
in the original space, issues associated with the curse of dimensionality problem
can be avoided.
Figure 5.29 shows the nonlinear decision boundary obtained by SVM using
the polynomial kernel function given in Equation 5.61. A test instance x is
classified according to the following equation:
f (z) = sign(
n∑
i=1
λiyiΦ(xi) · Φ(z) + b)
= sign(
n∑
i=1
λiyiK(xi, z) + b)
= sign(
n∑
i=1
λiyi(xi · z + 1)2 + b), (5.62)
where b is the parameter obtained using Equation 5.58. The decision boundary
obtained by nonlinear SVM is quite close to the true decision boundary shown
in Figure 5.28(a).
Mercer’s Theorem
The main requirement for the kernel function used in nonlinear SVM is that
there must exist a corresponding transformation such that the kernel function
computed for a pair of vectors is equivalent to the dot product between the
vectors in the transformed space. This requirement can be formally stated in
the form of Mercer’s theorem.
Theorem 5.1 (Mercer’s Theorem). A kernel function K can be expressed
as
K(u, v) = Φ(u) · Φ(v)
if and only if, for any function g(x) such that
∫
g(x)2dx is finite, then∫
K(x, y) g(x) g(y) dx dy ≥ 0.
274
5.5 Support Vector Machine (SVM)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
X1
X
2
Figure 5.29. Decision boundary produced by a nonlinear SVM with polynomial kernel.
Kernel functions that satisfy Theorem 5.1 are called positive definite kernel
functions. Examples of such functions are listed below:
K(x, y) = (x · y + 1)p (5.63)
K(x, y) = e−‖x−y‖
2/(2σ2) (5.64)
K(x, y) = tanh(kx · y − δ) (5.65)
Example 5.6. Consider the polynomial kernel function given in Equation
5.63. Let g(x) be a function that has a finite L2 norm, i.e.,
∫
g(x)2dx < ∞.∫
(x · y + 1)pg(x)g(y)dxdy
=
∫ p∑
i=0
(
p
i
)
(x · y)ig(x)g(y)dxdy
=
p∑
i=0
(
p
i
)∫ ∑
α1,α2,...
(
i
α1α2 . . .
)[
(x1y1)
α1 (x2y2)
α2 (x3y3)
α3 . . .
]
g(x1, x2, . . .) g(y1, y2, . . .)dx1dx2 . . . dy1dy2 . . .
275
Chapter 5 Classification: Alternative Techniques
=
p∑
i=0
∑
α1,α2,...
(
p
i
)(
i
α1α2 . . .
)[∫
xα11 x
α2
2 . . . g(x1, x2, . . .)dx1dx2 . . .
]2
.
Because the result of the integration is nonnegative, the polynomial kernel
function therefore satisfies Mercer’s theorem.
5.5.5 Characteristics of SVM
SVM has many desirable qualities that make it one of the most widely used
classification algorithms. Following is a summary of the general characteristics
of SVM:
1. The SVM learning problem can be formulated as a convex optimization
problem, in which efficient algorithms are available to find the global
minimum of the objective function. Other classification methods, such
as rulebased classifiers and artificial neural networks, employ a greedy
based strategy to search the hypothesis space. Such methods tend to
find only locally optimum solutions.
2. SVM performs capacity control by maximizing the margin of the decision
boundary. Nevertheless, the user must still provide other parameters
such as the type of kernel function to use and the cost function C for
introducing each slack variable.
3. SVM can be applied to categorical data by introducing dummy variables
for each categorical attribute value present in the data. For example, if
Marital Status has three values {Single, Married, Divorced}, we can
introduce a binary variable for each of the attribute values.
4. The SVM formulation presented in this chapter is for binary class prob
lems. Some of the methods available to extend SVM to multiclass prob
lems are presented in Section 5.8.
5.6 Ensemble Methods
The classification techniques we have seen so far in this chapter, with the ex
ception of the nearestneighbor method, predict the class labels of unknown
examples using a single classifier induced from training data. This section
presents techniques for improving classification accuracy by aggregating the
predictions of multiple classifiers. These techniques are known as the ensem
ble or classifier combination methods. An ensemble method constructs a
276
5.6 Ensemble Methods
set of base classifiers from training data and performs classification by taking
a vote on the predictions made by each base classifier. This section explains
why ensemble methods tend to perform better than any single classifier and
presents techniques for constructing the classifier ensemble.
5.6.1 Rationale for Ensemble Method
The following example illustrates how an ensemble method can improve a
classifier’s performance.
Example 5.7. Consider an ensemble of twentyfive binary classifiers, each of
which has an error rate of = 0.35. The ensemble classifier predicts the class
label of a test example by taking a majority vote on the predictions made
by the base classifiers. If the base classifiers are identical, then the ensemble
will misclassify the same examples predicted incorrectly by the base classifiers.
Thus, the error rate of the ensemble remains 0.35. On the other hand, if the
base classifiers are independent—i.e., their errors are uncorrelated—then the
ensemble makes a wrong prediction only if more than half of the base classifiers
predict incorrectly. In this case, the error rate of the ensemble classifier is
eensemble =
25∑
i=13
(
25
i
)
i(1 − )25−i = 0.06, (5.66)
which is considerably lower than the error rate of the base classifiers.
Figure 5.30 shows the error rate of an ensemble of twentyfive binary clas
sifiers (eensemble) for different base classifier error rates ( ). The diagonal line
represents the case in which the base classifiers are identical, while the solid
line represents the case in which the base classifiers are independent. Observe
that the ensemble classifier performs worse than the base classifiers when is
larger than 0.5.
The preceding example illustrates two necessary conditions for an ensem
ble classifier to perform better than a single classifier: (1) the base classifiers
should be independent of each other, and (2) the base classifiers should do bet
ter than a classifier that performs random guessing. In practice, it is difficult to
ensure total independence among the base classifiers. Nevertheless, improve
ments in classification accuracies have been observed in ensemble methods in
which the base classifiers are slightly correlated.
277
Chapter 5 Classification: Alternative Techniques
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Base classifier error
E
n
se
m
b
le
c
la
ss
ifi
e
r
e
rr
o
r
Figure 5.30. Comparison between errors of base classifiers and errors of the ensemble classifier.
Original
Training data
Step 1:
Create Multiple
Data Sets
Step 2:
Build Multiple
Classifiers
Step 3:
Combine
Classifiers
D
D1 D2 Dt1 Dt
C1 C2
C*
Ct1 Ct
....
Figure 5.31. A logical view of the ensemble learning method.
5.6.2 Methods for Constructing an Ensemble Classifier
A logical view of the ensemble method is presented in Figure 5.31. The basic
idea is to construct multiple classifiers from the original data and then aggre
gate their predictions when classifying unknown examples. The ensemble of
classifiers can be constructed in many ways:
278
5.6 Ensemble Methods
1. By manipulating the training set. In this approach, multiple train
ing sets are created by resampling the original data according to some
sampling distribution. The sampling distribution determines how likely
it is that an example will be selected for training, and it may vary from
one trial to another. A classifier is then built from each training set using
a particular learning algorithm. Bagging and boosting are two exam
ples of ensemble methods that manipulate their training sets. These
methods are described in further detail in Sections 5.6.4 and 5.6.5.
2. By manipulating the input features. In this approach, a subset
of input features is chosen to form each training set. The subset can
be either chosen randomly or based on the recommendation of domain
experts. Some studies have shown that this approach works very well
with data sets that contain highly redundant features. Random forest,
which is described in Section 5.6.6, is an ensemble method that manip
ulates its input features and uses decision trees as its base classifiers.
3. By manipulating the class labels. This method can be used when the
number of classes is sufficiently large. The training data is transformed
into a binary class problem by randomly partitioning the class labels
into two disjoint subsets, A0 and A1. Training examples whose class
label belongs to the subset A0 are assigned to class 0, while those that
belong to the subset A1 are assigned to class 1. The relabeled examples
are then used to train a base classifier. By repeating the classrelabeling
and modelbuilding steps multiple times, an ensemble of base classifiers
is obtained. When a test example is presented, each base classifier Ci is
used to predict its class label. If the test example is predicted as class
0, then all the classes that belong to A0 will receive a vote. Conversely,
if it is predicted to be class 1, then all the classes that belong to A1
will receive a vote. The votes are tallied and the class that receives the
highest vote is assigned to the test example. An example of this approach
is the errorcorrecting output coding method described on page 307.
4. By manipulating the learning algorithm. Many learning algo
rithms can be manipulated in such a way that applying the algorithm
several times on the same training data may result in different models.
For example, an artificial neural network can produce different mod
els by changing its network topology or the initial weights of the links
between neurons. Similarly, an ensemble of decision trees can be con
structed by injecting randomness into the treegrowing procedure. For
279
Chapter 5 Classification: Alternative Techniques
example, instead of choosing the best splitting attribute at each node,
we can randomly choose one of the top k attributes for splitting.
The first three approaches are generic methods that are applicable to any
classifiers, whereas the fourth approach depends on the type of classifier used.
The base classifiers for most of these approaches can be generated sequentially
(one after another) or in parallel (all at once). Algorithm 5.5 shows the steps
needed to build an ensemble classifier in a sequential manner. The first step
is to create a training set from the original data D. Depending on the type
of ensemble method used, the training sets are either identical to or slight
modifications of D. The size of the training set is often kept the same as the
original data, but the distribution of examples may not be identical; i.e., some
examples may appear multiple times in the training set, while others may not
appear even once. A base classifier Ci is then constructed from each training
set Di. Ensemble methods work better with unstable classifiers, i.e., base
classifiers that are sensitive to minor perturbations in the training set. Ex
amples of unstable classifiers include decision trees, rulebased classifiers, and
artificial neural networks. As will be discussed in Section 5.6.3, the variability
among training examples is one of the primary sources of errors in a classifier.
By aggregating the base classifiers built from different training sets, this may
help to reduce such types of errors.
Finally, a test example x is classified by combining the predictions made
by the base classifiers Ci(x):
C∗(x) = V ote(C1(x), C2(x), . . . , Ck(x)).
The class can be obtained by taking a majority vote on the individual predic
tions or by weighting each prediction with the accuracy of the base classifier.
Algorithm 5.5 General procedure for ensemble method.
1: Let D denote the original training data, k denote the number of base classifiers,
and T be the test data.
2: for i = 1 to k do
3: Create training set, Di from D.
4: Build a base classifier Ci from Di.
5: end for
6: for each test record x ∈ T do
7: C∗(x) = V ote(C1(x), C2(x), . . . , Ck(x))
8: end for
280
5.6 Ensemble Methods
5.6.3 BiasVariance Decomposition
Biasvariance decomposition is a formal method for analyzing the prediction
error of a predictive model. The following example gives an intuitive explana
tion for this method.
Figure 5.32 shows the trajectories of a projectile launched at a particular
angle. Suppose the projectile hits the floor surface at some location x, at a
distance d away from the target position t. Depending on the force applied
to the projectile, the observed distance may vary from one trial to another.
The observed distance can be decomposed into several components. The first
component, which is known as bias, measures the average distance between
the target position and the location where the projectile hits the floor. The
amount of bias depends on the angle of the projectile launcher. The second
component, which is known as variance, measures the deviation between x
and the average position x where the projectile hits the floor. The variance
can be explained as a result of changes in the amount of force applied to the
projectile. Finally, if the target is not stationary, then the observed distance
is also affected by changes in the location of the target. This is considered the
noise component associated with variability in the target position. Putting
these components together, the average distance can be expressed as:
df,θ(y, t) = Biasθ + Variancef + Noiset, (5.67)
where f refers to the amount of force applied and θ is the angle of the launcher.
The task of predicting the class label of a given example can be analyzed
using the same approach. For a given classifier, some predictions may turn out
to be correct, while others may be completely off the mark. We can decompose
the expected error of a classifier as a sum of the three terms given in Equation
5.67, where expected error is the probability that the classifier misclassifies a
Target, t
ʻVarianceʼ ʻNoiseʼ
ʻBiasʼ
y
Figure 5.32. Biasvariance decomposition.
281
Chapter 5 Classification: Alternative Techniques
given example. The remainder of this section examines the meaning of bias,
variance, and noise in the context of classification.
A classifier is usually trained to minimize its training error. However, to
be useful, the classifier must be able to make an informed guess about the
class labels of examples it has never seen before. This requires the classifier to
generalize its decision boundary to regions where there are no training exam
ples available—a decision that depends on the design choice of the classifier.
For example, a key design issue in decision tree induction is the amount of
pruning needed to obtain a tree with low expected error. Figure 5.33 shows
two decision trees, T1 and T2, that are generated from the same training data,
but have different complexities. T2 is obtained by pruning T1 until a tree with
maximum depth of two is obtained. T1, on the other hand, performs very little
pruning on its decision tree. These design choices will introduce a bias into
the classifier that is analogous to the bias of the projectile launcher described
in the previous example. In general, the stronger the assumptions made by
a classifier about the nature of its decision boundary, the larger the classi
fier’s bias will be. T2 therefore has a larger bias because it makes stronger
assumptions about its decision boundary (which is reflected by the size of the
tree) compared to T1. Other design choices that may introduce a bias into a
classifier include the network topology of an artificial neural network and the
number of neighbors considered by a nearestneighbor classifier.
The expected error of a classifier is also affected by variability in the train
ing data because different compositions of the training set may lead to differ
ent decision boundaries. This is analogous to the variance in x when different
amounts of force are applied to the projectile. The last component of the ex
pected error is associated with the intrinsic noise in the target class. The target
class for some domains can be nondeterministic; i.e., instances with the same
attribute values can have different class labels. Such errors are unavoidable
even when the true decision boundary is known.
The amount of bias and variance contributing to the expected error depend
on the type of classifier used. Figure 5.34 compares the decision boundaries
produced by a decision tree and a 1nearest neighbor classifier. For each
classifier, we plot the decision boundary obtained by “averaging” the models
induced from 100 training sets, each containing 100 examples. The true deci
sion boundary from which the data is generated is also plotted using a dashed
line. The difference between the true decision boundary and the “averaged”
decision boundary reflects the bias of the classifier. After averaging the mod
els, observe that the difference between the true decision boundary and the
decision boundary produced by the 1nearest neighbor classifier is smaller than
282
5.6 Ensemble Methods
x2 < 1.94
x2 < 1.94
x2 < 9.25x2 < 7.45
x1 < –1.24
x1 < –1.24
x1 < 11.00
x1 < 11.00
x1 < 1.58
(a) Decision tree T1
(b) Decision tree T2
15
10
5
0
–5
–5 0 5 10 15
15
10
5
0
–5
–5 0 5 10 15
Figure 5.33. Two decision trees with different complexities induced from the same training data.
the observed difference for a decision tree classifier. This result suggests that
the bias of a 1nearest neighbor classifier is lower than the bias of a decision
tree classifier.
On the other hand, the 1nearest neighbor classifier is more sensitive to
the composition of its training examples. If we examine the models induced
from different training sets, there is more variability in the decision boundary
of a 1nearest neighbor classifier than a decision tree classifier. Therefore, the
decision boundary of a decision tree classifier has a lower variance than the
1nearest neighbor classifier.
5.6.4 Bagging
Bagging, which is also known as bootstrap aggregating, is a technique that
repeatedly samples (with replacement) from a data set according to a uniform
probability distribution. Each bootstrap sample has the same size as the origi
nal data. Because the sampling is done with replacement, some instances may
appear several times in the same training set, while others may be omitted
from the training set. On average, a bootstrap sample Di contains approxi
283
Chapter 5 Classification: Alternative Techniques
–30 –20 –10 0 10 20 30
–30
–20
–10
0
10
20
30
(a) Decision boundary for decision tree.
–30 –20 –10 0 10 20 30
–30
–20
–10
0
10
20
30
(b) Decision boundary for 1nearest
neighbor.
Figure 5.34. Bias of decision tree and 1nearest neighbor classifiers.
Algorithm 5.6 Bagging algorithm.
1: Let k be the number of bootstrap samples.
2: for i = 1 to k do
3: Create a bootstrap sample of size N , Di.
4: Train a base classifier Ci on the bootstrap sample Di.
5: end for
6: C∗(x) = argmax
y
∑
i δ
(
Ci(x) = y
)
.
{δ(·) = 1 if its argument is true and 0 otherwise}.
mately 63% of the original training data because each sample has a probability
1 − (1 − 1/N )N of being selected in each Di. If N is sufficiently large, this
probability converges to 1 − 1/e � 0.632. The basic procedure for bagging is
summarized in Algorithm 5.6. After training the k classifiers, a test instance
is assigned to the class that receives the highest number of votes.
To illustrate how bagging works, consider the data set shown in Table 5.4.
Let x denote a onedimensional attribute and y denote the class label. Suppose
we apply a classifier that induces only onelevel binary decision trees, with a
test condition x ≤ k, where k is a split point chosen to minimize the entropy
of the leaf nodes. Such a tree is also known as a decision stump.
Without bagging, the best decision stump we can produce splits the records
at either x ≤ 0.35 or x ≤ 0.75. Either way, the accuracy of the tree is at
284
5.6 Ensemble Methods
Table 5.4. Example of data set used to construct an ensemble of bagging classifiers.
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 −1 −1 −1 −1 1 1 1
most 70%. Suppose we apply the bagging procedure on the data set using
ten bootstrap samples. The examples chosen for training in each bagging
round are shown in Figure 5.35. On the righthand side of each table, we also
illustrate the decision boundary produced by the classifier.
We classify the entire data set given in Table 5.4 by taking a majority
vote among the predictions made by each base classifier. The results of the
predictions are shown in Figure 5.36. Since the class labels are either −1 or
+1, taking the majority vote is equivalent to summing up the predicted values
of y and examining the sign of the resulting sum (refer to the second to last
row in Figure 5.36). Notice that the ensemble classifier perfectly classifies all
ten examples in the original data.
The preceding example illustrates another advantage of using ensemble
methods in terms of enhancing the representation of the target function. Even
though each base classifier is a decision stump, combining the classifiers can
lead to a decision tree of depth 2.
Bagging improves generalization error by reducing the variance of the base
classifiers. The performance of bagging depends on the stability of the base
classifier. If a base classifier is unstable, bagging helps to reduce the errors
associated with random fluctuations in the training data. If a base classifier
is stable, i.e., robust to minor perturbations in the training set, then the
error of the ensemble is primarily caused by bias in the base classifier. In
this situation, bagging may not be able to improve the performance of the
base classifiers significantly. It may even degrade the classifier’s performance
because the effective size of each training set is about 37% smaller than the
original data.
Finally, since every sample has an equal probability of being selected, bag
ging does not focus on any particular instance of the training data. It is
therefore less susceptible to model overfitting when applied to noisy data.
5.6.5 Boosting
Boosting is an iterative procedure used to adaptively change the distribution
of training examples so that the base classifiers will focus on examples that
are hard to classify. Unlike bagging, boosting assigns a weight to each training
285
Chapter 5 Classification: Alternative Techniques
x <= 0.35 ==> y = 1
x > 0.35 ==> y = 1
x <= 0.65 ==> y = 1
x > 0.65 ==> y = 1
x <= 0.35 ==> y = 1
x > 0.35 ==> y = 1
x <= 0.3 ==> y = 1
x > 0.3 ==> y = 1
x <= 0.35 ==> y = 1
x > 0.35 ==> y = 1
x <= 0.75 ==> y = 1
x > 0.75 ==> y = 1
x <= 0.75 ==> y = 1
x > 0.75 ==> y = 1
x <= 0.75 ==> y = 1
x > 0.75 ==> y = 1
x <= 0.75 ==> y = 1
x > 0.75 ==> y = 1
x <= 0.05 ==> y = 1
x > 0.05 ==> y = 1
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 1 1 1 1 1 1
Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1 1 1
y 1 1 1 1 1 11 1 1 1
Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 1 1 1 1 1 1 1
Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 1 1 1 1 1 1 1
Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1
y 1 1 1 1 1 1 1 1 1 1
Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1
y 1 1 1 1 1 1 1 1 1 1
Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1
y 1 1 1 1 1 1 1 1 1 1
Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1
y 1 1 1 1 1 1 1 1 1 1
Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1
y 1 1 1 1 1 1 1 1 1 1
Bagging Round 10:
x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9
y 1 1 1 1 1 1 1 1 1 1
Figure 5.35. Example of bagging.
example and may adaptively change the weight at the end of each boosting
round. The weights assigned to the training examples can be used in the
following ways:
1. They can be used as a sampling distribution to draw a set of bootstrap
samples from the original data.
2. They can be used by the base classifier to learn a model that is biased
toward higherweight examples.
286
5.6 Ensemble Methods
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1 1 1
8 1 1 1 1 1 1 1 1 1 1
9 1 1 1 1 1 1 1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Sum 2 2 2 6 6 6 6 2 2 2
Sign 1 1 1 1 1 1 1 1 1 1
True Class 1 1 1 1 1 1 1 1 1 1
Figure 5.36. Example of combining classifiers constructed using the bagging approach.
This section describes an algorithm that uses weights of examples to de
termine the sampling distribution of its training set. Initially, the examples
are assigned equal weights, 1/N , so that they are equally likely to be chosen
for training. A sample is drawn according to the sampling distribution of the
training examples to obtain a new training set. Next, a classifier is induced
from the training set and used to classify all the examples in the original data.
The weights of the training examples are updated at the end of each boost
ing round. Examples that are classified incorrectly will have their weights
increased, while those that are classified correctly will have their weights de
creased. This forces the classifier to focus on examples that are difficult to
classify in subsequent iterations.
The following table shows the examples chosen during each boosting round.
Boosting (Round 1): 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2): 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3): 4 4 8 10 4 5 4 6 3 4
Initially, all the examples are assigned the same weights. However, some ex
amples may be chosen more than once, e.g., examples 3 and 7, because the
sampling is done with replacement. A classifier built from the data is then
used to classify all the examples. Suppose example 4 is difficult to classify.
The weight for this example will be increased in future iterations as it gets
misclassified repeatedly. Meanwhile, examples that were not chosen in the pre
287
Chapter 5 Classification: Alternative Techniques
vious round, e.g., examples 1 and 5, also have a better chance of being selected
in the next round since their predictions in the previous round were likely to
be wrong. As the boosting rounds proceed, examples that are the hardest to
classify tend to become even more prevalent. The final ensemble is obtained
by aggregating the base classifiers obtained from each boosting round.
Over the years, several implementations of the boosting algorithm have
been developed. These algorithms differ in terms of (1) how the weights of
the training examples are updated at the end of each boosting round, and (2)
how the predictions made by each classifier are combined. An implementation
called AdaBoost is explored in the next section.
AdaBoost
Let {(xj , yj )  j = 1, 2, . . . , N} denote a set of N training examples. In the
AdaBoost algorithm, the importance of a base classifier Ci depends on its error
rate, which is defined as
i =
1
N
[ N∑
j=1
wj I
(
Ci(xj ) �= yj
)]
, (5.68)
where I(p) = 1 if the predicate p is true, and 0 otherwise. The importance of
a classifier Ci is given by the following parameter,
αi =
1
2
ln
(
1 − i
i
)
.
Note that αi has a large positive value if the error rate is close to 0 and a large
negative value if the error rate is close to 1, as shown in Figure 5.37.
The αi parameter is also used to update the weight of the training ex
amples. To illustrate, let w(j)i denote the weight assigned to example (xi, yi)
during the jth boosting round. The weight update mechanism for AdaBoost
is given by the equation:
w
(j+1)
i =
w
(j)
i
Zj
×
{
exp−αj if Cj (xi) = yi
expαj if Cj (xi) �= yi
, (5.69)
where Zj is the normalization factor used to ensure that
∑
i w
(j+1)
i = 1. The
weight update formula given in Equation 5.69 increases the weights of incor
rectly classified examples and decreases the weights of those classified correctly.
288
5.6 Ensemble Methods
0 0.2 0.4 0.6 0.8 1
–5
– 4
–3
–2
–1
0
1
2
3
4
5
ε
In
(
(1
–
ε
)/
ε)
Figure 5.37. Plot of α as a function of training error .
Instead of using a majority voting scheme, the prediction made by each
classifier Cj is weighted according to αj . This approach allows AdaBoost to
penalize models that have poor accuracy, e.g., those generated at the earlier
boosting rounds. In addition, if any intermediate rounds produce an error
rate higher than 50%, the weights are reverted back to their original uniform
values, wi = 1/N , and the resampling procedure is repeated. The AdaBoost
algorithm is summarized in Algorithm 5.7.
Let us examine how the boosting approach works on the data set shown
in Table 5.4. Initially, all the examples have identical weights. After three
boosting rounds, the examples chosen for training are shown in Figure 5.38(a).
The weights for each example are updated at the end of each boosting round
using Equation 5.69.
Without boosting, the accuracy of the decision stump is, at best, 70%.
With AdaBoost, the results of the predictions are given in Figure 5.39(b).
The final prediction of the ensemble classifier is obtained by taking a weighted
average of the predictions made by each base classifier, which is shown in the
last row of Figure 5.39(b). Notice that AdaBoost perfectly classifies all the
examples in the training data.
An important analytical result of boosting shows that the training error of
the ensemble is bounded by the following expression:
eensemble ≤
∏
i
[√
i(1 − i)
]
, (5.70)
289
Chapter 5 Classification: Alternative Techniques
Algorithm 5.7 AdaBoost algorithm.
1: w = {wj = 1/N  j = 1, 2, . . . , N}. {Initialize the weights for all N examples.}
2: Let k be the number of boosting rounds.
3: for i = 1 to k do
4: Create training set Di by sampling (with replacement) from D according to w.
5: Train a base classifier Ci on Di.
6: Apply Ci to all examples in the original training set, D.
7: i = 1N
[∑
j wj δ
(
Ci(xj ) �= yj
)]
{Calculate the weighted error.}
8: if i > 0.5 then
9: w = {wj = 1/N  j = 1, 2, . . . , N}. {Reset the weights for all N examples.}
10: Go back to Step 4.
11: end if
12: αi = 12 ln
1−�i
�i
.
13: Update the weight of each example according to Equation 5.69.
14: end for
15: C∗(x) = argmax
y
∑T
j=1 αj δ(Cj (x) = y)
)
.
where i is the error rate of each base classifier i. If the error rate of the base
classifier is less than 50%, we can write i = 0.5 − γi, where γi measures how
much better the classifier is than random guessing. The bound on the training
error of the ensemble becomes
eensemble ≤
∏
i
√
1 − 4γ2i ≤ exp
(
− 2
∑
i
γ2i
)
. (5.71)
If γi < γ∗ for all i’s, then the training error of the ensemble decreases expo
nentially, which leads to the fast convergence of the algorithm. Nevertheless,
because of its tendency to focus on training examples that are wrongly classi
fied, the boosting technique can be quite susceptible to overfitting.
5.6.6 Random Forests
Random forest is a class of ensemble methods specifically designed for decision
tree classifiers. It combines the predictions made by multiple decision trees,
where each tree is generated based on the values of an independent set of
random vectors, as shown in Figure 5.40. The random vectors are generated
from a fixed probability distribution, unlike the adaptive approach used in
AdaBoost, where the probability distribution is varied to focus on examples
that are hard to classify. Bagging using decision trees is a special case of
random forests, where randomness is injected into the modelbuilding process
290
5.6 Ensemble Methods
Boosting Round 1:
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1
y 1 1 1 1 1 1 1 1 1 1
Boosting Round 2:
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1
Boosting Round 3:
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 1 1 1 1 1 1 1 1
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.01
3 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009
(b) Weights of training records
(a) Training records chosen during boosting
Figure 5.38. Example of boosting.
by randomly choosing N samples, with replacement, from the original training
set. Bagging also uses the same uniform probability distribution to generate
its bootstrapped samples throughout the entire modelbuilding process.
It was theoretically proven that the upper bound for generalization error
of random forests converges to the following expression, when the number of
trees is sufficiently large.
Generalization error ≤ ρ(1 − s
2)
s2
, (5.72)
where ρ is the average correlation among the trees and s is a quantity that
measures the “strength” of the tree classifiers. The strength of a set of classi
fiers refers to the average performance of the classifiers, where performance is
measured probabilistically in terms of the classifier’s margin:
margin, M (X, Y ) = P (Ŷθ = Y ) − max
Z �=Y
P (Ŷθ = Z), (5.73)
where Ŷθ is the predicted class of X according to a classifier built from some
random vector θ. The higher the margin is, the more likely it is that the
291
Chapter 5 Classification: Alternative Techniques
(a)
(b)
Round Split Point Left Class Right Class
1 0.75 1 1 1.738
2 0.05 1 1 2.7784
3 0.3 1 1 4.1195
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1 1 1
Sum 5.16 5.16 5.16 3.08 3.08 3.08 3.08 0.397 0.397 0.397
Sign 1 1 1 1 1 1 1 1 1 1
α
Figure 5.39. Example of combining classifiers constructed using the AdaBoost approach.
Step 1:
Create random
vectors
Step 2:
Use random
vector to
build multiple
decision trees
Step 3:
Combine
decision trees
D
D1 D2 Dt1 Dt
T1 T2 T1–1 T1
T*
...
Original
Training data
Randomize
Figure 5.40. Random forests.
classifier correctly predicts a given example X. Equation 5.72 is quite intuitive;
as the trees become more correlated or the strength of the ensemble decreases,
the generalization error bound tends to increase. Randomization helps to
reduce the correlation among decision trees so that the generalization error of
the ensemble can be improved.
292
5.6 Ensemble Methods
Each decision tree uses a random vector that is generated from some fixed
probability distribution. A random vector can be incorporated into the tree
growing process in many ways. The first approach is to randomly select F
input features to split at each node of the decision tree. As a result, instead of
examining all the available features, the decision to split a node is determined
from these selected F features. The tree is then grown to its entirety without
any pruning. This may help reduce the bias present in the resulting tree.
Once the trees have been constructed, the predictions are combined using a
majority voting scheme. This approach is known as ForestRI, where RI refers
to random input selection. To increase randomness, bagging can also be used
to generate bootstrap samples for ForestRI. The strength and correlation of
random forests may depend on the size of F . If F is sufficiently small, then
the trees tend to become less correlated. On the other hand, the strength of
the tree classifier tends to improve with a larger number of features, F . As
a tradeoff, the number of features is commonly chosen to be F = log2 d + 1,
where d is the number of input features. Since only a subset of the features
needs to be examined at each node, this approach helps to significantly reduce
the runtime of the algorithm.
If the number of original features d is too small, then it is difficult to choose
an independent set of random features for building the decision trees. One
way to increase the feature space is to create linear combinations of the input
features. Specifically, at each node, a new feature is generated by randomly
selecting L of the input features. The input features are linearly combined
using coefficients generated from a uniform distribution in the range of [−1,
1]. At each node, F of such randomly combined new features are generated,
and the best of them is subsequently selected to split the node. This approach
is known as ForestRC.
A third approach for generating the random trees is to randomly select
one of the F best splits at each node of the decision tree. This approach may
potentially generate trees that are more correlated than ForestRI and Forest
RC, unless F is sufficiently large. It also does not have the runtime savings of
ForestRI and ForestRC because the algorithm must examine all the splitting
features at each node of the decision tree.
It has been shown empirically that the classification accuracies of random
forests are quite comparable to the AdaBoost algorithm. It is also more robust
to noise and runs much faster than the AdaBoost algorithm. The classification
accuracies of various ensemble algorithms are compared in the next section.
293
Chapter 5 Classification: Alternative Techniques
Table 5.5. Comparing the accuracy of a decision tree classifier against three ensemble methods.
Data Set Number of Decision Bagging Boosting RF
(Attributes, Classes, Tree (%) (%) (%) (%)
Records)
Anneal (39, 6, 898) 92.09 94.43 95.43 95.43
Australia (15, 2, 690) 85.51 87.10 85.22 85.80
Auto (26, 7, 205) 81.95 85.37 85.37 84.39
Breast (11, 2, 699) 95.14 96.42 97.28 96.14
Cleve (14, 2, 303) 76.24 81.52 82.18 82.18
Credit (16, 2, 690) 85.8 86.23 86.09 85.8
Diabetes (9, 2, 768) 72.40 76.30 73.18 75.13
German (21, 2, 1000) 70.90 73.40 73.00 74.5
Glass (10, 7, 214) 67.29 76.17 77.57 78.04
Heart (14, 2, 270) 80.00 81.48 80.74 83.33
Hepatitis (20, 2, 155) 81.94 81.29 83.87 83.23
Horse (23, 2, 368) 85.33 85.87 81.25 85.33
Ionosphere (35, 2, 351) 89.17 92.02 93.73 93.45
Iris (5, 3, 150) 94.67 94.67 94.00 93.33
Labor (17, 2, 57) 78.95 84.21 89.47 84.21
Led7 (8, 10, 3200) 73.34 73.66 73.34 73.06
Lymphography (19, 4, 148) 77.03 79.05 85.14 82.43
Pima (9, 2, 768) 74.35 76.69 73.44 77.60
Sonar (61, 2, 208) 78.85 78.85 84.62 85.58
Tictactoe (10, 2, 958) 83.72 93.84 98.54 95.82
Vehicle (19, 4, 846) 71.04 74.11 78.25 74.94
Waveform (22, 3, 5000) 76.44 83.30 83.90 84.04
Wine (14, 3, 178) 94.38 96.07 97.75 97.75
Zoo (17, 7, 101) 93.07 93.07 95.05 97.03
5.6.7 Empirical Comparison among Ensemble Methods
Table 5.5 shows the empirical results obtained when comparing the perfor
mance of a decision tree classifier against bagging, boosting, and random for
est. The base classifiers used in each ensemble method consist of fifty decision
trees. The classification accuracies reported in this table are obtained from
tenfold crossvalidation. Notice that the ensemble classifiers generally out
perform a single decision tree classifier on many of the data sets.
5.7 Class Imbalance Problem
Data sets with imbalanced class distributions are quite common in many real
applications. For example, an automated inspection system that monitors
products that come off a manufacturing assembly line may find that the num
294
5.7 Class Imbalance Problem
ber of defective products is significantly fewer than that of nondefective prod
ucts. Similarly, in credit card fraud detection, fraudulent transactions are
outnumbered by legitimate transactions. In both of these examples, there is
a disproportionate number of instances that belong to different classes. The
degree of imbalance varies from one application to another—a manufacturing
plant operating under the six sigma principle may discover four defects in a
million products shipped to their customers, while the amount of credit card
fraud may be of the order of 1 in 100. Despite their infrequent occurrences,
a correct classification of the rare class in these applications often has greater
value than a correct classification of the majority class. However, because the
class distribution is imbalanced, this presents a number of problems to existing
classification algorithms.
The accuracy measure, which is used extensively to compare the perfor
mance of classifiers, may not be well suited for evaluating models derived from
imbalanced data sets. For example, if 1% of the credit card transactions are
fraudulent, then a model that predicts every transaction as legitimate has an
accuracy of 99% even though it fails to detect any of the fraudulent activities.
Additionally, measures that are used to guide the learning algorithm (e.g., in
formation gain for decision tree induction) may need to be modified to focus
on the rare class.
Detecting instances of the rare class is akin to finding a needle in a haystack.
Because their instances occur infrequently, models that describe the rare class
tend to be highly specialized. For example, in a rulebased classifier, the
rules extracted for the rare class typically involve a large number of attributes
and cannot be easily simplified into more general rules with broader coverage
(unlike the rules for the majority class). Such models are also susceptible
to the presence of noise in training data. As a result, many of the existing
classification algorithms may not effectively detect instances of the rare class.
This section presents some of the methods developed for handling the class
imbalance problem. First, alternative metrics besides accuracy are introduced,
along with a graphical method called ROC analysis. We then describe how
costsensitive learning and samplingbased methods may be used to improve
the detection of rare classes.
5.7.1 Alternative Metrics
Since the accuracy measure treats every class as equally important, it may
not be suitable for analyzing imbalanced data sets, where the rare class is
considered more interesting than the majority class. For binary classification,
the rare class is often denoted as the positive class, while the majority class is
295
Chapter 5 Classification: Alternative Techniques
Table 5.6. A confusion matrix for a binary classification problem in which the classes are not equally
important.
Predicted Class
+ −
Actual + f++ (TP) f+− (FN)
Class − f−+ (FP) f−− (TN)
denoted as the negative class. A confusion matrix that summarizes the number
of instances predicted correctly or incorrectly by a classification model is shown
in Table 5.6.
The following terminology is often used when referring to the counts tab
ulated in a confusion matrix:
• True positive (TP) or f++, which corresponds to the number of positive
examples correctly predicted by the classification model.
• False negative (FN) or f+−, which corresponds to the number of positive
examples wrongly predicted as negative by the classification model.
• False positive (FP) or f−+, which corresponds to the number of negative
examples wrongly predicted as positive by the classification model.
• True negative (TN) or f−−, which corresponds to the number of negative
examples correctly predicted by the classification model.
The counts in a confusion matrix can also be expressed in terms of percentages.
The true positive rate (T P R) or sensitivity is defined as the fraction of
positive examples predicted correctly by the model, i.e.,
T P R = T P/(T P + F N ).
Similarly, the true negative rate (T N R) or specificity is defined as the
fraction of negative examples predicted correctly by the model, i.e.,
T N R = T N/(T N + F P ).
Finally, the false positive rate (F P R) is the fraction of negative examples
predicted as a positive class, i.e.,
F P R = F P/(T N + F P ),
296
5.7 Class Imbalance Problem
while the false negative rate (F N R) is the fraction of positive examples
predicted as a negative class, i.e.,
F N R = F N/(T P + F N ).
Recall and precision are two widely used metrics employed in applica
tions where successful detection of one of the classes is considered more signif
icant than detection of the other classes. A formal definition of these metrics
is given below.
Precision, p =
T P
T P + F P
(5.74)
Recall, r =
T P
T P + F N
(5.75)
Precision determines the fraction of records that actually turns out to be
positive in the group the classifier has declared as a positive class. The higher
the precision is, the lower the number of false positive errors committed by the
classifier. Recall measures the fraction of positive examples correctly predicted
by the classifier. Classifiers with large recall have very few positive examples
misclassified as the negative class. In fact, the value of recall is equivalent to
the true positive rate.
It is often possible to construct baseline models that maximize one metric
but not the other. For example, a model that declares every record to be the
positive class will have a perfect recall, but very poor precision. Conversely,
a model that assigns a positive class to every test record that matches one of
the positive records in the training set has very high precision, but low recall.
Building a model that maximizes both precision and recall is the key challenge
of classification algorithms.
Precision and recall can be summarized into another metric known as the
F1 measure.
F1 =
2rp
r + p
=
2 × T P
2 × T P + F P + F N (5.76)
In principle, F1 represents a harmonic mean between recall and precision, i.e.,
F1 =
2
1
r
+ 1
p
.
The harmonic mean of two numbers x and y tends to be closer to the smaller
of the two numbers. Hence, a high value of F1measure ensures that both
297
Chapter 5 Classification: Alternative Techniques
precision and recall are reasonably high. A comparison among harmonic, ge
ometric, and arithmetic means is given in the next example.
Example 5.8. Consider two positive numbers a = 1 and b = 5. Their arith
metic mean is µa = (a + b)/2 = 3 and their geometric mean is µg =
√
ab =
2.236. Their harmonic mean is µh = (2×1×5)/6 = 1.667, which is closer to the
smaller value between a and b than the arithmetic and geometric means.
More generally, the Fβ measure can be used to examine the tradeoff be
tween recall and precision:
Fβ =
(β2 + 1)rp
r + β2p
=
(β2 + 1) × T P
(β2 + 1)T P + β2F P + F N
. (5.77)
Both precision and recall are special cases of Fβ by setting β = 0 and β = ∞,
respectively. Low values of β make Fβ closer to precision, and high values
make it closer to recall.
A more general metric that captures Fβ as well as accuracy is the weighted
accuracy measure, which is defined by the following equation:
Weighted accuracy =
w1T P + w4T N
w1T P + w2F P + w3F N + w4T N
. (5.78)
The relationship between weighted accuracy and other performance metrics is
summarized in the following table:
Measure w1 w2 w3 w4
Recall 1 1 0 0
Precision 1 0 1 0
Fβ β
2 + 1 β2 1 0
Accuracy 1 1 1 1
5.7.2 The Receiver Operating Characteristic Curve
A receiver operating characteristic (ROC) curve is a graphical approach for
displaying the tradeoff between true positive rate and false positive rate of a
classifier. In an ROC curve, the true positive rate (T P R) is plotted along the
y axis and the false positive rate (F P R) is shown on the x axis. Each point
along the curve corresponds to one of the models induced by the classifier.
Figure 5.41 shows the ROC curves for a pair of classifiers, M1 and M2.
298
5.7 Class Imbalance Problem
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
Tr
u
e
P
o
si
tiv
e
R
a
te
M1
M2
Figure 5.41. ROC curves for two different classifiers.
There are several critical points along an ROC curve that have wellknown
interpretations:
(TPR=0, FPR=0): Model predicts every instance to be a negative class.
(TPR=1, FPR=1): Model predicts every instance to be a positive class.
(TPR=1, FPR=0): The ideal model.
A good classification model should be located as close as possible to the up
per left corner of the diagram, while a model that makes random guesses should
reside along the main diagonal, connecting the points (T P R = 0, F P R = 0)
and (T P R = 1, F P R = 1). Random guessing means that a record is classi
fied as a positive class with a fixed probability p, irrespective of its attribute
set. For example, consider a data set that contains n+ positive instances
and n− negative instances. The random classifier is expected to correctly
classify pn+ of the positive instances and to misclassify pn− of the negative
instances. Therefore, the T P R of the classifier is (pn+)/n+ = p, while its
F P R is (pn−)/p = p. Since the T P R and F P R are identical, the ROC curve
for a random classifier always reside along the main diagonal.
An ROC curve is useful for comparing the relative performance among
different classifiers. In Figure 5.41, M1 is better than M2 when F P R is less
299
Chapter 5 Classification: Alternative Techniques
than 0.36, while M2 is superior when F P R is greater than 0.36. Clearly,
neither of these two classifiers dominates the other.
The area under the ROC curve (AUC) provides another approach for eval
uating which model is better on average. If the model is perfect, then its area
under the ROC curve would equal 1. If the model simply performs random
guessing, then its area under the ROC curve would equal 0.5. A model that
is strictly better than another would have a larger area under the ROC curve.
Generating an ROC curve
To draw an ROC curve, the classifier should be able to produce a continuous
valued output that can be used to rank its predictions, from the most likely
record to be classified as a positive class to the least likely record. These out
puts may correspond to the posterior probabilities generated by a Bayesian
classifier or the numericvalued outputs produced by an artificial neural net
work. The following procedure can then be used to generate an ROC curve:
1. Assuming that the continuousvalued outputs are defined for the positive
class, sort the test records in increasing order of their output values.
2. Select the lowest ranked test record (i.e., the record with lowest output
value). Assign the selected record and those ranked above it to the
positive class. This approach is equivalent to classifying all the test
records as positive class. Because all the positive examples are classified
correctly and the negative examples are misclassified, T P R = F P R = 1.
3. Select the next test record from the sorted list. Classify the selected
record and those ranked above it as positive, while those ranked below it
as negative. Update the counts of T P and F P by examining the actual
class label of the previously selected record. If the previously selected
record is a positive class, the T P count is decremented and the F P
count remains the same as before. If the previously selected record is a
negative class, the F P count is decremented and T P count remains the
same as before.
4. Repeat Step 3 and update the T P and F P counts accordingly until the
highest ranked test record is selected.
5. Plot the T P R against F P R of the classifier.
Figure 5.42 shows an example of how to compute the ROC curve. There
are five positive examples and five negative examples in the test set. The class
300
5.7 Class Imbalance Problem
0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
Class + – + + + +– – – –
TP 5
5
0
0
1
1
4
5
0
1
0.8
1 2
1
4
4 3
0.8
0.8
1
4
0.6
2
2
3
3
0.6
2
2
3
3
0.6
2 3 3 4 5
5 54 4
1
3
1 0
2 2
0
1
5
0
0
0.6 0.4 0.4
0.2 0.2 0
0.2
0
0
00.40.60.81
TN
TPR
FPR
FN
FP
Figure 5.42. Constructing an ROC curve.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 5.43. ROC curve for the data shown in Figure 5.42.
labels of the test records are shown in the first row of the table. The second row
corresponds to the sorted output values for each record. For example, they
may correspond to the posterior probabilities P (+x) generated by a näıve
Bayes classifier. The next six rows contain the counts of T P , F P , T N , and
F N , along with their corresponding T P R and F P R. The table is then filled
from left to right. Initially, all the records are predicted to be positive. Thus,
T P = F P = 5 and T P R = F P R = 1. Next, we assign the test record with
the lowest output value as the negative class. Because the selected record is
actually a positive example, the T P count reduces from 5 to 4 and the F P
count is the same as before. The F P R and T P R are updated accordingly.
This process is repeated until we reach the end of the list, where T P R = 0
and F P R = 0. The ROC curve for this example is shown in Figure 5.43.
301
Chapter 5 Classification: Alternative Techniques
5.7.3 CostSensitive Learning
A cost matrix encodes the penalty of classifying records from one class as
another. Let C(i, j) denote the cost of predicting a record from class i as class
j. With this notation, C(+, −) is the cost of committing a false negative error,
while C(−, +) is the cost of generating a false alarm. A negative entry in the
cost matrix represents the reward for making correct classification. Given a
collection of N test records, the overall cost of a model M is
Ct(M ) = T P × C(+, +) + F P × C(−, +) + F N × C(+, −)
+ T N × C(−, −). (5.79)
Under the 0/1 cost matrix, i.e., C(+, +) = C(−, −) = 0 and C(+, −) =
C(−, +) = 1, it can be shown that the overall cost is equivalent to the number
of misclassification errors.
Ct(M ) = 0 × (T P + T N ) + 1 × (F P + F N ) = N × Err, (5.80)
where Err is the error rate of the classifier.
Example 5.9. Consider the cost matrix shown in Table 5.7: The cost of
committing a false negative error is a hundred times larger than the cost
of committing a false alarm. In other words, failure to detect any positive
example is just as bad as committing a hundred false alarms. Given the
classification models with the confusion matrices shown in Table 5.8, the total
cost for each model is
Ct(M1) = 150 × (−1) + 60 × 1 + 40 × 100 = 3910,
Ct(M2) = 250 × (−1) + 5 × 1 + 45 × 100 = 4255.
Table 5.7. Cost matrix for Example 5.9.
Predicted Class
Class = + Class = −
Actual Class = + −1 100
Class Class = − 1 0
302
5.7 Class Imbalance Problem
Table 5.8. Confusion matrix for two classification models.
Model M1 Predicted Class
Class + Class 
Actual Class + 150 40
Class Class  60 250
Model M2 Predicted Class
Class + Class 
Actual Class + 250 45
Class Class  5 200
Notice that despite improving both of its true positive and false positive counts,
model M2 is still inferior since the improvement comes at the expense of in
creasing the more costly false negative errors. A standard accuracy measure
would have preferred model M2 over M1.
A costsensitive classification technique takes the cost matrix into consid
eration during model building and generates a model that has the lowest cost.
For example, if false negative errors are the most costly, the learning algorithm
will try to reduce these errors by extending its decision boundary toward the
negative class, as shown in Figure 5.44. In this way, the generated model can
cover more positive examples, although at the expense of generating additional
false alarms.
B2 B1
Figure 5.44. Modifying the decision boundary (from B1 to B2) to reduce the false negative errors of a
classifier.
There are various ways to incorporate cost information into classification
algorithms. For example, in the context of decision tree induction, the cost
303
Chapter 5 Classification: Alternative Techniques
information can be used to: (1) choose the best attribute to use for splitting
the data, (2) determine whether a subtree should be pruned, (3) manipulate
the weights of the training records so that the learning algorithm converges to
a decision tree that has the lowest cost, and (4) modify the decision rule at
each leaf node. To illustrate the last approach, let p(it) denote the fraction of
training records from class i that belong to the leaf node t. A typical decision
rule for a binary classification problem assigns the positive class to node t if
the following condition holds.
p(+t) > p(−t)
=⇒ p(+t) > (1 − p(+t))
=⇒ 2p(+t) > 1
=⇒ p(+t) > 0.5. (5.81)
The preceding decision rule suggests that the class label of a leaf node depends
on the majority class of the training records that reach the particular node.
Note that this rule assumes that the misclassification costs are identical for
both positive and negative examples. This decision rule is equivalent to the
expression given in Equation 4.8 on page 165.
Instead of taking a majority vote, a costsensitive algorithm assigns the
class label i to node t if it minimizes the following expression:
C(it) =
∑
j
p(jt)C(j, i). (5.82)
In the case where C(+, +) = C(−, −) = 0, a leaf node t is assigned to the
positive class if:
p(+t)C(+, −) > p(−t)C(−, +)
=⇒ p(+t)C(+, −) > (1 − p(+t))C(−, +)
=⇒ p(+t) > C(−, +)
C(−, +) + C(+, −) . (5.83)
This expression suggests that we can modify the threshold of the decision rule
from 0.5 to C(−, +)/(C(−, +) + C(+, −)) to obtain a costsensitive classifier.
If C(−, +) < C(+, −), then the threshold will be less than 0.5. This result
makes sense because the cost of making a false negative error is more expensive
than that for generating a false alarm. Lowering the threshold will expand the
decision boundary toward the negative class, as shown in Figure 5.44.
304
5.7 Class Imbalance Problem
x
1
x 2
(a) Without oversampling
x
1
x 2
(b) With oversampling
Figure 5.45. Illustrating the effect of oversampling of the rare class.
5.7.4 SamplingBased Approaches
Sampling is another widely used approach for handling the class imbalance
problem. The idea of sampling is to modify the distribution of instances so
that the rare class is well represented in the training set. Some of the available
techniques for sampling include undersampling, oversampling, and a hybrid
of both approaches. To illustrate these techniques, consider a data set that
contains 100 positive examples and 1000 negative examples.
In the case of undersampling, a random sample of 100 negative examples
is chosen to form the training set along with all the positive examples. One
potential problem with this approach is that some of the useful negative exam
ples may not be chosen for training, therefore, resulting in a less than optimal
model. A potential method to overcome this problem is to perform undersam
pling multiple times and to induce multiple classifiers similar to the ensemble
learning approach. Focused undersampling methods may also be used, where
the sampling procedure makes an informed choice with regard to the nega
tive examples that should be eliminated, e.g., those located far away from the
decision boundary.
Oversampling replicates the positive examples until the training set has an
equal number of positive and negative examples. Figure 5.45 illustrates the
effect of oversampling on the construction of a decision boundary using a classi
fier such as a decision tree. Without oversampling, only the positive examples
at the bottom righthand side of Figure 5.45(a) are classified correctly. The
positive example in the middle of the diagram is misclassified because there
305
Chapter 5 Classification: Alternative Techniques
are not enough examples to justify the creation of a new decision boundary
to separate the positive and negative instances. Oversampling provides the
additional examples needed to ensure that the decision boundary surrounding
the positive example is not pruned, as illustrated in Figure 5.45(b).
However, for noisy data, oversampling may cause model overfitting because
some of the noise examples may be replicated many times. In principle, over
sampling does not add any new information into the training set. Replication
of positive examples only prevents the learning algorithm from pruning certain
parts of the model that describe regions that contain very few training exam
ples (i.e., the small disjuncts). The additional positive examples also tend to
increase the computation time for model building.
The hybrid approach uses a combination of undersampling the majority
class and oversampling the rare class to achieve uniform class distribution.
Undersampling can be performed using random or focused subsampling. Over
sampling, on the other hand, can be done by replicating the existing positive
examples or generating new positive examples in the neighborhood of the ex
isting positive examples. In the latter approach, we must first determine the
knearest neighbors for each existing positive example. A new positive ex
ample is then generated at some random point along the line segment that
joins the positive example to one of its knearest neighbors. This process is
repeated until the desired number of positive examples is reached. Unlike the
data replication approach, the new examples allow us to extend the decision
boundary for the positive class outward, similar to the approach shown in Fig
ure 5.44. Nevertheless, this approach may still be quite susceptible to model
overfitting.
5.8 Multiclass Problem
Some of the classification techniques described in this chapter, such as support
vector machines and AdaBoost, are originally designed for binary classification
problems. Yet there are many realworld problems, such as character recogni
tion, face identification, and text classification, where the input data is divided
into more than two categories. This section presents several approaches for
extending the binary classifiers to handle multiclass problems. To illustrate
these approaches, let Y = {y1, y2, . . . , yK} be the set of classes of the input
data.
The first approach decomposes the multiclass problem into K binary prob
lems. For each class yi ∈ Y , a binary problem is created where all instances
that belong to yi are considered positive examples, while the remaining in
306
5.8 Multiclass Problem
stances are considered negative examples. A binary classifier is then con
structed to separate instances of class yi from the rest of the classes. This is
known as the oneagainstrest (1r) approach.
The second approach, which is known as the oneagainstone (11) ap
proach, constructs K(K − 1)/2 binary classifiers, where each classifier is used
to distinguish between a pair of classes, (yi, yj ). Instances that do not belong
to either yi or yj are ignored when constructing the binary classifier for (yi, yj ).
In both 1r and 11 approaches, a test instance is classified by combining the
predictions made by the binary classifiers. A voting scheme is typically em
ployed to combine the predictions, where the class that receives the highest
number of votes is assigned to the test instance. In the 1r approach, if an
instance is classified as negative, then all classes except for the positive class
receive a vote. This approach, however, may lead to ties among the different
classes. Another possibility is to transform the outputs of the binary classifiers
into probability estimates and then assign the test instance to the class that
has the highest probability.
Example 5.10. Consider a multiclass problem where Y = {y1, y2, y3, y4}.
Suppose a test instance is classified as (+, −, −, −) according to the 1r ap
proach. In other words, it is classified as positive when y1 is used as the
positive class and negative when y2, y3, and y4 are used as the positive class.
Using a simple majority vote, notice that y1 receives the highest number of
votes, which is four, while the remaining classes receive only three votes. The
test instance is therefore classified as y1.
Suppose the test instance is classified as follows using the 11 approach:
Binary pair +: y1 +: y1 +: y1 +: y2 +: y2 +: y3
of classes −: y2 −: y3 −: y4 −: y3 −: y4 −: y4
Classification + + − + − +
The first two rows in this table correspond to the pair of classes (yi, yj ) chosen
to build the classifier and the last row represents the predicted class for the test
instance. After combining the predictions, y1 and y4 each receive two votes,
while y2 and y3 each receives only one vote. The test instance is therefore
classified as either y1 or y4, depending on the tiebreaking procedure.
ErrorCorrecting Output Coding
A potential problem with the previous two approaches is that they are sensitive
to the binary classification errors. For the 1r approach given in Example 5.10,
307
Chapter 5 Classification: Alternative Techniques
if at least of one of the binary classifiers makes a mistake in its prediction, then
the ensemble may end up declaring a tie between classes or making a wrong
prediction. For example, suppose the test instance is classified as (+, −, +, −)
due to misclassification by the third classifier. In this case, it will be difficult to
tell whether the instance should be classified as y1 or y3, unless the probability
associated with each class prediction is taken into account.
The errorcorrecting output coding (ECOC) method provides a more ro
bust way for handling multiclass problems. The method is inspired by an
informationtheoretic approach for sending messages across noisy channels.
The idea behind this approach is to add redundancy into the transmitted
message by means of a codeword, so that the receiver may detect errors in the
received message and perhaps recover the original message if the number of
errors is small.
For multiclass learning, each class yi is represented by a unique bit string of
length n known as its codeword. We then train n binary classifiers to predict
each bit of the codeword string. The predicted class of a test instance is given
by the codeword whose Hamming distance is closest to the codeword produced
by the binary classifiers. Recall that the Hamming distance between a pair of
bit strings is given by the number of bits that differ.
Example 5.11. Consider a multiclass problem where Y = {y1, y2, y3, y4}.
Suppose we encode the classes using the following 7bit codewords:
Class Codeword
y1 1 1 1 1 1 1 1
y2 0 0 0 0 1 1 1
y3 0 0 1 1 0 0 1
y4 0 1 0 1 0 1 0
Each bit of the codeword is used to train a binary classifier. If a test instance
is classified as (0,1,1,1,1,1,1) by the binary classifiers, then the Hamming dis
tance between the codeword and y1 is 1, while the Hamming distance to the
remaining classes is 3. The test instance is therefore classified as y1.
An interesting property of an errorcorrecting code is that if the minimum
Hamming distance between any pair of codewords is d, then any �(d − 1)/2)�
errors in the output code can be corrected using its nearest codeword. In
Example 5.11, because the minimum Hamming distance between any pair of
codewords is 4, the ensemble may tolerate errors made by one of the seven
308
5.9 Bibliographic Notes
binary classifiers. If there is more than one classifier that makes a mistake,
then the ensemble may not be able to compensate for the error.
An important issue is how to design the appropriate set of codewords for
different classes. From coding theory, a vast number of algorithms have been
developed for generating nbit codewords with bounded Hamming distance.
However, the discussion of these algorithms is beyond the scope of this book.
It is worthwhile mentioning that there is a significant difference between the
design of errorcorrecting codes for communication tasks compared to those
used for multiclass learning. For communication, the codewords should max
imize the Hamming distance between the rows so that error correction can
be performed. Multiclass learning, however, requires that the rowwise and
columnwise distances of the codewords must be well separated. A larger
columnwise distance ensures that the binary classifiers are mutually indepen
dent, which is an important requirement for ensemble learning methods.
5.9 Bibliographic Notes
Mitchell [208] provides an excellent coverage on many classification techniques
from a machine learning perspective. Extensive coverage on classification can
also be found in Duda et al. [180], Webb [219], Fukunaga [187], Bishop [159],
Hastie et al. [192], Cherkassky and Mulier [167], Witten and Frank [221], Hand
et al. [190], Han and Kamber [189], and Dunham [181].
Direct methods for rulebased classifiers typically employ the sequential
covering scheme for inducing classification rules. Holte’s 1R [195] is the sim
plest form of a rulebased classifier because its rule set contains only a single
rule. Despite its simplicity, Holte found that for some data sets that exhibit
a strong onetoone relationship between the attributes and the class label,
1R performs just as well as other classifiers. Other examples of rulebased
classifiers include IREP [184], RIPPER [170], CN2 [168, 169], AQ [207], RISE
[176], and ITRULE [214]. Table 5.9 shows a comparison of the characteristics
of four of these classifiers.
For rulebased classifiers, the rule antecedent can be generalized to include
any propositional or firstorder logical expression (e.g., Horn clauses). Read
ers who are interested in firstorder logic rulebased classifiers may refer to
references such as [208] or the vast literature on inductive logic programming
[209]. Quinlan [211] proposed the C4.5rules algorithm for extracting classifi
cation rules from decision trees. An indirect method for extracting rules from
artificial neural networks was given by Andrews et al. in [157].
309
Chapter 5 Classification: Alternative Techniques
Table 5.9. Comparison of various rulebased classifiers.
RIPPER CN2 CN2 AQR
(unordered) (ordered)
Rulegrowing Generalto Generalto Generalto Generaltospecific
strategy specific specific specific (seeded by a
positive example)
Evaluation FOIL’s Info gain Laplace Entropy and Number of
Metric likelihood ratio true positives
Stopping All examples No performance No performance Rules cover only
condition for belong to the gain gain positive class
rulegrowing same class
Rule Pruning Reduced None None None
error pruning
Instance Positive and Positive only Positive only Positive and
Elimination negative negative
Stopping Error > 50% or No performance No performance All positive
condition for based on MDL gain gain examples are
adding rules covered
Rule Set Replace or Statistical None None
Pruning modify rules tests
Search strategy Greedy Beam search Beam search Beam search
Cover and Hart [172] presented an overview of the nearestneighbor classi
fication method from a Bayesian perspective. Aha provided both theoretical
and empirical evaluations for instancebased methods in [155]. PEBLS, which
was developed by Cost and Salzberg [171], is a nearestneighbor classification
algorithm that can handle data sets containing nominal attributes. Each train
ing example in PEBLS is also assigned a weight factor that depends on the
number of times the example helps make a correct prediction. Han et al. [188]
developed a weightadjusted nearestneighbor algorithm, in which the feature
weights are learned using a greedy, hillclimbing optimization algorithm.
Näıve Bayes classifiers have been investigated by many authors, including
Langley et al. [203], Ramoni and Sebastiani [212], Lewis [204], and Domingos
and Pazzani [178]. Although the independence assumption used in näıve Bayes
classifiers may seem rather unrealistic, the method has worked surprisingly well
for applications such as text classification. Bayesian belief networks provide a
more flexible approach by allowing some of the attributes to be interdependent.
An excellent tutorial on Bayesian belief networks is given by Heckerman in
[194].
Vapnik [217, 218] had written two authoritative books on Support Vector
Machines (SVM). Other useful resources on SVM and kernel methods include
the books by Cristianini and ShaweTaylor [173] and Schölkopf and Smola
310
5.9 Bibliographic Notes
[213]. There are several survey articles on SVM, including those written by
Burges [164], Bennet et al. [158], Hearst [193], and Mangasarian [205].
A survey of ensemble methods in machine learning was given by Diet
terich [174]. The bagging method was proposed by Breiman [161]. Freund
and Schapire [186] developed the AdaBoost algorithm. Arcing, which stands
for adaptive resampling and combining, is a variant of the boosting algorithm
proposed by Breiman [162]. It uses the nonuniform weights assigned to train
ing examples to resample the data for building an ensemble of training sets.
Unlike AdaBoost, the votes of the base classifiers are not weighted when de
termining the class label of test examples. The random forest method was
introduced by Breiman in [163].
Related work on mining rare and imbalanced data sets can be found in the
survey papers written by Chawla et al. [166] and Weiss [220]. Samplingbased
methods for mining imbalanced data sets have been investigated by many au
thors, such as Kubat and Matwin [202], Japkowitz [196], and Drummond and
Holte [179]. Joshi et al. [199] discussed the limitations of boosting algorithms
for rare class modeling. Other algorithms developed for mining rare classes
include SMOTE [165], PNrule [198], and CREDOS [200].
Various alternative metrics that are wellsuited for class imbalanced prob
lems are available. The precision, recall, and F1measure are widely used met
rics in information retrieval [216]. ROC analysis was originally used in signal
detection theory. Bradley [160] investigated the use of area under the ROC
curve as a performance metric for machine learning algorithms. A method
for comparing classifier performance using the convex hull of ROC curves was
suggested by Provost and Fawcett in [210]. Ferri et al. [185] developed a
methodology for performing ROC analysis on decision tree classifiers. They
had also proposed a methodology for incorporating area under the ROC curve
(AUC) as the splitting criterion during the treegrowing process. Joshi [197]
examined the performance of these measures from the perspective of analyzing
rare classes.
A vast amount of literature on costsensitive learning can be found in
the online proceedings of the ICML’2000 Workshop on costsensitive learn
ing. The properties of a cost matrix had been studied by Elkan in [182].
Margineantu and Dietterich [206] examined various methods for incorporating
cost information into the C4.5 learning algorithm, including wrapper meth
ods, class distributionbased methods, and lossbased methods. Other cost
sensitive learning methods that are algorithmindependent include AdaCost
[183], MetaCost [177], and costing [222].
311
Chapter 5 Classification: Alternative Techniques
Extensive literature is also available on the subject of multiclass learning.
This includes the works of Hastie and Tibshirani [191], Allwein et al. [156],
Kong and Dietterich [201], and Tax and Duin [215]. The errorcorrecting
output coding (ECOC) method was proposed by Dietterich and Bakiri [175].
They had also investigated techniques for designing codes that are suitable for
solving multiclass problems.
Bibliography
[155] D. W. Aha. A study of instancebased algorithms for supervised learning tasks: mathe
matical, empirical, and psychological evaluations. PhD thesis, University of California,
Irvine, 1990.
[156] E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing Multiclass to Binary: A
Unifying Approach to Margin Classifiers. Journal of Machine Learning Research, 1:
113–141, 2000.
[157] R. Andrews, J. Diederich, and A. Tickle. A Survey and Critique of Techniques For
Extracting Rules From Trained Artificial Neural Networks. Knowledge Based Systems,
8(6):373–389, 1995.
[158] K. Bennett and C. Campbell. Support Vector Machines: Hype or Hallelujah. SIGKDD
Explorations, 2(2):1–13, 2000.
[159] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press,
Oxford, U.K., 1995.
[160] A. P. Bradley. The use of the area under the ROC curve in the Evaluation of Machine
Learning Algorithms. Pattern Recognition, 30(7):1145–1149, 1997.
[161] L. Breiman. Bagging Predictors. Machine Learning, 24(2):123–140, 1996.
[162] L. Breiman. Bias, Variance, and Arcing Classifiers. Technical Report 486, University
of California, Berkeley, CA, 1996.
[163] L. Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.
[164] C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition.
Data Mining and Knowledge Discovery, 2(2):121–167, 1998.
[165] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic
Minority Oversampling Technique. Journal of Artificial Intelligence Research, 16:321–
357, 2002.
[166] N. V. Chawla, N. Japkowicz, and A. Kolcz. Editorial: Special Issue on Learning from
Imbalanced Data Sets. SIGKDD Explorations, 6(1):1–6, 2004.
[167] V. Cherkassky and F. Mulier. Learning from Data: Concepts, Theory, and Methods.
Wiley Interscience, 1998.
[168] P. Clark and R. Boswell. Rule Induction with CN2: Some Recent Improvements. In
Machine Learning: Proc. of the 5th European Conf. (EWSL91), pages 151–163, 1991.
[169] P. Clark and T. Niblett. The CN2 Induction Algorithm. Machine Learning, 3(4):
261–283, 1989.
[170] W. W. Cohen. Fast Effective Rule Induction. In Proc. of the 12th Intl. Conf. on
Machine Learning, pages 115–123, Tahoe City, CA, July 1995.
[171] S. Cost and S. Salzberg. A Weighted Nearest Neighbor Algorithm for Learning with
Symbolic Features. Machine Learning, 10:57–78, 1993.
[172] T. M. Cover and P. E. Hart. Nearest Neighbor Pattern Classification. Knowledge
Based Systems, 8(6):373–389, 1995.
312
Bibliography
[173] N. Cristianini and J. ShaweTaylor. An Introduction to Support Vector Machines and
Other Kernelbased Learning Methods. Cambridge University Press, 2000.
[174] T. G. Dietterich. Ensemble Methods in Machine Learning. In First Intl. Workshop on
Multiple Classifier Systems, Cagliari, Italy, 2000.
[175] T. G. Dietterich and G. Bakiri. Solving Multiclass Learning Problems via Error
Correcting Output Codes. Journal of Artificial Intelligence Research, 2:263–286, 1995.
[176] P. Domingos. The RISE system: Conquering without separating. In Proc. of the 6th
IEEE Intl. Conf. on Tools with Artificial Intelligence, pages 704–707, New Orleans, LA,
1994.
[177] P. Domingos. MetaCost: A General Method for Making Classifiers CostSensitive. In
Proc. of the 5th Intl. Conf. on Knowledge Discovery and Data Mining, pages 155–164,
San Diego, CA, August 1999.
[178] P. Domingos and M. Pazzani. On the Optimality of the Simple Bayesian Classifier
under ZeroOne Loss. Machine Learning, 29(23):103–130, 1997.
[179] C. Drummond and R. C. Holte. C4.5, Class imbalance, and Cost sensitivity: Why
undersampling beats oversampling. In ICML’2004 Workshop on Learning from Im
balanced Data Sets II, Washington, DC, August 2003.
[180] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley & Sons,
Inc., New York, 2nd edition, 2001.
[181] M. H. Dunham. Data Mining: Introductory and Advanced Topics. Prentice Hall, 2002.
[182] C. Elkan. The Foundations of CostSensitive Learning. In Proc. of the 17th Intl. Joint
Conf. on Artificial Intelligence, pages 973–978, Seattle, WA, August 2001.
[183] W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. AdaCost: misclassification cost
sensitive boosting. In Proc. of the 16th Intl. Conf. on Machine Learning, pages 97–105,
Bled, Slovenia, June 1999.
[184] J. Fürnkranz and G. Widmer. Incremental reduced error pruning. In Proc. of the 11th
Intl. Conf. on Machine Learning, pages 70–77, New Brunswick, NJ, July 1994.
[185] C. Ferri, P. Flach, and J. HernandezOrallo. Learning Decision Trees Using the Area
Under the ROC Curve. In Proc. of the 19th Intl. Conf. on Machine Learning, pages
139–146, Sydney, Australia, July 2002.
[186] Y. Freund and R. E. Schapire. A decisiontheoretic generalization of online learning
and an application to boosting. Journal of Computer and System Sciences, 55(1):119–
139, 1997.
[187] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New
York, 1990.
[188] E.H. Han, G. Karypis, and V. Kumar. Text Categorization Using Weight Adjusted
kNearest Neighbor Classification. In Proc. of the 5th PacificAsia Conf. on Knowledge
Discovery and Data Mining, Lyon, France, 2001.
[189] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann
Publishers, San Francisco, 2001.
[190] D. J. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press, 2001.
[191] T. Hastie and R. Tibshirani. Classification by pairwise coupling. Annals of Statistics,
26(2):451–471, 1998.
[192] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning:
Data Mining, Inference, Prediction. Springer, New York, 2001.
[193] M. Hearst. Trends & Controversies: Support Vector Machines. IEEE Intelligent
Systems, 13(4):18–28, 1998.
313
Chapter 5 Classification: Alternative Techniques
[194] D. Heckerman. Bayesian Networks for Data Mining. Data Mining and Knowledge
Discovery, 1(1):79–119, 1997.
[195] R. C. Holte. Very Simple Classification Rules Perform Well on Most Commonly Used
Data sets. Machine Learning, 11:63–91, 1993.
[196] N. Japkowicz. The Class Imbalance Problem: Significance and Strategies. In Proc.
of the 2000 Intl. Conf. on Artificial Intelligence: Special Track on Inductive Learning,
volume 1, pages 111–117, Las Vegas, NV, June 2000.
[197] M. V. Joshi. On Evaluating Performance of Classifiers for Rare Classes. In Proc. of
the 2002 IEEE Intl. Conf. on Data Mining, Maebashi City, Japan, December 2002.
[198] M. V. Joshi, R. C. Agarwal, and V. Kumar. Mining Needles in a Haystack: Classifying
Rare Classes via TwoPhase Rule Induction. In Proc. of 2001 ACMSIGMOD Intl. Conf.
on Management of Data, pages 91–102, Santa Barbara, CA, June 2001.
[199] M. V. Joshi, R. C. Agarwal, and V. Kumar. Predicting rare classes: can boosting
make any weak learner strong? In Proc. of the 8th Intl. Conf. on Knowledge Discovery
and Data Mining, pages 297–306, Edmonton, Canada, July 2002.
[200] M. V. Joshi and V. Kumar. CREDOS: Classification Using Ripple Down Structure
(A Case for Rare Classes). In Proc. of the SIAM Intl. Conf. on Data Mining, pages
321–332, Orlando, FL, April 2004.
[201] E. B. Kong and T. G. Dietterich. ErrorCorrecting Output Coding Corrects Bias and
Variance. In Proc. of the 12th Intl. Conf. on Machine Learning, pages 313–321, Tahoe
City, CA, July 1995.
[202] M. Kubat and S. Matwin. Addressing the Curse of Imbalanced Training Sets: One
Sided Selection. In Proc. of the 14th Intl. Conf. on Machine Learning, pages 179–186,
Nashville, TN, July 1997.
[203] P. Langley, W. Iba, and K. Thompson. An analysis of Bayesian classifiers. In Proc. of
the 10th National Conf. on Artificial Intelligence, pages 223–228, 1992.
[204] D. D. Lewis. Naive Bayes at Forty: The Independence Assumption in Information
Retrieval. In Proc. of the 10th European Conf. on Machine Learning (ECML 1998),
pages 4–15, 1998.
[205] O. Mangasarian. Data Mining via Support Vector Machines. Technical Report Tech
nical Report 0105, Data Mining Institute, May 2001.
[206] D. D. Margineantu and T. G. Dietterich. Learning Decision Trees for Loss Minimization
in MultiClass Problems. Technical Report 993003, Oregon State University, 1999.
[207] R. S. Michalski, I. Mozetic, J. Hong, and N. Lavrac. The MultiPurpose Incremental
Learning System AQ15 and Its Testing Application to Three Medical Domains. In Proc.
of 5th National Conf. on Artificial Intelligence, Orlando, August 1986.
[208] T. Mitchell. Machine Learning. McGrawHill, Boston, MA, 1997.
[209] S. Muggleton. Foundations of Inductive Logic Programming. Prentice Hall, Englewood
Cliffs, NJ, 1995.
[210] F. J. Provost and T. Fawcett. Analysis and Visualization of Classifier Performance:
Comparison under Imprecise Class and Cost Distributions. In Proc. of the 3rd Intl.
Conf. on Knowledge Discovery and Data Mining, pages 43–48, Newport Beach, CA,
August 1997.
[211] J. R. Quinlan. C4.5: Programs for Machine Learning. MorganKaufmann Publishers,
San Mateo, CA, 1993.
[212] M. Ramoni and P. Sebastiani. Robust Bayes classifiers. Artificial Intelligence, 125:
209–226, 2001.
314
5.10 Exercises
[213] B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press, 2001.
[214] P. Smyth and R. M. Goodman. An Information Theoretic Approach to Rule Induction
from Databases. IEEE Trans. on Knowledge and Data Engineering, 4(4):301–316, 1992.
[215] D. M. J. Tax and R. P. W. Duin. Using TwoClass Classifiers for Multiclass Classi
fication. In Proc. of the 16th Intl. Conf. on Pattern Recognition (ICPR 2002), pages
124–127, Quebec, Canada, August 2002.
[216] C. J. van Rijsbergen. Information Retrieval. ButterworthHeinemann, Newton, MA,
1978.
[217] V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York,
1995.
[218] V. Vapnik. Statistical Learning Theory. John Wiley & Sons, New York, 1998.
[219] A. R. Webb. Statistical Pattern Recognition. John Wiley & Sons, 2nd edition, 2002.
[220] G. M. Weiss. Mining with Rarity: A Unifying Framework. SIGKDD Explorations, 6
(1):7–19, 2004.
[221] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations. Morgan Kaufmann, 1999.
[222] B. Zadrozny, J. C. Langford, and N. Abe. CostSensitive Learning by Cost
Proportionate Example Weighting. In Proc. of the 2003 IEEE Intl. Conf. on Data
Mining, pages 435–442, Melbourne, FL, August 2003.
5.10 Exercises
1. Consider a binary classification problem with the following set of attributes and
attribute values:
• Air Conditioner = {Working, Broken}
• Engine = {Good, Bad}
• Mileage = {High, Medium, Low}
• Rust = {Yes, No}
Suppose a rulebased classifier produces the following rule set:
Mileage = High −→ Value = Low
Mileage = Low −→ Value = High
Air Conditioner = Working, Engine = Good −→ Value = High
Air Conditioner = Working, Engine = Bad −→ Value = Low
Air Conditioner = Broken −→ Value = Low
(a) Are the rules mutually exclustive?
315
Chapter 5 Classification: Alternative Techniques
(b) Is the rule set exhaustive?
(c) Is ordering needed for this set of rules?
(d) Do you need a default class for the rule set?
2. The RIPPER algorithm (by Cohen [170]) is an extension of an earlier algorithm
called IREP (by Fürnkranz and Widmer [184]). Both algorithms apply the
reducederror pruning method to determine whether a rule needs to be
pruned. The reduced error pruning method uses a validation set to estimate
the generalization error of a classifier. Consider the following pair of rules:
R1: A −→ C
R2: A ∧ B −→ C
R2 is obtained by adding a new conjunct, B, to the lefthand side of R1. For
this question, you will be asked to determine whether R2 is preferred over R1
from the perspectives of rulegrowing and rulepruning. To determine whether
a rule should be pruned, IREP computes the following measure:
vIREP =
p + (N − n)
P + N
,
where P is the total number of positive examples in the validation set, N is
the total number of negative examples in the validation set, p is the number of
positive examples in the validation set covered by the rule, and n is the number
of negative examples in the validation set covered by the rule. vIREP is actually
similar to classification accuracy for the validation set. IREP favors rules that
have higher values of vIREP . On the other hand, RIPPER applies the following
measure to determine whether a rule should be pruned:
vRIP P ER =
p − n
p + n
.
(a) Suppose R1 is covered by 350 positive examples and 150 negative ex
amples, while R2 is covered by 300 positive examples and 50 negative
examples. Compute the FOIL’s information gain for the rule R2 with
respect to R1.
(b) Consider a validation set that contains 500 positive examples and 500
negative examples. For R1, suppose the number of positive examples
covered by the rule is 200, and the number of negative examples covered
by the rule is 50. For R2, suppose the number of positive examples covered
by the rule is 100 and the number of negative examples is 5. Compute
vIREP for both rules. Which rule does IREP prefer?
(c) Compute vRIP P ER for the previous problem. Which rule does RIPPER
prefer?
316
5.10 Exercises
3. C4.5rules is an implementation of an indirect method for generating rules from
a decision tree. RIPPER is an implementation of a direct method for generating
rules directly from data.
(a) Discuss the strengths and weaknesses of both methods.
(b) Consider a data set that has a large difference in the class size (i.e., some
classes are much bigger than others). Which method (between C4.5rules
and RIPPER) is better in terms of finding high accuracy rules for the
small classes?
4. Consider a training set that contains 100 positive examples and 400 negative
examples. For each of the following candidate rules,
R1: A −→ + (covers 4 positive and 1 negative examples),
R2: B −→ + (covers 30 positive and 10 negative examples),
R3: C −→ + (covers 100 positive and 90 negative examples),
determine which is the best and worst candidate rule according to:
(a) Rule accuracy.
(b) FOIL’s information gain.
(c) The likelihood ratio statistic.
(d) The Laplace measure.
(e) The mestimate measure (with k = 2 and p+ = 0.2).
5. Figure 5.4 illustrates the coverage of the classification rules R1, R2, and R3.
Determine which is the best and worst rule according to:
(a) The likelihood ratio statistic.
(b) The Laplace measure.
(c) The mestimate measure (with k = 2 and p+ = 0.58).
(d) The rule accuracy after R1 has been discovered, where none of the exam
ples covered by R1 are discarded).
(e) The rule accuracy after R1 has been discovered, where only the positive
examples covered by R1 are discarded).
(f) The rule accuracy after R1 has been discovered, where both positive and
negative examples covered by R1 are discarded.
6. (a) Suppose the fraction of undergraduate students who smoke is 15% and
the fraction of graduate students who smoke is 23%. If onefifth of the
college students are graduate students and the rest are undergraduates,
what is the probability that a student who smokes is a graduate student?
317
Chapter 5 Classification: Alternative Techniques
(b) Given the information in part (a), is a randomly chosen college student
more likely to be a graduate or undergraduate student?
(c) Repeat part (b) assuming that the student is a smoker.
(d) Suppose 30% of the graduate students live in a dorm but only 10% of
the undergraduate students live in a dorm. If a student smokes and lives
in the dorm, is he or she more likely to be a graduate or undergraduate
student? You can assume independence between students who live in a
dorm and those who smoke.
7. Consider the data set shown in Table 5.10
Table 5.10. Data set for Exercise 7.
Record A B C Class
1 0 0 0 +
2 0 0 1 −
3 0 1 1 −
4 0 1 1 −
5 0 0 1 +
6 1 0 1 +
7 1 0 1 −
8 1 0 1 −
9 1 1 1 +
10 1 0 1 +
(a) Estimate the conditional probabilities for P (A+), P (B+), P (C+), P (A−),
P (B−), and P (C−).
(b) Use the estimate of conditional probabilities given in the previous question
to predict the class label for a test sample (A = 0, B = 1, C = 0) using
the näıve Bayes approach.
(c) Estimate the conditional probabilities using the mestimate approach,
with p = 1/2 and m = 4.
(d) Repeat part (b) using the conditional probabilities given in part (c).
(e) Compare the two methods for estimating probabilities. Which method is
better and why?
8. Consider the data set shown in Table 5.11.
(a) Estimate the conditional probabilities for P (A = 1+), P (B = 1+),
P (C = 1+), P (A = 1−), P (B = 1−), and P (C = 1−) using the
same approach as in the previous problem.
318
5.10 Exercises
Table 5.11. Data set for Exercise 8.
Instance A B C Class
1 0 0 1 −
2 1 0 1 +
3 0 1 0 −
4 1 0 0 −
5 1 0 1 +
6 0 0 1 +
7 1 1 0 −
8 0 0 0 −
9 0 1 0 +
10 1 1 1 +
(b) Use the conditional probabilities in part (a) to predict the class label for
a test sample (A = 1, B = 1, C = 1) using the näıve Bayes approach.
(c) Compare P (A = 1), P (B = 1), and P (A = 1, B = 1). State the relation
ships between A and B.
(d) Repeat the analysis in part (c) using P (A = 1), P (B = 0), and P (A =
1, B = 0).
(e) Compare P (A = 1, B = 1Class = +) against P (A = 1Class = +) and
P (B = 1Class = +). Are the variables conditionally independent given
the class?
9. (a) Explain how näıve Bayes performs on the data set shown in Figure 5.46.
(b) If each class is further divided such that there are four classes (A1, A2,
B1, and B2), will näıve Bayes perform better?
(c) How will a decision tree perform on this data set (for the twoclass prob
lem)? What if there are four classes?
10. Repeat the analysis shown in Example 5.3 for finding the location of a decision
boundary using the following information:
(a) The prior probabilities are P (Crocodile) = 2 × P (Alligator).
(b) The prior probabilities are P (Alligator) = 2 × P (Crocodile).
(c) The prior probabilities are the same, but their standard deviations are
different; i.e., σ(Crocodile) = 4 and σ(Alligator) = 2.
11. Figure 5.47 illustrates the Bayesian belief network for the data set shown in
Table 5.12. (Assume that all the attributes are binary).
(a) Draw the probability table for each node in the network.
319
Chapter 5 Classification: Alternative Techniques
Distinguishing Attributes Noise Attributes
Class A
Class B
Records
Attributes
A1
A2
B1
B2
Figure 5.46. Data set for Exercise 9.
Mileage
Engine
Car
Value
Air
Conditioner
Figure 5.47. Bayesian belief network.
(b) Use the Bayesian network to compute P(Engine = Bad, Air Conditioner
= Broken).
12. Given the Bayesian network shown in Figure 5.48, compute the following prob
abilities:
(a) P (B = good, F = empty, G = empty, S = yes).
(b) P (B = bad, F = empty, G = not empty, S = no).
(c) Given that the battery is bad, compute the probability that the car will
start.
13. Consider the onedimensional data set shown in Table 5.13.
320
5.10 Exercises
Table 5.12. Data set for Exercise 11.
Mileage Engine Air Conditioner Number of Records Number of Records
with Car Value=Hi with Car Value=Lo
Hi Good Working 3 4
Hi Good Broken 1 2
Hi Bad Working 1 5
Hi Bad Broken 0 4
Lo Good Working 9 0
Lo Good Broken 5 1
Lo Bad Working 1 2
Lo Bad Broken 0 2
Battery
Gauge
Start
Fuel
P(B = bad) = 0.1 P(F = empty) = 0.2
P(G = empty  B = good, F = not empty) = 0.1
P(G = empty  B = good, F = empty) = 0.8
P(G = empty  B = bad, F = not empty) = 0.2
P(G = empty  B = bad, F = empty) = 0.9
P(S = no  B = good, F = not empty) = 0.1
P(S = no  B = good, F = empty) = 0.8
P(S = no  B = bad, F = not empty) = 0.9
P(S = no  B = bad, F = empty) = 1.0
Figure 5.48. Bayesian belief network for Exercise 12.
(a) Classify the data point x = 5.0 according to its 1, 3, 5, and 9nearest
neighbors (using majority vote).
(b) Repeat the previous analysis using the distanceweighted voting approach
described in Section 5.2.1.
14. The nearestneighbor algorithm described in Section 5.2 can be extended to
handle nominal attributes. A variant of the algorithm called PEBLS (Parallel
ExamplarBased Learning System) by Cost and Salzberg [171] measures the
distance between two values of a nominal attribute using the modified value
difference metric (MVDM). Given a pair of nominal attribute values, V1 and
321
Chapter 5 Classification: Alternative Techniques
Table 5.13. Data set for Exercise 13.
x 0.5 3.0 4.5 4.6 4.9 5.2 5.3 5.5 7.0 9.5
y − − + + + − − + − −
V2, the distance between them is defined as follows:
d(V1, V2) =
k∑
i=1
∣∣∣∣ni1n1 − ni2n2
∣∣∣∣, (5.84)
where nij is the number of examples from class i with attribute value Vj and
nj is the number of examples with attribute value Vj .
Consider the training set for the loan classification problem shown in Figure
5.9. Use the MVDM measure to compute the distance between every pair of
attribute values for the Home Owner and Marital Status attributes.
15. For each of the Boolean functions given below, state whether the problem is
linearly separable.
(a) A AND B AND C
(b) NOT A AND B
(c) (A OR B) AND (A OR C)
(d) (A XOR B) AND (A OR B)
16. (a) Demonstrate how the perceptron model can be used to represent the AND
and OR functions between a pair of Boolean variables.
(b) Comment on the disadvantage of using linear functions as activation func
tions for multilayer neural networks.
17. You are asked to evaluate the performance of two classification models, M1 and
M2. The test set you have chosen contains 26 binary attributes, labeled as A
through Z.
Table 5.14 shows the posterior probabilities obtained by applying the models
to the test set. (Only the posterior probabilities for the positive class are
shown). As this is a twoclass problem, P (−) = 1−P (+) and P (−A, . . . , Z) =
1−P (+A, . . . , Z). Assume that we are mostly interested in detecting instances
from the positive class.
(a) Plot the ROC curve for both M1 and M2. (You should plot them on the
same graph.) Which model do you think is better? Explain your reasons.
(b) For model M1, suppose you choose the cutoff threshold to be t = 0.5. In
other words, any test instances whose posterior probability is greater than
t will be classified as a positive example. Compute the precision, recall,
and Fmeasure for the model at this threshold value.
322
5.10 Exercises
Table 5.14. Posterior probabilities for Exercise 17.
Instance True Class P (+A, . . . , Z, M1) P (+A, . . . , Z, M2)
1 + 0.73 0.61
2 + 0.69 0.03
3 − 0.44 0.68
4 − 0.55 0.31
5 + 0.67 0.45
6 + 0.47 0.09
7 − 0.08 0.38
8 − 0.15 0.05
9 + 0.45 0.01
10 − 0.35 0.04
(c) Repeat the analysis for part (c) using the same cutoff threshold on model
M2. Compare the F measure results for both models. Which model is
better? Are the results consistent with what you expect from the ROC
curve?
(d) Repeat part (c) for model M1 using the threshold t = 0.1. Which thresh
old do you prefer, t = 0.5 or t = 0.1? Are the results consistent with what
you expect from the ROC curve?
18. Following is a data set that contains two attributes, X and Y , and two class
labels, “+” and “−”. Each attribute can take three different values: 0, 1, or 2.
X Y
Number of
Instances
+ −
0 0 0 100
1 0 0 0
2 0 0 100
0 1 10 100
1 1 10 0
2 1 10 100
0 2 0 100
1 2 0 0
2 2 0 100
The concept for the “+” class is Y = 1 and the concept for the “−” class is
X = 0 ∨ X = 2.
(a) Build a decision tree on the data set. Does the tree capture the “+” and
“−” concepts?
323
Chapter 5 Classification: Alternative Techniques
(b) What are the accuracy, precision, recall, and F1measure of the decision
tree? (Note that precision, recall, and F1measure are defined with respect
to the “+” class.)
(c) Build a new decision tree with the following cost function:
C(i, j) =
0, if i = j;
1, if i = +, j = −;
Number of − instances
Number of + instances
, if i = −, j = +.
(Hint: only the leaves of the old decision tree need to be changed.) Does
the decision tree capture the “+” concept?
(d) What are the accuracy, precision, recall, and F1measure of the new deci
sion tree?
19. (a) Consider the cost matrix for a twoclass problem. Let C(+, +) = C(−, −) =
p, C(+, −) = C(−, +) = q, and q > p. Show that minimizing the cost
function is equivalent to maximizing the classifier’s accuracy.
(b) Show that a cost matrix is scaleinvariant. For example, if the cost matrix
is rescaled from C(i, j) −→ βC(i, j), where β is the scaling factor, the
decision threshold (Equation 5.82) will remain unchanged.
(c) Show that a cost matrix is translationinvariant. In other words, adding a
constant factor to all entries in the cost matrix will not affect the decision
threshold (Equation 5.82).
20. Consider the task of building a classifier from random data, where the attribute
values are generated randomly irrespective of the class labels. Assume the data
set contains records from two classes, “+” and “−.” Half of the data set is used
for training while the remaining half is used for testing.
(a) Suppose there are an equal number of positive and negative records in
the data and the decision tree classifier predicts every test record to be
positive. What is the expected error rate of the classifier on the test data?
(b) Repeat the previous analysis assuming that the classifier predicts each
test record to be positive class with probability 0.8 and negative class
with probability 0.2.
(c) Suppose twothirds of the data belong to the positive class and the re
maining onethird belong to the negative class. What is the expected
error of a classifier that predicts every test record to be positive?
(d) Repeat the previous analysis assuming that the classifier predicts each
test record to be positive class with probability 2/3 and negative class
with probability 1/3.
324
5.10 Exercises
21. Derive the dual Lagrangian for the linear SVM with nonseparable data where
the objective function is
f (w) =
‖w‖2
2
+ C
( N∑
i=1
ξi
)2
.
22. Consider the XOR problem where there are four training points:
(1, 1, −), (1, 0, +), (0, 1, +), (0, 0, −).
Transform the data into the following feature space:
Φ = (1,
√
2×1,
√
2×2,
√
2x1x2, x
2
1, x
2
2).
Find the maximum margin linear decision boundary in the transformed space.
23. Given the data sets shown in Figures 5.49, explain how the decision tree, näıve
Bayes, and knearest neighbor classifiers would perform on these data sets.
325
Chapter 5 Classification: Alternative Techniques
Distinguishing
Attributes Noise Attributes
Class A
Class B
Records
Attributes
(a) Synthetic data set 1.
Distinguishing Attributes Noise Attributes
Class A
Class B
Records
Attributes
(b) Synthetic data set 2.
Distinguishing
Attribute set 1 Noise Attributes
Class A
Class B
Records
Attributes
Distinguishing
Attribute set 2
60% filled
with 1
60% filled
with 1
40% filled
with 1
40% filled
with 1
(c) Synthetic data set 3.
Class A Class B Class A Class B Class A
Class A Class B Class A Class BClass B
Class A Class B Class A Class B Class A
Class A Class B Class A Class BClass B
Attribute X
A
tt
ri
b
u
te
Y
(d) Synthetic data set 4
Attribute X
A
tt
ri
b
u
te
Y
Class A
Class B
(e) Synthetic data set 5.
Attribute X
A
tt
ri
b
u
te
Y
Class A
Class B
Class B
(f) Synthetic data set 6.
Figure 5.49. Data set for Exercise 23.
326
6
Association Analysis:
Basic Concepts and
Algorithms
Many business enterprises accumulate large quantities of data from their day
today operations. For example, huge amounts of customer purchase data are
collected daily at the checkout counters of grocery stores. Table 6.1 illustrates
an example of such data, commonly known as market basket transactions.
Each row in this table corresponds to a transaction, which contains a unique
identifier labeled T ID and a set of items bought by a given customer. Retail
ers are interested in analyzing the data to learn about the purchasing behavior
of their customers. Such valuable information can be used to support a vari
ety of businessrelated applications such as marketing promotions, inventory
management, and customer relationship management.
This chapter presents a methodology known as association analysis,
which is useful for discovering interesting relationships hidden in large data
sets. The uncovered relationships can be represented in the form of associa
Table 6.1. An example of market basket transactions.
T ID Items
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}
3 {Milk, Diapers, Beer, Cola}
4 {Bread, Milk, Diapers, Beer}
5 {Bread, Milk, Diapers, Cola}
From Chapter 6 of Introduction to Data Mining
Vipin Kumar. Copyright © 2006 by Pearson Education, Inc. All rights reserved.
, First Edition. PangNing Tan, Michael Steinbach,
327
Chapter 6 Association Analysis
tion rules or sets of frequent items. For example, the following rule can be
extracted from the data set shown in Table 6.1:
{Diapers} −→ {Beer}.
The rule suggests that a strong relationship exists between the sale of diapers
and beer because many customers who buy diapers also buy beer. Retailers
can use this type of rules to help them identify new opportunities for cross
selling their products to the customers.
Besides market basket data, association analysis is also applicable to other
application domains such as bioinformatics, medical diagnosis, Web mining,
and scientific data analysis. In the analysis of Earth science data, for example,
the association patterns may reveal interesting connections among the ocean,
land, and atmospheric processes. Such information may help Earth scientists
develop a better understanding of how the different elements of the Earth
system interact with each other. Even though the techniques presented here
are generally applicable to a wider variety of data sets, for illustrative purposes,
our discussion will focus mainly on market basket data.
There are two key issues that need to be addressed when applying associ
ation analysis to market basket data. First, discovering patterns from a large
transaction data set can be computationally expensive. Second, some of the
discovered patterns are potentially spurious because they may happen simply
by chance. The remainder of this chapter is organized around these two is
sues. The first part of the chapter is devoted to explaining the basic concepts
of association analysis and the algorithms used to efficiently mine such pat
terns. The second part of the chapter deals with the issue of evaluating the
discovered patterns in order to prevent the generation of spurious results.
6.1 Problem Definition
This section reviews the basic terminology used in association analysis and
presents a formal description of the task.
Binary Representation Market basket data can be represented in a binary
format as shown in Table 6.2, where each row corresponds to a transaction
and each column corresponds to an item. An item can be treated as a binary
variable whose value is one if the item is present in a transaction and zero
otherwise. Because the presence of an item in a transaction is often considered
more important than its absence, an item is an asymmetric binary variable.
328
6.1 Problem Definition
Table 6.2. A binary 0/1 representation of market basket data.
TID Bread Milk Diapers Beer Eggs Cola
1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1
This representation is perhaps a very simplistic view of real market basket data
because it ignores certain important aspects of the data such as the quantity
of items sold or the price paid to purchase them. Methods for handling such
nonbinary data will be explained in Chapter 7.
Itemset and Support Count Let I = {i1,i2,. . .,id} be the set of all items
in a market basket data and T = {t1, t2, . . . , tN} be the set of all transactions.
Each transaction ti contains a subset of items chosen from I. In association
analysis, a collection of zero or more items is termed an itemset. If an itemset
contains k items, it is called a kitemset. For instance, {Beer, Diapers, Milk}
is an example of a 3itemset. The null (or empty) set is an itemset that does
not contain any items.
The transaction width is defined as the number of items present in a trans
action. A transaction tj is said to contain an itemset X if X is a subset of
tj . For example, the second transaction shown in Table 6.2 contains the item
set {Bread, Diapers} but not {Bread, Milk}. An important property of an
itemset is its support count, which refers to the number of transactions that
contain a particular itemset. Mathematically, the support count, σ(X), for an
itemset X can be stated as follows:
σ(X) =
∣∣{tiX ⊆ ti, ti ∈ T}∣∣,
where the symbol  ·  denote the number of elements in a set. In the data set
shown in Table 6.2, the support count for {Beer, Diapers, Milk} is equal to
two because there are only two transactions that contain all three items.
Association Rule An association rule is an implication expression of the
form X −→ Y , where X and Y are disjoint itemsets, i.e., X ∩ Y = ∅. The
strength of an association rule can be measured in terms of its support and
confidence. Support determines how often a rule is applicable to a given
329
Chapter 6 Association Analysis
data set, while confidence determines how frequently items in Y appear in
transactions that contain X. The formal definitions of these metrics are
Support, s(X −→ Y ) = σ(X ∪ Y )
N
; (6.1)
Confidence, c(X −→ Y ) = σ(X ∪ Y )
σ(X)
. (6.2)
Example 6.1. Consider the rule {Milk, Diapers} −→ {Beer}. Since the
support count for {Milk, Diapers, Beer} is 2 and the total number of trans
actions is 5, the rule’s support is 2/5 = 0.4. The rule’s confidence is obtained
by dividing the support count for {Milk, Diapers, Beer} by the support count
for {Milk, Diapers}. Since there are 3 transactions that contain milk and di
apers, the confidence for this rule is 2/3 = 0.67.
Why Use Support and Confidence? Support is an important measure
because a rule that has very low support may occur simply by chance. A
low support rule is also likely to be uninteresting from a business perspective
because it may not be profitable to promote items that customers seldom buy
together (with the exception of the situation described in Section 6.8). For
these reasons, support is often used to eliminate uninteresting rules. As will
be shown in Section 6.2.1, support also has a desirable property that can be
exploited for the efficient discovery of association rules.
Confidence, on the other hand, measures the reliability of the inference
made by a rule. For a given rule X −→ Y , the higher the confidence, the more
likely it is for Y to be present in transactions that contain X. Confidence also
provides an estimate of the conditional probability of Y given X.
Association analysis results should be interpreted with caution. The infer
ence made by an association rule does not necessarily imply causality. Instead,
it suggests a strong cooccurrence relationship between items in the antecedent
and consequent of the rule. Causality, on the other hand, requires knowledge
about the causal and effect attributes in the data and typically involves rela
tionships occurring over time (e.g., ozone depletion leads to global warming).
Formulation of Association Rule Mining Problem The association
rule mining problem can be formally stated as follows:
Definition 6.1 (Association Rule Discovery). Given a set of transactions
T , find all the rules having support ≥ minsup and confidence ≥ minconf ,
where minsup and minconf are the corresponding support and confidence
thresholds.
330
6.1 Problem Definition
A bruteforce approach for mining association rules is to compute the sup
port and confidence for every possible rule. This approach is prohibitively
expensive because there are exponentially many rules that can be extracted
from a data set. More specifically, the total number of possible rules extracted
from a data set that contains d items is
R = 3d − 2d+1 + 1. (6.3)
The proof for this equation is left as an exercise to the readers (see Exercise 5
on page 405). Even for the small data set shown in Table 6.1, this approach
requires us to compute the support and confidence for 36 − 27 + 1 = 602 rules.
More than 80% of the rules are discarded after applying minsup = 20% and
minconf = 50%, thus making most of the computations become wasted. To
avoid performing needless computations, it would be useful to prune the rules
early without having to compute their support and confidence values.
An initial step toward improving the performance of association rule min
ing algorithms is to decouple the support and confidence requirements. From
Equation 6.2, notice that the support of a rule X −→ Y depends only on
the support of its corresponding itemset, X ∪ Y . For example, the following
rules have identical support because they involve items from the same itemset,
{Beer, Diapers, Milk}:
{Beer, Diapers} −→ {Milk}, {Beer, Milk} −→ {Diapers},
{Diapers, Milk} −→ {Beer}, {Beer} −→ {Diapers, Milk},
{Milk} −→ {Beer,Diapers}, {Diapers} −→ {Beer,Milk}.
If the itemset is infrequent, then all six candidate rules can be pruned imme
diately without our having to compute their confidence values.
Therefore, a common strategy adopted by many association rule mining
algorithms is to decompose the problem into two major subtasks:
1. Frequent Itemset Generation, whose objective is to find all the item
sets that satisfy the minsup threshold. These itemsets are called frequent
itemsets.
2. Rule Generation, whose objective is to extract all the highconfidence
rules from the frequent itemsets found in the previous step. These rules
are called strong rules.
The computational requirements for frequent itemset generation are gen
erally more expensive than those of rule generation. Efficient techniques for
generating frequent itemsets and association rules are discussed in Sections 6.2
and 6.3, respectively.
331
Chapter 6 Association Analysis
null
ba c d e
decebeaeadacab
abc abd abe
abcd
acd
abcde
abce abde acde bcde
ace ade bcd bce bde cde
bdbc cd
Figure 6.1. An itemset lattice.
6.2 Frequent Itemset Generation
A lattice structure can be used to enumerate the list of all possible itemsets.
Figure 6.1 shows an itemset lattice for I = {a, b, c, d, e}. In general, a data set
that contains k items can potentially generate up to 2k − 1 frequent itemsets,
excluding the null set. Because k can be very large in many practical appli
cations, the search space of itemsets that need to be explored is exponentially
large.
A bruteforce approach for finding frequent itemsets is to determine the
support count for every candidate itemset in the lattice structure. To do
this, we need to compare each candidate against every transaction, an opera
tion that is shown in Figure 6.2. If the candidate is contained in a transaction,
its support count will be incremented. For example, the support for {Bread,
Milk} is incremented three times because the itemset is contained in transac
tions 1, 4, and 5. Such an approach can be very expensive because it requires
O(N M w) comparisons, where N is the number of transactions, M = 2k − 1 is
the number of candidate itemsets, and w is the maximum transaction width.
332
6.2 Frequent Itemset Generation
M
Milk, Diapers, Beer, Coke
Bread, Diapers, Beer, Eggs
Bread, Milk, Diapers, Beer
Bread, Milk, Diapers, Coke
Bread, Milk
Transactions
Candidates
TID Items
N
1
2
3
4
5
Figure 6.2. Counting the support of candidate itemsets.
There are several ways to reduce the computational complexity of frequent
itemset generation.
1. Reduce the number of candidate itemsets (M ). The Apriori prin
ciple, described in the next section, is an effective way to eliminate some
of the candidate itemsets without counting their support values.
2. Reduce the number of comparisons. Instead of matching each can
didate itemset against every transaction, we can reduce the number of
comparisons by using more advanced data structures, either to store the
candidate itemsets or to compress the data set. We will discuss these
strategies in Sections 6.2.4 and 6.6.
6.2.1 The Apriori Principle
This section describes how the support measure helps to reduce the number
of candidate itemsets explored during frequent itemset generation. The use of
support for pruning candidate itemsets is guided by the following principle.
Theorem 6.1 (Apriori Principle). If an itemset is frequent, then all of its
subsets must also be frequent.
To illustrate the idea behind the Apriori principle, consider the itemset
lattice shown in Figure 6.3. Suppose {c, d, e} is a frequent itemset. Clearly,
any transaction that contains {c, d, e} must also contain its subsets, {c, d},
{c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is frequent, then
all subsets of {c, d, e} (i.e., the shaded itemsets in this figure) must also be
frequent.
333
Chapter 6 Association Analysis
null
ba c d e
decebeaeadacab
abc abd abe
abcd
acd
abcde
abce abde acde bcde
ace ade bcd bce bde cde
bdbc cd
Frequent
Itemset
Figure 6.3. An illustration of the Apriori principle. If {c, d, e} is frequent, then all subsets of this
itemset are frequent.
Conversely, if an itemset such as {a, b} is infrequent, then all of its supersets
must be infrequent too. As illustrated in Figure 6.4, the entire subgraph
containing the supersets of {a, b} can be pruned immediately once {a, b} is
found to be infrequent. This strategy of trimming the exponential search
space based on the support measure is known as supportbased pruning.
Such a pruning strategy is made possible by a key property of the support
measure, namely, that the support for an itemset never exceeds the support
for its subsets. This property is also known as the antimonotone property
of the support measure.
Definition 6.2 (Monotonicity Property). Let I be a set of items, and
J = 2I be the power set of I. A measure f is monotone (or upward closed) if
∀X, Y ∈ J : (X ⊆ Y ) −→ f (X) ≤ f (Y ),
334
6.2 Frequent Itemset Generation
null
ba c d e
decebeaeadacab
abc abd abe
abcd
acd
abcde
abce abde acde bcde
ace ade bcd bce bde cde
bdbc cd
Infrequent
Itemset
Pruned
Supersets
Figure 6.4. An illustration of supportbased pruning. If {a, b} is infrequent, then all supersets of {a, b}
are infrequent.
which means that if X is a subset of Y , then f (X) must not exceed f (Y ). On
the other hand, f is antimonotone (or downward closed) if
∀X, Y ∈ J : (X ⊆ Y ) −→ f (Y ) ≤ f (X),
which means that if X is a subset of Y , then f (Y ) must not exceed f (X).
Any measure that possesses an antimonotone property can be incorpo
rated directly into the mining algorithm to effectively prune the exponential
search space of candidate itemsets, as will be shown in the next section.
6.2.2 Frequent Itemset Generation in the Apriori Algorithm
Apriori is the first association rule mining algorithm that pioneered the use
of supportbased pruning to systematically control the exponential growth of
candidate itemsets. Figure 6.5 provides a highlevel illustration of the frequent
itemset generation part of the Apriori algorithm for the transactions shown in
335
Chapter 6 Association Analysis
Candidate
1Itemsets
3
4
2
4
4
1
Beer
Bread
Diapers
Cola
Milk
Eggs
Item Count
Candidate
2Itemsets
2
3
2
3
3
3
{Beer, Bread}
{Beer, Diapers}
{Bread, Diapers}
{Bread, Milk}
{Diapers, Milk}
{Beer, Milk}
Itemset Count
Candidate
3Itemsets
3{Bread, Diapers, Milk}
Itemset Count
Itemsets removed
because of low
support
Minimum support count = 3
Figure 6.5. Illustration of frequent itemset generation using the Apriori algorithm.
Table 6.1. We assume that the support threshold is 60%, which is equivalent
to a minimum support count equal to 3.
Initially, every item is considered as a candidate 1itemset. After count
ing their supports, the candidate itemsets {Cola} and {Eggs} are discarded
because they appear in fewer than three transactions. In the next iteration,
candidate 2itemsets are generated using only the frequent 1itemsets because
the Apriori principle ensures that all supersets of the infrequent 1itemsets
must be infrequent. Because there are only four frequent 1itemsets, the num
ber of candidate 2itemsets generated by the algorithm is
(
4
2
)
= 6. Two
of these six candidates, {Beer, Bread} and {Beer, Milk}, are subsequently
found to be infrequent after computing their support values. The remain
ing four candidates are frequent, and thus will be used to generate candidate
3itemsets. Without supportbased pruning, there are
(
6
3
)
= 20 candidate
3itemsets that can be formed using the six items given in this example. With
the Apriori principle, we only need to keep candidate 3itemsets whose subsets
are frequent. The only candidate that has this property is {Bread, Diapers,
Milk}.
The effectiveness of the Apriori pruning strategy can be shown by count
ing the number of candidate itemsets generated. A bruteforce strategy of
336
6.2 Frequent Itemset Generation
enumerating all itemsets (up to size 3) as candidates will produce
(
6
1
)
+
(
6
2
)
+
(
6
3
)
= 6 + 15 + 20 = 41
candidates. With the Apriori principle, this number decreases to
(
6
1
)
+
(
4
2
)
+ 1 = 6 + 6 + 1 = 13
candidates, which represents a 68% reduction in the number of candidate
itemsets even in this simple example.
The pseudocode for the frequent itemset generation part of the Apriori
algorithm is shown in Algorithm 6.1. Let Ck denote the set of candidate
kitemsets and Fk denote the set of frequent kitemsets:
• The algorithm initially makes a single pass over the data set to determine
the support of each item. Upon completion of this step, the set of all
frequent 1itemsets, F1, will be known (steps 1 and 2).
• Next, the algorithm will iteratively generate new candidate kitemsets
using the frequent (k − 1)itemsets found in the previous iteration (step
5). Candidate generation is implemented using a function called apriori
gen, which is described in Section 6.2.3.
Algorithm 6.1 Frequent itemset generation of the Apriori algorithm.
1: k = 1.
2: Fk = { i  i ∈ I ∧ σ({i}) ≥ N × minsup}. {Find all frequent 1itemsets}
3: repeat
4: k = k + 1.
5: Ck = apriorigen(Fk−1). {Generate candidate itemsets}
6: for each transaction t ∈ T do
7: Ct = subset(Ck, t). {Identify all candidates that belong to t}
8: for each candidate itemset c ∈ Ct do
9: σ(c) = σ(c) + 1. {Increment support count}
10: end for
11: end for
12: Fk = { c  c ∈ Ck ∧ σ(c) ≥ N × minsup}. {Extract the frequent kitemsets}
13: until Fk = ∅
14: Result =
⋃
Fk.
337
Chapter 6 Association Analysis
• To count the support of the candidates, the algorithm needs to make an
additional pass over the data set (steps 6–10). The subset function is
used to determine all the candidate itemsets in Ck that are contained in
each transaction t. The implementation of this function is described in
Section 6.2.4.
• After counting their supports, the algorithm eliminates all candidate
itemsets whose support counts are less than minsup (step 12).
• The algorithm terminates when there are no new frequent itemsets gen
erated, i.e., Fk = ∅ (step 13).
The frequent itemset generation part of the Apriori algorithm has two im
portant characteristics. First, it is a levelwise algorithm; i.e., it traverses the
itemset lattice one level at a time, from frequent 1itemsets to the maximum
size of frequent itemsets. Second, it employs a generateandtest strategy
for finding frequent itemsets. At each iteration, new candidate itemsets are
generated from the frequent itemsets found in the previous iteration. The
support for each candidate is then counted and tested against the minsup
threshold. The total number of iterations needed by the algorithm is kmax + 1,
where kmax is the maximum size of the frequent itemsets.
6.2.3 Candidate Generation and Pruning
The apriorigen function shown in Step 5 of Algorithm 6.1 generates candidate
itemsets by performing the following two operations:
1. Candidate Generation. This operation generates new candidate k
itemsets based on the frequent (k − 1)itemsets found in the previous
iteration.
2. Candidate Pruning. This operation eliminates some of the candidate
kitemsets using the supportbased pruning strategy.
To illustrate the candidate pruning operation, consider a candidate kitemset,
X = {i1, i2, . . . , ik}. The algorithm must determine whether all of its proper
subsets, X − {ij} (∀j = 1, 2, . . . , k), are frequent. If one of them is infre
quent, then X is immediately pruned. This approach can effectively reduce
the number of candidate itemsets considered during support counting. The
complexity of this operation is O(k) for each candidate kitemset. However,
as will be shown later, we do not have to examine all k subsets of a given
candidate itemset. If m of the k subsets were used to generate a candidate,
we only need to check the remaining k − m subsets during candidate pruning.
338
6.2 Frequent Itemset Generation
In principle, there are many ways to generate candidate itemsets. The fol
lowing is a list of requirements for an effective candidate generation procedure:
1. It should avoid generating too many unnecessary candidates. A candi
date itemset is unnecessary if at least one of its subsets is infrequent.
Such a candidate is guaranteed to be infrequent according to the anti
monotone property of support.
2. It must ensure that the candidate set is complete, i.e., no frequent item
sets are left out by the candidate generation procedure. To ensure com
pleteness, the set of candidate itemsets must subsume the set of all fre
quent itemsets, i.e., ∀k : Fk ⊆ Ck.
3. It should not generate the same candidate itemset more than once. For
example, the candidate itemset {a, b, c, d} can be generated in many
ways—by merging {a, b, c} with {d}, {b, d} with {a, c}, {c} with {a, b, d},
etc. Generation of duplicate candidates leads to wasted computations
and thus should be avoided for efficiency reasons.
Next, we will briefly describe several candidate generation procedures, in
cluding the one used by the apriorigen function.
BruteForce Method The bruteforce method considers every kitemset as
a potential candidate and then applies the candidate pruning step to remove
any unnecessary candidates (see Figure 6.6). The number of candidate item
sets generated at level k is equal to
(
d
k
)
, where d is the total number of items.
Although candidate generation is rather trivial, candidate pruning becomes
extremely expensive because a large number of itemsets must be examined.
Given that the amount of computations needed for each candidate is O(k),
the overall complexity of this method is O
(∑d
k=1 k ×
(
d
k
))
= O
(
d · 2d−1
)
.
Fk−1 × F1 Method An alternative method for candidate generation is to
extend each frequent (k − 1)itemset with other frequent items. Figure 6.7
illustrates how a frequent 2itemset such as {Beer, Diapers} can be aug
mented with a frequent item such as Bread to produce a candidate 3itemset
{Beer, Diapers, Bread}. This method will produce O(Fk−1 × F1) candi
date kitemsets, where Fj is the number of frequent jitemsets. The overall
complexity of this step is O(
∑
k kFk−1F1).
The procedure is complete because every frequent kitemset is composed
of a frequent (k − 1)itemset and a frequent 1itemset. Therefore, all frequent
kitemsets are part of the candidate kitemsets generated by this procedure.
339
Chapter 6 Association Analysis
{Beer, Bread, Cola}
{Beer, Bread, Diapers}
{Beer, Cola, Diapers}
{Bread, Cola, Diapers}
{Beer, Cola, Milk}
{Beer, Diapers, Milk}
{Beer, Diapers, Eggs}
{Beer, Milk, Eggs}
{Bread, Milk, Eggs}
{Beer, Cola, Eggs}
{Bread, Cola, Milk}
{Bread, Diapers, Milk}
{Bread, Diapers, Milk}
{Bread, Diapers, Eggs}
{Cola, Milk, Eggs}
{Diapers, Milk, Eggs}
{Cola, Diapers, Milk}
{Cola, Diapers, Eggs}
{Bread, Cola, Eggs}
{Beer, Bread, Milk}
{Beer, Bread, Eggs}
Itemset
Itemset
Item
Items
Candidate Generation
Beer
Bread
Cola
Milk
Eggs
Diapers
Candidate
Pruning
Figure 6.6. A bruteforce method for generating candidate 3itemsets.
{Beer, Diapers, Milk}
{Bread, Diapers, Milk}
{Bread, Milk, Beer}
{Beer, Diapers, Bread}
Candidate Generation
Candidate
Pruning
Item
Itemset
Itemset
Frequent
1itemset
Beer
{Beer, Diapers}
{Bread, Diapers}
{Bread, Milk}
{Diapers, Milk}
Bread
Milk
Diapers
Frequent
2itemset
{Bread, Diapers, Milk}
Itemset
Figure 6.7. Generating and pruning candidate kitemsets by merging a frequent (k−1)itemset with a
frequent item. Note that some of the candidates are unnecessary because their subsets are infrequent.
This approach, however, does not prevent the same candidate itemset from
being generated more than once. For instance, {Bread, Diapers, Milk} can
be generated by merging {Bread, Diapers} with {Milk}, {Bread, Milk} with
{Diapers}, or {Diapers, Milk} with {Bread}. One way to avoid generating
340
6.2 Frequent Itemset Generation
duplicate candidates is by ensuring that the items in each frequent itemset are
kept sorted in their lexicographic order. Each frequent (k−1)itemset X is then
extended with frequent items that are lexicographically larger than the items in
X. For example, the itemset {Bread, Diapers} can be augmented with {Milk}
since Milk is lexicographically larger than Bread and Diapers. However, we
should not augment {Diapers, Milk} with {Bread} nor {Bread, Milk} with
{Diapers} because they violate the lexicographic ordering condition.
While this procedure is a substantial improvement over the bruteforce
method, it can still produce a large number of unnecessary candidates. For
example, the candidate itemset obtained by merging {Beer, Diapers} with
{Milk} is unnecessary because one of its subsets, {Beer, Milk}, is infrequent.
There are several heuristics available to reduce the number of unnecessary
candidates. For example, note that, for every candidate kitemset that survives
the pruning step, every item in the candidate must be contained in at least
k − 1 of the frequent (k − 1)itemsets. Otherwise, the candidate is guaranteed
to be infrequent. For example, {Beer, Diapers, Milk} is a viable candidate
3itemset only if every item in the candidate, including Beer, is contained in
at least two frequent 2itemsets. Since there is only one frequent 2itemset
containing Beer, all candidate itemsets involving Beer must be infrequent.
Fk−1×Fk−1 Method The candidate generation procedure in the apriorigen
function merges a pair of frequent (k−1)itemsets only if their first k−2 items
are identical. Let A = {a1, a2, . . . , ak−1} and B = {b1, b2, . . . , bk−1} be a pair
of frequent (k − 1)itemsets. A and B are merged if they satisfy the following
conditions:
ai = bi (for i = 1, 2, . . . , k − 2) and ak−1 �= bk−1.
In Figure 6.8, the frequent itemsets {Bread, Diapers} and {Bread, Milk} are
merged to form a candidate 3itemset {Bread, Diapers, Milk}. The algorithm
does not have to merge {Beer, Diapers} with {Diapers, Milk} because the
first item in both itemsets is different. Indeed, if {Beer, Diapers, Milk} is a
viable candidate, it would have been obtained by merging {Beer, Diapers}
with {Beer, Milk} instead. This example illustrates both the completeness of
the candidate generation procedure and the advantages of using lexicographic
ordering to prevent duplicate candidates. However, because each candidate is
obtained by merging a pair of frequent (k−1)itemsets, an additional candidate
pruning step is needed to ensure that the remaining k − 2 subsets of the
candidate are frequent.
341
Chapter 6 Association Analysis
Candidate
Pruning
Itemset
{Beer, Diapers}
{Bread, Diapers}
{Bread, Milk}
{Diapers, Milk}
Frequent
2itemset
Itemset
{Beer, Diapers}
{Bread, Diapers}
{Bread, Milk}
{Diapers, Milk}
Frequent
2itemset
{Bread, Diapers, Milk}
Itemset
Candidate
Generation
{Bread, Diapers, Milk}
Itemset
Figure 6.8. Generating and pruning candidate kitemsets by merging pairs of frequent (k−1)itemsets.
6.2.4 Support Counting
Support counting is the process of determining the frequency of occurrence
for every candidate itemset that survives the candidate pruning step of the
apriorigen function. Support counting is implemented in steps 6 through 11
of Algorithm 6.1. One approach for doing this is to compare each transaction
against every candidate itemset (see Figure 6.2) and to update the support
counts of candidates contained in the transaction. This approach is computa
tionally expensive, especially when the numbers of transactions and candidate
itemsets are large.
An alternative approach is to enumerate the itemsets contained in each
transaction and use them to update the support counts of their respective can
didate itemsets. To illustrate, consider a transaction t that contains five items,
{1, 2, 3, 5, 6}. There are
(
5
3
)
= 10 itemsets of size 3 contained in this transac
tion. Some of the itemsets may correspond to the candidate 3itemsets under
investigation, in which case, their support counts are incremented. Other
subsets of t that do not correspond to any candidates can be ignored.
Figure 6.9 shows a systematic way for enumerating the 3itemsets contained
in t. Assuming that each itemset keeps its items in increasing lexicographic
order, an itemset can be enumerated by specifying the smallest item first,
followed by the larger items. For instance, given t = {1, 2, 3, 5, 6}, all the 3
itemsets contained in t must begin with item 1, 2, or 3. It is not possible to
construct a 3itemset that begins with items 5 or 6 because there are only two
342
6.2 Frequent Itemset Generation
Transaction, t
Subsets of 3 items
12356
12356
12356 1356 2 356 2 56 3 5615 6
2 356 3 56
Level 1
Level 2
Level 3
1 2 3
1 2 5
1 2 6
1 3 5
1 3 6
2 3 5
2 3 6
1 5 6 2 5 6 3 5 6
Figure 6.9. Enumerating subsets of three items from a transaction t.
items in t whose labels are greater than or equal to 5. The number of ways to
specify the first item of a 3itemset contained in t is illustrated by the Level
1 prefix structures depicted in Figure 6.9. For instance, 1 2 3 5 6 represents
a 3itemset that begins with item 1, followed by two more items chosen from
the set {2, 3, 5, 6}.
After fixing the first item, the prefix structures at Level 2 represent the
number of ways to select the second item. For example, 1 2 3 5 6 corresponds
to itemsets that begin with prefix (1 2) and are followed by items 3, 5, or 6.
Finally, the prefix structures at Level 3 represent the complete set of 3itemsets
contained in t. For example, the 3itemsets that begin with prefix {1 2} are
{1, 2, 3}, {1, 2, 5}, and {1, 2, 6}, while those that begin with prefix {2 3} are
{2, 3, 5} and {2, 3, 6}.
The prefix structures shown in Figure 6.9 demonstrate how itemsets con
tained in a transaction can be systematically enumerated, i.e., by specifying
their items one by one, from the leftmost item to the rightmost item. We
still have to determine whether each enumerated 3itemset corresponds to an
existing candidate itemset. If it matches one of the candidates, then the sup
port count of the corresponding candidate is incremented. In the next section,
we illustrate how this matching operation can be performed efficiently using a
hash tree structure.
343
Chapter 6 Association Analysis
Bread, Diapers, Beer, Eggs
Milk, Diapers, Beer, Cola
Bread, Milk, Diapers, Beer
Bread, Milk, Diapers, Cola
Bread, Milk
Transactions
Hash Tree
TID Items
1
2
3
4
5
Leaf nodes
containing
candidate
2itemsets
{Beer, Bread}
{Beer, Diapers}
{Beer, Milk}
{Bread, Diapers}
{Bread, Milk}
{Diapers, Milk}
Figure 6.10. Counting the support of itemsets using hash structure.
Support Counting Using a Hash Tree
In the Apriori algorithm, candidate itemsets are partitioned into different
buckets and stored in a hash tree. During support counting, itemsets contained
in each transaction are also hashed into their appropriate buckets. That way,
instead of comparing each itemset in the transaction with every candidate
itemset, it is matched only against candidate itemsets that belong to the same
bucket, as shown in Figure 6.10.
Figure 6.11 shows an example of a hash tree structure. Each internal node
of the tree uses the following hash function, h(p) = p mod 3, to determine
which branch of the current node should be followed next. For example, items
1, 4, and 7 are hashed to the same branch (i.e., the leftmost branch) because
they have the same remainder after dividing the number by 3. All candidate
itemsets are stored at the leaf nodes of the hash tree. The hash tree shown in
Figure 6.11 contains 15 candidate 3itemsets, distributed across 9 leaf nodes.
Consider a transaction, t = {1, 2, 3, 5, 6}. To update the support counts
of the candidate itemsets, the hash tree must be traversed in such a way
that all the leaf nodes containing candidate 3itemsets belonging to t must be
visited at least once. Recall that the 3itemsets contained in t must begin with
items 1, 2, or 3, as indicated by the Level 1 prefix structures shown in Figure
6.9. Therefore, at the root node of the hash tree, the items 1, 2, and 3 of the
transaction are hashed separately. Item 1 is hashed to the left child of the root
node, item 2 is hashed to the middle child, and item 3 is hashed to the right
child. At the next level of the tree, the transaction is hashed on the second
344
6.2 Frequent Itemset Generation
Hash Function
3,6,91,4,7
2,5,8
Transaction
Candidate Hash Tree
1 2 3 5 6
1 4 5 1 3 6
1 5 9
3 4 5
5 6 7
2 3 4
5 6
3 5 6
2 3 5 61 +
2 +
3 +
3 6 8
3 6 7
3 5 7
6 8 9
3 5 6
4 5 8
1 2 5
4 5 7
1 2 4
Figure 6.11. Hashing a transaction at the root node of a hash tree.
item listed in the Level 2 structures shown in Figure 6.9. For example, after
hashing on item 1 at the root node, items 2, 3, and 5 of the transaction are
hashed. Items 2 and 5 are hashed to the middle child, while item 3 is hashed
to the right child, as shown in Figure 6.12. This process continues until the
leaf nodes of the hash tree are reached. The candidate itemsets stored at the
visited leaf nodes are compared against the transaction. If a candidate is a
subset of the transaction, its support count is incremented. In this example, 5
out of the 9 leaf nodes are visited and 9 out of the 15 itemsets are compared
against the transaction.
6.2.5 Computational Complexity
The computational complexity of the Apriori algorithm can be affected by the
following factors.
Support Threshold Lowering the support threshold often results in more
itemsets being declared as frequent. This has an adverse effect on the com
345
Chapter 6 Association Analysis
Transaction
Candidate Hash Tree
1 2 3 5 6
1 4 5 1 3 6
1 5 9
3 4 5
5 6 7
2 3 4
5 6
3 5 6
2 3 5 61 +
2 +
3 +
5 61 3 +
61 5 +
3 5 61 2 +
3 6 8
3 6 7
3 5 7
6 8 9
3 5 6
4 5 8
1 2 5
4 5 7
1 2 4
Figure 6.12. Subset operation on the leftmost subtree of the root of a candidate hash tree.
putational complexity of the algorithm because more candidate itemsets must
be generated and counted, as shown in Figure 6.13. The maximum size of
frequent itemsets also tends to increase with lower support thresholds. As the
maximum size of the frequent itemsets increases, the algorithm will need to
make more passes over the data set.
Number of Items (Dimensionality) As the number of items increases,
more space will be needed to store the support counts of items. If the number of
frequent items also grows with the dimensionality of the data, the computation
and I/O costs will increase because of the larger number of candidate itemsets
generated by the algorithm.
Number of Transactions Since the Apriori algorithm makes repeated
passes over the data set, its run time increases with a larger number of trans
actions.
Average Transaction Width For dense data sets, the average transaction
width can be very large. This affects the complexity of the Apriori algorithm in
two ways. First, the maximum size of frequent itemsets tends to increase as the
346
6.2 Frequent Itemset Generation
0 5 1510 20
0
0.5
1
1.5
2
2.5
3
3.5
4
Size of Itemset
N
u
m
b
e
r
o
f
C
a
n
d
id
a
te
I
te
m
se
ts
Support = 0.1%
Support = 0.2%
Support = 0.5%
×105
(a) Number of candidate itemsets.
N
u
m
b
e
r
o
f
F
re
q
u
e
n
t
It
e
m
se
ts
0 10 155 20
0
3.5
3
2.5
2
1.5
1
0.5
4
Size of Itemset
Support = 0.1%
Support = 0.2%
Support = 0.5%
×105
(b) Number of frequent itemsets.
Figure 6.13. Effect of support threshold on the number of candidate and frequent itemsets.
average transaction width increases. As a result, more candidate itemsets must
be examined during candidate generation and support counting, as illustrated
in Figure 6.14. Second, as the transaction width increases, more itemsets
347
Chapter 6 Association Analysis
0 5 10 15 20 25
0
1
2
3
4
5
6
7
8
9
10
Size of Itemset
N
u
m
b
e
r
o
f
C
a
n
d
id
a
te
I
te
m
se
ts
Width = 5
Width = 10
Width = 15
×105
(a) Number of candidate itemsets.
0 5 10 15 20 25
10
9
8
7
6
5
4
3
2
1
0
Size of Itemset
N
u
m
b
e
r
o
f
F
re
q
u
e
n
t
It
e
m
se
ts
Width = 5
Width = 10
Width = 15
×105
(b) Number of Frequent Itemsets.
Figure 6.14. Effect of average transaction width on the number of candidate and frequent itemsets.
are contained in the transaction. This will increase the number of hash tree
traversals performed during support counting.
A detailed analysis of the time complexity for the Apriori algorithm is
presented next.
348
6.3 Rule Generation
Generation of frequent 1itemsets For each transaction, we need to up
date the support count for every item present in the transaction. Assuming
that w is the average transaction width, this operation requires O(N w) time,
where N is the total number of transactions.
Candidate generation To generate candidate kitemsets, pairs of frequent
(k − 1)itemsets are merged to determine whether they have at least k − 2
items in common. Each merging operation requires at most k − 2 equality
comparisons. In the bestcase scenario, every merging step produces a viable
candidate kitemset. In the worstcase scenario, the algorithm must merge ev
ery pair of frequent (k−1)itemsets found in the previous iteration. Therefore,
the overall cost of merging frequent itemsets is
w∑
k=2
(k − 2)Ck < Cost of merging <
w∑
k=2
(k − 2)Fk−12.
A hash tree is also constructed during candidate generation to store the can
didate itemsets. Because the maximum depth of the tree is k, the cost for
populating the hash tree with candidate itemsets is O
(∑w
k=2 kCk
)
. During
candidate pruning, we need to verify that the k − 2 subsets of every candidate
kitemset are frequent. Since the cost for looking up a candidate in a hash
tree is O(k), the candidate pruning step requires O
(∑w
k=2 k(k − 2)Ck
)
time.
Support counting Each transaction of length t produces
(t
k
)
itemsets of
size k. This is also the effective number of hash tree traversals performed for
each transaction. The cost for support counting is O
(
N
∑
k
(
w
k
)
αk
)
, where w
is the maximum transaction width and αk is the cost for updating the support
count of a candidate kitemset in the hash tree.
6.3 Rule Generation
This section describes how to extract association rules efficiently from a given
frequent itemset. Each frequent kitemset, Y , can produce up to 2k−2 associa
tion rules, ignoring rules that have empty antecedents or consequents (∅ −→ Y
or Y −→ ∅). An association rule can be extracted by partitioning the itemset
Y into two nonempty subsets, X and Y −X, such that X −→ Y −X satisfies
the confidence threshold. Note that all such rules must have already met the
support threshold because they are generated from a frequent itemset.
349
Chapter 6 Association Analysis
Example 6.2. Let X = {1, 2, 3} be a frequent itemset. There are six candi
date association rules that can be generated from X: {1, 2} −→ {3}, {1, 3} −→
{2}, {2, 3} −→ {1}, {1} −→ {2, 3}, {2} −→ {1, 3}, and {3} −→ {1, 2}. As
each of their support is identical to the support for X, the rules must satisfy
the support threshold.
Computing the confidence of an association rule does not require additional
scans of the transaction data set. Consider the rule {1, 2} −→ {3}, which is
generated from the frequent itemset X = {1, 2, 3}. The confidence for this rule
is σ({1, 2, 3})/σ({1, 2}). Because {1, 2, 3} is frequent, the antimonotone prop
erty of support ensures that {1, 2} must be frequent, too. Since the support
counts for both itemsets were already found during frequent itemset genera
tion, there is no need to read the entire data set again.
6.3.1 ConfidenceBased Pruning
Unlike the support measure, confidence does not have any monotone property.
For example, the confidence for X −→ Y can be larger, smaller, or equal to the
confidence for another rule X̃ −→ Ỹ , where X̃ ⊆ X and Ỹ ⊆ Y (see Exercise
3 on page 405). Nevertheless, if we compare rules generated from the same
frequent itemset Y , the following theorem holds for the confidence measure.
Theorem 6.2. If a rule X −→ Y −X does not satisfy the confidence threshold,
then any rule X′ −→ Y − X′, where X′ is a subset of X, must not satisfy the
confidence threshold as well.
To prove this theorem, consider the following two rules: X′ −→ Y −X′ and
X −→ Y −X, where X′ ⊂ X. The confidence of the rules are σ(Y )/σ(X′) and
σ(Y )/σ(X), respectively. Since X′ is a subset of X, σ(X′) ≥ σ(X). Therefore,
the former rule cannot have a higher confidence than the latter rule.
6.3.2 Rule Generation in Apriori Algorithm
The Apriori algorithm uses a levelwise approach for generating association
rules, where each level corresponds to the number of items that belong to the
rule consequent. Initially, all the highconfidence rules that have only one item
in the rule consequent are extracted. These rules are then used to generate
new candidate rules. For example, if {acd} −→ {b} and {abd} −→ {c} are
highconfidence rules, then the candidate rule {ad} −→ {bc} is generated by
merging the consequents of both rules. Figure 6.15 shows a lattice structure
for the association rules generated from the frequent itemset {a, b, c, d}. If any
node in the lattice has low confidence, then according to Theorem 6.2, the
350
6.3 Rule Generation
abd=>c abc=>dacd=>b
bd=>ac bc=>ad ad=>bc ac=>bd ab=>cd
a=>bcdb=>acdd=>abc c=>abd
abcd=>{ }
bcd=>a
cd=>ab
LowConfidence
Rule
Pruned
Rules
Figure 6.15. Pruning of association rules using the confidence measure.
entire subgraph spanned by the node can be pruned immediately. Suppose
the confidence for {bcd} −→ {a} is low. All the rules containing item a in
its consequent, including {cd} −→ {ab}, {bd} −→ {ac}, {bc} −→ {ad}, and
{d} −→ {abc} can be discarded.
A pseudocode for the rule generation step is shown in Algorithms 6.2 and
6.3. Note the similarity between the apgenrules procedure given in Algo
rithm 6.3 and the frequent itemset generation procedure given in Algorithm
6.1. The only difference is that, in rule generation, we do not have to make
additional passes over the data set to compute the confidence of the candidate
rules. Instead, we determine the confidence of each rule by using the support
counts computed during frequent itemset generation.
Algorithm 6.2 Rule generation of the Apriori algorithm.
1: for each frequent kitemset fk, k ≥ 2 do
2: H1 = {i  i ∈ fk} {1item consequents of the rule.}
3: call apgenrules(fk, H1.)
4: end for
351
Chapter 6 Association Analysis
Algorithm 6.3 Procedure apgenrules(fk, Hm).
1: k = fk {size of frequent itemset.}
2: m = Hm {size of rule consequent.}
3: if k > m + 1 then
4: Hm+1 = apriorigen(Hm).
5: for each hm+1 ∈ Hm+1 do
6: conf = σ(fk)/σ(fk − hm+1).
7: if conf ≥ minconf then
8: output the rule (fk − hm+1) −→ hm+1.
9: else
10: delete hm+1 from Hm+1.
11: end if
12: end for
13: call apgenrules(fk, Hm+1.)
14: end if
6.3.3 An Example: Congressional Voting Records
This section demonstrates the results of applying association analysis to the
voting records of members of the United States House of Representatives. The
data is obtained from the 1984 Congressional Voting Records Database, which
is available at the UCI machine learning data repository. Each transaction
contains information about the party affiliation for a representative along with
his or her voting record on 16 key issues. There are 435 transactions and 34
items in the data set. The set of items are listed in Table 6.3.
The Apriori algorithm is then applied to the data set with minsup = 30%
and minconf = 90%. Some of the highconfidence rules extracted by the
algorithm are shown in Table 6.4. The first two rules suggest that most of the
members who voted yes for aid to El Salvador and no for budget resolution and
MX missile are Republicans; while those who voted no for aid to El Salvador
and yes for budget resolution and MX missile are Democrats. These high
confidence rules show the key issues that divide members from both political
parties. If minconf is reduced, we may find rules that contain issues that cut
across the party lines. For example, with minconf = 40%, the rules suggest
that corporation cutbacks is an issue that receives almost equal number of
votes from both parties—52.3% of the members who voted no are Republicans,
while the remaining 47.7% of them who voted no are Democrats.
352
6.4 Compact Representation of Frequent Itemsets
Table 6.3. List of binary attributes from the 1984 United States Congressional Voting Records. Source:
The UCI machine learning repository.
1. Republican 18. aid to Nicaragua = no
2. Democrat 19. MXmissile = yes
3. handicappedinfants = yes 20. MXmissile = no
4. handicappedinfants = no 21. immigration = yes
5. water project cost sharing = yes 22. immigration = no
6. water project cost sharing = no 23. synfuel corporation cutback = yes
7. budgetresolution = yes 24. synfuel corporation cutback = no
8. budgetresolution = no 25. education spending = yes
9. physician fee freeze = yes 26. education spending = no
10. physician fee freeze = no 27. righttosue = yes
11. aid to El Salvador = yes 28. righttosue = no
12. aid to El Salvador = no 29. crime = yes
13. religious groups in schools = yes 30. crime = no
14. religious groups in schools = no 31. dutyfreeexports = yes
15. antisatellite test ban = yes 32. dutyfreeexports = no
16. antisatellite test ban = no 33. export administration act = yes
17. aid to Nicaragua = yes 34. export administration act = no
Table 6.4. Association rules extracted from the 1984 United States Congressional Voting Records.
Association Rule Confidence
{budget resolution = no, MXmissile=no, aid to El Salvador = yes } 91.0%
−→ {Republican}
{budget resolution = yes, MXmissile=yes, aid to El Salvador = no } 97.5%
−→ {Democrat}
{crime = yes, righttosue = yes, physician fee freeze = yes} 93.5%
−→ {Republican}
{crime = no, righttosue = no, physician fee freeze = no} 100%
−→ {Democrat}
6.4 Compact Representation of Frequent Itemsets
In practice, the number of frequent itemsets produced from a transaction data
set can be very large. It is useful to identify a small representative set of
itemsets from which all other frequent itemsets can be derived. Two such
representations are presented in this section in the form of maximal and closed
frequent itemsets.
353
Chapter 6 Association Analysis
null
ba c d e
decebeaeadacab
abc abd abe
abcd
acd
abce abde bcde
ace ade bcd bce bde cde
bdbc cd
abcde
acde
Maximal Frequent
Itemset
Frequent
Itemset
Border
Frequent
Infrequent
Figure 6.16. Maximal frequent itemset.
6.4.1 Maximal Frequent Itemsets
Definition 6.3 (Maximal Frequent Itemset). A maximal frequent item
set is defined as a frequent itemset for which none of its immediate supersets
are frequent.
To illustrate this concept, consider the itemset lattice shown in Figure
6.16. The itemsets in the lattice are divided into two groups: those that are
frequent and those that are infrequent. A frequent itemset border, which is
represented by a dashed line, is also illustrated in the diagram. Every itemset
located above the border is frequent, while those located below the border (the
shaded nodes) are infrequent. Among the itemsets residing near the border,
{a, d}, {a, c, e}, and {b, c, d, e} are considered to be maximal frequent itemsets
because their immediate supersets are infrequent. An itemset such as {a, d}
is maximal frequent because all of its immediate supersets, {a, b, d}, {a, c, d},
and {a, d, e}, are infrequent. In contrast, {a, c} is nonmaximal because one
of its immediate supersets, {a, c, e}, is frequent.
Maximal frequent itemsets effectively provide a compact representation of
frequent itemsets. In other words, they form the smallest set of itemsets from
354
6.4 Compact Representation of Frequent Itemsets
which all frequent itemsets can be derived. For example, the frequent itemsets
shown in Figure 6.16 can be divided into two groups:
• Frequent itemsets that begin with item a and that may contain items c,
d, or e. This group includes itemsets such as {a}, {a, c}, {a, d}, {a, e},
and {a, c, e}.
• Frequent itemsets that begin with items b, c, d, or e. This group includes
itemsets such as {b}, {b, c}, {c, d},{b, c, d, e}, etc.
Frequent itemsets that belong in the first group are subsets of either {a, c, e}
or {a, d}, while those that belong in the second group are subsets of {b, c, d, e}.
Hence, the maximal frequent itemsets {a, c, e}, {a, d}, and {b, c, d, e} provide
a compact representation of the frequent itemsets shown in Figure 6.16.
Maximal frequent itemsets provide a valuable representation for data sets
that can produce very long, frequent itemsets, as there are exponentially many
frequent itemsets in such data. Nevertheless, this approach is practical only
if an efficient algorithm exists to explicitly find the maximal frequent itemsets
without having to enumerate all their subsets. We briefly describe one such
approach in Section 6.5.
Despite providing a compact representation, maximal frequent itemsets do
not contain the support information of their subsets. For example, the support
of the maximal frequent itemsets {a, c, e}, {a, d}, and {b,c,d,e} do not provide
any hint about the support of their subsets. An additional pass over the data
set is therefore needed to determine the support counts of the nonmaximal
frequent itemsets. In some cases, it might be desirable to have a minimal
representation of frequent itemsets that preserves the support information.
We illustrate such a representation in the next section.
6.4.2 Closed Frequent Itemsets
Closed itemsets provide a minimal representation of itemsets without losing
their support information. A formal definition of a closed itemset is presented
below.
Definition 6.4 (Closed Itemset). An itemset X is closed if none of its
immediate supersets has exactly the same support count as X.
Put another way, X is not closed if at least one of its immediate supersets
has the same support count as X. Examples of closed itemsets are shown in
Figure 6.17. To better illustrate the support count of each itemset, we have
associated each node (itemset) in the lattice with a list of its corresponding
355
Chapter 6 Association Analysis
null
ba c d e
decebeaeadacab
abc abd abe
abcd
acd
abcde
abce abde acde bcde
ace ade bcd bce bde cde
bdbc cd
TID Items
abc
abcd
acde
de
bce
1
2
4
5
3
minsup = 40%
1,2,4
1,2,4
1,2,3 1,2,3,4 2,4,5 3,4,5
1,2,31,2
1,2
2,4
2,4
2,4 3,4 4,5
4 4 4
42
2 3
4
2
2 3
Closed Frequent Itemset
Figure 6.17. An example of the closed frequent itemsets (with minimum support count equal to 40%).
transaction IDs. For example, since the node {b, c} is associated with transac
tion IDs 1, 2, and 3, its support count is equal to three. From the transactions
given in this diagram, notice that every transaction that contains b also con
tains c. Consequently, the support for {b} is identical to {b, c} and {b} should
not be considered a closed itemset. Similarly, since c occurs in every transac
tion that contains both a and d, the itemset {a, d} is not closed. On the other
hand, {b, c} is a closed itemset because it does not have the same support
count as any of its supersets.
Definition 6.5 (Closed Frequent Itemset). An itemset is a closed fre
quent itemset if it is closed and its support is greater than or equal to minsup.
In the previous example, assuming that the support threshold is 40%, {b,c}
is a closed frequent itemset because its support is 60%. The rest of the closed
frequent itemsets are indicated by the shaded nodes.
Algorithms are available to explicitly extract closed frequent itemsets from
a given data set. Interested readers may refer to the bibliographic notes at the
end of this chapter for further discussions of these algorithms. We can use the
closed frequent itemsets to determine the support counts for the nonclosed
356
6.4 Compact Representation of Frequent Itemsets
Algorithm 6.4 Support counting using closed frequent itemsets.
1: Let C denote the set of closed frequent itemsets
2: Let kmax denote the maximum size of closed frequent itemsets
3: Fkmax = {ff ∈ C, f = kmax} {Find all frequent itemsets of size kmax.}
4: for k = kmax − 1 downto 1 do
5: Fk = {ff ⊂ Fk+1, f = k} {Find all frequent itemsets of size k.}
6: for each f ∈ Fk do
7: if f /∈ C then
8: f.support = max{f ′.supportf ′ ∈ Fk+1, f ⊂ f ′}
9: end if
10: end for
11: end for
frequent itemsets. For example, consider the frequent itemset {a, d} shown
in Figure 6.17. Because the itemset is not closed, its support count must be
identical to one of its immediate supersets. The key is to determine which
superset (among {a, b, d}, {a, c, d}, or {a, d, e}) has exactly the same support
count as {a, d}. The Apriori principle states that any transaction that contains
the superset of {a, d} must also contain {a, d}. However, any transaction that
contains {a, d} does not have to contain the supersets of {a, d}. For this
reason, the support for {a, d} must be equal to the largest support among its
supersets. Since {a, c, d} has a larger support than both {a, b, d} and {a, d, e},
the support for {a, d} must be identical to the support for {a, c, d}. Using this
methodology, an algorithm can be developed to compute the support for the
nonclosed frequent itemsets. The pseudocode for this algorithm is shown in
Algorithm 6.4. The algorithm proceeds in a specifictogeneral fashion, i.e.,
from the largest to the smallest frequent itemsets. This is because, in order
to find the support for a nonclosed frequent itemset, the support for all of its
supersets must be known.
To illustrate the advantage of using closed frequent itemsets, consider the
data set shown in Table 6.5, which contains ten transactions and fifteen items.
The items can be divided into three groups: (1) Group A, which contains
items a1 through a5; (2) Group B, which contains items b1 through b5; and
(3) Group C, which contains items c1 through c5. Note that items within each
group are perfectly associated with each other and they do not appear with
items from another group. Assuming the support threshold is 20%, the total
number of frequent itemsets is 3×(25 −1) = 93. However, there are only three
closed frequent itemsets in the data: ({a1, a2, a3, a4, a5}, {b1, b2, b3, b4, b5}, and
{c1, c2, c3, c4, c5}). It is often sufficient to present only the closed frequent
itemsets to the analysts instead of the entire set of frequent itemsets.
357
Chapter 6 Association Analysis
Table 6.5. A transaction data set for mining closed itemsets.
TID a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 c1 c2 c3 c4 c5
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
5 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
6 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
Figure 6.18. Relationships among frequent, maximal frequent, and closed frequent itemsets.
Closed frequent itemsets are useful for removing some of the redundant
association rules. An association rule X −→ Y is redundant if there exists
another rule X′ −→ Y ′, where X is a subset of X′ and Y is a subset of Y ′, such
that the support and confidence for both rules are identical. In the example
shown in Figure 6.17, {b} is not a closed frequent itemset while {b, c} is closed.
The association rule {b} −→ {d, e} is therefore redundant because it has the
same support and confidence as {b, c} −→ {d, e}. Such redundant rules are
not generated if closed frequent itemsets are used for rule generation.
Finally, note that all maximal frequent itemsets are closed because none
of the maximal frequent itemsets can have the same support count as their
immediate supersets. The relationships among frequent, maximal frequent,
and closed frequent itemsets are shown in Figure 6.18.
358
6.5 Alternative Methods for Generating Frequent Itemsets
6.5 Alternative Methods for Generating Frequent
Itemsets
Apriori is one of the earliest algorithms to have successfully addressed the
combinatorial explosion of frequent itemset generation. It achieves this by ap
plying the Apriori principle to prune the exponential search space. Despite its
significant performance improvement, the algorithm still incurs considerable
I/O overhead since it requires making several passes over the transaction data
set. In addition, as noted in Section 6.2.5, the performance of the Apriori
algorithm may degrade significantly for dense data sets because of the increas
ing width of transactions. Several alternative methods have been developed
to overcome these limitations and improve upon the efficiency of the Apriori
algorithm. The following is a highlevel description of these methods.
Traversal of Itemset Lattice A search for frequent itemsets can be con
ceptually viewed as a traversal on the itemset lattice shown in Figure 6.1.
The search strategy employed by an algorithm dictates how the lattice struc
ture is traversed during the frequent itemset generation process. Some search
strategies are better than others, depending on the configuration of frequent
itemsets in the lattice. An overview of these strategies is presented next.
• GeneraltoSpecific versus SpecifictoGeneral: The Apriori al
gorithm uses a generaltospecific search strategy, where pairs of frequent
(k−1)itemsets are merged to obtain candidate kitemsets. This general
tospecific search strategy is effective, provided the maximum length of
a frequent itemset is not too long. The configuration of frequent item
sets that works best with this strategy is shown in Figure 6.19(a), where
the darker nodes represent infrequent itemsets. Alternatively, a specific
togeneral search strategy looks for more specific frequent itemsets first,
before finding the more general frequent itemsets. This strategy is use
ful to discover maximal frequent itemsets in dense transactions, where
the frequent itemset border is located near the bottom of the lattice,
as shown in Figure 6.19(b). The Apriori principle can be applied to
prune all subsets of maximal frequent itemsets. Specifically, if a candi
date kitemset is maximal frequent, we do not have to examine any of its
subsets of size k − 1. However, if the candidate kitemset is infrequent,
we need to check all of its k − 1 subsets in the next iteration. Another
approach is to combine both generaltospecific and specifictogeneral
search strategies. This bidirectional approach requires more space to
359
Chapter 6 Association Analysis
Frequent
Itemset
Border null
Frequent
Itemset
Border
Frequent
Itemset
Border
nullnull
{a1,a2,…,an} {a1,a2,…,an} {a1,a2,…,an}
(a) Generaltospecific (b) Specifictogeneral (c) Bidirectional
Figure 6.19. Generaltospecific, specifictogeneral, and bidirectional search.
store the candidate itemsets, but it can help to rapidly identify the fre
quent itemset border, given the configuration shown in Figure 6.19(c).
• Equivalence Classes: Another way to envision the traversal is to first
partition the lattice into disjoint groups of nodes (or equivalence classes).
A frequent itemset generation algorithm searches for frequent itemsets
within a particular equivalence class first before moving to another equiv
alence class. As an example, the levelwise strategy used in the Apriori
algorithm can be considered to be partitioning the lattice on the basis
of itemset sizes; i.e., the algorithm discovers all frequent 1itemsets first
before proceeding to largersized itemsets. Equivalence classes can also
be defined according to the prefix or suffix labels of an itemset. In this
case, two itemsets belong to the same equivalence class if they share
a common prefix or suffix of length k. In the prefixbased approach,
the algorithm can search for frequent itemsets starting with the prefix
a before looking for those starting with prefixes b, c, and so on. Both
prefixbased and suffixbased equivalence classes can be demonstrated
using the treelike structure shown in Figure 6.20.
• BreadthFirst versus DepthFirst: The Apriori algorithm traverses
the lattice in a breadthfirst manner, as shown in Figure 6.21(a). It first
discovers all the frequent 1itemsets, followed by the frequent 2itemsets,
and so on, until no new frequent itemsets are generated. The itemset
360
6.5 Alternative Methods for Generating Frequent Itemsets
null
a
abcd
cb
bc bd cd
d
adacab
abc
bc bd cdadacab
acd bcdabd abc acd bcdabd
(a) Prefix tree. (b) Suffix tree.
null
a cb d
abcd
Figure 6.20. Equivalence classes based on the prefix and suffix labels of itemsets.
(a) Breadth first (b) Depth first
Figure 6.21. Breadthfirst and depthfirst traversals.
lattice can also be traversed in a depthfirst manner, as shown in Figures
6.21(b) and 6.22. The algorithm can start from, say, node a in Figure
6.22, and count its support to determine whether it is frequent. If so, the
algorithm progressively expands the next level of nodes, i.e., ab, abc, and
so on, until an infrequent node is reached, say, abcd. It then backtracks
to another branch, say, abce, and continues the search from there.
The depthfirst approach is often used by algorithms designed to find
maximal frequent itemsets. This approach allows the frequent itemset
border to be detected more quickly than using a breadthfirst approach.
Once a maximal frequent itemset is found, substantial pruning can be
361
Chapter 6 Association Analysis
null
b
abc
abcd
abcde
abd acd
aceabe ade
bcd
bcdeacdeabdeabce
bce bde cde
bd cd
decebe
c d
e
a
ab ac ad
ae
bc
Figure 6.22. Generating candidate itemsets using the depthfirst approach.
performed on its subsets. For example, if the node bcde shown in Figure
6.22 is maximal frequent, then the algorithm does not have to visit the
subtrees rooted at bd, be, c, d, and e because they will not contain any
maximal frequent itemsets. However, if abc is maximal frequent, only the
nodes such as ac and bc are not maximal frequent (but the subtrees of
ac and bc may still contain maximal frequent itemsets). The depthfirst
approach also allows a different kind of pruning based on the support
of itemsets. For example, suppose the support for {a, b, c} is identical
to the support for {a, b}. The subtrees rooted at abd and abe can be
skipped because they are guaranteed not to have any maximal frequent
itemsets. The proof of this is left as an exercise to the readers.
Representation of Transaction Data Set There are many ways to rep
resent a transaction data set. The choice of representation can affect the I/O
costs incurred when computing the support of candidate itemsets. Figure 6.23
shows two different ways of representing market basket transactions. The rep
resentation on the left is called a horizontal data layout, which is adopted
by many association rule mining algorithms, including Apriori. Another pos
sibility is to store the list of transaction identifiers (TIDlist) associated with
each item. Such a representation is known as the vertical data layout. The
support for each candidate itemset is obtained by intersecting the TIDlists of
its subset items. The length of the TIDlists shrinks as we progress to larger
362
6.6 FPGrowth Algorithm
a,b,c,d
a,b,c
a,b,e
a,b
b,c,d
a,c,d
a,c,d
c,e
a,e
b
Horizontal
Data Layout Vertical Data Layout
1
2
3
4
5
6
7
8
9
1
4
5
6
7
8
1
2
5
7
8
2
3
4
8
2
4
5
1
3
6
9
9
10
9
10
TID Items a b c d e
Figure 6.23. Horizontal and vertical data format.
sized itemsets. However, one problem with this approach is that the initial
set of TIDlists may be too large to fit into main memory, thus requiring
more sophisticated techniques to compress the TIDlists. We describe another
effective approach to represent the data in the next section.
6.6 FPGrowth Algorithm
This section presents an alternative algorithm called FPgrowth that takes
a radically different approach to discovering frequent itemsets. The algorithm
does not subscribe to the generateandtest paradigm of Apriori. Instead, it
encodes the data set using a compact data structure called an FPtree and
extracts frequent itemsets directly from this structure. The details of this
approach are presented next.
6.6.1 FPTree Representation
An FPtree is a compressed representation of the input data. It is constructed
by reading the data set one transaction at a time and mapping each transaction
onto a path in the FPtree. As different transactions can have several items
in common, their paths may overlap. The more the paths overlap with one
another, the more compression we can achieve using the FPtree structure. If
the size of the FPtree is small enough to fit into main memory, this will allow
us to extract frequent itemsets directly from the structure in memory instead
of making repeated passes over the data stored on disk.
363
Chapter 6 Association Analysis
{a,b}
{a}
{a,b,c}
{a,b,d}
{b,c,e}
{a,b,c}
{a,b,c,d}
{b,c,d}
{a,d,e}
{a,c,d,e}
Transaction
Data Set
1
2
3
4
5
6
7
8
9
10
TID Items
null
null
a:1 b:1
c:1
d:1
b:1
null
a:2 b:1
c:1
c:1
d:1d:1
e:1
b:1
a:1
b:1
(i) After reading TID=1
(iii) After reading TID=3
(iv) After reading TID=10
(ii) After reading TID=2
null
a:8 b:2
c:2
c:1
c:3
d:1
d:1
d:1
d:1 d:1
e:1
e:1e:1
b:5
Figure 6.24. Construction of an FPtree.
Figure 6.24 shows a data set that contains ten transactions and five items.
The structures of the FPtree after reading the first three transactions are also
depicted in the diagram. Each node in the tree contains the label of an item
along with a counter that shows the number of transactions mapped onto the
given path. Initially, the FPtree contains only the root node represented by
the null symbol. The FPtree is subsequently extended in the following way:
1. The data set is scanned once to determine the support count of each
item. Infrequent items are discarded, while the frequent items are sorted
in decreasing support counts. For the data set shown in Figure 6.24, a
is the most frequent item, followed by b, c, d, and e.
364
6.6 FPGrowth Algorithm
2. The algorithm makes a second pass over the data to construct the FP
tree. After reading the first transaction, {a, b}, the nodes labeled as a
and b are created. A path is then formed from null → a → b to encode
the transaction. Every node along the path has a frequency count of 1.
3. After reading the second transaction, {b,c,d}, a new set of nodes is cre
ated for items b, c, and d. A path is then formed to represent the
transaction by connecting the nodes null → b → c → d. Every node
along this path also has a frequency count equal to one. Although the
first two transactions have an item in common, which is b, their paths
are disjoint because the transactions do not share a common prefix.
4. The third transaction, {a,c,d,e}, shares a common prefix item (which
is a) with the first transaction. As a result, the path for the third
transaction, null → a → c → d → e, overlaps with the path for the
first transaction, null → a → b. Because of their overlapping path, the
frequency count for node a is incremented to two, while the frequency
counts for the newly created nodes, c, d, and e, are equal to one.
5. This process continues until every transaction has been mapped onto one
of the paths given in the FPtree. The resulting FPtree after reading
all the transactions is shown at the bottom of Figure 6.24.
The size of an FPtree is typically smaller than the size of the uncompressed
data because many transactions in market basket data often share a few items
in common. In the bestcase scenario, where all the transactions have the
same set of items, the FPtree contains only a single branch of nodes. The
worstcase scenario happens when every transaction has a unique set of items.
As none of the transactions have any items in common, the size of the FPtree
is effectively the same as the size of the original data. However, the physical
storage requirement for the FPtree is higher because it requires additional
space to store pointers between nodes and counters for each item.
The size of an FPtree also depends on how the items are ordered. If
the ordering scheme in the preceding example is reversed, i.e., from lowest
to highest support item, the resulting FPtree is shown in Figure 6.25. The
tree appears to be denser because the branching factor at the root node has
increased from 2 to 5 and the number of nodes containing the high support
items such as a and b has increased from 3 to 12. Nevertheless, ordering
by decreasing support counts does not always lead to the smallest tree. For
example, suppose we augment the data set given in Figure 6.24 with 100
transactions that contain {e}, 80 transactions that contain {d}, 60 transactions
365
Chapter 6 Association Analysis
null
a:1a:1 a:1
a:1
a:1
a:1
a:2
b:2
b:2
b:1
b:1
b:1
c:2
c:2
c:1
c:1
d:3
d:2
e:3
Figure 6.25. An FPtree representation for the data set shown in Figure 6.24 with a different item
ordering scheme.
that contain {c}, and 40 transactions that contain {b}. Item e is now most
frequent, followed by d, c, b, and a. With the augmented transactions, ordering
by decreasing support counts will result in an FPtree similar to Figure 6.25,
while a scheme based on increasing support counts produces a smaller FPtree
similar to Figure 6.24(iv).
An FPtree also contains a list of pointers connecting between nodes that
have the same items. These pointers, represented as dashed lines in Figures
6.24 and 6.25, help to facilitate the rapid access of individual items in the tree.
We explain how to use the FPtree and its corresponding pointers for frequent
itemset generation in the next section.
6.6.2 Frequent Itemset Generation in FPGrowth Algorithm
FPgrowth is an algorithm that generates frequent itemsets from an FPtree
by exploring the tree in a bottomup fashion. Given the example tree shown in
Figure 6.24, the algorithm looks for frequent itemsets ending in e first, followed
by d, c, b, and finally, a. This bottomup strategy for finding frequent item
sets ending with a particular item is equivalent to the suffixbased approach
described in Section 6.5. Since every transaction is mapped onto a path in the
FPtree, we can derive the frequent itemsets ending with a particular item,
say, e, by examining only the paths containing node e. These paths can be
accessed rapidly using the pointers associated with node e. The extracted
paths are shown in Figure 6.26(a). The details on how to process the paths to
obtain frequent itemsets will be explained later.
366
6.6 FPGrowth Algorithm
null
null
a:8 b:2
b:2
b:2
b:5
b:5
c:1
c:1
c:3
c:2
c:2
d:1
d:1
d:1 d:1 d:1
c:3 c:2
b:2b:5
c:1
d:1
d:1
e:1 e:1e:1
null null
null
a:8 a:8
a:8
a:8
(c) Paths containing node c (d) Paths containing node b (e) Paths containing node a
(a) Paths containing node e (b) Paths containing node d
Figure 6.26. Decomposing the frequent itemset generation problem into multiple subproblems, where
each subproblem involves finding frequent itemsets ending in e, d, c, b, and a.
Table 6.6. The list of frequent itemsets ordered by their corresponding suffixes.
Suffix Frequent Itemsets
e {e}, {d,e}, {a,d,e}, {c,e},{a,e}
d {d}, {c,d}, {b,c,d}, {a,c,d}, {b,d}, {a,b,d}, {a,d}
c {c}, {b,c}, {a,b,c}, {a,c}
b {b}, {a,b}
a {a}
After finding the frequent itemsets ending in e, the algorithm proceeds to
look for frequent itemsets ending in d by processing the paths associated with
node d. The corresponding paths are shown in Figure 6.26(b). This process
continues until all the paths associated with nodes c, b, and finally a, are
processed. The paths for these items are shown in Figures 6.26(c), (d), and
(e), while their corresponding frequent itemsets are summarized in Table 6.6.
FPgrowth finds all the frequent itemsets ending with a particular suffix
by employing a divideandconquer strategy to split the problem into smaller
subproblems. For example, suppose we are interested in finding all frequent
367
Chapter 6 Association Analysis
null
a:8
a:2
a:2
a:2
b:2
c:1
c:1
c:1
c:1
c:2
d:1
d:1 d:1
d:1d:1
d:1
e:1 e:1e:1
null
null null
a:2
c:1 c:1
null
(a) Prefix paths ending in e (b) Conditional FPtree for e
(c) Prefix paths ending in de (d) Conditional FPtree for de
(e) Prefix paths ending in ce (f) Prefix paths ending in ae
a:2
null
Figure 6.27. Example of applying the FPgrowth algorithm to find frequent itemsets ending in e.
itemsets ending in e. To do this, we must first check whether the itemset
{e} itself is frequent. If it is frequent, we consider the subproblem of finding
frequent itemsets ending in de, followed by ce, be, and ae. In turn, each
of these subproblems are further decomposed into smaller subproblems. By
merging the solutions obtained from the subproblems, all the frequent itemsets
ending in e can be found. This divideandconquer approach is the key strategy
employed by the FPgrowth algorithm.
For a more concrete example on how to solve the subproblems, consider
the task of finding frequent itemsets ending with e.
1. The first step is to gather all the paths containing node e. These initial
paths are called prefix paths and are shown in Figure 6.27(a).
2. From the prefix paths shown in Figure 6.27(a), the support count for e is
obtained by adding the support counts associated with node e. Assuming
that the minimum support count is 2, {e} is declared a frequent itemset
because its support count is 3.
368
6.6 FPGrowth Algorithm
3. Because {e} is frequent, the algorithm has to solve the subproblems of
finding frequent itemsets ending in de, ce, be, and ae. Before solving
these subproblems, it must first convert the prefix paths into a con
ditional FPtree, which is structurally similar to an FPtree, except
it is used to find frequent itemsets ending with a particular suffix. A
conditional FPtree is obtained in the following way:
(a) First, the support counts along the prefix paths must be updated
because some of the counts include transactions that do not contain
item e. For example, the rightmost path shown in Figure 6.27(a),
null −→ b:2 −→ c:2 −→ e:1, includes a transaction {b, c} that
does not contain item e. The counts along the prefix path must
therefore be adjusted to 1 to reflect the actual number of transac
tions containing {b, c, e}.
(b) The prefix paths are truncated by removing the nodes for e. These
nodes can be removed because the support counts along the prefix
paths have been updated to reflect only transactions that contain e
and the subproblems of finding frequent itemsets ending in de, ce,
be, and ae no longer need information about node e.
(c) After updating the support counts along the prefix paths, some
of the items may no longer be frequent. For example, the node b
appears only once and has a support count equal to 1, which means
that there is only one transaction that contains both b and e. Item b
can be safely ignored from subsequent analysis because all itemsets
ending in be must be infrequent.
The conditional FPtree for e is shown in Figure 6.27(b). The tree looks
different than the original prefix paths because the frequency counts have
been updated and the nodes b and e have been eliminated.
4. FPgrowth uses the conditional FPtree for e to solve the subproblems of
finding frequent itemsets ending in de, ce, and ae. To find the frequent
itemsets ending in de, the prefix paths for d are gathered from the con
ditional FPtree for e (Figure 6.27(c)). By adding the frequency counts
associated with node d, we obtain the support count for {d, e}. Since
the support count is equal to 2, {d, e} is declared a frequent itemset.
Next, the algorithm constructs the conditional FPtree for de using the
approach described in step 3. After updating the support counts and
removing the infrequent item c, the conditional FPtree for de is shown
in Figure 6.27(d). Since the conditional FPtree contains only one item,
369
Chapter 6 Association Analysis
a, whose support is equal to minsup, the algorithm extracts the fre
quent itemset {a, d, e} and moves on to the next subproblem, which is
to generate frequent itemsets ending in ce. After processing the prefix
paths for c, only {c, e} is found to be frequent. The algorithm proceeds
to solve the next subprogram and found {a, e} to be the only frequent
itemset remaining.
This example illustrates the divideandconquer approach used in the FP
growth algorithm. At each recursive step, a conditional FPtree is constructed
by updating the frequency counts along the prefix paths and removing all
infrequent items. Because the subproblems are disjoint, FPgrowth will not
generate any duplicate itemsets. In addition, the counts associated with the
nodes allow the algorithm to perform support counting while generating the
common suffix itemsets.
FPgrowth is an interesting algorithm because it illustrates how a compact
representation of the transaction data set helps to efficiently generate frequent
itemsets. In addition, for certain transaction data sets, FPgrowth outperforms
the standard Apriori algorithm by several orders of magnitude. The runtime
performance of FPgrowth depends on the compaction factor of the data
set. If the resulting conditional FPtrees are very bushy (in the worst case, a
full prefix tree), then the performance of the algorithm degrades significantly
because it has to generate a large number of subproblems and merge the results
returned by each subproblem.
6.7 Evaluation of Association Patterns
Association analysis algorithms have the potential to generate a large number
of patterns. For example, although the data set shown in Table 6.1 contains
only six items, it can produce up to hundreds of association rules at certain
support and confidence thresholds. As the size and dimensionality of real
commercial databases can be very large, we could easily end up with thousands
or even millions of patterns, many of which might not be interesting. Sifting
through the patterns to identify the most interesting ones is not a trivial task
because “one person’s trash might be another person’s treasure.” It is therefore
important to establish a set of wellaccepted criteria for evaluating the quality
of association patterns.
The first set of criteria can be established through statistical arguments.
Patterns that involve a set of mutually independent items or cover very few
transactions are considered uninteresting because they may capture spurious
relationships in the data. Such patterns can be eliminated by applying an
370
6.7 Evaluation of Association Patterns
objective interestingness measure that uses statistics derived from data
to determine whether a pattern is interesting. Examples of objective interest
ingness measures include support, confidence, and correlation.
The second set of criteria can be established through subjective arguments.
A pattern is considered subjectively uninteresting unless it reveals unexpected
information about the data or provides useful knowledge that can lead to
profitable actions. For example, the rule {Butter} −→ {Bread} may not be
interesting, despite having high support and confidence values, because the
relationship represented by the rule may seem rather obvious. On the other
hand, the rule {Diapers} −→ {Beer} is interesting because the relationship is
quite unexpected and may suggest a new crossselling opportunity for retailers.
Incorporating subjective knowledge into pattern evaluation is a difficult task
because it requires a considerable amount of prior information from the domain
experts.
The following are some of the approaches for incorporating subjective
knowledge into the pattern discovery task.
Visualization This approach requires a userfriendly environment to keep
the human user in the loop. It also allows the domain experts to interact with
the data mining system by interpreting and verifying the discovered patterns.
Templatebased approach This approach allows the users to constrain
the type of patterns extracted by the mining algorithm. Instead of reporting
all the extracted rules, only rules that satisfy a userspecified template are
returned to the users.
Subjective interestingness measure A subjective measure can be defined
based on domain information such as concept hierarchy (to be discussed in
Section 7.3) or profit margin of items. The measure can then be used to filter
patterns that are obvious and nonactionable.
Readers interested in subjective interestingness measures may refer to re
sources listed in the bibliography at the end of this chapter.
6.7.1 Objective Measures of Interestingness
An objective measure is a datadriven approach for evaluating the quality
of association patterns. It is domainindependent and requires minimal in
put from the users, other than to specify a threshold for filtering lowquality
patterns. An objective measure is usually computed based on the frequency
371
Chapter 6 Association Analysis
Table 6.7. A 2way contingency table for variables A and B.
B B
A f11 f10 f1+
A f01 f00 f0+
f+1 f+0 N
counts tabulated in a contingency table. Table 6.7 shows an example of a
contingency table for a pair of binary variables, A and B. We use the notation
A (B) to indicate that A (B) is absent from a transaction. Each entry fij in
this 2 × 2 table denotes a frequency count. For example, f11 is the number of
times A and B appear together in the same transaction, while f01 is the num
ber of transactions that contain B but not A. The row sum f1+ represents
the support count for A, while the column sum f+1 represents the support
count for B. Finally, even though our discussion focuses mainly on asymmet
ric binary variables, note that contingency tables are also applicable to other
attribute types such as symmetric binary, nominal, and ordinal variables.
Limitations of the SupportConfidence Framework Existing associa
tion rule mining formulation relies on the support and confidence measures to
eliminate uninteresting patterns. The drawback of support was previously de
scribed in Section 6.8, in which many potentially interesting patterns involving
low support items might be eliminated by the support threshold. The draw
back of confidence is more subtle and is best demonstrated with the following
example.
Example 6.3. Suppose we are interested in analyzing the relationship be
tween people who drink tea and coffee. We may gather information about the
beverage preferences among a group of people and summarize their responses
into a table such as the one shown in Table 6.8.
Table 6.8. Beverage preferences among a group of 1000 people.
Cof f ee Cof f ee
T ea 150 50 200
T ea 650 150 800
800 200 1000
372
6.7 Evaluation of Association Patterns
The information given in this table can be used to evaluate the association
rule {T ea} −→ {Cof f ee}. At first glance, it may appear that people who drink
tea also tend to drink coffee because the rule’s support (15%) and confidence
(75%) values are reasonably high. This argument would have been acceptable
except that the fraction of people who drink coffee, regardless of whether they
drink tea, is 80%, while the fraction of tea drinkers who drink coffee is only
75%. Thus knowing that a person is a tea drinker actually decreases her
probability of being a coffee drinker from 80% to 75%! The rule {T ea} −→
{Cof f ee} is therefore misleading despite its high confidence value.
The pitfall of confidence can be traced to the fact that the measure ignores
the support of the itemset in the rule consequent. Indeed, if the support of
coffee drinkers is taken into account, we would not be surprised to find that
many of the people who drink tea also drink coffee. What is more surprising is
that the fraction of tea drinkers who drink coffee is actually less than the overall
fraction of people who drink coffee, which points to an inverse relationship
between tea drinkers and coffee drinkers.
Because of the limitations in the supportconfidence framework, various
objective measures have been used to evaluate the quality of association pat
terns. Below, we provide a brief description of these measures and explain
some of their strengths and limitations.
Interest Factor The teacoffee example shows that highconfidence rules
can sometimes be misleading because the confidence measure ignores the sup
port of the itemset appearing in the rule consequent. One way to address this
problem is by applying a metric known as lift:
Lif t =
c(A −→ B)
s(B)
, (6.4)
which computes the ratio between the rule’s confidence and the support of
the itemset in the rule consequent. For binary variables, lift is equivalent to
another objective measure called interest factor, which is defined as follows:
I(A, B) =
s(A, B)
s(A) × s(B) =
N f11
f1+f+1
. (6.5)
Interest factor compares the frequency of a pattern against a baseline fre
quency computed under the statistical independence assumption. The baseline
frequency for a pair of mutually independent variables is
f11
N
=
f1+
N
× f+1
N
, or equivalently, f11 =
f1+f+1
N
. (6.6)
373
Chapter 6 Association Analysis
Table 6.9. Contingency tables for the word pairs ({p,q} and {r,s}.
p p r r
q 880 50 930 s 20 50 70
q 50 20 70 s 50 880 930
930 70 1000 70 930 1000
This equation follows from the standard approach of using simple fractions
as estimates for probabilities. The fraction f11/N is an estimate for the joint
probability P (A, B), while f1+/N and f+1/N are the estimates for P (A) and
P (B), respectively. If A and B are statistically independent, then P (A, B) =
P (A) × P (B), thus leading to the formula shown in Equation 6.6. Using
Equations 6.5 and 6.6, we can interpret the measure as follows:
I(A, B)
= 1, if A and B are independent;
> 1, if A and B are positively correlated;
< 1, if A and B are negatively correlated.
(6.7)
For the teacoffee example shown in Table 6.8, I = 0.15
0.2×0.8 = 0.9375, thus sug
gesting a slight negative correlation between tea drinkers and coffee drinkers.
Limitations of Interest Factor We illustrate the limitation of interest
factor with an example from the text mining domain. In the text domain, it
is reasonable to assume that the association between a pair of words depends
on the number of documents that contain both words. For example, because
of their stronger association, we expect the words data and mining to appear
together more frequently than the words compiler and mining in a collection
of computer science articles.
Table 6.9 shows the frequency of occurrences between two pairs of words,
{p, q} and {r, s}. Using the formula given in Equation 6.5, the interest factor
for {p, q} is 1.02 and for {r, s} is 4.08. These results are somewhat troubling
for the following reasons. Although p and q appear together in 88% of the
documents, their interest factor is close to 1, which is the value when p and q
are statistically independent. On the other hand, the interest factor for {r, s}
is higher than {p, q} even though r and s seldom appear together in the same
document. Confidence is perhaps the better choice in this situation because it
considers the association between p and q (94.6%) to be much stronger than
that between r and s (28.6%).
374
6.7 Evaluation of Association Patterns
Correlation Analysis Correlation analysis is a statisticalbased technique
for analyzing relationships between a pair of variables. For continuous vari
ables, correlation is defined using Pearson’s correlation coefficient (see Equa
tion 2.10 on page 77). For binary variables, correlation can be measured using
the φcoefficient, which is defined as
φ =
f11f00 − f01f10√
f1+f+1f0+f+0
. (6.8)
The value of correlation ranges from −1 (perfect negative correlation) to +1
(perfect positive correlation). If the variables are statistically independent,
then φ = 0. For example, the correlation between the tea and coffee drinkers
given in Table 6.8 is −0.0625.
Limitations of Correlation Analysis The drawback of using correlation
can be seen from the word association example given in Table 6.9. Although
the words p and q appear together more often than r and s, their φcoefficients
are identical, i.e., φ(p, q) = φ(r, s) = 0.232. This is because the φcoefficient
gives equal importance to both copresence and coabsence of items in a trans
action. It is therefore more suitable for analyzing symmetric binary variables.
Another limitation of this measure is that it does not remain invariant when
there are proportional changes to the sample size. This issue will be discussed
in greater detail when we describe the properties of objective measures on page
377.
IS Measure IS is an alternative measure that has been proposed for han
dling asymmetric binary variables. The measure is defined as follows:
IS(A, B) =
√
I(A, B) × s(A, B) = s(A, B)√
s(A)s(B)
. (6.9)
Note that IS is large when the interest factor and support of the pattern
are large. For example, the value of IS for the word pairs {p, q} and {r, s}
shown in Table 6.9 are 0.946 and 0.286, respectively. Contrary to the results
given by interest factor and the φcoefficient, the IS measure suggests that
the association between {p, q} is stronger than {r, s}, which agrees with what
we expect from word associations in documents.
It is possible to show that IS is mathematically equivalent to the cosine
measure for binary variables (see Equation 2.7 on page 75). In this regard, we
375
Chapter 6 Association Analysis
Table 6.10. Example of a contingency table for items p and q.
q q
p 800 100 900
p 100 0 100
900 100 1000
consider A and B as a pair of bit vectors, A • B = s(A, B) the dot product
between the vectors, and A =
√
s(A) the magnitude of vector A. Therefore:
IS(A, B) =
s(A, B)√
s(A) × s(B)
=
A • B
A × B = cosine(A, B). (6.10)
The IS measure can also be expressed as the geometric mean between the
confidence of association rules extracted from a pair of binary variables:
IS(A, B) =
√
s(A, B)
s(A)
× s(A, B)
s(B)
=
√
c(A → B) × c(B → A). (6.11)
Because the geometric mean between any two numbers is always closer to the
smaller number, the IS value of an itemset {p, q} is low whenever one of its
rules, p −→ q or q −→ p, has low confidence.
Limitations of IS Measure The IS value for a pair of independent item
sets, A and B, is
ISindep(A, B) =
s(A, B)√
s(A) × s(B)
=
s(A) × s(B)√
s(A) × s(B)
=
√
s(A) × s(B).
Since the value depends on s(A) and s(B), IS shares a similar problem as
the confidence measure—that the value of the measure can be quite large,
even for uncorrelated and negatively correlated patterns. For example, despite
the large IS value between items p and q given in Table 6.10 (0.889), it is
still less than the expected value when the items are statistically independent
(ISindep = 0.9).
376
6.7 Evaluation of Association Patterns
Alternative Objective Interestingness Measures
Besides the measures we have described so far, there are other alternative mea
sures proposed for analyzing relationships between pairs of binary variables.
These measures can be divided into two categories, symmetric and asym
metric measures. A measure M is symmetric if M (A −→ B) = M (B −→ A).
For example, interest factor is a symmetric measure because its value is iden
tical for the rules A −→ B and B −→ A. In contrast, confidence is an
asymmetric measure since the confidence for A −→ B and B −→ A may not
be the same. Symmetric measures are generally used for evaluating itemsets,
while asymmetric measures are more suitable for analyzing association rules.
Tables 6.11 and 6.12 provide the definitions for some of these measures in
terms of the frequency counts of a 2 × 2 contingency table.
Consistency among Objective Measures
Given the wide variety of measures available, it is reasonable to question
whether the measures can produce similar ordering results when applied to
a set of association patterns. If the measures are consistent, then we can
choose any one of them as our evaluation metric. Otherwise, it is important
to understand what their differences are in order to determine which measure
is more suitable for analyzing certain types of patterns.
Table 6.11. Examples of symmetric objective measures for the itemset {A, B}.
Measure (Symbol) Definition
Correlation (φ) N f11−f1+f+1√
f1+f+1f0+f+0
Odds ratio (α)
(
f11f00
)/(
f10f01
)
Kappa (κ) N f11+N f00−f1+f+1−f0+f+0
N 2−f1+f+1−f0+f+0
Interest (I)
(
N f11
)/(
f1+f+1
)
Cosine (IS)
(
f11
)/(√
f1+f+1
)
PiatetskyShapiro (P S) f11
N
− f1+f+1
N 2
Collective strength (S) f11+f00
f1+f+1+f0+f+0
× N−f1+f+1−f0+f+0
N−f11−f00
Jaccard (ζ) f11
/(
f1+ + f+1 − f11
)
Allconfidence (h) min
[
f11
f1+
, f11
f+1
]
377
Chapter 6 Association Analysis
Table 6.12. Examples of asymmetric objective measures for the rule A −→ B.
Measure (Symbol) Definition
GoodmanKruskal (λ)
(∑
j maxk fjk − maxkf+k
)/(
N − maxk f+k
)
Mutual Information (M )
(∑
i
∑
j
fij
N
log N fij
fi+f+j
)/(
− ∑i fi+N log fi+N )
JMeasure (J) f11
N
log N f11
f1+f+1
+ f10
N
log N f10
f1+f+0
Gini index (G) f1+
N
× ( f11
f1+
)2 + ( f10
f1+
)2] − ( f+1
N
)2
+ f0+
N
× [( f01
f0+
)2 + ( f00
f0+
)2] − ( f+0
N
)2
Laplace (L)
(
f11 + 1
)/(
f1+ + 2
)
Conviction (V )
(
f1+f+0
)/(
N f10
)
Certainty factor (F )
(
f11
f1+
− f+1
N
)/(
1 − f+1
N
)
Added Value (AV ) f11
f1+
− f+1
N
Table 6.13. Example of contingency tables.
Example f11 f10 f01 f00
E1 8123 83 424 1370
E2 8330 2 622 1046
E3 3954 3080 5 2961
E4 2886 1363 1320 4431
E5 1500 2000 500 6000
E6 4000 2000 1000 3000
E7 9481 298 127 94
E8 4000 2000 2000 2000
E9 7450 2483 4 63
E10 61 2483 4 7452
Suppose the symmetric and asymmetric measures are applied to rank the
ten contingency tables shown in Table 6.13. These contingency tables are cho
sen to illustrate the differences among the existing measures. The ordering
produced by these measures are shown in Tables 6.14 and 6.15, respectively
(with 1 as the most interesting and 10 as the least interesting table). Although
some of the measures appear to be consistent with each other, there are certain
measures that produce quite different ordering results. For example, the rank
ings given by the φcoefficient agree with those provided by κ and collective
strength, but are somewhat different than the rankings produced by interest
378
6.7 Evaluation of Association Patterns
Table 6.14. Rankings of contingency tables using the symmetric measures given in Table 6.11.
φ α κ I IS P S S ζ h
E1 1 3 1 6 2 2 1 2 2
E2 2 1 2 7 3 5 2 3 3
E3 3 2 4 4 5 1 3 6 8
E4 4 8 3 3 7 3 4 7 5
E5 5 7 6 2 9 6 6 9 9
E6 6 9 5 5 6 4 5 5 7
E7 7 6 7 9 1 8 7 1 1
E8 8 10 8 8 8 7 8 8 7
E9 9 4 9 10 4 9 9 4 4
E10 10 5 10 1 10 10 10 10 10
Table 6.15. Rankings of contingency tables using the asymmetric measures given in Table 6.12.
λ M J G L V F AV
E1 1 1 1 1 4 2 2 5
E2 2 2 2 3 5 1 1 6
E3 5 3 5 2 2 6 6 4
E4 4 6 3 4 9 3 3 1
E5 9 7 4 6 8 5 5 2
E6 3 8 6 5 7 4 4 3
E7 7 5 9 8 3 7 7 9
E8 8 9 7 7 10 8 8 7
E9 6 4 10 9 1 9 9 10
E10 10 10 8 10 6 10 10 8
factor and odds ratio. Furthermore, a contingency table such as E10 is ranked
lowest according to the φcoefficient, but highest according to interest factor.
Properties of Objective Measures
The results shown in Table 6.14 suggest that a significant number of the mea
sures provide conflicting information about the quality of a pattern. To under
stand their differences, we need to examine the properties of these measures.
Inversion Property Consider the bit vectors shown in Figure 6.28. The
0/1 bit in each column vector indicates whether a transaction (row) contains
a particular item (column). For example, the vector A indicates that item a
379
Chapter 6 Association Analysis
A
1
0
0
0
0
0
0
0
0
1
B
0
0
0
0
1
0
0
0
0
0
F
0
0
0
0
1
0
0
0
0
0
E
0
1
1
1
1
1
1
1
1
0
D
1
1
1
1
0
1
1
1
1
1
C
0
1
1
1
1
1
1
1
1
0
(a) (b) (c)
Figure 6.28. Effect of the inversion operation. The vectors C and E are inversions of vector A, while
the vector D is an inversion of vectors B and F .
belongs to the first and last transactions, whereas the vector B indicates that
item b is contained only in the fifth transaction. The vectors C and E are in
fact related to the vector A—their bits have been inverted from 0’s (absence)
to 1’s (presence), and vice versa. Similarly, D is related to vectors B and F by
inverting their bits. The process of flipping a bit vector is called inversion.
If a measure is invariant under the inversion operation, then its value for the
vector pair (C, D) should be identical to its value for (A, B). The inversion
property of a measure can be tested as follows.
Definition 6.6 (Inversion Property). An objective measure M is invariant
under the inversion operation if its value remains the same when exchanging
the frequency counts f11 with f00 and f10 with f01.
Among the measures that remain invariant under this operation include
the φcoefficient, odds ratio, κ, and collective strength. These measures may
not be suitable for analyzing asymmetric binary data. For example, the φ
coefficient between C and D is identical to the φcoefficient between A and
B, even though items c and d appear together more frequently than a and b.
Furthermore, the φcoefficient between C and D is less than that between E
and F even though items e and f appear together only once! We had previously
raised this issue when discussing the limitations of the φcoefficient on page
375. For asymmetric binary data, measures that do not remain invariant under
the inversion operation are preferred. Some of the noninvariant measures
include interest factor, IS, P S, and the Jaccard coefficient.
380
6.7 Evaluation of Association Patterns
Null Addition Property Suppose we are interested in analyzing the re
lationship between a pair of words, such as data and mining, in a set of
documents. If a collection of articles about ice fishing is added to the data set,
should the association between data and mining be affected? This process of
adding unrelated data (in this case, documents) to a given data set is known
as the null addition operation.
Definition 6.7 (Null Addition Property). An objective measure M is
invariant under the null addition operation if it is not affected by increasing
f00, while all other frequencies in the contingency table stay the same.
For applications such as document analysis or market basket analysis, the
measure is expected to remain invariant under the null addition operation.
Otherwise, the relationship between words may disappear simply by adding
enough documents that do not contain both words! Examples of measures
that satisfy this property include cosine (IS) and Jaccard (ξ) measures, while
those that violate this property include interest factor, P S, odds ratio, and
the φcoefficient.
Scaling Property Table 6.16 shows the contingency tables for gender and
the grades achieved by students enrolled in a particular course in 1993 and
2004. The data in these tables showed that the number of male students has
doubled since 1993, while the number of female students has increased by a
factor of 3. However, the male students in 2004 are not performing any better
than those in 1993 because the ratio of male students who achieve a high
grade to those who achieve a low grade is still the same, i.e., 3:4. Similarly,
the female students in 2004 are performing no better than those in 1993. The
association between grade and gender is expected to remain unchanged despite
changes in the sampling distribution.
Table 6.16. The gradegender example.
Male Female Male Female
High 30 20 50 High 60 60 120
Low 40 10 50 Low 80 30 110
70 30 100 140 90 230
(a) Sample data from 1993. (b) Sample data from 2004.
381
Chapter 6 Association Analysis
Table 6.17. Properties of symmetric measures.
Symbol Measure Inversion Null Addition Scaling
φ φcoefficient Yes No No
α odds ratio Yes No Yes
κ Cohen’s Yes No No
I Interest No No No
IS Cosine No Yes No
P S PiatetskyShapiro’s Yes No No
S Collective strength Yes No No
ζ Jaccard No Yes No
h Allconfidence No No No
s Support No No No
Definition 6.8 (Scaling Invariance Property). An objective measure M
is invariant under the row/column scaling operation if M (T ) = M (T ′), where
T is a contingency table with frequency counts [f11; f10; f01; f00], T ′ is a
contingency table with scaled frequency counts [k1k3f11; k2k3f10; k1k4f01;
k2k4f00], and k1, k2, k3, k4 are positive constants.
From Table 6.17, notice that only the odds ratio (α) is invariant under
the row and column scaling operations. All other measures such as the φ
coefficient, κ, IS, interest factor, and collective strength (S) change their val
ues when the rows and columns of the contingency table are rescaled. Although
we do not discuss the properties of asymmetric measures (such as confidence,
Jmeasure, Gini index, and conviction), it is clear that such measures do not
preserve their values under inversion and row/column scaling operations, but
are invariant under the null addition operation.
6.7.2 Measures beyond Pairs of Binary Variables
The measures shown in Tables 6.11 and 6.12 are defined for pairs of binary vari
ables (e.g., 2itemsets or association rules). However, many of them, such as
support and allconfidence, are also applicable to largersized itemsets. Other
measures, such as interest factor, IS, P S, and Jaccard coefficient, can be ex
tended to more than two variables using the frequency tables tabulated in a
multidimensional contingency table. An example of a threedimensional con
tingency table for a, b, and c is shown in Table 6.18. Each entry fijk in this
table represents the number of transactions that contain a particular combi
nation of items a, b, and c. For example, f101 is the number of transactions
that contain a and c, but not b. On the other hand, a marginal frequency
382
6.7 Evaluation of Association Patterns
Table 6.18. Example of a threedimensional contingency table.
c b b c b b
a f111 f101 f1+1 a f110 f100 f1+0
a f011 f001 f0+1 a f010 f000 f0+0
f+11 f+01 f++1 f+10 f+00 f++0
such as f1+1 is the number of transactions that contain a and c, irrespective
of whether b is present in the transaction.
Given a kitemset {i1, i2, . . . , ik}, the condition for statistical independence
can be stated as follows:
fi1i2...ik =
fi1+...+ × f+i2...+ × . . . × f++...ik
N k−1
. (6.12)
With this definition, we can extend objective measures such as interest factor
and P S, which are based on deviations from statistical independence, to more
than two variables:
I =
N k−1 × fi1i2...ik
fi1+...+ × f+i2...+ × . . . × f++...ik
P S =
fi1i2...ik
N
− fi1+...+ × f+i2...+ × . . . × f++...ik
N k
Another approach is to define the objective measure as the maximum, min
imum, or average value for the associations between pairs of items in a pat
tern. For example, given a kitemset X = {i1, i2, . . . , ik}, we may define the
φcoefficient for X as the average φcoefficient between every pair of items
(ip, iq) in X. However, because the measure considers only pairwise associa
tions, it may not capture all the underlying relationships within a pattern.
Analysis of multidimensional contingency tables is more complicated be
cause of the presence of partial associations in the data. For example, some
associations may appear or disappear when conditioned upon the value of cer
tain variables. This problem is known as Simpson’s paradox and is described
in the next section. More sophisticated statistical techniques are available to
analyze such relationships, e.g., loglinear models, but these techniques are
beyond the scope of this book.
383
Chapter 6 Association Analysis
Table 6.19. A twoway contingency table between the sale of highdefinition television and exercise
machine.
Buy Buy Exercise Machine
HDTV Yes No
Yes 99 81 180
No 54 66 120
153 147 300
Table 6.20. Example of a threeway contingency table.
Customer Buy Buy Exercise Machine Total
Group HDTV Yes No
College Students Yes 1 9 10
No 4 30 34
Working Adult Yes 98 72 170
No 50 36 86
6.7.3 Simpson’s Paradox
It is important to exercise caution when interpreting the association between
variables because the observed relationship may be influenced by the presence
of other confounding factors, i.e., hidden variables that are not included in
the analysis. In some cases, the hidden variables may cause the observed
relationship between a pair of variables to disappear or reverse its direction, a
phenomenon that is known as Simpson’s paradox. We illustrate the nature of
this paradox with the following example.
Consider the relationship between the sale of highdefinition television
(HDTV) and exercise machine, as shown in Table 6.19. The rule {HDTV=Yes}
−→ {Exercise machine=Yes} has a confidence of 99/180 = 55% and the rule
{HDTV=No} −→ {Exercise machine=Yes} has a confidence of 54/120 = 45%.
Together, these rules suggest that customers who buy highdefinition televi
sions are more likely to buy exercise machines than those who do not buy
highdefinition televisions.
However, a deeper analysis reveals that the sales of these items depend
on whether the customer is a college student or a working adult. Table 6.20
summarizes the relationship between the sale of HDTVs and exercise machines
among college students and working adults. Notice that the support counts
given in the table for college students and working adults sum up to the fre
quencies shown in Table 6.19. Furthermore, there are more working adults
384
6.7 Evaluation of Association Patterns
than college students who buy these items. For college students:
c
(
{HDTV=Yes} −→ {Exercise machine=Yes}
)
= 1/10 = 10%,
c
(
{HDTV=No} −→ {Exercise machine=Yes}
)
= 4/34 = 11.8%,
while for working adults:
c
(
{HDTV=Yes} −→ {Exercise machine=Yes}
)
= 98/170 = 57.7%,
c
(
{HDTV=No} −→ {Exercise machine=Yes}
)
= 50/86 = 58.1%.
The rules suggest that, for each group, customers who do not buy high
definition televisions are more likely to buy exercise machines, which contradict
the previous conclusion when data from the two customer groups are pooled
together. Even if alternative measures such as correlation, odds ratio, or
interest are applied, we still find that the sale of HDTV and exercise machine
is positively correlated in the combined data but is negatively correlated in
the stratified data (see Exercise 20 on page 414). The reversal in the direction
of association is known as Simpson’s paradox.
The paradox can be explained in the following way. Notice that most
customers who buy HDTVs are working adults. Working adults are also the
largest group of customers who buy exercise machines. Because nearly 85% of
the customers are working adults, the observed relationship between HDTV
and exercise machine turns out to be stronger in the combined data than
what it would have been if the data is stratified. This can also be illustrated
mathematically as follows. Suppose
a/b < c/d and p/q < r/s,
where a/b and p/q may represent the confidence of the rule A −→ B in two
different strata, while c/d and r/s may represent the confidence of the rule
A −→ B in the two strata. When the data is pooled together, the confidence
values of the rules in the combined data are (a + p)/(b + q) and (c + r)/(d + s),
respectively. Simpson’s paradox occurs when
a + p
b + q
>
c + r
d + s
,
thus leading to the wrong conclusion about the relationship between the vari
ables. The lesson here is that proper stratification is needed to avoid generat
ing spurious patterns resulting from Simpson’s paradox. For example, market
385
Chapter 6 Association Analysis
0 500 1000 1500 2000 2500
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Items sorted by support
S
u
p
p
o
rt
×10 4
Figure 6.29. Support distribution of items in the census data set.
basket data from a major supermarket chain should be stratified according to
store locations, while medical records from various patients should be stratified
according to confounding factors such as age and gender.
6.8 Effect of Skewed Support Distribution
The performances of many association analysis algorithms are influenced by
properties of their input data. For example, the computational complexity of
the Apriori algorithm depends on properties such as the number of items in
the data and average transaction width. This section examines another impor
tant property that has significant influence on the performance of association
analysis algorithms as well as the quality of extracted patterns. More specifi
cally, we focus on data sets with skewed support distributions, where most of
the items have relatively low to moderate frequencies, but a small number of
them have very high frequencies.
An example of a real data set that exhibits such a distribution is shown in
Figure 6.29. The data, taken from the PUMS (Public Use Microdata Sample)
census data, contains 49,046 records and 2113 asymmetric binary variables.
We shall treat the asymmetric binary variables as items and records as trans
actions in the remainder of this section. While more than 80% of the items
have support less than 1%, a handful of them have support greater than 90%.
386
6.8 Effect of Skewed Support Distribution
Table 6.21. Grouping the items in the census data set based on their support values.
Group G1 G2 G3
Support < 1% 1% − 90% > 90%
Number of Items 1735 358 20
To illustrate the effect of skewed support distribution on frequent itemset min
ing, we divide the items into three groups, G1, G2, and G3, according to their
support levels. The number of items that belong to each group is shown in
Table 6.21.
Choosing the right support threshold for mining this data set can be quite
tricky. If we set the threshold too high (e.g., 20%), then we may miss many
interesting patterns involving the low support items from G1. In market bas
ket analysis, such low support items may correspond to expensive products
(such as jewelry) that are seldom bought by customers, but whose patterns
are still interesting to retailers. Conversely, when the threshold is set too
low, it becomes difficult to find the association patterns due to the following
reasons. First, the computational and memory requirements of existing asso
ciation analysis algorithms increase considerably with low support thresholds.
Second, the number of extracted patterns also increases substantially with low
support thresholds. Third, we may extract many spurious patterns that relate
a highfrequency item such as milk to a lowfrequency item such as caviar.
Such patterns, which are called crosssupport patterns, are likely to be spu
rious because their correlations tend to be weak. For example, at a support
threshold equal to 0.05%, there are 18,847 frequent pairs involving items from
G1 and G3. Out of these, 93% of them are crosssupport patterns; i.e., the pat
terns contain items from both G1 and G3. The maximum correlation obtained
from the crosssupport patterns is 0.029, which is much lower than the max
imum correlation obtained from frequent patterns involving items from the
same group (which is as high as 1.0). Similar statement can be made about
many other interestingness measures discussed in the previous section. This
example shows that a large number of weakly correlated crosssupport pat
terns can be generated when the support threshold is sufficiently low. Before
presenting a methodology for eliminating such patterns, we formally define the
concept of crosssupport patterns.
387
Chapter 6 Association Analysis
Definition 6.9 (CrossSupport Pattern). A crosssupport pattern is an
itemset X = {i1, i2, . . . , ik} whose support ratio
r(X) =
min
[
s(i1), s(i2), . . . , s(ik)
]
max
[
s(i1), s(i2), . . . , s(ik)
], (6.13)
is less than a userspecified threshold hc.
Example 6.4. Suppose the support for milk is 70%, while the support for
sugar is 10% and caviar is 0.04%. Given hc = 0.01, the frequent itemset
{milk, sugar, caviar} is a crosssupport pattern because its support ratio is
r =
min
[
0.7, 0.1, 0.0004
]
max
[
0.7, 0.1, 0.0004
] = 0.0004
0.7
= 0.00058 < 0.01.
Existing measures such as support and confidence may not be sufficient
to eliminate crosssupport patterns, as illustrated by the data set shown in
Figure 6.30. Assuming that hc = 0.3, the itemsets {p, q}, {p, r}, and {p, q, r}
are crosssupport patterns because their support ratios, which are equal to
0.2, are less than the threshold hc. Although we can apply a high support
threshold, say, 20%, to eliminate the crosssupport patterns, this may come
at the expense of discarding other interesting patterns such as the strongly
correlated itemset, {q, r} that has support equal to 16.7%.
Confidence pruning also does not help because the confidence of the rules
extracted from crosssupport patterns can be very high. For example, the
confidence for {q} −→ {p} is 80% even though {p, q} is a crosssupport pat
tern. The fact that the crosssupport pattern can produce a highconfidence
rule should not come as a surprise because one of its items (p) appears very
frequently in the data. Therefore, p is expected to appear in many of the
transactions that contain q. Meanwhile, the rule {q} −→ {r} also has high
confidence even though {q, r} is not a crosssupport pattern. This example
demonstrates the difficulty of using the confidence measure to distinguish be
tween rules extracted from crosssupport and noncrosssupport patterns.
Returning to the previous example, notice that the rule {p} −→ {q} has
very low confidence because most of the transactions that contain p do not
contain q. In contrast, the rule {r} −→ {q}, which is derived from the pattern
{q, r}, has very high confidence. This observation suggests that crosssupport
patterns can be detected by examining the lowest confidence rule that can be
extracted from a given itemset. The proof of this statement can be understood
as follows.
388
6.8 Effect of Skewed Support Distribution
p q r
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1 1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Figure 6.30. A transaction data set containing three items, p, q, and r, where p is a high support item
and q and r are low support items.
1. Recall the following antimonotone property of confidence:
conf ({i1i2} −→ {i3, i4, . . . , ik}) ≤ conf ({i1i2i3} −→ {i4, i5, . . . , ik}).
This property suggests that confidence never increases as we shift more
items from the left to the righthand side of an association rule. Because
of this property, the lowest confidence rule extracted from a frequent
itemset contains only one item on its lefthand side. We denote the set
of all rules with only one item on its lefthand side as R1.
2. Given a frequent itemset {i1, i2, . . . , ik}, the rule
{ij} −→ {i1, i2, . . . , ij−1, ij+1, . . . , ik}
has the lowest confidence in R1 if s(ij ) = max
[
s(i1), s(i2), . . . , s(ik)
]
.
This follows directly from the definition of confidence as the ratio be
tween the rule’s support and the support of the rule antecedent.
389
Chapter 6 Association Analysis
3. Summarizing the previous points, the lowest confidence attainable from
a frequent itemset {i1, i2, . . . , ik} is
s({i1, i2, . . . , ik})
max
[
s(i1), s(i2), . . . , s(ik)
].
This expression is also known as the hconfidence or allconfidence
measure. Because of the antimonotone property of support, the numer
ator of the hconfidence measure is bounded by the minimum support
of any item that appears in the frequent itemset. In other words, the
hconfidence of an itemset X = {i1, i2, . . . , ik} must not exceed the fol
lowing expression:
hconfidence(X) ≤ min
[
s(i1), s(i2), . . . , s(ik)
]
max
[
s(i1), s(i2), . . . , s(ik)
].
Note the equivalence between the upper bound of hconfidence and the
support ratio (r) given in Equation 6.13. Because the support ratio for
a crosssupport pattern is always less than hc, the hconfidence of the
pattern is also guaranteed to be less than hc.
Therefore, crosssupport patterns can be eliminated by ensuring that the
hconfidence values for the patterns exceed hc. As a final note, it is worth
mentioning that the advantages of using hconfidence go beyond eliminating
crosssupport patterns. The measure is also antimonotone, i.e.,
hconfidence({i1, i2, . . . , ik}) ≥ hconfidence({i1, i2, . . . , ik+1}),
and thus can be incorporated directly into the mining algorithm. Furthermore,
hconfidence ensures that the items contained in an itemset are strongly asso
ciated with each other. For example, suppose the hconfidence of an itemset
X is 80%. If one of the items in X is present in a transaction, there is at least
an 80% chance that the rest of the items in X also belong to the same trans
action. Such strongly associated patterns are called hyperclique patterns.
6.9 Bibliographic Notes
The association rule mining task was first introduced by Agrawal et al. in
[228, 229] to discover interesting relationships among items in market basket
390
6.9 Bibliographic Notes
transactions. Since its inception, extensive studies have been conducted to
address the various conceptual, implementation, and application issues per
taining to the association analysis task. A summary of the various research
activities in this area is shown in Figure 6.31.
Conceptual Issues
Research in conceptual issues is focused primarily on (1) developing a frame
work to describe the theoretical underpinnings of association analysis, (2) ex
tending the formulation to handle new types of patterns, and (3) extending the
formulation to incorporate attribute types beyond asymmetric binary data.
Following the pioneering work by Agrawal et al., there has been a vast
amount of research on developing a theory for the association analysis problem.
In [254], Gunopoulos et al. showed a relation between the problem of finding
maximal frequent itemsets and the hypergraph transversal problem. An upper
bound on the complexity of association analysis task was also derived. Zaki et
al. [334, 336] and Pasquier et al. [294] have applied formal concept analysis to
study the frequent itemset generation problem. The work by Zaki et al. have
subsequently led them to introduce the notion of closed frequent itemsets [336].
Friedman et al. have studied the association analysis problem in the context
of bump hunting in multidimensional space [252]. More specifically, they
consider frequent itemset generation as the task of finding high probability
density regions in multidimensional space.
Over the years, new types of patterns have been defined, such as profile
association rules [225], cyclic association rules [290], fuzzy association rules
[273], exception rules [316], negative association rules [238, 304], weighted
association rules [240, 300], dependence rules [308], peculiar rules[340], inter
transaction association rules [250, 323], and partial classification rules [231,
285]. Other types of patterns include closed itemsets [294, 336], maximal
itemsets [234], hyperclique patterns [330], support envelopes [314], emerging
patterns [246], and contrast sets [233]. Association analysis has also been
successfully applied to sequential [230, 312], spatial [266], and graphbased
[268, 274, 293, 331, 335] data. The concept of crosssupport pattern was first
introduced by Hui et al. in [330]. An efficient algorithm (called Hyperclique
Miner) that automatically eliminates crosssupport patterns was also proposed
by the authors.
Substantial research has been conducted to extend the original association
rule formulation to nominal [311], ordinal [281], interval [284], and ratio [253,
255, 311, 325, 339] attributes. One of the key issues is how to define the support
measure for these attributes. A methodology was proposed by Steinbach et
391
Chapter 6 Association Analysis
al. [315] to extend the traditional notion of support to more general patterns
and attribute types.
392
6.9 Bibliographic Notes
R
e
s
e
a
rc
h
I
s
s
u
e
s
i
n
M
in
in
g
A
s
s
o
c
ia
ti
o
n
P
a
tt
e
rn
s
Im
p
le
m
e
n
ta
ti
o
n
Is
s
u
e
s
C
o
n
c
e
p
tu
a
l
Is
s
u
e
s
A
p
p
li
c
a
ti
o
n
Is
s
u
e
s
l
a
tt
ic
e
t
h
e
o
ry
b
o
u
n
d
s
o
n
it
e
m
se
t
e
n
u
m
e
ra
tio
n
b
in
a
ry
n
u
m
e
ri
c
n
o
m
in
a
l
o
rd
in
a
l
m
ix
e
d
o
p
tim
iz
a
tio
n
S
Q
L
s
u
p
p
o
rt
O
L
A
P
m
u
lti
d
a
ta
b
a
se
i
te
m
t
a
xo
n
o
m
y
t
e
m
p
la
te

b
a
se
d
m
u
lti
p
le
s
u
p
p
o
rt
W
e
b
a
n
a
ly
si
s
t
e
xt
a
n
a
ly
si
s
b
io
in
fo
rm
a
tic
s
E
a
rt
h
S
ci
e
n
ce
o
b
je
ct
iv
e
s
u
b
je
ct
iv
e
s
u
b
tr
e
e
s
s
u
b
g
ra
p
h
s
s
e
ri
a
l o
r
p
a
ra
lle
l
o
n
lin
e
o
r
b
a
tc
h
A
p
ri
o
ri
D
IC
t
re
e
p
ro
je
ci
to
n
F
P
t
re
e
H
m
in
e
P
a
rt
iti
o
n
S
a
m
p
lin
g
b
a
se
d
C
H
A
R
M
c
lo
se
d
m
a
xi
m
a
l
e
m
e
rg
in
g
p
a
tt
e
rn
s
h
yp
e
rc
liq
u
e
p
a
tt
e
rn
s
s
u
p
p
o
rt
e
n
ve
lo
p
e
n
e
g
a
tiv
e
d
e
p
e
n
d
e
n
ce
c
a
u
sa
l
w
e
ig
h
te
d
s
p
a
tia
l a
n
d
c
o

lo
ca
tio
n
p
a
tt
e
rn
s
t
e
m
p
o
ra
l (
cy
cl
ic
,
se
q
u
e
n
tia
l)
f
u
zz
y
e
xc
e
p
tio
n
r
u
le
s
c
la
ss
ifi
ca
tio
n
r
e
g
re
ss
io
n
c
lu
st
e
ri
n
g
r
e
co
m
m
e
n
d
e
r
s
ys
te
m
s
P
o
s
t
p
ro
c
e
s
s
in
g
V
is
u
a
li
za
ti
o
n
In
te
re
s
ti
n
g
n
e
s
s
D
o
m
a
in
s
M
e
a
s
u
re
O
th
e
r
S
tr
u
c
tu
re
s
It
e
m
s
e
ts
R
u
le
s
C
o
m
p
u
ta
ti
o
n
a
l
m
o
d
e
l
A
lg
o
ri
th
m
a
n
d
D
a
ta
S
tr
u
c
tu
re
r
a
n
ki
n
g
f
ilt
e
ri
n
g
s
u
m
m
a
ri
zi
n
g
M
e
th
o
d
O
th
e
r
d
a
ta
m
in
in
g
p
ro
b
le
m
s
C
o
n
s
tr
a
in
ts
P
a
tt
e
rn
D
is
c
o
v
e
ry
D
a
ta
b
a
s
e
is
s
u
e
s
D
a
ta
T
y
p
e
T
y
p
e
o
f
P
a
tt
e
rn
s
T
h
e
ro
re
ti
c
a
l
F
o
rm
u
la
ti
o
n
Fi
gu
re
6.
31
.
A
su
m
m
ar
y
of
th
e
va
rio
us
re
se
ar
ch
ac
tiv
iti
es
in
as
so
ci
at
io
n
an
al
ys
is
.
393
Chapter 6 Association Analysis
Implementation Issues
Research activities in this area revolve around (1) integrating the mining ca
pability into existing database technology, (2) developing efficient and scalable
mining algorithms, (3) handling userspecified or domainspecific constraints,
and (4) postprocessing the extracted patterns.
There are several advantages to integrating association analysis into ex
isting database technology. First, it can make use of the indexing and query
processing capabilities of the database system. Second, it can also exploit the
DBMS support for scalability, checkpointing, and parallelization [301]. The
SETM algorithm developed by Houtsma et al. [265] was one of the earliest
algorithms to support association rule discovery via SQL queries. Since then,
numerous methods have been developed to provide capabilities for mining as
sociation rules in database systems. For example, the DMQL [258] and MSQL
[267] query languages extend the basic SQL with new operators for mining as
sociation rules. The Mine Rule operator [283] is an expressive SQL operator
that can handle both clustered attributes and item hierarchies. Tsur et al.
[322] developed a generateandtest approach called query flocks for mining
association rules. A distributed OLAPbased infrastructure was developed by
Chen et al. [241] for mining multilevel association rules.
Dunkel and Soparkar [248] investigated the time and storage complexity
of the Apriori algorithm. The FPgrowth algorithm was developed by Han et
al. in [259]. Other algorithms for mining frequent itemsets include the DHP
(dynamic hashing and pruning) algorithm proposed by Park et al. [292] and
the Partition algorithm developed by Savasere et al [303]. A samplingbased
frequent itemset generation algorithm was proposed by Toivonen [320]. The
algorithm requires only a single pass over the data, but it can produce more
candidate itemsets than necessary. The Dynamic Itemset Counting (DIC)
algorithm [239] makes only 1.5 passes over the data and generates less candi
date itemsets than the samplingbased algorithm. Other notable algorithms
include the treeprojection algorithm [223] and HMine [295]. Survey articles
on frequent itemset generation algorithms can be found in [226, 262]. A repos
itory of data sets and algorithms is available at the Frequent Itemset Mining
Implementations (FIMI) repository (http://fimi.cs.helsinki.fi). Parallel algo
rithms for mining association patterns have been developed by various authors
[224, 256, 287, 306, 337]. A survey of such algorithms can be found in [333].
Online and incremental versions of association rule mining algorithms had also
been proposed by Hidber [260] and Cheung et al. [242].
Srikant et al. [313] have considered the problem of mining association rules
in the presence of boolean constraints such as the following:
394
6.9 Bibliographic Notes
(Cookies ∧ Milk) ∨ (descendents(Cookies) ∧ ¬ancestors(Wheat Bread))
Given such a constraint, the algorithm looks for rules that contain both cook
ies and milk, or rules that contain the descendent items of cookies but not
ancestor items of wheat bread. Singh et al. [310] and Ng et al. [288] had also
developed alternative techniques for constrainedbased association rule min
ing. Constraints can also be imposed on the support for different itemsets.
This problem was investigated by Wang et al. [324], Liu et al. in [279], and
Seno et al. [305].
One potential problem with association analysis is the large number of
patterns that can be generated by current algorithms. To overcome this prob
lem, methods to rank, summarize, and filter patterns have been developed.
Toivonen et al. [321] proposed the idea of eliminating redundant rules using
structural rule covers and to group the remaining rules using clustering.
Liu et al. [280] applied the statistical chisquare test to prune spurious patterns
and summarized the remaining patterns using a subset of the patterns called
direction setting rules. The use of objective measures to filter patterns
has been investigated by many authors, including Brin et al. [238], Bayardo
and Agrawal [235], Aggarwal and Yu [227], and DuMouchel and Pregibon[247].
The properties for many of these measures were analyzed by PiatetskyShapiro
[297], Kamber and Singhal [270], Hilderman and Hamilton [261], and Tan et
al. [318]. The gradegender example used to highlight the importance of the
row and column scaling invariance property was heavily influenced by the
discussion given in [286] by Mosteller. Meanwhile, the teacoffee example il
lustrating the limitation of confidence was motivated by an example given in
[238] by Brin et al. Because of the limitation of confidence, Brin et al. [238]
had proposed the idea of using interest factor as a measure of interesting
ness. The allconfidence measure was proposed by Omiecinski [289]. Xiong
et al. [330] introduced the crosssupport property and showed that the all
confidence measure can be used to eliminate crosssupport patterns. A key
difficulty in using alternative objective measures besides support is their lack
of a monotonicity property, which makes it difficult to incorporate the mea
sures directly into the mining algorithms. Xiong et al. [328] have proposed
an efficient method for mining correlations by introducing an upper bound
function to the φcoefficient. Although the measure is nonmonotone, it has
an upper bound expression that can be exploited for the efficient mining of
strongly correlated itempairs.
Fabris and Freitas [249] have proposed a method for discovering inter
esting associations by detecting the occurrences of Simpson’s paradox [309].
Megiddo and Srikant [282] described an approach for validating the extracted
395
Chapter 6 Association Analysis
patterns using hypothesis testing methods. A resamplingbased technique was
also developed to avoid generating spurious patterns because of the multiple
comparison problem. Bolton et al. [237] have applied the BenjaminiHochberg
[236] and Bonferroni correction methods to adjust the pvalues of discovered
patterns in market basket data. Alternative methods for handling the multiple
comparison problem were suggested by Webb [326] and Zhang et al. [338].
Application of subjective measures to association analysis has been inves
tigated by many authors. Silberschatz and Tuzhilin [307] presented two prin
ciples in which a rule can be considered interesting from a subjective point of
view. The concept of unexpected condition rules was introduced by Liu et al.
in [277]. Cooley et al. [243] analyzed the idea of combining soft belief sets
using the DempsterShafer theory and applied this approach to identify contra
dictory and novel association patterns in Web data. Alternative approaches
include using Bayesian networks [269] and neighborhoodbased information
[245] to identify subjectively interesting patterns.
Visualization also helps the user to quickly grasp the underlying struc
ture of the discovered patterns. Many commercial data mining tools display
the complete set of rules (which satisfy both support and confidence thresh
old criteria) as a twodimensional plot, with each axis corresponding to the
antecedent or consequent itemsets of the rule. Hofmann et al. [263] proposed
using Mosaic plots and Double Decker plots to visualize association rules. This
approach can visualize not only a particular rule, but also the overall contin
gency table between itemsets in the antecedent and consequent parts of the
rule. Nevertheless, this technique assumes that the rule consequent consists of
only a single attribute.
Application Issues
Association analysis has been applied to a variety of application domains such
as Web mining [296, 317], document analysis [264], telecommunication alarm
diagnosis [271], network intrusion detection [232, 244, 275], and bioinformatics
[302, 327]. Applications of association and correlation pattern analysis to
Earth Science studies have been investigated in [298, 299, 319].
Association patterns have also been applied to other learning problems
such as classification [276, 278], regression [291], and clustering [257, 329, 332].
A comparison between classification and association rule mining was made
by Freitas in his position paper [251]. The use of association patterns for
clustering has been studied by many authors including Han et al.[257], Kosters
et al. [272], Yang et al. [332] and Xiong et al. [329].
396
Bibliography
Bibliography
[223] R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. A Tree Projection Algorithm
for Generation of Frequent Itemsets. Journal of Parallel and Distributed Computing
(Special Issue on High Performance Data Mining), 61(3):350–371, 2001.
[224] R. C. Agarwal and J. C. Shafer. Parallel Mining of Association Rules. IEEE Transac
tions on Knowledge and Data Engineering, 8(6):962–969, March 1998.
[225] C. C. Aggarwal, Z. Sun, and P. S. Yu. Online Generation of Profile Association Rules.
In Proc. of the 4th Intl. Conf. on Knowledge Discovery and Data Mining, pages 129–
133, New York, NY, August 1996.
[226] C. C. Aggarwal and P. S. Yu. Mining Large Itemsets for Association Rules. Data
Engineering Bulletin, 21(1):23–31, March 1998.
[227] C. C. Aggarwal and P. S. Yu. Mining Associations with the Collective Strength
Approach. IEEE Trans. on Knowledge and Data Engineering, 13(6):863–873, Jan
uary/February 2001.
[228] R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspec
tive. IEEE Transactions on Knowledge and Data Engineering, 5:914–925, 1993.
[229] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of
items in large databases. In Proc. ACM SIGMOD Intl. Conf. Management of Data,
pages 207–216, Washington, DC, 1993.
[230] R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proc. of Intl. Conf. on
Data Engineering, pages 3–14, Taipei, Taiwan, 1995.
[231] K. Ali, S. Manganaris, and R. Srikant. Partial Classification using Association Rules.
In Proc. of the 3rd Intl. Conf. on Knowledge Discovery and Data Mining, pages 115–
118, Newport Beach, CA, August 1997.
[232] D. Barbará, J. Couto, S. Jajodia, and N. Wu. ADAM: A Testbed for Exploring the
Use of Data Mining in Intrusion Detection. SIGMOD Record, 30(4):15–24, 2001.
[233] S. D. Bay and M. Pazzani. Detecting Group Differences: Mining Contrast Sets. Data
Mining and Knowledge Discovery, 5(3):213–246, 2001.
[234] R. Bayardo. Efficiently Mining Long Patterns from Databases. In Proc. of 1998 ACM
SIGMOD Intl. Conf. on Management of Data, pages 85–93, Seattle, WA, June 1998.
[235] R. Bayardo and R. Agrawal. Mining the Most Interesting Rules. In Proc. of the 5th
Intl. Conf. on Knowledge Discovery and Data Mining, pages 145–153, San Diego, CA,
August 1999.
[236] Y. Benjamini and Y. Hochberg. Controlling the False Discovery Rate: A Practical
and Powerful Approach to Multiple Testing. Journal Royal Statistical Society B, 57
(1):289–300, 1995.
[237] R. J. Bolton, D. J. Hand, and N. M. Adams. Determining Hit Rate in Pattern Search.
In Proc. of the ESF Exploratory Workshop on Pattern Detection and Discovery in
Data Mining, pages 36–48, London, UK, September 2002.
[238] S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing associ
ation rules to correlations. In Proc. ACM SIGMOD Intl. Conf. Management of Data,
pages 265–276, Tucson, AZ, 1997.
[239] S. Brin, R. Motwani, J. Ullman, and S. Tsur. Dynamic Itemset Counting and Impli
cation Rules for market basket data. In Proc. of 1997 ACMSIGMOD Intl. Conf. on
Management of Data, pages 255–264, Tucson, AZ, June 1997.
[240] C. H. Cai, A. Fu, C. H. Cheng, and W. W. Kwong. Mining Association Rules with
Weighted Items. In Proc. of IEEE Intl. Database Engineering and Applications Symp.,
pages 68–77, Cardiff, Wales, 1998.
397
Chapter 6 Association Analysis
[241] Q. Chen, U. Dayal, and M. Hsu. A Distributed OLAP infrastructure for ECommerce.
In Proc. of the 4th IFCIS Intl. Conf. on Cooperative Information Systems, pages 209–
220, Edinburgh, Scotland, 1999.
[242] D. C. Cheung, S. D. Lee, and B. Kao. A General Incremental Technique for Maintaining
Discovered Association Rules. In Proc. of the 5th Intl. Conf. on Database Systems for
Advanced Applications, pages 185–194, Melbourne, Australia, 1997.
[243] R. Cooley, P. N. Tan, and J. Srivastava. Discovery of Interesting Usage Patterns
from Web Data. In M. Spiliopoulou and B. Masand, editors, Advances in Web Usage
Analysis and User Profiling, volume 1836, pages 163–182. Lecture Notes in Computer
Science, 2000.
[244] P. Dokas, L. Ertöz, V. Kumar, A. Lazarevic, J. Srivastava, and P. N. Tan. Data Mining
for Network Intrusion Detection. In Proc. NSF Workshop on Next Generation Data
Mining, Baltimore, MD, 2002.
[245] G. Dong and J. Li. Interestingness of discovered association rules in terms of
neighborhoodbased unexpectedness. In Proc. of the 2nd PacificAsia Conf. on Knowl
edge Discovery and Data Mining, pages 72–86, Melbourne, Australia, April 1998.
[246] G. Dong and J. Li. Efficient Mining of Emerging Patterns: Discovering Trends and
Differences. In Proc. of the 5th Intl. Conf. on Knowledge Discovery and Data Mining,
pages 43–52, San Diego, CA, August 1999.
[247] W. DuMouchel and D. Pregibon. Empirical Bayes Screening for MultiItem Associa
tions. In Proc. of the 7th Intl. Conf. on Knowledge Discovery and Data Mining, pages
67–76, San Francisco, CA, August 2001.
[248] B. Dunkel and N. Soparkar. Data Organization and Access for Efficient Data Mining.
In Proc. of the 15th Intl. Conf. on Data Engineering, pages 522–529, Sydney, Australia,
March 1999.
[249] C. C. Fabris and A. A. Freitas. Discovering surprising patterns by detecting occurrences
of Simpson’s paradox. In Proc. of the 19th SGES Intl. Conf. on KnowledgeBased
Systems and Applied Artificial Intelligence), pages 148–160, Cambridge, UK, December
1999.
[250] L. Feng, H. J. Lu, J. X. Yu, and J. Han. Mining intertransaction associations with
templates. In Proc. of the 8th Intl. Conf. on Information and Knowledge Management,
pages 225–233, Kansas City, Missouri, Nov 1999.
[251] A. A. Freitas. Understanding the crucial differences between classification and discov
ery of association rules—a position paper. SIGKDD Explorations, 2(1):65–69, 2000.
[252] J. H. Friedman and N. I. Fisher. Bump hunting in highdimensional data. Statistics
and Computing, 9(2):123–143, April 1999.
[253] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Mining Optimized Asso
ciation Rules for Numeric Attributes. In Proc. of the 15th Symp. on Principles of
Database Systems, pages 182–191, Montreal, Canada, June 1996.
[254] D. Gunopulos, R. Khardon, H. Mannila, and H. Toivonen. Data Mining, Hypergraph
Transversals, and Machine Learning. In Proc. of the 16th Symp. on Principles of
Database Systems, pages 209–216, Tucson, AZ, May 1997.
[255] E.H. Han, G. Karypis, and V. Kumar. MinApriori: An Algorithm for Finding As
sociation Rules in Data with Continuous Attributes. http://www.cs.umn.edu/˜han,
1997.
[256] E.H. Han, G. Karypis, and V. Kumar. Scalable Parallel Data Mining for Association
Rules. In Proc. of 1997 ACMSIGMOD Intl. Conf. on Management of Data, pages
277–288, Tucson, AZ, May 1997.
398
Bibliography
[257] E.H. Han, G. Karypis, V. Kumar, and B. Mobasher. Clustering Based on Association
Rule Hypergraphs. In Proc. of the 1997 ACM SIGMOD Workshop on Research Issues
in Data Mining and Knowledge Discovery, Tucson, AZ, 1997.
[258] J. Han, Y. Fu, K. Koperski, W. Wang, and O. R. Zäıane. DMQL: A data mining query
language for relational databases. In Proc. of the 1996 ACM SIGMOD Workshop on
Research Issues in Data Mining and Knowledge Discovery, Montreal, Canada, June
1996.
[259] J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation.
In Proc. ACMSIGMOD Int. Conf. on Management of Data (SIGMOD’00), pages
1–12, Dallas, TX, May 2000.
[260] C. Hidber. Online Association Rule Mining. In Proc. of 1999 ACMSIGMOD Intl.
Conf. on Management of Data, pages 145–156, Philadelphia, PA, 1999.
[261] R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Measures of Interest.
Kluwer Academic Publishers, 2001.
[262] J. Hipp, U. Guntzer, and G. Nakhaeizadeh. Algorithms for Association Rule Mining—
A General Survey. SigKDD Explorations, 2(1):58–64, June 2000.
[263] H. Hofmann, A. P. J. M. Siebes, and A. F. X. Wilhelm. Visualizing Association Rules
with Interactive Mosaic Plots. In Proc. of the 6th Intl. Conf. on Knowledge Discovery
and Data Mining, pages 227–235, Boston, MA, August 2000.
[264] J. D. Holt and S. M. Chung. Efficient Mining of Association Rules in Text Databases.
In Proc. of the 8th Intl. Conf. on Information and Knowledge Management, pages
234–242, Kansas City, Missouri, 1999.
[265] M. Houtsma and A. Swami. Setoriented Mining for Association Rules in Relational
Databases. In Proc. of the 11th Intl. Conf. on Data Engineering, pages 25–33, Taipei,
Taiwan, 1995.
[266] Y. Huang, S. Shekhar, and H. Xiong. Discovering Colocation Patterns from Spatial
Datasets: A General Approach. IEEE Trans. on Knowledge and Data Engineering, 16
(12):1472–1485, December 2004.
[267] T. Imielinski, A. Virmani, and A. Abdulghani. DataMine: Application Programming
Interface and Query Language for Database Mining. In Proc. of the 2nd Intl. Conf.
on Knowledge Discovery and Data Mining, pages 256–262, Portland, Oregon, 1996.
[268] A. Inokuchi, T. Washio, and H. Motoda. An Aprioribased Algorithm for Mining
Frequent Substructures from Graph Data. In Proc. of the 4th European Conf. of Prin
ciples and Practice of Knowledge Discovery in Databases, pages 13–23, Lyon, France,
2000.
[269] S. Jaroszewicz and D. Simovici. Interestingness of Frequent Itemsets Using Bayesian
Networks as Background Knowledge. In Proc. of the 10th Intl. Conf. on Knowledge
Discovery and Data Mining, pages 178–186, Seattle, WA, August 2004.
[270] M. Kamber and R. Shinghal. Evaluating the Interestingness of Characteristic Rules. In
Proc. of the 2nd Intl. Conf. on Knowledge Discovery and Data Mining, pages 263–266,
Portland, Oregon, 1996.
[271] M. Klemettinen. A Knowledge Discovery Methodology for Telecommunication Network
Alarm Databases. PhD thesis, University of Helsinki, 1999.
[272] W. A. Kosters, E. Marchiori, and A. Oerlemans. Mining Clusters with Association
Rules. In The 3rd Symp. on Intelligent Data Analysis (IDA99), pages 39–50, Amster
dam, August 1999.
[273] C. M. Kuok, A. Fu, and M. H. Wong. Mining Fuzzy Association Rules in Databases.
ACM SIGMOD Record, 27(1):41–46, March 1998.
399
Chapter 6 Association Analysis
[274] M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. In Proc. of the 2001
IEEE Intl. Conf. on Data Mining, pages 313–320, San Jose, CA, November 2001.
[275] W. Lee, S. J. Stolfo, and K. W. Mok. Adaptive Intrusion Detection: A Data Mining
Approach. Artificial Intelligence Review, 14(6):533–567, 2000.
[276] W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification Based on
Multiple Classassociation Rules. In Proc. of the 2001 IEEE Intl. Conf. on Data
Mining, pages 369–376, San Jose, CA, 2001.
[277] B. Liu, W. Hsu, and S. Chen. Using General Impressions to Analyze Discovered
Classification Rules. In Proc. of the 3rd Intl. Conf. on Knowledge Discovery and Data
Mining, pages 31–36, Newport Beach, CA, August 1997.
[278] B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining.
In Proc. of the 4th Intl. Conf. on Knowledge Discovery and Data Mining, pages 80–86,
New York, NY, August 1998.
[279] B. Liu, W. Hsu, and Y. Ma. Mining association rules with multiple minimum supports.
In Proc. of the 5th Intl. Conf. on Knowledge Discovery and Data Mining, pages 125–
134, San Diego, CA, August 1999.
[280] B. Liu, W. Hsu, and Y. Ma. Pruning and Summarizing the Discovered Associations. In
Proc. of the 5th Intl. Conf. on Knowledge Discovery and Data Mining, pages 125–134,
San Diego, CA, August 1999.
[281] A. Marcus, J. I. Maletic, and K.I. Lin. Ordinal association rules for error identifi
cation in data sets. In Proc. of the 10th Intl. Conf. on Information and Knowledge
Management, pages 589–591, Atlanta, GA, October 2001.
[282] N. Megiddo and R. Srikant. Discovering Predictive Association Rules. In Proc. of the
4th Intl. Conf. on Knowledge Discovery and Data Mining, pages 274–278, New York,
August 1998.
[283] R. Meo, G. Psaila, and S. Ceri. A New SQLlike Operator for Mining Association
Rules. In Proc. of the 22nd VLDB Conf., pages 122–133, Bombay, India, 1996.
[284] R. J. Miller and Y. Yang. Association Rules over Interval Data. In Proc. of 1997
ACMSIGMOD Intl. Conf. on Management of Data, pages 452–461, Tucson, AZ, May
1997.
[285] Y. Morimoto, T. Fukuda, H. Matsuzawa, T. Tokuyama, and K. Yoda. Algorithms for
mining association rules for binary segmentations of huge categorical databases. In
Proc. of the 24th VLDB Conf., pages 380–391, New York, August 1998.
[286] F. Mosteller. Association and Estimation in Contingency Tables. Journal of the Amer
ican Statistical Association, 63:1–28, 1968.
[287] A. Mueller. Fast sequential and parallel algorithms for association rule mining: A
comparison. Technical Report CSTR3515, University of Maryland, August 1995.
[288] R. T. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory Mining and Pruning
Optimizations of Constrained Association Rules. In Proc. of 1998 ACMSIGMOD Intl.
Conf. on Management of Data, pages 13–24, Seattle, WA, June 1998.
[289] E. Omiecinski. Alternative Interest Measures for Mining Associations in Databases.
IEEE Trans. on Knowledge and Data Engineering, 15(1):57–69, January/February
2003.
[290] B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic Association Rules. In Proc. of
the 14th Intl. Conf. on Data Eng., pages 412–421, Orlando, FL, February 1998.
[291] A. Ozgur, P. N. Tan, and V. Kumar. RBA: An Integrated Framework for Regression
based on Association Rules. In Proc. of the SIAM Intl. Conf. on Data Mining, pages
210–221, Orlando, FL, April 2004.
400
Bibliography
[292] J. S. Park, M.S. Chen, and P. S. Yu. An effective hashbased algorithm for mining
association rules. SIGMOD Record, 25(2):175–186, 1995.
[293] S. Parthasarathy and M. Coatney. Efficient Discovery of Common Substructures in
Macromolecules. In Proc. of the 2002 IEEE Intl. Conf. on Data Mining, pages 362–369,
Maebashi City, Japan, December 2002.
[294] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets
for association rules. In Proc. of the 7th Intl. Conf. on Database Theory (ICDT’99),
pages 398–416, Jerusalem, Israel, January 1999.
[295] J. Pei, J. Han, H. J. Lu, S. Nishio, and S. Tang. HMine: HyperStructure Mining of
Frequent Patterns in Large Databases. In Proc. of the 2001 IEEE Intl. Conf. on Data
Mining, pages 441–448, San Jose, CA, November 2001.
[296] J. Pei, J. Han, B. MortazaviAsl, and H. Zhu. Mining Access Patterns Efficiently from
Web Logs. In Proc. of the 4th PacificAsia Conf. on Knowledge Discovery and Data
Mining, pages 396–407, Kyoto, Japan, April 2000.
[297] G. PiatetskyShapiro. Discovery, Analysis and Presentation of Strong Rules. In
G. PiatetskyShapiro and W. Frawley, editors, Knowledge Discovery in Databases,
pages 229–248. MIT Press, Cambridge, MA, 1991.
[298] C. Potter, S. Klooster, M. Steinbach, P. N. Tan, V. Kumar, S. Shekhar, and C. Car
valho. Understanding Global Teleconnections of Climate to Regional Model Estimates
of Amazon Ecosystem Carbon Fluxes. Global Change Biology, 10(5):693–703, 2004.
[299] C. Potter, S. Klooster, M. Steinbach, P. N. Tan, V. Kumar, S. Shekhar, R. Myneni,
and R. Nemani. Global Teleconnections of Ocean Climate to Terrestrial Carbon Flux.
J. Geophysical Research, 108(D17), 2003.
[300] G. D. Ramkumar, S. Ranka, and S. Tsur. Weighted Association Rules: Model and
Algorithm. http://www.cs.ucla.edu/˜czdemo/tsur/, 1997.
[301] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating Mining with Relational Database
Systems: Alternatives and Implications. In Proc. of 1998 ACMSIGMOD Intl. Conf.
on Management of Data, pages 343–354, Seattle, WA, 1998.
[302] K. Satou, G. Shibayama, T. Ono, Y. Yamamura, E. Furuichi, S. Kuhara, and T. Takagi.
Finding Association Rules on Heterogeneous Genome Data. In Proc. of the Pacific
Symp. on Biocomputing, pages 397–408, Hawaii, January 1997.
[303] A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining associ
ation rules in large databases. In Proc. of the 21st Int. Conf. on Very Large Databases
(VLDB‘95), pages 432–444, Zurich, Switzerland, September 1995.
[304] A. Savasere, E. Omiecinski, and S. Navathe. Mining for Strong Negative Associations
in a Large Database of Customer Transactions. In Proc. of the 14th Intl. Conf. on
Data Engineering, pages 494–502, Orlando, Florida, February 1998.
[305] M. Seno and G. Karypis. LPMiner: An Algorithm for Finding Frequent Itemsets Using
LengthDecreasing Support Constraint. In Proc. of the 2001 IEEE Intl. Conf. on Data
Mining, pages 505–512, San Jose, CA, November 2001.
[306] T. Shintani and M. Kitsuregawa. Hash based parallel algorithms for mining association
rules. In Proc of the 4th Intl. Conf. on Parallel and Distributed Info. Systems, pages
19–30, Miami Beach, FL, December 1996.
[307] A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discov
ery systems. IEEE Trans. on Knowledge and Data Engineering, 8(6):970–974, 1996.
[308] C. Silverstein, S. Brin, and R. Motwani. Beyond market baskets: Generalizing associ
ation rules to dependence rules. Data Mining and Knowledge Discovery, 2(1):39–68,
1998.
401
Chapter 6 Association Analysis
[309] E.H. Simpson. The Interpretation of Interaction in Contingency Tables. Journal of
the Royal Statistical Society, B(13):238–241, 1951.
[310] L. Singh, B. Chen, R. Haight, and P. Scheuermann. An Algorithm for Constrained
Association Rule Mining in Semistructured Data. In Proc. of the 3rd PacificAsia
Conf. on Knowledge Discovery and Data Mining, pages 148–158, Beijing, China, April
1999.
[311] R. Srikant and R. Agrawal. Mining Quantitative Association Rules in Large Relational
Tables. In Proc. of 1996 ACMSIGMOD Intl. Conf. on Management of Data, pages
1–12, Montreal, Canada, 1996.
[312] R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations and Perfor
mance Improvements. In Proc. of the 5th Intl Conf. on Extending Database Technology
(EDBT’96), pages 18–32, Avignon, France, 1996.
[313] R. Srikant, Q. Vu, and R. Agrawal. Mining Association Rules with Item Constraints.
In Proc. of the 3rd Intl. Conf. on Knowledge Discovery and Data Mining, pages 67–73,
Newport Beach, CA, August 1997.
[314] M. Steinbach, P. N. Tan, and V. Kumar. Support Envelopes: A Technique for Ex
ploring the Structure of Association Patterns. In Proc. of the 10th Intl. Conf. on
Knowledge Discovery and Data Mining, pages 296–305, Seattle, WA, August 2004.
[315] M. Steinbach, P. N. Tan, H. Xiong, and V. Kumar. Extending the Notion of Support.
In Proc. of the 10th Intl. Conf. on Knowledge Discovery and Data Mining, pages 689–
694, Seattle, WA, August 2004.
[316] E. Suzuki. Autonomous Discovery of Reliable Exception Rules. In Proc. of the 3rd
Intl. Conf. on Knowledge Discovery and Data Mining, pages 259–262, Newport Beach,
CA, August 1997.
[317] P. N. Tan and V. Kumar. Mining Association Patterns in Web Usage Data. In Proc.
of the Intl. Conf. on Advances in Infrastructure for eBusiness, eEducation, eScience
and eMedicine on the Internet, L’Aquila, Italy, January 2002.
[318] P. N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure
for Association Patterns. In Proc. of the 8th Intl. Conf. on Knowledge Discovery and
Data Mining, pages 32–41, Edmonton, Canada, July 2002.
[319] P. N. Tan, M. Steinbach, V. Kumar, S. Klooster, C. Potter, and A. Torregrosa. Finding
SpatioTemporal Patterns in Earth Science Data. In KDD 2001 Workshop on Temporal
Data Mining, San Francisco, CA, 2001.
[320] H. Toivonen. Sampling Large Databases for Association Rules. In Proc. of the 22nd
VLDB Conf., pages 134–145, Bombay, India, 1996.
[321] H. Toivonen, M. Klemettinen, P. Ronkainen, K. Hatonen, and H. Mannila. Pruning
and Grouping Discovered Association Rules. In ECML95 Workshop on Statistics,
Machine Learning and Knowledge Discovery in Databases, pages 47 – 52, Heraklion,
Greece, April 1995.
[322] S. Tsur, J. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, and A. Rosen
thal. Query Flocks: A Generalization of Association Rule Mining. In Proc. of 1998
ACMSIGMOD Intl. Conf. on Management of Data, pages 1–12, Seattle, WA, June
1998.
[323] A. Tung, H. J. Lu, J. Han, and L. Feng. Breaking the Barrier of Transactions: Mining
InterTransaction Association Rules. In Proc. of the 5th Intl. Conf. on Knowledge
Discovery and Data Mining, pages 297–301, San Diego, CA, August 1999.
[324] K. Wang, Y. He, and J. Han. Mining Frequent Itemsets Using Support Constraints.
In Proc. of the 26th VLDB Conf., pages 43–52, Cairo, Egypt, September 2000.
402
BIBLIOGRAPHY
[325] K. Wang, S. H. Tay, and B. Liu. InterestingnessBased Interval Merger for Numeric
Association Rules. In Proc. of the 4th Intl. Conf. on Knowledge Discovery and Data
Mining, pages 121–128, New York, NY, August 1998.
[326] G. I. Webb. Preliminary investigations into statistically valid exploratory rule dis
covery. In Proc. of the Australasian Data Mining Workshop (AusDM03), Canberra,
Australia, December 2003.
[327] H. Xiong, X. He, C. Ding, Y. Zhang, V. Kumar, and S. R. Holbrook. Identification
of Functional Modules in Protein Complexes via Hyperclique Pattern Discovery. In
Proc. of the Pacific Symposium on Biocomputing, (PSB 2005), Maui, January 2005.
[328] H. Xiong, S. Shekhar, P. N. Tan, and V. Kumar. Exploiting a Supportbased Upper
Bound of Pearson’s Correlation Coefficient for Efficiently Identifying Strongly Corre
lated Pairs. In Proc. of the 10th Intl. Conf. on Knowledge Discovery and Data Mining,
pages 334–343, Seattle, WA, August 2004.
[329] H. Xiong, M. Steinbach, P. N. Tan, and V. Kumar. HICAP: Hierarchial Clustering
with Pattern Preservation. In Proc. of the SIAM Intl. Conf. on Data Mining, pages
279–290, Orlando, FL, April 2004.
[330] H. Xiong, P. N. Tan, and V. Kumar. Mining Strong Affinity Association Patterns in
Data Sets with Skewed Support Distribution. In Proc. of the 2003 IEEE Intl. Conf.
on Data Mining, pages 387–394, Melbourne, FL, 2003.
[331] X. Yan and J. Han. gSpan: Graphbased Substructure Pattern Mining. In Proc. of
the 2002 IEEE Intl. Conf. on Data Mining, pages 721–724, Maebashi City, Japan,
December 2002.
[332] C. Yang, U. M. Fayyad, and P. S. Bradley. Efficient discovery of errortolerant frequent
itemsets in high dimensions. In Proc. of the 7th Intl. Conf. on Knowledge Discovery
and Data Mining, pages 194–203, San Francisco, CA, August 2001.
[333] M. J. Zaki. Parallel and Distributed Association Mining: A Survey. IEEE Concurrency,
special issue on Parallel Mechanisms for Data Mining, 7(4):14–25, December 1999.
[334] M. J. Zaki. Generating NonRedundant Association Rules. In Proc. of the 6th Intl.
Conf. on Knowledge Discovery and Data Mining, pages 34–43, Boston, MA, August
2000.
[335] M. J. Zaki. Efficiently mining frequent trees in a forest. In Proc. of the 8th Intl.
Conf. on Knowledge Discovery and Data Mining, pages 71–80, Edmonton, Canada,
July 2002.
[336] M. J. Zaki and M. Orihara. Theoretical foundations of association rules. In Proc. of
the 1998 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge
Discovery, Seattle, WA, June 1998.
[337] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New Algorithms for Fast
Discovery of Association Rules. In Proc. of the 3rd Intl. Conf. on Knowledge Discovery
and Data Mining, pages 283–286, Newport Beach, CA, August 1997.
[338] H. Zhang, B. Padmanabhan, and A. Tuzhilin. On the Discovery of Significant Statis
tical Quantitative Rules. In Proc. of the 10th Intl. Conf. on Knowledge Discovery and
Data Mining, pages 374–383, Seattle, WA, August 2004.
[339] Z. Zhang, Y. Lu, and B. Zhang. An Effective PartioningCombining Algorithm for
Discovering Quantitative Association Rules. In Proc. of the 1st PacificAsia Conf. on
Knowledge Discovery and Data Mining, Singapore, 1997.
[340] N. Zhong, Y. Y. Yao, and S. Ohsuga. Peculiarity Oriented Multidatabase Mining. In
Proc. of the 3rd European Conf. of Principles and Practice of Knowledge Discovery in
Databases, pages 136–146, Prague, Czech Republic, 1999.
403
Chapter 6 Association Analysis
6.10 Exercises
1. For each of the following questions, provide an example of an association rule
from the market basket domain that satisfies the following conditions. Also,
describe whether such rules are subjectively interesting.
(a) A rule that has high support and high confidence.
(b) A rule that has reasonably high support but low confidence.
(c) A rule that has low support and low confidence.
(d) A rule that has low support and high confidence.
2. Consider the data set shown in Table 6.22.
Table 6.22. Example of market basket transactions.
Customer ID Transaction ID Items Bought
1 0001 {a, d, e}
1 0024 {a, b, c, e}
2 0012 {a, b, d, e}
2 0031 {a, c, d, e}
3 0015 {b, c, e}
3 0022 {b, d, e}
4 0029 {c, d}
4 0040 {a, b, c}
5 0033 {a, d, e}
5 0038 {a, b, e}
(a) Compute the support for itemsets {e}, {b, d}, and {b, d, e} by treating
each transaction ID as a market basket.
(b) Use the results in part (a) to compute the confidence for the associa
tion rules {b, d} −→ {e} and {e} −→ {b, d}. Is confidence a symmetric
measure?
(c) Repeat part (a) by treating each customer ID as a market basket. Each
item should be treated as a binary variable (1 if an item appears in at
least one transaction bought by the customer, and 0 otherwise.)
(d) Use the results in part (c) to compute the confidence for the association
rules {b, d} −→ {e} and {e} −→ {b, d}.
(e) Suppose s1 and c1 are the support and confidence values of an association
rule r when treating each transaction ID as a market basket. Also, let s2
and c2 be the support and confidence values of r when treating each cus
tomer ID as a market basket. Discuss whether there are any relationships
between s1 and s2 or c1 and c2.
404
6.10 Exercises
3. (a) What is the confidence for the rules ∅ −→ A and A −→ ∅?
(b) Let c1, c2, and c3 be the confidence values of the rules {p} −→ {q},
{p} −→ {q, r}, and {p, r} −→ {q}, respectively. If we assume that c1, c2,
and c3 have different values, what are the possible relationships that may
exist among c1, c2, and c3? Which rule has the lowest confidence?
(c) Repeat the analysis in part (b) assuming that the rules have identical
support. Which rule has the highest confidence?
(d) Transitivity: Suppose the confidence of the rules A −→ B and B −→ C
are larger than some threshold, minconf . Is it possible that A −→ C has
a confidence less than minconf ?
4. For each of the following measures, determine whether it is monotone, anti
monotone, or nonmonotone (i.e., neither monotone nor antimonotone).
Example: Support, s = σ(X)T  is antimonotone because s(X) ≥
s(Y ) whenever X ⊂ Y .
(a) A characteristic rule is a rule of the form {p} −→ {q1, q2, . . . , qn}, where
the rule antecedent contains only a single item. An itemset of size k can
produce up to k characteristic rules. Let ζ be the minimum confidence of
all characteristic rules generated from a given itemset:
ζ({p1, p2, . . . , pk}) = min
[
c
(
{p1} −→ {p2, p3, . . . , pk}
)
, . . .
c
(
{pk} −→ {p1, p3 . . . , pk−1}
) ]
Is ζ monotone, antimonotone, or nonmonotone?
(b) A discriminant rule is a rule of the form {p1, p2, . . . , pn} −→ {q}, where
the rule consequent contains only a single item. An itemset of size k can
produce up to k discriminant rules. Let η be the minimum confidence of
all discriminant rules generated from a given itemset:
η({p1, p2, . . . , pk}) = min
[
c
(
{p2, p3, . . . , pk} −→ {p1}
)
, . . .
c
(
{p1, p2, . . . pk−1} −→ {pk}
) ]
Is η monotone, antimonotone, or nonmonotone?
(c) Repeat the analysis in parts (a) and (b) by replacing the min function
with a max function.
5. Prove Equation 6.3. (Hint: First, count the number of ways to create an itemset
that forms the left hand side of the rule. Next, for each size k itemset selected
for the lefthand side, count the number of ways to choose the remaining d − k
items to form the righthand side of the rule.)
405
Chapter 6 Association Analysis
Table 6.23. Market basket transactions.
Transaction ID Items Bought
1 {Milk, Beer, Diapers}
2 {Bread, Butter, Milk}
3 {Milk, Diapers, Cookies}
4 {Bread, Butter, Cookies}
5 {Beer, Cookies, Diapers}
6 {Milk, Diapers, Bread, Butter}
7 {Bread, Butter, Diapers}
8 {Beer, Diapers}
9 {Milk, Diapers, Bread, Butter}
10 {Beer, Cookies}
6. Consider the market basket transactions shown in Table 6.23.
(a) What is the maximum number of association rules that can be extracted
from this data (including rules that have zero support)?
(b) What is the maximum size of frequent itemsets that can be extracted
(assuming minsup > 0)?
(c) Write an expression for the maximum number of size3 itemsets that can
be derived from this data set.
(d) Find an itemset (of size 2 or larger) that has the largest support.
(e) Find a pair of items, a and b, such that the rules {a} −→ {b} and {b} −→
{a} have the same confidence.
7. Consider the following set of frequent 3itemsets:
{1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}, {2, 3, 5}, {3, 4, 5}.
Assume that there are only five items in the data set.
(a) List all candidate 4itemsets obtained by a candidate generation procedure
using the Fk−1 × F1 merging strategy.
(b) List all candidate 4itemsets obtained by the candidate generation proce
dure in Apriori.
(c) List all candidate 4itemsets that survive the candidate pruning step of
the Apriori algorithm.
8. The Apriori algorithm uses a generateandcount strategy for deriving frequent
itemsets. Candidate itemsets of size k + 1 are created by joining a pair of
frequent itemsets of size k (this is known as the candidate generation step). A
candidate is discarded if any one of its subsets is found to be infrequent during
the candidate pruning step. Suppose the Apriori algorithm is applied to the
406
6.10 Exercises
Table 6.24. Example of market basket transactions.
Transaction ID Items Bought
1 {a, b, d, e}
2 {b, c, d}
3 {a, b, d, e}
4 {a, c, d, e}
5 {b, c, d, e}
6 {b, d, e}
7 {c, d}
8 {a, b, c}
9 {a, d, e}
10 {b, d}
data set shown in Table 6.24 with minsup = 30%, i.e., any itemset occurring
in less than 3 transactions is considered to be infrequent.
(a) Draw an itemset lattice representing the data set given in Table 6.24.
Label each node in the lattice with the following letter(s):
• N: If the itemset is not considered to be a candidate itemset by
the Apriori algorithm. There are two reasons for an itemset not to
be considered as a candidate itemset: (1) it is not generated at all
during the candidate generation step, or (2) it is generated during
the candidate generation step but is subsequently removed during
the candidate pruning step because one of its subsets is found to be
infrequent.
• F: If the candidate itemset is found to be frequent by the Apriori
algorithm.
• I: If the candidate itemset is found to be infrequent after support
counting.
(b) What is the percentage of frequent itemsets (with respect to all itemsets
in the lattice)?
(c) What is the pruning ratio of the Apriori algorithm on this data set?
(Pruning ratio is defined as the percentage of itemsets not considered
to be a candidate because (1) they are not generated during candidate
generation or (2) they are pruned during the candidate pruning step.)
(d) What is the false alarm rate (i.e, percentage of candidate itemsets that
are found to be infrequent after performing support counting)?
9. The Apriori algorithm uses a hash tree data structure to efficiently count the
support of candidate itemsets. Consider the hash tree for candidate 3itemsets
shown in Figure 6.32.
407
Chapter 6 Association Analysis
{258}
{289}
{356}
{689}
{568}{168} {367}{346}
{379}
{678}
{459}
{456}
{789}
{125}
{158}
{458}
2,5,8
1,4,7
1,4,7
1,4,7
1,4,73,6,9
3,6,9
3,6,9
3,6,9
2,5,8
2,5,8
2,5,8 1,4,7
3,6,9
2,5,8
L1 L5 L6 L7 L8 L9 L11 L12
L2 L3 L4
{246}
{278}
{145}
{178}
{127}
{457}
Figure 6.32. An example of a hash tree structure.
(a) Given a transaction that contains items {1, 3, 4, 5, 8}, which of the hash
tree leaf nodes will be visited when finding the candidates of the transac
tion?
(b) Use the visited leaf nodes in part (b) to determine the candidate itemsets
that are contained in the transaction {1, 3, 4, 5, 8}.
10. Consider the following set of candidate 3itemsets:
{1, 2, 3}, {1, 2, 6}, {1, 3, 4}, {2, 3, 4}, {2, 4, 5}, {3, 4, 6}, {4, 5, 6}
(a) Construct a hash tree for the above candidate 3itemsets. Assume the
tree uses a hash function where all oddnumbered items are hashed to
the left child of a node, while the evennumbered items are hashed to the
right child. A candidate kitemset is inserted into the tree by hashing on
each successive item in the candidate and then following the appropriate
branch of the tree according to the hash value. Once a leaf node is reached,
the candidate is inserted based on one of the following conditions:
Condition 1: If the depth of the leaf node is equal to k (the root is
assumed to be at depth 0), then the candidate is inserted regardless
of the number of itemsets already stored at the node.
Condition 2: If the depth of the leaf node is less than k, then the candi
date can be inserted as long as the number of itemsets stored at the
node is less than maxsize. Assume maxsize = 2 for this question.
Condition 3: If the depth of the leaf node is less than k and the number
of itemsets stored at the node is equal to maxsize, then the leaf
node is converted into an internal node. New leaf nodes are created
as children of the old leaf node. Candidate itemsets previously stored
408
6.10 Exercises
null
ba c d e
decebeaeadacab
abc abd abe
abcd
acd
abcde
abce abde acde bcde
ace ade bcd bce bde cde
bdbc cd
Figure 6.33. An itemset lattice
in the old leaf node are distributed to the children based on their hash
values. The new candidate is also hashed to its appropriate leaf node.
(b) How many leaf nodes are there in the candidate hash tree? How many
internal nodes are there?
(c) Consider a transaction that contains the following items: {1, 2, 3, 5, 6}.
Using the hash tree constructed in part (a), which leaf nodes will be
checked against the transaction? What are the candidate 3itemsets con
tained in the transaction?
11. Given the lattice structure shown in Figure 6.33 and the transactions given in
Table 6.24, label each node with the following letter(s):
• M if the node is a maximal frequent itemset,
• C if it is a closed frequent itemset,
• N if it is frequent but neither maximal nor closed, and
• I if it is infrequent.
Assume that the support threshold is equal to 30%.
12. The original association rule mining formulation uses the support and confi
dence measures to prune uninteresting rules.
409
Chapter 6 Association Analysis
(a) Draw a contingency table for each of the following rules using the trans
actions shown in Table 6.25.
Table 6.25. Example of market basket transactions.
Transaction ID Items Bought
1 {a, b, d, e}
2 {b, c, d}
3 {a, b, d, e}
4 {a, c, d, e}
5 {b, c, d, e}
6 {b, d, e}
7 {c, d}
8 {a, b, c}
9 {a, d, e}
10 {b, d}
Rules: {b} −→ {c}, {a} −→ {d}, {b} −→ {d}, {e} −→ {c}, {c} −→ {a}.
(b) Use the contingency tables in part (a) to compute and rank the rules in
decreasing order according to the following measures.
i. Support.
ii. Confidence.
iii. Interest(X −→ Y ) = P (X,Y )
P (X)
P (Y ).
iv. IS(X −→ Y ) = P (X,Y )√
P (X)P (Y )
.
v. Klosgen(X −→ Y ) =
√
P (X, Y )×(P (Y X)−P (Y )), where P (Y X) =
P (X,Y )
P (X)
.
vi. Odds ratio(X −→ Y ) = P (X,Y )P (X,Y )
P (X,Y )P (X,Y )
.
13. Given the rankings you had obtained in Exercise 12, compute the correlation
between the rankings of confidence and the other five measures. Which measure
is most highly correlated with confidence? Which measure is least correlated
with confidence?
14. Answer the following questions using the data sets shown in Figure 6.34. Note
that each data set contains 1000 items and 10,000 transactions. Dark cells
indicate the presence of items and white cells indicate the absence of items. We
will apply the Apriori algorithm to extract frequent itemsets with minsup =
10% (i.e., itemsets must be contained in at least 1000 transactions)?
(a) Which data set(s) will produce the most number of frequent itemsets?
410
6.10 Exercises
(b) Which data set(s) will produce the fewest number of frequent itemsets?
(c) Which data set(s) will produce the longest frequent itemset?
(d) Which data set(s) will produce frequent itemsets with highest maximum
support?
(e) Which data set(s) will produce frequent itemsets containing items with
widevarying support levels (i.e., items with mixed support, ranging from
less than 20% to more than 70%).
15. (a) Prove that the φ coefficient is equal to 1 if and only if f11 = f1+ = f+1.
(b) Show that if A and B are independent, then P (A, B)×P (A, B) = P (A, B)×
P (A, B).
(c) Show that Yule’s Q and Y coefficients
Q =
[
f11f00 − f10f01
f11f00 + f10f01
]
Y =
[√
f11f00 −
√
f10f01√
f11f00 +
√
f10f01
]
are normalized versions of the odds ratio.
(d) Write a simplified expression for the value of each measure shown in Tables
6.11 and 6.12 when the variables are statistically independent.
16. Consider the interestingness measure, M = P (BA)−P (B)
1−P (B) , for an association
rule A −→ B.
(a) What is the range of this measure? When does the measure attain its
maximum and minimum values?
(b) How does M behave when P (A, B) is increased while P (A) and P (B)
remain unchanged?
(c) How does M behave when P (A) is increased while P (A, B) and P (B)
remain unchanged?
(d) How does M behave when P (B) is increased while P (A, B) and P (A)
remain unchanged?
(e) Is the measure symmetric under variable permutation?
(f) What is the value of the measure when A and B are statistically indepen
dent?
(g) Is the measure nullinvariant?
(h) Does the measure remain invariant under row or column scaling opera
tions?
(i) How does the measure behave under the inversion operation?
411
Chapter 6 Association Analysis
Tr
a
n
sa
ct
io
n
s
2000
4000
6000
600 800400200
8000
Items
2000
4000
6000
600 800400200
8000
Items
(a) (b)
Tr
a
n
sa
ct
io
n
s
2000
4000
6000
600 800400200
8000
Items
(c)
2000
4000
6000
600 800400200
8000
Items
(d)
Tr
a
n
sa
ct
io
n
s
Tr
a
n
sa
ct
io
n
s
Tr
a
n
sa
ct
io
n
s
Tr
a
n
sa
ct
io
n
s
2000
4000
6000
600 800400200
8000
Items
(e)
2000
4000
6000
600 800400200
8000
Items
(f)
10% are 1s
90% are 0s
(uniformly distributed)
Figure 6.34. Figures for Exercise 14.
412
6.10 Exercises
17. Suppose we have market basket data consisting of 100 transactions and 20
items. If the support for item a is 25%, the support for item b is 90% and the
support for itemset {a, b} is 20%. Let the support and confidence thresholds
be 10% and 60%, respectively.
(a) Compute the confidence of the association rule {a} → {b}. Is the rule
interesting according to the confidence measure?
(b) Compute the interest measure for the association pattern {a, b}. Describe
the nature of the relationship between item a and item b in terms of the
interest measure.
(c) What conclusions can you draw from the results of parts (a) and (b)?
(d) Prove that if the confidence of the rule {a} −→ {b} is less than the support
of {b}, then:
i. c({a} −→ {b}) > c({a} −→ {b}),
ii. c({a} −→ {b}) > s({b}),
where c(·) denote the rule confidence and s(·) denote the support of an
itemset.
18. Table 6.26 shows a 2 × 2 × 2 contingency table for the binary variables A and
B at different values of the control variable C.
Table 6.26. A Contingency Table.
A
C = 0
C = 1
B
B
1
1
0
0
0
5
1
15
0
15
0
0
30
15
(a) Compute the φ coefficient for A and B when C = 0, C = 1, and C = 0 or
1. Note that φ({A, B}) = P (A,B)−P (A)P (B)√
P (A)P (B)(1−P (A))(1−P (B))
.
(b) What conclusions can you draw from the above result?
19. Consider the contingency tables shown in Table 6.27.
(a) For table I, compute support, the interest measure, and the φ correla
tion coefficient for the association pattern {A, B}. Also, compute the
confidence of rules A → B and B → A.
413
Chapter 6 Association Analysis
Table 6.27. Contingency tables for Exercise 19.
B B B B
A 9 1 A 89 1
A 1 89 A 1 9
(a) Table I. (b) Table II.
(b) For table II, compute support, the interest measure, and the φ correla
tion coefficient for the association pattern {A, B}. Also, compute the
confidence of rules A → B and B → A.
(c) What conclusions can you draw from the results of (a) and (b)?
20. Consider the relationship between customers who buy highdefinition televisions
and exercise machines as shown in Tables 6.19 and 6.20.
(a) Compute the odds ratios for both tables.
(b) Compute the φcoefficient for both tables.
(c) Compute the interest factor for both tables.
For each of the measures given above, describe how the direction of association
changes when data is pooled together instead of being stratified.
414
7
Association Analysis:
Advanced Concepts
The association rule mining formulation described in the previous chapter
assumes that the input data consists of binary attributes called items. The
presence of an item in a transaction is also assumed to be more important than
its absence. As a result, an item is treated as an asymmetric binary attribute
and only frequent patterns are considered interesting.
This chapter extends the formulation to data sets with symmetric binary,
categorical, and continuous attributes. The formulation will also be extended
to incorporate more complex entities such as sequences and graphs. Although
the overall structure of association analysis algorithms remains unchanged, cer
tain aspects of the algorithms must be modified to handle the nontraditional
entities.
7.1 Handling Categorical Attributes
There are many applications that contain symmetric binary and nominal at
tributes. The Internet survey data shown in Table 7.1 contains symmetric
binary attributes such as Gender, Computer at Home, Chat Online, Shop
Online, and Privacy Concerns; as well as nominal attributes such as Level
of Education and State. Using association analysis, we may uncover inter
esting information about the characteristics of Internet users such as:
{Shop Online = Yes} −→ {Privacy Concerns = Yes}.
This rule suggests that most Internet users who shop online are concerned
about their personal privacy.
From Chapter 7 of Introduction to Data Mining
Vipin Kumar. Copyright © 2006 by Pearson Education, Inc. All rights reserved.
, First Edition. PangNing Tan, Michael Steinbach,
415
Chapter 7 Association Analysis: Advanced Concepts
Table 7.1. Internet survey data with categorical attributes.
Gender Level of State Computer Chat Shop Privacy
Education at Home Online Online Concerns
Female Graduate Illinois Yes Yes Yes Yes
Male College California No No No No
Male Graduate Michigan Yes Yes Yes Yes
Female College Virginia No No Yes Yes
Female Graduate California Yes No No Yes
Male College Minnesota Yes Yes Yes Yes
Male College Alaska Yes Yes Yes No
Male High School Oregon Yes No No No
Female Graduate Texas No Yes No No
. . . . . . . . . . . . . . . . . . . . .
To extract such patterns, the categorical and symmetric binary attributes
are transformed into “items” first, so that existing association rule mining
algorithms can be applied. This type of transformation can be performed
by creating a new item for each distinct attributevalue pair. For example,
the nominal attribute Level of Education can be replaced by three binary
items: Education = College, Education = Graduate, and Education = High
School. Similarly, symmetric binary attributes such as Gender can be con
verted into a pair of binary items, Male and Female. Table 7.2 shows the
result of binarizing the Internet survey data.
Table 7.2. Internet survey data after binarizing categorical and symmetric binary attributes.
Male Female Education Education . . . Privacy Privacy
= Graduate = College = Yes = No
0 1 1 0 . . . 1 0
1 0 0 1 . . . 0 1
1 0 1 0 . . . 1 0
0 1 0 1 . . . 1 0
0 1 1 0 . . . 1 0
1 0 0 1 . . . 1 0
1 0 0 1 . . . 0 1
1 0 0 0 . . . 0 1
0 1 1 0 . . . 0 1
. . . . . . . . . . . . . . . . . . . . .
416
7.1 Handling Categorical Attributes
There are several issues to consider when applying association analysis to
the binarized data:
1. Some attribute values may not be frequent enough to be part of a fre
quent pattern. This problem is more evident for nominal attributes
that have many possible values, e.g., state names. Lowering the support
threshold does not help because it exponentially increases the number
of frequent patterns found (many of which may be spurious) and makes
the computation more expensive. A more practical solution is to group
related attribute values into a small number of categories. For exam
ple, each state name can be replaced by its corresponding geographi
cal region, such as Midwest, Pacific Northwest, Southwest, and East
Coast. Another possibility is to aggregate the less frequent attribute
values into a single category called Others, as shown in Figure 7.1.
Virginia
New York
California
Massachusetts
Oregon
Texas
Minnesota
Florida
Michigan
Illinois
Ohio
Others
Figure 7.1. A pie chart with a merged category called Others.
2. Some attribute values may have considerably higher frequencies than
others. For example, suppose 85% of the survey participants own a
home computer. By creating a binary item for each attribute value
that appears frequently in the data, we may potentially generate many
redundant patterns, as illustrated by the following example:
{Computer at home = Yes, Shop Online = Yes}
−→ {Privacy Concerns = Yes}.
417
Chapter 7 Association Analysis: Advanced Concepts
The rule is redundant because it is subsumed by the more general rule
given at the beginning of this section. Because the highfrequency items
correspond to the typical values of an attribute, they seldom carry any
new information that can help us to better understand the pattern. It
may therefore be useful to remove such items before applying standard
association analysis algorithms. Another possibility is to apply the tech
niques presented in Section 6.8 for handling data sets with a wide range
of support values.
3. Although the width of every transaction is the same as the number of
attributes in the original data, the computation time may increase es
pecially when many of the newly created items become frequent. This
is because more time is needed to deal with the additional candidate
itemsets generated by these items (see Exercise 1 on page 473). One
way to reduce the computation time is to avoid generating candidate
itemsets that contain more than one item from the same attribute. For
example, we do not have to generate a candidate itemset such as {State
= X, State = Y, . . .} because the support count of the itemset is zero.
7.2 Handling Continuous Attributes
The Internet survey data described in the previous section may also contain
continuous attributes such as the ones shown in Table 7.3. Mining the con
tinuous attributes may reveal useful insights about the data such as “users
whose annual income is more than $120K belong to the 45–60 age group” or
“users who have more than 3 email accounts and spend more than 15 hours
online per week are often concerned about their personal privacy.” Association
rules that contain continuous attributes are commonly known as quantitative
association rules.
This section describes the various methodologies for applying association
analysis to continuous data. We will specifically discuss three types of meth
ods: (1) discretizationbased methods, (2) statisticsbased methods, and (3)
nondiscretization methods. The quantitative association rules derived using
these methods are quite different in nature.
7.2.1 DiscretizationBased Methods
Discretization is the most common approach for handling continuous attributes.
This approach groups the adjacent values of a continuous attribute into a finite
number of intervals. For example, the Age attribute can be divided into the
418
7.2 Handling Continuous Attributes
Table 7.3. Internet survey data with continuous attributes.
Gender . . . Age Annual No. of Hours Spent No. of Email Privacy
Income Online per Week Accounts Concern
Female . . . 26 90K 20 4 Yes
Male . . . 51 135K 10 2 No
Male . . . 29 80K 10 3 Yes
Female . . . 45 120K 15 3 Yes
Female . . . 31 95K 20 5 Yes
Male . . . 25 55K 25 5 Yes
Male . . . 37 100K 10 1 No
Male . . . 41 65K 8 2 No
Female . . . 26 85K 12 1 No
. . . . . . . . . . . . . . . . . . . . .
following intervals:
Age ∈ [12, 16), Age ∈ [16, 20), Age ∈ [20, 24), . . . , Age ∈ [56, 60),
where [a, b) represents an interval that includes a but not b. Discretization
can be performed using any of the techniques described in Section 2.3.6 (equal
interval width, equal frequency, entropybased, or clustering). The discrete
intervals are then mapped into asymmetric binary attributes so that existing
association analysis algorithms can be applied. Table 7.4 shows the Internet
survey data after discretization and binarization.
Table 7.4. Internet survey data after binarizing categorical and continuous attributes.
Male Female . . . Age Age Age . . . Privacy Privacy
< 13 ∈ [13, 21) ∈ [21, 30) = Yes = No
0 1 . . . 0 0 1 . . . 1 0
1 0 . . . 0 0 0 . . . 0 1
1 0 . . . 0 0 1 . . . 1 0
0 1 . . . 0 0 0 . . . 1 0
0 1 . . . 0 0 0 . . . 1 0
1 0 . . . 0 0 1 . . . 1 0
1 0 . . . 0 0 0 . . . 0 1
1 0 . . . 0 0 0 . . . 0 1
0 1 . . . 0 0 1 . . . 0 1
. . . . . . . . . . . . . . . . . . . . . . . . . . .
419
Chapter 7 Association Analysis: Advanced Concepts
Table 7.5. A breakdown of Internet users who participated in online chat according to their age group.
Age Group Chat Online = Yes Chat Online = No
[12, 16) 12 13
[16, 20) 11 2
[20, 24) 11 3
[24, 28) 12 13
[28, 32) 14 12
[32, 36) 15 12
[36, 40) 16 14
[40, 44) 16 14
[44, 48) 4 10
[48, 52) 5 11
[52, 56) 5 10
[56, 60) 4 11
A key parameter in attribute discretization is the number of intervals used
to partition each attribute. This parameter is typically provided by the users
and can be expressed in terms of the interval width (for the equal interval
width approach), the average number of transactions per interval (for the equal
frequency approach), or the number of desired clusters (for the clustering
based approach). The difficulty in determining the right number of intervals
can be illustrated using the data set shown in Table 7.5, which summarizes the
responses of 250 users who participated in the survey. There are two strong
rules embedded in the data:
R1: Age ∈ [16, 24) −→ Chat Online = Yes (s = 8.8%, c = 81.5%).
R2: Age ∈ [44, 60) −→ Chat Online = No (s = 16.8%, c = 70%).
These rules suggest that most of the users from the age group of 16–24 often
participate in online chatting, while those from the age group of 44–60 are less
likely to chat online. In this example, we consider a rule to be interesting only
if its support (s) exceeds 5% and its confidence (c) exceeds 65%. One of the
problems encountered when discretizing the Age attribute is how to determine
the interval width.
1. If the interval is too wide, then we may lose some patterns because of
their lack of confidence. For example, when the interval width is 24
years, R1 and R2 are replaced by the following rules:
R′1: Age ∈ [12, 36) −→ Chat Online = Yes (s = 30%, c = 57.7%).
R′2: Age ∈ [36, 60) −→ Chat Online = No (s = 28%, c = 58.3%).
420
7.2 Handling Continuous Attributes
Despite their higher supports, the wider intervals have caused the con
fidence for both rules to drop below the minimum confidence threshold.
As a result, both patterns are lost after discretization.
2. If the interval is too narrow, then we may lose some patterns because of
their lack of support. For example, if the interval width is 4 years, then
R1 is broken up into the following two subrules:
R
(4)
11 : Age ∈ [16, 20) −→ Chat Online = Yes (s=4.4%, c=84.6%).
R
(4)
12 : Age ∈ [20, 24) −→ Chat Online = No (s=4.4%, c=78.6%).
Since the supports for the subrules are less than the minimum support
threshold, R1 is lost after discretization. Similarly, the rule R2, which
is broken up into four subrules, will also be lost because the support of
each subrule is less than the minimum support threshold.
3. If the interval width is 8 years, then the rule R2 is broken up into the
following two subrules:
R
(8)
21 : Age ∈ [44, 52) −→ Chat Online = No (s=8.4%, c=70%).
R
(8)
22 : Age ∈ [52, 60) −→ Chat Online = No (s=8.4%, c=70%).
Since R(8)21 and R
(8)
22 have sufficient support and confidence, R2 can be
recovered by aggregating both subrules. Meanwhile, R1 is broken up
into the following two subrules:
R
(8)
11 : Age ∈ [12, 20) −→ Chat Online = Yes (s=9.2%, c=60.5%).
R
(8)
12 : Age ∈ [20, 28) −→ Chat Online = Yes (s=9.2%, c=60.0%).
Unlike R2, we cannot recover the rule R1 by aggregating the subrules
because both subrules fail the confidence threshold.
One way to address these issues is to consider every possible grouping of
adjacent intervals. For example, we can start with an interval width of 4 years
and then merge the adjacent intervals into wider intervals, Age ∈ [12, 16),
Age ∈ [12, 20), . . . , Age ∈ [12, 60), Age ∈ [16, 20), Age ∈ [16, 24), etc. This
approach enables the detection of both R1 and R2 as strong rules. However,
it also leads to the following computational issues:
1. The computation becomes extremely expensive. If the range is
initially divided into k intervals, then k(k − 1)/2 binary items must be
421
Chapter 7 Association Analysis: Advanced Concepts
generated to represent all possible intervals. Furthermore, if an item
corresponding to the interval [a,b) is frequent, then all other items cor
responding to intervals that subsume [a,b) must be frequent too. This
approach can therefore generate far too many candidate and frequent
itemsets. To address these problems, a maximum support threshold can
be applied to prevent the creation of items corresponding to very wide
intervals and to reduce the number of itemsets.
2. Many redundant rules are extracted. For example, consider the
following pair of rules:
R3 : {Age ∈ [16, 20), Gender = Male} −→ {Chat Online = Yes},
R4 : {Age ∈ [16, 24), Gender = Male} −→ {Chat Online = Yes}.
R4 is a generalization of R3 (and R3 is a specialization of R4) because
R4 has a wider interval for the Age attribute. If the confidence values
for both rules are the same, then R4 should be more interesting be
cause it covers more examples—including those for R3. R3 is therefore
a redundant rule.
7.2.2 StatisticsBased Methods
Quantitative association rules can be used to infer the statistical properties of a
population. For example, suppose we are interested in finding the average age
of certain groups of Internet users based on the data provided in Tables 7.1 and
7.3. Using the statisticsbased method described in this section, quantitative
association rules such as the following can be extracted:
{Annual Income > $100K, Shop Online = Yes} −→ Age: Mean = 38.
The rule states that the average age of Internet users whose annual income
exceeds $100K and who shop online regularly is 38 years old.
Rule Generation
To generate the statisticsbased quantitative association rules, the target at
tribute used to characterize interesting segments of the population must be
specified. By withholding the target attribute, the remaining categorical and
continuous attributes in the data are binarized using the methods described
in the previous section. Existing algorithms such as Apriori or FPgrowth
are then applied to extract frequent itemsets from the binarized data. Each
422
7.2 Handling Continuous Attributes
frequent itemset identifies an interesting segment of the population. The dis
tribution of the target attribute in each segment can be summarized using
descriptive statistics such as mean, median, variance, or absolute deviation.
For example, the preceding rule is obtained by averaging the age of Inter
net users who support the frequent itemset {Annual Income > $100K, Shop
Online = Yes}.
The number of quantitative association rules discovered using this method
is the same as the number of extracted frequent itemsets. Because of the way
the quantitative association rules are defined, the notion of confidence is not
applicable to such rules. An alternative method for validating the quantitative
association rules is presented next.
Rule Validation
A quantitative association rule is interesting only if the statistics computed
from transactions covered by the rule are different than those computed from
transactions not covered by the rule. For example, the rule given at the be
ginning of this section is interesting only if the average age of Internet users
who do not support the frequent itemset {Annual Income > 100K, Shop
Online = Yes} is significantly higher or lower than 38 years old. To deter
mine whether the difference in their average ages is statistically significant,
statistical hypothesis testing methods should be applied.
Consider the quantitative association rule, A −→ t : µ, where A is a
frequent itemset, t is the continuous target attribute, and µ is the average value
of t among transactions covered by A. Furthermore, let µ′ denote the average
value of t among transactions not covered by A. The goal is to test whether
the difference between µ and µ′ is greater than some userspecified threshold,
∆. In statistical hypothesis testing, two opposite propositions, known as the
null hypothesis and the alternative hypothesis, are given. A hypothesis test
is performed to determine which of these two hypotheses should be accepted,
based on evidence gathered from the data (see Appendix C).
In this case, assuming that µ < µ′, the null hypothesis is H0 : µ′ = µ + ∆,
while the alternative hypothesis is H1 : µ′ > µ + ∆. To determine which
hypothesis should be accepted, the following Zstatistic is computed:
Z =
µ′ − µ − ∆√
s21
n1
+ s
2
2
n2
, (7.1)
where n1 is the number of transactions supporting A, n2 is the number of trans
actions not supporting A, s1 is the standard deviation for t among transactions
423
Chapter 7 Association Analysis: Advanced Concepts
that support A, and s2 is the standard deviation for t among transactions that
do not support A. Under the null hypothesis, Z has a standard normal distri
bution with mean 0 and variance 1. The value of Z computed using Equation
7.1 is then compared against a critical value, Zα, which is a threshold that
depends on the desired confidence level. If Z > Zα, then the null hypothesis
is rejected and we may conclude that the quantitative association rule is in
teresting. Otherwise, there is not enough evidence in the data to show that
the difference in mean is statistically significant.
Example 7.1. Consider the quantitative association rule
{Income > 100K, Shop Online = Y es} −→ Age : µ = 38.
Suppose there are 50 Internet users who supported the rule antecedent. The
standard deviation of their ages is 3.5. On the other hand, the average age of
the 200 users who do not support the rule antecedent is 30 and their standard
deviation is 6.5. Assume that a quantitative association rule is considered
interesting only if the difference between µ and µ′ is more than 5 years. Using
Equation 7.1 we obtain
Z =
38 − 30 − 5√
3.52
50
+ 6.5
2
200
= 4.4414.
For a onesided hypothesis test at a 95% confidence level, the critical value
for rejecting the null hypothesis is 1.64. Since Z > 1.64, the null hypothesis
can be rejected. We therefore conclude that the quantitative association rule
is interesting because the difference between the average ages of users who
support and do not support the rule antecedent is more than 5 years.
7.2.3 Nondiscretization Methods
There are certain applications in which analysts are more interested in find
ing associations among the continuous attributes, rather than associations
among discrete intervals of the continuous attributes. For example, consider
the problem of finding word associations in text documents, as shown in Ta
ble 7.6. Each entry in the documentword matrix represents the normalized
frequency count of a word appearing in a given document. The data is normal
ized by dividing the frequency of each word by the sum of the word frequency
across all documents. One reason for this normalization is to make sure that
the resulting support value is a number between 0 and 1. However, a more
424
7.2 Handling Continuous Attributes
Table 7.6. Normalized documentword matrix.
Document word1 word2 word3 word4 word5 word6
d1 0.3 0.6 0 0 0 0.2
d2 0.1 0.2 0 0 0 0.2
d3 0.4 0.2 0.7 0 0 0.2
d4 0.2 0 0.3 0 0 0.1
d5 0 0 0 1.0 1.0 0.3
important reason is to ensure that the data is on the same scale so that sets
of words that vary in the same way have similar support values.
In text mining, analysts are more interested in finding associations between
words (e.g., data and mining) instead of associations between ranges of word
frequencies (e.g., data ∈ [1, 4] and mining ∈ [2, 3]). One way to do this is
to transform the data into a 0/1 matrix, where the entry is 1 if the normal
ized frequency count exceeds some threshold t, and 0 otherwise. While this
approach allows analysts to apply existing frequent itemset generation algo
rithms to the binarized data set, finding the right threshold for binarization
can be quite tricky. If the threshold is set too high, it is possible to miss some
interesting associations. Conversely, if the threshold is set too low, there is a
potential for generating a large number of spurious associations.
This section presents another methodology for finding word associations
known as minApriori. Analogous to traditional association analysis, an item
set is considered to be a collection of words, while its support measures the
degree of association among the words. The support of an itemset can be
computed based on the normalized frequency of its corresponding words. For
example, consider the document d1 shown in Table 7.6. The normalized fre
quencies for word1 and word2 in this document are 0.3 and 0.6, respectively.
One might think that a reasonable approach to compute the association be
tween both words is to take the average value of their normalized frequencies,
i.e., (0.3 + 0.6)/2 = 0.45. The support of an itemset can then be computed by
summing up the averaged normalized frequencies across all the documents:
s({word1, word2}) =
0.3 + 0.6
2
+
0.1 + 0.2
2
+
0.4 + 0.2
2
+
0.2 + 0
2
= 1.
This result is by no means an accident. Because every word frequency is
normalized to 1, averaging the normalized frequencies makes the support for
every itemset equal to 1. All itemsets are therefore frequent using this ap
proach, making it useless for identifying interesting patterns.
425
Chapter 7 Association Analysis: Advanced Concepts
In minApriori, the association among words in a given document is ob
tained by taking the minimum value of their normalized frequencies, i.e.,
min(word1, word2) = min(0.3, 0.6) = 0.3. The support of an itemset is com
puted by aggregating its association over all the documents.
s({word1, word2}) = min(0.3, 0.6) + min(0.1, 0.2) + min(0.4, 0.2)
+ min(0.2, 0)
= 0.6.
The support measure defined in minApriori has the following desired prop
erties, which makes it suitable for finding word associations in documents:
1. Support increases monotonically as the normalized frequency of a word
increases.
2. Support increases monotonically as the number of documents that con
tain the word increases.
3. Support has an antimonotone property. For example, consider a pair of
itemsets {A, B} and {A, B, C}. Since min({A, B}) ≥ min({A, B, C}),
s({A, B}) ≥ s({A, B, C}). Therefore, support decreases monotonically
as the number of words in an itemset increases.
The standard Apriori algorithm can be modified to find associations among
words using the new support definition.
7.3 Handling a Concept Hierarchy
A concept hierarchy is a multilevel organization of the various entities or con
cepts defined in a particular domain. For example, in market basket analysis,
a concept hierarchy has the form of an item taxonomy describing the “isa”
relationships among items sold at a grocery store—e.g., milk is a kind of food
and DVD is a kind of home electronics equipment (see Figure 7.2). Concept
hierarchies are often defined according to domain knowledge or based on a
standard classification scheme defined by certain organizations (e.g., the Li
brary of Congress classification scheme is used to organize library materials
based on their subject categories).
A concept hierarchy can be represented using a directed acyclic graph,
as shown in Figure 7.2. If there is an edge in the graph from a node p to
another node c, we call p the parent of c and c the child of p. For example,
426
7.3 Handling a Concept Hierarchy
Food
Bread
Milk
Skim 2%
Electronics
Computers Home
Desktop LaptopWheat DVDTV
AC
adaptor
Docking
station
Laptop
Accessories
White
Figure 7.2. Example of an item taxonomy.
milk is the parent of skim milk because there is a directed edge from the
node milk to the node skim milk. X̂ is called an ancestor of X (and X a
descendent of X̂) if there is a path from node X̂ to node X in the directed
acyclic graph. In the diagram shown in Figure 7.2, food is an ancestor of skim
milk and AC adaptor is a descendent of electronics.
The main advantages of incorporating concept hierarchies into association
analysis are as follows:
1. Items at the lower levels of a hierarchy may not have enough support to
appear in any frequent itemsets. For example, although the sale of AC
adaptors and docking stations may be low, the sale of laptop accessories,
which is their parent node in the concept hierarchy, may be high. Unless
the concept hierarchy is used, there is a potential to miss interesting
patterns involving the laptop accessories.
2. Rules found at the lower levels of a concept hierarchy tend to be overly
specific and may not be as interesting as rules at the higher levels. For
example, staple items such as milk and bread tend to produce many low
level rules such as skim milk −→ wheat bread, 2% milk −→ wheat
bread, and skim milk −→ white bread. Using a concept hierarchy,
they can be summarized into a single rule, milk −→ bread. Considering
only items residing at the top level of their hierarchies may not be good
enough because such rules may not be of any practical use. For example,
although the rule electronics −→ food may satisfy the support and
427
Chapter 7 Association Analysis: Advanced Concepts
confidence thresholds, it is not informative because the combination of
electronics and food items that are frequently purchased by customers
are unknown. If milk and batteries are the only items sold together
frequently, then the pattern {food, electronics} may have overgener
alized the situation.
Standard association analysis can be extended to incorporate concept hi
erarchies in the following way. Each transaction t is initially replaced with its
extended transaction t′, which contains all the items in t along with their
corresponding ancestors. For example, the transaction {DVD, wheat bread}
can be extended to {DVD, wheat bread, home electronics, electronics,
bread, food}, where home electronics and electronics are the ancestors
of DVD, while bread and food are the ancestors of wheat bread. With this
approach, existing algorithms such as Apriori can be applied to the extended
database to find rules that span different levels of the concept hierarchy. This
approach has several obvious limitations:
1. Items residing at the higher levels tend to have higher support counts
than those residing at the lower levels of a concept hierarchy. As a result,
if the support threshold is set too high, then only patterns involving the
highlevel items are extracted. On the other hand, if the threshold is set
too low, then the algorithm generates far too many patterns (most of
which may be spurious) and becomes computationally inefficient.
2. Introduction of a concept hierarchy tends to increase the computation
time of association analysis algorithms because of the larger number of
items and wider transactions. The number of candidate patterns and
frequent patterns generated by these algorithms may also grow expo
nentially with wider transactions.
3. Introduction of a concept hierarchy may produce redundant rules. A
rule X −→ Y is redundant if there exists a more general rule X̂ −→ Ŷ ,
where X̂ is an ancestor of X, Ŷ is an ancestor of Y , and both rules
have very similar confidence. For example, suppose {bread} −→ {milk},
{white bread} −→ {2% milk}, {wheat bread} −→ {2% milk}, {white
bread} −→ {skim milk}, and {wheat bread} −→ {skim milk} have
very similar confidence. The rules involving items from the lower level of
the hierarchy are considered redundant because they can be summarized
by a rule involving the ancestor items. An itemset such as {skim milk,
milk, food} is also redundant because food and milk are ancestors of
skim milk. Fortunately, it is easy to eliminate such redundant itemsets
during frequent itemset generation, given the knowledge of the hierarchy.
428
7.4 Sequential Patterns
6
1
Object
A
A
A
B
B
B
B
C
Timestamp
Sequence Database:
10
20
23
11
17
21
28
14
Events
2, 3, 5
6, 1
1
4, 5, 6
2
7, 8, 1, 2
1, 6
1, 8, 7
Timeline
Sequence for
Object A:
Sequence for
Object B:
Sequence for
Object C:
10 15 25 3520 30
2
3
5
4
5
6
1
7
8
7
8
1
2
2 1
6
1
Figure 7.3. Example of a sequence database.
7.4 Sequential Patterns
Market basket data often contains temporal information about when an item
was purchased by customers. Such information can be used to piece together
the sequence of transactions made by a customer over a certain period of time.
Similarly, eventbased data collected from scientific experiments or the mon
itoring of physical systems such as telecommunications networks, computer
networks, and wireless sensor networks, have an inherent sequential nature
to them. This means that an ordinal relation, usually based on temporal or
spatial precedence, exists among events occurring in such data. However, the
concepts of association patterns discussed so far emphasize only cooccurrence
relationships and disregard the sequential information of the data. The latter
information may be valuable for identifying recurring features of a dynamic
system or predicting future occurrences of certain events. This section presents
the basic concept of sequential patterns and the algorithms developed to dis
cover them.
7.4.1 Problem Formulation
The input to the problem of discovering sequential patterns is a sequence data
set, which is shown on the lefthand side of Figure 7.3. Each row records the
occurrences of events associated with a particular object at a given time. For
example, the first row contains the set of events occurring at timestamp t = 10
429
Chapter 7 Association Analysis: Advanced Concepts
for object A. By sorting all the events associated with object A in increasing
order of their timestamps, a sequence for object A is obtained, as shown on
the righthand side of Figure 7.3.
Generally speaking, a sequence is an ordered list of elements. A sequence
can be denoted as s = 〈e1e2e3 . . . en〉, where each element ej is a collection of
one or more events, i.e., ej = {i1, i2, . . . , ik}. The following is a list of examples
of sequences:
• Sequence of Web pages viewed by a Web site visitor:
〈 {Homepage} {Electronics} {Cameras and Camcorders} {Digital Cam
eras} {Shopping Cart} {Order Confirmation} {Return to Shopping} 〉
• Sequence of events leading to the nuclear accident at ThreeMile Island:
〈 {clogged resin} {outlet valve closure} {loss of feedwater} {condenser
polisher outlet valve shut} {booster pumps trip} {main waterpump trips}
{main turbine trips} {reactor pressure increases} 〉
• Sequence of classes taken by a computer science major:
〈 {Algorithms and Data Structures, Introduction to Operating Systems}
{Database Systems, Computer Architecture} {Computer Networks, Soft
ware Engineering} {Computer Graphics, Parallel Programming} 〉
A sequence can be characterized by its length and the number of occur
ring events. The length of a sequence corresponds to the number of elements
present in the sequence, while a ksequence is a sequence that contains k
events. The Web sequence in the previous example contains 7 elements and
7 events; the event sequence at ThreeMile Island contains 8 elements and 8
events; and the class sequence contains 4 elements and 8 events.
Figure 7.4 provides examples of sequences, elements, and events defined for
a variety of application domains. Except for the last row, the ordinal attribute
associated with each of the first three domains corresponds to calendar time.
For the last row, the ordinal attribute corresponds to the location of the bases
(A, C, G, T) in the gene sequence. Although the discussion on sequential
patterns is primarily focused on temporal events, it can be extended to the
case where the events have spatial ordering.
Subsequences
A sequence t is a subsequence of another sequence s if each ordered element in
t is a subset of an ordered element in s. Formally, the sequence t = 〈t1t2 . . . tm〉
430
7.4 Sequential Patterns
Bases A,T,G,C
Event (Item)Element
(Transaction)
An element of the DNA
sequence
DNA sequence of a
particular species
Genome
sequences
Types of alarms
generated by sensors
Events triggered by a
sensor at time t
History of events generated
by a given sensor
Event data
Home page, index
page, contact info, etc
The collection of files
viewed by a Web visitor
after a single mouse click
Browsing activity of a
particular Web visitor
Web Data
Books, diary products,
CDs, etc
A set of items bought by
a customer at time t
Purchase history of a given
customer
Customer
Event
(Item)
Element
(Transaction)
SequenceSequence
Database
Ordinal Attribute
E3
E4
E1
E3
E1
E2
E2E2
Sequence
Figure 7.4. Examples of elements and events in sequence data sets.
is a subsequence of s = 〈s1s2 . . . sn〉 if there exist integers 1 ≤ j1 < j2 < · · · <
jm ≤ n such that t1 ⊆ sj1 , t2 ⊆ sj2 , . . . , tm ⊆ sjm . If t is a subsequence of
s, then we say that t is contained in s. The following table gives examples
illustrating the idea of subsequences for various sequences.
Sequence, s Sequence, t Is t a subsequence of s?
<{2,4} {3,5,6} {8} > < {2} {3,6} {8} > Yes
<{2,4} {3,5,6} {8} > < {2} {8} > Yes
<{1,2} {3,4} > < {1} {2} > No
<{2,4} {2,4} {2,5} > < {2} {4} > Yes
7.4.2 Sequential Pattern Discovery
Let D be a data set that contains one or more data sequences. The term
data sequence refers to an ordered list of events associated with a single data
object. For example, the data set shown in Figure 7.3 contains three data
sequences, one for each object A, B, and C.
The support of a sequence s is the fraction of all data sequences that
contain s. If the support for s is greater than or equal to a userspecified
431
Chapter 7 Association Analysis: Advanced Concepts
Object Timestamp Events
Examples of Sequential Patterns:
A
A
A
B
B
C
C
C
D
D
D
E
E
1
2
3
1
2
1
2
3
1
2
3
1
2
1, 2, 4
2, 3
5
1, 2
2, 3, 4
1, 2
2, 3, 4
2, 4, 5
2
3, 4
4, 5
1, 3
2, 4, 5
Minsup = 50%
<{1,2}>
<{2,3}>
<{2,4}>
<{3} {5}>
<{1} {2}>
<{2} {2}>
<{1} {2,3}>
<{2} {2,3}>
<{1,2} {2,3}>
s=60%
s=60%
s=80%
s=80%
s=80%
s=60%
s=60%
s=60%
s=60%
Figure 7.5. Sequential patterns derived from a data set that contains five data sequences.
threshold minsup, then s is declared to be a sequential pattern (or frequent
sequence).
Definition 7.1 (Sequential Pattern Discovery). Given a sequence data
set D and a userspecified minimum support threshold minsup, the task of
sequential pattern discovery is to find all sequences with support ≥ minsup.
Figure 7.5 illustrates an example of a data set that contains five data
sequences. The support for the sequence < {1}{2} > is equal to 80% because it
occurs in four of the five data sequences (every object except for D). Assuming
that the minimum support threshold is 50%, any sequence that appears in at
least three data sequences is considered to be a sequential pattern. Examples
of sequential patterns extracted from the given data set include <{1}{2}>,
<{1,2}>, <{2,3}>, <{1,2}{2,3}>, etc.
Sequential pattern discovery is a computationally challenging task because
there are exponentially many sequences contained in a given data sequence.
For example, the data sequence <{a,b} {c,d,e} {f} {g,h,i}> contains sequences
such as <{a} {c,d} {f} {g}>, <{c,d,e}>, <{b} {g}>, etc. It can be easily
shown that the total number of ksequences present in a data sequence with
n events is
(
n
k
)
. A data sequence with nine events therefore contains
(
9
1
)
+
(
9
2
)
+ . . . +
(
9
9
)
= 29 − 1 = 511
distinct sequences.
432
7.4 Sequential Patterns
A bruteforce approach for generating sequential patterns is to enumerate
all possible sequences and count their respective supports. Given a collection
of n events, candidate 1sequences are generated first, followed by candidate
2sequences, candidate 3sequences, and so on:
1sequences: < i1 >, < i2 >, . . ., < in >
2sequences: < {i1, i2} >, < {i1, i3} >, . . ., < {in−1, in} >,
< {i1}{i1} >, < {i1}{i2} >,. . ., < {in−1}{in} >
3sequences: < {i1, i2, i3} >, < {i1, i2, i4} >, . . ., < {i1, i2}{i1} >, . . .,
< {i1}{i1, i2} >, . . ., < {i1}{i1}{i1} >, . . ., < {in}{in}{in} >
Notice that the number of candidate sequences is substantially larger than
the number of candidate itemsets. There are two reasons for the additional
number of candidates:
1. An item can appear at most once in an itemset, but an event can appear
more than once in a sequence. Given a pair of items, i1 and i2, only one
candidate 2itemset, {i1, i2}, can be generated. On the other hand, there
are many candidate 2sequences, such as < {i1, i2} >, < {i1}{i2} >,
< {i2}{i1} >, and < {i1, i1} >, that can be generated.
2. Order matters in sequences, but not for itemsets. For example, {1, 2} and
{2, 1} refers to the same itemset, whereas < {i1}{i2} > and < {i2}{i1} >
correspond to different sequences, and thus must be generated separately.
The Apriori principle holds for sequential data because any data sequence
that contains a particular ksequence must also contain all of its (k − 1)
subsequences. An Apriorilike algorithm can be developed to extract sequen
tial patterns from a sequence data set. The basic structure of the algorithm
is shown in Algorithm 7.1.
Notice that the structure of the algorithm is almost identical to Algorithm
6.1 presented in the previous chapter. The algorithm would iteratively gen
erate new candidate ksequences, prune candidates whose (k − 1)sequences
are infrequent, and then count the supports of the remaining candidates to
identify the sequential patterns. The detailed aspects of these steps are given
next.
Candidate Generation A pair of frequent (k −1)sequences are merged to
produce a candidate ksequence. To avoid generating duplicate candidates, re
call that the traditional Apriori algorithm merges a pair of frequent kitemsets
only if their first k − 1 items are identical. A similar approach can be used
433
Chapter 7 Association Analysis: Advanced Concepts
Algorithm 7.1 Apriorilike algorithm for sequential pattern discovery.
1: k = 1.
2: Fk = { i  i ∈ I ∧ σ({i})N ≥ minsup}. {Find all frequent 1subsequences.}
3: repeat
4: k = k + 1.
5: Ck = apriorigen(Fk−1). {Generate candidate ksubsequences.}
6: for each data sequence t ∈ T do
7: Ct = subsequence(Ck, t). {Identify all candidates contained in t.}
8: for each candidate ksubsequence c ∈ Ct do
9: σ(c) = σ(c) + 1. {Increment the support count.}
10: end for
11: end for
12: Fk = { c  c ∈ Ck ∧ σ(c)N ≥ minsup}. {Extract the frequent ksubsequences.}
13: until Fk = ∅
14: Answer =
⋃
Fk.
for sequences. The criteria for merging sequences are stated in the form of the
following procedure.
Sequence Merging Procedure
A sequence s(1) is merged with another sequence s(2) only if the subsequence
obtained by dropping the first event in s(1) is identical to the subsequence
obtained by dropping the last event in s(2). The resulting candidate is the
sequence s(1), concatenated with the last event from s(2). The last event from
s(2) can either be merged into the same element as the last event in s(1) or
different elements depending on the following conditions:
1. If the last two events in s(2) belong to the same element, then the last event
in s(2) is part of the last element in s(1) in the merged sequence.
2. If the last two events in s(2) belong to different elements, then the last event
in s(2) becomes a separate element appended to the end of s(1) in the
merged sequence.
Figure 7.6 illustrates examples of candidate 4sequences obtained by merg
ing pairs of frequent 3sequences. The first candidate 〈{1}{2}{3}{4}〉 is ob
tained by merging 〈(1)(2)(3)〉 with 〈(2)(3)(4)〉. Since events 3 and 4 belong
to different elements of the second sequence, they also belong to separate ele
ments in the merged sequence. On the other hand, merging 〈{1}{5}{3}〉 with
〈{5}{3, 4}〉 produces the candidate 4sequence 〈{1}{5}{3, 4}〉. In this case,
434
7.4 Sequential Patterns
Frequent
3sequences
< (1) (2) (3) >
< (1) (2 5) >
< (1) (5) (3) >
< (2) (3) (4) >
< (2 5) (3) >
< (3) (4) (5) >
< (5) (3 4) >
Candidate
Generation
< (1) (2) (3) (4) >
< (1) (2 5) (3) >
< (1) (5) (3 4) >
< (2) (3) (4) (5) >
< (2 5) (3 4) >
Candidate
Pruning
< (1) (2 5) (3) >
Figure 7.6. Example of the candidate generation and pruning steps of a sequential pattern mining
algorithm.
since events 3 and 4 belong to the same element of the second sequence, they
are combined into the same element in the merged sequence. Finally, the se
quences 〈{1}{2}{3}〉 and 〈{1}{2, 5}〉 do not have to be merged because remov
ing the first event from the first sequence does not give the same subsequence
as removing the last event from the second sequence. Although 〈{1}{2, 5}{3}〉
is a viable candidate, it is generated by merging a different pair of sequences,
〈{1}{2, 5}〉 and 〈{2, 5}{3}〉. This example shows that the sequence merging
procedure is complete; i.e., it will not miss any viable candidate, while at the
same time, it avoids generating duplicate candidate sequences.
Candidate Pruning A candidate ksequence is pruned if at least one of its
(k−1)sequences is infrequent. For example, suppose 〈{1}{2}{3}{4}〉 is a can
didate 4sequence. We need to check whether 〈{1}{2}{4}〉 and 〈{1}{3}{4}〉 are
frequent 3sequences. Since both are infrequent, the candidate 〈{1}{2}{3}{4}〉
can be eliminated. Readers should be able to verify that the only candi
date 4sequence that survives the candidate pruning step in Figure 7.6 is
〈{1}{2 5}{3}〉.
Support Counting During support counting, the algorithm will enumer
ate all candidate ksequences belonging to a particular data sequence. The
support of these candidates will be incremented. After counting their sup
ports, the algorithm may identify the frequent ksequences and may discard
all candidates whose support counts are less than the minsup threshold.
435
Chapter 7 Association Analysis: Advanced Concepts
Sequence:
1
3
3
5
2
4
5
1
2
3 42
u(sn) – I(s1) <= maxspan
u(sj+1)  I(sj) <= maxgap
I(sj+1)  u(sj) > mingap
Time window (w) for each element is characterized by [I,u]
where I : earliest time of occurrence of an event in w
u : latest time of occurrence of an event in w
window size
ws
Figure 7.7. Timing constraints of a sequential pattern.
7.4.3 Timing Constraints
This section presents a sequential pattern formulation where timing constraints
are imposed on the events and elements of a pattern. To motivate the need
for timing constraints, consider the following sequence of courses taken by two
students who enrolled in a data mining class:
Student A: 〈 {Statistics} {Database Systems} {Data Mining} 〉.
Student B: 〈 {Database Systems} {Statistics} {Data Mining} 〉.
The sequential pattern of interest is 〈 {Statistics, Database Systems} {Data
Mining} 〉, which means that students who are enrolled in the data mining
class must have previously taken a course in statistics and database systems.
Clearly, the pattern is supported by both students even though they do not
take statistics and database systems at the same time. In contrast, a student
who took a statistics course ten years earlier should not be considered as
supporting the pattern because the time gap between the courses is too long.
Because the formulation presented in the previous section does not incorporate
these timing constraints, a new sequential pattern definition is needed.
Figure 7.7 illustrates some of the timing constraints that can be imposed
on a pattern. The definition of these constraints and the impact they have on
sequential pattern discovery algorithms will be discussed in the next sections.
Note that each element of the sequential pattern is associated with a time
window [l, u], where l is the earliest occurrence of an event within the time
window and u is the latest occurrence of an event within the time window.
436
7.4 Sequential Patterns
The maxspan Constraint
The maxspan constraint specifies the maximum allowed time difference be
tween the latest and the earliest occurrences of events in the entire sequence.
For example, suppose the following data sequences contain events that oc
cur at consecutive time stamps (1, 2, 3, . . .). Assuming that maxspan = 3,
the following table contains sequential patterns that are supported and not
supported by a given data sequence.
Data Sequence, s Sequential Pattern, t Does s support t?
<{1,3} {3,4} {4} {5} {6,7} {8} > < {3} {4} > Yes
<{1,3} {3,4} {4} {5} {6,7} {8} > < {3} {6} > Yes
<{1,3} {3,4} {4} {5} {6,7} {8} > < {1,3} {6} > No
In general, the longer the maxspan, the more likely it is to detect a pattern
in a data sequence. However, a longer maxspan can also capture spurious pat
terns because it increases the chance for two unrelated events to be temporally
related. In addition, the pattern may involve events that are already obsolete.
The maxspan constraint affects the support counting step of sequential
pattern discovery algorithms. As shown in the preceding examples, some data
sequences no longer support a candidate pattern when the maxspan constraint
is imposed. If we simply apply Algorithm 7.1, the support counts for some
patterns may be overestimated. To avoid this problem, the algorithm must be
modified to ignore cases where the interval between the first and last occur
rences of events in a given pattern is greater than maxspan.
The mingap and maxgap Constraints
Timing constraints can also be specified to restrict the time difference be
tween two consecutive elements of a sequence. If the maximum time difference
(maxgap) is one week, then events in one element must occur within a week’s
time of the events occurring in the previous element. If the minimum time dif
ference (mingap) is zero, then events in one element must occur immediately
after the events occurring in the previous element. The following table shows
examples of patterns that pass or fail the maxgap and mingap constraints,
assuming that maxgap = 3 and mingap = 1.
Data Sequence, s Sequential Pattern, t maxgap mingap
<{1,3} {3,4} {4} {5} {6,7} {8} > < {3} {6} > Pass Pass
<{1,3} {3,4} {4} {5} {6,7} {8} > < {6} {8} > Pass Fail
<{1,3} {3,4} {4} {5} {6,7} {8} > < {1,3} {6} > Fail Pass
<{1,3} {3,4} {4} {5} {6,7} {8} > < {1} {3} {8} > Fail Fail
437
Chapter 7 Association Analysis: Advanced Concepts
As with maxspan, these constraints will affect the support counting step
of sequential pattern discovery algorithms because some data sequences no
longer support a candidate pattern when mingap and maxgap constraints are
present. These algorithms must be modified to ensure that the timing con
straints are not violated when counting the support of a pattern. Otherwise,
some infrequent sequences may mistakenly be declared as frequent patterns.
A side effect of using the maxgap constraint is that the Apriori principle
might be violated. To illustrate this, consider the data set shown in Figure
7.5. Without mingap or maxgap constraints, the support for 〈{2}{5}〉 and
〈{2}{3}{5}〉 are both equal to 60%. However, if mingap = 0 and maxgap = 1,
then the support for 〈{2}{5}〉 reduces to 40%, while the support for 〈{2}{3}{5}〉
is still 60%. In other words, support has increased when the number of events
in a sequence increases—which contradicts the Apriori principle. The viola
tion occurs because the object D does not support the pattern 〈{2}{5}〉 since
the time gap between events 2 and 5 is greater than maxgap. This problem
can be avoided by using the concept of a contiguous subsequence.
Definition 7.2 (Contiguous Subsequence). A sequence s is a contiguous
subsequence of w = 〈e1e2 . . . ek〉 if any one of the following conditions hold:
1. s is obtained from w after deleting an event from either e1 or ek,
2. s is obtained from w after deleting an event from any element ei ∈ w
that contains at least two events, or
3. s is a contiguous subsequence of t and t is a contiguous subsequence of
w.
The following examples illustrate the concept of a contiguous subsequence:
Data Sequence, s Sequential Pattern, t Is t a contiguous
subsequence of s?
<{1} {2,3}> < {1} {2} > Yes
<{1,2} {2} {3} > < {1} {2} > Yes
<{3,4} {1,2} {2,3} {4} > < {1} {2} > Yes
<{1} {3} {2} > < {1} {2} > No
<{1,2} {1} {3} {2} > < {1} {2} > No
Using the concept of contiguous subsequences, the Apriori principle can
be modified to handle maxgap constraints in the following way.
Definition 7.3 (Modified Apriori Principle). If a ksequence is frequent,
then all of its contiguous k − 1subsequences must also be frequent.
438
7.4 Sequential Patterns
The modified Apriori principle can be applied to the sequential pattern
discovery algorithm with minor modifications. During candidate pruning, not
all ksequences need to be verified since some of them may violate the maxgap
constraint. For example, if maxgap = 1, it is not necessary to check whether
the subsequence 〈{1}{2, 3}{5}〉 of the candidate 〈{1}{2, 3}{4}{5}〉 is frequent
since the time difference between elements {2, 3} and {5} is greater than one
time unit. Instead, only the contiguous subsequences of 〈{1}{2, 3}{4}{5}〉 need
to be examined. These subsequences include 〈{1}{2, 3}{4}〉, 〈{2, 3}{4}{5}〉,
〈{1}{2}{4}{5}〉, and 〈{1}{3}{4}{5}〉.
The Window Size Constraint
Finally, events within an element sj do not have to occur at the same time. A
window size threshold (ws) can be defined to specify the maximum allowed
time difference between the latest and earliest occurrences of events in any
element of a sequential pattern. A window size of 0 means all events in the
same element of a pattern must occur simultaneously.
The following example uses ws = 2 to determine whether a data se
quence supports a given sequence (assuming mingap = 0, maxgap = 3, and
maxspan = ∞).
Data Sequence, s Sequential Pattern, t Does s support t?
<{1,3} {3,4} {4} {5} {6,7} {8} > < {3,4} {5} > Yes
<{1,3} {3,4} {4} {5} {6,7} {8} > < {4,6} {8} > Yes
<{1,3} {3,4} {4} {5} {6,7} {8} > < {3, 4, 6} {8} > No
<{1,3} {3,4} {4} {5} {6,7} {8} > < {1,3,4} {6,7,8} > No
In the last example, although the pattern 〈{1,3,4} {6,7,8}〉 satisfies the win
dow size constraint, it violates the maxgap constraint because the maximum
time difference between events in the two elements is 5 units. The window
size constraint also affects the support counting step of sequential pattern dis
covery algorithms. If Algorithm 7.1 is applied without imposing the window
size constraint, the support counts for some of the candidate patterns might
be underestimated, and thus some interesting patterns may be lost.
7.4.4 Alternative Counting Schemes
There are several methods available for counting the support of a candidate
ksequence from a database of sequences. For illustrative purposes, consider
the problem of counting the support for sequence 〈{p}{q}〉, as shown in Figure
7.8. Assume that ws = 0, mingap = 0, maxgap = 1, and maxspan = 2.
439
Chapter 7 Association Analysis: Advanced Concepts
1 2 3 4 5 6 7
p p
p
q
p
q
p
q qq
Objectʼs Timeline
Sequence: (p) (q)
(Method, Count)
COBJ 1
CWIN 6
CMINWIN4
CDIST_O8
CDIST 5
Figure 7.8. Comparing different support counting methods.
• COBJ: One occurrence per object.
This method looks for at least one occurrence of a given sequence in
an object’s timeline. In Figure 7.8, even though the sequence 〈(p)(q)〉
appears several times in the object’s timeline, it is counted only once—
with p occurring at t = 1 and q occuring at t = 3.
• CWIN: One occurrence per sliding window.
In this approach, a sliding time window of fixed length (maxspan) is
moved across an object’s timeline, one unit at a time. The support
count is incremented each time the sequence is encountered in the sliding
window. In Figure 7.8, the sequence 〈{p}{q}〉 is observed six times using
this method.
• CMINWIN: Number of minimal windows of occurrence.
A minimal window of occurrence is the smallest window in which the
sequence occurs given the timing constraints. In other words, a minimal
440
7.4 Sequential Patterns
window is the time interval such that the sequence occurs in that time
interval, but it does not occur in any of the proper subintervals of it. This
definition can be considered as a restrictive version of CWIN, because
its effect is to shrink and collapse some of the windows that are counted
by CWIN. For example, sequence 〈{p}{q}〉 has four minimal window
occurrences: (1) the pair (p: t = 2, q: t = 3), (2) the pair (p: t = 3, q:
t = 4), (3) the pair (p: t = 5, q: t = 6), and (4) the pair (p: t = 6, q:
t = 7). The occurrence of event p at t = 1 and event q at t = 3 is not a
minimal window occurrence because it contains a smaller window with
(p: t = 2, q: t = 3), which is indeed a minimal window of occurrence.
• CDIST O: Distinct occurrences with possibility of eventtimestamp
overlap.
A distinct occurrence of a sequence is defined to be the set of event
timestamp pairs such that there has to be at least one new event
timestamp pair that is different from a previously counted occurrence.
Counting all such distinct occurrences results in the CDIST O method.
If the occurrence time of events p and q is denoted as a tuple (t(p), t(q)),
then this method yields eight distinct occurrences of sequence 〈{p}{q}〉
at times (1,3), (2,3), (2,4), (3,4), (3,5), (5,6), (5,7), and (6,7).
• CDIST: Distinct occurrences with no eventtimestamp overlap allowed.
In CDIST O above, two occurrences of a sequence were allowed to have
overlapping eventtimestamp pairs, e.g., the overlap between (1,3) and
(2,3). In the CDIST method, no overlap is allowed. Effectively, when an
eventtimestamp pair is considered for counting, it is marked as used and
is never used again for subsequent counting of the same sequence. As
an example, there are five distinct, nonoverlapping occurrences of the
sequence 〈{p}{q}〉 in the diagram shown in Figure 7.8. These occurrences
happen at times (1,3), (2,4), (3,5), (5,6), and (6,7). Observe that these
occurrences are subsets of the occurrences observed in CDIST O.
One final point regarding the counting methods is the need to determine the
baseline for computing the support measure. For frequent itemset mining, the
baseline is given by the total number of transactions. For sequential pattern
mining, the baseline depends on the counting method used. For the COBJ
method, the total number of objects in the input data can be used as the
baseline. For the CWIN and CMINWIN methods, the baseline is given by the
sum of the number of time windows possible in all objects. For methods such
as CDIST and CDIST O, the baseline is given by the sum of the number of
distinct timestamps present in the input data of each object.
441
Chapter 7 Association Analysis: Advanced Concepts
7.5 Subgraph Patterns
This section describes the application of association analysis methods to more
complex entities beyond itemsets and sequences. Examples include chemical
compounds, 3D protein structures, network topologies, and tree structured
XML documents. These entities can be modeled using a graph representation,
as shown in Table 7.7.
Table 7.7. Graph representation of entities in various application domains.
Application Graphs Vertices Edges
Web mining Web browsing patterns Web pages Hyperlink between pages
Computational Structure of chemical Atoms or Bond between atoms or
chemistry compounds ions ions
Network computing Computer networks Computers and Interconnection between
servers machines
Semantic Web Collection of XML XML elements Parentchild relationship
documents between elements
Bioinformatics Protein structures Amino acids Contact residue
A useful data mining task to perform on this type of data is to derive a
set of common substructures among the collection of graphs. Such a task is
known as frequent subgraph mining. A potential application of frequent
subgraph mining can be seen in the context of computational chemistry. Each
year, new chemical compounds are designed for the development of pharmaceu
tical drugs, pesticides, fertilizers, etc. Although the structure of a compound
is known to play a major role in determining its chemical properties, it is dif
ficult to establish their exact relationship. Frequent subgraph mining can aid
this undertaking by identifying the substructures commonly associated with
certain properties of known compounds. Such information can help scientists
to develop new chemical compounds that have certain desired properties.
This section presents a methodology for applying association analysis to
graphbased data. The section begins with a review of some of the basic
graphrelated concepts and definitions. The frequent subgraph mining problem
is then introduced, followed by a description of how the traditional Apriori
algorithm can be extended to discover such patterns.
442
7.5 Subgraph Patterns
7.5.1 Graphs and Subgraphs
A graph is a data structure that can be used to represent the relationships
among a set of entities. Mathematically, a graph is composed of a vertex set V
and a set of edges E connecting between pairs of vertices. Each edge is denoted
by a vertex pair (vi, vj ), where vi, vj ∈ V . A label l(vi) can be assigned to each
vertex vi representing the name of an entity. Similarly each edge (vi, vj ) can
also be associated with a label l(vi, vj ) describing the relationship between a
pair of entities. Table 7.7 shows the vertices and edges associated with different
types of graphs. For example, in a Web graph, the vertices correspond to Web
pages and the edges represent the hyperlinks between Web pages.
Definition 7.4 (Subgraph). A graph G′ = (V ′, E′) is a subgraph of another
graph G = (V, E) if its vertex set V ′ is a subset of V and its edge set E′ is a
subset of E. The subgraph relationship is denoted as G′ ⊆S G.
Figure 7.9 shows a graph that contains 6 vertices and 11 edges along with
one of its possible subgraphs. The subgraph, which is shown in Figure 7.9(b),
contains only 4 of the 6 vertices and 4 of the 11 edges in the original graph.
b a
a
b
c c
q
qp
p
p
p
r
r
t t
s
a
a
b
c
p
p
t
s
(a) Labeled graph. (b) Subgraph.
Figure 7.9. Example of a subgraph.
Definition 7.5 (Support). Given a collection of graphs G, the support for
a subgraph g is defined as the fraction of all graphs that contain g as its
subgraph, i.e.:
s(g) =
{Gig ⊆S Gi, Gi ∈ G}
G . (7.2)
443
Chapter 7 Association Analysis: Advanced Concepts
a
c
b
support = 80%
support = 60%
support = 40%
e
d
1 1
1
a e1
1
1
a
c
b
e
d
1
1
1
1
1
G1
G3 G4
G5
Subgraph g1
Subgraph g2
Subgraph g3
G2
a
a
c
c
b
b
e
e
d
d
1 1
1
1
1
a e
d
1
1
1
11 1
1
Graph Data Set
a
e
d1
1
a
e
d1
11
Figure 7.10. Computing the support of a subgraph from a set of graphs.
Example 7.2. Consider the five graphs, G1 through G5, shown in Figure
7.10. The graph g1 shown on the top righthand diagram is a subgraph of G1,
G3, G4, and G5. Therefore, s(g1) = 4/5 = 80%. Similarly, we can show that
s(g2) = 60% because g2 is a subgraph of G1, G2, and G3, while s(g3) = 40%
because g3 is a subgraph of G1 and G3.
7.5.2 Frequent Subgraph Mining
This section presents a formal definition of the frequent subgraph mining prob
lem and illustrates the complexity of this task.
Definition 7.6 (Frequent Subgraph Mining). Given a set of graphs G
and a support threshold, minsup, the goal of frequent subgraph mining is to
find all subgraphs g such that s(g) ≥ minsup.
While this formulation is generally applicable to any type of graph, the
discussion presented in this chapter focuses primarily on undirected, con
nected graphs. The definitions of these graphs are given below:
1. A graph is connected if there exists a path between every pair of vertices
in the graph, in which a path is a sequence of vertices < v1v2 . . . vk >
444
7.5 Subgraph Patterns
such that there is an edge connecting between every pair of adjacent
vertices (vi, vi+1) in the sequence.
2. A graph is undirected if it contains only undirected edges. An edge
(vi, vj ) is undirected if it is indistinguishable from (vj , vi).
Methods for handling other types of subgraphs (directed or disconnected) are
left as an exercise to the readers (see Exercise 15 on page 482).
Mining frequent subgraphs is a computationally expensive task because of
the exponential scale of the search space. To illustrate the complexity of this
task, consider a data set that contains d entities. In frequent itemset mining,
each entity is an item and the size of the search space to be explored is 2d,
which is the number of candidate itemsets that can be generated. In frequent
subgraph mining, each entity is a vertex and can have up to d − 1 edges to
other vertices. Assuming that the vertex labels are unique, the total number
of subgraphs is
d∑
i=1
(
d
i
)
× 2i(i−1)/2,
where
(
d
i
)
is the number of ways to choose i vertices to form a subgraph and
2i(i−1)/2 is the maximum number of edges between vertices. Table 7.8 compares
the number of itemsets and subgraphs for different values of d.
Table 7.8. A comparison between number of itemsets and subgraphs for different dimensionality, d.
Number of entities, d 1 2 3 4 5 6 7 8
Number of itemsets 2 4 8 16 32 64 128 256
Number of subgraphs 2 5 18 113 1,450 40,069 2,350,602 28,619,2513
The number of candidate subgraphs is actually much smaller because the
numbers given in Table 7.8 include subgraphs that are disconnected. Discon
nected subgraphs are usually ignored because they are not as interesting as
connected subgraphs.
A bruteforce method for doing this is to generate all connected subgraphs
as candidates and count their respective supports. For example, consider the
graphs shown in Figure 7.11(a). Assuming that the vertex labels are chosen
from the set {a, b} and the edge labels are chosen from the set {p, q}, the list
of connected subgraphs with one vertex up to three vertices is shown in Figure
7.11(b). The number of candidate subgraphs is considerably larger than the
445
Chapter 7 Association Analysis: Advanced Concepts
a
a
a
a
a
a a
b
a
b
b
a
b b
b
ba
a
b
b
a
p p
p
p p
p
p
pp
p
pq q
q
q
q
qq
G1
k=1
k=2
k=3
G2 G3 G4
(a) Example of a graph data set.
(b) List of connected subgraphs.
p
a a
a
p
p
a b
p
b b
p
a a
q
a b
q
b b
q
a a
a
p
q
q
b b
b
q
q
…
a a
a
p
pp
a a
a
p
p
b b
b
qq
q
…
Figure 7.11. Bruteforce method for mining frequent subgraphs.
number of candidate itemsets in traditional association rule mining for the
following reasons:
1. An item can appear at most once in an itemset, whereas a vertex label
can appear more than once in a graph.
2. The same pair of vertex labels can have multiple choices of edge labels.
Given the large number of candidate subgraphs, a bruteforce method may
break down even for moderately sized graphs.
446
7.5 Subgraph Patterns
a
a
a
a
a
b
b
a
b b
b
ba
a
b
b
a
p p
p
p p
p
p
pp
p
pq q
q
q
q
qq
G1
G1
G2
G3
G4
1
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
G2 G3 G4
…
…
…
…
…
(a,b,p) (b,c,p) (b,c,r) (d,e,r)(b,c,q)(a,b,r)(a,b,q)
Figure 7.12. Mapping a collection of graph structures into market basket transactions.
7.5.3 Apriorilike Method
This section examines how an Apriorilike algorithm can be developed for
finding frequent subgraphs.
Data Transformation
One possible approach is to transform each graph into a transactionlike for
mat so that existing algorithms such as Apriori can be applied. Figure 7.12
illustrates how to transform a collection of graphs into its equivalent market
basket representation. In this representation, each combination of edge la
bel l(e) with its corresponding vertex labels, (l(vi), l(vj )), is mapped into an
“item.” The width of the “transaction” is given by the number of edges in the
graph. Despite its simplicity, this approach works only if every edge in a graph
has a unique combination of vertex and edge labels. Otherwise, such graphs
cannot be accurately modeled using this representation.
General Structure of the Frequent Subgraph Mining Algorithm
An Apriorilike algorithm for mining frequent subgraphs consists of the fol
lowing steps:
1. Candidate generation, which is the process of merging pairs of fre
quent (k − 1)subgraphs to obtain a candidate ksubgraph.
447
Chapter 7 Association Analysis: Advanced Concepts
2. Candidate pruning, which is the process of discarding all candidate
ksubgraphs that contain infrequent (k − 1)subgraphs.
3. Support counting, which is the process of counting the number of
graphs in G that contain each candidate.
4. Candidate elimination, which discards all candidate subgraphs whose
support counts are less than minsup.
The specific details of these steps are discussed in the remainder of this section.
7.5.4 Candidate Generation
During candidate generation, a pair of frequent (k − 1)subgraphs are merged
to form a candidate ksubgraph. The first question is how to define k, the size
of a subgraph. In the example shown in Figure 7.11, k refers to the number
of vertices in the graph. This approach of iteratively expanding a subgraph
by adding an extra vertex is known as vertex growing. Alternatively, k may
refer to the number of edges in the graph. This approach of adding an extra
edge to the existing subgraphs is known as edge growing.
To avoid generating duplicate candidates, we may impose an additional
condition for merging, that the two (k − 1)subgraphs must share a common
(k−2)subgraph. The common (k−2)subgraph is known as their core. Below,
we briefly describe the candidate generation procedure for both vertexgrowing
and edgegrowing strategies.
Candidate Generation via Vertex Growing
Vertex growing is the process of generating a new candidate by adding a new
vertex into an existing frequent subgraph. Before describing this approach,
let us first consider the adjacency matrix representation of a graph. Each
entry M (i, j) in the matrix contains either the label of the edge connecting
between the vertices vi and vj , or zero, if there is no edge between them.
The vertexgrowing approach can be viewed as the process of generating a
k × k adjacency matrix by combining a pair of (k − 1) × (k − 1) adjacency
matrices, as illustrated in Figure 7.13. G1 and G2 are two graphs whose
adjacency matrices are given by M (G1) and M (G2), respectively. The core
for the graphs is indicated by dashed lines in the diagram. The procedure for
generating candidate subgraphs via vertex growing is presented next.
448
7.5 Subgraph Patterns
a
e
a
a
a
d
p p
pp
q r r
a
a
e
a
d
p
p
r r
q q
G1 G2 G3 = merge (G1, G2)
a
b
?
MG1 =
0
p
p
q
0 0
0
00
0
0
p p q
q
r
r
MG2 = MG3 =
0
0
0 0 0
0
0 0
00
0 0
0
0
0
p
p
p
p p q
p
0
0
0
0
0
0
p p
r
r
r r
r
r
r r
?
?
Figure 7.13. Vertexgrowing strategy.
Subgraph Merging Procedure via Vertex Growing
An adjacency matrix M (1) is merged with another matrix M (2) if the submatrices
obtained by removing the last row and last column of M (1) and M (2) are identical
to each other. The resulting matrix is the matrix M (1), appended with the last
row and last column of matrix M (2). The remaining entries of the new matrix are
either zero or replaced by all valid edge labels connecting the pair of vertices.
The resulting graph contains one or two edges more than the original
graphs. In Figure 7.13, both G1 and G2 contain four vertices and four edges.
After merging, the resulting graph G3 has five vertices. The number of edges
in G3 depends on whether the vertices d and e are connected. If d and e
are disconnected, then G3 has five edges and the corresponding matrix entry
for (d, e) is zero. Otherwise, G3 has six edges and the matrix entry for (d, e)
corresponds to the label for the newly created edge. Since the edge label is
unknown, we need to consider all possible edge labels for (d, e), thus increasing
the number of candidate subgraphs substantially.
Candidate Generation via Edge Growing
Edge growing inserts a new edge to an existing frequent subgraph during
candidate generation. Unlike vertex growing, the resulting subgraph does not
449
Chapter 7 Association Analysis: Advanced Concepts
a
e
a
a
a
e
p p
pp
q r r
a
a
e
a
e
p
p
r rq
q
G1 G2
G3 = merge (G1, G2)
a
b
a
a
e
a
p
p
r
r
q
G4 = merge (G1, G2)
Figure 7.14. Edgegrowing strategy.
necessarily increase the number of vertices in the original graphs. Figure 7.14
shows two possible candidate subgraphs obtained by merging G1 and G2 via
the edgegrowing strategy. The first candidate subgraph, G3, has one extra
vertex, while the second candidate subgraph, G4, has the same number of
vertices as the original graphs. The core for the graphs is indicated by dashed
lines in the diagram.
The procedure for generating candidate subgraphs via edge growing can
be summarized as follows.
Subgraph Merging Procedure via Edge Growing
A frequent subgraph g(1) is merged with another frequent subgraph g(2) only if
the subgraph obtained by removing an edge from g(1) is topologically equivalent
to the subgraph obtained by removing an edge from g(2). After merging, the
resulting candidate is the subgraph g(1), appended with the extra edge from g(2).
The graphs to be merged may contain several vertices that are topolog
ically equivalent to each other. To illustrate the concept of topologically
equivalent vertices, consider the graphs shown in Figure 7.15. The graph G1
contains four vertices with identical vertex labels, “a.” If a new edge is at
450
7.5 Subgraph Patterns
p
p
p p
G1
v1 v2
v3 v4
a a
a a
p
p p p
p
p
pp p
G2 G3
v1
v1
v2
v2
v3
v3
v4
v5
v4
a a
a ab b
ba a
Figure 7.15. Illustration of topologically equivalent vertices.
tached to any one of the four vertices, the resulting graph will look the same.
The vertices in G1 are therefore topologically equivalent to each other.
The graph G2 has two pairs of topologically equivalent vertices, v1 with
v4 and v2 with v3, even though the vertex and edge labels are identical. It is
easy to see that v1 is not topologically equivalent to v2 because the number of
edges incident on the vertices is different. Therefore, attaching a new edge to
v1 results in a different graph than attaching the same edge to v2. Meanwhile,
the graph G3 does not have any topologically equivalent vertices. While v1
and v4 have the same vertex labels and number of incident edges, attaching a
new edge to v1 results in a different graph than attaching the same edge to v4.
The notion of topologically equivalent vertices can help us understand why
multiple candidate subgraphs can be generated during edge growing. Consider
the (k − 1)subgraphs G1 and G2 shown in Figure 7.16. To simplify the
notation, their core, which contains k − 2 common edges between the two
graphs, is drawn as a rectangular box. The remaining edge in G1 that is not
included in the core is shown as a dangling edge connecting the vertices a and
b. Similarly, the remaining edge in G2 that is not part of the core is shown as
a dangling edge connecting vertices c and d. Although the cores for G1 and
G2 are identical, a and c may or may not be topologically equivalent to each
G1 G2
a b
Core
c d
Core
Figure 7.16. General approach for merging a pair of subgraphs via edge growing.
451
Chapter 7 Association Analysis: Advanced Concepts
G3 = Merge (G1, G2)
a b
c dCore
G3 = Merge (G1, G2)
a b
c dCore
G3 = Merge (G1, G2)
a b
dCore
G3 = Merge (G1, G2)
a b
c dCore
G3 = Merge (G1, G2)
a b
c dCore
G3 = Merge (G1, G2)
a b
cCore
(a) a ≠ c and b ≠ d
(c) a ≠ c and b = d
(d) a = c and b = d
(b) a = c and b ≠ d
G3 = Merge (G1, G2)
a b
cCore
G3 = Merge (G1, G2)
a b
dCore
Figure 7.17. Candidate subgraphs generated via edge growing.
other. If a and c are topologically equivalent, we denote them as a = c. For
vertices outside the core, we denote them as b = d if their labels are identical.
The following rule of thumb can be used to determine the candidate sub
graphs obtained during candidate generation:
1. If a �= c and b �= d, then there is only one possible resulting subgraph,
as shown in Figure 7.17(a).
2. If a = c but b �= d, then there are two possible resulting subgraphs, as
shown in Figure 7.17(b).
452
7.5 Subgraph Patterns
a
ab
a a
a ab
a
a a
ab
b
a
a ab
b
a
a ab
a
a
a
Figure 7.18. Multiplicity of candidates during candidate generation.
3. If a �= c but b = d, then there are two possible resulting subgraphs, as
shown in Figure 7.17(c).
4. If a = c and b = d, then there are three possible resulting subgraphs, as
shown in Figure 7.17(d).
Multiple candidate subgraphs can also be generated when there is more
than one core associated with the pair of (k−1)subgraphs, as shown in Figure
7.18. The shaded vertices correspond to those vertices whose edges form a
core during the merging operation. Each core may lead to a different set of
candidate subgraphs. In principle, if a pair of frequent (k − 1)subgraphs is
merged, there can be at most k−2 cores, each of which is obtained by removing
an edge from one of the merged graphs. Although the edgegrowing procedure
can produce multiple candidate subgraphs, the number of candidate subgraphs
tends to be smaller than those produced by the vertexgrowing strategy.
7.5.5 Candidate Pruning
After the candidate ksubgraphs are generated, the candidates whose (k −
1)subgraphs are infrequent need to be pruned. The pruning step can be
performed by successively removing an edge from the candidate ksubgraph
and checking whether the corresponding (k − 1)subgraph is connected and
frequent. If not, the candidate ksubgraph can be discarded.
To check whether the (k − 1)subgraph is frequent, it should be matched
against other frequent (k−1)subgraphs. Determining whether two graphs are
topologically equivalent (or isomorphic) is known as the graph isomorphism
problem. To illustrate the difficulty of solving the graph isomorphism problem,
453
Chapter 7 Association Analysis: Advanced Concepts
A A
A
A
A
A
A
B
B
B
B
B
B
B
B
A
Figure 7.19. Graph isomorphism
consider the two graphs shown in Figure 7.19. Even though both graphs look
different, they are actually isomorphic to each other because there is a oneto
one mapping between vertices in both graphs.
Handling Graph Isomorphism
A standard approach for handling the graph isomorphism problem is to map
each graph into a unique string representation known as its code or canonical
label. A canonical label has the property that if two graphs are isomorphic,
then their codes must be the same. This property allows us to test for graph
isomorphism by comparing the canonical labels of the graphs.
The first step toward constructing the canonical label of a graph is to
find an adjacency matrix representation for the graph. Figure 7.20 shows an
a
ep
p
r
q
a
a M =
0
q
p
p
0 0
0
0 0
0 0
p p q
r
r
Figure 7.20. Adjacency matrix representation of a graph.
454
7.5 Subgraph Patterns
example of such a matrix for the given graph. In principle, a graph can have
more than one adjacency matrix representation because there are multiple
ways to order the vertices in the adjacency matrix. In the example shown in
Figure 7.20, the first row and column correspond to the vertex a that has 3
edges, the second row and column correspond to another vertex a that has
2 edges, and so on. To derive all the adjacency matrix representations for
a graph, we need to consider all possible permutations of rows (and their
corresponding columns) of the matrix.
Mathematically, each permutation corresponds to a multiplication of the
initial adjacency matrix with a corresponding permutation matrix, as illus
trated in the following example.
Example 7.3. Consider the following matrix:
M =
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
The following permutation matrix can be used to exchange the first row (and
column) with the third row (and column) of M :
P13 =
0 0 1 0
0 1 0 0
1 0 0 0
0 0 0 1
,
where P13 is obtained by swapping the first and third row of the identity
matrix. To exchange the first and third rows (and columns), the permutation
matrix is multiplied with M :
M ′ = P T13 × M × P13 =
11 10 9 12
7 6 5 8
3 2 1 4
15 14 13 16
.
Note that multiplying M from the right with P13 exchanges the first and third
columns of M , while multiplying M from the left with P T13 exchanges the first
and third rows of M . If all three matrices are multiplied, this will produce a
matrix M ′ whose first and third rows and columns have been swapped.
455
Chapter 7 Association Analysis: Advanced Concepts
A(1) A(2)
A(3)
B
A(4)
0
1
1
0
1
0
0
0
1
0
0
1
0
1
0
0
1
0
0
1
0
0
1
0
1
0
0
0
0
1
1
0
0
0
0
1
0
1
1
0
0
1
1
0
0
0
0
1
0
1
0
0
1
0
0
1
0
0
1
0
1
0
0
1
(5) B (6)
B (7) B (8)
Code = 1100111000010010010100001011
A(1)
A(2)
A(3)
A(4)
B(5)
B(6)
B(7)
B(8)
A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8)
A(2) A(1)
A(3)
B
A(4)
0
1
0
1
0
1
0
0
1
0
1
0
0
0
1
0
0
1
0
1
1
0
0
0
0
0
1
0
0
0
1
1
0
0
0
1
1
1
0
0
1
0
1
0
0
0
0
1
1
0
0
0
0
0
1
1
0
1
0
0
1
1
0
0
(7) B (6)
B (5) B (8)
Code = 1011010010100000100110001110
A(1)
A(2)
A(3)
A(4)
B(5)
B(6)
B(7)
B(8)
A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8)
Figure 7.21. String representation of adjacency matrices.
The second step is to determine the string representation for each adjacency
matrix. Since the adjacency matrix is symmetric, it is sufficient to construct
the string representation based on the upper triangular part of the matrix. In
the example shown in Figure 7.21, the code is obtained by concatenating the
entries of the upper triangular matrix in a columnwise fashion. The final step
is to compare all the string representations of the graph and choose the one
that has the lowest (or highest) lexicographic value.
The preceding approach seems expensive because it requires us to examine
all possible adjacency matrices of a graph and to compute each of their string
representation in order to find the canonical label. More specifically, there
are k! permutations that must be considered for every graph that contains k
vertices. Some of the methods developed to reduce the complexity of this task
include caching the previously computed canonical label (so that we do not
have to recompute it again when performing an isomorphism test on the same
graph) and reducing the number of permutations needed to determine the
canonical label by incorporating additional information such as vertex labels
and the degree of a vertex. The latter approach is beyond the scope of this
456
7.6 Infrequent Patterns
book, but interested readers may consult the bibliographic notes at the end of
this chapter.
7.5.6 Support Counting
Support counting is also a potentially costly operation because all the can
didate subgraphs contained in each graph G ∈ G must be determined. One
way to speed up this operation is to maintain a list of graph IDs associated
with each frequent (k − 1)subgraph. Whenever a new candidate ksubgraph
is generated by merging a pair of frequent (k−1)subgraphs, their correspond
ing lists of graph IDs are intersected. Finally, the subgraph isomorphism tests
are performed on the graphs in the intersected list to determine whether they
contain a particular candidate subgraph.
7.6 Infrequent Patterns
The association analysis formulation described so far is based on the premise
that the presence of an item in a transaction is more important than its ab
sence. As a consequence, patterns that are rarely found in a database are often
considered to be uninteresting and are eliminated using the support measure.
Such patterns are known as infrequent patterns.
Definition 7.7 (Infrequent Pattern). An infrequent pattern is an itemset
or a rule whose support is less than the minsup threshold.
Although a vast majority of infrequent patterns are uninteresting, some
of them might be useful to the analysts, particularly those that correspond
to negative correlations in the data. For example, the sale of DVDs and VCRs
together is low because any customer who buys a DVD will most likely not buy
a VCR, and vice versa. Such negativecorrelated patterns are useful to help
identify competing items, which are items that can be substituted for one
another. Examples of competing items include tea versus coffee, butter versus
margarine, regular versus diet soda, and desktop versus laptop computers.
Some infrequent patterns may also suggest the occurrence of interesting
rare events or exceptional situations in the data. For example, if {Fire = Yes}
is frequent but {Fire = Yes, Alarm = On} is infrequent, then the latter is an
interesting infrequent pattern because it may indicate faulty alarm systems.
To detect such unusual situations, the expected support of a pattern must
be determined, so that, if a pattern turns out to have a considerably lower
support than expected, it is declared as an interesting infrequent pattern.
457
Chapter 7 Association Analysis: Advanced Concepts
Mining infrequent patterns is a challenging endeavor because there is an
enormous number of such patterns that can be derived from a given data set.
More specifically, the key issues in mining infrequent patterns are: (1) how
to identify interesting infrequent patterns, and (2) how to efficiently discover
them in large data sets. To get a different perspective on various types of
interesting infrequent patterns, two related concepts—negative patterns and
negatively correlated patterns—are introduced in Sections 7.6.1 and 7.6.2, re
spectively. The relationships among these patterns are elucidated in Section
7.6.3. Finally, two classes of techniques developed for mining interesting in
frequent patterns are presented in Sections 7.6.5 and 7.6.6.
7.6.1 Negative Patterns
Let I = {i1, i2, . . . , id} be a set of items. A negative item, ik, denotes the
absence of item ik from a given transaction. For example, coffee is a negative
item whose value is 1 if a transaction does not contain coffee.
Definition 7.8 (Negative Itemset). A negative itemset X is an itemset
that has the following properties: (1) X = A ∪ B, where A is a set of positive
items, B is a set of negative items, B ≥ 1, and (2) s(X) ≥ minsup.
Definition 7.9 (Negative Association Rule). A negative association rule
is an association rule that has the following properties: (1) the rule is extracted
from a negative itemset, (2) the support of the rule is greater than or equal to
minsup, and (3) the confidence of the rule is greater than or equal to minconf.
The negative itemsets and negative association rules are collectively known
as negative patterns throughout this chapter. An example of a negative
association rule is tea −→ coffee, which may suggest that people who drink
tea tend to not drink coffee.
7.6.2 Negatively Correlated Patterns
Section 6.7.1 on page 371 described how correlation analysis can be used to
analyze the relationship between a pair of categorical variables. Measures
such as interest factor (Equation 6.5) and the φcoefficient (Equation 6.8)
were shown to be useful for discovering itemsets that are positively correlated.
This section extends the discussion to negatively correlated patterns.
Let X = {x1, x2, . . . , xk} denote a kitemset and P (X) denote the proba
bility that a transaction contains X. In association analysis, the probability
is often estimated using the itemset support, s(X).
458
7.6 Infrequent Patterns
Definition 7.10 (Negatively Correlated Itemset). An itemset X is neg
atively correlated if
s(X) <
k∏
j=1
s(xj ) = s(x1) × s(x2) × . . . × s(xk), (7.3)
where s(xj ) is the support of an item xj .
The righthand side of the preceding expression,
∏k
j=1 s(xj ), represents an
estimate of the probability that all the items in X are statistically independent.
Definition 7.10 suggests that an itemset is negatively correlated if its support
is below the expected support computed using the statistical independence
assumption. The smaller s(X), the more negatively correlated is the pattern.
Definition 7.11 (Negatively Correlated Association Rule). An asso
ciation rule X −→ Y is negatively correlated if
s(X ∪ Y ) < s(X)s(Y ), (7.4)
where X and Y are disjoint itemsets; i.e., X ∪ Y = ∅.
The preceding definition provides only a partial condition for negative cor
relation between items in X and items in Y . A full condition for negative
correlation can be stated as follows:
s(X ∪ Y ) <
∏
i
s(xi)
∏
j
s(yj ), (7.5)
where xi ∈ X and yj ∈ Y . Because the items in X (and in Y ) are often
positively correlated, it is more practical to use the partial condition to de
fine a negatively correlated association rule instead of the full condition. For
example, although the rule
{eyeglass, lens cleaner} −→ {contact lens, saline solution}
is negatively correlated according to Inequality 7.4, eyeglass is positively
correlated with lens cleaner and contact lens is positively correlated with
saline solution. If Inequality 7.5 is applied instead, such a rule could be
missed because it may not satisfy the full condition for negative correlation.
459
Chapter 7 Association Analysis: Advanced Concepts
The condition for negative correlation can also be expressed in terms of
the support for positive and negative itemsets. Let X and Y denote the
corresponding negative itemsets for X and Y , respectively. Since
s(X ∪ Y ) − s(X)s(Y )
= s(X ∪ Y ) −
[
s(X ∪ Y ) + s(X ∪ Y )
][
s(X ∪ Y ) + s(X ∪ Y )
]
= s(X ∪ Y )
[
1 − s(X ∪ Y ) − s(X ∪ Y ) − s(X ∪ Y )
]
− s(X ∪ Y )s(X ∪ Y )
= s(X ∪ Y )s(X ∪ Y ) − s(X ∪ Y )s(X ∪ Y ),
the condition for negative correlation can be stated as follows:
s(X ∪ Y )s(X ∪ Y ) < s(X ∪ Y )s(X ∪ Y ). (7.6)
The negatively correlated itemsets and association rules are known as nega
tively correlated patterns throughout this chapter.
7.6.3 Comparisons among Infrequent Patterns, Negative Pat
terns, and Negatively Correlated Patterns
Infrequent patterns, negative patterns, and negatively correlated patterns are
three closely related concepts. Although infrequent patterns and negatively
correlated patterns refer only to itemsets or rules that contain positive items,
while negative patterns refer to itemsets or rules that contain both positive
and negative items, there are certain commonalities among these concepts, as
illustrated in Figure 7.22.
First, note that many infrequent patterns have corresponding negative pat
terns. To understand why this is the case, consider the contingency table
shown in Table 7.9. If X ∪ Y is infrequent, then it is likely to have a corre
sponding negative itemset unless minsup is too high. For example, assuming
that minsup ≤ 0.25, if X ∪Y is infrequent, then the support for at least one of
the following itemsets, X ∪ Y , X ∪ Y , or X ∪ Y , must be higher than minsup
since the sum of the supports in a contingency table is 1.
Second, note that many negatively correlated patterns also have corre
sponding negative patterns. Consider the contingency table shown in Table
7.9 and the condition for negative correlation stated in Inequality 7.6. If X
and Y have strong negative correlation, then
s(X ∪ Y ) × s(X ∪ Y ) � s(X ∪ Y ) × s(X ∪ Y ).
460
7.6 Infrequent Patterns
Infrequent Patterns
Frequent Patterns
Negative
Patterns
Negatively
Correlated
Patterns
Figure 7.22. Comparisons among infrequent patterns, negative patterns, and negatively correlated
patterns.
Table 7.9. A twoway contingency table for the association rule X −→ Y .
Y Y
X s(X ∪ Y ) s(X ∪ Y ) s(X)
X s(X ∪ Y ) s(X ∪ Y ) s(X)
s(Y ) s(Y ) 1
Therefore, either X ∪ Y or X ∪ Y , or both, must have relatively high support
when X and Y are negatively correlated. These itemsets correspond to the
negative patterns.
Finally, because the lower the support of X ∪ Y , the more negatively cor
related is the pattern, negatively correlated patterns that are infrequent tend
to be more interesting than negatively correlated patterns that are frequent.
The infrequent, negatively correlated patterns are illustrated by the overlap
ping region in Figure 7.22 between both types of patterns.
7.6.4 Techniques for Mining Interesting Infrequent Patterns
In principle, infrequent itemsets are given by all itemsets that are not extracted
by standard frequent itemset generation algorithms such as Apriori and FP
461
Chapter 7 Association Analysis: Advanced Concepts
null
ba c d e
decebeaeadacab
abc abd abe
abcd
acd
abce abde bcde
ace ade bcd bce bde cde
bdbc cd
abcde
acde
Maximal Frequent
Itemset
Frequent
Itemset
Border
Frequent
Infrequent
Figure 7.23. Frequent and infrequent itemsets.
growth. These itemsets correspond to those located below the frequent itemset
border shown in Figure 7.23.
Since the number of infrequent patterns can be exponentially large, es
pecially for sparse, highdimensional data, techniques developed for mining
infrequent patterns focus on finding only interesting infrequent patterns. An
example of such patterns includes the negatively correlated patterns discussed
in Section 7.6.2. These patterns are obtained by eliminating all infrequent
itemsets that fail the negative correlation condition provided in Inequality
7.3. This approach can be computationally intensive because the supports
for all infrequent itemsets must be computed in order to determine whether
they are negatively correlated. Unlike the support measure used for mining
frequent itemsets, correlationbased measures used for mining negatively corre
lated itemsets do not possess an antimonotone property that can be exploited
for pruning the exponential search space. Although an efficient solution re
mains elusive, several innovative methods have been developed, as mentioned
in the bibliographic notes provided at the end of this chapter.
The remainder of this chapter presents two classes of techniques for mining
interesting infrequent patterns. Section 7.6.5 describes methods for mining
462
7.6 Infrequent Patterns
TID
1 {A,B}
Items
{B,C}
{B,D}
{A,B,C}
{C}
2
3
4
5
TID A A B B C C D D
1
2
3
4
5
1
1
0
0
0
1
1
0
1
1
0
0
1
1
1
0
0
1
0
0
1
0
0
0
1
0
0
0
0
1
0
1
1
1
0
1
1
1
1
0
Original Transactions Transactions with Negative Items
Figure 7.24. Augmenting a data set with negative items.
negative patterns in data, while Section 7.6.6 describes methods for finding
interesting infrequent patterns based on support expectation.
7.6.5 Techniques Based on Mining Negative Patterns
The first class of techniques developed for mining infrequent patterns treats
every item as a symmetric binary variable. Using the approach described in
Section 7.1, the transaction data can be binarized by augmenting it with neg
ative items. Figure 7.24 shows an example of transforming the original data
into transactions having both positive and negative items. By applying exist
ing frequent itemset generation algorithms such as Apriori on the augmented
transactions, all the negative itemsets can be derived.
Such an approach is feasible only if a few variables are treated as symmetric
binary (i.e., we look for negative patterns involving the negation of only a
small number of items). If every item must be treated as symmetric binary,
the problem becomes computationally intractable due to the following reasons.
1. The number of items doubles when every item is augmented with its
corresponding negative item. Instead of exploring an itemset lattice of
size 2d, where d is the number of items in the original data set, the lattice
becomes considerably larger, as shown in Exercise 21 on page 485.
2. Supportbased pruning is no longer effective when negative items are
augmented. For each variable x, either x or x has support greater than
or equal to 50%. Hence, even if the support threshold is as high as
50%, half of the items will remain frequent. For lower thresholds, many
more items and possibly itemsets containing them will be frequent. The
supportbased pruning strategy employed by Apriori is effective only
463
Chapter 7 Association Analysis: Advanced Concepts
when the support for most itemsets is low; otherwise, the number of
frequent itemsets grows exponentially.
3. The width of each transaction increases when negative items are aug
mented. Suppose there are d items available in the original data set. For
sparse data sets such as market basket transactions, the width of each
transaction tends to be much smaller than d. As a result, the maximum
size of a frequent itemset, which is bounded by the maximum transac
tion width, wmax, tends to be relatively small. When negative items are
included, the width of the transactions increases to d because an item is
either present in the transaction or absent from the transaction, but not
both. Since the maximum transaction width has grown from wmax to
d, this will increase the number of frequent itemsets exponentially. As
a result, many existing algorithms tend to break down when they are
applied to the extended data set.
The previous bruteforce approach is computationally expensive because it
forces us to determine the support for a large number of positive and negative
patterns. Instead of augmenting the data set with negative items, another
approach is to determine the support of the negative itemsets based on the
support of their corresponding positive items. For example, the support for
{p, q, r} can be computed in the following way:
s({p, q, r}) = s({p}) − s({p, q}) − s({p, r}) + s({p, q, r}).
More generally, the support for any itemset X ∪ Y can be obtained as follows:
s(X ∪ Y ) = s(X) +
n∑
i=1
∑
Z⊂Y,Z=i
{
(−1)i × s(X ∪ Z)
}
. (7.7)
To apply Equation 7.7, s(X ∪ Z) must be determined for every Z that is a
subset of Y . The support for any combination of X and Z that exceeds the
minsup threshold can be found using the Apriori algorithm. For all other
combinations, the supports must be determined explicitly, e.g., by scanning
the entire set of transactions. Another possible approach is to either ignore
the support for any infrequent itemset X ∪ Z or to approximate it with the
minsup threshold.
Several optimization strategies are available to further improve the perfor
mance of the mining algorithms. First, the number of variables considered as
464
7.6 Infrequent Patterns
symmetric binary can be restricted. More specifically, a negative item y is con
sidered interesting only if y is a frequent item. The rationale for this strategy
is that rare items tend to produce a large number of infrequent patterns and
many of which are uninteresting. By restricting the set Y given in Equation 7.7
to variables whose positive items are frequent, the number of candidate nega
tive itemsets considered by the mining algorithm can be substantially reduced.
Another strategy is to restrict the type of negative patterns. For example, the
algorithm may consider only a negative pattern X ∪ Y if it contains at least
one positive item (i.e., X ≥ 1). The rationale for this strategy is that if the
data set contains very few positive items with support greater than 50%, then
most of the negative patterns of the form X ∪ Y will become frequent, thus
degrading the performance of the mining algorithm.
7.6.6 Techniques Based on Support Expectation
Another class of techniques considers an infrequent pattern to be interesting
only if its actual support is considerably smaller than its expected support. For
negatively correlated patterns, the expected support is computed based on the
statistical independence assumption. This section describes two alternative
approaches for determining the expected support of a pattern using (1) a
concept hierarchy and (2) a neighborhoodbased approach known as indirect
association.
Support Expectation Based on Concept Hierarchy
Objective measures alone may not be sufficient to eliminate uninteresting in
frequent patterns. For example, suppose bread and laptop computer are
frequent items. Even though the itemset {bread, laptop computer} is in
frequent and perhaps negatively correlated, it is not interesting because their
lack of support seems obvious to domain experts. Therefore, a subjective ap
proach for determining expected support is needed to avoid generating such
infrequent patterns.
In the preceding example, bread and laptop computer belong to two
completely different product categories, which is why it is not surprising to
find that their support is low. This example also illustrates the advantage of
using domain knowledge to prune uninteresting patterns. For market basket
data, the domain knowledge can be inferred from a concept hierarchy such
as the one shown in Figure 7.25. The basic assumption of this approach is
that items from the same product family are expected to have similar types of
interaction with other items. For example, since ham and bacon belong to the
465
Chapter 7 Association Analysis: Advanced Concepts
Food
Meat
ChickenPorkCookiesChips
Soda
Regular Diet
Snack Food
PotatoTaco Oatmeal Chocolate
Chip
Ham Bacon Boneless Whole
Figure 7.25. Example of a concept hierarchy.
same product family, we expect the association between ham and chips to be
somewhat similar to the association between bacon and chips. If the actual
support for any one of these pairs is less than their expected support, then the
infrequent pattern is interesting.
To illustrate how to compute the expected support, consider the diagram
shown in Figure 7.26. Suppose the itemset {C, G} is frequent. Let s(·) denote
the actual support of a pattern and (·) denote its expected support. The
expected support for any children or siblings of C and G can be computed
using the formula shown below.
(s(E, J)) = s(C, G) × s(E)
s(C)
× s(J)
s(G)
(7.8)
(s(C, J)) = s(C, G) × s(J)
s(G)
(7.9)
(s(C, H)) = s(C, G) × s(H)
s(G)
(7.10)
For example, if soda and snack food are frequent, then the expected
support between diet soda and chips can be computed using Equation 7.8
because these items are children of soda and snack food, respectively. If
the actual support for diet soda and chips is considerably lower than their
expected value, then diet soda and chips form an interesting infrequent
pattern.
466
7.6 Infrequent Patterns
A
B
CGH
KJED
F
Figure 7.26. Mining interesting negative patterns using a concept hierarchy.
Support Expectation Based on Indirect Association
Consider a pair of items, (a, b), that are rarely bought together by customers.
If a and b are unrelated items, such as bread and DVD player, then their support
is expected to be low. On the other hand, if a and b are related items, then
their support is expected to be high. The expected support was previously
computed using a concept hierarchy. This section presents an approach for
determining the expected support between a pair of items by looking at other
items commonly purchased together with these two items.
For example, suppose customers who buy a sleeping bag also tend to
buy other camping equipment, whereas those who buy a desktop computer
also tend to buy other computer accessories such as an optical mouse or a
printer. Assuming there is no other item frequently bought together with both
a sleeping bag and a desktop computer, the support for these unrelated items
is expected to be low. On the other hand, suppose diet and regular soda are
often bought together with chips and cookies. Even without using a concept
hierarchy, both items are expected to be somewhat related and their support
should be high. Because their actual support is low, diet and regular soda
form an interesting infrequent pattern. Such patterns are known as indirect
association patterns.
A highlevel illustration of indirect association is shown in Figure 7.27.
Items a and b correspond to diet soda and regular soda, while Y , which is
known as the mediator set, contains items such as chips and cookies. A
formal definition of indirect association is presented next.
467
Chapter 7 Association Analysis: Advanced Concepts
Y
a b
y1
y2
yk
•
•
•
Figure 7.27. An indirect association between a pair of items.
Definition 7.12 (Indirect Association). A pair of items a, b is indirectly
associated via a mediator set Y if the following conditions hold:
1. s({a, b}) < ts (Itempair support condition).
2. ∃Y �= ∅ such that:
(a) s({a}∪ Y ) ≥ tf and s({b}∪ Y ) ≥ tf (Mediator support condition).
(b) d({a}, Y ) ≥ td, d({b}, Y ) ≥ td, where d(X, Z) is an objective mea
sure of the association between X and Z (Mediator dependence
condition).
Note that the mediator support and dependence conditions are used to
ensure that items in Y form a close neighborhood to both a and b. Some
of the dependence measures that can be used include interest, cosine or IS,
Jaccard, and other measures previously described in Section 6.7.1 on page 371.
Indirect association has many potential applications. In the market basket
domain, a and b may refer to competing items such as desktop and laptop
computers. In text mining, indirect association can be used to identify syn
onyms, antonyms, or words that are used in different contexts. For example,
given a collection of documents, the word data may be indirectly associated
with gold via the mediator mining. This pattern suggests that the word
mining can be used in two different contexts—data mining versus gold min
ing.
Indirect associations can be generated in the following way. First, the set
of frequent itemsets is generated using standard algorithms such as Apriori
or FPgrowth. Each pair of frequent kitemsets are then merged to obtain
a candidate indirect association (a, b, Y ), where a and b are a pair of items
and Y is their common mediator. For example, if {p, q, r} and {p, q, s} are
468
7.7 Bibliographic Notes
Algorithm 7.2 Algorithm for mining indirect associations.
1: Generate Fk, the set of frequent itemsets.
2: for k = 2 to kmax do
3: Ck = {(a, b, Y ){a} ∪ Y ∈ Fk, {b} ∪ Y ∈ Fk, a �= b}
4: for each candidate (a, b, Y ) ∈ Ck do
5: if s({a, b}) < ts ∧ d({a}, Y ) ≥ td ∧ d({b}, Y ) ≥ td then
6: Ik = Ik ∪ {(a, b, Y )}
7: end if
8: end for
9: end for
10: Result =
⋃
Ik.
frequent 3itemsets, then the candidate indirect association (r, s, {p, q}) is ob
tained by merging the pair of frequent itemsets. Once the candidates have
been generated, it is necessary to verify that they satisfy the itempair support
and mediator dependence conditions provided in Definition 7.12. However,
the mediator support condition does not have to be verified because the can
didate indirect association is obtained by merging a pair of frequent itemsets.
A summary of the algorithm is shown in Algorithm 7.2.
7.7 Bibliographic Notes
The problem of mining association rules from categorical and continuous data
was introduced by Srikant and Agrawal in [363]. Their strategy was to binarize
the categorical attributes and to apply equalfrequency discretization to the
continuous attributes. A partial completeness measure was also proposed
to determine the amount of information loss as a result of discretization. This
measure was then used to determine the number of discrete intervals needed
to ensure that the amount of information loss can be kept at a certain desired
level. Following this work, numerous other formulations have been proposed
for mining quantitative association rules. The statisticsbased approach was
developed by Aumann and Lindell [343] to identify segments of the population
who exhibit interesting behavior characterized by some quantitative attributes.
This formulation was later extended by other authors including Webb [368] and
Zhang et al. [372]. The minApriori algorithm was developed by Han et al.
[349] for finding association rules in continuous data without discretization.
The problem of mining association rules in continuous data has also been
469
Chapter 7 Association Analysis: Advanced Concepts
investigated by numerous other researchers including Fukuda et al. [347],
Lent et al. [355], Wang et al. [367], and Miller and Yang [357].
The method described in Section 7.3 for handling concept hierarchy using
extended transactions was developed by Srikant and Agrawal [362]. An alter
native algorithm was proposed by Han and Fu [350], where frequent itemsets
are generated one level at a time. More specifically, their algorithm initially
generates all the frequent 1itemsets at the top level of the concept hierarchy.
The set of frequent 1itemsets is denoted as L(1, 1). Using the frequent 1
itemsets in L(1, 1), the algorithm proceeds to generate all frequent 2itemsets
at level 1, L(1, 2). This procedure is repeated until all the frequent itemsets
involving items from the highest level of the hierarchy, L(1, k) (k > 1), are
extracted. The algorithm then continues to extract frequent itemsets at the
next level of the hierarchy, L(2, 1), based on the frequent itemsets in L(1, 1).
The procedure is repeated until it terminates at the lowest level of the concept
hierarchy requested by the user.
The sequential pattern formulation and algorithm described in Section 7.4
was proposed by Agrawal and Srikant in [341, 364]. Similarly, Mannila et
al. [356] introduced the concept of frequent episode, which is useful for min
ing sequential patterns from a long stream of events. Another formulation of
sequential pattern mining based on regular expressions was proposed by Garo
falakis et al. in [348]. Joshi et al. have attempted to reconcile the differences
between various sequential pattern formulations [352]. The result was a uni
versal formulation of sequential pattern with the different counting schemes
described in Section 7.4.4. Alternative algorithms for mining sequential pat
terns were also proposed by Pei et al. [359], Ayres et al. [344], Cheng et al.
[346], and Seno et al. [361].
The frequent subgraph mining problem was initially introduced by Inokuchi
et al. in [351]. They used a vertexgrowing approach for generating frequent
induced subgraphs from a graph data set. The edgegrowing strategy was
developed by Kuramochi and Karypis in [353], where they also presented an
Apriorilike algorithm called FSG that addresses issues such as multiplicity
of candidates, canonical labeling, and vertex invariant schemes. Another fre
quent subgraph mining algorithm known as gSpan was developed by Yan and
Han in [370]. The authors proposed using a minimum DFS code for encoding
the various subgraphs. Other variants of the frequent subgraph mining prob
lems were proposed by Zaki in [371], Parthasarathy and Coatney in [358], and
Kuramochi and Karypis in [354].
The problem of mining infrequent patterns has been investigated by many
authors. Savasere et al. [360] examined the problem of mining negative asso
470
Bibliography
ciation rules using a concept hierarchy. Tan et al. [365] proposed the idea of
mining indirect associations for sequential and nonsequential data. Efficient
algorithms for mining negative patterns have also been proposed by Boulicaut
et al. [345], Teng et al. [366], Wu et al. [369], and Antonie and Zäiane [342].
Bibliography
[341] R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proc. of Intl. Conf. on
Data Engineering, pages 3–14, Taipei, Taiwan, 1995.
[342] M.L. Antonie and O. R. Zäıane. Mining Positive and Negative Association Rules:
An Approach for Confined Rules. In Proc. of the 8th European Conf. of Principles
and Practice of Knowledge Discovery in Databases, pages 27–38, Pisa, Italy, September
2004.
[343] Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative Association Rules.
In KDD99, pages 261–270, San Diego, CA, August 1999.
[344] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu. Sequential Pattern mining using a bitmap
representation. In Proc. of the 8th Intl. Conf. on Knowledge Discovery and Data Mining,
pages 429–435, Edmonton, Canada, July 2002.
[345] J.F. Boulicaut, A. Bykowski, and B. Jeudy. Towards the Tractable Discovery of
Association Rules with Negations. In Proc. of the 4th Intl. Conf on Flexible Query
Answering Systems FQAS’00, pages 425–434, Warsaw, Poland, October 2000.
[346] H. Cheng, X. Yan, and J. Han. IncSpan: incremental mining of sequential patterns
in large database. In Proc. of the 10th Intl. Conf. on Knowledge Discovery and Data
Mining, pages 527–532, Seattle, WA, August 2004.
[347] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Mining Optimized Associa
tion Rules for Numeric Attributes. In Proc. of the 15th Symp. on Principles of Database
Systems, pages 182–191, Montreal, Canada, June 1996.
[348] M. N. Garofalakis, R. Rastogi, and K. Shim. SPIRIT: Sequential Pattern Mining with
Regular Expression Constraints. In Proc. of the 25th VLDB Conf., pages 223–234,
Edinburgh, Scotland, 1999.
[349] E.H. Han, G. Karypis, and V. Kumar. MinApriori: An Algorithm for Finding As
sociation Rules in Data with Continuous Attributes. http://www.cs.umn.edu/˜han,
1997.
[350] J. Han and Y. Fu. Mining MultipleLevel Association Rules in Large Databases. IEEE
Trans. on Knowledge and Data Engineering, 11(5):798–804, 1999.
[351] A. Inokuchi, T. Washio, and H. Motoda. An Aprioribased Algorithm for Mining Fre
quent Substructures from Graph Data. In Proc. of the 4th European Conf. of Principles
and Practice of Knowledge Discovery in Databases, pages 13–23, Lyon, France, 2000.
[352] M. V. Joshi, G. Karypis, and V. Kumar. A Universal Formulation of Sequential
Patterns. In Proc. of the KDD’2001 workshop on Temporal Data Mining, San Francisco,
CA, August 2001.
[353] M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. In Proc. of the 2001
IEEE Intl. Conf. on Data Mining, pages 313–320, San Jose, CA, November 2001.
[354] M. Kuramochi and G. Karypis. Discovering Frequent Geometric Subgraphs. In Proc.
of the 2002 IEEE Intl. Conf. on Data Mining, pages 258–265, Maebashi City, Japan,
December 2002.
471
Chapter 7 Association Analysis: Advanced Concepts
[355] B. Lent, A. Swami, and J. Widom. Clustering Association Rules. In Proc. of the 13th
Intl. Conf. on Data Engineering, pages 220–231, Birmingham, U.K, April 1997.
[356] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of Frequent Episodes in Event
Sequences. Data Mining and Knowledge Discovery, 1(3):259–289, November 1997.
[357] R. J. Miller and Y. Yang. Association Rules over Interval Data. In Proc. of 1997
ACMSIGMOD Intl. Conf. on Management of Data, pages 452–461, Tucson, AZ, May
1997.
[358] S. Parthasarathy and M. Coatney. Efficient Discovery of Common Substructures in
Macromolecules. In Proc. of the 2002 IEEE Intl. Conf. on Data Mining, pages 362–369,
Maebashi City, Japan, December 2002.
[359] J. Pei, J. Han, B. MortazaviAsl, Q. Chen, U. Dayal, and M. Hsu. PrefixSpan: Mining
Sequential Patterns efficiently by prefixprojected pattern growth. In Proc of the 17th
Intl. Conf. on Data Engineering, Heidelberg, Germany, April 2001.
[360] A. Savasere, E. Omiecinski, and S. Navathe. Mining for Strong Negative Associations
in a Large Database of Customer Transactions. In Proc. of the 14th Intl. Conf. on Data
Engineering, pages 494–502, Orlando, Florida, February 1998.
[361] M. Seno and G. Karypis. SLPMiner: An Algorithm for Finding Frequent Sequential
Patterns Using LengthDecreasing Support Constraint. In Proc. of the 2002 IEEE Intl.
Conf. on Data Mining, pages 418–425, Maebashi City, Japan, December 2002.
[362] R. Srikant and R. Agrawal. Mining Generalized Association Rules. In Proc. of the
21st VLDB Conf., pages 407–419, Zurich, Switzerland, 1995.
[363] R. Srikant and R. Agrawal. Mining Quantitative Association Rules in Large Relational
Tables. In Proc. of 1996 ACMSIGMOD Intl. Conf. on Management of Data, pages 1–
12, Montreal, Canada, 1996.
[364] R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations and Perfor
mance Improvements. In Proc. of the 5th Intl Conf. on Extending Database Technology
(EDBT’96), pages 18–32, Avignon, France, 1996.
[365] P. N. Tan, V. Kumar, and J. Srivastava. Indirect Association: Mining Higher Order
Dependencies in Data. In Proc. of the 4th European Conf. of Principles and Practice
of Knowledge Discovery in Databases, pages 632–637, Lyon, France, 2000.
[366] W. G. Teng, M. J. Hsieh, and M.S. Chen. On the Mining of Substitution Rules for
Statistically Dependent Items. In Proc. of the 2002 IEEE Intl. Conf. on Data Mining,
pages 442–449, Maebashi City, Japan, December 2002.
[367] K. Wang, S. H. Tay, and B. Liu. InterestingnessBased Interval Merger for Numeric
Association Rules. In Proc. of the 4th Intl. Conf. on Knowledge Discovery and Data
Mining, pages 121–128, New York, NY, August 1998.
[368] G. I. Webb. Discovering associations with numeric variables. In Proc. of the 7th Intl.
Conf. on Knowledge Discovery and Data Mining, pages 383–388, San Francisco, CA,
August 2001.
[369] X. Wu, C. Zhang, and S. Zhang. Mining Both Positive and Negative Association Rules.
ACM Trans. on Information Systems, 22(3):381–405, 2004.
[370] X. Yan and J. Han. gSpan: Graphbased Substructure Pattern Mining. In Proc.
of the 2002 IEEE Intl. Conf. on Data Mining, pages 721–724, Maebashi City, Japan,
December 2002.
[371] M. J. Zaki. Efficiently mining frequent trees in a forest. In Proc. of the 8th Intl. Conf.
on Knowledge Discovery and Data Mining, pages 71–80, Edmonton, Canada, July 2002.
[372] H. Zhang, B. Padmanabhan, and A. Tuzhilin. On the Discovery of Significant Statis
tical Quantitative Rules. In Proc. of the 10th Intl. Conf. on Knowledge Discovery and
Data Mining, pages 374–383, Seattle, WA, August 2004.
472
7.8 Exercises
7.8 Exercises
1. Consider the traffic accident data set shown in Table 7.10.
Table 7.10. Traffic accident data set.
Weather Driver’s Traffic Seat Belt Crash
Condition Condition Violation Severity
Good Alcoholimpaired Exceed speed limit No Major
Bad Sober None Yes Minor
Good Sober Disobey stop sign Yes Minor
Good Sober Exceed speed limit Yes Major
Bad Sober Disobey traffic signal No Major
Good Alcoholimpaired Disobey stop sign Yes Minor
Bad Alcoholimpaired None Yes Major
Good Sober Disobey traffic signal Yes Major
Good Alcoholimpaired None No Major
Bad Sober Disobey traffic signal No Major
Good Alcoholimpaired Exceed speed limit Yes Major
Bad Sober Disobey stop sign Yes Minor
(a) Show a binarized version of the data set.
(b) What is the maximum width of each transaction in the binarized data?
(c) Assuming that support threshold is 30%, how many candidate and fre
quent itemsets will be generated?
(d) Create a data set that contains only the following asymmetric binary
attributes: (Weather = Bad, Driver’s condition = Alcoholimpaired,
Traffic violation = Yes, Seat Belt = No, Crash Severity = Major).
For Traffic violation, only None has a value of 0. The rest of the
attribute values are assigned to 1. Assuming that support threshold is
30%, how many candidate and frequent itemsets will be generated?
(e) Compare the number of candidate and frequent itemsets generated in
parts (c) and (d).
2. (a) Consider the data set shown in Table 7.11. Suppose we apply the following
discretization strategies to the continuous attributes of the data set.
D1: Partition the range of each continuous attribute into 3 equalsized
bins.
D2: Partition the range of each continuous attribute into 3 bins; where
each bin contains an equal number of transactions
473
Chapter 7 Association Analysis: Advanced Concepts
Table 7.11. Data set for Exercise 2.
TID Temperature Pressure Alarm 1 Alarm 2 Alarm 3
1 95 1105 0 0 1
2 85 1040 1 1 0
3 103 1090 1 1 1
4 97 1084 1 0 0
5 80 1038 0 1 1
6 100 1080 1 1 0
7 83 1025 1 0 1
8 86 1030 1 0 0
9 101 1100 1 1 1
For each strategy, answer the following questions:
i. Construct a binarized version of the data set.
ii. Derive all the frequent itemsets having support ≥ 30%.
(b) The continuous attribute can also be discretized using a clustering ap
proach.
i. Plot a graph of temperature versus pressure for the data points shown
in Table 7.11.
ii. How many natural clusters do you observe from the graph? Assign
a label (C1, C2, etc.) to each cluster in the graph.
iii. What type of clustering algorithm do you think can be used to iden
tify the clusters? State your reasons clearly.
iv. Replace the temperature and pressure attributes in Table 7.11 with
asymmetric binary attributes C1, C2, etc. Construct a transac
tion matrix using the new attributes (along with attributes Alarm1,
Alarm2, and Alarm3).
v. Derive all the frequent itemsets having support ≥ 30% from the bi
narized data.
3. Consider the data set shown in Table 7.12. The first attribute is continuous,
while the remaining two attributes are asymmetric binary. A rule is considered
to be strong if its support exceeds 15% and its confidence exceeds 60%. The
data given in Table 7.12 supports the following two strong rules:
(i) {(1 ≤ A ≤ 2), B = 1} → {C = 1}
(ii) {(5 ≤ A ≤ 8), B = 1} → {C = 1}
(a) Compute the support and confidence for both rules.
(b) To find the rules using the traditional Apriori algorithm, we need to
discretize the continuous attribute A. Suppose we apply the equal width
474
7.8 Exercises
Table 7.12. Data set for Exercise 3.
A B C
1 1 1
2 1 1
3 1 0
4 1 0
5 1 1
6 0 1
7 0 0
8 1 1
9 0 0
10 0 0
11 0 0
12 0 1
binning approach to discretize the data, with binwidth = 2, 3, 4. For
each binwidth, state whether the above two rules are discovered by the
Apriori algorithm. (Note that the rules may not be in the same exact
form as before because it may contain wider or narrower intervals for A.)
For each rule that corresponds to one of the above two rules, compute its
support and confidence.
(c) Comment on the effectiveness of using the equal width approach for clas
sifying the above data set. Is there a binwidth that allows you to find
both rules satisfactorily? If not, what alternative approach can you take
to ensure that you will find both rules?
4. Consider the data set shown in Table 7.13.
Table 7.13. Data set for Exercise 4.
Age Number of Hours Online per Week (B)
(A) 0 – 5 5 – 10 10 – 20 20 – 30 30 – 40
10 – 15 2 3 5 3 2
15 – 25 2 5 10 10 3
25 – 35 10 15 5 3 2
35 – 50 4 6 5 3 2
(a) For each combination of rules given below, specify the rule that has the
highest confidence.
i. 15 < A < 25 −→ 10 < B < 20, 10 < A < 25 −→ 10 < B < 20, and
15 < A < 35 −→ 10 < B < 20.
475
Chapter 7 Association Analysis: Advanced Concepts
ii. 15 < A < 25 −→ 10 < B < 20, 15 < A < 25 −→ 5 < B < 20, and
15 < A < 25 −→ 5 < B < 30.
iii. 15 < A < 25 −→ 10 < B < 20 and 10 < A < 35 −→ 5 < B < 30.
(b) Suppose we are interested in finding the average number of hours spent
online per week by Internet users between the age of 15 and 35. Write the
corresponding statisticsbased association rule to characterize the segment
of users. To compute the average number of hours spent online, approx
imate each interval by its midpoint value (e.g., use B = 7.5 to represent
the interval 5 < B < 10).
(c) Test whether the quantitative association rule given in part (b) is statis
tically significant by comparing its mean against the average number of
hours spent online by other users who do not belong to the age group.
5. For the data set with the attributes given below, describe how you would con
vert it into a binary transaction data set appropriate for association analysis.
Specifically, indicate for each attribute in the original data set
(a) how many binary attributes it would correspond to in the transaction
data set,
(b) how the values of the original attribute would be mapped to values of the
binary attributes, and
(c) if there is any hierarchical structure in the data values of an attribute that
could be useful for grouping the data into fewer binary attributes.
The following is a list of attributes for the data set along with their possible
values. Assume that all attributes are collected on a perstudent basis:
• Year : Freshman, Sophomore, Junior, Senior, Graduate:Masters, Gradu
ate:PhD, Professional
• Zip code : zip code for the home address of a U.S. student, zip code for
the local address of a nonU.S. student
• College : Agriculture, Architecture, Continuing Education, Education,
Liberal Arts, Engineering, Natural Sciences, Business, Law, Medical, Den
tistry, Pharmacy, Nursing, Veterinary Medicine
• On Campus : 1 if the student lives on campus, 0 otherwise
• Each of the following is a separate attribute that has a value of 1 if the
person speaks the language and a value of 0, otherwise.
– Arabic
– Bengali
– Chinese Mandarin
– English
– Portuguese
476
7.8 Exercises
– Russian
– Spanish
6. Consider the data set shown in Table 7.14. Suppose we are interested in ex
tracting the following association rule:
{α1 ≤ Age ≤ α2, Play Piano = Yes} −→ {Enjoy Classical Music = Yes}
Table 7.14. Data set for Exercise 6.
Age Play Piano Enjoy Classical Music
9 Yes Yes
11 Yes Yes
14 Yes No
17 Yes No
19 Yes Yes
21 No No
25 No No
29 Yes Yes
33 No No
39 No Yes
41 No No
47 No Yes
To handle the continuous attribute, we apply the equalfrequency approach
with 3, 4, and 6 intervals. Categorical attributes are handled by introducing as
many new asymmetric binary attributes as the number of categorical values.
Assume that the support threshold is 10% and the confidence threshold is 70%.
(a) Suppose we discretize the Age attribute into 3 equalfrequency intervals.
Find a pair of values for α1 and α2 that satisfy the minimum support and
minimum confidence requirements.
(b) Repeat part (a) by discretizing the Age attribute into 4 equalfrequency
intervals. Compare the extracted rules against the ones you had obtained
in part (a).
(c) Repeat part (a) by discretizing the Age attribute into 6 equalfrequency
intervals. Compare the extracted rules against the ones you had obtained
in part (a).
(d) From the results in part (a), (b), and (c), discuss how the choice of dis
cretization intervals will affect the rules extracted by association rule min
ing algorithms.
7. Consider the transactions shown in Table 7.15, with an item taxonomy given
in Figure 7.25.
477
Chapter 7 Association Analysis: Advanced Concepts
Table 7.15. Example of market basket transactions.
Transaction ID Items Bought
1 Chips, Cookies, Regular Soda, Ham
2 Chips, Ham, Boneless Chicken, Diet Soda
3 Ham, Bacon, Whole Chicken, Regular Soda
4 Chips, Ham, Boneless Chicken, Diet Soda
5 Chips, Bacon, Boneless Chicken
6 Chips, Ham, Bacon, Whole Chicken, Regular Soda
7 Chips, Cookies, Boneless Chicken, Diet Soda
(a) What are the main challenges of mining association rules with item tax
onomy?
(b) Consider the approach where each transaction t is replaced by an extended
transaction t′ that contains all the items in t as well as their respective
ancestors. For example, the transaction t = { Chips, Cookies} will be
replaced by t′ = {Chips, Cookies, Snack Food, Food}. Use this approach
to derive all frequent itemsets (up to size 4) with support ≥ 70%.
(c) Consider an alternative approach where the frequent itemsets are gener
ated one level at a time. Initially, all the frequent itemsets involving items
at the highest level of the hierarchy are generated. Next, we use the fre
quent itemsets discovered at the higher level of the hierarchy to generate
candidate itemsets involving items at the lower levels of the hierarchy. For
example, we generate the candidate itemset {Chips, Diet Soda} only if
{Snack Food, Soda} is frequent. Use this approach to derive all frequent
itemsets (up to size 4) with support ≥ 70%.
(d) Compare the frequent itemsets found in parts (b) and (c). Comment on
the efficiency and completeness of the algorithms.
8. The following questions examine how the support and confidence of an associ
ation rule may vary in the presence of a concept hierarchy.
(a) Consider an item x in a given concept hierarchy. Let x1, x2, . . ., xk denote
the k children of x in the concept hierarchy. Show that s(x) ≤ ∑ki=1 s(xi),
where s(·) is the support of an item. Under what conditions will the
inequality become an equality?
(b) Let p and q denote a pair of items, while p̂ and q̂ are their corresponding
parents in the concept hierarchy. If s({p, q}) > minsup, which of the fol
lowing itemsets are guaranteed to be frequent? (i) s({p̂, q}), (ii) s({p, q̂}),
and (iii) s({p̂, q̂}).
(c) Consider the association rule {p} −→ {q}. Suppose the confidence of the
rule exceeds minconf . Which of the following rules are guaranteed to
478
7.8 Exercises
have confidence higher than minconf ? (i) {p} −→ {q̂}, (ii) {p̂} −→ {q},
and (iii) {p̂} −→ {q̂}.
9. (a) List all the 4subsequences contained in the following data sequence:
< {1, 3} {2} {2, 3} {4} >,
assuming no timing constraints.
(b) List all the 3element subsequences contained in the data sequence for
part (a) assuming that no timing constraints are imposed.
(c) List all the 4subsequences contained in the data sequence for part (a)
(assuming the timing constraints are flexible).