Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Brereton Chemometrics

.pdf
Скачиваний:
49
Добавлен:
15.08.2013
Размер:
4.3 Mб
Скачать

PATTERN RECOGNITION

251

 

 

If, as is usual in chemistry, there are many more than two measurements, it is simply necessary to extend the concept of distance to one in multidimensional space. Although we cannot visualise more than three dimensions, computers can handle geometry in an indefinite number of dimensions, and the idea of distance is easy to generalise. In the case of Figure 4.36 it is not really necessary to perform an elaborate computation to classify the unknown, but when a large number of measurements have been made, e.g. in spectroscopy, it is often hard to determine the class of an unknown by simple graphical approaches.

4.5.5.2 Limitations

This conceptually simple approach works well in many situations, but it is important to understand the limitations.

The first is that the numbers in each class of the training set should be approximately equal, otherwise the ‘votes’ will be biased towards the class with most representatives. The second is that for the simplest implementations, each variable assumes equal significance. In spectroscopy, we may record hundreds of wavelengths, and some will either not be diagnostic or else be correlated. A way of getting round this is either to select the variables or else to use another distance measure, just as in cluster analysis. Mahalanobis distance is a common alternative measure. The third problem is that ambiguous or outlying samples in the training set can result in major problems in the resultant classification. Fourth, the methods take no account of the spread or variance in a class. For example, if we were trying to determine whether a forensic sample is a forgery, it is likely that the class of forgeries has a much higher variance to the class of nonforged samples. The methods in the Sections 4.5.2 to 4.5.4 would normally take this into account.

However, KNN is a very simple approach that can be easily understood and programmed. Many chemists like these approaches, whereas statisticians often prefer the more elaborate methods involving modelling the data. KNN makes very few assumptions, whereas methods based on modelling often inherently make assumptions such as normality of noise distributions that are not always experimentally justified, especially when statistical tests are then employed to provide probabilities of class membership. In practice, a good strategy is to use several different methods for classification and see if similar results are obtained. Often the differences in performance of various approaches are not entirely due to the algorithm itself but in data scaling, distance measures, variable selection, validation method and so on. Some advocates of certain approaches do not always make this entirely clear.

4.6 Multiway Pattern Recognition

Most traditional chemometrics is concerned with two-way data, often represented by matrices. However, over the past decade there has been increasing interest in threeway chemical data. Instead of organising the information as a two-dimensional array [Figure 4.38(a)], it falls into a three-dimensional ‘tensor’ or box [Figure 4.38(b)]. Such datasets are surprisingly common. In Chapter 5 we discussed multiway PLS (Section 5.5.3), the discussion in this section being restricted to pattern recognition.

252

CHEMOMETRICS

 

 

J

J

I

K

 

I

(a)

(b)

Figure 4.38

(a) Two-way and (b) three-way data

24

6

20

Figure 4.39

Possible method of arranging environmental sampling data

Consider, for example, an environmental chemical experiment in which the concentrations of six elements are measured at 20 sampling sites on 24 days in a year. There will be 20 × 24 × 6 or 2880 measurements; however, these can be organised as a ‘box’ with 20 planes each corresponding to a sampling site, and of dimensions 24 × 6 (Figure 4.39). Such datasets have been available for many years to psychologists and in sensory research. A typical example might involve a taste panel assessing 20 food products. Each food could involve the use of 10 judges who score eight attributes, resulting in a 20 × 10 × 8 box. In psychology, we might be following the reactions of 15 individuals to five different tests on 10 different days, possibly each day under slightly different conditions, and so have a 15 × 5 × 10 box. These problems involve finding the main factors that influence the taste of a food or the source of pollutant or the reactions of an individual, and are a form of pattern recognition.

Three-dimensional analogies to principal components are required. There are no direct analogies to scores and loadings as in PCA, so the components in each of the three dimensions are often called ‘weights’. There are a number of methods available to tackle this problem.

4.6.1 Tucker3 Models

These models involve calculating weight matrices corresponding to each of the three dimensions (e.g. sampling site, date and metal), together with a ‘core’ box or array,

PATTERN RECOGNITION

 

 

253

 

 

J

N

 

 

 

 

J

 

K

 

 

 

 

 

 

M

I

K

 

 

 

I

 

L

Figure 4.40

Tucker3 decomposition

which provides a measure of magnitude. The three weight matrices do not necessarily have the same dimensions, so the number of significant components for the sampling sites may be different to those for the dates, unlike normal PCA where one of the dimensions of both the scores and loadings matrices must be identical. This model (or decomposition) is represented in Figure 4.40. The easiest mathematical approach is by expressing the model as a summation:

 

L M N

xijk

 

ail bj mcknzlmn

l=1 m=1 n=1

where z represents what is often called a core array and a, b and c are functions relating to each of the three types of variable. Some authors use the concept of ‘tensor multiplication’, being a 3D analogy to ‘matrix multiplication’ in two dimensions; however, the details are confusing and conceptually it is probably best to stick to summations, which is what computer programs do.

4.6.2 PARAFAC

PARAFAC (parallel factor analysis) differs from the Tucker3 models in that each of the three dimensions contains the same number of components. Hence the model can be represented as the sum of contributions due to g components, just as in normal PCA, as illustrated in Figure 4.41 and represented algebraically by

G

xijk aig bjg ckg g=1

Each component can be characterised by one vector that is analogous to a scores vector and two vectors that are analogous to loadings, but some keep to the notation of ‘weights’ in three dimensions. Components can, in favourable circumstances, be assigned a physical meaning. A simple example might involve following a reaction by recording a chromatogram from HPLC–DAD at different reaction times. A box

254

CHEMOMETRICS

 

 

J

I

K

+

+

+ etc.

 

 

 

 

 

 

 

Figure 4.41

PARAFAC

whose dimensions are reaction time × elution time × wavelength is obtained. If there are three factors in the data, this would imply three significant compounds in a cluster in the chromatogram (or three significant reactants), and the weights should correspond to the reaction profile, the chromatogram and the spectrum of each compound.

PARAFAC is difficult to use, however, and, although the results are easy to interpret physically, it is conceptually more complex than PCA. Nevertheless, it can lead to results that are directly interpretable physically, whereas the factors in PCA have a purely abstract meaning.

4.6.3 Unfolding

Another approach is simply to ‘unfold’ the ‘box’ to give a long matrix. In the environmental chemistry example, instead of each sample being represented by a 24 × 6 matrix, it could be represented by a vector of length 144, each measurement consisting of the measurement of one element on one date, e.g. the measurement of Cd concentration on July 15. Then a matrix of dimensions 20 (sampling sites) × 144 (variables) is produced (Figure 4.42) and subjected to normal PCA. Note that a box can be subdivided into planes in three different ways (compare Figure 4.39 with Figure 4.42), according to which dimension is regarded as the ‘major’ dimension. When unfolding it is also important to consider details of scaling and centring which become far more complex in three dimensions as opposed to two. After unfolding, normal PCA can be performed. Components can be averaged over related variables, for example we could take an average loading for Cd over all dates to give an overall picture of its influence on the observed data.

This comparatively simple approach is sometimes sufficient but the PCA calculation neglects to take into account the relationships between the variables. For example, the

6

144

 

20

24

24

20

24

 

 

K

 

 

 

 

 

Figure 4.42

Unfolding

PATTERN RECOGNITION

255

 

 

relationship between concentration of Cd on July 15 and that on August 1 is considered to be no stronger than the relationship between Cd concentration on July 15 and Hg on November 1 during the calculation of the components. However, after the calculations have been performed it is still possible to regroup the loadings and sometimes an easily understood method such as unfolded PCA can be of value.

Problems

Problem 4.1 Grouping of Elements from Fundamental Properties Using PCA

Section 4.3.2 Section 4.3.3.1 Section 4.3.5 Section 4.3.6.4

The table below lists 27 elements, divided into six groups according to their position in the periodic table together with five physical properties.

Element

Group

Melting

Boiling

Density

Oxidation

Electronegativity

 

 

point (K)

point (K)

 

number

 

 

 

 

 

 

 

 

Li

1

453.69

1615

534

1

0.98

Na

1

371

1156

970

1

0.93

K

1

336.5

1032

860

1

0.82

Rb

1

312.5

961

1530

1

0.82

Cs

1

301.6

944

1870

1

0.79

Be

2

1550

3243

1800

2

1.57

Mg

2

924

1380

1741

2

1.31

Ca

2

1120

1760

1540

2

1

Sr

2

1042

1657

2600

2

0.95

F

3

53.5

85

1.7

1

3.98

Cl

3

172.1

238.5

3.2

1

3.16

Br

3

265.9

331.9

3100

1

2.96

I

3

386.6

457.4

4940

1

2.66

He

4

0.9

4.2

0.2

0

0

Ne

4

24.5

27.2

0.8

0

0

Ar

4

83.7

87.4

1.7

0

0

Kr

4

116.5

120.8

3.5

0

0

Xe

4

161.2

166

5.5

0

0

Zn

5

692.6

1180

7140

2

1.6

Co

5

1765

3170

8900

3

1.8

Cu

5

1356

2868

8930

2

1.9

Fe

5

1808

3300

7870

2

1.8

Mn

5

1517

2370

7440

2

1.5

Ni

5

1726

3005

8900

2

1.8

Bi

6

544.4

1837

9780

3

2.02

Pb

6

600.61

2022

11340

2

1.8

Tl

6

577

1746

11850

3

1.62

 

 

 

 

 

 

 

1.Standardise the five variables, using the population (rather than sample) standard deviation. Why is this preprocessing necessary to obtain sensible results in this case?

256

CHEMOMETRICS

 

 

2.Calculate the scores, loadings and eigenvalues of the first two PCs of the standardised data. What is the sum of the first two eigenvalues, and what proportion of the overall variability do they represent?

3.Plot a graph of the scores of PC2 versus PC1, labelling the points. Comment on the grouping in the scores plot.

4.Plot a graph of the loadings of PC2 versus PC1, labelling the points. Which variables cluster together and which appears to behave differently? Hence which physical property mainly accounts for PC2?

5.Calculate the correlation matrix between each of the five fundamental parameters. How does this relate to clustering in the loadings plot?

6.Remove the parameter that exhibits a high loading in PC2 and recalculate the scores using only four parameters. Plot the scores. What do you observe, and why?

Problem 4.2 Introductory PCA

Section 4.3.2 Section 4.3.3.1

The following is a data matrix, consisting of seven samples and six variables:

2.7

4.3

5.7

2.3

4.6

1.4

2.6

3.7

7.6

9.1

7.4

1.8

4.3

8.1

4.2

5.7

8.4

2.4

2.5

3.5

6.5

5.4

5.6

1.5

4.0

6.2

5.4

3.7

7.4

3.2

3.1

5.3

6.3

8.4

8.9

2.4

3.2

5.0

6.3

5.3

7.8

1.7

The scores of the first two principal components on the centred data matrix are given

as follows:

1.6700

4.0863

3.5206

2.0486

0.0119

3.7487

0.7174

2.3799

1.8423

1.7281

3.1757

0.6012

0.0384

0.0206

1.Since X T .P, calculate the loadings for the first two PCs using the pseudoinverse, remembering to centre the original data matrix first.

2.Demonstrate that the two scores vectors are orthogonal and the two loadings vectors are orthonormal. Remember that the answer will only to be within a certain degree of numerical accuracy.

3.Determine the eigenvalues and percentage variance of the first two principal components.

Problem 4.3 Introduction to Cluster Analysis

Section 4.4

The following dataset consists of seven measurements (rows) on six objects A–F (columns)

PATTERN RECOGNITION

 

 

 

 

257

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A

B

C

D

E

F

 

 

 

 

 

 

 

 

0.9

0.3

0.7

0.5

1.0

0.3

 

0.5

0.2

0.2

0.4

0.7

0.1

 

0.2

0.6

0.1

1.1

2

0.3

 

1.6

0.7

0.9

1.3

2.2

0.5

 

1.5

0.1

0.1

0.2

0.4

0.1

 

0.4

0.9

0.7

1.8

3.7

0.4

 

1.5

0.3

0.3

0.6

1.1

0.2

 

 

 

 

 

 

 

 

 

1.Calculate the correlation matrix between the six objects.

2.Using the correlation matrix, perform cluster analysis using the furthest neighbour method. Illustrate each stage of linkage.

3.From the results in 2, draw a dendrogram, and deduce which objects cluster closely into groups.

Problem 4.4 Classification Using Euclidean Distance and KNN

Section 4.5.5 Section 4.4.1

The following data represent three measurements, x, y and z, made on two classes of compound:

Object

Class

x

y

z

 

 

 

 

 

1

A

0.3

0.4

0.1

2

A

0.5

0.6

0.2

3

A

0.7

0.5

0.3

4

A

0.5

0.6

0.5

5

A

0.2

0.5

0.1

6

B

0.2

0.1

0.6

7

B

0.3

0.4

0.5

8

B

0.1

0.3

0.7

9

B

0.4

0.5

0.7

 

 

 

 

 

1.Calculate the centroids of each class (this is done simply by averaging the values of the three measurements over each class).

2.Calculate the Euclidean distance of all nine objects from the centroids of both classes A and B (you should obtain a table of 18 numbers). Verify that all objects do, indeed, belong to their respective classes.

3.An object of unknown origins has measurements (0.5, 0.3, 0.3). What is the distance from the centroids of each class and so to which class is it more likely to belong?

4.The K nearest neighbour criterion can also be used for classification. Find the distance of the object in question 3 from the nine objects in the table above. Which are the three closest objects, and does this confirm the conclusions in question 3?

5.Is there one object in the original dataset that you might be slightly suspicious about?

258

CHEMOMETRICS

 

 

Problem 4.5 Certification of NIR Filters Using PC Scores Plots

Section 4.3.2 Section 4.3.5.1 Section 4.3.3.1 Section 4.3.6.4

These data were obtained by the National Institute of Standards and Technology (USA) while developing a transfer standard for verification and calibration of the x-axis of NIR spectrometers. Optical filters were prepared from two separate melts, 2035 and 2035a, of a rare earth glass. Filters from both melts provide seven well-suited adsorption bands of very similar but not exactly identical location. One filter, Y, from one of the two melts was discovered to be unlabelled. Four 2035 filters and one 2035a filter were available at the time of this discovery. Six replicate spectra were taken from each filter. Band location data from these spectra are provided below, in cm1. The expected location uncertainties range from 0.03 to 0.3 cm1.

Type

No.

P1

P2

P3

P4

P5

P6

P7

 

 

 

 

 

 

 

 

 

2035

18

5138.58

6804.70

7313.49

8178.65

8681.82

9293.94

10245.45

2035

18

5138.50

6804.81

7313.49

8178.71

8681.73

9293.93

10245.49

2035

18

5138.47

6804.87

7313.43

8178.82

8681.62

9293.82

10245.52

2035

18

5138.46

6804.88

7313.67

8178.80

8681.52

9293.89

10245.54

2035

18

5138.46

6804.96

7313.54

8178.82

8681.63

9293.79

10245.51

2035

18

5138.45

6804.95

7313.59

8178.82

8681.70

9293.89

10245.53

2035

101

5138.57

6804.77

7313.54

8178.69

8681.70

9293.90

10245.48

2035

101

5138.51

6804.82

7313.57

8178.75

8681.73

9293.88

10245.53

2035

101

5138.49

6804.91

7313.57

8178.82

8681.63

9293.80

10245.55

2035

101

5138.47

6804.88

7313.50

8178.84

8681.63

9293.78

10245.55

2035

101

5138.48

6804.97

7313.57

8178.80

8681.70

9293.79

10245.50

2035

101

5138.47

6804.99

7313.59

8178.84

8681.67

9293.82

10245.52

2035

102

5138.54

6804.77

7313.49

8178.69

8681.62

9293.88

10245.49

2035

102

5138.50

6804.89

7313.45

8178.78

8681.66

9293.82

10245.54

2035

102

5138.45

6804.95

7313.49

8178.77

8681.65

9293.69

10245.53

2035

102

5138.48

6804.96

7313.55

8178.81

8681.65

9293.80

10245.52

2035

102

5138.47

6805.00

7313.53

8178.83

8681.62

9293.80

10245.52

2035

102

5138.46

6804.97

7313.54

8178.83

8681.70

9293.81

10245.52

2035

103

5138.52

6804.73

7313.42

8178.75

8681.73

9293.93

10245.48

2035

103

5138.48

6804.90

7313.53

8178.78

8681.63

9293.84

10245.48

2035

103

5138.45

6804.93

7313.52

8178.73

8681.72

9293.83

10245.56

2035

103

5138.47

6804.96

7313.53

8178.78

8681.59

9293.79

10245.51

2035

103

5138.46

6804.94

7313.51

8178.81

8681.65

9293.77

10245.52

2035

103

5138.48

6804.98

7313.57

8178.82

8681.51

9293.80

10245.51

2035a

200

5139.26

6806.45

7314.93

8180.19

8682.57

9294.46

10245.62

2035a

200

5139.22

6806.47

7315.03

8180.26

8682.52

9294.35

10245.66

2035a

200

5139.21

6806.56

7314.92

8180.26

8682.61

9294.34

10245.68

2035a

200

5139.20

6806.56

7314.90

8180.23

8682.49

9294.31

10245.69

2035a

200

5139.19

6806.58

7314.95

8180.24

8682.64

9294.32

10245.67

2035a

200

5139.20

6806.50

7314.97

8180.21

8682.58

9294.27

10245.64

Y

201

5138.53

6804.82

7313.62

8178.78

8681.78

9293.77

10245.52

 

 

 

 

 

 

 

 

 

PATTERN RECOGNITION

 

 

 

 

 

 

 

259

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Type

No.

P1

P2

P3

P4

P5

P6

P7

 

 

 

 

 

 

 

 

 

 

 

Y

201

5138.49

6804.87

7313.47

8178.75

8681.66

9293.74

10245.52

 

Y

201

5138.48

6805.00

7313.54

8178.85

8681.67

9293.75

10245.54

 

Y

201

5138.48

6804.97

7313.54

8178.82

8681.70

9293.79

10245.53

 

Y

201

5138.47

6804.96

7313.51

8178.77

8681.52

9293.85

10245.54

 

Y

201

5138.48

6804.97

7313.49

8178.84

8681.66

9293.87

10245.50

 

 

 

 

 

 

 

 

 

 

1.Standardise the peak positions for the 30 known samples (exclude samples Y).

2.Perform PCA on these data, retaining the first two PCs. Calculate the scores and eigenvalues. What will the sum of squares of the standardised data equal, and so what proportion of the variance is accounted for by the first two PCs?

3.Produce a scores plot of the first two PCs of this data, indicating the two groups using different symbols. Verify that there is a good discrimination using PCA.

4.Determine the origin of Y as follows. (a) For each variable subtract the mean and

divide by the standard deviation of the 30 known samples to give a 6 × 7 matrix stand X . (b) Then multiply this standardised data by the overall loadings, for the first PC to give T = stand X .P and predict the scores for these samples. (c) Superimpose

the scores of Y on to the scores plot obtained in 3, and so determine the origin of Y.

5. Why is it correct to calculate T

=

stand X .P rather than using the pseudo-inverse

and calculate T =

stand

 

1

?

 

X .P (P.P )

Problem 4.6 Simple KNN Classification

Section 4.5.5 Section 4.4.1

The following represents five measurements on 16 samples in two classes, a and b:

Sample

 

 

 

 

 

Class

 

 

 

 

 

 

 

1

37

3

56

32

66

a

2

91

84

64

37

50

a

3

27

34

68

28

63

a

4

44

25

71

25

60

a

5

46

60

45

23

53

a

6

25

32

45

21

43

a

7

36

53

99

42

92

a

8

56

53

92

37

82

a

9

95

58

59

35

33

b

10

29

25

30

13

21

b

11

96

91

55

31

32

b

12

60

34

29

19

15

b

13

43

74

44

21

34

b

14

62

105

36

16

21

b

15

88

70

48

29

26

b

16

95

76

74

38

46

b

 

 

 

 

 

 

 

260

CHEMOMETRICS

 

 

1.Calculate the 16 × 16 sample distance matrix, by computing the Euclidean distance between each sample.

2.For each sample, list the classes of the three and five nearest neighbours, using the distance matrix as a guide.

3.Verify that most samples belong to their proposed class. Is there a sample that is most probably misclassified?

Problem 4.7 Classification of Swedes into Fresh and Stored using SIMCA

Section 4.5.3 Section 4.3.6 Section 4.3.5 Section 4.3.2 Section 4.3.3.1

The following consist of a training set of 14 swedes (vegetable) divided into two groups, fresh and stored (indicated by F and S in the names), with the areas of eight GC peaks (A–H) from the extracts indicated. The aim is to set up a model to classify a swede into one of these two groups.

 

A

B

C

D

E

F

G

H

 

 

 

 

 

 

 

 

 

FH

0.37

0.99

1.17

6.23

2.31

3.78

0.22

0.24

FA

0.84

0.78

2.02

5.47

5.41

2.8

0.45

0.46

FB

0.41

0.74

1.64

5.15

2.82

1.83

0.37

0.37

FI

0.26

0.45

1.5

4.35

3.08

2.01

0.52

0.49

FK

0.99

0.19

2.76

3.55

3.02

0.65

0.48

0.48

FN

0.7

0.46

2.51

2.79

2.83

1.68

0.24

0.25

FM

1.27

0.54

0.90

1.24

0.02

0.02

1.18

1.22

SI

1.53

0.83

3.49

2.76

10.3

1.92

0.89

0.86

SH

1.5

0.53

3.72

3.2

9.02

1.85

1.01

0.96

SA

1.55

0.82

3.25

3.23

7.69

1.99

0.85

0.87

SK

1.87

0.25

4.59

1.4

6.01

0.67

1.12

1.06

SB

0.8

0.46

3.58

3.95

4.7

2.05

0.75

0.75

SM

1.63

1.09

2.93

6.04

4.01

2.93

1.05

1.05

SN

3.45

1.09

5.56

3.3

3.47

1.52

1.74

1.71

 

 

 

 

 

 

 

 

 

In addition, two test set samples, X and Y, each belonging to one of the groups F and S have also been analysed by GC:

 

A

B

C

D

E

F

G

H

 

 

 

 

 

 

 

 

 

FX

0.62

0.72

1.48

4.14

2.69

2.08

0.45

0.45

SY

1.55

0.78

3.32

3.2

5.75

1.77

1.04

1.02

 

 

 

 

 

 

 

 

 

1.Transform the data first by taking logarithms and then standardising over the 14 training set samples (use the population standard deviation). Why are these transformations used?

2.Perform PCA on the transformed PCs of the 14 objects in the training set, and retain the first two PCs. What are the eigenvalues of these PCs and to what percentage

Соседние файлы в предмете Химия