Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Brereton Chemometrics

.pdf
Скачиваний:
49
Добавлен:
15.08.2013
Размер:
4.3 Mб
Скачать

282

CHEMOMETRICS

 

 

Exactly the same principles can be employed for calculating the coefficients as in Section 2.1.2, but in this case b is a vector rather than scalar, and X is a matrix rather

than a vector, so that

b (X .X )1.X .c

or

cˆ = −0.173 + 4.227x

Note that the coefficients are different to those of Section 5.2.2. One reason is that there are still a number of interferents, from the other PAHs, in the spectrum at 335 nm, and these are modelled partly by the intercept term. The models of the previous sections force the best fit straight line to pass through the origin. A better fit can be obtained if this condition is not required. The new best fit straight line is presented in Figure 5.6 and results, visually, in a much better fit to the data.

The predicted concentrations are fairly easy to obtain, the easiest approach involving the use of matrix based methods, so that

cˆ = X.b

the root mean square error being given by

E = 0.229/23 = 0.100 mg l1

representing an E% of 21.8 % relative to the mean. Note that the error term should be divided by 23 (number of degrees of freedom rather than 25) to reflect the two parameters used in the model.

One interesting and important consideration is that the apparent root mean square error in Sections 5.2.2 and 5.2.3 is only reduced by a small amount, yet the best fit straight line appears much worse if we neglect the intercept. The reason for this is that there is still a considerable replicate error, and this cannot readily be modelled using a

Absorbance (AU)

0.30

0.25

0.20

0.15

0.10

0.05

0.00

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

Concentration mg l−1

Figure 5.6

Best fit straight line using inverse calibration: data of Figure 5.5 and an intercept term

CALIBRATION

283

 

 

single compound model. If this contribution were removed the error would be reduced dramatically.

An alternative, and common, method for including the intercept is to mean centre both the x and the c variables to fit the equation

c c (x x)b

or

cenc cenxb

or

I

(xi x)(ci c)

b (cenx .cenx)1.cenx .cenc = i=1

I

(xi x)2

i=1

It is easy to show algebraically that

the value of b when both variables have been centred is identical with the value of b1 obtained when the data are modelled including an intercept term (=4.227 in this example);

the value of b0 (intercept term for uncentred data) is given by c bx = 0.469 4.227 × 0.149 = −0.173, so the two methods are related.

It is common to centre both sets of variables for this reason, the calculations being mathematically simpler than including an intercept term. Note that both blocks must be centred, and the predictions are of the concentrations minus their mean, so the mean concentration must be added back to return to the original physical values.

 

0.9

 

 

 

 

 

 

 

 

 

0.8

 

 

 

 

 

 

 

 

−1

0.7

 

 

 

 

 

 

 

 

mgl

0.6

 

 

 

 

 

 

 

 

concentration

 

 

 

 

 

 

 

 

0.5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Predicted

0.4

 

 

 

 

 

 

 

 

0.3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.2

 

 

 

 

 

 

 

 

 

0.1

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

 

 

 

 

True concentration mg l−1

 

 

 

Figure 5.7

Predicted (vertical) versus known (horizontal) concentrations using the methods in Section 5.2.3

284

CHEMOMETRICS

 

 

In calibration it is common to plot a graph of predicted versus observed concentrations as presented in Figure 5.7. This looks superficially similar to that in the previous figure, but the vertical scale is different and the graph goes through the origin (providing the data have been mean centred). There is a variety of potential graphical output and it is important not to be confused, but to distinguish each type of information carefully.

It is important to realise that the predictions for the method described in this section differ from those obtained for the uncentred data. It is also useful to realise that similar methods can be applied to classical calibration, the details being omitted for brevity, as it is recommended that inverse calibration is performed in normal circumstances.

5.3 Multiple Linear Regression

5.3.1 Multidetector Advantage

Multiple linear regression (MLR) is an extension when more than one response is employed. There are two principal reasons for this. The first is that there may be more than one component in a mixture. Under such circumstances it is usual to employ more than one response (the exception being if the concentrations of some of the components are known to be correlated): for N components, at least N wavelengths should normally be used. The second is that each detector contains extra, and often complementary, information: some individual wavelengths in a spectrum may be influenced by noise or unknown interferents. Using, for example, 100 wavelengths averages out the information, and will often provide a better result than relying on a single wavelength.

5.3.2 Multiwavelength Equations

In certain applications, equations can be developed that are used to predict the concentrations of compounds by monitoring at a finite number of wavelengths. A classical area is in pigment analysis by electronic absorption spectroscopy, for example in the area of chlorophyll chemistry. In order to determine the concentration of four pigments in a mixture, investigators recommend monitoring at four different wavelengths, and to use an equation that links absorbance at each wavelength to concentration of the pigments.

In the PAH case study, only certain compounds absorb above 330 nm, the main ones being pyrene, fluoranthene, acenaphthylene and benzanthracene (note that the small absorbance due to a fifth component may be regarded as an interferent, although adding this to the model will, of course, result in better predictions). It is possible to choose four wavelengths, preferably ones in which the absorbance ratios of these four compounds differ. The absorbance at wavelengths 330, 335, 340 and 345 nm are indicated in Figure 5.8. Of course, it is not necessary to select four sequential wavelengths; any four wavelengths would be sufficient, provided that the four compounds are the main ones represented by these variables to give an X matrix with four columns and 25 rows.

Calibration equations can be obtained, as follows, using inverse methods.

First, select the absorbances of the 25 spectra at these four wavelengths.

Second, obtain the corresponding C matrix consisting of the relevant concentrations. These new (reduced) matrices are presented in Table 5.5.

The aim is to find coefficients B relating X and C by C X .B, where B is a 4 × 4 matrix, each column representing a compound and each row a wavelength.

CALIBRATION

285

 

 

Absorbance

 

 

 

330

335

340

345

Wavelength (nm)

Figure 5.8

Absorbances of pure Pyr, Fluor, Benz and Ace between 330 and 345 nm

Table 5.5 Matrices for four components.

 

 

X

 

 

 

 

C

 

 

 

 

 

 

 

 

 

 

330

335

340

345

 

Py

Ace

Benz

Fluora

 

 

 

 

 

 

 

 

0.127

0.165

0.110

0.075

0.456

0.120

1.620

0.120

0.150

0.178

0.140

0.105

0.456

0.040

2.700

0.120

0.095

0.102

0.089

0.068

0.152

0.200

1.620

0.080

0.134

0.191

0.107

0.060

0.760

0.200

1.080

0.160

0.170

0.239

0.146

0.094

0.760

0.160

2.160

0.160

0.135

0.178

0.115

0.078

0.608

0.200

2.160

0.040

0.129

0.193

0.089

0.041

0.760

0.120

0.540

0.160

0.127

0.164

0.113

0.078

0.456

0.080

2.160

0.120

0.104

0.129

0.098

0.074

0.304

0.160

1.620

0.200

0.157

0.193

0.134

0.093

0.608

0.160

2.700

0.040

0.100

0.154

0.071

0.030

0.608

0.040

0.540

0.040

0.056

0.065

0.053

0.036

0.152

0.160

0.540

0.080

0.094

0.144

0.078

0.043

0.608

0.120

1.080

0.040

0.079

0.114

0.064

0.040

0.456

0.200

0.540

0.120

0.143

0.211

0.114

0.067

0.760

0.040

1.620

0.160

0.081

0.087

0.081

0.069

0.152

0.040

2.160

0.080

0.071

0.077

0.061

0.045

0.152

0.080

1.080

0.080

0.081

0.106

0.072

0.047

0.304

0.040

1.080

0.200

0.114

0.119

0.115

0.096

0.152

0.120

2.700

0.080

0.098

0.130

0.080

0.051

0.456

0.160

1.080

0.120

0.133

0.182

0.105

0.059

0.608

0.080

1.620

0.040

0.070

0.095

0.064

0.042

0.304

0.080

0.540

0.200

0.124

0.138

0.118

0.093

0.304

0.200

2.700

0.200

0.163

0.219

0.145

0.101

0.760

0.080

2.700

0.160

0.128

0.147

0.116

0.086

0.304

0.120

2.160

0.200

 

 

 

 

 

 

 

 

 

286

 

 

 

 

 

CHEMOMETRICS

 

 

 

 

 

Table 5.6 Matrix B for Section 5.3.2.

 

 

 

 

 

 

 

 

 

 

 

Py

Ace

Benz

Fluor

 

 

 

 

 

 

 

330

3.870

2.697

14.812

4.192

335

8.609

2.391

3.033

0.489

 

340

5.098

4.594

49.076

7.221

 

 

345

1.848

4.404

65.255

2.910

 

This equation can be solved using the regression methods in Section 5.2.2, changing vectors and scalars to matrices, so that B = (X .X )1.X .C , giving the matrix in Table 5.6.

If desired, represent in equation form, for example, the first column of B suggests that

estimated [pyrene] = −3.870A330 + 8.609A335 5.098A340 + 1.848A345

In many areas of optical spectroscopy, these types of equations are very common. Note, though, that changing the wavelengths can have a radical influence on the coefficients, and slight wavelength irreproducibility between spectrometers can lead to equations that are not easily transferred.

Finally, estimate the concentrations by

ˆ =

C X .B

as indicated in Table 5.7.

The estimates by this approach are very much better than the univariate approaches in this particular example. Figure 5.9 shows the predicted versus known concentrations for pyrene. The root mean square error of prediction is now

E

=

 

I

(ci

ci )2

/21

 

i

1

− ˆ

 

 

 

 

 

 

 

 

 

 

 

 

 

=

(note that the divisor is 21 not 25 as four degrees of freedom are lost because there are four compounds in the model), equal to 0.042 or 9.13 %, of the average concentration, a significant improvement. Further improvement could be obtained by including the intercept (usually performed by centring the data) and including the concentrations of more compounds. However, the number of wavelengths must be increased if the more compounds are used in the model.

It is possible also to employ classical methods. For the single detector, single wavelength model in Section 2.1.1,

cˆ = x(1/s)

where s is a scalar and x and c are vectors corresponding to the concentrations and absorbances for each of the I samples. Where there are several components in the mixture, this becomes

ˆ

=

X.S

.(S.S )1

C

 

CALIBRATION

 

 

287

 

 

 

 

Table 5.7 Estimated concentrations (mg l1) for four components

 

as described in Section 5.3.2.

 

 

 

 

 

 

 

 

 

 

Py

Ace

Benz

Fluor

 

 

 

 

 

 

0.507

0.123

1.877

0.124

 

0.432

0.160

2.743

0.164

 

0.182

0.122

1.786

0.096

 

0.691

0.132

1.228

0.130

 

0.829

0.144

2.212

0.185

 

0.568

0.123

1.986

0.125

 

0.784

0.115

0.804

0.077

 

0.488

0.126

1.923

0.137

 

0.345

0.096

1.951

0.119

 

0.543

0.168

2.403

0.133

 

0.632

0.096

0.421

0.081

 

0.139

0.081

0.775

0.075

 

0.558

0.078

0.807

0.114

 

0.423

0.058

0.985

0.070

 

0.806

0.110

1.535

0.132

 

0.150

0.079

1.991

0.087

 

0.160

0.089

1.228

0.050

 

0.319

0.089

1.055

0.095

 

0.174

0.128

2.670

0.131

 

0.426

0.096

1.248

0.082

 

0.626

0.146

1.219

0.118

 

0.298

0.071

0.925

0.093

 

0.278

0.137

2.533

0.129

 

0.702

0.137

2.553

0.177

 

0.338

0.148

2.261

0.123

 

 

 

 

 

 

 

 

0.9

 

 

 

 

 

 

 

 

 

0.8

 

 

 

 

 

 

 

 

−1

0.7

 

 

 

 

 

 

 

 

l

 

 

 

 

 

 

 

 

mg

0.6

 

 

 

 

 

 

 

 

concentration

 

 

 

 

 

 

 

 

0.5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Predicted

0.4

 

 

 

 

 

 

 

 

0.3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.2

 

 

 

 

 

 

 

 

 

0.1

 

 

 

 

 

 

 

 

 

0.0

 

 

 

 

 

 

 

 

 

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

 

 

 

 

True concentration mg l−1

 

 

 

Figure 5.9

Predicted versus known concentration of pyrene, using a four component model and the wavelengths 330, 335, 340 and 345 nm (uncentred)

288

 

 

 

 

 

CHEMOMETRICS

 

and the trick is to estimate S, which can be done in one of two ways: (1) by knowl-

edge of the true spectra or (2) by regression since C .S

ˆ

=

(C .C )1C .X .

 

X , so S

 

Note that

 

S .(S .S )1

 

 

 

 

B

 

 

 

 

 

 

 

 

 

 

However, as in univariate calibration, the coefficients obtained using both approaches may not be exactly equal, as each method makes different assumptions about error structure.

Such equations make assumptions that the concentrations of the significant analytes are all known, and work well only if this is true. Application to mixtures where there are unknown interferents can result in serious estimation errors.

5.3.3 Multivariate Approaches

The methods in Section 5.3.2 could be extended to all 10 PAHs, and with appropriate choice of 10 wavelengths may give reasonable estimates of concentrations. However, all the wavelengths contain some information and there is no reason why most of the spectrum cannot be employed.

There is a fairly confusing literature on the use of multiple linear regression for calibration in chemometrics, primarily because many workers present their arguments in a very formalised manner. However, the choice and applicability of any method depends on three main factors:

1. the number of compounds in the mixture (N = 10 in this case) or responses to be estimated;

2.the number of experiments (I = 25 in this case), often spectra or chromatograms;

3.the number of variables (J = 27 wavelengths in this case).

In order to have a sensible model, the number of compounds must be less than or equal to the smaller of the number of experiments or number of variables. In certain specialised cases this limitation can be infringed if it is known that there are correlations between concentrations of different compounds. This may happen, for example, in environmental chemistry, where there could be tens or hundreds of compounds in a sample, but the presence of one (e.g. a homologous series) indicates the presence of another, so, in practice there are only a few independent factors or groups of compounds. Also, correlations can be built into the design. In most real world situations there definitely will be correlations in complex multicomponent mixtures. However, the methods described below are for the case where the number of compounds is smaller than the number of experiments or number of detectors.

The X data matrix is ideally related to the concentration and spectral matrices by

X C .S

where X is a 25 × 27 matrix, C a 25 × 10 matrix and S a 10 × 27 matrix in the example discussed here. In calibration it is assumed that a series of experiments are performed in which C is known (e.g. a set of mixtures of compounds with known concentrations are recorded spectroscopically). An estimate of S can then be obtained by

ˆ = ( .C )1.C .X

S C

CALIBRATION

 

 

 

 

 

 

289

 

 

and then the concentrations can be predicted using

 

 

Cˆ

=

ˆ

ˆ ˆ

)1

 

 

X.S

.(S.S

exactly as above. This can be extended to estimating the concentrations in any unknown

spectrum by

=

 

ˆ

ˆ

ˆ

=

 

 

x.B

 

x.S

.(S.S )1

 

Unless the number of experiments is exactly equal to the number of compounds, the prediction will not be completely model the data. This approach works because the

ˆ ˆ

) are square matrices whose dimensions equal the number of

matrices (C .C ) and (S .S

compounds in the mixture (10 × 10) and have inverses, provided that experiments have been suitably designed and the concentrations of the compounds are not correlated. The predicted concentrations, using this approach, are given in Table 5.8, together with the percentage root mean square prediction error; note that there are only 15 degrees of freedom (=25 experiments 10 compounds). Had the data been centred, the number of degrees of freedom would be reduced further. The predicted concentrations are reasonably good for most compounds apart from acenaphthylene.

Table 5.8 Estimated concentrations for the case study using uncentred MLR and all wavelengths.

Spectrum No.

 

 

 

PAH concentration (mg l1)

 

 

 

 

Py

Ace

Anth

Acy

Chry

Benz

Fluora

Fluore

Nap

Phen

 

 

 

 

 

 

 

 

 

 

 

1

0.509

0.092

0.200

0.151

0.369

1.731

0.121

0.654

0.090

0.433

2

0.438

0.100

0.297

0.095

0.488

2.688

0.148

0.276

0.151

0.744

3

0.177

0.150

0.303

0.217

0.540

1.667

0.068

0.896

0.174

0.128

4

0.685

0.177

0.234

0.150

0.369

1.099

0.128

0.691

0.026

0.728

5

0.836

0.137

0.304

0.155

0.224

2.146

0.159

0.272

0.194

0.453

6

0.593

0.232

0.154

0.042

0.435

2.185

0.071

0.883

0.146

1.030

7

0.777

0.164

0.107

0.129

0.497

0.439

0.189

0.390

0.158

0.206

8

0.419

0.040

0.198

0.284

0.044

2.251

0.143

1.280

0.088

0.299

9

0.323

0.141

0.247

0.037

0.462

1.621

0.196

0.101

0.003

0.298

10

0.578

0.236

0.020

0.107

0.358

2.659

0.093

0.036

0.070

0.305

11

0.621

0.051

0.214

0.111

0.571

0.458

0.062

0.428

0.022

0.587

12

0.166

0.187

0.170

0.142

0.087

0.542

0.100

0.343

0.103

0.748

13

0.580

0.077

0.248

0.133

0.051

1.120

0.042

0.689

0.176

0.447

14

0.468

0.248

0.057

0.006

0.237

0.558

0.157

0.712

0.103

0.351

15

0.770

0.016

0.066

0.119

0.094

1.680

0.187

0.450

0.080

0.920

16

0.101

0.026

0.100

0.041

0.338

2.230

0.102

0.401

0.201

0.381

17

0.169

0.115

0.063

0.069

0.478

1.054

0.125

0.829

0.068

0.523

18

0.271

0.079

0.142

0.106

0.222

1.086

0.211

0.254

0.151

0.261

19

0.171

0.152

0.216

0.059

0.274

2.587

0.081

0.285

0.013

0.925

20

0.399

0.116

0.095

0.170

0.514

1.133

0.101

0.321

0.243

1.023

21

0.651

0.025

0.146

0.232

0.230

1.610

0.013

0.940

0.184

0.616

22

0.295

0.135

0.256

0.052

0.349

0.502

0.237

0.970

0.161

1.037

23

0.296

0.214

0.116

0.069

0.144

2.589

0.202

0.785

0.162

0.588

24

0.774

0.085

0.187

0.026

0.547

2.671

0.128

1.107

0.108

0.329

25

0.324

0.035

0.036

0.361

0.472

2.217

0.094

0.918

0.128

0.779

E%

9.79

44.87

15.58

69.43

13.67

4.71

40.82

31.38

29.22

16.26

 

 

 

 

 

 

 

 

 

 

 

290

CHEMOMETRICS

 

 

220

240

260

280

300

320

340

Wavelength (nm)

Figure 5.10

Normalised spectra of the 10 PAHs estimated by MLR, pyrene in bold

The predicted spectra are presented in Figure 5.10, and are not nearly as well predicted as the concentrations. In fact, it would be remarkable that for such a complex mixture it is possible to reconstruct 10 spectra well, given that there is a great deal of overlap. Pyrene, which is indicated in bold, exhibits most of the main peak maxima of the known pure data (compare with Figure 5.3). Often, other knowledge of the system is required to produce better reconstructions of individual spectra. The reason why concentration predictions appear to work significantly better than spectral reconstruction is that, for most compounds, there are characteristic regions of the spectrum containing prominent features. These parts of the spectra for individual compounds will be predicted well, and will disproportionately influence the effectiveness of the method for determining concentrations. However, MLR as described in this section is not an effective method for determining spectra in complex mixtures, and should be employed primarily as a way of determining concentrations.

MLR predicts concentrations well in this case because all significant compounds are included in the model, and so the data are almost completely modelled. If we knew of only a few compounds, there would be much poorer predictions. Consider the situation in which only pyrene, acenaphthene and anthracene are known. The C matrix now has only three columns, and the predicted concentrations are given in Table 5.9. The errors are, as expected, much larger than those in Table 5.8. The absorbances of the remaining seven compounds are mixed up with those of the three modelled components. This problem could be overcome if some characteristic wavelengths or regions of the spectrum at which the selected compounds absorb most strongly are identified, or if the experiments were designed so that there are correlations in the data, or even by a number of methods for weighted regression, but the need to provide information about all significant compounds is a major limitation of MLR.

The approach described above is a form of classical calibration, and it is also possible to envisage an inverse calibration model since

ˆ =

C X .B

CALIBRATION

 

 

 

 

291

 

 

 

 

Table 5.9 Estimates for three PAHs using the full dataset and

 

MLR but including only three compounds in the model.

 

 

 

 

 

 

 

 

 

 

Spectrum No.

 

 

 

PAH concentration (mg l1)

 

 

 

Py

 

Ace

Anth

 

 

 

 

 

 

 

1

0.539

 

0.146

0.156

 

2

0.403

 

0.173

0.345

 

3

0.199

 

0.270

0.138

 

4

0.749

 

0.015

0.231

 

5

0.747

 

0.103

0.211

 

6

0.489

 

0.165

0.282

 

7

0.865

 

0.060

0.004

8

0.459

 

0.259

0.080

 

9

0.362

 

0.121

0.211

 

10

0.512

 

0.351

0.049

11

0.742

 

0.082

0.230

 

12

0.209

 

0.023

0.218

 

13

0.441

 

0.006

0.202

 

14

0.419

 

0.095

0.051

 

15

0.822

 

0.010

0.192

 

16

0.040

 

0.255

0.151

 

17

0.259

 

0.162

0.122

 

18

0.323

 

0.117

0.104

 

19

0.122

 

0.179

0.346

 

20

0.502

 

0.085

0.219

 

21

0.639

 

0.109

0.130

 

22

0.375

 

0.062

0.412

 

23

0.196

 

0.316

0.147

 

24

0.638

 

0.218

0.179

 

25

0.545

 

0.317

0.048

 

 

E%

22.04986

105.7827

52.40897

 

However, unlike in Section 2.2.2, there are now more wavelengths than samples or

components in the mixture. The matrix B is given by

 

 

 

 

B

=

(X .X )1.X .C

 

 

 

 

 

 

 

 

 

as above. A problem with this approach is that the matrix (X X ) is now a large matrix, with 27 rows and 27 columns, compared with the matrices used above which have 10 rows and 10 columns only. If there are only 10 components in a mixtures, in a noise free experiment, the matrix X X would only have 10 degrees of freedom and no inverse. In practice, a numerical inverse can be computed but it will be largely a function of noise, and often contain some very large (and meaningless) numbers, because many of the columns of the matrix will contain correlations, as the determinant of the matrix X .X will be very small. This use of the inverse is only practicable if

1.the number of experiments and wavelengths is at least equal to the number of components in the mixture, and

2.the number of experiments is at least equal to the number of wavelengths.

Condition 2 either requires a large number of extra experiments to be performed or a reduction to 25 wavelengths. There have been a number of algorithms developed for

Соседние файлы в предмете Химия