Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Казанский национальный исследовательский технологический университет

Предмет:

Химия

Файл:

Brereton Chemometrics

.pdf

Скачиваний:

Добавлен:

15.08.2013

Размер:

4.3 Mб

Скачать

☆

<<< < Предыдущая 20 21 22 23 24 25 26 27 28 29 30 3132 / 5032 33 34 35 36 37 38 39 40 41 42 43 44 > Следующая >>>

302					CHEMOMETRICS

	Table 5.14 Magnitudes of ﬁrst 15 PLS1 components (centred
	data) for pyrene.

	Component	Magnitude	Component	Magnitude

1		7.944	9	0.004
2		1.178	10	0.007
3		0.484	11	0.001
4		0.405	12	0.002
5		0.048	13	0.002
6		0.158	14	0.003
7		0.066	15	0.001
8		0.01

mean centred spectra is 10.313, hence the ﬁrst two components account for 100 ×

(7.944 + 1.178)/10.313 = 88.4 % of the overall variance, so the root mean square

√

error after two PLS components have been calculated is 1.191/(27 × 25) = 0.042

(since 1.191 is the residual error) or, expressed as a percentage of the mean centred data,

√

E% = 0.042/ 10.313/(27 × 25) = 40.0 %. This could be expressed as a percentage of the mean of the raw data = 0.042/0.430 = 9.76 %. The latter appears much lower and is a consequence of the fact that the mean of the data is considerably higher than the standard deviation of the mean centred data. It is probably best simply to determine the percentage residual sum of square error (= 100 − 88.4 = 11.6 %) as more components are computed, but it is important to be aware that there are several approaches for the determination of errors.

The error in concentration predictions for pyrene using two PLS components can be computed from Table 5.13:

•the sum of squares of the errors is 0.385;

•dividing this by 22 and taking the square root leads to a root mean square error of 0.128 mg l−1;

•	the average concentration of pyrene is 0.456 mg l−1;
•	hence the percentage root mean square error (compared with the raw data) is 28.25 %.

Relative to the standard deviation of the centred data it is even higher. Hence the ‘x’ and ‘c’ blocks are modelled in different ways and it is important to recognise that the percentage error of prediction in concentration may diverge considerably from the percentage error of prediction of the spectra. It is sometimes possible to reconstruct spectral blocks fairly well but still not predict concentrations very effectively. It is best practice to look at errors in both blocks simultaneously to gain an understanding of the quality of predictions.

The root mean square errors for modelling both blocks of data as successive numbers of PLS components are calculated for pyrene are illustrated in Figure 5.13, and those for acenaphthene in Figure 5.14. Several observations can be made. First, the shape of the graph of residuals for the two blocks is often very different, see especially acenaphthene. Second, the graph of c residuals tends to change much more dramatically than that for x residuals, according to compound, as might be expected. Third, tests for numbers of signiﬁcant PLS components might give different answers according to which block is used for the test.

CALIBRATION	303

	0.1
error	0.01
RMS	0.01
RMS
	0.001
	1	3	5	7	9	11	13	15
				(a) x	block
				Component number

		1
		0.1
	error	0.01
	RMS	0.01
	RMS
		0.001
		0.0001

1	3	5	7	9	11	13	15
			(b) c	block
			Component number

Figure 5.13

Root mean square errors in x and c blocks, PLS1 centred and pyrene

The errors using 10 PLS components are summarised in Table 5.15, and are better than PCR in this case. It is important, however, not to get too excited about the improved quality of predictions. The c or concentration variables may in themselves contain errors, and what has been shown is that PLS forces the solution to model the apparent c block better, but it does not necessarily imply that the other methods are worse at discovering the truth. If, however, we have a lot of conﬁdence in the experimental procedure for determining c (e.g. weighing, dilution, etc.), PLS will result in a more faithful reconstruction.

5.5.2 PLS2

An extension to PLS1 was suggested some 15 years	ago, often called PLS2.	In
fact there is little conceptual difference, except that	the latter allows the use	of

304								CHEMOMETRICS
	0.1
error	0.01
RMS	0.01
RMS
	0.001
	1	3	5	7	9	11	13	15
				(a) x	block
				Component number

0.1
RMSerror
0.01
1	3	5	7	9	11	13	15
			(b) c	block
			Component number

Figure 5.14

Root mean square errors in x and c blocks, PLS1 centred and acenaphthene

a concentration matrix, C, rather than concentration vectors for each individual compound in a mixture, and the algorithm is iterative. The equations above alter slightly in that Q is now a matrix not a vector, so that

X = T .P + E

C = T .Q + F

The number of columns in C and Q are equal to the number of compounds of interest. In PLS1 one compound is modelled at a time, whereas in PLS2 all known compounds can be included in the model simultaneously. This is illustrated in Figure 5.15.

CALIBRATION	305

Table 5.15 Concentration estimates of the PAHs using PLS1 and 10 components (centred).

Spectrum No.								PAH concentration (mg l−1)
	Py		Ace		Anth		Acy			Chry	Benz	Fluora		Fluore	Nap	Phen

1	0.462		0.112		0.170	0.147			0.341		1.697	0.130		0.718	0.110	0.553
2	0.445		0.065		0.280	0.175			0.440		2.758	0.138		0.408	0.177	0.772
3	0.147		0.199		0.285	0.162			0.562		1.635	0.111		0.784	0.159	0.188
4	0.700		0.174		0.212	0.199			0.333		1.097	0.132		0.812	0.054	0.785
5	0.791		0.167		0.285	0.111			0.223		2.118	0.171		0.211	0.176	0.519
6	0.616		0.226		0.176	0.040			0.467		2.172	0.068		0.752	0.116	0.928
7	0.767		0.119		0.108	0.180			0.452		0.522	0.153		0.577	0.177	0.202
8	0.476		0.085		0.228	0.157			0.109		2.155	0.129		0.967	0.046	0.184
9	0.317		0.145		0.232	0.042			0.440		1.576	0.171		0.187	0.009	0.367
10	0.614		0.178		0.046	0.154			0.334		2.702	0.039		0.174	0.084	0.219
11	0.625		0.029		0.237	0.121			0.574		0.543	0.042		0.423	0.039	0.516
12	0.179		0.161		0.185	0.175			0.091		0.560	0.098		0.363	0.110	0.709
13	0.579		0.119		0.262	0.061			0.118		1.074	0.012		0.522	0.149	0.428
14	0.463		0.198		0.067	0.054			0.226		0.561	0.134		0.788	0.110	0.330
15	0.752		0.041		0.062	0.075			0.113		1.646	0.193		0.401	0.072	0.943
16	0.149		0.017		0.115	0.037			0.338		2.186	0.062		0.474	0.196	0.349
17	0.148		0.106		0.050	0.096			0.453		1.044	0.112		0.974	0.092	0.585
18	0.274		0.075		0.149	0.119			0.223		1.098	0.199		0.280	0.147	0.256
19	0.151		0.119		0.213	0.109			0.236		2.664	0.075		0.536	0.050	0.953
20	0.458		0.140		0.114	0.095			0.555		1.067	0.100		0.220	0.198	0.944
21	0.615		0.080		0.120	0.189			0.226		1.581	0.040		1.024	0.198	0.738
22	0.318		0.091		0.267	0.097			0.329		0.523	0.187		0.942	0.157	0.967
23	0.295		0.160		0.124	0.160			0.122		2.669	0.182		0.826	0.171	0.531
24	0.761		0.072		0.167	0.047			0.541		2.687	0.153		1.049	0.122	0.378
25	0.296		0.120		0.047	0.197			0.555		2.166	0.170		0.590	0.082	0.758
E%	5.47		19.06		7.85	22.48			3.55		2.46	21.96		12.96	16.48	7.02
		J				A					J				J
								A


I				=	I						P	+	I		E
		X				T

Figure 5.15

Principles of PLS2

306	CHEMOMETRICS

Table 5.16 Concentration estimates of the PAHs using PLS2 and 10 components (centred).

Spectrum No.				PAH concentration mg l−1
	Py	Ace	Anth	Acy	Chry	Benz	Fluora	Fluore	Nap	Phen

1	0.505	0.110	0.193	0.132	0.365	1.725	0.125	0.665	0.089	0.459
2	0.460	0.116	0.285	0.105	0.453	2.693	0.144	0.363	0.150	0.760
3	0.162	0.180	0.294	0.173	0.563	1.647	0.094	0.787	0.161	0.157
4	0.679	0.173	0.224	0.164	0.343	1.134	0.123	0.752	0.038	0.748
5	0.811	0.135	0.294	0.149	0.230	2.152	0.162	0.221	0.183	0.475
6	0.575	0.182	0.156	0.108	0.442	2.228	0.077	0.827	0.153	1.002
7	0.779	0.151	0.107	0.143	0.469	0.453	0.167	0.484	0.156	0.199
8	0.397	0.100	0.198	0.183	0.093	2.165	0.181	1.035	0.070	0.306
9	0.295	0.089	0.238	0.108	0.433	1.665	0.158	0.238	0.032	0.341
10	0.581	0.203	0.029	0.148	0.327	2.690	0.079	0.191	0.088	0.287
11	0.609	0.070	0.207	0.108	0.559	0.453	0.079	0.484	0.049	0.636
12	0.190	0.176	0.186	0.144	0.086	0.549	0.083	0.411	0.105	0.709
13	0.565	0.107	0.249	0.095	0.092	1.088	0.000	0.595	0.173	0.478
14	0.468	0.173	0.067	0.089	0.214	0.610	0.108	0.830	0.124	0.322
15	0.771	0.018	0.073	0.096	0.112	1.668	0.175	0.415	0.077	0.906
16	0.119	0.030	0.110	0.037	0.345	2.189	0.101	0.442	0.192	0.369
17	0.181	0.098	0.070	0.090	0.468	1.061	0.106	0.903	0.076	0.510
18	0.278	0.067	0.151	0.102	0.226	1.073	0.178	0.292	0.147	0.249
19	0.184	0.131	0.218	0.102	0.245	2.617	0.071	0.434	0.034	0.925
20	0.410	0.120	0.111	0.134	0.543	1.117	0.115	0.243	0.215	0.963
21	0.663	0.100	0.147	0.129	0.262	1.558	0.040	0.845	0.152	0.630
22	0.308	0.108	0.257	0.093	0.335	0.509	0.209	0.954	0.156	0.998
23	0.320	0.179	0.123	0.114	0.129	2.610	0.164	0.817	0.157	0.537
24	0.763	0.072	0.165	0.038	0.524	2.696	0.120	1.123	0.130	0.390
25	0.327	0.110	0.049	0.216	0.544	2.150	0.139	0.650	0.091	0.746
E%	10.25	34.11	13.66	44.56	6.99	4.26	33.41	18.62	25.83	14.77

It is a simple extension to predict all the concentrations simultaneously, the PLS2 predictions, together with root mean square errors being given in Table 5.16. Note that there is now only one set of scores and loadings for the x (spectroscopic) dataset, and one set of ga common to all 10 compounds. However, the concentration estimates are different when using PLS2 compared with PLS1. In this way PLS differs from PCR where it does not matter if each variable is modelled separately or all together. The reasons are rather complex but relate to the fact that for PCR the principal components are calculated independently of the c variables, whereas the PLS components are also inﬂuenced by both blocks of variables.

In some cases PLS2 is helpful, especially since it is easier to perform computationally if there are several c variables compared with PLS1. Instead of obtaining 10 independent models, one for each PAH, in this example, we can analyse all the data in one go. However, in many situations PLS2 concentration estimates are, in fact, worse than PLS1 estimates, so a good strategy might be to perform PLS2 as a ﬁrst step, which could provide further information such as which wavelengths are signiﬁcant and which concentrations can be determined with a high degree of conﬁdence, and then perform PLS1 individually for the most appropriate compounds.

CALIBRATION	307

5.5.3 Multiway PLS

Two-way data such as HPLC–DAD, LC–MS and LC–NMR are increasingly common in chemistry, especially with the growth of coupled chromatography. Conventionally either a univariate parameter (e.g. a peak area at a given wavelength) (methods in Section 5.2) or a chromatographic elution proﬁle at a single wavelength (methods in Sections 5.3 to 5.5.2) is used for calibration, allowing the use of normal regression techniques described above. However, additional information has been recorded for each sample, often involving both an elution proﬁle and a spectrum. A series of two-way chromatograms are available, and can be organised into a three-way array, often visualised as a box, sometimes denoted by X where the line underneath the array name indicates a third dimension. Each level of the box consists of a single chromatogram. Sometimes these three-way arrays are called ‘tensors’, but tensors often have special properties in physics which are unnecessarily complex and confusing to the chemometrician. We will use the notation of tensors only where it helps in understanding the existing methods.

Enhancements of the standard methods for multivariate calibration are required. Although it is possible to use methods such as three-way MLR, most chemometricians have concentrated on developing approaches based on PLS, to which we will be restricted below. Theoreticians have extended these methods to cases where there are several dimensions in both the ‘x’ and ‘c’ blocks, but the most complex practical case is where there are three dimensions in the ‘x’ block, as happens for a series of coupled chromatograms or in ﬂuorescence excitation–emission spectroscopy, for example. A simple simulated numerical example is presented in Table 5.17, in which the x block consists of four two-way chromatograms, each of dimensions 5 × 6. There are three components in the mixture, the c block consisting of a 4 × 3 matrix. We will restrict the discussion for the case where each column of c is to be estimated independently (analogous to PLS1) rather than all in one go. Note that although PLS is by far the most popular approach for multiway calibration, it is possible to envisage methods analogous to MLR or PCR, but they are rarely used.

5.5.3.1 Unfolding

One of the simplest methods is to create a single, long, data matrix from the original three-way tensor. In the case of Table 5.17, we have four samples, which could be arranged as a 4 × 5 × 6 tensor (or ‘box’). The three dimensions will be denoted I , J and K. It is possible to change the shape so that any binary combination of variables is converted to a new variable, for example, the intensity of the variable at J = 2 and K = 3, and the data can now be represented by 5 × 6 = 30 variables and is the unfolded form of the original data matrix. This operation is illustrated in Figure 5.16.

It is now a simple task to perform PLS (or indeed any other multivariate approach), as discussed above. The 30 variables are centred and the predictions of the concentrations performed when increasing number of components are used (note that three is the maximum permitted for column centred data in this case, so this example is somewhat simple). All the methods described above can be applied.

An important aspect of three-way calibration involves scaling, which can be rather complex. The are four fundamental ways in which the data can be treated:

1.no centring;

2.centre the columns in each J × K plane and then unfold with no further centring, so, for example, x1,1,1 becomes 390–(390 + 635 + 300 + 65 + 835)/5;

308							CHEMOMETRICS

	Table 5.17 Three-way calibration dataset.

	(a) X block, each of the 4 (=I ) samples gives a two-way
	5 × 6 (=J × K) matrix
390		421	871	940	610	525
635		357	952	710	910	380
300		334	694	700	460	390
65		125	234	238	102	134
835		308	1003	630	1180	325
488		433	971	870	722	479
1015		633	1682	928	1382	484
564		538	1234	804	772	434
269		317	708	364	342	194
1041		380	1253	734	1460	375
186		276	540	546	288	306
420		396	930	498	552	264
328		396	860	552	440	300
228		264	594	294	288	156
222		120	330	216	312	114
205		231	479	481	314	268
400		282	713	427	548	226
240		264	576	424	336	232
120		150	327	189	156	102
385		153	482	298	542	154

	(b) C block, concentrations of three compounds in each
	of the four samples
	1	9	10
7		11	8
6		2	6
3		4	5

I K

J.K

I	K	K	K
	1	2	J

Figure 5.16

Unfolding a data matrix

3. unfold the raw data and centre afterwards, so, for example, x1,1,1 becomes 390–(390 + 488 + 186 + 205)/4 = 72.75;

4.combine methods 2 and 3, start with centring as in step 2, then unfold and recentre a second time.

These four methods are illustrated in Table 5.18 for the case of the xi,1,1, the variables in the top left-hand corner of each of the four two-way datasets. Note that methods 3

CALIBRATION					309

	Table 5.18 Four methods of mean centring the data in Table 5.17, illustrated by the
	variable xi,1,1 as discussed in Section 5.5.3.1.

	Sample	Method 1	Method 2	Method 3	Method 4

1		390	−55	72.75	44.55
2		488	−187.4	170.75	−87.85
3		186	−90.8	−131.25	8.75
	4	205	−65	−112.25	34.55

and 4 provide radically different answers; for example, sample 2 has the highest value (=170.75) using method 3, but the lowest using method 4 (= −87.85).

Standardisation is also sometimes employed, but must be done before unfolding for meaningful results; an example might be in the GC–MS of a series of samples, each mass being of different absolute intensity. A sensible strategy might be as follows:

1.standardise each mass in each individual chromatogram, to provide I standardised matrices of dimensions J × K;

2.unfold;

3.centre each of the variables.

Standardising at the wrong stage of the analysis can result in meaningless data so it is always essential to think carefully of the physical (and numerical) consequences of any preprocessing which is far more complex and has far more options than for simple two-way data.

After this preprocessing, all the normal multivariate calibration methods can be employed.

5.5.3.2 Trilinear PLS1

Some of the most interesting theoretical developments in chemometrics over the past few years have been in so-called ‘multiway’ or ‘multimode’ data analysis. Many such methods have been available for some years, especially in the area of psychometrics, and a few do have relevance to chemistry. It is important, though, not to get too carried away with the excitement of these novel theoretical approaches. We will restrict the discussion here to trilinear PLS1, involving a three-way x block and a single c variable. If there are several known calibrants, the simplest approach is to perform trilinear PLS1 individually on each variable.

Since centring can be fairly complex for three-way data, and there is no inherent reason to do this, for simplicity it is assumed that data are not centred, so raw concentrations and chromatographic/spectroscopic measurements are employed. The data in Table 5.17 can be considered to be arranged in the form of a cube, with three dimensions, I for the number of samples and J and K for the measurements.

Trilinear PLS1 attempts to model both the ‘x’ and ‘c’ blocks simultaneously. Here we will illustrate the use with the algorithm of Appendix A.2.4, based on methods proposed by de Jong and Bro.

Superﬁcially, trilinear PLS1 has many of the same objectives as normal PLS1, and the method as applied to the x block is often represented diagrammatically as in Figure 5.17, replacing ‘squares’ or matrices by ‘boxes’ or tensors, and replacing, where necessary, the dot product (‘.’) by something called a tensor product (‘ ’). The ‘c’

310

CHEMOMETRICS

Figure 5.17

Representation of trilinear PLS1

block decomposition can be represented as per PLS1 and is omitted from the diagram for brevity. In fact, as we shall see, this is an oversimpliﬁcation, and is not an entirely accurate description of the method.

In trilinear PLS1, for each component it is possible to determine

• a scores vector (t), of length I or 4 in this example;

• a weight vector, which has analogy to a loadings vector (j p) of length J or 5 in this example, referring to one of the dimensions (e.g. time), whose sum of squares equals 1;

•another weight vector, which has analogy to a loadings vector (k p) of length K or 6 in this example, referring to the other one of the dimensions (e.g. wavelength) whose sum of squares also equals 1.

Superﬁcially these vectors are related to scores and loadings in normal PLS, but in practice they are completely different, a key reason being that these vectors are not orthogonal in trilinear PLS1 inﬂuencing the additivity of successive components. Here, we keep the notation scores and loadings, simply for the purpose of retaining familiarity with terminology usually used in two-way data analysis.

In addition, a vector q is determined after each new component, by

q=(T .T )−1.T .c

so that

cˆ = T .q

CALIBRATION	311

	or
	c = T .q + f

where T is the scores matrix, the columns of which consist of the individual scores vectors for each component and has dimensions I × A or 4 × 3 in this example if three PLS components are computed, and q is a column vector of dimensions A × 1 or 3 × 1 in our example.

A key difference from bilinear PLS1 as described in Section 5.5.1 is that the elements of q have to be recalculated afresh as new components are computed, whereas for two-way PLS, the ﬁrst element of q, for example, is the same no matter how many components are calculated. This limitation is a consequence of nonorthogonality of individual columns of matrix T.

The x block residuals after each component are often computed conventionally by

resid,a	xij k =	resid	,a−1x − ti p p
			j k

where resid,a xij k is the residual after a components are calculated, which would lead to a model

	A
xˆij k =
	j k
	ti pj pk
	a=1

Sometimes these equations are written as tensor products, but there are numerous ways of multiplying tensors together, so this notation can be confusing and it is often conceptually more convenient to deal directly with vectors and matrices, just as in Section 5.5.3.1 by unfolding the data. This procedure can be called matricisation.

In mathematical terms, we can state that

unfolded ˆ	t a .	unfolded	pa
X =	t a .		pa
	a=1

where unfolded pa is simply a row vector of length J.K. Where trilinear PLS1 differs from unfolded PLS described in Section 5.5.3.1 is that a matrix Pa of dimensions J × K can be obtained for each PLS component given by

Pa =jpa .kpa

and Pa is unfolded to give unfolded pa .

Figure 5.18 represents this procedure, avoiding tensor multiplication, using conventional matrices and vectors together with unfolding. A key problem with the common implementation of trilinear PLS1 is that, since the scores and loadings of successive components are not orthogonal, the methods for determining residual errors are simply an approximation. Hence the x block residual is not modelled very well, and the error matrices (or tensors) do not have an easily understood physical meaning. It also means that there are no obvious analogies to eigenvalues. This means that it is not easy to determine the size of the components or the modelling power using the x scores and loadings, but, nevertheless, the main aim is to predict the concentration (or c block),

<<< < Предыдущая 20 21 22 23 24 25 26 27 28 29 30 3132 / 5032 33 34 35 36 37 38 39 40 41 42 43 44 > Следующая >>>

Соседние файлы в предмете Химия

#
15.08.20134.29 Mб17Baer M., Billing G.D. (eds.) - The role of degenerate states in chemistry (Adv.Chem.Phys. special issue, Wiley, 2002).pdf
#
15.08.20137.08 Mб55Basov N.I. i dr. Raschet i konstruirovanie formiruyushchego instrumenta dlya izgotovleniya izdelij (1991.pdf
#
15.08.20135.59 Mб69Becker O.M., MacKerell A.D., Roux B., Watanabe M. (eds.) Computational biochemistry and biophysic.pdf
#
15.08.2013324.82 Кб32benzyne-cyclization.pdf
#
15.08.201314.48 Mб18Borowko M. 2000 Computational methods in surface and colloid science.djvu
#
15.08.20134.3 Mб48Brereton Chemometrics.pdf
#
15.08.20131.07 Mб30Burshtejn K.Ya., Shorygin P.P. Kvantovohimicheskie raschety v organicheskoj himii i molekulyarnoj.pdf
#
15.08.201321.36 Mб45Carey F.A. - Organic Chemistry (2004)(en).djvu
#
15.08.201321.36 Mб39Carey F.A. Advanced organic chemistry 5ed., MGH, 2004.djvu
#
15.08.201311.62 Mб23Carey F.A. Advanced organic chemistry. Part A structure and mechanisms 1938.djvu
#
15.08.20138.77 Mб17Carey F.A. Advanced organic chemistry. Part B reaction and synthesis 1938.djvu