วันพฤหัสบดีที่ 29 พฤษภาคม พ.ศ. 2557

It is easier to avoid overfitting the data by using PCR over PLS, particularly if there is considerable error in the lab method (SECV <= 1.05*SEC)

From NIR Forum discussion

Posted on Tuesday, February 08, 2011 - 4:42 am:   

Dear all, 

First, I want to say thank you in advance for every answer. I'm first time on this forum.
 
I have read a lot of things here and I
 
think that it is really useful. I have a question regarding the prediction of wood properties with NIR spectra.
 

I have a set of spectra from wood sample (calibration and test set) and I would like develop the best model for wood properties (eg wood density). However, I get higher error for cross validation (SECV) then for calibration set (SEC) and test set(SEP).Maybe I make a mistake in the application of cross validation. I use Unscrambler software. Can anyone tell me how to use the option of cross validation in Unscrambler software?
 

Thank you and best regards,
 
Nebojsa
Posted on Tuesday, February 08, 2011 - 5:44 am:   

HI, 

How large is the difference between SEC and SECV?
 
Before calibrating, you have to know the SEL (error of the reference method) and the SD of the calibration sample.
 
How many samples?
 
the gap between SEC and SECV is due to
 
- Too few samples and/or
 
- Parameter difficult to predict and/or
 
- Noise in the reference method.
 
SECV is always higher than SEC. My rule is to have SECV<=1.05*SEC with 2 groups of CV. Then I am pretty sure the model is robust.
 

Pierre
Posted on Tuesday, February 08, 2011 - 7:22 am:   

Dear Pierre, 

thank you for your answer.
 

How large is the difference between SEC and SECV?
 
- The difference is very large SECV - 0.043 SEC - 0.020
 
How many samples?
 
- 74 for calibration and 20 for test set.
 
- SD for calibration is 0.049 (mean 0.698) and for test set is 0.047 (mean
 
0.713)
 
- for cross validation I use setup random, number of segments 10 and samples
 
per segment 7.
 

Nebojsa
Posted on Wednesday, February 09, 2011 - 4:35 am:   

Bonjour, 

it means R2CV ~~ 0.0. You have a serious problem.
 
What is your SEL ? Likely too large and/or SDy too low.
 
Several papers mention wood density calibrations. Do refer to the literature to compare with your samples (mean, SD and SEL) and the way the samples are scanned.
 

Pierre
Posted on Wednesday, February 09, 2011 - 6:17 am:   

Dear Pierre, 

thank you. Of course, I will chek my samples and probably is a problem in their choice. Is there any reference for rule SECV<=1.05*SEC or difference between SEC (SECV) and SEP? It is more important then R2?!
 

Nebojsa
Posted on Wednesday, February 09, 2011 - 6:52 am:   

Nebojsa, 

as I said it's my rule. you can calculate a Ftest on the ratio SECV/SEP, but the result depends on the number of samples in cal and val. SEP is more important than R2, but with SEP = SDy there is no need for analyzes. (except in process control when predictions of the "standard" product will give the same predicted values over time)
 

Pierre
Posted on Friday, February 18, 2011 - 11:00 am:   

Nebojsa, 

It appears that the SEP values and the SECV value you find are fairly comparable, and that the SEC is quite low. That suggests to me that you are using too many factors in your model. How many factors have you selected?
 

You have not mentioned what calibration method you are using. Since you are using Unscrambler, I assume you are using PLS? Have you tried PCR? In my experience, it is easier to avoid overfitting the data by using PCR, particularly if there is considerable error in the lab method. Have you evaluated the reproducibility of the lab method (the SEL that Pierre mentioned)? The results of your validation are limited by the SEL, because the RMSEP must always be >= SEL.
 

Another way to see if you have used too many factors (or over-fit your data) is to observe the factors. They should look like smooth spectra-like curves, with a minimum of high frequency noise.
 

Best wishes,
 
Dave

Standard error of prediction (SEP) should not be greater than 1.3 times the standard error of calibration (SEC)

"As recommended by the instrument/software vendor, generally, standard error of prediction (SEP) should not be greater than 1.3 times the standard error of calibration (SEC) and the bias should not be greater than 0.6 times the SEC (50). High values of SEP or bias indicate that the errors are significantly larger for the new
cross-validation samples and that the calibration data may not include all the necessary variability or be over fit."

(P. 265 in: Stuart L. Cantor, Stephen W. Hoag, Christopher D. Ellison, Mansoor A. Khan, and Robbe C. Lyon (2011). NIR Spectroscopy Applications in the Development of a Compacted Multiparticulate System for Modified Release. AAPS PharmSciTech, Vol. 12, No. 1, March 2011)

"A large difference indicates that too many latent variables are used in the model and noise is modeled. "

(P.318
In Li et al., (2007) Nondestructive measurement and fingerprint analysis of soluble solid content of tea soft drink based on Vis/NIR spectroscopy, J. of Food Eng, 82, 316-323.)

Standard error of cross-validation (SECV or SEP)

P.318
In Li et al., (2007) Nondestructive measurement and fingerprint analysis of soluble solid content of tea soft drink based on Vis/NIR spectroscopy, J. of Food Eng, 82, 316-323.

A large difference indicates that too many latent variables are used in the model and noise is modeled.

P.318
In Li et al., (2007) Nondestructive measurement and fingerprint analysis of soluble solid content of tea soft drink based on Vis/NIR spectroscopy, J. of Food Eng, 82, 316-323.

PLS2 regression give better results than PLS1 regression only if Y variables are strongly correlated

"When several dependent data are available for calibration, two approaches can be used in PLS regression: either properties are calibrated for one at a time (PLS1), or properties are calibrated at once (PLS2). In PLS1 model, the Y response consists of a single variable. When there is more than one Y response a separated model must be constructed for each Y response. In PLS2 model, responses are multivariate. PLS1 and PLS2 models provide different prediction set and PLS2 regression give better results than PLS1 regression only if Y variables are strongly correlated."

(Page 134 in: O. Galtier, O. Abbas, Y. Le Dréau, C. Rebufa, J. Kister, J. Artaud, N. Dupuy 2011. Comparison of PLS1-DA, PLS2-DA and SIMCA for classification by origin of crude petroleum oils by MIR and virgin olive oils by NIR for different spectral regions. Vibrational Spectroscopy 55 (2011) 132–140)