วันพฤหัสบดีที่ 29 พฤษภาคม พ.ศ. 2557

It is easier to avoid overfitting the data by using PCR over PLS, particularly if there is considerable error in the lab method (SECV <= 1.05*SEC)

From NIR Forum discussion

Posted on Tuesday, February 08, 2011 - 4:42 am:   

Dear all, 

First, I want to say thank you in advance for every answer. I'm first time on this forum.
 
I have read a lot of things here and I
 
think that it is really useful. I have a question regarding the prediction of wood properties with NIR spectra.
 

I have a set of spectra from wood sample (calibration and test set) and I would like develop the best model for wood properties (eg wood density). However, I get higher error for cross validation (SECV) then for calibration set (SEC) and test set(SEP).Maybe I make a mistake in the application of cross validation. I use Unscrambler software. Can anyone tell me how to use the option of cross validation in Unscrambler software?
 

Thank you and best regards,
 
Nebojsa
Posted on Tuesday, February 08, 2011 - 5:44 am:   

HI, 

How large is the difference between SEC and SECV?
 
Before calibrating, you have to know the SEL (error of the reference method) and the SD of the calibration sample.
 
How many samples?
 
the gap between SEC and SECV is due to
 
- Too few samples and/or
 
- Parameter difficult to predict and/or
 
- Noise in the reference method.
 
SECV is always higher than SEC. My rule is to have SECV<=1.05*SEC with 2 groups of CV. Then I am pretty sure the model is robust.
 

Pierre
Posted on Tuesday, February 08, 2011 - 7:22 am:   

Dear Pierre, 

thank you for your answer.
 

How large is the difference between SEC and SECV?
 
- The difference is very large SECV - 0.043 SEC - 0.020
 
How many samples?
 
- 74 for calibration and 20 for test set.
 
- SD for calibration is 0.049 (mean 0.698) and for test set is 0.047 (mean
 
0.713)
 
- for cross validation I use setup random, number of segments 10 and samples
 
per segment 7.
 

Nebojsa
Posted on Wednesday, February 09, 2011 - 4:35 am:   

Bonjour, 

it means R2CV ~~ 0.0. You have a serious problem.
 
What is your SEL ? Likely too large and/or SDy too low.
 
Several papers mention wood density calibrations. Do refer to the literature to compare with your samples (mean, SD and SEL) and the way the samples are scanned.
 

Pierre
Posted on Wednesday, February 09, 2011 - 6:17 am:   

Dear Pierre, 

thank you. Of course, I will chek my samples and probably is a problem in their choice. Is there any reference for rule SECV<=1.05*SEC or difference between SEC (SECV) and SEP? It is more important then R2?!
 

Nebojsa
Posted on Wednesday, February 09, 2011 - 6:52 am:   

Nebojsa, 

as I said it's my rule. you can calculate a Ftest on the ratio SECV/SEP, but the result depends on the number of samples in cal and val. SEP is more important than R2, but with SEP = SDy there is no need for analyzes. (except in process control when predictions of the "standard" product will give the same predicted values over time)
 

Pierre
Posted on Friday, February 18, 2011 - 11:00 am:   

Nebojsa, 

It appears that the SEP values and the SECV value you find are fairly comparable, and that the SEC is quite low. That suggests to me that you are using too many factors in your model. How many factors have you selected?
 

You have not mentioned what calibration method you are using. Since you are using Unscrambler, I assume you are using PLS? Have you tried PCR? In my experience, it is easier to avoid overfitting the data by using PCR, particularly if there is considerable error in the lab method. Have you evaluated the reproducibility of the lab method (the SEL that Pierre mentioned)? The results of your validation are limited by the SEL, because the RMSEP must always be >= SEL.
 

Another way to see if you have used too many factors (or over-fit your data) is to observe the factors. They should look like smooth spectra-like curves, with a minimum of high frequency noise.
 

Best wishes,
 
Dave

ไม่มีความคิดเห็น:

แสดงความคิดเห็น