Introduction
For Origin 2017, a Principal Component Analysis for Spectroscopy App is available. The App is specifically designed to perform principal component analysis for spectra (IR, Fluorescence, UV-Vis, Raman, etc.). In chemometric analysis, researchers want to know which variables (frequency, wavelength or time) are important in distinguishing samples by their spectra, which samples can be allocated to a group, and to detect outliers in these samples. This App can help resolve these issues.
Before using the Principal Component Analysis for Spectroscopy App, spectra data for samples must be arranged in a worksheet, each column representing a sample spectrum. Frequency, wavelength or time for the spectra can be in the X column. Sample names and group info for samples can be set in column headers of samples, e.g. Long Name or Comments in each Y column.
Example
This App provides a built-in sample. Once you install the App, right-click on the Principal Component Analysis for Spectroscopy icon in the Apps Gallery window, and choose Show Samples Folder from the short-cut menu. This opens the folder for the sample project file. Open the project file PCASpecEx.opj in the folder. You will see that it includes a Workbook and a Notes window. Book1, Sheet1 contains the input data, and the Notes window shows the input data’s source.
Input Data
The input data consists of 20 samples from 120 samples in the original source data. The 20 samples consist of 10 olive oil samples, 5 non-olive vegetable oil samples and 5 non-olive vegetable oils mixed with olive oil samples. The first column in the sheet (A(X)) holds time data for spectra. Other columns (B(Y) – (U(Y)) are spectra data, and group info for 20 samples are saved in Comments for each Y column. When plotted as a line plot, the spectra for 20 samples look like the following:
Steps
- Open the sample project file PCASpecEx.opj. Click the Principal Component Analysis for Spectroscopy icon in the Apps Gallery window to open the dialog.
- In the dialog’s Input tab, select the (first) X column in Sheet1 as Frequency/Wavelength. Select the other Y columns as Spectra Data. Set Long Name for Spectra Names, and use Comments as Group Info.
- In the Settings tab, choose the Covariance Matrix for Analyze option. If the Correlation Matrix option is chosen, each row for 20 samples would be normalized.
- In the Plots tab, choose Sample 6 for the Reference Spectrum in Loading with Reference Spectrum Plot, and check the Loading Plot and Score Plot options.
- Click the OK button and a report sheet, a result sheet and a plot data sheet are created.
Results
- Looking at the Report Sheet, the Eigenvalues table shows that the first four principal components explained 96% of total variance.
- In the Loading with Reference Spectrum Plot (note that you can double-click on the plot to pop up the embedded graph), the first layer shows the sixth (reference) sample’s spectrum; the second layer illustrates loading for the first component; and the third layer for the second component. The graph below shows that times 7.95 and 8.47 are important variables in PC1, while times 3.96 and 5.92 have more influence in PC2. The vertical annotation lines in the graph were added using the Vertical Cursor gadget in Origin (Gadgets: Vertical Cursor).
- The Loading Plot shows coefficients of each variable (time) in PC1 and PC2. You can use the Data Reader (Tools toolbar) in Origin to find variables of larger coefficients (important times) in PC1 and PC2. Note that the sign in loading of a principal component doesn’t matter, and can be multiplied by -1.
- The Score Plot illustrates scores of 20 samples in PC1 and PC2. The 20 samples were divided into three groups as specified in Group Info. It is clear from the graph that olive oils and non-olive oils can be divided easily in principal component space, while mixed oils intersect with the other two. Confidence ellipses of scores for the three groups of samples are also shown, and some extreme points are labelled. If the number of samples is large, you can turn off labels by unchecking the Enable option in the Plot Details dialog’s Label tab (to open Plot Details, double-click on the pop-up plot or choose Format: Plot Properties).
- This App can create 3D component plots. Click the green lock in the upper-left corner of the graph and choose Change Parameters. In Plots tab, choose 3 for Number of Components to Plot (you may have to go to Settings and increase Number of Components to Extract to 3 or more. You can also change the Reference Spectrum on the Plots tab to see other samples.
Conclusion
Principal component analysis is an effective method to find important frequency or wavelength regions from a group of samples, and help classify samples in the principal component space. It can also be used to determine number of compounds from spectra of mixtures, and it can be incorporated with the partial least squares method to solve quantitative problems. The result can also be used for further classification, e.g. hierarchical cluster analysis and discriminant analysis.
Is it possible to share the algorithm or step by step procedure on how the App works? I need to particularly understand all the preprocessing steps to help in reporting my analysis
1. Normalize each row for original spectra (along samples). If you analyze correlation, row is normalized by z-scores. And If analyze covariance, row is normalized by x-mean.
2. For each column in normalized data, multiply it with each column in loadings matrix, and calculate the sum respectively. It is scores vector for that column.
3. Repeat step2 for all columns in normalized data, you can get scores for all samples.
4. The variance for scores of each principal component will be eigenvalues. If you want to Standardize Scores (unit variance), you can scale scores of each principal component by the square root of its eigenvalue.
Thank you greatly!
Hi, I have two questions and I would be grateful for your answers.
1. Before the PCA, are the absorbance values mean centred and standardized?
2. Ticking the standardize scores, does it imply every unit on the score plot is a unit of standard deviation of the principal component?
Hi Sam,
Absorbance values are always mean centered in the app. If you choose Anaylze: Correlation Matrix in Settings tab, they will be also standardized, otherwise they will not.
Ticking the standardize scores, variance of scores for each principal component will be 1. Otherwise, its variance will be the eigenvalue of each component.
Thanks.
Sam Fang
Hi Sam. When the scores are calculated, in what form is the original spectra when it is multiplied by the loadings?
What processing is performed on the data before projection onto the PCs?
Hi,
could you please tell me how to Denoise a signal using PCA?
I mean how to reconstruct the signal using only the most important PCs.
Thanks
The PCA app in Origin doesn’t support PCA denoising. Please share your data with us using the link below and we can look into it and provide a solution.
https://www.originlab.com/restricted/support/index.aspx?c=3
Originlab
hi could you please tell me how can i do a clustering between different gamma ray spectrum using PCA app in origin? and plot the result after clustering? or showing the fractional abundance of each of them?
Hi,
Sorry for the late reply. Can you email tech@originlab.com and provide us with more information- perhaps some sample data. And please don’t forget to include your Origin version and the last 7 digits of your serial number. You can find it by going to the Help menu and choosing About Origin.
Hello,
Do you know how calculate derivate and Savitzky-Golay for all columns, not one by one
Hello, do you mean to calculate derivatives using Savitzky-Golay method for all columns?
Suppose all your data in a worksheet. Set column as X or Y. You can choose a Y column, select Analysis: Mathematics: Differentiate from Origin menu. Click on the green lock icon for the output column, and select Repeat this for All Y Columns from the menu. It will calculate derivatives for all Y columns.
Thanks.
Sam