### Introduction

For Origin 2017, a Principal Component Analysis for Spectroscopy App is available. The App is specifically designed to perform principal component analysis for spectra (IR, Fluorescence, UV-Vis, Raman, etc.). In chemometric analysis, researchers want to know which variables (frequency, wavelength or time) are important in distinguishing samples by their spectra, which samples can be allocated to a group, and to detect outliers in these samples. This App can help resolve these issues.

Before using the **Principal Component Analysis for Spectroscopy** App, spectra data for samples must be arranged in a worksheet, each column representing a sample spectrum. Frequency, wavelength or time for the spectra can be in the X column. Sample names and group info for samples can be set in column headers of samples, e.g. Long Name or Comments in each Y column.

### Example

This App provides a built-in sample. Once you install the App, right-click on the **Principal Component Analysis for Spectroscopy** icon in the **Apps Gallery** window, and choose **Show Samples Folder** from the short-cut menu. This opens the folder for the sample project file. Open the project file *PCASpecEx.opj* in the folder. You will see that it includes a Workbook and a Notes window. Book1, Sheet1 contains the input data, and the Notes window shows the input data’s source.

#### Input Data

The input data consists of 20 samples from 120 samples in the original source data. The 20 samples consist of 10 olive oil samples, 5 non-olive vegetable oil samples and 5 non-olive vegetable oils mixed with olive oil samples. The first column in the sheet (A(X)) holds time data for spectra. Other columns (B(Y) – (U(Y)) are spectra data, and group info for 20 samples are saved in Comments for each Y column. When plotted as a line plot, the spectra for 20 samples look like the following:

#### Steps

- Open the sample project file
*PCASpecEx.opj*.**Principal Component Analysis for Spectroscopy**icon in the**Apps Gallery**window to open the dialog. - In the dialog’s
**Input**tab, select the (first) X column in Sheet1 as**Frequency/Wavelength**. Select the other Y columns as**Spectra Data**. Set**Long Name**for**Spectra Names**, and use**Comments**as**Group Info**. - In the
**Settings**tab, choose the**Covariance Matrix**for**Analyze**option. If the**Correlation Matrix**option is chosen, each row for 20 samples would be normalized. - In the
**Plots**tab, choose Sample 6 for the**Reference Spectrum**in**Loading with Reference Spectrum Plot**, and check the**Loading Plot**and**Score Plot**options. - Click the
**OK**button and a report sheet, a result sheet and a plot data sheet are created.

#### Results

- Looking at the Report Sheet, the
**Eigenvalues**table shows that the first four principal components explained 96% of total variance. - In the
**Loading with Reference Spectrum Plot**(note that you can double-click on the plot to pop up the embedded graph), the first layer shows the sixth (reference) sample’s spectrum; the second layer illustrates loading for the first component; and the third layer for the second component. The graph below shows that times 7.95 and 8.47 are important variables in PC1, while times 3.96 and 5.92 have more influence in PC2. The vertical annotation lines in the graph were added using the**Vertical Cursor**gadget in Origin (**Gadgets: Vertical Cursor**). - The
**Loading Plot**shows coefficients of each variable (time) in PC1 and PC2. You can use the**Data Reader**(**Tools**toolbar) in Origin to find variables of larger coefficients (important times) in PC1 and PC2. Note that the sign in loading of a principal component doesn’t matter, and can be multiplied by -1. - The
**Score Plot**illustrates scores of 20 samples in PC1 and PC2. The 20 samples were divided into three groups as specified in**Group Info**. It is clear from the graph that olive oils and non-olive oils can be divided easily in principal component space, while mixed oils intersect with the other two. Confidence ellipses of scores for the three groups of samples are also shown, and some extreme points are labelled. If the number of samples is large, you can turn off labels by unchecking the**Enable**option in the**Plot Details**dialog’s**Label**tab (to open Plot Details, double-click on the pop-up plot or choose**Format: Plot Properties**). - This App can create 3D component plots. Click the green lock in the upper-left corner of the graph and choose
**Change Parameters**. In**Plots**tab, choose 3 for**Number of Components to Plot**(you may have to go to**Settings**and increase**Number of Components to Extract**to 3 or more. You can also change the**Reference Spectrum**on the**Plots**tab to see other samples.

### Conclusion

Principal component analysis is an effective method to find important frequency or wavelength regions from a group of samples, and help classify samples in the principal component space. It can also be used to determine number of compounds from spectra of mixtures, and it can be incorporated with the partial least squares method to solve quantitative problems. The result can also be used for further classification, e.g. hierarchical cluster analysis and discriminant analysis.

Is it possible to share the algorithm or step by step procedure on how the App works? I need to particularly understand all the preprocessing steps to help in reporting my analysis

1. Normalize each row for original spectra (along samples). If you analyze correlation, row is normalized by z-scores. And If analyze covariance, row is normalized by x-mean.

2. For each column in normalized data, multiply it with each column in loadings matrix, and calculate the sum respectively. It is scores vector for that column.

3. Repeat step2 for all columns in normalized data, you can get scores for all samples.

4. The variance for scores of each principal component will be eigenvalues. If you want to Standardize Scores (unit variance), you can scale scores of each principal component by the square root of its eigenvalue.

Thank you greatly!

Hi, I have two questions and I would be grateful for your answers.

1. Before the PCA, are the absorbance values mean centred and standardized?

2. Ticking the standardize scores, does it imply every unit on the score plot is a unit of standard deviation of the principal component?

Hi Sam,

Absorbance values are always mean centered in the app. If you choose

Anaylze: Correlation MatrixinSettingstab, they will be also standardized, otherwise they will not.Ticking the standardize scores, variance of scores for each principal component will be 1. Otherwise, its variance will be the eigenvalue of each component.

Thanks.

Sam Fang

Hi Sam. When the scores are calculated, in what form is the original spectra when it is multiplied by the loadings?

What processing is performed on the data before projection onto the PCs?

Hi,

could you please tell me how to Denoise a signal using PCA?

I mean how to reconstruct the signal using only the most important PCs.

Thanks

The PCA app in Origin doesn’t support PCA denoising. Please share your data with us using the link below and we can look into it and provide a solution.

https://www.originlab.com/restricted/support/index.aspx?c=3

Originlab

hi could you please tell me how can i do a clustering between different gamma ray spectrum using PCA app in origin? and plot the result after clustering? or showing the fractional abundance of each of them?

Hi,

Sorry for the late reply. Can you email tech@originlab.com and provide us with more information- perhaps some sample data. And please don’t forget to include your Origin version and the last 7 digits of your serial number. You can find it by going to the Help menu and choosing About Origin.

Hello,

Do you know how calculate derivate and Savitzky-Golay for all columns, not one by one

Hello, do you mean to calculate derivatives using Savitzky-Golay method for all columns?

Suppose all your data in a worksheet. Set column as X or Y. You can choose a Y column, select Analysis: Mathematics: Differentiate from Origin menu. Click on the green lock icon for the output column, and select Repeat this for All Y Columns from the menu. It will calculate derivatives for all Y columns.

Thanks.

Sam