Origin has an abundance of machine-learning tools for users to perform regression and classification. You may check this page for details. In this blog, we are going to show an example of using the Neural Network Regression app for prediction. The data we are using is the health insurance payout data from Kaggle. We will show step by step how to examine the data, train the model, and perform the prediction.
Examine the data
1. Open the attached project in Origin and go to the Data folder with the training data. Each row of the worksheet represents a person with his/her personal characteristics like age, sex, etc. The last column is the expenditure on insurance for this person. By studying the relationship between insurance charges and personal information, we hope to build a prediction model to estimate the insurance expense for a potential customer.
2. Detailed information about each feature is stored in column comments. Select menu View: Column List View (hotkey Ctrl + W) to view it. Press Ctrl + W again to go back to the normal view.
Note: variables from Col(F) to Col(I) are the locations of the person where one-hot encoding is performed in advance.
3. First, we examine the correlations between features. Go to this page to download the Correlation Plot app. Drag and drop the .opx file into Origin to install the app. Then click on the Correlation Plot app icon in the app panel. Select the features on the left panel and select the proper settings as shown below. The correlation plot shows a strong linear correlation between smoker 、 charges (0.79). The correlation of age 、 bmi to charges is relatively high. And the correlation of sex and the number of children to charges is not so significant.
4. Next, we make some statistics plots to study the data further. Select menu Plot: Graph Maker… to launch the dialog. At the top of the dialog, we can first select a plot type. Let’s select Box Normal under the Box Plot category. Drag the charges column to Single Y entry and drag the sex column to Horizontal Panel entry. Now we get two box charts side by side for comparison. It shows that the insurance charge has similar distributions for male (0) and female (1).
5. Drag the smoker column to the Horizontal Panel entry this time. From the graph, we can tell that the insurance charge is higher for smokers than non-smokers. And the charge for smokers has a larger variation than for non-smokers. This tells us that smokers not only pay more for insurance, but the amount also varies a lot with other factors like age, bmi, etc. Let’s find out just how much these factors affect it next.
6. Change the plot type to scatter. Drag the age column and charges column to X, Y entry. Then drag the Col(K) to Symbol Color entry. On the graph, the charge for both smokers and non-smokers increases with age. There are other patterns on the graph.
Note: Col(K) is equivalent to Col(E) but replaces 0 by Non-Smokers and 1 by Smokers.
7. Drag the bmi column to the X entry to check the relation between charges 、 bmi. It shows a positive correlation with smokers but a less significant correlation with non-smokers. We can draw the conclusion that for a smoker with a high bmi(body mass index), the insurance charge can be extremely high.
Train the prediction model
8. With the Graph Maker tool, we can study the data further and extract more insight from the data. After that, we can move on to our next step to train a regression model from the data. In this example, we use the Neural Network Regression app which is a good choice as a non-linear regression model. You need first go to the app center to search for the app and install it (see the app page for more details on using the app). The installation will download the necessary Python packages automatically. Once finished, launch the app from the app panel. Under Input Data tab, set Col(A) to Col(I) as the Independent Data. Set Col(J) as the Dependent Data.
9. Under Options tab, make the following settings. Here we use one layer of 10 neurons in the model. Set the Learning Rate to 0.01, set Activation Function to ReLU. Click OK to start the fitting.
10. A result workbook with four sheets is generated once succeeded. The fitting result can be found in the NNR sheet. It shows the fitting converged after 28249 iterations and reached an R-square value of 0.86. Other information like the weights for each neuron as well as the fitted data can be found in other worksheets.
11. With the fitted data, we can make the 3D graph below to display the original data points on top of the nonlinear fitted surface. We will not go into the details of making the graph. Please open the Fitted Response Surface folder of the project to inspect it.
Note: In order to visualize the data and result in low dimension (age, bmi, charges), we fix the sex to male and location to southwest in the worksheet. To view the same graph of females and other locations, please change the column filter settings on the relevant worksheets.
12. Next, we can use the trained model for prediction. Activate the report sheet (NNR) under the Data folder. Click the Predict button, in the popped-up dialog, set Col(A) to Col(I) in the Prediction workbook as X to Predict. Set Col(J) as Predicted Y. Click OK.
13. Check the predicted values.
In this blog, we have shown an example of supervised learning where the source data is trained by a neural network regression model. After the training, the model is then used for prediction. We introduced the Correlation Plot app and Graph Maker tool to help examine the data. With 10 neurons introduced in the model and the ReLU activation function used, we got a fairly high R-Squared value of 0.86 on the training data. For more machine learning tools supported in Origin, please refer to this page.