One of the most popular datasets in Machine Learning is Boston Housing dataset (source: https://archive.ics.uci.edu/ml/machine-learning-databases/housing/). This dataset is practically used in any Machine Learning lecture or book. So why don't we take a look, how can we predict housing prices also within Oracle Data Visualization (DV).
Exploratory Data Analysis
- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centres
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per $10,000
- PT: pupil-teacher ratio by town
- B: 1000 (Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT: % lower status of the population
- MV: Median value of owner-occupied homes in $1000's
Data Preparation using Data Flows
Building a new Machine Learning modelThe second step is creation of the Machine Learning model. Housing price prediction is an example of supervised learning on continuous variable, hence it is a regression problem. In Data Flows, we can use the Train Numeric Prediction step to train the model.
Machine Learning Model Evaluation
You can actually now repeat these last few steps by using different algorithm or different parameter settings and compare the models among themselves.
For example, you can compare Random Forest regression results with the results of the Linear Regression model which we created before. This can be done directly in Oracle DV.
We can see that Random Forest is slightly better.
Machine Learning Model Deployment
Option 1: Creating a new predicted data set
While building the Data Flow, you can already observe what the result will look like. There are three new columns created:
- Predicted Value,
- Prediction Confidence Percentage and
- Prediction Group or Segment, which explains the decision rule applied for the Predicted Value.
- Oracle DV gives us option to set up a new Sequence, which runs all three Data Flows one after another.
- We can use Scheduler to schedule a job to execute this Sequence daily at specific time.