Oracle Data Lakehouse series highlights some of the examples of using elements of Oracle Data Lakehouse and presents the tools that are available as part of Oracle Cloud Infrastructure.
The first blog, Using OCI Object Storage based Raw Data in Oracle Analytics, of the Series talks about how to use Object Storage for “a file cabinet” for the retail transactions CSV files. These files are then “registered” with Oracle autonomous database and used in Oracle Analytics analyses. Any new files added to that “file cabinet” are immediately included in any analysis.
Of course, there are several ways to perform the following exercise, but in this 2nd example I’m discussing a similar approach to load and use the data, however, there are two add-ins in the process. Firstly, we will use Oracle Analytics Data Flows to prepare data for the machine learning exercise and secondly, we will use the Oracle Machine Learning AutoML feature to build a prediction model which will be registered with and used in Oracle Analytics.
This time, we are using Kaggle’s Telecom Churn Case Study Hackathon dataset.
In this step of the process, data is made accessible from Oracle Database and transformations are applied by using Oracle Analytics Data Flows.
Creating an external table
The procedure to create an external table is the same as described in my previous blog. The only difference is that this time we need to create two external tables, one for training (churntrain.csv) and one for new data (churntest.csv). In my previous example, we referred to all CSV files in a folder.
The script to create an external table could look like this (there are 100+ columns in the CSV file that should have been listed in the right order):
table_name => 'CHURN_TRAIN',
file_uri_list => 'https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/<bucket_namespace>/b/telecom-churn-data/o/churntrain.csv',
format => json_object('type' value 'csv', 'skipheaders' value '1', 'delimiter' value ',', 'rejectlimit' value '100'),
column_list => 'ID NUMBER,
Parameter file_uri_list points to file location in a bucket list. A careful reader might spot that there is no credential_name parameter in the CREATE_EXTERNAL_TABLE call. This is not required as we are using a Pre-Authenticated Request.
Using OAC Data Flows for data preprocessing and preparation
Data Flows are a very useful and convenient, self-service, data preparation and augmentation tool that is part of Oracle Analytics. We could have used different tools to do the same thing here, but the idea is to use Data Flows to make some data preparation activities and store data in the database at the end.
As I said, the same steps could easily be achieved by making transformations using SQL or Python, or any other ETL tool.
Working with datasets in Oracle Analytics is really user-friendly and any end-user could learn it pretty fast.
It all starts with a new dataset that is based on a database table or even an SQL query. In its simplest form is just drag & drop exercise.
In our example, we will use the Convert to Date function, which converts text dates into proper Date Format. We could perform most of the transformations here, however, I found it a bit easier to use a Data Flow. And since there are two datasets that require exact same transformations, I decided to do most of the transformations using Data Flows.
The Data Flow consists of several steps which are quite intuitive and enable the user to perform necessary changes. For example, several columns contain only one value or date columns are either one or a lot of values. That is why additional columns are created, such as Day of Week and Day of Month for all Date columns. Finally, a new, clean dataset is stored.
To apply the same Data Flow to the New dataset (the one used in prediction), only the right database table has to be selected in the first step. All other steps are unchanged from the original Data Flow which prepares data for model training.
So, after both data flows or sequence (we could create a sequence that executes both data flows one after another) is completed, we end up with two new database tables CHURN_TRAIN_CLEAN and CHURN_TEST_CLEAN.
We will use these two tables in the next steps. If data was clean already at the beginning, then we would perform all of the following steps using original CSV files stored in object storage.
Predict & Analyze
In this process step, I will apply AutoML to train and predict churn, and consequently, the results, the best model, will be registered and used with Oracle Analytics.
Using AutoML UI
Oracle defines AutoML User Interface (AutoML UI) “as an Oracle Machine Learning interface that provides users no-code automated machine learning. Business users without an extensive data science background can use AutoML UI to create and deploy machine learning models.”
We will use AutoML UI to train several machine learning models and identify the most appropriate for predicting churn on the set of new data.
As the definition itself says, the process of using AutoML UI is codeless. Users need to provide some basic information about their data and set some settings in order to start the process which automatically and autonomously undertakes the following activities:
- algorithm selection,
- adaptive sampling,
- feature selection and
- hyperparameter tuning.
All of these steps are usually done by the developer, and here we see the whole process automated.
Each AutoML process is called an experiment. The user is required to set some basic parameters, like name and comments.
The data source, a database table, is selected from the list of available database tables. Once Predict (target column) and Case ID are selected, Prediction Type is automatically detected.
Additional Settings are already pre-populated, but one can adjust to his needs. Users can run and optimize a model training by selecting one of the standard Model metrics such as Accuracy, Precision, Recall, ROC AUC, and others.
The last decision before starting is to select how AutoML will be run. There are two options to choose from, Faster Results or Better Accuracy. As the name says, Faster Results is expected to come back faster with some “best model”, whereas Better Accuracy should be looking for really the maximum metric value.
Just a remark, this is obviously linked to the duration parameter set in Additional settings. If this is not set adequately, one might get better results if Faster Results is selected.
AutoML then runs itself. Pop-up window Running is displayed and progress can be tracked alongside a graphical presentation of the current best Model metric.
After the process is completed, algorithms are sorted by the value of the observed metric. More details are available, such as which columns impact the prediction the most:
or you can observe the confusion matrix:
Registering and using the best ML model with Oracle Analytics
In the last section of this blog post, let’s focus on how to use the model we created using AutoML UI. One way of using it could be Oracle Analytics. So let’s take a look at how this can be done.
First, the model created just earlier needs to be registered with Oracle Analytics. As you can see all information related to a model is accessible before the model is registered.
When the model is successfully registered, it can be used in Data Flows. Apply Model step applies machine learning model to the dataset selected. In this case, this is new clean data that requires Churn prediction.
The output of the step will be two new columns, Prediction, and Prediction Probability.
In the end, Data Flow stores a new dataset with predictions to the database or to dataset storage. Now, the “prediction” dataset can be used in a workbook and visualized.
This blog post discusses one possible approach to how to use predictive analytics in Oracle Data Lakehouse alongside Oracle Machine Learning and Oracle Analytics. This is definitely the only alternative and maybe the dataset selected is potentially not the best example. However, I think that the discussion above highlights possibilities one might have using the Oracle Data Lakehouse platform.
In the next blog in this series, we will take a look at the other side of the Data Lakehouse and explore how we could capture and process Streaming data.