Share

Data preparation in our example basically means bringing all four training datasets together (and new data datasets too).  Final result of this exercise should be a table with 5298 rows, one row per customer. The data should be ready for training the model, which includes performing a series of data transformations before we start eventually building the model.

In my original post from 2+ years ago, I simply brought the four files together, did some minor transformations (for example Onehot transformation for Services file), but other transformations were not done. In this respect, I am revisiting this process with goal to automate the whole preparation cycle including model training and deployment. As title of this post suggest, we will look into data preparation  first. And we try to use the most of the Data Flow functionality.
 
 
We have analysed data source files in the previous post Telecom Churn Prediction Case Revisited: Data Analysis
 
The starting point for data preparation will be one step back. We will start with original files without any transformation, which have been uploaded.
 
 
We can see that there are 4 datasets for the training (Original Train) and 4 dataset of new data (Original Test). Based on the finding in the previous post, we will perform required transformations. To do that, I will create series of data flows, which will be pulled together in one single execution sequence in the end. Not to forget to mention, the same transformations must be performed on both datasets, train and new data.
 

Dataflow: Customers 

This dataflow is the most simple dataflow because it is just taking the original dataset and copy it to, let's call it, a working or temporary dataset.
As you can see no transformation, no nothing. It is just to keep original files in their original form, ready to be used for another use case.
 

Dataflow: Account Info

Preparing Account Info part of the dataset is a bit more complex than the first part. It involves several transformations, including date transformations, min-max normalisations, onehot transformations.
 
Data flows with two parallel branches:
 
 
The first branch includes transformation on original columns, like:
 
 
  • Conversion of TotalCharges to Numeric.
  • Transformation of DOE into separate features for Year and Month and then Min-Max normalisation of year and Onehot transformation for Month. As a result one new column called Year and 11 Month_xxx are created. 
  • Onehot transformations for ElectronicBilling, ContractType and PaymentMethod are also performed.
Parallel branch also starts with importing TCP Original Train AccountInfo DS dataset. 
 
 
Same transformation on TotalCharges is made as in the first brach. Then, all columns except BaseCharges and TotalCharges are removed. This is because we need MIN and MAX values for the two measures. In order to achieve that, additional column is added:
This new column acts like a total level in the hierarchy of customers, hence value Total Customers.
 
Now we can apply Aggregate step:
 
You can also observe that only one row with MIN and MAX values are calculated.
 
Next step is joining the two branches using the new All Customers attribute. 
 
 
Once joined, we can now apply additional Min-Max transformation on BaseCharges and TotalCharges in the joined dataset.
 
We could apply several normalisation or standardisations. In this case I am using MinMax normalisation which is define as is shown in the formula above.
 
As the last step, all columns that are no longer needed are removed.
 

Dataflow: Demographics

Demographics dataset is quite simple to prepare:
 
 
  • For two columns need we need to substract 1 to get values 0 and 1. 
  • Onehot transformation is done for two columns.
  • Two columns are removed (Country, State).

Dataflow: Service

The last of the original datasets is Services. We have found that this dataset contains 47.682 rows, which is 9 times more than in other three datasets. Further investigation has shown that there are 9 different types of services available. Each customer has 9 rows with details for each of the services. This is always one row with different possible values. The requirement is to have one row per customer, therefore we need to perform several transformations. 
 
Complete dataflow is as follows:
 
There are 9 branches, each filtered on one of the services. Each branch has same steps and transformations:
  • import original dataset TCP Original Train Service DS,
  • filter on selected TypeOfService,
  • perform onehot transformation on selected TypeOfService using ServiceDetails,
  • remove original column for TypeOfService and ServiceDetails.

Onehot transformation

We have been mentioning Onehot transformation, but didn't realy expained it till now. So let's take a look at it a bit closer.
 
Main idea of Onehot transformation is "binarisation" of categorical values. Let's take a look at one example implemented in our data flow.
 
For TypeOfService = InternetServiceCategory, we can see that customers can have three different values for ServiceDetails:
 
  • Fiber optic,
  • DSL and
  • No.
In order to transform these three values into binary values we need 2 digits. 
 
For example, we could define the following transformations:
 
 Fiber optic  1  0
 DSL  0  1
 No   0  0
 
There is no need for 3rd column in which we would define No as 001. And exactly this is what we need to set in our data flow. We need two additional columns:
  • InternetServiceCategory_FiberOptic
  • InternetServiceCategory_DSL
This way we can define all Onehot transformations we need in our data flows.
 
Once all transformations are done, all branches are joined and unnecessary columns are removed before data set is saved. 
 

Dataflow: Bringing them all together

Up until now, we have been still working on individual part of the complete dataset. Now it is time to bring it all together. Basically, this is rather simple step. Four datasets are joined using CustomerID and redundant "customerid" columns are removed after joins are done. 
 
 
Final data set is saved. We will use this data set for the model creation.
 

Setting up sequences and automating the whole thing

We have created several data flows in order to prepare data for application in machine learning algorithms. It is important that these data flows are executed in the right order. In order to avoid confusion, sequences can be used.
 
I have created three sequences:
  • Sequence to prepare training data
  • Sequence to prepare new (testing) data
Defining sequences is rather straightforward. Each data flow can be executed only one after another. There is no option to run these in parallel. 
 
 
  •  Sequence which execute both sequence together


Finally, sequences can be now schedule to run periodically. Every morning, for example.
 

Wrap up and what is next?

Data preparation in Oracle Data Visualization can be done using Data Flows. In general, we can see that whole process can be done within the tool, however, for some transformations  you need to find some workarounds to achieve what you'd like to do. But, we can see it is possible. 
 
And again, we need to understand that Oracle Data Visualization is targeting business users and not data scientists by default. So as long as you understand the process and some "design rules" you will be fine.You will be able to prepare data in the way that you use as much as possible information with your datasets. You might not achieve same results as some experienced data scientist using Tensorflow and Keras, but you might still achieve satisfactory results. I think there is some value in it.
 
In my process, describe above, I am also deliberately going the long way. We will see in the next post on creating models that some transformations are actually available within algorithms. For example, Onehot transformation can be defined in every algorithm we will use. So basically, you don't need to prepare data as we did. But nevertheless, this exercise gives you basic understanding of what those transformation are about. 
 
In the next post, we will look into model training and application of different machine learning algorithms. We will also take a look how to prepare a project in which we could compare different models among themselves.

Become a member

Newsletter

Receive our newsletter to stay on top of the latest posts.

Topics

Tech Talk

Ziga Vaupot

Author

Ziga Vaupot