Share

Let’s start this blog with a definition of Object Storage. Object Storage is an internet-scale, high-performance storage platform that offers reliable and cost-efficient data durability. Object Storage can store an unlimited amount of unstructured data of any content type, including analytic data.

Object Storage can also be used as a cold storage layer for the data warehouse by storing data that is used infrequently and then joining it seamlessly with the most recent data by using hybrid tables in Oracle Autonomous Data Warehouse. More on object storage can be found here.

It seems that Object Storage can be considered as a key element of Oracle Data Lakehouse architecture.

In this blog post, we are looking at setting up an Object Storage bucket with several data files and using those data for analysis in Oracle Analytics. One might argue some better and more reasonable approaches, but we are focusing on particular elements in the first place and we are not looking for the absolute best possible solution.

Setting up Object Storage

Buckets are containers that are used to store objects. These can be any, from text files to images and audio. Besides the content, each object contains metadata about itself. Each object is stored in a bucket.

os001
 

In our example, we create a new bucket called retail_data:

os002

The bucket is created as a Standard storage type and currently, it is still Private:

os003

We are almost good to continue with data upload, however, we need to do a thing or two before we start the load process.

os004

First, let’s edit visibility. By default, the bucket is set to Private. We will not set any particular access rules or networks in this example, therefore, let’s just change the visibility to Public.

os005

Before we start loading the data files, let’s create a folder for these files. Later, we will use a reference to all the files in this folder.

os021

A new folder is called transactions. In order to start loading the data, navigate to the folder first. In our example, we will load data files manually, however, in some other scenarios, these could be done by using REST API or some streaming method, which we will also see in this series.

We can start the load. Our example (Dunnhumby’s Let’s Get Sort-of-Real dataset) is intentionally using a large number of not so small data files (.CSV). There is one file for (almost) each month in 2006 and 2007.

os051

The load runs pretty fast and approx. 100 files are uploaded within a couple of minutes.

os022

Once uploaded, you can review the uploaded files.

os023


The last step in Object Storage setup is optional but recommended. It is to create a Pre-Authenticated Request (PAR) to access the objects in a bucket. As you can see from the picture below, this can be done by a whole bucket, by a specific object in a bucket or group of objects with prefix. In our case, the prefix is the name of the folder we created earlier, transactions/.

os024

Using PAR is convenient as the administrator can set the expiration date, hence having strong control over the access to the objects in the bucket.

Having done that, we have successfully created a bucket, uploaded files and given access to them. Any other transaction data files added later will inherit these settings automatically.

Creating an external table in ADB to access Object Storage files

If we want to access and analyse Object Storage data, there is one way of doing it using Oracle Autonomous Database. 

What we need to do is to set an external database table in Oracle ADB which will read data from files stored in Object Storage. Data can be queried using SQL as any other database table.

In the OCI console navigate to Oracle Autonomous Datawarehouse instance and from there open Database Actions.

os016

In the Development section, open SQL (web-based SQL Developer):

os017

The following statement creates external table:

BEGIN

DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
table_name => 'RETAIL_TRANSACTIONS',
file_uri_list => 
            'https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/.../b/retail_data/o/*.csv',
format => json_object('type' value 'csv', 'skipheaders' value '1', 'delimiter' value ',’),
column_list => ' SHOP_WEEK VARCHAR2 (6) ,
                 SHOP_DATE VARCHAR2 (8) ,
                 SHOP_WEEKDAY NUMBER ,
                 SHOP_HOUR NUMBER ,
                 QUANTITY NUMBER ,
                 SPEND NUMBER ,
                 PROD_CODE VARCHAR2 (10) ,
                 PROD_CODE_10 VARCHAR2 (7) ,
                 PROD_CODE_20 VARCHAR2 (8) ,
                 PROD_CODE_30 VARCHAR2 (6) ,
                 PROD_CODE_40 VARCHAR2 (6) ,
                 CUST_CODE VARCHAR2 (14) ,
                 CUST_PRICE_SENSITIVITY VARCHAR2 (2) ,
                 CUST_LIFESTAGE VARCHAR2 (2) ,
                 BASKET_ID NUMBER ,
                 BASKET_SIZE VARCHAR2 (1) ,
                 BASKET_PRICE_SENSITIVITY VARCHAR2 (2) ,
                 BASKET_TYPE VARCHAR2 (20) ,
                 BASKET_DOMINANT_MISSION VARCHAR2 (20) ,
                 STORE_CODE VARCHAR2 (10) ,
                 STORE_FORMAT VARCHAR2 (2) ,
                 STORE_REGION VARCHAR2 (3) ');
END;
/
  

Let’s take a look a bit closer and review the SQL statement above:

DBMS_CLOUD.CREATE_EXTERNAL_TABLE procedure is used to create an external table. A new table RETAIL_TRANSACTIONS will be created based on the specification in file_uri_list. This list contains the URL link to the files in object storage. URL consists of two parts:

  • https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/.../b/retail_data/o is obtained when PAR is created. This link is pointing to transactions/ folder.
  • /*.csv is added in order to indicate the reference is for all .CSV files in that folder.

The format part defines the format of files stored in Object Storage, which is in this case CSV. It also defines delimiter and how lines skip from the top of the files. 

The last part is column_list, which lists all table columns and their formats.

When created, an external table can be queried like any other database table. Data retrieval might be longer, compering to “normal” database tables, which is due to the location of data and characteristics of Object Storage which is usually slower than the database.

os026

Using Object Storage data for analyses in Oracle Analytics Cloud

Bringing a new (external) database table into Oracle Analytics seems to be relatively straightforward. Unexpectedly, I ran into some issues, which in the end resolved (it seems to me) by themselves. 

As mentioned, the process is as usual and starts with a new dataset creation:

os031

And then continues with data import and profiling (done automatically). But that didn’t go very well. I was able to choose a database table from the database and import started. 

os032

Then, just before the operation finished, I got the following error message:

os033

After a while and after I finally decided to read the message properly, I noticed that the “reject limit” parameter was not set high enough. It is true, I didn’t verify the cause (possible rows with NULL values), but I had a lot of errors in the dataset and consequently, rows were rejected, and once the limit is reached, the general error is thrown and operation was cancelled. 

In order to avoid this kind of issue, consider adding the following parameter ’rejectlimit’ value ‘1000’ to the format  clause CREATE_EXTERNAL_TABLE command. format should then look like this:

format => json_object('type' value 'csv', 'skipheaders' value '1', 'delimiter' value ‘,’,                         'rejectlimit' value '1000')

Then also profiling would work and the dataset could be used in Oracle Analytics. Of course, a detailed investigation of rejected rows is to be performed.

All dataset operations can be also performed without any issues. For example, use recommendation for the SHOP_DATE column.

os036

And of course, once the dataset is prepared, creating a workbook is also without any issues:

os043

Conclusion

This short demo shows how easy is to use other data sources than just standard database tables and views in analyses. Oracle database package DBMS_CLOUD is really opening Oracle database to any other sources. And once the database can connect to external data sources, these can be used in tools like Oracle Analytics as well. At the moment, Oracle Analytics cannot connect to Objects Storage.

This blog is meant to be only the first in the series of blog posts that expand into Oracle Data Lakehouse. In the forthcoming posts, we will look at products such as Oracle Streaming, Data Flows, Data Catalog and others to cover the flow of data from different data sources to Oracle Analytics and Oracle Machine Learning. Stay tuned!

Become a member

Newsletter

Receive our newsletter to stay on top of the latest posts.


Ziga Vaupot

Author

Ziga Vaupot