Let’s start this blog with a definition of Object Storage. Object Storage is an internet-scale, high-performance storage platform that offers reliable and cost-efficient data durability. Object Storage can store an unlimited amount of unstructured data of any content type, including analytic data.
Object Storage can also be used as a cold storage layer for the data warehouse by storing data that is used infrequently and then joining it seamlessly with the most recent data by using hybrid tables in Oracle Autonomous Data Warehouse. More on object storage can be found here.
It seems that Object Storage can be considered as a key element of Oracle Data Lakehouse architecture.
In this blog post, we are looking at setting up an Object Storage bucket with several data files and using those data for analysis in Oracle Analytics. One might argue some better and more reasonable approaches, but we are focusing on particular elements in the first place and we are not looking for the absolute best possible solution.
Setting up Object Storage
In our example, we create a new bucket called retail_data:
The bucket is created as a Standard storage type and currently, it is still Private:
We are almost good to continue with data upload, however, we need to do a thing or two before we start the load process.
First, let’s edit visibility. By default, the bucket is set to Private. We will not set any particular access rules or networks in this example, therefore, let’s just change the visibility to Public.
Before we start loading the data files, let’s create a folder for these files. Later, we will use a reference to all the files in this folder.
A new folder is called transactions. In order to start loading the data, navigate to the folder first. In our example, we will load data files manually, however, in some other scenarios, these could be done by using REST API or some streaming method, which we will also see in this series.
We can start the load. Our example (Dunnhumby’s Let’s Get Sort-of-Real dataset) is intentionally using a large number of not so small data files (.CSV). There is one file for (almost) each month in 2006 and 2007.
The load runs pretty fast and approx. 100 files are uploaded within a couple of minutes.
Once uploaded, you can review the uploaded files.
The last step in Object Storage setup is optional but recommended. It is to create a Pre-Authenticated Request (PAR) to access the objects in a bucket. As you can see from the picture below, this can be done by a whole bucket, by a specific object in a bucket or group of objects with prefix. In our case, the prefix is the name of the folder we created earlier, transactions/.
Using PAR is convenient as the administrator can set the expiration date, hence having strong control over the access to the objects in the bucket.
Having done that, we have successfully created a bucket, uploaded files and given access to them. Any other transaction data files added later will inherit these settings automatically.
Creating an external table in ADB to access Object Storage files
If we want to access and analyse Object Storage data, there is one way of doing it using Oracle Autonomous Database.
What we need to do is to set an external database table in Oracle ADB which will read data from files stored in Object Storage. Data can be queried using SQL as any other database table.
In the OCI console navigate to Oracle Autonomous Datawarehouse instance and from there open Database Actions.
In the Development section, open SQL (web-based SQL Developer):
The following statement creates external table:
Let’s take a look a bit closer and review the SQL statement above:
DBMS_CLOUD.CREATE_EXTERNAL_TABLE procedure is used to create an external table. A new table RETAIL_TRANSACTIONS will be created based on the specification in file_uri_list. This list contains the URL link to the files in object storage. URL consists of two parts:
- https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/.../b/retail_data/o is obtained when PAR is created. This link is pointing to transactions/ folder.
- /*.csv is added in order to indicate the reference is for all .CSV files in that folder.
The format part defines the format of files stored in Object Storage, which is in this case CSV. It also defines delimiter and how lines skip from the top of the files.
The last part is column_list, which lists all table columns and their formats.
When created, an external table can be queried like any other database table. Data retrieval might be longer, compering to “normal” database tables, which is due to the location of data and characteristics of Object Storage which is usually slower than the database.
Using Object Storage data for analyses in Oracle Analytics Cloud
Bringing a new (external) database table into Oracle Analytics seems to be relatively straightforward. Unexpectedly, I ran into some issues, which in the end resolved (it seems to me) by themselves.
As mentioned, the process is as usual and starts with a new dataset creation:
And then continues with data import and profiling (done automatically). But that didn’t go very well. I was able to choose a database table from the database and import started.
Then, just before the operation finished, I got the following error message:
After a while and after I finally decided to read the message properly, I noticed that the “reject limit” parameter was not set high enough. It is true, I didn’t verify the cause (possible rows with NULL values), but I had a lot of errors in the dataset and consequently, rows were rejected, and once the limit is reached, the general error is thrown and operation was cancelled.
In order to avoid this kind of issue, consider adding the following parameter ’rejectlimit’ value ‘1000’ to the format clause CREATE_EXTERNAL_TABLE command. format should then look like this:
format => json_object('type' value 'csv', 'skipheaders' value '1', 'delimiter' value ‘,’, 'rejectlimit' value '1000')
Then also profiling would work and the dataset could be used in Oracle Analytics. Of course, a detailed investigation of rejected rows is to be performed.
All dataset operations can be also performed without any issues. For example, use recommendation for the SHOP_DATE column.
And of course, once the dataset is prepared, creating a workbook is also without any issues:
This short demo shows how easy is to use other data sources than just standard database tables and views in analyses. Oracle database package DBMS_CLOUD is really opening Oracle database to any other sources. And once the database can connect to external data sources, these can be used in tools like Oracle Analytics as well. At the moment, Oracle Analytics cannot connect to Objects Storage.
This blog is meant to be only the first in the series of blog posts that expand into Oracle Data Lakehouse. In the forthcoming posts, we will look at products such as Oracle Streaming, Data Flows, Data Catalog and others to cover the flow of data from different data sources to Oracle Analytics and Oracle Machine Learning. Stay tuned!