Building High Value, Low-Cost, Data Lakes with Oracle Object Storage
Published by : Industrial Automation
Samir Shah elaborates upon one of the solution patterns that builds a data lake centred on Oracle Object Storage in the reference architecture.
Before you build a data lake, evaluate your business needs and the impact the data lake has on your business. Data lake projects are costly, in terms of people, time, and money. In the absence of meaningful value drivers and cross-organisational buy-in, it might become a data swamp! What ramifications does it have on the business scorecard? How does it help dealing with competitive pressure? Ask yourself these questions, and then start thinking about how you’ll achieve a better time-to-value ratio, lower cost, and future proof your investment with flexibility.
Time to Value
Data lakes manage several tasks, such as ingesting the data from many sources, processing and transforming raw data, establishing data governance practices, and making it accessible to different roles, such as data scientists, stewards, engineers, and analysts. Building a data lake can take a significant amount of time without proper planning and architecture.
To accelerate building a data lake, use fully managed services as much as possible, such as Oracle Data Flow, Oracle Data Catalog, and Oracle Autonomous Database. These hands-off services allow organisations to focus on building their data pipelines without getting distracted by operations. With massive amounts of data in a lake, searching and discovering data can pose a challenge and can take up significant time for the end users. Having a data catalogue makes it easy to find, tag, and create business glossaries so that data scientists and analysts don’t spend a lot of time finding what they’re looking for.
You have many options to where to store the data for the lake, from Big Data Hadoop to Cloud Storage. Keeping the data in cloud storage, such as Oracle Object Storage, is a low-cost option to consider when storing petabytes of information. The cost of cloud storage has been going down, historically. For example, the cost of Oracle Object Storage is around $0.0255 per GB per month, one of the most competitive rates in the market today.
Also, serverless computing technology, such as Data Flow, further minimise your infrastructure investment. Pricing is based on the actual amount of resources consumed by an application during execution, rather than pre-purchased units of capacity.
Usage patterns of a data lake constantly change and inflexible architecture make it difficult to support new business use cases. The architecture needs to be adaptable and flexible. It has to accommodate different data formats, data types, volumes, processing in batch and real time, and data movement and integration requirements. It needs to support a good ecosystem of both open source and non-open source tools and technologies, as well as SaaS applications.
Building a data lake
You can build a data lake using several approaches and patterns. Oracle recently published Enterprise Data Warehousing – An Integrated Data Lake Example (https://docs.oracle.com/en/solutions/oci-curated-analysis/index.html#GUID-3A382EF2-409C-458D-B623-FBF4C97F84CE). The artefact also includes reference architecture that can help you design a solution pattern and build the data pipeline for your specific use case.
In this blog, we focus on one of the solution patterns that builds a data lake centred on Oracle Object Storage in the reference architecture. All the data is ingested in Oracle Object Storage and then processed or transformed, catalogued, governed, and accessed. This is one of the ways to build a high value, low-cost data lake using Oracle Cloud Services. The following high-level architecture describes an end-to-end data lake design for a customer opportunity who was looking to build a data pipeline that may seem very familiar. For your specific use case and detailed stages, refer to the reference architecture link. Overall, we broke the pipeline into five major steps.
Discover and ingest data
This step involves ingesting data from multiple sources, such as Oracle or non-oracle databases, XML, JSON, Avro, and Parquet files, and multi-media information like images and videos. Depending on source of data, type, volume, and frequency (real-time, batch), you can select different technologies that fit the best for your use case.
Oracle GoldenGate is a comprehensive software package for real-time data integration and replication in heterogeneous IT environments. Oracle Data Integrator is a comprehensive data integration platform that covers all data integration requirements, from high-volume, high- performance batch loads to event-driven, trickle-feed integration processes to service-oriented architecture-enabled data services.
The Oracle Cloud Infrastructure Streaming service is publish-subscribe messaging that provides a fully managed, scalable, and durable storage solution for ingesting continuous, high-volume streams of data that you can consume and process in real time.
Move data to raw storage
Here, consume or move data captured from GoldenGate, Oracle Data Integrator, Streaming service, or Kafka to Oracle Object Storage. The Object Storage service is an internet-scale, high- performance storage platform that offers reliable and cost-efficient data durability. Oracle Object Storage enables near infinite storage capacity for large amounts of data or rich content, such as images and videos.
Unlike most of the cloud storage providers, Oracle Object Storage also provides strong consistency, which means you always get the most recent copy of data written to the system. Object Storage is a great way to store lots of data at a lower cost. The data from Oracle Streaming service to Oracle Object Storage is moved through the HDFS/S3 Connector. Set up and configure the connector to move the data to Oracle Object Storage. So, development is required.
As the data arrives, you might want to catalogue it. Oracle Data Catalog Service is a metadata management service that helps organisations find and govern data using an organised inventory of data assets across the enterprise. With this self-service solution, data consumers can easily find, understand, govern, tag, and track Oracle Cloud data assets. You can use Data Catalog service to harvest the data from the Oracle Object Storage among many asset types. A data asset represents a data source, such as a database, an object store, a file or document store, a message queue, or an application. Harvesting is a process that extracts technical metadata from your data assets into your data catalogue.
Transform, standardise, or process data
Oracle Data Flow (serverless) service can process the data from this raw storage in a batch. Data Flow is a fully managed service that lets you run Apache Spark applications with no infrastructure to deploy or manage. The Spark application can process the raw data, partially conform it, and write it to the Object Storage in Parquet format or any other formats. For real-time processing, you might also consider Oracle Steam Analytics. Oracle Stream Analytics allows you to process and analyse large-scale, real-time information by using sophisticated correlation patterns, enrichment, and machine learning. It offers real-time actionable business insights on streaming data and automates action. Through its interactive designer, users can explore real-time data through live charts, maps, visualisations, and graphically build streaming pipelines without any hand coding.
As in the previous step, you can harvest the processed data with Oracle Data Catalog Service.
Accessing and consuming data: Analyse, Learn, and Predict
Data Scientists can start working with the dataset from the Object Storage in notebooks, such as Jupyter. Oracle provides Data Science service to manage the entire life cycle of a model from build, train, deploy, and share to retraining the model. Oracle also provides Oracle Accelerated Data Science SDK client libraries to connect to Object Storage and other oracle and non-oracle sources. Accelerated Data Science offers a friendly user interface with objects and methods that cover the steps involved in the lifecycle of machine learning models, from data acquisition to model evaluation and interpretation.
The Oracle Autonomous Database (https://www.oracle.com/database/what-is-autonomous-database.html) can provide an SQL interface on top of the Object Store as a processing layer. Conceptually, it’s similar to using Hive tables with HDFS but better, because it’s simpler, managed, and uses database features, such as security and self-driving capabilities. With Oracle Autonomous Database External Tables, you can create a table structure over your external Object Storage files and directly run SQL queries on top of it.
With external partitioned data, you can query multiple data files stored in the Object Store as a single external table and the files can be represented as multiple logical partitions. You can also query the data from Oracle Autonomous Database tables and multiple data files stored in Object Storage as single logical table using a hybrid partitioned table to represent the data as single object. Also, you can create materialised views on external tables to keep detailed data remain stored on Oracle Object Storage and summarised data in Oracle Autonomous Database. Data and business analysts can use Oracle Analytics Cloud, which empowers them with modern, AI-powered, self-service analytics capabilities for data preparation, visualisation, enterprise reporting, augmented analysis, and natural language processing and generation.
Build performance layer: Curated information layer
You can create the performance layer before accessing and consuming date or add it on when the need for higher speed and concurrency, real-time response, or complex query processing arises. For demanding workloads, I highly recommend this layer. It involves moving data from the Oracle Object Storage to a performance layer in Oracle Autonomous Database. The database layer can be added anytime if necessary. Using Oracle Data Integrator or Data Integration Service and Enterprise Data Quality, move the data from Object Storage to Oracle Autonomous Database.
You can query the data from Oracle Autonomous Database and multiple data files stored in the Object Store as single logical table using a hybrid partitioned table to represent the data as single object. This option allows the hot data to reside in the performance layer and still access cold data from the Oracle Object Storage. You might also look into Oracle Cloud SQL, which supports queries against non-relational data stored in multiple big data sources, including Apache Hive, HDFS, Oracle NoSQL Database, Apache Kafka, Apache HBase, and other object stores, such as Oracle Object Store and S3.
Bringing it all together
Now that we have walked though all five steps, let’s watch the demo in action! This video (https://www.youtube.com/embed/ORtEUZsjE7c?rel=0) shows Oracle Object Storage, Data Flow, Data Catalog, Autonomous Database, and Data Science services working together! For details, see the sample data and source code used in the video (https://github.com/sssshah/datalakeblog).
Enterprise data warehousing – an integrated data lake example https://docs.oracle.com/en/solutions/oci-curated-analysis/index.html#GUID-3A382EF2-409C-458D-B623-FBF4C97F84CE
Hybrid Partitioned Data
External Partitioned Data
Oracle Cloud Infrastructure Products by Category
Using Oracle GoldenGate File Handlers
Data Science Platform
Oracle Stream Analytics
About Cloud SQL
Samir Shah is Account CTO, Enterprise Architect, Oracle Corporation. He is an adviser to CxOs and top level management on how to align their IT with business strategies. He has over 15 years of experience spanning multiple verticals such as financial, manufacturing, technology and consumer products.