Following on from our introduction of CF.Cumulus.Ingest, and the wider CF.Cumulus product, we’re revealing the next component of the framework: CF.Cumulus.Transform! Here is a logical (Azure Resource agnostic) view on how Control, Ingest and Transform fit together:
Why use CF.Cumulus.Transform?
CF.Cumulus.Transform allows users flexibility in developing notebooks to perform any data transformation activities they require. This includes unmanaged notebooks, where the sole purpose is to execute user-defined logic, and managed notebooks, which utilise notebooks integrated with the metadata-driven process used across all components of Cumulus.
Within Transform, Spark Compute executes notebooks created by users to deliver value-focused data assets which are ready to be integrated with downstream reporting tools, Machine Learning models and applications.
Benefits of CF.Cumulus.Transform
Business User Data
CF.Cumulus.Transform brings your data into a consumer-focused format as it moves through the data curation process.
User-friendly Development Process
Combining business requirements and code can be a challenging task. Transform simplifies this by letting your developers work in the languages they know best, rather than requiring them to upskill in a new language. This saves development time and gets your data into user hands quicker.
One Structure
By reusing Data Pipeline structures and metadata designs defined in CF.Cumulus.Ingest, Transform is a natural extension with a shallow learning curve for organisations and users to adopt the additional functionality.
Key Features
Business logic can be described in developer’s preferred language
The design of Cumulus.Transform supports users writing business logic in a variety of languages to cut the requirement of learning new languages and speed up development of your data tables.
Currently, Transform supports:
SQL
PySpark
Scala
Source Control your curated dataset definitions
Through utilising Repos in Databricks, Synapse or Fabric, and combining this with a read-only path that the compute target executes against protects your data definitions from ad-hoc changes.
Support for variety of notebook types
Transform gives you the power to create the data outputs you need. Whether this is simply joining a few tables together, or building assets for downstream reporting, users have complete control on what these table definitions look like, all orchestrated through an easy-to-adopt framework. You can:
Run user-defined data transformations
Create star schema dimension tables
Create star schema fact tables
Create Feature Store tables for Machine Learning
Clean, format and populate datasets for custom LLM inputs
Key Column validation
Transform ensures that Surrogate Keys are generated for tables created through managed notebooks and relies on business keys from source tables in the Cleansed layer for efficient merge actions.
Custom Compute Selection
Whether loading tiny tables relying on a couple of joined datasets or million-row Fact tables comprised of complex business logic, users have control on the compute used through the metadata design. Users can point large operations to larger compute clusters in the metadata configuration, and ensure resources are optimally aligned.
This is paired with seamless integration for pointing development workloads to interactive/all-purpose clusters suited for debugging, and production workloads to job clusters that are more cost-effective and suitable for unsupervised operations. This is all achievable through a simple metadata configuration in the underlying database.
Try it for yourself!
Our landing page contains further information for you to explore all components of Cumulus: Cumulus | Cloud Formations
For any questions about the product or how Cloud Formations can help you, please reach out to the team via our booking system: Speak to an expert | Cloud Formations
Comments