A technical comparison on a subset of capabilities.
Context
Microsoft SQL Server and Azure Databricks over the many years I’ve been working in the data/IT industry have easily become my two favourite data processing tools. When Databricks became a first-class resource in Microsoft Azure, it was a big moment for the evolution of the data platform architectures I’ve designed and built (but architecture isn’t the focus for this blog). That said, rather than considering the tooling and technology as an evolution, I find a lot of people drawing comparisons between the products. This often leads to confusion and friction, as they are ultimately offering a lot of different capabilities, with only some common areas where comparisons could be made. Use cases also play a key part in this understanding, but for now to gain some clarity, I thought it would be a useful starting point to explore where my favourite tools offer similarities to support a potential evolution in platforms. Evolution being a key part of this, evolution, not replacement. For additional context, my career started with SQL Server 2000, and I’ve had hands on experience with every version since and on the Databricks side, worked with the resource ever since it became available in Azure, including previous experience with Apache Spark, HD Insight, Hadoop and Map Reduce (for my sins!)
Scope
To avoid this blog digressing into lots of different areas, let me add some scope to the comparison. SQL Server and Databricks are both popular data platforms that offer different features and benefits for data professionals. SQL Server and is a relational database management system (RDBMS) that supports the transactional structured query language (SQL) syntax. This is foundational in the product for data manipulation and data definition. On the other hand, Azure Databricks is a unified data platform that supports Apache Spark, a distributed computing framework that enables large-scale data processing and machine learning. I’m sure most of you already know this, but I’m including it for completeness and maybe people new to the industry.
To further limit the scope of the thinking here, we are only talking about a SQL Server Instance and Database. Not SSRS, SSIS and SSAS.
That said, in this blog, I want to explore and compare where each product offers similar technical capabilities, with a focus on three main areas:
I’m also going to avoid calling out any particular version of the SQL Server family as that would add another dimension of complexity to the explanations. References here to Azure SQL vs on premises can be implicit without bloating the narrative.
T-SQL vs Spark SQL
T-SQL and Spark SQL are two dialects of SQL that can be used to query data in SQL Server and Azure Databricks, respectively. Both dialects support the ANSI SQL standard, which defines the syntax and semantics of the statements. However, there are some differences, features, quirks and limitations between T-SQL and Spark SQL that might affect the performance and functionality of your queries. Especially if your muscle memory (like mine) is used to writing things in a certain way! To call out a few differences:
T-SQL supports more built-in functions and operators than Spark SQL. Spark SQL has some functions that are not currently available in T-SQL. Playing spot the difference is tricky as variations in syntax are common. For example APPLY vs LATERAL_VIEW.
T-SQL supports user-defined functions (UDFs) and stored procedures, which allow you to encapsulate complex logic and reuse it in multiple queries. Spark SQL also supports UDFs, but not stored procedures. However, you could breakout into Python or Scala to define functions and call them from Spark SQL.
T-SQL supports transactions, which allow you to group multiple statements into a single unit of work and ensure data consistency and integrity. Spark SQL does not support transactions in the same way during execution, but it does supports ACID (atomicity, consistency, isolation, and durability) resilience when/if writing and reading data to Delta Lake tables. See delta.io for more details on storing entities in a data lake using Delta Lake.
T-SQL supports indexes to improve the speed and efficiency of data retrieval across tables. Spark SQL does not implement indexes in the same way, but it supports partitioning and bucketing, which are useful techniques to optimize the physical layout of data for faster processing. Additionally, it could be argued that Apache Parquet as an open-standard format for the data storage offers column store compression and encoding to assist query executions.
There are many other cosmetic differences if we went into the weeds of the syntax, but in general, we could conclude by saying T-SQL is more suitable for complex analytical queries that requires inline functions, operators, transactions and indexes. Whereas Spark SQL is more suitable for large-scale data processing and machine learning that requires distributed computing and storage. In either case, there is no right or wrong in the context of the code. However, picking the right technology to execute the code with wider challenges around data volumes, velocity and variety is where requirements would need to be looked at. SQL is great in either form.
Developer Experience (IDEs)
Moving on, another aspect to consider when comparing SQL Server and Azure Databricks is the developer experience, keeping creative teams happy is important for productivity. For both technologies, let us think about the integrated development environment (IDE). Or to clarify, the tools and interfaces that are available for writing, executing, testing, debugging and deploying your code. Both SQL Server and Databricks have several options depending on their preferences and needs of the engineer. Let us call out a few differences again:
Firstly, we need to think about queries vs notebooks. In SQL Server, a single procedural query file would be a common place to start, executed by my muscle memory pressing F5. Whereas in Databricks the common place to start would be a multi cell, multi-language interactive notebook, offering support for markdown syntax, visualisations, images, and interactive collaboration with other developers through a browser, a browser that will likely refresh the webpage if you press F5 to execute the code! Bad muscle memory!
For our queries, SQL Server offers several tools for developing and managing our T-SQL code, such as SQL Server Management Studio (SSMS), which has always been my go-to. You could also use Azure Data Studio and Visual Studio with the SQL Server Data Tools (SSDT) extensions optionally installed.
For our notebook, Azure Databricks offers a web-based interactive workspace that allows you to create and run notebooks. To be explicit, you can use the notebooks to write and execute Spark SQL, Python, Scala and R. Changing between languages in each cell if required using the “magic functions,” for example: %python %sql %fs.
In addition to notebooks, you can also use the Databricks Command Line Interface (CLI), REST API, or SDKs to interact with the platform programmatically. Also, if you want to work locally, the Databricks Connect extension for Visual Studio Code (VSCode) can be a convenient option if working on an unstable internet connection, maybe while travelling.
SQL Server and Azure Databricks also support integration with other tools and frameworks, such as Microsoft Fabric, Power BI, Azure Synapse Analytics, Azure Data Factory, Azure DevOps and GitHub. You can use these tools to enhance your data analysis, orchestration, collaboration and deployment capabilities.
The developer experience of SQL Server and Azure Databricks depends largely on your personal preference and familiarity. Both software vendors have gone to great lengths to meet developers on their terms, especially considering other factors like operating systems, personas, and environment governance. As with point one, there is not a right or wrong conclusion here, just different tools for different needs/preferences.
Job Scheduling
The last aspect I want to compare and think about in this post is focused job scheduling, by which I mean the ability to run your code at specific times or intervals or based on a trigger or some custom condition. Scheduling is really important for the automation of data workloads, ensuring that outputs are always up to date and consistent is a key requirement for any data solution.
SQL Server offers this ability in the form of the SQL Server Agent, which is a service that allows you to create and manage jobs, controlled through the instances MSDB. Each job allows a collection of steps that can perform specific task, such as executing stored procedures, scripts and running a SSIS packages. You can schedule jobs to run on a regular or irregular basis, or based on events, such as the completion of another job, the modification of a file, or if an error occurs.
Azure Databricks offers Databricks Workflows, which are pipelines that allow you to create and run sequences of notebooks or compiled JAR files. Pipelines can be scheduled to run based on a cron job expression, which is a string that specifies the frequency and time of execution, or based on triggers, such as the completion of another workflow, the arrival of new data, or invoked by an API call.
Both products also support integration with other Azure tools that can help schedule and monitor executions, such as Microsoft Fabric Data Integration, Azure Logic Apps, Azure Functions, and my personal favourite, Azure Data Factory. You can use these services and tools to create complex and dynamic workflows that span across multiple platforms and sources with a bunch of dependency handling options for workloads.
The scheduling capabilities of SQL Server and Azure Databricks are similar in terms of functionality and flexibility, but they differ in terms of implementation and configuration. SQL Server Agent is more familiar and accessible for SQL developers, while Databricks Workflows are more native and integrated for Spark developers. You can also use both platforms together, by triggering jobs or workflows from one platform to another, or by using a third-party product like our very own CF.Cumulus to orchestrate your workflows.
Conclusion
Firstly, apologies, I did not intend for this blog to be so long! The detail seemed to keep flowing to make the relevant points.
To recap and conclude we have explored and compared only a subset of the technical capabilities in SQL Server and Azure Databricks, focusing on three initial areas of the products where we can derive differences and meaningful overlaps in functionality.
Hopefully, I have given you an informed view on the strengths and weaknesses of both sides, which often relates more to platform requirements and preference than capabilities.
There is certainly no definitive answer to which platform is better or worse, as each platform might be more suitable for different scenarios and use cases.
To quote the film the Matrix, the problem is choice.
Many thanks for reading.
Comments