Databricks: First Impression

Databricks is a tool that is somewhat new to me. While I have worked with or around many ETL and Data Warehouse tools, including one I contributed to creating at Acxiom, Databricks is specifically new to me as of Jan of this year.

Initially, most of what I have seen is as expected. UI, Python workbook, Catalog. It's a good tool.

But what I'm starting to see is how the concept of Delta Live objects works. From what I can tell, everything is a view and you're essentially rebuilding your entire pipeline everytime. You can't really even do any CDC except if you piping data out of Databricks. Otherwise, it's all just a sting of spark views you essentially rebuild every time you call the end view or rematerialize a view.

This is fine. Databricks is pretty nice for ingestion. Personally I'm not a fan of Spark or that style of functional api for data. I'm not a big fan of using Pandas Dataframes either or things like that for Transformations except as an interface into or out of a SQL database. Or maybe some minimal incoming record-level transformations.

But not for things like writing aggregates for building a SQL view, in contrast. For that, I think either staying in Big Query or Snowflake is probably preferred at the moment including mainly reporting marts and things like that.

Author: Marcus

Post Date: 2024-08-24

By Marcus