As data teams shift toward modern cloud data platforms, transforming raw data into reliable, analytics-ready datasets has become a core challenge. That is where dbt fits in. Instead of handling data transformation within complex ETL pipelines, dbt enables analysts and engineers to transform data directly in the data warehouse using SQL, making workflows more transparent, testable, and easier to manage.
The need for tools like dbt is driven by the scale and complexity of modern data. Industry research shows that data professionals spend up to 60% of their time on data preparation and transformation, often dealing with inconsistent or poorly structured data. At the same time, organizations are under pressure to deliver insights faster, making reliable, maintainable transformation workflows critical. dbt addresses this by introducing testing, documentation, and version control into the transformation layer, reducing errors and improving trust in data.
In this blog, we explore what dbt is, how it works, its key features, and why it has become an essential tool for modern data engineering and analytics teams.
Key Takeaways
- dbt enables teams to transform data directly within the data warehouse using SQL, making workflows more transparent, testable, and easier to manage compared to traditional ETL approaches.
- It focuses only on the transformation layer in the ELT stack, working alongside ingestion and BI tools rather than replacing them, which keeps its role clear and predictable.
- By introducing version control, testing, and documentation into data workflows, dbt improves data quality, reduces errors, and builds trust in analytics across teams.
- dbt allows analysts and engineers to collaborate using Git-based workflows, enabling faster development, consistent metric definitions, and reduced dependency on data engineering teams.
- While powerful, dbt has limitations, such as no native data ingestion, orchestration, or real-time processing, which means it must be combined with other tools in a modern data stack.
Enable Responsible and Ethical AI in Your Enterprise
Discover a practical roadmap for implementing transparent and accountable AI systems.
What Is dbt?
dbt is an open-source transformation framework built by dbt Labs. It lets data engineers and analysts write, test, and document data transformations using SQL, all running directly inside a data warehouse. It treats SQL models as code, meaning transformations can be versioned, reviewed, and tested the same way software teams handle application code.
dbt refers specifically to dbt Core unless stated otherwise. dbt Core is open-source (Apache 2.0 licensed). dbt Cloud is the commercial managed version that adds scheduling, a browser-based IDE, and hosted documentation on top of the same transformation engine.
dbt handles only the T in ELT. It does not extract data from sources or load it into a warehouse. That distinction matters because it keeps dbt scoped and predictable.
dbt vs Traditional ETL
In a traditional ETL setup, data is extracted from sources, transformed in an external compute layer, then loaded into a destination. That external layer requires its own infrastructure, maintenance, and compute costs.
dbt follows the ELT model instead. Raw data is loaded into the warehouse first, and dbt transforms it there using the warehouse’s own compute engine. There is no separate transformation server. The warehouse runs the SQL, and dbt manages the logic and execution order.
With ETL, transformations happen before data reaches the warehouse. With dbt, raw data lands first, and models define how it gets cleaned, joined, and structured for downstream use.
dbt Scope: What It Handles and What It Does Not
dbt is scoped entirely to transformation. Understanding its boundaries upfront prevents common integration mistakes when building a data stack.
What dbt handles:
- Defines transformation logic as SQL SELECT statements organized into models
- Compiles models and executes them in the warehouse
- Manages model dependencies automatically via the ref() function
- Runs data quality tests before downstream models consume data
- Generates documentation and lineage graphs from the project structure
- Integrates with Git for version control and code review workflows
What dbt does not handle:
- Extract data from source systems
- Load data into a warehouse
- Schedule or orchestrate pipeline runs natively (in dbt Core)
- Handle streaming or real-time data processing
- Replace ingestion tools like Fivetran or Airbyte, or orchestrators like Airflow or Dagster
How dbt Works in a Data Pipeline
dbt in the ELT Stack: Between Ingestion and BI
A modern data stack has three distinct layers. An ingestion tool (Fivetran, Airbyte, or Stitch) extracts data from sources and loads it into raw tables in the warehouse. dbt sits in the middle, transforming those raw tables into clean, analytics-ready models. A BI tool (such as Looker, Tableau, or Power BI) then reads from those models to power dashboards and reports.
dbt operates entirely within the warehouse. It reads from raw source tables, applies transformation logic, and writes results back as views or tables. Because of this, dbt only needs warehouse credentials and does not move data across environments.
dbt Project Structure
A dbt project is a directory of SQL and YAML files organized by function. Understanding the folder structure is important before working with any existing dbt project or starting a new one.
Running dbt init my_project generates the following structure:
my_project/
├── dbt_project.yml # Project config: name, version, model paths, variables
├── profiles.yml # Warehouse connection credentials (usually outside project root)
├── models/ # SQL transformation logic, organized by layer
│ ├── staging/ # One model per source table, light cleaning only
│ ├── intermediate/ # Joins and business logic across staging models
│ └── marts/ # Final output models, scoped to business domains
├── tests/ # Singular (custom) SQL tests
├── macros/ # Reusable Jinja functions
├── seeds/ # CSV files loaded into the warehouse as static tables
├── snapshots/ # SCD Type 2 snapshot definitions
├── analyses/ # Ad hoc SQL that runs in the warehouse but is not a model
└── packages.yml # Third-party package declarations
The dbt_project.yml file is the root configuration. It defines the project name, model directories, default materializations per folder, and project-level variables.
yaml
# dbt_project.yml
name: my_project
version: '1.0.0'
models:
my_project:
staging:
+materialized: view
marts:
+materialized: tableHow dbt Compiles SQL and Builds a DAG
When a dbt project runs, dbt reads every .sql model file and scans for ref() calls that indicate dependencies between models. From those calls, it constructs a Directed Acyclic Graph (DAG) that automatically maps the complete execution order.
If model B depends on model A, dbt runs model A first. If model C depends on both A and B, dbt waits for both to complete before running C. Engineers do not manually define execution order. dbt infers it from how models reference each other.
After building the DAG, dbt compiles each model by resolving ref() calls into actual schema and table names for the target environment, then submits the compiled SQL to the warehouse. The warehouse executes it, and dbt writes the result as a table or view.
Here is a simple example of two connected models:
sql
-- models/staging/stg_orders.sql
select
order_id,
customer_id,
order_date,
amount
from {{ source('raw', 'orders') }}
where status != 'test'
sql
-- models/marts/fct_revenue.sql
select
customer_id,
date_trunc('month', order_date) as month,
sum(amount) as total_revenue
from {{ ref('stg_orders') }}
group by 1, 2
fct_revenue references stg_orders via ref(). dbt infers that stg_orders must run first and builds the DAG accordingly.Core dbt Commands
Most day-to-day dbt work relies on a small set of CLI commands. These are the ones data engineers use most often:
bash
dbt run # Compile and execute all models in the project
dbt test # Run all schema, data, and unit tests
dbt build # Run models, tests, seeds, and snapshots in DAG order
dbt seed # Load CSV files from the seeds/ directory into the warehouse
dbt snapshot # Execute snapshot definitions to capture slowly changing dimensions
dbt docs generate # Build the documentation site from project metadata
dbt docs serve # Preview the documentation site locally in a browser
dbt deps # Install packages declared in packages.yml
dbt compile # Compile SQL without executing (useful for debugging Jinja)Selective execution is common in larger projects, since running the full DAG on every change is inefficient:
bash
dbt run --select staging # Run only models in the staging folder
dbt run --select +fct_revenue # Run fct_revenue and all its upstream dependencies
dbt test --select stg_orders # Test only the stg_orders model
dbt build --select tag:finance # Build all models tagged 'finance'Core Features of dbt
1. SQL Models and Transformation Logic
A dbt model is a .sql file containing a single SELECT statement. Each model defines one transformation: it reads from source tables or other models and produces a clean output. Models are organized in a folder structure that corresponds to their role in the pipeline.
dbt handles all the DDL. Engineers write only the SELECT logic, and dbt creates or replaces the underlying table or view. The config() block at the top of a model controls how it materializes:
sql
{{ config(
materialized='table',
schema='marts',
tags=['finance']
) }}
select
customer_id,
count(order_id) as total_orders,
sum(amount) as lifetime_value
from {{ ref('stg_orders') }}
group by 1Materialization options include table, view, incremental, and ephemeral. Choosing the right one per model directly affects warehouse compute costs and runtime.
2. The ref() Function and Dependency Management
ref() is how dbt models reference each other. Instead of hardcoding schema and table names, a model uses ref('model_name') and dbt to resolve the correct name for the active environment at compile time.
Two things follow from this. First, models become environment-agnostic: the same code runs in dev, staging, and prod without modification. Second, every ref() call is an explicit dependency declaration that dbt uses to build the DAG.
sql
-- Without dbt (hardcoded, breaks across environments)
select * from prod_schema.stg_customers
-- With dbt (resolves correctly in any environment)
select * from {{ ref('stg_customers') }}The second form resolves to the correct schema in any environment and automatically contributes to lineage tracking. Additionally, changing a referenced model name in a single place propagates the update to all downstream consumers.
3. Schema Tests, Unit Tests, and Data Validation
dbt ships with a built-in testing framework. Tests run assertions against model output before downstream models consume it. Starting from dbt Core v1.8, running dbt test executes both data tests and unit tests.
Data tests validate model outputs. Generic data tests that ship with dbt include:
unique: ensures no duplicate values in a columnnot_null: ensures no null values in a columnaccepted_values: ensures column values match a defined listrelationships: validates foreign key integrity between models
yaml
# models/staging/schema.yml
models:
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['placed', 'shipped', 'returned']Unit tests, introduced in v1.8, validate SQL modeling logic against a small set of static inputs before the full model materializes in production. They support a test-driven development approach, where engineers write tests for transformation logic before running it at scale.
yaml
unit_tests:
- name: test_revenue_calculation
model: fct_revenue
given:
- input: ref('stg_orders')
rows:
- {order_id: 1, customer_id: 101, amount: 100}
- {order_id: 2, customer_id: 101, amount: 200}
expect:
rows:
- {customer_id: 101, total_revenue: 300}Beyond these, singular tests are custom SQL queries that return rows when a condition fails. The dbt-expectations package adds an extended assertion library for more complex validations.
4. Documentation, Lineage, and dbt Docs
dbt generates a documentation site directly from YAML files and SQL model definitions. Descriptions written alongside models compile into a searchable web interface. To build and preview it locally:
bash
dbt docs generate
dbt docs serve # For local preview only, not intended for production hostingNote that dbt docs serve is designed for local development previews. For sharing documentation with the wider team, dbt platform hosts and deploys documentation on a schedule, making it accessible across the organization via a URL.
The documentation site includes a lineage graph: a visual DAG showing how every model connects to upstream sources and downstream dependents. When a source changes or a model breaks, the lineage graph shows all affected downstream models instantly, eliminating the need to trace dependencies manually.
5. Jinja Templating, Macros, and dbt Packages
dbt uses Jinja templating to add logic and reusability to SQL. Macros are reusable Jinja functions stored in the macros/ directory. A macro might standardize unit conversion, generate repetitive CASE WHEN blocks, or apply a consistent cleaning pattern across multiple models.
sql
-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
({{ column_name }} / 100.0)::numeric(10, 2)
{% endmacro %}
-- used in a model
select
order_id,
{{ cents_to_dollars('amount_cents') }} as amount_dollars
from {{ ref('stg_payments') }}dbt packages extend this further. Packages are shared collections of models, macros, and tests published to dbt Hub or GitHub. Declare them in packages.yml and install them with dbt deps:
yaml
# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
- package: calogica/dbt_expectations
version: 0.10.1dbt-utils is the most widely used package, providing helper macros for common transformation patterns. dbt-expectations adds extended data validation beyond what generic tests cover.
How dbt Improves Data Engineering Workflows
1. Git-Based Version Control for Transformation Logic
dbt projects consist of plain text files (SQL and YAML) that live in a Git repository. As a result, every transformation change produces a commit with a full history of what changed, when, and by whom. Teams can review diffs, roll back broken changes, and run CI checks before anything reaches production.
This replaces stored procedures or ad hoc SQL scripts that leave no audit trail. When a metric breaks, the Git history shows exactly what changed and points to who changed it.
2. Pull Request Reviews on Data Models
Because dbt projects live in Git, pull requests become the standard review mechanism for transformation changes. A data engineer modifying a mart model submits a PR, a colleague reviews the SQL and YAML test definitions, and the change merges only after approval.
This brings the same review discipline to data work that software teams have applied to application code for years. As a result, logic errors, naming inconsistencies, and missing tests get caught before they reach production rather than after.
3. Consistent Metric Definitions Across Teams
Metric inconsistency is a common problem in data organizations. Two teams applying slightly different filtering logic to the same base table arrive at different revenue numbers. Both are defensible, and neither is authoritative.
dbt creates a single source of truth for transformation logic. When revenue is defined in a mart model, every downstream consumer reads from that model. Changing the definition in one place propagates it everywhere. Teams no longer argue about whose number is correct because there is only one definition.
4. Analyst-Led Transformations Without Engineering Dependency
In traditional setups, analysts write queries but do not control the transformation layer. New metrics require a data engineering ticket and a wait. dbt changes this dynamic because models are SQL SELECT statements that any analyst familiar with SQL can write, test, and deploy.
Analysts work in the same Git workflow as engineers, submit pull requests for review, and take ownership of the models that power their reporting. Consequently, engineering bottlenecks on transformation work shrink considerably, and data teams move faster.
dbt Use Cases in Production
1. Staging Layer for Cleaning and Standardizing Data
The staging layer applies the first round of transformation to raw source data. Staging models sit directly on top of source tables and handle light cleaning: renaming columns to a consistent convention, casting data types, filtering deleted records, and standardizing categorical values.
Staging models have a one-to-one relationship with source tables, meaning one staging model per source table. They do not join across tables or apply business logic. Their purpose is to produce a clean, consistent representation of each source that every downstream model can rely on.
sql
-- models/staging/stg_customers.sql
select
id as customer_id,
lower(email) as email,
created_at::timestamp as created_at,
status
from {{ source('raw', 'customers') }}
where is_deleted = false2. Intermediate and Mart Layer Modeling
Intermediate models join and aggregate data from multiple staging models into reusable business entities. For example, a customer activity model might join CRM, orders, and support ticket staging models into a single unified view.
Mart models form the final output layer, scoped to specific business domains. Finance, marketing, and product each have their own mart models structured for direct use by BI tools. Mart models implement business logic, such as revenue recognition rules or cohort definitions, and dashboards query them directly.
3. Incremental Models for Large Datasets
For event or log tables that grow by millions of rows daily, rebuilding the full table on every run is impractical. Incremental models address this by processing only new or updated rows since the last run.
sql
-- models/marts/fct_events.sql
{{ config(materialized='incremental', unique_key='event_id') }}
select
event_id,
user_id,
event_type,
occurred_at
from {{ ref('stg_events') }}
{% if is_incremental() %}
where occurred_at > (select max(occurred_at) from {{ this }})
{% endif %}On the first run, dbt builds the full table. On subsequent runs, dbt processes only rows that pass the filter and merges them in. This approach cuts warehouse compute costs and shortens run times for high-volume pipelines.
4. Snapshot Models for Slowly Changing Dimensions
Source systems store only the current state of a record. When a customer changes their subscription tier, the previous tier disappears, and the history is lost. dbt snapshots solve this by periodically capturing point-in-time records from source tables and tracking changes using dbt_valid_from dbt_valid_to timestamps.
yaml
# snapshots/snp_customer_plans.sql
{% snapshot snp_customer_plans %}
{{ config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at'
) }}
select * from {{ source('raw', 'customers') }}
{% endsnapshot %}Snapshots work for any entity that changes state over time: customer plans, employee roles, product pricing, and account status.
5. Seeds for Static Reference Data
Seeds are CSV files stored in the seeds/ directory and loaded into the warehouse as tables using dbt seed. They suit small, static datasets that change infrequently and do not originate from a source system.
Common uses for seeds include:
- Country or region code mappings
- Cost center or department reference tables
- Static exchange rate lookups
- Category classification tables
csv
# seeds/country_codes.csv
country_code,country_name,region
US,United States,North America
GB,United Kingdom,Europe
IN,India,Asia PacificOnce loaded, seeds are referenced in models the same way as any other model using ref('country_codes'). This approach keeps static lookup data version-controlled alongside transformation logic, rather than managing it separately in a spreadsheet or an external system.
6. dbt with Snowflake, BigQuery, Redshift, and Databricks
dbt connects to warehouses through adapters that handle SQL dialect differences and DDL generation. The core project code stays the same regardless of the warehouse. As a result, teams switching warehouses need minimal changes to their model logic.
- Snowflake: most widely used adapter; supports clone-based dev environments and dynamic tables
- BigQuery: native GCP integration; supports partition and clustering directives
- Redshift: supports Redshift-specific distribution keys and sort keys
- Databricks: uses the
dbt-databricksadapter; supports Delta Lake tables and Unity Catalog
Limitations of dbt
1. No Native Data Ingestion
dbt only transforms data already in the warehouse. It cannot pull from APIs, connect to source databases, or load files from object storage. A separate ingestion tool is always required, and the handoff between ingestion and transformation needs coordination around run order and data freshness.
2. No Built-in Orchestration or Scheduling
dbt Core does not include a scheduler. Running dbt on a schedule requires an external orchestration tool, such as Apache Airflow, Dagster, Prefect, or Astronomer. These tools trigger dbt runs, handle retries, and manage dependencies between dbt jobs and other pipeline steps.
dbt platform adds built-in scheduling, but teams already running Airflow or Dagster generally find this redundant. The orchestration gap is the most common reason teams add another tool to their dbt Core stack.
3. No Support for Real-Time or Streaming Data
dbt is designed for batch transformations. It runs SQL against warehouse tables that refresh on a schedule. It does not process events as they arrive and does not integrate with streaming platforms like Apache Kafka, Apache Flink, or Spark Streaming.
Teams with real-time requirements typically process events in a streaming layer and land the results in the warehouse. dbt then handles further batch-level modeling on top of those landed tables. For sub-minute latency requirements, however, dbt is not the right tool.
4. YAML and Jinja Complexity at Scale
dbt projects start simple. As they grow, the volume of YAML required to define sources, models, tests, and documentation increases. Large projects can accumulate hundreds of YAML files that must stay in sync with the SQL models they describe. When a model changes and its YAML does not update, tests break, and documentation goes stale.
Jinja adds a debugging layer on top of SQL. Jinja errors in model files can be cryptic, and the compiled SQL that dbt submits to the warehouse is not always easy to inspect when something goes wrong.
5. Governance Gaps in Multi-Team Projects
dbt works cleanly for a single team managing one project. When multiple teams share a project, governance becomes harder to enforce. There are no built-in access controls on who can modify which models. Naming conventions, testing standards, and documentation requirements rely on code review discipline rather than tooling to enforce them.
Some teams split into multiple dbt projects to give each team autonomy. However, ref() does not work natively across separate projects. dbt platform Enterprise adds cross-project ref support, though it is a paid feature that requires organizational coordination to implement.
dbt Core vs dbt Cloud
Both options run on the same core transformation engine. The difference lies in what surrounds it: scheduling, IDE, documentation hosting, and collaboration tooling. Note that dbt Labs now refers to the managed product as the dbt platform (formerly dbt Cloud), though the name dbt Cloud remains widely used.
| Feature | dbt Core | dbt platform (dbt Cloud) |
| Cost | Free (Apache 2.0 open-source) | Free individual tier; paid team and enterprise plans |
| Engine | dbt Core engine (Python-based) | dbt Core engine or dbt Fusion engine (30x faster) |
| Setup | CLI, local or self-hosted | Managed, browser-based |
| Scheduling | Not included (needs Airflow, Dagster, etc.) | Built-in with cron and trigger-based runs |
| IDE | Command line or VS Code extension | Web-based Studio IDE + dbt Canvas (drag-and-drop) |
| Documentation hosting | Self-hosted | Hosted and shared via dbt platform URL |
| Logging and alerting | Manual setup required | Built-in job history, run logs, Slack/email alerts |
| Environments | Managed via profiles.yml | Managed via UI with separate credentials per environment |
| Semantic layer | Not included | Included; integrates with BI tools for consistent metrics |
| AI assistance | Not included | dbt Copilot for generating docs, tests, and SQL |
| Cross-project ref() | Not supported | Supported (Enterprise tier only) |
| Best for | Teams with existing orchestration and CLI workflows | Teams without orchestration or with non-technical users |
dbt Core is the right choice when:
- The team already runs Airflow, Dagster, or Prefect for orchestration
- Engineers are comfortable with CLI and Git-based workflows
- Keeping the stack lean without per-seat costs is a priority
dbt platform makes sense when:
- The team has no existing orchestration infrastructure
- Analysts or non-engineering users need a browser-based IDE and hosted documentation
- Centralized job monitoring, alerting, and environment management are a priority
- The Fusion engine’s 30x performance improvement is worth the managed cost
- Cross-project
ref()is needed across multiple dbt projects
Pricing for dbt platform has changed multiple times. The Developer plan is free for individuals. Team plans are priced per seat with per-model-run charges. Enterprise pricing is negotiated directly with dbt Labs.
Kanerika: Transforming Data Management with AI-Powered Solutions
Kanerika helps businesses solve complex data challenges and turn them into actionable insights using advanced AI and data management solutions. Our expertise spans data integration, analytics, AI/ML, and cloud management, enabling organizations to build scalable systems that improve decision-making and operational efficiency.
We maintain rigorous security and compliance standards, including ISO 27701, ISO 27001, ISO 9001, SOC 2, GDPR, and a CMMI Level 3 appraisal. These certifications ensure that every solution we deliver is secure, reliable, and enterprise-ready. Our partnerships with Microsoft, AWS, and Informatica allow us to combine modern technology with agile practices, giving clients access to innovative solutions built on a proven foundation. Our mission is to help organizations use data to drive growth through AI-powered solutions.
To deliver on that mission, Kanerika has built a portfolio of specialized AI agents, including DokGPT, Jennifer, Alan, Susan, KARL, and Mike Jarvis. Each agent automates a specific category of work, from document processing and risk scoring to customer analytics and voice data analysis. They operate across structured and unstructured data environments and integrate directly into enterprise workflows, helping teams move from raw data to faster insights with less manual effort.
Transform Your Business with AI-Powered Solutions!
Partner with Kanerika for Expert AI implementation Services
FAQs
What is dbt (data build tool) used for?
A data build tool like dbt is used to transform raw data into clean, structured datasets for analysis. It works within your data warehouse and allows teams to write SQL-based transformations, test data quality, and manage workflows efficiently. dbt is widely used in modern data stacks for analytics engineering.
2. How does dbt work in a data pipeline?
dbt sits in the transformation layer of a data pipeline. After data is extracted and loaded into a warehouse (like Snowflake or BigQuery), dbt transforms that data using SQL models. It also enables testing, documentation, and version control, making the pipeline more reliable and maintainable.
3. Why is dbt important for data teams?
dbt helps data teams standardize transformations, improve data quality, and collaborate better. It introduces software engineering practices like version control, testing, and modular code into analytics workflows, which leads to more scalable and trustworthy data systems.
4. What are the key features of dbt?
dbt offers features like modular SQL models, built-in data testing, documentation generation, and dependency management. It also integrates with cloud data warehouses and supports scheduling through orchestration tools, making it a powerful solution for managing data transformations.
5. Who should use a data build tool like dbt?
dbt is best suited for data analysts, analytics engineers, and data teams working with modern cloud warehouses. It’s ideal for organizations that want to scale their data operations, improve data reliability, and adopt a more structured approach to data transformation.



