TL;DR: Data Vault is a hub-link-satellite warehouse modeling pattern built for auditability and source-system change: hubs hold business keys, links capture relationships, and satellites store descriptive history with full lineage. It fits enterprises with many source systems, frequent mergers, and strict regulatory audit trails, and works best when paired with a Kimball-style downstream layer for BI consumption.
Watch on YouTube
Data Modernization in 2025: Moving Beyond Legacy BI
A working walkthrough of how Kanerika migrates enterprise data stacks to Databricks, the platform where many modern data vaults now live.
Enterprise data warehouses keep failing the same way. A merger adds two new source systems, a regulator asks for a five-year audit trail, and the dimensional model that was clean a year ago turns into a tangle of late-arriving fact tables and Type 2 dimensions nobody trusts. The teams ship slower, the reports show different numbers, and the warehouse becomes a question of who got there first rather than what the data says.
Data vault modeling exists because Dan Linstedt watched this play out across U.S. Department of Defense and corporate warehouses through the 1990s and built a pattern that treats source-system change and audit history as first-class requirements, not edge cases. The approach separates the unchanging business keys, the relationships between them, and the descriptive context into three table types: hubs, links, and satellites. That separation is what lets you add a new source on Monday without rebuilding what shipped on Friday.
Key Takeaways Data vault modeling is a hub-link-satellite pattern for the enterprise warehouse layer that absorbs source-system change, preserves a complete audit trail, and loads in parallel. Hubs hold unique business keys, links capture relationships, and satellites carry descriptive history; each table type is intentionally narrow so the structure ages well. Data Vault 2.0’s hash keys replace sequence-based surrogates, which unlocks full parallel loading across Snowflake, Databricks, and Microsoft Fabric. Pick data vault when you have three or more major sources, regulatory audit requirements, or a warehouse expected to last more than five years; otherwise a Kimball star ships faster. Information marts on top of the vault are not optional; the vault feeds dimensional marts that BI tools query, while the vault itself stays clean. Kanerika designs, builds, and operates data vault warehouses across Snowflake, Databricks, and Fabric, with automation accelerators that compress build cycles from six months to six weeks. This guide walks through what data vault modeling is, how the three core entities relate, how it differs from Kimball star schemas and Inmon third-normal form, when it earns its complexity, and how Kanerika delivers it on Snowflake, Databricks, and Microsoft Fabric. We compare the top-of-funnel patterns the SERP currently covers and add the implementation context that data-vault articles usually skip.
Listen on Spotify
How Do Fortune 500 Companies Actually Govern Their Data Migrations?
What Is Data Vault Modeling? Data vault modeling is a database modeling method designed for the enterprise data warehouse core layer. It stores raw data from multiple operational systems in a way that preserves history, supports parallel loads, and absorbs schema changes without rework. Dan Linstedt published the pattern publicly in 2000 and released Data Vault 2.0 in 2013, adding hash keys, big-data integration, and the methodology around agile delivery.
The pattern splits enterprise data into three table types. Hubs store the unique business keys that identify a real-world entity, such as a customer number or a product SKU. Links capture the relationships between those keys, such as a customer placing an order. Satellites carry the descriptive attributes and history, time-stamped by load date.
That structural choice is the whole point. When a source system adds a column, you add a satellite. When a new source arrives with its own customer master, you add a satellite to the existing customer hub. The hub never changes, the link never changes, and the queries against history keep returning the same answers as before.
Data vault sits between source systems and the consumption layer. Most modern stacks land raw data in a staging zone, model it in the data vault, and serve curated dimensional marts on top for BI tools. The vault is where the truth lives; the marts are where the speed lives. See Kanerika’s wider cloud data warehouse primer for how the vault layer fits the full stack.
The history Dan Linstedt set out to solve Linstedt began the work in 1990 and spent a decade testing it against real warehouses before publishing. The pattern was built for environments where source systems came and went, regulators demanded full audit trails, and dimensional models had to keep responding to change without monthly redesigns. Those constraints have only intensified since.
Data vault 2.0, the current standard, replaced sequence-based surrogate keys with hash keys derived from the business key. That single change unlocked parallel loading across the entire vault, because every table can compute its keys independently without waiting for a central key generator. It also made cross-platform integration cleaner, because the same hash function produces the same key whether the table lives in Snowflake, Databricks, or on-premises SQL Server.
Hubs, Links, and Satellites: The Three Core Entities Every data vault is built from three table types. Understanding what each one stores and what it deliberately leaves out is the difference between a vault that ages well and one that becomes another tangle of join paths.
Hubs: the unique business keys A hub stores the unique list of business keys for one entity. A customer hub holds one row per customer number, regardless of how many source systems track that customer. A product hub holds one row per SKU. The hub has four columns: the hash key, the original business key, the load date, and the record source.
Hubs do not store descriptions, statuses, or any attribute that might change. They are intentionally thin. That thinness is what lets a hub absorb new sources without ever altering its structure.
Links: the relationships A link stores the many-to-many relationship between two or more hubs. A customer-order link records that customer C123 placed order O456. A product-supplier link records that product P789 is sourced from supplier S001. Links carry the hash keys of the participating hubs, their own hash key, the load date, and the record source.
Like hubs, links carry no descriptive content. The fact that the relationship exists is the data. Any attribute of the relationship, such as the order date or the unit price, lives in a satellite attached to the link.
Satellites: history and context A satellite stores the descriptive attributes and their history for a hub or a link. A customer hub might have one satellite for name and address, another for credit-rating data, and a third for marketing segmentation. Each satellite carries the hash key of its parent, the load date, the load end date, the record source, and the descriptive columns.
When a customer’s address changes, the vault writes a new satellite row with a fresh load date. The old row stays in place with its end date stamped. Five years of history is just five years of rows, queried by date range. That history is the audit trail compliance teams have been asking warehouses to produce since SOX.
Kanerika Service
Data Vault on Snowflake, Databricks, and Microsoft Fabric
Kanerika designs and operates data vault warehouses end to end across Snowflake, Databricks, and Microsoft Fabric, from source assessment and hash-key design to governance and information marts.
Explore Data Engineering Satellites also let you separate data by rate of change or by source. Name changes rarely, so it lives in one satellite. Account balance changes daily, so it lives in another. Splitting them keeps small, frequent updates from rewriting large, stable records.
How a Data Vault Actually Works End to End A working data vault has four logical zones, each with a defined job. The clean separation is what lets multiple teams load and consume the vault in parallel without coordination overhead.
The staging layer lands raw source data with minimal transformation. Source columns are preserved, types are standardised, and load timestamps are added. Nothing is filtered, deduplicated, or interpreted here. Teams running modern data transformation stacks usually pair the staging zone with dbt or a similar tool.
The raw vault is the hubs, links, and satellites that mirror the source systems’ business keys and attributes one-to-one. Loads are insert-only and idempotent. Every row carries its load date and record source, which means a re-run never corrupts the history. The same pattern shows up in Iceberg tables and other open table formats, where insert-only patterns enable time travel.
The business vault sits above the raw vault and adds derived structures: computed business rules, point-in-time tables that snapshot the state of the vault for fast querying, and bridge tables that materialise common join paths. The business vault is optional; small warehouses skip it.
The information mart is the consumption layer that BI tools and analysts query. It is dimensional, denormalised, and tuned for read performance. Marts get rebuilt from the vault whenever the business logic changes, which means the vault stays stable while the reporting layer evolves. Teams use data warehouse automation tooling to keep the mart-build pipelines repeatable.
Reads against the vault use point-in-time queries that pick the satellite row valid at a given timestamp. That pattern, which Linstedt calls the “as-of” query, is what makes auditing trivial. You ask the vault what it knew on March 15 last year, and it answers from the rows valid that day.
Data Vault vs Kimball vs Inmon: When to Use Which The three established data warehouse patterns solve different problems. Picking the wrong one is more expensive than picking a sub-optimal one, because the cost of changing the core modeling pattern after launch is usually a full rebuild.
Kimball dimensional modeling , also called the star schema, optimises for query performance and business-user clarity. Facts hold measurements, dimensions hold context, and slowly changing dimensions track history through Type 1, 2, and 3 patterns. It loads fast for analysts, but absorbing new sources or rebuilding history is expensive once the schema is set.
Inmon’s corporate information factory uses third-normal form for the enterprise warehouse and feeds dimensional marts downstream. The 3NF core gives a single version of the truth but is brittle when source systems change frequently. Schema redesign cycles can run months for mature warehouses.
Data vault sits between them. The vault uses the hub-link-satellite structure for the enterprise core, which absorbs change without redesign, and serves dimensional marts downstream. The trade-off is more tables and more joins for the same query, which is why every vault project ships information marts on top.
Talk to Kanerika
Is Data Vault Right for Your Warehouse?
Kanerika scopes whether the hub-link-satellite pattern earns its complexity for your sources, audit posture, and platform mix in a short working session, then drafts the hub list and delivery plan.
Schedule a Demo → The decision rule most enterprise teams use: pick data vault when you have three or more major source systems, regulatory audit requirements, or a warehouse expected to last more than five years. Pick Kimball for departmental marts or BI-first warehouses with stable sources. Pick Inmon when your reference data rarely changes and you want strict normalisation.
Data Vault 2.0 in Practice: Hash Keys, Insert-Only Loads, and Parallelism Data Vault 2.0 made three changes that turned the pattern into something the cloud warehouses could run at scale. Understanding these moves is the difference between a working vault and one that bottlenecks at load time.
Hash keys replaced sequence-based surrogates. A SHA-256 or MD5 hash of the business key becomes the join column. Two tables computing the same hash on the same key always agree, which means loads no longer need a central key generator. Snowflake, Databricks, and Fabric all expose the hash functions natively.
Inserts are the only write operation. Updates and deletes never touch the raw vault. New satellite rows record changes; the old rows stay in place. That insert-only discipline is what makes the vault idempotent and replayable, which matters when a source system re-sends a day of data.
Loads run in parallel across hubs, links, and satellites because none depend on each other’s keys. A team can load 200 satellites simultaneously without waiting for a sequential key cascade. On Snowflake or Databricks, that parallelism translates directly into shorter load windows and lower compute spend.
Data Vault 2.0 also formalised the agile delivery method around it. Sprints deliver new hubs, links, and satellites as discrete units of work, which means the warehouse grows in two-week increments rather than year-long redesigns. The methodology piece is as load-bearing as the modeling piece for projects that need to ship in production.
Common Mistakes to Avoid With Data Vault Modeling Most failed data vault projects fail for the same handful of reasons. Each of them is preventable with a hard rule applied at design time.
Modeling consumption queries into the vault. Analysts want denormalised tables they can query directly. Putting them in the vault breaks the entire pattern. Build marts on top; keep the vault clean.Using sequence keys instead of hash keys. Sequence keys force central key generation and serialise loads. Hash keys are non-negotiable in Data Vault 2.0. Reach for them on day one.Skipping the staging layer. Loading source data straight into the vault leaks transformation logic into the raw layer. The staging layer takes a few extra hours to build and saves months of confusion later.Letting satellites get too wide. One satellite per source system per logical group is the rule. A 60-column satellite that mixes name, address, credit, and segment data rewrites the world every time one column changes.Adding business rules to the raw vault. Business rules belong in the business vault or the marts. The raw vault mirrors the source system, full stop. Mixing the two breaks the audit trail.Treating the vault as the consumption layer. Analysts who query the vault directly write expensive multi-join queries that compete with the load workloads. The vault feeds marts; users query marts.Case Study
Distributed-Source Consolidation on Snowflake
A global tech consulting firm replaced manual reconciliation across regional systems with governed, centralized Snowflake data; reconciliation effort fell 60% and distributed teams gained real-time operational visibility, the same multi-source pattern data vault targets.
Read the Case Study → How Cloud Warehouses Implement Data Vault Snowflake, Databricks, and Microsoft Fabric all support data vault patterns, with different strengths that influence which platform a project picks. Knowing the platform shape is the difference between a vault that scales and one that fights the engine.
Snowflake runs data vault natively. Hash key functions, micro-partitions, and clustering keys give the vault the indexing it needs. Snowflake’s Time Travel feature complements satellite history at the storage layer, and zero-copy clones let teams test vault changes against a full production copy. Snowflake’s developer guide covers building a Data Vault 2.0 model on the platform end to end, including raw vault loading, point-in-time queries, and information mart construction.
Databricks runs data vault on top of Delta Lake, which provides ACID guarantees, time travel, and schema evolution on the storage layer. The lakehouse pattern lets teams load both structured and semi-structured data into the same vault, with Delta’s MERGE handling the insert-only logic efficiently. Databricks’ reference architecture covers data vault on the lakehouse, including DLT pipelines for incremental satellite loads.
Dimension Data Vault Kimball Star Schema Inmon 3NF Core structure Hubs, links, satellites Facts and dimensions Normalised entity tables Best for Multi-source enterprise warehouses with frequent change Business intelligence and reporting marts Stable enterprise reference data Schema flexibility High; add satellites without redesign Low; fact and dimension changes are expensive Low; 3NF refactors cascade Audit trail Built in via satellite history Type 2 SCDs, partial Requires separate audit tables Load parallelism Full parallel with hash keys Limited by surrogate key dependency Limited by referential cascades Query complexity High; consumed through marts Low; designed for SQL Medium
Microsoft Fabric supports data vault through its data warehouse experience , with Lakehouse storage feeding the vault and Direct Lake serving marts to Power BI. Fabric’s OneLake unifies storage across the vault layers, which removes most of the cross-engine copy work that on-premises vaults used to need. Teams already running parallel platforms can use Kanerika’s Databricks vs Snowflake comparison and the data warehouse to data lake migration guide to decide where the vault should live.
On all three platforms, the vault loads insert-only, queries through marts, and runs the same SQL patterns. The platform choice is usually driven by existing cloud commitments and team skills rather than vault-specific capability. Migration tools and automation accelerators determine project speed more than the engine itself.
When Data Vault Is the Wrong Choice Data vault is not a default. The complexity earns its keep only when specific conditions are present, and a vault built without those conditions becomes a maintenance burden that ships nothing.
Skip data vault when you have fewer than three significant source systems. The integration value of hubs and links is what makes the pattern pay off. With one or two sources, a Kimball star schema ships faster and runs cheaper.
Skip data vault when your sources are stable and rarely change schema. The pattern’s flexibility is its main lever. A warehouse fed by mature, slow-moving systems does not need the absorption capacity of satellites.
Skip data vault when the warehouse is BI-first and analysts query it directly. The vault is a foundation for marts, not a query target. Teams that try to use the vault as the consumption layer end up with the worst of both worlds: dimensional query patterns running against a normalised core.
Case Study
Multi-Source Operational Data Integration in Production
Kanerika consolidated multiple disconnected source systems into a single integrated warehouse layer that now supports operations, finance, and analytics in parallel, with new sources shipping in two-week sprints instead of quarterly redesigns.
Read the Case Study → Skip data vault when your team has no prior data warehouse experience. The pattern has a learning curve, and the wrong calls in the first sprint cascade into rework. Bring in vault-certified consultants or train the team properly before starting; do not learn it on a live enterprise project. Vault also presumes named ownership of hubs, links, and satellites, so the operating model leans on the same data stewardship roles your warehouse already needs.
Building a Production Data Vault with Kanerika Kanerika designs, builds, and operates data vault warehouses for enterprises across financial services , healthcare , manufacturing , retail , and logistics . Our data engineering teams have shipped vaults on Snowflake, Databricks, Microsoft Fabric, and on-premises SQL Server, and the work usually follows the same five stages.
The first stage is the source assessment. We catalogue every operational system feeding the warehouse, map the business keys across them, and identify the conformity rules that determine which keys collapse into a single hub. This stage outputs the hub list, the link map, and the satellite inventory before any code is written.
The second stage is the vault design. Hubs, links, and satellites get named, hashed, and documented. Load patterns get defined for full extracts, incremental change data capture, and late-arriving records. The design stage produces the dbt models, the data dictionary, and the load orchestration plan. Teams running on Databricks can pair this with Kanerika’s Databricks dbt reference setup.
The third stage is the build. Kanerika uses automation accelerators that generate the hub, link, and satellite DDL from the design metadata, which compresses what used to be a six-month hand-build into a six-week assembly. The accelerators are platform-specific: Snowflake gets one toolkit, Databricks another, Fabric a third. Where teams are migrating off legacy stacks, our DataStage migration and Informatica to Fabric guides cover the source-system reload patterns the vault depends on.
The fourth stage is the governance and operate phase. The vault gets data quality checks, lineage tracking, and the audit reporting that compliance teams expect. Kanerika integrates data vaults with our data engineering practice so the audit trail and the policy enforcement live in the same place.
The fifth stage is the mart enablement. Information marts get built on top of the vault for finance, operations, sales, and supply chain teams. Each mart pulls from the same vault, which means cross-functional reports finally agree on the underlying numbers.
Three guardrails matter for every project. We never let analysts query the raw vault directly because the join cost is too high and the maintenance signal gets lost. We never let business rules drift into the raw vault because the audit trail breaks the moment they do. We never let satellites grow past 30 columns because narrow satellites are the difference between a vault that runs in 20 minutes and one that runs in 4 hours.
For an enterprise client running multiple disconnected systems, Kanerika consolidated the source landscape into a single integrated warehouse layer that supports operations, finance, and analytics in parallel. The team now ships new sources in two-week sprints rather than the quarterly redesigns the previous warehouse required. The complete operational data integration case study documents how the multi-source consolidation pattern played out in production.
Kanerika’s data engineers are certified across Snowflake, Databricks, and Fabric, and our consulting practice has been recognised by Forbes, Inc. 5000, and Great Place to Work. We also partner with the leading data platform vendors, which gives us early access to the automation tooling that compresses vault delivery timelines.
Frequently Asked Questions What is data vault modeling? Data vault modeling is a database modeling pattern for the enterprise data warehouse core layer. It uses three table types — hubs for unique business keys, links for relationships between hubs, and satellites for descriptive history — so the warehouse can absorb new sources, preserve a complete audit trail, and load in parallel without redesign. Dan Linstedt published the pattern in 2000 and released Data Vault 2.0 in 2013.
What is the difference between data vault and Kimball star schema? Data vault separates business keys, relationships, and descriptive attributes into hubs, links, and satellites for the enterprise core, while Kimball star schemas combine measurements and context into facts and dimensions for fast BI queries. Data vault absorbs change and history; Kimball delivers query speed. Most modern warehouses use data vault for the core and Kimball marts on top for consumption.
When should you use data vault modeling? Use data vault when you have three or more major source systems, regulatory audit requirements like SOX or HIPAA, or a warehouse expected to last more than five years. Skip it when you have one or two stable sources, when analysts query the warehouse directly, or when the team has no prior warehouse experience. The complexity earns its keep only under those conditions.
What are hubs, links, and satellites? A hub stores the unique list of business keys for one entity, such as a customer number or product SKU, with no descriptive attributes. A link captures the many-to-many relationship between two or more hubs, like a customer placing an order. A satellite carries the descriptive attributes and their history, time-stamped by load date, attached to a hub or a link. All three are insert-only.
What is Data Vault 2.0? Data Vault 2.0, released by Dan Linstedt in 2013, is the current standard. It replaced sequence-based surrogate keys with hash keys derived from the business key, which lets every table compute its keys independently and unlocks full parallel loading. It also formalised the agile delivery methodology and added patterns for big-data and cloud-warehouse integration on Snowflake, Databricks, and Microsoft Fabric.
Can data vault run on Snowflake, Databricks, and Microsoft Fabric? Yes. Snowflake exposes native hash functions, micro-partitions, and zero-copy clones that complement the vault’s insert-only pattern. Databricks runs vault on Delta Lake with ACID, time travel, and DLT pipelines for incremental satellite loads. Microsoft Fabric pairs Lakehouse storage with Direct Lake to Power BI marts. The same hash-key SQL patterns run on all three platforms with no architectural changes.
Does data vault replace a Kimball data warehouse? No, it sits underneath one. Data vault is the enterprise core layer that absorbs source-system change and preserves history. Information marts built on top of the vault use Kimball star schemas tuned for BI tools and analyst SQL. The vault is where the truth lives and the marts are where the speed lives, so the two patterns work together rather than competing.
What are common mistakes in data vault projects? The most common failure modes are letting analysts query the raw vault directly, using sequence keys instead of hash keys, skipping the staging layer, letting satellites grow past 30 columns, adding business rules to the raw vault, and treating the vault as the consumption layer. Each one is preventable with a hard rule applied at design time and an architecture review before code ships.