Home
Products

Intelligent Workflow Automation Platform
Explore FLIP

FLIP Navigation

Overview
Enterprise Workflow Automation Platform

Use Cases
Enterprise Use Cases Handled by FLIP

AI Workforce
Suite of Autonomous AI Agents

Security & Governance
Built for Compliance & Trust

Why FLIP
Why Choose FLIP

Pricing
Tiered Packages, Usage-based Fees

Calculate Your Migration ROI Now
Use Cases
AI-governed Reliable Data Flows & Invoice Processing

AP Automation
Eliminate manual invoice processing delays

DataOps
Automate data pipelines for faster delivery

Data Platform Migration
Migrate to modern data platforms faster

AI Invoice Processing
AI-powered invoice approvals with accuracy

Insurance Claims automation
Faster, accurate, end-to-end processing.

Trade Document Processing
Automated Trade Document Processing

Bank Statement Processing
Simplified Bank File Reconciliation

EDI Integration
Smart EDI Integration, Powered by AI

AI Agents
Autonomous AI Agents Built for You

Alan
AI legal summarizer that processes and condenses lengthy legal documents

Mike
AI quantitative proofreader that catches arithmetic errors

Susan
AI PII redactor that automatically removes sensitive information

Karl
Data insights agent that analyzes data and delivers quick insights

Ember
Automate customer service ops, resolve issues faster

AI-Powered Digital Twins for Preventive Maintenance
Register Now
Services

AI Services
Automate Decisions, Predict Outcomes, and Act Faster With Purposeful AI

Agentic AI
Deploy autonomous agents for task execution

Generative AI
Generate content and automate workflows instantly

AI Consulting
Expert AI consulting services, from strategy to deployment,

AI Strategy
Find where AI fits and build the roadmap.

Intelligent Automation
Intelligent Bots Streamline Repetitive Workflows

AI Governance
Governance That Powers Faster AI Innovation

AI Application Development
Ship production apps powered by AI.

RAG Development
Intelligent Retrieval for Smarter Decisions

AI Model Development
Build custom models for specific problems.

LLM Development
Build real products on language models.

MLOps Consulting
Keep models running reliably in production.

ML Consulting
Apply machine learning to business problems.
Data Services
Automate Decisions, Predict Outcomes, and Act Faster With Purposeful AI

Data Platform Migrations
Drive innovation and smarter decisions with AI.

Data Analytics
Unlock actionable intelligence from your data

Data Integration
Unify disparate data sources seamlessly

Data Governance
Ensure compliant, secure data management

Azure Cloud Solutions
Scale and innovate with AI-powered Azure solutions.

Predictive Analytics
Forecast demand faster and with precision

Data Engineering
Build pipelines that deliver clean data.

Data Strategy
Align data with goals worth measuring.

Data Modernization
Move off legacy platforms to cloud

Data Architecture
Design data platforms that scale.
Migration Accelerators
Automate & Accelerate Your Modernization Journeys

Azure to Microsoft Fabric
Consolidate analytics infrastructure for unified insights

Cognos to Microsoft Power BI
Transition BI tools with preserved dashboards seamlessly

Crystal Reports to Microsoft Power BI
Modernize legacy reports with advanced BI features

Alteryx to Microsoft fabric
Upgrade analytics workflows with Fabric capabilities

Informatica to Databricks
Build Lakehouse ETL pipelines for modern analytics

Informatica to Alteryx
Enable self-service analytics with automated conversion

Informatica to Microsoft fabric
Consolidate data integration into Fabric workflows

Informatica to Talend
Streamline ETL transitions with preserved business logic

SQL services to Microsoft Fabric
Modernize databases into unified analytics platform

SSRS to Microsoft Power BI
Convert server reports to interactive Power BI.

Tableau to Microsoft Power BI
Reduce costs, boost integration with Microsoft ecosystem

UiPath to Power Automate
Cut costs, boost efficiency, unlock seamless M365 integration
Technologies
Leading Platform Expertize to Enable Your Growth Goals

Microsoft Fabric
Integrate all data analytics end-to-end seamlessly

Microsoft Power BI
Visualize insights with interactive dashboards and reports

Microsoft Purview
Unified data governance, security, and compliance.

Databricks
Scale analytics on an enterprise unified Lakehouse

Snowflake
Store, query, and analyze large-scale data, all in one platform.

AI-Powered Digital Twins for Preventive Maintenance
Register Now
Industries

Industries
Industry Expertise Delivering Your Sector's Critical KPIs

Automotive
Accelerate production, optimize operations, create smarter CX.

Banking
Transform operations seamlessly with secure & compliant analytics.

Healthcare
Modernize systems, automate workflows, make faster decisions.

Insurance
Automate claims, enhance underwriting, personalize customer engagement.

Logistics & Supply Chain
Modernize operations for faster decisions, better forecasting.

Manufacturing
Boost production speed, reduce downtime, improve forecast accuracy.

Pharma
Accelerate research, improve efficiency, deliver faster.

Retail & FMCG
Digitize operations, automate tasks, deliver stronger customer connections.
AI Solutions

AI Agents
Autonomous AI Agents Built for You

Alan
AI legal summarizer that processes and condenses lengthy legal documents

Mike
AI quantitative proofreader that catches arithmetic errors

Susan
AI PII redactor that automatically removes sensitive information
AI for Enterprise
AI Solutions for Enterprise Workflows

Karl
Data insights agent that analyzes data and delivers quick insights

Ember
Automate customer service ops, resolve issues faster

DokGPT
Document intelligence agent that retrieves information instantly
AI for Business Roles
Optimize Core Business Processes for Scale with AI

Sales
Forecast revenue with AI precision

Finance
Automate reconciliation and financial reporting

Supply Chain
Optimize inventory and logistics routes

Operations
Boost efficiency through intelligent automation
AI for Industries
Industry Expertise Delivering Your Sector's Critical KPIs

AI Manufacturing
Smarter Production, Less Downtime

AI Pharma
Faster Innovation, Better Patient Outcomes

AI Insurance
Automate claims, underwriting, and policies

AI Logistics
Optimize routes, freight, and fulfillment

AI Automotive
Predictive maintenance, production, and quality

AI Healthcare
Enhanced patient and care operations

AI Banking
Faster decisions, smarter banking workflows

AI Retail
Smarter inventory, pricing, and demand

Microsoft Fabric Analyst in a Day
Register Now
Resources

Tools
Assessments & Calculators for Enterprises

AI Maturity Assessment
Evaluate your AI readiness & plan the next step

Migration ROI Calculator
Calculate your migration savings instantly
Resources
Insights Hub with Blogs, Tools, and Industry Resources.

Blogs
Stay ahead with the latest trends on Data & AI

Events & Webinars
Participate in leading events for knowledge & networking

Case studies
See proven transformation results from real client projects.

Whitepapers & Industry Reports
Step by step guidance to shape your Data & AI strategy

Infographics
Visualize complex concepts fast & clear

Videos
Demoes, case studies, thought leadership and more

Podcasts
Hear our experts dive deep to topics that matter

Datasheets
Cheat sheet to decode our solution capabilities

Knowledge Hub
Centralized learning resources

Glossaries
Master industry terminology

AI-Powered Digital Twins for Preventive Maintenance
Register Now
About

Company
Discover Our Mission and Opportunities

About us
Get to know our journey, vision, and the people behind us.

Contact us
Connect with us to discuss ideas, support needs, or partnerships.

Career
Build your career with us and grow through meaningful opportunities.

Newsroom
Discover company announcements, media mentions, and the latest updates.
Partners
Tech Partners Powering Your Digital Transformation

Enablers
Tech Enablers that Help us Power Your Digital Transformation

Microsoft
Accelerating data adoption to help organizations stay AI-ready.

Databricks
Powering Lakehouse analytics at scale for modern data-driven enterprises.

Snowflake
Simplify data modernization and accelerate analytics on Snowflake.

Microsoft Fabric Analyst in a Day
Register Now
Mobile

Call us
ROI Calculator
Contact Us
Instagram Facebook-f X-twitter Linkedin-in Youtube

+1 (855) 6-KANERI

Learn How AI-Powered Digital Twins help in Preventive Maintenance

Home Blogs Liquid Clustering in Databricks: Keys, Speed & vs Partitioning

Liquid Clustering in Databricks: Keys, Speed & vs Partitioning

TL;DR

Liquid clustering in Databricks organizes Delta table data by the columns queries actually filter on, replacing Hive-style partitioning and ZORDER with a single adaptive method that keeps data laid out efficiently as it’s ingested.

Most slow Databricks queries are not a compute problem. They are a data layout problem.

When a Delta table scatters the rows you filter on across thousands of files, every query reads far more data than it needs. Adding a bigger cluster only rents more horsepower to read the same wasted bytes.

Liquid clustering in Databricks fixes the layout itself. It is a data layout optimization technique that groups related rows into the same files based on the columns you actually query, so the engine skips the files that cannot match and scans a fraction of the table.

Liquid clustering replaces both Hive-style partitioning and ZORDER with a single, adaptive approach. You set clustering keys once, change them later without rewriting history, and let Databricks keep newly ingested data organized as part of routine maintenance.

This guide explains what liquid clustering does, how clustering keys work, when it beats partitioning and Z-ordering, and how to roll it out on real Delta tables without surprises. It stays focused on table data layout and query performance, which sits alongside the separate topic of how you schedule and orchestrate the jobs that load those tables.

Key Takeaways

Liquid clustering is a Delta Lake data layout technique that groups rows by clustering keys so Databricks reads only the files a query can match.
It replaces both Hive-style partitioning and ZORDER, and unlike them you can change clustering keys later without rewriting historical data.
Clustering keys should be the one to four high-cardinality columns your queries filter and join on most, not low-cardinality partition-style columns.
CLUSTER BY AUTO lets Databricks analyze query history on Unity Catalog managed tables and choose and update the keys for you.
Liquid clustering needs routine OPTIMIZE to keep newly ingested data clustered, with OPTIMIZE FULL to re-cluster a legacy table once.
Kanerika, a Databricks partner, tunes Delta table layouts so query performance is engineered in, with results like 78% less data latency and 45% revenue growth.

What Is Liquid Clustering in Databricks?

Liquid clustering is a data layout optimization technique for Delta Lake tables that automatically organizes data on disk based on columns you choose, called clustering keys. Instead of writing rows in arrival order or forcing them into rigid folder structures, Databricks groups rows with similar key values into the same set of files.

When a query filters on a clustering key, the engine reads file statistics, skips every file that cannot contain matching rows, and scans only the files that can. That is data skipping, and a good layout is what makes it effective.

The feature reached general availability in the Databricks Data Intelligence Platform and works on Delta tables in the Databricks lakehouse architecture. It is a property of the table, not of a single query, so once a table is clustered, every reader benefits from the improved layout. This is the same disk-level organization that drives much of Databricks performance optimization, and it pairs naturally with the broader Databricks Data Intelligence Platform.

Kanerika Service

Databricks Consulting and Implementation

Kanerika is a Databricks partner that designs, tunes, and operates lakehouse platforms end to end, from Delta table layout and clustering keys to governed, AI-ready pipelines.

Explore Databricks Services

Three properties make liquid clustering different from the layout techniques it replaces. First, the layout is adaptive rather than fixed, so it handles skew and changing data volumes instead of breaking when one value dominates. Second, you can change clustering keys at any time and Databricks reorganizes data incrementally, with no full table rewrite required.

Third, on Unity Catalog managed tables you can hand key selection to Databricks entirely with automatic clustering. These combine into a layout that you set once and rarely revisit, which is a real shift for teams used to babysitting partitions inside a data lakehouse.

How Clustering Keys Work

A clustering key is one or more columns that Databricks uses to decide which rows land in which files. You pick the columns your queries filter and join on most often, and Databricks colocates rows that share key values. The result is that a filter like a date range or a specific customer touches only a handful of files instead of the whole table. Choosing keys well is the single most important decision in liquid clustering, because the keys are what the engine prunes against.

You declare keys at table creation or add them to an existing table. Enabling clustering is a one-line change, and unlike partitioning it does not lock your design in:

-- Define keys when you create the table
CREATE TABLE orders (id INT, customer_id STRING, order_date DATE, amount DECIMAL(10,2))
CLUSTER BY (customer_id, order_date);

-- Add or change keys on an existing table, no rewrite of history
ALTER TABLE orders CLUSTER BY (order_date, customer_id);

-- Let Databricks choose and update keys for you (Unity Catalog managed tables)
CREATE TABLE events (id INT, user_id STRING, event_type STRING)
CLUSTER BY AUTO;

A few rules keep clustering effective. Keep the key list to one through four columns, because each extra key dilutes data skipping and slows single-key lookups.

Clustering columns must have statistics collected, and by default Delta collects statistics on the first 32 columns of a table, so order matters or you adjust the statistics configuration. High-cardinality columns such as customer_id or transaction_id are excellent keys because they let the engine isolate point lookups, which is the opposite of the guidance you would follow for partition columns.

Automatic clustering, enabled with CLUSTER BY AUTO, lets Databricks analyze historical query patterns on a managed table and choose, then update, the clustering keys for you. It leans on predictive optimization to apply the right keys as workloads shift, which removes the manual tuning loop entirely. For teams that cannot constantly review query logs, this is the lowest-effort path to a well-organized table, and it fits the kind of governed, self-maintaining estate that Unity Catalog is built to support.

Case Study

78% Less Data Latency via Analytics Platform Modernization

An enterprise cut data latency by 78% after Kanerika modernized its analytics platform, getting the data layout and pipeline design right instead of over-provisioning compute.

Read the Case Study →

Liquid Clustering vs Partitioning vs Z-Ordering

For years, Databricks teams reached for two layout tools: Hive-style partitioning to split a table into folders, and ZORDER to colocate related data within files. Both work, and both carry real drawbacks that liquid clustering was designed to remove.

Partitioning forces you to commit to a fixed column up front, and a poor choice produces either too many tiny files or a handful of giant ones, the over-partitioning and under-partitioning problem. ZORDER improves clustering inside files but has to be re-run as a full rewrite, and ordinary insert, update, and delete operations gradually break the order it created.

Liquid clustering combines what each did well while dropping the rigidity. It organizes data like ZORDER, adapts to skew like a good partition scheme should but rarely does, and lets you change keys without the rewrite that both older methods demand. It uses a Z-order curve for single-column layouts and a Hilbert curve for two or more columns, which clusters multi-column data more effectively than ZORDER alone. The table below lays out the practical differences.

Watch on YouTube

Why Databricks’ Platform Wins with 2025 Data Insights

A short walkthrough of why the Databricks lakehouse, including its data layout and optimization features, keeps enterprise analytics fast and cost-efficient.

Aspect	Partitioning	Z-Ordering	Liquid Clustering
Layout structure	Fixed folders per partition value	Sorted within files, no folders	Adaptive file groups by key
Change the key later	Requires full table rewrite	Re-run ZORDER as a rewrite	No rewrite, incremental
Skew handling	Breaks on uneven values	Partial, degrades over writes	Adapts to skew automatically
High-cardinality columns	Poor, causes small files	Reasonable	Strong, ideal for point lookups
Maintenance	Manual repartitioning	Manual periodic ZORDER	Routine OPTIMIZE, can be automatic

The practical takeaway is that liquid clustering is the default for new Delta tables, while partitioning still has narrow uses. Databricks recommends not partitioning tables under roughly one terabyte, and liquid clustering scales well above that. The same disk-level thinking shows up across the platform, from Databricks real-time analytics to a tuned data intelligence platform. If you are weighing platforms more broadly, our Databricks vs Snowflake and Microsoft Fabric vs Databricks comparisons cover the wider tradeoffs, and Fabric exposes the same idea through Microsoft Fabric lakehouse tables.

When to Use Liquid Clustering, and When Not To

Liquid clustering is the right default for most Delta tables, but a few workload shapes still favor partitioning, and knowing the boundary saves rework. Reach for liquid clustering when your situation matches the patterns below, all of which the older methods handle badly.

Query patterns change over time. When the columns teams filter on keep shifting, the no-rewrite key change is worth the switch on its own.
High-cardinality filters. Point lookups on columns like customer_id or transaction_id cluster cleanly, where partitioning would explode into tiny files.
Skewed data. When one value dominates, adaptive file groups stay balanced instead of producing one oversized partition.
Concurrent writes. Tables receiving frequent overlapping inserts and updates stay organized without manual repartitioning.
Small-file risk. Datasets that would fragment under traditional partitioning avoid the small-file problem entirely.

There are still cases to stay with Hive-style partitioning. Keep partitioning when downstream systems explicitly require a partitioned folder layout, when workloads depend heavily on metadata-only aggregations over partition values, or when you run highly selective single-partition queries such as reading only today’s data. Liquid clustering is not compatible with partitioning or ZORDER on the same columns, so a table uses one approach, not a blend. This decision sits at the same level as other data layout calls across the lakehouse, including how you organize a data lake versus a lakehouse and where governed tables live under Databricks data lineage.

Case Study

45% Revenue Growth with a Real-Time Analytics Platform

A real-time analytics build helped drive 45% revenue growth, grounded in getting the data layout and pipeline design right rather than over-provisioning compute.

Read the Case Study →

If you want the decision in one view, the matrix below maps common table situations to the layout that fits, so you can match your own tables to a row before you commit a key.

Your table situation	Use liquid clustering	Stay with partitioning	Why
Filters and joins on high-cardinality columns like customer_id	Yes	No	Clustering isolates point lookups, while partitioning on these columns fragments into tiny files
The columns teams filter on keep shifting quarter to quarter	Yes	No	You can change clustering keys with no rewrite, so the layout follows the workload
Table is under roughly one terabyte	Yes	No	Databricks advises against partitioning small tables, and clustering is the default for new Delta tables
A downstream system reads a fixed partitioned folder layout	No	Yes	That tool expects the folder structure partitioning produces, which clustering does not create
Workload leans on metadata-only counts over partition values	No	Yes	Partition metadata answers these without scanning files, an edge clustering cannot match
Queries almost always read a single partition, such as only today’s data	No	Yes	A highly selective single-partition read is already as fast as it gets on a partitioned table

How to Enable and Maintain Liquid Clustering

Rolling out liquid clustering is mostly about two things: picking keys and keeping new data clustered. Enabling it is a single clause, but the layout only stays healthy if OPTIMIZE runs on a schedule, because newly written files are clustered incrementally rather than instantly.

The maintenance model is simple. New data lands unclustered and is organized when OPTIMIZE runs, so a regular OPTIMIZE job is part of owning a clustered table. If you enable liquid clustering on a legacy table that already holds data and you want to re-cluster the existing history, run OPTIMIZE FULL once to reorganize everything, then return to routine OPTIMIZE for incremental upkeep.

-- Cluster newly written data incrementally (run on a schedule)
OPTIMIZE orders;

-- Re-cluster all historical data after first enabling clustering on a legacy table
OPTIMIZE FULL orders;

A short rollout sequence keeps the change low-risk. Profile your real query logs to find the columns that actually appear in filters and joins, since those are your candidate keys, not the columns you assume matter. Enable clustering with one to four of those columns, or use CLUSTER BY AUTO on managed tables to let Databricks decide. Schedule OPTIMIZE so new data stays organized, then measure data scanned per query before and after to confirm the layout is pruning files. This profiling-first discipline is the same one behind a healthy data pipeline optimization effort and shows up whenever teams work through a Databricks troubleshooting guide. Teams moving off older stacks usually fold this into a wider legacy systems to Databricks migration or an Informatica to Databricks migration.

Listen on Spotify

From Data to Decisions: AI-Powered Analytics in 2025

Common mistakes are easy to avoid once you know them. Three trip teams up most often:

Picking low-cardinality keys. Treating clustering keys like partition columns wastes most of the benefit.
Skipping OPTIMIZE. New data drifts out of layout until queries slow down again.
Adding too many keys. More than four spreads data so thinly that single-key skipping suffers.

Each of these traces back to data layout discipline, the same discipline that keeps a Databricks deployment fast as it grows and supports reliable Databricks security and governance on top.

Migrating an Existing Partitioned Table to Liquid Clustering

Most teams meet liquid clustering on a table that is already partitioned or already running ZORDER, not on a greenfield table. You cannot blend the approaches, since liquid clustering is not compatible with partitioning or ZORDER on the same columns, so migration means switching the table over rather than layering one on top of the other. The path is short, but the order of steps matters so you do not leave the table half-organized.

Databricks supports converting a partitioned table by recreating it with clustering keys and rewriting the data into the new layout, after which you drop the old partition scheme. The official Databricks documentation on migrating from partitioning or Z-order walks through the supported conversion, and the practical sequence on a real table looks like this.

Pick the new keys from query logs, not from the old partition column. The column you partitioned on is often a low-cardinality date that makes a poor clustering key, so profile filters and joins fresh instead of carrying the old choice forward.
Create the clustered table and backfill. Define CLUSTER BY on the new table, then write the historical data in. New tables take clustering cleanly because every file is written under the new layout.
Run OPTIMIZE FULL once. On a table that already holds data, OPTIMIZE FULL reorganizes the entire history into clustered files, after which routine OPTIMIZE keeps new data in line.
Verify before you cut over. Compare data scanned per query on the old and new tables on the same workload, so the switch is backed by a number rather than a hope.

This conversion usually rides along with a larger modernization, which is why it shows up inside a legacy systems to Databricks migration or an Informatica to Databricks migration rather than as a standalone task. Treat it as a layout decision made once, with the same care you would give any change to the Databricks lakehouse architecture underneath your tables.

Limitations and How to Confirm Clustering Is Actually Working

Liquid clustering is the right default for most tables, but it is not universal, and a clustered table can still underperform if you never check that it is pruning files. Two questions settle most rollouts: what the feature does not support, and how you prove it is helping.

On the support side, a handful of restrictions are worth knowing before you commit a table. Per the Azure Databricks documentation on liquid clustering limitations, streaming tables and materialized views created from a declarative pipeline are not supported, and tables that use Delta Sharing with partition filtering are not supported either.

Clustering keys also need statistics, and since Delta collects statistics on the first 32 columns of a table by default, a key that sits past that boundary will not prune well unless you adjust the statistics configuration. None of these block the common case, but they decide a few tables that should stay on another layout.

The more frequent problem is a table that is technically clustered yet still slow, almost always because the keys were chosen like partition columns or because OPTIMIZE never runs. Practitioners who have hit this describe the same trap: treat liquid clustering like ZORDER, pick low-cardinality columns, skip OPTIMIZE, and queries get slower rather than faster, a pattern the Databricks community guidance on when to use liquid clustering calls out directly. The fix is to verify, not assume.

Confirming the layout works is a measurement loop, not a feeling. Capture data scanned per query on a representative workload before clustering, enable the keys and run OPTIMIZE, then re-run the same queries and read the bytes scanned and files pruned from the query profile.

A clustered table on a good key should read a fraction of the files it did before. If the number barely moves, your keys do not match how the table is queried, and you re-profile rather than add more keys. This proof loop is the same discipline that keeps Databricks performance optimization grounded in evidence and that surfaces in any honest Databricks troubleshooting guide.

How Kanerika Helps Teams Get Data Layout Right

Choosing clustering keys, deciding where partitioning still earns its place, and proving the gain in data scanned per query is work that rewards experience with real workloads. As a Databricks partner, Kanerika designs and tunes Delta table layouts as part of building and modernizing lakehouse platforms, so query performance is engineered in rather than chased after launch.

Our delivery follows a repeatable layout-tuning path on each engagement:

Assess. Profile real query logs and the current file layout to find the tables where data skipping is failing and bytes scanned are highest.
Design. Choose one to four clustering keys per table from actual filter and join patterns, and decide where partitioning still earns its place.
Build. Enable clustering, backfill with OPTIMIZE FULL where needed, and fold legacy tables in through migration accelerators like FLIP.
Govern. Wire the clustered estate into Unity Catalog so lineage, access, and predictive optimization stay consistent.
Enable. Schedule OPTIMIZE, set CLUSTER BY AUTO where it fits, and hand teams the before-and-after numbers so they can own the layout.

That same data-layout discipline carries into platform selection across Databricks, Snowflake, and Fabric and into governed environments under Unity Catalog, Purview, and Collibra.

The results show up where it counts. A centralized analytics platform modernization cut data latency by 78%, and a real-time analytics build helped drive 45% revenue growth.

Both were grounded in getting the data layout and pipeline design right rather than over-provisioning compute. The recurring pitfall we watch for is teams treating clustering keys like partition columns or skipping OPTIMIZE, so we set the keys and cadence and measure the before-and-after, which keeps the layout decision backed by numbers, not guesswork.

Talk to Kanerika

Want Your Delta Tables Tuned the Right Way?

Kanerika scopes which tables benefit from liquid clustering, sets the keys and OPTIMIZE cadence, and measures the before-and-after. A short working session turns the layout question into a plan.

Schedule a Demo →

For deeper architecture choices, our guides on data lake vs data warehouse and Databricks Mosaic AI round out the picture.

Frequently Asked Questions

What is liquid clustering in Databricks?

Liquid clustering is a Delta Lake data layout optimization technique that organizes rows on disk based on columns you choose, called clustering keys. When a query filters on a clustering key, Databricks reads file statistics and skips every file that cannot contain matching rows, so it scans only a fraction of the table. It is a property of the table, so once a table is clustered, every reader benefits from the improved layout.

How is liquid clustering different from partitioning and Z-ordering?

Liquid clustering replaces both Hive-style partitioning and ZORDER with a single adaptive approach. Partitioning forces a fixed column choice up front and can produce too many tiny files or a few giant ones, while ZORDER must be re-run as a full rewrite and breaks down as rows are inserted, updated, and deleted. Liquid clustering adapts to skew, handles high-cardinality columns well, and lets you change keys without rewriting historical data.

How do I choose clustering keys?

Pick the one to four columns your queries filter and join on most often, found by profiling real query logs rather than assuming which columns matter. High-cardinality columns like customer_id or transaction_id make excellent keys because they isolate point lookups, which is the opposite of how you would pick partition columns. Keep the list short, because each extra key dilutes data skipping and slows single-key lookups.

What is CLUSTER BY AUTO in Databricks?

CLUSTER BY AUTO is automatic clustering for Unity Catalog managed tables. Databricks analyzes the table’s historical query patterns and chooses, then updates, the clustering keys for you, leaning on predictive optimization to apply the right keys as workloads shift. It removes the manual tuning loop and is the lowest-effort way to keep a table well organized when you cannot constantly review query logs.

Does liquid clustering replace OPTIMIZE?

No. Liquid clustering relies on OPTIMIZE to keep new data clustered. Newly written files land unclustered and are organized when OPTIMIZE runs, so a regular OPTIMIZE job is part of owning a clustered table. If you enable clustering on a legacy table that already holds data, run OPTIMIZE FULL once to re-cluster the existing history, then return to routine OPTIMIZE for incremental upkeep.

When should I not use liquid clustering?

Stay with Hive-style partitioning when downstream systems explicitly require a partitioned folder layout, when workloads depend heavily on metadata-only aggregations over partition values, or when you run highly selective single-partition queries such as reading only today’s data. Liquid clustering is not compatible with partitioning or ZORDER on the same columns, so a table uses one approach, not a blend.

How many clustering columns can a table have?

You can specify up to four clustering columns, and the columns must have statistics collected. By default Delta collects statistics on the first 32 columns of a table, so either keep clustering columns within that range or adjust the statistics configuration. In practice, one to four well-chosen keys give the best results, since adding more spreads data thinly and weakens single-key data skipping.

Is liquid clustering good for large tables?

Yes. Databricks recommends not partitioning tables under roughly one terabyte and treats liquid clustering as the default for new Delta tables, and it scales well above that size. Large tables with skew, high-cardinality filters, concurrent writes, or query patterns that change over time benefit most, because adaptive file groups stay balanced where fixed partitioning would break or fragment into small files.

Authored by

Gaurav Verma | Chief Marketing Officer

Gaurav Verma brings 25+ years of B2B SaaS marketing expertise, helping brands sharpen positioning, build demand, and drive measurable growth in competitive markets.

View Profile ⇒

Reviewed by

Shaurya Chauhan | Lead Software Engineer

Databricks Certified Data Engineer Professional and Lead Software Engineer at Kanerika, specializing in data engineering and analytics across Azure, Microsoft Fabric, Databricks, and Snowflake.

View Profile ⇒