Skip to content

Databricks is a unified analytics platform built on Apache Spark, offering a Lakehouse architecture that combines data lake flexibility with data warehouse performance. Unity Catalog provides centralized governance for all data assets.

Cloud Object StorageStaging AreaUnity CatalogSQL WarehouseManaged TablesParquetParquet...DeltaDelta...1Read23SQLAPIWrite4Trigger COPY INTOWrite

Tenzir sends events to Databricks using the to_databricks operator, which writes to managed Delta tables with full Unity Catalog governance.

The operator stages optimized Parquet files in a Unity Catalog Volume and commits them via COPY INTO using the SQL Statement API.

StepWhoActionDBU Cost
TenzirWrite Parquet to staging
Tenzir → SQL APIPOST COPY INTO command
SQL WarehouseRead staged files (via UC perms)
SQL WarehouseWrite to table + commit _delta_log
TenzirDelete staging files (not shown)

Key characteristics:

  • Produces fully managed Delta tables with complete Unity Catalog governance
  • Automatic schema evolution via mergeSchema
  • Idempotent commits (COPY INTO tracks processed files)
  • Data is queryable immediately after each commit

Tenzir uses OAuth machine-to-machine (M2M) authentication with a Databricks service principal. Use a secret store to manage credentials securely.

  1. Create a service principal

    Navigate to Account Console → User management → Service principals and create a new service principal. Note the Application (client) ID.

  2. Generate an OAuth secret

    In the service principal’s Secrets tab, generate a new secret. Copy the secret immediately—it’s only shown once.

  3. Grant Unity Catalog permissions

    -- Catalog access
    GRANT USE CATALOG ON CATALOG my_catalog TO `my-service-principal`;
    -- Schema access and table creation
    GRANT USE SCHEMA ON SCHEMA my_catalog.my_schema TO `my-service-principal`;
    GRANT CREATE TABLE ON SCHEMA my_catalog.my_schema TO `my-service-principal`;
    -- Staging volume access
    GRANT READ VOLUME, WRITE VOLUME
    ON VOLUME my_catalog.my_schema.staging
    TO `my-service-principal`;
  4. Configure a SQL Warehouse

    Create or identify a SQL Warehouse. Serverless warehouses are recommended for ingestion due to fast startup and automatic scaling. Note the warehouse ID from the connection details.

Create a Unity Catalog Volume to stage Parquet files:

CREATE VOLUME IF NOT EXISTS my_catalog.my_schema.tenzir_staging;

The operator writes files to this volume and deletes them after successful COPY INTO commits.

Delta tables use Hive-style partitioning, creating a directory structure like day=2025-01-15/class_uid=4001/. The partition_by parameter accepts a list of column names—columns must exist in the events being written.

To partition by derived values (e.g., daily buckets from a timestamp), add the partition column to your events before writing:

subscribe "events"
day = time.round(1d)
to_databricks
...
partition_by=[day, class_uid]

If the target table already exists, the operator queries Unity Catalog for the existing partition scheme and aligns staging files accordingly. The partition_by parameter is ignored for existing tables since partition columns are immutable after table creation.

Tenzir applies optimizations that produce query-efficient files:

  • Partition-aligned files: Each output file contains data for exactly one partition value, enabling partition pruning—queries filtering on partition columns skip irrelevant directories without opening files.
  • Sorted rows: Tight min/max statistics enable aggressive data skipping.
  • 1GB file targets: Optimal size for scan efficiency, reduces need for OPTIMIZE.
  • Zstd compression: High compression ratio with fast decompression.
let $client_id = secret("databricks-client-id")
let $client_secret = secret("databricks-client-secret")
from_file "/var/log/app/events.json"
to_databricks
workspace="https://adb-1234567890.azuredatabricks.net",
catalog="analytics",
schema="bronze",
table="app_events",
client_id=$client_id,
client_secret=$client_secret,
warehouse_id="abc123def456"
let $client_id = secret("databricks-client-id")
let $client_secret = secret("databricks-client-secret")
subscribe "ocsf"
day = time.round(1d)
to_databricks
workspace="https://dbc-a1b2c3d4-e5f6.cloud.databricks.com",
catalog="security",
schema="silver",
table="security_events",
client_id=$client_id,
client_secret=$client_secret,
warehouse_id="abc123def456",
partition_by=[day, class_uid],
sort_by=[src_endpoint.ip, dst_endpoint.ip]

Optimize for cost with larger batches and files:

let $client_id = secret("databricks-client-id")
let $client_secret = secret("databricks-client-secret")
subscribe "netflow"
day = time.round(1d)
to_databricks
workspace="https://1234567890123456.7.gcp.databricks.com",
catalog="network",
schema="bronze",
table="flows",
client_id=$client_id,
client_secret=$client_secret,
warehouse_id="abc123def456",
partition_by=[day],
sort_by=[src_ip, dst_ip],
flush_interval=15m,
file_size=1Gi

Cribl Stream writes files to Unity Catalog Volumes but does not commit them to Delta tables. Customers must separately configure Autoloader or run COPY INTO to make data queryable.

Tenzir’s to_databricks commits directly to managed Delta tables—data is queryable immediately after each flush with no additional setup required.

Autoloader monitors cloud storage for new files and incrementally loads them. It requires files to already exist in cloud storage, making it a “pull” model.

Tenzir’s to_databricks is a “push” model—data flows directly from pipelines to Databricks without intermediate storage management. Use Autoloader when data already lands in cloud storage from other sources.

DLT provides managed ETL pipelines within Databricks. Tenzir complements DLT by handling data collection and initial ingestion, while DLT handles downstream transformations within the lakehouse.

The Kafka connector requires managing Kafka infrastructure. Tenzir can ingest directly from sources, eliminating the intermediate message broker for simpler architectures.

Last updated: