Databricks is a unified analytics platform built on Apache Spark, offering a Lakehouse architecture that combines data lake flexibility with data warehouse performance. Unity Catalog provides centralized governance for all data assets.
Tenzir sends events to Databricks using the
to_databricks operator, which writes to
managed Delta tables with full Unity Catalog governance.
Architecture
Section titled “Architecture”The operator stages optimized Parquet files in a Unity Catalog Volume and
commits them via COPY INTO using the SQL Statement API.
| Step | Who | Action | DBU Cost |
|---|---|---|---|
| ① | Tenzir | Write Parquet to staging | ❌ |
| ② | Tenzir → SQL API | POST COPY INTO command | ❌ |
| ③ | SQL Warehouse | Read staged files (via UC perms) | ✅ |
| ④ | SQL Warehouse | Write to table + commit _delta_log | ✅ |
| ⑤ | Tenzir | Delete staging files (not shown) | ❌ |
Key characteristics:
- Produces fully managed Delta tables with complete Unity Catalog governance
- Automatic schema evolution via
mergeSchema - Idempotent commits (
COPY INTOtracks processed files) - Data is queryable immediately after each commit
Configuration
Section titled “Configuration”Authentication
Section titled “Authentication”Tenzir uses OAuth machine-to-machine (M2M) authentication with a Databricks service principal. Use a secret store to manage credentials securely.
-
Create a service principal
Navigate to Account Console → User management → Service principals and create a new service principal. Note the Application (client) ID.
-
Generate an OAuth secret
In the service principal’s Secrets tab, generate a new secret. Copy the secret immediately—it’s only shown once.
-
Grant Unity Catalog permissions
-- Catalog accessGRANT USE CATALOG ON CATALOG my_catalog TO `my-service-principal`;-- Schema access and table creationGRANT USE SCHEMA ON SCHEMA my_catalog.my_schema TO `my-service-principal`;GRANT CREATE TABLE ON SCHEMA my_catalog.my_schema TO `my-service-principal`;-- Staging volume accessGRANT READ VOLUME, WRITE VOLUMEON VOLUME my_catalog.my_schema.stagingTO `my-service-principal`; -
Configure a SQL Warehouse
Create or identify a SQL Warehouse. Serverless warehouses are recommended for ingestion due to fast startup and automatic scaling. Note the warehouse ID from the connection details.
Staging Volume Setup
Section titled “Staging Volume Setup”Create a Unity Catalog Volume to stage Parquet files:
CREATE VOLUME IF NOT EXISTS my_catalog.my_schema.tenzir_staging;The operator writes files to this volume and deletes them after successful
COPY INTO commits.
Partitioning
Section titled “Partitioning”Delta tables use Hive-style partitioning, creating a directory structure like
day=2025-01-15/class_uid=4001/. The partition_by parameter accepts a list
of column names—columns must exist in the events being written.
To partition by derived values (e.g., daily buckets from a timestamp), add the partition column to your events before writing:
subscribe "events"day = time.round(1d)to_databricks ... partition_by=[day, class_uid]If the target table already exists, the operator queries Unity Catalog for the
existing partition scheme and aligns staging files accordingly. The
partition_by parameter is ignored for existing tables since partition columns
are immutable after table creation.
Write-Time Optimizations
Section titled “Write-Time Optimizations”Tenzir applies optimizations that produce query-efficient files:
- Partition-aligned files: Each output file contains data for exactly one partition value, enabling partition pruning—queries filtering on partition columns skip irrelevant directories without opening files.
- Sorted rows: Tight min/max statistics enable aggressive data skipping.
- 1GB file targets: Optimal size for scan efficiency, reduces need for
OPTIMIZE. - Zstd compression: High compression ratio with fast decompression.
Examples
Section titled “Examples”Basic ingestion to a bronze table
Section titled “Basic ingestion to a bronze table”let $client_id = secret("databricks-client-id")let $client_secret = secret("databricks-client-secret")
from_file "/var/log/app/events.json"to_databricks workspace="https://adb-1234567890.azuredatabricks.net", catalog="analytics", schema="bronze", table="app_events", client_id=$client_id, client_secret=$client_secret, warehouse_id="abc123def456"OCSF security events with partitioning
Section titled “OCSF security events with partitioning”let $client_id = secret("databricks-client-id")let $client_secret = secret("databricks-client-secret")
subscribe "ocsf"day = time.round(1d)to_databricks workspace="https://dbc-a1b2c3d4-e5f6.cloud.databricks.com", catalog="security", schema="silver", table="security_events", client_id=$client_id, client_secret=$client_secret, warehouse_id="abc123def456", partition_by=[day, class_uid], sort_by=[src_endpoint.ip, dst_endpoint.ip]High-volume network telemetry
Section titled “High-volume network telemetry”Optimize for cost with larger batches and files:
let $client_id = secret("databricks-client-id")let $client_secret = secret("databricks-client-secret")
subscribe "netflow"day = time.round(1d)to_databricks workspace="https://1234567890123456.7.gcp.databricks.com", catalog="network", schema="bronze", table="flows", client_id=$client_id, client_secret=$client_secret, warehouse_id="abc123def456", partition_by=[day], sort_by=[src_ip, dst_ip], flush_interval=15m, file_size=1GiComparison with Other Approaches
Section titled “Comparison with Other Approaches”vs. Cribl Stream
Section titled “vs. Cribl Stream”Cribl Stream writes
files to Unity Catalog Volumes but does not commit them to Delta tables.
Customers must separately configure Autoloader or run COPY INTO to make data
queryable.
Tenzir’s to_databricks commits directly
to managed Delta tables—data is queryable immediately after each flush with no
additional setup required.
vs. Databricks Autoloader
Section titled “vs. Databricks Autoloader”Autoloader monitors cloud storage for new files and incrementally loads them. It requires files to already exist in cloud storage, making it a “pull” model.
Tenzir’s to_databricks is a “push”
model—data flows directly from pipelines to Databricks without intermediate
storage management. Use Autoloader when data already lands in cloud storage from
other sources.
vs. Delta Live Tables (DLT)
Section titled “vs. Delta Live Tables (DLT)”DLT provides managed ETL pipelines within Databricks. Tenzir complements DLT by handling data collection and initial ingestion, while DLT handles downstream transformations within the lakehouse.
vs. Kafka + Databricks Connector
Section titled “vs. Kafka + Databricks Connector”The Kafka connector requires managing Kafka infrastructure. Tenzir can ingest directly from sources, eliminating the intermediate message broker for simpler architectures.