OCSF

Data Lake

No Lake Without a Stream: The Pipeline for your Security Data Lake

Mar 12, 2025

Matthias Vallentin

Founder & CEO

Security teams today are drowning in data. Logs pour in from endpoints, cloud services, firewalls, and applications—each in a different format, each requiring tedious transformation before it can be analyzed. The promise of a Security Data Lake (SDL) is simple: centralize all security telemetry in one place and make it easy to query. But a major challenge in building the lake is getting your data in, shaping it into a structured format like OCSF, and delivering to the lake's object storage in the right table format.

This is where Tenzir comes in. Tenzir is a Data Pipeline Management (DPM) solution purpose-built for security teams. Instead of struggling with brittle ETL pipelines and ad-hoc data wrangling, Tenzir lets you ingest, transform, normalize, and route security data at scale—without investing in data engineering.

Why build your Security Data Lake with Tenzir?

Here are three reasons why Tenzir is the best choice for building and managing a security data lake.

1. Easy, Yet Powerful: One ETL Tool for your Lake

Before data lands in our lake, you need to transform the messy semi-structured raw and JSON logs into a structured schema. OCSF gives you a destination here, but you still need to walk the path. Meaning, your job will be to normalize the data so that it's OCSF compliant. That's a lot of data wrangling.The good news is that we specifically designed Tenzir to streamline this process so that you don't have to build custom tools for it. After you've normalized your security telemetry, you then need to deliver that data to your lake. This is also non-trivial, as you have to make a lot of decisions. Batching, partitioning, etc. Your lake often dictates most of the parameters, but you still need to actually do this in an efficient way and register your new data at the lake's catalog. We've sketched out what these two phases, normalization and delivery, typically look like:

We've designed our Tenzir Query Language (TQL) to make the first normalization phase a breeze, providing all the building blocks to quickly translate your raw data into OCSF events. But we didn't stop there. Our pipeline engine also comes with the needed machinery to deliver your data at the lake's doorstep. One example is our to_asl operator that takes care of the second phase in a single operator, allowing you to point a normalized event stream to your Amazon Security Lake. Sending data to Snowflake tables works similarly.

We are working hard to deliver the same turn-key experience for other lakes, with a current focus being Databricks and Iceberg-based lakes. Until Tenzir writes native Iceberg and Delta tables, you can use our to_hive Operator with format="parquet" and a blob store of your choice to get 90% there. What remains is refreshing your catalog to register the new table partitions.

2. Flexible Input & Output: Control at Both Pipeline Ends

Tenzir’s pipelines have a lot of power at the input and output operators—the "ends" of a pipeline.

Because you can simply swap out a pipeline input operator, you can seamlessly switch data sources and keep the data processing the same. And if the source supports predicate or limit pushdown, you even get an optimized I/O access path. This comes in handy when reading, say, remote Parquet files from S3 buckets. Similarly, you can swap out the pipeline output operator to route the data to a different destination. Testing both Snowflake and Databricks in parallel? Just send a copy to both.

Whether you’re running batch analytics on historical logs or pushing real-time threat signals to an alerting system, Tenzir can power both workflows by mixing and matching inputs and outputs.

3. Streaming Execution: Power in the Pipeline Middle

Tenzir’s pipelines also have a lot of power between the input and output operators—the "middle" of a pipeline.

Unlike batch-oriented ETL tools, Tenzir’s streaming-first architecture ensures low-latency processing while keeping workloads stable. We've engineered a multi-schema, volcano-style executor that combines structured query efficiency with document-store flexibility. It works on data frames in the form of Arrow record batches, but each batch can have a different schema. That's not possible with the structured engines out there. But we wanted high-throughput streaming on messy security. So we needed to innovate beyond the state of the art and take a leap. And it works remarkably well, including backpressure that throttles the source to prevent overload and going out of memory.

The result? A scalable, high-throughput pipeline that can handle the unpredictable nature of security logs—whether it’s bursty threat intelligence feeds or continuous cloud event streams.

Why You Shouldn’t Build This Yourself 🙅

Security Teams Should Focus on Security—Not Data Plumbing

Engineering a robust data pipeline from scratch requires significant time, expertise, and maintenance. Security teams should be spending their time hunting threats, not fixing brittle ingestion scripts.

Standardized Security Data is the Future

With OCSF and other standardized schemas gaining traction, teams need a way to map disparate logs into a common format optimized for security analytics. Tenzir has support for OCSF mapping out of the box, eliminating custom transformation headaches.

Data Engineering is a Long-Term Drain on Resources

Pipelines need constant updates as new log sources are added. Schema changes create ongoing maintenance challenges. Homegrown solutions often lack performance optimizations for high-throughput security data. Tenzir abstracts away these complexities, giving you a scalable, future-proof pipeline solution. Moreover, the community-based approach with an open source library of mappings allows everyone to partake in improving the quality of our collectively curated security data.

Conclusion

If you’re serious about building a security data lake, Tenzir provides the fastest and most flexible way to get your data in and standardized. It eliminates the biggest headache—ingesting, shaping, and delivery of data—so that you can focus on analytics, threat detection, and security insights.

No more fragile schema mappings. No more custom ETL tools. Let Tenzir handle the plumbing and focus on your core mission.

Product

Overview

Integrations

Library

Comparisons

Resources

Blog

Events

Sheets & Briefs

Documentation

Changelog

Company

About Tenzir

Partners

Leadership

Founding Story

Press Releases

Contact

Social

GitHub

Discord

Bluesky

Legal Notice

Terms and Conditions

Privacy Statement

Product

Overview

Integrations

Library

Comparisons

Resources

Blog

Events

Sheets & Briefs

Documentation

Changelog

Company

About Tenzir

Partners

Leadership

Founding Story

Press Releases

Contact

Social

GitHub

Discord

Bluesky

Legal Notice

Terms and Conditions

Privacy Statement

Product

Overview

Integrations

Library

Comparisons

Resources

Blog

Events

Sheets & Briefs

Documentation

Changelog

Company

About Tenzir

Partners

Leadership

Founding Story

Press Releases

Contact

Social

GitHub

Discord

Bluesky

Legal Notice

Terms and Conditions

Privacy Statement