Microsoft Sentinel Data Lake

Data lake is here, rejoice. It also brings up a bunch of questions, like do I still need Microsoft Sentinel? Yes. Is this just auxiliary logging done well without a lot of complications, like not being able to use the “new” Azure Monitoring Agent and instead having to lean on logstash? Sort of.

There’s been a bunch of these FAQs posted by companies, but I’m going to give some of the questions I had regarding this a whirl and add some new ones as well.

1. I’m living under a rock, please, what is going on?

July 22nd Microsoft announced their Security Data Lake.

It’s a big deal!

Why? Well, to keep it brief:

The Security Data Lake replaces the need for moving data to other solutions for warm/cold storage
Reduces complexity in setting up log tiering between hot/warm/cold
~~KQL jobs replaces summary rules~~ KQL jobs is a version of summary rules that is easier to use, but runs less frequently

All in all, it makes everything a bit cheaper and easier with a lot less management overhead (in theory).

EDIT: As someone on LinkedIn kindly pointed out, KQL jobs currently only runs daily at the fastest, while summary rules run every 20 minutes. Thus, their use cases are different as of now:

Summary rules can be used for summarizing data into analytic tables to use for detection, while that is more of a stretch with the limitations on KQL jobs.
KQL jobs are more suited for surfacing limited amounts of data from the data lake tier for usage in use cases that require data from a longer period of time.

Looking into my 🔮 I have some predictions though! I suspect that summary rules and KQL jobs will be the same capability with the ability to run more frequently, but maybe with a cost attached to it based on the amount of data queried (like it currently is with KQL jobs) and/or frequency maybe? Speculation on my side - but it makes sense that components from Sentinel/Defender with similar capabilities will merge into one to avoid confusion in the future.

2. What is a Security Data Lake?

In essence it’s a data lake with the security-label slapped on it.

“A data lake is a centralized repository that ingests and stores large volumes of data in its original form. The data can then be processed and used as a basis for a variety of analytic needs.”

As I understand (I’m often wrong) the idea here is that the data is stored in a single open format in what I will refer to as a “cheap tier” - the data lake storage tier.

Combined with deduplication of data ensuring that you only store single copy of any data this in turn makes for cost-effective storage - which, again, makes a lot of sense.

Now, for our security purposes we need to be able to effectively query the data. The data lake has full KQL support via KQL interactive queries (costs money to run vs normal KQL queries which are free) and you can bring your own analytic engine, use notebooks etc. Not sure how this works yet if you ingest unstructured data that’s not a native connector (maybe this is not supported).

Add the ability to promote data on demand to higher tiers (like summary rules) using KQL jobs and that’s the basis of what it does.

3. What is the difference between Security Data Lake and my existing combination of auxiliary logging for verbose logs and ADX/storage accounts for cold storage?

That’s a long question, how did you come up with that?

Short answer is that it’s ease of use. The data lake will handle a lot of the complexity for you at a reasonable price, also making sure everything lives in the “security-portal” context so it’s easy to use for the SecOps team.

The long answer - in essence this is a combined monetary answer to cheaper storage for data that we don’t want in the analytic tier for multiple reasons , while also serving as a financially viable option to storage accounts. The difference between Sentinel Data Lake and using auxiliary logs and azure data explorer/storage account is ease of use (and the fact that the data lake is a purpose built cloud-native security data lake).

Maintaining a complex log-ingestion infrastructure often means relying on third party options such as Cribl or Tenzir to filter logs and push them to their correct destination. Working with logs in any capacity from ADX is fine, but management overhead is a big cost/investment. Working with logs from storage accounts can be trying at times if you have low patience. They’re there, but it takes a while to work with them. All in all, it works, but it requires you to do a lot of configuration, maintenance and development.

This is also an answer from Microsoft to all of the third party solutions that pop up in that gap beneath Microsoft Sentinel, that some filled with Azure Data Explorer and Storage Accounts.

4. What data tiers does Sentinel Data Lake have?

Analytics tier - stays the same as always, but change for the SDL is “all data is mirrored in the lake at no extra cost”.
Data lake tier - unified, low-cost storage and advanced analytic for all of your data.

Let’s map this up in a table.

Tier	Examples of use	Ingestion cost	Retention included in base cost	Other costs
Analytic	EDR Antivirus Authentication logs Threat Intelligence	Yes	90 days (can be extended for extra cost)	Long-term retention cost
Data Lake	Netflow TLS/SSL certificates Firewall Proxy	Yes	30 days (can be extended for extra cost)	Federation cost Compute cost Long-term retention cost

5. So what are the actual costs here?

So far this article is what we have to go on together with the updated Sentinel billing learn-page.

In summary, we pay for ingested GB as usual and storage outside of the included retention per GB as usual. In addition, we pay for queried GB similar to basic/auxiliary logs. The new thing is that we pay for compute costs when running advanced data insights (this payment for running Jupyter notebook sessions/jobs)

The Microsoft Sentinel pricing calculator isn’t updated, so the prices are as follows (in $USD based on East US pricing):

Capability	Measured in	Price per measurement
Data lake ingestion	Data processed (per GB)	$0.05
Data processing	Data Processed (per GB)	$0.10
Data lake storage	Data Stored (GB/Month)	$0.026
Data lake query	Data Analyzed (per GB)	$0.005
Advanced data insights	Compute Hour	$0.15

6. Can I create analytic rules using data in the SDL?

Not directly. SDL introduces two “new” concepts, KQL interactive queries (which are just KQL queries running against data in the lake at the aformentioned cost) and KQL jobs.

KQL jobs essentially allows you to run KQL queries as scheduled tasks and promote the data from the results to the analytic tier.

Sounds similar to something? Yes, it’s an easier way of doing the aux/basic promotion to analytic tier via summary rules!

7. How will onboarding this impact my current Microsoft Sentinel and Defender XDR implementation?

Aside from added cost, there should be no major changes.

8. How do I get my data into the data lake tier?

Short answer is through the same way you were getting data into Microsoft Sentinel. All data is mirrored, or you can choose to only ingest into data lake tier for all except some tables.

Long answer - One of the boons of the data lake is that it mirrors your existing workspace(s) for free, and it allows you to use the existing 350+ Microsoft Sentinel data connectors to ingest data into it.

The UI experience is to simply go to the new “Table Management” experience in the Defender portal and simply choose the destination of a table to be either analytic or data lake tier.

Once SDL is enabled, auxiliary log tables are no longer visible in Microsoft Defender’s Advanced hunting or in the Microsoft Sentinel Azure portal. The auxiliary table data is available in the data lake and can be queried using KQL queries or Jupyter notebooks. Not all tables can be switched over to the Data Lake Tier - notable mentions are the Defender XDR tables and some Microsoft Sentinel solution tables. As you can see from the image above, by default all data sent to the analytic tier is mirrored to the data lake at no extra cost. The retention is the same as the analytic table it’s mirroring.

If you switch from analytic + mirroring (default) to only ingesting into the data lake tier any new data will stop coming to the analytic tier table

This diagram shows the retention components of the analytics, data lake, and XDR default tiers, and which table types apply to each tier:

If you want to further compare analytic tier vs data lake tier, check out this table as it does a good job of explaining further.

9. What does this mean for the archive function in Microsoft Sentinel?

You should switch over once SDL becomes GA. Right now it’s public preview.

10. Why am I not allowed to onboard the data lake right now?

Two reasons - one, there is some region limitations for the SDL while it’s in preview. If your tenant is in Norway, for example, it defaults to store data in Norway East datacenters and is thus not eligible.

Second reason is some capacity issues in certain regions. These are being fixed as of writing this, you should be able to enroll in all supported regions come August 2025.

11. Does this have anything to do with AI?

You can bet it probably does. Having a large dataset of data to reason over might finally make me believe the AI hype. We’ll see, jury’s out.

12. Jupyter notebooks?

Yes, they work. You can run them against the data lake at the price mentioned above. You run them straight from VSCode using the Microsoft Sentinel extension.

You will need a role that allows for reading tables, writing to tables and managing jobs - either Security Operator, Security Administrator or Global Administrator.

You also need the managed identity of the data lake to have the Log Analytics Contributor role on the log analytics workspace(s) connected to the lake.

13. SDL vs Azure Data Explorer

I usually never advocate for using ADX unless you use something like ADXFlowmater or build IAC integrations that allow you to keep it up to date with table schemas and logs without too much manual intervention.

Now SDL will be the easier option. It’s hard to say without comparing the two what the pricing will end up favoring, but I’m guessing if you account for time saved you will want to go for SDL.

14. How do I onboard?

If you’re eligible, go to https://security.microsoft.com/sentinelsettings/data-lake and follow the wizard.

15. Does it only mirror the first workspace, similar to how Defender only enables full correlation for the first workspace?

No, it works for all of them - if they’re all in the same region.

“Your primary and other workspaces connected to Microsoft Defender that are located in the same region as your Microsoft Entra tenant home region are attached to your Microsoft Sentinel data lake. Unconnected workspaces won’t be attached to the data lake.”

Some caveats here:

Data is mirrored from the workspace only from the date SDL is onboarded and forward - so does not currently work for historic data. This is one of the things most likely to be adressed in the future.
The exact wording is: “Mirrored data in the data lake with the same retention as the analytics tier doesn’t incur additional billing charges. Preexisting data in the tables isn’t mirrored.”

16. What are the main use cases for SDL?

More data is always better - but at some point more data becomes expensive. Handling this more data in cheaper storage usually gave you three options:

Stay native in Sentinel at great cost
Use ADX/storage accounts and maintain it yourself at great cost of ones sanity
Look to third party options for either data engineering pipelines or better security storage options at medium cost and medium loss of sanity

SDL takes the idea of auxiliary logs, summary rules, azure data explorer/storage accounts for warm/cold data and puts it into a cloud-native data lake at reasonable cost and work.

This allows you to retain more data for threat hunting, and store data longer for any compliance and incident response purposes you might have. The SDL is built for the usage with external analytic engines and jupyter notebooks to add more advanced analytics options and is also built with AI usage in mind, particularly agents reasoning over the data within certain guardrails.

17. Is everything you wrote here correct?

Yes? Maybe? It’s correct according to the sources I could find as of writing this, 29th of July 2025. It’s still a preview feature (even if it’s public preview), so some of it is still subject to change!

security automation blog

Microsoft Sentinel Data Lake - FAQ

Answering some common questions people might have