calgarysoli.blogg.se - Data lake architecture

#Data lake architecture how to
#Data lake architecture update

Within each layer there will be a folder structure, which is designed based upon reasons such as subject matter, security, or performance (i.e. Also called exploration layer, development layer or data science workspace Sandbox data layer – Optional layer to be used to “play” in, usually for data scientists.Also called application, workspace, trusted, gold, secure, production ready, governed, curated, or consumption layer The data is joined and/or aggregated, and can be stored in de-normalized data marts or star schemas. data warehouse application, advanced analysis process, etc). Presentation data layer – Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e.Also called silver, transformed, integrated, or enriched layer The aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. It removes impurities and can also involve enrichment. Think of the cleansed layer as a filtration layer. Cleansed data layer – Raw events are transformed (cleaned and mastered) into directly consumable data sets.Sometimes there is a separate conformed layer (also called base layer or standardized layer) that is used after the raw layer to make all the file types the same, usually parquet. Also called bronze layer, staging layer or landing area. A typical example is if you need to rerun an ETL job because of a bug, you can get the data from the raw layer instead of going back to the source.

Advantages are auditability, discovery, and recovery. Think of the raw layer as a reservoir that stores data in its natural and original state.

Raw data layer– Raw events are stored for historical reference, usually kept forever (immutable).

Then absorb all the information you can find on data lake architecture and choose the appropriate design for your situation.Ī data lake should have layers (also called zones) such as: So make sure you think through all the sources of data you will use now and in the future, understanding the size, type, and speed of the data. I often find customers do not spend enough time in designing a data lake and many times have to go back and redo their design and data lake build-out because they did not think through all their use cases for data.

#Data lake architecture update

I touched on this in my blog Data lake details, but that was written a long time ago so I wanted to update it.

#Data lake architecture how to

I have had a lot of conversations with customers to help them understand how to design a data lake.