Data Lake cheap, dimensionless and low resistance design

Significant problem in any analytics platform is that volume and type of data is ever changing. The data can be complex and incomparable. In analytics terms, high degree of data variance is a persistent problem. Analytics system must be flexible in order to integrate and derive value from new variations in the data.

Image for post
Image for post
Photo by dirk von loen-wagner on Unsplash

Simply, We needed to have an Architecture for Data lake. This architecture decouples data ingestion from serving consumption.

Analytics systems would be facing challenges here as they need to handle high velocity, high variety, and high volume of data. Solutions exist for this, but do not scale and are expensive.

Analytics systems on the other hand are outclassed by newer technologies such as Hadoop and Spark.. Hence we needed a costly solution to solve a problem analytics can’t solve. Data Lake stores data in the form that provides flexibility for evolving data schema and encourages multiple analysts to derive business insights from the same seed data.

Unstructured data lake is for a purpose of archiving your data. The data lake concept is that you keep all your raw data and near-raw data in the first few layers and then allow some analytics solution to navigate, query, and traverse these layers for derived clusters of intelligence.

A typical problem of any analytics system is that variance and volume of data are high. There is not always clear or certain what kind of analytic metrics we need to derive from this data 6 months from now. Likewise, there is a high probability that format and data dimensionality will evolve over time.

In the popular column-based analytics solutions such as column-based Cassandra or big table cluster, those input data can create a lot of challenges in database schema design and query engineering. To be more precise that can become so complex that is just not worth approaching from this angle.

Data lake is the approach that comes to the rescue. The overall idea is that data stored in the semi-structured format of blobs typically in JSON, BSON, CSV, TSV, or Parquet files on the distributed filesystem or cloud storage such as S3 / Glassier

1. No schema enforcement in input data.

2. Build several analytics modules in parallel to extract various use case based on changing priorities

3. It is cheaper to store data rather than for example process financial transactions in real-time.

4. Consider data lake more as insurance for your business rather than your critical infrastructure

The path of least resistance is to design the ingestion pipeline which moves data from multiple sources in multiple formats to the data lake storage as schema on read.

Data lake design is by nature schema on read design. For example, if the data stored in JSON format then we want to store data in columnar format.

The general approach of designing a data lake is

1. Understand your input data and identify one-dimensional properties of the data

2. Design data lake around these dimensions and create a schema using dimensional modeling techniques

3. Process current data to fit the read-defined schema

Written by

Eventually consistent and eventually practical system engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store