Skip to content

DP-900

Data Formats & Data Stores

Data Formats

  • Structured: Tables, Rows and Columns (Relational)
  • Semi Structured: Key-Value, Document (JSON), Graph, etc
  • Unstructured: Video, Audio, Image, Text files, Binary files ...

Data Stores

  • Relational databases
  • NoSQL databases
  • Analytical databases
  • Object/Block/File storage

Relational Database

OLTP(Online Transaction Processing)

Applications where large number of users make large number (millions) of transactions
Transaction - small, discrete, unit of work

Azure Managed Services:

  • Azure SQL Database: Managed Microsoft SQL Server
  • Azure Database for MySQL: Managed MySQL
  • Azure Database for PostgreSQL: Managed PostgreSQL

OLAP(Online Analytics Processing)

Applications allowing users to analyze petabytes of data
Data is consolidated from multiple (typically transactional) databases

Azure Managed Service: Azure Synapse Analytics

  • Petabyte-scale distributed data warehouse
  • Unified experience for developing end-to-end analytics solutions
  • Data integration + Data warehouse + Data analytics
  • Run complex queries across petabytes of data
  • Earlier called Azure SQL Data Warehouse

OLAP vs OLTP

  • OLTP databases use row storage
  • Each table row is stored together
  • Efficient for processing small transactions
  • OLAP databases use columnar storage
  • Each table column is stored together
  • High compression - store petabytes of data efficiently
  • Distribute data - one table in multiple cluster nodes
  • Execute single query across multiple nodes - Complex queries can be executed efficiently

Semi Structure Data

Managed Service: Azure Cosmos DB

Document

Managed Service: Azure Cosmos DB SQL API & MongoDB API

Key Value

  • Similar to a HashMap
  • Key - Unique identifier to retrieve a specific value
  • Value - Number or a String or a complex object, like a JSON file
  • Supports simple lookups - query by keys
    • NOT optimized for query by values
    • Typically, no other indexes allowed
  • Use cases: Session Store, Caching Data
  • Managed Services: Azure Cosmos DB Table API, Azure Table Storage

Graph

  • Contains nodes and edges (relationships)
  • Use cases: People and relationships, Organizational charts, Fraud Detection
  • Managed Service: Azure Cosmos DB Gremlin API

Column Family

  • Data organized into rows and columns
  • Can appear similar to a relational database
  • IMPORTANT FEATURE: Columns are divided into groups called column-family
  • Rows can be sparse (does NOT need to have value for every column)
  • Use cases: IOT streams and real time analytics, financial data - transaction histories, stock prices etc
  • Managed Service: Azure Cosmos DB Cassandra API

Unstructured Data

  • Block Storage (Azure Managed Service: Azure Disks)
  • File Storage (Azure Managed Service: Azure Files)
  • Object Storage (Azure Managed Service: Azure Blob Storage)

Relational Databases

  • Recommended when strong transactional consistency guarantees are needed
  • Database schema is mandatory
  • Azure Managed Services:
  • Azure SQL Database
  • Azure SQL Managed Instance
  • Azure Database for PostgreSQL
  • Azure Database for MySQL
  • Azure Database for MariaDB

SQL(Structured Query Language)

  • SQL: Language used to perform operations on relational databases
  • Data Definition Language (DDL): Create and modify structure of database objects
  • Create: Create a database or its constituent objects (Table, View, Index etc)
  • Drop: Delete objects (Table, View, Index) from database
  • Alter: Alter structure of the database
  • Rename: Rename an existing object
  • Data Query Language (DQL): Perform queries on the data
  • Data Manipulation Language (DML): Insert, update or delete data
  • Select: Read rows from a table
  • Insert: Insert new rows into a table
  • Update: Modify data in existing rows
  • Delete: Delete existing rows
  • Data Control Language (DCL): Manage permissions and other controls
  • Grant: Grant permission to perform specific actions
  • Deny: Deny permission to perform specific actions
  • Revoke: Remove a previously granted permission
  • Example: Grant and revoke user access - GRANT SELECT ON course TO user1
  • Transaction Control Language(TCL): Control transactions within a database
  • Commit - commits a transaction
  • Rollback - rollbacks a transaction (used in case of an error)

Index

  • Allows efficient data retrieval from a database
  • Combination of one or more columns
  • An index is automatically created with the primary key
  • A table can have more than one index
  • Two Types of Indexes:
  • Clustered: Data in table is stored in the order of the index key values
    • Only one clustered index per table ( Why? - data rows can only be sorted in one way)
  • Non-clustered indexes: Index stored separately with pointers to the data rows

View

  • View: Virtual table mapped to a query
  • Can be used just like a table in SQL queries
  • Use cases: Add calculated columns, join multiple tables, filter unnecessary columns

Normalization

  • Goals in designing relational databases:
  • High Data Integrity
  • Minimum Data Redundancy (or Duplication)
  • Database Normalization: "Process of restructuring a relational database to reduce data redundancy and improve data integrity"
  • First Normal Form (1NF): Single(atomic) valued columns Violation
  • Second Normal Form (2NF): Eliminate redundant data
  • Third Normal Form (3NF): Move columns not directly dependent on primary key
  • There are other normal forms (4NF, 5NF, ...) but 3NF is considered good enough for most relational data
  • A simple definition for practical purposes is:
  • Separate each entity into its own table.
  • Separate each discrete attribute into its own column.
  • Uniquely identify each entity instance (row) using a primary key.
  • Use foreign key columns to link related entities.
  • Advantages of Normalization
  • Avoid same data being duplicated in multiple tables
  • Reduce disk space wastage Avoid data inconsistencies

Transaction

Transaction: Sequence of operations that need to be atomic
All operations are successful (commit) OR NONE are successful (rollback)

  • Properties: ACID (Atomicity, Consistency, Isolation, Durability)
  • Atomicity: Each transaction is atomic (either succeeds completely, or fails completely)
  • Consistency: Database must be consistent before and after the transaction
  • Isolation: Multiple Transactions occur independently
  • Durability: Once a transaction is committed, it remains committed even if there are system failures (a power outage, for example)
  • Supported in all Relational Databases

Azure SQL Database

  • Fully Managed Service for Microsoft SQL Server
  • Recommended for cloud-born applications
  • 99.99% availability
  • Built-in high availability, automatic updates and backups
  • Flexible and responsive serverless compute
  • Hyperscale (up to 100 TB) storage
  • Transparent data encryption(TDE) - Data is automatically encrypted at rest
  • Authentication: SQL Server authentication or Active Directory (and MFA)

Azure SQL managed instance

  • Another Fully Managed Service for Microsoft SQL Server
  • Near 100% SQL Server feature compatibility
  • Recommended when migrating on premise SQL Servers to Azure
  • Azure SQL managed instance features NOT in Azure SQL Database
  • Cross-database queries (and transactions) within a single SQL Server instance
  • Database Mail
  • Built in SQL Server Agent
    • Service to execute scheduled administrative tasks - jobs in SQL Server
  • Native virtual network support
  • Supports only vCore-based purchasing model
  • SQL Server Analysis Services (SSAS), SQL Server Reporting Services (SSRS), Polybase: NOT supported by both Azure SQL Database and SQL Managed Instance

Azure Database for MySQL

  • Fully managed, scalable MySQL database
  • Supports 5.6, 5.7 and 8.0 community editions of MySQL
  • 99.99% availability
  • Choose single zone or zone redundant high availability
  • Automatic updates and backups

Azure Database for MariaDB

MariaDB: community-developed, commercially supported fork of MySQL

  • MariaDB offers compatibility with Oracle Database
  • One notable feature of MariaDB is its built-in support for temporal data.
  • A table can hold several versions of data, enabling an application to query the data as it appeared at some point in the past.

Azure Database for PostgreSQL

PostgreSQL is a hybrid relational-object database.
You can store data in relational tables, but a PostgreSQL database also enables you to store custom data types, with their own non-relational properties.
PostgreSQL has its own query language called pgsql. This language is a variant of the standard relational query language, SQL, with features that enable you to write stored procedures that run inside the database.

  • Fully managed, intelligent and scalable PostgreSQL
  • 99.99% availability
  • Choose single zone or zone redundant high availability
  • Automatic updates and backups
  • Single Server and Hyperscale Options
  • Hyperscale: Scale to hundreds of nodes and execute queries across multiple nodes

Azure Cosmos DB

  • Fully managed NoSQL database service
  • Single-digit millisecond response times
  • 99.999% availability
  • Automatic scaling (serverless) - Storage and Compute
  • Global database: Automatically replicates data across multiple Azure regions
  • Multi-region writes
  • Data distribution to any Azure region with the click of a button
  • Your app doesn't need to be paused or redeployed to add or remove a region
  • Structure: Azure Cosmos account(s) > database(s) > container(s) > item(s)

Azure Cosmos DB APIs

  • Core(SQL): SQL based API for working with documents
  • MongoDB: Document with MongoDB API
  • Move existing MongoDB workloads
  • Table: Key Value
  • Ideal for moving existing Azure Table storage workloads
  • Gremlin: Graph
  • Store complex relationships between data
  • Cassandra: Column Family

Tips

You need a separate Cosmos DB account for each type of API

Entity SQL Cassandra MongoDB Gremlin Table
Database Database Keyspace Database Database NA
Container Container Table Collection Graph Table
Item Item Row Document Node or edge Item

Azure Storage

  • Managed Cloud Storage Solution
  • Highly available, durable and massively scalable (upto few PetaBytes)
  • Core Storage Services:
  • Azure Disks: Block storage (hard disks) for Azure VMs
  • Azure Files: File shares for cloud and on-premises
  • Azure Blobs: Object store for text and binary data
  • Azure Queues: Decouple applications using messaging
  • Azure Tables: NoSQL store (Very Basic, Prefer Azure Cosmos DB for NoSQL)
  • (PRE-REQUISITE) Storage Account is needed for Azure Files, Azure Blobs, Azure Queues and Azure Tables

Data Redundancy

  • LRS(Locally redundant storage): Three synchronous copies in same data center
  • ZRS(Zone-redundant storage): Three synchronous copies in three AZs in the primary region
  • GRS(Geo-redundant storage): LRS + Asynchronous copy to secondary region (three more copies using LRS)
  • GZRS(Geo-zone-redundant storage): ZRS + Asynchronous copy to secondary region (three more copies using LRS)

Azure Disks Storage

Disk storage: Disks for Azure VMs

Types:

  • Standard HDD: Recommended for Backup, non-critical, infrequent access
  • Standard SSD: Recommended for Web servers, lightly used enterprise applications and dev/test environments
  • Premium SSD disks: Recommended for production and performance sensitive workloads
  • Ultra disks (SSD): Recommended for IO-intensive workloads such as SAP HANA, top tier databases (for example, SQL, Oracle), and other transaction-heavy workloads

Tips

Premium and Ultra provide very high availability

Azure Files

  • Managed File Shares
  • Connect from multiple devices concurrently:
  • From cloud or on-premises
  • From different OS: Windows, Linux, and macOS
  • Supports two common network file sharing protocols:
  • Server Message Block (SMB) file sharing is commonly used across multiple operating systems (Windows, Linux, macOS).
  • Network File System (NFS) shares are used by some Linux and macOS versions. To create an NFS share, you must use a premium tier storage account and create and configure a virtual network through which access to the share can be controlled.
  • Usecase: Shared files between multiple VMs (example: configuration files)

Azure Blob Storage

Azure Blob Storage: Object storage in Azure

Tips

Blobs: Binary Large Objects

  • Structure: Storage Account > Container(s) > Blob(s)
  • Store massive volumes of unstructured data
  • Store all file types - text, binary, backup & archives:
    • Media files and archives, Application packages and logs
    • Backups of your databases or storage devices
  • Three Types of Blobs
  • Block Blobs: Store text or binary files (videos, archives etc)
  • Append Blobs: Store log files (Ideal for append operations)
  • Page Blobs: Foundation for Azure IaaS Disks (512-byte pages up to 8 TB)
  • Azure Data Lake Storage Gen2: Azure Blob Storage Enhanced
  • Designed for enterprise big data analytics (exabytes, hierarchical)
  • Low-cost, tiered storage, with high availability/disaster recovery
  • Access tiers
  • Hot(default): Store frequently accessed data
  • Cool: Infrequently accessed data stored for min. 30 days
  • Archive: Rarely accessed data stored for min. 180 days
    • Lowest storage cost BUT Highest access cost
    • Access latency: In hours
    • To access: Rehydrate (Change access tier to hot or cool) OR Copy to another blob with access tier hot or cool
  • You can change access tiers of an object at any point in time

Azure Queues & Azure Tables

  • Azure Queues: Decouple applications using messaging
  • Azure Tables: NoSQL store (Very Basic)
  • A key/value store
  • Store and retrieve values by key
  • Supports simple query, insert, and delete operations
  • Cosmos DB Table API is recommended as key/value store for newer usecases (supports multi-master in multiple regions)
  • Azure Tables only supports read replicas in other regions
    • GRS or GZRS: Data in secondary region is generally NOT available for read or write access
    • Available for read or write only in case of failover to the secondary region
    • To enable round the clock read access:
    • Use read-access geo-redundant storage (RA-GRS) or read-access geo-zone-redundant storage (RA-GZRS)

Data Analytics

Data Analytics Work Flow

  • Data Ingestion: Capture raw data
  • From various sources (stream or batch)
  • Data Processing: Process data
  • Raw data is not suitable for querying
    • Clean (remove duplicates), filter (remove anomalies) and/or aggregate data
    • Transform data to required format (Transformation)
  • Data Storage: Store to data warehouse or data lake Data Querying: Run queries to analyze data
  • Data Visualization: Create visualizations to make it easier to understand data and make better decisions
  • Create dashboards, charts and reports (capture trends)
  • Help business spot trends, outliers, and hidden patterns in data

Data Analysis Categories

  • Descriptive analytics: What’s happening?
  • Based on historical/current data
  • Monitor status (of KPIs) and generate alerts
  • Example: Generating reports (current vs planned)
  • Diagnostic analytics: Why is something happening?
  • Take findings from descriptive analytics and dig deeper
  • Example: Why did sales increase last month?
  • Predictive analytics: What will happen?
  • Predict probability based on historical data
  • Mitigate risk and identify opportunities
  • Example: What will be the future demand?
  • Prescriptive analytics: What actions should we take?
  • Use insights from predictive analytics and make data-driven informed decisions
  • Still in early stages
  • Cognitive analytics: Make analytic tools to think like humans
  • Combine traditional analytics techniques with AI and ML features
  • Examples: Speech to text (transcription or subtitles), text to speech, Video Analysis, Image Analysis, Semantic Analysis of Text (Analyze reviews)

Big Data - Terminology and Evolution

  • 3Vs of Big Data
  • Volume: Terabytes to Petabytes to Exabytes
  • Variety: Structured, Semi structured, Unstructured
  • Velocity: Batch, Streaming ..
  • Terminology: Data warehouse vs Data lake
  • Data warehouse: PBs of Storage + Compute (Typically)
    • A data warehouse is a relational database in which the data is stored in a schema that is optimized for data analytics rather than transactional workloads
    • Optimized for read operations
    • Data stored in a format ready for specific analysis! (processed data)
    • Examples: Teradata, BigQuery(GCP), Redshift(AWS), Azure Synapse Analytics
    • Typically uses specialized hardware
  • Data lake: Typically retains all raw data (compressed)
    • A data lake is a file store, usually on a distributed file system for high performance data access.
    • Typically object storage is used as data lake
    • Amazon S3, Google Cloud Storage, Azure Data Lake Storage Gen2 etc..
    • Flexibility while saving cost
    • Perform ad-hoc analysis on demand
    • Analytics & intelligence services (even data warehouses) can directly read from data lake
    • Azure Synapse Analytics, BigQuery(GCP) etc..

Azure Services

  • Azure Synapse Analytics: End-to-end analytics solutions
  • Data integration + Enterprise data warehouse + Data analytics
  • Create SQL and Spark pools to analyze data
  • Azure Data Factory: Fully managed serverless service to build complex data pipelines
  • Extract-transform-load (ETL), extract-load-transform (ELT) and data integration
  • Power BI: Create visualization around data
  • Unify data and create BI reports & dashboards

Massive Parallel Processing

  • Split processing across multiple compute nodes
  • Typically separate storage and compute
  • Use Data lake as storage
  • Scale compute on demand
  • Examples: Spark, Azure Synapse Analytics

Batch Pipelines

  • Buffering and processing data in groups
  • Define condition - how often to run?
  • Advantages: Process huge volumes of data during off-peak hours
  • Typically takes longer to run (minutes to hours to days)
  • Example: Read from storage (Azure Data Lake Store), process, and write to Relational Database or NoSQL Database or Data warehouse

Streaming Pipelines

  • Real-time data processing
  • Processing data as it arrives (in seconds or milliseconds)
  • Examples: Stock Market Data, Telemetry from IOT Devices, User action metrics from websites

Azure Data Factory

  • Fully managed serverless service to build complex data pipelines:
  • Extract-transform-load (ETL), extract-load-transform (ELT) and data integration
    • 90 built-in connectors
    • Ingest data from:
    • Big Data sources like Amazon Redshift, Google BigQuery
    • Enterprise data warehouses like Oracle Exadata, Teradata
    • All Azure data services
  • Build data flows to transform data
    • Integrate with services like Azure HDInsight, Azure Databricks, Azure Synapse Analytics for data processing
  • Move SQL Server Integration Services (SSIS) packages to cloud
  • CI/CD support with Azure Devops
  • Components
  • Pipeline: Logical group of activities that can be scheduled
    • You can chain activities in a pipeline
    • You can run activities sequentially or in parallel
    • A pipeline can execute other pipelines
  • Activity: Represents a step in a pipeline (an action to be performed)
    • Copy Activity: Copy data from one store to another store
    • Example: Copy CSV from Blob Storage to a Table in SQL Database
    • Three types of activities: Data movement, Data transformation, Control activities
  • Data Flow: Create and manage data transformation logic
    • Build reusable library of data transformation routines
    • Executes logic on a Spark cluster:
    • You don't need to manage the cluster (it is spun up and down automatically as needed)
    • Control flow: Orchestrate pipeline activity based on output of another pipeline activity
  • Linked Service: Used to connect to an external source
    • Connect to different sources like Azure Storage Blob, SQL Databases etc
  • Dataset: Representation of data structures within data stores
  • Integration Runtime: Compute infrastructure used by Azure Data Factory allowing you to perform
  • Triggers: Trigger pipeline at a specific times

Azure Data Lake Storage(Gen2)

  • Blob storage + Hierarchical directory structure
  • Configure permissions(RBAC) at file and directory level
  • Fully compatible with Hadoop Distributed File System (HDFS)
  • Apache Hadoop workloads can directly access data in Azure Data Lake Storage
  • Three main elements:
  • Data Lake Store: Azure Data Factory, Azure Databricks, Azure HDInsight, Azure Data Lake Analytics, and Azure Stream Analytics can read directly
  • Data Lake Analytics: Run analytics jobs using U-SQL
  • HDInsight: Run Hadoop jobs

Power BI

Unify data and create BI reports & dashboards

  • Integrates with all Azure analytics services
  • Azure Synapse Analytics to Azure Data Lake Storage
  • Power BI Components
  • Power BI Service: Online SaaS (Software as a Service) service
    • Power BI online - app.powerbi.com
    • Create/share reports and dashboards
  • Power BI Desktop: Windows desktop application to create and share reports
    • More data sources, Complex modeling and transformations
  • Power BI Report Builder: Standalone tool to author paginated reports
  • Power BI Mobile Apps: Apps for Windows, iOS, and Android devices
  • Typical Power BI Workflow:
  • 1: Create a report with Power BI Service/Desktop (or paginated report with Power BI Report Builder)
  • 2: Share it to the Power BI service
  • 3: View and interact with report (and create dashboards) using Power BI service
    • Reports can also be accessed from Power BI mobile