DP-900
Data Formats & Data Stores
Data Formats
- Structured: Tables, Rows and Columns (Relational)
- Semi Structured: Key-Value, Document (JSON), Graph, etc
- Unstructured: Video, Audio, Image, Text files, Binary files ...
Data Stores
- Relational databases
- NoSQL databases
- Analytical databases
- Object/Block/File storage
Relational Database
OLTP(Online Transaction Processing)
Applications where large number of users make large number (millions) of transactions
Transaction - small, discrete, unit of work
Azure Managed Services:
- Azure SQL Database: Managed Microsoft SQL Server
- Azure Database for MySQL: Managed MySQL
- Azure Database for PostgreSQL: Managed PostgreSQL
OLAP(Online Analytics Processing)
Applications allowing users to analyze petabytes of data
Data is consolidated from multiple (typically transactional) databases
Azure Managed Service: Azure Synapse Analytics
- Petabyte-scale distributed data warehouse
- Unified experience for developing end-to-end analytics solutions
- Data integration + Data warehouse + Data analytics
- Run complex queries across petabytes of data
- Earlier called Azure SQL Data Warehouse
OLAP vs OLTP
- OLTP databases use row storage
- Each table row is stored together
- Efficient for processing small transactions
- OLAP databases use columnar storage
- Each table column is stored together
- High compression - store petabytes of data efficiently
- Distribute data - one table in multiple cluster nodes
- Execute single query across multiple nodes - Complex queries can be executed efficiently
Semi Structure Data
Managed Service: Azure Cosmos DB
Document
Managed Service: Azure Cosmos DB SQL API & MongoDB API
Key Value
- Similar to a HashMap
- Key - Unique identifier to retrieve a specific value
- Value - Number or a String or a complex object, like a JSON file
- Supports simple lookups - query by keys
- NOT optimized for query by values
- Typically, no other indexes allowed
- Use cases: Session Store, Caching Data
- Managed Services: Azure Cosmos DB Table API, Azure Table Storage
Graph
- Contains nodes and edges (relationships)
- Use cases: People and relationships, Organizational charts, Fraud Detection
- Managed Service: Azure Cosmos DB Gremlin API
Column Family
- Data organized into rows and columns
- Can appear similar to a relational database
- IMPORTANT FEATURE: Columns are divided into groups called column-family
- Rows can be sparse (does NOT need to have value for every column)
- Use cases: IOT streams and real time analytics, financial data - transaction histories, stock prices etc
- Managed Service: Azure Cosmos DB Cassandra API
Unstructured Data
- Block Storage (Azure Managed Service: Azure Disks)
- File Storage (Azure Managed Service: Azure Files)
- Object Storage (Azure Managed Service: Azure Blob Storage)
Relational Databases
- Recommended when strong transactional consistency guarantees are needed
- Database schema is mandatory
- Azure Managed Services:
- Azure SQL Database
- Azure SQL Managed Instance
- Azure Database for PostgreSQL
- Azure Database for MySQL
- Azure Database for MariaDB
SQL(Structured Query Language)
- SQL: Language used to perform operations on relational databases
- Data Definition Language (DDL): Create and modify structure of database objects
- Create: Create a database or its constituent objects (Table, View, Index etc)
- Drop: Delete objects (Table, View, Index) from database
- Alter: Alter structure of the database
- Rename: Rename an existing object
- Data Query Language (DQL): Perform queries on the data
- Data Manipulation Language (DML): Insert, update or delete data
- Select: Read rows from a table
- Insert: Insert new rows into a table
- Update: Modify data in existing rows
- Delete: Delete existing rows
- Data Control Language (DCL): Manage permissions and other controls
- Grant: Grant permission to perform specific actions
- Deny: Deny permission to perform specific actions
- Revoke: Remove a previously granted permission
- Example:
Grant and revoke user access - GRANT SELECT ON course TO user1
- Transaction Control Language(TCL): Control transactions within a database
- Commit - commits a transaction
- Rollback - rollbacks a transaction (used in case of an error)
Index
- Allows efficient data retrieval from a database
- Combination of one or more columns
- An index is automatically created with the primary key
- A table can have more than one index
- Two Types of Indexes:
- Clustered: Data in table is stored in the order of the index key values
- Only one clustered index per table ( Why? - data rows can only be sorted in one way)
- Non-clustered indexes: Index stored separately with pointers to the data rows
View
- View: Virtual table mapped to a query
- Can be used just like a table in SQL queries
- Use cases: Add calculated columns, join multiple tables, filter unnecessary columns
Normalization
- Goals in designing relational databases:
- High Data Integrity
- Minimum Data Redundancy (or Duplication)
- Database Normalization: "Process of restructuring a relational database to reduce data redundancy and improve data integrity"
- First Normal Form (1NF): Single(atomic) valued columns Violation
- Second Normal Form (2NF): Eliminate redundant data
- Third Normal Form (3NF): Move columns not directly dependent on primary key
- There are other normal forms (4NF, 5NF, ...) but 3NF is considered good enough for most relational data
- A simple definition for practical purposes is:
- Separate each entity into its own table.
- Separate each discrete attribute into its own column.
- Uniquely identify each entity instance (row) using a primary key.
- Use foreign key columns to link related entities.
- Advantages of Normalization
- Avoid same data being duplicated in multiple tables
- Reduce disk space wastage Avoid data inconsistencies
Transaction
Transaction: Sequence of operations that need to be atomic
All operations are successful (commit) OR NONE are successful (rollback)
- Properties: ACID (Atomicity, Consistency, Isolation, Durability)
- Atomicity: Each transaction is atomic (either succeeds completely, or fails completely)
- Consistency: Database must be consistent before and after the transaction
- Isolation: Multiple Transactions occur independently
- Durability: Once a transaction is committed, it remains committed even if there are system failures (a power outage, for example)
- Supported in all Relational Databases
Azure SQL Database
- Fully Managed Service for Microsoft SQL Server
- Recommended for cloud-born applications
- 99.99% availability
- Built-in high availability, automatic updates and backups
- Flexible and responsive serverless compute
- Hyperscale (up to 100 TB) storage
- Transparent data encryption(TDE) - Data is automatically encrypted at rest
- Authentication: SQL Server authentication or Active Directory (and MFA)
Azure SQL managed instance
- Another Fully Managed Service for Microsoft SQL Server
- Near 100% SQL Server feature compatibility
- Recommended when migrating on premise SQL Servers to Azure
- Azure SQL managed instance features NOT in Azure SQL Database
- Cross-database queries (and transactions) within a single SQL Server instance
- Database Mail
- Built in SQL Server Agent
- Service to execute scheduled administrative tasks - jobs in SQL Server
- Native virtual network support
- Supports only vCore-based purchasing model
- SQL Server Analysis Services (SSAS), SQL Server Reporting Services (SSRS), Polybase: NOT supported by both Azure SQL Database and SQL Managed Instance
Azure Database for MySQL
- Fully managed, scalable MySQL database
- Supports 5.6, 5.7 and 8.0 community editions of MySQL
- 99.99% availability
- Choose single zone or zone redundant high availability
- Automatic updates and backups
Azure Database for MariaDB
MariaDB: community-developed, commercially supported fork of MySQL
- MariaDB offers compatibility with Oracle Database
- One notable feature of MariaDB is its built-in support for temporal data.
- A table can hold several versions of data, enabling an application to query the data as it appeared at some point in the past.
Azure Database for PostgreSQL
PostgreSQL is a hybrid relational-object database.
You can store data in relational tables, but a PostgreSQL database also enables you to store custom data types, with their own non-relational properties.
PostgreSQL has its own query language called pgsql. This language is a variant of the standard relational query language, SQL, with features that enable you to write stored procedures that run inside the database.
- Fully managed, intelligent and scalable PostgreSQL
- 99.99% availability
- Choose single zone or zone redundant high availability
- Automatic updates and backups
- Single Server and Hyperscale Options
- Hyperscale: Scale to hundreds of nodes and execute queries across multiple nodes
Azure Cosmos DB
- Fully managed NoSQL database service
- Single-digit millisecond response times
- 99.999% availability
- Automatic scaling (serverless) - Storage and Compute
- Global database: Automatically replicates data across multiple Azure regions
- Multi-region writes
- Data distribution to any Azure region with the click of a button
- Your app doesn't need to be paused or redeployed to add or remove a region
- Structure: Azure Cosmos account(s) > database(s) > container(s) > item(s)
Azure Cosmos DB APIs
- Core(SQL): SQL based API for working with documents
- MongoDB: Document with MongoDB API
- Move existing MongoDB workloads
- Table: Key Value
- Ideal for moving existing Azure Table storage workloads
- Gremlin: Graph
- Store complex relationships between data
- Cassandra: Column Family
Tips
You need a separate Cosmos DB account for each type of API
Entity | SQL | Cassandra | MongoDB | Gremlin | Table |
---|---|---|---|---|---|
Database | Database | Keyspace | Database | Database | NA |
Container | Container | Table | Collection | Graph | Table |
Item | Item | Row | Document | Node or edge | Item |
Azure Storage
- Managed Cloud Storage Solution
- Highly available, durable and massively scalable (upto few PetaBytes)
- Core Storage Services:
- Azure Disks: Block storage (hard disks) for Azure VMs
- Azure Files: File shares for cloud and on-premises
- Azure Blobs: Object store for text and binary data
- Azure Queues: Decouple applications using messaging
- Azure Tables: NoSQL store (Very Basic, Prefer Azure Cosmos DB for NoSQL)
- (PRE-REQUISITE) Storage Account is needed for Azure Files, Azure Blobs, Azure Queues and Azure Tables
Data Redundancy
- LRS(Locally redundant storage): Three synchronous copies in same data center
- ZRS(Zone-redundant storage): Three synchronous copies in three AZs in the primary region
- GRS(Geo-redundant storage): LRS + Asynchronous copy to secondary region (three more copies using LRS)
- GZRS(Geo-zone-redundant storage): ZRS + Asynchronous copy to secondary region (three more copies using LRS)
Azure Disks Storage
Disk storage: Disks for Azure VMs
Types:
- Standard HDD: Recommended for Backup, non-critical, infrequent access
- Standard SSD: Recommended for Web servers, lightly used enterprise applications and dev/test environments
- Premium SSD disks: Recommended for production and performance sensitive workloads
- Ultra disks (SSD): Recommended for IO-intensive workloads such as SAP HANA, top tier databases (for example, SQL, Oracle), and other transaction-heavy workloads
Tips
Premium and Ultra provide very high availability
Azure Files
- Managed File Shares
- Connect from multiple devices concurrently:
- From cloud or on-premises
- From different OS: Windows, Linux, and macOS
- Supports two common network file sharing protocols:
- Server Message Block (SMB) file sharing is commonly used across multiple operating systems (Windows, Linux, macOS).
- Network File System (NFS) shares are used by some Linux and macOS versions. To create an NFS share, you must use a premium tier storage account and create and configure a virtual network through which access to the share can be controlled.
- Usecase: Shared files between multiple VMs (example: configuration files)
Azure Blob Storage
Azure Blob Storage: Object storage in Azure
Tips
Blobs: Binary Large Objects
- Structure: Storage Account > Container(s) > Blob(s)
- Store massive volumes of unstructured data
- Store all file types - text, binary, backup & archives:
- Media files and archives, Application packages and logs
- Backups of your databases or storage devices
- Three Types of Blobs
- Block Blobs: Store text or binary files (videos, archives etc)
- Append Blobs: Store log files (Ideal for append operations)
- Page Blobs: Foundation for Azure IaaS Disks (512-byte pages up to 8 TB)
- Azure Data Lake Storage Gen2: Azure Blob Storage Enhanced
- Designed for enterprise big data analytics (exabytes, hierarchical)
- Low-cost, tiered storage, with high availability/disaster recovery
- Access tiers
- Hot(default): Store frequently accessed data
- Cool: Infrequently accessed data stored for min. 30 days
- Archive: Rarely accessed data stored for min. 180 days
- Lowest storage cost BUT Highest access cost
- Access latency: In hours
- To access: Rehydrate (Change access tier to hot or cool) OR Copy to another blob with access tier hot or cool
- You can change access tiers of an object at any point in time
Azure Queues & Azure Tables
- Azure Queues: Decouple applications using messaging
- Azure Tables: NoSQL store (Very Basic)
- A key/value store
- Store and retrieve values by key
- Supports simple query, insert, and delete operations
- Cosmos DB Table API is recommended as key/value store for newer usecases (supports multi-master in multiple regions)
- Azure Tables only supports read replicas in other regions
- GRS or GZRS: Data in secondary region is generally NOT available for read or write access
- Available for read or write only in case of failover to the secondary region
- To enable round the clock read access:
- Use read-access geo-redundant storage (RA-GRS) or read-access geo-zone-redundant storage (RA-GZRS)
Data Analytics
Data Analytics Work Flow
- Data Ingestion: Capture raw data
- From various sources (stream or batch)
- Data Processing: Process data
- Raw data is not suitable for querying
- Clean (remove duplicates), filter (remove anomalies) and/or aggregate data
- Transform data to required format (Transformation)
- Data Storage: Store to data warehouse or data lake Data Querying: Run queries to analyze data
- Data Visualization: Create visualizations to make it easier to understand data and make better decisions
- Create dashboards, charts and reports (capture trends)
- Help business spot trends, outliers, and hidden patterns in data
Data Analysis Categories
- Descriptive analytics: What’s happening?
- Based on historical/current data
- Monitor status (of KPIs) and generate alerts
- Example: Generating reports (current vs planned)
- Diagnostic analytics: Why is something happening?
- Take findings from descriptive analytics and dig deeper
- Example: Why did sales increase last month?
- Predictive analytics: What will happen?
- Predict probability based on historical data
- Mitigate risk and identify opportunities
- Example: What will be the future demand?
- Prescriptive analytics: What actions should we take?
- Use insights from predictive analytics and make data-driven informed decisions
- Still in early stages
- Cognitive analytics: Make analytic tools to think like humans
- Combine traditional analytics techniques with AI and ML features
- Examples: Speech to text (transcription or subtitles), text to speech, Video Analysis, Image Analysis, Semantic Analysis of Text (Analyze reviews)
Big Data - Terminology and Evolution
- 3Vs of Big Data
- Volume: Terabytes to Petabytes to Exabytes
- Variety: Structured, Semi structured, Unstructured
- Velocity: Batch, Streaming ..
- Terminology: Data warehouse vs Data lake
- Data warehouse: PBs of Storage + Compute (Typically)
- A data warehouse is a relational database in which the data is stored in a schema that is optimized for data analytics rather than transactional workloads
- Optimized for read operations
- Data stored in a format ready for specific analysis! (processed data)
- Examples: Teradata, BigQuery(GCP), Redshift(AWS), Azure Synapse Analytics
- Typically uses specialized hardware
- Data lake: Typically retains all raw data (compressed)
- A data lake is a file store, usually on a distributed file system for high performance data access.
- Typically object storage is used as data lake
- Amazon S3, Google Cloud Storage, Azure Data Lake Storage Gen2 etc..
- Flexibility while saving cost
- Perform ad-hoc analysis on demand
- Analytics & intelligence services (even data warehouses) can directly read from data lake
- Azure Synapse Analytics, BigQuery(GCP) etc..
Azure Services
- Azure Synapse Analytics: End-to-end analytics solutions
- Data integration + Enterprise data warehouse + Data analytics
- Create SQL and Spark pools to analyze data
- Azure Data Factory: Fully managed serverless service to build complex data pipelines
- Extract-transform-load (ETL), extract-load-transform (ELT) and data integration
- Power BI: Create visualization around data
- Unify data and create BI reports & dashboards
Massive Parallel Processing
- Split processing across multiple compute nodes
- Typically separate storage and compute
- Use Data lake as storage
- Scale compute on demand
- Examples: Spark, Azure Synapse Analytics
Batch Pipelines
- Buffering and processing data in groups
- Define condition - how often to run?
- Advantages: Process huge volumes of data during off-peak hours
- Typically takes longer to run (minutes to hours to days)
- Example: Read from storage (Azure Data Lake Store), process, and write to Relational Database or NoSQL Database or Data warehouse
Streaming Pipelines
- Real-time data processing
- Processing data as it arrives (in seconds or milliseconds)
- Examples: Stock Market Data, Telemetry from IOT Devices, User action metrics from websites
Azure Data Factory
- Fully managed serverless service to build complex data pipelines:
- Extract-transform-load (ETL), extract-load-transform (ELT) and data integration
- 90 built-in connectors
- Ingest data from:
- Big Data sources like Amazon Redshift, Google BigQuery
- Enterprise data warehouses like Oracle Exadata, Teradata
- All Azure data services
- Build data flows to transform data
- Integrate with services like Azure HDInsight, Azure Databricks, Azure Synapse Analytics for data processing
- Move SQL Server Integration Services (SSIS) packages to cloud
- CI/CD support with Azure Devops
- Components
- Pipeline: Logical group of activities that can be scheduled
- You can chain activities in a pipeline
- You can run activities sequentially or in parallel
- A pipeline can execute other pipelines
- Activity: Represents a step in a pipeline (an action to be performed)
- Copy Activity: Copy data from one store to another store
- Example: Copy CSV from Blob Storage to a Table in SQL Database
- Three types of activities: Data movement, Data transformation, Control activities
- Data Flow: Create and manage data transformation logic
- Build reusable library of data transformation routines
- Executes logic on a Spark cluster:
- You don't need to manage the cluster (it is spun up and down automatically as needed)
- Control flow: Orchestrate pipeline activity based on output of another pipeline activity
- Linked Service: Used to connect to an external source
- Connect to different sources like Azure Storage Blob, SQL Databases etc
- Dataset: Representation of data structures within data stores
- Integration Runtime: Compute infrastructure used by Azure Data Factory allowing you to perform
- Triggers: Trigger pipeline at a specific times
Azure Data Lake Storage(Gen2)
- Blob storage + Hierarchical directory structure
- Configure permissions(RBAC) at file and directory level
- Fully compatible with Hadoop Distributed File System (HDFS)
- Apache Hadoop workloads can directly access data in Azure Data Lake Storage
- Three main elements:
- Data Lake Store: Azure Data Factory, Azure Databricks, Azure HDInsight, Azure Data Lake Analytics, and Azure Stream Analytics can read directly
- Data Lake Analytics: Run analytics jobs using U-SQL
- HDInsight: Run Hadoop jobs
Power BI
Unify data and create BI reports & dashboards
- Integrates with all Azure analytics services
- Azure Synapse Analytics to Azure Data Lake Storage
- Power BI Components
- Power BI Service: Online SaaS (Software as a Service) service
- Power BI online - app.powerbi.com
- Create/share reports and dashboards
- Power BI Desktop: Windows desktop application to create and share reports
- More data sources, Complex modeling and transformations
- Power BI Report Builder: Standalone tool to author paginated reports
- Power BI Mobile Apps: Apps for Windows, iOS, and Android devices
- Typical Power BI Workflow:
- 1: Create a report with Power BI Service/Desktop (or paginated report with Power BI Report Builder)
- 2: Share it to the Power BI service
- 3: View and interact with report (and create dashboards) using Power BI service
- Reports can also be accessed from Power BI mobile