Sep 142012
 

You can count on one of three things failing: hardware, software, or people. One of the most important considerations when moving applications into the public cloud is how to plan for – and mitigate – these failures.

Certainly there are best practices in building any application that help you to handle failures, but what are the practices when your applications run in the public cloud? In this presentation, Wade Wegner will draw upon his years of experience with cloud applications in Windows Azure to share proven practices for handling failure in cloud applications.

Presented by Wade Wegner

Disclaimer: These are conference session notes I compiled during various sessions at Microsoft Tech Ed 2012, September 11-14, 2012.  The majority of the content comprises notes taken from the presentation slides accompanied, occasionally, by my own narration.  Some of the content may be free hand style.  Enjoy… Rob

Introduction

  • Architectural options for designing highly-available, fault tolerant applications
  • Best practices for these options
  • Multi-availability Zones (AZ)

Cloud Outages

  • AWS 21/4/2011
  • Azure 29/2/2012
  • AWS 14/6/2012
  • AWS 29/6/2012
  • Azure 26/7/2012

Quite a range of outages, listed above, Leap year created date parsing issues, etc.  Additional outages caused due to lightning/infrastructure issues.  In essence, failures can and will occur.

What do we need to consider?

  • Fault Tolerance
  • High availability
  • Disaster recovery

Read your SLAs!

Windows Azure has monthly SLAs, for example.  Keep in mind that most SLAs will rarely reimburse for lost revenue due to outages.

‘Compounding SLAs’

  • If different systems have different SLAs, e.g.
  • Azure Compute = 99.95%
  • SQL Azure = 99.9%
  • Azure Storage = 99.9%

Total SLA: 4.38h + 8.76h + 8.76h
Total outage: 21.9 hours

Lets define ‘Cloud’

  • Physical data centre behind an API
  • Cloud is a ‘resource pool’ behind an API
  • A cloud is not
    • Azure
    • AWS
  • A cloud is defined by the isolation of resources
  • Sometimes might need to go across Cloud platforms
    • e.g. Azure, AWS, different data centres, different geo-locations

A cloud is a specific data centre (rather than the platform itself)

Define High Availability?

  • Remove all single point of failures
    • Multiple hosts, load balancers, data replication
  • Graceful failover
    • Platforms might provide functionality to support this
    • Sometimes you need to build it

Define Disaster Recovery?

  • Processes or procedures to recover from a failure
    • Network, hardware, software etc
  • Practice and test DR strategies, takes a lot of time
    • document, train, rehearse
  • disaster can occur anywhere

Typical Approach

  • Duplication of infrastructure
  • identical spec
  • cold failover
  • typically under-provisioned, over provisioned

DR with Cloud

  • Consider the advantages/features of each platform
    • to support migration, durability, restoration of data
  • Scale up as needed
  • Geo-located
    • Azure: Regions & Fault domains
    • AWS: Regions & availability zones
  • Move applications into separate fault domains

Design for Failure

  • Large scale failures are rare, but happen
  • Applications need to be fault aware, can recover
  • Balance cost of tolerance against cost/risk

API Endpoint Differences

  • APIs differ
  • Different resources, billing
  • Network architectures vary (VLANs, security groups)
  • Different storage architecture
  • Abstractions and management vary
  • Each Cloud is unique in various ways

Overcoming Multi-Cloud Pain

  • Design using generic concepts
  • Have tools which translate concepts to cloud-specific clouds
  • How to share resources across clouds

Infrastructure Abstraction/Automation

  • Simplify deployments across multiple regions/zones
  • Automate deployments
    • Reproducible, consistent
  • Advanced server and deployment monitoring
    • Some API support, e.g. custom performance counters
    • Azure aggregates a lot of data, performance counters etc
    • Still maturing
  • Automatic scaling and operations (and throttling)
  • Third party services/apps/tools can help
  • Make use of diagnostic information

Reduced cost of maintenance..  ScaleExtreme works across cloud.

HA/DR Checklist for Risk Mitigation

  • Determine who owns the design, processes, testing
    • Who will support, and operate the application(s)
  • Develop in-house expertise (or bring help in)
  • Conduct a risk assessment
  • Specify recovery time objectives/recovery point objectives
  • Design for failure (start with application design)
  • Implement HA best practices
    • Balance cost/risk/complexity
    • automate/abstract infrastructure
    • It can be costly to support referential integrity across zones
  • Document operational processes/automations & test them
  • Test the failover and recoveries
  • Unleash the Chaos Monkey!
    • Acknowledge that things do fail

General HA Best Practices

  • Avoid single point of failure (again)
  • Place at least one of each component in different fault domains
  • Maintain sufficient capacity to absorb faults
  • Replicate data across fault domains
  • Monitoring and alerts to automate problem resolution
  • Design stateless applications (to support failover/reboot/relaunch)
    • Avoid internal instance dependencies
  • Make use of platform specific monitoring features
  • Framework services can be slow to respond

Some General DR Scenarios

  • Backup/restore
  • Simple Recovery
  • Warm standby
  • Multi-site
  • Multi-cloud

IMG_2319 IMG_2318

Consider cost, complexity and risk implications.  Defines different levels of availability and recovery times.

Multi-Cloud:Cold DR:

IMG_2320

  Takes time to spin up the cold DR.  DNS switching can be time sensitive, even if fully automated, reduced running costs

Multi-Cloud:Warm DR:

IMG_2321

Slightly better approach, can replicate data/exports.  Data tier doesn’t need to spin up, just the other tiers.  Storage can be partitioned into a separate fault domain, etc.  Still fairly minimal cost, same DNS timeframe issues.  DB could be put into read-only mode for reporting etc.

Azure SQL Database: Multi-tenant service.  Export can be put into Azure Storage BLOB and can be replicated to other regions.

Multi-Cloud:Hot DR:

IMG_2322

Apps are spun up, Much higher cost, DNS wou ld need to fail over.

Multi-Cloud-HA:

IMG_2323

For designs which can tolerate NO downtime.  Route DNS traffic to different clouds.  Data consistency becomes an issue as real-time production data is being captured in two completely separate clouds.  Is real-time synchronization something which is entirely necessary in this configuration?  High cost.

How do I make my service immortal?

  • Hope for the best, plan for the worst
    • Failures do occur, design for them
  • Embrace the cloud mentality
  • Fit for purpose – no one design suits all
    • Analyse requirements, appetite for risk
    • Costs
  • Start easy – build HA first, then expand
    • Start at process and procedures
    • Automation

Open Source/Standards: needs community push to garner some attention.

Sep 142012
 

Most developers are familiar with the concept of scaling out their application tier; with SQL Azure Federations it is now possible to scale out the data tier as well. In this session we will deep dive on building large scale solutions on SQL Azure.

In this session we will cover patterns and techniques for building scalability into your relational databases. SQL Azure Federations allow databases to be spread over 100s of nodes in the Azure datacentre with databases paid for by the day. This presents a unique avenue for dealing with particularly massive volumes of data, of user load, or both.

This session will discuss how to design a schema for federation scale-out while still maintaining the value afforded by a true relational (SQL) database. We’ll look at approaches for minimizing cross federation queries and as well as approaches to fan-out queries when necessary. We will examine approaches for dealing with elastically scaling applications and other high load scenarios.

Introduction

Kiwi Chris Auld is back to go into more detail on SQL Federations.

Disclaimer: These are conference session notes I compiled during various sessions at Microsoft Tech Ed 2012, September 11-14, 2012.  The majority of the content comprises notes taken from the presentation slides accompanied, occasionally, by my own narration.  Some of the content may be free hand style.  Enjoy… Rob

Agenda

  • Overview
  • Tips and Tricks
    • Design and Development
      • Picking a Federation Model
      • Picking Reference Tables
      • Generating a key without bottlenecks
      • Coding fan-out queries
    • Administration
      • Configuring layout
      • Where and when to split
  • vNext
    • Improvements

SQL Azure has been renamed..  Keeping it “simple”: SQL Database = SQL Azure

Scalability model for the Cloud

  • Cloud Apps allow massive scale
    • Orders of magnitude more than burst
  • Cloud Apps demand the best economics
    • Best Price/Performance
    • Elasticity + Pay-as-you-go

Data Scale Challenges

  • For small scenarios scale up is cheaper & easier
  • For larger scenarios scale out is the only solution
    • Massive diseconomies of scale
      • 1×64 way server >>> $$$ 64×1 way servers
    • Reach a limit (can’t get a big enough box)
    • Shared resource contention
  • Scale out differs

Federations in SQL Database

  • Canonical 3 tier app scales by
    • Adding and removing nodes
    • Buying a huge DB server
  • Federations extend the model to the DB tier
    • Add and remove Azure nodes with federations
    • Scale on demand to your traffic without downtime

Sort of like a load balancer for databases.  Database horizontal partitioning under transactional load.
Federation is essentially a database sharding strategy.

Why use Federations?

  • Scale beyond single DB to almost unlimited scale
    • Many nodes
  • Best economics
    • elastic tier that can repartition
    • Always on
  • Simplified multi-tenancy
  • Simplified development and admin
    • Platform supports sharding
    • Proper RDBMS

Concepts

Sharding/federation: A named dimension over which data is sliced.
’CREATE FEDERATION fed_name(key_label, distribution_type)

Atomic unit: What is the smallest atomic unit (record) to shard from?  e.g. a Customer record (which can’t be shared across two shards).

Repartitioning without downtime: 

  • SPLIT members into workloads over nodes e.g. ALTER FEDERATION <name> SPLIT AT(key=value)
  • DROP members to shrink back to fewer nodes
  • MERGE not yet supported

Built-in Data Dependent Routing

  • DDR ensure apps can discover where data is just-in-time (JIT)
    • Apps don’t need to cache a ‘shard map’
    • No cache coherency issues even with partitioning
  • Prevents connection pool fragmentation issues

USE FEDERATION <name>(<key>=value);

[Demo]

Working with Federations

  • Create federation (shard)
  • Create federation member
  • insert data (turn filtering off for multiple records)
  • Supports standard T-SQL
  • using FEDERATED ON to assign column to federation scope
  • Needs to be applied consistently when sharding a table (e.g. on the atomic unit – e.g. customerId)
  • Can only do one split at a time
  • Can only split on discrete values (int, varbinary, uniqueidentifier)
  • Can use a dynamic view to track progress

Concepts – 3 kinds of tables

  • Federated Tables
    • Contain data that is distributed by the federation
    • Contains a slice of data in each database
    • Optimized for read/write at scale
    • Bounded by atomic unit/filtering
  • Reference Tables
    • Duplicate copy of data in each federation member
    • Must manually modify data in each federation member, thus eventually consistent
    • optimised for read
    • Used for referential integrity
    • No cross database transactions in SQL Azure
  • Central table
    • Refer to tables that are created in the federation root (database) for low traffic objects such as metadata
    • Single point of load (use sparingly) or through a cache

Picking Federations

  • Normalize data model to 3NF (or more) – sort of (keep in mind the persistence of the federated column needs to be used across the related data entities) – sort of like adding a “federation key” across the data model
  • Apply scale-first DB design principals
  • Pick Federations, i.e. “Table Groups” that need to scale out

Picking Reference Tables

  • Look up table
  • Joined in queries

Generating Unique Keys

  • Identity not available (because it’s shared)
  • Identity Generation can be expensive for large apps
    • provides linearly increasing values
    • provides no gaps in generation
    • only generated at the data tier
    • creates bottleneck, must have some sort of shared counter
  • Benefits of GUID
    • no centralized id generation
    • generated anywhere in the tiers
    • random distribution over an enormous address space
  • Can also use varbinary

Data Dependent Routing

Based on ‘atomic units’.  DDR connects to the correct database federation member. 

FILTERING=ON – connects to a single atomic unit (based on the federation key), good for:

  • Management tasks
  • CRUD

FILTERING=OFF – for fan-out queries (data across multiple federation members)

Used for:

  • Reporting queries
    • e.g. Union or aggregate queries
  • Unaligned queries
  • fan-out queries (union all, additive/non-additive aggregates)

Member query: part sent to each member
Summary query: post processing query for member query results

[Demo] A number of different T-SQL queries which demonstrate querying across multiple and single federations.

Well, there’s obviously a lot to cover off here.  I’ll be coming back to this topic myself to run a few sample proof of concepts, stay tuned here at Sanders Technology for more on SQL Federation.

Sep 142012
 

Casablanca is a Microsoft incubation effort to support cloud-based client-server communication in native code using a modern asynchronous C++ API design. Think of it as Node.js, but using C++. Casablanca gives you the power to use existing native C++ libraries and code to do awesome things on the server. Come and watch John Azariah and Mahesh Krishnan show you how it is done”

Presented by John Azariah and Mahesh Krishnan

Disclaimer: These are conference session notes I compiled during various sessions at Microsoft Tech Ed 2012, September 11-14, 2012.  The majority of the content comprises notes taken from the presentation slides accompanied, occasionally, by my own narration.  Some of the content may be free hand style.  Enjoy… Rob

Arrived about ten minutes late due to delays in checking out of the hotel.  John’s now in the process of compiling and running a Hello World sample.

The Node Influence on Casablanca

  • Asynchronous, non-blocking I/O
  • Powerful libraries (many external modules)
  • Simplicity

Sample Node.js approach to a hello world HTTP response looks very similar in C++.
The http namespace used in C++ is out of Casablanca.

Node.js sample:

var http = require('http');
http.createServer(function (req, res) {
  res.writeHead(200, {'Content-Type': 'text/plain'});
  res.end('Hello World\n');
}).listen(1337, '127.0.0.1');
console.log('Server running at http://127.0.0.1:1337/');

..and a C++ Casablanca version:

http_listener listener = 
        http_listener::create("http://localhost/path_1",
            [](http_request request)
            {
                request.reply(status_codes::OK, "Hello, World!");
            });
    listener.open();

Some Inclusions:

  • Web
    • RESTful services
    • JSON
  • The Cloud
    • SDK for accessing Azure storage

[Demo] Serializing JSON, implementing a RESTful interface and hooking to Azure

  • Mapping HTTP verbs to actions, e.g.
    • Read = GET,
    • Update = PUT,
    • Insert = POST,
    • Delete = DELETE

Possibly using the following syntax:

MyListener::MyListener(const http::uri& url) :
        m_listener(http_listener::create(url))
    {
        m_listener.support(methods::GET,
                           std::tr1::bind(&MyListener::handle_get,
                                          this,
                                          std::tr1::placeholders::_1));
        m_listener.support(methods::PUT,
                           std::tr1::bind(&MyListener::handle_put,
                                          this,
                                          std::tr1::placeholders::_1));
        m_listener.support(methods::POST,
                           std::tr1::bind(&MyListener::handle_post,
                                          this,
                                          std::tr1::placeholders::_1));
        m_listener.support(methods::DEL,
                           std::tr1::bind(&MyListener::handle_delete,
                                          this,
                                          std::tr1::placeholders::_1));
    }

Demonstration of all of the above shows that JSON serialization in/out is quite seamless (at this stage).

IMG_2309 IMG_2310
Handling JSON & Using Cloud Storage

C++ Specific Advantages

[Demo] Text to Speech Demo

Wow, I haven’t seen some of those pre-processor and #pragma statements in a long time. 

[Demo] PhotoSynth Web Service

Photosynth has a web service implementation now.  Uploading multiple photos and getting back the panorama is quite cool.

Here’s the panorama taken at the session.

IMG_2313 

Casablanca Task Libraries

IMG_2314

[Demo] Mandelblot example rendered from locally via Casablanca.

IMG_2315 IMG_2316

The next big thing coming out of the Casablanca Team is GPU processing.

Summary

  • Casablanca is an incubation effort
  • Allows you to write end-to-end Azure apps in C++
  • If you are already programming in C++ you can migrate your apps to Azure