May 052013
 

File this one away for further analysis.  Live or delayed (on demand) content delivered via Windows Azure Media Services. 

Takes advantage of local CDNs and provides a platform for targeted advertising as well as taking advantage of the Azure cloud capabilities (e.g. caching, high availability, etc.).

Azure Media

There’s also a high resolution copy available from the Microsoft downloads site if you’d like to “zoom in” on the salient details. 

What I found to be eye catching?

  • Codecs (H.264, MPEG)
  • Device support (HTML5, Flash, Set Top/Smart TV)
  • Platform support (Windows, iOS, Android)

Sounds interesting?  There are ways to find out more.

Check back here soon as I’ll likely to a more comprehensive write-up in a few weeks.

Sep 142012
 

You can count on one of three things failing: hardware, software, or people. One of the most important considerations when moving applications into the public cloud is how to plan for – and mitigate – these failures.

Certainly there are best practices in building any application that help you to handle failures, but what are the practices when your applications run in the public cloud? In this presentation, Wade Wegner will draw upon his years of experience with cloud applications in Windows Azure to share proven practices for handling failure in cloud applications.

Presented by Wade Wegner

Disclaimer: These are conference session notes I compiled during various sessions at Microsoft Tech Ed 2012, September 11-14, 2012.  The majority of the content comprises notes taken from the presentation slides accompanied, occasionally, by my own narration.  Some of the content may be free hand style.  Enjoy… Rob

Introduction

  • Architectural options for designing highly-available, fault tolerant applications
  • Best practices for these options
  • Multi-availability Zones (AZ)

Cloud Outages

  • AWS 21/4/2011
  • Azure 29/2/2012
  • AWS 14/6/2012
  • AWS 29/6/2012
  • Azure 26/7/2012

Quite a range of outages, listed above, Leap year created date parsing issues, etc.  Additional outages caused due to lightning/infrastructure issues.  In essence, failures can and will occur.

What do we need to consider?

  • Fault Tolerance
  • High availability
  • Disaster recovery

Read your SLAs!

Windows Azure has monthly SLAs, for example.  Keep in mind that most SLAs will rarely reimburse for lost revenue due to outages.

‘Compounding SLAs’

  • If different systems have different SLAs, e.g.
  • Azure Compute = 99.95%
  • SQL Azure = 99.9%
  • Azure Storage = 99.9%

Total SLA: 4.38h + 8.76h + 8.76h
Total outage: 21.9 hours

Lets define ‘Cloud’

  • Physical data centre behind an API
  • Cloud is a ‘resource pool’ behind an API
  • A cloud is not
    • Azure
    • AWS
  • A cloud is defined by the isolation of resources
  • Sometimes might need to go across Cloud platforms
    • e.g. Azure, AWS, different data centres, different geo-locations

A cloud is a specific data centre (rather than the platform itself)

Define High Availability?

  • Remove all single point of failures
    • Multiple hosts, load balancers, data replication
  • Graceful failover
    • Platforms might provide functionality to support this
    • Sometimes you need to build it

Define Disaster Recovery?

  • Processes or procedures to recover from a failure
    • Network, hardware, software etc
  • Practice and test DR strategies, takes a lot of time
    • document, train, rehearse
  • disaster can occur anywhere

Typical Approach

  • Duplication of infrastructure
  • identical spec
  • cold failover
  • typically under-provisioned, over provisioned

DR with Cloud

  • Consider the advantages/features of each platform
    • to support migration, durability, restoration of data
  • Scale up as needed
  • Geo-located
    • Azure: Regions & Fault domains
    • AWS: Regions & availability zones
  • Move applications into separate fault domains

Design for Failure

  • Large scale failures are rare, but happen
  • Applications need to be fault aware, can recover
  • Balance cost of tolerance against cost/risk

API Endpoint Differences

  • APIs differ
  • Different resources, billing
  • Network architectures vary (VLANs, security groups)
  • Different storage architecture
  • Abstractions and management vary
  • Each Cloud is unique in various ways

Overcoming Multi-Cloud Pain

  • Design using generic concepts
  • Have tools which translate concepts to cloud-specific clouds
  • How to share resources across clouds

Infrastructure Abstraction/Automation

  • Simplify deployments across multiple regions/zones
  • Automate deployments
    • Reproducible, consistent
  • Advanced server and deployment monitoring
    • Some API support, e.g. custom performance counters
    • Azure aggregates a lot of data, performance counters etc
    • Still maturing
  • Automatic scaling and operations (and throttling)
  • Third party services/apps/tools can help
  • Make use of diagnostic information

Reduced cost of maintenance..  ScaleExtreme works across cloud.

HA/DR Checklist for Risk Mitigation

  • Determine who owns the design, processes, testing
    • Who will support, and operate the application(s)
  • Develop in-house expertise (or bring help in)
  • Conduct a risk assessment
  • Specify recovery time objectives/recovery point objectives
  • Design for failure (start with application design)
  • Implement HA best practices
    • Balance cost/risk/complexity
    • automate/abstract infrastructure
    • It can be costly to support referential integrity across zones
  • Document operational processes/automations & test them
  • Test the failover and recoveries
  • Unleash the Chaos Monkey!
    • Acknowledge that things do fail

General HA Best Practices

  • Avoid single point of failure (again)
  • Place at least one of each component in different fault domains
  • Maintain sufficient capacity to absorb faults
  • Replicate data across fault domains
  • Monitoring and alerts to automate problem resolution
  • Design stateless applications (to support failover/reboot/relaunch)
    • Avoid internal instance dependencies
  • Make use of platform specific monitoring features
  • Framework services can be slow to respond

Some General DR Scenarios

  • Backup/restore
  • Simple Recovery
  • Warm standby
  • Multi-site
  • Multi-cloud

IMG_2319 IMG_2318

Consider cost, complexity and risk implications.  Defines different levels of availability and recovery times.

Multi-Cloud:Cold DR:

IMG_2320

  Takes time to spin up the cold DR.  DNS switching can be time sensitive, even if fully automated, reduced running costs

Multi-Cloud:Warm DR:

IMG_2321

Slightly better approach, can replicate data/exports.  Data tier doesn’t need to spin up, just the other tiers.  Storage can be partitioned into a separate fault domain, etc.  Still fairly minimal cost, same DNS timeframe issues.  DB could be put into read-only mode for reporting etc.

Azure SQL Database: Multi-tenant service.  Export can be put into Azure Storage BLOB and can be replicated to other regions.

Multi-Cloud:Hot DR:

IMG_2322

Apps are spun up, Much higher cost, DNS wou ld need to fail over.

Multi-Cloud-HA:

IMG_2323

For designs which can tolerate NO downtime.  Route DNS traffic to different clouds.  Data consistency becomes an issue as real-time production data is being captured in two completely separate clouds.  Is real-time synchronization something which is entirely necessary in this configuration?  High cost.

How do I make my service immortal?

  • Hope for the best, plan for the worst
    • Failures do occur, design for them
  • Embrace the cloud mentality
  • Fit for purpose – no one design suits all
    • Analyse requirements, appetite for risk
    • Costs
  • Start easy – build HA first, then expand
    • Start at process and procedures
    • Automation

Open Source/Standards: needs community push to garner some attention.

Sep 132012
 

The Enterprise Library Integration Pack for Windows Azure from the p&p group provides some useful application blocks such as the Transient Fault Handling Application Block and the Windows Azure Scaling Application Block (WASABi).

The Transient Fault Handling Application Block can make access to storage in the cloud more resilient to temporary errors by helping you add retry logic to your code, and WASABi allows you to automatically scale your application in response to changes in demand. Come and hear how you can make your Azure applications more robust by using these application blocks.

Presented by Mahesh Krishnan

Disclaimer: These are conference session notes I compiled during various sessions at Microsoft Tech Ed 2012, September 11-14, 2012.  The majority of the content comprises notes taken from the presentation slides accompanied, occasionally, by my own narration.  Some of the content may be free hand style.  Enjoy… Rob

Follow me on Twitter @ausrob

Introduction

Building applications for the cloud.  Resource sharing and economy of scale drives lower cost.  Sharing CPU, memory and storage but come at a cost – resource contention et cetera.

How do you design your Azure applications to avoid the contention issues and other factors.

  • Auto Scaling
    • Patterns and Practices Application Block (WASABi)
  • Transient errors
    • Basics
    • Addressing using TOPAZ Application Block

Cloud Benefits

  • Zero or low upfront cost
  • Lower on-going cost
  • Near-infinite resources
  • Elasticity on demand

Scaling – In Brief

  • Helps balance running cost with load and performance
  • Vertical scaling (increase or decrease VM sizes (memory, CPU (cores), storage)
  • Horizontal scaling (increase or decrease instances) – scale out for elasticity

Manual Scaling

  • Useful for once-off scaling (but not recommended in most situations)
  • Manual scaling can lead to errors/mistakes

Types of Scaling

  • Proactive (set up in advance)
  • Reactive (respond to demand)

Auto Scaling Wish List

  • Built into Azure [not available]
  • Scale out/in based on time
  • Scale out/in based on perf counters, queue sizes
  • Work within SLAs
  • Work within budget
  • Configuration not hard coded
  • On heavy load, cut back on CPU/tasks/features
  • Make use of billing cycles
  • Host in Azure on a worker role (no need to worry about hosting)
  • Cover multiple sites with one app (one role to rule them all)

Options..

  • Use a SaaS provider (e.g. Azure Watch)
  • Build your own
  • Leverage Patterns and Practices guidance and existing framework
    • Windows Azure Scaling Application Block (WASABi) part of the Enterprise Library

WASABI in brief

  • Supports auto-scaling
  • Throttling
  • Proactive and reactive scaling
  • Hosting:
    • In Azure worker role
    • On premise, as a Windows Service/standalone app
  • Obtain as a NuGet package
  • Recommendation: Install EntLib Library Configuration Editor
    • Can hand code.. but.. why?

Configuration

  • Service Configuration (App/Web.config)
  • Additionals – rules, services
  • Rules and Service can be stored in BLOBS

Code.to support:

IMG_2223

Proactive Scaling

  • Uses Constraint Rules
    • Time tables
    • Budget
    • Ranking of rules
  • Overrides reactive rules (so.. source of truth for baseline rules)

[Example XML Rule]

IMG_2225 

Reactive Scaling

  • Conditional rules to change instance count
  • Can respond to performance counters
    • Uses <operands> (performanceCounter, queueLength, instanceCount)
  • Instances can take a few minutes to initialize

[Example Reactive Rules]

IMG_2226

Throttling

  • Use config settings to cut back on features (specifically CPU intensive features)
  • Throttling is faster than creating new instances

Stabilization

  • Cool down settings
  • Allow scale handling (up/down) by time

[Demo WASABi in action]

IMG_2227
WASABi settings in the Ent Lib Configuration Editor

Apologies – I have to leave now so I make my Helicopter flight.  Mahesh is still demonstrating the WASABi capability for proactive and reactive scaling. 

/R