BSc Docs

CCP1 Storage 1 (Ephemeral, Cinder)

6 min read · tagged ccp1, cloud, storage

Contents

Cloud Storage

Block Storage

Data is represented as logical blocks over a complex physical layout. Concept of files is one more level of abstraction.

Storage Area Network (SAN)

  • Dedicated network to access consolidated storage
  • Array of storage devices is made available over high-speed network
  • Only block-level, no file abstraction (block IO)
  • Protocols: iSCSI, Fiber Channel, Encapsulated SCSI
  • Not made of commodity hardware
  • Only available as large units (unit level scalability)
  • Single point of failure
  • Not software-defined
  • No native support for object storage

Network Attached Storage (NAS)

  • Node connected to a network which has direct access to disk
  • Offers file-based access
  • Can be mounted locally as network drive
  • Protocols: AFS, AFP, CIFS, FTP, HTTP, NFS
  • No simple scale-out
  • Slow
  • Enterprise grade

Directly Attached Storage (DAS)

  • Use disks in servers and provide as service
  • Storage and compute scale together
  • Block-level performance fully distibuted
  • Requires middleware (Cinder, Swift)

SAN vs NAS

  • SAN appears as a block level device that can be mounted and formatted
  • SAN scales well in large deployments
  • NAS usually used to directly expose a filesystem
  • NAS usually as independent single boxes
  • NAS is simpler for simple data sharing/file access
  • NAS is less flexible/powerful as it already includes abstraction layers over block storage

Common requirement: (Cloud) Storage requirements and design principles

Requirements

  • Flexible volume management
  • Multi-tenancy
  • Thin-provisioning
  • Accounting and user policies
  • No operational down-time
  • Privacy and security

Data Placement

  • Placement of data is handled transparently
  • Data is allocated, migrated, rebalanced transparently
  • Policies can be applied to increase usage efficiency of resources (same data on less components)
  • Policies can be applied to increase I/O performance (same data on multiple components)

Data Striping

  • Segment sequential data into stripes
  • Store stripes on separate disks/nodes
  • R/W operations on stripes can be parallel
  • Throughput increase by factor N (N=number of stripes stored on different storage units), upper bound is network bandwidth
  • Failure of any of the N storage units implies data loss

Data Replication

  • Data is replicated across failure zones
  • Replicas: Equivalent data replicas, minimum 2 - already has uncompressed efficiency of 50%
  • Erasure coding: Algorithms to replicate data more flexibly and still provide fault tolerance (e.g. RAID-Z2 allows 2 disks in a pool to fail)

Data Durability

Industrial metric to determine the quality of a cloud storage system

  • Independent from cluster size but not from architecture
  • Obtained by determining the annual expected data loss
  • Likelihood that customer data get permanently lost

e.g. annual loss of 0.01% = durability of 9.99% => store 1000 objects, lose 0.01 means we loose 0.01*1000=10 objects per year

Data Availability

Likelihood that customer data will not be accessible at a particular point in time

  • Durability loss implies availability loss (not the other way)
  • Availability is uptime

e.g. subtract from 100% the average of error rates from each X minutes of a billing cycle

Both durability and availability depend on data placement, replication, MTTR adn MTTF.

Design Principles

Availability: Always-available, failures of individual components must not affect the overall health

Scalability/Efficiency: Distribute to enable scalability, data deduplication, compression

Consistency: Ensure data in a distributed system remain consistent, copy-on-write

CAP Theorem

Can’t have them all.

Consistency: at a certain time, all nodes see the same data

Availability: system is able to process and reply to client requests

Partition Tolerance: system remains operational if a single component fails (e.g. network link)

AC: can only be achieved on a single machine (e.g. non-distributed DB)

PA: always replies but data might be outdated (e.g. DNS, Amazon web store)

PC: may not always reply but provided data is always consistent (e.g. flight booking system)

CAP: ongoing research to improve PA/PC (with e.g. conflict-free Replicated Data Types)

Spanner

Spanner prioritizes CP, sacrificing availability when necessary. In a stable network with network redundancy it can deliver 5 9’s internally, which is very close to CAP.

OpenStack Storage

Cinder is a block storage service that provides storage via Nova

  • Software infrastructure for volumes management
  • Outside of the data path
  • Supports DAS on the local compute node
  • Supports remote SAN/NAS via cinder backend drivers
  • Data persistence beyond compute instance lifetime

Compute with local storage

  • Simple configuration
  • Can deliver higher performance to VMs (e.g. SSD)
  • Large footprint on compute hosts
  • Requires local storage on compute hosts
  • Can be less resilient

Compute with remote storage

  • Nova supports multiple remote storage backends
  • NAS: NFS
  • SAN: iSCSI, Ceph RBD (widely used, good qemu support, scales well)
  • DFS: NFS (simple, proven, bottleneck)

iSCSI

Internet SCSI (iSCSI) is a protocol to access block devices over a network

  • Different storage solutions behind interface
  • Built on top of SCSI
  • Supports non SCSI disks through iSCSI gateway
  • Supports daisy chaining
  • Encapsulates SCSI-3 instruction set (longer device names, timeouts, authentication)

iSCSI target: the server exposing storage resources

iSCSI initiator: the client accessing exported storage resources

iSCSI TPG: portal group form which block devices are exposed

iSCSI LUN: Logical unit identifying block device on TPG

ZFS

Very powerful, mature, trusted local file system with advanced features

  • Copy-on-write: on disk state always valid
  • Transactional: everything fails or succeeds
  • Checksummed: no silent corruption, bad metadata
  • Snapshots
  • Cloning

Software-defined storage resources

  • Organize physical disks into vdevs
  • Organize vdevs into pools
  • Configure pool features and allocate volumes
  • Expose volumes at block level or file level

Virtual devices (vdevs)

  • Configuration is redundant
  • Mirror (1:1 replica)
  • Erasure codes: raidz1, raidz2, raidz3

zpool

  • collection of vdevs

datasets

  • can be created over zpools
  • volume: block devices
  • filesystem: storage unit for files
  • snapshots of block devices or filesystem
  • clones: writable volume or filesystem created from existing one

Usage

Create zpool

zpool create <poolname> <vdev> <vdev-resources>

Create zpool with 4 volumes and raidz2, two disks can fail, storage efficiency of 50% (raidz3 with 8 disks has efficiency of 62.5%)

zpool create whirl raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd

Query zpool status

zpool status -v whirl

Configure pool compression to lz4

zfs set compression=lz4 whirl

Add spare device to pool for increased availability

zpool add whirl spare /dev/sde

Create volume

sudo zfs create -V 30MB whirl/vol

Check existing datasets

sudo zfs list

Avatar of Simon AnlikerSimon Anliker Someone has to write all this stuff.

About the author.