BSc Docs

CCP1 Storage 2 (Swift, Glance)

7 min read · tagged ccp1, cloud, storage

Cloud Storage

Cloud Storage

Object Storage

No hierarchical structure, only containers and objects
Good for unstructured data
Object is data blob with unique ID (e.g. hash)
Can hold metadata (key-value pairs)
Access via REST, CRUD
Update as delete and create
Stored in buckets (containers)
Data represented as binary, opaque objects

OpenStack Swift

Distributed, eventually consistent, redundant object storage

On commodity hardware
Fault tolerant
Scales to petabytes
Preserves data integrity
User limits and ACLs
REST APIs
Every objects has URI and metadata
Data location transparent to end user
Resilience by adoption of data replication
Concept of failre domain
Maintenance without downtime / infrastructure elasticity

Data Model

Flat hierarchy of accounts, containers, objects.

Account

Top-level hierarchy
Owns all associated resources
Defines namespace for containers
Synonymous with project or tenant

Container

Defines namespace for objects
Controls access to objects with an ACL
Storage policies on container level

Object

Data and metadata
Each can be 5GB by default (configurable)
Compress files using content-encoding
Auto-extract archive files
Schedule deletion
Generate URL for time limited GET access

Access via endpoint and api version plus object path

http://swift.examle.com/v1/container/object

Show account info and list all containers

GET http://swift.examle.com/v1

Create new container

PUT http://swift.examle.com/v1/container_id

List all objects of a container

GET http://swift.examle.com/v1/container_id/

Create new object

PUT http://swift.examle.com/v1/container_id/object_id

Delete an object

DELETE http://swift.examle.com/v1/container_id/object_id

Architecture

Proxy Server

Communication endpoint for clients
Responsible for coordination between client and storage system
Stateless design and implementation
Supports high-availability and high-performance configuration

Storage Servers

Management entities: Account and container server
- Namespace partitioning
- Implemented as SQLite DB
- Holds management related metadata
Data storage: Object storage server
- Pool of storage nodes
- On-disk storage
- Big files are divided into segments and stored with manifest
- Object related metadata stored in file system attributes

Ring Concept

Ring represents a mapping between data entities and physical location in the cluster. Maintains the mapping between data, partitions, zones, devices.

Account server: uses the account ring to maintain a list of containers
Container server: uses container ring to maintain a list of objects
Object server: uses the object ring to maintain a list of object locations, one ring per storage policy

Failure domains

Region: Geographical, WAN
Zone: Local, power/network separation

Partitions

Across single device: speed
Across multiple devices: recover from failures

Replicas

Replica count (3 by default)
Balance: Achieve capacity distribution
Dispersion: Isolate replicas across failure domains
Device weight: Balance distribution of partition according to storage capacity and speed

Consistent hashing for location

Stateless algorithm to select correct partition
C = partition power, number of bits from MD5 hash of the item path
N = total number of partitions in the ring

partition index = (hash(item path) % C) % N

Storage Policies

One object ring per storage policy to segregate object storage within a single cluster

Levels of durability: Multiple replication policies can be setup for the same cluster
Performance: SSD-only object ring can be created for low-latency, high performance policy
Grouping nodes: Objects rings may have different physical servers
Storage implementations: Nodes that use different Diskfile and use policy to direct traffic to only those

CAP

Designed as AP (Availability, Partition Tolerance) system with EC (eventual consistency)

A: Always returns a value reply
P: Scalable, resilient, performance
EC: may provide outdated data

Request Flow

Resource creation

HTTP PUT request to proxy server
Proxy server determines rings and partitions for provided data
Success if majority of partitions respond

Resource retrieval

HTTP GET request to proxy server
Proxy server determines rings and partitions for requested data
Requests data from partitions
Returns object data from responding partitions

Keystone token as header mandatory: X-Auth-Token

Extensibility

DiskFile API enabled pluggable backends

Storage backends

Swift-on-File: POSIX filesystem
swift-ceph-backend: Ceph RADOS
kinetic-swift: Seagate Kinetic Drive
swift-scality-backend: Scality (sproxyd)

OpenStack Ceph

Unified object, block, file store on top of object-based system. All data is stored as objects in a flat namespace.

RADOS: Reliable Autonomic Distributed Object Store

Intergrates with OpenStack Services via python library librdb

Architecture

Monitor daemon (MON):

Deploy in small, odd number of instances
Does not serve stored objects to client
Maintains state and cluster membership
Provides consensus for distributed decision making

Object storage device daemon (OSD):

Minimum three in a cluster
One per disk or RAID group
Serves stored objects to clients
Intelligently peer to perform replication tasks

Meta Data Service (MDS):

Only requires for CephFS/ shared filesystems
Does not server file data to clients
Manages metadata for POSIX-compliant shared filesystems
Stores metadata in RADOS

Interfaces

Native

Provides direct access to RADOS for apps
librados, native support for C, Java etc
Local, no HTTP overhead

Object

REST-based interface to RADOSGW daemon
Supports buckets and user management
Provides S3 and Swift compatible interfaces

Block

RADOS Block device (RBD, librdb)
Storage of virtual disks in RADOS
Images are striped across cluster
Support by QEMU/KVM, OpenStack

File

Provided by CephFS (libcephfs)
Linux kernel client (mount -t ceph 192.168.1.5:/ /mnt/tank)
- Can be exported with NFS or samba
CephFS library for apps (libcephfs.so)

Concepts

No central gateway for data client operations

Operation with MONs require only latest version of cluster map
Direct data exchange with OSDs
OSDs handle replication of objects to other nodes for availability

Objects placement is dynamically computed

No central lookup table unlike Swift rings
Data localisation with CRUSH
Theoretically unlimited scalability

Devices: Physical location of objects

Placement groups (PG): Logical concept to group devices

Pools: Logical partitions with specific settings (ownership, replicas, PG number, CRUSH ruleset per PG, contains PGs)

CRUSH rulesets: Algorithm to select physical OSD where to store data, uses specific map configuration of DCs/buckets

Buckets: Physical hardware element in the hierarchy

Client runs CRUSH algorithm to determine where to write the object to. Primary OSD runs the algorithm too to determine the PGs for replicas. Same for reading and rebalancing.

CRUSH

Controlled Replication Under Scalable Hashing (CRUSH)

Decides where to store and retrieve data in a RADOS cluster.
Enabled clients to directly communicate with OSDs

Two parts:

CRUSH Algorithm

Stateless, only needs CRUSH Map
Statistically uniform distribution (hash-based)
Rule-based configuration
Similar to consistent mapping in Swift Rings
Operates upon CRUSH Map, aware of cluster topology

CRUSH Map

Location and topology of OSDs
- defines devices
- defines bucket types
- defines buckets
- defines rules
Input to the CRUSH algorithm
Defined in a plain text file, freely defined by the operator
Can be retrieved and edited

File -> Objects -> Placement Groups -> OSDs

File is broken into objects
Objects are mapped into PGs using simple hash function with adjustable bit-mask to control nr of PGs (hash(objId) & mask = pgid)
PGs are assigned to OSDs using the CRUSH algorithm/map (CRUSH(pgid, CRUSHMAP) = [osd_x, osd_y])

Simon Anliker Someone has to write all this stuff.

About the author.

BSc Docs

Contents

Cloud Storage

Object Storage

OpenStack Swift

Data Model

Architecture

Ring Concept

Storage Policies

CAP

Request Flow

Extensibility

OpenStack Ceph

Architecture

Interfaces

Concepts

CRUSH