MinIO Architecture & Object Storage Deep Dive

MinIO just does one thing - Object storage for Private cloud

This document provides a comprehensive overview of MinIO’s architecture, how it stores objects, distributes data across servers, and retrieves objects.

Core Principles
High-Level Architecture
System Components
Object Storage Internals
Erasure Coding
PUT Operation (Storing Objects)
GET Operation (Retrieving Objects)
Distributed Architecture
Server Pools
Code Architecture
Healing & Self-Recovery
Advanced Features
Deprecated Gateway
Real-World Metadata Examples

Core Principles

MinIO is built on several fundamental design principles:

Metadata-Free Design: No centralized metadata database. Object metadata is stored locally on each disk alongside the data (in xl.meta files). This eliminates the metadata bottleneck and prevents cluster-wide failures.
Shared-Nothing Architecture: Each node operates independently. Data is distributed and scattered across multiple nodes and disks, with no single point of failure.
S3 Compatibility: Full S3 API compatibility allows seamless migration from AWS S3 or other S3-compatible systems.
Erasure Coding + Bitrot Protection: Multi-level data protection using Reed-Solomon erasure coding and HighwayHash checksums. Even if you lose more than half of your hard drives, you can still recover data. (N/2)-1 node failure is allowed in distributed mode.
Rebalance-Free Expansion: Add new pools without rebalancing existing data.
No Master Node: All nodes are peers in a decentralized architecture using distributed locking (dsync).

How MinIO Compares to Legacy Storage

Minio adopts a metadata-free database design for high performance, avoiding the metabase becoming a performance bottleneck for the entire system, and limiting failures to a single cluster, so that no other clusters are involved.
Minio is also fully compatible with the S3 interface, so it can also be used as a gateway to provide S3 access to the outside world.
Use both Minio Erasure code and checksum to prevent hardware failures. Even if you lose more than half of your hard drive, you can still recover from it. (N/2)-1 node failure is also allowed in the distribution

Legacy Object Storage Architecture

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      Client (S3 API)                            │
│              PUT/GET/DELETE/LIST Requests                       │
└────────────────────────┬────────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────────┐
│                  HTTP Server & Router                           │
│         (Authentication, Throttling, Compression)               │
└────────────────────────┬────────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────────┐
│              API Handlers (GET/PUT/DELETE/LIST)                 │
└────────────────────────┬────────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────────┐
│            ObjectLayer Interface (Abstraction)                  │
└────────────────────────┬────────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────────┐
│         erasureServerPools (Multi-Pool Manager)                 │
│  - Weighted pool selection based on available space             │
│  - Pool expansion & decommissioning                             │
└────────────────────────┬────────────────────────────────────────┘
                         │
        ┌────────────────┼────────────────┐
        │                │                │
┌───────▼────┐   ┌───────▼────┐    ┌──────▼──────┐
│  Pool 1    │   │  Pool 2    │    │  Pool N     │
│ erasureSets│   │ erasureSets│    │ erasureSets │
└───────┬────┘   └───────┬────┘    └──────┬──────┘
        │                │                │
     ┌──┴──┐          ┌──┴──┐          ┌──┴──┐
     │Set1 │          │Set1 │          │Set1 │
     └──┬──┘          └──┬──┘          └──┬──┘
        │
┌───────▼──────────────────────────────────────────────────────┐
│  erasureObjects (Single Set - Erasure Coding Logic)          │
│  - Reed-Solomon EC:M+N encoding/decoding                     │
│  - Quorum-based read/write                                   │
│  - Object-level healing                                      │
└───────┬──────────────────────────────────────────────────────┘
        │
┌───────▼──────────────────────────────────────────────────────┐
│  StorageAPI Interface (Local & Remote Disk I/O)              │
│  - xlStorage (local), storageRESTClient (remote)             │
└───────┬──────────────────────────────────────────────────────┘
        │
   ┌────┴─────────────────────┬──────────────────┐
   │                          │                  │
┌──▼──┐ ┌──────┐ ┌──────────┐ │        ┌──────────────┐
│Disk1│ │Disk2 │ │  Disk..  │ │        │  Disk16      │
│     │ │      │ │          │ │        │              │
│     │ │      │ │          │ │        │              │
└─────┘ └──────┘ └──────────┘ │        └──────────────┘
                              └─ (Parallel I/O)

System Components

1. HTTP Server Layer

Handles incoming S3 API requests
Middleware chain: Auth (Signature V4), Tracing, Throttling, GZIP compression
Routes requests to appropriate handlers

2. API Handler Layer

GetObjectHandler: Retrieves objects
PutObjectHandler: Stores objects
DeleteObjectHandler: Deletes objects
ListObjectsHandler: Lists bucket contents
Multipart upload handlers

3. ObjectLayer Interface

Abstract interface implemented by erasureServerPools, erasureSets, and erasureObjects. Provides unified S3 operations.

4. Erasure Server Pools

Manages multiple independent pools
Selects pool based on available space (weighted random)
Enables non-disruptive expansion
Each pool has its own erasure sets

5. Erasure Sets

Routes objects to correct set using consistent hash (SipHash)
One set contains all disks in the erasure set
Uses deterministic placement: same object name → same set always

6. Erasure Objects

Core logic for encoding/decoding
Manages read/write quorum
Handles object-level healing
Coordinates with StorageAPI

7. Storage Layer

xlStorage: Local disk I/O, metadata, file operations
storageRESTClient: Remote disk via REST API
xlStorageDiskIDCheck: Health wrapper for disk monitoring

Object Storage Internals

On-Disk Layout

Each disk in a MinIO cluster stores data in the following structure:

disk1/
├── .minio.sys/
│   ├── format.json              # Cluster configuration
│   ├── config/                  # Server configuration
│   ├── buckets/                 # Bucket metadata
│   └── tmp/                     # Temporary files during writes
│
├── bucket1/
│   ├── object1/
│   │   ├── xl.meta              # Metadata (MessagePack serialized)
│   │   └── a1b2c3d4-e5f6.../   # DataDir UUID (contains data shard)
│   │       └── part.1           # Actual shard data
│   └── object2/
│       └── ...
└── bucket2/
    └── ...

xl.meta File Format

The xl.meta file contains critical metadata in MessagePack binary format:

Header:
- Magic: "XL2 "
- Version: 1.3

Versions[] (Version History):
├── Type: ObjectType, DeleteType, or LegacyType
├── ObjectV2 (if ObjectType):
│   ├── VersionID: UUID (unique version identifier)
│   ├── DataDir: UUID (data directory on disk)
│   ├── ErasureAlgorithm: ReedSolomon
│   ├── ErasureM: Number of data blocks (e.g., 12)
│   ├── ErasureN: Number of parity blocks (e.g., 4)
│   ├── ErasureBlockSize: Block size for encoding (1MB default)
│   ├── ErasureIndex: This disk's shard index (0-15)
│   ├── ErasureDist: Distribution array [disk_index_0, disk_index_1, ...]
│   ├── BitrotChecksumAlgo: HighwayHash (for integrity)
│   ├── PartNumbers: Part IDs (multipart uploads)
│   ├── PartSizes: Size of each part
│   ├── Size: Total object size
│   ├── ModTime: Modification timestamp (Unix nanoseconds)
│   ├── MetaSys: System metadata (inline data flag, etc.)
│   └── MetaUsr: User metadata (Content-Type, custom headers)

format.json - Cluster Configuration

Located at .minio.sys/format.json on each disk:

{
  "version": "1",
  "format": "xl",
  "id": "deployment-uuid",
  "xl": {
    "version": "3",
    "this": "disk-uuid",
    "sets": [
      ["disk-0", "disk-1", "disk-2", ..., "disk-15"],
      ["disk-16", "disk-17", "disk-18", ..., "disk-31"]
    ],
    "distributionAlgo": "SIPMOD"
  }
}

Key Fields:

id: Cluster deployment ID (shared by all disks)
this: UUID of current disk
sets: Array of erasure sets, each containing disk UUIDs
distributionAlgo: Algorithm used for object placement (SipHash with parity consideration)

Erasure Coding

Reed-Solomon Encoding

MinIO uses Reed-Solomon erasure coding (via klauspost/reedsolomon library):

Example Configuration (16 disks):

EC:12+4 (12 data shards + 4 parity shards)
Block size: 1MB default
Can tolerate: Up to 4 disk failures per erasure set

How It Works

Encoding (Write):

Original File (10MB)
        │
        ▼
┌──────────────────────────────────┐
│ Split into 1MB blocks (10 blocks)│
└──────────────────────────────────┘
        │
        ▼
┌──────────────────────────────────────────────────────┐
│ For each 1MB block:                                  │
│ ├─ Split into 12 data shards (~85KB each)           │
│ └─ Generate 4 parity shards (~85KB each)            │
│                                                      │
│ Result: 16 shards per block (12 data + 4 parity)   │
└──────────────────────────────────────────────────────┘
        │
        ▼
┌──────────────────────────────────────────────────────┐
│ Write to 16 disks in parallel:                       │
│ ├─ Disk 0: shard_0 (all blocks)                     │
│ ├─ Disk 1: shard_1 (all blocks)                     │
│ ├─ Disk 12-15: parity shards                        │
│ └─ All disks: xl.meta (metadata)                    │
└──────────────────────────────────────────────────────┘

Decoding (Read):

Read Request for 10MB object
        │
        ▼
┌──────────────────────────────────┐
│ Scenario 1: All 16 disks healthy │
│ ├─ Read 12 data shards           │
│ ├─ Ignore 4 parity shards        │
│ └─ Reconstruct original data     │
└──────────────────────────────────┘

┌──────────────────────────────────┐
│ Scenario 2: 2 disks dead         │
│ ├─ Read 14 available shards      │
│ ├─ Use Reed-Solomon to recover   │
│ │  missing 2 shards              │
│ └─ Reconstruct original data     │
└──────────────────────────────────┘

┌──────────────────────────────────┐
│ Scenario 3: 5+ disks dead        │
│ └─ READ FAILS (quorum lost)      │
└──────────────────────────────────┘

Read/Write Quorum

Read Quorum: M (data blocks)

Need M shards available to reconstruct object
Example: EC:12+4 → need 12 out of 16 disks

Write Quorum:

If parity < 50% of drives: Write Quorum = M (data blocks)
If parity = 50% of drives: Write Quorum = M + 1 (avoid split-brain)
Example: EC:12+4 (16 disks) → Write Quorum = 12

Erasure Coding Visual

The value K here constitutes the read quorum for the deployment. The erasure set must therefore have at least K healthy drives to support read operations.

For a small object with only 1 part (part.1), here we have 2 data blocks and 2 parity blocks:

Ref: https://blog.min.io/erasure-coding-vs-raid/

Not only does MinIO erasure coding protect against drive and node failures, MinIO also heals at the object level:

Heal one object at a time vs RAID which heals at volume level
A corrupt object restored in seconds vs. hours in RAID

Read Request Flow

Write Request Flow

Two cases for write quorum:

Case 1: Parity < 50% of drives → Write Quorum = Parity
Case 2: Parity = 50% of drives → Write Quorum = Parity + 1

If parity equals 1/2 (half) the number of erasure set drives, write quorum equals parity + 1 to avoid data inconsistency due to ‘split brain’ scenarios.

Bitrot Protection

MinIO protects against silent data corruption (BitRot):

HighwayHash Algorithm: Computes 256-bit hash per block
Verification: Hash checked on every read
Performance: >10 GB/sec hashing on single Intel CPU core
Storage Format: [hash | data | hash | data | ...]
Detection: Hash mismatch → disk marked bad → reconstruction

PUT Operation

Step-by-Step Flow (Storing a 10MB Object)

1. CLIENT REQUEST
   │
   ├─ PUT /bucket/photos/vacation.jpg (10MB)
   │
   ▼

2. HTTP HANDLER
   │
   ├─ Parse request, extract bucket/object name
   ├─ Verify authentication (Signature V4)
   ├─ Create hash verifier (for bitrot)
   │
   ▼

3. POOL SELECTION
   │
   ├─ If object already exists:
   │  └─ Use same pool as existing version
   │
   └─ If new object:
      ├─ Calculate available space for each pool
      ├─ Filter: skip suspended/rebalancing pools
      ├─ Weighted random selection (prefer pool with most space)
      └─ Select Pool 0
   │
   ▼

4. SET SELECTION (Consistent Hashing)
   │
   ├─ Hash object name using SipHash:
   │  setIndex = sipHashMod("photos/vacation.jpg", numSets, deploymentID) % numSets
   │
   ├─ Result: Always same set for same object name (deterministic)
   └─ Select Erasure Set 3
   │
   ▼

5. CREATE METADATA
   │
   ├─ Generate UUIDs:
   │  ├─ VersionID: Unique identifier for this version
   │  └─ DataDir: Directory to store data shards
   │
   ├─ Calculate distribution order:
   │  └─ hashOrder(objectName, diskCount) = [3, 1, 4, 2, 5, ...]
   │
   ├─ Set erasure parameters:
   │  ├─ ErasureM = 12 (data blocks)
   │  ├─ ErasureN = 4 (parity blocks)
   │  └─ BlockSize = 1MB
   │
   ▼

6. ERASURE ENCODING
   │
   ├─ Read data in 1MB blocks (10 blocks total)
   │
   ├─ For each block:
   │  ├─ Split into 12 data shards (~85KB each)
   │  ├─ Compute 4 parity shards using Reed-Solomon
   │  ├─ Add HighwayHash checksum to each shard
   │  └─ Result: 16 shards per block
   │
   ▼

7. DISK ORDERING (Distribution)
   │
   ├─ Take 16 disks from Set 3
   ├─ Shuffle according to distribution order
   └─ Map shards: shard_i → disk_i
   │
   ▼

8. PARALLEL WRITES
   │
   ├─ For each of 16 disks (in parallel):
   │  │
   │  ├─ Write to temporary location:
   │  │  .minio.sys/tmp/{VersionID}/{DataDir}/part.1
   │  │
   │  ├─ Format: [block1_hash|block1_data|block2_hash|block2_data|...]
   │  │
   │  └─ Verify write success
   │
   ├─ Check write quorum: Need ≥12 successful writes
   │  └─ If <12 fail: WRITE FAILS, cleanup
   │
   ▼

9. ATOMIC RENAME
   │
   ├─ Once quorum reached:
   │  └─ Rename all temp files to final location:
   │     bucket/object/{DataDir}/part.1
   │
   ▼

10. METADATA PERSISTENCE
    │
    ├─ Create xl.meta with all object metadata
    │ ├─ Version history
    │ ├─ Erasure config
    │ ├─ Distribution array
    │ └─ Part sizes
    │
    ├─ Write xl.meta to all 16 disks (in parallel)
    │
    ├─ Verify metadata quorum (≥12 successful)
    │
    ▼

11. SUCCESS
    │
    └─ Return 200 OK + ETag to client

PUT Request Overview

For example, with 5 data blocks and 3 parity blocks:

PUT Request Sequence

sequenceDiagram
    participant Client
    participant MinIO as MinIO Server
    participant PoolMgr as Pool Manager
    participant SetMgr as Erasure Set Manager
    participant EC as Erasure Coder
    participant Disk1 as Drive 1
    participant Disk2 as Drive 2
    participant DiskN as Drive N

    Client->>MinIO: PUT /bucket/object
    MinIO->>PoolMgr: getPoolIdx(bucket, object, size)

    alt Object Already Exists
        PoolMgr->>PoolMgr: Query all pools in parallel
        PoolMgr->>PoolMgr: GetObjectInfo() on each pool
        PoolMgr->>PoolMgr: Object found in Pool 2
        PoolMgr-->>MinIO: Use Pool 2 (existing)
    else New Object
        PoolMgr->>PoolMgr: getServerPoolsAvailableSpace()
        PoolMgr->>PoolMgr: Filter: skip suspended/rebalancing
        PoolMgr->>PoolMgr: Weighted random selection
        PoolMgr-->>MinIO: Use Pool 1 (most space)
    end

    MinIO->>SetMgr: Hash object name
    SetMgr->>SetMgr: sipHashMod(objectName, numSets)
    SetMgr-->>MinIO: Erasure Set 3

    MinIO->>EC: Create FileInfo metadata
    EC->>EC: hashOrder(objectName, drives)
    EC->>EC: Generate distribution array
    Note over EC: Distribution: [3,1,4,2,5,...]

    MinIO->>EC: Encode object (K data + M parity)
    EC->>EC: Split into data blocks
    EC->>EC: Calculate parity using Reed-Solomon
    EC-->>MinIO: Data + Parity shards

    MinIO->>SetMgr: Shuffle disks by distribution
    SetMgr-->>MinIO: Ordered disk list

    par Write to all drives in parallel
        MinIO->>Disk1: Write shard 1 + xl.meta
        MinIO->>Disk2: Write shard 2 + xl.meta
        MinIO->>DiskN: Write shard N + xl.meta
    end

    Disk1-->>MinIO: Success
    Disk2-->>MinIO: Success
    DiskN-->>MinIO: Success

    MinIO->>MinIO: Check write quorum (K drives)
    MinIO-->>Client: 200 OK

PUT Layer-by-Layer Graph

graph TB
    subgraph "Client"
        C[PUT /mybucket/image.jpg<br/>10 MB]
    end

    subgraph "Layer 1: HTTP Server"
        H[HTTP Router<br/>Match route]
    end

    subgraph "Layer 2: API Handler"
        A[PutObjectHandler<br/>Parse request<br/>Create PutObjReader]
    end

    subgraph "Layer 3: Server Pools"
        SP[erasureServerPools<br/>Select Pool 0<br/>based on space]
    end

    subgraph "Layer 4: Erasure Sets"
        ES[erasureSets<br/>Hash object name<br/>Select Set 3]
    end

    subgraph "Layer 5: Erasure Objects"
        EO[erasureObjects<br/>Setup EC:12+4<br/>WriteQuorum: 12]
    end

    subgraph "Layer 6: Encoding Loop"
        L1[Read Block 1<br/>1 MB]
        L2[Encode to<br/>16 shards]
        L3[Write to<br/>16 disks]
        L4[Check<br/>quorum]
        L5{More<br/>blocks?}
    end

    subgraph "Layer 7: Reed-Solomon"
        RS[Split 1 MB into<br/>12 data shards<br/>Generate 4 parity]
    end

    subgraph "Layer 8: Parallel Writes"
        W1[Disk 1<br/>Write D1]
        W2[Disk 2<br/>Write D2]
        W3[Disk 12<br/>Write D12]
        W4[Disk 13<br/>Write P1]
        W5[Disk 16<br/>Write P4]
    end

    subgraph "Metadata Write"
        M1[Create xl.meta<br/>with FileInfo]
        M2[Write to all<br/>16 disks]
        M3[Check quorum<br/>12/16]
        M4{Success?}
    end

    subgraph "Final"
        F1[✅ Return ObjectInfo]
        F2[❌ Revert & Error]
    end

    C --> H --> A --> SP --> ES --> EO --> L1
    L1 --> L2 --> RS --> L3
    L3 --> W1 & W2 & W3 & W4 & W5
    W1 & W2 & W3 & W4 & W5 --> L4
    L4 --> L5
    L5 -->|Yes| L1
    L5 -->|No| M1
    M1 --> M2 --> M3 --> M4
    M4 -->|Yes| F1
    M4 -->|No| F2

    style C fill:#e1f5ff
    style RS fill:#fff3cd
    style F1 fill:#d4edda
    style F2 fill:#f8d7da

Key Decisions

Pool Selection Algorithm (Weighted Random):

totalFreeSpace = sum of free space in all pools
choose = random(0, totalFreeSpace)
for each pool:
    if pool.freeSpace >= choose:
        select this pool
        break
    choose -= pool.freeSpace

Set Selection Algorithm (Consistent Hash):

func sipHashMod(key string, cardinality int, id [16]byte) int {
    k0, k1 := binary.LittleEndian.Uint64(id[0:8]),
             binary.LittleEndian.Uint64(id[8:16])
    sum64 := siphash.Hash(k0, k1, []byte(key))
    return int(sum64 % uint64(cardinality))
}

GET Operation

Step-by-Step Flow (Retrieving a 10MB Object)

1. CLIENT REQUEST
   │
   ├─ GET /bucket/photos/vacation.jpg
   │ Optional: Range header (e.g., bytes=2097152-10485760)
   │
   ▼

2. HTTP HANDLER
   │
   ├─ Parse request
   ├─ Verify authentication
   ├─ Check preconditions (If-Match, If-Modified-Since)
   │
   ▼

3. SET LOOKUP (Same Hash as Write)
   │
   ├─ Hash object name using same SipHash
   │  → Deterministically routes to same set as original write
   │
   └─ Select Erasure Set 3
   │
   ▼

4. METADATA READING
   │
   ├─ Read xl.meta from ALL 16 disks (in parallel)
   │
   ├─ Verify quorum: Need ≥12 successful reads
   │  └─ If <12: READ FAILS
   │
   ├─ Select latest version (by ModTime)
   │
   ├─ Extract:
   │  ├─ Erasure config (M, N, block size)
   │  ├─ Part sizes and ETags
   │  ├─ Distribution order
   │  └─ Shard indices
   │
   ▼

5. PARALLEL SHARD READING
   │
   ├─ Create readers for all 16 disks
   │
   ├─ Read in parallel:
   │  ├─ Each disk returns its shard blocks
   │  ├─ Verify HighwayHash per block
   │  │  └─ Hash mismatch → mark disk as bad
   │  └─ Stop reading once we have ≥12 good shards
   │
   ▼

6. RECONSTRUCTION (If Needed)
   │
   ├─ If all 16 disks healthy:
   │  └─ Use 12 data shards directly
   │
   └─ If some disks failed/corrupted:
      ├─ Use Reed-Solomon decoder
      ├─ Reconstruct missing shards from available ones
      └─ Need at least M (12) shards to reconstruct
   │
   ▼

7. RANGE EXTRACTION (If Range Header Present)
   │
   ├─ If range requested (e.g., bytes 2-10MB):
   │  ├─ Extract only requested byte range
   │  └─ Efficient: Don't read entire object
   │
   ├─ Apply decompression (if S2 compression used)
   │
   ├─ Apply decryption (if AES-256-GCM encryption used)
   │
   ▼

8. STREAM TO CLIENT
   │
   ├─ Set Content-Length header
   ├─ Set Content-Range (if range request)
   ├─ Stream data directly to HTTP response body
   │
   ▼

9. SUCCESS
    │
    └─ Return 200 OK (or 206 Partial Content for range)

Failure Scenarios

Scenario	Disks Available	Status	Action
All healthy	16/16	✅ Success	Read 12 data shards
1 disk dead	15/16	✅ Success	Read 12 data shards from remaining
2 disks dead	14/16	✅ Success	Read 12+ shards, reconstruct if needed
4 disks dead	12/16	✅ Success	Read 12 available shards (at limit)
5+ disks dead	<12/16	❌ FAIL	Cannot read (quorum lost)

Distributed Architecture

All the nodes running a distributed MinIO setup are recommended to be homogeneous — same operating system, same number of drives, and same network interconnects.

No master server, no metadata server

Ref: https://github.com/minio/minio/blob/master/docs/distributed/README.md

MinIO adopts a decentralized shared-nothing architecture, where object data is scattered and stored on multiple hard disks on different nodes, providing unified namespace access and load balancing between servers through load balancing or DNS round-robin.

Erasure Set Organization (4 Servers × 4 Disks Each = 16 Disks Total)

Server 1: [D1] [D2] [D3] [D4]
Server 2: [D5] [D6] [D7] [D8]
Server 3: [D9] [D10][D11][D12]
Server 4: [D13][D14][D15][D16]

            ↓ Round-Robin Assignment ↓

Erasure Set 0: [D1, D5, D9, D13, D2, D6, D10, D14, ...]
                S1   S2   S3   S4   S1   S2   S3    S4

Key: Each set has disks from ALL servers

Fault Tolerance

If Server 3 Dies:

Set contains: [D1(S1), D5(S2), D9(S3), D13(S4), ...]

After S3 failure:
├─ Available: D1(S1) ✓, D5(S2) ✓, D13(S4) ✓
├─ Dead: D9(S3) ✗
├─ Tolerance: EC:12+4 can lose up to 4 disks
│
└─ Result: SAFE - Can still read and recover

If Any 4 Disks Die:

Available shards: 12 (exactly at read quorum)
Parity tolerance: 4
Result: Still readable but no fault tolerance left

If 5+ Disks Die:

Available shards: <12 (below read quorum)
Result: UNRECOVERABLE - READ FAILS

Server Pools

serverpools

A server pool is a set of MinIO server nodes which pool their drives and resources, creating a unit of expansion. All nodes in a server pool share their hardware resources in an isolated namespace.

The other important point here involves rebalance-free, non-disruptive expansion. With MinIO’s server pool approach - rebalancing is not required to expand. Ref: https://blog.min.io/no-rebalancing-object-storage/

A MinIO cluster is built on server pools, and server pools are built on erasure sets.

Multi-Pool Architecture

MinIO can have multiple independent pools for expansion:

Cluster
├── Pool 1 (16 disks, 4 nodes)
│   ├─ Erasure Set 0
│   └─ Erasure Set 0 (shared)
│
├── Pool 2 (32 disks, 4 nodes)
│   ├─ Erasure Set 1
│   └─ Erasure Set 2
│
└── Pool 3 (48 disks, 4 nodes)
    ├─ Erasure Set 3
    ├─ Erasure Set 4
    └─ Erasure Set 5

Pool Expansion (Rebalance-Free)

Add new pool: MinIO detects new endpoints at startup
Update format files: Cluster configuration updated
New objects: Distributed across all pools by available space
Existing objects: Stay in original pool (no rebalancing)
Decommission: Background migration copies objects to other pools

Weighted Random Selection

When adding new object to new pool:

Calculate available space: Pool1=500GB, Pool2=200GB, Pool3=300GB (total=1TB)
Generate random number: 0-1000GB
If 0-500: Pool1, if 500-700: Pool2, if 700-1000: Pool3
Result: Pools filled proportionally to their capacity

Code Architecture

Layer Hierarchy

graph TB
    subgraph "1. Entry Point"
        A[Application Entry<br/>& Initialization]
    end

    subgraph "2. HTTP Server Layer"
        B[HTTP Server<br/>Router & Middleware]
    end

    subgraph "3. API Handler Layer"
        C[S3 API Handlers<br/>Request Processing]
    end

    subgraph "4. Object Layer Interface"
        D[ObjectLayer Interface<br/>Abstraction Layer]
    end

    subgraph "5. Erasure Server Pools"
        E[Pool Manager<br/>Multi-Pool Coordination]
    end

    subgraph "6. Erasure Sets"
        F[Set Manager<br/>Consistent Hashing]
    end

    subgraph "7. Erasure Objects"
        G[Erasure Logic<br/>Quorum & Operations]
    end

    subgraph "8. Storage Layer"
        H[Disk I/O<br/>File Operations]
    end

    subgraph "9. Metadata Layer"
        I[xl.meta Management<br/>Version Control]
    end

    subgraph "10. Healing Layer"
        J[Self-Repair<br/>Background Scanner]
    end

    subgraph "11. Erasure Coding"
        K[Reed-Solomon<br/>Data Protection]
    end

    subgraph "12. Bitrot Protection"
        L[Hash Verification<br/>Integrity Checks]
    end

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H
    H --> I

    G -.Healing.-> J
    G -.Encoding.-> K
    H -.Verification.-> L

    style A fill:#e1f5ff,stroke:#01579b,stroke-width:3px,color:#000
    style B fill:#f3e5f5,stroke:#4a148c,stroke-width:3px,color:#000
    style C fill:#e8f5e9,stroke:#1b5e20,stroke-width:3px,color:#000
    style D fill:#fff3e0,stroke:#e65100,stroke-width:3px,color:#000
    style E fill:#fce4ec,stroke:#880e4f,stroke-width:3px,color:#000
    style F fill:#e0f2f1,stroke:#004d40,stroke-width:3px,color:#000
    style G fill:#f1f8e9,stroke:#33691e,stroke-width:3px,color:#000
    style H fill:#e3f2fd,stroke:#0d47a1,stroke-width:3px,color:#000
    style I fill:#fef5e7,stroke:#f39c12,stroke-width:3px,color:#000
    style J fill:#fadbd8,stroke:#c0392b,stroke-width:3px,color:#000
    style K fill:#d5f4e6,stroke:#117a65,stroke-width:3px,color:#000
    style L fill:#ebdef0,stroke:#6c3483,stroke-width:3px,color:#000

Detailed Layer Flow

graph TB
    subgraph "1. HTTP Server Layer"
        HTTP[HTTP Server<br/>xhttp.NewServer]
        Router[Mux Router<br/>mux.NewRouter]
    end

    subgraph "2. Middleware Layer"
        Auth[Authentication<br/>Signature V4]
        Trace[HTTP Tracing]
        Throttle[Request Throttling<br/>maxClients]
        GZIP[GZIP Compression]
    end

    subgraph "3. API Handler Layer"
        APIHandlers[objectAPIHandlers]
        GetObj[GetObjectHandler]
        PutObj[PutObjectHandler]
        DelObj[DeleteObjectHandler]
        ListObj[ListObjectsHandler]
    end

    subgraph "4. ObjectLayer Interface"
        ObjInterface["<b>ObjectLayer Interface</b><br/>• GetObjectNInfo<br/>• PutObject<br/>• DeleteObject<br/>• ListObjects<br/>• GetObjectInfo<br/>• Multipart Operations"]
    end

    subgraph "5. Erasure Server Pools"
        ESP[erasureServerPools<br/>implements ObjectLayer]
        Pool1[Pool 1<br/>erasureSets]
        Pool2[Pool 2<br/>erasureSets]
        PoolN[Pool N<br/>erasureSets]
    end

    subgraph "6. Erasure Sets"
        Set1[Set 1<br/>erasureObjects]
        Set2[Set 2<br/>erasureObjects]
        SetN[Set N<br/>erasureObjects]
    end

    subgraph "7. Erasure Objects Layer"
        ErasureObj[erasureObjects<br/>implements ObjectLayer]
        ECLogic[Erasure Coding Logic<br/>Reed-Solomon]
        Quorum[Read/Write Quorum]
        Healing[Self-Healing]
    end

    subgraph "8. StorageAPI Interface"
        StorageInterface["<b>StorageAPI Interface</b><br/>• ReadVersion<br/>• WriteMetadata<br/>• DeleteVersion<br/>• ReadFile/WriteAll<br/>• Volume Operations"]
    end

    subgraph "9. Storage Implementation"
        XLStorage[xlStorage<br/>implements StorageAPI]
        Remote[storageRESTClient<br/>Remote Disks]
        DiskCheck[xlStorageDiskIDCheck<br/>Health Wrapper]
    end

    subgraph "10. Disk Layer"
        LocalDisk[Local Disk I/O<br/>xl.meta files]
        RemoteDisk[Remote Disk via REST]
        Metadata[xl.meta<br/>Object Metadata]
    end

    HTTP --> Router
    Router --> Auth
    Auth --> Trace
    Trace --> Throttle
    Throttle --> GZIP
    GZIP --> APIHandlers
    APIHandlers --> GetObj
    APIHandlers --> PutObj
    APIHandlers --> DelObj
    APIHandlers --> ListObj

    GetObj --> ObjInterface
    PutObj --> ObjInterface
    DelObj --> ObjInterface
    ListObj --> ObjInterface

    ObjInterface --> ESP
    ESP --> Pool1
    ESP --> Pool2
    ESP --> PoolN

    Pool1 --> Set1
    Pool1 --> Set2
    Pool1 --> SetN

    Set1 --> ErasureObj
    ErasureObj --> ECLogic
    ErasureObj --> Quorum
    ErasureObj --> Healing

    ErasureObj --> StorageInterface

    StorageInterface --> XLStorage
    StorageInterface --> Remote
    StorageInterface --> DiskCheck

    XLStorage --> LocalDisk
    Remote --> RemoteDisk
    LocalDisk --> Metadata
    RemoteDisk --> Metadata

    style ObjInterface fill:#e1f5ff,stroke:#01579b,stroke-width:3px,color:#000
    style StorageInterface fill:#e1f5ff,stroke:#01579b,stroke-width:3px,color:#000
    style ESP fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
    style ErasureObj fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
    style XLStorage fill:#f1f8e9,stroke:#33691e,stroke-width:2px,color:#000

Interface Class Diagram

classDiagram
    class ObjectLayer {
        <<interface>>
        +GetObjectNInfo() GetObjectReader
        +PutObject() ObjectInfo
        +DeleteObject() ObjectInfo
        +GetObjectInfo() ObjectInfo
        +ListObjects() ListObjectsInfo
        +MakeBucket() error
        +NewMultipartUpload() NewMultipartUploadResult
        +GetDisks() []StorageAPI
    }

    class erasureServerPools {
        -poolMeta poolMeta
        -serverPools []*erasureSets
        -deploymentID [16]byte
        +PutObject() ObjectInfo
        +GetObjectNInfo() GetObjectReader
        +getPoolIdx() int
    }

    class erasureSets {
        -sets []*erasureObjects
        -format *formatErasureV3
        -erasureDisks [][]StorageAPI
        -setCount int
        -setDriveCount int
        +PutObject() ObjectInfo
        +getHashedSet() int
    }

    class erasureObjects {
        -setDriveCount int
        -defaultParityCount int
        -getDisks func()[]StorageAPI
        -nsMutex *nsLockMap
        +PutObject() ObjectInfo
        +putObject() ObjectInfo
        +defaultWQuorum() int
        +defaultRQuorum() int
    }

    class StorageAPI {
        <<interface>>
        +ReadVersion() FileInfo
        +WriteMetadata() error
        +DeleteVersion() error
        +CreateFile() error
        +ReadFile() int64
        +MakeVol() error
        +ListVols() []VolInfo
        +GetDiskID() string
        +IsOnline() bool
    }

    class xlStorage {
        -diskPath string
        -endpoint Endpoint
        -diskID string
        -formatFile string
        +WriteMetadata() error
        +ReadVersion() FileInfo
        +CreateFile() error
        +ReadFile() int64
    }

    class storageRESTClient {
        -endpoint Endpoint
        -restClient *rest.Client
        -diskID string
        +WriteMetadata() error
        +ReadVersion() FileInfo
        +CreateFile() error
    }

    class Erasure {
        -encoder func()Encoder
        -dataBlocks int
        -parityBlocks int
        -blockSize int64
        +EncodeData() [][]byte
        +DecodeDataBlocks() error
        +ShardSize() int64
    }

    ObjectLayer <|.. erasureServerPools : implements
    ObjectLayer <|.. erasureSets : implements
    ObjectLayer <|.. erasureObjects : implements

    erasureServerPools *-- erasureSets : contains
    erasureSets *-- erasureObjects : contains
    erasureObjects --> StorageAPI : uses
    erasureObjects --> Erasure : uses

    StorageAPI <|.. xlStorage : implements
    StorageAPI <|.. storageRESTClient : implements

    style ObjectLayer fill:#e1f5ff,stroke:#01579b,stroke-width:3px,color:#000
    style StorageAPI fill:#e1f5ff,stroke:#01579b,stroke-width:3px,color:#000
    style erasureServerPools fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
    style erasureSets fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
    style erasureObjects fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
    style xlStorage fill:#f1f8e9,stroke:#33691e,stroke-width:2px,color:#000
    style Erasure fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000

Key Interfaces

ObjectLayer Interface:

type ObjectLayer interface {
    // Bucket operations
    MakeBucket(ctx, bucket, opts) error
    GetBucketInfo(ctx, bucket, opts) (BucketInfo, error)
    ListBuckets(ctx, opts) ([]BucketInfo, error)
    DeleteBucket(ctx, bucket, opts) error
    ListObjects(...) (ListObjectsInfo, error)
    ListObjectVersions(...) (ListObjectVersionsInfo, error)

    // Object operations
    GetObjectNInfo(ctx, bucket, object, rangeSpec, headers, opts) (*GetObjectReader, error)
    GetObjectInfo(ctx, bucket, object, opts) (ObjectInfo, error)
    PutObject(ctx, bucket, object, data, opts) (ObjectInfo, error)
    CopyObject(ctx, srcBucket, srcObject, dstBucket, dstObject, ...) (ObjectInfo, error)
    DeleteObject(ctx, bucket, object, opts) (ObjectInfo, error)
    DeleteObjects(ctx, bucket, objects, opts) ([]DeletedObject, []error)

    // Multipart operations
    NewMultipartUpload(ctx, bucket, object, opts) (*NewMultipartUploadResult, error)
    PutObjectPart(ctx, bucket, object, uploadID, partID, data, opts) (PartInfo, error)
    CompleteMultipartUpload(ctx, bucket, object, uploadID, parts, opts) (ObjectInfo, error)
    AbortMultipartUpload(ctx, bucket, object, uploadID, opts) error

    // Healing & Info
    HealFormat(ctx, dryRun) (HealResultItem, error)
    HealBucket(ctx, bucket, opts) (HealResultItem, error)
    HealObject(ctx, bucket, object, versionID, opts) (HealResultItem, error)
    StorageInfo(ctx, metrics bool) StorageInfo
}

StorageAPI Interface:

type StorageAPI interface {
    // Metadata
    ReadVersion(ctx, origvolume, volume, path, versionID, opts) (FileInfo, error)
    WriteMetadata(ctx, origvolume, volume, path, fi) error
    DeleteVersion(ctx, volume, path, fi, ...) error

    // File operations
    ReadFile(ctx, volume, path, offset, buf, verifier) (n, error)
    CreateFile(ctx, origvolume, volume, path, size, reader) error
    ReadFileStream(ctx, volume, path, offset, length) (io.ReadCloser, error)
    AppendFile(ctx, volume, path, buf) error
    Delete(ctx, volume, path, opts) error

    // Volume operations
    MakeVol(ctx, volume) error
    ListVols(ctx) ([]VolInfo, error)
    StatVol(ctx, volume) (VolInfo, error)
    DeleteVol(ctx, volume, forceDelete bool) error

    // Disk info
    IsOnline() bool
    GetDiskID() (string, error)
    DiskInfo(ctx, opts) (DiskInfo, error)
}

Implementations

Component	Location	Role
`erasureServerPools`	`cmd/erasure-server-pool.go`	Pool orchestration, weighted selection
`erasureSets`	`cmd/erasure-sets.go`	Set routing, consistent hashing
`erasureObjects`	`cmd/erasure-object.go`	Core put/get/delete with EC
`xlStorage`	`cmd/xl-storage.go`	Local disk I/O
`storageRESTClient`	`cmd/storage-rest-client.go`	Remote disk via REST
`Erasure`	`cmd/erasure-coding.go`	Reed-Solomon encode/decode

Healing

MinIO performs automatic background healing to detect and repair corrupted objects:

Healing Mechanisms

Bitrot Detection: HighwayHash checksum verification on every read
Bad Disk Detection: Continuous health monitoring of all disks
Object-Level Healing: Corrupted objects repaired in seconds (vs RAID hours)
Background Scanner: Periodic scan of all objects to detect bitrot proactively

Healing Flow

Bad block detected (hash mismatch)
        │
        ▼
Mark disk as bad
        │
        ▼
Read remaining 15 shards (12+ available)
        │
        ▼
Use Reed-Solomon to reconstruct missing shard
        │
        ▼
Repair disk by writing reconstructed shard
        │
        ▼
Verify repair with new hash
        │
        ▼
Continue serving object (healed)

Gateway Mode (Deprecated)

MinIO introduced gateway mode early on to provide S3 API compatibility to legacy systems:

Why Deprecated:

Critical S3 features (versioning, replication, locking, encryption) couldn’t work in gateway mode without proprietary formats
Would defeat the purpose of direct backend access
Better to run MinIO in server mode than as a stateless proxy
S3 API now ubiquitous (partly due to MinIO Gateway work)

Lessons Learned:

S3 API evolved significantly since gateway inception
Inline translation is insufficient for modern S3 capabilities
Backends become mere storage media, which is essentially running MinIO anyway

Reference: Gateway Migration and Deprecation Details

Advanced Features

Versioning

Keep multiple versions of an object
Each version has separate xl.meta entry
Access previous versions without data loss

Object Locking (WORM)

Write Once, Read Many protection
Objects immutable for set retention period
Compliance and audit requirements

Lifecycle Management

Automatic object deletion/transition after time period
Move to different storage classes
Cost optimization

Replication

Automatic cross-cluster replication
Disaster recovery and high availability
Real-time synchronization

IAM & Access Control

User authentication (basic, LDAP, OAuth)
Bucket policies (similar to AWS S3)
Access key/secret pairs

Encryption

Server-side encryption (SSE-S3, SSE-KMS) and client-side encryption support with master key rotation.

Distributed Locking (dsync)

MinIO avoids consistency issues using distributed locking:

How dsync Works

Lock Request: Any node broadcasts lock request to all nodes
Quorum: If N/2+1 nodes approve → lock acquired
No Master: Every node is peer; no single authority
Stale Detection: Between-node heartbeats detect offline nodes

dsync is MinIO’s distributed RW mutex (internal/dsync/). Every operation that mutates or reads object state acquires a lock through the nsLockMap abstraction, which routes to either a distributed DRWMutex (multi-node) or a local mutex (single-node).

Design goals

Ref: https://blog.min.io/minio-dsync-a-distributed-locking-and-syncing-package-for-go/

Architecture

Application (erasureObjects.PutObject, etc.)
         ↓
nsLockMap  (local wrapper — distributed or single-node)
         ↓
distLockInstance (DRWMutex)   OR   localLockInstance
         ↓
DRWMutex — quorum-based distributed lock
         ↓
NetLocker interface — REST calls to lock server on each node
         ↓
localLocker — lock server running on each MinIO node

Which operations lock what

Operation	Lock Type	Lock Key
`GetObjectNInfo`	RLock	`bucket/object`
`GetObjectInfo`	RLock	`bucket/object`
`PutObject`	Lock (write)	`bucket/object`
`CopyObject`	Lock (write)	`bucket/object`
`DeleteObject`	Lock (write)	`bucket/object`
`DeleteObjects`	Lock (write)	`[bucket/obj1, bucket/obj2, ...]` sorted
`NewMultipartUpload`	Lock (write)	`bucket/object`
`PutObjectPart`	Lock (write)	`bucket/object/uploadID`
`CompleteMultipartUpload`	Lock (write)	`bucket/object/uploadID`
`AbortMultipartUpload`	RLock	`bucket/object`
`PutObjectTags / Metadata`	Lock (write)	`bucket/object`
`MakeBucket`	Lock (write)	`.minio.sys/bucket.lck`
`DeleteBucket`	Lock (write)	`.minio.sys/bucket.lck`

Lock key structure

Keys are pathJoin(bucket, object):

"my-bucket/photos/vacation.jpg"           ← regular object
"my-bucket/large-file/abc123-upload-id"   ← multipart upload
".minio.sys/new-bucket.lck"               ← bucket creation

For DeleteObjects, all paths are sorted before acquiring — prevents deadlock when two concurrent requests delete overlapping object sets.

Quorum protocol

4-node cluster, lock servers on all 4 nodes

Write lock (PutObject):
  → Contact all 4 lock servers in parallel
  → Need majority: ⌈4/2⌉ + 1 = 3 approvals
  → If only 2 respond → lock DENIED

Read lock (GetObject):
  → Contact all 4 lock servers
  → Need fewer approvals (multiple readers allowed concurrently)

Split-brain guard:
  if quorum == tolerance → quorum++ (ensures strict majority for writes)

Lock lifecycle

1. ACQUIRE
   nsLockMap.NewNSLock(bucket, object)
        ↓
   DRWMutex.GetLock(ctx, timeout)
        ↓
   Contacts all peer nodes via REST in parallel
        ↓
   Waits for quorum approvals

2. REFRESH (background goroutine, every 10 seconds)
   Keeps lock alive by pinging all peers
   If quorum lost → force unlock + notify caller

3. RELEASE
   DRWMutex.Unlock() → fires async goroutine
   Retries with backoff if nodes unreachable
   Stale locks auto-expire after 1 minute

What is nsLockMap?

nsLockMap is the namespace lock manager — the single abstraction layer that sits between MinIO’s S3 operations and the actual lock implementation (distributed or local).

When any operation wants to lock an object, it goes through nsLockMap.NewNSLock(), which decides which lock implementation to use:

PutObject("my-bucket", "photo.jpg")
         ↓
nsLockMap.NewNSLock("my-bucket", "photo.jpg")
         ↓
    isDistErasure?
    ┌────YES────┐          ┌────NO────┐
    ↓                       ↓
distLockInstance         localLockInstance
(DRWMutex — contacts     (in-memory map of
 all peer nodes)          RW mutexes)

In single-node mode: maintains an in-memory map[string]*nsLock keyed by lock path (bucket/object). Uses reference counting — when two operations lock the same object, the same nsLock entry is reused and ref is incremented. When the last holder unlocks, ref hits 0 and the entry is removed (cleanup).

In distributed mode: skips the in-memory map entirely and creates a DRWMutex that contacts all peer nodes over REST to acquire a quorum-based distributed lock.

Why it exists: lets the rest of MinIO’s code be completely unaware of whether it’s running single-node or distributed. erasureObjects.PutObject() just calls nsLockMap.NewNSLock() and gets back a RWLocker — same interface regardless of mode.

nsLockMap internals

// Two modes depending on cluster type:

// Distributed mode (isDistErasure = true):
type distLockInstance struct {
    rwMutex *dsync.DRWMutex  // Distributed RW Mutex
    opsID   string
}

// Single-node mode (isDistErasure = false):
type localLockInstance struct {
    ns     *nsLockMap
    volume string
    paths  []string
    opsID  string
}

// Routing in NewNSLock():
if n.isDistErasure {
    drwmutex := dsync.NewDRWMutex(&dsync.Dsync{
        GetLockers: lockers,   // returns remote lock server clients
        Timeouts:   dsync.DefaultTimeouts,
    }, pathsJoinPrefix(volume, paths...)...)
    return &distLockInstance{drwmutex, opsID}
}
// else: local in-memory map

Key optimizations

Early read lock release — GetObjectNInfo releases the read lock as soon as metadata quorum is confirmed. For small inline objects, the lock is dropped before data is streamed to the client, minimizing lock hold time.

Async unlock — DRWMutex.Unlock() fires a background goroutine that retries until all peers acknowledge, so the caller is never blocked waiting on the network.

Overload protection — if a lock server has >1000 queued lock requests it rejects new ones immediately (fail-fast) rather than queuing indefinitely, preventing resource exhaustion.

Granularity

Locks are per-object, not per-disk or per-erasure-set. Multiple objects in the same bucket can be locked concurrently and independently — the erasure layer underneath coordinates parallel disk I/O without needing its own locks. Two requests only contend if they target the exact same object path.

Limitations

Supports up to 32 nodes (design limitation of the quorum broadcast approach)
Lock throughput decreases as cluster grows (more peers to contact per lock)
Lock loss possible if nodes fail mid-refresh (acceptable — locks auto-expire after 1 minute)

Quick Reference: Key Concepts

Concept	Definition
Erasure Set	Group of disks where objects are erasure coded
Server Pool	Collection of erasure sets, independent expansion unit
Consistent Hash	SipHash used to deterministically place objects
Read Quorum	Minimum shards needed to reconstruct object (M data shards)
Write Quorum	Minimum disks that must acknowledge write (M or M+1)
BitRot	Silent data corruption, detected via HighwayHash
xl.meta	Metadata file containing object info, stored on all disks
DataDir	UUID-named directory storing object’s data shard
Reed-Solomon	Erasure coding algorithm enabling data reconstruction

Performance Characteristics

Throughput: GB/sec performance (limited by network)
Latency: Milliseconds for PUT/GET operations
Scalability: Supports petabyte-scale deployments
Fault Tolerance: Up to N parity disks can fail per set
Healing: Object-level healing in seconds
Bitrot Hashing: >10 GB/sec on single CPU core

Real-World Metadata Examples

format.json from Actual Cluster

A 12-disk single erasure set cluster:

{
  "version": "1",
  "format": "xl",
  "id": "f9a7a6ba-39d9-4483-bb47-fe86518bdc67",
  "xl": {
    "version": "3",
    "this": "9ae64de8-1c75-46df-b09d-ad8b97f95313",
    "sets": [
      [
        "4199dbce-78ba-4176-846d-7423ab6cfcd9",
        "22b83b76-f883-49c8-abc8-a3cf84eb92f4",
        "9ae64de8-1c75-46df-b09d-ad8b97f95313",
        "fc1a7dde-1da7-44cc-9380-3ae3063c415c",
        "48d7881f-6e93-42ab-9d89-f27bf0648b0d",
        "b8cfec44-f88b-4193-9575-368d92eefb16",
        "ef66b6f7-3c15-45fa-aca8-52286f4750f4",
        "02b3aa13-ff62-4e46-a196-f40b6f531c23",
        "f5dd8d65-56d7-40f2-9035-b4b37e3018a5",
        "ae4e30fd-db65-4c0e-a9c1-44f50191ba20",
        "d4cf829c-b96f-4687-845c-8884a43a6397",
        "2efc58b9-253a-4ac6-ba92-a316811f896c"
      ]
    ],
    "distributionAlgo": "SIPMOD+PARITY"
  }
}

Explanation:

1 erasure set with 12 disks (UUIDs in the sets array)
Deployment ID: f9a7a6ba-39d9-4483-bb47-fe86518bdc67 (shared by all disks)
Distribution Algorithm: SIPMOD+PARITY
This disk position: 2 in the set (9ae64de8-1c75-46df-b09d-ad8b97f95313)

Real xl.meta Example from Small File (65 bytes)

File content: This is test data for xl.meta debugging with erasure coding EC:4

{
  "Versions": [
    {
      "Header": {
        "EcM": 8,
        "EcN": 4,
        "Flags": 6,
        "ModTime": "2026-01-14T16:53:55.923264863+05:30",
        "Signature": "b9f71a0b",
        "Type": 1,
        "VersionID": "00000000000000000000000000000000"
      },
      "Idx": 0,
      "Metadata": {
        "Type": 1,
        "V2Obj": {
          "CSumAlgo": 1,
          "DDir": "NhHND1OVRfWzQYC/GFqfGA==",
          "EcAlgo": 1,
          "EcBSize": 1048576,
          "EcDist": [1,2,3,4,5,6,7,8,9,10,11,12],
          "EcIndex": 3,
          "EcM": 8,
          "EcN": 4,
          "ID": "AAAAAAAAAAAAAAAAAAAAAA==",
          "MTime": 1768389835923264863,
          "MetaSys": {
            "x-minio-internal-inline-data": "dHJ1ZQ=="
          },
          "MetaUsr": {
            "content-type": "text/plain",
            "etag": "eeb5a84d38f5dac272eb0d3f772c8a59"
          },
          "PartASizes": [ 65 ],
          "PartETags": null,
          "PartNums": [  1],
          "PartSizes": [ 65 ],
          "Size": 65
        },
        "v": 1740736516
      }
    }
  ]
}

xl.meta from a Different Disk (EcIndex=7)

Decoding xl.meta from another disk in the same erasure set shows the same object but a different shard:

{
  "Versions": [
    {
      "Header": { "EcM": 8, "EcN": 4, "Type": 1, "VersionID": "00000000000000000000000000000000" },
      "Metadata": {
        "V2Obj": {
          "EcDist": [1,2,3,4,5,6,7,8,9,10,11,12],
          "EcIndex": 7,
          "EcM": 8,
          "EcN": 4,
          "MetaSys": { "x-minio-internal-inline-data": "dHJ1ZQ==" },
          "MetaUsr": { "content-type": "text/plain", "etag": "eeb5a84d38f5dac272eb0d3f772c8a59" },
          "Size": 65
        }
      }
    }
  ]
}
--- INLINE DATA ---
{
  "null": {
    "bitrot_valid": true,
    "bytes": 41,
    "data_base64": "b2RpbmcgRUM6",
    "data_string": "oding EC:"
  }
}

Notice EcIndex: 7 (vs EcIndex: 3 on the other disk) — each disk holds a different shard of the same object. The data_string differs ("oding EC:" vs "for xl.me") confirming each disk stores its own slice.

Data Distribution Visualization

For the 65-byte file above split into EC:8+4:

┌────────────────────────────────────────────────────────────────┐
│                    Original File (~65 bytes)                   │
│                   "...for xl.me...oding EC:..."                │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  Erasure Split into 8 Data Shards + 4 Parity Shards:           │
│                                                                │
│  EcDist: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]               │
│           ├─────────────────────┤ ├──────────────┤             │
│           8 DATA shards          4 PARITY shards               │
│                                                                │
│  Disk EcIndex=3: Contains data shard 3 → "for xl.me"           │
│  Disk EcIndex=7: Contains data shard 7 → "oding EC:"           │
│                                                                │
│  Disks 9-12 (EcIndex 9,10,11,12): Parity shards (for recovery) │
└────────────────────────────────────────────────────────────────┘

Storing a Test File

# Create test bucket and upload file
kubectl exec minio-0 -- sh -c '
mc alias set local http://localhost:9000 $MINIO_ROOT_USER $MINIO_ROOT_PASSWORD --insecure 2>/dev/null
mc mb local/debug-bucket --insecure 2>/dev/null || true
echo "This is test data for xl.meta debugging with erasure coding EC:4" > /tmp/test-file.txt
mc cp /tmp/test-file.txt local/debug-bucket/test-file.txt --insecure
mc stat local/debug-bucket/test-file.txt --insecure
'

Expected Output:

Added `local` successfully.
Bucket created successfully `local/debug-bucket`.
`/tmp/test-file.txt` -> `local/debug-bucket/test-file.txt`
┌───────┬─────────────┬──────────┬────────────┐
│ Total │ Transferred │ Duration │ Speed      │
│ 65 B  │ 65 B        │ 00m00s   │ 1.13 KiB/s │
└───────┴─────────────┴──────────┴────────────┘
Name      : test-file.txt
Date      : 2026-01-14 11:23:55 UTC
Size      : 65 B
ETag      : eeb5a84d38f5dac272eb0d3f772c8a59
Type      : file
Metadata  :
  Content-Type: text/plain

On-Disk Structure

For small files (≤128KB), data is inlined in xl.meta:

debug-bucket/
└── test-file.txt/
    └── xl.meta              # Contains metadata + inline data

For larger files (>128KB):

/data1/testbucket/test-large-file.txt/
├── xl.meta                          # Metadata (on all 12 disks)
└── <DDir-UUID>/                     # Data directory
    └── part.1                       # Actual data shard for this disk

Healing Example

Ref: https://minio-docs.tf.fo/operations/concepts/healing

Ref: https://minio-docs.tf.fo/operations/data-recovery

Replication & Site-to-Site

Distributed Locking (dsync) in Action