MinIO Architecture & Object Storage Deep Dive

MinIO just does one thing - Object storage for Private cloud

image

image

image

image

image

This document provides a comprehensive overview of MinIO’s architecture, how it stores objects, distributes data across servers, and retrieves objects.

Table of Contents

  1. Core Principles
  2. High-Level Architecture
  3. System Components
  4. Object Storage Internals
  5. Erasure Coding
  6. PUT Operation (Storing Objects)
  7. GET Operation (Retrieving Objects)
  8. Distributed Architecture
  9. Server Pools
  10. Code Architecture
  11. Healing & Self-Recovery
  12. Advanced Features
  13. Deprecated Gateway
  14. Real-World Metadata Examples

Core Principles

MinIO is built on several fundamental design principles:

How MinIO Compares to Legacy Storage

image

image

image

Legacy Object Storage Architecture

image


High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      Client (S3 API)                            │
│              PUT/GET/DELETE/LIST Requests                       │
└────────────────────────┬────────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────────┐
│                  HTTP Server & Router                           │
│         (Authentication, Throttling, Compression)               │
└────────────────────────┬────────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────────┐
│              API Handlers (GET/PUT/DELETE/LIST)                 │
└────────────────────────┬────────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────────┐
│            ObjectLayer Interface (Abstraction)                  │
└────────────────────────┬────────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────────┐
│         erasureServerPools (Multi-Pool Manager)                 │
│  - Weighted pool selection based on available space             │
│  - Pool expansion & decommissioning                             │
└────────────────────────┬────────────────────────────────────────┘
                         │
        ┌────────────────┼────────────────┐
        │                │                │
┌───────▼────┐   ┌───────▼────┐    ┌──────▼──────┐
│  Pool 1    │   │  Pool 2    │    │  Pool N     │
│ erasureSets│   │ erasureSets│    │ erasureSets │
└───────┬────┘   └───────┬────┘    └──────┬──────┘
        │                │                │
     ┌──┴──┐          ┌──┴──┐          ┌──┴──┐
     │Set1 │          │Set1 │          │Set1 │
     └──┬──┘          └──┬──┘          └──┬──┘
        │
┌───────▼──────────────────────────────────────────────────────┐
│  erasureObjects (Single Set - Erasure Coding Logic)          │
│  - Reed-Solomon EC:M+N encoding/decoding                     │
│  - Quorum-based read/write                                   │
│  - Object-level healing                                      │
└───────┬──────────────────────────────────────────────────────┘
        │
┌───────▼──────────────────────────────────────────────────────┐
│  StorageAPI Interface (Local & Remote Disk I/O)              │
│  - xlStorage (local), storageRESTClient (remote)             │
└───────┬──────────────────────────────────────────────────────┘
        │
   ┌────┴─────────────────────┬──────────────────┐
   │                          │                  │
┌──▼──┐ ┌──────┐ ┌──────────┐ │        ┌──────────────┐
│Disk1│ │Disk2 │ │  Disk..  │ │        │  Disk16      │
│     │ │      │ │          │ │        │              │
│     │ │      │ │          │ │        │              │
└─────┘ └──────┘ └──────────┘ │        └──────────────┘
                              └─ (Parallel I/O)

System Components

1. HTTP Server Layer

2. API Handler Layer

3. ObjectLayer Interface

Abstract interface implemented by erasureServerPools, erasureSets, and erasureObjects. Provides unified S3 operations.

4. Erasure Server Pools

5. Erasure Sets

6. Erasure Objects

7. Storage Layer


Object Storage Internals

On-Disk Layout

Each disk in a MinIO cluster stores data in the following structure:

disk1/
├── .minio.sys/
│   ├── format.json              # Cluster configuration
│   ├── config/                  # Server configuration
│   ├── buckets/                 # Bucket metadata
│   └── tmp/                     # Temporary files during writes
│
├── bucket1/
│   ├── object1/
│   │   ├── xl.meta              # Metadata (MessagePack serialized)
│   │   └── a1b2c3d4-e5f6.../   # DataDir UUID (contains data shard)
│   │       └── part.1           # Actual shard data
│   └── object2/
│       └── ...
└── bucket2/
    └── ...

xl.meta File Format

The xl.meta file contains critical metadata in MessagePack binary format:

Header:
- Magic: "XL2 "
- Version: 1.3

Versions[] (Version History):
├── Type: ObjectType, DeleteType, or LegacyType
├── ObjectV2 (if ObjectType):
│   ├── VersionID: UUID (unique version identifier)
│   ├── DataDir: UUID (data directory on disk)
│   ├── ErasureAlgorithm: ReedSolomon
│   ├── ErasureM: Number of data blocks (e.g., 12)
│   ├── ErasureN: Number of parity blocks (e.g., 4)
│   ├── ErasureBlockSize: Block size for encoding (1MB default)
│   ├── ErasureIndex: This disk's shard index (0-15)
│   ├── ErasureDist: Distribution array [disk_index_0, disk_index_1, ...]
│   ├── BitrotChecksumAlgo: HighwayHash (for integrity)
│   ├── PartNumbers: Part IDs (multipart uploads)
│   ├── PartSizes: Size of each part
│   ├── Size: Total object size
│   ├── ModTime: Modification timestamp (Unix nanoseconds)
│   ├── MetaSys: System metadata (inline data flag, etc.)
│   └── MetaUsr: User metadata (Content-Type, custom headers)

format.json - Cluster Configuration

Located at .minio.sys/format.json on each disk:

{
  "version": "1",
  "format": "xl",
  "id": "deployment-uuid",
  "xl": {
    "version": "3",
    "this": "disk-uuid",
    "sets": [
      ["disk-0", "disk-1", "disk-2", ..., "disk-15"],
      ["disk-16", "disk-17", "disk-18", ..., "disk-31"]
    ],
    "distributionAlgo": "SIPMOD"
  }
}

Key Fields:


Erasure Coding

Reed-Solomon Encoding

MinIO uses Reed-Solomon erasure coding (via klauspost/reedsolomon library):

Example Configuration (16 disks):

How It Works

Encoding (Write):

Original File (10MB)
        │
        ▼
┌──────────────────────────────────┐
│ Split into 1MB blocks (10 blocks)│
└──────────────────────────────────┘
        │
        ▼
┌──────────────────────────────────────────────────────┐
│ For each 1MB block:                                  │
│ ├─ Split into 12 data shards (~85KB each)           │
│ └─ Generate 4 parity shards (~85KB each)            │
│                                                      │
│ Result: 16 shards per block (12 data + 4 parity)   │
└──────────────────────────────────────────────────────┘
        │
        ▼
┌──────────────────────────────────────────────────────┐
│ Write to 16 disks in parallel:                       │
│ ├─ Disk 0: shard_0 (all blocks)                     │
│ ├─ Disk 1: shard_1 (all blocks)                     │
│ ├─ Disk 12-15: parity shards                        │
│ └─ All disks: xl.meta (metadata)                    │
└──────────────────────────────────────────────────────┘

Decoding (Read):

Read Request for 10MB object
        │
        ▼
┌──────────────────────────────────┐
│ Scenario 1: All 16 disks healthy │
│ ├─ Read 12 data shards           │
│ ├─ Ignore 4 parity shards        │
│ └─ Reconstruct original data     │
└──────────────────────────────────┘

┌──────────────────────────────────┐
│ Scenario 2: 2 disks dead         │
│ ├─ Read 14 available shards      │
│ ├─ Use Reed-Solomon to recover   │
│ │  missing 2 shards              │
│ └─ Reconstruct original data     │
└──────────────────────────────────┘

┌──────────────────────────────────┐
│ Scenario 3: 5+ disks dead        │
│ └─ READ FAILS (quorum lost)      │
└──────────────────────────────────┘

Read/Write Quorum

Read Quorum: M (data blocks)

Write Quorum:

Erasure Coding Visual

image

The value K here constitutes the read quorum for the deployment. The erasure set must therefore have at least K healthy drives to support read operations.

For a small object with only 1 part (part.1), here we have 2 data blocks and 2 parity blocks:

image

Ref: https://blog.min.io/erasure-coding-vs-raid/

Not only does MinIO erasure coding protect against drive and node failures, MinIO also heals at the object level:

Read Request Flow

image

Write Request Flow

Two cases for write quorum:

If parity equals 1/2 (half) the number of erasure set drives, write quorum equals parity + 1 to avoid data inconsistency due to ‘split brain’ scenarios.

image

Bitrot Protection

MinIO protects against silent data corruption (BitRot):


PUT Operation

Step-by-Step Flow (Storing a 10MB Object)

1. CLIENT REQUEST
   │
   ├─ PUT /bucket/photos/vacation.jpg (10MB)
   │
   ▼

2. HTTP HANDLER
   │
   ├─ Parse request, extract bucket/object name
   ├─ Verify authentication (Signature V4)
   ├─ Create hash verifier (for bitrot)
   │
   ▼

3. POOL SELECTION
   │
   ├─ If object already exists:
   │  └─ Use same pool as existing version
   │
   └─ If new object:
      ├─ Calculate available space for each pool
      ├─ Filter: skip suspended/rebalancing pools
      ├─ Weighted random selection (prefer pool with most space)
      └─ Select Pool 0
   │
   ▼

4. SET SELECTION (Consistent Hashing)
   │
   ├─ Hash object name using SipHash:
   │  setIndex = sipHashMod("photos/vacation.jpg", numSets, deploymentID) % numSets
   │
   ├─ Result: Always same set for same object name (deterministic)
   └─ Select Erasure Set 3
   │
   ▼

5. CREATE METADATA
   │
   ├─ Generate UUIDs:
   │  ├─ VersionID: Unique identifier for this version
   │  └─ DataDir: Directory to store data shards
   │
   ├─ Calculate distribution order:
   │  └─ hashOrder(objectName, diskCount) = [3, 1, 4, 2, 5, ...]
   │
   ├─ Set erasure parameters:
   │  ├─ ErasureM = 12 (data blocks)
   │  ├─ ErasureN = 4 (parity blocks)
   │  └─ BlockSize = 1MB
   │
   ▼

6. ERASURE ENCODING
   │
   ├─ Read data in 1MB blocks (10 blocks total)
   │
   ├─ For each block:
   │  ├─ Split into 12 data shards (~85KB each)
   │  ├─ Compute 4 parity shards using Reed-Solomon
   │  ├─ Add HighwayHash checksum to each shard
   │  └─ Result: 16 shards per block
   │
   ▼

7. DISK ORDERING (Distribution)
   │
   ├─ Take 16 disks from Set 3
   ├─ Shuffle according to distribution order
   └─ Map shards: shard_i → disk_i
   │
   ▼

8. PARALLEL WRITES
   │
   ├─ For each of 16 disks (in parallel):
   │  │
   │  ├─ Write to temporary location:
   │  │  .minio.sys/tmp/{VersionID}/{DataDir}/part.1
   │  │
   │  ├─ Format: [block1_hash|block1_data|block2_hash|block2_data|...]
   │  │
   │  └─ Verify write success
   │
   ├─ Check write quorum: Need ≥12 successful writes
   │  └─ If <12 fail: WRITE FAILS, cleanup
   │
   ▼

9. ATOMIC RENAME
   │
   ├─ Once quorum reached:
   │  └─ Rename all temp files to final location:
   │     bucket/object/{DataDir}/part.1
   │
   ▼

10. METADATA PERSISTENCE
    │
    ├─ Create xl.meta with all object metadata
    │ ├─ Version history
    │ ├─ Erasure config
    │ ├─ Distribution array
    │ └─ Part sizes
    │
    ├─ Write xl.meta to all 16 disks (in parallel)
    │
    ├─ Verify metadata quorum (≥12 successful)
    │
    ▼

11. SUCCESS
    │
    └─ Return 200 OK + ETag to client

PUT Request Overview

image

image

image

For example, with 5 data blocks and 3 parity blocks:

image

PUT Request Sequence

sequenceDiagram
    participant Client
    participant MinIO as MinIO Server
    participant PoolMgr as Pool Manager
    participant SetMgr as Erasure Set Manager
    participant EC as Erasure Coder
    participant Disk1 as Drive 1
    participant Disk2 as Drive 2
    participant DiskN as Drive N

    Client->>MinIO: PUT /bucket/object
    MinIO->>PoolMgr: getPoolIdx(bucket, object, size)

    alt Object Already Exists
        PoolMgr->>PoolMgr: Query all pools in parallel
        PoolMgr->>PoolMgr: GetObjectInfo() on each pool
        PoolMgr->>PoolMgr: Object found in Pool 2
        PoolMgr-->>MinIO: Use Pool 2 (existing)
    else New Object
        PoolMgr->>PoolMgr: getServerPoolsAvailableSpace()
        PoolMgr->>PoolMgr: Filter: skip suspended/rebalancing
        PoolMgr->>PoolMgr: Weighted random selection
        PoolMgr-->>MinIO: Use Pool 1 (most space)
    end

    MinIO->>SetMgr: Hash object name
    SetMgr->>SetMgr: sipHashMod(objectName, numSets)
    SetMgr-->>MinIO: Erasure Set 3

    MinIO->>EC: Create FileInfo metadata
    EC->>EC: hashOrder(objectName, drives)
    EC->>EC: Generate distribution array
    Note over EC: Distribution: [3,1,4,2,5,...]

    MinIO->>EC: Encode object (K data + M parity)
    EC->>EC: Split into data blocks
    EC->>EC: Calculate parity using Reed-Solomon
    EC-->>MinIO: Data + Parity shards

    MinIO->>SetMgr: Shuffle disks by distribution
    SetMgr-->>MinIO: Ordered disk list

    par Write to all drives in parallel
        MinIO->>Disk1: Write shard 1 + xl.meta
        MinIO->>Disk2: Write shard 2 + xl.meta
        MinIO->>DiskN: Write shard N + xl.meta
    end

    Disk1-->>MinIO: Success
    Disk2-->>MinIO: Success
    DiskN-->>MinIO: Success

    MinIO->>MinIO: Check write quorum (K drives)
    MinIO-->>Client: 200 OK

PUT Layer-by-Layer Graph

graph TB
    subgraph "Client"
        C[PUT /mybucket/image.jpg<br/>10 MB]
    end

    subgraph "Layer 1: HTTP Server"
        H[HTTP Router<br/>Match route]
    end

    subgraph "Layer 2: API Handler"
        A[PutObjectHandler<br/>Parse request<br/>Create PutObjReader]
    end

    subgraph "Layer 3: Server Pools"
        SP[erasureServerPools<br/>Select Pool 0<br/>based on space]
    end

    subgraph "Layer 4: Erasure Sets"
        ES[erasureSets<br/>Hash object name<br/>Select Set 3]
    end

    subgraph "Layer 5: Erasure Objects"
        EO[erasureObjects<br/>Setup EC:12+4<br/>WriteQuorum: 12]
    end

    subgraph "Layer 6: Encoding Loop"
        L1[Read Block 1<br/>1 MB]
        L2[Encode to<br/>16 shards]
        L3[Write to<br/>16 disks]
        L4[Check<br/>quorum]
        L5{More<br/>blocks?}
    end

    subgraph "Layer 7: Reed-Solomon"
        RS[Split 1 MB into<br/>12 data shards<br/>Generate 4 parity]
    end

    subgraph "Layer 8: Parallel Writes"
        W1[Disk 1<br/>Write D1]
        W2[Disk 2<br/>Write D2]
        W3[Disk 12<br/>Write D12]
        W4[Disk 13<br/>Write P1]
        W5[Disk 16<br/>Write P4]
    end

    subgraph "Metadata Write"
        M1[Create xl.meta<br/>with FileInfo]
        M2[Write to all<br/>16 disks]
        M3[Check quorum<br/>12/16]
        M4{Success?}
    end

    subgraph "Final"
        F1[✅ Return ObjectInfo]
        F2[❌ Revert & Error]
    end

    C --> H --> A --> SP --> ES --> EO --> L1
    L1 --> L2 --> RS --> L3
    L3 --> W1 & W2 & W3 & W4 & W5
    W1 & W2 & W3 & W4 & W5 --> L4
    L4 --> L5
    L5 -->|Yes| L1
    L5 -->|No| M1
    M1 --> M2 --> M3 --> M4
    M4 -->|Yes| F1
    M4 -->|No| F2

    style C fill:#e1f5ff
    style RS fill:#fff3cd
    style F1 fill:#d4edda
    style F2 fill:#f8d7da

Key Decisions

Pool Selection Algorithm (Weighted Random):

totalFreeSpace = sum of free space in all pools
choose = random(0, totalFreeSpace)
for each pool:
    if pool.freeSpace >= choose:
        select this pool
        break
    choose -= pool.freeSpace

Set Selection Algorithm (Consistent Hash):

func sipHashMod(key string, cardinality int, id [16]byte) int {
    k0, k1 := binary.LittleEndian.Uint64(id[0:8]),
             binary.LittleEndian.Uint64(id[8:16])
    sum64 := siphash.Hash(k0, k1, []byte(key))
    return int(sum64 % uint64(cardinality))
}

GET Operation

Step-by-Step Flow (Retrieving a 10MB Object)

1. CLIENT REQUEST
   │
   ├─ GET /bucket/photos/vacation.jpg
   │ Optional: Range header (e.g., bytes=2097152-10485760)
   │
   ▼

2. HTTP HANDLER
   │
   ├─ Parse request
   ├─ Verify authentication
   ├─ Check preconditions (If-Match, If-Modified-Since)
   │
   ▼

3. SET LOOKUP (Same Hash as Write)
   │
   ├─ Hash object name using same SipHash
   │  → Deterministically routes to same set as original write
   │
   └─ Select Erasure Set 3
   │
   ▼

4. METADATA READING
   │
   ├─ Read xl.meta from ALL 16 disks (in parallel)
   │
   ├─ Verify quorum: Need ≥12 successful reads
   │  └─ If <12: READ FAILS
   │
   ├─ Select latest version (by ModTime)
   │
   ├─ Extract:
   │  ├─ Erasure config (M, N, block size)
   │  ├─ Part sizes and ETags
   │  ├─ Distribution order
   │  └─ Shard indices
   │
   ▼

5. PARALLEL SHARD READING
   │
   ├─ Create readers for all 16 disks
   │
   ├─ Read in parallel:
   │  ├─ Each disk returns its shard blocks
   │  ├─ Verify HighwayHash per block
   │  │  └─ Hash mismatch → mark disk as bad
   │  └─ Stop reading once we have ≥12 good shards
   │
   ▼

6. RECONSTRUCTION (If Needed)
   │
   ├─ If all 16 disks healthy:
   │  └─ Use 12 data shards directly
   │
   └─ If some disks failed/corrupted:
      ├─ Use Reed-Solomon decoder
      ├─ Reconstruct missing shards from available ones
      └─ Need at least M (12) shards to reconstruct
   │
   ▼

7. RANGE EXTRACTION (If Range Header Present)
   │
   ├─ If range requested (e.g., bytes 2-10MB):
   │  ├─ Extract only requested byte range
   │  └─ Efficient: Don't read entire object
   │
   ├─ Apply decompression (if S2 compression used)
   │
   ├─ Apply decryption (if AES-256-GCM encryption used)
   │
   ▼

8. STREAM TO CLIENT
   │
   ├─ Set Content-Length header
   ├─ Set Content-Range (if range request)
   ├─ Stream data directly to HTTP response body
   │
   ▼

9. SUCCESS
    │
    └─ Return 200 OK (or 206 Partial Content for range)

Failure Scenarios

Scenario Disks Available Status Action
All healthy 16/16 ✅ Success Read 12 data shards
1 disk dead 15/16 ✅ Success Read 12 data shards from remaining
2 disks dead 14/16 ✅ Success Read 12+ shards, reconstruct if needed
4 disks dead 12/16 ✅ Success Read 12 available shards (at limit)
5+ disks dead <12/16 ❌ FAIL Cannot read (quorum lost)

Distributed Architecture

All the nodes running a distributed MinIO setup are recommended to be homogeneous — same operating system, same number of drives, and same network interconnects.

image

Ref: https://github.com/minio/minio/blob/master/docs/distributed/README.md

MinIO adopts a decentralized shared-nothing architecture, where object data is scattered and stored on multiple hard disks on different nodes, providing unified namespace access and load balancing between servers through load balancing or DNS round-robin.

Erasure Set Organization (4 Servers × 4 Disks Each = 16 Disks Total)

Server 1: [D1] [D2] [D3] [D4]
Server 2: [D5] [D6] [D7] [D8]
Server 3: [D9] [D10][D11][D12]
Server 4: [D13][D14][D15][D16]

            ↓ Round-Robin Assignment ↓

Erasure Set 0: [D1, D5, D9, D13, D2, D6, D10, D14, ...]
                S1   S2   S3   S4   S1   S2   S3    S4

Key: Each set has disks from ALL servers

Fault Tolerance

If Server 3 Dies:

Set contains: [D1(S1), D5(S2), D9(S3), D13(S4), ...]

After S3 failure:
├─ Available: D1(S1) ✓, D5(S2) ✓, D13(S4) ✓
├─ Dead: D9(S3) ✗
├─ Tolerance: EC:12+4 can lose up to 4 disks
│
└─ Result: SAFE - Can still read and recover

If Any 4 Disks Die:

Available shards: 12 (exactly at read quorum)
Parity tolerance: 4
Result: Still readable but no fault tolerance left

If 5+ Disks Die:

Available shards: <12 (below read quorum)
Result: UNRECOVERABLE - READ FAILS

Server Pools

serverpools

A server pool is a set of MinIO server nodes which pool their drives and resources, creating a unit of expansion. All nodes in a server pool share their hardware resources in an isolated namespace.

The other important point here involves rebalance-free, non-disruptive expansion. With MinIO’s server pool approach - rebalancing is not required to expand. Ref: https://blog.min.io/no-rebalancing-object-storage/

A MinIO cluster is built on server pools, and server pools are built on erasure sets.

image

Multi-Pool Architecture

MinIO can have multiple independent pools for expansion:

Cluster
├── Pool 1 (16 disks, 4 nodes)
│   ├─ Erasure Set 0
│   └─ Erasure Set 0 (shared)
│
├── Pool 2 (32 disks, 4 nodes)
│   ├─ Erasure Set 1
│   └─ Erasure Set 2
│
└── Pool 3 (48 disks, 4 nodes)
    ├─ Erasure Set 3
    ├─ Erasure Set 4
    └─ Erasure Set 5

Pool Expansion (Rebalance-Free)

  1. Add new pool: MinIO detects new endpoints at startup
  2. Update format files: Cluster configuration updated
  3. New objects: Distributed across all pools by available space
  4. Existing objects: Stay in original pool (no rebalancing)
  5. Decommission: Background migration copies objects to other pools

Weighted Random Selection

When adding new object to new pool:


Code Architecture

Layer Hierarchy

graph TB
    subgraph "1. Entry Point"
        A[Application Entry<br/>& Initialization]
    end

    subgraph "2. HTTP Server Layer"
        B[HTTP Server<br/>Router & Middleware]
    end

    subgraph "3. API Handler Layer"
        C[S3 API Handlers<br/>Request Processing]
    end

    subgraph "4. Object Layer Interface"
        D[ObjectLayer Interface<br/>Abstraction Layer]
    end

    subgraph "5. Erasure Server Pools"
        E[Pool Manager<br/>Multi-Pool Coordination]
    end

    subgraph "6. Erasure Sets"
        F[Set Manager<br/>Consistent Hashing]
    end

    subgraph "7. Erasure Objects"
        G[Erasure Logic<br/>Quorum & Operations]
    end

    subgraph "8. Storage Layer"
        H[Disk I/O<br/>File Operations]
    end

    subgraph "9. Metadata Layer"
        I[xl.meta Management<br/>Version Control]
    end

    subgraph "10. Healing Layer"
        J[Self-Repair<br/>Background Scanner]
    end

    subgraph "11. Erasure Coding"
        K[Reed-Solomon<br/>Data Protection]
    end

    subgraph "12. Bitrot Protection"
        L[Hash Verification<br/>Integrity Checks]
    end

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H
    H --> I

    G -.Healing.-> J
    G -.Encoding.-> K
    H -.Verification.-> L

    style A fill:#e1f5ff,stroke:#01579b,stroke-width:3px,color:#000
    style B fill:#f3e5f5,stroke:#4a148c,stroke-width:3px,color:#000
    style C fill:#e8f5e9,stroke:#1b5e20,stroke-width:3px,color:#000
    style D fill:#fff3e0,stroke:#e65100,stroke-width:3px,color:#000
    style E fill:#fce4ec,stroke:#880e4f,stroke-width:3px,color:#000
    style F fill:#e0f2f1,stroke:#004d40,stroke-width:3px,color:#000
    style G fill:#f1f8e9,stroke:#33691e,stroke-width:3px,color:#000
    style H fill:#e3f2fd,stroke:#0d47a1,stroke-width:3px,color:#000
    style I fill:#fef5e7,stroke:#f39c12,stroke-width:3px,color:#000
    style J fill:#fadbd8,stroke:#c0392b,stroke-width:3px,color:#000
    style K fill:#d5f4e6,stroke:#117a65,stroke-width:3px,color:#000
    style L fill:#ebdef0,stroke:#6c3483,stroke-width:3px,color:#000

Detailed Layer Flow

graph TB
    subgraph "1. HTTP Server Layer"
        HTTP[HTTP Server<br/>xhttp.NewServer]
        Router[Mux Router<br/>mux.NewRouter]
    end

    subgraph "2. Middleware Layer"
        Auth[Authentication<br/>Signature V4]
        Trace[HTTP Tracing]
        Throttle[Request Throttling<br/>maxClients]
        GZIP[GZIP Compression]
    end

    subgraph "3. API Handler Layer"
        APIHandlers[objectAPIHandlers]
        GetObj[GetObjectHandler]
        PutObj[PutObjectHandler]
        DelObj[DeleteObjectHandler]
        ListObj[ListObjectsHandler]
    end

    subgraph "4. ObjectLayer Interface"
        ObjInterface["<b>ObjectLayer Interface</b><br/>• GetObjectNInfo<br/>• PutObject<br/>• DeleteObject<br/>• ListObjects<br/>• GetObjectInfo<br/>• Multipart Operations"]
    end

    subgraph "5. Erasure Server Pools"
        ESP[erasureServerPools<br/>implements ObjectLayer]
        Pool1[Pool 1<br/>erasureSets]
        Pool2[Pool 2<br/>erasureSets]
        PoolN[Pool N<br/>erasureSets]
    end

    subgraph "6. Erasure Sets"
        Set1[Set 1<br/>erasureObjects]
        Set2[Set 2<br/>erasureObjects]
        SetN[Set N<br/>erasureObjects]
    end

    subgraph "7. Erasure Objects Layer"
        ErasureObj[erasureObjects<br/>implements ObjectLayer]
        ECLogic[Erasure Coding Logic<br/>Reed-Solomon]
        Quorum[Read/Write Quorum]
        Healing[Self-Healing]
    end

    subgraph "8. StorageAPI Interface"
        StorageInterface["<b>StorageAPI Interface</b><br/>• ReadVersion<br/>• WriteMetadata<br/>• DeleteVersion<br/>• ReadFile/WriteAll<br/>• Volume Operations"]
    end

    subgraph "9. Storage Implementation"
        XLStorage[xlStorage<br/>implements StorageAPI]
        Remote[storageRESTClient<br/>Remote Disks]
        DiskCheck[xlStorageDiskIDCheck<br/>Health Wrapper]
    end

    subgraph "10. Disk Layer"
        LocalDisk[Local Disk I/O<br/>xl.meta files]
        RemoteDisk[Remote Disk via REST]
        Metadata[xl.meta<br/>Object Metadata]
    end

    HTTP --> Router
    Router --> Auth
    Auth --> Trace
    Trace --> Throttle
    Throttle --> GZIP
    GZIP --> APIHandlers
    APIHandlers --> GetObj
    APIHandlers --> PutObj
    APIHandlers --> DelObj
    APIHandlers --> ListObj

    GetObj --> ObjInterface
    PutObj --> ObjInterface
    DelObj --> ObjInterface
    ListObj --> ObjInterface

    ObjInterface --> ESP
    ESP --> Pool1
    ESP --> Pool2
    ESP --> PoolN

    Pool1 --> Set1
    Pool1 --> Set2
    Pool1 --> SetN

    Set1 --> ErasureObj
    ErasureObj --> ECLogic
    ErasureObj --> Quorum
    ErasureObj --> Healing

    ErasureObj --> StorageInterface

    StorageInterface --> XLStorage
    StorageInterface --> Remote
    StorageInterface --> DiskCheck

    XLStorage --> LocalDisk
    Remote --> RemoteDisk
    LocalDisk --> Metadata
    RemoteDisk --> Metadata

    style ObjInterface fill:#e1f5ff,stroke:#01579b,stroke-width:3px,color:#000
    style StorageInterface fill:#e1f5ff,stroke:#01579b,stroke-width:3px,color:#000
    style ESP fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
    style ErasureObj fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
    style XLStorage fill:#f1f8e9,stroke:#33691e,stroke-width:2px,color:#000

Interface Class Diagram

classDiagram
    class ObjectLayer {
        <<interface>>
        +GetObjectNInfo() GetObjectReader
        +PutObject() ObjectInfo
        +DeleteObject() ObjectInfo
        +GetObjectInfo() ObjectInfo
        +ListObjects() ListObjectsInfo
        +MakeBucket() error
        +NewMultipartUpload() NewMultipartUploadResult
        +GetDisks() []StorageAPI
    }

    class erasureServerPools {
        -poolMeta poolMeta
        -serverPools []*erasureSets
        -deploymentID [16]byte
        +PutObject() ObjectInfo
        +GetObjectNInfo() GetObjectReader
        +getPoolIdx() int
    }

    class erasureSets {
        -sets []*erasureObjects
        -format *formatErasureV3
        -erasureDisks [][]StorageAPI
        -setCount int
        -setDriveCount int
        +PutObject() ObjectInfo
        +getHashedSet() int
    }

    class erasureObjects {
        -setDriveCount int
        -defaultParityCount int
        -getDisks func()[]StorageAPI
        -nsMutex *nsLockMap
        +PutObject() ObjectInfo
        +putObject() ObjectInfo
        +defaultWQuorum() int
        +defaultRQuorum() int
    }

    class StorageAPI {
        <<interface>>
        +ReadVersion() FileInfo
        +WriteMetadata() error
        +DeleteVersion() error
        +CreateFile() error
        +ReadFile() int64
        +MakeVol() error
        +ListVols() []VolInfo
        +GetDiskID() string
        +IsOnline() bool
    }

    class xlStorage {
        -diskPath string
        -endpoint Endpoint
        -diskID string
        -formatFile string
        +WriteMetadata() error
        +ReadVersion() FileInfo
        +CreateFile() error
        +ReadFile() int64
    }

    class storageRESTClient {
        -endpoint Endpoint
        -restClient *rest.Client
        -diskID string
        +WriteMetadata() error
        +ReadVersion() FileInfo
        +CreateFile() error
    }

    class Erasure {
        -encoder func()Encoder
        -dataBlocks int
        -parityBlocks int
        -blockSize int64
        +EncodeData() [][]byte
        +DecodeDataBlocks() error
        +ShardSize() int64
    }

    ObjectLayer <|.. erasureServerPools : implements
    ObjectLayer <|.. erasureSets : implements
    ObjectLayer <|.. erasureObjects : implements

    erasureServerPools *-- erasureSets : contains
    erasureSets *-- erasureObjects : contains
    erasureObjects --> StorageAPI : uses
    erasureObjects --> Erasure : uses

    StorageAPI <|.. xlStorage : implements
    StorageAPI <|.. storageRESTClient : implements

    style ObjectLayer fill:#e1f5ff,stroke:#01579b,stroke-width:3px,color:#000
    style StorageAPI fill:#e1f5ff,stroke:#01579b,stroke-width:3px,color:#000
    style erasureServerPools fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
    style erasureSets fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
    style erasureObjects fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
    style xlStorage fill:#f1f8e9,stroke:#33691e,stroke-width:2px,color:#000
    style Erasure fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000

Key Interfaces

ObjectLayer Interface:

type ObjectLayer interface {
    // Bucket operations
    MakeBucket(ctx, bucket, opts) error
    GetBucketInfo(ctx, bucket, opts) (BucketInfo, error)
    ListBuckets(ctx, opts) ([]BucketInfo, error)
    DeleteBucket(ctx, bucket, opts) error
    ListObjects(...) (ListObjectsInfo, error)
    ListObjectVersions(...) (ListObjectVersionsInfo, error)

    // Object operations
    GetObjectNInfo(ctx, bucket, object, rangeSpec, headers, opts) (*GetObjectReader, error)
    GetObjectInfo(ctx, bucket, object, opts) (ObjectInfo, error)
    PutObject(ctx, bucket, object, data, opts) (ObjectInfo, error)
    CopyObject(ctx, srcBucket, srcObject, dstBucket, dstObject, ...) (ObjectInfo, error)
    DeleteObject(ctx, bucket, object, opts) (ObjectInfo, error)
    DeleteObjects(ctx, bucket, objects, opts) ([]DeletedObject, []error)

    // Multipart operations
    NewMultipartUpload(ctx, bucket, object, opts) (*NewMultipartUploadResult, error)
    PutObjectPart(ctx, bucket, object, uploadID, partID, data, opts) (PartInfo, error)
    CompleteMultipartUpload(ctx, bucket, object, uploadID, parts, opts) (ObjectInfo, error)
    AbortMultipartUpload(ctx, bucket, object, uploadID, opts) error

    // Healing & Info
    HealFormat(ctx, dryRun) (HealResultItem, error)
    HealBucket(ctx, bucket, opts) (HealResultItem, error)
    HealObject(ctx, bucket, object, versionID, opts) (HealResultItem, error)
    StorageInfo(ctx, metrics bool) StorageInfo
}

StorageAPI Interface:

type StorageAPI interface {
    // Metadata
    ReadVersion(ctx, origvolume, volume, path, versionID, opts) (FileInfo, error)
    WriteMetadata(ctx, origvolume, volume, path, fi) error
    DeleteVersion(ctx, volume, path, fi, ...) error

    // File operations
    ReadFile(ctx, volume, path, offset, buf, verifier) (n, error)
    CreateFile(ctx, origvolume, volume, path, size, reader) error
    ReadFileStream(ctx, volume, path, offset, length) (io.ReadCloser, error)
    AppendFile(ctx, volume, path, buf) error
    Delete(ctx, volume, path, opts) error

    // Volume operations
    MakeVol(ctx, volume) error
    ListVols(ctx) ([]VolInfo, error)
    StatVol(ctx, volume) (VolInfo, error)
    DeleteVol(ctx, volume, forceDelete bool) error

    // Disk info
    IsOnline() bool
    GetDiskID() (string, error)
    DiskInfo(ctx, opts) (DiskInfo, error)
}

Implementations

Component Location Role
erasureServerPools cmd/erasure-server-pool.go Pool orchestration, weighted selection
erasureSets cmd/erasure-sets.go Set routing, consistent hashing
erasureObjects cmd/erasure-object.go Core put/get/delete with EC
xlStorage cmd/xl-storage.go Local disk I/O
storageRESTClient cmd/storage-rest-client.go Remote disk via REST
Erasure cmd/erasure-coding.go Reed-Solomon encode/decode

Healing

MinIO performs automatic background healing to detect and repair corrupted objects:

Healing Mechanisms

  1. Bitrot Detection: HighwayHash checksum verification on every read
  2. Bad Disk Detection: Continuous health monitoring of all disks
  3. Object-Level Healing: Corrupted objects repaired in seconds (vs RAID hours)
  4. Background Scanner: Periodic scan of all objects to detect bitrot proactively

Healing Flow

Bad block detected (hash mismatch)
        │
        ▼
Mark disk as bad
        │
        ▼
Read remaining 15 shards (12+ available)
        │
        ▼
Use Reed-Solomon to reconstruct missing shard
        │
        ▼
Repair disk by writing reconstructed shard
        │
        ▼
Verify repair with new hash
        │
        ▼
Continue serving object (healed)

Gateway Mode (Deprecated)

MinIO introduced gateway mode early on to provide S3 API compatibility to legacy systems:

image

image

Why Deprecated:

Lessons Learned:

Reference: Gateway Migration and Deprecation Details


Advanced Features

Versioning

Object Locking (WORM)

Lifecycle Management

Replication

IAM & Access Control

Encryption

Server-side encryption (SSE-S3, SSE-KMS) and client-side encryption support with master key rotation.

image

image

image


Distributed Locking (dsync)

MinIO avoids consistency issues using distributed locking:

How dsync Works

  1. Lock Request: Any node broadcasts lock request to all nodes
  2. Quorum: If N/2+1 nodes approve → lock acquired
  3. No Master: Every node is peer; no single authority
  4. Stale Detection: Between-node heartbeats detect offline nodes

dsync is MinIO’s distributed RW mutex (internal/dsync/). Every operation that mutates or reads object state acquires a lock through the nsLockMap abstraction, which routes to either a distributed DRWMutex (multi-node) or a local mutex (single-node).

Design goals

image

Ref: https://blog.min.io/minio-dsync-a-distributed-locking-and-syncing-package-for-go/

Architecture

Application (erasureObjects.PutObject, etc.)
         ↓
nsLockMap  (local wrapper — distributed or single-node)
         ↓
distLockInstance (DRWMutex)   OR   localLockInstance
         ↓
DRWMutex — quorum-based distributed lock
         ↓
NetLocker interface — REST calls to lock server on each node
         ↓
localLocker — lock server running on each MinIO node

Which operations lock what

Operation Lock Type Lock Key
GetObjectNInfo RLock bucket/object
GetObjectInfo RLock bucket/object
PutObject Lock (write) bucket/object
CopyObject Lock (write) bucket/object
DeleteObject Lock (write) bucket/object
DeleteObjects Lock (write) [bucket/obj1, bucket/obj2, ...] sorted
NewMultipartUpload Lock (write) bucket/object
PutObjectPart Lock (write) bucket/object/uploadID
CompleteMultipartUpload Lock (write) bucket/object/uploadID
AbortMultipartUpload RLock bucket/object
PutObjectTags / Metadata Lock (write) bucket/object
MakeBucket Lock (write) .minio.sys/bucket.lck
DeleteBucket Lock (write) .minio.sys/bucket.lck

Lock key structure

Keys are pathJoin(bucket, object):

"my-bucket/photos/vacation.jpg"           ← regular object
"my-bucket/large-file/abc123-upload-id"   ← multipart upload
".minio.sys/new-bucket.lck"               ← bucket creation

For DeleteObjects, all paths are sorted before acquiring — prevents deadlock when two concurrent requests delete overlapping object sets.

Quorum protocol

4-node cluster, lock servers on all 4 nodes

Write lock (PutObject):
  → Contact all 4 lock servers in parallel
  → Need majority: ⌈4/2⌉ + 1 = 3 approvals
  → If only 2 respond → lock DENIED

Read lock (GetObject):
  → Contact all 4 lock servers
  → Need fewer approvals (multiple readers allowed concurrently)

Split-brain guard:
  if quorum == tolerance → quorum++ (ensures strict majority for writes)

Lock lifecycle

1. ACQUIRE
   nsLockMap.NewNSLock(bucket, object)
        ↓
   DRWMutex.GetLock(ctx, timeout)
        ↓
   Contacts all peer nodes via REST in parallel
        ↓
   Waits for quorum approvals

2. REFRESH (background goroutine, every 10 seconds)
   Keeps lock alive by pinging all peers
   If quorum lost → force unlock + notify caller

3. RELEASE
   DRWMutex.Unlock() → fires async goroutine
   Retries with backoff if nodes unreachable
   Stale locks auto-expire after 1 minute

What is nsLockMap?

nsLockMap is the namespace lock manager — the single abstraction layer that sits between MinIO’s S3 operations and the actual lock implementation (distributed or local).

When any operation wants to lock an object, it goes through nsLockMap.NewNSLock(), which decides which lock implementation to use:

PutObject("my-bucket", "photo.jpg")
         ↓
nsLockMap.NewNSLock("my-bucket", "photo.jpg")
         ↓
    isDistErasure?
    ┌────YES────┐          ┌────NO────┐
    ↓                       ↓
distLockInstance         localLockInstance
(DRWMutex — contacts     (in-memory map of
 all peer nodes)          RW mutexes)

In single-node mode: maintains an in-memory map[string]*nsLock keyed by lock path (bucket/object). Uses reference counting — when two operations lock the same object, the same nsLock entry is reused and ref is incremented. When the last holder unlocks, ref hits 0 and the entry is removed (cleanup).

In distributed mode: skips the in-memory map entirely and creates a DRWMutex that contacts all peer nodes over REST to acquire a quorum-based distributed lock.

Why it exists: lets the rest of MinIO’s code be completely unaware of whether it’s running single-node or distributed. erasureObjects.PutObject() just calls nsLockMap.NewNSLock() and gets back a RWLocker — same interface regardless of mode.

nsLockMap internals

// Two modes depending on cluster type:

// Distributed mode (isDistErasure = true):
type distLockInstance struct {
    rwMutex *dsync.DRWMutex  // Distributed RW Mutex
    opsID   string
}

// Single-node mode (isDistErasure = false):
type localLockInstance struct {
    ns     *nsLockMap
    volume string
    paths  []string
    opsID  string
}

// Routing in NewNSLock():
if n.isDistErasure {
    drwmutex := dsync.NewDRWMutex(&dsync.Dsync{
        GetLockers: lockers,   // returns remote lock server clients
        Timeouts:   dsync.DefaultTimeouts,
    }, pathsJoinPrefix(volume, paths...)...)
    return &distLockInstance{drwmutex, opsID}
}
// else: local in-memory map

Key optimizations

Early read lock releaseGetObjectNInfo releases the read lock as soon as metadata quorum is confirmed. For small inline objects, the lock is dropped before data is streamed to the client, minimizing lock hold time.

Async unlockDRWMutex.Unlock() fires a background goroutine that retries until all peers acknowledge, so the caller is never blocked waiting on the network.

Overload protection — if a lock server has >1000 queued lock requests it rejects new ones immediately (fail-fast) rather than queuing indefinitely, preventing resource exhaustion.

Granularity

Locks are per-object, not per-disk or per-erasure-set. Multiple objects in the same bucket can be locked concurrently and independently — the erasure layer underneath coordinates parallel disk I/O without needing its own locks. Two requests only contend if they target the exact same object path.

Limitations


Quick Reference: Key Concepts

Concept Definition
Erasure Set Group of disks where objects are erasure coded
Server Pool Collection of erasure sets, independent expansion unit
Consistent Hash SipHash used to deterministically place objects
Read Quorum Minimum shards needed to reconstruct object (M data shards)
Write Quorum Minimum disks that must acknowledge write (M or M+1)
BitRot Silent data corruption, detected via HighwayHash
xl.meta Metadata file containing object info, stored on all disks
DataDir UUID-named directory storing object’s data shard
Reed-Solomon Erasure coding algorithm enabling data reconstruction

Performance Characteristics


Real-World Metadata Examples

format.json from Actual Cluster

A 12-disk single erasure set cluster:

{
  "version": "1",
  "format": "xl",
  "id": "f9a7a6ba-39d9-4483-bb47-fe86518bdc67",
  "xl": {
    "version": "3",
    "this": "9ae64de8-1c75-46df-b09d-ad8b97f95313",
    "sets": [
      [
        "4199dbce-78ba-4176-846d-7423ab6cfcd9",
        "22b83b76-f883-49c8-abc8-a3cf84eb92f4",
        "9ae64de8-1c75-46df-b09d-ad8b97f95313",
        "fc1a7dde-1da7-44cc-9380-3ae3063c415c",
        "48d7881f-6e93-42ab-9d89-f27bf0648b0d",
        "b8cfec44-f88b-4193-9575-368d92eefb16",
        "ef66b6f7-3c15-45fa-aca8-52286f4750f4",
        "02b3aa13-ff62-4e46-a196-f40b6f531c23",
        "f5dd8d65-56d7-40f2-9035-b4b37e3018a5",
        "ae4e30fd-db65-4c0e-a9c1-44f50191ba20",
        "d4cf829c-b96f-4687-845c-8884a43a6397",
        "2efc58b9-253a-4ac6-ba92-a316811f896c"
      ]
    ],
    "distributionAlgo": "SIPMOD+PARITY"
  }
}

Explanation:

Real xl.meta Example from Small File (65 bytes)

File content: This is test data for xl.meta debugging with erasure coding EC:4

{
  "Versions": [
    {
      "Header": {
        "EcM": 8,
        "EcN": 4,
        "Flags": 6,
        "ModTime": "2026-01-14T16:53:55.923264863+05:30",
        "Signature": "b9f71a0b",
        "Type": 1,
        "VersionID": "00000000000000000000000000000000"
      },
      "Idx": 0,
      "Metadata": {
        "Type": 1,
        "V2Obj": {
          "CSumAlgo": 1,
          "DDir": "NhHND1OVRfWzQYC/GFqfGA==",
          "EcAlgo": 1,
          "EcBSize": 1048576,
          "EcDist": [1,2,3,4,5,6,7,8,9,10,11,12],
          "EcIndex": 3,
          "EcM": 8,
          "EcN": 4,
          "ID": "AAAAAAAAAAAAAAAAAAAAAA==",
          "MTime": 1768389835923264863,
          "MetaSys": {
            "x-minio-internal-inline-data": "dHJ1ZQ=="
          },
          "MetaUsr": {
            "content-type": "text/plain",
            "etag": "eeb5a84d38f5dac272eb0d3f772c8a59"
          },
          "PartASizes": [ 65 ],
          "PartETags": null,
          "PartNums": [  1],
          "PartSizes": [ 65 ],
          "Size": 65
        },
        "v": 1740736516
      }
    }
  ]
}

xl.meta from a Different Disk (EcIndex=7)

Decoding xl.meta from another disk in the same erasure set shows the same object but a different shard:

{
  "Versions": [
    {
      "Header": { "EcM": 8, "EcN": 4, "Type": 1, "VersionID": "00000000000000000000000000000000" },
      "Metadata": {
        "V2Obj": {
          "EcDist": [1,2,3,4,5,6,7,8,9,10,11,12],
          "EcIndex": 7,
          "EcM": 8,
          "EcN": 4,
          "MetaSys": { "x-minio-internal-inline-data": "dHJ1ZQ==" },
          "MetaUsr": { "content-type": "text/plain", "etag": "eeb5a84d38f5dac272eb0d3f772c8a59" },
          "Size": 65
        }
      }
    }
  ]
}
--- INLINE DATA ---
{
  "null": {
    "bitrot_valid": true,
    "bytes": 41,
    "data_base64": "b2RpbmcgRUM6",
    "data_string": "oding EC:"
  }
}

Notice EcIndex: 7 (vs EcIndex: 3 on the other disk) — each disk holds a different shard of the same object. The data_string differs ("oding EC:" vs "for xl.me") confirming each disk stores its own slice.

image

Data Distribution Visualization

For the 65-byte file above split into EC:8+4:

┌────────────────────────────────────────────────────────────────┐
│                    Original File (~65 bytes)                   │
│                   "...for xl.me...oding EC:..."                │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  Erasure Split into 8 Data Shards + 4 Parity Shards:           │
│                                                                │
│  EcDist: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]               │
│           ├─────────────────────┤ ├──────────────┤             │
│           8 DATA shards          4 PARITY shards               │
│                                                                │
│  Disk EcIndex=3: Contains data shard 3 → "for xl.me"           │
│  Disk EcIndex=7: Contains data shard 7 → "oding EC:"           │
│                                                                │
│  Disks 9-12 (EcIndex 9,10,11,12): Parity shards (for recovery) │
└────────────────────────────────────────────────────────────────┘

Storing a Test File

# Create test bucket and upload file
kubectl exec minio-0 -- sh -c '
mc alias set local http://localhost:9000 $MINIO_ROOT_USER $MINIO_ROOT_PASSWORD --insecure 2>/dev/null
mc mb local/debug-bucket --insecure 2>/dev/null || true
echo "This is test data for xl.meta debugging with erasure coding EC:4" > /tmp/test-file.txt
mc cp /tmp/test-file.txt local/debug-bucket/test-file.txt --insecure
mc stat local/debug-bucket/test-file.txt --insecure
'

Expected Output:

Added `local` successfully.
Bucket created successfully `local/debug-bucket`.
`/tmp/test-file.txt` -> `local/debug-bucket/test-file.txt`
┌───────┬─────────────┬──────────┬────────────┐
│ Total │ Transferred │ Duration │ Speed      │
│ 65 B  │ 65 B        │ 00m00s   │ 1.13 KiB/s │
└───────┴─────────────┴──────────┴────────────┘
Name      : test-file.txt
Date      : 2026-01-14 11:23:55 UTC
Size      : 65 B
ETag      : eeb5a84d38f5dac272eb0d3f772c8a59
Type      : file
Metadata  :
  Content-Type: text/plain

On-Disk Structure

For small files (≤128KB), data is inlined in xl.meta:

debug-bucket/
└── test-file.txt/
    └── xl.meta              # Contains metadata + inline data

For larger files (>128KB):

/data1/testbucket/test-large-file.txt/
├── xl.meta                          # Metadata (on all 12 disks)
└── <DDir-UUID>/                     # Data directory
    └── part.1                       # Actual data shard for this disk

Healing Example

image

image

Ref: https://minio-docs.tf.fo/operations/concepts/healing

image

Ref: https://minio-docs.tf.fo/operations/data-recovery

Replication & Site-to-Site

image

Distributed Locking (dsync) in Action

image

image

Ref: https://blog.min.io/minio-dsync-a-distributed-locking-and-syncing-package-for-go/


References