Distributed Systems

Components

Ref: https://gear.hermygong.com/p/seaweeds/

Blob storage

Other Blobstore operations

Write

Read

File Storage

Filer Architecture

Ref: https://www.a-programmer.top/2021/06/19/SeaweedFS%E5%88%9D%E6%8E%A2/

Filer Store Data Model

Volume Server

Volume

In SeaweedFS, a volume is a single file consisting of many small files. When a master server starts, it sets the volume file maximum size to 30GB (see: -volumeSizeLimitMB). At volume server initialization, it will create 8 of these volumes (see: -max).

Each volume has its own TTL and replication.

Ref: https://github.com/seaweedfs/seaweedfs/wiki/Components

Volume Files Structure

Ref: https://github.com/seaweedfs/seaweedfs/wiki/Volume-Files-Structure

UI

Architecture

Design Philosophy

High Availability

In Master, How Raft is Used?

Leader Election: Multiple master servers form a Raft cluster to elect a leader. Only the leader can assign new volume IDs.
Volume ID Assignment: When a new volume needs to be created, the leader:
- Gets the current max volume ID
- Increments it
- Replicates this new max via Raft to ensure all masters agree

Also master manages, all these

Assign file ID - Leader only (proxied if needed)
Volume creation - Leader only
NextVolumeId (Raft write) - Leader only + barrier
Volume lookup Leader uses local topology, non-leader queries too
Client connections - Any master, but redirected to leader

Replication

Erasure Coding

SeaweedFS uses Reed-Solomon erasure coding with a default 10+4 scheme (10 data shards + 4 parity shards = 14 total shards).
This allows you to lose up to 4 volume servers and still recover your data.
Only volumes with this fullness ratio 80% or higher will be erasure coded, configurable

/*

Steps to apply erasure coding to .dat .idx files
ensure the volume is readonly
client call VolumeEcShardsGenerate to generate the .ecx and .ec00 ~ .ec13 files
client ask master for possible servers to hold the ec files
client call VolumeEcShardsCopy on above target servers to copy ec files from the source server
target servers report the new ec files to the master
  master stores vid -> [14]*DataNode
client checks master. If all 14 slices are ready, delete the original .idx, .idx files

*/

Ref: seaweedfs/weed/server/volume_grpc_erasure_coding.go

S3 changes

Ref: SeaweedFS S3 API in 2025: Enterprise‑grade security and control - Chris Lu, SeaweedFS KubeCon

Changes related to this S3 data path skips filer https://github.com/seaweedfs/seaweedfs/pull/7481

Check this file weed/s3api/s3api_object_handlers_put.go previously it used Filer Proxy proxyReq, err := http.NewRequest(http.MethodPut, uploadUrl, body), now S3 api directly talks to volume server

Change


flowchart TB
    subgraph OLD["Before v4.01"]
        direction TB
        C1[S3 Client] --> S1[S3 API Server]
        S1 -->|"HTTP proxy<br/>ALL data + metadata"| F1[Filer]
        F1 -->|"Read/Write data"| V1[Volume Server]
        
        style F1 fill:#cc4444,color:#fff
        style S1 fill:#4466aa,color:#fff
    end

    subgraph NEW["NEW Architecture (v4.01+)"]
        direction TB
        C2[S3 Client] --> S2[S3 API Server]
        S2 -->|"gRPC<br/>metadata only"| F2[Filer]
        S2 -->|"HTTP direct<br/>data streaming"| V2[Volume Server]
        
        style F2 fill:#44aa66,color:#fff
        style S2 fill:#4466aa,color:#fff
        style V2 fill:#44aa66,color:#fff
    end

Write Path

sequenceDiagram
    participant Client as S3 Client
    participant S3API as S3 API Server
    participant Filer as Filer (gRPC)
    participant Volume as Volume Server

    Note over Client,Volume: PUT Object - Direct Volume Access

    Client->>S3API: PUT /bucket/key (data)

    rect rgba(144, 238, 144, 0.3)
        Note right of S3API: Step 1: Get Volume Assignment
        S3API->>Filer: AssignVolume (gRPC)
        Filer-->>S3API: {volumeId, fileId, url, JWT}
    end

    rect rgba(173, 216, 230, 0.3)
        Note right of S3API: Step 2: Upload Data DIRECTLY
        loop For each 8MB chunk
            S3API->>Volume: POST http://volume:8080/{fid} (chunk data + JWT)
            Volume-->>S3API: {size, eTag, fid}
        end
    end

    rect rgba(255, 182, 193, 0.3)
        Note right of S3API: Step 3: Save Metadata Only
        S3API->>Filer: CreateEntry (gRPC)
        Note over Filer: Stores: chunks[], size,<br/>ETag, SSE metadata,<br/>user metadata, etc.
        Filer-->>S3API: OK
    end

    S3API-->>Client: 200 OK + ETag

Read Path

sequenceDiagram
    participant Client as S3 Client
    participant S3API as S3 API Server
    participant Filer as Filer (gRPC)
    participant Volume as Volume Server

    Note over Client,Volume: GET Object - Direct Volume Access

    Client->>S3API: GET /bucket/key

    rect rgba(255, 182, 193, 0.3)
        Note right of S3API: Step 1: Fetch Metadata Only
        S3API->>Filer: LookupDirectoryEntry (gRPC)
        Filer-->>S3API: Entry{chunks[], size, attrs, extended}
    end

    rect rgba(144, 238, 144, 0.3)
        Note right of S3API: Step 2: Resolve Volume URLs
        Note over S3API: Uses FilerClient's<br/>cached vidMap<br/>(no gRPC per chunk!)
        S3API->>S3API: lookupFileIdFn(volumeId)
    end

    rect rgba(173, 216, 230, 0.3)
        Note right of S3API: Step 3: Stream Data DIRECTLY
        S3API->>Volume: GET http://volume:8080/{fid} + JWT
        Volume-->>S3API: chunk data (streaming)
        S3API-->>Client: data (streaming passthrough)
    end

How Large file is written to S3?

sequenceDiagram
    autonumber
    participant C as Client
    participant S3 as S3 API Server
    participant F as Filer
    participant M as Master
    participant V1 as Volume Server 1
    participant V2 as Volume Server 2

    Note over V1,M: Periodically send heartbeats to Master
    
    C->>S3: PUT /bucket/object (4GB stream)
    
    Note over S3: Stream data, chunk in 8MB buffers<br/>(max 4 buffers = 32MB)
    
    rect rgb(240, 248, 255)
        Note over S3,V2: Repeat for each 8MB chunk (streaming)
        S3->>F: AssignVolume (gRPC)
        F->>M: Assign request
        M-->>F: Return Fid + Volume URL
        F-->>S3: Return Fid + Volume URL
        S3->>V1: POST chunk data (HTTP)
        V1->>V2: Replicate
        Note over V1,V2: Strong consistency:<br/>response after replication completes
        V2-->>V1: Ack
        V1-->>S3: Return chunk size + ETag
    end
    
    Note over S3: All chunks uploaded,<br/>collect FileChunk metadata
    
    S3->>F: CreateEntry (gRPC)<br/>(path, chunks[], attributes)
    F-->>S3: Success
    S3-->>C: 200 OK + ETag