实现一个跨区域高可用的 Weaviate 架构在 DigitalOcean 上的 CAP 约束实践

分布式架构

文章字数: 3.6k

阅读时长: 15 分

部署一个向量数据库集群的挑战，并不仅仅是启动几个实例。真正的难题在于，当业务要求服务必须跨越多个地理区域、容忍单点故障、并为全球用户提供低延迟访问时，我们必须直面分布式系统最核心的权衡：CAP 定理。我们的目标是为一套AI驱动的个性化推荐系统构建向量索引后端，该系统需要在 DigitalOcean 的北美（NYC3）、西海岸（SFO3）和欧洲（AMS3）三个区域同时提供服务。任何一个区域的云服务中断，都不能影响到全局服务的读写可用性。

这就把我们直接推到了 CAP 定理的十字路口。在一个存在网络分区（Partition Tolerance, P）的分布式系统中——这在跨区域部署中是必然的——我们必须在一致性（Consistency, C）和可用性（Availability, A）之间做出选择。

方案权衡：CP vs. AP 在向量搜索场景下的抉择

在设计阶段，我们评估了两种截然不同的架构策略，它们直接对应着 CAP 定理中的 CP 和 AP 路径。

方案 A：强一致性优先 (CP - Consistency/Partition Tolerance)

这个方案的核心是保证任何时刻从任何区域读取到的数据都是最新的。在 Weaviate 中，这可以通过将写操作的复制因子（Replication Factor）设置为 ALL 来实现。当一个写请求到达时，协调节点必须等待所有 N 个副本（在我们的案例中 N=3）都成功写入后，才能向客户端返回成功。

优势:
- 数据一致性得到强保证。对于金融交易或库存管理这类场景，这是必须的。
- 客户端逻辑相对简单，无需处理数据可能存在的滞后问题。
劣势:
- 可用性极低: 只要三个区域中的任何一个节点失联或写入失败，整个集群的写操作就会被阻塞。对于跨越大洋的广域网来说，网络抖动和延迟是常态，这意味着写失败的概率会显著增加。
- 写延迟极高: 一次写操作的延迟取决于最慢的那个副本，即跨大西洋网络往返的延迟（RTT）将成为常态。

在我们的推荐系统场景中，一个用户无法看到刚刚发布的、但相关性不高的商品，其业务影响远小于用户刷新页面时看到服务错误。因此，牺牲可用性来换取这种程度的强一致性，是得不偿వ失的。方案 A 被否决。

方案 B：高可用性优先 (AP - Availability/Partition Tolerance)

该方案以保证服务的最大可用性为目标。即使部分节点或整个区域不可用，系统仍应能正常处理读写请求。在 Weaviate 中，这可以通过将写操作的复制因子设置为 QUORUM (法定人数) 来实现。对于一个有3个副本的集群，QUORUM 意味着协调节点只需等待 (3/2)+1=2 个副本写入成功，即可向客户端返回成功。

优势:
- 高可用性: 只要至少有两个区域的节点是健康的，写操作就能成功。系统可以容忍单个区域的完全故障。
- 较低的写延迟: 写操作的延迟取决于最快的两个副本的响应时间，而不是最慢的那个。
劣势:
- 最终一致性: 在一个写操作被确认后，剩下的那个副本（可能位于网络延迟最高的区域）的数据更新会存在一个时间窗口。在此期间，如果读请求路由到了这个尚未同步的节点，客户端就会读到旧数据。
- 客户端复杂性: 客户端或上游业务逻辑需要能够容忍或处理这种潜在的数据不一致。

对于个性化推荐，用户对秒级的数据延迟并不敏感。因此，方案 B (AP) 完美契合我们的业务需求。我们选择了这条路径，并着手设计具体的实现。

核心实现：基于 Terraform 和 Docker Compose 的三区域部署

我们的目标是自动化、可重复地构建这套基础设施。Terraform 是基础设施即代码（IaC）的不二之选，而 Docker Compose 则简化了在每个 Droplet 上对 Weaviate 服务的管理。

1. 基础设施定义 (Terraform)

我们将基础设施分为几个模块：网络、计算节点、防火墙和负载均衡器。

# main.tf

terraform {
  required_providers {
    digitalocean = {
      source  = "digitalocean/digitalocean"
      version = "~> 2.0"
    }
  }
}

variable "do_token" {}
variable "regions" {
  type    = list(string)
  default = ["nyc3", "sfo3", "ams3"]
}

provider "digitalocean" {
  token = var.do_token
}

# 为每个区域创建一个VPC，实现网络隔离
resource "digitalocean_vpc" "weaviate_vpc" {
  for_each = toset(var.regions)
  name     = "weaviate-vpc-${each.key}"
  region   = each.key
}

# 在每个VPC中创建Droplet
resource "digitalocean_droplet" "weaviate_node" {
  for_each = toset(var.regions)
  image    = "docker-20-04"
  name     = "weaviate-node-${each.key}"
  region   = each.key
  size     = "s-4vcpu-8gb" # 生产环境应选择内存优化型
  vpc_uuid = digitalocean_vpc.weaviate_vpc[each.key].id
  ssh_keys = [data.digitalocean_ssh_key.main.id]

  # 初始化的 User Data 脚本，用于安装Docker Compose
  user_data = <<-EOF
              #!/bin/bash
              apt-get update
              apt-get install -y docker-compose
              EOF
}

# 定义防火墙规则，只允许必要的端口
resource "digitalocean_firewall" "weaviate_fw" {
  name = "weaviate-cluster-firewall"

  # 将防火墙应用到所有节点
  droplet_ids = [for droplet in digitalocean_droplet.weaviate_node : droplet.id]

  # Weaviate 节点间通信
  inbound_rule {
    protocol         = "tcp"
    port_range       = "7070-7071" # Gossip ports for clustering
    source_addresses = [for droplet in digitalocean_droplet.weaviate_node : droplet.ipv4_address]
  }

  # Weaviate gRPC 端口
  inbound_rule {
    protocol         = "tcp"
    port_range       = "50051"
    source_addresses = ["0.0.0.0/0", "::/0"] # 生产环境应限制为LB的IP
  }
  
  # Weaviate REST API 端口
  inbound_rule {
    protocol         = "tcp"
    port_range       = "8080"
    source_addresses = ["0.0.0.0/0", "::/0"] # 生产环境应限制为LB的IP
  }
  
  # 允许 SSH
  inbound_rule {
    protocol         = "tcp"
    port_range       = "22"
    source_addresses = ["YOUR_OFFICE_IP/32"] # 限制SSH访问源
  }

  outbound_rule {
    protocol              = "tcp"
    port_range            = "1-65535"
    destination_addresses = ["0.0.0.0/0", "::/0"]
  }

  outbound_rule {
    protocol              = "udp"
    port_range            = "1-65535"
    destination_addresses = ["0.0.0.0/0", "::/0"]
  }
}

# 全局负载均衡器
resource "digitalocean_loadbalancer" "public_lb" {
  name   = "weaviate-global-lb"
  region = "nyc3" # LB本身需要一个区域，但其流量会转发到所有健康检查通过的区域

  forwarding_rule {
    entry_port     = 8080
    entry_protocol = "tcp"
    target_port    = 8080
    target_protocol= "tcp"
  }

  healthcheck {
    port     = 8080
    protocol = "http"
    path     = "/v1/.well-known/ready"
  }

  droplet_ids = [for droplet in digitalocean_droplet.weaviate_node : droplet.id]
}

data "digitalocean_ssh_key" "main" {
  name = "your-ssh-key-name"
}

# 输出负载均衡器IP和各节点IP
output "loadbalancer_ip" {
  value = digitalocean_loadbalancer.public_lb.ip
}

output "node_ips" {
  value = { for key, droplet in digitalocean_droplet.weaviate_node : key => droplet.ipv4_address }
}

这段 Terraform 代码会创建三个位于不同区域的 Droplet，每个都在自己的 VPC 中，并通过防火墙规则允许它们之间进行集群通信。全局负载均衡器会根据健康检查将流量路由到最近的健康节点。

2. Weaviate 集群配置 (Docker Compose)

在每个 Droplet 上，我们使用以下 docker-compose.yml 来启动 Weaviate。这里的关键在于环境变量的配置，特别是 CLUSTER_JOIN。

# docker-compose.yml
version: '3.4'
services:
  weaviate:
    image: semitechnologies/weaviate:1.23.7
    restart: on-failure:0
    ports:
     - "8080:8080"
     - "50051:50051"
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'none'
      ENABLE_MODULES: ''
      # --- 集群配置关键部分 ---
      CLUSTER_HOSTNAME: 'node1' # 在每个节点上需要是唯一的: node1, node2, node3
      CLUSTER_GOSSIP_BIND_PORT: '7070'
      CLUSTER_DATA_BIND_PORT: '7071'
      # CLUSTER_JOIN: '10.10.1.2:7070,10.10.2.3:7070' # 这里需要填写其他节点的私有IP
      # 例如, 在 nyc3 节点上，CLUSTER_JOIN 应为 sfo3 和 ams3 节点的IP
      # 生产中，这应该通过配置管理工具（如Ansible）动态生成
      AUTOSCHEMA_ENABLED: 'false'
      CLUSTER_RAFT_INTERNAL_RPC_PORT: '7072'
      CLUSTER_RAFT_VOTER: 'true' # 确保所有节点都是投票成员
    volumes:
      - ./weaviate_data:/var/lib/weaviate

手动管理 CLUSTER_JOIN 环境变量在生产中是不可靠的。一个更健壮的方法是使用 Ansible 或类似的工具，在 Terraform 应用之后，获取所有节点的 IP 地址，然后为每个节点生成并分发定制化的 docker-compose.yml 文件。

3. 架构可视化

部署后的架构可以用 Mermaid 图清晰地表示出来。

graph TD
    subgraph "Internet"
        User[Global Users]
    end

    subgraph "DigitalOcean Global Network"
        LB[Global Load Balancer
lb.yourdomain.com]
    end

    User --> LB

    subgraph "Region: NYC3"
        VPC1[VPC NYC3]
        Node1[Weaviate Node 1
Docker]
        LB -- Health Check & Traffic --> Node1
        VPC1 -- Peering/Firewall --> VPC2
        VPC1 -- Peering/Firewall --> VPC3
        VPC1 --- Node1
    end

    subgraph "Region: SFO3"
        VPC2[VPC SFO3]
        Node2[Weaviate Node 2
Docker]
        LB -- Health Check & Traffic --> Node2
        VPC2 -- Peering/Firewall --> VPC1
        VPC2 -- Peering/Firewall --> VPC3
        VPC2 --- Node2
    end
    
    subgraph "Region: AMS3"
        VPC3[VPC AMS3]
        Node3[Weaviate Node 3
Docker]
        LB -- Health Check & Traffic --> Node3
        VPC3 -- Peering/Firewall --> VPC1
        VPC3 -- Peering/Firewall --> VPC2
        VPC3 --- Node3
    end

    Node1 <-->|Gossip Protocol
7070/7071| Node2
    Node1 <-->|Gossip Protocol
7070/7071| Node3
    Node2 <-->|Gossip Protocol
7070/7071| Node3

    style LB fill:#f9f,stroke:#333,stroke-width:2px

客户端鲁棒性设计：处理最终一致性

选择了 AP 架构，就意味着必须在客户端层面为最终一致性做好准备。我们使用 Python 客户端，并实现了一套健壮的连接和操作逻辑。

import weaviate
import weaviate.classes as wvc
from weaviate.exceptions import UnexpectedStatusCodeException
import time
import logging
import uuid
import os

# 配置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# --- 核心配置 ---
# 生产环境中，这个地址应指向我们的全局负载均衡器
WEAVIATE_URL = os.getenv("WEAVIATE_URL", "http://your-load-balancer-ip:8080")
COLLECTION_NAME = "ProductRecommendations"

def get_client() -> weaviate.WeaviateClient:
    """
    创建一个健壮的Weaviate客户端。
    - 包含重试机制。
    - 设定合理的超时。
    """
    try:
        client = weaviate.connect_to_local(
            host=WEAVIATE_URL.replace("http://", "").split(":")[0],
            port=int(WEAVIATE_URL.split(":")[2]),
            # 超时设置至关重要，特别是对于跨区域操作
            startup_period=10, 
        )
        client.connect()
        logging.info(f"Successfully connected to Weaviate at {WEAVIATE_URL}")
        return client
    except Exception as e:
        logging.error(f"Failed to connect to Weaviate: {e}")
        raise

def setup_collection(client: weaviate.WeaviateClient):
    """
    创建或验证集合存在，并采用高可用性复制策略。
    """
    if client.collections.exists(COLLECTION_NAME):
        logging.warning(f"Collection '{COLLECTION_NAME}' already exists. Skipping creation.")
        return

    logging.info("Creating collection with QUORUM replication...")
    client.collections.create(
        name=COLLECTION_NAME,
        vectorizer_config=wvc.config.Configure.Vectorizer.none(),
        properties=[
            wvc.config.Property(name="product_id", data_type=wvc.config.DataType.TEXT),
            wvc.config.Property(name="description", data_type=wvc.config.DataType.TEXT),
        ],
        # --- CAP 权衡的核心在此 ---
        replication_config=wvc.config.Configure.replication(factor=3) # 我们有3个节点
    )
    logging.info("Collection created successfully.")


def resilient_insert(client: weaviate.WeaviateClient, data: dict, vector: list):
    """
    带有重试和错误处理的插入逻辑。
    使用 QUORUM 一致性级别进行写入。
    """
    collection = client.collections.get(COLLECTION_NAME)
    max_retries = 3
    retry_delay = 2  # seconds

    for attempt in range(max_retries):
        try:
            obj_uuid = collection.with_consistency_level(wvc.ConsistencyLevel.QUORUM).data.insert(
                properties=data,
                vector=vector,
                uuid=uuid.uuid4()
            )
            logging.info(f"Successfully inserted object with UUID: {obj_uuid}")
            return obj_uuid
        except UnexpectedStatusCodeException as e:
            # 5xx 错误码通常是服务端临时问题，值得重试
            if 500 <= e.status_code < 600:
                logging.warning(f"Attempt {attempt + 1}/{max_retries} failed with status {e.status_code}. Retrying in {retry_delay}s...")
                time.sleep(retry_delay)
            else:
                # 4xx 错误通常是客户端问题，无需重试
                logging.error(f"Client-side error inserting data: {e}")
                raise
        except Exception as e:
            logging.error(f"An unexpected error occurred during insertion: {e}")
            time.sleep(retry_delay)
    
    raise Exception("Failed to insert data after multiple retries.")


def read_with_consistency_awareness(client: weaviate.WeaviateClient, obj_uuid: uuid.UUID):
    """
    读取数据并演示如何处理潜在的陈旧读。
    """
    collection = client.collections.get(COLLECTION_NAME)

    # 场景1: 默认读取 (ONE or QUORUM,取决于服务器配置)，可能读到旧数据
    logging.info(f"Attempting a standard read for {obj_uuid} (potentially stale)...")
    try:
        # 默认级别是 ONE，最快但最可能读到旧数据
        obj = collection.with_consistency_level(wvc.ConsistencyLevel.ONE).query.fetch_object_by_id(obj_uuid)
        if obj:
            logging.info(f"Standard read successful: {obj.properties}")
        else:
            # 这是一个关键场景：写入成功(QUORUM)，但读(ONE)的节点还没同步到
            logging.warning(f"Standard read could not find object {obj_uuid}. This might be due to replication lag.")
    except Exception as e:
        logging.error(f"Error during standard read: {e}")

    # 场景2: 需要强一致性读 (例如，在更新操作后立即验证)
    logging.info(f"Attempting a strong consistency read for {obj_uuid} (ALL)...")
    try:
        # 使用 ALL 会强制查询所有副本，确保数据最新，但延迟更高且可用性更低
        obj_strong = collection.with_consistency_level(wvc.ConsistencyLevel.ALL).query.fetch_object_by_id(obj_uuid)
        if obj_strong:
            logging.info(f"Strong read successful: {obj_strong.properties}")
        else:
            # 如果连 ALL 都读不到，说明写入确实可能失败了
            logging.error(f"CRITICAL: Strong read failed to find object {obj_uuid}. Investigate write operation.")

    except Exception as e:
        # 在一个节点宕机时，ALL 级别的读取必然会失败
        logging.error(f"Error during strong read (this is expected if a node is down): {e}")


if __name__ == "__main__":
    # 单元测试思路:
    # 1. MOCK `weaviate.connect_to_local` 来模拟连接成功/失败。
    # 2. MOCK `collection.data.insert`，使其在第一次调用时抛出 503 错误，验证 `resilient_insert` 的重试逻辑。
    # 3. MOCK `collection.query.fetch_object_by_id`，模拟 ONE 级别返回 None，而 ALL 级别返回对象，以测试 `read_with_consistency_awareness` 的逻辑。
    
    try:
        client = get_client()
        setup_collection(client)
        
        # 模拟一个产品向量
        sample_data = {"product_id": "SKU12345", "description": "A high-performance gadget."}
        sample_vector = [0.1] * 1536 # 假设使用1536维的向量

        new_uuid = resilient_insert(client, sample_data, sample_vector)
        
        # 在插入后立即读取，这是最能体现最终一致性影响的时刻
        if new_uuid:
            logging.info("--- Waiting 5 seconds for replication to propagate ---")
            time.sleep(5)
            read_with_consistency_awareness(client, new_uuid)

        client.close()
    except Exception as e:
        logging.critical(f"Main execution failed: {e}")

这段代码展示了一个生产级的客户端应该具备的特性：明确的连接管理、错误处理、重试逻辑，以及最重要的——对一致性级别的显式控制。resilient_insert 函数使用 QUORUM 保证写入的高可用性，而 read_with_consistency_awareness 函数则演示了如何在需要时通过 ALL 级别请求强一致性读，或如何理解 ONE 级别读可能遇到的数据滞后。

当前方案的局限性与未来迭代路径

这套基于 DigitalOcean Droplets 和 Weaviate 原生集群的架构，虽然实现了跨区域高可用的核心目标，但并非没有局限性。首先，节点的扩缩容是手动过程，需要调整 Terraform 配置并重新应用，无法根据负载自动伸缩。其次，服务发现依赖于静态 IP 地址配置，节点的替换和维护流程较为繁琐。

未来的迭代方向很明确。第一步是将此架构迁移到 DigitalOcean Kubernetes (DOKS)。通过为 Weaviate 构建一个 Kubernetes Operator，我们可以实现状态的声明式管理、自动化故障恢复和节点替换。结合 KEDA (Kubernetes Event-driven Autoscaling)，可以根据查询 QPS 或 CPU 使用率等指标自动水平扩展 Weaviate 节点，从而构建一个真正弹性的向量搜索平台。此外，在 DOKS 环境中，可以利用 CoreDNS 等内置服务发现机制，彻底摆脱静态 IP 的束缚。这种演进将进一步提高系统的自动化程度和运维效率，但其底层的 CAP 权衡与高可用设计原则将保持不变。

Weaviate CAP Theorem Terraform DigitalOcean 分布式数据库

使用 Tekton 与 gRPC 实现一个管理 Flink 分布式事务的 Kubernetes Operator

2023-10-27 分布式系统

Two-Phase Commit (2PC) Tekton Apache Flink NATS gRPC

构建基于 ISR 保证消息必达的云原生 Server-Sent Events 推送网关

2023-10-27 分布式架构

Server-Sent Events (SSE) API Gateway SkyWalking ISR Terraform