solrcloud(solr4)搜索引擎系统架构

New SolrCloud DesignThis was a scratchpad for ideas and was not used as is - this page does not describe theSolrCloud design as it was implemented and this page is only around for historical rea

豹先生_MR-BAO

1920人浏览 · 2013-10-09 11:25:51

豹先生_MR-BAO · 2013-10-09 11:25:51 发布

New SolrCloud Design

This was a scratchpad for ideas and was not used as is - this page does not describe theSolrCloud design as it was implemented and this page is only around for historical reasons.

这是一个暂时想法，是不作为 solrcloud的实现设计，有这个页面只是因为历史原因。但是该设计思想讲解比较明了。

What is SolrCloud?

SolrCloud is an enhancement to the existing Solr to manage and operate Solr as a search service in a cloud.

Glossary

Cluster : Cluster is a set of Solr nodes managed as a single unit. The entire cluster must have a single schema and solrconfig
Node : A JVM instance running Solr
Partition : A partition is a subset of the entire document collection. A partition is created in such a way that all its documents can be contained in a single index.
Shard : A Partition needs to be stored in multiple nodes as specified by the replication factor. All these nodes collectively form a shard. A node may be a part of multiple shards
Leader : Each Shard has one node identified as its leader. All the writes for documents belonging to a partition should be routed through the leader.
Replication Factor : Minimum number of copies of a document maintained by the cluster
Transaction Log : An append-only log of write operations maintained by each node
Partition version : This is a counter maintained with the leader of each shard and incremented on each write operation and sent to the peers
Cluster Lock : This is a global lock which must be acquired in order to change the range -> partition or the partition -> node mappings.

什么是SolrCloud ？

SolrCloud是对现有Solr的一个增强，因为他通过云平台的概念来管理、操作solr搜索引擎

词汇表

    集群：集群是一组作为一个单元管理Solr的节点。整个群集必须有一个单一schema和solrconfig

    节点：一个运行于Solr的JVM实例

    分区：一个分区是整个文档集合的一个子集。分区是以这样一种方式创建：可以包含在一个单一的索引文件创建一个分区。

    分片：一个分区需要被存储在多个节点所指定的复制因子。所有这些节点共同形成一个碎片。一个节点可以存储多个碎片的一部分

    领导节点：每个分片都有一个标识的节点作为它的领导者，所有的写文档操作都需要通过领导节点路由。

    复制因子：每份文件在集群中保持最低数量

    事务日志：每个节点维护仅追加写操作日志

    分区版本：这是由每个分片的leader和维护的一个计数器，每次写操作进行递增，并发送给同级

    群集锁：如果是进行改变范围 - >分区或是分区 - >节点的映射，必须获得一个全局锁，才能进行映射

Guiding Principles

Any operation can be invoked on any node in the cluster.
No non-recoverable single point of failures
Cluster should be elastic
Writes must never be lost i.e. durability is guaranteed
Order of writes should be preserved
If two clients send document "A" to two different replicas at the same time, one should consistently "win" on all replicas.
Cluster configuration should be managed centrally and can be updated through any node in the cluster. No per node configuration other than local values such as the port, index/logs storage locations should be required
Automatic failover of reads
Automatic failover of writes
Automatically honour the replication factor in the event of a node failure

指导原则

    在集群中的任何节点上进行任何操作都是可以的。
    没有不可恢复的单点故障
    集群应该是弹性
    写入必须保证数据永不丢失（数据完整性）
    写操作记录应保留
    如果两个客户端发送文件“A”给两个不同副本时，一个要一直获得所有副本。
    群集配置应集中管理，并且可以通过集群中的任何节点更新。没有节点可以配置例外的数值，如端口值，索引/日志存储位置
    自动读取故障转移
    自动写入故障转移
    在一个节点发生故障的情况下自动恢复复制因子

Zookeeper

A ZooKeeper cluster is used as:

The central configuration store for the cluster
A co-ordinator for operations requiring distributed synchronization
The system-of-record for cluster topology

ZooKeeper群集作用：

    中央配置存储集群
    分布式同步操作需要委任
    集群拓扑记录系统

Partitioning

The cluster is configured with a fixed max_hash_value (which is set to a fairly large value, say 1000) ‘N’. Each document’s hash is calculated as:

hash = hash_function(doc.getKey()) % N

Ranges of hash values are assigned to partitions and stored in Zookeeper. For example we may have a range to partition mapping as follows

range  : partition
------  ----------
0 - 99 : 1
100-199: 2
200-299: 3

The hash is added as an indexed field in the doc and it is immutable. This may also be used during an index split

The hash function is pluggable. It can accept a document and return a consistent & positive integer hash value. The system provides a default hash function which uses the content of a configured, required & immutable field (default is unique_key field) to calculate hash values.

分区

在配置了群集具有固定max_hash_value的（它被设置为一个较大的值，例如1000个）'N'。每个文件的哈希值计算公式为：

哈希= hash_function（doc.getKey（））％N

哈希值的范围被分配给分区并存储在zookeeper中。例如，我们可能有一个范围分区映射如下

范围：分区
----------------
0 - 99： 1
100-199：2
200-299：3

hash是作为document中的被索引字段增加一条，它是不可改变的。这也可用于一个索引分割

哈希函数是可插拔的。它可以接受一个文档并返回一致的正整数哈希值。该系统提供了一个默认的配置，必要的和不可改变的域（默认是unique_key域）的哈希函数来计算哈希值。

Using full hash range

Alternatively, there need not be any max_hash_value - the full 32 bits of the hash can be used since each shard will have a range of hash values anyway.Avoiding a configurable max_hash_value makes things easier on clients wanting related hash values next to each other. For example, in an email search application, one could construct a hashcode as follows:

(hash(user_id)<<24) | (hash(message_id)>>>8)

By deriving the top 8 bits of the hashcode from the user_id, it guarantees that any users emails are in the same 256th portion of the cluster. At search time, this information can be used to only query that portion of the cluster.

We probably also want to view the hash space as a ring (as is done with consistent hashing) in order to express ranges that wrap (cross from the maximum value to the minimum value).

使用完整的哈希值范围

另外，不必有任何max_hash_value - 完整的32位哈希可以使用，因为每个分片将有哈希值范围，防止一个可配置的max_hash_value使事情变得更容易的客户想要彼此相关的哈希值。例如，在一封电子邮件中搜索应用程序，可以构建一个如下哈希码：

（散列（USER_ID）<<24）|（哈希（MESSAGE_ID）>>>8）

前8位hashCode从user_id产生，它保证任何用户的电子邮件是相同的第256部分的集群。在搜索时，该信息可以用来仅查询在群集中有的那部分。

我们可能还需要查看哈希空间环（一致性哈希）为了表达范围包裹（交叉从最大值到最小值）。

shard naming

When partitioning is by hash code, rather than maintaining a separate mapping from shard to hash range, the shard name could actually be the hash range (i.e. shard "1-1000" would contain docs with a hashcode between 1 and 1000).

The current convention for solr-core naming is that it match the collection name (assuming a single core in a solr server for the collection). We still need a good naming scheme for when there are multiple cores for the same collection.

分片命名

分区是由哈希代码决定的，而不是维护一个单独分片的哈希范围映射，分片名称实际上可能是哈希范围（即分片“1-1000”将包含哈希码为1和1000之间文档）。

当前对Sol core命名是它相匹配的集合名称（假设Solr服务器集合在一个单一的核心）。当有多个内核的同一个集合，我们仍然需要一个良好的命名方案。

Shard Assignment

The node -> partition mapping can only be changed by a node which has acquired the Cluster Lock inZooKeeper. So when a node comes up, it first attempts to acquire the cluster lock, waits for it to be acquired and then identifies the partition to which it can subscribe to.

Node to a shard assignment

The node which is trying to find a new node should acquire the cluster lock first. First of all the leader is identified for the shard. Out of the all the available nodes, the node with the least number of shards is selected. If there is a tie, the node which is a leader to the least number of shard is chosen. If there is a tie, a random node is chosen.

分片分配

节点 - >分区映射只能被在ZooKeeper中标识获得集群锁的节点修改。因此，当一个节点出现时，它首先试图获得群集锁，等待被允许，然后将它可以订阅的分区标识。

节点的分片分配

试图找到一个新的节点的节点应该先获得群集锁。首先，所有的领导者被确定为分片。在所有可用节点，分片数量最少的节点被选中。如果有一个领带，碎片数量最少的，这是一个领导者选择的节点。如果有一条领带，一个随机的节点被选中。

Transaction Log

A transaction log records all operations performed on an Index between two commits
Each commit starts a new transaction log because a commit guarantees durability of operations performed before it
The sync can be tunable e.g. flush vs fsync by default can protect against JVM crashes but not against power failure and can be much faster

事务日志

    事务日志记录了索引两次提交之间的所有操作行为
    每次提交会开一个新的事务日志，为了保证数据的完整性
    同步可以是可调参数例如flush vs fsync，可以防止JVM崩溃，但不能防止掉电

Recovery

A recovery can be triggered during:

Bootstrap
Partition splits
Cluster re-balancing

The node starts by setting its status as ‘recovering’. During this phase, the node will not receive any read requests but it will receive all new write requests which shall be written to a separate transaction log. The node looks up the version of index it has and queries the ‘leader’ for the latest version of the partition. The leader responds with the set of operations to be performed before the node can be in sync with the rest of the nodes in the shard.

This may involve copying the index first and replaying the transaction log depending on where the node is w.r.t the state of the art. If an index copy is required, the index files are replicated first to the local index and then the transaction logs are replayed. The replay of transaction log is nothing but a stream of regular write requests. During this time, the node may have accumulated new writes, which should then be played back on the index. The moment the node catches up with the latest commit point, it marks itself as “ready”. At this point, read requests can be handled by the node.

恢复

恢复期间触发：

    自举
    分区分割
    重新平衡集群

节点开始将其状态设置为“恢复” 。在这个阶段中，该节点将不会收到任何读取请求，但它会接收所有新的写请求应被写入一个单独的事务日志。节点查找索引并询问leader获取分区的最新版本版本。leader响应要执行的操作之前，该节点可以是同步的，其余的碎片中的节点的集合。

这可能涉及到复制第一次的索引和根据节点状态重放事务日志。如果索引复制是必需的，索引文件先复制到本地索引，然后在事务日志重播。事务日志重播不过是一个写请求流。在这段时间内，该节点可能累积写入新的信息，然后应回放到索引上。节点会catch住最新提交点的那一刻，它标志自己为“准备就绪” 。在这时，读请求可以被该节点处理

Handling Node Failures

There may be temporary network partitions between some nodes or between a node andZooKeeper. The cluster should wait for some time before re-balancing data.

Leader failure

If node fails and if it is a leader of any of the shards, the other members will initiate a leader election process. Writes to this partition are not accepted until the new leader is elected. Then it follows the steps in non-leader failure

Non-Leader failure

The leader would wait for the min_reaction_time before identifying a new node to be a part of the shard.The leader acquires the Cluster Lock and uses the node-shard assignment algorithm to identify a node asthe new member of the shard. The node -> partition mapping is updated inZooKeeper and the cluster lock is released. The new node is then instructed to force reload the node -> partition mapping fromZooKeeper.

处理节点故障

有可能是暂时的网络分区之间的一些节点之间或节点andZooKeeper。集群应等待一段时间后再重新平衡数据。

leader节点失败

如果节点失败，并且它是任何任何分片的leader，其他成员将启动一个leader的选举过程。在新的leader选举出之前，写入到这个分区是不能接受的。然后，它遵循下面非leader失败流程

非leader失败

leader将等待min_reaction_time，然后确定一个新的节点，成为其分片的一部分.leader获得群集锁，使用节点-分片分配算法，以识别一个节点作为新的分片乘员。新的节点，指示强制刷新在zookeeper中的节点 - >分区映射

Splitting partitions

A partition can be split either by an explicit cluster admin command or automatically by splitting strategies provided by Solr. An explicit split command may give specify target partition(s) for split.

Assume the partition ‘X’ with hash range 100 - 199 is identified to be split into X (100 - 149) and a new partition Y (150 - 199). The leader of X records the split action inZooKeeper with the new desired range values of X as well as Y. No nodes are notified of this split action or the existence of the new partition.

The leader of X, acquires the Cluster Lock and identifies nodes which can be assigned to partition Y (algorithm TBD) and informs them of the new partition and updates the partition -> node mapping. The leader of X waits for the nodes to respond and once it determines that the new partition is ready to accept commands, it proceeds as follows:
The leader of X suspends all commits until the split is complete.
The leader of X opens an IndexReader on the latest commit point (say version V) and instructs its peers to do the same.
The leader of X starts streaming the transaction log after version V for the hash range 150 - 199 to the leader of Y.
The leader of Y records the requests sent in #2 in its transaction log only i.e. it is not played on the index.
The leader of X initiates an index split on the IndexReader opened in step #2.
The index created in #5 is sent to the leader of Y and is installed.
The leader of Y instructs its peers to start recovery process. At the same time, it starts playing its transaction log on the index.
1. Once all peers of partition Y have reached at least version V:
2. The leader of Y asks the leader of X to prepare a FilteredIndexReader on top of the reader created in step #2 which will have documents belonging to hash range 100 - 149 only.
3. Once the leader of X acknowledges the completion of request in #8a, the leader of Y acquires the Cluster Lock and modifies the range -> partition mapping to start receiving regular search/write requests from the whole cluster.
4. The leader of Y asks leader of X to start using the FilteredIndexReader created in #8a for search requests.
5. The leader of Y asks leader of X to force refresh the range -> partition mapping fromZooKeeper. At this point, it is guaranteed that the transaction log streaming which started in #3 will be stopped.
The leader of X will delete all documents with hash values not belonging to its partitions, commits and re-opens the searcher on the latest commit point.
At this point, the split is considered complete and leader of X resumes commits according to the commit_within parameters.

Notes:

The partition being split does not honor commit_within parameter until the split operation completes
Any distributed search operation performed starting at the time of #8b and till the end of #8c can return inconsistent results i.e. the number of search results may be wrong.

分割分区

分区可以被分裂，也可以由一个明确的集群管理命令或自动Solr的分裂策略。可能会给一个明确的分裂命令指定目标分区（s）的分裂。

假设一个分区哈希值范围100 - 199的分区'X'被分为X（ 100 - 149 ）和一个新的分区Y（ 150 - 199 ）。 X的leader在zookeeper记录分裂行动i与新的期望的范围值的X和Y。对于这种分裂的动作或存在的新的分区没有节点通知会被通知。

X的leader，获得群集锁，并确定节点可以分配给分区Y（ TBD算法），并通知他们新的分区和更新的分区 - >节点映射。 X的leader等待节点响应，一旦确定新的分区准备接受命令，它的处理流程如下：

    X的leader暂停所有提交，直到分裂是完整的。

    X的leader打开IndexReader的最新提交点（说版本V ），并指示其同行做同样的。
    X的leader开始给Y的leader提交事务日志流
    Y的leader记录其事务日志，并且，在获取完整的事务日志前它是不会运行于索引中。

    X的leader提示在步骤二中IndexReader打开索引。
    第五步中创建的索引被发送到Y的leader，并安装。
    Y的领导人指示同行开始恢复过程。与此同时，它开始运行索引的事务日志。
      a. 一旦所有同行的分区Y至少达到V版本：

       b. Y的leader要求X的leader准备位于在第二步骤中创建的reader之上创建FilteredIndexReader的

     c. 一旦X的leader知道完成＃ 8A步骤， Y的leader要求获得群集锁和修改范围 - >分区映射，并开始从整个集群中接受正规的搜索/写请求

     d. Y的leader要求X的leader开始使用＃ 8A F创建的ilteredIndexReader对外提供搜索请求。

      e. Y的leader要求X的leader来强制刷新ZooKeeper 中的的范围 - >分区映射。同时＃3开始的事务日志流将被终止。

X的领导者，将删除所有不属于其分区的文件哈希值，提交和重新打开搜索最新提交点。
同事，分裂被认为是完整完成，X的leader重新根据commit_within参数提交相关索引。

附注：

被分割的分区不遵从commit_within参数，直到分割操作完成
任何分布式搜索执行从执行操作＃ 8B的时候开始，直到结束＃ 8C可以返回不一致的结果，即搜索结果的数量可能是错误的。

Cloudpods

开源、云原生的融合云平台

更多推荐

面向未来的 IT 基础设施管理架构——融合云（Unified IaaS）

随着数字化时代的到来，IT系统已成为人类社会正常运转不可或缺的组成部分。不远的未来，智能制造，5G和人工智能等技术将成为推动生产力发展的重要引擎，人类社会将面临前所未有的全面彻底的数字化浪潮。IT基础设施作为IT系统运行的平台和载体，是实现数字化的基石。在这场数字化浪潮中，企业必须积极拥抱云计算技术，采用符合技术发展趋势、面向未来的IT基础构架，才能在未来的竞争中赢得先机。一、云计算历经十余年

Cloudpods

Cloudpods负载均衡的功能介绍

作者:周有松今天的内容会从以下几个方面展开：负载均衡产品简介。主要介绍负载均衡作为一个云上产品，它的功能模型是怎样的，日常使用中会遇到的业务词汇负载均衡的功能与典型应用场景。这部分主要结合业务词汇，对负载均衡服务中常见的一些功能选项进行介绍，并举例介绍一些典型的应用场景最后，我们做一下总结，讨论一下负载均衡产品相比传统方式的优点一、产品简介 1. 以NGINX为例提到负载均衡，我们以

Cloudpods

使用Linux vfio将Nvidia GPU透传给QEMU虚拟机

Linux 上虚拟机 GPU 透传需要使用 vfio 的方式。主要是因为在 vfio 方式下对虚拟设备的权限和 DMA 隔离上做的更好。但是这么做也有个缺点，这个物理设备在主机和其他虚拟机都不能使用了。 qemu 直接使用物理设备本身命令行是很简单的，关键在于事先在主机上对系统、内核和物理设备的一些配置。单纯从 qemu 的命令行来看，其实和普通虚拟机启动就差了最后那个-device的选项。这