Distributed ID solution


The so-called id is a mark that can be used as a unique identifier.
In our daily design, for a single architecture, we generally use the auto-increment ID of the database as the primary key of the table. However, for a distributed system, ID conflicts will occur, so for distributed IDs, it is also necessary to have Characteristics of distributed systems: high concurrency, high availability, high performance and other characteristics.

1. Distributed ID implementation solution

Let’s first take a look at common distributed ID solutions and a comparison of their respective characteristics.

  • 1.UUID is the least complex solution, but it will affect storage space and performance.
  • 2. Use the self-increasing primary key of the stand-alone database as a distributed ID generator. The complexity is moderate and the ID length is shorter than UUID. However, it is limited by the performance of the stand-alone database and is not the optimal solution when the amount of concurrency is large. .
  • 3. Use the features of redis and zookeeper to generate IDs, such as redis’ auto-increment command and zookeeper’s sequential node. Compared with the performance of a stand-alone database (mysql), this solution will have improved performance and can be selected appropriately.
  • 4. Snowflake algorithm: If all problems can be solved directly by algorithms, it would be the most appropriate. The snowflake algorithm can be used to generate distributed IDs. The underlying principle is to use a certain machine to increment a certain number within a certain millisecond. This This solution can also ensure that the system ID in the distributed architecture is unique, but it can only guarantee an increasing trend.

Let’s take a closer look at the comparison of these solutions:

describeadvantageshortcoming
UUIDUUID is the abbreviation of Universal Unique Identification Code. Its purpose is to have unique identification information for all elements in the distributed system without the need for a central controller to specify a unique identification.1. Reduce the pressure on global nodes, making primary key generation faster; 2. The generated primary key is globally unique; 3. It is convenient to merge data across servers1. UUID occupies 16 characters and takes up a lot of space; 2. It is not an ascending and ordered number, the data writing IO is very random, and the indexing efficiency is reduced.
Database primary key auto-incrementMySQL database sets primary key and the primary key grows automatically1. INT and BIGINT types take up less space; 2. The primary key grows automatically and the IO writing continuity is good; 3. The query speed of numeric type is faster than that of string1. The concurrency performance is not high and is limited by database performance; 2. The sub-database and sub-table need to be modified, which is complicated; 3. Auto-increment: the amount of data is leaked
Redis auto-incrementRedis counter, atomically incrementedUses memory and has good concurrency performance1. Data loss; 2. Self-increase: data leakage
Number segment modeDepends on the database, but is different from the database primary key auto-increment modeThe performance is significantly improved compared to self-increasing ID.Limited by database performance;
Snowflake algorithm (snowflake)The famous Snowflake algorithm, a classic solution for distributed ID1. Does not rely on external components; 2. Good performanceClock set back

Five solutions are mentioned above. There are currently two popular distributed ID solutions: “Number Segment Mode” and “Snowflake Algorithm”.

1.1.uuid

This method is very simple. Every time you need to add new data, first generate a uuid.

String id=UUID.randomUUID().toString();

1.2 Database primary key auto-increment

We need to create a table specifically to store IDs.
The table can be designed as follows:

CREATE TABLE SEQID.SEQUENCE_ID ( 
    id bigint(20) unsigned NOT NULL auto_increment, 
    stub char(10) NOT NULL default '', 
    PRIMARY KEY (id), 
    UNIQUE KEY stub (stub) 
);

Each time a new addition is made, first add a new piece of data to the table, and then obtain and return the newly added primary key as the primary key ID to be inserted. We can use the following statement to generate and obtain an auto-increment ID.

begin; 
replace into SEQUENCE_ID (stub) VALUES ('anyword'); 
select last_insert_id(); 
commit;

Many people may be confused after reading this, what is a stub for?
The stub field has no special meaning here. It is just for the convenience of inserting data. Only when data can be inserted can an auto-increment ID be generated.
For insertion, we use replace. Replace will first check whether there is data with the same value as the stub specified. If it exists, delete it first and then insert it. If it does not exist, it will insert it directly.
Having said all this, you may still feel that it is not very intuitive. Let’s actually operate it.

Let’s execute the insert statement:

We can see that this statement returns the newly added primary key id. We will execute this statement several times more.

I executed it 5 more times and increased the auto-increment ID to 6. Let’s open the data table and see what it looks like inside.

You can see that there is always only one piece of data in the database.

This mechanism for generating distributed IDs requires a separate Mysql instance. Although it is feasible, it is not enough based on performance and reliability. Every time the business system needs an ID, it needs to request the database to obtain it, which has low performance and If this database instance goes offline, all business systems will be affected.

1.3 Redis auto-increment

Because Redis is single-threaded, it can be used to generate all unique IDs through incr and incrby.
The production environment may be a Redis cluster. If there are 5 Redis instances, the initial value of each Redis is 1, 2, 3, 4, 5, and then the increase is 5
The IDs generated by each Redis are:
A:1,6,11,16,21
B:2,7,12,17,22
C:3,8,13,18,23
D:4,9,14,19,24
E:5,10,15,20,25

In this case, no matter which Redis the request goes to, you can get a different ID.
Let’s do it in practice and feel it more intuitively.

The efficiency of using redis is very high, but persistence issues must be considered. Redis supportRDBandAOFTwo methods of persistence.

  • RDB persistence is equivalent to taking a snapshot regularly for persistence. If the snapshot is taken and the self-increment is continued several times before the next snapshot is persisted, Redis will hang up. Duplication of IDs will occur after restarting Redis. .
  • AOF persistence is equivalent to persisting each write command. If Redis hangs up, ID duplication will not occur. However, due to too many incr commands, it will take too long to restart and restore data.

1.4 Segment mode

We can use number segments to obtain self-increasing IDs. Number segments can be understood as batch acquisition. For example, when DistribuIdService obtains IDs from the database, if multiple IDs can be obtained in batches and cached locally, it will greatly facilitate business applications in obtaining IDs. s efficiency.

For example, every time DistributIdService obtains an ID from the database, it obtains a number segment, such as (1,1000]. This range represents 1,000 IDs. When a business application requests DistributIdService to provide an ID, DistributIdService only needs to increment locally from 1. And return, instead of requesting the database every time, it will not go to the database to re-obtain the next number segment until the local number reaches 1000, that is, when the current number segment has been used up.
Let me try to implement it in detail:

CREATE TABLE id_generator ( 
    id int(10) NOT NULL, 
    current_max_id bigint(20) NOT NULL COMMENT 'Current maximum id', 
    increment_step int(10) NOT NULL COMMENT 'length of number section', 
    PRIMARY KEY (`id`) 
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

This database table is used to record the auto-increment step size and the current maximum value of the auto-increment ID (that is, the last value of the number segment that has been applied for). Because the auto-increment logic has been moved to DistributIdService, the database does not need This is part of the logic.

This solution no longer relies heavily on the database. Even if the database is unavailable, DistribuIdService can continue to support it for a period of time. However, if DistributIdService is restarted, a section of ID will be lost, resulting in ID holes.

In order to improve the high availability of DistributIdService, a cluster needs to be built. When the business requests the DistributIdService cluster to obtain the ID, it will randomly select a DistributIdService node to obtain it. For each DistributIdService node, the database is connected to the same database, so it is possible will produce multiple
The DistributIdService node also requests the database to obtain the number segment. At this time, optimistic locking needs to be used for control. For example, adding a version field to the database table and using the following SQL when obtaining the number segment:

UPDATE id_generator
SET current_max_id = #{newMaxId},
version=version+1 
where version = #{version}

Because newMaxId is based on DistributIdServiceoldMaxId+step sizeCalculated, as long as the above update is successful, it means that the number segment has been obtained successfully.

In order to provide high availability of the database layer, the database needs to be deployed in multi-master mode. For each database, it is necessary to ensure that the generated number segments are not repeated. This requires using the original idea and adding it to the database table just now. Starting value and step size. For example, if there are two Mysqls now, then mysql1 will generate a number segment (1,1001]. When increasing by itself, the sequence is 1, 3, 5, 7…mysql2 will generate a number segment (2,1002 ], when incrementing, the sequence is 2, 4, 6, 8, 10…

A step has been added to TinyId to improve efficiency. In the above implementation, the logic of ID self-increment is implemented in DistributIdService. In fact, the self-increment logic can be transferred to the local business application, so that for business applications You only need to get the number segment, and you no longer need to request to call DistribuIdService every time it is incremented.

1.5 Snowflake algorithm (snowflake)

The above three methods are generally based on the idea of ​​auto-increment, and next we will introduce the more famous snowflake algorithm-snowflake
We can think about distributed IDs from another angle, as long as each machine responsible for generating distributed IDs can generate a different ID every millisecond.

Snowflake is Twitter’s open source distributed ID generation algorithm. It is an algorithm, so it is different from the above three mechanisms for generating distributed IDs. It does not rely on the database.

The core idea is: the distributed ID is fixed to a long number, and a long occupies 8 bytes, which is 64 bits. The allocation of bits in the original snowflake algorithm is as follows:

  • The sign bit is 0, 0 represents a positive number, and the ID is a positive number, so it is fixed at 0.
  • Needless to say, the timestamp bit is used to store timestamps. The unit is ms. The timestamp part occupies 41 bits. This is millisecond-level time. Generally, the current timestamp is not stored in implementation, but the difference between the timestamps (current time – a fixed starting time), which allows the generated IDs to start from a smaller value.
  • The working machine ID bit is used to store the machine ID, which is usually divided into 5 area bits + 5 server identification bits. This is more flexible. For example, you can use the first 5 digits as the data center computer room identification, and the last 5 digits as the single computer room machine identification, and 1024 nodes can be deployed.
  • The serial number bit is auto-incrementing.

How much data can the snowflake algorithm store?

  • Time range: 2^41 / (1000L * 60 * 60 * 24 * 365) = 69 years
  • Worker process range: 2^10 = 1024
  • Serial number range: 2^12 = 4096, which means 4096 IDs can be generated in 1ms.

According to the logic of this algorithm, you only need to implement this algorithm in Java language and encapsulate it as a tool method. Then each business application can directly use this tool method to obtain the distributed ID. You only need to ensure that each business application has its own work. The machine ID is enough, and there is no need to build a separate application to obtain the distributed ID. Here is the Twitter version of the Snowflake algorithm:

public class SnowFlake {

    /**
     * Starting timestamp
     */
    private final static long START_STMP = 1480166465631L;

    /**
     * Number of bits occupied by each part
     */
    private final static long SEQUENCE_BIT = 12; //The number of digits occupied by the sequence number
    private final static long MACHINE_BIT = 5; //The number of bits occupied by the machine identification
    private final static long DATACENTER_BIT = 5;//The number of bits occupied by the data center

    /**
     * Maximum value of each part
     */
    private final static long MAX_DATACENTER_NUM = -1L ^ (-1L << DATACENTER_BIT);
    private final static long MAX_MACHINE_NUM = -1L ^ (-1L << MACHINE_BIT);
    private final static long MAX_SEQUENCE = -1L ^ (-1L << SEQUENCE_BIT);

    /**
     * Displacement of each part to the left
     */
    private final static long MACHINE_LEFT = SEQUENCE_BIT;
    private final static long DATACENTER_LEFT = SEQUENCE_BIT + MACHINE_BIT;
    private final static long TIMESTMP_LEFT = DATACENTER_LEFT + DATACENTER_BIT;

    private long datacenterId; //data center
    private long machineId; //machine identification
    private long sequence = 0L; //sequence number
    private long lastStmp = -1L;//Last timestamp

    public SnowFlake(long datacenterId, long machineId) {
        if (datacenterId > MAX_DATACENTER_NUM || datacenterId < 0) {
            throw new IllegalArgumentException("datacenterId can't be greater than MAX_DATACENTER_NUM or less than 0");
        }
        if (machineId > MAX_MACHINE_NUM || machineId < 0) {
            throw new IllegalArgumentException("machineId can't be greater than MAX_MACHINE_NUM or less than 0");
        }
        this.datacenterId = datacenterId;
        this.machineId = machineId;
    }

    /**
     * Generate next ID
     *
     * @return
     */
    public synchronized long nextId() {
        long currStmp = getNewstmp();
        if (currStmp < lastStmp) {
            throw new RuntimeException("Clock moved backwards.  Refusing to generate id");
        }

        if (currStmp == lastStmp) {
            //In the same millisecond, the sequence number increases automatically
            sequence = (sequence + 1) & MAX_SEQUENCE;
            //The number of sequences in the same millisecond has reached the maximum
            if (sequence == 0L) {
                currStmp = getNextMill();
            }
        } else {
            // Within different milliseconds, the sequence number is set to 0
            sequence = 0L;
        }

        lastStmp = currStmp;

        return (currStmp-START_STMP) << TIMESTMP_LEFT // timestamp part
                | datacenterId << DATACENTER_LEFT // Data center section
                | machineId << MACHINE_LEFT // Machine identification part
                | sequence; //serial number part
    }

    private long getNextMill() {
        long mill = getNewstmp();
        while (mill <= lastStmp) {
            mill = getNewstmp();
        }
        return mill;
    }

    private long getNewstmp() {
        return System.currentTimeMillis();
    }

    public static void main(String[] args) {
        SnowFlake snowFlake = new SnowFlake(2, 3);

        for (int i = 0; i < (1 << 12); i++) {
            System.out.println(snowFlake.nextId());
        }

    }
}

In actual online usage scenarios, snowflake is rarely used directly, but is modified, because the most difficult thing to practice in the snowflake algorithm is the working machine ID. The original snowflake algorithm requires manual work to specify an ID for each machine. Machine ID, and configure it somewhere so that snowflake can get the machine ID from here.
Especially when there are a lot of machines, the labor cost is too high and it is error-prone, so many large manufacturers have transformed snowflake.
Let’s take a look at how others do it

1.5.1 Baidu (uid-generator)

uid-generator uses snowflake, but it is different when producing machine id, also called workId.

The workId in uid-generator is automatically generated by uid-generator, and taking into account the application deployment on docker, users can define the workId generation strategy by themselves in uid-generator. The default strategy provided is: when the application starts Assigned by the database.

To put it simply: when the application starts, it will insert a piece of data into the database table (uid-generator needs to add a WORKER_NODE table). After the data is successfully inserted, the auto-incremented unique id corresponding to the data returned is the workId of the machine. , and the data consists of host and port.

For the workId in uid-generator, it occupies 22 bits, the time occupies 28 bits, and the serialization occupies 13 bits. It should be noted that it is not the same as the original snowflake. The unit of time is seconds. , not milliseconds, and the workId is also different. The same application will consume a workId every time it is restarted.

1.5.2 Meituan (Leaf)

Meituan’s Leaf is also a distributed ID generation framework. It is very comprehensive, supporting number segment mode and snowflake mode. The name is taken from the words of German philosopher and mathematician Leibniz: “There are no two identical leaves in the world.” Leaf has the characteristics of high reliability, low latency, and global uniqueness. It has been widely used in Meituan Finance, Meituan Food Delivery, Meituan Wine and Travel and other departments.
The difference between the snowflake mode in Leaf and the original snowflake algorithm is mainly in the generation of workId. The workId in Leaf is generated based on the sequential ID of ZooKeeper. When each application uses Leaf-snowflake, it will be generated in Zookeeper at startup. Generate a sequence ID in it, which is equivalent to one machine corresponding to a sequence node, that is, a workId.
Leaf features are as follows:

  • It is globally unique, there will never be duplicate IDs, and the overall trend of IDs is increasing.
  • Highly available, the service is completely based on a distributed architecture. Even if MySQL goes down, it can tolerate a period of database unavailability.
  • High concurrency and low latency. On the CentOS 4C8G virtual machine, remote call QPS can reach 5W+, and TP99 can be within 1ms.
  • Access is simple and can be accessed directly through the company’s RPC service or HTTP call.

A more detailed introduction to Meituan LeafClick here to view
How to use Meituan Leaf? If you are interested, you can read my other blog post:“Meituan Leaf in Action”

Related Posts

[Detailed Graphical Explanation] Build Spring Authorization Server + Resource + Client Complete Demo

Springboot/Springcloud integrates ELK platform, (Filebeat method) log collection and management (Elasticsearch+Logstash+Filebeat+Kibana)

It has been officially announced that the second brother is going to start a business with the Huawei boss! Maybe friends in Luoyang will have another choice when they return to their hometown in the future!

Introduction to android and H5 interaction methods, mobile terminal hybrid development

Why is it the most popular Linux? Ubuntu ranks sixth.

[Data Structure] The most understandable explanation of red-black trees in history, allowing you to completely understand red-black trees

SpringCloud-Ribbon load balancing

SpringCloud: Use spring cloud LoadBalancer as load balancer and use random policy

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*