添加链接
link之家
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

Springboot Actuator之二:actuator在监控和管理指标的特性

服务注册发现consul之二:在Spring Cloud中使用Consul实现服务的注册和发现

Consul之:服务健康监测

consul之:常用API接口

springCloud+consul优雅停机

本文主要几点:

一、健康检查的几种方式

二、consul与微服务的几次交互

三、健康检查的线上问题

服务注册 - 服务进程在注册中心注册自己的位置。它通常注册自己的主机和端口号,有时还有身份验证信息,协议,版本号,以及运行环境的详细资料。

服务发现 - 客户端应用进程向注册中心发起查询,来获取服务的位置。服务发现的一个重要作用就是提供一个可用的服务列表。

一、健康检查的几种方式

服务定义的格式类似如下:

"service":{ "id": "jetty", "name": "jetty", "address": "192.168.1.200", "port": 8080, "tags": ["dev"], "checks": [ "http": "http://192.168.1.200:8080/health", "interval": "5s"

其中,check是用来做服务的健康检查的,可以有多个,也可以没有,支持多种方式的检查。check定义在配置文件中,或运行时通过HTTP接口添加。Check是通过HTTP与节点保持一致。

有五种check方法:

check必须是script或者TTL类型的,如果是script类型,则script和interval变量必须被提供,如果是TTL类型,则ttl变量必须被提供

script是consul主动去检查服务的健康状况,ttl是服务主动向consul报告自己的健康状况。

以下是几种配置方式

Check必须是Script、HTTP、TCP、TTL四种类型中的一种。Script类型需要提供Script脚本和interval变量。HTTP类型必须提供http和Interval字段。TCP类型需要提供tcp和Interval字段,TTL类型秩序提供ttl。Check的name字段是自动通过 service:<service-id> 生成,如果有多个service,则由 service:<service-id>:<num> 生成。

Script check(Script+ Interval)

通过执行外部应用进行健康检查:这种外部程序具有退出代码,并可能产生一些输出;脚本按照指预置时间间隔来调用(比如,每30秒调用一次),类似于Nagios插件系统,脚本输出限制在4K以内,输出大于4K将截断。默认情况下,脚本超时时间为30秒——可通过timeout来配置。

"check": { "id": "mem-util", "name": "Memory utilization", "script": "/usr/local/bin/check_mem.py", "interval": "10s", "timeout": "1s"

HTTP check(HTTP+ Interval)

这种检查将按照预设的时间间隔创建一个HTTP “get”请求。HTTP响应代码来标示服务所处状态:任何2xx代码视为正常,429表示警告——有很多请求;其他值表示失败。

这种类型的检查应使用curl或外部程序来处理HTTP操作。默认情况下,HTTP Checks中,请求超时时间等于调用请求的间隔时间,最大10秒。有可能使用客制的HTTP check,可以自由配置timeout时间,输出限制在4K以内,输出大于4K将截断。

"check": { "id": "api", "name": "HTTP API on port 5000", "http": "http://localhost:5000/health", "interval": "10s", "timeout": "1s"

TCP check(TCP + Interval)

将按照预设的时间间隔与指定的IP/Hostname和端口创建一个TCP连接。服务的状态依赖于TCP连接是否成功——如果连接成功,则状态是“success”;否则状态是“critical”。如果一个Hostname解析为一个IPv4和一个IPv6,将尝试连接这两个地址,第一次连接成功则服务状态是“success”。

如果希望通过这种方式利用外部脚本执行健康检查,那么脚本应该采用“netcat”或者简单的socket操作。

默认情况下,TCP checks中,请求超时时间等于调用请求的间隔时间,最大10秒。也是可以自由配置的。

"check": { "id": "ssh", "name": "SSH TCP on port 22", "tcp": "localhost:22", "interval": "10s", "timeout": "1s"

TTL check:(Timeto Live生存时间)

这种checks为给定TTL保留了最后一种状态,checks的状态必须通过HTTP接口周期性更新,如果外部接口没有更新状态,那么状态就会被认定为不正常。

这种机制,在概念上类似 “死人开关 ”,需要服务周期性汇报健康状态。比如,一个健康的APP可以周期性的将状态put到HTTP端;如果app出问题了,那么TTL将过期,健康检查将进入Critical状态。用来为给定check更新健康信息的endpoint都是pass endpoint和fail endpoint。(参见agent http endpoint)

TTL checks同时会将其最后已知状态更新至磁盘,这允许Agent通过重启后恢复到已知的状态。通过TTL端上一次check来维持健康状态的有效性。

"check": { "id": "web-app", "name": "Web App Status", "notes": "Web app does a curl internally every 10 seconds", "ttl": "30s"

Docker+ interval

这种检查依赖于调用封装在docker容器内的外部程序。运行的docker通过docker Exec API来触发外部应用。

我们期望,consul Agent用户访问Docker HTTP API或UNIX套接字。Consul使用 $DOCKER_HOST 来确定Docker API端点。应用程序将运行,并对在容器内运行的服务执行健康检查,并返回适当的退出代码。Check按照指定的时间间隔调用。

如果在同一个host主机上有多重shell,那么同样需要配置shell参数。

输出限制在4K以内,输出大于4K将截断。

"check": { "id": "mem-util", "name": "Memoryutilization", "docker_container_id": "f972c95ebf0e", "shell": "/bin/bash", "script": "/usr/local/bin/check_mem.py", "interval": "10s"

每一种check都必须包含name,id和notes两个是可选的。如果没有提供id,那么id会被设置为name。在一个节点中,check的ID都必须是唯一的。如果名字是冲突的,那么ID就应该设置。

字段Notes主要是增强checks的可读性。Script check中,notes字段可以由脚本生成。同样,适用HTTP接口更新TTL check的外部程序一样可以设置notes字段。

Check脚本

Check脚本可以自由地做任何事情确定check的状态。唯一的限制是:退出代码必须遵循下面的约定:

  • 退出代码0 – 正常
  • 退出代码1 – 告警
  • 其他值 - 失败。
  • Consul依赖此约定。脚本其他的输出都保存在notes字段中,可以供人查看。

    健康状态初始化

    默认情况下,当checks注册到Consul agent时,健康状态立即被设置为“Critical”。可以防止服务直接被注册为通过(“passing”)状态,在进入service pool前认为是健康状态。在某些情况下,checks可能需要指定健康检查的初始状态,可以通过设置“status”字段来实现。

    "check": { "id": "mem", "script": "/bin/check_mem", "interval": "10s", "status": "passing"

    初始状态设置为passing。

    Service-boundchecks

    健康检查(Health checks)或者有可能绑定到指定的服务。这将确保健康检查的状态只会影响给定的服务而不是整个节点。服务绑定健康检查需要提供一个service_id字段。

    "check": { "id": "web-app", "name": "WebApp Status", "service_id": "web-app", "ttl": "30s"

    在上述示例中,web-app健康检查如果失败了,只会影响web-app服务的有效性,本节点的其他服务是不受影响的。

    MultipleCheck Definitions

    多个check定义,可以使用字段“checks”,示例:

    "checks": [ "id": "chk1", "name": "mem", "script": "/bin/check_mem", "interval": "5s" "id": "chk2", "name": "/health", "http": "http://localhost:5000/health", "interval": "15s"
    2021/01/05 16:57:58 [WARN] agent: Check "service:tag-10-200-110-100-8778" HTTP request failed: Get http://10.200.110.100:8778/pphealth: dial tcp 10.200.110.100:8778: connectex: No connection could be made because the target machine actively refused it.
    "id": "chk3", "name": "cpu", "script": "/bin/check_cpu", "interval": "10s"

    注,实践过程中发现,脚本并不支持python,必须为shell脚本。

    二、consul与微服务的几次交互

    打开consul端的debug日志,可见如下几种请求:

    2.1、微服务向consul发起2种类型请求:

  • /catalog/services : 列出给定DC中的服务
  • /health/service/:列出可用服务节点
  • 2021/01/05 19:08:27 [DEBUG] http: Request GET /v1/catalog/services?wait=2s&index=45 (2.077s) from=10.200.110.100:28159
    2021/01/05 19:08:32 [DEBUG] http: Request GET /v1/health/service/10.200.140.19?token=<hidden> (0s) from=10.200.110.100:28159

    下面来分析一下,这些请求从微服务端(springboot、springcloud)的哪些地方发起的:

    2021/01/05 19:08:27 [DEBUG] http: Request GET /v1/catalog/services?wait=2s&index=45 (2.077s) from=10.200.110.100:28159

    根据这个访问路径,找到源码片段如下:

     package com.ecwid.consul.v1.catalog;
     public final class CatalogConsulClient implements CatalogClient {
         @Override
        public Response<Map<String, List<String>>> getCatalogServices(QueryParams queryParams, String token) {
            UrlParameters tokenParam = token != null ? new SingleUrlParameters("token", token) : null;
            RawResponse rawResponse = rawClient.makeGetRequest("/v1/catalog/services", queryParams, tokenParam);
            if (rawResponse.getStatusCode() == 200) {
                Map<String, List<String>> value = GsonFactory.getGson().fromJson(rawResponse.getContent(),
                        new TypeToken<Map<String, List<String>>>() {
                        }.getType());
                return new Response<Map<String, List<String>>>(value, rawResponse);
            } else {
                throw new OperationException(rawResponse);
    

    根据调用关系,有几处会调用:

    1、DiscoveryClientHealthIndicator,通过/health接口的健康检查时调用

    2、ConsulCatalogWatch,consul-discovery包里的定时任务调用

    3、DiscoveryClientRouteLocator,zuul网关的route维护时调用(zuul项目才会有效)

    这里看看2的源码,定时任务获取到consul中有关服务的信息后干啥

    package org.springframework.cloud.consul.discovery;
    @Slf4j
    public class ConsulCatalogWatch implements ApplicationEventPublisherAware {
        @Scheduled(fixedDelayString = "${spring.cloud.consul.discovery.catalogServicesWatchDelay:30000}")
        public void catalogServicesWatch() {
            try {
                long index = -1;
                if (catalogServicesIndex.get() != null) {
                    index = catalogServicesIndex.get().longValue();
                Response<Map<String, List<String>>> response = consul
                        .getCatalogServices(new QueryParams(properties
                                .getCatalogServicesWatchTimeout(), index));
                Long consulIndex = response.getConsulIndex();
                if (consulIndex != null) {
                    catalogServicesIndex.set(BigInteger.valueOf(consulIndex));
                log.trace("Received services update from consul: {}, index: {}",
                        response.getValue(), consulIndex);
                publisher.publishEvent(new HeartbeatEvent(this, consulIndex));
            catch (Exception e) {
                log.error("Error watching Consul CatalogServices", e);
    

    获取到的结果通过spring容器事件HeartbeatEvent,发出去。继续

    package org.springframework.cloud.config.client;
    @ConditionalOnProperty(value = "spring.cloud.config.discovery.enabled", matchIfMissing = false)
    @Configuration
    @Import({ UtilAutoConfiguration.class })
    @EnableDiscoveryClient
    public class DiscoveryClientConfigServiceBootstrapConfiguration {
        @EventListener(HeartbeatEvent.class)
        public void heartbeat(HeartbeatEvent event) {
            if (monitor.update(event.getValue())) {
                refresh();
        private void refresh() {
            try {
                String serviceId = this.config.getDiscovery().getServiceId();
                ServiceInstance server = this.instanceProvider
                        .getConfigServerInstance(serviceId);
                String url = getHomePage(server);
                if (server.getMetadata().containsKey("password")) {
                    String user = server.getMetadata().get("user");
                    user = user == null ? "user" : user;
                    this.config.setUsername(user);
                    String password = server.getMetadata().get("password");
                    this.config.setPassword(password);
                if (server.getMetadata().containsKey("configPath")) {
                    String path = server.getMetadata().get("configPath");
                    if (url.endsWith("/") && path.startsWith("/")) {
                        url = url.substring(0, url.length() - 1);
                    url = url + path;
                this.config.setUri(url);
            catch (Exception ex) {
                if (config.isFailFast()) {
                    throw ex;
                else {
                    logger.warn("Could not locate configserver via discovery", ex);
    

    还有就是:

    package org.springframework.cloud.netflix.zuul;
    @Configuration
    @Import({ RibbonCommandFactoryConfiguration.RestClientRibbonConfiguration.class,
            RibbonCommandFactoryConfiguration.OkHttpRibbonConfiguration.class,
            RibbonCommandFactoryConfiguration.HttpClientRibbonConfiguration.class,
            HttpClientConfiguration.class })
    @ConditionalOnBean(ZuulProxyMarkerConfiguration.Marker.class)
    public class ZuulProxyAutoConfiguration extends ZuulServerAutoConfiguration {
        private static class ZuulDiscoveryRefreshListener
                implements ApplicationListener<ApplicationEvent> {
            private HeartbeatMonitor monitor = new HeartbeatMonitor();
            @Autowired
            private ZuulHandlerMapping zuulHandlerMapping;
            @Override
            public void onApplicationEvent(ApplicationEvent event) {
                if (event instanceof InstanceRegisteredEvent) {
                    reset();
                else if (event instanceof ParentHeartbeatEvent) {
                    ParentHeartbeatEvent e = (ParentHeartbeatEvent) event;
                    resetIfNeeded(e.getValue());
                else if (event instanceof HeartbeatEvent) {
                    HeartbeatEvent e = (HeartbeatEvent) event;
                    resetIfNeeded(e.getValue());
    

    关于在zuul中的Spring容器事件的作用看看:https://blog.csdn.net/weixin_34341229/article/details/90222469

    分析:2021/01/05 19:08:32 [DEBUG] http: Request GET /v1/health/service/10.200.140.19?token=<hidden> (0s) from=10.200.110.100:28159

    package com.ecwid.consul.v1.health;
    public final class HealthConsulClient implements HealthClient {
    @Override
        public Response<List<com.ecwid.consul.v1.health.model.HealthService>> getHealthServices(String serviceName, String tag, boolean onlyPassing, QueryParams queryParams, String token) {
            UrlParameters tokenParam = token != null ? new SingleUrlParameters("token", token) : null;
            UrlParameters tagParams = tag != null ? new SingleUrlParameters("tag", tag) : null;
            UrlParameters passingParams = onlyPassing ? new SingleUrlParameters("passing") : null;
            RawResponse rawResponse = rawClient.makeGetRequest("/v1/health/service/" + serviceName, tagParams, passingParams, queryParams, tokenParam);
            if (rawResponse.getStatusCode() == 200) {
                List<com.ecwid.consul.v1.health.model.HealthService> value = GsonFactory.getGson().fromJson(rawResponse.getContent(),
                        new TypeToken<List<com.ecwid.consul.v1.health.model.HealthService>>() {
                        }.getType());
                return new Response<List<com.ecwid.consul.v1.health.model.HealthService>>(value, rawResponse);
            } else {
                throw new OperationException(rawResponse);
    

    根据栈信息跟踪到如下

    package com.netflix.loadbalancer;
    public class PollingServerListUpdater implements ServerListUpdater {
    @Override
        public synchronized void start(final UpdateAction updateAction) {
            if (isActive.compareAndSet(false, true)) {
                final Runnable wrapperRunnable = new Runnable() {
                    @Override
                    public void run() {
                        if (!isActive.get()) {
                            if (scheduledFuture != null) {
                                scheduledFuture.cancel(true);
                            return;
                        try {
                            updateAction.doUpdate();
                            lastUpdated = System.currentTimeMillis();
                        } catch (Exception e) {
                            logger.warn("Failed one update cycle", e);
                scheduledFuture = getRefreshExecutor().scheduleWithFixedDelay(
                        wrapperRunnable,
                        initialDelayMs,
                        refreshIntervalMs,
                        TimeUnit.MILLISECONDS
            } else {
                logger.info("Already active, no-op");
        private static long LISTOFSERVERS_CACHE_UPDATE_DELAY = 1000; // msecs;
        private static int LISTOFSERVERS_CACHE_REPEAT_INTERVAL = 30 * 1000; // msecs;
        public PollingServerListUpdater() {
            this(LISTOFSERVERS_CACHE_UPDATE_DELAY, LISTOFSERVERS_CACHE_REPEAT_INTERVAL);
    


    参考:https://blog.csdn.net/chengqiuming/article/details/81207985

    2.2、consul访问应用服务的有:

     访问所有注册上来的应用服务的health接口

    几种异常情况:

    1、访问不同时:2021/01/05 17:01:06 [WARN] agent: Check "service:tag-10-200-110-100-8778" HTTP request failed: Get http://10.200.110.100:8778/myhealth: dial tcp 10.200.110.100:8778: connectex: No connection could be made because the target machine actively refused it.
    2、访问超时:
    2020/12/26 01:37:45 [WARN] agent: Check "service:sms-172-30-x-x-18184" HTTP request failed: Get http://172.30.x.x:18184/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

    三、健康检查的线上问题HTTP request failed: Get http://172.30.x.x:18184/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

    错误信息如下:

        2020/12/26 01:37:45 [WARN] agent: Check "service:sms-172-30-x-x-18184" HTTP request failed: Get http://172.30.x.x:18184/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

    根据错误日志信息推测,consul和应用都可服务,并且consul的请求已经发出,是返回超时了(默认10s)

    springboot actuator下的默认健康检查项有很多,可以通过如下的开关配置项关闭:参考《Springboot Actuator之二:actuator在监控和管理指标的特性

    #springboot actuator下的默认健康检查项的开关
    management.health.cassandra.enabled: false
    management.health.couchbase.enabled: false
    management.health.ldap.enabled: false
    management.health.rabbit.enabled: false
    management.health.solr.enabled: false
    management.health.jms.enabled: false
    management.health.mongo.enabled: false
    management.health.diskspace.enabled: false
    management.health.redis.enabled: false
    management.health.elasticsearch.enabled: false
    management.health.db.enabled: false
    management.health.hystrix.enabled: false
    management.health.consul.enabled: false
    #discoveryComposite
    spring.cloud.discovery.client.composite-indicator.enabled: false 
    修改后,/health返回的json如下:
    "status": "UP",
    "application": {
    "status": "UP"
    或者应用在注册到consul时,修改默认的health路径,修改方法见consul的client的配置,配置参考:《服务注册发现consul之二:在Spring Cloud中使用Consul实现服务的注册和发现
    1、spring.cloud.consul.discovery.health-check-path: /myhealth,修改默认的检查路径
    2、写一个myhealth的接口,无逻辑,直接返回上面“UP”json数据。

    服务断开时的consul错误日志:

    2021/01/05 16:57:58 [WARN] agent: Check "service:tag-10-200-110-100-8778" HTTP request failed: Get http://10.200.110.100:8778/myhealth: dial tcp 10.200.110.100:8778: connectex: No connection could be made because the target machine actively refused it.

    服务重启动后,

        2021/01/05 17:01:06 [WARN] agent: Check "service:tag-10-200-110-100-8778" HTTP request failed: Get http://10.200.110.100:8778/myhealth: dial tcp 10.200.110.100:8778: connectex: No connection could be made because the target machine actively refused it.
        2021/01/05 17:01:11 [DEBUG] http: Request GET /v1/catalog/services?wait=2s (0s) from=10.200.110.100:19143
        2021/01/05 17:01:14 [INFO] agent: Synced service "tag-10-200-110-100-8778"
        2021/01/05 17:01:14 [DEBUG] agent: Check "service:tag-10-200-110-100-8778" in sync
        2021/01/05 17:01:14 [DEBUG] agent: Node info in sync
        2021/01/05 17:01:14 [DEBUG] http: Request PUT /v1/agent/service/register?token=<hidden> (2ms) from=10.200.110.100:19143
        2021/01/05 17:01:14 [DEBUG] agent: Service "tag-10-200-110-100-8778" in sync
        2021/01/05 17:01:14 [DEBUG] agent: Check "service:tag-10-200-110-100-8778" in sync
        2021/01/05 17:01:14 [DEBUG] agent: Node info in sync
        2021/01/05 17:01:14 [DEBUG] agent: Service "tag-10-200-110-100-8778" in sync
        2021/01/05 17:01:14 [DEBUG] agent: Check "service:tag-10-200-110-100-8778" in sync
        2021/01/05 17:01:14 [DEBUG] agent: Node info in sync
        2021/01/05 17:01:14 [DEBUG] http: Request GET /v1/health/service/10.200.140.19?token=<hidden> (1ms) from=10.200.110.100:19143
        2021/01/05 17:01:15 [DEBUG] http: Request GET /v1/health/service/10.200.140.19?token=<hidden> (0s) from=10.200.110.100:19143
        2021/01/05 17:01:22 [DEBUG] agent: Check "service:tag-10-200-110-100-8778" is passing
        2021/01/05 17:01:22 [DEBUG] agent: Service "tag-10-200-110-100-8778" in sync