添加链接
link之家
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

We're getting in an error in one environment deploying a ML endpoint with stating that xgboost cannot be found although it's included in the Dockerfile. We do not see this issue in 3 other environments and the model is able to deploy fine without this package error.

Dockerfile:

FROM mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.0-cudnn7-ubuntu16.04:20210220.v1
USER root
RUN mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd
RUN apt-get update && echo 'success updated apt-get!'
RUN apt-get install -y --no-install-recommends cmake libboost-dev libboost-system-dev libboost-filesystem-dev
RUN conda create -n gpuexp python=3.6.2 -y
###############################
# Pre-Build LightGBM
###############################
RUN cd /usr/local/src && mkdir lightgbm && cd lightgbm && \
    git clone --recursive --branch v2.3.0 --depth 1 https://github.com/microsoft/LightGBM && \
 cd LightGBM && mkdir build && cd build && \
 cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ .. && \
 make -j4
###############################
# Install GPU LightGBM and XgBoost
###############################
RUN /bin/bash -c "source activate gpuexp && \
    cd /usr/local/src/lightgbm/LightGBM/python-package && python setup.py install --precompile && \
 pip install --upgrade --force-reinstall xgboost==1.1.1 && \ 
 source deactivate"

Conda:

channels:
  - anaconda
  - conda-forge
  - pytorch
dependencies:
  - python=3.6.2
  - pip=20.2.4
  - pip:
      - azureml-core==1.27.0
      - azureml-pipeline-core==1.27.0
      - azureml-telemetry==1.27.0
      - azureml-defaults==1.27.0
      - azureml-interpret==1.27.0
      - azureml-automl-core==1.27.0
      - azureml-automl-runtime==1.27.0.post2
      - azureml-train-automl-client==1.27.0
      - azureml-train-automl-runtime==1.27.0.post1
      - azureml-dataset-runtime==1.27.0
      - azureml-mlflow==1.27.0
      - inference-schema
      - py-cpuinfo==5.0.0
      - boto3==1.15.18
      - botocore==1.18.18
      - azure-storage-file-datalake
      - azure-identity<1.5.0
      - azure-keyvault
      - azure-servicebus
  - numpy~=1.18.0
  - scikit-learn==0.22.1
  - pandas~=0.25.0
  - fbprophet==0.5
  - holidays==0.9.11
  - setuptools-git
  - 'psutil>5.0.0,<6.0.0'

I haven't included the name in the conda file intentionally.

Is there something we're missing in the container set up for this issue that could cause it to fail for one environment and not the other?

We are able to see the model within our endpoints section in the Azure Machine Learning Studio, but this error is visible on the deployment logs and the endpoint is a Failed state.

In our three other environments, the endpoint is visible and in a healthy state.

Full error message:

    2022-01-11T19:46:08,279016451+00:00 - rsyslog/run 
    2022-01-11T19:46:08,277445539+00:00 - gunicorn/run 
    2022-01-11T19:46:08,280042359+00:00 - iot-server/run 
    /usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
    2022-01-11T19:46:08,285741101+00:00 - nginx/run 
    /usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
    /usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
    /usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
    /usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
    rsyslogd: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libuuid.so.1: no version information available (required by rsyslogd)
    EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting...
    2022-01-11T19:46:08,407862719+00:00 - iot-server/finish 1 0
    2022-01-11T19:46:08,409832434+00:00 - Exit code 1 is normal. Not restarting iot-server.
    Starting gunicorn 19.9.0
    Listening at: http://127.0.0.1:31311 (11)
    Using worker: sync
    worker timeout is set to 300
    Booting worker with pid: 37
    SPARK_HOME not set. Skipping PySpark Initialization.
    Generating new fontManager, this may take some time...
    Initializing logger
    2022-01-11 19:46:09,674 | root | INFO | Starting up app insights client
    2022-01-11 19:46:09,675 | root | INFO | Starting up request id generator
    2022-01-11 19:46:09,675 | root | INFO | Starting up app insight hooks
    2022-01-11 19:46:09,675 | root | INFO | Invoking user's init function
    Loading model from path.
    2022-01-11 19:46:11,728 | azureml.core | WARNING | Failure while loading azureml_run_type_providers. Failed to load entrypoint automl = azureml.train.automl.run:AutoMLRun._from_run_dto with exception cannot import name 'RunType'.
    Failure while loading azureml_run_type_providers. Failed to load entrypoint automl = azureml.train.automl.run:AutoMLRun._from_run_dto with exception cannot import name 'RunType'.
    2022-01-11 19:46:12,132 | root | ERROR | User's init function failed
    2022-01-11 19:46:12,133 | root | ERROR | Encountered Exception Traceback (most recent call last):
      File "/var/azureml-server/aml_blueprint.py", line 182, in register
        main.init()
      File "/var/azureml-app/main.py", line 35, in init
        driver_module.init()
      File "/structure/azureml-app/scripts/inference/score.py", line 67, in init
        raise e
      File "/structure/azureml-app/scripts/inference/score.py", line 64, in init
        model = joblib.load(model_path)
      File "/azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 605, in load
        obj = _unpickle(fobj, filename, mmap_mode)
      File "/azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 529, in _unpickle
        obj = unpickler.load()
      File "/azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/python3.6/pickle.py", line 1050, in load
        dispatch[key[0]](self)
      File "/azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/python3.6/pickle.py", line 1347, in load_stack_global
        self.append(self.find_class(module, name))
      File "/azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/python3.6/pickle.py", line 1388, in find_class
        __import__(module, level=0)
    ModuleNotFoundError: No module named 'xgboost'
    2022-01-11 19:46:12,134 | root | INFO | Waiting for logs to be sent to Application Insights before exit.
    2022-01-11 19:46:12,137 | root | INFO | Waiting 30 seconds for upload.
    Worker exiting (pid: 37)
    Shutting down: Master
    Reason: Worker failed to boot.
    2022-01-11T19:46:42,562394399+00:00 - gunicorn/finish 3 0
    2022-01-11T19:46:42,563843910+00:00 - Exit code 3 is not normal. Killing image.

Partial deployment logs for a successfully deployed endpoint using the same pkl file:

2022-01-10T20:02:28,608154878+00:00 - rsyslog/run 
2022-01-10T20:02:28,608160978+00:00 - iot-server/run 
2022-01-10T20:02:28,609567614+00:00 - gunicorn/run 
2022-01-10T20:02:28,619823782+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
rsyslogd: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libuuid.so.1: no version information available (required by rsyslogd)
EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting...
2022-01-10T20:02:28,789369303+00:00 - iot-server/finish 1 0
2022-01-10T20:02:28,791654562+00:00 - Exit code 1 is normal. Not restarting iot-server.
Starting gunicorn 19.9.0
Listening at: http://127.0.0.1:31311 (14)
Using worker: sync
worker timeout is set to 300
Booting worker with pid: 40
SPARK_HOME not set. Skipping PySpark Initialization.
Generating new fontManager, this may take some time...
Initializing logger
2022-01-10 20:02:30,434 | root | INFO | Starting up app insights client
2022-01-10 20:02:30,435 | root | INFO | Starting up request id generator
2022-01-10 20:02:30,435 | root | INFO | Starting up app insight hooks
2022-01-10 20:02:30,435 | root | INFO | Invoking user's init function
Loading model from path.
2022-01-10 20:02:32,892 | azureml.core | WARNING | Failure while loading azureml_run_type_providers. Failed to load entrypoint automl = azureml.train.automl.run:AutoMLRun._from_run_dto with exception cannot import name 'RunType'.
Failure while loading azureml_run_type_providers. Failed to load entrypoint automl = azureml.train.automl.run:AutoMLRun._from_run_dto with exception cannot import name 'RunType'.
Model loaded succesfully.
ManagedIdentityCredential will use IMDS

I have tried utilizing py-xgboost in the Conda File and updating packages, however, I get the following error message:

Traceback (most recent call last):
  File "/var/azureml-server/aml_blueprint.py", line 182, in register
    main.init()
  File "/var/azureml-app/main.py", line 35, in init
    driver_module.init()
  File "/structure/azureml-app/scripts/inference/score.py", line 67, in init
    raise e
  File "/structure/azureml-app/scripts/inference/score.py", line 64, in init
    model = joblib.load(model_path)
  File "/azureml-envs/azureml_a6a4caa8ade8fc5dac7282e2e275c022/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 605, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/azureml-envs/azureml_a6a4caa8ade8fc5dac7282e2e275c022/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 529, in _unpickle
    obj = unpickler.load()
  File "/azureml-envs/azureml_a6a4caa8ade8fc5dac7282e2e275c022/lib/python3.6/pickle.py", line 1050, in load
    dispatch[key[0]](self)
  File "/azureml-envs/azureml_a6a4caa8ade8fc5dac7282e2e275c022/lib/python3.6/pickle.py", line 1347, in load_stack_global
    self.append(self.find_class(module, name))
  File "/azureml-envs/azureml_a6a4caa8ade8fc5dac7282e2e275c022/lib/python3.6/pickle.py", line 1390, in find_class
    return _getattribute(sys.modules[module], name)[0]
  File "/azureml-envs/azureml_a6a4caa8ade8fc5dac7282e2e275c022/lib/python3.6/pickle.py", line 272, in _getattribute
    .format(name, obj))
AttributeError: Can't get attribute 'XGBoostLabelEncoder' on <module 'xgboost.compat' from '/azureml-envs/azureml_a6a4caa8ade8fc5dac7282e2e275c022/lib/python3.6/site-packages/xgboost/compat.py'>

The hyper parameters within the model created by Azure Auto ML include a XGBoost package from Azure ML:

"spec_class": "sklearn", "class_name": "XGBoostClassifier", "module": "automl.client.core.common.model_wrappers", "param_args": [], "param_kwargs": { "tree_method": "auto" "prepared_kwargs": {}

Hey @Quinn, Katie ,

have you found a solution to your issue? I got the same problem with LightGBM.

Best Max