We're getting in an error in one environment deploying a ML endpoint with stating that xgboost cannot be found although it's included in the Dockerfile. We do not see this issue in 3 other environments and the model is able to deploy fine without this package error.
Dockerfile:
FROM mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.0-cudnn7-ubuntu16.04:20210220.v1
USER root
RUN mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd
RUN apt-get update && echo 'success updated apt-get!'
RUN apt-get install -y --no-install-recommends cmake libboost-dev libboost-system-dev libboost-filesystem-dev
RUN conda create -n gpuexp python=3.6.2 -y
###############################
# Pre-Build LightGBM
###############################
RUN cd /usr/local/src && mkdir lightgbm && cd lightgbm && \
git clone --recursive --branch v2.3.0 --depth 1 https://github.com/microsoft/LightGBM && \
cd LightGBM && mkdir build && cd build && \
cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ .. && \
make -j4
###############################
# Install GPU LightGBM and XgBoost
###############################
RUN /bin/bash -c "source activate gpuexp && \
cd /usr/local/src/lightgbm/LightGBM/python-package && python setup.py install --precompile && \
pip install --upgrade --force-reinstall xgboost==1.1.1 && \
source deactivate"
Conda:
channels:
- anaconda
- conda-forge
- pytorch
dependencies:
- python=3.6.2
- pip=20.2.4
- pip:
- azureml-core==1.27.0
- azureml-pipeline-core==1.27.0
- azureml-telemetry==1.27.0
- azureml-defaults==1.27.0
- azureml-interpret==1.27.0
- azureml-automl-core==1.27.0
- azureml-automl-runtime==1.27.0.post2
- azureml-train-automl-client==1.27.0
- azureml-train-automl-runtime==1.27.0.post1
- azureml-dataset-runtime==1.27.0
- azureml-mlflow==1.27.0
- inference-schema
- py-cpuinfo==5.0.0
- boto3==1.15.18
- botocore==1.18.18
- azure-storage-file-datalake
- azure-identity<1.5.0
- azure-keyvault
- azure-servicebus
- numpy~=1.18.0
- scikit-learn==0.22.1
- pandas~=0.25.0
- fbprophet==0.5
- holidays==0.9.11
- setuptools-git
- 'psutil>5.0.0,<6.0.0'
I haven't included the name in the conda file intentionally.
Is there something we're missing in the container set up for this issue that could cause it to fail for one environment and not the other?
We are able to see the model within our endpoints section in the Azure Machine Learning Studio, but this error is visible on the deployment logs and the endpoint is a Failed state.
In our three other environments, the endpoint is visible and in a healthy state.
Full error message:
2022-01-11T19:46:08,279016451+00:00 - rsyslog/run
2022-01-11T19:46:08,277445539+00:00 - gunicorn/run
2022-01-11T19:46:08,280042359+00:00 - iot-server/run
/usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
2022-01-11T19:46:08,285741101+00:00 - nginx/run
/usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
rsyslogd: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libuuid.so.1: no version information available (required by rsyslogd)
EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting...
2022-01-11T19:46:08,407862719+00:00 - iot-server/finish 1 0
2022-01-11T19:46:08,409832434+00:00 - Exit code 1 is normal. Not restarting iot-server.
Starting gunicorn 19.9.0
Listening at: http://127.0.0.1:31311 (11)
Using worker: sync
worker timeout is set to 300
Booting worker with pid: 37
SPARK_HOME not set. Skipping PySpark Initialization.
Generating new fontManager, this may take some time...
Initializing logger
2022-01-11 19:46:09,674 | root | INFO | Starting up app insights client
2022-01-11 19:46:09,675 | root | INFO | Starting up request id generator
2022-01-11 19:46:09,675 | root | INFO | Starting up app insight hooks
2022-01-11 19:46:09,675 | root | INFO | Invoking user's init function
Loading model from path.
2022-01-11 19:46:11,728 | azureml.core | WARNING | Failure while loading azureml_run_type_providers. Failed to load entrypoint automl = azureml.train.automl.run:AutoMLRun._from_run_dto with exception cannot import name 'RunType'.
Failure while loading azureml_run_type_providers. Failed to load entrypoint automl = azureml.train.automl.run:AutoMLRun._from_run_dto with exception cannot import name 'RunType'.
2022-01-11 19:46:12,132 | root | ERROR | User's init function failed
2022-01-11 19:46:12,133 | root | ERROR | Encountered Exception Traceback (most recent call last):
File "/var/azureml-server/aml_blueprint.py", line 182, in register
main.init()
File "/var/azureml-app/main.py", line 35, in init
driver_module.init()
File "/structure/azureml-app/scripts/inference/score.py", line 67, in init
raise e
File "/structure/azureml-app/scripts/inference/score.py", line 64, in init
model = joblib.load(model_path)
File "/azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 605, in load
obj = _unpickle(fobj, filename, mmap_mode)
File "/azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 529, in _unpickle
obj = unpickler.load()
File "/azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/python3.6/pickle.py", line 1050, in load
dispatch[key[0]](self)
File "/azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/python3.6/pickle.py", line 1347, in load_stack_global
self.append(self.find_class(module, name))
File "/azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/python3.6/pickle.py", line 1388, in find_class
__import__(module, level=0)
ModuleNotFoundError: No module named 'xgboost'
2022-01-11 19:46:12,134 | root | INFO | Waiting for logs to be sent to Application Insights before exit.
2022-01-11 19:46:12,137 | root | INFO | Waiting 30 seconds for upload.
Worker exiting (pid: 37)
Shutting down: Master
Reason: Worker failed to boot.
2022-01-11T19:46:42,562394399+00:00 - gunicorn/finish 3 0
2022-01-11T19:46:42,563843910+00:00 - Exit code 3 is not normal. Killing image.
Partial deployment logs for a successfully deployed endpoint using the same pkl file:
2022-01-10T20:02:28,608154878+00:00 - rsyslog/run
2022-01-10T20:02:28,608160978+00:00 - iot-server/run
2022-01-10T20:02:28,609567614+00:00 - gunicorn/run
2022-01-10T20:02:28,619823782+00:00 - nginx/run
/usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
rsyslogd: /azureml-envs/azureml_5ea1391fd04105b52a0d9fc3d6d367ac/lib/libuuid.so.1: no version information available (required by rsyslogd)
EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting...
2022-01-10T20:02:28,789369303+00:00 - iot-server/finish 1 0
2022-01-10T20:02:28,791654562+00:00 - Exit code 1 is normal. Not restarting iot-server.
Starting gunicorn 19.9.0
Listening at: http://127.0.0.1:31311 (14)
Using worker: sync
worker timeout is set to 300
Booting worker with pid: 40
SPARK_HOME not set. Skipping PySpark Initialization.
Generating new fontManager, this may take some time...
Initializing logger
2022-01-10 20:02:30,434 | root | INFO | Starting up app insights client
2022-01-10 20:02:30,435 | root | INFO | Starting up request id generator
2022-01-10 20:02:30,435 | root | INFO | Starting up app insight hooks
2022-01-10 20:02:30,435 | root | INFO | Invoking user's init function
Loading model from path.
2022-01-10 20:02:32,892 | azureml.core | WARNING | Failure while loading azureml_run_type_providers. Failed to load entrypoint automl = azureml.train.automl.run:AutoMLRun._from_run_dto with exception cannot import name 'RunType'.
Failure while loading azureml_run_type_providers. Failed to load entrypoint automl = azureml.train.automl.run:AutoMLRun._from_run_dto with exception cannot import name 'RunType'.
Model loaded succesfully.
ManagedIdentityCredential will use IMDS
I have tried utilizing py-xgboost in the Conda File and updating packages, however, I get the following error message:
Traceback (most recent call last):
File "/var/azureml-server/aml_blueprint.py", line 182, in register
main.init()
File "/var/azureml-app/main.py", line 35, in init
driver_module.init()
File "/structure/azureml-app/scripts/inference/score.py", line 67, in init
raise e
File "/structure/azureml-app/scripts/inference/score.py", line 64, in init
model = joblib.load(model_path)
File "/azureml-envs/azureml_a6a4caa8ade8fc5dac7282e2e275c022/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 605, in load
obj = _unpickle(fobj, filename, mmap_mode)
File "/azureml-envs/azureml_a6a4caa8ade8fc5dac7282e2e275c022/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 529, in _unpickle
obj = unpickler.load()
File "/azureml-envs/azureml_a6a4caa8ade8fc5dac7282e2e275c022/lib/python3.6/pickle.py", line 1050, in load
dispatch[key[0]](self)
File "/azureml-envs/azureml_a6a4caa8ade8fc5dac7282e2e275c022/lib/python3.6/pickle.py", line 1347, in load_stack_global
self.append(self.find_class(module, name))
File "/azureml-envs/azureml_a6a4caa8ade8fc5dac7282e2e275c022/lib/python3.6/pickle.py", line 1390, in find_class
return _getattribute(sys.modules[module], name)[0]
File "/azureml-envs/azureml_a6a4caa8ade8fc5dac7282e2e275c022/lib/python3.6/pickle.py", line 272, in _getattribute
.format(name, obj))
AttributeError: Can't get attribute 'XGBoostLabelEncoder' on <module 'xgboost.compat' from '/azureml-envs/azureml_a6a4caa8ade8fc5dac7282e2e275c022/lib/python3.6/site-packages/xgboost/compat.py'>
The hyper parameters within the model created by Azure Auto ML include a XGBoost package from Azure ML:
"spec_class": "sklearn",
"class_name": "XGBoostClassifier",
"module": "automl.client.core.common.model_wrappers",
"param_args": [],
"param_kwargs": {
"tree_method": "auto"
"prepared_kwargs": {}
Hey @Quinn, Katie ,
have you found a solution to your issue? I got the same problem with LightGBM.
Best Max