Projects#

An MLflow project is a format for packaging data science code in a reusable and reproducible way. The MLflow Projects component includes an API and command-line tools for running projects, which also integrate with the Tracking component to automatically record the parameters and git commit of your source code for reproducibility. This document describes the steps that need to be done to run MLflow projects on Oracle Cloud Infrastructure.

Data Science Jobs#

The examples demonstrated in this section show running the MLflow projects on the OCI Data Science jobs within different runtimes supported by the service. All demonstrated examples were taken from the MLflow official GitHub repository .

Prerequisites

  • Based on the General Machine Learning for CPUs on Python 3.8 (generalml_p38_cpu_v1) create and publish a custom conda environment with additional libraries:

    • mlflow

    • oci-mlflow

Data Science Config#

The OCI Data Science config is a JSON file contains the authentication information as well as the path to the job template YAML file.

  • json
{
  "oci_job_template_path": "{work_dir}/oci-datascience-template.yaml",
  "oci_auth": "api_key",
  "oci_config_path": "~/.oci/config",
  "oci_profile": "DEFAULT"
}

The {work_dir} can be used to point out that the YAML template located inside the project directory. It will be auto replaced with the absolute path to the project. However for the cases when YAML template cannot be placed in the project folder, the absolute or relative path can be used instead.

Supported authentication types:

Data Science Job Template#

The template file contains the information about the infrastructure on which a Data Science job should be run, and also the runtime information. More details can be found in the ADS documentation. The template file is divided into two main sections: infrastructure and runtime. The template also can be generated using ads opctl init command. More details can be found in the ADS documentation.

Data Science Job Infrastructure#

The Data Science job infrastructure allows specifying the configuration of the job instance. It includes such information as:

  • Compartment ID

  • Project ID

  • Subnet ID

  • Compute Shape

  • Block Storage Size

  • Log Group ID

  • Log ID

More details about job infrastructure can be found in the ADS documentation

  • yaml
infrastructure:
  kind: infrastructure
  spec:
    blockStorageSize: 50
    subnetId: ocid1.subnet.oc1.iad..<unique_ID>
    compartmentId: ocid1.compartment.oc1..<unique_ID>
    projectId: ocid1.datascienceproject.oc1.iad..<unique_ID>
    logGroupId: ocid1.loggroup.oc1.iad..<unique_ID>
    logId: ocid1.log.oc1.iad..<unique_ID>
    shapeConfigDetails:
      memoryInGBs: 20
      ocpus: 2
    shapeName: VM.Standard.E3.Flex
  type: dataScienceJob

Data Science Job Runtime#

The runtime of a job defines the source code of your workload, environment variables, CLI arguments, and other configurations for the environment to run the workload. You will not work with the runtimes directly but will have to specify a YAML definition of the runtime to run an MLflow project on the Data Science job.

Depending on the source code, we do provide different types of runtime for defining a data science job:

  • PythonRuntime for Python code stored locally, OCI object storage, or other remote location supported by fsspec.

  • NotebookRuntime for a single Jupyter notebook stored locally, OCI object storage, or other remote location supported by fsspec.

  • ContainerRuntime for container images.

  • PythonRuntime
  • NotebookRuntime
  • ContainerRuntime
  runtime:
    kind: runtime
    spec:
      args: []
      conda:
        type: published
        uri: <oci://bucket@namespace/prefix>
      env:
      - name: http_proxy
        value: <http://ip:port>
      entrypoint: "{Entry point script. For the MLflow will be replaced with the CMD}"
      scriptPathURI: "{Path to the script. For the MLflow will be replaced with path to the project}"
    type: python
  runtime:
    kind: runtime
    spec:
      args: []
      conda:
        type: published
        uri: <oci://bucket@namespace/prefix>
      env:
      - name: http_proxy
        value: <http://ip:port>
      entrypoint: "{Entry point notebook. For MLflow, it will be replaced with the CMD}"
      source: "{Path to the source code directory. For MLflow, it will be replaced with path to the project}"
      notebookEncoding: utf-8
    type: notebook
  runtime:
    kind: runtime
    spec:
      image: <iad.ocir.io/namespace/image_name:version>
      cmd: "{Container CMD. For MLflow, it will be replaced with the Project CMD}"
      entrypoint:
      - bash
      - --login
      - -c
    type: container

Running MLflow project within PythonRuntime#

This example demonstrates an MLflow project that trains a linear regression model on the UC Irvine Wine Quality Dataset. To run this example on the Data Science job, the custom conda environment needs to be prepared and published to the Object Storage bucket. The project can be run from source or by using GIT link.

  • To run project from the source, pull a sklearn_elasticnet_wine project form the GitHub repository. If you want to run the project with GIT URI, create a sklearn_elasticnet_wine folder.

  • Prepare and publish a custom conda environment. The libraries listed below need to be installed in your custom conda environment. This section can be skipped if you already prepared the custom conda environment following the prerequisites section in the beginning of the documentation.

    • scikit-learn

    • mlflow

    • pandas

    • oci-mlflow

  • Prepare a oci-datascience-config.json file containing the authentication information and path to the job configuration YAML file.

    • json
    {
      "oci_auth": "api_key",
      "oci_job_template_path": "oci-datascience-template.yaml"
    }
    

    Copy the oci-datascience-config.json file to the sklearn_elasticnet_wine folder.

  • Prepare a oci-datascience-template.yaml job configuration file.

    • yaml
    kind: job
    name: "{Job name. For the MLflow will be replaced with the Project name}"
    spec:
      infrastructure:
        kind: infrastructure
        spec:
          blockStorageSize: 50
          subnetId: ocid1.subnet.oc1.iad..<unique_ID>
          compartmentId: ocid1.compartment.oc1..<unique_ID>
          projectId: ocid1.datascienceproject.oc1.iad..<unique_ID
          logGroupId: ocid1.loggroup.oc1.iad..<unique_ID>
          logId: ocid1.log.oc1.iad..<unique_ID>
          shapeConfigDetails:
            memoryInGBs: 20
            ocpus: 2
          shapeName: VM.Standard.E3.Flex
        type: dataScienceJob
      runtime:
        kind: runtime
        spec:
          args: []
          conda:
            type: published
            uri: <oci://bucket@namespace/prefix>
          entrypoint: "{Entry point script. For the MLflow will be replaced with the CMD}"
          scriptPathURI: "{Path to the script. For the MLflow will be replaced with path to the project}"
        type: python
    

    Copy the oci-datascience-template.yaml file to the sklearn_elasticnet_wine folder.

  • Run the project from the source

    • CLI
    • SDK
      cd ~/sklearn_elasticnet_wine
      export MLFLOW_TRACKING_URI=<tracking_uri>
      mlflow run . --experiment-name My_Experiment --backend oci-datascience --backend-config ./oci-datascience-config.json
    
      import mlflow
    
      mlflow.set_tracking_uri("<tracking_uri>i")
    
      mlflow.run(
          ".",
          parameters={"alpha": 0.7, "l1-ratio": 0.06},
          experiment_name="My_Experiment",
          backend="oci-datascience",
          backend_config="oci-datascience-config.json",
      )
    
  • Run the project with GIT URI

    • CLI
    • SDK
      cd ~/sklearn_elasticnet_wine
      export MLFLOW_TRACKING_URI=<tracking_uri>
      mlflow run https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine --experiment-name My_Experiment --backend oci-datascience --backend-config ./oci-datascience-config.json
    
      import mlflow
    
      mlflow.set_tracking_uri("<tracking_uri>i")
    
      mlflow.run(
          "https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine",
          experiment_name="My_Experiment",
          backend="oci-datascience",
          backend_config="oci-datascience-config.json",
      )
    

Running MLflow project within NotebookRuntime#

This example demonstrates an MLflow project that trains a linear regression model on the UC Irvine Wine Quality Dataset. To run this example on the Data Science job, the custom conda environment needs to be prepared and published to the Object Storage bucket.

  • Download a sklearn_elasticnet_wine project form the GitHub repository.

  • Prepare and publish a custom conda environment. The libraries listed below need to be installed in your custom conda environment.

    • scikit-learn

    • mlflow

    • pandas

    • oci-mlflow

  • Prepare a oci-datascience-config.json file containing the authentication information and path to the job configuration YAML file.

    • json
    {
      "oci_auth": "api_key",
      "oci_job_template_path": "{work_dir}/oci-datascience-template.yaml"
    }
    

    Copy the oci-datascience-config.json file to the sklearn_elasticnet_wine folder.

  • Prepare a oci-datascience-template.yaml job configuration file.

    • yaml
    kind: job
    name: "{Job name. For the MLflow will be replaced with the Project name}"
    spec:
      infrastructure:
        kind: infrastructure
        spec:
          blockStorageSize: 50
          subnetId: ocid1.subnet.oc1.iad..<unique_ID>
          compartmentId: ocid1.compartment.oc1..<unique_ID>
          projectId: ocid1.datascienceproject.oc1.iad..<unique_ID>
          logGroupId: ocid1.loggroup.oc1.iad..<unique_ID>
          logId: ocid1.log.oc1.iad..<unique_ID>
          shapeConfigDetails:
            memoryInGBs: 20
            ocpus: 2
          shapeName: VM.Standard.E3.Flex
        type: dataScienceJob
      runtime:
        kind: runtime
        spec:
          args: []
          conda:
            type: published
            uri: <oci://bucket@namespace/prefix>
          entrypoint: "{Entry point notebook. For MLflow, it will be replaced with the CMD}"
          source: "{Path to the source code directory. For MLflow, it will be replaced with path to the project}"
          notebookEncoding: utf-8
        type: notebook
    

    Copy the oci-datascience-template.yaml file to the sklearn_elasticnet_wine folder.

  • Update the MLproject file with the content provided below

    • yaml
    name: tutorial
    
    entry_points:
      main:
        command: "train.ipynb"
    
  • Run the project

    • CLI
    • SDK
      cd ~/sklearn_elasticnet_wine
      export MLFLOW_TRACKING_URI=<tracking_uri>
      mlflow run . --experiment-name My_Experiment --backend oci-datascience --backend-config ./oci-datascience-config.json
    
      import mlflow
      mlflow.set_tracking_uri(<tracking_uri>)
    
      mlflow.run(".",
        experiment_name="My_Experiment",
        backend="oci-datascience",
        backend_config="oci-datascience-config.json"
      )
    

Running MLflow project within ContainerRuntime#

This example demonstrates an MLflow project that trains a linear regression model on the UC Irvine Wine Quality Dataset. In the first step, you will need to download the docker example from the MLflow official GitHub repository and go through the README.rst document provided within the project. The project uses a Docker image to capture the dependencies needed to run training code. Running a project in a Docker environment (as opposed to conda) allows for capturing non-Python dependencies, e.g. Java libraries. Once all steps from the README.rst are passed and the project can be run on the local environment, follow the steps below to run the project on the OCI Data Science jobs.

  • Download a docker project form the GitHub repository and place the code to the sklearn_elasticnet_wine folder.

  • Prepare a docker image following the steps from the README.rst. Add into the docker file the oci-mlflow library.

    • shell
    FROM python:3.8
    
    RUN pip install mlflow \
        && pip install oci \
        && pip install oracle-ads \
        && pip install numpy \
        && pip install scipy \
        && pip install pandas \
        && pip install scikit-learn \
        && pip install cloudpickle \
        && pip install oci-mlflow
    
  • Build and publish the image to the OCI container registry

    • shell
    docker tag mlflow-docker-example:<your_tag> <registry_path>/mlflow-docker-example:latest && \
    docker push <registry_path>/mlflow-docker-example:latest
    
  • Prepare a oci-datascience-config.json file containing the authentication information and path to the job configuration YAML file.

    • json
    {
      "oci_auth": "api_key",
      "oci_job_template_path": "{work_dir}/oci-datascience-template.yaml"
    }
    

    Copy the oci-datascience-config.json file to the sklearn_elasticnet_wine folder.

  • Prepare a oci-datascience-template.yaml job configuration file.

    • yaml
    kind: job
    spec:
      name: "{Job name. For the MLflow will be replaced with the Project name}"
      infrastructure:
        kind: infrastructure
        spec:
          blockStorageSize: 50
          subnetId: ocid1.subnet.oc1.iad..<unique_ID>
          compartmentId: ocid1.compartment.oc1..<unique_ID>
          projectId: ocid1.datascienceproject.oc1.iad..<unique_ID>
          logGroupId: ocid1.loggroup.oc1.iad..<unique_ID>
          logId: ocid1.log.oc1.iad..<unique_ID>
          shapeName: VM.Standard.E3.Flex
          shapeConfigDetails:
            memoryInGBs: 20
            ocpus: 2
        type: dataScienceJob
      runtime:
        type: container
        kind: runtime
        spec:
          image: <iad.ocir.io/realm/container:tag>
          cmd: "{Container CMD. For the MLflow will be replaced with the Project CMD}"
          entrypoint:
          - bash
          - --login
          - -c
    

    Copy the oci-datascience-template.yaml file to the sklearn_elasticnet_wine folder.

  • Run the project

    • CLI
    • SDK
      cd ~/sklearn_elasticnet_wine
      export MLFLOW_TRACKING_URI=<tracking_uri>
      mlflow run . --experiment-name My_Experiment --backend oci-datascience --backend-config ./oci-datascience-config.jsonjson
    
      import mlflow
      mlflow.set_tracking_uri(<tracking_uri>)
    
      mlflow.run(".",
        experiment_name="My_Experiment",
        parameters={"alpha": 0.7},
        backend="oci-datascience",
        backend_config="oci-datascience-config.json"
      )
    

Data Flow Applications#

The examples demonstrated in this section show how to run MLflow projects on a Data Flow remote Spark cluster. All examples were taken from the MLflow official repository.

Prerequisites

  • Based on the PySpark 3.2 and Data Flow (pyspark32_p38_cpu_v2) create and publish a custom conda environment with additional libraries: - mlflow - oci-mlflow

Running MLflow project within DataflowRuntime#

This example demonstrates an MLflow project that trains a logistic regression model on the Iris dataset. To run this example on the Data Flow cluster, the custom conda environment needs to be prepared and published to the Object Storage bucket.

  • Download a pyspark_ml_autologging project form the GitHub repository.

  • Prepare a oci-datascience-config.json file containing the authentication information and path to the job configuration YAML file.

    • yaml
    {
      "oci_auth": "api_key",
      "oci_job_template_path": "{work_dir}/oci-datascience-template.yaml"
    }
    

    Copy the oci-datascience-config.json file to the pyspark_ml_autologging folder.

  • Prepare a oci-datascience-template.yaml job configuration file. The template can be generated using ads opctl init command. More details can be found in the ADS documentation.

    • yaml
    kind: job
    name: "{DataFlow application name. For the MLflow will be replaced with the Project name}"
    spec:
      infrastructure:
        kind: infrastructure
        spec:
          compartmentId: ocid1.compartment.oc1..<unique_ID>
          driverShape: VM.Standard.E4.Flex
          driverShapeConfig:
            memory_in_gbs: 32
            ocpus: 2
          executorShape: VM.Standard.E4.Flex
          executorShapeConfig:
            memory_in_gbs: 32
            ocpus: 2
          language: PYTHON
          logsBucketUri: <oci://bucket@namespace>
          numExecutors: 1
          sparkVersion: 3.2.1
          privateEndpointId: ocid1.dataflowprivateendpoint.oc1.iad..<unique_ID>
        type: dataFlow
      runtime:
        kind: runtime
        spec:
          configuration:
            spark.driverEnv.MLFLOW_TRACKING_URI: <http://FQDN-address-of-the-container-instance:5000>
          conda:
            type: published
            uri: <oci://bucket@namespace/prefix>
          condaAuthType: resource_principal
          scriptBucket: <oci://bucket@namespace/prefix>
          scriptPathURI: "{Path to the executable script. For the MLflow will be replaced with the CMD}"
          overwrite: True
        type: dataFlow
    

    In the config file, we do also specify a Private Endpoint (privateEndpointId) which allows the Data Flow cluster to reach out to the tracking server URI (in case of the tracking server deployed in the private network). However, the private endpoint is not required for the case when the tracking server has a public Ip address. More details about the Private Endpoint can be found in the official documentation. We do also specify a spark.driverEnv.MLFLOW_TRACKING_URI property, which is only required in case of using a private endpoint and should be an FQDN of the container instance.

    Copy the oci-datascience-template.yaml file to the pyspark_ml_autologging folder.

  • Create an MLproject file in the pyspark_ml_autologging folder.

    • yaml
    name: mlflow-project-dataflow-application
    
    entry_points:
      main:
        command: "logistic_regression.py"
    
  • Run the example project

    • CLI
    • SDK
    cd ~/pyspark_ml_autologging
    export MLFLOW_TRACKING_URI=<tracking_uri>
    mlflow run . --experiment-name My_Experiment --backend oci-datascience --backend-config ./oci-datascience-config.json
    
    import mlflow
    
    mlflow.set_tracking_uri(<tracking_uri>)
    
    mlflow.run(".",
      experiment_name="My_Experiment",
      backend="oci-datascience",
      backend_config="oci-datascience-config.json"
    )