Project HIBASTIMAM Part 9: Training a model

7 min readApr 12, 2020

In the ninth article of my series How I Built a Space to Train and Infer on Medical Imaging AI Models (HIBASTIMAM), I will cover how I take a prepared COVID-19 dataset along with the Clara Train SDK to fine-tune a medical AI model. Check out Part 1 for what this series is about, which also has links to the other parts.

Train a Model using a Prepared Dataset

In this post, we’re going to take the COVID-19 dataset we prepared in Step 8, along with the chest classification model we previously downloaded, and then fine-tune it using Clara Train SDK we set up in Step 6. We’ll follow the guide posted here.

Before continuing, please review the important disclaimer in Part 1.

The model we create here is a proof of concept — it is meant to demonstrate how to do this, and not to create the best, well-rounded AI model. In the current form using the training parameters I’ve used and the limited test data input, it won’t apply very well generally.

This step will be more time-consuming, once training begins, but it is thrilling to watch it run. Put on some music — for this blog post, my playlist was 8-bit chip tunes — my favorite when I am making something great.

Anyway, let’s do this!

Stop Clara Deploy if it is Running

Clara Deploy may be using a lot of GPU memory, so, to ensure that we have enough for a training run, let’s stop Deploy. In Terminal, run the following commands, one at a time:

clara console stop
clara dicom stop
clara render stop
clara monitor stop
clara platform stop

Create a New Model Folder

We need a place to store our model. Create a directory in the following location:

mkdir /etc/clara/experiments/classification_covidxray_v1/

Let’s Jump into the Clara Train SDK container

Now, run the following command:

docker run -it --rm --shm-size=1G --gpus all --ulimit memlock=-1 --ulimit stack=67108864 --ipc=host --net=host --mount type=bind,source=/etc/clara/experiments,target=/workspace/data nvcr.io/nvidia/clara-train-sdk:v3.0 /bin/bash

You will now be in the Clara Train Docker container, and can pass commands directly to the container.

Download Model MMAR and Clone

You’ll recall we already have a Chest X-ray model loaded in an early blog — because we jumped back into a fresh Docker instance, we’ll need to re-download that quickly and use that as our seed to train with our new labels. Run these commands in the Docker shell:

MODEL_NAME=clara_xray_classification_chest_no_amp
VERSION=1
ngc registry model download-version nvidia/med/$MODEL_NAME:$VERSION --dest /workspace/data/

It will take a couple minutes to download, as it is about 300MB.

Next, let’s clone the MMAR.

cp -r /workspace/data/clara_xray_classification_chest_no_amp_v1/* /workspace/data/classification_covidxray_v1/

We can drop a few files that aren’t used:

rm /workspace/data/classification_covidxray_v1/docs/Readme.md
echo ADD CONTENT >> /workspace/data/classification_covidxray_v1/docs/Readme.md
rm /workspace/data/classification_covidxray_v1/config/plco.json

Edit Permissions

Let’s fix up the permissions, since when we copy these files, they are copied over as “root”. Open up a temporary new Terminal window (CTRL-ALT-T), and run the following command:

sudo chown <userID> -R /etc/clara/experiments/classification_covidxray_v1

Enter your password and let it run. Exit this terminal (leaving the original terminal running).

Edit Training Script

Next, let’s jump into the text editor and edit the training configuration. The steps below use “vi”, but, you might want to try Sublime (great text editor) and use the “Open Folder” functionality to “/etc/clara/experiments/classification_covidxray_v1”.

First, the training command:

vi /workspace/data/classification_covidxray_v1/commands/train_finetune.sh

We need to adjust three things:

Replace the DATASET_JSON value — it should now read “/workspace/data/covid-training-set/training-images/datalist.json”.
We also need to reduce the training rate, since we have such a small training set; change the value from 0.0002 to 0.00002.
We need to increase the number of epochs, from 40 to 1000.

Edit Training Configuration

Next, let’s edit our training configuration.

vi /workspace/data/classification_covidxray_v1/config/config_train.json

We need to fix six things.

Change the epochs from 40 to 500.
Change the learning rate from 2e-4 to 2e-5.
Update the “subtrahend” and “divisor” parameters from the CenterData transform, in both the “train” and “validate” sections of the file, with the following values:

          "subtrahend": [128, 128, 128],
          "divisor": [128, 128, 128]

4. Change the image pipeline; we don’t have very much data to train with. In the “image_pipeline” section under “train”, change the “name” to read “ClassificationKerasImagePipeline”. In the “args” section, add the following parameter:

"sampling" : "automatic",

5. Make the same pipeline change in the “validate” section (in the “image_pipeline” section under “train”, change the “name” to read “ClassificationKerasImagePipeline”) — but do NOT add the sampling parameter in args.

6. Next is to fix up the metrics section:

The sections for ComputeAUC with a “class_index” have entries for 15 labels. In the exercise above, we only have 6, so remove the remainder. Remember that the array is 0 based! The average AUC can be left alone.
You should update the “name” for each of the labels to match above. As reference, the labels we used from the previous part were: ARDS, COVID-19, Legionella, Pneumocystis, SARS, Streptococcus
The key metric we will use for this training is for the COVID-19 exercise. Move the “is_key_metric:true” parameter from the “Average_AUC” element down to the “COVID-19” element.

Edit Validation Configuration

Let’s make some similar changes to the validation configuration file:

vi /workspace/data/classification_covidxray_v1/config/config_validation.json

That file doesn’t have the pipeline configuration, so you’ll only need to do steps 3 and 6 above.

Edit Environment Configuration

Next, let’s edit our environment configuration.

vi /workspace/data/classification_covidxray_v1/config/environment.json

We need to edit the following variables.

"DATASET_JSON": "/workspace/data/covid-training-set/training-images/datalist.json","DATA_ROOT":"/workspace/data/covid-training-set/training-images",

Fine-Tune The Model!

First, let’s navigate to the correct directory to run the command.

cd /workspace/data/classification_covidxray_v1/commands/

The script is not executable, so let’s change the permissions on it. Run this command:

chmod 700 train_finetune.sh

Now, let’s kick this off:

 ./train_finetune.sh

You will likely have some errors that you will need to correct; you’ll probably need to restart a number of times as you troubleshoot issues with the training data.

Once you have it training, where you see epochs increasing, congratulations!

Now We Wait

This could take quite some time; on my workstation, for me with my rig it took just over 30 minutes to complete through 1000 epochs with 10 iterations each. Fingers crossed! Watch the statistics in each step.

Best to watch is the “This epoch” row, which shows the progress for each iteration, and provides statistics for each label. When mine finished, this is what row 1000 looked like for me:

Epoch: 1000/1000, train_accuracy: 1.0000  train_loss: 0.0000  ARDS: 0.0000  Average_AUC: 0.0000  COVID-19: 0.6021  Legionella: 0.0588  Pneumocystis: 0.8776  SARS: 0.9740  Streptococcus: 0.4082  mean_accuracy: 0.9570  val_time: 0.15s

Once complete, we get to see overall details for the run, including the time it takes:

Saved final model checkpoint at: /workspace/data/classification_covidxray_v1/commands/../models/model_final.ckpt
Total time for fitting: 2341.49s
Best validation metric: 0.6583333333333333 at epoch 143
2020-04-29 01:00:02,579 - nvmidl.utils.train_conf - INFO - Total Training Time 2400.649418592453

Check the Tensorboard

We can see how the model training went using Tensorboard, which has a great web GUI to visualize the results. Execute the following command:

python3 -m tensorboard.main --logdir "/workspace/data/classification_covidxray_v1"

That creates a web server, and it will tell you the URL to launch. Wait about two minutes; it takes a bit of time to load the new labels (you’ll see a message that says “Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event.”).

Open a web browser to http://localhost:6006/. You can now browse the results of the model training.

When we’re done, press CTRL-C to exit from the web server.

Export the Model

Now, let’s export the model. The export script is also not executable, so let’s change the permissions on it. Run this command:

chmod 700 ./export.sh

Now, let’s run it:

 ./export.sh

This will save the model into the directory. Let’s exit out of the Docker sudocontainer:

exit

Lastly, let us bask in the glory of the model we just created. Permission is only granted to root (because we sudo’ed to launch the Docker container interactive mode), so first we have to adjust that (replace UserID with your ID):

sudo chown -R <userID> /etc/clara/experiments/classification_covidxray_v1

Then, if you check out:

ls /etc/clara/experiments/classification_covidxray_v1/

This is the model as an MMAR structure that we have trained.

Congratulations — you made it!

What’s Next

We have done some amazing things — we’ve set up our medical imaging AI environment, prepared a dataset, and now we’ve trained a proof-of-concept model for classification of COVID-19. Awesome! Next, we’ll take that model and connect it to an inference pipeline. We’ll cover this in Part 10.

Thanks for reading. Stay safe, all.