To completely understand this paper we will divide this post into the following sections:
Multi-object tracking (MOT) aims at estimating bounding boxes and identities of objects in videos. A simple tracking algorithm can involve the following steps.
First, we detect the bounding boxes for all the objects (here we are only detecting persons in the image) in all the frames. In the image below, we have three frames and bounding boxes for all the persons highlighted in yellow, and the corresponding confidence threshold is mentioned at the top. Here we use an object detection algorithm to get the bounding box coordinates in all the frames.
Next, we use some algorithms like ByteTrack to associate these detection boxes across frames. In the image below, after applying the tracking algorithm we assign a tracking id to each object (persons). Here, objects with the same tracking ids are shown in the same color.
Note: There are variations in how different tracking algorithms work than what I have mentioned above. In general, any algorithm which does detection first and then uses that detection result to get tracking ids are called Tracking-by-detection. We will discuss some of those variations later.
I am not going into the details of detection. Let’s assume we are using one of the detection models(Yolov5, YoloX , RetinaNet and so on.) to get the bounding boxes from each frame. Note the official implementation of ByteTrack uses the YoloX model as object-detector.
Now for bounding box association, we can use two logic as follows:
1. IOU Trackers / Location-based trackers
Here, we assume that the video is captured at high FPS. So, between two consecutive frames, there is minimal movement of any objects. Now to associate objects across frames we can just calculate IoU (intersection over union) between detections of two consecutive frames. In the image above, the bounding boxes (they are shown in the same color across frames for the same object) of two objects are shown across frames.
Now if we calculate the IoU of detected bounding boxes, the same objects will have high overlap across frames than two different objects. In the above image, if we take any two consecutive frames the IoU for blue-blue or red-red boxes will be higher than for blue-red boxes. This way if two detected boxes have high IoU in two consecutive frames, we can give the same tracking id (in the image above, objects with the same tracking ids are in the same color).
Note: Detected boxes and bounding boxes are used interchangeably.
2. Feature-based trackers
Here instead of using location information (IoU), we use the features of the detected bounding boxes. We first find the bounding boxes in two frames. Then we calculate the features for each of the bounding boxes.
Then we can use cosine similarity to calculate the similarity of all the boxes from frame 1 to all the boxes from frame 2. We can give two detected boxes the same tracking id if they have a similarity value higher than a threshold and they don’t have any higher similarity matching with another bounding box.
Location-based tracker fails when there is movement in-camera because then there might be high relative movement of an object across the frame the IoU value would be 0. In this case, the feature-based method will work because the relative position does not matter here.
If the detection algorithm has a low recall, then there will be objects which will be missed by the detector and if the detector fails for the same object continuously then when it will be detected again it might not have positive IoU with the last detected instance of this object in the previous frame.
Feature-based methods work fine with models with low recall. Here instead of checking one of the previous frames for matches, we can check some N numbers of the previous frames, so that even if the model does not detect an object in some of the frames when it detects the object again we have some old history to assign the same tracking id instead of a new one.
The feature-based method fails if there is very little distinction between two objects. For example, for tracking objects on a high-altitude thermal camera. Here every person will be a white blob (or black-blob based on the camera in white-hot mode or black hot mode) and will have a similar feature. The location-based method works fine here.
Location-based methods are also simple. We need to extract features for all the detected boxes and then calculate some similarity metrics for the feature-based method. Compared to that, in the location-based method, we can just calculate the IoU. A location-based method in general will be faster than a feature-based method.
Location-based methods take the assumption that cameras will have high FPS. Because most of the recent cameras have more than 30 FPS, it should not cause any issues.
ByteTrack is an IoU-based association algorithm. Most methods obtain identities by associating detection boxes whose scores are higher than a threshold. The objects with low detection scores, e.g. occluded objects, are simply thrown away, which brings non-negligible true object missing and fragmented trajectories. To solve this problem, ByteTrack uses both high and low-confidence bounding boxes.
Let’s understand the algorithm step by step:
Let’s assume a few things to understand the pseudo-code: The inputs are as follows: A video sequence V; object detector Det(this is YoloX); detection score threshold τ. The output would be: Tracks T
of the video. In beginning, we will start with empty tracks.
For each frame in the video, we predict the detection boxes and scores using the YoloX detector. We separate all the detection boxes into two parts D_high
and D_low
according to the detection score threshold τ
. For the detection boxes whose scores are higher than τ, we put them into the high score detection boxes D_high
. For the detection boxes whose scores are lower than τ, we put them into the low score detection boxes D_low
.
After separating the low score detection boxes and the high score detection boxes, we adopt the Kalman filter to predict the new locations in the current frame of each track in T
.
The first association is performed between the high score detection boxes D_high
and all the tracks T
(including the lost tracks T_lost
).
We keep the unmatched detections in D_remain
and the unmatched tracks in T_remain
. The second association is performed between the low score detection boxes D_low
and the remaining tracks T_remain
after the first association.
We keep the unmatched tracks in T_re−remain
and just delete all the unmatched low score detection boxes, since we view them as background.
For the unmatched tracks T_re−remain
after the second association, we put them into T_lost
. For each track in T_lost
, only when it exists for more than a certain number of frames (in paper, this value is 30 frames), we delete it from the tracks T
. Otherwise, we remain the lost tracks T_lost
in T
Finally, we initialize new tracks from the unmatched high score detection boxes D_remain
after the first association.
Note: for the association, we can use either location or feature-based method depending on the problem statement. The main addition of ByteTrack is of using both low and high-confidence bounding boxes.
Here lets check the important parts of the official implementation. Full implementation of ByteTrack code is here.
First, We initialize a few lists to keep tracking history:
tracked_stracks
: tracks we are currently tracking and present in current frames.lost_stracks
: tracks that we are currently tracking but missing in the current frame. Based on buffer_size
(saved in self.max_time_lost
).removed_stracks
: tracks that we have removed but tracked before. class BYTETracker(object):
def __init__(self, args, frame_rate=30):
self.tracked_stracks = [] # type: list[STrack] # this is T
self.lost_stracks = [] # type: list[STrack] # this is T_lost
self.removed_stracks = [] # type: list[STrack] helps removing lost tracks if duration is finished
self.frame_id = 0 # current frame id
self.args = args
#self.det_thresh = args.track_thresh
self.det_thresh = args.track_thresh + 0.1
self.buffer_size = int(frame_rate / 30.0 * args.track_buffer) # how many frames to keep the lost frames
self.max_time_lost = self.buffer_size # buffer size
self.kalman_filter = KalmanFilter() # kalman filter, we will check this in details later
The update function takes care of associating tracks.
First, we rescale the bounding boxes to the original image size. Original image size information is saved into img_info
variable and test image size is saved into img_size
variable. We calculate the scale ratio and then rescale the bounding boxes to the original image size.
def update(self, output_results, img_info, img_size):
"""
Update function in bytetrack class
Args:
output_results: predictions fro object detection [N, x1,y1,x2,y2, score], [N, x1, y1, x2, y2, cls_conf, obj_conf]
img_info: original image size info
img_size: test image size info
"""
self.frame_id += 1
activated_starcks = []
refind_stracks = []
lost_stracks = []
removed_stracks = []
if output_results.shape[1] == 5: # [N, x1,y1,x2,y2, score]
scores = output_results[:, 4]
bboxes = output_results[:, :4]
else: # [N, x1, y1, x2, y2, cls_conf, obj_conf]
output_results = output_results.cpu().numpy()
scores = output_results[:, 4] * output_results[:, 5]
bboxes = output_results[:, :4] # x1y1x2y2
img_h, img_w = img_info[0], img_info[1]
scale = min(img_size[0] / float(img_h), img_size[1] / float(img_w))
bboxes /= scale # rescale bboxes back to original image size
Next, we separate bounding boxes into high confidence bounding boxes and low confidence bounding boxes. Here, we discard bounding boxes with less the 0.1
confidence. Low confidence bounding boxes have confidence between 0.1
and self.args.track_thresh
. High confidence bounding boxes are boxes with confidence higher than self.args.track_thresh
.
remain_inds = scores > self.args.track_thresh
inds_low = scores > 0.1
inds_high = scores < self.args.track_thresh
inds_second = np.logical_and(inds_low, inds_high)
dets_second = bboxes[inds_second]
dets = bboxes[remain_inds]
scores_keep = scores[remain_inds]
Next, we convert each detection into tracks. Then, we create a track pool by merging lost tracks with currently tracked tracks. We update track pool tracks with Kalman filter to get tracks with respect to the current frame.
if len(dets) > 0:
'''Detections'''
detections = [STrack(STrack.tlbr_to_tlwh(tlbr), s) for
(tlbr, s) in zip(dets, scores_keep)]
else:
detections = []
''' Add newly detected tracklets to tracked_stracks'''
unconfirmed = []
tracked_stracks = [] # type: list[STrack]
for track in self.tracked_stracks:
if not track.is_activated:
unconfirmed.append(track)
else:
tracked_stracks.append(track)
''' Step 2: First association, with high score detection boxes'''
strack_pool = joint_stracks(tracked_stracks, self.lost_stracks)
# Predict the current location with KF
STrack.multi_predict(strack_pool)
Then we calculate the IoU distance between the track pool tracks and detection tracks and assign detections with tracks. We update the matched tracks with current detection boxes.
dists = matching.iou_distance(strack_pool, detections)
if not self.args.mot20:
dists = matching.fuse_score(dists, detections)
matches, u_track, u_detection = matching.linear_assignment(dists, thresh=self.args.match_thresh)
for itracked, idet in matches:
track = strack_pool[itracked]
det = detections[idet]
if track.state == TrackState.Tracked:
track.update(detections[idet], self.frame_id)
activated_starcks.append(track)
else:
track.re_activate(det, self.frame_id, new_id=False)
refind_stracks.append(track)
Similarly, we take low confidence bounding boxes and update the tracks (only the tracks which were not matched during the first association) with detection information from the current frame.
''' Step 3: Second association, with low score detection boxes'''
# association the untrack to the low score detections
if len(dets_second) > 0:
'''Detections'''
detections_second = [STrack(STrack.tlbr_to_tlwh(tlbr), s) for
(tlbr, s) in zip(dets_second, scores_second)]
else:
detections_second = []
r_tracked_stracks = [strack_pool[i] for i in u_track if strack_pool[i].state == TrackState.Tracked]
dists = matching.iou_distance(r_tracked_stracks, detections_second)
matches, u_track, u_detection_second = matching.linear_assignment(dists, thresh=0.5)
for itracked, idet in matches:
track = r_tracked_stracks[itracked]
det = detections_second[idet]
if track.state == TrackState.Tracked:
track.update(det, self.frame_id)
activated_starcks.append(track)
else:
track.re_activate(det, self.frame_id, new_id=False)
refind_stracks.append(track)
The remaining unmatched tracks will be added to the lost track list.
for it in u_track:
track = r_tracked_stracks[it]
if not track.state == TrackState.Lost:
track.mark_lost()
lost_stracks.append(track)
For unmatched high detection boxes, if the detection confidence is higher than det_thresh then we add them as new tracks.
""" Step 4: Init new stracks"""
for inew in u_detection:
track = detections[inew]
if track.score < self.det_thresh:
continue
track.activate(self.kalman_filter, self.frame_id)
activated_starcks.append(track)
For all the tracks in the lost track list, if time is higher than max_time_lost then we discard those tracks.
""" Step 5: Update state"""
for track in self.lost_stracks:
if self.frame_id - track.end_frame > self.max_time_lost:
track.mark_removed()
removed_stracks.append(track)
ByteTrack outperforms other tracking menthods on MOT17, MOT20 and HiEve test dataset by a large margin.
Note: There are many benchmark tests done by the authors of the paper and details are in the paper.
Hope you enjoyed this paper. Have a nice day.
This blog aims to cover the basics and what you need to publish a python package successfully.
To initialize a folder as a Python package we need to create a file called __init__.py
inside the folder.__init__.py
will run everytime we try to import the package in our project.
setup.py is crucial which contains the metadata of our package and binds everything our module have. It allows us to do python setup.py install
and pip install
.
# Always prefer setuptools over distutils
from setuptools import setup, find_packages
# To use a consistent encoding
from codecs import open
from os import path
here = path.abspath(path.dirname(__file__))
# Get the long description from the README file
with open(path.join(here, 'README.md'), encoding='utf-8') as f:
long_description = f.read()
version = open('VERSION').read().strip()
requirements = open('requirements.txt').read().split('\n')
setup(
name='package_name',
version='#version',
description='what does the package do',
long_description=long_description,
author = 'author name',
author_email = 'author email',
packages=package,
install_requires=requirements,
python_requires=">=3.6"
)
name
- This is the name our package
version
- The version number to be visible on PyPi website
description
- The description of the package to be displayed on the PyPi website
long description
- detailed description of the package and usage
author
- name of the author to be displayed on PyPi website
author_email
-author’s email to be displayed on PyPi website
packages
- The packages to be built and uploaded to the PyPi website
install_requires
- Package dependency
python_requires
- Minimum python version requirement
There are multiple types of python package builds:
PyPi is where we can store all the built packages/source tars. We need to first upload the package to PyPi website before we can do pip install.
To start the upload, first sign up on the PyPi website and install the twine
python package to upload any package.
python3 -m pip install --upgrade twine
Next, we build our package for distribution:
python3 -m build
After the build command, we will see dist folder in the root directory of the root directory of our project. These distribution files are created by python which can be installed on any system.
The final step is to upload the distribution files to the website.
python3 -m twine upload --repository testpypi dist/
This will prompt for username and password after which the package will be uploaded to the website.
Super simple instructions to write tests for python package:
mkdir tests
cd tests
ln -s ../your_package_name .
touch test.py # write your tests here
py.test
You have now successfully created your own custom python package and uploaded it to PyPi website. You can now simply do pip install packagename and use it in any project required.
]]>In this post, let’s see how to spin up a fluent servers using docker and forward logs from one fluent server to another. We’ll push the logs using fluent-logger
python package. As we care about security, we’ll setup TLS encryption and authentication.
Complete code is available on github.
Let’s quickly spin up client and server fluentd servers using docker-compose. Keep the following text in docker-compose.yml
version: "3"
services:
clientfluent:
image: fluent/fluentd
volumes:
- ./client_fluentd.conf:/fluentd/etc/fluent.conf
ports:
- 24224:24224
serverfluent:
image: fluent/fluentd
volumes:
- ./server_fluentd.conf:/fluentd/etc/fluent.conf
And following text in both client_fluentd.conf
and server_fluentd.conf
:
<source>
@type forward
port 24224
bind 0.0.0.0
</source>
<match *.*>
@type stdout
</match>
What this config does is very simple: both client and server fluentds listen to port number 24224 and print the logs (to stdout). Start the containers using:
$ docker-compose up
Let’s send a simple log push from python using fluent-logger
. Install the package using pip install fluent-logger
. Keep the following code in test_fluent.py
:
from fluent import sender
logger = sender.FluentSender('app', host='localhost', port=24224)
logger.emit('follow', {'from': 'userA', 'to': 'userB'})
Run this using
$ python test_fluent.py
You should following line in the docker-compose’s logs:
clientfluent_1 | 2020-11-24 11:19:17.000000000 +0000 app.follow: {"from":"userA","to":"userB"}
Let’s forward the logs from client fluentd to server fluentd. We’ll make client fluent print the logs and forward. We just have to modify <match *.*>
section in client_fluentd.conf
:
<match *.*>
@type copy
<store>
@type stdout
</store>
<store>
@type forward
<server>
host serverfluent
port 24224
</server>
</store>
</match>
Notice how serverfluent
is used as host name. This works because docker-compose sets up common network between different containers and allows service names to be used as host names. Kill the previous server and use it again:
$ docker-compose up
And in a different terminal do
$ python test_fluent.py
You should see the log in both clienfluent and serverfluent. Note that there will be a time delay because fluent uses buffering.
clientfluent_1 | 2020-11-27 06:14:12.000000000 +0000 app.follow: {"from":"userA","to":"userB"}
serverfluent_1 | 2020-11-27 06:13:58.189592100 +0000 fluent.info: {"worker":0,"message":"fluentd worker is now running worker=0"}
serverfluent_1 | 2020-11-27 06:14:12.000000000 +0000 app.follow: {"from":"userA","to":"userB"}
What about security. What if clientfluent
connects to serverfluent
over internet? Right now there’s no encryption nor authentication of the communication between the fluentds. Encryption is when the communication is not readable by a third party. Authentication is when you have to limit communication to certain trusted parties. Note how these are two distinct concepts. For example, https
is all about encryption while your login to facebook or google is about authentication. We want to setup both encryption like https (called TLS) and password based authentication.
Let’s start with encryption. Create certificate and private key for TLS encryption. You’ll be prompted for a password. I used sasank
for illustrative purposes. Use a better password.
$ openssl req -new -x509 -sha256 -days 1095 -newkey rsa:2048 \
-keyout fluentd.key -out fluentd.crt
Generating a RSA private key
..+++++
............+++++
writing new private key to 'fluentd.key'
Enter PEM pass phrase:
Verifying - Enter PEM pass phrase:
-----
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [AU]:
State or Province Name (full name) [Some-State]:
Locality Name (eg, city) []:
Organization Name (eg, company) [Internet Widgits Pty Ltd]:
Organizational Unit Name (eg, section) []:
Common Name (e.g. server FQDN or YOUR name) []:
Email Address []:
Set permission to the generated certificate and key:
chmod 700 fluentd.crt
chmod 400 fluentd.key
We want to mount these files to the docker container. Add the following lines to volume section of serverfluent
service.
- ./fluentd.crt:/etc/certs/fluentd.crt
- ./fluentd.key:/etc/certs/fluentd.key
And add the following configuration to server_fluentd.conf
<source>
@type forward
port 24224
bind 0.0.0.0
<transport tls>
cert_path /etc/certs/fluentd.crt
private_key_path /etc/certs/fluentd.key
private_key_passphrase sasank
</transport>
</source>
<match *.*>
@type stdout
</match>
That’s it our target fluent is now TLS ready. Now let’s configure clientfluent
to send TLS encrypted data. Since we generated self-signed certificate we have to mount it to our client fluent docker. Make sure following line is present in volumes section of clientfluent
docker compose.
- ./fluentd.crt:/etc/certs/fluentd.crt
Now adjust the configuration of source fluent: client_fluentd.conf
<source>
@type forward
port 24224
bind 0.0.0.0
</source>
<match *.*>
@type copy
<store>
@type stdout
</store>
<store>
@type forward
transport tls
tls_cert_path /etc/certs/fluentd.crt
tls_verify_hostname false # Set false to ignore cert hostname.
<server>
host serverfluent
port 24224
</server>
</store>
</match>
That’s it! We’ve setup TLS encryption for all packets being sent from clientfluent
to serverfluent
. Note that this is not authentication. We need to setup a password so that only people with that password are able to send logs to serverfluent
.
Just add the following lines to @forward section of source and target respectively
<security>
self_hostname clientfluent
shared_key my_secure_password
</security>
<security>
self_hostname serverfluent
shared_key my_secure_password
</security>
Of course, replace my_secure_password
with a secure password you can share to the client.
Let’s test the setup.
docker-compose up
and
python test_fluent.py
You should see:
clientfluent_1 | 2020-11-27 06:49:08.000000000 +0000 app.follow: {"from":"userA","to":"userB"}
serverfluent_1 | 2020-11-27 06:49:06.928711600 +0000 fluent.info: {"worker":0,"message":"fluentd worker is now running worker=0"}
serverfluent_1 | 2020-11-27 06:49:08.000000000 +0000 app.follow: {"from":"userA","to":"userB"}
That’s it! We’re done. The logs are forwarded securely over TCP. Now you’re good to deploy a server fluent on AWS somewhere and client fluent on an edge device!
]]>In this blog, we will unbox qCT-Lung – our latest AI powered product that analyses Chest CT scans for lung cancer. At Qure.ai, we have always taken a holistic approach towards building solutions for lung health. qXR provides automated interpretation of chest X-rays and is complemented by qTrack, a disease & care pathway management platform with AI at its core. qCT-Lung augments our lung health suite with the ability to detect lung nodules & emphysema on chest CTs and analyze their malignancy. It can quantify & track nodules over subsequent scans. qCT-Lung is a CE certified product.
Medical Imaging has seen the biggest healthcare advancements in artificial intelligence (AI) and lung health has been at the forefront of these improvements. Lung health has also been a key domain of our product portfolio. We’ve built AI algorithms like qXR, which provides automated interpretation of chest X-rays. We augmented its capabilities with qTrack – our AI powered disease management platform, which solves for active case finding & tracking patients in care pathways. These applications have empowered healthcare practitioners at all stages of the patient journey in TB, Covid-19 & lung cancer screenings.
We’re adding a new member to our lung health suite: qCT-Lung. Its AI-powered algorithms can interpret chest CTs for findings like lung nodules & emphysema, and analyze their malignancy. It empowers clinicians to detect lung cancer in both screening programs as well as opportunistic screening settings.
qXR & qCT-Lung’s abilities to support clinicians with detection of lung cancer on chest X-rays & CTs complement qTrack’s disease management & patient tracking capability. Together, they round up our lung health portfolio to make it a comprehensive, powerful & unique offering.
Lung cancer is the second most common cancer in both men & women. 2.2 million people were diagnosed with lung cancer worldwide in 2020 [1]. With 1.74 million deaths in 2020, lung cancer is also the leading cause of cancer related deaths (18.4%) resulting in more deaths than the second and third deadliest cancers combined (colorectal - 9.2% & stomach - 8.2%).
Future projections don’t look good either. Lung cancer incidents are projected to rise by 38% and the mortality is projected to rise by 39% by 2030 [2].
There are two main types of lung cancer:
Non-small cell lung cancer (NSCLC): NSCLC comprises of 80-85% of all lung cancer cases. Their major subtypes are adenocarcinoma, squamous cell carcinoma, and large cell carcinoma. They are grouped together because of shared similarity in treatment & prognoses.
Small cell lung cancer (SCLC): SCLC tends to grow and spread faster than NSCLC. 10-15% of all lung cancers are SCLC.
There are also cancers that start in other organs (like breast) and spread to lung, but they don’t come under the vicinity of lung cancer.
The 5-year survival is a measure of what percent of people live at least 5 years after the cancer is found. The 5-year survival rates for both NSCLC & SCLC look as follows [4]:
The data shows that lung cancer mortality can be reduced significantly if detected & treated early.
Data from England shows that the chances of surviving for at least a year decrease from 90% to 20% for the earliest to most advanced stage of lung cancer [5]. WHO elaborates on two components for early detection [6]:
Early identification of cancer results in better response to treatment, greater chances of survival, lesser morbidity & less expensive treatment. It comprises of 3 components:
Screening is aimed at identifying individuals with findings suggestive of lung cancer before they have developed symptoms. Further tests are conducted to establish if the diagnosis should be followed or referral for treatments should be made. They’re effective because symptoms of lung cancer do not appear until the disease is already at an advanced stage.
Screening programs use regular chest X-rays and low dose CT/ CAT scans to study people at higher risk of getting lung cancer. CT scans have proven to be more effective than X-rays. They resulted in a 20% reduction in lung cancer-specific deaths as compared to X-rays [2]. However, X-rays are more accessible and cheaper and thus, are important for low-income settings.
The U.S. Preventive Services Task Force (USPSTF) recommends yearly lung cancer screening with LDCT for people who [9]:
Chest CTs are comparatively more accurate than chest X-rays for identification of thoracic abnormalities. This is because of lack of superimposition, greater contrast, and spatial resolution. However, there are many challenges in identifying & reporting lung cancer on Chest CTs. These challenges can be divided into the following categories:
A study revealed that 42.5% of malpractice suits on radiologists are because of failure to diagnose lung cancer [14]. These lawsuits can cost as high as $10M [15]. Misdiagnosis can occur due to two reasons [11]:
Lesion characteristics: Small dimension, poor conspicuousness , ill-defined margins and central location are the most common lesion characteristics that lead to missed lung cancers incidences.
Observer Error: There are multiple sources of observer error like:
Post detection of a lesion, a major challenge is to analyse its characteristics and determine malignancy. Even when the lesion’s malignancy is determined correctly, tracking them over subsequent scans is challenging for screening programs due to lack of appropriate CADs & tools.
Structured reporting helps to categorize results and recommend follow-ups based on chances of malignancy by considering size, appearance, and growth of the lesion. Further, volume measurement & volume doubling times (VDT) have been proposed in the management protocol of NELSON lung cancer screening trial [13]. All these metrics are challenging to calculate & report in absence of appropriate tools. This makes it hard to standardize follow up recommendations based on guidelines like Fleishner Society or Lung-RADS scores.
Certain other pulmonary findings like COPD (chronic obstructive pulmonary disease) are an independent risk factor for lung cancer. Lung cancer screening subjects have a high prevalence of COPD which accounts for significant morbidity and mortality. One of the major benefits of emphysema (a type of COPD) quantification in lung cancer screening patients is an earlier diagnosis and therapy of COPD with smoking cessation strategies. It can potentially lead to less COPD-related hospitalizations.
Interpreting CT scans is a time intensive process. A CT scan can have 16 to 320 slices compared to one or two images in an X-ray. Radiologists spend 5-10 minutes to interpret & report each CT scan.
For chest CTs, detecting small nodules through hundreds of slices consumes a lot of time. There are tools that help with some of these issues but none of them solve for lung cancer screening comprehensively.
qCT-Lung empowers lung cancer screening programs and facilitates opportunistic screening by detecting malignant lesions using AI. It is aimed at helping clinicians with all the issues discussed in the previous section - misdiagnosis, analysis, reporting, detection of co-findings & reducing time constraints. The algorithm is trained on more than 200k chest CTs and can detect, analyze, monitor and auto-report lung nodules. This is how qCT-Lung assists clinicians in interpreting chest CTs for lung nodules:
qCT can distinguish lung lesions from complex anatomical structures on lung CTs and minimize instances of letting lung cancers go undetected, by preventing nodules from being overlooked on scans. Faster and more accurate detection helps decrease time to treatment and improves patient outcomes.
qCT analyzes nodule characteristics to determine malignancy. The algorithm also assigns a malignancy risk score for each of the nodules that helps clinicians plan treatments.
qCT-Lung utilizes pre-populated results to offer clinicians faster reporting, that reduces time to treatment and further diagnosis. It can also recommend timelines for follow-up scans.
qCt-Lung also offers a lung nodule reporting platform that is designed for screening programs. It enables clinicians to choose which nodules to include in the report and also to add new nodules. The platform pre-populates the image viewer with nodules identified by qCT-Lung. Clinicians can then exclude or add new nodules to this list. The final list after these changes is sent to the RIS.
The platform empowers physicians to modify the results generated by qCT-Lung and report on what’s profoundly important for them.
We have built an end-to-end portfolio for managing lung cancer screenings in all kinds of resource-settings. Lung cancer screening has many challenges. While CTs are recommended imaging modality, resource limited settings must depend on X-rays for its cost benefit and easy availability. Patient tracking, disease management and long term follow up for individuals with high-risk cases are also a challenge. Our comprehensive lung health suite takes care of these challenges.
Together, these solutions can help in active case screening, monitoring disease progression, reducing turn-around-time, linking care to treatment, & improving care pathways.
Write to us at qct-lung@qure.ai to integrate qCT-Lung in your lung nodule management pathway.
National Lung Screening Trial Research Team, Aberle DR, Berg CD, et al. “The National Lung Screening Trial: overview and study design.” Radiology. 2011;258(1):243–253.
del Ciello A, et al. “Missed lung cancer: when, where, and why? Diagnos.” Intervent. Radiol. 2017;23:118–126. doi: 10.5152/dir.2016.16187.
Widmann, G. “Challenges in implementation of lung cancer screening—radiology requirements.” memo 12, 166–170 (2019).
Dong Ming Xu, Hester Gietema, Harry de Koning, René Vernhout, Kristiaan Nackaerts, Mathias Prokop, Carla Weenink, Jan-Willem Lammers, Harry Groen, Matthijs Oudkerk, Rob van Klaveren, “Nodule management protocol of the NELSON randomised lung cancer screening trial”, Lung Cancer, Volume 54, Issue 2, 2006, Pages 177-184, ISSN 0169-5002
Baker SR, Patel RH, Yang L, Lelkes VM, Castro A 3rd. “Malpractice suits in chest radiology: an evaluation of the histories of 8265 radiologists.” J Thorac Imaging. 2013 Nov;28(6):388-91.
HealthImaging: Lung cancer missed on CT prompts $10M lawsuit against U.S. government
Stroke is a leading cause of death. Stroke care is limited by the availability of specialized medical professionals. In this post, we describe a physician-led stroke unit model established at Baptist Christian Hospital (BCH) in Assam, India. Enabled with qER, Qure’s AI driven automated CT Brain interpretation tool, BCH can quickly and easily determine next steps in terms of treatment and examine the implications for clinical outcomes.
Across the world, Stroke is a leading cause of death, second only to ischemic heart disease. According to the the World Stroke Organization (WSO), 13.7 million new strokes occur each year and there are about 80 million stroke survivors globally. In India as per the Health of the Nation’s State Report we see an incidence rate of 119 to 152/100000, and has a case fatality rate of 19 to 42% across the country.
Catering to tea plantation workers in and around the town of Tezpur, the Baptist Christian Hospital, Tezpur (BCH) is a 130-bed secondary care hospital in the North eastern state of Assam in India. This hospital is a unit of the Emmanuel Hospital Association, New Delhi. From humble beginnings, offering basic dispensary services, the hospital grew to become one of the best healthcare providers in Assam, being heavily involved in academic and research work at both national and international levels.
Nestled below the Himalayas, interspersed with large tea plantations, Assamese indigenous population and tea garden workers showcase a prevalence of hypertension, the largest single risk factor of stroke, reportedly between 33% to 60.8%. Anecdotal reports and hospital-based studies indicate a huge burden of stroke in Assam - a significant portion of which is addressed by Baptist Hospital. Recent study showed that hemorrhagic strokes account for close to 50% of the cases here, compared to only about 20% of the strokes in the rest of India.
One of the biggest obstacles in Stroke Care is the lack of awareness of stroke symptoms and the late arrival of the patient, often at smaller peripheral hospitals, which are not equipped with the necessary scanning facilities and the specialists, leading to a delay in effective treatment.
The doctors and nurses of the Stroke Unit at BCH, Tezpur were trained online by specialist neurologists, who in turn trained the rest of the team on a protocol that included Stroke Clinical Assessment, monitoring of risk factors and vital parameters, and other supportive measures like management of Swallow assessment in addition to starting the rehabilitation process and advising on long term care at home. A study done at Tezpur indicated that post establishment of Stroke Unit, there was significant improvement in the quality of life along with reduction in deaths compared to the pre-Stroke Unit phase.
This is a crucial development in Stroke care especially in the low and middle income countries(LMIC) like India, to strengthen the peripheral smaller hospitals which lack specialists and are almost always the first stop for patients in emergencies like Stroke.
The guidelines for management of acute ischemic stroke involves capturing a non-contrast CT (NCCT) study of the brain along with CT or MRI angiography and perfusion and thrombolysis-administration of rTPA (Tissue Plasminogen Activator) within 4.5 hours of symptom onset. Equipped with a CT machine and teleradiology reporting, the physicians at BCH provide primary intervention for these stroke cases after a basic NCCT and may refer them to a tertiary facility, as applicable. They follow a Telestroke model-in cases where thrombolysis is required, the ER doctors consult with neurologists at a more specialized center and the decision making is done upon sharing these NCCT images via phone-based mediums like WhatsApp while severe cases of head trauma are referred for further management to far away tertiary facilities. There have been studies done on a Physician based Stroke Unit model in Tezpur, that has shown an improvement in treatment outcomes.
BCH and Qure have worked closely since the onset of the COVID-19 pandemic, especially at a time when confirmatory RT-PCR kits were limiting. qXR, Qure’s AI aided chest X-ray solution had proved to be a beneficial addition for identification of especially asymptomatic COVID-19 suspects and their treatment and management, beyond its role in comprehensive chest screening.
In efforts to improve the workflow of stroke management and care at the Baptist hospital, qER, FDA approved and CE certified software which can detect 12 abnormalities was deployed. The abnormalities including five types of Intracranial Hemorrhages, Cranial Fractures, Mass effect, midline Shift, Infarcts, Hydrocephalus, Atrophy etc in less than 1-2 minutes of the CT being taken. qER has been trained on CT scans from more than 22 different CT machine models, thus making it hardware agnostic. In addition to offering a pre-populated radiology report, the HIPAA compliant qER solution is also able to label and annotate the abnormalities in the key slices.
Since qER integrates seamlessly with the existing technical framework of the site, the deployment of the software was completed in less an hour along with setting up a messaging group for the site. Soon after, within minutes of taking the Head CT, qER analyses were available in the PACS worklist along with messaging alerts for the physicians’ and medical team’s review on their mobile phones.
The aim of this pilot project was to evaluate how qER could add value to a secondary care center where the responsibility for determination of medical intervention falls on the physicians based on teleradiology report available to them in a span of 15-60 minutes. As is established with stroke care, every minute saved is precious.
At the outset, there were apprehensions amongst the medical team about the performance of the software and its efficacy in improving the workflow, however, this is what they have to say about qER after 2 months of operation:
“qER is good as it alerts the physicians in a busy casualty room even without having to open the workstation. We know if there are any critical issues with the patient” - Dr. Jemin Webster, a physician at Tezpur.
He goes on to explain how qER helps grab the attention of the emergency room doctors and nurses to critical cases that need intervention, or in some instances, referral. It helps in boosting the confidence of the treating doctors in making the right judgement in the clinical decision-making process. It also helps in seeking the teleradiology support’s attention into the notified critical scans, as well as the scans of the stroke cases that are in the window period for thrombolysis. Dr. Jemin also sees the potential of qER in the workflow of high volume, multi-specialty referral centers, where coordination between multiple departments are required.
A technology solution like qER can reduce the time to diagnosis in case of emergencies like Stroke or trauma and boosts the confidence of Stroke Unit, even in the absence of specialists. The qER platform can help Stroke neurologists in the Telestroke settings access great quality scans even on their smartphones and guide the treating doctors for thrombolysis and further management. Scaling up this technology to Stroke units and MSUs can empower peripheral hospitals to manage acute Stroke especially in LMICs.
We intend to conduct an observational time-motion study to analyze the Door-to- Needle time with qER intervention via instant reports and phone alerts as we work through the required approvals. Also in the pipeline is performance comparison of qER reporting against the Radiologist report as ground truth along with comparison of clinical outcomes and these parameters before and after introduction of qER into the workflow. We also plan to extend the pilot project to Padhar Mission Hospital, MP and the Shanthibhavan Medical Center, Simdega, Jharkhand.
Qure team is also working on creating a comprehensive stroke platform which is aimed at improving stroke workflows in LMICs and low-resource settings.
]]>Why an agile monitoring and management system is the need of the hour
The world is seeing unprecedented times. Two relentless years of a pandemic is enough to break even the strongest healthcare systems, that along with the current efforts to ramp up vaccinations and clinical care has resulted in most countries public health system being strained or barely functional. Such accelerated development and roll out of multiple vaccines, is a first for any disease.It is important that each of these vaccines and its effects are monitored for substantial periods of time to understand short term and long-term effects on varying demographic and risk factor profiles of vaccine recipients.
The traditional surveillance systems in most countries, rely heavily on health care providers to notify the adverse events. This is a passive surveillance system that helps in detecting unsolicited adverse events Vaccine Survey is another conventional method, but the disadvantage is that it usually is a cross sectional survey with only a one time follow up. One of the ways to augment these traditionalsurveillance systems is to empower the vaccine recipient using smartphone based digital tools. AVSS or Active Vaccine Safety Surveillance systems helps by proactively enrolling many vaccine recipients who are followed up for all minor and major adverse events. This can significantly alleviate the burdens off frontline workers, while capturing large amount of data, frequently and in a timely fashion. Besides allowing the healthcare systems to address immediate or delayed adverse events, this has the potential to monitor the health of the community in the long term as well.
Policy formulation has also been extremely difficult for governments and world organisations given the novelty of the disease. This solution could allow for faster data driven decision making empowering governments and policy makers in a way that only technology can.
Post Vaccine monitoring: Before COVID–19, vaccines used to be licensed after 4-15years of rigorous clinical trials. With the fast-tracked development of COVID-19 vaccines, there is a likelihood that some of the rare and long-term adverse events may have gone undetected in the clinical trials. Through AVSS via phone, the vaccine recipients can be monitored for a period ranging from 07 days up to 12 months, getting real time alerts for any serious adverse events following immunization (AEFI)or Adverse Events of Special Interest (AESI). By automating this process, we have successfully tracked the symptoms fast enough to be of actionable value; the healthcare worker getting involved only if necessary.
Large Data collection and Analysis: we need interoperable systems that can harmonise data from multiple sites, with a validated AI algorithm to measure the risk of AEFIs and their early indicators. The system will need to be agile and scalableto work in varying resource settings.
Country-level Surveillance: There must be a centralised dashboard for policy makersand regulatory authorities to visualize community vaccine uptake statistics, AEFI patterns and efficacies.
qScout is Qure.ai’s Artificial Intelligence and NLP-powered solution that improves vaccine recipient’s experience while augmenting traditional surveillance systems forindividual’s health monitoring. It has a smartphone-based component for easy interaction between the recipient and public health professionals.
How can qScout be used for active surveillance and monitoring of vaccinees?
Step 1: Walk-in/registered individuals at COVID-19 vaccination sites will be enrolled using qScout EMR by recording the following details :
Step 2: Once the enrollment is completed, the vaccinated person receives a message on the mobile for their consent. Follow-up messages will be sent for a set period to check for any adverse/unexpected symptoms (AEFIs or AESI). The person will also be reminded about the second dose. Every enrolled individual will be monitored for a predefined period , as per the guidelines of the proposed project.
Step 3: Public health officials who have access to the data can see the analysis of the AEFIs OR AESIs on a real time dashboard The information will be segregated based on demographics, type of vaccine administered, count of individuals administered with dose 1 and/or dose 2 as well as percent drop-out between both the doses.
Benefits of Real-time remote patient monitoring after vaccination
During the first wave of the COVID-19 pandemic, the qScout platform was adopted for national contact tracing and management mechanism by Ministry of Health in Oman. Within a span of a few weeks, qScout was integrated with Tarassud plus, the country’s ICT platform for surveillance and monitoring. qScout used AI chatbot customised to the local languages and engaged with confirmed cases capturing their primary and secondary symptoms. The AI engine analysed the information and provided insights enabling virtual triaging and timely escalation for medical requirements. Over a span of 8 moths, approximately 400,000 Covid-19 patients under quarantine in Oman regularly interacted with a software for over thousands of sessions taking off a significant proportion of healthcare workers’ burden. All this while the health authorities and government actively kept a watch centrally to monitor hotspot regions, areas needing additional resources and so on. Having qScout enabled with multi-lingual support in English as well as Arabic helped increase the ease of interaction for various users.
The software was deployed with a gadget that relayed instant reports to the competent authorities about the movements and locations that a quarantined or infected person visits. It also had the capability to send alerts if this person left their location or tried taking it off. This level of data collection allowed sharing relevant insights with the Ministry of Health about population level statistics vital for planning for resources. This coupled with Qure.ai’s qSCOUT, was a true exemplar of use of technology to tackle the pandemic.
There are multiple studies that are ongoing with regional and state governments as well as non-governmental organizations. qScout is designed as a platform for monitoring safety and efficacy of all adult and pediatric vaccines and medications.
]]>vRad, a large US teleradiology practice and Qure.ai have been colloborating for more than an year for a large scale radiology AI deployment. In this blog post, we describe the engineering that goes into scaling radiology AI. We discuss adapting AI for extreme data diversity, DICOM protocol and software engineering.
vRad and Qure.ai have been collaborating on a large-scale prospective validation of qER, Qure.ai’s ICH model for detecting intracranial hemorrhages (ICH) for more than a year. vRad is a large teleradiology practice – 500+ radiologists serving over 2,000 facilities in the United States – representing patients from nearly all states. vRad uses an in-house built RIS and PACS that processes over 1 million studies a month, with the majority of those studies being XR or CT. Of these, about 70,000 CT studies a month get processed by qure.ai’s algorithms. This collaboration has produced interesting insights into the challenges of implementing AI on such a large scale. Our earlier work together is published elsewhere at Imaging Wire and vRad’s blog.
Before we discuss the accuracy of models, we have to start with how we actually measure it at scale. In this respect, we have leveraged our experience from prior AI endeavors. vRad runs the imaging models during validation in parallel with production flows. As an imaging study is ingested into the PACS, it is sent directly to validation models for processing. In turn, as soon as the radiologist on the platform completes their report for the scan, we use it to establish the ground truth. We used our Natural Language Processing (NLP) algorithms to automatically read these reports to assign whether the current scan is positive or negative for ICH. Thus, the sensitivity and specificity of a model can be measured in real-time this way on real-world data.
AI models often perform well in the lab, but when tried in a real-world clinical workflow, it does not live up to expectations. This is a combination of problems. The idea of a diverse, heterogeneous cohort of patients is well discussed in the space of medical imaging. In this case, Qure.ai’s model was measured with a cohort of patients representative of the entire US population – with studies from all 50 states flowing through the model and being reported against.
Less commonly discussed are the challenges with the uniqueness of data that is a hospital or even imaging device-specific. vRad receives images from over 150,000 unique imaging devices in over 2,000 facilities. At a study level, different facilities can have many different study protocols – varying amounts of contrast, varying radiation dosages, varying slice thicknesses, and other considerations can change how well a human radiologist can evaluate a study, let alone the AI model.
Just like human radiologists, AI models do their best if they see consistent images at pixel level despite the data diversity. Nobody would want to recalibrate their decision process just because different manufacturers chose to use different post-processing techniques. For example, image characteristics of a thin slice CT scan are quite different from a 5mm thick scan with the former being considerably noisier. Both AI and doctors are sure to be confused if asked to decide whether those subtle hyperdense dots that they see on a thin slice scan are just noise or symptoms of diffuse axonal injury. Therefore, we invested considerably in making sure the diverse data is pre-processed into highly consistent raw pixel data. We discuss more in the following section.
Dealing with patient and data diversity are major components of AI models. The AI model not only has to be generalizable at the pixel level, but it also must make sure the right pixels are fed into it. The first problem is highly documented in the AI literature but the second one, not so much. As traditional AI imaging models are trained to work on natural images (think cat photos), they deal with simplistic data formats like PNG or JPEG. However, medical imaging is highly structured and complex and contains orders more data compared to natural images. DICOM is the file format and standard used for storing and transfer the medical images.
While DICOM is a robust and well-adopted standard, implementation details vary. Often DICOM tags differ greatly from facility to facility, private tags vary from manufacturer to manufacturer, encodings and other imaging-device specific differences in DICOM require that any piece of software, including an AI model, be robust and good at error handling. After a decade of receiving DICOM from all over the U.S., the vRad PACS still runs into new unique configurations and implementations a few times a year, so we are uniquely sensitive to the challenges.
We realized that we need another machine learning model to solve this interoperability problem itself. How do we recognize that this particular CT image is not a brain image even if the description of images says so? How do we make sure the complete brain is present in the image before we decide there is a bleed in it? Variability of DICOM metadata doesn’t allow us to write simple rules which can work at scale. So, we have trained another AI model based on metadata and pixels which can make the above decisions for us.
These challenges harken back to classic healthcare interoperability problems. In a survey by Philips, the majority of younger healthcare professionals indicated that improved interoperability between software platforms and healthcare practices is important for their workplace satisfaction. Interestingly, these are the exact challenges medical imaging AI has to solve for it to work well. So, AI generalizability is just another name for healthcare interoperability. Given how we used machine learning and computer vision to solve the interoperability problems for our AI model, it might be that solving wider interoperability problems might involve AI itself.
But even after those generalizability/interoperability challenges are overcome, a model must be hosted in some manner, often in a docker-based solution, frequently written in Python. And like the model, this wrapper must scale the solution. It must handle calls to the model and returning results, as well as logging information for the health of the system just like any other piece of software. As a model goes live on a platform like vRad’s, common problems that we see happen are memory overflows, underperforming throughput, and other “typical” software problems.
Although these problems look quite similar to traditional “software problems”, the root cause is quite different. For the scalability and the reliability of traditional software, the bottleneck usually boils down to database transactions. Take Slack, an enterprise messaging platform, for example. What’s the most compute-intensive thing Slack app does? It looks up the chat typed previously by your colleague from a database and shows it to you. Basically, a database transaction. The scalability of Slack usually means scalability and reliability of these database transactions. Given how databases have been around for years, this problem is fairly well solved with off-the-shelf solutions.
For an AI enabled software, the most compute intensive task is not a database transaction but running of an AI model. And this is arguably more intensive than a database lookup. Given how new deep learning is, the ecosystem around it is not yet well-developed. This make AI model deployment and engineering hard and it is being tackled by big names like Google (Tensorflow), Facebook (Torch), and Microsoft (ONNX). Because these are opensource, we actively contribute to them and make them better as we come across problems.
As different is the root cause of the engineering challenges, the process to tackle them is surprisingly similar. After all, engineers’ approach to building bridges and rockets is not all that different, they just require different tools. To make our AI scale to vRad, we followed traditional software engineering best practices including highly tested code and frequent updates. As soon as we identify an issue, we patch it up and write a regression test to make sure we never come across it again. Docker has made deployment and updates easy and consistent.
Another significant engineering challenge we solved is to bend clinical software to our will. DICOM is a messy communication standard and lacks some important features. For example, DICOM features no acknowledgement signal that the complete study has been sent over the network. Another great example is the lack of standardization in how a given study is described – what fields are used and what phrases are used to describe what the study represents. The work Qure.ai and vRad collaborated on the required intelligent mapping of study descriptions and modality information throughout the platform – from the vRad PACS through the Inference Engine running the models to the actual logic in the model containers themselves.
Many AI image models and solutions on the market today integrate with PACS and Worklists, but one unique aspect of Qure.AI and vRad’s work is the sheer scale of the undertaking. vRad’s PACS ingests millions of studies a year, around 1 billion individual images annually. The vRad platform, including the PACS, RIS, and AI Inference Engine, route those studies to the right AI models and the right radiologists, radiologists perform thousands of reads each night, and NLP helps them report and analyze those reports for continual feedback both to radiologists as well as AI models and monitoring. Qure.AI’s ICH model plugged into the platform and demonstrated robustness as well as impressive sensitivity and specificity.
During vRad and Qure.ai’s validation, we were able to run hundreds of thousands of studies in parallel with our production workloads, validating that the model and the solution for hosting the model was able to not only generalize for sensitivity and specificity but overcome all of these other technical challenges that are often issues in large-scale deployments of AI solutions.
]]>When the COVID-19 pandemic hit Mumbai, one of the most densely populated cities in the world, the Municipal Corporation of Greater Mumbai (MCGM) promptly embraced newer technologies, while creatively utilising available resources. Here is a deeper dive into how the versatility of chest x-rays and Artificial Intelligence helped the financial capital of India in efforts to containing this pandemic.
The COVID-19 pandemic is one of the most demanding adversities that the present generation has had to witness and endure. The highly virulent novel Coronavirus has posed a challenge like no other to the most sophisticated healthcare systems world over. Given the brisk transmission, it was only a matter of time that the virus spread to Mumbai, the busiest city of India, with a population more than 1.5 times that of New York.
The resilient Municipal Corporation of Greater Mumbai (MCGM), swiftly sprang into action, devising multiple strategies to test, isolate, and treat in an attempt to contain the pandemic and avoid significant damage. Given the availability and effectiveness of chest x-rays, they were identified to be an excellent tool to rule-in cases that needed further testing to ensure that no suspected case was missed out. Though Mumbai saw a steep rise in cases more than any other city in India, MCGM’s efforts across various touchpoints in the city were augmented using Qure’s AI-based X-ray interpretation tool - qXR - and the extension of its capabilities and benefits.
In the latter half of June, MCGM launched the MISSION ZERO initiative, a public-private partnership supported by the Bill & Melinda Gates Foundation, Bharatiya Jain Sanghatana (BJS) and Desh Apnayen and CREDAI-MCHI. Mobile vans with qXR installed digital X-ray systems were stationed outside various quarantine centers in the city. Individuals identified to be at high-risk of COVID-19 infection by on-site physicians from various camps were directed to these vans for further examination. Based on the clinical and radiological indications of the individuals thus screened, they were requested to proceed for isolation, RT-PCR testing, or continue isolation in the quarantine facility. Our objective was to reduce the load on the centers by continuously monitoring patients and discharging those who had recovered, making room for new patients to be admitted, and ensuring optimal utilization of resources.
The approach adopted by MCGM was multi-pronged to ascertain that no step of the pandemic management process was overlooked:
Learn more about Qure.ai qXR COVID in our detailed blog here
Kasturba Hospital and HBT Trauma Center were among the first few COVID-19 testing centers in Mumbai. However, due to the overwhelming caseload, it was essential that they triage individuals flowing into fever clinics for optimal utilization of testing kits. The two centers used conventional analog film-based X-ray machines, one for standard OPD setting and another portable system for COVID isolation wards
From early March, both these hospitals adopted
The qTrack mobile app is a simple, easy to use tool that interfaces qXR results with the user. The qTrack app digitizes film-based X-rays and provides real-time interpretation using deep learning models. The x-ray technician simply clicks a picture of the x-ray against a view box via the app to receive the AI reading corresponding to the x-ray uploaded. The app is a complete workflow management tool, with the provision to register patients and capture all relevant information along with the x-ray. The attending physicians and the hospital Deans were provided separate access to the Qure portal so that they could instantly access AI analyses of the x-rays from their respective sites, from the convenience of their desktops/mobile phones.
When the city went into lockdown along with the rest of the world as a measure to contain the spread of infection, social distancing guidelines were imposed across the globe. However, this is not a luxury that the second-most densely populated city in the world could always afford. It is not uncommon to have several families living in close quarters within various communities, easily making them high-risk areas and soon, containment zones. With more than 50% of the COVID-19 positive cases being asymptomatic cases, it was imperative to test aggressively. Especially in the densely populated areas to identify individuals who are at high-risk of infection so that they could be institutionally quarantined in order to prevent and contain community transmission.
As the global situation worsened, the commercial capital of the country saw a steady rise in the number of positive cases. MCGM, very creatively and promptly, revived the previously closed down hospitals and converted large open grounds in the city into dedicated COVID-19 centers in record time with their own critical patient units. The BKC MMRDA grounds, NESCO grounds, NSCI (National Sports Council of India) Dome, and SevenHills Hospital are a few such centers.
NESCO COVID Center
The COVID-19 center at NESCO is a 3000-bed facility with 100+ ICU beds, catering primarily to patients from Mumbai’s slums. With several critical patients admitted here, it was important for Dr. Neelam Andrade, the facility head, and her team to monitor patients closely, keep a check on their disease progression and ensure that they acted quickly. qXR helped Dr. Andrade’s team by providing instant automated reporting of the chest X-rays. It also captured all clinical information, enabling the center to make their process completely paperless.
“Since the patients admitted here are confirmed cases, we take frequent X-rays to monitor their condition. qXR gives instant results and this has been very helpful for us to make decisions quickly for the patient on their treatment and management.”
- Dr Neelam Andrade, Dean, NESCO COVID centre
SevenHills Hospital, Andheri
Located in the heart of the city’s suburbs, SevenHills Hospital was one of the first hospitals that were revived by MCGM as a part of COVID-19 response measures.
The center played a critical role on two accounts:
As with all COVID-19 cases, chest x-rays were taken of the admitted patients periodically to ascertain their lung condition and monitor the progress of the disease. All x-rays were then read by the head radiologist, Dr. Bhujang Pai, the next day, and released to the patient only post his review and approval. This meant that on most mornings, Dr. Pai was tasked with reading and reporting 200-250 x-rays, if not more. This is where qXR simplified his work.
Initially, we deployed the software on one of the two chest X-ray systems. However, after stellar feedback from Dr. Pai, our technology was installed in both the machines. In this manner AI, pre-read was available for all chest X-rays of COVID-19 patients from the center.
Where qXR adds most value:
“At SevenHills hospital, we have a daily load of ~220 Chest X-rays from the admitted COVID-19 cases, sometimes going up to 300 films per day. Having qXR has helped me immensely in reading them in a much shorter amount of time and helps me utilise my time more efficiently. The findings from the software are useful to quickly pickup the indications and we have been able to work with the team, and make suitable modifications in the reporting pattern, to make the findings more accurate. qXR pre-fills the report which I review and edit, and this facilitates releasing the patient reports in a much faster and efficient manner. This obviously translates into better patient care and treatment outcomes. The percentage of lung involvement which qXR analyses enhances the Radiologist’s report and is an excellent tool in reporting Chest radiographs of patients diagnosed with COVID infection.”
– Dr Bhujang Pai, Radiology Head, SevenHills Hospital
During the course of the pandemic, Qure has assisted MCGM with providing AI analyses for thousands of chest x-rays of COVID-19 suspects and patients. This has been possible with continued collaboration with key stakeholders within MCGM who have been happy to assist in the process and provide necessary approvals and documentation to initiate work. However, different challenges were posed by the sites owing to their varied nature and the limitations that came with them.
We had to navigate through various technical challenges like interrupted network connections and lack of an IT team, especially at the makeshift COVID centers. We crossed these hurdles repeatedly to ensure that the x-rays from these centers were processed seamlessly within the stipulated timeframe, and the x-ray systems being used were serviced and functioning uninterrupted. Close coordination with the on-ground team and cooperation from their end was crucial to keep the engagement smooth.
This pandemic has been a revelation in many ways. In addition to reiterating that a virus sees no class or creed, it also forced us to move beyond our comfort zones and take our blinders off. Owing to limitations posed by the pandemic and subsequent movement restrictions, every single deployment of qXR by Qure was done entirely remotely. This included end-to-end activities like coordination with the key stakeholders, planning and execution of the deployment of the software, training of on-ground staff, and physicians using the portal/mobile app in addition to continuous operations support.
Robust and smart technology truly made it possible to implement what we had conceived and hoped for. Proving yet again that if we are to move ahead, it has to be a healthy partnership between technology and humanity.
Qure is supported by ACT Grants and India Health Fund for joining MCGM’s efforts for the pandemic response using qXR for COVID-19 management.
]]>When the pandemic hit the world without discretion, it caused health systems to crumble across the world. While a large focus was on strengthening them in the urban cities, the rural areas were struggling to cope up. In this blog, we highlight our experience working with some of the best healthcare centers in rural India that are delivering healthcare to the last mile. We describe how they embraced AI technology during this pandemic, and how it made a difference in their workflow and patient outcomes.
2020 will be remembered as the year of the COVID-19 pandemic. Affecting every corner of the world without discretion, it has caused unprecedented chaos and put healthcare systems under enormous stress. The majority of COVID-19 transmissions take place due to asymptomatic or mildly symptomatic cases. While global public health programs have steadily created evolving strategies for integrative technologies for improved case detection, there is a critical need for consistent and rigorous testing. It is at this juncture that the impact of Qure’s AI-powered chest X-ray screening tool, qXR, was felt across large testing sites such as hospital networks and government-led initiatives.
In India, Qure joined forces with the Indian Government to combat COVID-19 and qXR found its value towards diagnostic aid and critical care management. With the assistance of investor groups like ACT Grants and India Health Fund, we extended support to a number of sites, strengthening the urban systems fighting the virus in hotspots and containment zones. Unfortunately, by this time, the virus had already moved to the rural areas, crumbling the primary healthcare systems that were already overburdened and resource-constrained.
Technologies are meant to improve the quality of human lives, and access to quality healthcare is one of the most basic necessities. To further our work with hospitals and testing centers across the world, we took upon ourselves if more hospitals could benefit from the software in optimising testing capability. Through our physicians, we reached out to healthcare provider networks and social impact organisations that could potentially use the software for triaging and optimisation. During this process, we discovered an entirely new segment, very different from the well equipped urban hospitals we have been operating so far, and interacted with few Physicians dedicated to delivering quality and affordable healthcare through these hospitals.
Working closely with the community public health systems, these secondary care hospitals act as a vital referral link for tertiary hospitals. Some of these are located in isolated tribal areas and address the needs of large catchment populations, hosting close to 100,000 OPD visits annually. They already faced the significant burden of TB and now had to cope with the COVID-19 crisis. With testing facilities often located far away, the diagnosis time increases by days, which is unfortunate because chest X-rays are crucial for primary investigation prior to confirmatory tests, mainly due to the limitations in a testing capacity. No, sufficient testing kits have not reached many parts of rural India as yet!
“I have just finished referring a 25-year-old who came in respiratory distress, flagged positive on X-ray with positive rapid antigen test to Silchar Medical College and Hospital (SMCH), which is 162kms away from here. The number of cases here in Assam is increasing”
– Dr. Roshine Koshy, Makunda Christian Leprosy and General Hospital in Assam.
When we first reached out to these hospitals, we were struck by the heroic vigour with which they were already handling the COVID-19 crisis despite their limited resources. We spoke to the doctors, care-givers and IT experts across all of these hospitals and they had the utmost clarity from the very beginning on how the technology could help them.
Patients regularly present with no symptoms or atypical ones and conceal their travel history due to the associated stigma of COVID-19. Owing to the ambiguous nature of the COVID-19 presentation, there is a possibility of missing subtle findings. This means that, apart from direct contact with the patient, it puts the healthcare team, their families, and other vulnerable patients at risk.
qXR bridges underlying gaps in these remote, isolated and resource-constrained regions around the world. Perhaps the most revolutionary, life-saving aspect is the fact that, in less than 1 minute, qXR generates the AI analysis of whether the X-ray is normal or abnormal, along with a list of 27+ abnormalities including COVID-19 and TB. With qXR’s assistance, the X-rays that are suggestive of a high risk of COVID-19 are flagged, enabling quick triaging and isolation of these suspects till negative RT PCR confirmatory results are received. As the prognosis changes with co-morbidities, alerting the referring Physician via phone of life-threatening findings like Pneumothorax is an added advantage.
Due to the lack of radiologists and other specialists in their own or neighbouring cities, Clinicians often play multiple roles - Physician, Obstetrician, Surgeon, Intensivist, Anaesthesist - and is normal in these hospitals that investigate, treat and perform surgeries for those in need. Detecting any case at risk prior to their surgical procedures are important for necessitating RT PCR confirmation and further action.
These hospitals have been in the service of the local communities with a mix of healthcare and community outreach services for decades now. Heavily dependent on funding, these setups have to often navigate severe financial crises in their mission to continue catering to people at the bottom of the pyramid. Amidst the tribal belt in Jharkhand, Dr. George Mathew (former Principal, CMC, Vellore) and Medical Director of Shantibhavan Medical Center in Simdega, had to face the herculean task of providing food and accommodation for all his healthcare and non-healthcare staff as they were ostracised by their families owing to the stigma attached to COVID-19 care. Lack of availability of PPE kits and other protective gear, also pushed these sites to innovate and produce them inhouse.
qXR was introduced to these passionate professionals and other staff were sensitized on the technology. Post their buy-in of the solution, we on-boarded 11 of these hospitals, working closely with their IT teams for secure protocols, deployment and training of the staff in a span of 2 weeks. A glimpse of the hospitals as below:
Location | Hospital Name | Setting |
---|---|---|
Betul District, rural Madhya Pradesh | Padhar Hospital | This is a 200 bedded multi-speciality charitable hospital engages in a host of community outreach activities in nearby villages involving education, nutrition, maternal and child health programs, mental health and cancer screening |
Nandurbar, Maharashtra | Chinchpada Mission Hospital | This secondary care hospital serves the Bhil tribal community. Patients travel upto 200kms from the interiors of Maharashtra to avail affordable, high quality care. |
Tezpur, Assam | The Baptist Christian Hospital | This is a 200- bedded secondary care hospital in the North eastern state of Assam |
Bazaricherra, Assam | Makunda Christian Leprosy & General Hospital | They cater to the tribal regions. Situated in a district with a Maternal Mortality Rate (MMR) as high as 284 per 100,000 live births and Infant Mortality Rate (IMR) of 69 per 1000 live births. They conduct 6,000 deliveries, and perform 3,000 surgeries annually. |
Simdega, Jharkhand | Shanti Bhavan Medical Center | This secondary hospital caters to remote tribal district. It is managed entirely by 3-4 doctors that actively multitask to ensure highest quality care for their patients. The nearest tertiary care hospital is approximately 100 km away. Currently, they are a COVID-19 designated center and they actively see many TB cases as well. |
Others include hospitals in Khariar, Odisha; Dimapur, Nagaland; Raxaul, Bihar and so on.
Initially, qXR was used to process X-rays of cases with COVID-19 like symptoms, with results interpreted and updated in a minute. Soon the doctors found it to be useful in OPD as well and the solution’s capability was extended to all patients who visited with various ailments that required chest X-ray diagnosis. Alerts on every suspect are provided immediately, based on the likelihood of disease predicted by qXR, along with information on other suggestive findings. The reports are compiled and integrated on our patient workflow management solution, qTrack. Due to resource constraints for viewing X-ray in dedicated workstations, the results are also made available real-time using the qTrack mobile application.
“It is a handy tool for our junior medical officers in the emergency department, as it helps in quick clinical decision making. The uniqueness of the system being speed, accuracy, and the details of the report. We get the report moment the x rays are uploaded on the server. The dashboard is very friendly to use. It is a perfect tool for screening asymptomatic patients for RT PCR testing, as it calculates the COVID-19 risk score. This also helps us to isolate suspected patients early and thereby helping in infection control. In this pandemic, this AI system would be a valuable tool in the battleground”
– Dr Jemin Webster, Tezpur Baptist Hospital
Once the preliminary chest X-ray screening is done, the hospitals equipped with COVID-19 rapid tests get them done right away, while the others send samples to the closest testing facility which may be more than 30 miles away, with results made available in 4-5 days or more. But, none of these hospitals have the RT-PCR testing facility, yet!
In Makunda Hospital, Assam, qXR is used as an additional input in the diagnosis methodologies to manage the patient as a COVID-19 patient. They have currently streamlined their workflow to include the X-ray technicians taking digital X-rays and uploading the images on qXR, to follow up and alert the doctors. Meanwhile, physicians can also access reports, review images and make clinical corroboration anywhere they are through qTrack and manage patients without any undue delay.
“One of our objectives as a clinical team has been to ensure that care for non-COVID-19 patients is not affected as much as possible as there are no other healthcare facilities providing similar care. We are seeing atypical presentations of the illness, patients without fever, with vague complaints. We had one patient admitted in the main hospital who was flagged positive on the qXR system and subsequently tested positive and referred to a higher center. All the symptomatic patients who tested positive on the rapid antigen test have been flagged positive by qXR and some of them were alerted because of the qXR input. Being a high volume center and the main service provider in the district, using Qure.ai as a triaging tool will have enormous benefits in rural areas especially where there are no well-trained doctors”
- Dr. Roshine Koshy, Makunda Christian Leprosy and General Hospital in Assam.
There are a number of changes our users experienced in this short span of introduction of qXR in their existing workflow including:
In Padhar Hospital, Madhya Pradesh, in addition to triaging suspected COVID cases, qXR assists doctors in managing pre-operative patients, where their medicine department takes care of pre-anaesthesia checkups as well. qXR helps them in identifying and flagging off suspected cases who are planned for procedures. They are deferred till diagnosis or handled with appropriate additional safety measures in case of an emergency.
“We are finding it quite useful since we get a variety of patients, both outpatients and inpatients. And anyone who has a short history of illness and has history suggestive of SARI, we quickly do the chest X-ray and if the Qure app shows a high COVID-19 score, we immediately refer the patient to the nearby district hospital for RT-PCR for further management. Through the app we are also able to pick up asymptomatic suspects who hides their travel history or positive cases who have come for second opinion, to confirm and/or guide them to the proper place for further testing and isolation”
– Dr Mahima Sonwani, Padhar Hospital, Betul, Madhya Pradesh
In some of the high TB burden settings like Simdega in Jharkhand, qXR is used as a surveillance tool for screening and triaging Tuberculosis cases in addition to COVID-19 and other lung ailments.
“We are dependent on chest X-rays to make the preliminary diagnosis in both these conditions before we perform any confirmatory test. There are no trained radiologists available in our district or our neighbouring district and struggle frequently to make accurate diagnosis without help of a trained radiologist. The AI solution provided by Qure, is a perfect answer for our problem in this remote and isolated region. I strongly feel that the adoption of AI for Chest X-ray and other radiological investigation is the ideal solution for isolated and human resource deprived regions of the world”
- Dr.George Mathew, Medical Director, Shanti Bhavan Medical Centre
Currently, qXR processes close to 150 chest X-rays a day from these hospitals, enabling quick diagnostic decisions for lung diseases.
Challenges: Several hospitals had very basic technological infrastructure systems with poor internet connectivity and limitations in IT systems for using all supporting softwares. They were anxious about potential viruses / crashing the computer where our software was installed. Most of these teams had limited understanding of exposure to working with such softwares as well. However, they were extremely keen to learn, adapt and even provide solutions to overcome these infrastructural limitations. The engineers of the customer success team at Qure, deployed the software gateways carefully, ensuring no interruption in their existing functioning.
At Qure, we have worked closely with public health stakeholders in recent years. It is rewarding to hear the experiences and stories of impact from these physicians. To strengthen their armor in the fight against the pandemic even in such resource-limited settings, we will continue to expand our software solutions. Without limitation, qXR will be available across primary, secondary, and tertiary hospitals. The meetings, deployments, and training will be done remotely, providing a seamless experience. It is reassuring to hear these words:
“Qure’s solution is particularly attractive because it is cutting edge technology that directly impacts care for those sections of our society who are deprived of many advances in science and technology simply because they never reach them! We hope that this and many more such innovative initiatives would be encouraged so that we can include the forgotten masses of our dear people in rural India in the progress enjoyed by those in the cities, where most of the health infrastructure and manpower is concentrated”
–Dr. Ashita Waghmare, Chinchpada hospital
Democratizing healthcare through innovations! We will be publishing a detailed study soon.
]]>In March 2020, we re-purposed our chest X-ray AI tool, qXR, to detect signs of COVID-19. We validated it on a test set of 11479 CXRs with 515 PCR-confirmed COVID-19 positives. The algorithm performs at an AUC of 0.9 (95% CI : 0.88 - 0.92)
on this test set. At our most common operating threshold for this version, sensitivity is 0.912 (95% CI : 0.88 - 0.93)
and specificity is 0.775 (95% CI : 0.77 - 0.78)
. qXR for COVID-19 is used at over 28 sites across the world to triage suspected patients with COVID-19 and to monitor the progress of infection in patients admitted to hospital
The emergence of the COVID-19 pandemic has already caused a great deal of disruption around the world. Healthcare systems are overwhelmed as we speak, in the face of WHO guidance to ‘test, test, test’ [1]. Many countries are facing a severe shortage of Reverse Transcription Polymerase Chain Reaction (RT-PCR) tests. There has been a lot of debate around the role of radiology — both chest X-rays (CXRs) and chest CT scans — as an alternative or supplement to RT-PCR in triage and diagnosis. Opinions on the subject range from ‘Radiology is fundamental in this process’ [2] to ‘framing CT as pivotal for COVID-19 diagnosis is a distraction during a pandemic, and possibly dangerous’ [3].
The humble chest X-ray has emerged as the frontline screening and diagnostic tool for COVID-19 infection in a few countries and is used in conjunction with clinical history and key blood markers such as C-Reactive Protein (CRP) test and lymphopenia [4]. Ground glass opacities and consolidations which are peripheral and bilateral in nature are attributed to be the most common findings with respect to COVID related infections on CXRs and chest CTs. CXRs can help in identifying COVID-19 related infections and can be used as a triage tool in most cases. In fact, Italian and British hospitals are employing CXR as a first-line triage tool due to high RT-PCR turnaround times. A recent study [5] which examined CXRs of 64 patients found that in 9% of cases, initial RT-PCR was negative whereas CXRs showed abnormalities. All these cases subsequently tested positive for RT-PCR within 48 hours. The American college of Radiology recommends considering portable chest X-rays [6] to avoid bringing patients to radiography rooms. The Canadian Association of Radiologists suggest the use of mobile chest X-ray units for preliminary diagnosis of suspected cases [7] and to monitor critically ill patients, but have reported that no abnormalities are seen on CXRs in the initial stages of the infection.
As of today, despite calls for opening up imaging data on COVID-19 and outstanding efforts from physicians on the front-lines, there are limited X-ray or CT datasets in the public domain pertaining specifically to COVID. These datasets remain insufficient to train an AI model for COVID-19 triage or diagnosis but are potentially useful in evaluating the model – provided the model hasn’t been trained on the same data sources.
Over the last month, customers, collaborators, healthcare providers, NGOs, state and national governments have reached out to us for help with COVID detection on chest X-rays and CTs.
In response, we have adapted our tried-and-tested chest X-ray AI tool, qXR to identify findings related to COVID-19 infections. qXR is trained using a dataset of 2.5 million chest X-rays (that included bacterial and viral pneumonia and many other chest X-ray findings) and is currently deployed in over 28 countries. qXR detects the following findings that are indicative of COVID-19: Opacities
and Consolidation
with bilateral and peripheral distribution and the following findings that are contra-indicative of COVID-19: hilar enlargement
, discrete pulmonary nodule
, calcification
, cavity
and pleural effusion
.
These CE-marked capabilities have been leveraged for a COVID-19 triage product that is highly sensitive to COVID-19 related findings. This version of qXR gives out the likelihood of a CXR being positive for COVID-19, called Covid-19 Risk. Covid-19 Risk is computed using a post processing algorithm which combines the model outputs for the above mentioned findings. The algorithm is tuned on a set of 300 COVID-19 positives and 300 COVID-19 negatives collected from India and Europe.
Most new qXR users for COVID-19 are using it as a triage tool, often in settings with limited diagnostic resources. This version of qXR also localizes and quantifies the affected region. This capability is being used to monitor the progression of infection and to evaluate response to treatment in new clinical studies.
We have created an independent testset of 11479 CXRs
to evaluate our algorithm. The WHO [10] recommends a confirmatory diagnosis of COVID-19 using Reverse-Transcriptase Polymerase Chain Reaction (RT-PCR) - a specialised Nucleic Acid Amplification Test (NAAT) which looks for unique signatures using primers designed for the COVID-19 RNA sequence. Positives in this test set are defined as any CXR that is acquired while the patient has tested positive on RT-PCR test based on sputum/ lower respiratory and or upper respiratory aspirates/throat swab samples for COVID-19. Negatives in this test set are defined as any CXR which was acquired before the first case of COVID-19 was discovered.
The size of the negative set relative to the positive set was set to match the available prevalence in the literature [11]. The test set has 515 positives
and 10964 negatives
. Negatives are sampled from an independent set 250,000 CXRs. Negative set has 1609 cases of bilateral opacity and 547 cases of pulmonary consolidation in it (findings which are indicative of COVID-19 on a CXR), where the final diagnosis is not COVID-19. Negative set also has 355 non-opacity related abnormalities. This allowed us to evaluate algorithms ability to detect non COVID-19 opacities and findings, and is used to suggest alternative possible etiology and rule out COVID-19. We have used Area under Receiver Operating Characteristic Curve (AUC) along with Sensitivity and Specificity at the operating point to evaluate the performance of our algorithm.
Characteristic | Value |
---|---|
Number of scans | 11479 |
Positives | 515 |
Negatives | 10964 |
Normals | 9000 |
Consolidation | 547 |
Opacities | 1609 |
Other Abnormalities | 355 |
A subset (1000 cases) of this test set was independently reviewed by radiologists to create pixel level annotations to localize opacity and consolidation. Localization and progression monitoring capability of qXR is validated by computing the Jaccard Index between algorithm output and radiologist annotations.
To detect signs of COVID-19, We have observed an AUC of 0.9 (95% CI: 0.88 - 0.92)
on this test set. At the operating threshold, we have observed the sensitivity to be 0.912 (95% CI : 0.88 - 0.93)
and specificity to be 0.775 (95% CI : 0.77 - 0.78)
. While there are no WHO guidelines yet for an imaging based triage tool for COVID, WHO recommends a minimum sensitivity and specificity of 0.9 and 0.7 for community screening tests for Tuberculosis [12], which is a deadly infectious disease in itself. We have observed a Jaccard index of 0.88
between qXR’s output and expert’s annotations.
qXR is available as a web-api and can be deployed within minutes. Built using our learnings of deploying globally and remotely, it can interface with a variety of PACS and RIS systems, and is very intuitive to interpret. qXR can be used to triage suspect patients in resource constrained countries to make effective use of RT-PCR test kits. qXR is being used for screening and triage at multiple hospitals in India and Mexico.
San Raffaele Hospital in Milan, Italy has deployed qXR to monitor patients and to evaluate patient’s response to treatments. In Karachi, qXR powered mobile vans are being used at multiple sites to identify potential suspects early and thus reducing burden on the healthcare system.
In the UK, all the suspected COVID-19 patients presenting to the emergency department are undergoing blood tests and CXR [4]. This puts a tremendous amount of workload on already burdened radiologists as it becomes critical for radiologists to report the CXRs urgently. qXR, with its ability to handle huge workloads, provides significant value in such a scenario and thus reduce the burden on radiologists.
qXR can also be scaled for rapid and extensive population screening. Frontline clinicians are increasingly relying on chest X-rars to triage the sickest patients, while they await RT-PCR results. When there is high clinical suspicion for COVID-19 infection, the need for a patient with positive chest X-ray to get admitted in a hospital is conceivable. qXR can help solve this problem at scale.
With new evidence published every day, and evolving guidance and protocols adapting in suit for COVID-19, national responses globally remain fluid. Singapore, Taiwan and South Korea have shown that aggressive and proactive testing plays a crucial role in containing the spread of the disease. We believe qXR can play an important role in expanding screening in the community to help reduce the burden on healthcare systems. If you want to use qXR, please reach out to us.
Our deep learning models have become really good at recognizing hemorrhages from Head CT scans. Real-world performance is sometimes hampered by several external factors both hardware-related and human-related. In this blog post, we analyze how acquisition artifacts are responsible for performance degradation and introduce two methods that we tried, to solve this problem.
Medical Imaging is often accompanied by acquisition artifacts which can be subject related or hardware related. These artifacts make confident diagnostic evaluation difficult in two ways:
Some common examples of artifacts are
Here we are investigating motion artifacts that look like SDH, in Head CT scans. These artifacts result in increase in false positive (FPs) predictions of subdural hemorrhage models. We confirmed this by quantitatively analyzing the FPs of our AI model deployed at an urban outpatient center. The FP rates were higher for this data when compared to our internal test dataset. The reason for these false positive predictions is due to the lack of variety of artifact-ridden data in the training set used. Its practically difficult to acquire and include scans containing all varieties of artifacts in the training set.
We tried to solve this problem in the following two ways.
We reasoned that the artifacts were misclassified as bleeds because the model has not seen enough artifact scans while training. The number of images containing artifacts is relatively small in our annotated training dataset. But we have access to several unannotated scans containing artifacts acquired from various centers with older CT scanners.(Motion artifacts are more prevalent when using older CT scanners with poor in plane temporal resolution). If we could generate artifact ridden versions of all the annotated images in our training dataset, we would be able to effectively augment our training dataset and make the model invariant to artifacts. We decided to use a Cycle GAN to generate new training data containing artifacts.
Cycle GAN[1] is a generative adversarial network that is used for unpaired image to image translation. It serves our purpose because we have an unpaired image translation problem where X domain has our training set CT images with no artifact and Y domain has artifact-ridden CT images.
We curated a A dataset of 5000 images without artifact and B dataset of 4000 images with artifacts and used this to train the Cycle GAN.
Unfortunately, the quality of generated images was not very good. See fig 6. The generator was unable to capture all the variety in CT dataset, meanwhile introducing artifacts of its own, thus rendering it useless for augmenting the dataset. Cycle GAN authors state that the performance of the generator when the transformation involves geometric changes for ex. dog to cat, apples to oranges etc. is worse when compared to transformation involving color or style changes. Inclusion of artifacts is a bit more complex than color or style changes because it has to introduce distortions to existing geometry. This could be one of the reasons why the generated images have extra artifacts.
In this method, we trained a model to identify slices with artifacts and show that discounting these slices made the AI model identifying subdural hemorrhage (SDH) robust to artifacts. A manually annotated dataset was used to train a convolutional neural network (CNN) model to detect if a CT slice had artifacts or not. The original SDH model was also a CNN which predicted if a slice contained SDH. The probabilities from artifact model were used to discount the slices containing artifact and artifact-free slices of a scan were used in computation of score for presence of bleed. See fig 7.
Our validation dataset contained 712 head CT scans, of which 42 contained SDH. Original SDH model predicted 35 false positives and no false negatives. Quantitative analysis of FPs confirmed that 17 (48%) of them were due to CT artifacts. Our trained artifact model had slice-wise AUC of 96%. Proposed modification to the SDH model had reduced the FPs to 18 (decrease of 48%) without introducing any false negatives. Thus using method 2, all scanwise FP’s due to artifacts were corrected.
In summary, using method 2, we improved the precision of SDH detection from 54.5% to 70% while maintaining a sensitivity of 100 percent.
See fig 9. for model predictions on a representative scan.
A drawback of Method 2 is that if SDH and artifact are present in the same slice, its probable that the SDH could be missed.
Using a cycle GAN to augment the dataset with artifact ridden scans would solve the problem by enriching the dataset with both SDH positive and SDH negative scans with artifacts over top of it. But the current experiments do not give realistic looking image synthesis results. The alternative we used, meanwhile reduces the problem of high false positives due to artifacts while maintaining the same sensitivity.
We have been deploying our deep learning based solutions across the globe. qXR, our product for automated chest X-ray reads, is being widely used for Tuberculosis screening. In this blog, we will understand the scale of the threat that TB presents. Thereafter, taking one of our deployments as a case study, we will explain how artificial intelligence can help us in fighting TB.
Qure.ai’s deep learning solutions are actively reading radiology images in over 82 sites spread across 12 countries. We have processed more than 50 thousand scans till date. One of the major use cases of our solutions is for fast-tracking Tuberculosis (TB) screening.
TB is caused by bacteria called Mycobacterium tuberculosis and it mostly affects the lungs. About one-fourth of the world’s population is infected by the bacteria inactively – a condition called latent TB. TB infection occurs when a person breathes in droplets produced due to an active TB person’s coughing, sneezing or spitting.
TB is a curable and preventable disease. Despite that, WHO reports that it is one of the top 10 causes of deaths worldwide. In 2017, 10 million people fell ill with TB, out of which 1.6 million lost their lives. 1 million children got affected by it, with 230,000 fatalities. It is also the leading cause of death among HIV patients.
There are many tests to detect TB. Some of them are as follows:
Chest X-ray: Typically used to screen for signs of TB in the lungs. They are a sensitive and inexpensive screening test, but may pick up other lung diseases too. So chest X-rays are not used for a final TB diagnosis. The presence of TB bacteria is confirmed using a bacteriological or molecular test of sputum or other biological sample.
Sputum tests: The older AFB sputum tests (samples manually viewed through a microscope looking for signs of bacteria) are still used in low-income countries to confirm TB. A more sensitive sputum test that uses DNA amplification technology to detect traces is now in wide use to confirm TB – it’s not only more sensitive, but also can also look for TB resistance. Tests like Genexpert and TrueNat fall under this category. These are fairly expensive tests.
Molecular tests have shown excellent results in South Africa and are generally considered as the go-to test for TB. However, their high costs make it impossible to conduct them for every TB suspect.
Due to the high costs of molecular tests, Chest X-rays are generally preferred as a pre-test for TB suspects. Post that, sputum or molecular tests are performed for confirmation. In regions where these confirmatory tests are not available, Chest X-rays are used for final diagnosis.
Having understood the X-rays’ key role in TB diagnosis, it is important to note that there is a huge dearth of radiologists to read these X-rays. In India alone, 80 million chest X-rays are being captured every year. There aren’t enough radiologists to read them within acceptable timelines. Depending upon the extent of shortage for radiology expertise, it can take anywhere between 2 to 15 days for the report to arrive. As a result, critical time is lost for a TB patient which prevents its early detection. A failure in detecting it early is not only hazardous for the patient but also enhances the risk of its transmission to others.
Moreover, the error rates in reading these X-rays lie around 25-30%. Such errors can prove to be fatal for the patient.
This large gap between the number of TB incidences and the number of timely & accurately reported cases is a major reason why many lives are lost to this curable disease. It can be bridged with a solution that requires little manual intervention. This is precisely how Qure’s qXR solution, trained on more than a million chest X-rays, attacks at the heart of the problem. The AI (Artificial Intelligence) encapsulated inside qXR automates reading chest X-rays and generates reports within seconds. Thereby, reducing the waiting time for TB confirmatory tests from weeks to a couple of hours and enrolling confirmed cases to treatment the same day!
While bacteriological confirmatory tests on presumptive cases are preferred in a screening setting, the cost burden increases. Sputum culture testing will take weeks for the reports that could result in dropouts in collecting reports and treatment enrolment. Additionally, the shortage of sourcing Cartridge Based Nucleic Acid Amplification Test (CB-NAAT) becomes a limitation which results in a delay of the testing process. Qure.ai’s qXR helps in cutting down on time and costs incurred by reducing the number of individuals required to go through these tests. The whole program workflow happens as depicted in the following picture.
While upscaling our solutions in the last 2 years, it has become evident that Qure.ai can play a vital role in humanity’s war against TB. We deployed qXR with ACCESS TB Project in Philippines in their TB screening program. During the deployment, we learned the operational dynamics of deploying Artificial Intelligence (AI) at health centers.
The ACCESS TB program has mobile vans equipped with X-rays machines with trained radiographers and health workers. The program is intended to screen presumptive cases and individuals with a high-risk factor of TB, by running the vans across different cities in the Philippines. Screening camps are either announced in conjunction with a nearby nursing home or health workers identify and invite individuals at risk on the days of programs.
The vans leave the office on Monday morning for remote villages with a predefined schedule. These villages are situated around 100kms from Manila. Two radiology technicians accompany each van. Once they reach the desired health center in the village, they start capturing X-rays for each individual. The X-ray machines are connected to a computer which stores these X-rays locally. One can also edit the dicom (radiology image) information inside the X-ray from this computer.
Individuals arrive inside the van on a first come first serve basis. They are given a receipt containing their patient id, name, etc. Their X-ray is also marked with the same ID using the computer. This approach of mass screening for TB is similar to the approach adopted by the USA during the 1930s to 1960s as depicted in the following picture.
Once all the X-rays have been captured, the vans return to their stay in the same village. They visit a new village/ health center on subsequent weekdays. On Friday evening, all the vans return to Manila. Thereafter, all the X-rays captured in the 4 vans over the week are sent to a radiologist for review. The lead time for the radiologist report is 3 working days and can extend to 2 weeks. The delay in reporting leads to delay in diagnosis and treatment, which can prove to be fatal for the patient and the neighborhood.
Our team arrived in Manila during the second week of September 2018 with the deep learning solution sitting nice and cozy on the cloud. The major challenges in front of us were two-fold:
To ensure smooth upload of images to our cloud server: This was a challenge because some of the villages and towns being visited were really remote and there was no guarantee of sufficient internet connection for the upload to work properly. We had to make sure that everything worked fine even if there was no internet connectivity. To deal with this, we built an application which was installed on their computer to upload images on our cloud. In case of no internet connectivity, it would store all the information and wait for better connectivity. As soon as connectivity became available, the app would start processing deferred uploads.
To enable end to end patient management on one single platform: This was the biggest concern and we designed the software to minimize manual intervention at various stages.
We built a portal where radiology assistants could register patients, radiologists could report on them and patient history could be maintained. The diagnosis from the radiologist, qXR and CB-NAAT tests are all accumulated at a single place.
Features that could ease the workflow were added to the software that enabled the staff in the field to filter patients by name, date, site or health center. Such features and provisions in the software helped the staff to capture the progress of screening for a patient with simple sorting and searches.
At Qure, we deliver our products and solutions understanding the customer needs and designing workflows to fit into their existing processes. Especially when it comes to mass screening programs, we understand that each one of them is uniquely designed by program managers & strategists, and requires specific customizations to deliver a seamless experience.
After understanding the existing workflow at AccessTB, we designed our software to include elements that can automate some of the existing processes. Thereafter, the software was built, tested, packaged and stored in a secure cloud. We figured the best way to integrate with their existing X-ray console and completed the integration on all the vans in 2 working days’ time.
A field visit was arranged after the deployment to assess the software’s performance in areas with limited network connectivity and its ease of usage for the radiology staff. Based on our on-field learnings, we further customized the software’s workflow for the staff.
The implementation process ended with a classroom training program with the field staff, technicians and program managers. With the completion of the deployment, software adaptability assessment and training, we handed over the software to the program in 5 days before we left Manila.
Quoting Preetham Srinivas (AI scientist at Qure) on qXR, “With qXR at the heart of it, we’ve developed a solution that is end to end. As in, with individual registrations, and then qXR doing the automated analysis and flagging individuals who need microbiological confirmation. Radiologists can verify in the same portal and then, an organization doing the microbiological tests can plug in their data in the same portal. And once you have a dashboard which collates all this information in one place, it becomes very powerful. The loss itself can be minimized. It becomes that much easier to track the person and make sure he is receiving the treatment.”
WHO has given the status of an epidemic to TB. They adopted an END TB strategy in 2014 aimed at reducing TB deaths by 90% and cutting new cases by 80% between 2015 and 2030. Ending TB by 2030 is one of the health targets of their Sustainable Development Goals.
The scale of this epidemic cries out for technology to intervene. Technologies like AI, if incorporated into existing TB care ecosystem, can not only assist healthcare practitioners massively, but also enrich it by the supplied data and feedback. And this is not a mere speculation. With qXR, we are having a first-hand experience of how AI can accelerate our efforts in eradicating TB. Jerome Trinona, account coordinator for AccessTB project, says “Qure.ai’s complete TB software is very helpful in maximizing our time – now we can keep track of the entire patient workflow in one place.”
Successful deployments like AccessTB show that Qure.ai is leading the battle against TB at the technology and innovation fronts. Post World TB day, let us all embrace AI as our newest ammunition against TB.
Let’s join hands to end TB by 2030. 1
Reach out to us at partner@qure.ai ↩︎
This post is Part 1 of a series that uses large datasets (15,000+) coupled with deep learning segmentation methods to review and maybe re-establish what we know about normal brain anatomy and pathology. Subsequent posts will tackle intra-cranial bleeds, their typical volumes and locations across similarly sized datasets.
Brain ventricular volume has been quantified by post-mortem studies [1] and pneumoencephalography. When CT and subsequently MRI became available, facilitating non-invasive observation of the ventricular system larger datasets could be used to study these volumes. Typical subject numbers in recent studies have ranged from 50 - 150 [2-6].
Now that deep learning segmentation methods have enabled automated precise measurements of ventricular volume, we can re-establish these reference ranges using datasets that are 2 orders of magnitude larger. This is likely to be especially useful for age group extremes - in children, where very limited reference data exist and the elderly, where the effects of age-related atrophy may co-exist with pathologic neurodegenerative processes.
To date, no standard has been established regarding the normal ventricular volume of the human brain. The Evans index and the bicaudate index are linear measurements currently being used as surrogates to provide some indication that there is abnormal ventricular enlargement [1]. True volumetric measures are preferable to these indices for a number of reasons [7, 8] but have not been adopted so far, largely because of the time required for manual segmentation of images. Now that automated precise quantification is feasible with deep learning, it is possible to upgrade to a more precise volumetric measure.
Such quantitative measures will be useful in the monitoring of patients with hydrocephalus, and as an aid to diagnosing normal pressure hydrocephalus. In the future, automated measurements of ventricular, brain and cranial volumes could be displayed alongside established age- and gender-adjusted normal ranges as a standard part of radiology head CT and MRI reports.
To train our deep learning model, lateral ventricles were manually annotated in 103 scans. We split these scans randomly with a ratio of 4:1 for training and validation respectively. We trained a U-Net to segment lateral ventricles in each slice. Another U-Net model was trained to segment cranial vault using a similar process. Models were validated using DICE score metric versus the annotations.
Anatomy | DICE Score |
---|---|
Lateral Ventricles | 0.909 |
Cranial Vault | 0.983 |
Validation set of about 20 scans might not have represented all the anatomical/pathological variations in the population. Therefore, we visually verified that the resulting models worked despite pathologies like hemorrhage/infarcts or surgical implants such as shunts. We show some representative scans and model outputs below.
|
|
|
|
To study lateral ventricular and cranial vault volume variation across the population, we randomly selected 14,153 scans from our database. This selection contained only 208 scans with hydrocephalus reported by the radiologist. Since we wanted to study ventricle volume variation in patients with hydrocephalus, we added 1314 additional scans reported with ‘hydrocephalus’. We excluded those scans for which age/gender metadata were not available. In total, our analysis dataset contained 15223 scans whose demographic characteristics are shown in the table below.
Characteristic | Value |
---|---|
Number of scans | 15223 |
Females | 6317 (41.5%) |
Age: median (interquartile range) | 40 (24 - 56) years |
Scans reported with cerebral atrophy | 1999 (13.1%) |
Scans reported with hydrocephalus | 1404 (9.2%) |
Histogram of age distribution is shown below. It can be observed that there are reasonable numbers of subjects (>200) for all age and sex groups. This ensures that our analysis is generalizable.
We ran the trained deep learning models and measured lateral ventricular and cranial vault volumes for each of the 15223 scans in our database. Below is the scatter plot of all the analyzed scans.
In this scatter plot, x-axis is the lateral ventricular volume while y-axis is cranial vault volume. Patients with atrophy were circled with marked orange and while scans with hydrocephalus were marked with green. Patients with atrophy were on the right to the majority of the individuals, indicating larger ventricles in these subjects. Patients with hydrocephalus move to the extreme right with ventricular volumes even higher than those with atrophy.
To make this relationship clearer, we have plotted distribution of ventricular volume for patients without hydrocephalus or atrophy and patients with one of these.
Interestingly, hydrocephalus distribution has a very long tail while distribution of patients with neither hydrocephalus nor atrophy has a narrower peak.
Next, let us observe cranial vault volume variation with age and sex. Bands around solid lines indicate interquartile range of cranial vault volume of the particular group.
An obvious feature of this plot is that the cranial vault increases in size until age of 10-20 after which it plateaus. The cranial vault of males is approximately 13% larger than that of females. Another interesting point is that the cranial vault in males will grow until the age group of 15-20 while in the female group it stabilizes at ages of 10-15.
Now, let’s plot variation of lateral ventricles with age and sex. As before, bands indicate interquartile range for a particular age group.
This plot shows that ventricles grow in size as one ages. This may be explained by the fact that brain naturally atrophies with age, leading to relative enlargement of the ventricles. This information can be used as normal range of ventricle volume for a particular age in a defined gender. Ventricle volume outside this normal range can be indicative of hydrocephalus or a neurodegenerative disease.
While the above plot showed variation of lateral ventricle volumes across age and sex, it might be easier to visualize relative proportion of lateral ventricles compared to cranial vault volume. This also has a normalizing effect across sexes; difference in ventricular volumes between sexes might be due to difference in cranial vault sizes.
This plot looks similar to the plot before, with the ratio of the ventricular volume to the cranial vault increasing with age. Until the age of 30-35, males and females have relatively similar ventricular volumes. After that age, however, males tend to larger relative ventricular size compared to females. This is in line with prior research which found that males are more susceptible to atrophy than females[10].
We can incorporate all this analysis into our automated report. For example, following is the CT scan of an 75 year old patient and our automated report.
qER Analysis Report =================== Patient ID: KSA18458 Patient Age: 75Y Patient Sex: M Preliminary Findings by Automated Analysis: - Infarct of 0.86 ml in left occipital region. - Dilated lateral ventricles. This might indicate neurodegenerative disease/hydrocephalus. Lateral ventricular volume = 88 ml. Interquartile range for male >=75Y patients is 28 - 54 ml. This is a report of preliminary findings by automated analysis. Other significant abnormalities may be present. Please refer to final report.
The question of how to establish the ground truth for these measurements still remains to be answered. For this study, we use DICE scores versus manually outlined ventricles as an indicator of segmentation accuracy. Ventricle volumes annotated slice-wise by experts are an insufficient gold-standard not only because of scale, but also because of the lack of precision. The most likely places where these algorithms are likely to fail (and therefore need more testing) are anatomical variants and pathology that might alter the structure of the ventricles. We have tested some common co-occurring pathologies (hemorrhage), but it would be interesting to see how well the method performs on scans with congenital anomalies and other conditions such as subarachnoid cysts (which caused an earlier machine-learning-based algorithm to fail [9]).
First challenge we faced in the development process is that CT scans are three dimensional (3D). There is plethora of research for two dimensional (2D) images, but far less for 3D images. You might ask, why not simply use 3D convolutional neural networks (CNNs) in place of 2D CNNs? Notwithstanding the computational and memory requirements of 3D CNNs, they have been shown to be inferior to 2D CNN based approaches on a similar problem (action recognition).
So how do we solve it? We need not invent the wheel from scratch when there is a lot of literature on a similar problem, action recognition. Action recognition is classification of action that is present in a given video. Why is action recognition similar to 3D volume classification? Well, temporal dimension in videos is analogous to the Z dimension in the CT.
We have taken a foundational work from action recognition literature and modified it to our purposes. Our modification was that we have incorporated slice (or frame in videos) level labels in to the network. This is because action recognition literature had a comfort of using pretrained 2D CNNs which we do not share.
Second challenge was that CT is of high resolution both spatially and in bit depth. We just downsample the CT to a standard pixel spacing. How about bit depth? Deep learning doesn’t work great with the data which is not normalized to [-1, 1] or [0, 1]. We solved this with what a radiologist would use - windowing. Windowing is restriction of dynamic range to a certain interval (eg. [0, 80]) and then normalizing it. We applied three windows and passed them as channels to the CNNs.
Windows: brain, blood/subdural and bone
This approach allows for multi-class effects to be accounted by the model. For example, a large scalp hemotoma visible in brain window might indicate a fracture underneath it. Conversely, a fracture visible in the bone window is usually correlated with an extra-axial bleed.
There are few other challenges that deserve mention as well:
Once the algorithms were developed, validation was not without its challenges as well. Here are the key questions we started with: does our algorithms generalize well to CT scans not in the development dataset? Does the algorithm also generalize to CT scans from a different source altogether? How does it compare to radiologists without access to clinical history?
The validation looks simple enough: just acquire scans (from a different source), get it read by radiologists and compare their reads with the algorithms’. But statistical design is a challenge! This is because prevalence of abnormalities tend to be low; it can be as low as 1% for some abnormalities. Our key metrics for evaluating the algorithms are sensitivity & specificity and AUC depending on the both. Sensitivity is the trouble maker: we have to ensure there are enough positives in the dataset to ensure narrow enough 95% confidence intervals (CI). Required number of positive scans turns out to be ~80 for a CI of +/- 10% at an expected sensitivity of 0.7.
If we were to chose a randomly sampled dataset, number of scans to be read is ~ 80/prevalence rate = 8000. Suppose there are three readers per scan, number of total reads are 8k * 3 = 24k. So, this is a prohibitively large dataset to get read by radiologists. We cannot therefore have a randomly sampled dataset; we have to somehow enrich the number of positives in the dataset.
To enrich a dataset with positives, we have to find the positives from all the scans available. It’s like searching for a needle in a haystack. Fortunately, all the scans usually have a clinical report associated with them. So we just have to read the reports and choose the positive reports. Even better, have an NLP algorithm parse the reports and randomly sample the required number of positives. We chose this path.
We collected the dataset in two batches, B1 & B2. B1 was all the head CT scans acquired in a month and B2 was the algorithmically selected dataset. So, B1 mostly contained negatives while B2 contained lot of positives. This approach removed any selection bias that might have been present if the scans were manually picked. For example, if positive scans were to be picked by manual & cursory glances at the scans themselves, subtle positive findings would have been missing from the dataset.
We called this enriched dataset, CQ500 dataset (C for CARING and Q for Qure.ai). The dataset contained 491 scans after the exclusions. Three radiologists independently read the scans in the dataset and the majority vote is considered the gold standard. We randomized the order of the reads to minimize the recall of follow up scans and to blind the readers to the batches of the dataset.
We make this dataset and the radiologists’ reads public under CC-BY-NC-SA license. Other researchers can use this dataset to benchmark their algorithms. I think it can also be used for some clinical research like measuring concordance of radiologists on various tasks etc.
In addition to the CQ500 dataset, we validated the algorithms on a much larger randomly sampled dataset, Qure25k dataset. Number of scans in this dataset was 21095. Ground truths were clinical radiology reports. We used the NLP algorithm to get structured data from the reports. This dataset satisfies the statistical requirements, but each scan is read only by a single radiologist who had access to clinical history.
Finding | CQ500 (95% CI) | Qure25k (95% CI) |
---|---|---|
Intracranial hemorrhage | 0.9419 (0.9187-0.9651) | 0.9194 (0.9119-0.9269) |
Intraparenchymal | 0.9544 (0.9293-0.9795) | 0.8977 (0.8884-0.9069) |
Intraventricular | 0.9310 (0.8654-0.9965) | 0.9559 (0.9424-0.9694) |
Subdural | 0.9521 (0.9117-0.9925) | 0.9161 (0.9001-0.9321) |
Extradural | 0.9731 (0.9113-1.0000) | 0.9288 (0.9083-0.9494) |
Subarachnoid | 0.9574 (0.9214-0.9934) | 0.9044 (0.8882-0.9205) |
Calvarial fracture | 0.9624 (0.9204-1.0000) | 0.9244 (0.9130-0.9359) |
Midline Shift | 0.9697 (0.9403-0.9991) | 0.9276 (0.9139-0.9413) |
Mass Effect | 0.9216 (0.8883-0.9548) | 0.8583 (0.8462-0.8703) |
Above table shows AUCs of the algorithms on the two datasets. Note that the AUCs are directly comparable. This is because AUC is prevalence independent. AUCs on CQ500 dataset are generally better than that on the Qure25k dataset. This might be because:
Shown above is ROC curves on both the datasets. Readers’ TPR and FPR are also plotted. We observe that radiologists are either highly sensitive or specific to a particular finding. The algorithms are still yet to beat radiologists, on this task at least! But these should nonetheless be useful to triage or notify physicians.
]]>In this post, I summarize the literature on action recognition from videos. The post is organized into three sections -
Action recognition task involves the identification of different actions from video clips (a sequence of 2D frames) where the action may or may not be performed throughout the entire duration of the video. This seems like a natural extension of image classification tasks to multiple frames and then aggregating the predictions from each frame. Despite the stratospheric success of deep learning architectures in image classification (ImageNet), progress in architectures for video classification and representation learning has been slower.
What made this task tough?
Huge Computational Cost A simple convolution 2D net for classifying 101 classes has just ~5M parameters whereas the same architecture when inflated to a 3D structure results in ~33M parameters. It takes 3 to 4 days to train a 3DConvNet on UCF101 and about two months on Sports-1M, which makes extensive architecture search difficult and overfitting likely[1].
Capturing long context Action recognition involves capturing spatiotemporal context across frames. Additionally, the spatial information captured has to be compensated for camera movement. Even having strong spatial object detection doesn’t suffice as the motion information also carries finer details. There’s a local as well as global context w.r.t. motion information which needs to be captured for robust predictions. For example, consider the video representations shown in Figure 2. A strong image classifier can identify human, water body in both the videos but the nature of temporal periodic action differentiates front crawl from breast stroke.
Fig 2: Left: Front crawl. Right: Breast stroke. Capturing temporal motion is critical to differentiate these two seemingly similar cases. Also notice, how camera angle suddenly changes in the middle of front crawl video.
No standard benchmark The most popular and benchmark datasets have been UCF101 and Sports1M for a long time. Searching for reasonable architecture on Sports1M can be extremely expensive. For UCF101, although the number of frames is comparable to ImageNet, the high spatial correlation among the videos makes the actual diversity in the training much lesser. Also, given the similar theme (sports) across both the datasets, generalization of benchmarked architectures to other tasks remained a problem. This has been solved lately with the introduction of Kinetics dataset[2].
Sample illustration of UCF-101. Source.
It must be noted here that abnormality detection from 3D medical images doesn’t involve all the challenges mentioned here. The major differences between action recognition from medical images are mentioned as below
In case of medical imaging, the temporal context may not be as important as action recognition. For example, detecting hemorrhage in a head CT scan could involve much less temporal context across slices. Intracranial hemorrhage can be detected from a single slice only. As opposed to that, detecting lung nodule from chest CT scans would involve capturing temporal context as the nodule as well as bronchi and vessels all look like circular objects in 2D scans. It’s only when 3D context is captured, that nodules can be seen as spherical objects as opposed to cylindrical objects like vessels
In case of action recognition, most of the research ideas resort to using pre-trained 2D CNNs as a starting point for drastically better convergence. In case of medical images, such pre-trained networks would be unavailable.
Before deep learning came along, most of the traditional CV algorithm variants for action recognition can be broken down into the following 3 broad steps:
Of these algorithms that use shallow hand-crafted features in Step 1, improved Dense Trajectories [6] (iDT) which uses densely sampled trajectory features was the state-of-the-art. Simultaneously, 3D convolutions were used as is for action recognition without much help in 2013[7]. Soon after this in 2014, two breakthrough research papers were released which form the backbone for all the papers we are going to discuss in this post. The major differences between them was the design choice around combining spatiotemporal information.
In this work [June 2014], the authors - Karpathy et al. - explore multiple ways to fuse temporal information from consecutive frames using 2D pre-trained convolutions.
Fig 3: Fusion Ideas Source.
As can be seen in Fig 3, the consecutive frames of the video are presented as input in all setups. Single frame uses single architecture that fuses information from all frames at the last stage. Late fusion uses two nets with shared params, spaced 15 frames apart, and also combines predictions at the end. Early fusion combines in the first layer by convolving over 10 frames. Slow fusion involves fusing at multiple stages, a balance between early and late fusion. For final predictions, multiple clips were sampled from entire video and prediction scores from them were averaged for final prediction.
Despite extensive experimentations the authors found that the results were significantly worse as compared to state-of-the-art hand-crafted feature based algorithms. There were multiple reasons attributed for this failure:
In this pioneering work [June 2014] by Simmoyan and Zisserman, the authors build on the failures of the previous work by Karpathy et al. Given the toughness of deep architectures to learn motion features, authors explicitly modeled motion features in the form of stacked optical flow vectors. So instead of single network for spatial context, this architecture has two separate networks - one for spatial context (pre-trained), one for motion context. The input to the spatial net is a single frame of the video. Authors experimented with the input to the temporal net and found bi-directional optical flow stacked across for 10 successive frames was performing best. The two streams were trained separately and combined using SVM. Final prediction was same as previous paper, i.e. averaging across sampled frames.
Fig 4: Two stream architecture Source.
Though this method improved the performance of single stream method by explicitly capturing local temporal movement, there were still a few drawbacks:
Following papers which are, in a way, evolutions from the two papers (single stream and two stream) which are summarized as below:
The recurrent theme around these papers can be summarized as follows. All of the papers are improvisations on top of these basic ideas.
Recurrent theme across papers. Source.
For each of these papers, I list down their key contributions and explain them. I also show their benchmark scores on UCF101-split1.
Key Contributions:
Explanation:
In a previous work by Ng et al[9]. authors had explored the idea of using LSTMs on separately trained feature maps to see if it can capture temporal information from clips. Sadly, they conclude that temporal pooling of convoluted features proved more effective than LSTM stacked after trained feature maps. In the current paper, authors build on the same idea of using LSTM blocks (decoder) after convolution blocks(encoder) but using end-to-end training of entire architecture. They also compared RGB and optical flow as input choice and found that a weighted scoring of predictions based on both inputs was the best.
Fig 5: Left: LRCN for action recognition. Right: Generic LRCN architecture for all tasks Source.
Algorithm:
During training, 16 frame clips are sampled from video. The architecture is trained end-to-end with input as RGB or optical flow of 16 frame clips. Final prediction for each clip is the average of predictions across each time step. The final prediction at video level is average of predictions from each clip.
Benchmarks (UCF101-split1):
Score | Comment |
---|---|
82.92 | Weighted score of flow and RGB inputs |
71.1 | Score with just RGB |
My comments:
Even though the authors suggested end-to-end training frameworks, there were still a few drawbacks
Varol et al. in their work[10] tried to compensate for the stunted temporal range problem by using lower spatial resolution of video and longer clips (60 frames) which led to significantly better performance.
Key Contributions:
Explanation:
In this work authors built upon work by Karpathy et al. However, instead of using 2D convolutions across frames, they used 3D convolutions on video volume. The idea was to train these vast networks on Sports1M and then use them (or an ensemble of nets with different temporal depths) as feature extractors for other datasets. Their finding was a simple linear classifier like SVM on top of ensemble of extracted features worked better than she ttate-of-the-art algorithms. The model performed even better if hand crafted features like iDT were used additionally.
Differences in C3D paper and single stream paper Source.
The other interesting part of the work was using deconvolutional layers (explained here) to interpret the decisions. Their finding was that the net focussed on spatial appearance in first few frames and tracked the motion in the subsequent frames.
Algorithm:
During training, five random 2-second clips are extracted for each video with ground truth as action reported in the entire video. In test time, 10 clips are randomly sampled and predictions across them are averaged for final prediction.
3D convolution where convolution is applied on a spatiotemporal cube.
Benchmarks (UCF101-split1):
Score | Comment |
---|---|
82.3 | C3D (1 net) + linear SVM |
85.2 | C3D (3 nets) + linear SVM |
90.4 | C3D (3 nets) + iDT + linear SVM |
My comments:
The long range temporal modeling was still a problem. Moreover, training such huge networks is computationally a problem - especially for medical imaging where pre-training from natural images doesn’t help a lot.
Note: Around the same time Sun et al.[11] introduced the concept of factorized 3D conv networks (FSTCN), where the authors explored the idea of breaking 3D convolutions into spatial 2D convolutions followed by temporal 1D convolutions. The 1D convolution, placed after 2D conv layer, was implemented as 2D convolution over temporal and channel dimension. The factorized 3D convolutions (FSTCN) had comparable results on UCF101 split.
FSTCN paper and the factorization of 3D convolution Source.
Key Contributions:
Explanation:
Although this work is not directly related to action recognition, but it was a landmark work in terms of video representations. In this paper the authors use a 3D CNN + LSTM as base architecture for video description task. On top of the base, authors use a pre-trained 3D CNN for improved results.
Algorithm:
The set up is almost same as encoder-decoder architecture described in LRCN with two differences
Attention mechanism for action recognition. Source.
Benchmarks:
Score | Comment |
---|---|
– | Network used for video description prediction |
My comments:
This was one of the landmark work in 2015 introducing attention mechanism for the first time for video representations.
Key Contributions:
Explanation:
In this work, authors use the base two stream architecture with two novel approaches and demonstrate performance increment without any significant increase in size of parameters. The authors explore the efficacy of two major ideas.
Fusion of spatial and temporal streams (how and when) - For a task discriminating between brushing hair and brushing teeth - spatial net can capture the spatial dependency in a video (if it’s hair or teeth) while temporal net can capture presence of periodic motion for each spatial location in video. Hence it’s important to map spatial feature maps pertaining to say a particular facial region to temporal feature map for the corresponding region. To achieve the same, the nets need to be fused at an early level such that responses at the same pixel position are put in correspondence rather than fusing at end (like in base two stream architecture).
Combining temporal net output across time frames so that long term dependency is also modeled.
Algorithm:
Everything from two stream architecture remains almost similar except
Possible strategies for fusing spatial and temporal streams. The one on right performed better. Source.
Two stream fusion architecture. There are two paths one for step 1 and other for step 2 Source.
Benchmarks (UCF101-split1):
Score | Comment |
---|---|
92.5 | TwoStreamfusion |
94.2 | TwoStreamfusion + iDT |
My comments: The authors established the supremacy of the TwoStreamFusion method as it improved the performance over C3D without the extra parameters used in C3D.
Key Contributions:
Explanation:
In this work authors improved on two streams architecture to produce state-of-the-art results. There were two major differences from the original paper
For final prediction at video-level authors explored multiple strategies. The best strategy was
The other important part of the work was establishing the problem of overfitting (due to small dataset sizes) and demonstrating usage of now-prevalent techniques like batch normalization, dropout and pre-trainign to counter the same. The authors also evaluated two new input modalities as alternate to optical flow - namely warped optical flow and RGB difference.
Algorithm:
During training and prediction a video is divided into K segments of equal durations. Thereafter, snippets are sampled randomly from each of the K segments. Rest of the steps remained similar to two stream architecture with changes as mentioned above.
Temporal Segment Network architecture. Source.
Benchmarks (UCF101-split1):
Score | Comment |
---|---|
94.0 | TSN (input RGB + Flow ) |
94.2 | TSN (input RGB + Flow + Warped flow) |
My comments:
The work attempted to tackle two big challenges in action recognition - overfitting due to small sizes and long range modeling and the results were really strong. However,the problem of pre-computing optical flow and related input modalities was still a problem at large.
Key Contributions:
Explanation:
In this work, the most notable contribution by the authors is the usage of learnable feature aggregation (VLAD) as compared to normal aggregation using maxpool or avgpool. The aggregation technique is akin to bag of visual words. There are multiple learned anchor-point (say c1, …ck) based vocabulary representing k typical action (or sub-action) related spatiotemporal features. The output from each stream in two stream architecture is encoded in terms of k-space “action words” features - each feature being difference of the output from the corresponding anchor-point for any given spatial or temporal location.
ActionVLAD - Bag of action based visual "words". Source.
Average or max-pooling represent the entire distribution of points as only a single descriptor which can be sub-optimal for representing an entire video composed of multiple sub-actions. In contrast, the proposed video aggregation represents an entire distribution of descriptors with multiple sub-actions by splitting the descriptor space into k cells and pooling inside each of the cells.
While max or average pooling are good for similar features, they do not not adequately capture the complete distribution of features. ActionVlAD clusters the appearance and motion features and aggregates their residuals from nearest cluster centers. Source.
Algorithm:
Everything from two stream architecture remains almost similar except the usage of ActionVLAD layer. The authors experiment multiple layers to place ActionVLAD layer with the late fusion after conv layers working out as the best strategy.
Benchmarks (UCF101-split1):
Score | Comment |
---|---|
92.7 | ActionVLAD |
93.6 | ActionVLAD + iDT |
My comments: The use of VLAD as an effective way of pooling was already proved long back. The extension of the same in an end-to-end trainable framework made this technique extremely robust and state-of-the-art for most action recognition tasks in early 2017.
Key Contributions:
Explanation:
The usage of optical flow in the two stream architecture made it mandatory to pre-compute optical flow for each sampled frame before hand thereby affecting storage and speed adversely. This paper advocates the usage of an unsupervised architecture to generate optical flow for a stack of frames.
Optical flow can be regarded as an image reconstruction problem. Given a pair of adjacent frames I1 and I2 as input, our CNN generates a flow field V. Then using the predicted flow field V and I2, I1 can be reconstructed as I1’ using inverse warping such that difference between I1 and it’s reconstruction is minimized.
Algorithm:
The authors explored multiple strategies and architectures to generate optical flow with largest fps and least parameters without hurting accuracy much. The final architecture was same as two stream architecture with changes as mentioned:
The temporal stream now had the optical flow generation net (MotionNet) stacked on the top of the general temporal stream architectures. The input to the temporal stream was now consequent frames instead of preprocessed optical flow.
There’s an additional multi-level loss for the unsupervised training of MotionNet
The authors also demonstrate improvement in performance using TSN based fusion instead of conventional architecture for two stream approach.
HiddenTwoStream - MotionNet generates optical flow on-the-fly. Source.
Benchmarks (UCF101-split1):
Score | Comment |
---|---|
89.8 | Hidden Two Stream |
92.5 | Hidden Two Stream + TSN |
My comments: The major contribution of the paper was to improve speed and associated cost of prediction. With automated generation of flow, the authors relieved the dependency on slower traditional methods to generate optical flow.
Key Contributions:
Explanation:
This paper takes off from where C3D left. Instead of a single 3D network, authors use two different 3D networks for both the streams in the two stream architecture. Also, to take advantage of pre-trained 2D models the authors repeat the 2D pre-trained weights in the 3rd dimension. The spatial stream input now consists of frames stacked in time dimension instead of single frames as in basic two stream architectures.
Algorithm:
Same as basic two stream architecture but with 3D nets for each stream
Benchmarks (UCF101-split1):
Score | Comment |
---|---|
93.4 | Two Stream I3D |
98.0 | Imagenet + Kinetics pre-training |
My comments:
The major contribution of the paper was the demonstration of evidence towards benefit of using pre-trained 2D conv nets. The Kinetics dataset, that was open-sourced along the paper, was the other crucial contribution from this paper.
Key Contributions:
Explanation:
The authors extend the work done on I3D but suggest using a single stream 3D DenseNet based architecture with multi-depth temporal pooling layer (Temporal Transition Layer) stacked after dense blocks to capture different temporal depths The multi depth pooling is achieved by pooling with kernels of varying temporal sizes.
TTL Layer along with rest of DenseNet architecture. Source.
Apart from the above, the authors also devise a new technique of supervising transfer learning betwenn pre-trained 2D conv nets and T3D. The 2D pre-trianed net and T3D are both presented frames and clips from videos where the clips and videos could be from same video or not. The architecture is trianed to predict 0/1 based on the same and the error from the prediction is back-propagated through the T3D net so as to effectively transfer knowledge.
Transfer learning supervision. Source.
Algorithm:
The architecture is basically 3D modification to DenseNet [12] with added variable temporal pooling.
Benchmarks (UCF101-split1):
Score | Comment |
---|---|
90.3 | T3D |
91.7 | T3D + Transfer |
93.2 | T3D + TSN |
My comments:
Although the results don’t improve on I3D results but that can mostly attributed to much lower model footprint as compared to I3D. The most novel contribution of the paper was the supervised transfer learning technique.
For now, we use radiologist reports as the gold standard as we train deep learning algorithms to recognize abnormalities on radiology images. While this is not ideal for many reasons (see this), it is currently the most scalable way to supply classification algorithms with the millions of images that they need in order to achieve high accuracy.
These reports are usually written in free form text rather than in a structured format. So, we have designed a rule based Natural Language Processing (NLP) system to extract findings automatically from these unstructured reports.
CT SCAN BRAIN - PLAIN STUDY Axial ct sections of the brain were performed from the level of base of skull. 5mm sections were done for the posterior fossa and 5 mm sections for the supra sellar region without contrast. OBSERVATIONS: - Area of intracerebral haemorrhage measuring 16x15mm seen in left gangliocapsular region and left corona radiate. - Minimal squashing of left lateral ventricle noted without any appreciable midline shift - Lacunar infarcts seen in both gangliocapsular regions - Cerebellar parenchyma is normal. - Fourth ventricle is normal in position and caliber. - The cerebellopontine cisterns, basal cisterns and sylvian cisterns appear normal. - Midbrain and pontine structures are normal. - Sella and para sellar regions appear normal. - The grey-white matter attenuation pattern is normal. - Calvarium appears normal - Ethmoid and right maxillary sinusitis noted IMPRESSION: - INTRACEREBRAL HAEMORRHAGE IN LEFT GANGLIOCAPSULAR REGION AND LEFT CORONA RADIATA - LACUNAR INFARCTS IN BOTH GANGLIOCAPSULAR REGIONS
⇩
{
"intracerebral hemorrhage": true,
"lacunar infarct": true,
"mass effect": true,
"midline shift": false,
"maxillary sinusitis": true
}
An example clinical radiology report and the automatically extracted findings
Rule based NLP systems use a list of manually created rules to parse the unorganized content and structure it. Machine Learning (ML) based NLP systems, on the other hand, automatically generate the rules when trained on a large annotated dataset.
Rule based approaches have multiple advantages when compared to ML based ones:
As reports were collected from multiple centers, there were multiple reporting standards. Therefore, we constructed a set of rules to capture these variations after manually reading a large number of reports. Of these, I illustrate two common types of rules below.
In reports, the same finding can be noted in several different formats. These include the definition of the finding itself or its synonyms. For example, finding blunted CP angle
could be reported in either of the following ways:
We collected all the wordings that can be used to report findings and created a rule for each finding. As an illustration, following is the rule for blunted CP angle
.
((angle & (blunt | obscur | oblitera | haz | opaci)) | (effusio & thicken))
Visualization of blunted CP angle
rule
This rule will be positive if there are words angle and blunted or its synonyms in a sentence. Alternatively, it will also be positive if there are words effusion and thickening in a given sentence.
In addition, there can be a hierarchical structure in findings. For example, opacity
is considered positive if any of the edema
, groundglass
, consolidation
etc are positive.
We therefore created a ontology of findings and rules to deal with this hierarchy.
[opacity] rule = ((opacit & !(/ & collapse)) | infiltrate | hyperdensit) hierarchy = (edema | groundglass | consolidation | ... )
Rule and hierarchy for opacity
The above mentioned rules are used to detect a finding in a report. But these are not sufficient to understand the reports. For example, consider the following sentences.
1. Intracerebral hemorrhage is absent. 2. Contusions are ruled out. 3. No evidence of intracranial hemorrhages in the brain.
Although the findings intracerebral hemorrhage
, contusion
and intracranial hemorrhage
are mentioned in the above sentences, their absence is noted in these sentences rather than their presence. Therefore, we need to detect negations in a sentence in addition to findings.
We manually read several sentences that indicate negation of findings and grouped these sentences according to their structures. Rules to detect negation were created based on these groups. One of these is illustrated below:
(<finding>) & ( is | are | was | were ) & (absent | ruled out | unlikely | negative)
Negation detection structure
We can see that first and second sentences of above example matches this rule and therefore we can infer that the finding is negative.
intracerebral hemorrhage
negative.contusion
negative.We have tested our algorithm on a dataset containing 1878 clinical radiology reports of Head CT scans. We manually read all the reports to create gold standards. We used sensitivity and specificity as evaluation metrics. The results obtained are given below in a table.
Findings | #Positives | Sensitivity (95% CI) |
Specificity (95% CI) |
---|---|---|---|
Intracranial Hemorrhage | 207 | 0.9807 (0.9513-0.9947) |
0.9873 (0.9804-0.9922) |
Intraparenchymal Hemorrhage | 157 | 0.9809 (0.9452-0.9960) |
0.9883 (0.9818-0.9929) |
Intraventricular Hemorrhage | 44 | 1.0000 (0.9196-1.0000) |
1.0000 (0.9979-1.0000) |
Subdural Hemorrhage | 44 | 0.9318 (0.8134-0.9857) |
0.9965 (0.9925-0.9987) |
Extradural Hemorrhage | 27 | 1.0000 (0.8723-1.0000) |
0.9983 (0.9950-0.9996) |
Subarachnoid Hemorrhage | 51 | 1.0000 (0.9302-1.0000) |
0.9971 (0.9933-0.9991) |
Fracture | 143 | 1.0000 (0.9745-1.0000) |
1.0000 (0.9977-1.0000) |
Calvarial Fracture | 89 | 0.9888 (0.9390-0.9997) |
0.9947 (0.9899-0.9976) |
Midline Shift | 54 | 0.9815 (0.9011-0.9995) |
1.0000 (0.9979-1.0000) |
Mass Effect | 132 | 0.9773 (0.9350-0.9953) |
0.9933 (0.9881-0.9967) |
In this paper[1], authors used ML based NLP model (Bag Of Words with unigrams, bigrams, and trigrams plus average word embeddings vector) to extract findings from head CT clinical radiology reports. They reported average sensitivity and average specificity of 0.9025 and 0.9172 across findings. The same across target findings on our evaluation turns out to be 0.9841 and 0.9956 respectively. So, we can conclude rule based NLP algorithms perform better than ML based NLP algorithms on clinical reports.
Qure.ai is deploying deep learning for radiology across the globe. This blog is the first in the series where we will talk about our learnings from deploying deep learning solutions at radiology centers. We will cover the technical aspects of the challenges and solutions in here. The operational hurdles will be covered in the next part of this series.
The dawn of an AI revolution is upon us. Deep learning or deep neural networks have crawled into our daily lives transforming how we type, write emails, search for photos etc. It is revolutionizing major fields like healthcare, banking, driving etc. At Qure.ai, we have been working for the past couple of years on our mission of making healthcare more affordable and accessible through the power of deep learning.
Since our journey began more than two years ago, we have seen excellent progress in development and visualization of deep learning models. With Nvidia leading the advancements in GPUs and the release of Pytorch, Tensorflow, MXNet etc leading the war on deep learning frameworks, training deep learning models has become faster and easier than ever.
However, deploying these deep learning models at scale has become a different beast altogether. Let’s discuss some of the major problems that Qure.ai has tackled/is tackling in deploying deep learning for hospitals and radiologists across the globe.
Let us start with understanding how the challenges in deploying deep learning models are different from training them. During training, the focus is mainly on the accuracy of predictions, while deployment focuses on speed and reliability of predictions. Models can be trained on local servers, but in deployment, they need to be capable of scaling up or down depending upon the volume of API requests. Companies like Algorithmia and EnvoyAI are trying to solve this problem by providing a layer over AI to serve the end users. We are already working with EnvoyAI to explore this route of deploying deep learning.
Caffe was the first framework built to focus on production. Initially, our research team was using both Torch (flexible, imperative) as well as Lasagne/Keras (python!) for training. The release of Pytorch in late 2016 settled the debate on frameworks within our team.
Deep learning frameworks (source)
Thankfully, this happened before we started looking into deployment. Once we finalized Pytorch for training and tweaking our models, we started looking into best practices for deploying the same. Meanwhile, Facebook released Caffe2 for easier deployment, especially into mobile devices.
The AI community including Facebook, Microsoft and Amazon came together to release Open Neural Network Exchange (ONNX) making it easier to switch between tools as per need. For example, it enables you to train your model in Pytorch and then export it into Caffe2/ MXNet/ CNTK (Cognitive Toolkit) for deployment. This approach is worth looking into when the load on our servers increase. But for our present needs, deploying models in Pytorch has sufficed.
We use following components to build our Linux servers keeping our pythonic deep learning framework in mind.
Docker: For operating system level virtualization
Anaconda: For creating python3 virtual environments and supervising package installations
Django: For building and serving RESTful APIs
Pytorch: As deep learning framework
Nginx: As webserver and load balancer
uWSGI: For serving multiple requests at a time
Celery: As distributed task queue
Most of these tools can be replaced as per requirements. The following diagram represents our present stack.
Server architecture
We use Amazon EC2 P2 instances as our cloud GPU servers primarily due to our team’s familiarity with AWS. Although, Microsoft’s Azure and Google Cloud can also be excellent options.
Our servers are built using small components performing specific services and it was important to have them on the same host for easy configuration. Moreover, we handle large dicom images (each having a size between 10 and 50 Mb) and they get transferred between the components. It made sense to have all the components on the same host or else, the network bandwidth might get choked due to these transfers. The following diagram illustrates various software components comprising a typical qure deployment.
Software Components
We started with launching qXR (Chest X-ray product) on a P2 instance but as the load on our servers rose, managing GPU memory became an overhead. We were also planning to launch qER (HeadCT product) which had even higher GPU memory requirements.
Initially, we started with buying new P2 instances. Optimizing their usage and making sure that few instances are not bogged down by the incoming load while other instances remain comparatively free became a challenge. It became clear that we needed auto-scaling for our containers.
Load balancing improves the distribution of workloads across instances (source)
That was when we started looking into solutions for managing our containerized applications. We decided to go ahead with Kubernetes (Amazon ECS is also an excellent alternative) majorly because it runs independently of specific provider (ECS has to be deployed on Amazon cloud). Since many hospitals and radiology centers prefer on-premise deployment, Kubernetes is clearly more suited for such needs. It makes life easier by automatic bin-packing of containers based on resource requirements, simpler horizontal scaling, and load balancing.
Initially, when qXR was deployed, it dealt with fewer abnormalities. So for an incoming request, loading models into memory, processing images through it and then releasing the memory worked fine. But as the number of abnormalities (thereby models) increased, loading all models sequentially for each upcoming request became an overhead.
We thought of accumulating incoming requests and processing images in batches on a periodic basis. This could have been a decent solution except that time was critical when dealing with medical images, more so in emergency situations. It was especially critical for qER where in cases of strokes, one has less than an hour to make a diagnostic decision. This ruled out the batch processing approach.
Beware of GPUs !! (warning at Qure's Mumbai office)
Moreover, our models for qER were even larger and required approximately 10x GPU memory of what qXR models required. Another thought was to keep the models loaded in memory and process images through them as the requests arrive. This is a good solution where you need to run your models every second or even millisecond (think of AI models running on millions of images being uploaded to Facebook or Google Photos). However, this is not a typical scenario within the medical domain. Radiology centers do not encounter patients at that scale. Even if the servers send back the results within a couple of minutes, that’s like a 30x improvement in the amount of time that a radiologist would take to report the scan. And that’s when you assume that a radiologist is immediately available. Otherwise, an average turnaround period of a chest x-ray scan varies from 1 to 2 days (700-1400x of what we take currently).
As of now, auto-scaling with Kubernetes solves our problems but we would definitely look into it in future. The solution lies somewhere between the two approaches (think of a caching mechanism for deep learning models).
Training deep learning models, especially in healthcare, is only one part of building a successful AI product. Bringing it to healthcare practitioners is a formidable and interesting challenge in itself. There are other operational hurdles like convincing doctors to embrace AI, offline working style at some hospitals (using radiographic films), lack of modern infrastructure at radiology centers (operating systems, bandwidth, RAM, disk space, GPU), varying procedures for scan acquisition etc. We will talk about them in detail in the next part of this series.
]]>Note
For a free trial of qXR and qER, please visit us at scan.qure.ai
In the previous post we looked at methods to visualize and interpret the decisions made by deep learning models using perturbation based techniques. To summarize the previous post, perturbation based methods do a good job of explaining decisions but they suffer from expensive computations and instability to surprise artifacts. In this post, we’ll give a brief overview and drawbacks of the various gradient-based algorithms for deep learning based classification models.
We would be discussing the following types of algorithms in this post:
In gradient-based algorithms, the gradient of the output with respect to the input is used for constructing the saliency maps. The algorithms in this class differ in the way the gradients are modified during backpropagation. Relevance score based algorithms try to attribute the relevance of each input pixel by backpropagating the probability score instead of the gradient. However, all of these methods involve a single forward and backward pass through the net to generate heatmaps as opposed to multiple forward passes for the perturbation based methods. Evidently, all of these methods are computationally cheaper as well as free of artifacts originating from perturbation techniques.
To illustrate each algorithm, we would be considering a Chest X-Ray (image below) of a patient diagnosed with pulmonary consolidation. Pulmonary consolidation is simply a “solidification” of the lung tissue due to the accumulation of solid and liquid material in the air spaces that would have normally been filled by gas [1]. The dense material deposition in the airways could have been affected by infection or pneumonia (deposition of pus) or lung cancer (deposition of malignant cells) or pulmonary hemorrhage (airways filled with blood) etc. An easy way to diagnose consolidation is to look out for dense abnormal regions with ill-defined borders in the X-ray image.
Chest X-ray with consolidation.
We would be considering this X-ray and one of our models trained for detecting consolidation for demonstration purposes. For this patient, our consolidation model predicts a possible consolidation with 98.2% confidence.
Explanation: Measure the relative importance of input features by calculating the gradient of the output decision with respect to those input features.
There were 2 very similar papers that pioneered the idea in 2013. In these papers — Saliency features [2] by Simonyan et al. and DeconvNet [3] by Zeiler et al. — authors used directly the gradient of the majority class prediction with respect to input to observe saliency features. The main difference between the above papers was how the authors handle the backpropagation of gradients through non-linear layers like ReLU. In Saliency features paper, the gradients of neurons with negative input were suppressed while propagating through ReLU layers. In the DeconvNet paper, the gradients of neurons with incoming negative gradients were suppressed.
Algorithm: Given an image I0, a class c, and a classification ConvNet with the class score function Sc(I). The heatmap is calculated as absolute of the gradient of Sc with respect to I at I0 \[\frac{\partial S_c}{\partial I} |_{I_0} \]
It is to be noted here, that DeepLIFT paper (which we’ll discuss later) explores the idea of gradient * input also as an alternate indicator as it leverages the strength and signal of input \[\frac{\partial S_c}{\partial I} |_{I_0} * I_0 \]
Heatmap by GradInput against original annotation.
Shortcomings: The problem with such a simple algorithm arises from non-linear activation functions like ReLU, ELU etc. Such non-linear functions being inherently non-differentiable at certain locations have discontinuous gradients. Now as the methods measured partial derivatives with respect to each pixel, the gradient heatmap is inherently discontinuous over the entire image and produces artifacts if viewed as it is. Some of it can be overcome by convolving with a Gaussian kernel. Also, the gradient flow suffers in case of renormalization layers like BatchNorm or max pooling.
Explanation: The next paper [4], by Springenberg et. al, released in 2014 introduces GuidedBackprop, suppressed the flow of gradients through neurons wherein either of input or incoming gradients were negative. Springenberg et al. showed the difference amongst their methods through a beautiful illustration given below. As we discussed, this paper combined the gradient handling of both the Simonyan et al. and Zeiler et al.
Schematic of visualizing the activations of high layer neurons. a) Given an input image, we perform the forward pass to the layer we are interested in, then set to zero all activations except one and propagate back to the image to get a reconstruction. b) Different methods of propagating back through a ReLU nonlinearity. c) Formal definition of different methods for propagating a output activation out back through a ReLU unit in layer l; note that the ’deconvnet’ approach and guided backpropagation do not compute a true gradient but rather an imputed version. Source.
Heatmap by GuidedBackprop against original annotation.
Shortcomings: The problem of gradient flow through ReLU layers still remained a problem at large. Tackling renormalization layers were still an unresolved problem as most of the above papers (including this paper) proposed mostly fully convolutional architectures (without max pool layers) and batch normalization was yet to ‘alchemised’ in 2014. Another such fully-convolutional architecture paper was CAM [6].
Explanation: An effective way to circumnavigate the backpropagation problems were explored in the GradCAM [5] by Selvaraju et al. This paper was a generalization of CAM [6] algorithm given by Zhou et al., that tried to describe attribution scores using fully connected layers. The idea is, instead of trying to propagate back the gradients, can the activation maps of the final convolutional layer be directly used to infer downsampled relevance map of the input pixels. The downsampled heatmap is upsampled to obtain a coarse relevance heatmap.
Algorithm:
Let the feature maps in the final convolutional layers be F1, F2 … ,Fn. Like before assume image I0, a class c, and a classification ConvNet with the class score function Sc(I).
Weights (w1, w2 ,…, wn) for each pixel in the F1, F2 … , Fn is calculated based on the gradients class c w.r.t. each feature map such as \(w_i = \frac{\partial S_c}{\partial F} |_{F_i} \ \forall i=1 \dots n \)
The weights and the corresponding activations of the feature maps are multiplied to compute the weighted activations (A1,A2, … , An) of each pixel in the feature maps. \(A_i = w_i * F_i \ \forall i = 1 \dots n \)
Steps 1-4 makes up the GradCAM method. Including step 5 constitutes the Guided Grad CAM method. Here’s how a heat map generated from Grad CAM method looks like. The best contribution from the paper was the generalization of the CAM paper in the presence of fully-connected layers.
Heatmap by GradCAM against original annotation.
Shortcomings: The algorithm managed to steer clear of backpropagating the gradients all the way up to inputs - it only propagates the gradients only till the final convolutional layer. The major problem with GradCAM was its limitation to specific architectures which use the AveragePooling layer to connect convolutional layers to fully connected layers. The other major drawback of GradCAM was the upsampling to coarse heatmap results in artifacts and loss in signal.
There are a couple of major problems with the gradient-based methods which can be summarised as follows:
Saturation problems of gradient based methods Source.
Saturation problems of gradient based methods Source.
Explanation: To counter these issues, relevance score based attribution technique was discussed for the first time by Bach et al. in 2015 in this [7] paper. The authors suggested a simple yet strong technique of propagating relevance scores and redistributing as per the proportion of the activation of previous layers. The redistribution based on activation scores means we steer clear of the difficulties that arise with non-linear activation layers.
Algorithm:
This implementation is according to epsilon-LRP[8] where small epsilon is added in denominator to propagate relevance with numerical stability. Like before assume image I0, a class c, and a classification ConvNet with the class score function Sc(I).
Heatmap by Epsilon LRP against original annotation.
Explanation: The last paper[9] we cover in this series, is based on layer-wise relevance. However, herein instead of directly explaining the output prediction in previous models, the authors explain the difference in the output prediction and prediction on a baseline reference image.The concept is similar to Integrated Gradients which we discussed in the previous post. The authors bring out a valid concern with the gradient-based methods described above - gradients don’t use a reference which limits the inference. This is because gradient-based methods only describe the local behavior of the output at the specific input value, without considering how the output behaves over a range of inputs.
Algorithm: The reference image (IR) is chosen as the neutral image, suitable for the problem at hand. For class c, and a classification ConvNet with the class score function Sc(I), SRc be the probability for image IR. The relevance score to be propagated is not Sc but Sc - SRc.
We have so far understood both perturbation based algorithms as well as gradient-based methods. Computationally and practically, perturbation based methods are not much of a win although their performance is relatively uniform and consistent with an underlying concept of interpretability. The gradient-based methods are computationally cheaper and measure the contribution of the pixels in the neighborhood of the original image. But these papers are plagued by the difficulties in propagating gradients back through non-linear and renormalization layers. The layer relevance techniques go a step ahead and directly redistribute relevance in the proportion of activations, thereby steering clear of the problems in propagating through non-linear layers. In order to understand the relative importance of pixels, not only in the local neighborhood of pixel intensities, DeepLIFT redistributes difference of activation of an image and a baseline image.
We’ll be following up with a final post on the performance of all the methods discussed in the current and previous post and detailed analysis of their performance.
With close to 30 years of radiology experience, Dr Biviji is an eminent radiologist based in Nagpur. He is an authority on developing deep learning solutions to radiology problems and works closely with early-stage healthcare technology innovators.
Q&A with Dr Mustafa Biviji on artificial intelligence in radiology.
In the future, radiologists and radiographers could be replaced by intelligent machines. CT and MRI machines of the future would be embedded with AI programs capable of modifying scanning protocols on the fly, depending on the disease process initially identified. Highly accurate automated reports would be produced almost instantly. Machines would prognosticate, identify as yet unknown imaging patterns associated with diseases and may also uncover new diseases.
There will be objectivity to the radiology reports with personal bias of the radiologist no longer a factor. Remote and isolated areas of the world will have an equal access to the best diagnostic information. Coupled with this would be better machine navigation during surgeries or probably even complete robotic surgery based on the imaging patterns identified with AI. Through it all, I believe that radiologists will continue to reinvent themselves.
These are initial days and the role of AI in Radiology is currently restricted to assistance. While most solutions talk about simplifying workflows, Qure to the best of my knowledge is the only one talking about automated reports with a remarkable degree of accuracy, thereby opening up exciting new prospects for the future. While the perfect radiology AI may be far in the future, at least a promising beginning has been made.
Qure.ai solutions in radiology now include automated head CT reports particularly for trauma and strokes. Reporting for these conditions would earlier have either necessitated a sleepless night or a delay in reporting. Automated reports can now be used to assist residents and help can be sought in case of a doubt or discrepancy. Delayed radiology reports will soon be a thing of the past.
Qure’s chest X-ray solution presently is best targeted to a general practitioner in a remote or rural location interpreting his own chest radiographs. Qure CXR could help provide radiologist-level accuracy, previously only available at the larger centers in the bigger cities. Better radiology would lead to better treatment outcomes and obviate the need for patients to travel long distances to seek a diagnosis.
AI in the future will radically modify the role of a radiologist. I predict a significant blurring of the roles of a diagnostic radiologist, surgeon or a physician. The radiologist of the future will have to stop behaving like an unseen backroom doctor and reinvent to participate actively in patient management. Image assisted robotic surgeries and integrated patient care are not too far off in the future.
]]>Dr. Bharat Aggarwal is the Director of Radiology Services at Max Healthcare. A distinguished third generation Radiologist, he was previously the promoter and lead Radiologist at Diwan Chand Aggarwal Imaging Research Centre, New Delhi. Dr. Aggarwal is an alumnus of Tata Memorial Hospital, Mumbai, and UCMS, Delhi.
Q&A with Dr. Bharat Aggarwal on artificial intelligence in radiology.
There is going to be a significant role of AI in the field of imaging, and it will form a critical part of service delivery. There are many gaps in the existing model of service offerings. Some examples where AI will be commonly used include triaging and highlighting critical cases (reporting is done sequentially and a diagnosis requiring urgent intervention could be “at the bottom of the pile”); early diagnosis (pixel resolution of AI vs the human eye); pre-reading to take care of resource crunch, automation in comparisons, objectivization of disease & response to treatment; quality assurance etc.
10-15 years.
Triaging normal from abnormal; building efficiency; quality assurance.
Yes, adopting AI is a must. Radiologists will not be irrelevant in the world of machines. The role of the radiologists will be to direct research towards clinical gaps, validate AI diagnosis and focus on new problems that will emerge in the AI world. They need to treat AI with healthy competitiveness and build their careers with AI on their team. The opposition is the disease. The goal is health for all.
]]>Dr Shalini Govil is the Lead Abdominal Radiologist, Senior Advisor and Quality Controller at the Columbia Asia Radiology Group. Through her years as Associate Professor at CMC Vellore, Dr Shalini Govil has taught and mentored countless radiologists and medical students and continues to do so. Nowadays, she is busy training a new student – an Artificial Intelligence system that is learning to read Chest X-rays. Dr Govil is an accomplished researcher having published 30+ papers, and has won numerous awards for her contributions to Radiology.
Q&A with Dr Shalini Govil on artificial intelligence in radiology.
Given the accuracy levels being reported across the world for deep learning algorithm diagnosis on imaging, I am sure AI has the potential to emerge as a strong diagnostic tool in the clinical armamentarium.
The only factor that could stand in the way of this progress is the very human fear of being “replaced”, “overtaken” or “made redundant”.
I feel that any crossroad like this in the practice of Medicine is best approached from the point of view of the patient and not from the viewpoint of commerce or market forces. Medicine is not a “job”…Medicine is “healing”… Medicine…is a patient trusting you at a vulnerable moment in his/her life.
From that standpoint, it is very simple - if AI is as accurate as a Senior Radiology Resident or even more accurate, let the patient have the benefit of a timely and accurate DRAFT report that can be validated by a physician or radiologist. This would certainly be better than the current practice in many parts of the world where the x-ray is not formally reported by a trained Radiologist or even a Trainee Radiologist.
Even as researchers are racing to study AI performance in increasingly complex pathology, widespread and parallel clinical testing is the need of the hour, to build confidence in Radiology AI and to obtain to a critical mass that will allow the threshold of human fear to be crossed.
Qure.ai has come up with a way to “see through the computer’s eyes”. I think this will be a game changer on the road to building confidence in AI. Whenever I have discussed the work I am doing on the use of AI in chest x-ray diagnosis with doctors, they tend to get a glassy look that says, “This is impractical…it’s never going to come into clinical use…”
But the minute they see a chest x-ray with the Qure.ai heatmap shading the abnormality that the AI actually “picked up”…the glassy look turns into one of wonder…because it is exactly what the doctor sees himself! I find this happens with lay people as well, even high school kids!
Once the algorithm has been trained on a large number of chest x-rays and robust clinical testing has demonstrated a low false negative rate, I think the best use of the Qure.ai CXR solution would be to run all chest x-rays in our practice through the algorithm and obtain a DRAFT report to ease validation by a Radiologist.
I would tell young Radiologists that help is on the way…that the days of struggling without a mentor when viewing a difficult case are over…that very soon, an “App” will help them derive a keyword tag to the image that has confounded them and that this keyword will then enable them to research and read and provide an articulate and lucid differential diagnosis.
What should they learn? They should learn Radiology of course…as in-depth and in-breadth as has ever been done…and they possibly can…. But they should also learn the basics of neural networks, deep learning algorithms and keep abreast of evolving AI. Oh! and another thing - it might be a good idea to brush up on their 12th grade calculus!
]]>Dr Bhavin Jankharia is one of India’s leading radiologists, former president of the IRIA as well as a renowned educator, speaker and writer. Dr Jankharia’s radiology and imaging practice, “Picture This by Jankharia”, is known for being an early adopter of innovation and for pioneering new technologies in radiology.
Q&A with Dr Jankharia on artificial intelligence in radiology.
AI is here to stay and will be a major factor to shape radiology over the next 10 years. It will be incorporated in some form or the other in protocols and workflows across the spectrum of radiology work and across the globe.
It is about questions that need to be answered. At present, AI is good at solving specific questions or given numerical data from CT scans of the abdomen and pelvis with respect to bone density and aortic size, etc. Wherever there is a need for such issues to be addresses, AI should be incorporated into those specific workflows. We still haven’t gotten to the stage where AI can report every detail in every scan and that may actually never happen.
Its basic value addition will be to take away drudge work. Automated measurements, automated checking of densities, enhancement patterns, perhaps even automated follow-ups for measurements of abnormal areas already marked out on the first scans and the like.
AI learns much faster and the basic approach is different. To the end user though, it matters not, does it, how we get the answer we want…
At present, all of AI is problem-solving based. And since each company deals with different problems based on the doctors they work with, this approach is fine. The company that figures out a way to handle a non-problem based approach to basic interpretation of scans, the way radiologists do, will have a head-start.
They are slowly saving time and helping radiologists work smarter and better.
I don’t think radiologists per se have to do anything about AI, unless they want to change track and work in the field of AI from a technical perspective. AI incorporation into workflow will happen anyway and like all changes to radiology workflow over the decades, it will become routine and a way of life. They don’t really need to do anything different, except be willing to accept change.
]]>As Associate Director at Mahajan Imaging, Dr Vidur Mahajan oversees scientific and clinical research, and pioneers the application of new techniques in radiology. Having collaborated with multiple healthcare tech companies, he is a thought leader in the AI-radiology space. Both doctor and entrepreneur, Dr Vidur Mahajan is passionate about improving access and affordability of high-end medical care across the developing world.
Q&A with Dr Vidur Mahajan on artificial intelligence in radiology.
AI will keep playing a more important role in radiology as time progresses. The benefits, the way I see them, would be around 2 dimensions: quality and efficiency.
We work with several AI companies today and are covered by a plethora of non-disclosure agreements, so wont be able to comment on how Qure compares to other solutions. That said, Qure’s pragmatic and user-oriented approach to developing algorithms is a definite plus. Given that Qure is backed by one of India’s best analytics companies, I would be surprised if they don’t end up taking the radiology AI world by storm! On the industry, I strongly feel that it will not be “sudden” transition. The industry will enter into this “AI future” in a step-by-step incremental way - we may not even notice and suddenly we’re surrounded by AI!
Qure’s chest X-Ray algorithm has the potential to change the entire paradigm of diagnostics in the developing world. The Chest X-Ray is the most commonly prescribed radiology investigation and everyday, thousands of X-Rays go un-reported in the developing world. Qure has the potential to give these patients a proper report and hence impact their treatment outcomes in a very positive way.
I think Qure has this nailed down completely. The strategy of integrating with Osirix (the world’s most widely used dicom viewer) through an extremely user-friendly and straightforward plugin enables instant reach to radiologists all over the world. Additionally, Qure’s ability to automatically email the AI system’s report to its users should also increase its usability across the world.
While I think I have answered this in Q-1, the primary, and most important expectation from AI in radiology is a product that works. Simple. I would be very guarded about promoting a product with less that 90% accuracy since I would not want potential users to form a negative opinion about Qure’s product based on their initial experience. The stakes in healthcare are too high and reliability of an algorithm is paramount. Qure is definitely a leader in the AI space globally and I am very proud of the fact that an Indian company is taking the lead in this.
]]>Interpretability of deep learning models is very much an active area of research and it becomes an even more crucial part of solutions in medical imaging.
The prevalent visualization methods can be broadly classified into 2 categories:
In this post, I’ll be giving a brief overview of the different perturbation based techniques for deep learning based classification models and their drawbacks. We would be following up with backpropagation based visualisations methods in the next part of the series.
Chest X-ray with pleural effusion.
For context, we would be considering chest X-ray (image above) of a patient diagnosed with pleural effusion. A pleural effusion is a clinical condition when pulmonary fluids have accumulated in the pulmonary fields. A visual cue for such an accumulation is the blunting of costophrenic (CP) angle as shown in the X-ray shown here. As is evident, the left CP angle (the one in the right of the image) is sharp whereas the right CP angle is blunted indicating symptoms of pleural effusion.
We would be considering this X-ray and one of our models trained for detecting pleural effusion for demonstration purposes. For this patient, our pleural effusion algorithms predict a possible pleural effusion with 97.62% probability.
This broad category of perturbation techniques involve perturbing the pixel intensity of input image with minimum noise and observing the change of prediction probability. The underlying principle being that the pixels which contribute maximally to the prediction, once altered, would drop the probability by the maximum amount. Let’s have an overview glance at some of these methods - I’ve linked the paper for your further reading.
In the paper Visualizing and Understanding Convolutional Networks, published in 2013, Zeiler et al used deconvolutional layers - earliest applications of deconvolutional layers - to visualize the activity maps for each layer for different inputs. This helped the authors in understanding object categories responsible for activation in a given feature map. The authors also explored the technique of occluding patches of the network and monitoring the prediction and activations of feature map in the last layer that was maximally activated for unoccluded images.
Here’s a small demo of how perturbation by occlusion works for the demo X-ray.
The leftmost image is the original X-ray image, the middle one is the perturbed image as the black occluding patch moves across the image, the rightmost image is the plot of the probability of pleural effusion as different parts of the X-ray gets occluded.
As is evident from above, the probability of pleural effusion drops as soon as the right CP angle and accumulated fluid region of the X-ray is occluded to the network, the probability of the pleural effusion drops suddenly. This signals the presence of blunt CP angle along with the fluid accumulation as the attributing factor pleural effusion diagnosis for the patient.
The same idea was explored in depth in the Samek et al in the 2015 paper Evaluating the visualization of what a Deep Neural Network has learned where authors suggests that we select the top k pixels by attribution and randomly vary their intensities and then measure the drop in score. If the attribution method is good, then the drop in score should be large.
Here’s how the heatmap generated via occlusion would look like
Heatmap generated by occlusion
But there’s a slight problem with occluding patches in a systematic way in regular grids. Often the object that is to be identified gets occluded in parts resulting in inappropriate decision by the network.
These sort of situations were better tackled in the LIME paper that came out in 2016. LIME isn’t specifically about computer vision but as such for any classifier. I’ll explain how LIME works for vision techniques explicitly and leave the rest for your reading. Instead of occluding systematic patches at regular intervals, input image is divided into component superpixels. A superpixel is a grouping of adjacent pixels which are of similar intensities. Thus grouping by superpixels ensures an object, composed of of similar pixel intensities, is a single superpixel component in itself.
The algorithm for generating heatmap goes as follows
Here’s a demo of how superpixel based perturbation (LIME model) works for the demo X-ray.
The leftmost image is the original X-ray, the center plot shows perturbed images (out of the k samples) with different superpixels being activated. The rightmost one is scatter plot of probability for pleural effusion vs no. of activated superpixels in sample.
Here’s how heatmap generated through superpixel based perturbation would look like
Heatmap generated by LIME Model using superpixel based perturbation
However, these techniques still have some downfalls. Occlusion of patches, systematic or superpixelwise can drastically affect the prediction of networks. For e.g.- at Qure we had trained nets for diagnosing abnormalities from Chest X Rays. Chest X Rays are generally grayscale images and abnormalities could include any thing like unlikely opacity at any place, or enlarged heart etc. Now with partial occlusion, resultant images would be abnormal images since a sudden black patch in the middle of X Ray is very well likely to be an abnormal case.
Instead of discretely occluding, another way to perturb images over a continuous spectrum were explored in a recent paper Axiomatic Attribution for Deep Networks. These models is in a way hybrid of gradient based methods & perturbation based methods. Here, the images are perturbed over a continuos domain from baseline image (all zeroes) to the current image, and sensitivity of each pixel with respect to prediction is integrated over the spectrum to give approximate attribution score for each pixel.
The algorithm for generating heatmap for input image X with pixel intensities xij goes as follows
Here’s a demo of how integrated gradients model works for the demo X-ray.
The leftmost image is the original X-ray, the center plot shows images as intensities are varied linearly from 0 to original intensity. The rightmost plot displays the sensitivity maps for each of the perturbed images as the intensities vary.
As you can observe, the sensitivity map is random and dispersed across the entire image in the begining when samples are closer to baseline image. As the image becomes closer to the original image, the sensitivity maps become more localised indicating the strong attribution of CP angle and fluid-filled areas to final prediction.
Here’s how heatmap generated through IntegratedGradients based perturbation would look like
Heatmap generated by Integrated Gradients
Finally, we discuss briefly about the most recent works of Fong et al in the paper Interpretable Explanations of Black Boxes by Meaningful Perturbation. In this paper the authors try and refine the heatmap mask of images, generated by sensitivity maps or otherwise, to fins the minimal mask to describe saliency. The goal of such a technique is to find the smallest subset of the image that preserves the prediction score. The method perturbs the sensitivity heatmap and monitors the probability drop to refine the heatmap to minimum pixels that can preserve the prediction score.
While most of these methods do a decently good job of producing relevant heatmaps. There are couple of drawbacks to perturbation based heatmaps which make them unsuitable for real time deployment.
Computationally Expensive : Most of these models are run multiple feed-forwards for computing a single heatmap for a given input image. This makes the algorithms slow and expensive and thereby unfit for deployment.
Unstable to surprise artifacts : As discussed above, a sudden perturbation in the form of a blurred or an occluded patch is something the net is not familiar with from it’s training set. The predictions for such a perturbed image becomes skewed a lot making the inferences from such a technique uninterpretable. A screening model trained for looking at abnormalities from normal X Rays, would predict abnormality whenever such a perturbed image is presented to it.
The drawbacks around unstable artifacts are mostly overcome by Integrated Gradients and resulting in much more stable heatmaps.
The backpropagation based methods are much cheaper computationally than perturbation based methods and would be discussed in the next part of the blog post.
]]>In this post, I review the literature on semantic segmentation. Most research on semantic segmentation use natural/real world image datasets. Although the results are not directly applicable to medical images, I review these papers because research on the natural images is much more mature than that of medical images.
Post is organized as follows: I first explain the semantic segmentation problem, give an overview of the approaches and summarize a few interesting papers.
In a later post, I’ll explain why medical images are different from natural images and examine how the approaches from this review fare on a dataset representative of medical images.
Semantic segmentation is understanding an image at pixel level i.e, we want to assign each pixel in the image an object class. For example, check out the following images.
Left: Input image. Right: It's semantic segmentation. Source.
Apart from recognizing the bike and the person riding it, we also have to delineate the boundaries of each object. Therefore, unlike classification, we need dense pixel-wise predictions from our models.
VOC2012 and MSCOCO are the most important datasets for semantic segmentation.
Before deep learning took over computer vision, people used approaches like TextonForest and Random Forest based classifiers for semantic segmentation. As with image classification, convolutional neural networks (CNN) have had enormous success on segmentation problems.
One of the popular initial deep learning approaches was patch classification where each pixel was separately classified into classes using a patch of image around it. Main reason to use patches was that classification networks usually have full connected layers and therefore required fixed size images.
In 2014, Fully Convolutional Networks (FCN) by Long et al. from Berkeley, popularized CNN architectures for dense predictions without any fully connected layers. This allowed segmentation maps to be generated for image of any size and was also much faster compared to the patch classification approach. Almost all the subsequent state of the art approaches on semantic segmentation adopted this paradigm.
Apart from fully connected layers, one of the main problems with using CNNs for segmentation is pooling layers. Pooling layers increase the field of view and are able to aggregate the context while discarding the ‘where’ information. However, semantic segmentation requires the exact alignment of class maps and thus, needs the ‘where’ information to be preserved. Two different classes of architectures evolved in the literature to tackle this issue.
First one is encoder-decoder architecture. Encoder gradually reduces the spatial dimension with pooling layers and decoder gradually recovers the object details and spatial dimension. There are usually shortcut connections from encoder to decoder to help decoder recover the object details better. U-Net is a popular architecture from this class.
U-Net: An encoder-decoder architecture. Source.
Architectures in the second class use what are called as dilated/atrous convolutions and do away with pooling layers.
Dilated/atrous convolutions. rate=1 is same as normal convolutions.
Source.
Conditional Random Field (CRF) postprocessing are usually used to improve the segmentation. CRFs are graphical models which ‘smooth’ segmentation based on the underlying image intensities. They work based on the observation that similar intensity pixels tend to be labeled as the same class. CRFs can boost scores by 1-2%.
CRF illustration. (b) Unary classifiers is the segmentation input to the CRF. (c, d, e) are variants of CRF with (e) being the widely used one.
Source.
In the next section, I’ll summarize a few papers that represent the evolution of segmentation architectures starting from FCN. All these architectures are benchmarked on VOC2012 evaluation server.
Following papers are summarized (in chronological order):
For each of these papers, I list down their key contributions and explain them. I also show their benchmark scores (mean IOU) on VOC2012 test dataset.
Key Contributions:
Explanation:
Key observation is that fully connected layers in classification networks can be viewed as convolutions with kernels that cover their entire input regions. This is equivalent to evaluating the original classification network on overlapping input patches but is much more efficient because computation is shared over the overlapping regions of patches. Although this observation is not unique to this paper (see overfeat, this post), it improved the state of the art on VOC2012 significantly.
Fully connected layers as a convolution. Source.
After convolutionalizing fully connected layers in a imagenet pretrained network like VGG, feature maps still need to be upsampled because of pooling operations in CNNs. Instead of using simple bilinear interpolation, deconvolutional layers can learn the interpolation. This layer is also known as upconvolution, full convolution, transposed convolution or fractionally-strided convolution.
However, upsampling (even with deconvolutional layers) produces coarse segmentation maps because of loss of information during pooling. Therefore, shortcut/skip connections are introduced from higher resolution feature maps.
Benchmarks (VOC2012):
Score | Comment | Source |
---|---|---|
62.2 | - | leaderboard |
67.2 | More momentum. Not described in paper | leaderboard |
My Comments:
Key Contributions:
Explanation:
FCN, despite upconvolutional layers and a few shortcut connections produces coarse segmentation maps. Therefore, more shortcut connections are introduced. However, instead of copying the encoder features as in FCN, indices from maxpooling are copied. This makes SegNet more memory efficient than FCN.
Segnet Architecture. Source.
Benchmarks (VOC2012):
Score | Comment | Source |
---|---|---|
59.9 | - | leaderboard |
My comments:
Key Contributions:
Explanation:
Pooling helps in classification networks because receptive field increases. But this is not the best thing to do for segmentation because pooling decreases the resolution. Therefore, authors use dilated convolution layer which works like this:
Dilated/Atrous Convolutions. Source
Dilated convolutional layer (also called as atrous convolution in DeepLab) allows for exponential increase in field of view without decrease of spatial dimensions.
Last two pooling layers from pretrained classification network (here, VGG) are removed and subsequent convolutional layers are replaced with dilated convolutions. In particular, convolutions between the pool-3 and pool-4 have dilation 2 and convolutions after pool-4 have dilation 4. With this module (called frontend module in the paper), dense predictions are obtained without any increase in number of parameters.
A module (called context module in the paper) is trained separately with the outputs of frontend module as inputs. This module is a cascade of dilated convolutions of different dilations so that multi scale context is aggregated and predictions from frontend are improved.
Benchmarks (VOC2012):
Score | Comment | Source |
---|---|---|
71.3 | frontend | reported in the paper |
73.5 | frontend + context | reported in the paper |
74.7 | frontend + context + CRF | reported in the paper |
75.3 | frontend + context + CRF-RNN | reported in the paper |
My comments:
Key Contributions:
Explanation:
Atrous/Dilated convolutions increase the field of view without increasing the number of parameters. Net is modified like in dilated convolutions paper.
Multiscale processing is achieved either by passing multiple rescaled versions of original images to parallel CNN branches (Image pyramid) and/or by using multiple parallel atrous convolutional layers with different sampling rates (ASPP).
Structured prediction is done by fully connected CRF. CRF is trained/tuned separately as a post processing step.
DeepLab2 Pipeline. Source.
Benchmarks (VOC2012):
Score | Comment | Source |
---|---|---|
79.7 | ResNet-101 + atrous Convolutions + ASPP + CRF | leaderboard |
Key Contributions:
Explanation:
Approach of using dilated/atrous convolutions are not without downsides. Dilated convolutions are computationally expensive and take a lot of memory because they have to be applied on large number of high resolution feature maps. This hampers the computation of high-res predictions. DeepLab’s predictions, for example are 1/8th the size of original input.
So, the paper proposes to use encoder-decoder architecture. Encoder part is ResNet-101 blocks. Decoder has RefineNet blocks which concatenate/fuse high resolution features from encoder and low resolution features from previous RefineNet block.
RefineNet Architecture. Source.
Each RefineNet block has a component to fuse the multi resolution features by upsampling the lower resolution features and a component to capture context based on repeated 5 x 5 stride 1 pool layers. Each of these components employ the residual connection design following the identity map mindset.
RefineNet Block. Source.
Benchmarks (VOC2012):
Score | Comment | Source |
---|---|---|
84.2 | Uses CRF, Multiscale inputs, COCO pretraining | leaderboard |
Key Contributions:
Explanation:
Global scene categories matter because it provides clues on the distribution of the segmentation classes. Pyramid pooling module captures this information by applying large kernel pooling layers.
Dilated convolutions are used as in dilated convolutions paper to modify Resnet and a pyramid pooling module is added to it. This module concatenates the feature maps from ResNet with upsampled output of parallel pooling layers with kernels covering whole, half of and small portions of image.
An auxiliary loss, additional to the loss on main branch, is applied after the fourth stage of ResNet (i.e input to pyramid pooling module). This idea was also called as intermediate supervision elsewhere.
PSPNet Architecture. Source.
Benchmarks (VOC2012):
Score | Comment | Source |
---|---|---|
85.4 | MSCOCO pretraining, multi scale input, no CRF | leaderboard |
82.6 | no MSCOCO pretraining, multi scale input, no CRF | reported in the paper |
Key Contributions:
Explanation:
Semantic segmentation requires both segmentation and classification of the segmented objects. Since fully connected layers cannot be present in a segmentation architecture, convolutions with very large kernels are adopted instead.
Another reason to adopt large kernels is that although deeper networks like ResNet have very large receptive field, studies show that the network tends to gather information from a much smaller region (valid receptive filed).
Larger kernels are computationally expensive and have a lot of parameters. Therefore, k x k convolution is approximated with sum of 1 x k + k x 1 and k x 1 and 1 x k convolutions. This module is called as Global Convolutional Network (GCN) in the paper.
Coming to architecture, ResNet(without any dilated convolutions) forms encoder part of the architecture while GCNs and deconvolutions form decoder. A simple residual block called Boundary Refinement (BR) is also used.
GCN Architecture. Source.
Benchmarks (VOC2012):
Score | Comment | Source |
---|---|---|
82.2 | - | reported in the paper |
83.6 | Improved training, not described in the paper | leaderboard |
Key Contributions:
Explanation:
ResNet model is modified to use dilated/atrous convolutions as in DeepLabv2 and dilated convolutions. Improved ASPP involves concatenation of image-level features, a 1x1 convolution and three 3x3 atrous convolutions with different rates. Batch normalization is used after each of the parallel convolutional layers.
Cascaded module is a resnet block except that component convolution layers are made atrous with different rates. This module is similar to context module used in dilated convolutions paper but this is applied directly on intermediate feature maps instead of belief maps (belief maps are final CNN feature maps with channels equal to number of classes).
Both the proposed models are evaluated independently and attempt to combine the both did not improve the performance. Both of them performed very similarly on val set with ASPP performing slightly better. CRF is not used.
Both these models outperform the best model from DeepLabv2. Authors note that the improvement comes from the batch normalization and better way to encode multi scale context.
DeepLabv3 ASPP (used for submission). Source.
Benchmarks (VOC2012):
Score | Comment | Source |
---|---|---|
85.7 | used ASPP (no cascaded modules) | leaderboard |
Knowing the state of the cells and amount of blood flow in brain, can reveal huge amount of information about it. Tumor cells have high consumption of blood metabolites (like glucose) than normal. This information can contribute significantly towards diagnosis of the brain. Perfusion Imaging allows us to measure all of these quantities. Combine this with Machine Learning, and we can build a system that takes perfusion images and gives out a complete report about the brain. This is the future.
Brain tumors are the leading cause of cancer-related deaths in children (males and females) age 1-19[1]. It is estimated that as many as 5.1 million Americans may have Alzheimer’s disease[2]. And there are many such degenerative diseases such as Parkinson’s, Autism, Schizophrenia that affect thousands and millions of people amongst us. Neuroimaging or brain imaging is the use of various techniques to either directly or indirectly image the structure, function/pharmacology of the nervous system[3]. One of the techniques of neuroimaging is Perfusion Imaging.
Perfusion imaging captures the qualitative and quantitative information of the blood flow and various kinetics related to blood flow in your brain. Technically, Perfusion is defined as the passage of fluid through the lymphatic system or blood vessels to an organ or a tissue[4]. If you know the blood flow of affected and the normal brain, it can be helpful in finding the abnormalities. Perfusion imaging helps in measurement of various parameters such as
For Brain Tumors, the World Health Organization (WHO), has developed a histological classification system that focuses on the tumor’s biological behavior. The system classifies tumors into grades I to IV. Grade IV are the most malignant primary brain tumors. Histopathological analysis or analysing a biopsy of brain serves as a final test to decide the grade. This is an invasive procedure and requires the availability of an expert for the analysis.
Glioblastoma, a form of Brain Tumor, exhibits increased rCBV values. Source :
Neurooncology - Newer Developments
Recent papers[5-7] have found strong correlation between perfusion parameters such as relative cerebral blood volume (abbr. rCBV), volume transfer coefficient (abbr. kTrans) and grade of the tumor. Higher perfusion values in marked RoIs (regions of interest) suggested higher grades. Taking a step further another paper[8] also suggested use of perfusion to measure prognosis and thus it can be a great indicator to quickly measure the effects of treatment or medication the subject is undergoing.
Alzheimer’s is the cause of 60%-70% cases of dementia[9] and it has no cure. Globally, dementia affects 47.5 million people[9]. About 3% of people between the ages of 65–74 have dementia, 19% between 75 and 84 and nearly half of those over 85 years of age[10].
The normal brain (on the left) shows normal blood perfusion, denoted by an abundance of yellow color. The scan on the right, of a person suffering from Alzheimer’s, shows pervasive low perfusion all around, denoted by blues and greens. Source :
The Physiological and Neurological basis of Cerebra TurboBrain
A 2001 paper[11] in American Journal of Neuroradiology showed that perfusion or rCBV values in particular can be used to replace nuclear medical imaging techniques for the evaluation of patients with Alzheimer’s disease. Another paper[12] published in 2014 suggests closely linked mechanisms of neurodegeneration mediating the evolution of dementia in both Alzheimers and Parkinsons. Many other papers[13,14] suggest strong linkage between early Alzheimer’s and cerebral blood flow and thus can help in detection at earlier stage.
A lot of work in last decade has also been done to try to develop autonomous/semi-autonomous process of decision making in solving various problems mentioned above. Some papers[15-17] have shown promise in developing semi-autonomous systems using Support Vector Machines (SVM) and other ML techniques for brain tumor grade classification with accuracies as high as 80%. In the domain of neurodegenerative diseases, accuracies as high as 85% have been achieved in classification of MRIs[18,19] using perfusion and ML, and a recent article[20] suggested that Alzheimer’s early detection might be possible using AI which could predict onset of Alzheimer’s with accuracy of 85% to 90%.
Even though the perfusion imaging looks promising, but there are some major hurdles due to which it has not yet spread into the hospitals as go-to method for analysis. Standardisation is the biggest problem that needs to be tackled in the first place. This paper[21] highlights various methods used in brain perfusion imaging. There aren’t one or two different methods, but seven that are highlighted in the paper. Another paper[22] published in Journal of Magnetic Resonance Imaging (JMRI) gives a deeper insight into two successful approaches being used. Measurements from different methods have different accuracies, and asks for different expertise from the doctors performing.
Before perfusion moves from research based imaging to more mainstream technique, a question of standardisation have to be answered. Also, inclusion of any major change into industry as big as healthcare requires time. However at a small scale, perfusion imaging has been showing many signs of being a forefront technology. This can be used alongside current advances in ML to do automated diagnosis and prognosis of various brain related diseases and disorders.
With inputs from Dr. Vasantha Kumar
Brain anatomy segmentation will also allow us to do quantitative and morphometric research in neuroimaging. For example, some studies [4, 5] have found that abnormal shapes/volumes of certain anatomical regions are associated with brain disorders like Alzheimer’s disease and Parkinson’s. In fact, there’s a whole subfield called brain morphometry which is concerned with quantifying anatomical features to understand the brain development, aging and diseases.
Brain Anatomy Segmentation is a well-studied problem by now. A MICCAI challenge was held in 2012 to assess the algorithms on whole brain labeling. The challenge provided 15 T1-weighted structural MRI images and associated manually labeled volumes with one label per voxel. These manually marked volumes are called ‘atlases’. 134 anatomical regions were marked in total. Top performer in the challenge [1] achieved a mean dice score of 0.782.
Almost all the brain segmentation algorithms use what is called Multi-atlas Segmentation algorithm. I briefly describe the algorithm [2] below. An illustration follows the description.
Given, Training image-atlas pairs \( (X_i, Y_i), i = 1,2,…,n \) and an unseen test image \( X_{test} \), do:
Multi-atlas Segmentation Algorithm
Illustration of Multi-atlas Segmentation (taken from here)
Key steps in Multi-atlas Segmentation are Registration and Label/Decision Fusion. Multiple algorithms are available to perform these steps. Registration is usually non-rigid/deformable. Meaning, small local deformations are made apart from global transformations like rotation and scaling (illustration below).
Illustration of Deformable Registration (smileys taken from here)
Rather than selecting the most frequent label at a voxel (Majority voting), an intensity based joint label fusion algorithm [1] can be used to improve the segmentation results. A survey of Multi-atlas Segmentation can be found in [2]. Mature opensource libraries are available for registration and multi atlas segmentation. In the Appendix, I’ll illustrate use of one such library called ANTs and use a subset of MICCAI12 dataset.
A big caveat of Multi-atlas Segmentation is its computational cost. Deformable registration is computationally very expensive. And for a given test image, we’ve to register each training image to it. Label fusion algorithms can also be quite expensive. All of this adds up to 2 or 3 hours for each test image on a latest high-end machine.
This computational cost makes it diffcult to deploy anatomical segmentations for clinical applications. Further research should focus on improving the speed of the segmentation.
One promising research direction is to use deep learning. A deep learning approach to the brain anatomy segmentation [3] was published recently. As noted in [3], main problem is the size of the image and GPU memory. We cannot simply fit a whole MRI image and use 3D CNN to segment because it exceeds the memory available in any single GPU. [3] uses a ‘2.5D’ patch-based segmentation approach; apart from small 3D patch, they also capture the global context by three orthogonal patches, each extracted from the sagittal, coronal and transverse planes.
This model achieves mean dice score of 0.725. This is comparable to best performer [1] in MICCAI12 (0.782), but not as good. This model is also not the easiest to implement. We are working on a fully convolutional 3D CNN approach like U-Net [6] to segment the brain anatomical regions. One of our approaches is to employ model parallelism like in Alex Krizhevsky’s One weird trick paper [7].
I’ve created a repo with a script and MRI images. To run this, build ANTs library on your linux/mac machine:
git clone git://github.com/stnava/ANTs.git
mkdir ANTs/bin
cd ANTs/bin
cmake ..
make -j 4 # This might take a while
cp ../Scripts/* bin
# Mac users change bashrc to bash_profile
echo export ANTSPATH=`pwd`/bin/ >> ~/.bashrc
source ~/.bashrc
Once ANTs is installed, run the following commands:
git clone https://github.com/qureai/Multi-Atlas-Segmentation.git
cd Multi-Atlas-Segmentation
bash multi-atlas-segmentation.sh
This should create a file predictions/1004_predLabels.nii.gz
which are the predicted anatomical labels for test/1004_3.nii
.
When we first set out to apply deep learning to genomics, we asked ourselves what the current state of the art is. What problems are researchers working on and what approaches are they using? This post contains a summary of what we found — an overview of popular network architectures in genomics, the types of data used to train deep models, and the outcomes predicted or inferred.
Despite being able to sequence the genome at nucleotide-level resolution, and the abundance of publicly available labeled datasets from sources like the 1000-genome project, ENCODE and GEO, we are still far from bridging the genotype-phenotype divide or predicting disease from genome sequences. This talk by Brendan Frey puts the deep learning-and-genomics problem in context, explaining why sequencing more genomes may not be the answer. The genome is complex and contains many interacting information layers. Most current approaches involve developing a system to interpret the genomic code or a part of it, rather than directly training a network that predicts phenotype from sequence. Below are some of the ways that deep learning has been used for genomics, with emphasis on implementations for the human genome or transcriptome.
Some of the first applications of neural networks in genomics involved training single-layer fully connected neural networks on gene expression data, typically after using principal component analysis to reduce the dimensions of the input. These networks were used to distinguish between tumor types, predict tumor grade, and predict patient survival from gene expression patterns. Improvements to this included developing feature selection techniques for neural networks that could identify subsets of genes or ‘signatures’ that were most predictive.
Similarly, simple ANNs have been used to predict tumor grade of colorectal tumors based on microRNA expression patterns and identify those microRNA’s which can predict tumor status, as described in this paper.
Autoencoders have been used in genomics to reduce feature space as well as to construct useful features from gene expression data. One example is this early work that demonstrates the use of autoencoders to learn a concise feature representation from unlabeled gene expression data, and subsequently use the representation to learn a classifier for a number of labeled tumor datasets. More recent examples include this paper, which uses denoising autoencoders to extract features from breast cancer gene expression data, and ADAGE (paper and repository), a similar approach on bacterial gene expression data. Features are either nodes in the encoded layer, or sets of genes whose weights most greatly influence a certain node. Both these demonstrate the ability of autoencoders to pick up individual features which can identify tumor subtypes, estrogen-receptor status, and predict patient survival. This success in classifying tumors of different types based on features learned from a single dataset suggests that gene expression features may be shared across the human transcriptome. Would training a single autoencoder with a variety of expression profiles generate a universal common set of features that are predictive in all tissue types? Given that the method seems to work for gene expression data, a second question that arises is - are there any useful features to be learned from using an auto-encoder on DNA sequence data?
Since genes are expressed in a coordinated manner, levels of expression of different genes are highly correlated. This means that expression levels of all genes could be inferred or predicted from the profiles of a subset of genes, potentially reducing the cost and complexity of gene expression profiling. A method called D-GEX has been developed that uses a multi-task deep neural network trained on the publicly available CMAP dataset, to predict the expression of all genes, given the expression of ~1000 genes. A recent topcoder challenge also focuses on the same task.
A related, much harder task is predicting the expression level of an exon or transcript from DNA sequence data. Expression level depends not only on the sequence but also on the cellular context. This paper, titled ‘Deep learning of the tissue-regulated splicing code’ describes a model that predicts the percent of transcripts with exon spliced in (PSI), given the DNA sequence surrounding the exon. Hand-generated genomic features are used to train a model that can predict splicing patterns based on genomic features in specific mouse cell types. After reducing feature space with an autoencoder, the encoded features with additional inputs representing cell type are used to train a multilayer fully connected network. Based on this method, the authors developed and validated a tool that can score the effect of a single-nucleotide-variant (or mutation) on splicing.
A large proportion of recent work is focused on using convolutional neural networks to answer epigenomic questions such as predicting transcription factor binding sites, enhancer regions and chromatin accessibility from gene sequence. Since this often involves training on more than one data type, custom CNN architectures are used, such as training the same network to predict 2 different targets, or combining different kinds of input via independent convolution modules.
DeepBind is a method that can predict the specificity of DNA- or RNA-binding proteins given a sequence. In order to do this, a CNN was trained using data from a large number of different high-throughput epigenomic experiments like protein-binding micro-array and (ChIP)-seq. The convolution stage of the network scans a set of ‘motif detector’ matrices across the one-hot encoded sequence. The learned filters are akin to position weight matrices are used in genomics to understand and depict DNA sequence motifs. The network has been able to learn both known and previously unknown motifs. DeepMotif is a deeper model on similar lines, trained to classify transcription factor binding (yes or no) to a sequence, and focuses on extracting motifs. DeepSEA is a model trained a with a variety of epigenomic data from ENCODE, that can predict the effects of mutations on transcription factor binding and DNAse senstitivity with single-nucleotide sensitivity. Key features of this model are the use of a hierarchical architecture that learns sequence features at different scales, the consequent ability to scan a wide sequence context, and multitask joint learning of diverse chromatin factors sharing predictive features. Other examples of convolutional networks used in epigenomics include Basset, which predicts chromatin accessibility code based on DNA sequence; DeepCpG, which models DNA methylation; DEEP, an ensemble framework predicting enhancers or regions of DNA where transcription factors bind to increase the transcription of a gene; and a CNN-based method to attenuate noise and enhance signal from ChIP-seq experiments.
Parallels are often made between language processing and genome interpretation, suggesting that the methods used in language analysis may be useful ways to understand genomics.
Recurrent neural networks have been used to capture long‐range interactions in DNA sequences. DanQ is a hybrid convolutional and bi-directional long short-term memory recurrent neural network where the convolution layer captures regulatory motifs, while the recurrent layer captures long-term dependencies between the motifs in order to learn a regulatory ‘grammar’. Other examples of RNNs include DNA-level splice junction prediction where an RNN is trained to detect splice junctions or the boundaries between introns and exons; and detection of protein coding regions in viral genomes.
A project titled ‘Gene2vec: Neural word embeddings of genetic data’, uses the original google word2vec implementation on genome sequences, splitting the genome into 27-bp ‘words’. It would be interesting to see if this can be applied for practical use in the human genome, and if a similar word2vec-style model is can be meaningfully trained on gene expression data.
The list of applications of neural networks in genomics is growing rapidly, with a number of new research papers published this year. Many of the recently-evolved deep learning paradigms have been applied to questions in genomics or epigenomics. It remains to be seen how these will translate to real-world applications in biology or medicine.
]]>Unsupervised learning is about inferring hidden structure from unlabelled data. There is no supervision in the form of labels, so the model has to figure out how to represent the data and find patterns in it. When we have large quantities of data, and none or few labels, the most effective technique is to use unsupervised methods to learn from data.
Deep learning methods which are popular today require large quantities of labelled data for training. Labelling is a very time consuming task and in case of medical images, requires significant time commitment from highly trained physicians with specialized skill-sets such as radiologists and pathologists. Lack of large datasets is by far the biggest challenge in applying Deep Learning techniques in the healthcare domain. One way, this can be mitigated is by using unsupervised methods to train on data without labels. These methods can learn patterns in the data which can then be clustered or used for supervised learning with small datasets.
The volume of medical data is growing at a rapid pace as was mentioned here. The data are usually from different modalities (as in case of MRI) or from different enviroments (machines, patients). Building a supervised model to handle all these situations means painstakingly labelling or annotating data from every various scenario. To tackle it, problem we need an approach which uses mix of supervised and unsupervised learning.
The following methods are used for Unsupervised Learning using Deep Learning
VAE stands for Variational AutoEncoders. It is a type of generative model which was introduced in the paper Auto Encoding Variational Bayes.
Architecture of the VAE. The left and right images represent the same VAE
The image illustrated above shows the architecture of a VAE. We see that the encoder part of the model i.e Q models the Q(z|X) (z is the latent representation and X the data). Q(z|X) is the part of the network that maps the data to the latent variables. The decoder part of the network is P which learns to regenerate the data using the latent variables as P(X|z). So far there is no difference between an autoencoder and a VAE. The difference is the constraint applied on z i.e the distribution of z is forced to be as close to Normal distribution as possible ( KL divergence term ).
Using a VAE we are able to fit a parametric distribution ( in this case gaussian ). This is what differentiates a VAE from a conventional autoencoder which relies only on the reconstruction cost. This means that during run time, when we want to draw samples from the network all we have to do is generate random samples from the Normal Distribution and feed it to the encoder P(X|z) which will generate the samples. This is shown in the figure below.
VAE as a graphical model and how to use it at runtime to generate samples
Our goal here is to use the VAE to learn the hidden or latent representations of the data. Since this is an unsupervised approach we will only be using the data and not the labels. We will be using the VAE to map the data to the hidden or latent variables. We will then visualize these features to see if the model has learnt to differentiate between data from different labels.
Our first run will on the well known MNIST dataset. We will run the network on a dataset of two digits from the MNIST dataset and visualize the features the network has learnt. After this we will proceed to run the network on the ISLES 2015 dataset.
We will be using the Keras library for running our example. Keras also has an example implementation of VAE in their repository. We will be using this as our implementation. Here we will run an example of an autoencoder.
MNIST dataset consists of 10 digits from 0-9. For our run we will choose only two digits (1 & 0).
scatter plot of the latent representation
What we see here are two clusters one belonging to the digit 1 (red) and the other to the digit 0 (blue). The VAE has mapped the two different digits to different cluster in the latent variable space. The clusters are very well defined because the digits are structurally different.
For running VAE on a medical dataset we will use the Ischemic Lesion dataset. The dataset contains 28 scans of brains which have Ischemic lesion. It also contains the manually segmented masks of the lesion regions. There are four modalities of MRI image. (T1, T2, DWI, FLAIR).
We will only be using the DWI modality. This is because Ischemic Lesions are visible in the DWI channel. For the preprocessing step we will be using SimpleITK to normalize the images. We also ensure the pixel values are scaled down to (0,1) range.
The next step is to extract 28x28 non overlapping patches from the 3D images. We will be reusing the same type of network as we used for the MNIST example. While extracting 28x28 non overlapping patches we also keep track of which patches have lesion regions and which are healthy patches.
TSNE plot of the latent representation. blue colored dots are lesion patches red colored dots are healthy patches
The image illustrated above is the TSNE plot of the latent representations (dimension of z is 5) of the data.We can clearly see the data clustered into two groups. The center blue cluster which is free of any red dots is the lesion patches. Outer regions consisnt of a mixture red and blue dots. This can be explained by the fact that many lesion patches will have only a few pixels of lesion in them making it difficult to differentiate between them.
In the next part we will explore how to improve clustering.
]]>We believe that deep learning has the ability to revolutionize healthcare. Read our post on deep learning in healthcare to understand where we are headed. Reaching this goal will require contributions from many people, both within and outside of Qure.ai. In this spirit, we want to open-source our work, wherever possible, so that the super-talented global deep learning community can build upon our solutions. In this post, we share our work on segmentation of nerves from the ultrasound images.
Kaggle ultrasound nerve segmentation challenge is one of the high profile challenges hosted on Kaggle. We have used U-net neural network architecture and torchnet package for tackling the challenge and achieved some remarkable results. The repository can be found here.
The challenge in itself is a great learning experience for segmentation problems. Figure below is an example of the image and the mask to predict.
We assume following are installed in your system:
git clone https://github.com/qureai/ultrasound-nerve-segmentation-using-torchnet.git
cd ultrasound-nerve-segmentation-using-torchnet
The dataset consists of 5635 training images and their masks, and 5508 testing images. The images are in tiff format, and to be able to load them into lua, we convert then to png format. So firstly we need to setup dataset so that
/path/to/train/data/images
/path/to/train/data/masks
/path/to/test/data
Now, go to each folder and run the following command, it will generate .png file for each .tif file in the folder. Be patient the procedure takes time.
``` mogrify -format png *.tif
Now, we have all images in png format. To create datasets run the following command
th create_dataset.lua -train /path/to/train/data/images -trainOutput /path/to/train/data.h5 -test /path/to/test/data -testOutput /path/to/test/data.h5
This will package the dataset into HDF5 format, such that train images and masks of patient number `N` are in paths `/images_N` and `/masks_N` of the train HDF5 file respectively. The test images are in `/images` path of test HDF5 file generated.
## Model
We are using a slightly Modified [U-Net](https://arxiv.org/abs/1505.04597) with [Kaiming-He](https://arxiv.org/abs/1502.01852) initialization. The structure of U-Net generated using nngraph can be found [here](/assets/images/ultrasound_torchnet/U-Net.svg).
Source code to create this model is at `models/unet.lua`
<p align="center">
<img src="/assets/images/ultrasound_torchnet/u-net-architecture.png" alt="U-Net Architecture">
<br>
<small> U-Net architecture </small>
</p>
## Training
You can start training right away by running
```bash
th main.lua [OPTIONS]
Option | Default value | Description |
---|---|---|
-dataset |
data/train.h5 |
Path to training dataset to be used |
-model |
models/unet.lua |
Path of the model to be used |
-trainSize |
100 | Amount of data to be used for training, -1 if complete train data to be used |
-valSize |
25 | Amount of data to be used for validation, -1 if complete validation to be used |
-trainBatchSize |
64 | Size of batch size to be used for training |
-valBatchSize |
32 | Size of batch size to be used for validation |
-savePath |
data/saved_models |
Path where models must be saved |
-optimMethod |
sgd |
Method to be used for training, can be sgd or adam |
-maxepoch |
250 | Maximum epochs for which training must be done |
-cvparam |
2 | Cross validation parameter |
The images are given for each patient, and thus in the dataset we have 47 patients with each patient having 119 or 120 images. To assess the real performance of our model, we divide the dataset into train and validation based on patients and use 80-20 split. Thus, now question arises which patients to use for validation and which for training.
To solve this, we keep a parameter -cvparam
, such that all patients with patient_id%5==cvparam
are used in validation, else in training. Now out of these images, we select -trainSize
number of images and -valSize
number of images for training and validation respectively. This allows us to do cross validation easily.
Data augmentation plays a vital role in any segmentation problem with limited dataset. Here we do on-the-fly data augmentation using modified Facebook’s resnet’s transformation file. The image goes through following transformations:
We resize the image to imgWidth X imgHeight
and then pass to our model.
For creating segmentation masks, we consider a pixel from the output to be a part of mask if prob_pixel > baseSegmentationProb
where prob_pixel
is predicted probability that pixel is nerve.
One can define these values in constants.lua
file.
While your model is training, you can look into how torchnet was used to create the training pipeline.
Torchnet was introduced in second half of June to enable code re-use and to make writing code in Torch much more simple. It is basically a well structured implementation of the boilerplate code such as permutation for batches, training for loop and all such things, into a single library. In this project, we have used 4 major tools provided by torchnet
Torchnet provides a abstract class tnt.Dataset
and implementations of it to easily to easily concat, split, batch, resample etc. datasets. We use two of these implementations:
tnt.ListDataset
: Given a list
and load()
closure, ith sample of dataset will be returned by load(list[i])
tnt.ShuffleDataset
: Given a dataset
like above, it creates a new Dataset
by shuffling it.For our model to generalize as it converges, providing a shuffled dataset on every epoch is an important strategy. So we load the data with tnt.ListDataset
and then wrap it with tnt.ShuffleDataset
:
local dataset = tnt.ShuffleDataset{
dataset = tnt.ListDataset{
list = torch.range(1,#images):long(),
load = function(idx)
return { input = images[idx], target = masks[idx] }
end,
},
size = size
}
This ensures that whenever you query the dataset
for ith sample using dataset:get(i)
, you get the image chosen at random from the dataset without replacement.
Illustration of dataset
While, it is easy to iterate over datasets using dataset:get(i)
and a for loop, we can easily do on the fly and threaded data augmentation using tnt.DatasetIterator
We call the iterator in every epoch, and it returns the batch over which training must be done. Before a batch is put for training, we must ensure that transformations for data augmentation take place and then batch is formed of the given size. Using shuffled dataset ensures that we get new order of data every epoch and thus batches are non-uniform across the epochs. tnt.BatchDataset
ensures that batches are formed from underlying images.
return tnt.ParallelDatasetIterator{
nthread = 1,
transform = GetTransforms(mode), --transforms for data augmentation
init = function()
tnt = require 'torchnet'
end,
closure = function()
return tnt.BatchDataset{
batchsize = batchSize,
dataset = ds
}
end
}
We use tnt.ParallelDatasetIterator
with transforms, which ensures that when the training is going for batch n
, it will apply transforms on batch n+1
in parallel and thus reducing the time for training.
From torch documentation,
In experimenting with different models and datasets, the underlying training procedure is often the same. The Engine module provides the boilerplate logic necessary for the training and testing of models. This might include conducting the interaction between model (nn.Module), tnt.DatasetIterators, nn.Criterions, and tnt.Meters.
Engine is the main running core that will put your model into train. We use optim engine which wraps the optimization functions of optim package of torch. Engine has hooks attached with different events of training. We can define a callback function and attach to the hooks, hooks ensure that these functions are called at the end of event it is attached to. We use these hooks to update our meters, save model and print the statistics of the training.
self.engine:train{
network = self.model,
iterator = getIterator('train',self.trainDataset,self.trainBatchSize),
criterion = self.criterion,
optimMethod = self.optimMethod,
config = self.optimConfig,
maxepoch = self.maxepoch
}
Below is an example of the hook that we attach to on end epoch event. We validate the model, print the meters and save model.
local onEndEpochHook = function(state)
state.t = 0
self:test()
self:PrintMeters()
self:saveModels(state)
end
state
supplied to hook function stores the current information about the training process, such as number of epochs done, model, criterion, etc.
Again from torchnet’s documentation,
When training a model, you generally would like to measure how the model is performing. Specifically, you may want to measure the average processing time required per batch of data, the classification error or AUC of a classifier a validation set, or the precision@k of a retrieval model.
Meters provide a standardized way to measure a range of different measures, which makes it easy to measure a wide range of properties of your models.
We use tnt.AverageValueMeter
for all parameters we want to observe such as validation dice scrore, validation loss, training loss, training dice score, etc. . They are set to zero on beginning of every epoch, updated at the end of an iteration in an epoch and printed at the end of every epoch.
th generate_submission.lua [OPTIONS]
Option | Default value | Description |
---|---|---|
-dataset |
data/test.h5 |
Path to dataset to be used |
-model |
models/unet.t7 |
Path of the model to be used |
-csv |
submission.csv |
Path of the csv to be generated |
-testSize |
5508 | Number of images to be used for generating test data, must be < 5508 |
The model takes about 3 min for each epoch on a Titan X GPU. Using adam for training we received score greater than 0.620 on the leaderboard while using SGD takes it to greater than 0.628. It takes about 7 min to generate the submission file. Rock On!!
Let us know if this was helpful and feel free to reach out to us through the forum.
]]>
On the other hand, healthcare is going through a revolution of its own. The amount of medical data generated is growing exponentially, from medical images to connected devices. The use of diagnostic imaging has increased significantly and continues to grow. CT scans grow at 7.8 percent annually and MRIs at 10% every year. This has an impact on the workload of healthcare practitioners - in 1999 radiologists were expected to interpret 2.8 images per minute which had increased to 19 images per minute in 2010. Additionally, new kinds of healthcare data (such as that generated by genome sequencing and biosensors) are large and complex, rendering traditional diagnostic methods obsolete. These sources of data also continue to grow – 200,000 genomes had been sequenced by 2014 and the number is expected to reach 1.7 million by 2017. Unfortunately, trained physicians who would analyze this data are growing at a much slower rate. For example, the growth of radiologists is only half of the growth in medical images. As organized healthcare reaches parts of the developing world who had no healthcare access previously, this ratio suffers further.
We believe that Artificial Intelligence (AI) will be critical in ensuring that healthcare practitioners can focus on cases that truly matter, letting machines diagnose or treat the easier ones. AI will also contribute in combining various data sources from a patient’s history (medical records, radiology imaging, pathology imaging, genome sequences, fitness band data, heart monitor data, etc.) to generate a personalized diagnosis or a personalized treatment plan. At Qure.ai, we use deep learning to diagnose disease from radiology and pathology imaging, and to create personalized cancer treatment plans from histopathology imaging and genome sequences.
We are a team of computer scientists, medical practitioners and bioinformaticians who believe that our work can have an enormous impact on human lives. We are hiring! If deep learning is an area that excites you, come join us!
Our research is done in collaboration with several hospital chains, universities and research institutions. If you are a medical institution and would like to collaborate with us, please reach out to us.
]]>