Google's AI Health Screening Tool Claimed 90 Percent Accuracy, but Failed to Deliver in Real World Tests

A team of Google researchers is working to improve how its artificial intelligence (AI) performs in reality after one healthcare project fell short of expectations.

In an academic paper published this week, the team detailed how a deep learning tool that showed great promise under lab conditions had at times sparked frustration and unnecessary delays when rolled out into real-world clinical conditions.

The project took place between November 2018 and August 2019, with fieldwork being conducted at 11 clinics in the provinces of Pathum Thani and Chiang Mai, Thailand. Its aim was to use the technology to detect diabetic retinopathy (DR), a condition that can lead to vision distortion or loss, while helping the workflow of nursing staff.

Google said its AI has a "specialist-level accuracy" of over 90 percent for the detection of referable cases of DR. It quickly faced a series of unforeseen challenges.

Researchers found the tool needed high-quality images to work, which staff could not always provide. They found eye-screening processes varied significantly between the clinics, and not all locations had high-quality internet connections. In some cases, the system actually appeared to slow down already lagging systems in place.

The Google researchers wrote in the final paper: "We discovered several factors that influenced model use and performance. Poor lighting conditions had always been a factor for nurses taking photos, but only through using the deep learning system did it present a real problem, leading to ungradable images and user frustration.

"Despite being designed to reduce the time needed for patients to receive care, the deployment of the system occasionally caused unnecessary delays for patients.

The analysis added: "Finally, concerns for potential patient hardship (time, cost, and travel) as a result of on-the-spot referral recommendations from the system, led some nurses to discourage patient participation in the prospective study altogether."

The study was still worthwhile, the team said. It was the first to analyze how nurses can use AI to screen patients for diabetic retinopathy (DR), and the findings will be used to improve the systems for the future, Google suggested in a blog post.

Without the AI, nurses take a photo of a patient's retina before sending the image to an ophthalmologist for review. The process can take up to 10 weeks. Google set out to test if using the algorithm could speed things up and provide instantaneous results.

But that did not prove to be easy, the researchers said.

Researchers soon learned some nurses were dissuading patients from participating in the prospective study over fears it would cause them "unnecessary hardship" as they would potentially have to travel to another hospital should they be referred.

"Through observation and interviews, we found a tension between the ability to know the results immediately and risk the need to travel, versus receiving a delayed referral notification and risk not receiving prompt treatment," the paper said.

It added: "Patients had to consider their means and desire to be potentially referred to a far-away hospital. Nurses had to consider their willingness to follow the study protocol, their trust in the deep learning system's results, and whether or not they felt the system's referral recommendations would unnecessarily burden the patient."

On top of that, researchers soon realized the deep learning system was not designed to work with low quality, dark or blurry images. This is to help decrease the chance that the tool would make an incorrect assessment, but it caused issues, Google said.

"Out of 1838 images that were put through the system in the first six months of usage, 393 (21%) didn't meet the system's high standards for grading," the team said.

The paper added: "The system's high standards for image quality is at odds with the consistency and quality of images that the nurses were routinely capturing under the constraints of the clinic, and this mismatch caused frustration and added work."

Nursing staff voiced similar complaints. One staff member told the team: "Patients like the instant results but the internet is slow and patients complain. They've been waiting here since 6 a.m. and for the first two hours we could only screen 10 patients."

Another nurse, noting the problems caused by slow internet speeds, said: "Patients like the instant results but the internet is slow and patients complain. They've been waiting here since 6 a.m. and for the first two hours we could only screen 10 patients."

Google said its work is not done. It has held design workshops with nurses, potential camera operators and retinal specialists at future deployment sites.

"These studies were successful in their intended purpose: to uncover the factors that can affect AI performance in real world environments and learn how people benefit from the tech, and refine the tech accordingly," researcher Emma Beede told Newsweek.

"A failure would have been to fully deploy technology without studying how people would actually use and become affected by it.

"A properly conducted study is designed to reveal impacts, both positive and negative, if we hadn't observed challenges, that would be the failure.

"The goal of publishing this work is to set an example of how AI technologies should be fielded with extreme care and involvement with the people who will use it.

"Now that we've published this, we hope people will follow our lead, and understand users' needs and study real-world environments closely before introducing AI. That is the ultimate goal, to set an example on the importance of careful studies like this."

This article has been updated with comment from Google researcher Emma Beede.

The Google sign is pictured at the Mobile World Congress (MWC), the world's biggest mobile fair, on February 26, 2018, in Barcelona. PAU BARRENA/AFP/Getty