It performed better than younger physicians without specialization and ophthalmology residents.
Many problems regarding the ways in which learning language models (LLMs) can be of use to society in various fields, such as the medical field, are being raised as the development of LLMs continues. According to the Financial Times, a recent study conducted by the School of Clinical Medicine at the University of Cambridge discovered that OpenAI’s GPT-4 performed nearly as well as specialists in the area when it came to an evaluation of ophthalmology.
The LLM, its predecessor GPT-3.5, Google’s PaLM 2, and Meta’s LLaMA were all put through a series of 87 multiple-choice questions in the research study that was published in the journal PLOS Digital Health. Five ophthalmologists with extensive experience, three ophthalmologists in training, and two junior doctors with no specialized training were all given the same mock exam. The questions were taken from a textbook that covers a wide range of topics, including light sensitivity and lesions, and were used to test trainees. Based on the fact that the contents are not accessible to the general public, the researchers conclude that LLMs could not have been trained on them in the past. Three opportunities were provided to ChatGPT, which was equipped with either GPT-4 or GPT-3.5, to provide a definitive response; otherwise, its response was designated as null.
With sixty of the eighty-seven questions correctly answered, GPT-4 had a higher score than the trainees and young doctors surveyed. Despite the fact that this was much higher than the average of 37 accurate answers that junior doctors had, it was just a few points higher than the average of 59.7 that the three trainees had. With the exception of one specialist ophthalmologist who was only able to correctly answer 56 questions, the five of them had an average score of 66.4 right answers, which was higher than the machine. PaLM 2 had a score of 49, whereas GPT-3.5 received a score of 42. The lowest score was received by LLaMa, which was 28, placing her below the younger doctors. Most notably, these trials took place in the middle of 2023.
Although there is the possibility that these results will be beneficial, there are also a significant number of hazards and concerns. Researchers pointed out that the study only included a limited number of questions, particularly in some categories, which indicates that the actual results could be different from what was expected. In addition, LLMs have a propensity to “hallucinate” or think things that are not true. That is one thing if the fact in question is inconsequential, but it is an entirely different matter to assert that there is a cataract or cancer. It is also the case that the systems lack nuance, which creates additional potential for inaccuracy. This is like the situation in many instances where LLM is used.