General limitations

Due to the way LLMs work, there are certain limitations to their output that should be taken into account when working with these kinds of tools.

‘An artificial intelligence that has limitations when being worked with’, Microsoft Designer

Below we discuss some of these limitations.

Lack of truthfulness

LLMs are trained on ‘next-word’ predictions. This means that it uses statistics to estimate what would be the most plausible next word in the sentence. But these plausible and convincing responses can be incorrect or made-up (‘hallucinating AI’).

The model cannot verify its own output or assess its reliability. Always use your critical thinking skills when reading the output and always verify it against other sources of information.

In the example below, ChatGPT is giving an erroneous answer, as the third female Prime Minister of the United Kingdom was Liz Truss. She was in office in 2022. Moreover, given that this prompt was entered in June 2024, ChatGPT is providing information about a fictional future date.

Lack of sources

GenAI systems do not offer sources in their output to those specific pieces of information they are basing their answer on, which makes it difficult to verify their claims. They can also hallucinate references. GenAI systems are different from search engines such as Google, and so you should not use them in the same way when looking for scientific information. Although some models such as ChatGPT Plus and Microsoft Copilot have begun to include links to internet sources in their output, you should always critically evaluate and cross-reference all AI-generated output.

Below is an example of ChatGPT hallucinating references. Even though the references appear plausible, neither of these articles seems to exist. The DOI codes for the articles are also made up.

Generic or reductive output

The output of GenAI can be quite generic or reductive. Especially when the prompt is very short, simple, or unspecific, the language used in the response can be rather bland, formulaic, and uninspiring.

In the following example, the prompt is very basic and generic, and the output is equally dull:

 

To increase the quality of the output, try to make your prompts as specific as possible. In the example below, the prompt is much more precise: it specifies who the intended audience is, it clearly states the aim of the message, and it offers instructions on how to deliver the message. This information is reflected in the output, as the language is more engaging and dynamic.

Inconsistent output reproducibility

Due to how text-based AI models work, they generate inconsistent output. When given the same input (e.g. a prompt), the model will give you a different result each time. This makes it difficult to consistently reproduce content from such models.

The example below illustrates that two identical prompts entered on the same day still produce a different result each time. Not only do the language and structure of the output differ in several ways, but the content is also different. For example, the first answer states that Michael Phelps competed in five Olympic Games, whereas the second output mentions only four Olympic Games.

Since reproducibility is often a key aspect of academic research, it has been suggested that the use of AI models that generate unreproducible output for research purposes may lead to a ‘reproducibility crisis’ in science. Therefore, the inconsistent output and its impact on reproducibility should always be taken into account when considering using GenAI for academic purposes.