"Publish-or-perish" and ChatGPT: a dangerous mix
Academia is vulnerable to LLM-driven disruption because everything that is measured revolves around text, and what gets measured decides about your survival as a scientist. It's time to do better.
ChatGPT and other AI-powered text-generating tools are here to stay and scientific publishers have started to acknowledge this. “Science” and other AAAS journals forbid using ChatGPT-generated text in submitted manuscripts at all. Springer-Nature requires authors to disclose it and similar tools in the “Methods” or “Acknowledgments” section. While such clarity and transparency are important, I think that the consequences of AI-tool proliferation in academia reach much further and require much deeper, systemic change.
Post-ChatGPT academia: the good parts
Both ChatGPT and Galactica show clear and serious limitations as tools that could be potentially used for scientific writing, as they are casually hallucinating plausibly sounding fake science. However, it doesn’t mean that ChatGPT or similar tools will have no impact on scientific work. If you treat a chatbot as an assistant with a particular task to perform, you can give it a few bullet points outlining a paragraph you need to write, describe the target audience and ask it to act as a writing assistant. ChatGPT will then produce high-quality text in a fraction of the time that a human needs. And while such text still needs to be extensively fact-proofed, other solutions such as WebGPT are already under development. Those models could potentially be connected to a knowledge base like Wikipedia or Wolfram Alpha, and able to support claims with traceable sources.
For academics, this means that an AI assistant could soon be able to help with a lot of daily work which doesn’t require that much creativity or scientific insight. Such as: write a big part of a grant proposal, co-create syllabi and parts of teaching material, draft some sections of scientific publications, summarize and pre-screen literature review.
The story that is typically told is that the time won through automation will be generally reinvested where human input is irreplaceable, increasing the overall quality of work. Thus in academia, scientists could have more time for the actual research and follow topics that are riskier or demand more rigor. They could finally focus on some crucial but neglected aspects as well, like leadership, team building, public outreach etc.
It is also really important that AI assistants have the potential to act as a great equalizer toward fairer treatment, especially for researchers who are not as skilled in writing complex texts in English as they are in doing research work.
This is part of the story that the AI proponents want us to believe. But it is only a wish. I deeply believe that in the long run, incentives are stronger than intentions. And in the current scientific publishing and research assessment system—as unfortunately, they are two sides of the same coin—the core assumptions and incentives, coupled with the possibilities offered by AI, may lead to straining the system to the point where the current model of doing science becomes meaningless.
Proxy for “scientificness” and proxy for quality
When we detach from the actual content of science (or from the philosophical definition of science) and strip down the current practice only to its technical aspects, we may notice a peculiar pattern of what gets treated as “science” and what doesn’t:
A result of research activity is usually only perceived as “legitimate” when it comes in a form of a PDF with text and figures which has been read and accepted by at least two other humans and published in a particular kind of outlet.
Other types of artifacts produced by a research activity are deemed less important and usually recognized neither in a formal research assessment process nor between the colleagues. Artifacts such as published datasets, code, preprints are at best captured in the scope of altmetrics and treated like a nice addition rather than a core output. Other artifacts are completely ignored as not “legitimate” enough to be qualified as a scholarly work: popular science articles, Youtube videos, discussions below a blog post, but also labbooks, personal knowledge bases etc. Some aspects of scientific work (such as coaching, teamwork or hands-on lore) leave no digital trace that could be analyzed whatsoever.
This means that in mainstream scholarly communication, text has become the main (or only) artifact used as a proxy for the whole body of diverse research activity.
The second part of the mix is that the volume or popularity of such “legitimate” text (quantified through publications and their citations) has become conflated with the quality of the research itself. One reason why this assumption arose is that text (journal articles, book chapters etc.) has been the most easily available artifact, and what is available, gets measured. Another reason is that academia is currently an environment with a very narrow career funnel and scarce resources, which results in a high need for establishing a clear hierarchy and therefore extremely high competitiveness. “Publish or perish” means that the ability to write good text in big quantities becomes crucially important for survival. Despite initiatives like DORA, Agreement on Reforming Research Assessment or narrative CVs, academic career progress still too often depends on the sheer amount of published work: more publications mean more citations, hence higher bibliometric indicators and better chances for a permanent position. Equally important is that a simple ability to submit more grant applications means statistically higher chances to obtain the funding necessary to continue working.
Is academia on a way to becoming a cargo cult?
So let’s imagine what may happen to such a system when it has become possible to write an arbitrarily high volume of high-quality text almost for free.
Gaming the metrics and evolutionary selection: Even without AI, the publication metrics have long been gamed in the spirit of Goodhart's law. Given that “publish or perish” is currently a dominant incentive, it is likely that many will use AI tools not to move into riskier, more resource-intensive projects, but to avoid “perishing” in the first place: publish even more papers, submit even more grant applications. Since the incentives are systemic, it is hard to blame individual actors for doing this. And because the system is incentivized for removing researchers who don’t publish “enough”, it will be soon selecting for the researchers using AI for authoring in one way or another.
Ethical tension: What may be particularly difficult upon the wider adoption of AI-powered tools is the ethical tension faced by the individual researchers willing to succeed under the current rules. On one hand, reaching out to AI text generation tools can be seen as a blow to scientific integrity. On the other hand, not using such tools will put researchers at a publication and funding disadvantage, as they will not be able to generate new publications at the same pace as their colleagues. This tension may add to the temptation to use other questionable research practices like p-value hacking. Even a countermeasure of fully disclosing the use of an AI tool is not going to change the fact that those tools are often not fully ethical at their very core. They are usually created using undisclosed training data, even if the training code is open-sourced, so they may provide overly convincing text distorted with all sorts of biases and amplify some scientific or political views above others in an uncontrolled fashion. Their development and operation are CO2-intensive. And the data curation process, necessary for those tools to function, is commonly tied to the exploitation of workers and other questionable practices.
Positive feedback loop backfiring onto peer-reviewers: An incentive to produce even more publications, combined with tools that allow for the automatic generation of high-quality text, may quickly create a strong positive feedback loop that will impact other players. If more manuscripts are being authored and submitted for publication, journal editors will have to pre-screen many more well-written texts than before. Not being able to differentiate or desk-reject by the proxy of writing quality, editors may need a content-based assessment more frequently to make an informed decision. As a result, the peer review pipelines may become clogged. If it takes one day of work to generate an article and one day to review it, suddenly researchers may have to end up spending most of their time just on reviewing semi-automatically generated publications of their peers.
Algorithmic “peer-review”: As a reaction to this, in a move to scale up the publishing efficiency, the publishers—not wanting to experience a revenue loss—may not only rely increasingly on less fair proxies like the credentials of the authors. They may also decide to use a tool for semi-automatic assessment of whether a manuscript is publication-worthy, or use another AI model to generate a review, judge novelty or pick a manuscript fulfilling an arbitrary black-box metric. Big publishers don’t have a particularly high reputation for having transparent processes. Critique of algorithmic hiring—and hiring is much more regulated than scientific publishing—shows us how such an automatic solution could quickly go wrong. Eventually, this could lead to a breakdown of the initial assumption: that what is perceived as a “legitimate” research output is a human-authored text judged by other humans. Instead, the core scientific output would be constituted of AI-authored papers, which are peer-reviewed and gatekept by other, often fully opaque, AI models.
Hyperinflation of scientific content and breakdown of scientific publishing as a way of communication: Making publications easier to write may also backfire in another way: with over 6.5 million publications published just in 2022 alone, it slowly becomes impossible for an individual to sift through all the recent and relevant literature. The market for various applications using AI for literature discovery has been getting bigger and bigger, and tools are increasingly using AI to assess and rank relevancy. ChatGPT and LLMs could help to tame such inflation by providing smart text summarization, but this would mean the process would change its nature. Instead of using scholarly publishing as a way of communication with peers, scientists would use AI to generate more text more quickly upon writing, only to use AI again to reduce the amount of text upon reading, essentially interacting primarily not with human-, but with machine-generated content.
Predatory practices: And then there is a problem of scientific misconduct and predatory publishing. When high-quality text generation comes for free, it will become increasingly difficult to distinguish real journals from hijacked, predatory and alt-science outlets, as they will be able to generate legitimate looking content at scale. The same will become true for paper-mills and faked publications. Science is self-correcting at its core, but the process is too inefficient. It requires so many resources to credibly debunk forgery even in small cases, that relying only on the self-correcting nature of the system will never manage to counteract a potential threat of automated fake science, especially in the environment that so strongly promotes novelty over reproducibility.
The collapse of predictable bibliometrics: In the end, even the traditional bibliometric metrics may collapse under the scale of sudden inflation of publication volume. We have experienced it already with the sudden Covid-related inflation of publications and citations, which significantly distorted even the normalized bibliometric indicators. While in many cases many processes are overreliant on often inappropriately used quantitative indicators, they are still in play. Those processes, whose outcomes rely on the indicators, may become deeply disturbed for the next couple of years, adding to the lack of transparency and chaos, and traditional interpretation of indicators may lose its meaning without their users noticing it on time.
How to move forward?
As I said earlier, the core assumption of the current publishing system is that the text describing a scientific result is a good proxy for the actual research. Tools like ChatGPT make this core assumption obsolete and may drive the system to the point of absurdity once they become mainstream.
However, it may be a good thing after all. We are not helpless. Such a disruption is also an opportunity to boldly experiment with novel and potentially better ways of scholarly communication and re-think the values we want to include in research assessment.
So I would like to throw some questions that could serve as an opening for such a discussion:
How might we move away from a reliance on text-based reports towards rich, multi-faceted research artifacts published on diverse types of platforms?
How to re-implement the human aspect into scholarly communication?
How could we harness technological tools at our disposal to make structured knowledge more accessible?
How might we make research assessment more aligned with the actual research process, however without introducing 360-degree surveillance for all kinds of data and metrics?
How to ensure oversight and transparency of AI tools which are part of the scientific discovery process?
How to design a fair system that is efficient in fact-proofing, has a functioning quality check, and promotes replicability, avoiding a scenario where a black-box AI decides which manuscript is publication-worthy?
How to manage a default level of trust vs. vigilance toward new manuscripts?
How to balance the value of an open discussion about scientific results on one hand and the value of strict curation on the other?
It will be a difficult and complex discussion. But it is up to us which path we will take.