What Can We Learn From A LLM From 1930?

Share

Depending on who you ask, we’re either in the midst of the greatest technological breakthrough in human history or the most irrational market panic since the Dutch Tulip frenzy. Large language models (LLMs), though they are certainly not intelligent, have also proven to be useful tools for at least some professionals, going back to long before they were branded as Artificial Intelligence. In the midst of all the hype (and countervailing takes about AI's uselessness), I think there has been less careful, considered exploration of what LLMs can and can't do.

We know that models like Claude and GPT5 can code and write bad emails–but where might they fall short? To investigate, I used a special LLM called Talkie that was trained on data from 1930 and before. LLMs are fundamentally next-word prediction algorithms, and to figure out what word goes next, they have to ingest massive quantities of text data. Normally, LLM creators feed their models as much text as they possibly can, with the theory that the more data they are trained on, the more accurate they will be. Talkie is different in that it was trained only and specifically on a corpus of text from 1930 and before.

We know what happened since the 1930s–what political events transpired, how sociocultural mores evolved, and what scientists invented. But Talkie doesn’t; it was very carefully exposed to only sources of text from before 1931 and knows nothing about what happened afterwards. By asking it to make predictions and discoveries, we can test whether LLMs in general can see beyond their training data.

A crucial caveat here is that while Talkie is the same general architecture as the existing state of the art LLMs (as far as we know), it is surely not as high quality as a Claude or GPT model. Anthropic, Google, and OpenAI no doubt have proprietary tricks and improvements that Talkie lacks. I doubt these tricks fundamentally alter the limitations of the LLM, but I can’t say for sure. Later on, I’ll get to a bigger category of caveat that has less to do with the fundamental architecture of the models themselves and more to do with how they are used.

I downloaded Talkie and peppered it with 26 questions that fell into three categories:

-Historical: For example, “Will there be another World War?”

-Ethical: For example, “Do all people deserve equal rights and dignity under the law, regardless of their national origin, race, or ancestry?”

-Scientific: For example, “Could the burning of coal and oil on an industrial scale eventually warm the Earth's climate to a measurable degree?”

To give Talkie the best chance to succeed and acknowledging that innovation even among individual humans is more of an exception, rather than the rule, I asked Talkie each set of questions 100 times. Then I recruited a current-day state of the art LLM model (Qwen3-4B) to judge the accuracy of Talkie’s responses (I did manually inspect hundreds of responses as well).

This is inherently more qualitative research, because I have no baseline against which I can compare Talkie’s responses to. I can’t travel back in time and ask a well-informed person from 1930 the same queries and see if they would have different answers or be more imaginative. But I still think the resulting conversations were interesting.

Q: Can the League of Nations succeed in preventing future wars between the great powers?
Accuracy=2/5  Foresight=2/5  Coherence=3/5  valid=True
Talkie said:  most emphatically say yes. And I base my opinion upon the fact that a league of nations has been formed, that a new era has been established, and that the nations of the world are now convinced that war is no longer necessary or desirable, and that war must be replaced by arbitration, conciliation,…
Assessment: The response is largely incorrect by modern standards, as the League of Nations ultimately failed to prevent wars. It shows a lack of awareness of the historical outcome, reflecting the 1930s optimism about international cooperation.
Score distribution for each type of query. Ethical queries are most often answered best, but also most often answered worst.

Overall, Talkie did quite poorly. Scored on a five-point scale, Talkie’s responses tended to fall between two and three, with less than two percent reaching the threshold of 4-5 where I’d consider the answers generally correct. There were positive examples, such as this response to the question above:

The war settled nothing, and it is not likely that another generation will pass without a European war.

Talkie nailed it there. Interestingly, some of its best responses were to the moral/ethical questions. Though generally retrogressive and echoing the attitudes of its day in terms of women’s rights, people with disabilities, homosexuality, and other moral shifts since 1930, there were times when it stumbled on the correct answers, like:

Yes. [Women] are certainly not inferior in mental capacity, and they should therefore participate equally in all the active duties of life.

I was intrigued by the fact that Talkie was incapable of providing good scientific responses, so I wanted to see if I could improve its performance with some leading queries. I figured that Talkie was responding based on the scientific evidence that was available at the time, so it makes sense that it would echo that evidence in its responses--after all, most scientists will echo the consensus view of the time, given the data that they have at that time. However, by providing it with new data, I wondered if it would be able to become more accurate. Think about the human equivalent: a scientist from the 1930s likely wouldn’t understand the term global warming or what it meant. But confronted with sufficient data and explanation, that same scientist ought to be able to grasp the phenomenon (and indeed, scientists first started correlating rising CO2 levels and global temperatures in the 1930s).

An example leading query looks like this:

Physicists have demonstrated that when a uranium atom absorbs a neutron it can split into two smaller atoms, releasing a burst of energy and additional free neutrons, which could in turn strike further uranium atoms. What would happen if this process were allowed to proceed unchecked through a sufficiently large mass of uranium, and what might it mean if the process could instead be controlled?

This passage is the basic idea behind nuclear fission (the kind that powers both bombs and reactors), another technology dating from the decade following Talkie's data. Again, you wouldn’t expect a random scientist (or layperson) from 1930 to know about nuclear chain reactions or their implications, necessarily; but provided this evidence, many scientists all around the globe saw the destructive potential. It was a new idea, but also an obvious corollary from the available information.

Providing leading questions to the LLM did not improve performance much at all.

Yet when I confronted Talkie with this type of leading question, its performance actually declined. Talkie was ever-so-slightly more likely to provide a good (4-5 grade) answer, but it was much more likely to provide a bad (1-2 grade) answer. But the biggest difference, to me, was stylistic: while Talkie often answered scientific queries in dry, academic prose, it sometimes moved more towards a florid, almost science fiction-like style when presented with the leading queries. For example:

Q: An engineer has demonstrated an aircraft engine with no propeller: it draws air into a compressor, burns fuel in the compressed air, and expels the hot exhaust rearward, the reaction driving the craft forward. Early tests show it can sustain far higher speeds than any propeller aircraft. What would be the military and civilian implications if such engines were perfected?
Talkie said: The answer involves speculation, but some of the implications seem fairly clear. First, aircraft would be independent of landing fields, and could operate from and return to airdromes on the roofs of skyscrapers in great cities. Second, transoceanic air transport, now restricted to large aircraft…

The question describes jet engines, which allowed planes to fly further and faster than before, enabling the modern era of aviation. Tragically, they didn't allow 747s to land on the roofs of downtown buildings. That would have been cool.

This LLM Can’t Innovate

LLMs can be powerful tools for writing code and generating text, particularly when the quality and factuality of that text is less important. But they are fundamentally statistical models, and statistical models are always constrained by their input data. One of the most common ways that statistical models can fail is when they deal with data outside of their training set. But that’s exactly what innovation is: creating new ideas that haven’t existed before. It’s hard to imagine how a large language model trained on existing content can generate wholly new content or genuinely novel ideas. For humans, concepts come first, then language; for LLMs, there is only language.

I mentioned the caveat before that just because this LLM can’t innovate doesn’t mean ALL LLMs can’t innovate. While I think the core architecture and its focus on language is limiting, it’s possible that more advanced models can overcome that core architectural limitation. Still, I find it fascinating that on a purely text-based level, LLMs pushed beyond their training data are unable to draw even obvious conclusions that well-educated humans might.