Skip to content

Before and After Superintelligence Part II

AI7 min read

Almost a year ago (Before and After Superintelligence Part I), I committed to some life changes before the dawn of Artificial Superintelligence (AI smarter than the smartest human for every evaluable dimension of intelligence).

This one's different. It's a bat-signal on long term problems that I think are interesting and tractable.

1. Data Ownership

2. Automating Scientific Hypothesis Testing

3. Redefine Social Welfare

1. Data Ownership

Two things go into making AI smarter: compute and data. Most public market and private market investment is focused on the compute market. But scaling laws suggest that a marginal improvement in compute should proportionally be matched by data. Specifically, the Chinchilla scaling laws cite the relationship between model loss , data and parameters to be: Above, and are empirical constants to represent difficulty of reducing loss via model size vs data size and is irreducible loss. For compute optimal training, the conclusion is that model size and data should scale equally, and , meaning we are hitting both compute and data walls. This is despite the staggering amount of existing data consumption such as the 300 trillion () high quality tokens on surface internet that has been collected for training.

I think I have a couple unorthodox views regarding what is 'valuable data' going into the future:

  • i) All data is useful, including the long tail of 'garbage'.
    • The shift in Synthetic Data has so far been towards more specialised data - through post training processes like GRPO or Monte Carlo Tree Search, the intuition being that higher signal to noise ratio is being properly 'juiced'. However, I think in the same way mass pretraining was critical for LLMs, I think mass pretraining of the world will be important for world models or robotics.
    • Besides, the garbage is being recycled! LLMs themselves are really good at cleaning and processing data - both through more efficient data engineering pipelines and manually spot-checking. This enables dirty data to be in cleaner, usable states, increasing the effective quantity of data.
  • ii) Data factories will form.
    • As much as I hope for a data marketplace for efficient price discovery, it seems unlikely one will form. Unlike commodities, that are rivalrous, the marginal cost of selling data is zero. The supply is somewhat unbounded.
    • However, I believe like how oil refineries are gangantuan facilities of tremendous complexity to distill 'useless oil' into 'useful data' of multiple tiers/tranches, I believe a tremendous amount of compute and effort will go into cleaning data and applying different transformations to get it to a more 'useful' state.
    • Also, metrics for additional data point(s) contribution to model performance are a critical gap in academic literature at the moment. Essentially, it would be pretty awesome to have 'data Shapley values' in Gradient Boosted Models for transformers or other billion parameter + deep learning models.
  • iii) Foundation model pre-training is theft.
    • I firmly believe AI models can do a lot of good. For example, I think discovering new drugs, progressing scientific research is generally good under the right hands. And these models have democratised knowledge in a way that is deeply powerful - maybe even enabling a von Neumann like polymath to resurface in the 21st century. But we can't ignore that the way they're trained is very, very wrong.
    • First off, note that for the entire existence of the internet's duration, the terms and conditions on which people passed over their data involved false consent. Nobody understands where their data is going, how its is used, what data even means in most cases, and yet to access services companies have been able to harvest their data, hiding behind a wall of text. We have signed thousands of false contracts over our life-spans.
    • This hasn't been a serious problem so far. But right now, we have a model that (from one perspective) has simply compressed human knowledge, and bundled it in a black box that prevents it from being prosecutable on the basis of theft. And when it takes most human jobs, there will be absolutely no redress on the collective intellectual and emotional labour of billions of people.
    • One cynical way of viewing mass pre-training is that you are performing a legal arbitrage. The speed at which litigation or grassroots activism mobilizes is lower than the rate of progression in AI.
    • That is why I view Mercor's model of paying specialized workers better than nothing, since there is clearer attribution for their data contribution. But a math PhD being compensated for specialised knowledge is a far cry from fair compensation for all. Some ideas include having clear regulations that enforce model providers to only train on consented data, or at the least, publicly transparent databases of training data composition. An aside: it's very cool X open sourced their recommendation algorithm today: https://github.com/twitter/the-algorithm.
    • There is the age-long question of tech governance and geopolitics - would we rather China do it? It is a genuinely fair argument. But, I can't help but think - what a perfect bogeyman for Sam Altman to do anything to hoard power.

2. Automating Scientific Hypothesis Testing

According to Karl Popper, science is about i) constructing hypotheses and ii) failing to falsify the hypothesis (the anti-inductivist view). But hypothesis generation was something that Popper viewed as a mysterious part of human creativity.

I'm not so sure it's so mysterious. My belief is that LLMs can enable an end to end pipeline of automated science, where hypothesis construction comes from their internal world model and context of experimental results. I'm quite excited about what companies like Project Prometheus and Periodic Labs will do.

Automated scientists will give us the ability to:

  1. Produce high quality hypotheses at mass scale
  2. Pipeline in which LLMs can attempt at falsifying or running statistical tests.

I wonder about three challenges:

  1. How do we experiment in this automated way without having the False Discovery Rate (FDR) explode?
    • Standard p-value < 0.05 thresholds become meaningless with stringent corrections, meaning each experiments need higher power, i.e. more data, i.e. more resources. This could lead to industralized p-hacking at levels we've never seen.
    • Multiple Hypothesis Testing is going to become even more critical. Through fields like genomics, finance, this has become a focus of academic statistical research, with tools like e-values becoming more powerful - but it's hard to make them a meaningful part of a research pipeline.
  2. Can AI scientists perform paradigm shifts (Mode 1), rather than Kuhnian normal science which is puzzle solving within a paradigm (Mode 2)?
    • So far, this is a deep question about Transformers. Why haven't they discovered new scientific discoveries or insights amidst the huge amount of data they have stored in themselves?
    • A recent worldview update in early 2026 for me was that Terence Tao recently posted about the first non-trivial discovery in mathematics made by AI. Pure science progression vs. applied experimental discoveries is more Mode 1. However, the discovery was more Mode 2 in flavour according to Tao.
  3. How do we get AI scientists to know what experiments to run or what ideas to use?
    • The best scientists have multi-month planning horizons that evolve through Bayesian updates. Models struggle to incorporate this knowledge in a deep way yet.
    • There are also open statistical questions! Very exciting fields include post selection inference, high dimensional simultaneous testing with power, and Bayesian experimental design.

I think the world would look very different with automated science in two ways:

  1. We get 'science on demand'
    • This would be a continuation, but acceleration nonetheless, of increasing between 'funding spent in science' and 'scientific progress'. That is, Gregor Mendel was a monk who pioneered genetics and key ideas in statistics. Academics have famously required open, free, and collaborative environments to thrive. But this would signal a departure from this kind of 'craftsman' approach of science to a scaleable one of spending more compute, and churning through more grid searches of ideas.
    • If true, science would become a more explicit geopolitical toggle for 'empire competition'. This is the Cold War and the Chips Act 2022 on steroids. I believe there will be some countries that attempt to achieve 'civilisational victory' through pouring resources into AGI and scientific victory.
    • Science could also become closed. If open spirit is required less, and the amount of money becomes astronomical where you are simply paying cash for knowledge/edge over others, this might cause incentives to hoard knowledge. We already see this at the frontier AI companies refusing to share model internals (exceptions including X AI and Deepseek).
  2. We might be able to help end the replicability crisis
    • A world of AI scientists could definitely exacerbate the replicability crisis.
    • An optimistic view is that without the fear of null results (publish or perish), AI authors might lead to a world where failed experiments are clearly documented, eliminating a huge source of selection bias in science.

3. Preventing Plutocracy

Through a friend, I recently learned of a mind-blowing statistic regarding Apple in China.

... Demand from China’s 1.4 billion people indirectly supports, across all industries, between 1 million and 2.6 million jobs in America; whereas, by Tim Cook’s estimate, Apple alone supports 5 million jobs in China—3 million in manufacturing and another 1.8 million in app development? ... one super-corporation has more of an impact on job creation in China than all of China has on America.

How do we ensure that all AI wealth does not get concentrated in the prosperous American city of San Francisco? Before, as we see above, wealth had real trickle down effects in this way on a global scale. But there is no need for this to be the case in a world with AI.

Increasing wealth inequality is the strongest single prediction I have regarding AI progress. Both domestically, and (forgotten but more crucial) internationally. That's because the invisible hand rests on a false assumption: there is always some comparative advantage people can gravitate towards. AI will eventually not just automate tasks - it will eliminate the concept of human comparative advantage.

I have a couple ideas on what we can do to try and fight inequality in this new era.

  1. Improve Scientific Literacy
    • AI gives us information at our fingertips like what Gutenberg did with the printing press. It's democratised academic knowledge hidden away in papers to an easy conversation.
    • At its best, we could have an educated citizenry able to debate and discuss with numbers and achieve better consensus.
    • However, I fear education (which needs to be reimagined) will be less valued by students because there will be no relationship between that and employment.
    • This is also crucial so that politicians cannot distract, and ensure people come together to vote for things like Universal Basic Income if employment plummets.
  2. Rewrite loss function of social welfare
    • So much of social welfare's KPI involves employment numbers, and to 'get people back on their feet'.
    • This model is going to be desperately obsolete in a world where over half of humans are unemployed systematically. Most voting citizens will not pay tax, many adults globally are going to lose a sense of pride, self, and purpose as they realise they do not pay tax, and have no way of contributing towards the future. Interestingly, studies on happiness show people can rebound (or return to equilibrium) from major life events like divorce or injury, but not a lack of employment for a long period of time.
    • So what should the goal of social welfare be? I'm not sure. I'm not sure if happiness is a loss function that can be 'wireheaded' as people live as 'happy' consumers of AI Tiktok. But people deserve lives of agency, optionality, and purpose.
  3. Prevent political capture
    • Rockefeller had an enormous amount of influence in politics. We are entering the age of trillionaires, and without further intervention, we get techno-feudalism where AI corporations lobby and consolidate power.
    • This also involves making people robust to AI generated propaganda, not letting the initial shock of the 2016 election return.