Supposing we just accept that IQ tests are no better at measuring long term change in abilities than any other type of examination?
Then it would not be surprising that the 'Flynn effect' - of rising raw IQ test scores over the twentieth century - seems to have no real-world validity; and is contradicted by slowing simple reaction times over the same timescale.
But why should we suppose, why should we assume (without proof) in the first place that the raw scores of IQ tests are any better at tracking longitudinal changes of general intelligence than are the raw scores of examinations of (for instance) Latin vocabulary, arithmetic, or historical knowledge?
Everybody knows that academic exams in Latin, Maths, History or any other substantive field will depend on a multitude of factors - what is taught, how big is the curriculum, how it is taught, how the teaching relates to the exam, how much practice of exams and of what type, the conditions of the exam (including possibilities for cheating), how the exam is marked (including possibilities of cheating), and the proportion of nature of the population or sample to whom the exam is administered.
In a cross-sectional use - this type of exam is good at predicting relative future performance on the basis of rank order in the results (not on the basis of absolute percentage scores) when applied to same age groups having been taught a common curriculum etc. - and in this respect academic exams resemble IQ tests (IQ test being, of course, marked and interpreted as age-specific, rank order exams).
All of which means the raw score of academic exams - the percentage correct - means nothing (or not necessarily anything) when looked at longitudinally. Different percentage scores among different groups at different times is what we expect from academic exams.
Cross-sectionally, performance in different academic exams correlate with each other; and with 'g' as calculated from IQ tests, or with sub-tests of IQ tests.
But just because differential performance in an IQ test (a specific test, in a specific group, at a specific time) is a valid predictor; does not mean that IQ testing over time is a valid measure of change in general intelligence.
The two things are utterly different.
Cross sectional use of IQ testing measures relative difference now to predict relative differences in future; but longitudinal use of IQ data uses relative difference at various time-points to try and measure objective change over time: incommensurable.
So, what advantage do IQ tests have over academic exams? Well, mainly the advantage is that good IQ tests are less dependent on prior educational experience (also (which is not exactly the same thing) their components are 'g-loaded').
Historically, IQ tests were mainly used to pick out intelligent children from poor and deprived backgrounds - whose social and educational experience had led to them under-performing on, say, Latin, arithmetic and History exams - because they had never been taught these subjects - or because their teaching was insufficient or inadequate in some way.
It was found that a high rank-order score in IQ testing was usefully-predictive of high rank-order performance in future educational exams (assuming that the requisite educational inputs were sufficient: high IQ does not lead to high scores in Latin vocabulary unless the child has actually studied Latin.)
But IQ tests were done cross-sectionally - to put test-takers in rank order - they were not developed to measure longitudinal change within or between age cohorts. Indeed, since IQ tests are rank-order tests, they have no reference point to anchor them against: 100 is the average IQ (for England, as the reference population) but that number of 100 is not anchored or referenced to anything else - it is merely an average '100' not mean anything at all as a measure of intelligence; just as an average score of 50% in A Latin Vocabulary Exam is is not an absolute measure of Latin ability - the test score number 50 does not mean anything at all in terms of an absolute measure of Latin ability.
What applies to the academic exam or IQ test as a whole, also applies to each of the individual items of the test. The ability to answer any specific individual test item correctly, or wrongly, depends on those things I mentioned before: "what is taught, how big is the curriculum, how it is taught, how the teaching relates to the exam, how much practice of exams and of what type, the conditions of the exam" etc. etc...
My point is that we have been to ready to assume that IQ testing (in particular raw average scores and specific item scores) is immune to the limitations, variations and problems of all other types of academic exams - problems which render them more-or-less meaningless when raw average scores or specific item scores are used, decontextualized, in the attempt to track long term changes in cognitive ability.
It is entirely conjectural to suppose, to assume, that IQ tests can function in a way that other cognitive ability tests (such as academic exams) cannot. And once this is understood, it can be seen that - far from being a mystery, there is nothing to explain about the Flynn effect.
If longitudinal raw average or test item IQ scores have zero expected predictive validity as a measure of intelligence change; then there is no mystery to solve regarding why they might change, at such and such a rate, or stop changing, or anything else!
The Flynn effect might show IQ raw scores or specific item responses going up, down, or round in circles - and it would not necessarily mean anything at all!