User Tools

Site Tools


featured_articles:evidence_on_good_forecasting_practices_from_the_good_judgment_project

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

featured_articles:evidence_on_good_forecasting_practices_from_the_good_judgment_project [2022/09/21 07:37] (current)
Line 1: Line 1:
 +====== Evidence on good forecasting practices from the Good Judgment Project ======
 +
 +// Published 07 February, 2019; last updated 17 July, 2020 //
 +
 +<HTML>
 +<p><span style="font-weight: 400;">According to experience and data from the Good Judgment Project, the following are associated with successful forecasting, in rough decreasing order of combined importance and confidence:</span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<ul>
 +<li><div class="li"><span style="font-weight: 400;">Past performance in the same broad domain</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">Making more predictions on the same question</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">Deliberation time</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">Collaboration on teams</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">Intelligence</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">Domain expertise</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">Having taken a one-hour training module on these topics</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">‘Cognitive reflection’ test scores</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">‘Active open-mindedness’</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">Aggregation of individual judgments</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">Use of precise probabilistic predictions</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">Use of ‘the outside view’</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">‘Fermi-izing’</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">‘Bayesian reasoning’</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">Practice</span></div></li>
 +</ul>
 +</HTML>
 +
 +
 +
 +===== Details =====
 +
 +
 +==== 1. 1. Process ====
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">The Good Judgment Project (GJP) was the winning team in IARPA’s 2011-2015 forecasting tournament. In the tournament, six teams assigned probabilistic answers to hundreds of questions about geopolitical events months to a year in the future. Each competing team used a different method for coming up with their guesses, so the tournament helps us to evaluate different forecasting methods.</span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">The GJP team, led by Philip Tetlock and Barbara Mellers, gathered thousands of online volunteers and had them answer the tournament questions. They then made their official forecasts by aggregating these answers. In the process, the team collected data about the patterns of performance in their volunteers, and experimented with aggregation methods and improvement interventions. For example, they ran an RCT to test the effect of a short training program on forecasting accuracy. They especially focused on identifying and making use of the most successful two percent of forecasters, dubbed ‘superforecasters’.</span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">Tetlock’s book</span> <i><span style="font-weight: 400;">Superforecasting</span></i> <span style="font-weight: 400;">describes this process and Tetlock’s resulting understanding of how to forecast well.</span></p>
 +</HTML>
 +
 +
 +
 +
 +==== 1.2. Correlates of successful forecasting ====
 +
 +
 +=== 1.2.1. Past performance ===
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">Roughly 70% of the superforecasters maintained their status from one year to the next <span class="easy-footnote-margin-adjust" id="easy-footnote-1-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-1-1283" title=" &lt;i&gt;Superforecasting &lt;/i&gt;p104 "><sup>1</sup></a></span>.</span> <span style="font-weight: 400;">Across all the forecasters, the correlation between performance in one year and performance in the next year was 0.65 <span class="easy-footnote-margin-adjust" id="easy-footnote-2-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-2-1283" title=" &lt;i&gt;Superforecasting &lt;/i&gt;p104 "><sup>2</sup></a></span>.</span> <span style="font-weight: 400;">These high correlations are particularly impressive because the forecasters were online volunteers; presumably substantial variance year-to-year came from forecasters throttling down their engagement due to fatigue or changing life circumstances <span class="easy-footnote-margin-adjust" id="easy-footnote-3-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-3-1283" title=" Technically the forecasters were paid, up to $250 per season. (&lt;i&gt;Superforecasting &lt;/i&gt;p72) However their payments did not depend on how accurate they were or how much effort they put in, beyond the minimum.&amp;nbsp;"><sup>3</sup></a></span>.</span></p>
 +</HTML>
 +
 +
 +
 +
 +=== 1.2.2. Behavioral and dispositional variables ===
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">Table 2  depicts the correlations between measured variables amongst GJP’s volunteers in the first two years of the tournament <span class="easy-footnote-margin-adjust" id="easy-footnote-4-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-4-1283" title=' The table is from &lt;a href="https://www.apa.org/pubs/journals/releases/xap-0000040.pdf"&gt;Mellers &lt;i&gt;et al&lt;/i&gt; 2015&lt;/a&gt;. “Del time” is deliberation time.&amp;nbsp;'><sup>4</sup></a></span>.</span> <span style="font-weight: 400;"> Each is described in more detail below.</span></p>
 +</HTML>
 +
 +
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">The first column shows the relationship between each variable and standardized</span> <a href="https://en.wikipedia.org/wiki/Brier_score"><span style="font-weight: 400;">Brier score</span></a><span style="font-weight: 400;">, which is a measure of inaccuracy: higher Brier scores mean less accuracy, so negative correlations are good. “Ravens” is an IQ test; “Del time” is deliberation time, and “teams” is whether or not the forecaster was assigned to a team. “Actively open-minded thinking” is an attempt to measure “the tendency to evaluate arguments and evidence without undue bias from one’s own prior beliefs—and with recognition of the fallibility of one’s judgment.” <span class="easy-footnote-margin-adjust" id="easy-footnote-5-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-5-1283" title=' “Nonetheless, as we saw in the structural model, and confirm here, the best model uses dispositional, situational, and behavioral variables. The combination produced a multiple correlation of .64.” This is from &lt;a href="https://www.apa.org/pubs/journals/releases/xap-0000040.pdf"&gt;Mellers &lt;i&gt;et al&lt;/i&gt; 2015&lt;/a&gt;.&amp;nbsp;'><sup>5</sup></a></span></span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">The authors conducted various statistical analyses to explore the relationships between these variables. They computed a structural equation model to predict a forecaster’s accuracy:</span></p>
 +</HTML>
 +
 +
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">Yellow ovals are latent dispositional variables, yellow rectangles are observed dispositional variables, pink rectangles are experimentally manipulated situational variables, and green rectangles are observed behavioral variables. This model has a <a href="https://en.wikipedia.org/wiki/Multiple_correlation">multiple correlation</a> of 0.64.</span><span style="font-weight: 400;"><span class="easy-footnote-margin-adjust" id="easy-footnote-6-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-6-1283" title=' This is from &lt;a href="https://www.apa.org/pubs/journals/releases/xap-0000040.pdf"&gt;Mellers &lt;i&gt;et al&lt;/i&gt; 2015&lt;/a&gt;.&amp;nbsp;'><sup>6</sup></a></span></span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">As these data indicate, domain knowledge, intelligence, active open-mindedness, and working in teams each contribute substantially to accuracy. We can also conclude that effort helps, because deliberation time and number of predictions made per question (“belief updating”) both improved accuracy. Finally, training also helps. This is especially surprising because the training module lasted only an hour and its effects persisted for at least a year. The module included content about probabilistic reasoning, using the outside view, avoiding biases, and more.</span></p>
 +</HTML>
 +
 +
 +
 +
 +==== 1.3. Aggregation algorithms ====
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">GJP made their official predictions by aggregating and extremizing the predictions of their volunteers.</span> <span style="font-weight: 400;">The aggregation algorithm was elitist, meaning that it weighted more heavily people who were better on various metrics.</span> <span style="font-weight: 400;"><span class="easy-footnote-margin-adjust" id="easy-footnote-7-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-7-1283" title=' On the &lt;a href="https://goodjudgment.com/science.html"&gt;webpage&lt;/a&gt;, it says forecasters with better track-records and those who update more frequently get weighted more. In &lt;a href="https://www.edge.org/conversation/philip_tetlock-edge-master-class-2015-a-short-course-in-superforecasting-class-ii"&gt;these slides,&amp;nbsp;&lt;/a&gt;Tetlock describes the elitism differently: He says it gives weight to higher-IQ, more open-minded forecasters. '><sup>7</sup></a></span> The extremizing step pushes the aggregated judgment closer to 1 or 0, to make it more confident. The degree to which they extremize depends on how diverse and sophisticated the pool of forecasters is.</span> <span style="font-weight: 400;"><span class="easy-footnote-margin-adjust" id="easy-footnote-8-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-8-1283" title=' The academic papers on this topic are &lt;a href="https://www.sciencedirect.com/science/article/pii/S0169207013001635"&gt;Satopaa et al 2013&lt;/a&gt; and &lt;a href="http://pubsonline.informs.org/doi/abs/10.1287/deca.2014.0293"&gt;Baron et al 2014&lt;/a&gt;. '><sup>8</sup></a></span> Whether extremizing is a good idea is still controversial.  <span class="easy-footnote-margin-adjust" id="easy-footnote-9-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-9-1283" title=" According to one expert I interviewed, more recent data suggests that the successes of the extremizing algorithm during the forecasting tournament were a fluke. After all, &lt;i&gt;a priori &lt;/i&gt;one would expect extremizing to lead to small improvements in accuracy most of the time, but big losses in accuracy some of the time.&amp;nbsp;"><sup>9</sup></a></span></span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">GJP beat all of the other teams.</span> <span style="font-weight: 400;">They consistently beat the control group—which was a forecast made by averaging ordinary forecasters—by more than 60%.</span> <span style="font-weight: 400;"><span class="easy-footnote-margin-adjust" id="easy-footnote-10-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-10-1283" title=" &lt;i&gt;Superforecasting &lt;/i&gt;p18. "><sup>10</sup></a></span> They  also beat a prediction market inside the intelligence community—populated by professional analysts with access to classified information—by 25-30%. <span class="easy-footnote-margin-adjust" id="easy-footnote-11-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-11-1283" title=' This is from &lt;a href="https://www.edge.org/conversation/philip_tetlock-edge-master-class-2015-a-short-course-in-superforecasting-class-ii"&gt;this seminar&lt;/a&gt;.&amp;nbsp;'><sup>11</sup></a></span></span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">That said, individual superforecasters did almost as well, so the elitism of the algorithm may account for a lot of its success.<span class="easy-footnote-margin-adjust" id="easy-footnote-12-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-12-1283" title=' For example, in year 2 one superforecaster beat the extremizing algorithm. More generally, as discussed in &lt;a href="https://www.edge.org/conversation/philip_tetlock-edge-master-class-2015-a-short-course-in-superforecasting-class-ii"&gt;this seminar&lt;/a&gt;, the aggregation algorithm produces the greatest improvement with ordinary forecasters; the superforecasters were good enough that it didn’t help much.&amp;nbsp;'><sup>12</sup></a></span></span></p>
 +</HTML>
 +
 +
 +
 +
 +==== 1.4. Outside View ====
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">The forecasters who received training were asked to record, for each prediction, which parts of the training they used to make it. Some parts of the training—e.g. “Post-mortem analysis”—were correlated with inaccuracy, but others—most notably “Comparison classes”—were correlated with accuracy.</span> <span style="font-weight: 400;"><span class="easy-footnote-margin-adjust" id="easy-footnote-13-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-13-1283" title=' This is from &lt;a href="http://journal.sjdm.org/16/16511/jdm16511.pdf"&gt;Chang et al 2016&lt;/a&gt;. The average brier score of answers tagged “comparison classes” was 0.17, while the next-best tag averaged 0.26.'><sup>13</sup></a></span>  ‘Comparison classes’ is another term for</span> <a href="https://en.wikipedia.org/wiki/Reference_class_forecasting"><span style="font-weight: 400;">reference-class forecasting</span></a><span style="font-weight: 400;">, also known as ‘the outside view’. It is the method of assigning a probability by straightforward extrapolation from similar past situations and their outcomes.</span></p>
 +</HTML>
 +
 +
 +
 +
 +==== 1.5. Tetlock’s “Portrait of the modal superforecaster” ====
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">This subsection and those that follow will lay out some more qualitative results, things that Tetlock recommends on the basis of his research and interviews with superforecasters. Here is Tetlock’s “portrait of the modal superforecaster:” <span class="easy-footnote-margin-adjust" id="easy-footnote-14-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-14-1283" title=" &lt;i&gt;Superforecasting &lt;/i&gt;p191 "><sup>14</sup></a></span></span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><b>Philosophic outlook:</b></p>
 +</HTML>
 +
 +
 +<HTML>
 +<ul>
 +<li><div class="li"><b>Cautious:</b> <span style="font-weight: 400;">Nothing is certain.</span></div></li>
 +<li><div class="li"><b>Humble:</b> <span style="font-weight: 400;">Reality is infinitely complex.</span></div></li>
 +<li><div class="li"><b>Nondeterministic:</b> <span style="font-weight: 400;">Whatever happens is not meant to be and does not have to happen.</span></div></li>
 +</ul>
 +</HTML>
 +
 +
 +<HTML>
 +<p><b>Abilities &amp; thinking styles:</b></p>
 +</HTML>
 +
 +
 +<HTML>
 +<ul>
 +<li><div class="li"><b>Actively open-minded:</b> <span style="font-weight: 400;">Beliefs are hypotheses to be tested, not treasures to be protected.</span></div></li>
 +<li><div class="li"><b>Intelligent and knowledgeable, with a “Need for Cognition”:</b> <span style="font-weight: 400;">Intellectually curious, enjoy puzzles and mental challenges.</span></div></li>
 +<li><div class="li"><b>Reflective:</b> <span style="font-weight: 400;">Introspective and self-critical</span></div></li>
 +<li><div class="li"><b>Numerate:</b> <span style="font-weight: 400;">Comfortable with numbers</span></div></li>
 +</ul>
 +</HTML>
 +
 +
 +<HTML>
 +<p><b>Methods of forecasting:</b></p>
 +</HTML>
 +
 +
 +<HTML>
 +<ul>
 +<li><div class="li"><b>Pragmatic:</b> <span style="font-weight: 400;">Not wedded to any idea or agenda</span></div></li>
 +<li><div class="li"><b>Analytical:</b> <span style="font-weight: 400;">Capable of stepping back from the tip-of-your-nose perspective and considering other views</span></div></li>
 +<li><div class="li"><b>Dragonfly-eyed:</b> <span style="font-weight: 400;">Value diverse views and synthesize them into their own</span></div></li>
 +<li><div class="li"><b>Probabilistic:</b> <span style="font-weight: 400;">Judge using many grades of maybe</span></div></li>
 +<li><div class="li"><b>Thoughtful updaters:</b> <span style="font-weight: 400;">When facts change, they change their minds</span></div></li>
 +<li><div class="li"><b>Good intuitive psychologists:</b> <span style="font-weight: 400;">Aware of the value of checking thinking for cognitive and emotional biases <span class="easy-footnote-margin-adjust" id="easy-footnote-15-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-15-1283" title=' There is experimental evidence that superforecasters are less prone to standard cognitive science biases than ordinary people. From &lt;a href="https://www.edge.org/conversation/philip_tetlock-edge-master-class-2015-a-short-course-in-superforecasting-class-iv"&gt;edge.org&lt;/a&gt;: &lt;i&gt;Mellers: &lt;/i&gt;“We have given them lots of Kahneman and Tversky-like problems to see if they fall prey to the same sorts of biases and errors. The answer is sort of, some of them do, but not as many. It’s not nearly as frequent as you see with the rest of us ordinary mortals. The other thing that’s interesting is they don’t make the kinds of mistakes that regular people make instead of the right answer. They do something that’s a little bit more thoughtful. They integrate base rates with case-specific information a little bit more.” &amp;nbsp;'><sup>15</sup></a></span></span></div></li>
 +</ul>
 +</HTML>
 +
 +
 +<HTML>
 +<p><b>Work ethic:</b></p>
 +</HTML>
 +
 +
 +<HTML>
 +<ul>
 +<li><div class="li"><b>Growth mindset:</b> <span style="font-weight: 400;">Believe it’s possible to get better</span></div></li>
 +<li><div class="li"><b>Grit:</b> <span style="font-weight: 400;">Determined to keep at it however long it takes</span></div></li>
 +</ul>
 +</HTML>
 +
 +
 +
 +
 +==== 1.6. Tetlock’s “Ten Commandments for Aspiring Superforecasters:” ====
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">This advice is given at the end of the book, and may make less sense to someone who hasn’t read the book. A full transcript of these commandments can be found</span> <a href="https://www.lesswrong.com/posts/dvYeSKDRd68GcrWoe/ten-commandments-for-aspiring-superforecasters"><span style="font-weight: 400;">here</span></a><span style="font-weight: 400;">; this is a summary:</span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><b>(1) Triage:</b> <span style="font-weight: 400;">Don’t waste time on questions that are “clocklike” where a rule of thumb can get you pretty close to the correct answer, or “cloudlike” where even fancy models can’t beat a dart-throwing chimp.</span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><b>(2) Break seemingly intractable problems into tractable sub-problems:</b> <span style="font-weight: 400;">This is how Fermi estimation works. One related piece of advice is “be wary of accidentally substituting an easy question for a hard one,” e.g. substituting “Would Israel be willing to assassinate Yasser Arafat?” for “Will at least one of the tests for polonium in Arafat’s body turn up positive?”</span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><b>(3) Strike the right balance between inside and outside views:</b> <span style="font-weight: 400;">In particular,</span> <i><span style="font-weight: 400;">first</span></i> <span style="font-weight: 400;">anchor with the outside view and</span> <i><span style="font-weight: 400;">then</span></i> <span style="font-weight: 400;">adjust using the inside view.</span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><b>(4) Strike the right balance between under- and overreacting to evidence:</b> <span style="font-weight: 400;">Usually do many small updates, but occasionally do big updates when the situation calls for it. Remember to think about P(E|H)/P(E|~H); remember to avoid the base-rate fallacy. “Superforecasters aren’t perfect Bayesian predictors but they are much better than most of us.” <span class="easy-footnote-margin-adjust" id="easy-footnote-16-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-16-1283" title="&amp;nbsp;&lt;i&gt;Superforecasting &lt;/i&gt;p281 "><sup>16</sup></a></span></span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><b>(5) Look for the clashing causal forces at work in each problem:</b> <span style="font-weight: 400;">This is the “dragonfly eye perspective,” which is where you attempt to do a sort of mental wisdom of the crowds: Have tons of different causal models and aggregate their judgments. Use “Devil’s advocate” reasoning. If you think that P, try hard to convince yourself that not-P. You should find yourself saying “On the one hand… on the other hand… on the third hand…” a lot.</span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><b>(6) Strive to distinguish as many degrees of doubt as the problem permits but no more.</b></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><b>(7) Strike the right balance between under- and overconfidence, between prudence and decisiveness.</b></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><b>(8) Look for the errors behind your mistakes but beware of rearview-mirror hindsight biases.</b></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><b>(9) Bring out the best in others and let others bring out the best in you.</b> <span style="font-weight: 400;">The book spent a whole chapter on this, using the Wehrmacht as an extended case study on good team organization. One pervasive guiding principle is “Don’t tell people how to do things; tell them what you want accomplished, and they’ll surprise you with their ingenuity in doing it.” The other pervasive guiding principle is “Cultivate a culture in which people—even subordinates—are encouraged to dissent and give counterarguments.” <span class="easy-footnote-margin-adjust" id="easy-footnote-17-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-17-1283" title="&amp;nbsp;See e.g. page 284 of &lt;i&gt;Superforecasting&lt;/i&gt;, and the entirety of chapter 9. "><sup>17</sup></a></span></span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><b>(10) Master the error-balancing bicycle:</b> <span style="font-weight: 400;">This one should have been called practice, practice, practice. Tetlock says that reading the news and generating probabilities isn’t enough; you need to actually score your predictions so that you know how wrong you were.</span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><b>(11) Don’t treat commandments as commandments:</b> <span style="font-weight: 400;">Tetlock’s point here is simply that you should use your judgment about whether to follow a commandment or not; sometimes they should be overridden.</span></p>
 +</HTML>
 +
 +
 +
 +
 +==== 1.7. Recipe for Making Predictions ====
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">Tetlock describes how superforecasters go about making their predictions.</span><span style="font-weight: 400;"> <span class="easy-footnote-margin-adjust" id="easy-footnote-18-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-18-1283" title=" See Chapter 5: “Ultimately, it’s not the number crunching power that counts. It’s how you use it. … You’ve Fermi-ized the question, consulted the outside view, and now, finally, you can consult the inside view … So you have an outside view and an inside view. Now they have to be merged. …” "><sup>18</sup></a></span> Here is an attempt at a summary:</span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<ol>
 +<li><div class="li"><span style="font-weight: 400;">Sometimes a question can be answered more rigorously if it is first “Fermi-ized,” i.e. broken down into sub-questions for which more rigorous methods can be applied.</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">Next, use the outside view on the sub-questions (and/or the main question, if possible). You may then adjust your estimates using other considerations (‘the inside view’), but do this cautiously.</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">Seek out other perspectives, both on the sub-questions and on how to Fermi-ize the main question. You can also generate other perspectives yourself.</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">Repeat steps 1 – 3 until you hit diminishing returns.</span></div></li>
 +<li><div class="li"><span style="font-weight: 400;">Your final prediction should be based on an aggregation of various models, reference classes, other experts, etc.</span></div></li>
 +</ol>
 +</HTML>
 +
 +
 +
 +
 +==== 1.8. Bayesian reasoning & precise probabilistic forecasts ====
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">Humans normally express uncertainty with terms like “maybe” and “almost certainly” and “a significant chance.” Tetlock advocates for thinking and speaking in probabilities instead. He recounts many anecdotes of misunderstandings that might have been avoided this way. For example:</span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">In 1961, when the CIA was planning to topple the Castro government by landing a small army of Cuban expatriates at the Bay of Pigs, President John F. Kennedy turned to the military for an unbiased assessment. The Joint Chiefs of Staff concluded that the plan had a “fair chance” of success. The man who wrote the words “fair chance” later said he had in mind odds of 3 to 1 against success. But Kennedy was never told precisely what “fair chance” meant and, not unreasonably, he took it to be a much more positive assessment. <span class="easy-footnote-margin-adjust" id="easy-footnote-19-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-19-1283" title="&amp;nbsp;&lt;i&gt;Superforecasting&lt;/i&gt; 44 "><sup>19</sup></a></span></span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">This example hints at another advantage of probabilistic judgments: It’s harder to weasel out of them afterwards, and therefore easier to keep score. Keeping score is crucial for getting feedback from reality, which is crucial for building up expertise.</span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">A standard criticism of using probabilities is that they merely conceal uncertainty rather than quantify it—after all, the numbers you pick are themselves guesses. This may be true for people who haven’t practiced much, but it isn’t true for superforecasters, who are impressively well-calibrated and whose accuracy scores decrease when you round their predictions to the nearest 0.05. (EDIT: This should be 0.1)<span class="easy-footnote-margin-adjust" id="easy-footnote-20-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-20-1283" title='&amp;nbsp;The superforecasters had a calibration of 0.01, which means that the average difference between a probability they use and the true frequency of occurrence is 0.01. This is from &lt;a href="https://www.researchgate.net/publication/277087515_Identifying_and_Cultivating_Superforecasters_as_a_Method_of_Improving_Probabilistic_Predictions"&gt;Mellers et al 2015&lt;/a&gt;. The fact about rounding their predictions is from &lt;a href="https://academic.oup.com/isq/article-abstract/62/2/410/4944059?redirectedFrom=fulltext"&gt;Friedman et al 2018&lt;/a&gt;. EDIT: Seems I was wrong, thanks to this commenter for noticing.https://www.metaculus.com/questions/4166/the-lightning-round-tournament-comparing-metaculus-forecasters-to-infectious-disease-experts/#comment-28756'><sup>20</sup></a></span></span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">Bayesian reasoning is a natural next step once you are thinking and talking probabilities—it is the theoretical ideal in several important ways <span class="easy-footnote-margin-adjust" id="easy-footnote-21-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-21-1283" title='&amp;nbsp;For an excellent introduction to Bayesian reasoning and its theoretical foundations, see Strevens’ textbook-like &lt;a href="http://www.strevens.org/bct/"&gt;lecture notes&lt;/a&gt;. Some of the facts summarized in this paragraph about Superforecasters and Bayesianism can be found on pages 169-172, 281, and 314 of &lt;i&gt;Superforecasting&lt;/i&gt;. '><sup>21</sup></a></span> </span><span style="font-weight: 400;">—and Tetlock’s experience and interviews with superforecasters seems to bear this out. Superforecasters seem to do many small updates, with occasional big updates, just as Bayesianism would predict. They recommend thinking in the Bayesian way, and often explicitly make Bayesian calculations. They are good at breaking down difficult questions into more manageable parts and chaining the probabilities together properly.</span></p>
 +</HTML>
 +
 +
 +
 +
 +===== 2. Discussion: Relevance to AI Forecasting =====
 +
 +
 +==== 2.1. Limitations ====
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">A major limitation is that the forecasts were mainly on geopolitical events only a few years in the future at most. (Uncertain geopolitical events seem to be somewhat predictable up to two years out but much more difficult to predict five years out.) <span class="easy-footnote-margin-adjust" id="easy-footnote-22-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-22-1283" title="&amp;nbsp;Tetlock admits that &amp;#8220;there is no evidence that geopolitical or economic forecasters can predict anything ten years out beyond the excruciatingly obvious&amp;#8230; These limits on predictability are the predictable results of the butterfly dynamics of nonlinear systems. In my EPJ research, the accuracy of expert predictions declined toward chance five years out.&amp;#8221; (&lt;i&gt;Superforecasting&lt;/i&gt; p243) "><sup>22</sup></a></span></span><span style="font-weight: 400;"> So evidence from the GJP may not generalize to forecasting other types of events (e.g. technological progress and social  consequences) or events further in the future.</span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">That said, the forecasting best practices discovered by this research are not overtly specific to geopolitics or near-term events.  Also, geopolitical questions are diverse and accuracy on some was highly correlated with accuracy on others. <span class="easy-footnote-margin-adjust" id="easy-footnote-23-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-23-1283" title='&amp;nbsp;&amp;#8220;There are several ways to look for individual consistency across questions. We sorted questions on the basis of response format (binary, multinomial, conditional, ordered), region (Eurzone, Latin America, China, etc.), and duration of question (short, medium, and long). We computed accuracy scores for each individual on each variable within each set (e.g., binary, multinomial, conditional, and ordered) and then constructed correlation matrices. For all three question types, correlations were positive&amp;#8230; Then we conducted factor analyses. For each question type, a large proportion of the variance was captured by a single factor, consistent with the hypothesis that one underlying dimension was necessary to capture correlations among response formats, regions, and question duration.&amp;#8221; From &lt;a href="https://www.apa.org/pubs/journals/releases/xap-0000040.pdf"&gt;Mellers et al 2015&lt;/a&gt;. '><sup>23</sup></a></span></span></p>
 +</HTML>
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">Tetlock has ideas for how to handle longer-term, nebulous questions. He calls it “Bayesian Question Clustering.” (</span><i><span style="font-weight: 400;">Superforecasting</span></i> <span style="font-weight: 400;">263) The idea is to take the question you really want to answer and look for more precise questions that are evidentially relevant to the question you care about. Tetlock intends to test the effectiveness of this idea in future research.</span></p>
 +</HTML>
 +
 +
 +
 +
 +==== 2.2 Value ====
 +
 +
 +<HTML>
 +<p><span style="font-weight: 400;">The benefits of following these best practices (including identifying and aggregating the best forecasters) appear to be substantial: Superforecasters predicting events 300 days in the future were more accurate than regular forecasters predicting events 100 days in the future, and the GJP did even better. <span class="easy-footnote-margin-adjust" id="easy-footnote-24-1283"></span><span class="easy-footnote"><a href="#easy-footnote-bottom-24-1283" title='&amp;nbsp; &lt;i&gt;Superforecasting &lt;/i&gt;p94. Later, in the &lt;a href="https://www.edge.org/conversation/philip_tetlock-edge-master-class-2015-a-short-course-in-superforecasting-class-ii"&gt;edge.org seminar&lt;/a&gt;, Tetlock says “In some other ROC curves—receiver operator characteristic curves, from signal detection theory—that Mark Steyvers at UCSD constructed—superforecasters could assign probabilities 400 days out about as well as regular people could about eighty days out.” The quote is accompanied by a &lt;a href="https://www.edge.org/3rd_culture/Master%20Class%202015/Slide040.jpg"&gt;graph&lt;/a&gt;; unfortunately, it’s hard to interpret. '><sup>24</sup></a></span></span> <span style="font-weight: 400;">If these benefits generalize beyond the short-term and beyond geopolitics—e.g. to long-term technological and societal development—then this research is highly useful to almost everyone. Even if the benefits do not generalize beyond the near-term, these best practices may still be well worth adopting. For example, it would be extremely useful to have 300 days of warning before strategically important AI milestones are reached, rather than 100.</span></p>
 +</HTML>
 +
 +
 +
 +
 +===== 3. Contributions =====
 +
 +
 +<HTML>
 +<p><i><span style="font-weight: 400;">Research, analysis, and writing were done by Daniel Kokotajlo. Katja Grace and Justis Mills contributed feedback and editing. Tegan McCaslin, Carl Shulman, and Jacob Lagerros contributed feedback.</span></i></p>
 +</HTML>
 +
 +
 +
 +
 +===== 4. Footnotes =====
 +
 +
 +
 +
 +<HTML>
 +<ol class="easy-footnotes-wrapper">
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-1-1283"></span> <i>Superforecasting</i> p104 <a class="easy-footnote-to-top" href="#easy-footnote-1-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-2-1283"></span> <i>Superforecasting</i> p104 <a class="easy-footnote-to-top" href="#easy-footnote-2-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-3-1283"></span> Technically the forecasters were paid, up to $250 per season. (<i>Superforecasting</i> p72) However their payments did not depend on how accurate they were or how much effort they put in, beyond the minimum. <a class="easy-footnote-to-top" href="#easy-footnote-3-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-4-1283"></span> The table is from <a href="https://www.apa.org/pubs/journals/releases/xap-0000040.pdf">Mellers <i>et al</i> 2015</a>. “Del time” is deliberation time. <a class="easy-footnote-to-top" href="#easy-footnote-4-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-5-1283"></span> “Nonetheless, as we saw in the structural model, and confirm here, the best model uses dispositional, situational, and behavioral variables. The combination produced a multiple correlation of .64.” This is from <a href="https://www.apa.org/pubs/journals/releases/xap-0000040.pdf">Mellers <i>et al</i> 2015</a>. <a class="easy-footnote-to-top" href="#easy-footnote-5-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-6-1283"></span> This is from <a href="https://www.apa.org/pubs/journals/releases/xap-0000040.pdf">Mellers <i>et al</i> 2015</a>. <a class="easy-footnote-to-top" href="#easy-footnote-6-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-7-1283"></span> On the <a href="https://goodjudgment.com/science.html">webpage</a>, it says forecasters with better track-records and those who update more frequently get weighted more. In <a href="https://www.edge.org/conversation/philip_tetlock-edge-master-class-2015-a-short-course-in-superforecasting-class-ii">these slides, </a>Tetlock describes the elitism differently: He says it gives weight to higher-IQ, more open-minded forecasters. <a class="easy-footnote-to-top" href="#easy-footnote-7-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-8-1283"></span> The academic papers on this topic are <a href="https://www.sciencedirect.com/science/article/pii/S0169207013001635">Satopaa et al 2013</a> and <a href="http://pubsonline.informs.org/doi/abs/10.1287/deca.2014.0293">Baron et al 2014</a>. <a class="easy-footnote-to-top" href="#easy-footnote-8-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-9-1283"></span> According to one expert I interviewed, more recent data suggests that the successes of the extremizing algorithm during the forecasting tournament were a fluke. After all, <i>a priori</i> one would expect extremizing to lead to small improvements in accuracy most of the time, but big losses in accuracy some of the time. <a class="easy-footnote-to-top" href="#easy-footnote-9-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-10-1283"></span> <i>Superforecasting</i> p18. <a class="easy-footnote-to-top" href="#easy-footnote-10-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-11-1283"></span> This is from <a href="https://www.edge.org/conversation/philip_tetlock-edge-master-class-2015-a-short-course-in-superforecasting-class-ii">this seminar</a>. <a class="easy-footnote-to-top" href="#easy-footnote-11-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-12-1283"></span> For example, in year 2 one superforecaster beat the extremizing algorithm. More generally, as discussed in <a href="https://www.edge.org/conversation/philip_tetlock-edge-master-class-2015-a-short-course-in-superforecasting-class-ii">this seminar</a>, the aggregation algorithm produces the greatest improvement with ordinary forecasters; the superforecasters were good enough that it didn’t help much. <a class="easy-footnote-to-top" href="#easy-footnote-12-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-13-1283"></span> This is from <a href="http://journal.sjdm.org/16/16511/jdm16511.pdf">Chang et al 2016</a>. The average brier score of answers tagged “comparison classes” was 0.17, while the next-best tag averaged 0.26.<a class="easy-footnote-to-top" href="#easy-footnote-13-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-14-1283"></span> <i>Superforecasting</i> p191 <a class="easy-footnote-to-top" href="#easy-footnote-14-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-15-1283"></span> There is experimental evidence that superforecasters are less prone to standard cognitive science biases than ordinary people. From <a href="https://www.edge.org/conversation/philip_tetlock-edge-master-class-2015-a-short-course-in-superforecasting-class-iv">edge.org</a>: <i>Mellers:</i> “We have given them lots of Kahneman and Tversky-like problems to see if they fall prey to the same sorts of biases and errors. The answer is sort of, some of them do, but not as many. It’s not nearly as frequent as you see with the rest of us ordinary mortals. The other thing that’s interesting is they don’t make the kinds of mistakes that regular people make instead of the right answer. They do something that’s a little bit more thoughtful. They integrate base rates with case-specific information a little bit more.”  <a class="easy-footnote-to-top" href="#easy-footnote-15-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-16-1283"></span> <i>Superforecasting</i> p281 <a class="easy-footnote-to-top" href="#easy-footnote-16-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-17-1283"></span> See e.g. page 284 of <i>Superforecasting</i>, and the entirety of chapter 9. <a class="easy-footnote-to-top" href="#easy-footnote-17-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-18-1283"></span> See Chapter 5: “Ultimately, it’s not the number crunching power that counts. It’s how you use it. … You’ve Fermi-ized the question, consulted the outside view, and now, finally, you can consult the inside view … So you have an outside view and an inside view. Now they have to be merged. …” <a class="easy-footnote-to-top" href="#easy-footnote-18-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-19-1283"></span> <i>Superforecasting</i> 44 <a class="easy-footnote-to-top" href="#easy-footnote-19-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-20-1283"></span> The superforecasters had a calibration of 0.01, which means that the average difference between a probability they use and the true frequency of occurrence is 0.01. This is from <a href="https://www.researchgate.net/publication/277087515_Identifying_and_Cultivating_Superforecasters_as_a_Method_of_Improving_Probabilistic_Predictions">Mellers et al 2015</a>. The fact about rounding their predictions is from <a href="https://academic.oup.com/isq/article-abstract/62/2/410/4944059?redirectedFrom=fulltext">Friedman et al 2018</a>. EDIT: Seems I was wrong, thanks to this commenter for noticing.https://www.metaculus.com/questions/4166/the-lightning-round-tournament-comparing-metaculus-forecasters-to-infectious-disease-experts/#comment-28756<a class="easy-footnote-to-top" href="#easy-footnote-20-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-21-1283"></span> For an excellent introduction to Bayesian reasoning and its theoretical foundations, see Strevens’ textbook-like <a href="http://www.strevens.org/bct/">lecture notes</a>. Some of the facts summarized in this paragraph about Superforecasters and Bayesianism can be found on pages 169-172, 281, and 314 of <i>Superforecasting</i>. <a class="easy-footnote-to-top" href="#easy-footnote-21-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-22-1283"></span> Tetlock admits that “there is no evidence that geopolitical or economic forecasters can predict anything ten years out beyond the excruciatingly obvious… These limits on predictability are the predictable results of the butterfly dynamics of nonlinear systems. In my EPJ research, the accuracy of expert predictions declined toward chance five years out.” (<i>Superforecasting</i> p243) <a class="easy-footnote-to-top" href="#easy-footnote-22-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-23-1283"></span> “There are several ways to look for individual consistency across questions. We sorted questions on the basis of response format (binary, multinomial, conditional, ordered), region (Eurzone, Latin America, China, etc.), and duration of question (short, medium, and long). We computed accuracy scores for each individual on each variable within each set (e.g., binary, multinomial, conditional, and ordered) and then constructed correlation matrices. For all three question types, correlations were positive… Then we conducted factor analyses. For each question type, a large proportion of the variance was captured by a single factor, consistent with the hypothesis that one underlying dimension was necessary to capture correlations among response formats, regions, and question duration.” From <a href="https://www.apa.org/pubs/journals/releases/xap-0000040.pdf">Mellers et al 2015</a>. <a class="easy-footnote-to-top" href="#easy-footnote-23-1283"></a>
 +</div></li>
 +<li><div class="li">
 +<span class="easy-footnote-margin-adjust" id="easy-footnote-bottom-24-1283"></span>  <i>Superforecasting</i> p94. Later, in the <a href="https://www.edge.org/conversation/philip_tetlock-edge-master-class-2015-a-short-course-in-superforecasting-class-ii">edge.org seminar</a>, Tetlock says “In some other ROC curves—receiver operator characteristic curves, from signal detection theory—that Mark Steyvers at UCSD constructed—superforecasters could assign probabilities 400 days out about as well as regular people could about eighty days out.” The quote is accompanied by a <a href="https://www.edge.org/3rd_culture/Master%20Class%202015/Slide040.jpg">graph</a>; unfortunately, it’s hard to interpret. <a class="easy-footnote-to-top" href="#easy-footnote-24-1283"></a>
 +</div></li>
 +</ol>
 +</HTML>
 +
 +
  
featured_articles/evidence_on_good_forecasting_practices_from_the_good_judgment_project.txt · Last modified: 2022/09/21 07:37 (external edit)