class: center, middle, inverse, title-slide .title[ # 人工智能:计算社会科学的兴起 ] .subtitle[ ## 社会学概论——第10讲 ] .author[ ### 李代 ] .institute[ ### 中国政法大学社会学院 ] .date[ ### 2024-11-05 ] --- class: center, middle, inverse <!-- background-image: url("images/cool.png") --> # 人工智能:计算社会科学的兴起 ## 什么是计算社会科学 ## 研究案例 --- # 什么是计算社会科学 ## 计算社会科学 ### computational social science (CSS) [Lazer et al. 2009](https://pubmed.ncbi.nlm.nih.gov/19197046/); [Lazer et al. 2020](https://pubmed.ncbi.nlm.nih.gov/32855329/) 什么是计算社会科学? > We define CSS as the development and application of computational methods to complex, typically large-scale, human (sometimes simulated) behavioral data. 先驱:research on spatial data, social networks, and human coding of text and images. --- # 什么是计算社会科学 ## 用搜索引擎预测感冒? [What We Can Learn From the Epic Failure of Google Flu Trends](https://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/) 2008年Google在Nature杂志发文,根据人们的搜索关键词可以实时监控流行感冒的传播。 2013年,该技术预测失准达到140%,被Google默默关停。 --- # 什么是计算社会科学 ## 用灯光预测贫困 [Jean et al. 2016](https://forum.stanford.edu/events/posterslides/CombiningSatelliteImageryandMachineLearningtoPredictPoverty.pdf) <img src="image/light.jpg" width="80%" /> --- # 什么是计算社会科学 ## 用成交数据预测房价? [‘Powerful Tailwinds,’ iBuying Deliver Strong Earnings For Zillow](https://rezillafl.com/powerful-tailwinds-ibuying-deliver-strong-earnings-for-zillow/) > The brightest spot for the company this quarter was its Homes segment, which includes iBuyer Zillow Offers. The Home segment alone brought in $454.3 million during the second quarter, beating Zillow’s own estimate that the segment would bring in a maximum of $350 million. The earnings report further reveals that during the quarter Zillow bought 86 homes, sold 1,437 and ended up with 440 in inventory — the lowest number since the third quarter of 2018. [Why the iBuying algorithms failed Zillow, and what it says about the business world’s love affair with AI](https://www.geekwire.com/2021/ibuying-algorithms-failed-zillow-says-business-worlds-love-affair-ai/) > That’s a key takeaway after Zillow Group made the unexpected decision on Tuesday to shutter its home buying business — a painful move that will result in 2,000 employees losing their jobs, a $304 million third quarter write-down, a spiraling stock price (shares are down more than 18% today), and egg on the face of co-founder and CEO Rich Barton. --- # 什么是计算社会科学 ## “人工智能” 近年来兴起的所谓“人工智能”,本质是对大数据进行统计学习,从而自动化地给出预测。 预测:predict, [from prae "before" (see pre-) + dicere "to say"](https://www.etymonline.com/word/predict#etymonline_v_19392) 不是在事情发生之前预测,而是在知道答案之前预测。 各类机器学习算法有其各自的假设、缺陷和应用场景,不应滥用。 “机器学习” = “科学算命” --- # 什么是计算社会科学 ## 计算社会科学的先决条件 1. 大数据(数据化时代的产物) 1. 运算与存储数据的能力 1. 统计学工具的发展(“机器学习”) 大数据、人工智能不是计算社会科学的充要条件 --- # 什么是计算社会科学 ## 社会科学何去何从? 社会科学家不懂算法。 算法工程师不懂研究设计。 怎么办? 相互取代?共同合作?全面发展? --- # 什么是计算社会科学 ## “人工智能”会取代律师吗? [秘塔科技](https://metasota.ai/) 1. 秘塔翻译专注于法律翻译,依托机器学习技术研发 1. 秘塔检索真正懂得法律人的智能法律搜索引擎 [香侬科技](https://www.shannonai.com/) > 高效、准确、便捷。香侬科技为机构提供非结构化数据处理的人工智能解决方案。无惧纷繁信息如潮,以前沿的自然语言处理(NLP)技术为利器,化繁为简,清晰呈现最有价值信息。香侬科技为政府、央企、银行、保险、基金、券商、评级机构和大型企业所信赖,以尖端科技推动金融资产管理、风控评级、行业研究、投资决策等业务的持续扩张和服务升级。 --- class: center, middle, inverse <!-- background-image: url("images/cool.png") --> # 研究案例 --- # 研究案例 ## [Large teams develop and small teams disrupt science and technology](https://www.nature.com/articles/s41586-019-0941-9.epdf?author_access_token=MRQHpGwilhO_ezUVTm_podRgN0jAjWel9jnR3ZoTv0PMG6AccJlxzpvw8-b9iOEwUXPSD06h5wOaBF2mJQIs2FcprxiP9gst0zwn6_WT9jwoiUKvK5aU4BMOmM4fFmRH0ZEdi5gqLBLeQIbp3eZrnw%3D%3D) ## Authors ### [Lingfei Wu](http://lingfeiwu.github.io/) + Assistant Professor at the University of Pittsburgh + Postdoc at the University of Chicago & Arizona State University + Ph.D. Communication, City University of Hong Kong, 2013 + M.A. Communication, Peking University, 2009 + B.A. Political Science, China University of Political Science and Law, 2006 --- # 研究案例 ## Authors ### [Dashun Wang](https://www.dashunwang.com/) + Professor, Kellogg School of Management, Northwestern University + Ph.D. Physics, Northeastern University, 2013 + M.Sc. Physics, Northeastern University, 2009 + B.Sc. Physics, Fudan University, 2007 > His current research focus is on Science of Science, a quest to turn the scientific methods and curiosities upon ourselves, hoping to use and develop tools from complexity sciences and artificial intelligence to broadly explore the opportunities for innovation and promises of prosperity offered by the recent data explosion in science. --- # 研究案例 ## Research Question > [Research teams are getting bigger.] This shift in team size raises the question of whether and how the character of the science and technology produced by large teams differs from that of small teams. ... > These results demonstrate that both small and large teams are essential to a flourishing ecology of science and technology, and suggest that, to achieve this, science policies should aim to support a diversity of team sizes. --- # 研究案例 ## Data citation networks 1. the Web of Science (WOS) database that contains more than 42 million articles published between 1954 and 2014, and 611 million citations among them 1. 5 million patents granted by the US Patent and Trademark Office from 1976 to 2014, and 65 million citations added by patent applicants 1. 16 million software projects and 9 million forks to them on GitHub (2011–2014), a popular web platform that allows users to collaborate on the same code repository and ‘cite’ other repositories by copying and building on their code. --- # 研究案例 ## Measure Disruption (Funk & Owen-Smith 2017): this measure varies between −1 and 1, which corresponds to science and technology that develops or disrupts, respectively <img src="image/disruption.png" width="60%" /> Three citation networks comprising focal papers (blue diamonds), references (grey circles) and subsequent work (rectangles). Subsequent work may cite the focal work (i, green), both the focal work and its references (j, red) or just its references (k, black). --- # 研究案例 ### Validation <img src="image/validation.png" width="100%" /> We also find that, on average, Nobel-prize-winning papers register among the 2% most disruptive articles. Review articles are developmental with a negative mean of disruption (bottom 46%), whereas the original research works that they review have a positive mean (top 23%). --- # 研究案例 ### Results <img src="image/trend-disruption.png" width="100%" /> --- # 研究案例 ### Results 1. work by small teams will be substantially more disruptive than work by large teams. 1. High-impact papers produced by small teams are the most disruptive, and high-impact papers produced by large teams are the most developmental. 1. We find that solo authors and small teams much more often build on older, less popular ideas. 1. Large teams receive more of their citations rapidly, as their work is immediately relevant to more contemporaries whose ideas they develop and audiences primed to appreciate them. Conversely, smaller teams experience a much longer citation delay 1. Whereas larger teams facilitate broader search, small teams search deeper. --- # 研究案例 ### Results Our findings are consistent with field research on teams in other domains, which demonstrate that small groups with more to gain and less to lose are more likely to undertake new and untested opportunities that have the potential for high growth and failure Nobel-prize-winning articles significantly oversample small disruptive teams, whereas those that acknowledge US National Science Foundation funding oversample large developmental teams. Regardless of the dominant driver, these results paint a unified portrait of underfunded solo investigators and small teams who disrupt science and technology by generating new directions on the basis of deeper and wider information search. --- # 研究案例 ## [Belief Network Analysis: A Relational Approach to Understanding the Structure of Attitudes](https://www.ocf.berkeley.edu/~andrei/downloads/bna.pdf) ## Authors ### [Andrei Boutyline](https://lsa.umich.edu/soc/people/faculty/aboutyl.html) + Assistant Professor at University of Michigan + Ph.D. University of California - Berkeley, 2017 Andrei Boutyline's research focuses on culture, cognition, methodology, and public opinion. He examines the supra-individual aspects of attitudes, tastes, and cognitive representations, with a special focus on political views. He is broadly interested in the society-wide distribution of these cultural elements, and the social and cognitive processes that give rise to this distribution. He draws on network analysis, statistics, and computer science to develop novel methods for these investigations. In a separate research stream, he studies the effects of political disagreement on social network structure. --- # 研究案例 ## Authors ### [Stephen Vaisey](https://stephenvaisey.com/) + Professor of Sociology and Political Science at Duke University + Ph.D. Sociology, University of North Carolina at Chapel Hill The main goal of my research is to understand moral and political beliefs: what they are, where they come from, and what they do. I am the founding director of the Worldview Lab and the PI of the Measuring Morality project. I also think a lot about (and teach a lot about!) statistical methods — especially using panel data. These days I have been thinking a lot about how to use simple patterns in repeated cross-section and panel data to help adjudicate between competing theories of cultural change and socialization. --- # 研究案例 ## Research Question How some cultural elements can organize and structure others within a cultural system 1. Belief Network Analysis: centrality of beliefs 1. Revisiting Lakoff (2002): two common parenting styles, nurturant and strict, become the “deep structures” underlying the liberal and conservative political worldviews 1. Heterogeneity: difference among 44 subpopulations --- # 研究案例 ## Moral Politics Lakoff (2002): political cognition is also fundamentally metaphorical: the “nation is a family” metaphor. Ideological divisions stem from the fact that “liberals and conservatives have different models of how to raise children”. The “strict father” model used by conservatives emphasizes authority, strict discipline, and “tough love” as ways to lead the child to self-reliance. The “nurturant parent” model used by liberals emphasizes caring, protection, and respect as the best ways to help children grow up to be fulfilled and happy adults. Liberals thus support environmental protection and generous welfare policies because they are metaphorically understood as forms of parental caring. Conservatives oppose abortion and support mandatory sentencing for drug possession because their morality stresses personal accountability. --- # 研究案例 ## Social Constraint People use political identity as a heuristic for acquiring further political beliefs via the flow of information from opinion leaders, including politicians, journalists, and activists Once a person acquires such an identity—by, for example, imitating their parents or following widely known cultural stereotypes (Green, Palmquist, and Schickler 2002)—he or she can replace the abstract question of “what should I believe?” with the social question “which team am I on?” "Identity politics" --- # 研究案例 ## Methods ### Belief Network Analysis 1. Individuals start with a single central belief (parenting model or political identity). 1. This central belief is used to produce a number of broad stances (moral views or political heuristics), which are then used to stochastically produce further beliefs. 1. Newly added sets of beliefs then form the basis for yet newer and more specific beliefs, repeating recursively to yield a center-periphery structure. --- # 研究案例 ## Methods ### Belief Network Problem: central belief may not be the most "central" (highest ties) Solution: central belief connects disconnected domains. Translation: CB has the highest "shortest-path betweenness" (Freeman 1978). Significant? A nonparametric bootstrap to produce estimates. --- # 研究案例 ## Data [2000 ANES](https://electionstudies.org/data-center/2000-time-series-study/) American National Election Studies 1. Time Series Study 1. Completions: 1,807 pre-election; 1,555 post-election 1. Sample: all fresh cross-section 1. Modes used: face-to-face, telephone 1. Weights: V000002, V000002a --- # 研究案例 ## Correlation Network <img src="image/fig2.png" width="40%" /> --- # 研究案例 ## Centrality of Nodes <img src="image/tb2.png" width="40%" /> --- # 研究案例 ## Population Heterogeneity 44 sub groups: 1. gender, class, parents foreign born, number of children, black, hispanic, age group 1. education, income 1. southeastern, religion, occupation, type of place, church attendance 1. cross-pressures (higher-income church attendees & lower-income non-attendees), political knowledge --- # 研究案例 ## Population Heterogeneity 1. Compare correlations between sub-groups and see how many change signs: very robust (90% unchanged) 1. Amount of organization: sub-groups are similar in terms of mean constraint and network centralization 1. Low-information subsample: differed on 12.6% of the signs. Religiosity and biblical literalism have the highest centralities. Very wide confidence interval 1. Black subsample: differed on 10.5% of the signs. Religiosity is the most central. Very wide confidence interval --- # 研究案例 ## Discussion 1. We developed BNA, a correlation network-based method. (Is it really novel?) 1. Little support for heterogeneity of belief structures across social groups. 1. Social constraint is supported, moral politics is not. 1. Low-information and black groups may have religion as the center belief, however this is not statistically reliable. --- # 研究案例 ## Limitations 1. Simplifying assumptions 1. A test of structure rather than causality --- # 研究案例 ## [Machine Learning Approaches to Facial and Text Analysis: Discovering CEO Oral Communication Styles](https://onlinelibrary.wiley.com/doi/abs/10.1002/smj.3067) ## Authors ### [Prithwiraj "Raj" Choudhury](https://www.hbs.edu/faculty/Pages/profile.aspx?facId=327154]) + Lumry Family Associate Professor at the Harvard Business School + Doctorate from Harvard, and has Degrees from the Indian Institute of Technology and Indian Institute of Management. Prior to academia, he worked at McKinsey & Company, Microsoft and IBM. + Focus: the Future of Work, especially the changing Geography of Work + In particular, he studies the productivity effects of geographic mobility of workers, causes of geographic immobility and productivity effects of remote work practices such as ‘Work from anywhere’ and ‘All-remote’. --- # 研究案例 ## Authors ### [Dan Wang](https://www8.gsb.columbia.edu/cbs-directory/detail/djw2104) + Associate Professor of Business and (by courtesy) Sociology at Columbia Business School + BA from Columbia University (Columbia College) and PhD from Stanford University + Focus: How social networks shape opportunities for entrepreneurship, innovation, and large-scale economic and social transformation --- # 研究案例 ## Authors ### [Natalie A. Carlson](http://www.natalieannecarlson.com/) + Assistant Professor of Management at the Wharton School at the University of Pennsylvania + Ph.D. at Columbia Business School + Focus: entrepreneurship, digital platform work, and other forms of nontraditional employment, with a particular focus on emerging economies ### [Tarun Khanna](https://www.hbs.edu/faculty/Pages/profile.aspx?facId=6491) + Jorge Paulo Lemann Professor at the Harvard Business School + degrees from Princeton and Harvard + Focus: entrepreneurship as a means to social and economic development in emerging markets --- # 研究案例 ## Research Question 1. Discover *five* distinct *communication styles* that incorporate both verbal and nonverbal aspects of communication 1. A proof-of-concept analysis, correlating CEO communication styles to M&A outcomes 1. Contribution is mainly methodological --- # 研究案例 ## Methods 1. [Topic Modeling](https://springerplus.springeropen.com/articles/10.1186/s40064-016-3252-8) unsupervised topic modeling of text data to generate new measures of textual variance 1. [Sentiment Analysis](https://www.sciencedirect.com/topics/social-sciences/sentiment-analysis) sentiment analysis of text data 1. [Convolutional Neural Network](https://docs.paperspace.com/machine-learning/wiki/convolutional-neural-network-cnn) supervised ML coding of facial images with a cutting-edge convolutional neural network algorithm --- # 研究案例 ## Data [Creating Emerging Markets](https://www.hbs.edu/creating-emerging-markets/Pages/default.aspx) + An archive of video interviews with CEOs and founders conducted as part of Harvard Business School’s “Creating Emerging Markets” project. + The archive consists of a collection of oral history transcripts—as well as their corresponding video recordings—of interviews with the CEOs of 69 unique organizations; the interviews were conducted from 2008 to 2018. CEOs came from a diverse set of countries, representing Asia, Africa, the Middle East, and Latin America. --- # 研究案例 ## Literature: why is this question important 1. CEOs are important to firms (Hambrick and Mason, 1984) 1. Communication is important to CEOs' job (Bandiera et al., 2013, 2018)) + The communication style of top managers in general, and the way in which they communicate a vision for the organization in particular, can inspire workers, encourage initiative, and drive entrepreneurial growth (Baum, Locke, and Kirkpatrick, 1998; Westley and Mintzberg, 1989). + Reconfiguring (Helfat et al., 2007) 1. Non-verbal expressions are important to communication (Helfat and Peteraf, 2015) 1. Literature has studied verbal expressions (e.g., Yadav et al., 2007) using text data and text analysis (e.g. Watzlawick et al., 1967; Salancik and Meindl, 1984) 1. Recent research starts looking at videos (e.g. Petrenko et al 2016) 1. Human coding restricts use of video data. --- # 研究案例 ## Our synthetic method 1. We bring verbal/text and non-verbal/facial expressions together 1. Our method can be generalized to study other videos <img src="image/choudhury2019-1.png" width="60%" /> --- # 研究案例 ## Coding Part 1: Text data 1. Video: question-and-answer design. Each response (translated into English) is treated as a segment for LDA (Topic Modeling). 1. Average Answer Length 1. limitation: CEOs are talking to scholars, limit their oral style 1. Topic modeling: 100 topics (`topicmodels` and `ldatuning` package in R) 1. Results: Online Supplement Figure A3 1. Topic-proportions are collapsed back to each interview as its features. 1. Topic Entropy: tendency of concentration/diversity of topics. The bigger the value, the more diversed the responses. `$$-\Sigma(p_i * log_2p_i)$$` --- # 研究案例 ## Coding Part 2: Sentiment Analysis 1. Use crowdsourced lexicons (Mohammad & Turney, 2013) 1. Each term has a binary value of sentiment: +1 / -1 1. Using `Syuzhet` R package 1. Sum the sentiment values (of a segment) and calculate their proportions respectively (positive sentiment and negative sentiment sums to 1) e.g. 2 positive words 5 negative words, negative sentiment value = 5/7 = 0.71 1. Text Sentiment Variance: tendecy to vary in sentiments. Standard Deviation of negative sentiment values --- # 研究案例 ## Coding Part3: CNN 1. 8 facial expressions (FE): Anger, Contempt, Disgust, Fear, Happiness, Neutral, Sadness, and Surprise (Ekman and Friesen 1971) 1. Tool: Microsoft Azure Computer Vision REST API 1. Input: frames of videos 1. Output: weights of 8 FE <img src="image/choudhury2019-2.png" width="80%" /> --- # 研究案例 ## Synthesizing coded data: Factor Analysis 1. Input: 12 variables + the net negative text sentiment measure (Negative Text Sentiment) + the text sentiment variance measure (Text Sentiment Variance) + the average word length of each response (Average Answer Length) + the topic entropy measure (Topic Entropy) + the eight facial emotion measures (Anger, Contempt, Disgust, Fear, Happiness, Neutral, Sadness, and Surprise). 1. Output: 5 styles + Excitable + Stern + Dramatic + Rambling + Melancholy --- # 研究案例 ## Correlation: Communication styles and firm outcomes 1. Model: OLS regression + Table 2 1. Firm outcomes: acquisition + data: [SDC Platinum](https://gsb-research-help.stanford.edu/library/faq/298720) database from Thomson Reuters + Restricting our data + The number of completed acquisitions ranges from zero to six, with mean values of 0.22, 0.39, and 0.59 transactions within 1/3/5-year windows around the interview. 1. Covariates + gender + region: Aisa, Africa, Latin America (Middle East omitted) + year --- # 研究案例 ## Correlation: Communication styles and firm outcomes Coefficients: 1 sd increase in a style factor score contributes to how many more acquisitions <img src="image/choudhury2019-3.png" width="60%" /> --- # 研究案例 ## Contributions 1. Synthesized features outperform individual features + Table 3 1. Greater flexibility than traditional methods 1. Integrates video data 1. Good for replication because we use softwares 1. Precise measures --- # 研究案例 ## Limitations 1. Emerging markets only 1. Only talking to scholars 1. Correlational analysis is expositional not inductive 1. Retrospective bias (Kaplan 2008) suffered by oral interview data 1. Generalization is problematic --- # 研究案例 ## Comments 1. Interpretation every step of the way. Bias is concealed, not avoided. 1. Not replicable. Topic modeling is notorious in its multimodality: difficult to find global optimum. CNN is also dependent on parameter tuning. 1. Synthesized features look better because the model is not a good model. --- # 研究案例 ## [The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings](https://journals.sagepub.com/doi/pdf/10.1177/0003122419877135) ## Authors ### [Austin C. Kozlowski](https://austinkozlowski.com/about-me/) I am a doctoral candidate in the sociology department at the University of Chicago. My research lies at the intersection of culture and politics and explores how political ideas link together to form belief systems and ideologies. I use a wide array of methods, including computational text analysis, survey research, and qualitative in-depth interview analysis. --- # 研究案例 ## Authors ### [Matt Taddy](http://taddylab.com/) 1. AMAZON + Vice President of Economic Technology and Chief Economist for North America (since 2018) 1. UNIVERSITY OF CHICAGO BOOTH SCHOOL OF BUSINESS + Professor of Econometrics and Statistics, 2016-2018 + Assistant/Associate Professor, 2008-2016 1. MICROSOFT + Head of Economics and Data Science for Business AI, 2017 + Principal Researcher at Microsoft Research, 2016-2017 + EBAY, Research Fellow, 2014-2016 --- # 研究案例 ## Authors ### [JAMES A. EVANS](https://sociology.uchicago.edu/directory/james-evans) 1. Professor, The university of Chicago + Director, Knowledge Lab; + Faculty Director, Masters Program in Computational Social Science; + External Professor, Santa Fe Institute 1. B.A. Brigham Young University, 1994 1. M.A. Stanford University, 1999 1. Ph.D. Stanford University, 2004 --- # 研究案例 ## Research Question ### Method Word embedding models are a useful tool for the study of culture. ### substance Chart the cultural dimensions of social class and their evolution over the twentieth century. 7 dimensions: affluence, employment, status, education, cultivation; morality, gender. --- # 研究案例 ## Class is multidimensional Affluence: Simmel Socio-cultural position and relation to capital: Marx, Gramsci, Wright Education: Fischer and Hout Symbolic manisfestation: Weber Taste: Veblen, Elias, Bourdieu Moral classification: Fourcade and Healy, Zelizer Gender: Hochschild, Salzinger --- # 研究案例 ## Change in 20th Century Large organizations; mass education; gender composition of the workforce changed. Death of Class: social structural positions are replaced by identities and life-styles. Clark, Hunter, Pakulski and Waters Continued econoic Class: occupation and position in class structure continue to play key roles. Weeden and Grusky, Wright Symbolic factors always important: Accominotti, Kahn, and Storer How did the common understanding of class change in the 20th century? --- # 研究案例 ## Formal text analysis 1. Semantic network analysis: fail to distinguish between concepts that are close or distant when the corpus is too large. 1. Topic modeling: discrete clusters not continuous relationship. Word embedding is the answer. --- # 研究案例 ## Word embedding In a word embedding model, each word is represented as a vector in shared vector space. Words sharing similar contexts within the text will be positioned nearby in the space, whereas words that appear only in distinct and disconnected contexts will be positioned farther apart. *Word2vec*, the most widely used word embedding algorithm and the primary approach we apply in the following analyses, uses a shallow, two-layered neural network architecture that optimizes the prediction of words based on shared context with other words. --- # 研究案例 ## dimensional approach $$ \overrightarrow{king} + \overrightarrow{woman} - \overrightarrow{man} \approx \overrightarrow{queen} $$ Start from King and take 1 step towards woman on the gender dimension. $$ \overrightarrow{hockey} + \overrightarrow{affluence} - \overrightarrow{poverty} \approx \overrightarrow{lacrosse} $$ Start from hockey and take 1 step towards affluence on the wealth dimension. --- # 研究案例 ## dimensional approach <img src="image/fig2.jpeg" width="40%" /> --- # 研究案例 ## Data (validation): Amazon mechanical turk survey 398 respondents. weighted. rate 59 items on scales representing association along class, race, and gender lines. ## semantic differential researchers' dataset from 1950s Jenkins, Russell, and Suci (1958) had 30 college students rate 360 common terms on 20 semantic dimensions, such as hard-soft and good-bad, and published a table reporting the average rating for every word on each semantic dimension. --- # 研究案例 ## Data (Word embedding) 1. Google Ngram texts (5 grams) + problem: not representative of publications or population train 10 independent models for 10 decades from 1900 to 1999. use 2000-2012 as validation. --- # 研究案例 ## Validation Class: moderate. Gender: good. Race: bad. We refrain from analyses of race in our subsequent analyses of class associations over time. With survey: good. With historical data: good. --- # 研究案例 ## Meanings of Class ### Affluence <img src="image/affluence.jpeg" width="60%" /> Women is associated with affluence. Employment is associated weakly. Education and affluence are increasingly synonymous. --- # 研究案例 ## Meanings of Class ### Other 6 <img src="image/all.jpeg" width="60%" /> --- # 研究案例 ## Meanings of Class ### Other 6 Taken together, these results demonstrate a remarkably stable and complex structure among the cultural dimensions of class, with dimensions most closely associated with social distinction—morality, cultivation, and education—clustered on one end, employment position on the other, and status and affluence mediating these otherwise unrelated domains. --- # 研究案例 ## Discussion Collectively, these findings suggest that many of the basic dimensions through which class is understood were robust against the twentieth century’s tectonic shifts in the organization of economy, industry, and employment. What evolved were symbols used to signify locations in the multidimensional architecture of class. --- # 研究案例 ## Limitations 1. Require large corpora: analogy tests can only be reliably solved when input text comprises several million words or more (Hill et al. 2014). 1. The exact algorithmic processes undergirding the training of word embedding models can be highly complex and therefore elude theoretically parsimonious description. 1. Word embeddings are not able to adjudicate the suitability of a given corpus for an investigation 1. The voices and worldviews published in books digitized by Google are not a random sample of U.S. culture. 1. A set of texts should not be taken as a pure or complete reflection of the culture that produced it. 1. Word embeddings cannot identify the cultural dimensions most important for a given semantic system or social process.