nep-big New Economics Papers
on Big Data
Issue of 2019‒04‒15
seventeen papers chosen by
Tom Coupé
University of Canterbury

  1. Empirical Asset Pricing via Machine Learning By Shihao Gu; Bryan T. Kelly; Dacheng Xiu
  2. The Race against the Robots and the Fallacy of the Giant Cheesecake: Immediate and Imagined Impacts of Artificial Intelligence By Naudé, Wim
  3. New Digital Technologies and Heterogeneous Employment and Wage Dynamics in the United States: Evidence from Individual-Level Data By Fossen, Frank M.; Sorgner, Alina
  4. Feature Engineering for Mid-Price Prediction Forecasting with Deep Learning By Adamantios Ntakaris; Giorgio Mirone; Juho Kanniainen; Moncef Gabbouj; Alexandros Iosifidis
  5. The Enabling Technologies of Industry 4.0: Examining the Seeds of the Fourth Industrial Revolution By Arianna Martinelli; Andrea Mina; Massimo Moggi
  6. Classifying occupations using web-based job advertisements: an application to STEM and creative occupations By Antonio Lima; Hasan Bakhshi
  7. Cascading Logistic Regression Onto Gradient Boosted Decision Trees to Predict Stock Market Changes Using Technical Analysis By Feng Zhou; Zhang Qun; Didier Sornette; Liu Jiang
  8. Identifying effects of farm subsidies on structural change using neural networks By Storm, Hugo; Heckelei, Thomas; Baylis, Kathy; Mittenzwei, Klaus
  9. Text Data Analysis Using Latent Dirichlet Allocation: An Application to FOMC Transcripts By Hali Edison; Hector Carcel
  10. Enhancing Time Series Momentum Strategies Using Deep Neural Networks By Bryan Lim; Stefan Zohren; Stephen Roberts
  11. Opening Internet Monopolies to Competition with Data Sharing Mandates By Claudia Biancotti; Paolo Ciocca
  12. The Production of Information in an Online World: Is Copy Right? By Julia Cage; Nicolas Hervé; Marie-Luce Viaud
  13. The Production of Information in an Online World: Is Copy Right? By Julia Cage; Nicolas Hervé; Marie-Luce Viaud
  14. 25 Years of European Merger Control By Pauline Affeldt; Tomaso Duso; Florian Szücs
  15. Welcoming Remarks: at the Sixth Annual Community Banking in the 21st Century Research and Policy Conference, Federal Reserve System, Conference of State Bank Supervisors (CSBS) and Federal Deposit Insurance Corp. (FDIC), St. Louis, Mo. By Bullard, James B.
  16. (Martingale) Optimal Transport And Anomaly Detection With Neural Networks: A Primal-dual Algorithm By Pierre Henry-Labord`ere
  17. The Impact of Global Warming on Rural-Urban Migrations: Evidence from Global Big Data By Giovanni Peri; Akira Sasahara

  1. By: Shihao Gu (University of Chicago - Booth School of Business); Bryan T. Kelly (Yale SOM; AQR Capital Management, LLC; National Bureau of Economic Research (NBER)); Dacheng Xiu (University of Chicago - Booth School of Business)
    Abstract: We synthesize the field of machine learning with the canonical problem of empirical asset pricing: measuring asset risk premia. In the familiar empirical setting of cross section and time series stock return prediction, we perform a comparative analysis of methods in the machine learning repertoire, including generalized linear models, dimension reduction, boosted regression trees, random forests, and neural networks. At the broadest level, we find that machine learning offers an improved description of expected return behavior relative to traditional forecasting methods. Our implementation establishes a new standard for accuracy in measuring risk premia summarized by an unprecedented out-of-sample return prediction R2. We identify the best performing methods (trees and neural nets) and trace their predictive gains to allowance of nonlinear predictor interactions that are missed by other methods. Lastly, we find that all methods agree on the same small set of dominant predictive signals that includes variations on momentum, liquidity, and volatility. Improved risk premia measurement through machine learning can simplify the investigation into economic mechanisms of asset pricing and justifies its growing role in innovative financial technologies.
    Keywords: Machine Learning, Big Data, Return Prediction, Cross-Section of Returns, Ridge Regression, (Group) Lasso, Elastic Net, Random Forest, Gradient Boosting, (Deep) Neural Networks, Fintech
    Date: 2018–11
    URL: http://d.repec.org/n?u=RePEc:chf:rpseri:rp1871&r=all
  2. By: Naudé, Wim (Maastricht University)
    Abstract: After a number of AI-winters, AI is back with a boom. There are concerns that it will disrupt society. The immediate concern is whether labor can win a 'race against the robots' and the longer-term concern is whether an artificial general intelligence (super-intelligence) can be controlled. This paper describes the nature and context of these concerns, reviews the current state of the empirical and theoretical literature in economics on the impact of AI on jobs and inequality, and discusses the challenge of AI arms races. It is concluded that despite the media hype neither massive jobs losses nor a 'Singularity' are imminent. In part, this is because current AI, based on deep learning, is expensive and difficult for (especially small) businesses to adopt, can create new jobs, and is an unlikely route to the invention of a super-intelligence. Even though AI is unlikely to have either utopian or apocalyptic impacts, it will challenge economists in coming years. The challenges include regulation of data and algorithms; the (mis-) measurement of value added; market failures, anti-competitive behaviour and abuse of market power; surveillance, censorship, cybercrime; labor market discrimination, declining job quality; and AI in emerging economies.
    Keywords: technology, articial intelligence, productivity, labor demand, innovation, inequality
    JEL: O47 O33 J24 E21 E25
    Date: 2019–03
    URL: http://d.repec.org/n?u=RePEc:iza:izadps:dp12218&r=all
  3. By: Fossen, Frank M. (University of Nevada, Reno); Sorgner, Alina (John Cabot University)
    Abstract: We investigate heterogeneous effects of new digital technologies on the individual-level employment- and wage dynamics in the U.S. labor market in the period from 2011-2018. We employ three measures that reflect different aspects of impacts of new digital technologies on occupations. The first measure, as developed by Frey and Osborne (2017), assesses the computerization risk of occupations, the second measure, developed by Felten et al. (2018), provides an estimate of recent advances in artificial intelligence (AI), and the third measure assesses the suitability of occupations for machine learning (Brynjolfsson et al., 2018), which is a subfield of AI. Our empirical analysis is based on large representative panel data, the matched monthly Current Population Survey (CPS) and its Annual Social and Economic Supplement (ASEC). The results suggest that the effects of new digital technologies on employment stability and wage growth are already observable at the individual level. High computerization risk is associated with a high likelihood of switching one's occupation or becoming non-employed, as well as a decrease in wage growth. However, advances in AI are likely to improve an individual's job stability and wage growth. We further document that the effects are heterogeneous. In particular, individuals with high levels of formal education and older workers are most affected by new digital technologies.
    Keywords: digitalization, artificial intelligence, machine learning, employment stability, unemployment, wage dynamics
    JEL: J22 J23 O33
    Date: 2019–03
    URL: http://d.repec.org/n?u=RePEc:iza:izadps:dp12242&r=all
  4. By: Adamantios Ntakaris; Giorgio Mirone; Juho Kanniainen; Moncef Gabbouj; Alexandros Iosifidis
    Abstract: Mid-price movement prediction based on limit order book (LOB) data is a challenging task due to the complexity and dynamics of the LOB. So far, there have been very limited attempts for extracting relevant features based on LOB data. In this paper, we address this problem by designing a new set of handcrafted features and performing an extensive experimental evaluation on both liquid and illiquid stocks. More specifically, we implement a new set of econometrical features that capture statistical properties of the underlying securities for the task of mid-price prediction. Moreover, we develop a new experimental protocol for online learning that treats the task as a multi-objective optimization problem and predicts i) the direction of the next price movement and ii) the number of order book events that occur until the change takes place. In order to predict the mid-price movement, the features are fed into nine different deep learning models based on multi-layer perceptrons (MLP), convolutional neural networks (CNN) and long short-term memory (LSTM) neural networks. The performance of the proposed method is then evaluated on liquid and illiquid stocks, which are based on TotalView-ITCH US and Nordic stocks, respectively. For some stocks, results suggest that the correct choice of a feature set and a model can lead to the successful prediction of how long it takes to have a stock price movement.
    Date: 2019–04
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1904.05384&r=all
  5. By: Arianna Martinelli; Andrea Mina; Massimo Moggi
    Abstract: Technological revolutions mark profound transformations in socio-economic systems. They are associated with the diffusion of general purpose technologies that display very high degrees of pervasiveness, dynamism and complementarity. This paper provides an in-depth examination of the technologies underpinning the øfactory of the futureù as profiled by the Industry 4.0 paradigm. It contains an exploratory comparative analysis of the technological bases and the emergent patterns of development of Internet of Things (IoT), big data, cloud, robotics, artificial intelligence and additive manufacturing. By qualifying the øenablingù nature of these technologies, it explores to what extent their diffusion and convergence can be configured as the trigger of a fourth industrial revolution, and identifies key themes for future research on this topic from the viewpoint of industrial and corporate change.
    Keywords: Industry 4.0; technological paradigm; enabling technology; general purpose technology; disruptive innovation.
    Date: 2019–04–11
    URL: http://d.repec.org/n?u=RePEc:ssa:lemwps:2019/09&r=all
  6. By: Antonio Lima; Hasan Bakhshi
    Abstract: Rapid technological, social and economic change is having significant impacts on the nature of jobs. In fast-changing environments it is crucial that policymakers have a clear and timely picture of the labour market. Policymakers use standardised occupational classifications, such as the Office for National Statistics’ Standard Occupational Classification (SOC) in the UK to analyse the labour market. These permit the occupational composition of the workforce to be tracked on a consistent and transparent basis over time and across industrial sectors. However, such systems are by their nature costly to maintain, slow to adapt and not very flexible. For that reason, additional tools are needed. At the same time, policymakers over the world are revisiting how active skills development policies can be used to equip workers with the capabilities needed to meet the new labour market realities. There is in parallel a desire for more granular understandings of what skills combinations are required of occupations, in part so that policymakers are better sighted on how individuals can redeploy these skills as and when employer demands change further. In this paper, we investigate the possibility of complementing traditional occupational classifications with more flexible methods centred around employers’ characterisations of the skills and knowledge requirements of occupations as presented in job advertisements. We use data science methods to classify job advertisements as STEM or non-STEM (Science, Technology, Engineering and Mathematics) and creative or non-creative, based on the content of ads in a database of UK job ads posted online belonging to Boston-based job market analytics company, Burning Glass Technologies. In doing so, we first characterise each SOC code in terms of its skill make-up; this step allows us to describe each SOC skillset as a mathematical object that can be compared with other skillsets. Then we develop a classifier that predicts the SOC code of a job based on its required skills. Finally, we develop two classifiers that decide whether a job vacancy is STEM/non-STEM and creative/non-creative, based again on its skill requirements.
    Keywords: labour demand, occupational classification, online job adverts, big data, machine learning, STEM, STEAM, creative economy
    JEL: C18 J23 J24
    Date: 2018–07
    URL: http://d.repec.org/n?u=RePEc:nsr:escoed:escoe-dp-2018-08&r=all
  7. By: Feng Zhou (Guangdong University of Finance and Economics); Zhang Qun (Guangdong University of Foreign Studies); Didier Sornette (ETH Zurich and Swiss Finance Institute); Liu Jiang (University of Surrey)
    Abstract: In the data mining and machine learning fields, forecasting the direction of price change can be generally formulated as a supervised classfii cation. This paper attempts to predict the direction of daily changes of the Nasdaq Composite Index (NCI) and of the Standard & Poor's 500 Composite Stock Price Index (S&P 500) covering the period from January 3, 2012 to December 23, 2016, and of the Shanghai Stock Exchange Composite Index (SSEC) from January 4, 2010 to December 31, 2014. Due to the complexity of stock index data, we carefully combine raw price data and eleven technical indicators with a cascaded learning technique to improve the performance of the classifi cation. The proposed learning architecture LR2GBDT is obtained by cascading the logistic regression (LR) model onto the gradient boosted decision trees (GBDT) model. Given the same test conditions, the experimental results show that the LR2GBDT model performs better than the baseline LR and GBDT models for these stock indices, according to the performance metrics Hit ratio, Precision, Recall and F-measure. Furthermore, we use these models to develop simple trading strategies and assess their performance in terms of their Average Annual Return, Maximum Drawdown, Sharpe Ratio and Average Annualized Return/Maximum Drawdown. When transaction costs and buy-sell thresholds are taken into account, the best trading strategy derived from LR2GBDT model still reaches the highest Sharpe Ratio and clearly beats the buy-and-hold strategy. The performances are found to be both statistically and economically signi ficant.
    Keywords: Ensemble learning; gradient boosted decision trees; logistic regression; price prediction; transaction costs, technical analysis
    JEL: C45 C53 C60 G17
    Date: 2018–07
    URL: http://d.repec.org/n?u=RePEc:chf:rpseri:rp1850&r=all
  8. By: Storm, Hugo; Heckelei, Thomas; Baylis, Kathy; Mittenzwei, Klaus
    Abstract: Farm subsidies are commonly motivated by their promise to help keep families in agriculture and reduce farm structural change. Many of these subsidies are designed to be targeted to smaller farms, and include production caps or more generous funding for smaller levels of activity. Agricultural economists have long studied how such subsidies affect production choices, and resulting farm structure. Traditional econometric models are typically restricted to detecting average effects of subsidies on certain farm types or regions and cannot easily incorporate complex subsidy design or the multi-output, heterogeneous nature of many farming activities. Programming approaches may help address the broad scope of agricultural production but have less empirical measures for behavioral and technological parameters. This paper uses a recurrent neural network and detailed panel data to estimate the effect of subsidies on the structure of Norwegian farming. Specifically, we use the model to determine how the varying marginal subsidies have affected the distribution of Norwegian farms and their range of agricultural activities. We use the predictive capacity of this flexible, multi-output machine learning model to identify the effects of agricultural subsidies on farm activity and structure, as well as their detailed distributional effects.
    Keywords: Agricultural and Food Policy, Farm Management, Land Economics/Use, Research Methods/ Statistical Methods
    Date: 2019–04–10
    URL: http://d.repec.org/n?u=RePEc:ags:ubfred:287343&r=all
  9. By: Hali Edison (Williams College); Hector Carcel (Bank of Lithuania)
    Abstract: This paper applies Latent Dirichlet Allocation (LDA), a machine learning algorithm, to analyze the transcripts of the U.S. Federal Open Market Committee (FOMC) covering the period 2003 – 2012, including 45,346 passages. The goal is to detect the evolution of the different topics discussed by the members of the FOMC. The results of this exercise show that discussions on economic modelling were dominant during the Global Financial Crisis (GFC), with an increase in discussion of the banking system in the years following the GFC. Discussions on communication gained relevance toward the end of the sample as the Federal Reserve adopted a more transparent approach. The paper suggests that LDA analysis could be further exploited by researchers at central banks and institutions to identify topic priorities in relevant documents such as FOMC transcripts.
    Keywords: FOMC, Text data analysis, Transcripts, Latent Dirichlet Allocation
    JEL: E52 E58 D78
    Date: 2019–04–05
    URL: http://d.repec.org/n?u=RePEc:lie:dpaper:11&r=all
  10. By: Bryan Lim; Stefan Zohren; Stephen Roberts
    Abstract: While time series momentum is a well-studied phenomenon in finance, common strategies require the explicit definition of both a trend estimator and a position sizing rule. In this paper, we introduce Deep Momentum Networks -- a hybrid approach which injects deep learning based trading rules into the volatility scaling framework of time series momentum. The model also simultaneously learns both trend estimation and position sizing in a data-driven manner, with networks directly trained by optimising the Sharpe ratio of the signal. Backtesting on a portfolio of 88 continuous futures contracts, we demonstrate that the Sharpe-optimised LSTM improved traditional methods by more than two times in the absence of transactions costs, and continue outperforming when considering transaction costs up to 2-3 basis points. To account for more illiquid assets, we also propose a turnover regularisation term which trains the network to factor in costs at run-time.
    Date: 2019–04
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1904.04912&r=all
  11. By: Claudia Biancotti (Peterson Institute for International Economics); Paolo Ciocca (Consob)
    Abstract: Over the past few years, it has become apparent that a small number of technology companies have assembled detailed datasets on the characteristics, preferences, and behavior of billions of individuals. This concentration of data is at the root of a worrying power imbalance between dominant internet firms and the rest of society, reflecting negatively on collective security, consumer rights, and competition. Introducing data sharing mandates, or requirements for market leaders to share user data with other firms and academia, would have a positive effect on competition. As data are a key input for artificial intelligence (AI), more widely available information would help spread the benefits of AI through the economy. On the other hand, data sharing could worsen existing risks to consumer privacy and collective security. Policymakers intending to implement a data sharing mandate should carefully evaluate this tradeoff.
    Date: 2019–04
    URL: http://d.repec.org/n?u=RePEc:iie:pbrief:pb19-3&r=all
  12. By: Julia Cage (Département d'économie); Nicolas Hervé (Institut national de l'audiovisuel); Marie-Luce Viaud (Institut national de l'audiovisuel)
    Abstract: This paper documents the extent of copying and estimates the returns to originality in online news production. We build a unique dataset combining all the online content produced by French news media during the year 2013 with new micro audience data. We develop a topic detection algorithm that identifies each news event, trace the timeline of each story, and study news propagation. We unravel new evidence on online news production. First, we document high reactivity of online media: one quarter of the news stories are reproduced online in under 4 minutes. Second, we show that this comes with extensive copying: only 33% of the online content is original. Third, we investigate the cost of copying for original news producers. Using article-level variations and media-level daily audience combined with article-level social media statistics, we find that readers partly switch to the original producers, thereby mitigating the newsgathering incentive problem raised by copying.
    Keywords: Internet; Information spreading; Copyright; Social media; Reputation
    JEL: L11 L15 L82 L86
    Date: 2019–04
    URL: http://d.repec.org/n?u=RePEc:spo:wpecon:info:hdl:2441/3tcpvf3sd399op9sgtn8tq5bhd&r=all
  13. By: Julia Cage (Département d'économie); Nicolas Hervé (Institut national de l'audiovisuel); Marie-Luce Viaud (Institut national de l'audiovisuel)
    Abstract: This paper documents the extent of copying and estimates the returns to originality in online news production. We build a unique dataset combining all the online content produced by French news media during the year 2013 with new micro audience data. We develop a topic detection algorithm that identifies each news event, trace the timeline of each story, and study news propagation. We unravel new evidence on online news production. First, we document high reactivity of online media: one quarter of the news stories are reproduced online in under 4 minutes. Second, we show that this comes with extensive copying: only 33% of the online content is original. Third, we investigate the cost of copying for original news producers. Using article-level variations and media-level daily audience combined with article-level social media statistics, we find that readers partly switch to the original producers, thereby mitigating the newsgathering incentive problem raised by copying.
    Keywords: Internet; Information spreading; Copyright; Social media; Reputation
    JEL: L11 L15 L82 L86
    Date: 2019–04
    URL: http://d.repec.org/n?u=RePEc:spo:wpmain:info:hdl:2441/3tcpvf3sd399op9sgtn8tq5bhd&r=all
  14. By: Pauline Affeldt; Tomaso Duso; Florian Szücs
    Abstract: We study the evolution of the EC’s merger decision procedure over the first 25 years of European competition policy. Using a novel dataset constructed at the level of the relevant markets and containing all merger cases over the 1990-2014 period, we evaluate how consistently arguments related to structural market parameters were applied over time. Using non-parametric machine learning techniques, we find that the importance of market shares and concentration measures has declined while the importance of barriers to entry and the risk of foreclosure has increased in the EC’s merger assessment following the 2004 merger policy reform.
    Keywords: Merger policy, DG competition, causal forests
    JEL: K21 L40
    Date: 2019
    URL: http://d.repec.org/n?u=RePEc:diw:diwwpp:dp1797&r=all
  15. By: Bullard, James B. (Federal Reserve Bank of St. Louis)
    Abstract: St. Louis Fed President James Bullard welcomed community bankers, regulators and researchers to the Community Banking in the 21st Century research and policy conference. He also welcomed a third sponsor for the conference: The Federal Deposit Insurance Corp. has joined the Federal Reserve System and the Conference of State Bank Supervisors in presenting this sixth annual conference. Bullard also discussed a handful of topics related to technology, including “fintech,” artificial intelligence and innovation hubs. “The changing landscape of financial services is an important reason for this conference,” he said.
    Date: 2018–10–03
    URL: http://d.repec.org/n?u=RePEc:fip:fedlps:322&r=all
  16. By: Pierre Henry-Labord`ere (SOCIETE GENERALE)
    Abstract: In this paper, we introduce a primal-dual algorithm for solving (martingale) optimal transportation problems, with cost functions satisfying the twist condition, close to the one that has been used recently for training generative adversarial networks. As some additional applications, we consider anomaly detection and automatic generation of financial data.
    Date: 2019–04
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1904.04546&r=all
  17. By: Giovanni Peri; Akira Sasahara
    Abstract: This paper examines the impact of temperature changes on rural-urban migration using a 56km×56km grid cell level dataset covering the whole world at 10-year frequency during the period 1970-2000. We find that rising temperatures reduce rural-urban migration in poor countries and increase such migration in middle-income countries. These asymmetric migration responses are consistent with a simple model where rural-urban earnings differentials and liquidity constraints interact to determine rural-to-urban migration flows. We also confirm these temperature effects using country-level observations constructed by aggregating the grid cell level data. We project that expected warming in the next century will encourage further urbanization in middle-income countries such as Argentina, but it will slow down urban transition in poor countries like Malawi and Niger.
    JEL: J61 O13 R23
    Date: 2019–04
    URL: http://d.repec.org/n?u=RePEc:nbr:nberwo:25728&r=all

This nep-big issue is ©2019 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.