nep-big New Economics Papers
on Big Data
Issue of 2022‒08‒22
twenty-two papers chosen by
Tom Coupé
University of Canterbury

  1. Machine Learning: An Introduction for Economists By Zarak Jamal Khan
  2. Comparative Effectiveness of Machine Learning Methods for Causal Inference in Agricultural Economics By Badruddoza, Syed; Fuad, Syed M.; Amin, Modhurima
  3. Program Targeting with Machine Learning and Mobile Phone Data: Evidence from an Anti-Poverty Intervention in Afghanistan By Emily Aiken; Guadalupe Bedoya; Joshua Blumenstock; Aidan Coville
  4. Estimating Continuous Treatment Effects in Panel Data using Machine Learning with an Agricultural Application By Sylvia Klosin; Max Vilgalys
  5. AlphaMLDigger: A Novel Machine Learning Solution to Explore Excess Return on Investment By Jimei Shen; Zhehu Yuan; Yifan Jin
  6. Supervised similarity learning for corporate bonds using Random Forest proximities By Jerinsh Jeyapaulraj; Dhruv Desai; Peter Chu; Dhagash Mehta; Stefano Pasquali; Philip Sommer
  7. Sitting Next to a Dropout: Academic Success of Students with More Educated Peers By Goller, Daniel; Diem, Andrea; Wolter, Stefan C.
  8. Changing Electricity Markets: Quantifying the Price Effects of Greening the Energy Matrix By Emanuel Kohlscheen; Richhild Moessner
  9. Cyclical and Trend Variation in Demand Elasticity: Big data evidence from US grocery stores By Gafarov, Bulat; Gong, Tengda; Hilscher, Jens
  10. Modeling clusters from the ground up: a web data approach By Stich, Christoph; Tranos, Emmanouil; Nathan, Max
  11. The dynamics of the prices of the companies of the STOXX Europe 600 Index through the logit model and neural network By Federico Mecchia; Marcellino Gaudenzi
  12. Estimating Inequality with Missing Incomes By Paolo Brunori; Pedro Salas-Rojo; Paolo Brunori
  13. A multi-task network approach for calculating discrimination-free insurance prices By Mathias Lindholm; Ronald Richman; Andreas Tsanakas; Mario V. W\"uthrich
  14. Estimating Inequality with Missing Incomes By Paolo Brunori; Pedro Salas-Rojo; Paolo Verme
  15. Uncover Drivers Influencing Consumers' WTP Using Machine Learning: Case of Organic Coffee in Taiwan By Man-, ZuyiKeunZuyi Wang; Takagi, Chifumi; Kim, Man-Keun; Chung, Anh
  16. Reinforcement Learning Portfolio Manager Framework with Monte Carlo Simulation By Jungyu Ahn; Sungwoo Park; Jiwoon Kim; Ju-hong Lee
  17. Balancing Profit, Risk, and Sustainability for Portfolio Management By Charl Maree; Christian W. Omlin
  18. Learning Mutual Fund Categorization using Natural Language Processing By Dimitrios Vamvourellis; Mate Attila Toth; Dhruv Desai; Dhagash Mehta; Stefano Pasquali
  19. The value of data in digital-based business models: Measurement and economic policy implications By Carol Corrado; Jonathan Haskel; Massimiliano Iommi; Cecilia Jona-Lasinio
  20. Combining Survey and Geospatial Data Can Significantly Improve Gender-Disaggregated Estimates of Labor Market Outcomes By Merfeld, Joshua D.; Newhouse, David; Weber, Michael; Lahiri, Partha
  21. When are large female-led firms more resilient against shocks? Learnings from Indian enterprises during COVID-19 with diff-in-diff and causal forests By Merlin Stein
  22. PayTech and the D(ata) N(etwork) A(ctivities) of BigTech Platforms By Jonathan Chiu; Thorsten Koeppl

  1. By: Zarak Jamal Khan (M.Phil Scholar, PIDE)
    Abstract: The objective of this webinar is to provide a brief and non-technical overview of; What Machine learning is and its recent applications in economic literature. This webinar deals with an important aspect of the usage of machine learning and discusses why machine learning tools needed to be incorporated in academic and policy-relevant research in Pakistan.
    Keywords: Machine Learning,
    Date: 2021
    URL: http://d.repec.org/n?u=RePEc:pid:wbrief:2021:62&r=
  2. By: Badruddoza, Syed; Fuad, Syed M.; Amin, Modhurima
    Keywords: Research Methods/Statistical Methods, Food Consumption/Nutrition/Food Safety, Agricultural and Food Policy
    Date: 2022–08
    URL: http://d.repec.org/n?u=RePEc:ags:aaea22:322452&r=
  3. By: Emily Aiken; Guadalupe Bedoya; Joshua Blumenstock; Aidan Coville
    Abstract: Can mobile phone data improve program targeting? By combining rich survey data from a "big push" anti-poverty program in Afghanistan with detailed mobile phone logs from program beneficiaries, we study the extent to which machine learning methods can accurately differentiate ultra-poor households eligible for program benefits from ineligible households. We show that machine learning methods leveraging mobile phone data can identify ultra-poor households nearly as accurately as survey-based measures of consumption and wealth; and that combining survey-based measures with mobile phone data produces classifications more accurate than those based on a single data source.
    Date: 2022–06
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2206.11400&r=
  4. By: Sylvia Klosin; Max Vilgalys
    Abstract: This paper introduces and proves asymptotic normality for a new semi-parametric estimator of continuous treatment effects in panel data. Specifically, we estimate an average derivative of the regression function. Our estimator uses the panel structure of data to account for unobservable time-invariant heterogeneity and machine learning methods to flexibly estimate functions of high-dimensional inputs. We construct our estimator using tools from double de-biased machine learning (DML) literature. We show the performance of our method in Monte Carlo simulations and also apply our estimator to real-world data and measure the impact of extreme heat in United States (U.S.) agriculture. We use the estimator on a county-level dataset of corn yields and weather variation, measuring the elasticity of yield with respect to a marginal increase in extreme heat exposure. In our preferred specification, the difference between the estimates from OLS and our method is statistically significant and economically significant. We find a significantly higher degree of impact, corresponding to an additional $1.18 billion in annual damages by the year 2050 under median climate scenarios. We find little evidence that this elasticity is changing over time.
    Date: 2022–07
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2207.08789&r=
  5. By: Jimei Shen; Zhehu Yuan; Yifan Jin
    Abstract: How to quickly and automatically mine effective information and serve investment decisions has attracted more and more attention from academia and industry. And new challenges have been raised with the global pandemic. This paper proposes a two-phase AlphaMLDigger that effectively finds excessive returns in the highly fluctuated market. In phase 1, a deep sequential NLP model is proposed to transfer blogs on Sina Microblog to market sentiment. In phase 2, the predicted market sentiment is combined with social network indicator features and stock market history features to predict the stock movements with different Machine Learning models and optimizers. The results show that our AlphaMLDigger achieves higher accuracy in the test set than previous works and is robust to the negative impact of COVID-19 to some extent.
    Date: 2022–06
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2206.11072&r=
  6. By: Jerinsh Jeyapaulraj; Dhruv Desai; Peter Chu; Dhagash Mehta; Stefano Pasquali; Philip Sommer
    Abstract: Financial literature consists of ample research on similarity and comparison of financial assets and securities such as stocks, bonds, mutual funds, etc. However, going beyond correlations or aggregate statistics has been arduous since financial datasets are noisy, lack useful features, have missing data and often lack ground truth or annotated labels. However, though similarity extrapolated from these traditional models heuristically may work well on an aggregate level, such as risk management when looking at large portfolios, they often fail when used for portfolio construction and trading which require a local and dynamic measure of similarity on top of global measure. In this paper we propose a supervised similarity framework for corporate bonds which allows for inference based on both local and global measures. From a machine learning perspective, this paper emphasis that random forest (RF), which is usually viewed as a supervised learning algorithm, can also be used as a similarity learning (more specifically, a distance metric learning) algorithm. In addition, this framework proposes a novel metric to evaluate similarities, and analyses other metrics which further demonstrate that RF outperforms all other methods experimented with, in this work.
    Date: 2022–07
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2207.04368&r=
  7. By: Goller, Daniel (University of St. Gallen); Diem, Andrea (University of St. Gallen); Wolter, Stefan C. (University of Bern)
    Abstract: We investigate the impact of the presence of university dropouts on the academic success of first-time students. Our identification strategy relies on quasi-random variation in the proportion of returning dropouts. The estimated average zero effect of dropouts on first- time students' success masks treatment heterogeneity and non-linearities. First, we find negative effects on the academic success of their new peers from dropouts re-enrolling in the same subject and, conversely, positive effects of dropouts changing subjects. Second, using causal machine learning methods, we find that the effects vary nonlinearly with different treatment intensities and prevailing treatment levels.
    Keywords: university dropouts, peer effects, better prepared students, causal machine learning
    JEL: A23 C14 I23
    Date: 2022–06
    URL: http://d.repec.org/n?u=RePEc:iza:izadps:dp15378&r=
  8. By: Emanuel Kohlscheen; Richhild Moessner
    Abstract: We analyse the drivers of European Power Exchange (EPEX) retail electricity prices between 2012 and early 2022 using machine learning. The agnostic random forest approach that we use is able to reduce in-sample root mean square errors (RMSEs) by around 50% when compared to a standard linear least square model − indicating that non-linearities and interaction effects are key in retail electricity markets. Out-of-sample prediction errors using machine learning are (slightly) lower than even in-sample least square errors using a least square model. The effects of efforts to limit power consumption and green the energy matrix on retail electricity prices are first order. CO2 permit prices strongly impact electricity prices, as do the prices of source energy commodities. And carbon permit prices’ impact has clearly increased post-2021 (particularly for baseload prices). Among energy sources, natural gas has the largest effect on electricity prices. Importantly, the role of wind energy feed-in has slowly risen over time, and its impact is now roughly on par with that of coal.
    Keywords: carbon permit, CO2 emissions, commodities, electricity market, energy, EPEX, machine learning, natural gas, oil, wind energy
    JEL: C54 D40 L70 Q02 Q20 Q40
    Date: 2022
    URL: http://d.repec.org/n?u=RePEc:ces:ceswps:_9807&r=
  9. By: Gafarov, Bulat; Gong, Tengda; Hilscher, Jens
    Keywords: Agribusiness, Research Methods/Statistical Methods, Marketing
    Date: 2022–08
    URL: http://d.repec.org/n?u=RePEc:ags:aaea22:322430&r=
  10. By: Stich, Christoph; Tranos, Emmanouil; Nathan, Max
    Abstract: This paper proposes a new methodological framework to identify economic clusters over space and time. We employ a unique open source dataset of geolocated and archived business webpages and interrogate them using Natural Language Processing to build bottom-up classifications of economic activities. We validate our method on an iconic UK tech cluster – Shoreditch, East London. We benchmark our results against existing case studies and administrative data, replicating the main features of the cluster and providing fresh insights. As well as overcoming limitations in conventional industrial classification, our method addresses some of the spatial and temporal limitations of the clustering literature.
    Keywords: cities; clusters; machine learning; technology industry; onsumer Data Research Centre (CDRC) and Engineering and Physical Sciences Research Council (ESRC
    JEL: J1 N0
    Date: 2022–06–17
    URL: http://d.repec.org/n?u=RePEc:ehl:lserod:115565&r=
  11. By: Federico Mecchia; Marcellino Gaudenzi
    Abstract: The aim of the present work is analysing and understanding the dynamics of the prices of companies, depending on whether they are included or excluded from the STOXX Europe 600 Index. For this reason, data regarding the companies of the Index in question was collected and analysed also through the use of logit models and neural networks in order to find the independent variables that affect the changes in prices and thus determine the dynamics over time.
    Date: 2022–06
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2206.09899&r=
  12. By: Paolo Brunori (III LSE & University of Florence); Pedro Salas-Rojo (III LSE); Paolo Brunori (World Bank)
    Abstract: The measurement of income inequality is affected by missing observations, especially if they are concentrated on the tails of an income distribution. This paper conducts an experiment to test how the different correction methods proposed by the statistical, econometric and machine learning literature address measurement biases of inequality due to item non response. We take a baseline survey and artificially corrupt the data employing several alternative non-linear functions that simulate patterns of income non-response, and show how biased inequality statistics can be when item non-responses are ignored. The comparative assessment of correction methods indicates that most methods are able to partially correct for missing data biases. Sample reweighting based on probabilities on non-response produces inequality estimates quite close to true values in most simulated missing data patterns. Matching and Pareto corrections can also be effective to correct for selected missing data patterns.Other methods, such as Single and Multiple imputations and Machine Learning methods are less effective. A final discussion provides some elements that help explaining these findings.
    Keywords: Inequality, item non-response, missing, prediction
    JEL: D63 C83 C01
    URL: http://d.repec.org/n?u=RePEc:inq:inqwps:ecineq-&r=
  13. By: Mathias Lindholm; Ronald Richman; Andreas Tsanakas; Mario V. W\"uthrich
    Abstract: In applications of predictive modeling, such as insurance pricing, indirect or proxy discrimination is an issue of major concern. Namely, there exists the possibility that protected policyholder characteristics are implicitly inferred from non-protected ones by predictive models, and are thus having an undesirable (or illegal) impact on prices. A technical solution to this problem relies on building a best-estimate model using all policyholder characteristics (including protected ones) and then averaging out the protected characteristics for calculating individual prices. However, such approaches require full knowledge of policyholders' protected characteristics, which may in itself be problematic. Here, we address this issue by using a multi-task neural network architecture for claim predictions, which can be trained using only partial information on protected characteristics, and it produces prices that are free from proxy discrimination. We demonstrate the use of the proposed model and we find that its predictive accuracy is comparable to a conventional feedforward neural network (on full information). However, this multi-task network has clearly superior performance in the case of partially missing policyholder information.
    Date: 2022–07
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2207.02799&r=
  14. By: Paolo Brunori (LSE (III) & University of Florence); Pedro Salas-Rojo (LSE (III) & Complutense University); Paolo Verme (World Bank)
    Abstract: The measurement of income inequality is affected by missing observations, especially if they are concentrated on the tails of an income distribution. This paper conducts an experiment to test how the different correction methods proposed by the statistical, econometric and machine learning literature address measurement biases of inequality due to item non response. We take a baseline survey and artificially corrupt the data employing several alternative non-linear functions that simulate patterns of income non-response, and show how biased inequality statistics can be when item non-responses are ignored. The comparative assessment of correction methods indicates that most methods are able to partially correct for missing data biases. Sample reweighting based on probabilities on non-response produces inequality estimates quite close to true values in most simulated missing data patterns. Matching and Pareto corrections can also be effective to correct for selected missing data patterns. Other methods, such as Single and Multiple imputations and Machine Learning methods are less effective. A final discussion provides some elements that help explaining these findings.
    Keywords: D31, D63, E64, O15
    Date: 2022–07
    URL: http://d.repec.org/n?u=RePEc:inq:inqwps:ecineq2022-616&r=
  15. By: Man-, ZuyiKeunZuyi Wang; Takagi, Chifumi; Kim, Man-Keun; Chung, Anh
    Keywords: Agribusiness, Marketing, Research Methods/Statistical Methods
    Date: 2022–08
    URL: http://d.repec.org/n?u=RePEc:ags:aaea22:322150&r=
  16. By: Jungyu Ahn; Sungwoo Park; Jiwoon Kim; Ju-hong Lee
    Abstract: Asset allocation using reinforcement learning has advantages such as flexibility in goal setting and utilization of various information. However, existing asset allocation methods do not consider the following viewpoints in solving the asset allocation problem. First, State design without considering portfolio management and financial market characteristics. Second, Model Overfitting. Third, Model training design without considering the statistical structure of financial time series data. To solve the problem of the existing asset allocation method using reinforcement learning, we propose a new reinforcement learning asset allocation method. First, the state of the portfolio managed by the model is considered as the state of the reinforcement learning agent. Second, Monte Carlo simulation data are used to increase training data complexity to prevent model overfitting. These data can have different patterns, which can increase the complexity of the data. Third, Monte Carlo simulation data are created considering various statistical structures of financial markets. We define the statistical structure of the financial market as the correlation matrix of the assets constituting the financial market. We show experimentally that our method outperforms the benchmark at several test intervals.
    Date: 2022–07
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2207.02458&r=
  17. By: Charl Maree; Christian W. Omlin
    Abstract: Stock portfolio optimization is the process of continuous reallocation of funds to a selection of stocks. This is a particularly well-suited problem for reinforcement learning, as daily rewards are compounding and objective functions may include more than just profit, e.g., risk and sustainability. We developed a novel utility function with the Sharpe ratio representing risk and the environmental, social, and governance score (ESG) representing sustainability. We show that a state-of-the-art policy gradient method - multi-agent deep deterministic policy gradients (MADDPG) - fails to find the optimum policy due to flat policy gradients and we therefore replaced gradient descent with a genetic algorithm for parameter optimization. We show that our system outperforms MADDPG while improving on deep Q-learning approaches by allowing for continuous action spaces. Crucially, by incorporating risk and sustainability criteria in the utility function, we improve on the state-of-the-art in reinforcement learning for portfolio optimization; risk and sustainability are essential in any modern trading strategy and we propose a system that does not merely report these metrics, but that actively optimizes the portfolio to improve on them.
    Date: 2022–06
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2207.02134&r=
  18. By: Dimitrios Vamvourellis; Mate Attila Toth; Dhruv Desai; Dhagash Mehta; Stefano Pasquali
    Abstract: Categorization of mutual funds or Exchange-Traded-funds (ETFs) have long served the financial analysts to perform peer analysis for various purposes starting from competitor analysis, to quantifying portfolio diversification. The categorization methodology usually relies on fund composition data in the structured format extracted from the Form N-1A. Here, we initiate a study to learn the categorization system directly from the unstructured data as depicted in the forms using natural language processing (NLP). Positing as a multi-class classification problem with the input data being only the investment strategy description as reported in the form and the target variable being the Lipper Global categories, and using various NLP models, we show that the categorization system can indeed be learned with high accuracy. We discuss implications and applications of our findings as well as limitations of existing pre-trained architectures in applying them to learn fund categorization.
    Date: 2022–07
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2207.04959&r=
  19. By: Carol Corrado; Jonathan Haskel; Massimiliano Iommi; Cecilia Jona-Lasinio
    Abstract: A defining aspect of the digital age is data and its business use. Data have become an important input for firms (e.g., to train artificial intelligence algorithms) but data use is neither accounted for in macroeconomic statistics nor part of business contracts for goods and services provided to customers.This paper puts data and data investments in a framework amenable to measurement and policy analysis aimed at sharpening our understanding of the modern economies. Data is conceptualized as an intangible asset: a storable, nonrival (yet excludable) factor input that is only partially captured in existing macroeconomic and financial statistics. We provide experimental estimates of data investment designed to encompass data and data intelligence for six major European countries (France, Germany, Italy, Spain, and the United Kingdom) and we found an average value of 5 to 6.5 percent of market sector gross value added in 2010-2018 (Corrado et al, 2022). We also develop a simulation exercise to test the potential growth contribution of data capital, and we find that even limited diffusion of data capital could raise labor productivity growth as much as ½ percentage point per year, but outcomes are highly dependent on factors influenced by policy settings.
    Keywords: data, innovation, intangible capital, productivity growth
    JEL: E22 O47 E01
    Date: 2022–08–08
    URL: http://d.repec.org/n?u=RePEc:oec:ecoaaa:1723-en&r=
  20. By: Merfeld, Joshua D. (KDI School of Public Policy and Management); Newhouse, David (World Bank); Weber, Michael (University of Chicago); Lahiri, Partha (University of Maryland)
    Abstract: Better understanding the geography of women's labor market outcomes within countries is important to inform targeted efforts to increase women's economic empowerment. This paper assesses the extent to which a method that combines simulated survey data from urban areas in Mexico with broadly available geospatial indicators from Google Earth Engine and OpenStreetMap can significantly improve estimates of labor force participation and unemployment rates. Incorporating geospatial information substantially increases the accuracy of male and female labor force participation and unemployment rates at the state level, reducing mean absolute deviation by 50 to 62 percent for labor force participation and 25 to 52 percent for unemployment. Small area estimation using a nested error conditional random effect model also greatly improves municipal estimates of labor force participation, as the mean absolute error falls by approximately half, while the mean squared error falls by almost 75 percent when holding coverage rates constant. In contrast, the results for municipal unemployment rate estimates are not reliable because values of unemployment rates are low and therefore poorly suited for linear models. The municipal results hold in repeated simulations of alternative samples. Models utilizing Basic Geo-Statistical Area (AGEB)–level auxiliary information generate more accurate predictions than area-level models specified using the same auxiliary data. Overall, integrating survey data and publicly available geospatial indicators is feasible and can greatly improve state-level estimates of male and female labor force participation and unemployment rates, as well as municipal estimates of male and female labor force participation.
    Keywords: small area estimation, data integration, geospatial data, labor force participation, unemployment, Mexico
    JEL: J21 C13
    Date: 2022–06
    URL: http://d.repec.org/n?u=RePEc:iza:izadps:dp15390&r=
  21. By: Merlin Stein
    Abstract: In which kind of companies did the prevalence of women on corporate boards matter during the first wave of Covid-19? 2500 large Indian firms, of which only 78% initially complied with an exogenous gender quota enable a quantitative evaluation. By comparing their quarterly revenues with a Difference-in-Differences analysis, this research initially finds a significantly positive relationship between the compliance with the one-female-board-member policy and the change in revenues during the first economic shock of the Covid-19 crisis in 2020. A Triple-Difference and Causal Forest analysis indicates that this is likely endogenously driven by (self-)selection based on sectors, capital dynamics, size, independence of directors and further firm characteristics. There is no simple association of female directors and crisis revenues: The spectrum encompasses a mix of firms with a positive, a neutral and a negative association. With rare causal context for evaluating board gender diversity or other corporate governance and ESG dynamics, this piece of research illustrates the value and limitations of applying adjusted Random Forests to overcome linearity and dimensionality limitations for deeply understanding heterogeneities-within-heterogeneities as indicators of relevant distinctions.
    Keywords: Generalized Random Forest, Causal Machine Learning in Double and Triple Difference, Covid-19, large rms, women on corporate boards, female leadership, quota
    JEL: C21 C53 D22 G30 J16 K22
    Date: 2022–01–04
    URL: http://d.repec.org/n?u=RePEc:csa:wpaper:2022-01&r=
  22. By: Jonathan Chiu; Thorsten Koeppl
    Abstract: Why do BigTech platforms introduce payment services? Digital platforms often run business models where activities on the platform generate data that can be monetized off the platform. There is a trade-off between the value of such data and the privacy concerns of users, since platforms need to compensate users for their privacy loss by subsidizing activities. The nature of complementarities between data and payments determines whether and how payment services are provided. When data help to provide better payments (data-driven payments), platforms have too little incentive to adopt. When payments generate additional data (payments-driven data), platforms may adopt payments inefficiently.
    Keywords: Digital currencies and fintech; Payment clearing and settlement systems
    JEL: D8 E42 L1
    Date: 2022–08
    URL: http://d.repec.org/n?u=RePEc:bca:bocawp:22-35&r=

This nep-big issue is ©2022 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.