nep-big New Economics Papers
on Big Data
Issue of 2024–11–04
sixteen papers chosen by
Tom Coupé, University of Canterbury


  1. Explaining Machine Learning by Bootstrapping Partial Marginal Effects and Shapley Values By Thomas R. Cook; Zach Modig; Nathan M. Palmer
  2. Enhancing literature review with NLP methods Algorithmic investment strategies case By Stanisław Łaniewski; Robert Ślepaczuk
  3. Leveraging RNNs and LSTMs for Synchronization Analysis in the Indian Stock Market: A Threshold-Based Classification Approach By Sanjay Sathish; Charu C Sharma
  4. Transfer learning for financial data predictions: a systematic review By V. Lanzetta
  5. Dynamics of REIT Returns and Volatility: Analyzing Time-Varying Drivers Using an Explainable Machine Learning Approach By Hendrik Jenett; Maximilian Nagl; Cathrine Nagl; McKay Price; Wolfgang Schäfers
  6. Uncovering the SDG content of Human Security Policies through a Machine Learning web application By Koundouri, Phoebe; Aslanidis, Panagiotis-Stavros; Dellis, Konstantinos; Feretzakis, Georgios; Plataniotis, Angelos
  7. The Credibility Transformer By Ronald Richman; Salvatore Scognamiglio; Mario V. W\"uthrich
  8. Identifying Money Laundering Subgraphs on the Blockchain By Kiwhan Song; Mohamed Ali Dhraief; Muhua Xu; Locke Cai; Xuhao Chen; Arvind; Jie Chen
  9. Predicting Distance matrix with large language models By Jiaxing Yang
  10. Mamba Meets Financial Markets: A Graph-Mamba Approach for Stock Price Prediction By Ali Mehrabian; Ehsan Hoseinzade; Mahdi Mazloum; Xiaohong Chen
  11. Forecasting 2024 US Presidential Election by States Using County Level Data: Too Close to Call By Pesaran, M. H.; Song, H.
  12. Predictive Power of Biological Sex and Gender Identity on Economic Behavior By Stefano Piasenti; Süer Müge
  13. Trading Volume Alpha By Ruslan Goyenko; Bryan T. Kelly; Tobias J. Moskowitz; Yinan Su; Chao Zhang
  14. VickreyFeedback: Cost-efficient Data Construction for Reinforcement Learning from Human Feedback By Guoxi Zhang; Jiuding Duan
  15. Field-scale crop water consumption estimates reveal potential water savings in California agriculture By Boser, Anna; Caylor, Kelly; Larsen, Ashley; Pascolini-Campbell, Madeleine; Reager, John T; Carleton, Tamma
  16. Exchange Rate Narratives By Vito Cormun; Kim Ristolainen

  1. By: Thomas R. Cook; Zach Modig; Nathan M. Palmer
    Abstract: Machine learning and artificial intelligence are often described as “black boxes.” Traditional linear regression is interpreted through its marginal relationships as captured by regression coefficients. We show that the same marginal relationship can be described rigorously for any machine learning model by calculating the slope of the partial dependence functions, which we call the partial marginal effect (PME). We prove that the PME of OLS is analytically equivalent to the OLS regression coefficient. Bootstrapping provides standard errors and confidence intervals around the point estimates of the PMEs. We apply the PME to a hedonic house pricing example and demonstrate that the PMEs of neural networks, support vector machines, random forests, and gradient boosting models reveal the non-linear relationships discovered by the machine learning models and allow direct comparison between those models and a traditional linear regression. Finally we extend PME to a Shapley value decomposition and explore how it can be used to further explain model outputs.
    Keywords: Machine learning; House prices; Statistical inference
    JEL: C14 C18 C15 C45 C52
    Date: 2024–09–20
    URL: https://d.repec.org/n?u=RePEc:fip:fedgfe:2024-75
  2. By: Stanisław Łaniewski (University of Warsaw, Faculty of Economic Sciences, Department of Quantitative Finance and Machine Learning); Robert Ślepaczuk (University of Warsaw, Faculty of Economic Sciences, Department of Quantitative Finance and Machine Learning)
    Abstract: This study utilizes machine learning algorithms to analyze and organize knowledge in the field of algorithmic trading, based on filtering 136 million research papers to 14, 342 articles ranging from 1956 to Q1 2020. We compare previously used practices such as keyword-based algorithms and embedding techniques with state-of-the-art dimension reduction and clustering for topic modeling method (BERTopic) to compare the popularity and evolution of different approaches and themes. We show new possibilities created by the last iteration of Large Language Models (LLM) like ChatGPT. The analysis reveals that the number of research articles on algorithmic trading is increasing faster than the overall number of papers. The stocks and main indices comprise more than half of all assets considered, but the growing trend in some classes is much stronger (e.g. cryptocurrencies). Machine learning models have become the most popular methods nowadays, but they are often flawed compared to seemingly simpler techniques. The study demonstrates the usefulness of Natural Language Processing in asking intricate questions about analyzed articles, like comparing the efficiency of different models. We demonstrate the efficiency of LLMs in refining datasets. Our research shows that by breaking tasks into smaller ones and adding reasoning steps, we can effectively address complex questions supported by case analyses.
    Keywords: trading, quantitative finance, neural networks, literature review, knowledge representation, natural language processing (NLP), topic modeling, model comparison, artificial intelligence
    JEL: C4 C15 C22 C45 C53 C58 C61 G11 G14 G15 G17
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:war:wpaper:2024-16
  3. By: Sanjay Sathish; Charu C Sharma
    Abstract: Our research presents a new approach for forecasting the synchronization of stock prices using machine learning and non-linear time-series analysis. To capture the complex non-linear relationships between stock prices, we utilize recurrence plots (RP) and cross-recurrence quantification analysis (CRQA). By transforming Cross Recurrence Plot (CRP) data into a time-series format, we enable the use of Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) networks for predicting stock price synchronization through both regression and classification. We apply this methodology to a dataset of 20 highly capitalized stocks from the Indian market over a 21-year period. The findings reveal that our approach can predict stock price synchronization, with an accuracy of 0.98 and F1 score of 0.83 offering valuable insights for developing effective trading strategies and risk management tools.
    Date: 2024–08
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.06728
  4. By: V. Lanzetta
    Abstract: Literature highlighted that financial time series data pose significant challenges for accurate stock price prediction, because these data are characterized by noise and susceptibility to news; traditional statistical methodologies made assumptions, such as linearity and normality, which are not suitable for the non-linear nature of financial time series; on the other hand, machine learning methodologies are able to capture non linear relationship in the data. To date, neural network is considered the main machine learning tool for the financial prices prediction. Transfer Learning, as a method aimed at transferring knowledge from source tasks to target tasks, can represent a very useful methodological tool for getting better financial prediction capability. Current reviews on the above body of knowledge are mainly focused on neural network architectures, for financial prediction, with very little emphasis on the transfer learning methodology; thus, this paper is aimed at going deeper on this topic by developing a systematic review with respect to application of Transfer Learning for financial market predictions and to challenges/potential future directions of the transfer learning methodologies for stock market predictions.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.17183
  5. By: Hendrik Jenett; Maximilian Nagl; Cathrine Nagl; McKay Price; Wolfgang Schäfers
    Abstract: In the current context of heighted market tensions driven by rising interest rates, there is vital interest for both researchers and practitioners to understand the dynamics of Real Estate Investment Trust (REIT) returns and their accompanying uncertainties. To address this concern, we examine the drivers of REIT returns and volatility in a time-varying framework, spanning the modern REIT era (1991 to 2022). Our study is the first to simultaneously forecast both REIT returns and their associated volatility using an artificial neural network. We contribute to the literature by opening the black-box character of neural networks, enabling the identification of individual feature impacts on predictions and their evolution over time.The key focus revolves around understanding how the influence of accounting and macroeconomic variables changes during periods of financial crises compared to non-crisis periods. The results showcase superior predictive capabilities of the neural network compared to conventional regression models. We shed light on the intricate interplay of diverse variables influencing the performance of REITs. Our findings hold implications for investors, policymakers and researchers navigating the complex landscape of real estate investments in a dynamically evolving market environment.
    Keywords: Machine Learning; Neural Network; REIT Return; Volatility
    JEL: R3
    Date: 2024–01–01
    URL: https://d.repec.org/n?u=RePEc:arz:wpaper:eres2024-107
  6. By: Koundouri, Phoebe; Aslanidis, Panagiotis-Stavros; Dellis, Konstantinos; Feretzakis, Georgios; Plataniotis, Angelos
    Abstract: This paper introduces a machine learning (ML) based approach for integrating Human Security (HS) and Sustainable Development Goals (SDGs). Originating in the 1990s, HS focuses on strategic, people-centric interventions for ensuring comprehensive welfare and resilience. It closely aligns with the SDGs, together forming the foundation for global sustainable development initiatives. Our methodology involves mapping 44 reports to the 17 SDGs using expert-annotated keywords and advanced ML techniques, resulting in a web-based SDG mapping tool. This tool is specifically tailored for the HS-SDG nexus, enabling the analysis of 13 new reports and their connections to the SDGs. Through this, we uncover detailed insights and establish strong links between the reports and global objectives, offering a nuanced understanding of the interplay between HS and sustainable development. This research provides a scalable framework to explore the relationship between HS and the Paris Agenda, offering a practical, efficient resource for scholars and policymakers.
    Keywords: Artificial Intelligence in Policy Making, Data Mining, Human-Centric Governance Strategies, Human Security, Machine Learning, Sustainable Development Goals
    JEL: C65 O15
    Date: 2024–02–20
    URL: https://d.repec.org/n?u=RePEc:pra:mprapa:121972
  7. By: Ronald Richman; Salvatore Scognamiglio; Mario V. W\"uthrich
    Abstract: Inspired by the large success of Transformers in Large Language Models, these architectures are increasingly applied to tabular data. This is achieved by embedding tabular data into low-dimensional Euclidean spaces resulting in similar structures as time-series data. We introduce a novel credibility mechanism to this Transformer architecture. This credibility mechanism is based on a special token that should be seen as an encoder that consists of a credibility weighted average of prior information and observation based information. We demonstrate that this novel credibility mechanism is very beneficial to stabilize training, and our Credibility Transformer leads to predictive models that are superior to state-of-the-art deep learning models.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.16653
  8. By: Kiwhan Song; Mohamed Ali Dhraief; Muhua Xu; Locke Cai; Xuhao Chen; Arvind; Jie Chen
    Abstract: Anti-Money Laundering (AML) involves the identification of money laundering crimes in financial activities, such as cryptocurrency transactions. Recent studies advanced AML through the lens of graph-based machine learning, modeling the web of financial transactions as a graph and developing graph methods to identify suspicious activities. For instance, a recent effort on opensourcing datasets and benchmarks, Elliptic2, treats a set of Bitcoin addresses, considered to be controlled by the same entity, as a graph node and transactions among entities as graph edges. This modeling reveals the "shape" of a money laundering scheme - a subgraph on the blockchain. Despite the attractive subgraph classification results benchmarked by the paper, competitive methods remain expensive to apply due to the massive size of the graph; moreover, existing methods require candidate subgraphs as inputs which may not be available in practice. In this work, we introduce RevTrack, a graph-based framework that enables large-scale AML analysis with a lower cost and a higher accuracy. The key idea is to track the initial senders and the final receivers of funds; these entities offer a strong indication of the nature (licit vs. suspicious) of their respective subgraph. Based on this framework, we propose RevClassify, which is a neural network model for subgraph classification. Additionally, we address the practical problem where subgraph candidates are not given, by proposing RevFilter. This method identifies new suspicious subgraphs by iteratively filtering licit transactions, using RevClassify. Benchmarking these methods on Elliptic2, a new standard for AML, we show that RevClassify outperforms state-of-the-art subgraph classification techniques in both cost and accuracy. Furthermore, we demonstrate the effectiveness of RevFilter in discovering new suspicious subgraphs, confirming its utility for practical AML.
    Date: 2024–10
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2410.08394
  9. By: Jiaxing Yang
    Abstract: Structural prediction has long been considered critical in RNA research, especially following the success of AlphaFold2 in protein studies, which has drawn significant attention to the field. While recent advances in machine learning and data accumulation have effectively addressed many biological tasks, particularly in protein related research. RNA structure prediction remains a significant challenge due to data limitations. Obtaining RNA structural data is difficult because traditional methods such as nuclear magnetic resonance spectroscopy, Xray crystallography, and electron microscopy are expensive and time consuming. Although several RNA 3D structure prediction methods have been proposed, their accuracy is still limited. Predicting RNA structural information at another level, such as distance maps, remains highly valuable. Distance maps provide a simplified representation of spatial constraints between nucleotides, capturing essential relationships without requiring a full 3D model. This intermediate level of structural information can guide more accurate 3D modeling and is computationally less intensive, making it a useful tool for improving structural predictions. In this work, we demonstrate that using only primary sequence information, we can accurately infer the distances between RNA bases by utilizing a large pretrained RNA language model coupled with a well trained downstream transformer.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.16333
  10. By: Ali Mehrabian; Ehsan Hoseinzade; Mahdi Mazloum; Xiaohong Chen
    Abstract: Stock markets play an important role in the global economy, where accurate stock price predictions can lead to significant financial returns. While existing transformer-based models have outperformed long short-term memory networks and convolutional neural networks in financial time series prediction, their high computational complexity and memory requirements limit their practicality for real-time trading and long-sequence data processing. To address these challenges, we propose SAMBA, an innovative framework for stock return prediction that builds on the Mamba architecture and integrates graph neural networks. SAMBA achieves near-linear computational complexity by utilizing a bidirectional Mamba block to capture long-term dependencies in historical price data and employing adaptive graph convolution to model dependencies between daily stock features. Our experimental results demonstrate that SAMBA significantly outperforms state-of-the-art baseline models in prediction accuracy, maintaining low computational complexity. The code and datasets are available at github.com/Ali-Meh619/SAMBA.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2410.03707
  11. By: Pesaran, M. H.; Song, H.
    Abstract: This document is a follow up to the paper by Ahmed and Pesaran (2020, AP) and reports state-level forecasts for the 2024 US presidential election. It updates the 3, 107 county level data used by AP and uses the same machine learning techniques as before to select the variables used in forecasting voter turnout and the Republican vote shares by states for 2024. The models forecast the non-swing states correctly but give mixed results for the swing states (Nevada, Arizona, Wisconsin, Michigan, Pennsylvania, North Carolina, and Georgia). Our forecasts for the swing states do not make use of any polling data but confirm the very close nature of the 2024 election, much closer than APÂ’s predictions for 2020. The forecasts are too close to call.
    Keywords: Voter Turnout, Popular and Electoral College Votes, Simultaneity and Recursive Identification, High Dimensional Forecasting Models, Lasso, OCMT
    JEL: C53 C55 D72
    Date: 2024–10–21
    URL: https://d.repec.org/n?u=RePEc:cam:camdae:2464
  12. By: Stefano Piasenti (University of Bologna); Süer Müge (HU Berlin)
    Abstract: Behavioral differences by biological sex are still not fully understood, suggesting that studying gender differences in behavioral traits through the lenses of continuous identity might be a promising avenue to understand the remaining observed gender gaps. Using a large U.S. online sample (N=2017) and machine learning, we develop and validate a new continuous gender identity measure consisting of separate femininity and masculinity scores. In a first study, we identify ninety attributes from prior research and conduct an experiment to classify them as feminine and masculine. In a subsequent study, a different group of participants completes tasks designed to elicit behavioral traits that have been previously documented in the behavioral economics literature to exhibit binary gender differences. Data for the second study are collected in two waves; the first wave serves as a training sample, allowing us to identify key attributes predicting behavioral traits, create candidate identity measures, and select the most effective one, comprising sixteen attributes, based on predictive power. Finally, we use the second wave (test sample) to validate our gender identity measure, which outperforms existing ones in explaining gender differences in economic decision-making. We show that confidence, competition, and risk are associated with masculinity, while altruism, equality, and efficiency are with femininity, providing new possibilities for targeted policymaking.
    Keywords: Biological sex; Gender identity; Machine learning; Online experiment;
    JEL: D91 J16 J62 C91
    Date: 2024–10–11
    URL: https://d.repec.org/n?u=RePEc:rco:dpaper:513
  13. By: Ruslan Goyenko; Bryan T. Kelly; Tobias J. Moskowitz; Yinan Su; Chao Zhang
    Abstract: Portfolio optimization focuses on risk and return prediction, yet implementation costs critically matter. Predicting trading costs is challenging because costs depend on trade size and trader identity, thus impeding a generic solution. We focus on a component of trading costs that applies universally – trading volume. Individual stock trading volume is highly predictable, especially with machine learning. We model the economic benefits of predicting volume through a portfolio framework that trades off tracking error versus net-of-cost performance – translating volume prediction into net-of-cost alpha. The economic benefits of predicting individual stock volume are as large as those from stock return predictability.
    JEL: C45 C53 C55 G00 G11 G12 G17
    Date: 2024–10
    URL: https://d.repec.org/n?u=RePEc:nbr:nberwo:33037
  14. By: Guoxi Zhang; Jiuding Duan
    Abstract: This paper addresses the cost-efficiency aspect of Reinforcement Learning from Human Feedback (RLHF). RLHF leverages datasets of human preferences over outputs of large language models (LLM) to instill human expectations into LLMs. While preference annotation comes with a monetized cost, the economic utility of a preference dataset has not been considered by far. What exacerbates this situation is that given complex intransitive or cyclic relationships in preference datasets, existing algorithms for fine-tuning LLMs are still far from capturing comprehensive preferences. This raises severe cost-efficiency concerns in production environments, where preference data accumulate over time. In this paper, we see the fine-tuning of LLMs as a monetized economy and introduce an auction mechanism to improve the efficiency of the preference data collection in dollar terms. We show that introducing an auction mechanism can play an essential role in enhancing the cost-efficiency of RLHF while maintaining satisfactory model performance. Experimental results demonstrate that our proposed auction-based protocol is cost-efficient for fine-tuning LLMs by concentrating on high-quality feedback.
    Date: 2024–09
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2409.18417
  15. By: Boser, Anna; Caylor, Kelly; Larsen, Ashley; Pascolini-Campbell, Madeleine; Reager, John T; Carleton, Tamma
    Abstract: Efficiently managing agricultural irrigation is vital for food security today and into the future under climate change. Yet, evaluating agriculture's hydrological impacts and strategies to reduce them remains challenging due to a lack of field-scale data on crop water consumption. Here, we develop a method to fill this gap using remote sensing and machine learning, and leverage it to assess water saving strategies in California's Central Valley. We find that switching to lower water intensity crops can reduce consumption by up to 93%, but this requires adopting uncommon crop types. Northern counties have substantially lower irrigation efficiencies than southern counties, suggesting another potential source of water savings. Other practices that do not alter land cover can save up to 11% of water consumption. These results reveal diverse approaches for achieving sustainable water use, emphasizing the potential of sub-field scale crop water consumption maps to guide water management in California and beyond.
    Keywords: Hydrology, Environmental Sciences, Earth Sciences, Zero Hunger
    Date: 2024–01–01
    URL: https://d.repec.org/n?u=RePEc:cdl:agrebk:qt81j397nv
  16. By: Vito Cormun (Santa Clara University, USA); Kim Ristolainen (Turku School of Economics, University of Turku, Finland)
    Abstract: Leveraging Wall Street Journal news, recent developments in textual analysis, and generative AI, we estimate a narrative decomposition of the dollar exchange rate. Our findings shed light on the connection between economic fundamentals and the exchange rate, as well as on its absence. From the late 1970s onwards, we identify six distinct narratives that explain changes in the exchange rate, each largely non-overlapping. U.S. fiscal and monetary policies play a significant role in the early part of the sample, while financial market news becomes more dominant in the second half. Notably, news on technological change predicts the exchange rate throughout the entire sample period. Finally, using text-augmented regressions, we find evidence that media coverage explains the unstable relationship between exchange rates and macroeconomic indicators.
    Keywords: Exchange rates, big data, textual analysis, macroeconomic news, Wall Street Journal, narrative retrieval, scapegoat
    JEL: C3 C5 F3
    Date: 2024–10
    URL: https://d.repec.org/n?u=RePEc:tkk:dpaper:dp167

This nep-big issue is ©2024 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.