publications 📚 | Lefteris Loukas

2023

Patent Office

System and Method for Automatically Tagging Documents (Patent)

Lefteris Loukas, Eirini Spyropoulou, Prodromos Malakasiotis, Emmanouil Fergadiotis, Ilias Chalkidis, Ion Androutsopoulos, and Georgios Paliouras

World Intellectual Property Organization 2023

Assigned to Ernst & Young Global Limited

Abstract HTML PDF

System and methods (100) for automatically tagging electronic documents are disclosed. An input module receives (102) an electronic document to be tagged. A preprocessing module then preprocesses (104) the electronic document to be tagged. The preprocessing of the electronic document comprises extracting a text from the electronic document to be tagged, replacing a number or a date in the extracted text with a predetermined symbol, and tokenizing the extracted text with the predetermined symbol into a plurality of tokens. After the preprocessing (104), a deep learning module determines (106) a tag for at least one of the plurality of tokens. The determined tag for the at least one token is then output (108) by an output module.

2022

ACL 2022

FiNER: Financial Numeric Entity Recognition for XBRL Tagging

Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion Androutsopoulos, and Georgios Paliouras

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022

Abstract HTML PDF Code

Publicly traded companies are required to submit periodic reports with eXtensive Business Reporting Language (XBRL) word-level tags. Manually tagging the reports is tedious and costly. We, therefore, introduce XBRL tagging as a new entity extraction task for the financial domain and release FiNER-139, a dataset of 1.1M sentences with gold XBRL tags. Unlike typical entity extraction datasets, FiNER-139 uses a much larger label set of 139 entity types. Most annotated tokens are numeric, with the correct tag per token depending mostly on context, rather than the token itself. We show that subword fragmentation of numeric expressions harms BERT’s performance, allowing word-level BILSTMs to perform better. To improve BERT’s performance, we propose two simple and effective solutions that replace numeric expressions with pseudo-tokens reflecting original token shapes and numeric magnitudes. We also experiment with FIN-BERT, an existing BERT model for the financial domain, and release our own BERT (SEC-BERT), pre-trained on financial filings, which performs best. Through data and error analysis, we finally identify possible limitations to inspire future work on XBRL tagging.

2021

EcoNLP 2021

EDGAR-CORPUS: Billions of Tokens Make The World Go Round

Lefteris Loukas, Manos Fergadiotis, Ion Androutsopoulos, and Prodromos Malakasiotis

In Proceedings of the Third Workshop on Economics and Natural Language Processing 2021

Abstract HTML PDF Code

We release EDGAR-CORPUS, a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. To the best of our knowledge, EDGAR-CORPUS is the largest financial NLP corpus available to date. All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use JSON format. We use EDGAR-CORPUS to train and release EDGAR-W2V, which are WORD2VEC embeddings for the financial domain. We employ these embeddings in a battery of financial NLP tasks and showcase their superiority over generic GloVe embeddings and other existing financial word embeddings. We also open-source EDGAR-CRAWLER, a toolkit that facilitates downloading and extracting future annual reports.
FinNLP 2021

DICoE@FinSim-3: Financial Hypernym Detection using Augmented Terms and Distance-based Features

Lefteris Loukas, Konstantinos Bougiatiotis, Manos Fergadiotis, Dimitris Mavroeidis, and Elias Zavitsanos

In Proceedings of the Third Workshop on Financial Technology and Natural Language Processing 2021

HTML PDF Code
ICAIF 2021

Financial Fraud Detection: A Realistic Evaluation

Elias Zavitsanos, Dimitris Mavroeidis, Konstantinos Bougiatiotis, Eirini Spyropoulou, Lefteris Loukas, and Georgios Paliouras

In Proceedings of the Second ACM International Conference on AI in Finance 2021

Abstract HTML PDF

In this work, we examine the evaluation process for the task of detecting financial reports with a high risk of containing a mis-statement. This task is often referred to, in the literature, as "mis-statement detection in financial reports". We provide an extensive review of the related literature. We propose a new, realistic evaluation framework for the task which, unlike a large part of the previous work: (a) focuses on the misstatement class and its rarity, (b) considers the dimension of time when splitting data into training and test and (c) considers the fact that misstatements can take a long time to detect. Most importantly, we show that the evaluation process significantly affects system performance, and we analyze the performance of different models and feature types in the new realistic framework.

2019

ㅤ MPS ’19ㅤ

A Machine Learning Approach for NILM based on Odd Harmonic Current Vectors

Lefteris Loukas, Klajdi Bodurri, Panagiotis Evangelopoulos, Aggelos S. Bouhouras, Nikolay Poulakis, Giorgos C. Christoforidis, Ioannis Panapakidis, and Konstantinos Ch. Chatzisavvas

8th International Conference on Modern Power Systems (MPS) 2019

Abstract HTML PDF

This paper examines the application of machine learning techniques in NILM methodologies based on the first three odd harmonic order current vectors as the only attributes of the appliances. Proper formulation of the measured current waveform of appliances’ combinations is also presented. We apply our methodology on performed measurements of typical Low Voltage residential installations considering harmonic order currents as the input features for both the training and disaggregation scheme. Our results support the hypothesis that the identification performance is enhanced when higher harmonic currents are included in the NILM methodology.