Speech and Language Technologies for Low-Resource Languages : Second International Conference, SPELLL 2023, Perundurai, Erode, India, December 6-8, 2023, Revised Selected Papers

(Visualizza in formato marc) (Visualizza in BIBFRAME)

Autore:	Chakravarthi Bharathi Raja
Titolo:	Speech and Language Technologies for Low-Resource Languages : Second International Conference, SPELLL 2023, Perundurai, Erode, India, December 6-8, 2023, Revised Selected Papers
Pubblicazione:	Cham : , : Springer International Publishing AG, , 2024
	©2024
Edizione:	1st ed.
Descrizione fisica:	1 online resource (470 pages)
Altri autori:	BBharathi García CumbrerasMiguel Ángel Jiménez ZafraSalud María SubramanianMalliga ShanmugavadivelKogilavani NakovPreslav
Nota di contenuto:	Intro -- Preface -- Organization -- Contents -- Language Resources -- PolitiKweli: A Swahili-English Code-Switched Twitter Political Misinformation Classification Dataset -- 1 Introduction -- 2 Code-Switching on Social Media -- 3 Political Misinformation on Social Media -- 3.1 Case of Kenyan Politics on Twitter -- 4 Related Work -- 5 Methodology -- 5.1 Data Collection -- 5.2 Data Processing -- 5.3 Data Annotation -- 5.4 Data Analysis -- 6 Experimentation and Results -- 7 Conclusion and Future Work -- References -- Telugu Meme Dataset and Baseline System for Automatic Identification of Domain, and Troll in Memes -- 1 Introduction -- 2 Literature Review -- 3 Meme Data-Set Discussion -- 3.1 Dataset Creation -- 3.2 Backend Server for Dataset -- 3.3 Data Cleaning -- 3.4 Dataset Analysis -- 4 Meme Classification -- 4.1 Methodology -- 4.2 Models and Their Features -- 5 Results -- 5.1 Troll and Non-troll Classification -- 5.2 Domain Classification -- 5.3 Emotion Classification -- 6 Conclusions and Future Trends -- References -- SamPar: A Marathi Hate Speech Dataset for Homophobia, Transphobia -- 1 Introduction -- 2 Related Work -- 3 Dataset Creation -- 3.1 Search Criteria -- 3.2 Data Annotation -- 3.3 Annotation Process -- 3.4 Dataset Statistics -- 3.5 Privacy Concerns -- 4 Experiments -- 4.1 TF-IDF and NLTK Baseline Experiment -- 4.2 Data Compression-Based Classifier Using Normalized Compression Distance (NCD) -- 4.3 Deep Learning Models -- 4.4 Discussion -- 5 Conclusion and Future Work -- References -- L3Cube-MahaNews: News-Based Short Text and Long Document Classification Datasets in Marathi -- 1 Introduction -- 2 Related Work -- 3 Curating the Dataset -- 3.1 Data Collection -- 3.2 Data Statistics -- 4 Evaluation -- 4.1 Experiment Setup -- 4.2 Results -- 5 Conclusion -- References.
	Creation and Classification of Kannada Meme Dataset: Exploring Domain and Troll Categories -- 1 Introduction -- 2 Related Works -- 3 Kannada Dataset -- 3.1 Data Gathering and Refinement -- 3.2 Dataset Annotation and Inter-Annotator Agreement -- 3.3 Data Statistics and Analysis -- 3.4 Text Extraction and Data Preprocessing -- 4 Kannada Meme Classification -- 4.1 Models and Their Features -- 4.2 Image Modality -- 4.3 Text Modality -- 4.4 Multimodal Approach -- 5 Results -- 5.1 Domain Classification -- 5.2 Troll Classification -- 6 Conclusions and Future Trends -- References -- The Impact of Tamil Python Programming in Wikisource -- 1 Introduction -- 2 Wikisource and Tamil Wikisource -- 3 Contribution of Wikimedians -- 3.1 Python Automation for Other Projects Using Wikisource Data -- 3.2 Tamil Wikisource Contributor Info-Farmer -- 3.3 Automation in Wikimedia -- 4 Tamil Python Programming -- 4.1 Python Tags and PyTamil Tags -- 4.2 Wikisource Development for Tamil Python Programming -- 4.3 PAWS -- 5 Conclusion and Future Scope -- References -- Language Technologies -- Natural Language Processing for Tulu: Challenges, Review and Future Scope -- 1 Introduction -- 1.1 Introduction to Tulu -- 1.2 Importance of NLP Technologies for Low-Resource Languages -- 1.3 Scope and Objectives of the Paper -- 2 Present Challenges -- 2.1 Limited Data Availability -- 2.2 Lack of Digital Resources -- 2.3 Code-Mixing and Language Variation -- 2.4 Morphological Complexity -- 2.5 Speech Recognition Challenges -- 2.6 Machine Translation Challenges -- 2.7 Character Recognition -- 3 Existing Digital Resources -- 4 Morphological Analysis and Generation -- 4.1 Morphological Complexity of Tulu -- 4.2 Overview of the Rule-Based Approach -- 4.3 Model Description -- 4.4 System Performance -- 5 Automatic Speech Recognition -- 5.1 Existing Corpus -- 5.2 Corpus Creation and Analysis.
	5.3 Model Overview -- 5.4 System Performance -- 6 Sentiment Analysis -- 6.1 Nature of Corpus Created -- 6.2 Methods for Corpus Creation -- 6.3 Corpus Annotation and Annotator Agreement -- 6.4 Baseline Models -- 7 Kannada-Tulu Machine Translation -- 7.1 Corpus Creation -- 7.2 Model Description -- 7.3 System Performance -- 8 English-Tulu Machine Translation -- 8.1 Overview of the Rule-Based Approach -- 8.2 Model Description -- 8.3 System Performance -- 9 Character Recognition -- 9.1 Existing Corpus -- 9.2 Corpora Creation -- 9.3 Pre-processing Approaches -- 9.4 Feature Extraction -- 9.5 Model Description -- 9.6 System Performance -- 10 Future Scope -- 11 Conclusion -- References -- DepBoost-TransNet: Boosted Transformer Network for Depression Classification -- 1 Introduction -- 2 Motivation -- 3 Literature Review -- 4 Dataset -- 4.1 Dataset Analysis -- 5 Methodology -- 5.1 Implemented Architecture -- 5.2 Feature Extraction -- 5.3 Key Modules -- 6 Experiments -- 6.1 Without Data Augmentation -- 6.2 With Data Augmentation -- 7 Results -- 8 Conclusion -- 8.1 Limitations -- 8.2 Future Scope -- References -- Optimized BERT Model for Question Answering System on Mobile Platform -- 1 Introduction -- 2 Related Works -- 3 Optimized BERT Based Question-Answering System -- 3.1 Post-training Quantization -- 4 Results and Discussion -- 4.1 Accuracy Score, Model Size and Inference Time -- 4.2 Question Answering Results and Mobile App -- 5 Conclusion -- References -- A Comparative Analysis of Pretrained Models for Sentiment Analysis on Restaurant Customer Reviews (CAPM-SARCR) -- 1 Introduction -- 2 Related Work -- 3 Data Description -- 4 Proposed System -- 4.1 Data Preprocessing -- 4.2 BERT Tokenizer -- 4.3 RoBERTa -- 4.4 mBERT -- 4.5 DeBERTa -- 4.6 XLNet -- 5 Experimental Results and Discussion -- 6 Conclusion -- References.
	Lightweight Language Agnostic Data Sanitization Pipeline for Dealing with Homoglyphs in Code-Mixed Languages -- 1 Introduction -- 2 Related Works -- 2.1 Homoglyphs in NLP -- 2.2 Homoglyph Detection -- 2.3 Code-Mixed Language Models -- 2.4 Candidate Word Generation -- 2.5 N-grams for Information Retrieval -- 3 Proposed Methodology -- 3.1 Datasets -- 3.2 Pipeline -- 4 Experiment -- 4.1 Effects of Homoglyphs on MuRIL -- 4.2 OCR -- 4.3 N-grams and Symspell Training -- 5 Results -- 6 Limitations -- 7 Conclusion -- References -- TextGram: Towards a Better Domain-Adaptive Pretraining -- 1 Introduction -- 2 Motivation -- 3 Related Work -- 4 Experimentation Setup -- 4.1 Datasets -- 4.2 Model Architecture -- 4.3 Data Selection Techniques -- 5 Proposed Technique - TextGram -- 6 Evaluation Results -- 6.1 Fine-Tuning Without Data Selection -- 6.2 Fine-Tuning with Data Selection -- 7 Conclusion and Future Work -- References -- Abusive Social Media Comments Detection for Tamil and Telugu -- 1 Introduction -- 2 Related Work -- 3 Dataset Description -- 4 Methodology -- 5 Results and Discussion -- 5.1 Results -- 5.2 Discussions -- 6 Conclusion -- References -- Sales Forecasting from Group Conversation Using Natural Language Processing -- 1 Introduction -- 2 Literature Survey -- 3 Existing System -- 3.1 Drawbacks -- 4 Proposed System -- 5 Algorithm -- 5.1 Text Blob -- 5.2 Vader -- 6 Module Description -- 6.1 Data Retrieval and Pre-processing -- 6.2 Sentiment Analysis -- 6.3 Exploratory Data Analysis -- 6.4 Data Visualization -- 6.5 Sales Analysis -- 7 Results and Discussion -- 8 Conclusion and Future Work -- References -- Hands in Harmony: Empowering Communication Through Translation -- 1 Introduction -- 2 Related Work -- 3 Proposed Methodology -- 3.1 Module I: Speech-to-Sign Language & -- Telugu Text Translation.
	3.2 Module II: Translation of Sign Language to Text -- 4 Results and Discussion -- 4.1 Speech-to-Sign Language & -- Text Translation -- 4.2 Sign Language to Text Translation -- 5 Conclusion -- References -- Offensive Text Detection for Tamil Language -- 1 Introduction -- 2 Background and Related Work -- 3 System Architecture -- 4 Methodology -- 4.1 Dataset -- 4.2 Data Preprocessing -- 4.3 Feature Extraction and Word Embeddings -- 4.4 Building the Model -- 4.5 Detection of Offensive in Tweets -- 5 Discussion and Results -- 6 Conclusion and Future Works -- References -- Telugu-English Abusive Comment Detection Using XLMRoBERTa and mBERT -- 1 Introduction -- 2 Related Work -- 3 Dataset Description -- 4 Proposed Approach -- 5 Results -- 6 Conclusion -- References -- A Knowledge Engineering Framework Addressing High Incidence of Farmer Suicides -- 1 Introduction -- 2 Methodology -- 2.1 Cause-and-Effect Sentences -- 3 Data, Experiments and Results -- 3.1 Extraction of Causal Sentences -- 3.2 Extraction of Causal Words -- 3.3 Probabilities and Grouping -- 4 Conclusion -- References -- Event Categorization from News Articles Using Machine Learning Techniques -- 1 Introduction -- 2 Literature Review -- 3 Proposed System -- 3.1 Pre-processing -- 3.2 Machine Learning Techniques -- 4 Performance Evaluation -- 4.1 Dataset Description -- 5 Conclusion -- References -- From Words to Emotions: Identifying Depression Through Social Media Insights -- 1 Introduction -- 2 Literature Survey -- 3 Methods and Materials -- 3.1 Dataset Description -- 3.2 Preprocessing -- 3.3 Feature Extraction and Word Embedding Technique -- 3.4 Proposed Classifiers -- 4 Results and Discussion -- 4.1 Performance Metrics -- 5 Conclusion and Future Work -- References -- Text Summarisation for Low-Resourced Languages, A Review -- 1 Introduction -- 2 Datasets Generation and Curation.
	3 Summarisation Techniques for African Languages.
Titolo autorizzato:	Speech and Language Technologies for Low-Resource Languages
ISBN:	3-031-58495-3
Formato:	Materiale a stampa
Livello bibliografico	Monografia
Lingua di pubblicazione:	Inglese
Record Nr.:	9910851998303321
Lo trovi qui:	Univ. Federico II
Opac:	Controlla la disponibilità qui

Serie: Communications in Computer and Information Science Series