📚 Technology Stack
Python, Stopword, Tokenizer, Streamlit, Scikit learn
📃 Overview
Fake news is currently a pressing issue in society, especially in the age of explosive information. This project will allow you to work with Vietnamese text data, build a fake news prediction model, and deploy this model on a simple website.
Tasks
- Preprocessing Vietnamese text
- EDA
- Building a machine learning model
- Deploying the model
Goals
To become familiar with the process of working with data science problems with raw text data, and learn how to use some popular libraries.
Details
- Data source: VNFD Dataset
- Dataset of 223 Vietnamese news articles, with 2 labels: 1 (fake news) and 0 (real news)
- Data description: Description of VNFD dataset
- Vietnamese text pre-processing
- Basic text pre-processing steps include: lowercase, stopword removal, stemming, normalization depending on each field, noise removal (HTML tags, special symbols such as @, #,...), punctuation removal
- Reference sources: Overview of text pre-processing
- In this project, you need to observe the dataset and choose, experiment with pre-processing methods on your own.
Note 1: for stopword removal and tokenization, you can refer to some Vietnamese sources:
Note 2: if using stopword removal and tokenizer (or other normalization tools for Vietnamese), please cite the reference link. If using one of the above suggested sources, just mention the used tool in the paper.
Exploratory Data Analysis (EDA)
- Check for missing data, incorrect data types (if any)
- Check if there is any class imbalance in the distribution
- Obtain statistical information about the text (average length of each record, etc.)
Model Training
- Recommended library: scikit-learn
- Linear or non-linear models can be selected
- At least two machine learning models should be chosen (but not too many), trained on the pre-processed dataset and evaluated
- After training, the models need to be saved for inference in step 5
Note: The input data for machine learning models needs to be numerical, so feature extraction of the text is required before feeding it into the learning model.
Model Deployment
- Recommended library: Streamlit Streamlit allows for simple Python-based interface coding to support free web deployment of machine learning models for demo purposes. The library is quite easy to use, and you can refer to the interface ideas in the Streamlit gallery.
- When running the demo, I want to input a text directly, select one of the trained models, and output whether it is a fake or real news. Therefore, you need to load the saved models in step 4.
- After testing the web demo locally, refer to the Deploy Streamlit and follow the instructions to deploy the website on the Internet.
Note 1: You need a Github account to host this website.
Note 2: Due to the small dataset used, the prediction accuracy when testing the demo is not entirely required.
Note 3: The interface only needs to be visually appealing, allowing users to input text, select the model type, and display whether it is fake or real news.