Maximilien Kpizingui Blog

From Document Overload to Instant Insights: Meet DocQuest

Maximilien — Sat, 21 Mar 2026 00:30:15 GMT

If you're like me, your browser probably has 15 tabs open right now.

There's a research PDF you need to finish. A lecture recording you haven't listened to. A meeting transcript from yesterday. And a dozen other files scattered across Google Drive, Dropbox, and your local downloads folder.

We live in the age of information abundance, but understanding scarcity.

We consume more content than ever, yet retain less. We switch contexts constantly. We spend more time organizing information than acting on it.

I'm Maximilien Kpizingui, a Generative AI Engineer, and I got tired of drowning in documents. So I built DocQuest.

Today, I want to share what it is, why I built it, and how it can help you work smarter—not harder.

🎧 What is DocQuest?

DocQuest is the unified AI platform that transforms PDFs, audio, video, and documents into intelligent, actionable podcasts and answers your questions about them instantly.

Think of it as a bridge between your static files and your busy life. Instead of forcing yourself to read every word or watch every minute, DocQuest converts your content into an engaging audio format you can listen to anywhere—on your commute, at the gym, or while cooking.

But it's not just text-to-speech. It's intelligent understanding.

❓ Ask Questions. Get Answers. From Any Format.

This is the feature that changes everything.

Upload a DOCX, PNG, PDF, or TXT file—and just ask.

"What were the key findings in this research paper?"
"Summarize the action items from this meeting transcript."
"What does clause 4.2 say about termination?"
"Extract all dates and names from this scanned document."

DocQuest doesn't just read your files. It understands them.

How It Works Across Formats:

Format	What DocQuest Extracts	Example Question
📄 PDF	Text, tables, figures, metadata	"What's the methodology used in this study?"
📝 DOCX	Structured text, headings, comments	"List all the deliverables mentioned in this proposal."
🖼️ PNG/JPG	OCR text extraction from images/scans	"What's the total amount on this invoice screenshot?"
📄 TXT	Raw text analysis and summarization	"What are the three main arguments in this essay?"
🔗 URLs	Web content scraping + analysis	"What are the key updates in this blog post?"

The Magic: Cross-Document Q&A

Here's where it gets powerful.

Upload up to 10 files at once—a PDF research paper, a DOCX project brief, a PNG of a whiteboard sketch, and a TXT of meeting notes.

Then ask:

"What are the common themes across all these documents?"

DocQuest analyzes them together, finds connections, and gives you a unified answer.

No more tab-switching. No more manual synthesis. Just ask. Get answers.

🛠️ How It Works (The Tech Behind the Magic)

I built DocQuest to solve the fragmentation problem. Most tools handle one format. ChatPDF handles PDFs. Otter handles audio. Descript handles video.

DocQuest handles all of them together.

1. Unified Ingestion

You can upload PDFs, DOCX, PNG, TXT, Audio, Video, or even URLs. You can also ingest files directly from Google Drive.

2. Intelligent Parsing & Chunking

PDF/DOCX/TXT: Extract text + structure (headings, tables, lists)
PNG/JPG: OCR pipeline to extract text from images/scans
Audio/Video: Transcribe with speaker diarization + timestamping
Proprietary Chunking: For long content, we split into 10-minute segments with 5-second overlaps → zero context loss at boundaries

3. Cross-Format Semantic Embeddings

This is the technical moat.

We don't just store text. We map:

PDF text → vector embeddings
Audio transcripts → vector embeddings
Video captions → vector embeddings
Image OCR text → vector embeddings

All into one unified semantic space.

Result? You can ask a question, and DocQuest finds answers across all your files, regardless of format.

4. Intelligent Output

Once processed, you get:

Chaptered Podcasts: Navigate key insights easily
Cross-Referenced Answers: Ask once, get insights from 10 files
Actionable Drafts: Ready-to-use notes, briefs, and summaries

🎓 Beyond Consumption: The AI Tutor

Here's the feature I'm most proud of.

Most AI tools let you passively consume information. You read a summary, you nod, you forget.

DocQuest includes an Adaptive AI Tutor.

After you listen to your generated podcast or read an answer, the Tutor quizzes you on key concepts. It explains complex ideas simply and personalizes the learning path to your style.

Why? Because learning science tells us that active recall beats passive reading every time.

🤖 Automation with AI Agents

For the professionals out there: time is your most valuable asset.

DocQuest isn't just a reader; it's a worker. We've deployed 6 specialized AI Agents (with 100+ sub-agents) that work 24/7 to:

Extract specific data points from contracts (PDF/DOCX)
Generate show notes from meeting recordings (Audio/Video)
Draft emails based on research findings (TXT/PDF)
Summarize scanned documents (PNG) into actionable briefs

You set the task once. The agents handle the rest.

🎯 Who Is This For?

I built DocQuest with three main personas in mind:

1. Students & Researchers

Problem: Hundreds of papers (PDF/DOCX) + lecture recordings to synthesize
DocQuest Solution: Upload all files → ask "What are the key debates in this field?" → get a unified answer + podcast + quiz

2. Knowledge Workers & Consultants

Problem: Client reports (PDF), meeting notes (TXT), whiteboard photos (PNG) scattered everywhere
DocQuest Solution: Upload all 10 → ask "What are the top 3 risks?" → get cross-referenced insights + automated brief

3. HR & L&D Teams

Problem: Onboarding docs (PDF/DOCX), training videos, policy scans (PNG) go unread
DocQuest Solution: Convert to audio modules + quiz employees on retention → ensure compliance

🛡️ Why DocQuest? (The Differentiators)

You might be asking: "Isn't this like NotebookLM?"

Great question. NotebookLM is fantastic for research notes. But DocQuest is built for action, learning, and cross-format understanding.

Feature	NotebookLM	DocQuest
Input Formats	Text-heavy (PDF, Docs)	PDF + DOCX + PNG + TXT + Audio + Video + URLs
Q&A Scope	Single notebook	Cross-document + cross-format questions
Output	Summary + Audio Overview	Intelligent Podcast + AI Tutor + Automated Drafts
Learning	Passive	Active (Quizzes + Retention Tracking)
Automation	Chat-based	24/7 Autonomous AI Agents
Privacy	Google Ecosystem	Zero-Training Policy: Your data stays yours

Privacy is non-negotiable. Your documents never train our models. Your insights stay yours.

🚀 Try It Free (No Credit Card)

I launched DocQuest on Product Hunt last week, and the response has been humbling. We saw a 13% free-to-paid conversion rate in the first week (2-3x the SaaS benchmark) with zero churn.

I want you to experience it yourself.

Free Tier: Always available
Launch Bonus: 10,000 free tokens to test premium features
No Credit Card Required: Just sign up and start asking questions

👉 Try DocQuest Free: https://app.docquest.io

It's a solo founder journey, and every line of code is written with the goal of reducing your cognitive load.

👋 Let's Build Together

DocQuest is still early. We have amazing early users, 100+ inbound emails, and a roadmap full of features (music analysis is coming soon!).

I'd love your honest feedback. Break it. Test it. Tell me what sucks. Tell me what shines.

👉 Try DocQuest Free: https://app.docquest.io
🌐 Website: www.docquest.io

Stop reading. Start understanding. Start asking.

Did you find this useful? What format do you wish DocQuest supported next? Let me know in the comments below! 👇

Tags: #AI #Productivity #BuildInPublic #EdTech #SaaS #React #MachineLearning #Startup #DocumentAnalysis #QandA

Revolutionizing Healthcare Conversations: Building a Medical Chatbot Using LlamaIndex and DeepLake On Custom Dataset

Maximilien — Tue, 24 Oct 2023 17:12:25 GMT

Introduction

In today's rapidly evolving society, telemedicine is redefining the contours of patient care. As healthcare providers migrate to virtual consultations, patients seek clarity and relevance in their digital interactions. However, building a chatbot that transcends the typical to offer truly intuitive telemedical interactions is not an easy feat for developers. In this blog, we leverage the transformative potential of LlamaIndex and DeepLake for an unparalleled precision and responsiveness chatbot.

LlamaIndex
Deep Lake
Power and Limitations of LLMs
Application Integration: Data Indexing
Application Integration: Query Stage
Code Implementation

LlamaIndex

LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models. It offers a suite of essential tools designed to streamline the process of leveraging private or domain-specific data within LLM-powered applications:

Data Connectors: These versatile components are responsible for ingesting data from their native sources and formats. LlamaIndex supports a wide range of data sources, including APIs, PDF documents, SQL databases, and more.
Data Indexes: LlamaIndex employs data indexing to structure information into intermediate representations that are not only easy for LLMs to understand but also highly performant. These structured data representations serve as a bridge between raw data and natural language understanding.
Engines: LlamaIndex features different types of engines that provide natural language access to your structured data:
- Query Engines: These engines serve as robust retrieval interfaces, enabling knowledge-augmented output. They are ideal for information retrieval tasks and quick access to relevant data.
- Chat Engines: For applications requiring interactive and conversational experiences, LlamaIndex provides chat engines that support multi-message, "back and forth" interactions with the data, making it suitable for dynamic conversational interfaces.
Data Agents: LlamaIndex empowers knowledge workers by integrating Large Language Models with various tools, ranging from simple helper functions to API integrations and more. These agents can assist with data-related tasks and augment human decision-making.

Application Integrations: LlamaIndex ensures seamless integration with the broader ecosystem of applications following two key stages namely:

Data indexing Stage
Query stage

Deep Lake

Deep Lake is a database optimized for deep learning and AI applications, powered by a specialized storage format. Deep Lake can be used for:

Storing data and vectors while building applications
Managing datasets while training deep learning models

Deep Lake simplifies the deployment of enterprise-grade LLM-based products by offering storage for all data types (embeddings, audio, text, videos, images, pdfs, annotations, etc.), querying and vector search, data streaming while training models at scale, data versioning and lineage, and integrations with popular tools such as LangChain, LlamaIndex, Weights & Biases, and many more.

Deep Lake works with data of any size, it is serverless, and it enables you to store all of your data in your cloud. Deep Lake is used by Intel, Airbus, Matterport, ZERO Systems, Red Cross, Yale, & Oxford.[read more about deeplake]

Power and Limitations of Large Language Models

Large Language Models (LLMs) are trained on vast text volumes to learn the word distribution in a language, allowing them to generate meaningful content without direct data memorization. They can recall widespread information like historical events. However, their knowledge is limited to their training data, leading them to potentially "hallucinate" or fabricate details about events or facts after their last training update. This is a concern for applications needing high reliability.

A solution to this is using retrievers alongside LLMs. Retrievers fetch accurate information from trusted databases which the LLM uses without adding fictional details.

Application Integration: Data indexing stage

During this stage, a knowledge base is prepared. This involves organizing and structuring the custom data to make it easily retrievable and accessible. The knowledge base acts as a source of information for the LLM.

Data Source: This is the external data and can be in any form (CSV, pdf, word, excel, web-based) on the data source, relevant Data loaders are used to process the data.
Documents / Nodes: A Documents/Node here represents a fundamental unit of data in LlamaIndex, containing a chunk of a source Document with comprehensive metadata and inter-node relationships for precise retrieval actions.
Data Indexes (VectorStoreIndex): LlamaIndex streamlines data indexing by converting raw documents into intermediary representations, generating vector embeddings, and deducing metadata, with the VectorStoreIndex being a prevalent index format facilitating efficient data retrieval.

Application Integration: Query stage

In this stage, the system retrieves relevant context from the knowledge base based on a query. This retrieved-context is then used to augment the LLM's understanding and generation capabilities. The LLM can utilize this additional information to formulate more informed and accurate responses to user queries.

Code implementation

Installing DeepLake and llama_index

  pip install deeplake llama_index

Importing the required dependencies

  from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, Document
  from llama_index.vector_stores import DeepLakeVectorStore
  import textwrap
  import getpass
  import os

Defining the openai API key and active loop token using getpass to hide the credentials from public view

os.environ["OPENAI_API_KEY"] = getpass.getpass(prompt='Enter your OPENAI_API_KEY: ')
os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass(prompt='Enter your ACTIVELOOP_TOKEN: ')

When you run this, you'll be prompted to enter the values for each key. The values won't be displayed as you type, which helps maintain security, especially when working in shared or public environments.

Loading the dataset

The datasets used in this code are: symptom precautions in "xls format" and symptom descriptions in "doc format" of various sicknesses as shown below

Create a folder in the project directory called data containing the two datasets. Link to download the dataset here
Loading the dataset

path_document =<'path of the data folder'> 
documents = SimpleDirectoryReader(path_document).load_data()

Vectorizing and indexing the dataset
```
  dataset_path = "hub://activeloop_username/text_embedding"
  vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)
  storage_context = StorageContext.from_defaults(vector_store=vector_store)
  index = GPTVectorStoreIndex.from_documents(documents, storage_context=storage_context)
```
This code stores and indexes the documents in a vectorized form. It starts by defining where and how the vectors will be stored (DeepLakeVectorStore), sets up a storage context (StorageContext), and then creates an index (GPTVectorStoreIndex) for the documents using that storage context.
You should see an output like this stating that the dataset has been created successfully in DeepLake

Alternatively, if you don't want to store the embedding in DeepLake, use GPTDeepLakeIndex to store it locally

vector_store = DeepLakeVectorStore(overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = GPTVectorStoreIndex.from_documents(documents, storage_context=storage_context)

The output of this code creates a folder llama_index containing the tensors

Initializing the query engine

The query engine is a generic interface that allows you to ask questions about your data.

query_engine =index.as_query_engine()

Query the bot about symptoms of illness, precautions to take, cure etc.

  response = query_engine.query("What are the symptoms of malaria?")

Displaying the output

print(textwrap.fill(str(response), 50))

Output

Let's play around it

Bingo, you have made to the end of this post. You can leverage this technology to build a Q&A on your company data, it's time you give it a trial with few line of codes. In the upcoming post, we'll dive into building a complex chatbot with memory using Retrieval Augmented Generation (RAG)

Conclusion:

The fusion of LlamaIndex and DeepLake showcases the boundless possibilities of reshaping healthcare communication. Chatbots powered by such sophisticated AI-driven tools, not only bridge the gap between patients and providers but also redefine the standards of timely and accurate medical assistance. This synergy of technology and healthcare demonstrates that the future of patient care lies in AI-driven conversations. Whether it's addressing patients' immediate concerns, guiding them through their health journey or simply offering a comforting digital presence, this new wave of chatbot integration is a testament to how innovative solutions can revolutionize age-old healthcare practices. We strongly believe regardless of people's location or background they can access reliable health advice at their fingertips.

If you want to contribute or you find any errors in this article please do leave me a comment.

You can reach out to me on any of the matrix decentralized servers. My element messenger ID is @maximilien:matrix.org

If you are in one of the mastodon decentralized servers, here is my ID @maximilien@qoto.org

If you are on linkedIn, you can reach me here

If you want to contact me via email maximilien@maxtekai.tech

If you want to hire me to work on machine learning, data science, IoT and AI-related projects, please reach out to me here

Warm regards,

Maximilien.

Large Language Models: A Dive Into Three Distinct Architectures

Maximilien — Mon, 16 Oct 2023 11:23:30 GMT

Introduction

The rise of large language models has changed the landscape of Natural Language Processing (NLP) dramatically. Their ability to comprehend, generate, and interact using human language has unlocked numerous applications, from chatbots to content creation. In this post, we'll journey through three foundational architectures powering these behemoths: Masked Language Models, Causal Language Models, and Sequence-to-Sequence Language Models.

1. Masked Language Model (MLM) - Encoding the Unknown

Architecture:

MLM is designed to predict a missing word in a sentence. During training, random words in a sentence are replaced with a '[MASK]' token, and the model learns to predict these masked words.

Example:

Sentence: "I love [MASK] ice cream."

Prediction: "chocolate"

Prominent models

BERT (Bidirectional Encoder Representations from Transformers): Developed by Google, BERT revolutionized many NLP tasks by pre-training on large amounts of text and then fine-tuning on specific tasks.
RoBERTa (A Robustly Optimized BERT Pretraining Approach): A variation of BERT, RoBERTa tweaks the training process and methodology to achieve even better performance.
DistilBERT: A distilled version of BERT, it maintains most of the performance while being faster and smaller.
ALBERT (A Lite BERT): It reduces the number of parameters in BERT without a significant drop in performance.

Applications:

Sentiment Analysis: Determining if a review is positive or negative using models like BERT.
Named Entity Recognition: Identifying entities such as names, places, and organizations in a sentence with models like DistilBERT.
Question Answering: Extracting specific answers from large texts, as seen with models like RoBERTa on the SQuAD dataset.e
Text Classification: Categorizing text into predefined groups using models like ALBERT.

2. Causal Language Model (CLM) - Decoding the Sequence

Architecture:

CLMs, or autoregressive models, generate text by predicting the next word in a sequence based on the previous words. They're "causal" because the prediction at time 't' is only affected by words from time 't-1' and before.

Example:

Seed: "Once upon a time,"

Generated continuation: "... in a land far away, there was a brave knight."

Prominent models

GPT (Generative Pre-trained Transformer): OpenAI's model that is pre-trained on large corpora and can generate coherent paragraphs of text. Its iterations include GPT-2 and the more recent GPT-3.
CTRL (Conditional Transformer Language Model): Developed by Salesforce, CTRL can generate content conditioned on control codes, allowing for more specific text generation.
XLNet: It combines the strengths of both BERT and GPT by predicting words in a dynamic order.

Applications:

Text Generation: Producing coherent paragraphs of text or completing prompts using GPT series.
Storytelling: Given a starting point, generating a story or narrative as seen with models like CTRL.
Code Generation: Producing programming code based on prompts, often explored using GPT models.
Creative Writing: Assisting writers in generating poems, song lyrics, and more.

3. Sequence-to-Sequence (Seq2Seq) Model - Encoding to Decoding

Architecture:

Seq2Seq models consist of two main parts: an encoder and a decoder. The encoder processes the input sequence and compresses the information into a 'context vector'. The decoder then uses this vector to produce the output sequence.

Example:

Input (Encoder): "Bonjour"

Output (Decoder): "Hello"

Prominent models

T5 (Text-to-Text Transfer Transformer): Introduced by Google, T5 treats every NLP problem as a text-to-text problem, making it highly versatile.
BART (Bidirectional and Auto-Regressive Transformers): Introduced by Facebook AI, BART is trained to auto-encode (with some noise in the text) and has achieved strong performance in tasks like summarization.
MarianMT: A state-of-the-art Seq2Seq model specifically designed for neural machine translation.
TransformerXL: While not strictly a Seq2Seq model, TransformerXL introduced mechanisms to remember longer sequences, making it relevant for tasks that benefit from understanding over extended contexts

Applications:

- Machine Translation: Translating a sentence from one language to another, as done by MarianMT.
  
  * Text Summarization: Shortening a lengthy article into a concise summary, a strong suit of BART.
  
  * Conversational Agents: Building chatbots that can have back-and-forth interactions using models like T5.
  
  * Text Simplification: Converting complex sentences into simpler versions for better understanding.

Conclusion

The diverse architectures of large language models showcase the breadth and depth of possibilities in NLP. From filling in the gaps with MLMs, spinning tales with CLMs, or translating and summarizing with Seq2Seq, these models have transformed the way machines understand and generate human language. As we continue to push the boundaries of NLP, it's exciting to envision where these foundational architectures will take us next.

Parameter Efficient FineTuning In Action: Finetuning LLMs Using PEFT & LoRA For Causal Language Modeling Task

Maximilien — Sat, 14 Oct 2023 19:02:09 GMT

Hands-on Code Generation Implementation using Codegen pre-trained model- Parameter Efficient Fine-Tuning — LoRA - CausalLM

Introduction

In our ever-evolving AI landscape, the excitement around Language Models is palpable. Yet, as models grow in size, so do the challenges tied to fine-tuning them. How do we efficiently adapt colossal models to specific tasks without extensive computational costs? Welcome to a deep dive into the world of Parameter-Efficient Fine-Tuning! Today, we'll unravel the mystique around fine-tuning Large Language Models (LLMs) using cutting-edge techniques like PEFT and LoRA. If code generation and language modeling intrigue you, strap in for a hands-on walkthrough with the Codegen pre-trained model. By the end of this journey, not only will you grasp the nuances of these techniques, but you'll also have a clear road map to implement them in your projects.

Workflow of the code

LoRa
PEFT
Causal Language Modeling
Codegen
Installing dependencies
Loading the required libraries
Loading the based pre-trained model for casual language modeling
Loading the tokenizer
initializing LoRa configuration
Loading the dataset from hugging face
Splitting the dataset into train and val
Defining function to Tokenize and process prompt template
Tokenizing train and val dataset into tensor acceptable by the trainer
Defining the metric function
Initializing seed for reproducibility
Initializing the trainer's arguments and the trainer
Training the based pre-trained model
Saving the finetuned model and its tokenizer
Loading the finetuned model
Defining inference function
Crafting 3 prompt templates
Testing

LoRa

LoRa stands for Low Rank Adaptation of Large Language Models. It is a technique that accelerates the fine-tuning of large models while consuming less memory. To make fine-tuning more efficient, LoRA’s approach is to represent the weight updates with two smaller matrices (called update matrices) through low-rank decomposition. These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn’t receive any further adjustments. To produce the final results, both the original and the adapted weights are combined.

This approach has several advantages:

LoRA makes fine-tuning more efficient by drastically reducing the number of trainable parameters.
The original pre-trained weights are kept frozen, which means you can have multiple lightweight and portable LoRA models for various downstream tasks built on top of them.
Lora is orthogonal to many other parameter-efficient methods and can be combined with many of them.
The performance of models fine-tuned using LoRA is comparable to the performance of fully fine-tuned models.
LoRA does not add any inference latency because adapter weights can be merged with the base model.

LoRa is implemented in the Hugging Face Parameter Efficient Fine-Tuning (PEFT) library. To fine-tune a model using LoRA, you need to:

Instantiate a base model.
Create a configuration (LoraConfig) where you define LoRA-specific parameters.
Wrap the base model with get_peft_model() to get a trainable PeftModel.
Train the PeftModel as you normally would train the base model.

Parameter-Efficient Fine Tuning (PEFT)

PEFT is a method used to freeze the pre-trained model parameters during fine-tuning and add a small number of trainable parameters (the adapters) on top of it. The adapters are trained to learn task-specific information. This approach is very memory-efficient with lower compute usage while producing results comparable to a fully fine-tuned model.

Causal Language Model

Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left. This means the model cannot see future tokens. GPT-2 is an example of a causal language model. They are frequently used for text generation. You can use these models for creative applications like choosing your text adventure or an intelligent coding assistant like Copilot.

What is Codegen

CodeGen is an autoregressive language model for program synthesis trained sequentially on The Pile, BigQuery, and BigPython. The CodeGen model was proposed in A Conversational Paradigm for Program Synthesis by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. CodeGen model checkpoints are available on different pre-training data with variable sizes. The format is: Salesforce/codegen-{size}-{data}, where:

size: 350M, 2B, 6B, 16B
data:
- nl: Pre-trained on the Pile
  
  multi: Initialized with nl, then further pre-trained on multiple programming languages data
  
  mono: Initialized with multi, then further pre-trained on Python data
For example, Salesforce/codegen-350M-mono used in this tutorial offers a 350 million-parameter checkpoint pre-trained sequentially on the Pile, multiple programming languages, and Python.

Installing dependencies

The dependencies needed in this tutorial are: bitsandbytes, datasets accelerate, loralib, peft and Transformers. We install them using pip as shown below. To run this code you need to change your colab runtime to T4 GPU and enable it. Besides, we use bitsandbytes because it supports 8bit and 4bit precision data types, which are useful for loading large models to save memory.

!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/peft.git git+https://github.com/huggingface/transformers.git

Importing the libraries

from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import bitsandbytes as bnb
from datasets import Dataset
from datasets import load_metric
import transformers
import torch
import numpy as np
import random
import torch.nn as nn
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

Initializing the based pre-trained model

model = AutoModelForCausalLM.from_pretrained(
    "Salesforce/codegen-350M-mono",
    torch_dtype=torch.float16,
    device_map='auto',
    load_in_8bit=True
)

Let's break down each line:

AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono"): This function is responsible for loading a pre-trained model. Here's what each part means:
AutoModelForCausalLM: Refers to the general architecture of the model being loaded. In this case, it's a "causal language model", which is a type of model designed for tasks like text generation. "Causal" here means that the model predicts the next word in a sequence based only on previous words, not future words. -from_pretrained: This function tells the library to load a model that has already been trained (i.e., pre-trained) rather than starting from scratch.
"Salesforce/codegen-350M-mono": This is the identifier for the specific pre-trained model you want to load. In this case, you're loading a model from Salesforce with the identifier codegen-350M-mono. The naming often provides hints about the model; here, it suggests the model may be designed for code generation (codegen) and has around 350 million parameters (350M). mono might hint at it being monolingual, but without further details, this is speculative.
torch_dtype=torch.float16: This sets the data type of the model's parameters. By using a torch.float16 (also known as "half precision"), the model will consume less memory and potentially run faster than using the default torch.float32. However, using half-precision can sometimes result in a slight decrease in model accuracy. It's a trade-off between speed/memory and accuracy.
device_map='auto': This is directing the model to be loaded on the most appropriate computational device available. If you have a GPU available, the library will automatically use it, which can greatly accelerate model computations. If no GPU is available, the model will default to CPU.
Initializing the tokenizer

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")

AutoTokenizer: This is a class in the transformers library that can automatically load the appropriate tokenizer for a given pre-trained model. A tokenizer is responsible for converting human-readable text into a format that the model can understand (typically a sequence of integers) and vice-versa.

tokenizer.add_eos_token = True
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"

tokenizer.add_eos_token = True:
- This line tells the tokenizer to automatically add an "end of sentence" (EOS) token at the end of every sequence it tokenizes. In many transformer models, the EOS token is used to signal the conclusion of an input sequence. By setting add_eos_token to True, it ensures that the EOS token is added whenever you tokenize a piece of text using this tokenizer.
tokenizer.pad_token_id = 0:
- Padding is used in machine learning models, especially in sequence models like transformers, to ensure that all sequences (e.g., sentences) in a batch have the same length. This is important because the underlying computations in neural networks usually require consistent input shapes.
- This line sets the identifier of the padding token to 0. This means when the tokenizer adds padding tokens to a sequence to make it match the desired length, it will use the token with ID 0 as the padding token.
tokenizer.padding_side = "left":
- When adding padding tokens, we can either add them to the start (left) or the end (right) of a sequence. This line specifies that the padding should be added to the start (left) of each sequence. This can be important in certain models or applications where the positioning of padding might influence the model's understanding of the sequence.

Freezing the model parameters

for param in model.parameters():
  param.requires_grad = False  
  if param.ndim == 1:

    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable() 
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

Breaking the code down

      for param in model.parameters():
          param.requires_grad = False

This loop iterates through all the parameters of the model and sets their requires_grad attribute to False. When a parameter's requires_grad attribute is set to False, it will not update during the backward pass, meaning it remains "frozen" during training. This is useful when you only want to train certain parts of a model or when fine-tuning a new dataset.

Cast 1-dimensional Parameters to float32

  if param.ndim == 1:
      param.data = param.data.to(torch.float32)

For 1-dimensional parameters (often biases or parameters in normalization layers), the code changes its data type to float32. This can be helpful for stability in training since smaller data types (like float16) can sometimes cause numerical issues, especially for parameters like biases.

Enabling Gradient Checkpointing:

  model.gradient_checkpointing_enable()

Gradient checkpointing is a technique used to save memory when training very deep models. Instead of storing all intermediate activations in memory for the backward pass, it recomputes them, trading off computation time for memory. This can be particularly useful when training models on GPUs with limited memory.

Enabling Input Requirement for Gradients:
```
  model.enable_input_require_grads()
```

This method likely ensures that the input to the model requires gradients. This can be useful when you're interested in calculating gradients concerning the input, such as in adversarial training.

Creating a Custom Module to Cast Output to float32:

  class CastOutputToFloat(nn.Sequential):
      def forward(self, x): return super().forward(x).to(torch.float32)
  model.lm_head = CastOutputToFloat(model.lm_head)

This custom module, CastOutputToFloat, is derived from PyTorch's nn.Sequential class. It overrides the forward method to cast its output to float32. The final line replaces the lm_head of the model with this custom module wrapping the original lm_head. The purpose is likely to ensure that the final predictions (logits) from the model are in the float32 data type, which can be helpful for precision and stability reasons, especially if other parts of the model or training process utilize lower precision formats like float16.

Printing the number of trainable parameters in the model.

def print_trainable_parameters(model):

    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

Initializing the LoRa configuration

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["fc_in"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 819200 || all params: 357531648 || trainable%: 0.2291265695170012

r: This parameter sets the rank of the low-rank adaptation. Essentially, it determines the size of the adaptation parameters. A smaller r means fewer parameters, making the adaptation more parameter-efficient. In this case, it's set to 8.
lora_alpha: This parameter is a scaling factor. LoRA introduces an additional linear layer to existing layers in the model, and lora_alpha helps determine the size of this layer. Specifically, for a layer with d units, the LoRA layer would have d/lora_alpha units. In this configuration, lora_alpha is set to 16, meaning the adaptation layer will be scaled down by this factor compared to the original layer.
target_modules: This is a list indicating which modules (or layers) of the model should be adapted using LoRA. In this case, only the module named "fc_in" is set to be adapted.
lora_dropout: Specifies the dropout rate to be applied to the outputs of the LoRA layers. Dropout is a regularization technique where, during training, random neurons (or outputs) are "dropped out" or set to zero. Here, a dropout rate of 0.05 indicates that 5% of the neurons will be set to zero during each forward pass.
bias: Defines how biases should be handled in the LoRA-adapted layers. Here, it's set to "none", implying that no biases will be used in the LoRA layers.
task_type: Specifies the type of task the model is intended for. In this case, it's set to "CAUSAL_LM", indicating a causal language modeling task. Causal Language Modeling (CLM) is where the model predicts the next token in a sequence based solely on the previous tokens, as opposed to masked language modeling where the model predicts masked-out tokens based on their context.

Loading the dataset from hugging face

dataset = load_dataset("theblackcat102/evol-codealpaca-v1")

The code downloads the evol-codealpaca-v1 dataset from the Hugging Face datasets hub and stores it in the dataset variable for further use.

Splitting the dataset into train and validation

def split_dataset(dataset):
    n = int(0.8 * len(dataset['train']))
    train_data = dataset['train'][:n]
    val_data = dataset['train'][n:]

    dataset['train'] = train_data
    dataset['validation'] = val_data

    return dataset['train'], dataset['validation']

the split_dataset function is used to split the training data of a given dataset into training (80%) and validation (20%) sets. This is a common practice in machine learning to ensure that a separate set of data is available to validate the model's performance after training.

train_data, val_data = split_dataset(dataset)

The returned tuple from the split_dataset function is unpacked into two separate variables. The first value (the training set) is assigned to the train_data variable, and the second value (the validation set) is assigned to the val_data variable.

After this line of code executes, you'll have:

train_data: This contains 80% of the original training data from the dataset.
val_data: This contains the remaining 20% of the original training data, and it will be used for validation purposes.
Function definition to tokenize and preprocess prompt template

def tokenize_function(samples):
    output_str = samples['output'] if samples['output'] else "Cannot Find Answer"
    prompt_template = f"### INSTRUCTION\n{samples['instruction']}\n\n### OUTPUT\n{output_str}"
    return tokenizer(prompt_template, truncation=True, padding='max_length', max_length=2048)

The function accepts a single argument, samples, which is expected to be a dictionary containing at least two keys: instruction and output.

output_str = samples['output'] if samples['output'] else "Cannot Find Answer"

This line checks if samples['output'] exist and is not empty or None. If it does have a value, that value is assigned to the output_str variable. If not, the string "Cannot Find Answer" is assigned to output_str. This is a form of conditional assignment in Python and ensures that the output_str always has some value.

 prompt_template = f"### INSTRUCTION\n{samples['instruction']}\n\n### OUTPUT\n{output_str}"

This line creates a formatted string, prompt_template, using the values from the samples dictionary. It follows a specific format where the instruction and output are both clearly labeled. Notice the use of the token at the end; this is often used as an end-of-sequence token in certain tokenization schemes.

return tokenizer(prompt_template, truncation=True, padding='max_length', max_length=2048)

This line utilizes the tokenizer (which is expected to be available in the outer scope) to tokenize the prompt_template. The tokenizer is set to truncate sequences if they exceed 2048 tokens and pad shorter sequences to this length.

Formatting the train and validation dataset into tensor data acceptable by the trainer

train_dataset = Dataset.from_dict(train_data)
mapped_train_dataset = train_dataset.map(tokenize_function, batched=False, remove_columns=['instruction', 'output'])
val_dataset = Dataset.from_dict(val_data)
mapped_val_dataset = val_dataset.map(tokenize_function, batched=False, remove_columns=['instruction', 'output'])

Here's a step-by-step breakdown:

Dataset Initialization:

train_dataset = Dataset.from_dict(train_data)

This line creates a Dataset object from the train_data dictionary. The Dataset object is a data structure provided by the datasets library that is optimized for large-scale datasets and ML tasks. It enables efficient data processing methods and various utilities.

Mapping and Tokenizing Train Dataset:

mapped_train_dataset = train_dataset.map(tokenize_function, batched=False, remove_columns=['instruction', 'output'])

The map method applies a given function (in this case, tokenize_function) to each sample in the dataset.

batched=False: This means the tokenize_function will be applied to individual samples rather than batches of samples.
remove_columns=['instruction', 'output']: After processing each sample with tokenize_function, the original columns 'instruction' and 'output' are removed, since they've been tokenized and formatted and are no longer needed in their raw form.
Dataset Initialization for Validation Data:

val_dataset = Dataset.from_dict(val_data)

Similarly, a Dataset object for validation data (val_data) is created.

Mapping and Tokenizing Validation Dataset:

mapped_val_dataset = val_dataset.map(tokenize_function, batched=False, remove_columns=['instruction', 'output'])

Just like with the training data, the validation data is also processed using the tokenize_function. The processed and tokenized data is stored in the mapped_val_dataset object.

Defining the function to compute the metrics

transformers.logging.set_verbosity_info()
bleu_metric = load_metric("bleu")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = [tokenizer.decode(pred, skip_special_tokens=True) for pred in predictions]
    decoded_labels = [tokenizer.decode(label, skip_special_tokens=True) for label in labels]
    bleu_score = bleu_metric.compute(predictions=decoded_preds, references=decoded_labels)

    return {"bleu": bleu_score["bleu"]}

This function is designed to be used during or after the evaluation of a model to compute the BLEU score:

eval_pred is a tuple containing the predictions from the model and the true labels.
The predictions and labels, which are tokenized sequences, are first decoded into human-readable text using the tokenizer.decode() function. This is necessary because the BLEU metric works on actual text, not tokenized sequences.
The BLEU score is then computed using bleu_metric.compute(), and the result is returned as a dictionary with a single key-value pair: "bleu": bleu_score["bleu"].
Initializing seeding for model reproducibility

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

This part of the code sets various random seeds to ensure reproducibility. When training neural networks, many operations have a random component. By fixing the random seed, the same sequence of random numbers will be generated every time, leading to consistent results across runs. This is important if you want to ensure that someone else running your code, or you running your code at a later time, will get the same results.

Trainer initialization

trainer = transformers.Trainer(
    model=model,
    train_dataset=mapped_train_dataset,
    compute_metrics=compute_metrics,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        max_steps=100,
        learning_rate=1e-3,
        fp16=True,
        logging_steps=1,
        output_dir='outputs',
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False

The Trainer class from the Transformers library is being initialized. It's designed to simplify the process of training, evaluating, and testing transformer models.

Arguments to the Trainer:

model: The model instance you intend to train. This is typically a pre-initialized or pre-trained instance of a transformer model from the library.
train_dataset: The dataset the model will be trained on. mapped_train_dataset is a processed dataset, likely transformed into a format suitable for training, such as tokenized text.
compute_metrics: A function that computes metrics after an evaluation. This function is typically defined earlier in your code. It calculates metrics (like accuracy, BLEU score, etc.) on the evaluation dataset.
args: This specifies various training-related configurations using the TrainingArguments class:
per_device_train_batch_size: The batch size for training on each device (e.g., each GPU). Here it's set to 4.
gradient_accumulation_steps: Number of forward passes (batches) the model will see before an update (backpropagation) is performed. Here, the model will see 4 batches before an update.
warmup_steps: The number of steps for the learning rate warmup. The learning rate will gradually increase over these many steps at the beginning of training.
max_steps: Maximum number of training steps. Training will stop after 100 steps irrespective of epochs.
learning_rate: Specifies the learning rate for the optimizer. It's set to 0.001.
fp16: Indicates the use of 16-bit (also known as half-precision) floating point numbers during training. Using fp16 can accelerate training.
logging_steps: Interval at which logging will occur. Here, logs will be generated at every step.
output_dir: Directory where training-related outputs (like model checkpoints) will be saved. Here, they will be saved in a folder named 'outputs'.
data_collator: This is responsible for preparing and collating data samples into batched tensors before feeding them into the model. transformers.DataCollatorForLanguageModeling is used here, suited for causal language modeling tasks.
The argument mlm=False indicates that masked language modeling is not used (which makes sense since it's for causal language modeling).
model.config.use_cache = False: This disables the caching mechanism within the transformer model. In transformers, caching can store certain intermediate outputs to speed up the sequential processing of tokens. However, it might be disabled to save memory, especially if the sequences being processed are long.
Training the model

trainer.train()

trainer.train(): When this method is called, the model begins training on the dataset specified during the initialization of the Trainer object. The training process will use all the configurations, hyperparameters, and specifications you provided when you created the Trainer instance. It goes through the data in mini-batches as specified by the batch size. For each batch, it feeds the data through the model, computes the loss (difference between the model's predictions and the actual values), and then updates the model's weights using backpropagation. This process is repeated for the number of epochs or steps specified. An epoch is a complete pass through the entire training dataset.
Saving the finetuned model

model.save_pretrained("./my_model")
tokenizer.save_pretrained("./my_model")

Loading the finetuned model

loaded_model = AutoModelForCausalLM.from_pretrained("./my_model")
loaded_tokenizer = AutoTokenizer.from_pretrained("./my_model")

Inference function

from IPython.display import display, Markdown

def generate_completion(model, tokenizer, prompt_text, max_length=100):

    model.config.use_cache = True
    model.eval()
    input_ids = tokenizer.encode(prompt_text, return_tensors="pt")
    with torch.no_grad():
        output = model.generate(input_ids, max_length=max_length, num_return_sequences=1,
                                pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id,
                                temperature=0.1,
                                top_k=10,
                                top_p=0.1,
                                do_sample=True
                                )
    display(Markdown(tokenizer.decode(output[0], skip_special_tokens=True)))

The function is defined with the following parameters:

model: The pre-trained model that you want to use for generating the text.
tokenizer: The tokenizer associated with the model that is responsible for converting the text into tokens (and vice-versa). prompt_text: The initial text (or prompt) you want to expand upon. max_length: Maximum length of the generated output. The default is set to 100 tokens.

Setting Model for Generation:

model.config.use_cache = True
model.eval()

The use_cache is set to True to allow the model to use past computations for faster generations. model.eval() sets the model to evaluation mode. This is essential as certain layers like dropout behave differently during training and evaluation.

Tokenization of the Prompt:

input_ids = tokenizer.encode(prompt_text, return_tensors="pt")

The prompt text is tokenized into a format the model understands (input_ids). The return_tensors="pt" ensures the result is a PyTorch tensor.

Generating the Completion:

with torch.no_grad():
    output = model.generate()

with torch.no_grad() ensures that no gradients are computed during this operation, saving memory and computational power. model.generate() is the method that produces the generated completion. Here's a brief on the parameters: input_ids: The to===kenized version of the prompt_text. max_length: The maximum length of the generated text. num_return_sequences: The number of sequences to return. It's set to 1, so only one completion is generated. pad_token_id, eos_token_id: The padding and end-of-sentence token IDs. This ensures the generated text is appropriately formatted. temperature: This controls the randomness of the output. Lower values make the output more deterministic. top_k, top_p: Parameters that control the randomness of the model's output by selecting the next token only from the top k tokens or top p probability. do_sample: It enables sampling, which means the model will consider multiple possible next tokens rather than just the most probable one.

Displaying the Generated Text:

display(Markdown(tokenizer.decode(output[0], skip_special_tokens=True)))

This decodes the generated token IDs back to human-readable text and then displays it in the Jupyter Notebook in a formatted manner.

Testing

prompt_text = "### INSTRUCTION\nWrite a function to find the area of a triangle:\n\n### OUTPUT\n"

prompt_text1 = "### INSTRUCTION\nWrite a function to find if a number is odd:\n\n### OUTPUT\n"

prompt_text2 = "### INSTRUCTION\nWrite a code to find factorial of a number:\n\n### OUTPUT\n"

print(generate_completion(loaded_model, loaded_tokenizer, prompt_text))

Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 50256 }

INSTRUCTION

Write a function to find the area of a triangle:

OUTPUT

Here is a Python function that calculates the area of a triangle:

def area_of_triangle(a, b, c):
    return (a * b) / 2

print(area_of_triangle(3, 4, 5))

print(generate_completion(loaded_model, loaded_tokenizer, prompt_text1))

INSTRUCTION

Write a function to find if a number is odd:

OUTPUT

Here is a Python function that takes a number as input and returns True if the number is odd, and False otherwise.

def is_odd(n):
    return n % 2 == 1

print(is_odd(5))
This function takes a number as an input and returns True if the number is odd,

print(generate_completion(loaded_model, loaded_tokenizer, prompt_text2))

INSTRUCTION

Write a code to find factorial of a number:

OUTPUT

Here is a Python code that solves the problem:

def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n - 1)

print(factorial(5))

Conclusion

To wrap up, we've delved deep into the intricate process of optimizing language models with the power of Parameter-Efficient Fine-Tuning, namely through PEFT and LoRA. This advanced approach, heightened by its adeptness in efficient training, not only fine-tunes models with less computational expense but also ensures robust performance on specialized tasks like code generation. The hands-on implementation of the Codegen-based pre-trained model illustrates the practicality and potential of such methods. For those enthusiasts and professionals aiming to harness the prowess of large language models without the associated computational overhead, this exploration shines a light on the path forward.

If you want to contribute or you find any errors in this article please do leave me a comment.

You can reach out to me on any of the matrix decentralized servers. My element messenger ID is @maximilien:matrix.org

If you are in one of the mastodon decentralized servers, here is my ID @maximilien@qoto.org

If you are on linkedIn, you can reach me here

If you want to contact me via email maximilien@maxtekai.tech

If you want to hire me to work on machine learning, data science, IoT and AI-related projects, please reach out to me here

Warm regards,

Maximilien.

Langchain Meets GPT-3.5: Crafting the Ultimate Multilingual News Articles Summarizer In English And French

Maximilien — Tue, 26 Sep 2023 14:14:11 GMT

Introduction

In our modern, rapidly evolving society, staying abreast of current news and updates is crucial. Yet, sifting through numerous articles can be a tedious task. To streamline this process and provide you with succinct insights, we're introducing a News Articles Summarizer built with GPT-3.5 and LangChain. This robust tool allows for efficient scraping of web articles, capturing their headlines, content and producing sharp summaries. In this guide, we'll delve into the step-by-step creation of this summarizer.

Workflow for Building a News Articles Summarizer

Installing required libraries: To get started, ensure you have the necessary libraries installed: requests, newspaper3k, and langchain.
- Scraping articles: Use requests the library to scrape the content of the target news articles from their respective URLs.
  - Extracting titles and text: Employ newspaper the library to parse the scraped HTML and extract the titles and text of the articles.
  - Preprocessing the text: Clean and preprocess the extracted texts to make them suitable for input to GPT-3.5 model.
Generating summaries: Utilize GPT-3.5 model to summarize the extracted articles
Outputing the results: Present the summaries along with the original titles, allowing users to grasp the main points of each article quickly.

Installing dependencies

!pip install -q openai langchain newspaper3k python-dotenv  requests

Create a .env file in your project root directory and add your OpenAI environment variable:

from dotenv import load_dotenv

!echo "OPENAI_API_KEY=''" > .env

load_dotenv()

Scraping & extracting the title and the text of the article using requests and newspaper libraries

import requests
from newspaper import Article

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
}

article_urls = "https://www.wired.com/story/fast-forward-chatgpt-my-new-chatbot-friend-get-things-done/"

session = requests.Session()

try:
    response = session.get(article_urls, headers=headers, timeout=10)

    if response.status_code == 200:
        article = Article(article_urls)
        article.download()
        article.parse()

        print(f"Title: {article.title}")
        print(f"Text: {article.text}")

    else:
        print(f"Failed to fetch article at {article_urls}")
except Exception as e:
    print(f"Error occurred while fetching article at {article_urls}: {e}")

output of the above code

Unleashing the Power of LLMs, OpenAI API, and LangChain for Personalized City-Specific Recipe Recommendations

Maximilien — Sun, 14 May 2023 11:10:44 GMT

Cooking is an art, and having a knowledgeable cooking assistant can greatly enhance your culinary journey. In this blog post, we will explore how the combination of LLMs (Large Language Models) the OpenAI API, and LangChain can be leveraged to build an intelligent Cook Bot. This bot will provide recipes based on the location provided by the user, cooking tips, and even real-time translation for global culinary adventures.

Introduction
Setting up the Environment Variables for OpenAI API
Installing and loading dependencies
Creating a new Python script
Invoking the method to load the environment variable and creating a new instance of the language model
Creating Location and Meal Chains
Building the Overall Chain
Streamlit User Interface
Conclusion

Introduction

Cooking is an art that brings people together through delicious flavors and culinary experiences. Before delving into the code to create a cutting-edge cook assistant that will elevate your cooking skills to new heights we are going to play around with some keywords which are not common to non-geeks but key to the implementation of the cook assistant namely OpenAI, Langchain, LLMs, Prompt and Prompt template.

OpenAI : OpenAI is a leading artificial intelligence research organization that has developed state-of-the-art language models capable of understanding and generating human-like text. In the context of our cook assistant, OpenAI's technology is utilized to power the language generation capabilities of the assistant. The cooking assistant relies on an OpenAI language model, specifically the OpenAI API, to generate responses to user inputs. The language model has been trained on vast amounts of text data, allowing it to understand the nuances of human language and provide contextually relevant and coherent responses. By integrating the OpenAI API into the cook assistant, we leverage the power of natural language processing and generation. The cooking assistant can understand and process user queries related to cooking, recipe recommendations, and cooking tips. It then utilizes the OpenAI language model to generate informative and engaging responses conversationally.
Langchain: Python library that provides an easy-to-use interface for building conversational agents using large language models (LLMs). In the context of the cook assistant, Langchain is used to build a conversational interface that allows users to ask for cooking advice and recipe recommendations.

At the core of Langchain is the concept of a "chain," which is essentially a sequence of LLMs that are used to generate responses to user inputs. In the cook assistant, Langchain is used to create two separate chains: one for location-based recipe recommendations and another for meal-specific recipe recommendations.
LLMs(Large Language Models) in the realm of the cooking assistant are like culinary encyclopedias on steroids. These linguistic powerhouses possess a vast understanding of recipes, ingredients, cooking techniques, and culinary knowledge. LLMs are trained on massive amounts of text data, allowing them to absorb a wealth of culinary information. When it comes to cooking assistance, LLMs serve as the brilliant minds behind the scenes. They can comprehend and generate text-based responses to user queries, providing valuable cooking advice, recipe suggestions, and even personalized recommendations.
Prompt: In the context of LLMs and the Cook Assistant, a prompt refers to the initial instruction or input provided to the language model to generate a response. The prompt serves as a guiding context for the model, helping it understand the user's request and generate relevant and meaningful output. When interacting with the Cook Assistant, the user provides prompts related to the city. For example, a user might ask, "What is a delicious recipe for a particular city?" In this case, the prompt is the user's question itself. The prompt is then passed to the LLM, such as OpenAI's GPT-3 model, which processes the input and generates a response based on its understanding of the given context. The model analyzes the prompt and utilizes its vast knowledge of cooking techniques, ingredients, and recipes to generate a helpful and informative response to the user's query.
Prompt template: In the context of Large Language Models (LLMs) and the Cook Assistant, a prompt template is a structured framework that incorporates variables and placeholders to create dynamic prompts. Prompt templates allow for flexible and customizable input that can be easily adapted to different user queries and contexts. The Cook Assistant utilizes prompt templates to generate prompts tailored to specific user inputs, such as location or meal preferences. Instead of providing a static prompt, a prompt template includes placeholders that will be replaced with actual values provided by the user. For example, a prompt template for location-based recommendations might look like this: "Tell me the best food in {user_location}." Here, {user_location} is the placeholder that will be replaced with the actual user input, such as "New York" or "Paris." The prompt template allows the Cook Assistant to dynamically generate prompts based on the user's location, ensuring relevant and personalized responses. By employing prompt templates, the Cook Assistant can handle various user inputs and adapt its prompts accordingly. Prompt templates enable a more interactive and conversational experience, allowing users to provide specific details or preferences and receive targeted recommendations or recipes in response. In closing, prompt templates enhance the flexibility and interactivity of the Cook Assistant, enabling users to engage in meaningful conversations and receive customized cooking assistance based on their specific needs and preferences.
Setting up the Environment Variables for OpenAI API

Before we start building the Cook Bot, we need to set up the environment variables for the OpenAI API. The API key is a sensitive piece of information that should not be hard-coded into your codebase. Instead, it should be stored as an environment variable. Here are the steps to set up the environment variable:
- Log in to the OpenAI dashboard and copy your API key.
  - Open the terminal and navigate to the project directory.
  - Create a project directory using sudo mkdir cookAssistant and go into that directory using cd cookAssistant then create a new environment file using sudo nano .env
  - Add the following line to the .env file

OPENAI_API_KEY = "your api key here"

Installing and loading dependencies

In this section, we are going to install the required dependencies using pip and import them into our Python script

pip install openai langchain streamlit streamlit_chat dotenv

Creating a new Python script using nano app.py in the project cookAssistant directory where the .env file is and paste the following code as shown step by step.


  from langchain.llms import OpenAI
  from langchain.chains import LLMChain
  from langchain.prompts import PromptTemplate
  from langchain.chains import SimpleSequentialChain
  import streamlit as st
  from streamlit_chat import message
  from langchain.chains import ConversationChain
  from langchain.llms import OpenAI
  from dotenv import load_dotenv
  import os

Invoking the method to load the environment variable and creating a new instance of the language model

load_dotenv()
token = os.environ.get("OPENAI-API-KEY")
llm = OpenAI(temperature=1, openai_api_key=token)

In the above code, the load_dotenv() function loads the environment variables from the .env file. The token variable retrieves the OpenAI API key stored in the openai-key environment variable. Next, an instance of the OpenAI class is created with the provided API key (token) and a temperature of 1. The temperature parameter determines the randomness of the responses generated by the model.

Creating Location and Meal Chains

To provide location-based recommendations and meal-specific recipes, we create separate chains for location and meal inputs. We define prompt templates that incorporate the user's location and desired meal to generate contextually relevant responses. By utilizing LLMChain and PromptTemplate from LangChain, we establish the foundation for our cook assistant's knowledge and response generation as shown below.

def load_chain_loc():
    template = """Your job is to come up with a classic dish from the area that the users suggests.
    % USER LOCATION
    {user_location}

    YOUR RESPONSE:
    """
    prompt_template = PromptTemplate(input_variables=["user_location"], template=template)

    location_chain = LLMChain(llm=llm, prompt=prompt_template)
    return location_chain

In the above code, The load_chain_loc function defines a template for generating prompts based on the user's location input. It uses the PromptTemplate class to create a template with the variable {user_location}. An instance of the LLMChain class is created, passing the OpenAI instance (llm) and the prompt template (prompt_template). The function returns the location_chain instance.

def load_chain_meal():
    template = """Given a meal, give a short and simple recipe on how to make that dish at home.
    % MEAL
    {user_meal}

    YOUR RESPONSE:
    """
    prompt_template = PromptTemplate(input_variables=["user_meal"], template=template)

    meal_chain = LLMChain(llm=llm, prompt=prompt_template)
    return meal_chain

In the above code, The load_chain_meal function defines a template for generating prompts based on the user's meal input. It uses the PromptTemplate class to create a template with the variable {user_meal}. An instance of the LLMChain class is created, passing the OpenAI instance (llm) and the prompt template (prompt_template). The function returns the meal_chain instance.

Building the Overall Chain

To integrate the location and meal chains, we create an overall chain using SimpleSequentialChain. This allows us to connect the chains and execute them sequentially, ensuring a smooth flow of information and generating cohesive responses. We configure the overall chain to provide verbose output for debugging and testing purposes.

loc_chain = load_chain_loc()
chain_meal = load_chain_meal()
overall_chain = SimpleSequentialChain(chains=[loc_chain,chain_meal], verbose=True)

In the above code, the load_chain_loc function is called to create the loc_chain instance. The load_chain_meal function is called to create the chain_meal instance. Finally, an instance of the SimpleSequentialChain class is created, passing the loc_chain and chain_meal instances as a list.

Streamlit User Interface

In this section, we implement the user interface using Streamlit. We set the page configuration, display the cook assistant's title, and prompt the user to enter the desired city for location-based food recommendations. As the user inputs their preferences, the cook assistant generates responses using the overall chain, and the conversation history is displayed using the Streamlit Chat component as shown below

st.set_page_config(page_title=" Cook bot", page_icon=":robot:")
st.title("Cook bot powered with LLMs")
st.write("By Maximilien ")

if "generated" not in st.session_state:
    st.session_state["generated"] = []

if "past" not in st.session_state:
    st.session_state["past"] = []


def get_text():
    st.header("enter the city you want to know the best food")
    input_text = st.text_input("", key="input")
    return input_text


user_input = get_text()

if user_input:
    output = overall_chain.run(input=user_input)

    st.session_state.past.append(user_input)
    st.session_state.generated.append(output)
    st.write(output)
if st.session_state["generated"]:

    for i in range(len(st.session_state["generated"]) - 1, -1, -1):
        message(st.session_state["generated"][i], key=str(i))
        message(st.session_state["past"][i], is_user=True, key=str(i) + "_user")

Let's break down this part of the code step by step:

st.set_page_config(page_title=" Cook bot", page_icon=":robot:"): This line configures the page settings for the Streamlit application. It sets the page title as "Cook bot" and assigns a robot icon to the page.
st.title("Cook bot powered with LLMs"): This line displays a title on the Streamlit app interface, indicating that it is a Cook bot powered by LLMs.
st.write("By Maximilien "): This line displays the name or attribution "By Maximilien" on the Streamlit app interface.
if "generated" not in st.session_state: st.session_state["generated"] = []: This code block checks if the "generated" key is present in the Streamlit session state. If not, it initializes an empty list assigned to the "generated" key. This list will store the generated responses.
if "past" not in st.session_state: st.session_state["past"] = []: Similar to the previous code block, this block checks if the "past" key is present in the Streamlit session state. If not, it initializes an empty list assigned to the "past" key. This list will store the past user inputs.
def get_text(): ...: This is a function definition for get_text(). It displays a header asking the user to enter the city for which they want to know the best food. It then uses st.text_input() to retrieve the user's input as a text string.
user_input = get_text(): This line calls the get_text() function and assigns the returned user input to the user_input variable.
if user_input: ...: This code block checks if the user_input variable has a non-empty value. If there is user input, it proceeds with the following steps.
output = overall_chain.run(input=user_input): This line executes the run() method of the overall_chain object, passing the user input as the input argument. It generates a response based on the user input using the chained LLMs.
st.session_state.past.append(user_input): This appends the user input to the "past" list stored in the Streamlit session state.
st.session_state.generated.append(output): This appends the generated output to the "generated" list stored in the Streamlit session state.
st.write(output): This line displays the generated output on the Streamlit app interface.
if st.session_state["generated"]: ...: This code block checks if the "generated" list in the Streamlit session state is not empty. If it is not empty, it proceeds with the following steps.
for i in range(len(st.session_state["generated"]) - 1, -1, -1): ...: This loop iterates over the "generated" list in reverse order using the range() function. It retrieves each generated response and its corresponding user input from the Streamlit session state.
message(st.session_state["generated"][i], key=str(i)): This line displays the generated response as a message on the Streamlit app interface, using the message() function. Each message is assigned a unique key based on the loop index.
message(st.session_state["past"][i], is_user=True, key=str(i) + "_user"): This line displays the corresponding user input as a user message (indicating that it was input by the user) on the Streamlit app interface. Each user message is also.
- Conclusion

With the implementation of the cook assistant using LLMs, the OpenAI API, and LangChain, we have harnessed the power of language models and intelligent conversation chains to create a valuable tool for cooking enthusiasts. The cooking assistant provides personalized recommendations and recipes based on user inputs, empowering users to explore new cuisines and enhance their culinary skills. By combining advanced technologies, we are revolutionizing the cooking experience and paving the way for future innovations in the realm of intelligent kitchen assistants.

The source code can be found on my github repository here

If you want to contribute or you find any errors in this article please do leave me a comment.

You can reach out to me on any of the matrix decentralized servers. My element messenger ID is @maximilien:matrix.org

If you are in one of the mastodon decentralized servers, here is my ID @maximilien@qoto.org

If you are on linkedIn, you can reach me here

If you want to contact me via email for freelance maximilien@tutanota.de

If you want to hire me to work on machine learning, data science, IoT and AI related projects, please reach out to me here

Warm regards,

Maximilien.

@WeRateDogs Data Wrangling Analysis And Visualization

Maximilien — Mon, 18 Jul 2022 07:43:36 GMT

Welcome back to this blog post in advanced data analytics nanodegree scholarship sponsored by Udacity. The project II WeRateDogs, a twitter account which has 9.3 million followers across the world at the point this blog is published contains three datasets namely twitter-archive-enhanced.csv, image_predictions.tsv and tweet_json.txt which we have to programmatically download from three different sources. Two from website URLs and the last one requires the learner to sign up for twitter developer account to collect additional tweet through twitter API. The objective of this project is to challenge the learner to wrangle the three datasets and to combine all the three clean datasets into a twitter master archive by providing at least four insights and two visualizations. Without delay let's get into the data wrangling.

Table of Contents

Loading the libraries
Data gathering
Reading twitter-archive-enhanced dataset into a dataframe
Downloading image prediction dataframe using request
Reading image prediction into a dataframe
Query tweet_json dataset using twitter API
Reading tweet_json into a dataframe
Assessing data
Objectives
Methodology
Visual assessment of twitter_archive_df
Programmatic assessment of twitter_achive_df
Visual assessment of image_predictions_df
Programmatic assessment of image_predictions_df
Visual assessment of tweet_status
Programmatic assessment of tweet_status dataframe
Project scope
Quality issues
Tidiness issues
Cleaning data
Making the copy of the dataframes
Quality issue #8: Convert tweet_id column in image_predictions_clean from int to str
Quality Issue #6: Inconsistent use of lowercase and uppercase and underscores in p1, p2,p3 columns
Quality Issue #7: Removing duplicated values img_url of image_prediction_df
Quality Issue #4: Incorrect values and incorrect type for timestamp
Quality Issue #1: invalid and missing dog name in the column name
Tidiness Issue #1 : Column source contain HTML tag and hyperlinks
quality Issue #10: Removing RT @ in column text of twitter_archive_clean
Quality Issue #5 : dropping columns in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp
Tidiness Issue #2: Merging columns doggo, floofer,pupper, and puppo
Tidiness Issue #4: There are two information in a single column text url link and text
Quality Issue #2 and 3#: Invalid ratings. Value varies from 1776 to 0. data type must be converted from int to float. Invalid denominator, I expected a fixed base. data Strucutre must be converted from int to float.
Tidiness Issue #5: Merging twitter_archive_clean and image_predictions_clean
Quality Issue #9: Incorrect data type in column id in tweet_status_clean
Tidiness isssue #6: Creating a dataframe with columns: id, favorite_count, retweet_count
Tidiness isssue #7: Merge tweet_status_clean and twitter_achive_merged
Storing the master data into csv file
ANALYSING AND VISUALIZING THE DATA
Reading the twitter_archive_master.csv into a dataframe
Insight about the clean master dataset
Visualizing the hidden pattern in the dataset
Function to plot the average count of tweets
Visualizing the distribution of average favorite count of tweet using based on the dog category
Visualizing the distribution of average retweet count based on the dog category
Visualizing Likes vs Retweets
Visualizing the most used devices by WeRateDogs users
Visualizing the most popular name of the dogs
Visualizing 20 dogs breed P2 predicted by twitter user on WeRateDogs
Visualizing 20 dogs breed P3 predicted by twitter user on WeRateDogs
Visualizing the dog category with the highest score
Visualizing the dog category with maximum favorite count
Visualizing the dog category with minimun retweet count
Visualizing Total Tweets made by WeRateDogs per month between 2015 to 2017
Visualizing some of the dogs image prediction p1

1. Loading the required libraries

from timeit import default_timer as timer
from IPython.display import Image
from tweepy import OAuthHandler
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import pandas as pd
import numpy as np
%matplotlib inline
import requests
import tweepy
import json
import re

2. Data Gathering

Reading the twitter-archive-enhanced.csv

twitter_archive_df = pd.read_csv('twitter-archive-enhanced.csv')

Use the Requests library to download the tweet image prediction (image_predictions.tsv)

# Use Requests library to programmatically download tsv file from a website
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
# Save tsv to file
with open('image_predictions.tsv', mode='wb') as file:
 file.write(response.content)

Reading image_predictions.tsv in dataframe

image_predictions_df = pd.read_csv('image_predictions.tsv', sep='\t')

Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt) One of the Project requirements is to access the Twitter API to create the tweet_json.txt completing some missing/wrong values of the tweet-json file. I will use the tweepy package (a client code) to access the Twitter API.
Authentication Details: load personal API keys (replaced with placeholders)

consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''
# variables for Twitter API connection
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit = True)

Collecting tweet data using API

tweet_ids = twitter_archive_df.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except Exception as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

Reading tweet JSON content as pandas dataframe

tweet_status_df = pd.read_json('tweet_json.txt', lines = True,encoding='utf-8')

Assessing Data

Objectives
In this section, I detect and I document at least eight (8) quality issues and two (2) tidiness issue by using visual assessment and programmatic assessement The issues could be defined into two types:

Quality issues or dirty data: missing, duplicated, or incorrect data
Tidiness issues: messy data or unstructural data.

Methodology
I use visual assessment on each dataframe to detect the issues and document it.

Visual assessment of twitter_archive_df

Displaying the first five raw of the twitter_archive_df

twitter_archive_df.sample(3)

We notice quality issue in column name cause it contains invalid name and missing name which are not accurate
We notice tidiness issue in columns source HTML tags, URL, and content in a single column.

Programmatic assessment of twitter_archive_df

Through visual assessment we found invalid name in column name of the dataframe.

Displaying 20 name of dogs

twitter_archive_df['name'][:20]

Counting the occurence of the unique name of the dogs

twitter_archive_df['name'].value_counts()[:20]

We notice 745 dogs in the dataframe have the name None and 55 dogs have the name a
Displaying the unique occurance of the denominator rating value

twitter_archive_df['rating_denominator'].value_counts()[:20]

We notice incorrect value in the rating_denominator column cause it must have a same denomination which is 10 leading to accuracy quality issue
Displaying the rating_numerator column

twitter_archive_df['rating_numerator']

twitter_archive_df['rating_numerator'].value_counts()[:20]

The numerator values must be between 0 to 10 which is not the case leading to accuracy quality issue

Displaying descriptive information about the twitter_archive_df

twitter_archive_df.info()

From the programmatic assessment we notice some quality issues
- timestamp column need to be converted to datetime

twitter_archive_df.pupper

twitter_archive_df.doggo

twitter_archive_df.floofer

twitter_archive_df.puppo

From programmatic assessment we find a tidiness issue on columns doggo, floofer,pupper, and puppo they all have the same name and can be merge into 1 column

Checking for null values

twitter_archive_df.isnull().sum()

in_reply_to_status_id, in_reply_to_user_id ,retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp columns have more null values which will bring bring about a quality issue we need to drop them

twitter_archive_df.text[:20]

Checking the occurence of RT in the text column as it must be removed as specified in the project requirement

display(twitter_archive_df[twitter_archive_df['text'].str.contains('RT @')])
print('the number of RT in the text is:', sum(twitter_archive_df['text'].str.contains('RT @')))

the number of RT in the text is: 181
Programmatically this is a quality issue cause text columns contains text and Url andv 181 retweets

Visual assessment of image_predictions_df

image_predictions_df.sample(4)

From visual assessment we notice a quality issue on colums p1, p2 and p3 inconsistancy in the spelling of the name using upper case letter, lower case and underscore

Programmatic assessment of image_prediction dataframe

Displaying the description of the image_predictions.tsv

image_predictions_df.info()

From the descriptive information of the image_predictions dataframe, tweet_id need to be converted to a String data type cause it does not affect our analysis . So we raise a quality issue on the validity of the tweet_id data type

Checking for duplicated value

sum(image_predictions_df['tweet_id'].duplicated())

sum(image_predictions_df.jpg_url.duplicated())

They are 66 duplicated jpg_url double entry bringing about a quality issue on the validity of the data

Visual assessment of tweet_status

tweet_status_df.sample(4)

We notice tidy issue in column in_reply_to_status_id, retweeted_status, quoted_status_id, quoted_status_id_str, quoted_status having NaN value
We notice tidiness issue in columns source, entities, extended_entities having HTML tags, URL full_text contains RT @ which need to be remove per the project requirement

Programmatic assessment of tweet_status_df

tweet_status_df.isnull().sum()

we have quality issue on the validity of the data entry of the columns listed below having high number of NaN values therefore we need to drop
in_reply_to_status_id
in_reply_to_status_id_str
in_reply_to_user_id
in_reply_to_user_id_str
in_reply_to_screen_name
geo
coordinates
place
contributors
retweeted_status
quoted_status_id
quoted_status_id_str
quoted_status

Project scope

Based on the tidiness concept, twitter_archive_enhanced.csv,tweet_status_df, image_predictions.tsv should be merged using the tweet_id as a mapping key

Quality issues

invalid and missing dog name in the column name of twitter_achive_df
Invalid ratings in column rating_numerator of twitter_achive_df value varies from 1776 to 0 and data type need to be converted from int to float
Invalid denominator in column rating_denominator of twitter_achive_df. It should be a fixed base 10 and data type need to be converted from int to float.
Timestamp column in twitter_achive_df needs to be converted into datetime data type
in_reply_to_status_id, in_reply_to_user_id ,retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp have missing data
Spelling inconsistency of upper case letter in columns p1, p2 and p3 of image_predictions_df
jpg_url columns in image_predictions_df has 66 duplicated images and for the accurary of the data we need to drop them
tweet_id in image_predictions_df needs to be converted to string
column id in tweet_status_df need to be converted from in to string
column id need to be converted in tweet_id

Tidiness issues

HTML tags, URL, and content in source column
columns doggo, floofer,pupper, and puppo they all have the same name and can be merge into 1 column
- twitter_archive_df and image_predictions_df and tweet_status_df can be marged There is two information in a single column text, ampasand and //n
3 useful columns: id, favorite_count, retweet_count in tweet_status_df
Retweet need to be removed in text column and in full_text columns

Cleaning Data

In this section, we clean all of the issues I listed out above while assessing the data.

Making copies of the original dataframes

twitter_archive_clean = twitter_archive_df.copy()
image_predictions_clean = image_predictions_df.copy()
tweet_status_clean = tweet_status_df.copy()

Quality Issue #8:

Convert tweet_id column in image_predictions_clean from int to str

Define:

Converting tweet_id column from int to str

Code

image_predictions_clean.tweet_id = image_predictions_clean.tweet_id.astype(str)

Test

type(image_predictions_clean.tweet_id[0])

Quality Issue #6:

Inconsistent use of lowercase and uppercase and underscores in p1, p2,p3 columns

Define:

Replacing underscores ('_') with spaces and capitalize first letter.

Code

image_predictions_clean.p1 = image_predictions_clean.p1.str.replace('_', " ").str.capitalize()
image_predictions_clean.p2 = image_predictions_clean.p2.str.replace('_', " ").str.capitalize()
image_predictions_clean.p3 = image_predictions_clean.p3.str.replace('_', " ").str.capitalize()

Test

image_predictions_clean[['p1','p2','p3']].sample(6)

Quality Issue #7

Removing duplicated values img_url of image_prediction_df

Define:

Indexing all the duplicated value in jpg_url columns ;selecting the the non duplicating values and assigning the non duplicated jpg_url to image_predictions_clean

Code

indexing = image_predictions_df.jpg_url.duplicated()
indexing = np.logical_not(indexing)
image_predictions_clean= image_predictions_clean[indexing]

Test

print("Before cleaning: {} rows.\nAfter cleaning: {} rows.".format(image_predictions_df.shape[0],image_predictions_clean.shape[0]))

print("{} duplicated.".format(sum(image_predictions_clean.jpg_url.duplicated())))

Quality Issue #4:

Incorrect values and incorrect type for timestamp

Define:

Converting timestamp column from object to datetime series

Code

twitter_archive_clean['timestamp'] = twitter_archive_clean['timestamp'].astype('datetime64[ns]')

Test

twitter_archive_clean.timestamp[0]

Quality Issue #1

invalid and missing dog name in the column name

Define:

Rename non-standard names to "None"

Code

invalid_names=list(twitter_archive_clean[twitter_archive_clean.name.
                                           str.contains('^[a-z].*')].
                   name.value_counts().index) + ['None']
twitter_archive_clean.loc[twitter_archive_clean.name.apply(lambda x: x in invalid_names),'name']=None

Test

(twitter_archive_clean.name=='None').sum()

(twitter_archive_clean.name=='a').sum()

(twitter_archive_clean.name.apply(lambda x: x in invalid_names)).sum()

Quality Issue #10 :

twitter_data_archived contains retweets RT @ in column text

Define

As per project specification, we only want original dog ratings. We need to remove retweets text column starting with RT @. In order to fix this quality issue, we are going to create vectors to index all the non-retweet and we exclude the subset from the retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp.

Code

indexing_retweeted_status_id =twitter_archive_clean.retweeted_status_id.isnull()
twitter_archive_clean = twitter_archive_clean[indexing_retweeted_status_id]
indexing_retweeted_status_user_id = twitter_archive_clean.retweeted_status_user_id.isnull()
twitter_archive_clean = twitter_archive_clean[indexing_retweeted_status_user_id]
indexing_retweeted_status_timestamp = twitter_archive_clean.retweeted_status_timestamp.isnull()
twitter_archive_clean = twitter_archive_clean[indexing_retweeted_status_timestamp ]

Test

print("Number of rows with true in retweeted_status_id:", sum(twitter_archive_clean.retweeted_status_id.isnull()))
print("Number of rows with true in retweeted_status_timestamp:", sum(twitter_archive_clean.retweeted_status_timestamp.isnull()))
print("Number of rows with true in retweeted_status_user_id:", sum(twitter_archive_clean.retweeted_status_user_id.isnull()))
print("Number of rows of twitter_archive_clean:",twitter_archive_clean.shape[0])

display(twitter_archive_clean[twitter_archive_df['text'].str.contains('RT @')])
print('the number of RT in the text is:', sum(twitter_archive_clean['text'].str.contains('RT @')))

twitter_archive_clean['text'].sample(45)

Tidiness Issue #1

Column source contain HTML tag and hyperlinks

Define:

Extracting the content between opening and closing tag using regular expressions. Extracting the link.
Replacing source variable in the dataset with just the source name
Creating additional table with source link that we could use as a lookup table.

Code

source_link = twitter_archive_clean.source.str.extract(r'', expand=False)
source = twitter_archive_clean.source.str.extract(r'>([A-z -]+)<', expand=False)
twitter_archive_clean.source = source

sources = pd.DataFrame({'source': source, 'source_link': source_link})
sources.drop_duplicates(inplace=True)

sources

Test

twitter_archive_clean.source.value_counts()

Quality issue

Dropping columns in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp and expanded_urls

Define:

Removing the columns in_reply_to_status_id, in_reply_to_user_id ,retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp

Code

columns_to_remove = ['in_reply_to_user_id', 'in_reply_to_status_id',
                    'retweeted_status_id', 'retweeted_status_user_id',
                    'retweeted_status_timestamp','expanded_urls']
twitter_archive_clean.drop(columns_to_remove, axis=1, inplace=True)

Test

twitter_archive_clean.columns

Tidiness Issue #2

Merging columns doggo, floofer,pupper, and puppo

Define

Columns doggo, floofer,pupper, and puppo have the same values therefore can be merged into one feature

Code

dog_cols = twitter_archive_clean[['doggo','floofer','pupper','puppo']]
dog_cols = dog_cols.replace('None', '') 
dog_category = np.array(dog_cols['doggo']) + np.array(dog_cols['floofer']) + np.array(dog_cols['pupper']) + np.array(dog_cols['puppo'])
pd.DataFrame(dog_category, columns = ['dog_category']).dog_category.value_counts()

Appending this new column called dog_category to the twtitter_archive_clean

twitter_archive_clean.reset_index(drop=True, inplace=True)
twitter_archive_clean = pd.concat([twitter_archive_clean, pd.DataFrame(dog_category, columns = ['dog_category'])], axis = 1)

Dropping the individual columns after merging them

columns_to_remove = ['doggo','floofer','pupper','puppo']
twitter_archive_clean.drop(columns_to_remove, axis=1, inplace=True)

Test

twitter_archive_clean.dog_category.value_counts()

twitter_archive_clean.columns

Tidiness Issue #4

There are two information in a single column text.

Define

Remove the URL in the end of the text column, ampasamd and /n

Code

twitter_archive_clean['text'] = twitter_archive_clean.text.str.replace(r'[(https://.+),(\&)|(\n)]','',regex=True)

Test

twitter_archive_clean['text'].sample(45)

Quality Issue #2 and 3 :

Invalid ratings. Value varies from 1776 to 0. data type must be converted from int to float. Invalid denominator, I expected a fixed base. data Strucutre must be converted from int to float.

Define

Convert rating_numerator and rating_denominator to float because @dog_rates uses float rating number.
Remove the extreme values (1776, 420, etc.) of rating_numerator, and;
Remove non expected value of denominator, anything different of 10.

Code

Converting the rating_numerator and rating_denominator to float.

twitter_archive_clean.rating_numerator = twitter_archive_clean.rating_numerator.astype(float)
twitter_archive_clean.rating_denominator = twitter_archive_clean.rating_denominator.astype(float)

From visual assessment of the rating_numerator done earlier on, we need to drop the numerator with invalid rating 1776, 420, 204.

invalid_ratings_1776_420_204= twitter_archive_clean[(twitter_archive_clean.rating_numerator==1776)|(twitter_archive_clean.rating_numerator==204)|(twitter_archive_clean.rating_numerator==420)].index
twitter_archive_clean.drop(invalid_ratings_1776_420_204, inplace=True)

Removing tweet_id with 0/10
Listing the tweet_id having 0 as numerator and removing these tweet_id from the dataframe

list(twitter_archive_clean.query('rating_numerator == 0').tweet_id)

rm_list = list(twitter_archive_clean.query('rating_numerator == 0').tweet_id)
# Creating a vector to subset twitter_archive_clean and remove the tweet_id from the rm_list.
indexing = np.logical_not(twitter_archive_clean.tweet_id.isin(rm_list))
# Updating the twitter_archive_clean data frame.
twitter_archive_clean = twitter_archive_clean[indexing]

Test

twitter_archive_clean.query('rating_numerator == 0')

Quering denominator other than 10

list(twitter_archive_clean.query('rating_denominator !=10').tweet_id)

Remove those three tweet_id's

rm_list = list(twitter_archive_clean.query('rating_denominator !=10').tweet_id)
# Creating a vector to subset twitter_archive_clean and remove the tweet_id from the rm_list.
indexing = np.logical_not(twitter_archive_clean.tweet_id.isin(rm_list))
# Updating the twitter_archive_clean data frame.
twitter_archive_clean = twitter_archive_clean[indexing]

Test

twitter_archive_clean.query('rating_denominator < 10')

twitter_archive_clean.query('rating_denominator > 10')

twitter_archive_clean.rating_denominator.value_counts()

twitter_archive_clean.tweet_id =twitter_archive_clean.tweet_id.astype(str)

Merging twitter_archive_clean and image_predictions_clean as stating in the tidiness issue

twitter_archive_merged= pd.DataFrame()
twitter_archive_merged = twitter_archive_clean.copy()
twitter_archive_merged_df  = twitter_archive_merged.merge(image_predictions_clean, on='tweet_id', how='inner')

print("Shape df_twitter_archive_clean: " + str(twitter_archive_merged.shape))
print("Shape image_predictions_clean: " + str(image_predictions_clean.shape))
print("Shape df_twitter_combined " + str(twitter_archive_merged_df.shape) + " after joining df_twitter_archive_cleaned, df_tweet_performance and df_image_predictions_cleaned.")

Test

twitter_archive_merged_df.info()

Quality Issue #11:

Incorrect data type in column id in tweet_status_clean

Define:

Convert id from int to str

Code

tweet_status_clean.id = tweet_status_clean.id.astype('str')
tweet_status_clean.rename(columns={'id':'tweet_id'}, inplace=True)

Test

type(tweet_status_clean.tweet_id[0])

Tidiness Issue #11:

Removing RT @ in full_text column

Define

Removing RT @ in full_text column as required in the project requirement

Code

display(tweet_status_clean[tweet_status_clean['full_text'].str.contains('RT @')])
print('the number of RT in the text is:', sum(tweet_status_clean['full_text'].str.contains('RT @')))

Removing the 160 retweet in full_text column

tweet_status_clean['full_text']= tweet_status_clean['full_text'].apply(lambda x: re.sub(r'\bRT @\b','',x).strip())

Test

sum(tweet_status_clean.full_text.str.contains('RT @'))

All the rows have been removed

Tidiness isssue #6

Creating a dataframe with columns: id, favorite_count, retweet_count

Define Reassign the useful columns to the dataframe

Code

tweet_status_clean = tweet_status_clean[['tweet_id', 'favorite_count', 'retweet_count' ]]

Test

tweet_status_clean.sample(3)

Tidiness issue #7

Merge tweet_status_clean and twitter_achive_merged

Define

Merging tweet_status_clean and twitter_achive_merged

Code

twitter_archive_master = twitter_archive_merged_df.merge(tweet_status_clean, on='tweet_id', how='inner')

Test

twitter_archive_master.info()

Storing Data

Saving the clean merged dataframe into "twitter_archive_master.csv".

twitter_archive_master.to_csv('twitter_archive_master.csv', index=False)

Analyzing and Visualizing Data

In this section, we analyze the twitter archiv master dataset and visualizing data

-Reading and assessing the twitter_archive_master.csv into a dataframe

master_df =pd.read_csv("twitter_archive_master.csv")

master_df.sample(3)

Insights: In this section, we are interesting to find the hidden pattern in the clean twitter archive master dataset.

. Visualizing the distribution of dog category based on the favorite tweet count by their users on twitter

. Visualizing the distribution of dog category based on the retweet count by their users on twitter

. Visualizing the most used devices by WeRateDogs users

. Visualizing the most common name of dog

. Line plot of like and retweet by twitter users on WeRateDogs account

.Visualizing 20 dogs breed P2 predicted by twitter user on WeRateDogs

. Visualizing 20 dogs breed P3 predicted by twitter user on WeRateDogs

. Visualizing the dog category with the highest rating score

. Visualizing the dog category with maximum favorite count

. Visualizing the dog category with minimum retweet count

. Visualizing Total Tweets made by WeRateDogs per month

. Visualizing some of the dog names in image prediction p1

Visualization

Define a function to plot the average count of tweets

def distributionPlot(feature,ylabel,title):
    stage_favorite_counts = master_df[['dog_category',feature]]
    stage_favorite_counts_agg = stage_favorite_counts.groupby('dog_category', as_index=False).mean()
    print(stage_favorite_counts_agg )
    f, ax = plt.subplots(1, 1, figsize=(12, 4));
    sns.barplot(data=stage_favorite_counts_agg,x='dog_category',y=feature,color='green', ax=ax);
    ax.set_ylabel(ylabel);
    ax.set_xlabel("Dog category");
    ax.set_title(title)

Plotting the distribution of average favorite count of tweet using based on the dog category

distributionPlot('favorite_count','average favorite count',"Distribution of average favorite count of tweet using based on the dog category¶")
plt.savefig('av_favcounte.png')

Among the dog category:

doggo has 17599.225806 tweet favorite counts the most
followed by floofer with 11223.857143 tweet favorite counts
multiclass has 15008.909091 tweet favorite counts
pupper has 6204.975369 tweet favorite counts
puppo has 19573.545455 tweet favorite counts the least

Plotting the distribution of average retweet count based on the dog category

distributionPlot('retweet_count','average retweet count',' Distribution of average retweet count based on the dog category')
plt.savefig('av_retweet_count.png')

Among the dog category:

doggo has 5972.709677 average retweet counts the most
followed by puppo with 5325.318182 average retweet counts
multiclass has 4548.272727 average retweet countss
floofer has 1909.453202 average retweet counts
pupper has 5325.318182 average retweet counts the least

Plotting likes vs retweets

likes = pd.Series(data=master_df['favorite_count'].values, index=master_df['timestamp'])
retweets = pd.Series(data=master_df['retweet_count'].values, index=master_df['timestamp'])

likes.plot(figsize=(16,4), label='Favorites', color='orange', legend=True)
retweets.plot(figsize=(16,4), label='Retweets', color='maroon', legend=True);
plt.title('No. of Favorites and Retweets Over Time')
plt.show()
plt.savefig('retvs.png')

From the plot of likes vs retwets we notice that WeRateDogs twitter account started in December,11,2015 with zero likes and retweets. In June,23,2016 over 65k users retweeted post of the account and 140k people like the post as their favorites. In February,16, 2017, the account got more popularity over 100k people like the post on the account as their favorites and 25k people retweeted the posts about WeRateDogs.In August 1,2017 few people retweets about their dogs and only 58k people like the post of dogs as their favorites.

Visualizing the most used devices by WeRateDogs users

print(twitter_archive_clean.source.value_counts())
twitter_archive_clean.source.value_counts().plot(kind='bar', figsize=(11,5), title='Most used Twitter source').set_xlabel("Number of Tweets")
plt.savefig('twitter_source')
plt.savefig('twitter_source.png')

From the bar plot above, most of the WeRateDogs users used Twitter for iPhone

From the bar plot above, most of the WeRateDogs users used Twitter for iPhone

Displaying the most common name among the dogs

master_df['name'].value_counts()[1:13]

Visualizing the most popular name of the dogs

sorted_order = master_df['name'].value_counts()[1:13].head(13).index
display(sorted_order)
plt.figure(figsize = (10,4))
values = np.count_nonzero(master_df['name'])
fclrs = ['grey' if (x > min(sorted_order)) else 'red' for x in sorted_order]
sns.countplot(data = master_df, x = 'name', order = sorted_order, orient = 'h')
plt.xlabel('Name', fontsize =16)
plt.ylabel('Count', fontsize =16);
plt.title('Most popular name of the dogs ')
plt.savefig('popular_dog_name.png')

From the above plot the most common name among the dogs are Charlie, Oliver, Lucy, Tucker, Penny, Winston, Sadie,Toby, Daisy, Lola, Koda and Bo

Visualizing 20 dogs breed P2 predicted by twitter user on WeRateDogs

breeds= master_df.p2.value_counts().head(20)
display(breeds)
plt.barh(breeds.index , breeds,color='red')
plt.xlabel('Count')
plt.ylabel('Dog Breed')
plt.title('Top 20 Dog Breeds prediction p2 on tweet')
plt.gca().invert_yaxis()
plt.show()
plt.savefig('20p2_dog_pred.png')

From the above plot, the most predicted dog image by twitter users on WeRateDogs is Labrador retriever

Visualizing 20 dogs breed P3 predicted by twitter user on WeRateDogs

breeds= master_df.p3.value_counts().head(20)
display(breeds)
plt.barh(breeds.index , breeds,color='purple')
plt.xlabel('Count')
plt.ylabel('Dog Breed')
plt.title('Top 20 Dog Breeds predction on tweet')
plt.gca().invert_yaxis()
plt.show()
plt.savefig('20d0gs_p3.png')

From the above plot, the most predicted dog image by twitter users on WeRateDogs is Labrador retriver

Visualizing the dog category with the highest score

Creating a new column called rating in master_df

master_df['rating'] = master_df['rating_numerator']/master_df['rating_denominator']

dog_rating=master_df.groupby('dog_category')['rating'].max()
print(dog_rating)
dog_rating.plot.bar(figsize=(10,5))
plt.ylim(top=10)
plt.title("Dog category with the highest rating score")
plt.xlabel("Dog category")
plt.legend([" Rating"])
plt.savefig('rating.png')

From the plot above, pupper is the dog with the high rating score

Visualizing the dog category with maximun favorite count

dog_favorite_count=master_df.groupby('dog_category')['favorite_count'].max()
display(dog_favorite_count)
dog_favorite_count.plot.bar(figsize=(10,5))
plt.title("Dog category favorite count")
plt.xlabel("Dog category")
plt.legend(["favorite_count"])
plt.savefig('favorite.png')

From the above plot doggo is the dog with maximum twitter favorite count

Visualizing the dog category with minimun retweet count

dog_retweet_count=master_df.groupby('dog_category')['retweet_count'].min()
display(dog_retweet_count)
dog_retweet_count.plot.bar(figsize=(10,5))
plt.title("Dog category with minimun retweet count")
plt.xlabel("Dog category")
plt.legend(["favorite_count"])
plt.savefig('retweet.png')

From the above plot pupper has the least retweet count

Visualizing Total Tweets made by WeRateDogs per month between 2015 to 2017

twitter_archive_master['year_month_date'] = twitter_archive_master['timestamp'].dt.year.astype(str) + '-' + \
                                            twitter_archive_master['timestamp'].dt.month.astype(str).str.pad(2, fillchar='0')
twitter_archive_master['is_tweet'] = np.where(twitter_archive_master.tweet_id.notnull(), 1, 0)
twitter_archive_monthly_tweets = twitter_archive_master.groupby('year_month_date').is_tweet.sum().reset_index()
plt.xticks(rotation=45)
ax = sns.lineplot(x='year_month_date', y='is_tweet', data=twitter_archive_monthly_tweets)
ax.set_title('Total Tweets made by WeRateDogs per month between 2015 to 2017')
ax.set_xlabel('Date')
ax.set_ylabel('Number of Tweets')

From the plot above, we notice in November,2015, 350 people tweeted on WeRateDogs. However, the number gradually decreased in August 2017, the plot revealed that the number of tweets had dropped to 0 tweet.

Visualizing some of the dogs image prediction p1

twitter_archive_master[['jpg_url','p1']]

Visualizing Miniature_pinscher

Image("https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg")

Programmatically download Miniature_pinscher image from the url

response = requests.get("https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg")
file = open("Miniature_pinscher.png", "wb")
file.write(response.content)
file.close()

Visualizing Rhodesian ridgeback

Image('https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg')

Programmatically download Rhodesian ridgeback image from the url

response = requests.get("https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg")
file = open("Rhodesian ridgeback.png", "wb")
file.write(response.content)
file.close()

Visualizing German Sheperd

Image('https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg')

Programmatically download german_sheperd image from the url

response = requests.get("https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg")
file = open("german_sheperd.png", "wb")
file.write(response.content)
file.close()

Conclusion
We have reached the end of this data wrangling and visualization journey. Through our analysis , we found out that:

Among the dog category we found out that:

tweet favorite counts

doggo has 17599.225806 the most
followed by floofer with 11223.857143
multiclass has 15008.909091
pupper has 6204.975369
puppo has 19573.545455

retweet count

doggo has 5972.709677 the most
followed by puppo with 5325.318182
multiclass has 4548.272727
floofer has 1909.453202
pupper has 5325.318182

Most devices used by WeRaDogs users

Twitter for iPhone: 2016
Vine - Make a Scene: 91
Twitter Web Client: 31
TweetDeck : 10

** Dog category with the highest rating score

pupper 2.7
doggo 1.4
puppo 1.4
floofer 1.3
multiclass 1.3

dog category with maximun favorite count

doggo 144774
puppo 124103
pupper 108900
multiclass 49401
floofer 28112

Finaly, People preferred to favorite a dog over a retweet and both actions had decreased over time from 2015 to 2017.

The github repository to download the datasets, the wrangling report and the jupyter notebook can be found here.

If you want to contribute or you find any errors in this article please do leave me a comment.

You can reach me out on any of the matrix decentralized servers. My element messenger ID is @maximilien:matrix.org

If you are in one of the mastodon decentralized servers, here is my ID @maximilien@qoto.org

If you are on linkedIn, you can reach me here

If you want to contact me via email for freelance maximilien@tutanota.de

If you want to hire me to work on machine learning, data science,IoT and AI related project, please reach out to me here

Warm regards,
Maximilien.

Medical Appointments No-Show Data Analysis And Visualization From A-Z

Maximilien — Thu, 16 Jun 2022 15:32:25 GMT

Welcome back to this blog post on medical appointment No- show project which consists of 100k medical appointments in Brazil as part of the Advanced Data Analytics Nanodegree Scholarship by Udacity which seeks to equip and to train young Africans for digital technologies and skills for remote work and local market opportunities. The dataset collects information from more than 100k medical appointments in Brazil and it is focused on the question of whether or not patients show up for their appointment. Without delay, let us get into the data wrangling.

Table of Contents

INTRODUCTION

1- Research questions

2- DATA WRANGLING

3- Section objectives

4- Loading libraries

5- Loading the dataset

6- Exploring data

7- Descriptive information about the dataset

8- Shape of the dataset

9- Statistical data

10- DATA CLEANING

11- Renaming columns

12- Converting date

13- Filtering row Age with -1

14- Dropping negative Age

15- Checking negative Age

16- Converting PatientId and AppointmentId to object data type

17- Displaying minimun value of age

18- Filtering and displaying row with Age==0

19- Dropping Age==0

20- Checking missing values

21- Checking duplicated values

22- Cleaned dataset

23- Exploratory Data Analysis

24- Pie plot of No-show patients

25- Pie plot of gender,hypertenstion and No-show

26- Pie plot of gender,diabetes and No-show

27- Pie plot of gender,SMS_received and No-show

28- Pie plot of gender and No-show

29- Function to plot the distribution in the research question

30- No-show gender distribution

31- Diabetes No-show gender distribution

32- Hypertension No-show gender distribution

33- SMS_received No-show gender distribution

34- Age No-show gender distribution

Conclusion

Limitations

Introduction

The dataset subject to our analysis contains information recorded from a hospital in Brazil. The dataset has 110,527 data entries starting from 0 to 110526 and 14 columns. The description of each feature variable is shown as below

PatientId: Identification of a patient AppointmentId: Identification of each appointment
Gender: Male or Female
ScheduledDay: The day when the patient scheduled their appointment
AppointmentDay: The day of the appointment
Age: Age of the patient
Neighbourhood: Address of the hospital where the appointment is taken
Scholarship: Boolean 1 if the patient is enrolled into Brazilian welfare program Bolsa Familia 0 otherwise
Hipertension: Patient has hipertension Yes boolean 1; 0 otherwise
Diabetes: Patient has diabetis Yes boolean 1 or 0 otherwise Alcoholism: Patient drink alcohol Yes boolean 1 or 0 otherwise
Handcap: Patient is Handicap Yes boolean 1 or 0 otherwise SMS_received: Patient received a SMS before the appointment Yes boolean 1 or 0 otherwise
No-show: YES boolean 1 if the patient show up during the booking day 0 otherwise

From the description above, the aim of this project is to find out which population of the patient, ill health and disability condition show up or does not to their respective appointment.

1- Research questions

The research question during the brainstorming phase of our analysis are:
1- What is the distribution of the patient that showed up and did not show up during the appointment?
2- What is the distribution of the patients having or not having Hypertension showed up and did not show up during the appointment?
3- What is the distribution of the patients having or not having diabetes showed up and did not show up during the appointment?
4- What is the distribution of the patients who (received or did not a SMS) showed up and did not show up during the appointment?
2- Data Wrangling
3- Section objective
In this section:
We load in the data
We explore the data
We clean the dataset
We preprocess the dataset for visualization and further analysis.
5- Loading the required libraries

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline

5- Loading the dataset

df = pd.read_csv("noshowappointments-kagglev2-may-2016.csv")

6- Exploring data
Displaying the last 5 five observations of the dataset

df.tail()

7- Descriptive information about the dataset

df.info()

The dataset has 14 non null features contaning respectively
1 data type float: PatientId
8 data type integer: AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
5 data type object: Gender,ScheduledDay,AppointmentDay, Neighbourhood,No-show

8- Shape of the dataset

df.shape

The dataset has 110527 rows which represent the number of observations and 14 columns which represent the number of feature variables

9- Displaying the statistical data of the dataset

df.describe()

From the statistical data above we notice a lot of discrepancies in the dataset.
We need to covert it to integer

df.describe().astype('int64')

We notice there is a negative value in the Age colunm
Besides, PatientID is in float data type we need to convert it in String
There is columns data mispelling such as hipertension and handcap
PatientId and AppointmentId are irrelevant to our analysis we need to convert them to String data type

- Exploring the Age of the patient closely

df.Age.describe()

We observe that the mean of the age of the patient is 38 The eldest patient is 115 old

10- DATA CLEANING
From the dataset above, we notice that:

hipertension and handcap are mispelled
ScheduledDay and AppointmentDay are not in correct format. We need to convert it in date format
There is a negative value in the feature variable Age
there are zeros in the feature variable Age

11- Renaming columns

Renaming the columns hipertension and handcap

df = df.rename(columns={'Hipertension': 'Hypertension', 'Handcap': 'Handicap'})

12- Converting to date

From the dataset above, we need to convert ScheduledDay and AppointmentDay from String data type to datetime64 format yyyy-mm-dd.

df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay']).dt.date.astype('datetime64[ns]')
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay']).dt.date.astype('datetime64[ns]')

df.head()

13- Filtering the row with negative Age

rowAgeNegative= (df.Age==-1)
dfAgeNegative = df[rowAgeNegative]
dfAgeNegative.head()

We have one row containing -1

14- Dropping negative Age

df = df.drop(dfAgeNegative.index)

15- Filtering if there still exist a negative Age in the dataset row

ages = ['-1']  
age_dataset = df[df['Age'].isin(ages)]  
age_dataset.head()

Row as been dropped successfully
16- Converting patiendid and AppointmentId to object data type

df['PatientId'] = df['PatientId'].astype('object')
df['AppointmentID'] = df['AppointmentID'].astype('object')

Displaying minimun value of the Age

df.Age.min()

We notice the min value of Age is 0. This is wrong and we may consider it as a data entry error

17- Filtering row with Age=0 and displaying the head of the dataset and the length of the row with age ==0

ages = [0]  
age_dataset = df[df['Age'].isin(ages)]  
age_dataset.head()

rowAgeZero= (df.Age==0)
dfAgeZero = df[rowAgeZero]
dfAgeZero.head(10)
print("The number of rows containing zero in the Age columns are ",len(dfAgeZero))

The number of rows containing zero in the Age columns are 3539

18- Dropping row with Age==0

df = df.drop(dfAgeZero.index)

Checking if any zeros row remained

ages = [0]  
age_dataset = df[df['Age'].isin(ages)]  
age_dataset.head()

Data dropped successfully

19- Checking for missing values

df.isnull().sum()

There is no missing value in the dataset

20 -Checking for duplicated values

sum(df.duplicated())

There is not duplicated values

21- Displaying preprocessed dataset

df.head()

22- Exploratory Data Analysis

23- Defining a function to plot the pie plot taking three arguments namely the columns, the label and the title of the plot

def myplot(features,label,title ):
    plt.figure(figsize=(8,8))
    plt.pie(df.groupby(features).size(),autopct='%.2f')
    plt.title(title)
    plt.legend(label, loc="lower left")
    plt.show()

24- Pie plot of No-show patients

myplot('No-show',['Not Show','Show'],"No-show distribution" )

From the above pie plot, we conclude that:

79.74% of the patients did not come to the appointment
Only 20.26 patients come to the appointment

25- Pie plot of gender,hypertenstion and No-show

myplot(["No-show","Gender", "Hypertension"],['Not show,Female,Not hypertension','Not show,Female,hypertension','Not show,Male,Not hypertension','Not show,Male,hypertension', 'Show,Female,Not hypertension', 'Show,Female,hypertension','Show,Male,Not hypertension','Show,Male,hypertension'],"Pie plot of gender, hypertenstion and No-show" )

Patient Male

22.54% not having hypertension did not come to the appointment 5.88% not having hypertension came to the appointment 5% of having hypertension did not come to the appointment 1.04% having hypertension came to the appointment
Patient female

40.34% not having hypertension did not come to the appointment 10.86% not having hypertension came to the appointment 2.48% having hypertension came to the appointment 11.85% having hypertension did not come to the appointment

26- Pie plot of gender,diabetes and No-show

myplot(["No-show","Gender", "Diabetes"],['Not show,Female,Not diabete','Not show,Female,diabete','Not show,Male,Not diabete'

Patient Male

25.74% not having diabetes did not come to the appointment
6.54% not having diabetes came to the appointment
1.8% of having diabetes did not come to the appointment
0.39% having diabetes came to the appointment

Patient female
47.91% not having diabetes did not come to the appointment
12.39% not having diabetes came to the appointment
0.95% having diabetes came to the appointment
4.29% having diabetes did not come to the appointment

27- Pie plot of gender,SMS_received and No-show

myplot(["No-show","Gender", "SMS_received"], ['Not show,Female,Not SMS_received','Not show,Female,SMS_received','Not show,Male,Not SMS_received','Not show,Male,SMS_received', 'Show,Female,Not SMS_received', 'Show,Female,SMS_received','Show,Male,Not SMS_received','Show,Male,SMS_received'],"Pie plot of gender, SMS_received and No-show")

Patient Male

20.18% did not received sms did not come to the appointment
4.16% did not received sms came to the appointment
7.36% received sms did not come to the appointment
2.76% received sms came to the appointment

Patient female

36.17% not received sms did not come to the appointment
7.16% not received sms came to the appointment
6.18% received sms came to the appointment
16.03% received sms did not come to the appointment

28- Pie plot of gender and No-show

myplot(["No-show","Gender"],['Not show,Female','Not show,Female','Not show,Male','Not show,Male', 'Show,Female', 'Show,Female','Show,Male','Show,Male'],"Pie plot of gender and No-show")

Patient Male

Out of 34.46% of male, only 6.92% came to the appointment

Patient female

Out of 65.54 of female, only 13.34% came to the appointment

29- Function to plot the distribution in the research question

def distributionPlot(feature,titleplot):
    gender_column = 'Gender'
    df.groupby(['No-show',feature, gender_column]).size().unstack(level=1).plot(kind='bar',title=titleplot,ylabel='count')
    print(pd.DataFrame(df.groupby(['No-show',feature]).count().PatientId))

30- No-show gender distribution

distributionPlot('Gender','No-show gender distribution')

From the bar plot above we notice:

14275 female patients showed up to the appointment
7405 male patients showed up to the appointment
55843 female patients did not show up to the appointment
29464 male patients did not show up to the appointment

31- Diabetes No-show gender distribution

distributionPlot('Diabetes','Diabetes No-show gender distribution')

We notice that 1200 female patients with diabete show up to the appointment and 230 male patients with diabete show up to the appointment. Besides 15k female patients without diabete show up to the appointment and 6k patient male without diabete showed to the appointment. Moreover, 52k female not having diabete did not come to the appointment and 4k female having diabete did not come to the appointment. Finally, 28k of male patient not having diabete did not come to the appointment and 2k male patient having diabete did not come to the appointment.

32- Hypertension No-show gender distribution

distributionPlot('Hypertension',' Hypertension No-show gender distribution')

We notice that most almost 2500 female patient hypertension show up to the appointment and 1272 male patient with Hypertension did not show to the appointment Besides, 12k female patient without hypertension show up to the appointment and 6k male patient without hypertension show up to the appointment

33- SMS_received No-show gender distribution

distributionPlot('SMS_received',' SMS_received No-show gender distribution')

We notice that 8k patient female who did not receive a sms showed up to the appointment 7k patient females who received a sms show up to the appointment 4k patient male who did not receive a sms showed up to the appointment 3k patient male who received a sms showed up to the appointment 38k patient female who did not receive sms did not come to the appointment 17k patient female who received sms did not come to the appointment 22k patient male who did not receive sms did not come to the appointment 7k patient male who did not receive the sms did not come to the appointment

34- Age No-show gender distribution

boxplot = df.boxplot(column=['Age'] , by = ['No-show'] , notch = True, labels=['No-show','Age'])
pd.DataFrame(df.groupby(['No-show'])['Age'].describe().loc[:,['mean','std']])

We notice that the average age of patient who did not show up are 39.07 and the average age of patients who showed up to the appointment are 35.32 We notice that 25% of the patient who did not show up to the appointment are aged around 20 and 25% of the patients who showed up to the appointment are 19 Further more 75% of the patients who did not showed up are 58 and 75% of the patients who showed up to the appointment are 47.

Conclusion

We notice that few patients respond to appointment given by the physicians in Brazil

The gender distribution revealed that

14275 female patients showed up to the appointment
7405 male patients showed up to the appointment
55843 female patients did not show up to the appointment
29464 male patients did not show up to the appointment.

The hypertension patient distribution revealed that
2500 female patient hypertension show up to the appointment
1272 male patient with Hypertension did not show to the appointment
12k female patient without hypertension show up to the appointment
6k male patient without hypertension show up to the appointment

The diabetes patient distribution revealed that
1200 female patients with diabetes showed up to the appointment
230 male patients with diabetes show up to the appointment.
15k female patients without diabetes showed up to the appointment
6k male patients without diabetes showed to the appointment.
52k female not having diabetes did not come to the appointment
4k female having diabetes did not come to the appointment.

The SMS_received patient distribution revealed that

7k patient females who received a sms show up to the appointment
4k patient male who did not receive a sms showed up to the appointment
3k patient male who received a sms showed up to the appointment
38k patient female who did not receive sms did not come to the appointment
17k patient female who received sms did not come to the appointment
22k patient male who did not receive sms did not come to the appointment
7k patient male who did not receive the sms did not come to the appointment

From the above research question, we conclude that
SMS_received influenced patient to show up to their appointment more that the other feature variables.
The above summary did not not reflect the actual data entry from the hospital cause it contained some discrepancies. 3,540 observations were dropped during the analysis to arrived at the above conclusion. Additional information would have been handy to explain the reason why we have zero and negative one in the independent variable Age.

Limitations

The dataset submit to our analysis in this project contains some data entry errors which affect the outcome of our analysis. Out of the 110,527 observations, we found a negative Age in observation row index 99832 of the observations. Moreover, we found out that there were 3539 observations having Age value to be 0. We assumed those discrepancies were some errors therefore we dropped 3,540 observations. Our analysis was carried out on 106,987 observations after cleaning the dataset which did not represent the actual population of the patients which might change the outcome of our result. We further need to know the distance of the patients to the nearest hospital where the appointment has been booked to figure out which few patients respond to their medical appointment. We need further to know why the hospital prefers text message over phone call since most of people don't read they sms often.

If you want to contribute or you find any errors in this article please do leave me a comment.

You can reach me out on any of the matrix decentralized servers. My element messenger ID is @maximilien:matrix.org

If you are in one of the mastodon decentralized servers, here is my ID @maximilien@qoto.org

If you are on linkedIn, you can reach me here

If you want to contact me via email for freelance maximilien@tutanota.de

Warm regards,
Maximilien.

Qr Code Dynamic Certificate Generator and Authentication (QCDCGA) Application

Maximilien — Mon, 11 Apr 2022 21:00:40 GMT

Since the outbreak of corona virus, the learning system across the world has been shifted to an online learning mode which brings about the digital risk of E-learning certificate. Moreover, many students in Africa who go to further their study abroad with non-digital educational certifications and documents often face authentication issues and they are ushered to redo the same program as a fresher other even loose their job or they cannot find a job for authentication issues. In this light, to mitigate certificate counterfeit, maxtekAI opts for digital solution by developing QCDCGA (Qr Code Dynamic Certificate Generator And Authentication) across Africa by empowering certificate issuing body with a two step Qr code technology encrypted in a database to prevent its forgery and to ease the authenticity and validation process by any organizations across the world.

I.) Project Description

The QCDCGA system will include the following key features:

Secure Login for admin
Dashboard statistics display.
Creating and managing certificate.
Creating and managing Administrator.
Creating and managing Administrator roles.
Profile update.
Password Reset
Menu Conversion feature allows for easy to navigate optimized menu(s). Integrate any information you would like

The application will give to certificate issuing bodies a strong sense of satisfaction by accomplishing the following goals:

Generating of certificate easily
Updating of already generated certificate
Managing system administrators by setting rules on each account for security
Validation of the uniqueness of the certificates
Managing the database and the certificates
Managing the license issues of the institutions

II.) Development Package

The sections below outline each development package and how it relates to your system.

1. Domain name

A domain name is an identification string that defines a realm of administrative autonomy, authority, or control within the Internet.

2. SSL Certificate

SSL provides a secure channel between two machines or devices operating over the internet or an internal network. One common example is when SSL is used to secure communication between a web browser and a web server. This turns a website's address from HTTP to HTTPS, the 'S' standing for 'secure'.

3. VPC acquisition, installation, and security configuration

Virtual private clouds OR Virtual private computer enables you to launch resources into a virtual network that you've defined. This virtual network closely resembles a traditional network that you'd operate in your own data center or office, with the benefits of using scalable infrastructures.

4. Technical support

A technical support representative is focused on resolving your issue as quickly as possible. Technical support reps listen to symptoms, try to reproduce the issue, and quickly provide a solution to the issue.

5. Maintenance and update

System maintenance is an umbrella term that encompasses various forms of computer maintenance needed to keep a system running.
Regular maintenance of your systems helps your systems to run more smoothly as well as reduce the risk of them breaking down. A well-maintained application ensures your staff and business has no technology roadblocks that hamper productivity and will also lead to a reduction in support costs.

- NOTE: This development includes integration of third-party software.

III.) Security Measure Considered

There are few security measures we are going to implement on the system to ensure you have a secured and well-functioning application.

Server login access secure

Using SSH Keys Authentication method Instead of regular password

Secure Sockets Layer Certificates

Secure web administration areas and forms with Secure Socket Layer (SSL) that guards’ information passed between two systems via the internet. SSL can be used both in server-client and in server-server communication.3. Regular server Update, Upgrade and Backup This will help keep the system up to date with new fixes to vulnerability and weaknesses in the server.

Monitoring Login Attempts

Using intrusion prevention software to monitor login attempts is a way to protect your server against brute force attacks. These automated attacks use a trial-and-error method, attempting every possible combination of letters and numbers to gain access to the system.

Firewall Restrictions

We will be Setting Up and Maintaining a Firewall Restricting access to the server via firewall to stop unwanted request made to the server

Implementing Fail2Ban

Fail2Ban is an intrusion prevention software framework that protects computer servers from brute-force attacks. In other word it blocks IP address it finds suspicious.

IV.) Technical Support

We will provide pay technical support for you after launching your application. We will answer your question regarding application management, technical details, or anything about operating your own sever; we can provide this through email or by phone.

V.) Payment Plan

At the start of the development process for your company Application, you need to pay us 60% of the total development costs, the remaining balance of 40% should be paid after launching your site on the web (www). We will start the development immediately upon receiving of the initial payment.

VI.) Mode Of Payment

Our preferred mode of payment is Bank transfer. When payment is ready to be made, we will communicate the bank details to you.

VII.) Payment Plan

At the start of the development process for your company Application, you need to pay us 60% of the total development costs, the remaining balance of 40% will be given after launching your site on the web (www). We will start the development immediately upon receiving of the initial payment.

NOTE: Our quotations are valid for 7 days from the date of this quotation. If you have any other query regarding this quotation, please email us at maximilien@tutanota.de or call us at +233207863123

Feel free to mail us to request for the demo link and the password for testing. Once you're ready to move forward with development of your custom application with your company logo, simply sign this proposal and email to us. We'll be notified and will begin the initial stages of app development.

Regards,

Maximilien - QCDCGA Project Leader

Deploying Company Predictive Marketing Application using RFM Behavioral Clustering Algorithm To Heroku (part 2)

Maximilien — Mon, 21 Mar 2022 20:43:23 GMT

Welcome to the part 2 of this blog post series. Please if you are reading this article for the first time like many other here please read the part one before testing the application.

Train and test csv files can be downloaded here
The link to test the unsupervised machine learning web application on heroku can be found here
Part 3 of this blog post series with be released soon with the end to end model deployment code.

Awaiting stay tuned!

Regards,

Maximilien.

End To End Company Predictive Marketing Using RFM (Recency Frequency Monetary) Behavioral Based Clustering Algorithm(part 1).

Maximilien — Tue, 15 Mar 2022 10:20:42 GMT

Digital transformation including web, email, mobile, social, location technologies combined with technologies to store, process, and extract information has significantly changed our world. Nowadays, every entrepreneur in the process of business development is faced with the question how to make his client more loyal and not let him go to a competitor. In this light, predictive marketing is the approach which restores the bridge by bringing human sensibility into our digital world by focusing on the consumers to understand what they did, what they will do next and which product they are likely to buy. In the following, we are going to apply predictive marketing to segment and to cluster customer behavior on Shopify using recency frequency monetary clustering algorithm.

Contents

1 - What is predictive marketing

2 - What is RFM analysis and why it is useful

3 - What is the difference between Clustering and segmentation

4- Different types of clustering

5- Diving into the algorithm with ML object oriented programming in python

1 - What is predictive marketing

To understand predictive marketing in my humble opinion, it would be better to know what is predictive analytics. In machine learning or big data term, predictive analytics is a combination of mathematical and statistical techniques to recognize similarity in data or to make predictions about the future. In this sense, when predictive analytics is applied to marketing, it can predict customer behaviors, classify customers into segments, recommend a set of products to customers etc. So predictive marketing under the hood of predictive analysis helps companies to optimize their marketing strategy to acquire new customers, to grow customer lifetime value (revenue generated by purchased products) and to retain more customers over a period of time. However, someone reading this blog may ask does predictive marketing is going to replace marketers with robots? The answer is no, the use of predictive marketing is to empower human intelligence with machine learning to increase customer lifetime value. For instance, Amazon has been using predictive analytics for long. Play close attention to the recommendations that appear under a product you are thinking of adding into your cart is part of what makes Amazon such successful e-commerce platform today.

2 - What is RFM analysis and why it is useful?

To grasp RFM analysis, let us first curl around customer segmentation. By the way, customer segmentation is the process of dividing customers into groups based on common characteristics so companies can market to each group effectively and appropriately. With this understanding of customer segmentation, recency frequency monetary analysis on the other hand is a behavior-based approach consisting of grouping customers into segments. It groups the customers on the basis of their previous purchase transactions. How recently(recency), how often(frequency), and how much(monetary) did a customer buy products or services. Basically, RFM model ranks customer on a scale of 1 to 5. The higher the customer ranking, the more likely he will do business again with the firm. Furthermore, It gives organizations a sense of how much revenue comes from previous customers. So it help marketers to leverage their marketing strategy to keep high value customer and medium value customer loyal to them and to move targeted low value customer segment into high value customer through promotion , ads and discount on products.

3 - What is the difference between Clustering and segmentation

Clustering is the automated machine learning powered version of segmentation. Clustering is a powerful tool that allows us to discover personas or communities within your customer base. Segmentation on the other hand, is the process in which you segment customers to identify homogeneous groups that exist within your customer base which can be used to optimize and differentiate marketing actions or product strategy.

4- Different types of clustering

The most frequent types of clustering use by data analysts are Product-based clusters, brand-based clusters, and behavior-based clusters.

Product based Clusters

Product based or category based clustering models group customers based on what types or categories of products they tend to prefer and what types of products customers tend to buy together.

Brand based Clusters

Brand-based clusters tell you what brands people are most likely to buy. It groups customers together that prefer a group of brands more than other. For instance, you will be able to identify which customers are likely to be interested when a specific brand releases new products.

Behavior based Clusters

A behavior based clustering model groups customers based on how they will behave while purchasing. Do they use the website or the call center? Are they discount addicts? How frequently do they buy? How much do they spend? How much time will pass before they purchase again?

5- Diving into the algorithm with ML OOP in python

The use of OOP(Object Oriented Programming) is entirely optional in Machine Learning as we already have libraries like Scikit-learn and TensorFlow from where we can easily use algorithms. If you are new to python reading this article don't worry, please pause at this point, google OOP in python and come back to understand the following.

a) Importing the libraries

from sklearn.preprocessing import MinMaxScaler
from yellowbrick.cluster import KElbowVisualizer
from matplotlib.gridspec import GridSpec
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import datetime as dt
import pandas as pd
import numpy as np
import os

b) Defining a class, containing a function to preprocess the dataset

class SomeModel():
        def __init__(self):
                pass
        def get_preprocessing(self, a, b , c,d):
        #       removing duplicated index and dropping nan values
                X= pd.read_csv(d).drop_duplicates(keep="first")
                X=X[pd.notnull(X[a])]
                X=X[pd.notnull(X[b])]
                X=X[pd.notnull(X[c])]
                return X

Checking the output of this call. I use (....) to respect indentation while type this code on my blog which should be remove while testing this code in your environment with four spaces

if __name__ == '__main__':
....model_instance = SomeModel()        
....print(model_instance.get_preprocessing('location_country','referrer_source','referrer_name','shopify_dataseller1.csv'))

c) Defining a function to get RFM modelling

def get_rfm_modeling( self, a,b,c,d):
        # function to return the rfm dataframe
                preprocessed_df = self.get_preprocessing( a, b, c,d)
                df_recency = preprocessed_df.groupby(by='location_country',as_index=False)['total_sessions'].sum()
                df_recency.columns = ['location_country', 'Recency'] 
                frequency_df = preprocessed_df.drop_duplicates().groupby( by=['location_country'], as_index=False)['total_conversion'].count()
                frequency_df.columns = ['location_country', 'Frequency']
                preprocessed_df['Total'] =preprocessed_df['total_conversion']*preprocessed_df['total_carts']
                monetary_df = preprocessed_df.groupby(by='location_country', as_index=False)['Total'].sum()
                monetary_df.columns = ['location_country', 'Monetary']
                rf_df = df_recency.merge(frequency_df, on='location_country')
                rfm_df = rf_df.merge(monetary_df, on='location_country')
                return rfm_df

Checking the output

if __name__ == '__main__':
....model_instance = SomeModel()      
....print(model_instance.get_rfm_modeling("location_country","referrer_source","referrer_name",'shopify_dat  a_seller1.csv'))

d) Defining two functions to get the R_score

def R_score(self,var,p,d):
        # recency score on 2h activity high value, more logs on the platform
                if var <= d[p][0.25]:
                        return 1
                elif var <= d[p][0.50]:
                        return 2
                elif var <= d[p][0.75]:
                        return 3
                else:
                        return 4

e) Defining a function to get FM_score

   def FM_score(self,var,p,d):
#Frequency and Monetary score (Positive Impact : Higher the value, better the customer)   
                if var <= d[p][0.25]:
                        return 4
                elif var <= d[p][0.50]:
                        return 3
                elif var <= d[p][0.75]:
                        return 2
                else:
                        return 1

f) Defining a function to get the RFM score

     def get_rfmscore(self,a,b,c,d):
#Segmentation: Here, we will divide the data set into 4 parts based on the quantiles.
                rfm_df = self.get_rfm_modeling(a,b,c,d)
                quantiles =rfm_df.drop('location_country',axis = 1).quantile(q = [0.25,0.5,0.75])
                rfm_df['R_score'] = rfm_df['Recency'].apply(self.R_score,args = ('Recency',quantiles,))
                rfm_df['F_score'] = rfm_df['Frequency'].apply(self.FM_score,args = ('Frequency',quantiles,))
                rfm_df['M_score'] = rfm_df['Monetary'].apply(self.FM_score,args = ('Monetary',quantiles,))
        #Now we will create : RFMGroup and RFMScore
                rfm_df['RFM_Group'] = rfm_df['R_score'].astype(str) + rfm_df['F_score'].astype(str) + rfm_df['M_score'].astype(str)
        #Score
                rfm_df['RFM_Score'] = rfm_df[['R_score','F_score','M_score']].sum(axis = 1)
                rfm_df['R_rank'] = rfm_df['Recency'].rank(ascending=False)
                rfm_df['F_rank'] = rfm_df['Frequency'].rank(ascending=True)
                rfm_df['M_rank'] = rfm_df['Monetary'].rank(ascending=True)
        # normalizing the rank of the customers
                rfm_df['R_rank_norm'] = (rfm_df['R_rank']/rfm_df['R_rank'].max())*100
                rfm_df['F_rank_norm'] = (rfm_df['F_rank']/rfm_df['F_rank'].max())*100
                rfm_df['M_rank_norm'] = (rfm_df['F_rank']/rfm_df['M_rank'].max())*100
                rfm_df.drop(columns=['R_rank', 'F_rank', 'M_rank'], inplace=True)
                rfm_df['RFM_Score'] = 0.15*rfm_df['R_rank_norm']+0.28 * rfm_df['F_rank_norm']+0.57*rfm_df['M_rank_norm']
                rfm_df['RFM_Score'] *= 0.05
                rfm_df = rfm_df.round(2)
                return rfm_df

Checking the output

if __name__ == '__main__':
....model_instance = SomeModel()      
....print(model_instance.get_rfmscore("location_country", "referrer_source", "referrer_name",'shopify_data_seller1.csv'))

g) Function to perform customer segmentation

 def get_customerSegment(self,a,b,c,d):
                rfm_df = self.get_rfmscore(a,b,c,d)
                rfm_df["Customer_segment"] = np.where(rfm_df['RFM_Score'] > 4.5, "Top Customers",
                             (np.where(rfm_df['RFM_Score'] > 4,"High value Customer",
                             (np.where(rfm_df['RFM_Score'] >= 3,"Medium Value Customer",
                             np.where(rfm_df['RFM_Score'] > 1.6,'Low Value Customers', 'Low Customers'))))))
                return rfm_df

Getting the ouput

if __name__ == '__main__':
....model_instance = SomeModel()        
....print(model_instance.get_customerSegment("location_country", "referrer_source", "referrer_name",'shopify_data_seller1.csv'))

h) Function to remove negative, zero and skew the data

 def right_treat(self,var):
        # First will focus on the negative and zero from the dataset before the transformation.
                if var <= 0:
                        return 1
                else:
                        return var
        def get_screwLogTransform(self,a,b,c,d):
                rfm_df = self.get_customerSegment(a,b,c,d)
#skewness transform
                rfm_df['Recency'] = rfm_df['Recency'].apply(lambda x : self.right_treat(x))
                rfm_df['Monetary'] = rfm_df['Monetary'].apply(lambda x : self.right_treat(x))
#Log Transformation
                log_RFM_data = rfm_df[['Recency','Frequency','Monetary']].apply(np.log,axis = 1).round(4)
                return log_RFM_data

i) Function to find the maximum number of cluster using the elbow technique

def plotClusteringElbow():
        # After plotting, we found elbow at k=3. We will use this value in training our model
                x = scaledLogTransform()
                model = KMeans()
                visualizer =KElbowVisualizer(model, k=(1,9))
                visualizer.fit(x)
                return visualizer.show()

Output the plot

j ) Training the model using K-means clustering algorithm

def train(self,a,b,c,d):
                scaled_data = self.scaledLogTransform(a,b,c,d)
                KM_clust = KMeans(n_clusters= 3, init = 'k-means++',max_iter = 1000)
                KM_clust.fit(scaled_data)
                return KM_clust

k) defining a function to display the result

def get_results(self,a,b,c,d):
                model=self.train(a,b,c,d)
                rfm_df = self.get_customerSegment(a,b,c,d)
                rfm_df['Cluster'] = model.labels_
                rfm_df['Cluster'] = 'Cluster' + rfm_df['Cluster'].astype(str)
                new_rfm_df =  rfm_df[["location_country","Customer_segment", "Cluster"]]
                return new_rfm_df.tail(25)

Displaying the last 25 rows of the dataframe

if __name__ == '__main__':
....model_instance = SomeModel()
....print(model_instance.train("location_country","referrer_source", "referrer_name",'shopify_data_seller1.csv'))
....print(model_instance.get_results("location_country", "referrer_source","referrer_name",'shopify_data_seller1.csv'))

Conclusion

We have reached to the end of this post. From the above data, we can think of your customer cluster as a physical swimming pool. The pool is filled with money spent by active customers of your brands. High value customers are those customers who have spent money with you in the past frequently over a period of time. Higher value customers spend more money and they fill the pool up faster than medium value customers. Low value and low customers are seasoning customers, their purchasing power is small and it takes them years to fill the pool of water. On the other hand, water is draining as customers leave you or stop spending money with you. Therefore, different strategy should be implemented by the marketers to retain their customer. For instance, high value customers can be contacted by call centers whereas medium value customers received an email and low-value customers a text message. Not only limited to this, promotion can be rolled out to move low value and low customers into medium pool; special discount can be offered to high and medium value customer on some products or services to retain their loyalty to your brand.

If you want want to contribute or you find any errors in this article please do leave me a comment.

You can reach me out on any of the matrix decentralized server. Here is my Element ID @maximilien:matrix:org

If you use one on the mastodon decentralized server, here is my ID @maximilien@qoto.org

End-To-End Breast Cancer Model Explainability using SHAP and Random Forest Algorithm.

Maximilien — Thu, 16 Dec 2021 09:42:57 GMT

Model explainability and interpretability are one of the major concerns in the field of machine learning and artificial intelligence nowadays. We are gradually moving from traditional machine learning 'Black Box' where we preprocess data and feed into our training algorithm by relying only on accuracy score and classification report to explain how good our model performs on training and validation dataset d to 'White Box' a near explainable and interpretable model. In the following, we are going to use SHAP (SHapley Additive exPlanations) to explain breast cancer model using random forest classifier.

Contents

1- What is SHAP

2- Goal of SHAP

3- SHAP value

4- SHAP Explainer

5- List of SHAP charts

6- Model-Agnostic Method and advantages

7- Model-Agnostic Method layers

8- Implementation of breast cancer model explainability using SHAP and random forest classifier algorithm

1- What is SHAP

SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. It has optimized functions for interpreting tree-based models and a model agnostic explainer function for interpreting any black-box models for which the predictions are known.

2- Goal of SHAP

The goal of SHAP is to explain the prediction of an instance x by computing the contribution of each feature to the prediction. The SHAP explanation method computes Shapley values from coalitional game theory. The feature values of a data instance act as players in a coalition. Shapley values tell us how to fairly distribute the the prediction among the features.

3- SHAP value

Shapley values are a widely used approach from cooperative game theory. The essence of Shapley value is to measure the contributions to the final outcome from each player separately among the coalition while preserving the sum of contributions being equal to the final outcome. Though there are some other techniques used to explain models like permutation importance and partial dependence plots, below are some benefits of using SHAP values over other techniques:

Global interpretability : SHAP values not only show feature importance but also show whether the feature has a positive or negative impact on predictions. Local interpretability : We can calculate SHAP values for each individual prediction and know how the features contribute to that single prediction. Other techniques only show aggregated results over the whole dataset. SHAP values can be used to explain a large variety of models including linear models (e.g. linear regression), tree-based models (e.g. XGBoost) and neural networks, while other techniques can only be used to explain limited model types.

4- SHAP Explainer

SHAP has a list of classes which can help us understand a different kind of machine learning models from many python libraries. These classes are commonly referred to as explainers. This explainer generally takes the ML model and data as input and returns an explainer object which has SHAP values which will be used to plot various charts explained later on. Below is a list of available explainers with SHAP.

AdditiveExplainer: This explainer is used to explain Generalized Additive Models.
This explainer uses the brute force approach to find shap values which will try all possible parameter sequence.
DeepExplainer: This explainer is designed for deep learning models created using Keras, TensorFlow, and PyTorch. It’s an enhanced version of the algorithm where we measure conditional expectations of SHAP values based on a number of background samples. It's advisable to keep reasonable samples as background because too many samples will give more accurate results but will take a lot of time to compute SHAP values. Generally, 100 random samples are a good choice.
GradientExplainer: This explainer is used for differentiable models which are based on the concept of expected gradients which itself is an extension of the integrated gradients method.
KernelExplainer: This explainer uses special weighted linear regression to compute the importance of each feature and the same values are used as SHAP values.
LinearExplainer: This explainer is used for linear models available from sklearn. It can account for the relationship between features as well.
PartitionExplainer: This explainer calculates shap values recursively through trying a hierarchy of feature combinations. It can capture the relationship between a group of related features.
PermutationExplainer: This explainer iterates through all permutation of features in both forward and reverses directions. This explainer can take more time if tried with many samples.
SamplingExplainer: This explainer generates shap values based on assumption that features are independent and is an extension of an algorithm proposed in the paper "An Efficient Explanation of Individual Classifications using Game Theory".
TreeExplainer: This explainer is used for models that are based on a tree-like decision tree, random forest, gradient boosting.
CoefficentExplainer: This explainer returns model coefficients as shap values. It does not do any actual shap values calculation.
LimeTabularExplainer: This explainer simply wrap around LimeTabularExplainer from lime library. If you are interested in learning about lime then please feel free to check on our tutorial on the same from references section.
MapleExplainer: This explainer simply wraps MAPLE into shap interface.
RandomExplainer: This explainer simply returns random feature shap values.
TreeGainExplainer : This explainer returns global gain/Gini feature importances for tree models as shap values.
TreeMapleExplainer : This explainer provides a wrapper around tree MAPLE into shap interface.

5- List of SHAP charts

summary_plot creates a beeswarm plot of shap values distribution of each feature of the dataset.
decision_plot shows the path of how the model reached a particular decision based on shap values of individual features. The individual plotted line represents one sample of data and how it reached a particular prediction.
multioutput_decision_plot shows decision plot for multi output models.
dependence_plot shows relationship between feature value (X-axis) and its shape values (Y-axis).
force_plot plots shap values using additive force layout. It can help us see which features most positively or negatively contributed to prediction.
image_plot plots shape values for images.
monitoring_plot helps in monitoring the behavior of the model over time. It monitors the loss of model overtime.
embedding_plot projects shap values using PCA for 2D visualization. partial_dependence_plot shows basic partial dependence plot for a feature.
bar_plot shows a bar plot of shap values impact on the prediction of a particular sample.
waterfall_plot shows a waterfall plot explaining a particular prediction of the model based on shap values. It kind of shows the path of how shap values were added to the base value to come to a particular prediction.
text_plot plots an explanation of text samples coloring text based on their shap values.

6- Model-Agnostic Method and advantages

The process of separating the explanations from the machine learning model is called model-agnostic interpretation methods. The advantages of applying model-agnostic explanation system are :

Model flexibility: The interpretation method can work with any machine learning model, such as random forests, linear model and deep neural networks.
Explanation flexibility: You are not limited to a certain form of explanation. In some cases it might be useful to have a linear formula, in other cases a graphic with feature importance.
Representation flexibility: The explanation system should be able to use a different feature representation as the model being explained.

7- Model-Agnostic Method layers

Let us have a look at model-agnostic interpretability. We capture the world by collecting data and abstract it further by learning to predict the data with a machine learning model.

The World layer: It contains everything that can be observed which we aim to learn something about and interact with.

Data layer: We have to digitize the World in order to make it processable for computers and also to store information. The Data layer contains anything from images, texts etc.

Black box model layer: We fit the preprocessed data into the machine learning model and predict the outcome on unseen test data. Interpretability Methods layer: It deals with the opacity of machine learning models. What were the most important features for a particular diagnosis? Why was a financial transaction classified as fraud?

The last layer is occupied by a Human where all the explaination takes place.

8- Implementation of breast cancer model explainability using SHAP and random forest classifier algorithm

Problem statement: Breast cancer is a type of cancer that starts in the breast. Cancer starts when cells begin to grow out of control.A benign tumor is a tumor that does not invade its surrounding tissue or spread around the body. A malignant tumor is a tumor that may invade its surrounding tissue or spread around the body. We are required to use SHAP to explain the prediction of our model either a cancer is malignant or benign using random forest.

The dataset used below can be downloaded from DPhi github reposetory here

Importing Necessary Libraries

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import shap
shap.initjs()

Loading the first five row of the data

df=pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/breast_cancer/Training_set_breastcancer.csv")
df.head()

Perform Basic Exploratory Data Analysis

This section displays the summary statistic that quantitatively describes or summarizes features of a collection of information, the process of condensing key characteristics of the data set into simple numeric metrics. Some of the common metrics used are mean, standard deviation, and correlation.

Checking the dimensionality of the dataframe

df.shape

Getting a concise summary of the dataframe

df.info()

Descriptive statistics

df.describe().transpose()

From the difference between the median and mean in the figure it appears some features have skewness.

Based on the diagnosis class, data can be categorized using the mean value as follows.

df.groupby('diagnosis').mean()

Grouping the label in classes and displaying the number of element per class

print(df.groupby("diagnosis").size())

-Displaying the distribution of elements per class

sns.countplot(df['diagnosis'], label="Count", palette=sns.color_palette(['blue', 'red']),
              order=pd.value_counts(df['diagnosis']).iloc[:398].index)
plt.show()

From the above figure of count plot graph, it clearly displays there is more number of benign (B) stage of cancer tumors in the data set which can be cured.

Dropping the id column cause it does not affect our analysis

df = df.drop('id', axis=1)

Plotting correlation among among and target variable

df_corr = df.corr()
plt.figure(figsize=(20, 12))
sns.heatmap(df_corr, cbar=True, annot=True, yticklabels=df.columns,
            xticklabels=df.columns)
plt.show()

Each square shows the correlation between the variables on each axis. Correlation ranges from -1 to +1. Values closer to zero means there is no linear trend between the two features. The close to 1 the correlation is the more positively correlated they are; that is as one increases so does the other and the closer to 1 the stronger this relationship is. A correlation closer to -1 is similar, but instead of both increasing one variable will decrease as the other increases. The diagonals are all 1 because those squares are correlating each variable to itself so it's a perfect correlation. For the rest, the larger the number and lighter the color, the higher the correlation between the two variables. The plot is also symmetrical about the diagonal since the same two variables are being paired together in those squares.

Printing features with high correlation

high_correlation =df_corr.abs()
high_correlation_unstack=high_correlation.unstack()
high_correlation_sort = high_correlation_unstack.sort_values(ascending=False)
print(high_correlation_sort[30:35])

Plotting distribution of features with highest correlation "radius_mean and perimeter_mean"

sns.jointplot("radius_mean", "perimeter_mean", data=df, kind="scatter",space=0, color="blue", height=9, ratio=3)
plt.show()

Plotting distribution of features with highest correlation "radius_worst and perimeter_worst"

Splitting the data into Train and Test sets

X=df.drop("diagnosis",axis=1)
y=df.diagnosis.map({'B':0, 'M':1}).astype(np.int)

The train to test ratio should be 80:20 and the random_state should be 0

X_train, X_test,y_train,y_test=train_test_split(X,y,test_size=20,random_state=0)

Use Random Forest Machine Learning Model for prediction

model = RandomForestClassifier(n_estimators =400, criterion='entropy',random_state=1,n_jobs=-1,max_depth=5)

Fitting the model

model.fit(X_train, y_train)

Predicting on X_test set

y_pred = model.predict(X_test)

Evaluate the model using Accuracy Score

from sklearn.metrics import  accuracy_score
score= accuracy_score(y_test,y_pred)
print("Accuracy:",score)

Though we got the prediction score of 95% (very great), it does not tell us which features push the breast cancer prediction towards benign or malignant. We need to explain what goes into the model leading to a specific predicted class. Other questions that may arise are:

How do different features affect the prediction results?
What are the top features that influence the prediction results?
The model performance metrics look great, but should I trust the results?

Using SHAP Explainer to derive SHAP Values for the random forest ml model.

Creating an object of a class TreeExplainer which takes our model as a parameter.

explainer = shap.TreeExplainer(model)

Calculating the SHAP value

shap_values = explainer.shap_values(X_test)

Displaying the expected value

print("Expected Value:", explainer.expected_value)

Expected Value: [0.66426696 0.33573304]

These lines of code above calculate the Shapely values.

In our case, classification problem, the shap_values is a list of arrays and the length of the list is equal to the number of classes 2 (benign and malignant). This is true for the expected_values also. Besides, we should choose which label we are trying to explain and use the corresponding shap_value and expected_value in further plots. Depending on the prediction of an instance, we can choose the corresponding SHAP values and plot them as shown below.

NB: In case of a regression out of scope of this article, the shap_values will only return a single item.

row=0
for which_class in y.unique():
    display(shap.waterfall_plot(shap.Explanation(values=shap_values[int(which_class)][row], base_values=explainer.expected_value[int(which_class)], data=X_test.iloc[row],feature_names=X_test.columns.tolist())))

In the above plot, f(x) is the prediction after consedering all the features E[f(x)] is the mean prediction

The blue bar shows how much a particular feature decreases the value of the prediction.
The red bar shows how much a particular feature increases the value of the prediction.
Plotting SHAP force plot for the first row of test data.

shap.initjs()
shap_values_first_row = explainer.shap_values(X_test.iloc[0])
shap.force_plot(explainer.expected_value[0], shap_values_first_row[0], X_test.iloc[0])

This force plot above depicts the weight of each feature contribution by the model centered around the baseline SHAP value of 0.6423 which either increase or decrease the prediction. The red color depicts features having positive weight on the model and the blue color depicts features which have negative weigh on our model. That is to say perimeter_worst ,concave points_mean,concave point_worst, concavity_worst,concavity_mean and textture_worst decrease the model prediction. Therefore the first test sample has a low risk of developping breast cancer (benign tumor).

Shap summary_plot

shap.summary_plot(shap_values[0],X_test)

shap.summary_plot(shap_values[1],X_test)

The red color depicts features having positive weight on the model and the blue color depicts features which have negative weigh on our model. That is to say perimeter_worst ,concave points_mean,concave point_worst, concavity_worst,concavity_mean and textture_worst increase the model prediction. Therefore the first test sample has a high risk of developing breast cancer (malignant tumor).

There are other shap plots we could explore but for lack of time, I would like to introduce you to an amazing python library which explains shap model just in 5 lines of code.

Explainerdashboard

explainerdashboard is a library for quickly building interactive dashboards for analyzing and explaining the predictions and workings of (scikit-learn compatible) machine learning models, including xgboost, catboost and lightgbm. This makes your model transparant and explainable with just two lines of code. It allows you to investigate SHAP values, permutation importances, interaction effects, partial dependence plots, all kinds of performance plots, and even individual decision trees inside a random forest. Besides, explainerdashboard helps any data scientist to create an interactive explainable AI web app in minutes, without having to know anything about web development or deployment.

Let's get into the code now.

Installing explainerdashboard It takes 4 minutes to get it done

!pip install explainerdashboard

Importing the libraries

from explainerdashboard import ClassifierExplainer
from dash import html

Creating an object of the class ClassifierExplainer and passing the model, X_test and y_test as arguements

explainer = ClassifierExplainer(model, X_test, y_test)

launching the dashboard:

from explainerdashboard import ExplainerDashboard
ExplainerDashboard(explainer).run()

After executing the above, it should display the flask web server IP address as shown above in the image. please copy it and paste into your web browser.

http://0.0.0.0:8050/

As shown above this summary_plot shows features that have the biggest impact on predicted malignant cancer based on shap values

Kudos for making to the end of this article.

Conclusion

We started by a classification problem using random forest on breast cancer dataset from hospital in the USA. After analysis on the dataset, we found out that 250 people have benign breast cancer and 148 have malignant breast cancer. Next, we fed our training set into the black box and evaluated the model performance on unseen data giving 95% accuracy score. Finally, we used SHAP to explain and interpret our black box model.

If you want to contribute or you find any errors in this article please do leave me a comment.

You can reach me out on any of the matrix decentralized servers. My element messenger ID is @maximilien:matrix.org

If you are one of the mastodon decentralized server, here is my ID @maximilien@qoto.org

If you are on linkedIn, you can reach me here

Warm regards,

Maximilien.

End-To-End Amazon Product Rating Using Multinomial Naive Bayes Algorithm and CountVectorizer

Maximilien — Sat, 06 Nov 2021 19:50:40 GMT

In this pandemic time where every body has to cover his nose and turn to online shopping, your product is not what you say about it but it is what google says about it. Let me ask you this question as a consumer of an end product. When you go online to purchase a product what is the first thing you do? Do you just buy an item because you cannot interact physically with it and have a knowledge of it or you take your time to read through the comments and experiences of other end users about that same product before purchasing. If you belong to the last category, you are in the winning side in this digital age. In this optic, we are going to use a mathematical model to predict the rating of a particular product out of 5 based on some sample product information collected from end consumers on Amazon using machine learning and natural language processing with RMSE as evaluation metric.

Contents

1. What is product review

2. Importance of product review

3. Top product review platforms

4. What is multinomial Naive Bayes algorithm

5. What is CountVectorizer

6. Code implementation of CountVectorizer

7. Implementation of product rating using multinomial Naive Bayes algorithm

1- What is product review

In electronic commerce, a product review is a section on shopping websites which gives the customers the opportunity to rate and to comment on a product they have purchased in which other end consumers will read to make their decision to purchase the same item for their personal need.

2. Importance of product review

Drive sales
Build trust
Aid customer decision making
Credibility and social proof

3. Top product review platform

Baazarvoice
Yotpo
Trustpilot
PowerReviews
Reevoo
Feefo

4. What is multinomial Naive Bayes algorithm

Multinomial Naive Bayes algorithm is a subset of probabilistic algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of the feature. It is designed to handle corpus using word counts as its method of calculating probability given by:

P(c|x) = P(x|c) * P(c) / P(x)

Where c is the class of the possible outcomes and x is the given instance which has to be classified representing some certain features.

If you want to go deep into the mathematics, I refer Nagesh Singh Chauhan post on DPhi platform to you.

5. What is CountVectorizer

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to convert a given text in a document into a numerical vector based on the frequency of each word that occurs in the entire text. This transformation is paramount at the early stage of machine learning pipeline which will be used as feature representation of the raw text in document for machine learning tasks such as text classification and clustering because machine learning algorithms only compute numerical features irrespective of the input data feed into the model.

6. Code implementation of CountVectorizer

Considering a few sample texts from a corpus of my IoT startup as a list element:

corpus= ["maxtek helps startup", "maxtek is into computer vision and IoT", "maxtek provides Deep learning and IoT solutions"]

CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the corpus is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample as shown below

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["maxtek helps startup",
            "maxtek is into computer vision and IoT",
            "maxtek provides Deep learning and IoT solutions"]

# Create a Vectorizer Object
vectorizer = CountVectorizer()

vectorizer.fit(corpus)

# Printing the identified Unique words along with their indices
print("Vocabulary: ", vectorizer.vocabulary_)

# Encode the corpus
vector = vectorizer.transform(corpus)

# Summarizing the Encoded word in the corpus
print("Encoded corpus is:")
print(vector.toarray())

This way of representation is known as a Sparse Matrix.

Key observations:

There are 13 unique words in the corpus forming the vocabulary, represented as columns of the table.
There are 3 sentences in the corpus each represented as rows of the table.
Every cell contains a number, that represents the count of the word in that particular text.
All words have been converted to lowercase.
The words in columns have been arranged alphabetically.

7. Implementation of product rating using multinomial Naive Bayes algorithm

Importing all the libraries required to run this code.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import mean_squared_error
from sklearn.naive_bayes import MultinomialNB
from sklearn import preprocessing
from sklearn import metrics
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split
from nltk import sent_tokenize, word_tokenize
from bs4 import BeautifulSoup 
from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk import sent_tokenize, word_tokenize, pos_tag
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd 
import numpy as np
import contractions
import nltk
import re

In case you run through package not found error after running this code above you can install that particular library as shown below

!pip3 install name_of_the_library

Loading train and test dataset

train_df = pd.read_csv("Train_Data.csv")
test_df = pd.read_csv('Test_Data.csv')

Displaying 5 rows of the train_df

train_df.head()

Displaying information about the features sets in the train dataframe

train_df.info()

Visulalizing the distribution of average_review_rating

plt.figure(figsize=(12,8))

train_df['average_review_rating'].value_counts().sort_index().plot(kind='bar')
plt.title('Distribution of average_review_rating')
plt.xlabel('average_review_rating')
plt.ylabel('Count')

Visualizing the distribution top 50 products reviews

products = train_df["product_name"].value_counts()
plt.figure(figsize=(12,8))
products[:50].plot(kind='bar')
plt.title("Number of Reviews for Top 50 products")

Visualizing the distribution of top 50 manufactuters reviews

brands = train_df['manufacturer'].value_counts()
plt.figure(figsize=(12,8))
brands[:50].plot(kind='bar')
plt.title("Number of Reviews for Top 50 manufacturers")

Visualizing the distrubition of the length of the reviews

review_length = train_df["customer_reviews"].dropna().map(lambda x: len(x))
plt.figure(figsize=(12,8))
review_length.loc[review_length < 1500].hist()
plt.title("Distribution of customer review Length")
plt.xlabel('Review length (Number of character)')
plt.ylabel('Count')

Checking for NaN value in the train dataframe

train_df.isnull().sum()

As shown above, we have many NaN values in the dataframe so we are going to remove them by using drop() method cause if we keep them there will be noise in our data.

train_df.dropna(inplace=True)

Displaying customer_reviews in the dataframe cause we will be using it as input to our model. Recalling that the system rate various product based on customer review

train_df['customer_reviews']

Checking if there is not NaN values in the features

train_df['customer_reviews'].isnull().sum()

That looks good. let's proceed further!!

Defining a function cleanText() to remove special, html tagsetc. in our feature

def cleanText(raw_text, remove_stopwords=False, stemming=False, split_text=False ):
    '''
    Convert a raw review to a cleaned review
    '''
    text = BeautifulSoup(raw_text, 'lxml').get_text()  #remove html
    letters_only = re.sub("[^a-zA-Z]", " ", text)  # remove non-character
    words = letters_only.lower().split() # convert to lower case 

    if remove_stopwords: # remove stopword
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]

    if stemming==True: # stemming
#         stemmer = PorterStemmer()
        stemmer = SnowballStemmer('english') 
        words = [stemmer.stem(w) for w in words]

    if split_text==True:  # split text
        return (words)

    return( " ".join(words))

Splitting the dataset into training and validation dataset to train and evaluate the performance of our model by using train_test_split() method taking the feature and target variable as arguments.

X_train, X_test, y_train, y_test = train_test_split(train_df['customer_reviews'],train_df['average_review_rating'],test_size=0.2, random_state=42)

Cleaning the training feature set.

X_train_cleaned = []
X_test_cleaned = []

for d in X_train:
    X_train_cleaned.append(cleanText(d))
print('Show a cleaned review in the training set : \n',  X_train_cleaned[10])

for d in X_test:
    X_test_cleaned.append(cleanText(d))

Printing the identified Unique words along with their indices


X_train_countVect = countVect.fit(X_train_cleaned)

print("Vocabulary: ", X_train_countVect.vocabulary_)

Applying CountVectorizer to X_train_clean

countVect = CountVectorizer() 
X_train_countVect = countVect.fit_transform(X_train_cleaned)
print("Number of features : %d \n" %len(countVect.get_feature_names())) #6378 
print("Show some feature names : \n", countVect.get_feature_names()[::100])

Create a object of a class LabelEncoder(). We notice that our target variable is of data type float and the input to our model must be of integer data type.

lb=preprocessing.LabelEncoder()

Displaying y_train before encoding

print(y_train)

Encoding y_train

y_train_encoded=lb.fit_transform(y_train)

Displaying the encoded y_train

print(y_train_encoded)

Encoding y_test

y_test_encoded = lb.transform(y_test)

y_test_encoded

Defining our model and fitting with X_train_countVect and y_train_encoded

mnb = MultinomialNB()
mnb.fit(X_train_countVect, y_train_encoded)

Predicting on unseen validation test features

predictions = mnb.predict(countVect.transform(X_test_cleaned))

Displaying predicted value.

print(predictions)

We notice that the predicted value is out of the scale in the problem statement simply because we encoded our target variable before fitting our model. So we need to inverse transform the predicting to get the actual prediction as shown below:

prediction =lb.inverse_transform(predictions )

Defining a function to evaluate the model using RMSE(Root Mean Square Error) as metric score.

def modelEvaluation(y_tst,pred):
    '''
    Print model evaluation to predicted result 
    '''
    print ("\nAccuracy on validation set: {:.4f}".format(np.sqrt(mean_squared_error(y_tst, pred))))

Evaluating the model performance

modelEvaluation(lb.inverse_transform(y_test_encoded),predictions)

Accuracy on validation set: 3.1841

Low RMSE score means the model is performing well. We have reached half way to our final destination. Let us process the test dataframe and run the prediction on unseen feature sets.

Displaying the first 5 rows of the test dataframe

test_df.head()

Displaying information about the test dataframe

test_df.info()

Checking if there is a null value in the test dataframe

test_df.customer_reviews.isnull()

Cleaning the test_df['customer_reviews']

test_df_cleaned = []

for d in test_df['customer_reviews']:
    test_df_cleaned .append(cleanText(d))
print('Show a cleaned review in the training set : \n',  test_df_cleaned [10])

Predicting on the unseen clean customer reviews

predictions_test_df = mnb.predict(countVect.transform(test_df_cleaned ))
predictions_test = lb.inverse_transform(predictions_test_df)

Displaying the predicted values

print(predictions_test)

Putting the predicted values into a dataframe

predictions_test = pd.DataFrame(predictions_test)

Displaying the prediction

print(predictions_test)

Converting the dataframe into csv format for submission.

predictions_test.index = pd.DataFrame(predictions_test).index
predictions_test.columns = ["prediction"]
predictions_test.to_csv("submission.csv", index = False)

Check on the project folder in jupyter notebook you should have a file name submission.csv as shown below:

Conclusion:

We were able to rate customer reviews on Amazon product with RMSE score of 0.318 using multinomial Naive Bayes algorithm and countVectorizer. We could achieve lower RMSE score with TF-IDF which performs better than countVectorizer and exploring other machine learning algorithms such as support vector machine algorithm, deep learning algorithm like RNN and LSTM.

Please let me know if you find any errors. You can reach me out on any of the matrix decentralized servers. My element messenger ID is @maximilien:matrix.org

If you are on linkedIn you can reach me here

Warm regards,

Maximilien.

End-To-End Gender Determination by Morphometry of Eyes using CNN (Convolutional Neural Network)

Maximilien — Tue, 26 Oct 2021 16:04:52 GMT

I did recall the first time my mum escorted me to nursery school. I was that little boy who used to learn from images written on flash cards hanging on the blackboard "this is a car", "this is an elephant" :-). Similarly in the following, we are going to build mathematical models to mimic the functions of a human eyes and brain to empower machine to classify image of an eye of a patient and to find out whether the patient is male or female.

Contents:

1. Computer vision

2. Convolutional Neural Network

3. Architecture of CNN

4. Image augmentation

5. Implementation of gender determination by morphometry of eye

1. Computer Vision

To understand computer vision, let us discuss human vision. Human vision is the ability of the human eye and brain to see and recognize objects. It is quite simple for the human eye to precisely identify whether a person is a male or a female but it takes a lot of training for a computer system to understand such objects distinctly. In other word, Computer vision is the process of giving a machine a similar task of seeing and identifying objects in the real world. In this optic, computer vision can be defined as building mathematical models that can mimic the function of a human eye and brain. Basically, it is about training computers to understand and process images and videos.

2. Convolutional Neural Network

CNN is a class of deep neural network that is mostly used in the field of computer vision and imaging. CNNs are used to identify images, cluster them by their similarity and implement object recognition. Moreover, The word convolution refer to the filtering process that takes place in the network. Finally, CNN has different layers namely the input layer, the output layer, and multiple hidden layers. Besides, these hidden layers of a CNN consist of fully connected layers, convolutional layers, a ReLU layer as an activation function, normalization layers and pooling layers.

3. Architecture of CNN

The main components of CNN architecture are as follows:

• Input image: An input image forms the first component of a CNN architecture. An image can be of any type: a human, an animal, scenery, a medical X-ray image etc. Each image is converted into a mathematical matrix of zeros and ones as following:

• Convolutional layer: The convolution layer is the place where the image processing or filtering starts. A convolution layer consists of two parts:

• Feature detector or filter or kernel: This is a matrix basically a a 3*3 matrix for 2D image you put on an image to transform it into a feature map

• Feature map: This is the reduced image that is produced by the convolution of an image and feature detector. We have to put the feature detector on all the possible locations of the original image and derive a smaller image from it. That derived image is the feature map of the input image obtained by the dot product of the input image with the kernel matrix.

NB: The feature detector or kernel is the filter and the feature map is the reduced image. Some information is lost while reducing the image. The above feature map is obtained by moving the square orange frame all over the layer and taking the dot product with the kernel as shown below

• Pooling layer: The pooling layer helps us ignore the less important data in the image and reduces the image further while preserving its important features. The feature map derived from the convolution layer is passed through a pooling layer to further reduce the image, all while preserving the most relevant part of the image. The pooling layer consists of functions such as max pooling, min pooling, and average pooling. What this means is that we select a matrix size, say 2x2 and we scan the feature map and select the maximum number from the 2x2 matrix that fits in that block. The following image gives us a clear idea of how max pooling works.

• Flattening: Flattening is part of a CNN where the image is made ready to use as an input to an artificial neural network. The pooled image is flattened and converted into a single column. Each row is made into a column and stacked one over another. Here, we have converted a 3x3 matrix into a 1xn matrix, where n in our case is 9.

Now, let's look at the overall structure of a CNN

4. Image augmentation Image or data augmentation works in a similar manner. Image/data augmentation creates many batches of our images. Then, it applies random transformations to random images inside the batches. Data transformation can be rotating images, shifting them, flipping them, and so on. By applying this transformation, we get more diverse images inside the batches, and we also havemuch more data than we had originally like shown below for an image of a football.

5. Implementation of gender determination by morphometry of eye

Let's get into the coding part. The dataset used in this code can be downloaded from DPhi platform here .

Installing tensorflow framework (skip this part if you are using Google colab)

# Requires the latest pip
 pip install --upgrade pip

# Current stable release for CPU and GPU
pip3 install tensorflow

#installing open computer vision 
pip3 install opencv-contrib-python

Importing the libraries

from tensorflow.keras.preprocessing.image import ImageDataGenerator,array_to_img, img_to_array, load_img
from keras.layers import Dense, Activation, Flatten, Dropout, BatchNormalization
from keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras import regularizers, optimizers
from keras.preprocessing import image
from tensorflow.keras.models import Sequential
import matplotlib.pyplot as plt
from os import listdir,makedirs
from os.path import isfile,join
from keras import layers
from keras import models
import pandas as pd
import numpy as np
import pathlib
import PIL
import cv2
import os

Loading the train and test dataset

train_df=pd.read_csv("Training_set.csv",dtype=str)
test_df=pd.read_csv("Testing_set.csv",dtype=str)

Defining the path for the various directories

src_path_train="/root/Desktop/Deep learning dphi/eye_gender_data/train/"
src_path_test="/root/Desktop/Deep learning dphi/eye_gender_data/test/"
src_path_validation="/root/Desktop/Deep learning dphi/eye_gender_data/train/validation/"
src_path_train_gray="/root/Desktop/Deep learning dphi/eye_gender_data/train_grayscale/"
src_path_test_gray="/root/Desktop/Deep learning dphi/eye_gender_data/test_grayscale/"

Run this code only once to create validation folder

base_dir='/root/Desktop/Deep learning dphi/eye_gender_data/'
validation_dir = os.path.join(base_dir, 'validation')
os.mkdir(validation_dir)

Creating train_grayscale folder

validation_dir = os.path.join(src_path_train_gray, 'train_grayscale')
os.mkdir(validation_dir)

Creating test_grayscale folder

validation_dir = os.path.join(src_path_test_gray, 'train_grayscale')
os.mkdir(validation_dir)

Applying data augmentation to Image_6 in the train dataset to visualize how it looks like

datagen = ImageDataGenerator(
        rotation_range=40,
        width_shift_range=0.2,
        height_shift_range=0.2,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True,
        fill_mode='nearest')

img = load_img("/root/Desktop/Deep learning dphi/eye_gender_data/train/Image_6.jpg")  # this is a PIL image
x = img_to_array(img)  # this is a Numpy array with shape (3, 150, 150)
x = x.reshape((1,) + x.shape)  # this is a Numpy array with shape (1, 3, 150, 150)

# the .flow() command below generates batches of randomly transformed images
# and saves the results to the `preview/` directory
i = 0
for batch in datagen.flow(x, batch_size=1,
                          save_to_dir='preview', save_prefix='cat', save_format='jpeg'):
    i += 1
    if i > 12:
        break  # otherwise the generator would loop indefinitely

Creating a function to visualize 12 augmented samples of a real image

def visualizeImg(path,color): 
    sub_class = os.listdir(path)
    fig = plt.figure(figsize=(10,7))
    for e in range(len(sub_class[:12])):
        plt.subplot(3,4,e+1)
        img = plt.imread(os.path.join(path,sub_class[e]))
        plt.imshow(img, cmap=plt.get_cmap(color))

Visualizing the augmented images

visualizeImg("/root/Desktop/Deep learning dphi/eye_gender_data/preview/",'CMRmap')

Displaying sample of the train dataset

visualizeImg(src_path_train,'CMRmap')

Displaying sample of test dataset

visualizeImg(src_path_test,'CMRmap')

Converting the train dataset to grayscale instead

path ="/root/Desktop/Deep learning dphi/eye_gender_data/train/"  
#create a folder named  train_grayscale in the eye_gender data directory 
dstpath ="/root/Desktop/Deep learning dphi/eye_gender_data/train_grayscale/"

try:
    makedirs(dstpath)
except:
    print ("Directory already exist, images will be written in same folder")

# Folder won't used
files = [f for f in listdir(path) if isfile(join(path,f))] 

for image in files:
    try:
        img = cv2.imread(os.path.join(path,image))
        gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
        dstPath = join(dstpath,image)
        cv2.imwrite(dstPath,gray)
    except:
        print ("{} is not converted".format(image))

displaying sample of train_grayscale dataset

visualizeImg(src_path_train_gray,'gray')

Converting test to grayscale

path ="/root/Desktop/Deep learning dphi/eye_gender_data/test/"  
#create a folder named  test_grayscale in the eye_gender data directory 
dstpath ="/root/Desktop/Deep learning dphi/eye_gender_data/test_grayscale/"

try:
    makedirs(dstpath)
except:
    print ("Directory already exist, images will be written in same folder")

# Folder won't used
files = [f for f in listdir(path) if isfile(join(path,f))] 

for image in files:
    try:
        img = cv2.imread(os.path.join(path,image))
        gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
        dstPath = join(dstpath,image)

        cv2.imwrite(dstPath,gray)
    except:
        print ("{} is not converted".format(image))

Displaying sample of test_grayscale dataset

visualizeImg(src_path_test_gray,'gray')

Defining the augmentation configuration we will use for training

train_datagen = ImageDataGenerator(
        rescale=1 / 255.0,
        rotation_range=2,
        zoom_range=0.05,
        width_shift_range=0.05,
        height_shift_range=0.05,
        shear_range=0.05,
        horizontal_flip=True,
        fill_mode="nearest",
        validation_split=0.20)

Defining the augmentation configuration we will use for testing

test_datagen = ImageDataGenerator(rescale=1./255)

This is a generator that will read pictures found in subfolers of 'eye_gender_data/train_grayscale', and indefinitely generate batches of augmented image data

train_generator = train_datagen.flow_from_dataframe(
    dataframe=train_df,
    directory=src_path_train_gray,
    x_col="filename",
    y_col="label",
     subset='training',
    target_size=(71, 71),  # all images will be resized to 71*71
    batch_size=400,
    seed=60,
    shuffle=True,
   class_mode='categorical')

Found 7376 validated image filenames belonging to 2 classes.

This is a generator that will read pictures found in subfolers of 'eye_gender_data/test_grayscale', and indefinitely generate batches of augmented image data

test_generator = test_datagen.flow_from_dataframe(
    dataframe=test_df,
    directory="/root/Desktop/Deep learning dphi/eye_gender_data/test_grayscale/",
    x_col="filename",
    target_size=(71, 71),
    batch_size=1,
    class_mode=None,
    seed=60,
    shuffle=False,
)

Defining the function base_model contained the sequential model and all the connected layers

def base_model():
    model = models.Sequential()
    model.add(layers.Conv2D(32, (3, 3),padding='same', activation='relu',input_shape=(32, 32, 3)))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(Dropout(0.25))
    model.add(layers.Conv2D(64, (3, 3), padding='same', activation='relu'))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(Dropout(0.25))
    model.add(layers.Conv2D(64,(3, 3),padding='same' , activation='relu'))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(Dropout(0.25))
    model.add(layers.Conv2D(64, (3, 3), padding='same',activation='relu'))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(Dropout(0.25))
    model.add(layers.Flatten())
    model.add(layers.Dense(512, activation='relu'))
    model.add(Dropout(0.5))
    model.add(layers.Dense(2, activation='softmax'))
    return model

Compiling the baseline model

baseline=base_model()
baseline.compile(optimizer='adam',loss="categorical_crossentropy",metrics=["accuracy"])

Fitting the baseline model in 5 epoch

STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=valid_generator.n//valid_generator.batch_size
STEP_SIZE_TEST=test_generator.n//test_generator.batch_size
history= baseline.fit(train_generator,
                    steps_per_epoch=STEP_SIZE_TRAIN,
                    validation_data=valid_generator,
                    validation_steps=STEP_SIZE_VALID,
                   epochs=5)

Creating a folder to save the model

!mkdir -p saved_model

baseline.save('saved_model/my_model')

Evalualing the model on validation dataset

score = baseline.evaluate(valid_generator,steps=STEP_SIZE_TEST)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Plotting Training and validation accuracy & Training and validation loss

import matplotlib.pyplot as plt
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

Loading the save model

new_model = models.load_model('saved_model/my_model')

Predicting the test dataset

STEP_SIZE_TEST=test_generator.n//test_generator.batch_size
test_generator.reset()
prediction=new_model.predict(test_generator,steps=STEP_SIZE_TEST,verbose=1)

Naming the columns of the prediction labels and re-ordering the indices of the prediction .

predicted_class_indices=np.argmax(prediction,axis=1)
labels = (train_generator.class_indices)
labels = dict((v,k) for k,v in labels.items())
predictions = [labels[k] for k in predicted_class_indices]
filenames=test_generator.filenames

creating a dataframe containing the prediction

results=pd.DataFrame({ "label":predictions})

Displaying the prediction in a dataframe

results

Saving the prediction in csv file

results.to_csv("submission.csv",index=False)

kudos for reaching at this point, you can submit your submission file in csv format to any platform such as kaggle , DPhi, hackerrank, hackerearth for marking.

Classifying a new unseen image

You have your deep learning model up now you can ahead take a picture of people's eyes randomly with your phone and run the prediction to determine the gender of the person. How to go by it? check the following

Loading the new image save in the directory of your choice as below

new_image = image.load_img('/root/Desktop/Deep learning dphi/eye_gender_data/unseen_test/download.jpeg', target_size = (71, 71))

Converting the image to grayscale.

#converting train to grascale
path ="/root/Desktop/Deep learning dphi/eye_gender_data/unseen_test"  
#create a folder named  train_grayscale in the eye_gender data directory 
dstpath ="/root/Desktop/Deep learning dphi/eye_gender_data/unseen_test_grayscale/"

try:
    makedirs(dstpath)
except:
    print ("Directory already exist, images will be written in same folder")

# Folder won't used
files = [f for f in listdir(path) if isfile(join(path,f))] 

for image in files:
    try:
        img = cv2.imread(os.path.join(path,image))
        gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
        dstPath = join(dstpath,image)
        cv2.imwrite(dstPath,gray)
    except:
        print ("{} is not converted".format(image))

Visualizing the image

plt.imshow(gray,cmap="gray")

Checking the shape of the image because the input image to our sequential model is 32 by 32

gray.shape

(617, 926)

Resizing the image to (32,32)

from keras.preprocessing import image
from keras.preprocessing.image import img_to_array

image = image.load_img("/root/Desktop/Deep learning dphi/eye_gender_data/unseen_test_grayscale/download.jpeg")
image = image.resize((32,32))

plt.imshow(image)

Converting the image to an array

new_image =img_to_array(image, dtype='uint8')
new_image = np.expand_dims(new_image, axis = 0)

Running the prediction

STEP_SIZE_TEST=test_generator.n//test_generator.batch_size
prediction=new_model.predict(new_image)

Viewing the class labels

train_generator.class_indices

{'female': 0, 'male': 1}

Checking the label of the unseen image

if prediction[0][0] == 1:
    new_image = 'This is a male'
else:
    new_image = 'This is a female'
print(new_image)

This is a female

Conclusion:

We reach the end of our learning journey in convolutional neural network to determine the gender by morphometry of eyes with 93% accuracy. We can increase the accuracy of the model by using pre-trained models such Residual Network (ResNet50), VGG16, Xception, Deep Residual Network (DensetNet) or Inception to reach 98% or 99% with few line of codes.

Please subscribe to my newsletter and never miss my upcoming articles and leave me a comment if you have any questions or find this post interesting.

My regards,

Maximilien.

End-To- End Consumer Complaint Multiclass Classification Using Term Frequency - Inverse Document Frequency (TF-IDF) & Support Vector Machine Algorithm

Maximilien — Sat, 23 Oct 2021 15:52:03 GMT

You are about to embark on an exciting journey in NLP to implement end to end solution to solve business problem. In the end of this tutorial, the reader should be able to understand the concept of NLP, TF-IDF and to implement multiclass classification algorithm using python programming language.

Contents

What is NLP
Application of NLP
What is TF-IDF
What is multiclass classification
Importance of multiclass classification
Installing libraries
Objective of the project
Implementing Multiclass Classification

1. What is Natural Language Processing

Natural language processing is the term used to describe the process of using computer algorithms to identify key elements in human language and extract meaning from unstructured spoken or written text. In other words, NLP is a set of AI techniques designed to process human language. These techniques enable applications to recognize, process, analyze, and even generate natural human language.

2. Application of NLP

3. What is TF-IDF

TD-IDF or term frequency-inverse document frequency is a Google algorithm used to score and rank a piece of content’s relevance to any given search query. It is used to check the occurrence of a keyword in a document and allocate importance to that keyword, based on the number of times it appears in the document. It also checks how relevant that keyword is across the web.

Mathematically speaking, in a context of term and document, TF is defined as the number of times a term appears in a document. Term and Document are independent variables and TF is dependent on these. Let us denote TF as a function of term (t) and document (d) : TF(t,d).

Moreover, In the context of term and all the documents in corpus, DF is defined as the number of documents that contain the term. Term and Document Corpus are independent variables and DF is dependent on these. Let us denote DF as a function of term (t) and document corpus (D) : DF(t,D).

NB: When the requirement is to calculate the importance of a term to a document in the corpus, TF denotes how important the term is to a document but it does not address the context of corpus. DF addresses how important a term in the context of a documents corpus. If a term appears across all documents, the term is overemphasized by TF. So the inverse of DF (IDF) could be used to project the actual importance of term, by calculating the product of TF and IDF.

Inverse of DF (IDF) formular: With base of the log n > 1.

Product of TF and IDF formular:

Considering the following text corpus containing three documents below.

document1 : Welcome to maxtekIoT. There are many tutorials covering various fields of technology.

document2 : Technology has advanced a lot with the invention of semi-conductor transistor. Technology is changing our way of living.

document3 : You may find this tutorial on transistor technology interesting.

TFIDF(technology, document2, corpus)

TF(technology, document2) = 2

IDF(technology, document2) = log((3+1)/(3+1)) = 0

TFI-DF(technology, document2, corpus) = TF(technology, document2) . IDF(technology, document2) = 1 * 0 = 0

Conclusion: Though the term ‘technology’ appeared twice in document2 and the term has occurred in all the documents, it has no importance to document2 in the corpus.

4. What is multiclass classification

In machine learning, multiclass classification is an algorithm in which each feature in the dataset belongs to one of three or more classes in the dataset. So the aim of the algorithm is to construct a function in which given a new unseen feature point, the algorithm can precisely classify the class into which the new feature point belongs to.

5. Importance of multiclass classification

Multiclass classification enables a business analyst to predict which product a customer will purchase next from several options allowing the business to estimate expected revenue and adjust business practices and resources accordingly.

6. Installing libraries

in the following section till the end, all the codes should be written either in Jupyter notebook, Spyder or Google Colab. The following codes are written in Jupyter lab running on Debian 10 OS.

Various packages will be needed to be installed for the following code to execute if the reader is using Jupyter lab. To avoid this headache for newbies, we recommend you to use Google Colab cause most of the libraries and packages are preinstalled .

If you run through packages issues or libraries issues just type

!pip3 install name_of_the missing_package_pointing_to

7. Objective of the project

The goal of the project is to classify consumers’ complaints about financial products and services to companies for a response. Since it has multiple categories, it becomes a multiclass classification that can be solved through many of the machine learning algorithms. Once the algorithm is in place, whenever there is a new complaint, we can easily categorize it and can then be redirected to the concerned person. This will save a lot of time because we are minimizing the human intervention to decide whom this complaint should go to.

Step 1 Importing the libraries

from sklearn import model_selection, preprocessing, metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import confusion_matrix
from nltk.corpus import stopwords
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from textblob import Word
import seaborn as sns
import pandas as pd
import numpy as np

Step 2 Loading the dataset read_csv() method from pandas library

df = pd.read_csv("consumer_complaints.csv")

Step 3 Exploratory data analysis Displaying the first five raw of the dataset

df.head()

Displaying information about the data type and name of all the variables in the feature sets

df.info()

Adding category_id to the dataframe.(category_id shows which class each feature belongs to)

df['category_id'] = df['product'].factorize()[0]

We are only interesting in the product and the customer_complaint_narative all the rest of the feature sets are additional information which do not affect our analysis therefore they can be ignored.

df = df[['product', 'consumer_complaint_narrative']]

Checking for NaN values in the dataframe

df.isnull().sum()

Keeping only no null values in the dataframe

df = df[pd.notnull(df['consumer_complaint_narrative'])]

Grouping the product based on consumer_complaint_narrative and displaying the distribution

df.groupby('product').consumer_complaint_narrative.count()

fig = plt.figure(figsize=(16,12))
df.groupby('product').consumer_complaint_narrative.count()
df.groupby('product').consumer_complaint_narrative.count().plot.bar(ylim=0)
plt.show()

Step 4 Feature engineering using TF-IDF

Splitting the features into independent and dependent variables.

X=df['consumer_complaint_narrative']
y=df['product']

We did not use train_test_split to split the feature and target variables into train and test because. We noticed with train_test_split, the data were not proportionally distributed between y_train and y_test i.e some features appeared in the y_test only and not in the y_train. To fix that issue, we switched from train_test_split to StratifiedShuffleSplit.

n_splits = 1 # We only want a single split in this case
sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=0.25, random_state=0)
for train_index, test_index in sss.split(X, y):
    X_train,X_test =X.iloc[train_index],X.iloc[test_index]
    y_train,y_test =y.iloc[train_index],y.iloc[test_index]

Checking the distribution of y_train and y_test

Counter(y_train)

Counter(y_test)

Encoding target variable. We fit_trainform() y_train and transform y_test

encoder = preprocessing.LabelEncoder()
y_train_encoded = encoder.fit_transform(y_train)
y_test_encoded = encoder.transform(y_test)

We initialized the TfidfVectorizer we had previously imported. After initializing ,we passed our data to the vectorizer and it transforms the data to a TF-IDF vector a matrix of array which is the mathematical representation of the corpus. This is done by using the fit_transform method.

vectorizer = TfidfVectorizer(analyzer='word',token_pattern=r'\w{1,}', max_features=5000)
tfidf_matrix=vectorizer.fit_transform(X)
tfidf_train=vectorizer.transform(X_train)
tfidf_train=vectorizer.transform(X_train)

if we want to observe the mathematical representation of our text i.e. the TF-IDF representation we have to convert the sparse matrix to a dense matrix. This is done by using the toarray() method

feature_array = tfidf_matrix.toarray()
feature_array

-Step 5 Model building and evaluation

model = SVC(C= .1, kernel='linear', gamma= 1)
model.fit(tfidf_train, y_train_encoded )
y_prediction=model.predict(tfidf_test)
accuracy = metrics.accuracy_score(y_prediction,y_test_encoded)
print ("Accuracy: ", accuracy)

Accuracy: 0.8164890432283559

Classification report

print(metrics.classification_report(y_test_encoded, y_prediction,target_names=y_test.unique()))

Confusion matrix

conf_mat = confusion_matrix(y_test_encoded, y_prediction)
category_id_df = df[['product', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id','product']].values)
fig, ax = plt.subplots(figsize=(12,12))
sns.heatmap(conf_mat, annot=True, fmt='d', cmap="BuPu",xticklabels=category_id_df[['product']].values,yticklabels=category_id_df[['product']].values)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

The accuracy of 82% is good for a baseline model. Precision and recall look pretty good across the categories except for “Payday loan.” If you look through Payload loan, most of the wrong predictions are Debt collection and Credit card, which might be because of the smaller number of samples in that category. It look like it’s a subcategory of a credit card. We can add these samples to any other group to make the model more stable.

-Step 6 Predicting unseen consumer complaint

corpus= ["This company refuses to pay my interest to my bank account"]
corpus_features = vectorizer.transform(corpus)
predictions = model.predict(corpus_features)
print(corpus)
print("  - Predicted as: '{}'".format(id_to_category[predictions[0]]))

['This company refuses to pay my interest to my bank account']

Predicted as: 'Debt collection'

corpus = ["This company refuses to provide me verification and validation of debt per my right under the FDCPA. I do not believe this debt is mine."]
corpus_features = vectorizer.transform(corpus)
predictions =svc_model.predict(corpus_features)
print(corpus)
print("  \n Predicted as: '{}'".format(id_to_category[predictions[0]]))

['This company refuses to provide me verification and validation of debt per my right under the FDCPA. I do not believe this debt is mine.']

Predicted as: 'Credit reporting'

Conclusion: We achieve the aim of our objective by classifying consumer complaint using TF-IDF and Support Vector Algorithm with 82% accuracy. This model can be used as a baseline model. To increase the accuracy, we can do reiterate the process with different algorithms like Random Forest, GBM, Naive Bayes deep learning techniques like RNN and LSTM. Besides, other techniques can be used such as hyper-parameter tuning to improve the model accuracy which are out of the scope of this tutorial.

Please let me know if you find any errors. I can be contacted via LinkedIn here

My regards,

Maximilien.

Maximilien Kpizingui Blog

From Document Overload to Instant Insights: Meet DocQuest

🎧 What is DocQuest?

❓ Ask Questions. Get Answers. From Any Format.

How It Works Across Formats:

The Magic: Cross-Document Q&A

🛠️ How It Works (The Tech Behind the Magic)

1. Unified Ingestion

2. Intelligent Parsing & Chunking

3. Cross-Format Semantic Embeddings

4. Intelligent Output

🎓 Beyond Consumption: The AI Tutor

🤖 Automation with AI Agents

🎯 Who Is This For?

1. Students & Researchers

2. Knowledge Workers & Consultants

3. HR & L&D Teams

🛡️ Why DocQuest? (The Differentiators)

🚀 Try It Free (No Credit Card)

👋 Let's Build Together

Revolutionizing Healthcare Conversations: Building a Medical Chatbot Using LlamaIndex and DeepLake On Custom Dataset

Contents

LlamaIndex

Deep Lake

Power and Limitations of Large Language Models

Application Integration: Data indexing stage

Application Integration: Query stage

Code implementation

Large Language Models: A Dive Into Three Distinct Architectures

Architecture:

Applications:

Architecture:

Applications:

Architecture:

Applications:

Parameter Efficient FineTuning In Action: Finetuning LLMs Using PEFT & LoRA For Causal Language Modeling Task

Langchain Meets GPT-3.5: Crafting the Ultimate Multilingual News Articles Summarizer In English And French

Workflow for Building a News Articles Summarizer

Unleashing the Power of LLMs, OpenAI API, and LangChain for Personalized City-Specific Recipe Recommendations

Table of contents

@WeRateDogs Data Wrangling Analysis And Visualization

Medical Appointments No-Show Data Analysis And Visualization From A-Z

Qr Code Dynamic Certificate Generator and Authentication (QCDCGA) Application

I.) Project Description

II.) Development Package

III.) Security Measure Considered

IV.) Technical Support

V.) Payment Plan

VI.) Mode Of Payment

VII.) Payment Plan

Deploying Company Predictive Marketing Application using RFM Behavioral Clustering Algorithm To Heroku (part 2)

End To End Company Predictive Marketing Using RFM (Recency Frequency Monetary) Behavioral Based Clustering Algorithm(part 1).

End-To-End Breast Cancer Model Explainability using SHAP and Random Forest Algorithm.

End-To-End Amazon Product Rating Using Multinomial Naive Bayes Algorithm and CountVectorizer

End-To-End Gender Determination by Morphometry of Eyes using CNN (Convolutional Neural Network)

End-To- End Consumer Complaint Multiclass Classification Using Term Frequency - Inverse Document Frequency (TF-IDF) & Support Vector Machine Algorithm