<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Maximilien Kpizingui Blog]]></title><description><![CDATA[Generative AI & IoT Engineer. Founder of DocQuest. Exploring the intersection of AI and embedded systems]]></description><link>https://maximilien.docquest.io</link><image><url>https://cdn.hashnode.com/uploads/logos/60f1884fcbcd625a5d31fc5b/beda299f-6da4-435d-9abd-1289ac651752.png</url><title>Maximilien Kpizingui Blog</title><link>https://maximilien.docquest.io</link></image><generator>RSS for Node</generator><lastBuildDate>Thu, 16 Apr 2026 22:56:14 GMT</lastBuildDate><atom:link href="https://maximilien.docquest.io/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[From Document Overload to Instant Insights: Meet DocQuest]]></title><description><![CDATA[If you're like me, your browser probably has 15 tabs open right now.
There's a research PDF you need to finish. A lecture recording you haven't listened to. A meeting transcript from yesterday. And a ]]></description><link>https://maximilien.docquest.io/from-document-overload-to-instant-insights-meet-docquest</link><guid isPermaLink="true">https://maximilien.docquest.io/from-document-overload-to-instant-insights-meet-docquest</guid><category><![CDATA[automation]]></category><category><![CDATA[Build In Public]]></category><category><![CDATA[Productivity]]></category><category><![CDATA[innovation]]></category><category><![CDATA[#ai-tools]]></category><category><![CDATA[SaaS]]></category><category><![CDATA[edutech]]></category><category><![CDATA[learning]]></category><category><![CDATA[university]]></category><category><![CDATA[student]]></category><category><![CDATA[professional]]></category><category><![CDATA[startup]]></category><category><![CDATA[indiehackers]]></category><category><![CDATA[llm]]></category><category><![CDATA[ai agents]]></category><category><![CDATA[education]]></category><category><![CDATA[Applications]]></category><category><![CDATA[Developer]]></category><dc:creator><![CDATA[Maximilien]]></dc:creator><pubDate>Sat, 21 Mar 2026 00:30:15 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/60f1884fcbcd625a5d31fc5b/aad593d5-b9c7-43e9-b2a6-bc040f5b98bd.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you're like me, your browser probably has 15 tabs open right now.</p>
<p>There's a research PDF you need to finish. A lecture recording you haven't listened to. A meeting transcript from yesterday. And a dozen other files scattered across Google Drive, Dropbox, and your local downloads folder.</p>
<p>We live in the age of information abundance, but understanding scarcity.</p>
<p>We consume more content than ever, yet retain less. We switch contexts constantly. We spend more time organizing information than acting on it.</p>
<p>I'm Maximilien Kpizingui, a Generative AI Engineer, and I got tired of drowning in documents. So I built <strong>DocQuest</strong>.</p>
<p>Today, I want to share what it is, why I built it, and how it can help you work smarter—not harder.</p>
<hr />
<h2>🎧 What is DocQuest?</h2>
<p><strong>DocQuest</strong> is the unified AI platform that transforms PDFs, audio, video, and documents into intelligent, actionable podcasts and answers your questions about them instantly.</p>
<p>Think of it as a bridge between your static files and your busy life. Instead of forcing yourself to read every word or watch every minute, DocQuest converts your content into an engaging audio format you can listen to anywhere—on your commute, at the gym, or while cooking.</p>
<p>But it's not just text-to-speech. It's <strong>intelligent understanding</strong>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/60f1884fcbcd625a5d31fc5b/6cee2b9d-1e05-4c2e-87ec-4b509162b5ef.png" alt="Insert Screenshot: The Podcast Player Interface showing waveform and chapters" style="display:block;margin:0 auto" />

<hr />
<h2>❓ Ask Questions. Get Answers. From Any Format.</h2>
<p>This is the feature that changes everything.</p>
<p><strong>Upload a DOCX, PNG, PDF, or TXT file—and just ask.</strong></p>
<blockquote>
<p>"What were the key findings in this research paper?"<br />"Summarize the action items from this meeting transcript."<br />"What does clause 4.2 say about termination?"<br />"Extract all dates and names from this scanned document."</p>
</blockquote>
<p>DocQuest doesn't just read your files. It <strong>understands</strong> them.</p>
<img src="https://cdn.hashnode.com/uploads/covers/60f1884fcbcd625a5d31fc5b/11da27a3-8e5a-4860-a663-217a9b0cf596.png" alt="Insert Screenshot: Q&amp;A Interface showing a question being asked about a PDF with the answer highlighted" style="display:block;margin:0 auto" />

<h3>How It Works Across Formats:</h3>
<table>
<thead>
<tr>
<th>Format</th>
<th>What DocQuest Extracts</th>
<th>Example Question</th>
</tr>
</thead>
<tbody><tr>
<td><strong>📄 PDF</strong></td>
<td>Text, tables, figures, metadata</td>
<td>"What's the methodology used in this study?"</td>
</tr>
<tr>
<td><strong>📝 DOCX</strong></td>
<td>Structured text, headings, comments</td>
<td>"List all the deliverables mentioned in this proposal."</td>
</tr>
<tr>
<td><strong>🖼️ PNG/JPG</strong></td>
<td>OCR text extraction from images/scans</td>
<td>"What's the total amount on this invoice screenshot?"</td>
</tr>
<tr>
<td><strong>📄 TXT</strong></td>
<td>Raw text analysis and summarization</td>
<td>"What are the three main arguments in this essay?"</td>
</tr>
<tr>
<td><strong>🔗 URLs</strong></td>
<td>Web content scraping + analysis</td>
<td>"What are the key updates in this blog post?"</td>
</tr>
</tbody></table>
<h3>The Magic: Cross-Document Q&amp;A</h3>
<p>Here's where it gets powerful.</p>
<p><strong>Upload up to 10 files at once</strong>—a PDF research paper, a DOCX project brief, a PNG of a whiteboard sketch, and a TXT of meeting notes.</p>
<p>Then ask:</p>
<blockquote>
<p><em>"What are the common themes across all these documents?"</em></p>
</blockquote>
<p>DocQuest analyzes them <strong>together</strong>, finds connections, and gives you a unified answer.</p>
<p>No more tab-switching. No more manual synthesis. Just ask. Get answers.</p>
<hr />
<h2>🛠️ How It Works (The Tech Behind the Magic)</h2>
<p>I built DocQuest to solve the fragmentation problem. Most tools handle <em>one</em> format. ChatPDF handles PDFs. Otter handles audio. Descript handles video.</p>
<p>DocQuest handles <strong>all of them together</strong>.</p>
<h3>1. Unified Ingestion</h3>
<p>You can upload <strong>PDFs, DOCX, PNG, TXT, Audio, Video, or even URLs</strong>. You can also ingest files directly from Google Drive.</p>
<img src="https://cdn.hashnode.com/uploads/covers/60f1884fcbcd625a5d31fc5b/1e5cc3a4-2c76-4e45-a805-036bdf39565e.png" alt="Insert Screenshot: The Upload Dashboard showing multiple file types" style="display:block;margin:0 auto" />

<h3>2. Intelligent Parsing &amp; Chunking</h3>
<ul>
<li><p><strong>PDF/DOCX/TXT</strong>: Extract text + structure (headings, tables, lists)</p>
</li>
<li><p><strong>PNG/JPG</strong>: OCR pipeline to extract text from images/scans</p>
</li>
<li><p><strong>Audio/Video</strong>: Transcribe with speaker diarization + timestamping</p>
</li>
<li><p><strong>Proprietary Chunking</strong>: For long content, we split into 10-minute segments with 5-second overlaps → zero context loss at boundaries</p>
</li>
</ul>
<h3>3. Cross-Format Semantic Embeddings</h3>
<p>This is the technical moat.</p>
<p>We don't just store text. We map:</p>
<ul>
<li><p>PDF text → vector embeddings</p>
</li>
<li><p>Audio transcripts → vector embeddings</p>
</li>
<li><p>Video captions → vector embeddings</p>
</li>
<li><p>Image OCR text → vector embeddings</p>
</li>
</ul>
<p><strong>All into one unified semantic space.</strong></p>
<p>Result? You can ask a question, and DocQuest finds answers across <em>all</em> your files, regardless of format.</p>
<h3>4. Intelligent Output</h3>
<p>Once processed, you get:</p>
<ul>
<li><p><strong>Chaptered Podcasts</strong>: Navigate key insights easily</p>
</li>
<li><p><strong>Cross-Referenced Answers</strong>: Ask once, get insights from 10 files</p>
</li>
<li><p><strong>Actionable Drafts</strong>: Ready-to-use notes, briefs, and summaries</p>
</li>
</ul>
<hr />
<h2>🎓 Beyond Consumption: The AI Tutor</h2>
<p>Here's the feature I'm most proud of.</p>
<p>Most AI tools let you <strong>passively consume</strong> information. You read a summary, you nod, you forget.</p>
<p>DocQuest includes an <strong>Adaptive AI Tutor</strong>.</p>
<p>After you listen to your generated podcast or read an answer, the Tutor quizzes you on key concepts. It explains complex ideas simply and personalizes the learning path to your style.</p>
<p><strong>Why?</strong> Because learning science tells us that <strong>active recall</strong> beats passive reading every time.</p>
<img src="https://cdn.hashnode.com/uploads/covers/60f1884fcbcd625a5d31fc5b/b07b159d-c725-4747-b532-3019a305b01f.png" alt="Insert Screenshot: The AI Tutor Chat Interface showing a quiz question" style="display:block;margin:0 auto" />

<hr />
<h2>🤖 Automation with AI Agents</h2>
<p>For the professionals out there: time is your most valuable asset.</p>
<p>DocQuest isn't just a reader; it's a worker. We've deployed <strong>6 specialized AI Agents (with 100+ sub-agents)</strong> that work 24/7 to:</p>
<ul>
<li><p>Extract specific data points from contracts (PDF/DOCX)</p>
</li>
<li><p>Generate show notes from meeting recordings (Audio/Video)</p>
</li>
<li><p>Draft emails based on research findings (TXT/PDF)</p>
</li>
<li><p>Summarize scanned documents (PNG) into actionable briefs</p>
</li>
</ul>
<p>You set the task once. The agents handle the rest.</p>
<img src="https://cdn.hashnode.com/uploads/covers/60f1884fcbcd625a5d31fc5b/6b4b022f-aee3-4390-a625-c646a1fcd7fe.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h2>🎯 Who Is This For?</h2>
<p>I built DocQuest with three main personas in mind:</p>
<h3>1. Students &amp; Researchers</h3>
<ul>
<li><p><strong>Problem:</strong> Hundreds of papers (PDF/DOCX) + lecture recordings to synthesize</p>
</li>
<li><p><strong>DocQuest Solution:</strong> Upload all files → ask "What are the key debates in this field?" → get a unified answer + podcast + quiz</p>
</li>
</ul>
<h3>2. Knowledge Workers &amp; Consultants</h3>
<ul>
<li><p><strong>Problem:</strong> Client reports (PDF), meeting notes (TXT), whiteboard photos (PNG) scattered everywhere</p>
</li>
<li><p><strong>DocQuest Solution:</strong> Upload all 10 → ask "What are the top 3 risks?" → get cross-referenced insights + automated brief</p>
</li>
</ul>
<h3>3. HR &amp; L&amp;D Teams</h3>
<ul>
<li><p><strong>Problem:</strong> Onboarding docs (PDF/DOCX), training videos, policy scans (PNG) go unread</p>
</li>
<li><p><strong>DocQuest Solution:</strong> Convert to audio modules + quiz employees on retention → ensure compliance</p>
</li>
</ul>
<hr />
<h2>🛡️ Why DocQuest? (The Differentiators)</h2>
<p>You might be asking: <em>"Isn't this like NotebookLM?"</em></p>
<p>Great question. NotebookLM is fantastic for research notes. But DocQuest is built for <strong>action, learning, and cross-format understanding</strong>.</p>
<table>
<thead>
<tr>
<th>Feature</th>
<th>NotebookLM</th>
<th>DocQuest</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Input Formats</strong></td>
<td>Text-heavy (PDF, Docs)</td>
<td><strong>PDF + DOCX + PNG + TXT + Audio + Video + URLs</strong></td>
</tr>
<tr>
<td><strong>Q&amp;A Scope</strong></td>
<td>Single notebook</td>
<td><strong>Cross-document + cross-format questions</strong></td>
</tr>
<tr>
<td><strong>Output</strong></td>
<td>Summary + Audio Overview</td>
<td><strong>Intelligent Podcast + AI Tutor + Automated Drafts</strong></td>
</tr>
<tr>
<td><strong>Learning</strong></td>
<td>Passive</td>
<td><strong>Active (Quizzes + Retention Tracking)</strong></td>
</tr>
<tr>
<td><strong>Automation</strong></td>
<td>Chat-based</td>
<td><strong>24/7 Autonomous AI Agents</strong></td>
</tr>
<tr>
<td><strong>Privacy</strong></td>
<td>Google Ecosystem</td>
<td><strong>Zero-Training Policy: Your data stays yours</strong></td>
</tr>
</tbody></table>
<p><strong>Privacy is non-negotiable.</strong> Your documents never train our models. Your insights stay yours.</p>
<hr />
<h2>🚀 Try It Free (No Credit Card)</h2>
<p>I launched DocQuest on Product Hunt last week, and the response has been humbling. We saw a <strong>13% free-to-paid conversion rate</strong> in the first week (2-3x the SaaS benchmark) with <strong>zero churn</strong>.</p>
<p>I want you to experience it yourself.</p>
<ul>
<li><p><strong>Free Tier:</strong> Always available</p>
</li>
<li><p><strong>Launch Bonus:</strong> <strong>10,000 free tokens</strong> to test premium features</p>
</li>
<li><p><strong>No Credit Card Required:</strong> Just sign up and start asking questions</p>
</li>
</ul>
<p><strong>👉 Try DocQuest Free:</strong> <a href="https://app.docquest.io">https://app.docquest.io</a></p>
<hr />
<p>It's a solo founder journey, and every line of code is written with the goal of reducing your cognitive load.</p>
<hr />
<h2>👋 Let's Build Together</h2>
<p>DocQuest is still early. We have amazing early users, 100+ inbound emails, and a roadmap full of features (music analysis is coming soon!).</p>
<p>I'd love your honest feedback. Break it. Test it. Tell me what sucks. Tell me what shines.</p>
<p><strong>👉 Try DocQuest Free:</strong> <a href="https://app.docquest.io">https://app.docquest.io</a><br /><strong>🌐 Website:</strong> <a href="https://www.docquest.io">www.docquest.io</a></p>
<p><strong>Stop reading. Start understanding. Start asking.</strong></p>
<hr />
<p><em>Did you find this useful? What format do you wish DocQuest supported next? Let me know in the comments below! 👇</em></p>
<p><strong>Tags:</strong> #AI #Productivity #BuildInPublic #EdTech #SaaS #React #MachineLearning #Startup #DocumentAnalysis #QandA</p>
]]></content:encoded></item><item><title><![CDATA[Revolutionizing Healthcare Conversations: Building a Medical Chatbot Using LlamaIndex and DeepLake On Custom Dataset]]></title><description><![CDATA[Introduction
In today's rapidly evolving society, telemedicine is redefining the contours of patient care. As healthcare providers migrate to virtual consultations, patients seek clarity and relevance in their digital interactions. However, building ...]]></description><link>https://maximilien.docquest.io/revolutionizing-healthcare-conversations-building-a-medical-chatbot-using-llamaindex-and-deeplake-on-custom-dataset</link><guid isPermaLink="true">https://maximilien.docquest.io/revolutionizing-healthcare-conversations-building-a-medical-chatbot-using-llamaindex-and-deeplake-on-custom-dataset</guid><category><![CDATA[LlamaIndex]]></category><category><![CDATA[chatbot]]></category><category><![CDATA[DeepLake]]></category><dc:creator><![CDATA[Maximilien]]></dc:creator><pubDate>Tue, 24 Oct 2023 17:12:25 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1698145936150/f7d0326c-50d3-40f3-b408-0384fc3a25d3.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong><mark>Introduction</mark></strong></p>
<p>In today's rapidly evolving society, telemedicine is redefining the contours of patient care. As healthcare providers migrate to virtual consultations, patients seek clarity and relevance in their digital interactions. However, building a chatbot that transcends the typical to offer truly intuitive telemedical interactions is not an easy feat for developers. In this blog, we leverage the transformative potential of LlamaIndex and DeepLake for an unparalleled precision and responsiveness chatbot.</p>
<hr />
<h3 id="heading-contents"><strong><mark>Contents</mark></strong></h3>
<ul>
<li><p>LlamaIndex</p>
</li>
<li><p>Deep Lake</p>
</li>
<li><p>Power and Limitations of LLMs</p>
</li>
<li><p>Application Integration: Data Indexing</p>
</li>
<li><p>Application Integration: Query Stage</p>
</li>
<li><p>Code Implementation</p>
</li>
</ul>
<hr />
<h3 id="heading-llamaindex"><mark>LlamaIndex</mark></h3>
<p>LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models. It offers a suite of essential tools designed to streamline the process of leveraging private or domain-specific data within LLM-powered applications:</p>
<ul>
<li><p>Data Connectors: These versatile components are responsible for ingesting data from their native sources and formats. LlamaIndex supports a wide range of data sources, including APIs, PDF documents, SQL databases, and more.</p>
</li>
<li><p>Data Indexes: LlamaIndex employs data indexing to structure information into intermediate representations that are not only easy for LLMs to understand but also highly performant. These structured data representations serve as a bridge between raw data and natural language understanding.</p>
</li>
<li><p>Engines: LlamaIndex features different types of engines that provide natural language access to your structured data:</p>
<ul>
<li><p>Query Engines: These engines serve as robust retrieval interfaces, enabling knowledge-augmented output. They are ideal for information retrieval tasks and quick access to relevant data.</p>
</li>
<li><p>Chat Engines: For applications requiring interactive and conversational experiences, LlamaIndex provides chat engines that support multi-message, "back and forth" interactions with the data, making it suitable for dynamic conversational interfaces.</p>
</li>
</ul>
</li>
<li><p>Data Agents: LlamaIndex empowers knowledge workers by integrating Large Language Models with various tools, ranging from simple helper functions to API integrations and more. These agents can assist with data-related tasks and augment human decision-making.</p>
</li>
</ul>
<p>Application Integrations: LlamaIndex ensures seamless integration with the broader ecosystem of applications following two key stages namely:</p>
<ul>
<li><p>Data indexing Stage</p>
</li>
<li><p>Query stage</p>
</li>
</ul>
<h3 id="heading-deep-lake"><mark>Deep Lake</mark></h3>
<p>Deep Lake is a database optimized for deep learning and AI applications, powered by a specialized storage format. Deep Lake can be used for:</p>
<ul>
<li><p>Storing data and vectors while building applications</p>
</li>
<li><p>Managing datasets while training deep learning models</p>
</li>
</ul>
<p>Deep Lake simplifies the deployment of enterprise-grade LLM-based products by offering storage for all data types (embeddings, audio, text, videos, images, pdfs, annotations, etc.), querying and vector search, data streaming while training models at scale, data versioning and lineage, and integrations with popular tools such as LangChain, LlamaIndex, Weights &amp; Biases, and many more.</p>
<p>Deep Lake works with data of any size, it is serverless, and it enables you to store all of your data in your cloud. Deep Lake is used by Intel, Airbus, Matterport, ZERO Systems, Red Cross, Yale, &amp; Oxford.[read more about <a target="_blank" href="https://docs.activeloop.ai/">deeplake</a>]</p>
<h3 id="heading-power-and-limitations-of-large-language-models"><mark>Power and Limitations of Large Language Models</mark></h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698174526988/e363500b-f108-4ea9-a01f-aeeb1079f540.png" alt class="image--center mx-auto" /></p>
<p>Large Language Models (LLMs) are trained on vast text volumes to learn the word distribution in a language, allowing them to generate meaningful content without direct data memorization. They can recall widespread information like historical events. However, their knowledge is limited to their training data, leading them to potentially "hallucinate" or fabricate details about events or facts after their last training update. This is a concern for applications needing high reliability.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698184738660/32bc3708-c36d-48ee-aa31-e4e847513cb9.png" alt class="image--center mx-auto" /></p>
<p>A solution to this is using retrievers alongside LLMs. Retrievers fetch accurate information from trusted databases which the LLM uses without adding fictional details.</p>
<h3 id="heading-application-integration-data-indexing-stage"><mark>Application Integration: Data indexing stage</mark></h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698168072117/bd382e5f-4008-4612-9e16-982ad5b47184.jpeg" alt class="image--center mx-auto" /></p>
<p>During this stage, a knowledge base is prepared. This involves organizing and structuring the custom data to make it easily retrievable and accessible. The knowledge base acts as a source of information for the LLM.</p>
<ul>
<li><p>Data Source: This is the external data and can be in any form (CSV, pdf, word, excel, web-based) on the data source, relevant Data loaders are used to process the data.</p>
</li>
<li><p>Documents / Nodes: A Documents/Node here represents a fundamental unit of data in LlamaIndex, containing a chunk of a source Document with comprehensive metadata and inter-node relationships for precise retrieval actions.</p>
</li>
<li><p>Data Indexes (VectorStoreIndex): LlamaIndex streamlines data indexing by converting raw documents into intermediary representations, generating vector embeddings, and deducing metadata, with the VectorStoreIndex being a prevalent index format facilitating efficient data retrieval.</p>
</li>
</ul>
<h3 id="heading-application-integration-query-stage"><mark>Application Integration: Query stage</mark></h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698169128533/0fd8eece-4ebf-4374-8bf6-e94dffffd81e.jpeg" alt class="image--center mx-auto" /></p>
<p>In this stage, the system retrieves relevant context from the knowledge base based on a query. This retrieved-context is then used to augment the LLM's understanding and generation capabilities. The LLM can utilize this additional information to formulate more informed and accurate responses to user queries.</p>
<h3 id="heading-code-implementation"><mark>Code implementation</mark></h3>
<ul>
<li><p>Installing DeepLake and llama_index</p>
<pre><code class="lang-python">  pip install deeplake llama_index
</code></pre>
<ul>
<li><p>Importing the required dependencies</p>
<pre><code class="lang-python">  <span class="hljs-keyword">from</span> llama_index <span class="hljs-keyword">import</span> GPTVectorStoreIndex, SimpleDirectoryReader, Document
  <span class="hljs-keyword">from</span> llama_index.vector_stores <span class="hljs-keyword">import</span> DeepLakeVectorStore
  <span class="hljs-keyword">import</span> textwrap
  <span class="hljs-keyword">import</span> getpass
  <span class="hljs-keyword">import</span> os
</code></pre>
</li>
</ul>
</li>
<li><p>Defining the openai API key and active loop token using getpass to hide the credentials from public view</p>
</li>
</ul>
<pre><code class="lang-python">os.environ[<span class="hljs-string">"OPENAI_API_KEY"</span>] = getpass.getpass(prompt=<span class="hljs-string">'Enter your OPENAI_API_KEY: '</span>)
os.environ[<span class="hljs-string">"ACTIVELOOP_TOKEN"</span>] = getpass.getpass(prompt=<span class="hljs-string">'Enter your ACTIVELOOP_TOKEN: '</span>)
</code></pre>
<p>When you run this, you'll be prompted to enter the values for each key. The values won't be displayed as you type, which helps maintain security, especially when working in shared or public environments.</p>
<p><mark>Loading the dataset</mark></p>
<p>The datasets used in this code are: <em><mark>symptom precautions</mark></em> in <em>"xls format"</em> and <em><mark>symptom descriptions</mark></em> in "doc format" of various sicknesses as shown below</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698149083294/e21fb147-c78c-424a-a9e7-664ba0c7d104.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698149278970/e59add80-130b-46bf-a6e0-01be69538a67.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p>Create a folder in the project directory called data containing the two datasets. Link to download the dataset <a target="_blank" href="https://drive.google.com/drive/folders/1Q5oCv2a3GQa90HnUbxxsLEgSo--09C0I?usp=sharing">here</a></p>
</li>
<li><p><mark>Loading the dataset</mark></p>
</li>
</ul>
<pre><code class="lang-python">path_document =&lt;<span class="hljs-string">'path of the data folder'</span>&gt; 
documents = SimpleDirectoryReader(path_document).load_data()
</code></pre>
<ul>
<li><p><mark>Vectorizing and indexing the dataset</mark></p>
<pre><code class="lang-python">  dataset_path = <span class="hljs-string">"hub://activeloop_username/text_embedding"</span>
  vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=<span class="hljs-literal">True</span>)
  storage_context = StorageContext.from_defaults(vector_store=vector_store)
  index = GPTVectorStoreIndex.from_documents(documents, storage_context=storage_context)
</code></pre>
<p>  This code stores and indexes the documents in a vectorized form. It starts by defining where and how the vectors will be stored (DeepLakeVectorStore), sets up a storage context (StorageContext), and then creates an index (GPTVectorStoreIndex) for the documents using that storage context.</p>
</li>
<li><p>You should see an output like this stating that the dataset has been created successfully in DeepLake</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698151868688/8c3ca5fd-4a23-4582-814a-60c2262b0d1d.png" alt class="image--center mx-auto" /></p>
</li>
</ul>
<p>Alternatively, if you don't want to store the embedding in DeepLake, use GPTDeepLakeIndex to store it locally</p>
<pre><code class="lang-python">vector_store = DeepLakeVectorStore(overwrite=<span class="hljs-literal">True</span>)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = GPTVectorStoreIndex.from_documents(documents, storage_context=storage_context)
</code></pre>
<p>The output of this code creates a folder <em>llama_index</em> containing the tensors</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698152776522/c3bfc03a-f720-4722-ba9e-0eae58a6d84b.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p><mark>Initializing the query engine</mark></p>
<p>  The query engine is a generic interface that allows you to ask questions about your data.</p>
</li>
</ul>
<pre><code class="lang-python">query_engine =index.as_query_engine()
</code></pre>
<ul>
<li><p><mark>Query the bot about symptoms of illness, precautions to take, cure etc.</mark></p>
<pre><code class="lang-python">  response = query_engine.query(<span class="hljs-string">"What are the symptoms of malaria?"</span>)
</code></pre>
</li>
<li><p><mark>Displaying the output</mark></p>
</li>
</ul>
<pre><code class="lang-python">print(textwrap.fill(str(response), <span class="hljs-number">50</span>))
</code></pre>
<ul>
<li><mark>Output</mark></li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698155320122/862c3ca8-6a3b-4d94-904f-c8c139d7103f.png" alt class="image--center mx-auto" /></p>
<p>Let's play around it</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698155480678/58c013c7-29f8-4f3c-bd65-1abbf5c6d4b2.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698155620740/a675b404-f9bb-49f3-8185-e7899812bead.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698156266973/ca0d8fc5-5f32-4b18-a17e-7bb0bb1103a3.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698156494001/a1b14dc5-7327-4220-a4ba-d821a2da040e.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698165289452/6b70d63e-9f44-4cf8-bb4a-3e1a3485bfb0.png" alt class="image--center mx-auto" /></p>
<p>Bingo, you have made to the end of this post. You can leverage this technology to build a Q&amp;A on your company data, it's time you give it a trial with few line of codes. In the upcoming post, we'll dive into building a complex chatbot with memory using Retrieval Augmented Generation (RAG)</p>
<p><strong>Conclusion:</strong></p>
<p>The fusion of LlamaIndex and DeepLake showcases the boundless possibilities of reshaping healthcare communication. Chatbots powered by such sophisticated AI-driven tools, not only bridge the gap between patients and providers but also redefine the standards of timely and accurate medical assistance. This synergy of technology and healthcare demonstrates that the future of patient care lies in AI-driven conversations. Whether it's addressing patients' immediate concerns, guiding them through their health journey or simply offering a comforting digital presence, this new wave of chatbot integration is a testament to how innovative solutions can revolutionize age-old healthcare practices. We strongly believe regardless of people's location or background they can access reliable health advice at their fingertips.</p>
<hr />
<p>If you want to contribute or you find any errors in this article please do leave me a comment.</p>
<p>You can reach out to me on any of the matrix decentralized servers. My element messenger ID is <a class="user-mention" href="https://hashnode.com/@maximilien">@maximilien</a><mark>:</mark><a target="_blank" href="http://matrix.org">matrix.org</a></p>
<p>If you are in one of the mastodon decentralized servers, here is my ID <a class="user-mention" href="https://hashnode.com/@maximilien">@</a><a target="_blank" href="mailto:maximilien@qoto.org"><strong>maximilien@qoto.org</strong></a></p>
<p>If you are on linkedIn, you can reach me <a target="_blank" href="http://www.linkedin.com/in/maximilien-kpizingui">here</a></p>
<p>If you want to contact me via email <a target="_blank" href="mailto:maximilien@maxtekai.tech"><strong><mark>maximilien@maxtekai.tech</mark></strong></a></p>
<p>If you want to hire me to work on machine learning, data science, IoT and AI-related projects, please reach out to me <a target="_blank" href="http://www.maxtekai.tech">here</a></p>
<p><code>Warm regards,</code></p>
<p><code>Maximilien.</code></p>
]]></content:encoded></item><item><title><![CDATA[Large Language Models: A Dive Into Three Distinct Architectures]]></title><description><![CDATA[Introduction
The rise of large language models has changed the landscape of Natural Language Processing (NLP) dramatically. Their ability to comprehend, generate, and interact using human language has unlocked numerous applications, from chatbots to ...]]></description><link>https://maximilien.docquest.io/large-language-models-a-dive-into-three-distinct-architectures</link><guid isPermaLink="true">https://maximilien.docquest.io/large-language-models-a-dive-into-three-distinct-architectures</guid><category><![CDATA[large language models]]></category><category><![CDATA[nlp transformers]]></category><dc:creator><![CDATA[Maximilien]]></dc:creator><pubDate>Mon, 16 Oct 2023 11:23:30 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1699742376756/1ed79f10-7814-4a8f-b3fb-35d66d1a4547.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<p><strong>Introduction</strong></p>
<p>The rise of large language models has changed the landscape of Natural Language Processing (NLP) dramatically. Their ability to comprehend, generate, and interact using human language has unlocked numerous applications, from chatbots to content creation. In this post, we'll journey through three foundational architectures powering these behemoths: Masked Language Models, Causal Language Models, and Sequence-to-Sequence Language Models.</p>
<hr />
<p><strong>1. Masked Language Model (MLM) - Encoding the Unknown</strong></p>
<h4 id="heading-architecture"><strong>Architecture:</strong></h4>
<p>MLM is designed to predict a missing word in a sentence. During training, random words in a sentence are replaced with a '[MASK]' token, and the model learns to predict these masked words.</p>
<p><strong>Example:</strong></p>
<p>Sentence: "I love [MASK] ice cream."</p>
<p>Prediction: "chocolate"</p>
<p><strong>Prominent models</strong></p>
<ul>
<li><p><strong>BERT (Bidirectional Encoder Representations from Transformers):</strong> Developed by Google, BERT revolutionized many NLP tasks by pre-training on large amounts of text and then fine-tuning on specific tasks.</p>
</li>
<li><p><strong>RoBERTa (A Robustly Optimized BERT Pretraining Approach):</strong> A variation of BERT, RoBERTa tweaks the training process and methodology to achieve even better performance.</p>
</li>
<li><p><strong>DistilBERT:</strong> A distilled version of BERT, it maintains most of the performance while being faster and smaller.</p>
</li>
<li><p><strong>ALBERT (A Lite BERT):</strong> It reduces the number of parameters in BERT without a significant drop in performance.</p>
</li>
</ul>
<h4 id="heading-applications"><strong>Applications:</strong></h4>
<ul>
<li><p><strong>Sentiment Analysis</strong>: Determining if a review is positive or negative using models like BERT.</p>
</li>
<li><p><strong>Named Entity Recognition</strong>: Identifying entities such as names, places, and organizations in a sentence with models like DistilBERT.</p>
</li>
<li><p><strong>Question Answering</strong>: Extracting specific answers from large texts, as seen with models like RoBERTa on the SQuAD dataset.e</p>
</li>
<li><p><strong>Text Classification</strong>: Categorizing text into predefined groups using models like ALBERT.</p>
</li>
</ul>
<hr />
<p><strong>2. Causal Language Model (CLM) - Decoding the Sequence</strong></p>
<h4 id="heading-architecture-1"><strong>Architecture:</strong></h4>
<p>CLMs, or autoregressive models, generate text by predicting the next word in a sequence based on the previous words. They're "causal" because the prediction at time 't' is only affected by words from time 't-1' and before.</p>
<p><strong>Example:</strong></p>
<p>Seed: "Once upon a time,"</p>
<p>Generated continuation: "... in a land far away, there was a brave knight."</p>
<p><strong>Prominent</strong> <strong>models</strong></p>
<ul>
<li><p><strong>GPT (Generative Pre-trained Transformer):</strong> OpenAI's model that is pre-trained on large corpora and can generate coherent paragraphs of text. Its iterations include GPT-2 and the more recent GPT-3.</p>
</li>
<li><p><strong>CTRL (Conditional Transformer Language Model):</strong> Developed by Salesforce, CTRL can generate content conditioned on control codes, allowing for more specific text generation.</p>
</li>
<li><p><strong>XLNet:</strong> It combines the strengths of both BERT and GPT by predicting words in a dynamic order.</p>
</li>
</ul>
<h4 id="heading-applications-1"><strong>Applications:</strong></h4>
<ul>
<li><p><strong>Text Generation:</strong> Producing coherent paragraphs of text or completing prompts using GPT series.</p>
</li>
<li><p><strong>Storytelling:</strong> Given a starting point, generating a story or narrative as seen with models like CTRL.</p>
</li>
<li><p><strong>Code Generation:</strong> Producing programming code based on prompts, often explored using GPT models.</p>
</li>
<li><p><strong>Creative Writing:</strong> Assisting writers in generating poems, song lyrics, and more.</p>
</li>
</ul>
<hr />
<p><strong>3. Sequence-to-Sequence (Seq2Seq) Model - Encoding to Decoding</strong></p>
<h4 id="heading-architecture-2"><strong>Architecture:</strong></h4>
<p>Seq2Seq models consist of two main parts: an encoder and a decoder. The encoder processes the input sequence and compresses the information into a 'context vector'. The decoder then uses this vector to produce the output sequence.</p>
<p><strong>Example:</strong></p>
<p>Input (Encoder): "Bonjour"</p>
<p>Output (Decoder): "Hello"</p>
<p><strong>Prominent</strong> <strong>models</strong></p>
<ul>
<li><p><strong>T5 (Text-to-Text Transfer Transformer):</strong> Introduced by Google, T5 treats every NLP problem as a text-to-text problem, making it highly versatile.</p>
</li>
<li><p><strong>BART (Bidirectional and Auto-Regressive Transformers):</strong> Introduced by Facebook AI, BART is trained to auto-encode (with some noise in the text) and has achieved strong performance in tasks like summarization.</p>
</li>
<li><p><strong>MarianMT:</strong> A state-of-the-art Seq2Seq model specifically designed for neural machine translation.</p>
</li>
<li><p><strong>TransformerXL:</strong> While not strictly a Seq2Seq model, TransformerXL introduced mechanisms to remember longer sequences, making it relevant for tasks that benefit from understanding over extended contexts</p>
</li>
</ul>
<h4 id="heading-applications-2"><strong>Applications:</strong></h4>
<ul>
<li><ul>
<li><p><strong>Machine Translation:</strong> Translating a sentence from one language to another, as done by MarianMT.</p>
<p>    * <strong>Text Summarization:</strong> Shortening a lengthy article into a concise summary, a strong suit of BART.</p>
<p>    * <strong>Conversational Agents:</strong> Building chatbots that can have back-and-forth interactions using models like T5.</p>
<p>    * <strong>Text Simplification:</strong> Converting complex sentences into simpler versions for better understanding.</p>
</li>
</ul>
</li>
</ul>
<hr />
<p><strong>Conclusion</strong></p>
<p>The diverse architectures of large language models showcase the breadth and depth of possibilities in NLP. From filling in the gaps with MLMs, spinning tales with CLMs, or translating and summarizing with Seq2Seq, these models have transformed the way machines understand and generate human language. As we continue to push the boundaries of NLP, it's exciting to envision where these foundational architectures will take us next.</p>
<hr />
]]></content:encoded></item><item><title><![CDATA[Parameter Efficient FineTuning In Action:  Finetuning LLMs Using PEFT & LoRA For Causal Language Modeling Task]]></title><description><![CDATA[Hands-on Code Generation Implementation using Codegen pre-trained model- Parameter Efficient Fine-Tuning — LoRA - CausalLM
Introduction
In our ever-evolving AI landscape, the excitement around Language Models is palpable. Yet, as models grow in size,...]]></description><link>https://maximilien.docquest.io/parameter-efficient-finetuning-in-action-finetuning-llms-using-peft-lora-for-causal-language-modeling-task</link><guid isPermaLink="true">https://maximilien.docquest.io/parameter-efficient-finetuning-in-action-finetuning-llms-using-peft-lora-for-causal-language-modeling-task</guid><category><![CDATA[LoRA]]></category><category><![CDATA[#openai #LLMs #langchain #promtTemplate #PromptEngineering #python ]]></category><category><![CDATA[nlp transformers]]></category><category><![CDATA[finetuning]]></category><category><![CDATA[PEFT]]></category><dc:creator><![CDATA[Maximilien]]></dc:creator><pubDate>Sat, 14 Oct 2023 19:02:09 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1697304421581/55608860-a443-4856-8471-e5cccf6d6f0d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hands-on Code Generation Implementation using Codegen pre-trained model- Parameter Efficient Fine-Tuning — LoRA - CausalLM</p>
<p><mark>Introduction</mark></p>
<p>In our ever-evolving AI landscape, the excitement around Language Models is palpable. Yet, as models grow in size, so do the challenges tied to fine-tuning them. How do we efficiently adapt colossal models to specific tasks without extensive computational costs? Welcome to a deep dive into the world of Parameter-Efficient Fine-Tuning! Today, we'll unravel the mystique around fine-tuning Large Language Models (LLMs) using cutting-edge techniques like PEFT and LoRA. If code generation and language modeling intrigue you, strap in for a hands-on walkthrough with the Codegen pre-trained model. By the end of this journey, not only will you grasp the nuances of these techniques, but you'll also have a clear road map to implement them in your projects.</p>
<hr />
<p><mark>Workflow of the code</mark></p>
<ul>
<li><p>LoRa</p>
</li>
<li><p>PEFT</p>
</li>
<li><p>Causal Language Modeling</p>
</li>
<li><p>Codegen</p>
</li>
<li><p>Installing dependencies</p>
</li>
<li><p>Loading the required libraries</p>
</li>
<li><p>Loading the based pre-trained model for casual language modeling</p>
</li>
<li><p>Loading the tokenizer</p>
</li>
<li><p>initializing LoRa configuration</p>
</li>
<li><p>Loading the dataset from hugging face</p>
</li>
<li><p>Splitting the dataset into train and val</p>
</li>
<li><p>Defining function to Tokenize and process prompt template</p>
</li>
<li><p>Tokenizing train and val dataset into tensor acceptable by the trainer</p>
</li>
<li><p>Defining the metric function</p>
</li>
<li><p>Initializing seed for reproducibility</p>
</li>
<li><p>Initializing the trainer's arguments and the trainer</p>
</li>
<li><p>Training the based pre-trained model</p>
</li>
<li><p>Saving the finetuned model and its tokenizer</p>
</li>
<li><p>Loading the finetuned model</p>
</li>
<li><p>Defining inference function</p>
</li>
<li><p>Crafting 3 prompt templates</p>
</li>
<li><p>Testing</p>
</li>
</ul>
<hr />
<p><mark>LoRa</mark></p>
<p>LoRa stands for Low Rank Adaptation of Large Language Models. It is a technique that accelerates the fine-tuning of large models while consuming less memory. To make fine-tuning more efficient, LoRA’s approach is to represent the weight updates with two smaller matrices (called update matrices) through low-rank decomposition. These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn’t receive any further adjustments. To produce the final results, both the original and the adapted weights are combined.</p>
<p>This approach has several advantages:</p>
<ul>
<li><p>LoRA makes fine-tuning more efficient by drastically reducing the number of trainable parameters.</p>
</li>
<li><p>The original pre-trained weights are kept frozen, which means you can have multiple lightweight and portable LoRA models for various downstream tasks built on top of them.</p>
</li>
<li><p>Lora is orthogonal to many other parameter-efficient methods and can be combined with many of them.</p>
</li>
<li><p>The performance of models fine-tuned using LoRA is comparable to the performance of fully fine-tuned models.</p>
</li>
<li><p>LoRA does not add any inference latency because adapter weights can be merged with the base model.</p>
</li>
</ul>
<p>LoRa is implemented in the Hugging Face Parameter Efficient Fine-Tuning (PEFT) library. To fine-tune a model using LoRA, you need to:</p>
<ul>
<li><p>Instantiate a base model.</p>
</li>
<li><p>Create a configuration (LoraConfig) where you define LoRA-specific parameters.</p>
</li>
<li><p>Wrap the base model with <mark>get_peft_model()</mark> to get a trainable PeftModel.</p>
</li>
<li><p>Train the PeftModel as you normally would train the base model.</p>
</li>
</ul>
<p><mark>Parameter-Efficient Fine Tuning (PEFT)</mark></p>
<p>PEFT is a method used to freeze the pre-trained model parameters during fine-tuning and add a small number of trainable parameters (the adapters) on top of it. The adapters are trained to learn task-specific information. This approach is very memory-efficient with lower compute usage while producing results comparable to a fully fine-tuned model.</p>
<p><mark>Causal Language Model</mark></p>
<p>Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left. This means the model cannot see future tokens. GPT-2 is an example of a causal language model. They are frequently used for text generation. You can use these models for creative applications like choosing your text adventure or an intelligent coding assistant like Copilot.</p>
<p><mark>What is Codegen</mark></p>
<p>CodeGen is an autoregressive language model for program synthesis trained sequentially on The Pile, BigQuery, and BigPython. The CodeGen model was proposed in A Conversational Paradigm for Program Synthesis by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. CodeGen model checkpoints are available on different pre-training data with variable sizes. The format is: <code>Salesforce/codegen-{size}-{data}</code>, where:</p>
<ul>
<li><p><code>size</code>: <code>350M</code>, <code>2B</code>, <code>6B</code>, <code>16B</code></p>
</li>
<li><p><code>data</code>:</p>
<ul>
<li><p><code>nl</code>: Pre-trained on the Pile</p>
<p>  <code>multi</code>: Initialized with <code>nl</code>, then further pre-trained on multiple programming languages data</p>
<p>  <code>mono</code>: Initialized with <code>multi</code>, then further pre-trained on Python data</p>
</li>
</ul>
</li>
<li><p>For example, <code>Salesforce/codegen-350M-mono</code> used in this tutorial offers a 350 million-parameter checkpoint pre-trained sequentially on the Pile, multiple programming languages, and Python.</p>
</li>
</ul>
<hr />
<ul>
<li><p><mark>Installing dependencies</mark></p>
<p>  The dependencies needed in this tutorial are: bitsandbytes, datasets accelerate, loralib, peft and Transformers. We install them using pip as shown below. To run this code you need to change your colab runtime to T4 GPU and enable it. Besides, we use bitsandbytes because it supports 8bit and 4bit precision data types, which are useful for loading large models to save memory.</p>
</li>
</ul>
<pre><code class="lang-python">!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/peft.git git+https://github.com/huggingface/transformers.git
</code></pre>
<ul>
<li><mark>Importing the libraries</mark></li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoTokenizer, AutoConfig, AutoModelForCausalLM
<span class="hljs-keyword">from</span> peft <span class="hljs-keyword">import</span> LoraConfig, get_peft_model
<span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset
<span class="hljs-keyword">import</span> bitsandbytes <span class="hljs-keyword">as</span> bnb
<span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> Dataset
<span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_metric
<span class="hljs-keyword">import</span> transformers
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> random
<span class="hljs-keyword">import</span> torch.nn <span class="hljs-keyword">as</span> nn
<span class="hljs-keyword">import</span> os
os.environ[<span class="hljs-string">"CUDA_VISIBLE_DEVICES"</span>]=<span class="hljs-string">"0"</span>
</code></pre>
<ul>
<li><mark>Initializing the based pre-trained model</mark></li>
</ul>
<pre><code class="lang-python">model = AutoModelForCausalLM.from_pretrained(
    <span class="hljs-string">"Salesforce/codegen-350M-mono"</span>,
    torch_dtype=torch.float16,
    device_map=<span class="hljs-string">'auto'</span>,
    load_in_8bit=<span class="hljs-literal">True</span>
)
</code></pre>
<p>Let's break down each line:</p>
<ul>
<li><p>AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono"): This function is responsible for loading a pre-trained model. Here's what each part means:</p>
</li>
<li><p>AutoModelForCausalLM: Refers to the general architecture of the model being loaded. In this case, it's a "causal language model", which is a type of model designed for tasks like text generation. "Causal" here means that the model predicts the next word in a sequence based only on previous words, not future words. -from_pretrained: This function tells the library to load a model that has already been trained (i.e., pre-trained) rather than starting from scratch.</p>
</li>
<li><p>"Salesforce/codegen-350M-mono": This is the identifier for the specific pre-trained model you want to load. In this case, you're loading a model from Salesforce with the identifier codegen-350M-mono. The naming often provides hints about the model; here, it suggests the model may be designed for code generation (codegen) and has around 350 million parameters (350M). mono might hint at it being monolingual, but without further details, this is speculative.</p>
</li>
<li><p>torch_dtype=torch.float16: This sets the data type of the model's parameters. By using a torch.float16 (also known as "half precision"), the model will consume less memory and potentially run faster than using the default torch.float32. However, using half-precision can sometimes result in a slight decrease in model accuracy. It's a trade-off between speed/memory and accuracy.</p>
</li>
<li><p>device_map='auto': This is directing the model to be loaded on the most appropriate computational device available. If you have a GPU available, the library will automatically use it, which can greatly accelerate model computations. If no GPU is available, the model will default to CPU.</p>
</li>
<li><p><mark>Initializing the tokenizer</mark></p>
</li>
</ul>
<pre><code class="lang-python">tokenizer = AutoTokenizer.from_pretrained(<span class="hljs-string">"Salesforce/codegen-350M-mono"</span>)
</code></pre>
<ul>
<li>AutoTokenizer: This is a class in the transformers library that can automatically load the appropriate tokenizer for a given pre-trained model. A tokenizer is responsible for converting human-readable text into a format that the model can understand (typically a sequence of integers) and vice-versa.</li>
</ul>
<pre><code class="lang-python">tokenizer.add_eos_token = <span class="hljs-literal">True</span>
tokenizer.pad_token_id = <span class="hljs-number">0</span>
tokenizer.padding_side = <span class="hljs-string">"left"</span>
</code></pre>
<ol>
<li><p><code>tokenizer.add_eos_token = True</code>:</p>
<ul>
<li>This line tells the tokenizer to automatically add an "end of sentence" (EOS) token at the end of every sequence it tokenizes. In many transformer models, the EOS token is used to signal the conclusion of an input sequence. By setting <code>add_eos_token</code> to <code>True</code>, it ensures that the EOS token is added whenever you tokenize a piece of text using this tokenizer.</li>
</ul>
</li>
<li><p><code>tokenizer.pad_token_id = 0</code>:</p>
<ul>
<li><p>Padding is used in machine learning models, especially in sequence models like transformers, to ensure that all sequences (e.g., sentences) in a batch have the same length. This is important because the underlying computations in neural networks usually require consistent input shapes.</p>
</li>
<li><p>This line sets the identifier of the padding token to <code>0</code>. This means when the tokenizer adds padding tokens to a sequence to make it match the desired length, it will use the token with ID <code>0</code> as the padding token.</p>
</li>
</ul>
</li>
<li><p><code>tokenizer.padding_side = "left"</code>:</p>
<ul>
<li>When adding padding tokens, we can either add them to the start (left) or the end (right) of a sequence. This line specifies that the padding should be added to the start (left) of each sequence. This can be important in certain models or applications where the positioning of padding might influence the model's understanding of the sequence.</li>
</ul>
</li>
</ol>
<p>    <mark>Freezing the model parameters</mark></p>
<pre><code class="lang-python"><span class="hljs-keyword">for</span> param <span class="hljs-keyword">in</span> model.parameters():
  param.requires_grad = <span class="hljs-literal">False</span>  
  <span class="hljs-keyword">if</span> param.ndim == <span class="hljs-number">1</span>:

    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable() 
model.enable_input_require_grads()

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">CastOutputToFloat</span>(<span class="hljs-params">nn.Sequential</span>):</span>
  <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">forward</span>(<span class="hljs-params">self, x</span>):</span> <span class="hljs-keyword">return</span> super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)
</code></pre>
<p>Breaking the code down</p>
<ul>
<li><pre><code class="lang-python">      <span class="hljs-keyword">for</span> param <span class="hljs-keyword">in</span> model.parameters():
          param.requires_grad = <span class="hljs-literal">False</span>
</code></pre>
</li>
</ul>
<p>This loop iterates through all the parameters of the model and sets their <code>requires_grad</code> attribute to <code>False</code>. When a parameter's <code>requires_grad</code> attribute is set to <code>False</code>, it will not update during the backward pass, meaning it remains "frozen" during training. This is useful when you only want to train certain parts of a model or when fine-tuning a new dataset.</p>
<ul>
<li><p>Cast 1-dimensional Parameters to <code>float32</code></p>
<pre><code class="lang-python">  <span class="hljs-keyword">if</span> param.ndim == <span class="hljs-number">1</span>:
      param.data = param.data.to(torch.float32)
</code></pre>
</li>
</ul>
<p>For 1-dimensional parameters (often biases or parameters in normalization layers), the code changes its data type to <code>float32</code>. This can be helpful for stability in training since smaller data types (like <code>float16</code>) can sometimes cause numerical issues, especially for parameters like biases.</p>
<ul>
<li><p>Enabling Gradient Checkpointing:</p>
<pre><code class="lang-python">  model.gradient_checkpointing_enable()
</code></pre>
</li>
</ul>
<p>Gradient checkpointing is a technique used to save memory when training very deep models. Instead of storing all intermediate activations in memory for the backward pass, it recomputes them, trading off computation time for memory. This can be particularly useful when training models on GPUs with limited memory.</p>
<ul>
<li><p>Enabling Input Requirement for Gradients:</p>
<pre><code class="lang-python">  model.enable_input_require_grads()
</code></pre>
</li>
</ul>
<p>This method likely ensures that the input to the model requires gradients. This can be useful when you're interested in calculating gradients concerning the input, such as in adversarial training.</p>
<ul>
<li><p>Creating a Custom Module to Cast Output to <code>float32</code>:</p>
<pre><code class="lang-python">  <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">CastOutputToFloat</span>(<span class="hljs-params">nn.Sequential</span>):</span>
      <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">forward</span>(<span class="hljs-params">self, x</span>):</span> <span class="hljs-keyword">return</span> super().forward(x).to(torch.float32)
  model.lm_head = CastOutputToFloat(model.lm_head)
</code></pre>
</li>
</ul>
<p>This custom module, <code>CastOutputToFloat</code>, is derived from PyTorch's <code>nn.Sequential</code> class. It overrides the <code>forward</code> method to cast its output to <code>float32</code>. The final line replaces the <code>lm_head</code> of the model with this custom module wrapping the original <code>lm_head</code>. The purpose is likely to ensure that the final predictions (logits) from the model are in the <code>float32</code> data type, which can be helpful for precision and stability reasons, especially if other parts of the model or training process utilize lower precision formats like <code>float16</code>.</p>
<ul>
<li><mark>Printing the number of trainable parameters in the model.</mark></li>
</ul>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">print_trainable_parameters</span>(<span class="hljs-params">model</span>):</span>

    trainable_params = <span class="hljs-number">0</span>
    all_param = <span class="hljs-number">0</span>
    <span class="hljs-keyword">for</span> _, param <span class="hljs-keyword">in</span> model.named_parameters():
        all_param += param.numel()
        <span class="hljs-keyword">if</span> param.requires_grad:
            trainable_params += param.numel()
    print(
        <span class="hljs-string">f"trainable params: <span class="hljs-subst">{trainable_params}</span> || all params: <span class="hljs-subst">{all_param}</span> || trainable%: <span class="hljs-subst">{<span class="hljs-number">100</span> * trainable_params / all_param}</span>"</span>
    )
</code></pre>
<ul>
<li><mark>Initializing the LoRa configuration</mark></li>
</ul>
<pre><code class="lang-python">config = LoraConfig(
    r=<span class="hljs-number">8</span>,
    lora_alpha=<span class="hljs-number">16</span>,
    target_modules=[<span class="hljs-string">"fc_in"</span>],
    lora_dropout=<span class="hljs-number">0.05</span>,
    bias=<span class="hljs-string">"none"</span>,
    task_type=<span class="hljs-string">"CAUSAL_LM"</span>
)

model = get_peft_model(model, config)
print_trainable_parameters(model)
</code></pre>
<p><em>trainable params: 819200 || all params: 357531648 || trainable%: 0.2291265695170012</em></p>
<ol>
<li><p><code>r</code>: This parameter sets the rank of the low-rank adaptation. Essentially, it determines the size of the adaptation parameters. A smaller <code>r</code> means fewer parameters, making the adaptation more parameter-efficient. In this case, it's set to 8.</p>
</li>
<li><p><code>lora_alpha</code>: This parameter is a scaling factor. LoRA introduces an additional linear layer to existing layers in the model, and <code>lora_alpha</code> helps determine the size of this layer. Specifically, for a layer with <code>d</code> units, the LoRA layer would have <code>d/lora_alpha</code> units. In this configuration, <code>lora_alpha</code> is set to 16, meaning the adaptation layer will be scaled down by this factor compared to the original layer.</p>
</li>
<li><p><code>target_modules</code>: This is a list indicating which modules (or layers) of the model should be adapted using LoRA. In this case, only the module named "fc_in" is set to be adapted.</p>
</li>
<li><p><code>lora_dropout</code>: Specifies the dropout rate to be applied to the outputs of the LoRA layers. Dropout is a regularization technique where, during training, random neurons (or outputs) are "dropped out" or set to zero. Here, a dropout rate of 0.05 indicates that 5% of the neurons will be set to zero during each forward pass.</p>
</li>
<li><p><code>bias</code>: Defines how biases should be handled in the LoRA-adapted layers. Here, it's set to "none", implying that no biases will be used in the LoRA layers.</p>
</li>
<li><p><code>task_type</code>: Specifies the type of task the model is intended for. In this case, it's set to "CAUSAL_LM", indicating a causal language modeling task. Causal Language Modeling (CLM) is where the model predicts the next token in a sequence based solely on the previous tokens, as opposed to masked language modeling where the model predicts masked-out tokens based on their context.</p>
</li>
</ol>
<ul>
<li><mark>Loading the dataset from hugging face</mark></li>
</ul>
<pre><code class="lang-python">dataset = load_dataset(<span class="hljs-string">"theblackcat102/evol-codealpaca-v1"</span>)
</code></pre>
<p>The code downloads the evol-codealpaca-v1 dataset from the Hugging Face datasets hub and stores it in the dataset variable for further use.</p>
<ul>
<li><mark>Splitting the dataset into train and validation</mark></li>
</ul>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">split_dataset</span>(<span class="hljs-params">dataset</span>):</span>
    n = int(<span class="hljs-number">0.8</span> * len(dataset[<span class="hljs-string">'train'</span>]))
    train_data = dataset[<span class="hljs-string">'train'</span>][:n]
    val_data = dataset[<span class="hljs-string">'train'</span>][n:]

    dataset[<span class="hljs-string">'train'</span>] = train_data
    dataset[<span class="hljs-string">'validation'</span>] = val_data

    <span class="hljs-keyword">return</span> dataset[<span class="hljs-string">'train'</span>], dataset[<span class="hljs-string">'validation'</span>]
</code></pre>
<p>the split_dataset function is used to split the training data of a given dataset into training (80%) and validation (20%) sets. This is a common practice in machine learning to ensure that a separate set of data is available to validate the model's performance after training.</p>
<pre><code class="lang-python">train_data, val_data = split_dataset(dataset)
</code></pre>
<p>The returned tuple from the split_dataset function is unpacked into two separate variables. The first value (the training set) is assigned to the train_data variable, and the second value (the validation set) is assigned to the val_data variable.</p>
<p>After this line of code executes, you'll have:</p>
<ul>
<li><p>train_data: This contains 80% of the original training data from the dataset.</p>
</li>
<li><p>val_data: This contains the remaining 20% of the original training data, and it will be used for validation purposes.</p>
</li>
<li><p><mark>Function definition to tokenize and preprocess prompt template</mark></p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">tokenize_function</span>(<span class="hljs-params">samples</span>):</span>
    output_str = samples[<span class="hljs-string">'output'</span>] <span class="hljs-keyword">if</span> samples[<span class="hljs-string">'output'</span>] <span class="hljs-keyword">else</span> <span class="hljs-string">"Cannot Find Answer"</span>
    prompt_template = <span class="hljs-string">f"### INSTRUCTION\n<span class="hljs-subst">{samples[<span class="hljs-string">'instruction'</span>]}</span>\n\n### OUTPUT\n<span class="hljs-subst">{output_str}</span>&lt;/s&gt;"</span>
    <span class="hljs-keyword">return</span> tokenizer(prompt_template, truncation=<span class="hljs-literal">True</span>, padding=<span class="hljs-string">'max_length'</span>, max_length=<span class="hljs-number">2048</span>)
</code></pre>
<p>The function accepts a single argument, samples, which is expected to be a dictionary containing at least two keys: instruction and output.</p>
<pre><code class="lang-python">output_str = samples[<span class="hljs-string">'output'</span>] <span class="hljs-keyword">if</span> samples[<span class="hljs-string">'output'</span>] <span class="hljs-keyword">else</span> <span class="hljs-string">"Cannot Find Answer"</span>
</code></pre>
<p>This line checks if samples['output'] exist and is not empty or None. If it does have a value, that value is assigned to the output_str variable. If not, the string "Cannot Find Answer" is assigned to output_str. This is a form of conditional assignment in Python and ensures that the output_str always has some value.</p>
<pre><code class="lang-python"> prompt_template = <span class="hljs-string">f"### INSTRUCTION\n<span class="hljs-subst">{samples[<span class="hljs-string">'instruction'</span>]}</span>\n\n### OUTPUT\n<span class="hljs-subst">{output_str}</span>&lt;/s&gt;"</span>
</code></pre>
<p>This line creates a formatted string, prompt_template, using the values from the samples dictionary. It follows a specific format where the instruction and output are both clearly labeled. Notice the use of the token at the end; this is often used as an end-of-sequence token in certain tokenization schemes.</p>
<pre><code class="lang-python"><span class="hljs-keyword">return</span> tokenizer(prompt_template, truncation=<span class="hljs-literal">True</span>, padding=<span class="hljs-string">'max_length'</span>, max_length=<span class="hljs-number">2048</span>)
</code></pre>
<p>This line utilizes the tokenizer (which is expected to be available in the outer scope) to tokenize the prompt_template. The tokenizer is set to truncate sequences if they exceed 2048 tokens and pad shorter sequences to this length.</p>
<ul>
<li><mark>Formatting the train and validation dataset into tensor data acceptable by the trainer</mark></li>
</ul>
<pre><code class="lang-python">train_dataset = Dataset.from_dict(train_data)
mapped_train_dataset = train_dataset.map(tokenize_function, batched=<span class="hljs-literal">False</span>, remove_columns=[<span class="hljs-string">'instruction'</span>, <span class="hljs-string">'output'</span>])
val_dataset = Dataset.from_dict(val_data)
mapped_val_dataset = val_dataset.map(tokenize_function, batched=<span class="hljs-literal">False</span>, remove_columns=[<span class="hljs-string">'instruction'</span>, <span class="hljs-string">'output'</span>])
</code></pre>
<p>Here's a step-by-step breakdown:</p>
<ul>
<li>Dataset Initialization:</li>
</ul>
<pre><code class="lang-python">train_dataset = Dataset.from_dict(train_data)
</code></pre>
<p>This line creates a Dataset object from the train_data dictionary. The Dataset object is a data structure provided by the datasets library that is optimized for large-scale datasets and ML tasks. It enables efficient data processing methods and various utilities.</p>
<ul>
<li>Mapping and Tokenizing Train Dataset:</li>
</ul>
<pre><code class="lang-python">mapped_train_dataset = train_dataset.map(tokenize_function, batched=<span class="hljs-literal">False</span>, remove_columns=[<span class="hljs-string">'instruction'</span>, <span class="hljs-string">'output'</span>])
</code></pre>
<p>The map method applies a given function (in this case, tokenize_function) to each sample in the dataset.</p>
<ul>
<li><p>batched=False: This means the tokenize_function will be applied to individual samples rather than batches of samples.</p>
</li>
<li><p>remove_columns=['instruction', 'output']: After processing each sample with tokenize_function, the original columns 'instruction' and 'output' are removed, since they've been tokenized and formatted and are no longer needed in their raw form.</p>
</li>
<li><p>Dataset Initialization for Validation Data:</p>
</li>
</ul>
<pre><code class="lang-python">val_dataset = Dataset.from_dict(val_data)
</code></pre>
<p>Similarly, a Dataset object for validation data (val_data) is created.</p>
<ul>
<li>Mapping and Tokenizing Validation Dataset:</li>
</ul>
<pre><code class="lang-python">mapped_val_dataset = val_dataset.map(tokenize_function, batched=<span class="hljs-literal">False</span>, remove_columns=[<span class="hljs-string">'instruction'</span>, <span class="hljs-string">'output'</span>])
</code></pre>
<p>Just like with the training data, the validation data is also processed using the tokenize_function. The processed and tokenized data is stored in the mapped_val_dataset object.</p>
<ul>
<li><mark>Defining the function to compute the metrics</mark></li>
</ul>
<pre><code class="lang-python">transformers.logging.set_verbosity_info()
bleu_metric = load_metric(<span class="hljs-string">"bleu"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">compute_metrics</span>(<span class="hljs-params">eval_pred</span>):</span>
    predictions, labels = eval_pred
    decoded_preds = [tokenizer.decode(pred, skip_special_tokens=<span class="hljs-literal">True</span>) <span class="hljs-keyword">for</span> pred <span class="hljs-keyword">in</span> predictions]
    decoded_labels = [tokenizer.decode(label, skip_special_tokens=<span class="hljs-literal">True</span>) <span class="hljs-keyword">for</span> label <span class="hljs-keyword">in</span> labels]
    bleu_score = bleu_metric.compute(predictions=decoded_preds, references=decoded_labels)

    <span class="hljs-keyword">return</span> {<span class="hljs-string">"bleu"</span>: bleu_score[<span class="hljs-string">"bleu"</span>]}
</code></pre>
<p>This function is designed to be used during or after the evaluation of a model to compute the BLEU score:</p>
<ul>
<li><p>eval_pred is a tuple containing the predictions from the model and the true labels.</p>
</li>
<li><p>The predictions and labels, which are tokenized sequences, are first decoded into human-readable text using the tokenizer.decode() function. This is necessary because the BLEU metric works on actual text, not tokenized sequences.</p>
</li>
<li><p>The BLEU score is then computed using bleu_metric.compute(), and the result is returned as a dictionary with a single key-value pair: "bleu": bleu_score["bleu"].</p>
</li>
<li><p><mark>Initializing seeding for model reproducibility</mark></p>
</li>
</ul>
<pre><code class="lang-python">SEED = <span class="hljs-number">42</span>
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = <span class="hljs-literal">True</span>
torch.backends.cudnn.benchmark = <span class="hljs-literal">False</span>
<span class="hljs-keyword">if</span> torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
</code></pre>
<p>This part of the code sets various random seeds to ensure reproducibility. When training neural networks, many operations have a random component. By fixing the random seed, the same sequence of random numbers will be generated every time, leading to consistent results across runs. This is important if you want to ensure that someone else running your code, or you running your code at a later time, will get the same results.</p>
<ul>
<li><mark>Trainer initialization</mark></li>
</ul>
<pre><code class="lang-python">trainer = transformers.Trainer(
    model=model,
    train_dataset=mapped_train_dataset,
    compute_metrics=compute_metrics,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=<span class="hljs-number">4</span>,
        gradient_accumulation_steps=<span class="hljs-number">4</span>,
        warmup_steps=<span class="hljs-number">100</span>,
        max_steps=<span class="hljs-number">100</span>,
        learning_rate=<span class="hljs-number">1e-3</span>,
        fp16=<span class="hljs-literal">True</span>,
        logging_steps=<span class="hljs-number">1</span>,
        output_dir=<span class="hljs-string">'outputs'</span>,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=<span class="hljs-literal">False</span>)
)
model.config.use_cache = <span class="hljs-literal">False</span>
</code></pre>
<p>The Trainer class from the Transformers library is being initialized. It's designed to simplify the process of training, evaluating, and testing transformer models.</p>
<p>Arguments to the Trainer:</p>
<ul>
<li><p>model: The model instance you intend to train. This is typically a pre-initialized or pre-trained instance of a transformer model from the library.</p>
</li>
<li><p>train_dataset: The dataset the model will be trained on. mapped_train_dataset is a processed dataset, likely transformed into a format suitable for training, such as tokenized text.</p>
</li>
<li><p>compute_metrics: A function that computes metrics after an evaluation. This function is typically defined earlier in your code. It calculates metrics (like accuracy, BLEU score, etc.) on the evaluation dataset.</p>
</li>
<li><p>args: This specifies various training-related configurations using the TrainingArguments class:</p>
</li>
<li><p>per_device_train_batch_size: The batch size for training on each device (e.g., each GPU). Here it's set to 4.</p>
</li>
<li><p>gradient_accumulation_steps: Number of forward passes (batches) the model will see before an update (backpropagation) is performed. Here, the model will see 4 batches before an update.</p>
</li>
<li><p>warmup_steps: The number of steps for the learning rate warmup. The learning rate will gradually increase over these many steps at the beginning of training.</p>
</li>
<li><p>max_steps: Maximum number of training steps. Training will stop after 100 steps irrespective of epochs.</p>
</li>
<li><p>learning_rate: Specifies the learning rate for the optimizer. It's set to 0.001.</p>
</li>
<li><p>fp16: Indicates the use of 16-bit (also known as half-precision) floating point numbers during training. Using fp16 can accelerate training.</p>
</li>
<li><p>logging_steps: Interval at which logging will occur. Here, logs will be generated at every step.</p>
</li>
<li><p>output_dir: Directory where training-related outputs (like model checkpoints) will be saved. Here, they will be saved in a folder named 'outputs'.</p>
</li>
<li><p>data_collator: This is responsible for preparing and collating data samples into batched tensors before feeding them into the model. transformers.DataCollatorForLanguageModeling is used here, suited for causal language modeling tasks.</p>
</li>
<li><p>The argument mlm=False indicates that masked language modeling is not used (which makes sense since it's for causal language modeling).</p>
</li>
<li><p>model.config.use_cache = False: This disables the caching mechanism within the transformer model. In transformers, caching can store certain intermediate outputs to speed up the sequential processing of tokens. However, it might be disabled to save memory, especially if the sequences being processed are long.</p>
</li>
<li><p><mark>Training the model</mark></p>
</li>
</ul>
<pre><code class="lang-python">trainer.train()
</code></pre>
<ul>
<li><p>trainer.train(): When this method is called, the model begins training on the dataset specified during the initialization of the Trainer object. The training process will use all the configurations, hyperparameters, and specifications you provided when you created the Trainer instance. It goes through the data in mini-batches as specified by the batch size. For each batch, it feeds the data through the model, computes the loss (difference between the model's predictions and the actual values), and then updates the model's weights using backpropagation. This process is repeated for the number of epochs or steps specified. An epoch is a complete pass through the entire training dataset.</p>
</li>
<li><p><mark>Saving the finetuned model</mark></p>
</li>
</ul>
<pre><code class="lang-python">model.save_pretrained(<span class="hljs-string">"./my_model"</span>)
tokenizer.save_pretrained(<span class="hljs-string">"./my_model"</span>)
</code></pre>
<ul>
<li><mark>Loading the finetuned model</mark></li>
</ul>
<pre><code class="lang-python">loaded_model = AutoModelForCausalLM.from_pretrained(<span class="hljs-string">"./my_model"</span>)
loaded_tokenizer = AutoTokenizer.from_pretrained(<span class="hljs-string">"./my_model"</span>)
</code></pre>
<ul>
<li><mark>Inference function</mark></li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> IPython.display <span class="hljs-keyword">import</span> display, Markdown

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">generate_completion</span>(<span class="hljs-params">model, tokenizer, prompt_text, max_length=<span class="hljs-number">100</span></span>):</span>

    model.config.use_cache = <span class="hljs-literal">True</span>
    model.eval()
    input_ids = tokenizer.encode(prompt_text, return_tensors=<span class="hljs-string">"pt"</span>)
    <span class="hljs-keyword">with</span> torch.no_grad():
        output = model.generate(input_ids, max_length=max_length, num_return_sequences=<span class="hljs-number">1</span>,
                                pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id,
                                temperature=<span class="hljs-number">0.1</span>,
                                top_k=<span class="hljs-number">10</span>,
                                top_p=<span class="hljs-number">0.1</span>,
                                do_sample=<span class="hljs-literal">True</span>
                                )
    display(Markdown(tokenizer.decode(output[<span class="hljs-number">0</span>], skip_special_tokens=<span class="hljs-literal">True</span>)))
</code></pre>
<p>The function is defined with the following parameters:</p>
<ul>
<li><p>model: The pre-trained model that you want to use for generating the text.</p>
</li>
<li><p>tokenizer: The tokenizer associated with the model that is responsible for converting the text into tokens (and vice-versa). prompt_text: The initial text (or prompt) you want to expand upon. max_length: Maximum length of the generated output. The default is set to 100 tokens.</p>
</li>
</ul>
<p>Setting Model for Generation:</p>
<pre><code class="lang-python">model.config.use_cache = <span class="hljs-literal">True</span>
model.eval()
</code></pre>
<p>The use_cache is set to True to allow the model to use past computations for faster generations. model.eval() sets the model to evaluation mode. This is essential as certain layers like dropout behave differently during training and evaluation.</p>
<p>Tokenization of the Prompt:</p>
<pre><code class="lang-python">input_ids = tokenizer.encode(prompt_text, return_tensors=<span class="hljs-string">"pt"</span>)
</code></pre>
<p>The prompt text is tokenized into a format the model understands (input_ids). The return_tensors="pt" ensures the result is a PyTorch tensor.</p>
<p>Generating the Completion:</p>
<pre><code class="lang-python"><span class="hljs-keyword">with</span> torch.no_grad():
    output = model.generate()
</code></pre>
<p>with <a target="_blank" href="http://torch.no">torch.no</a>_grad() ensures that no gradients are computed during this operation, saving memory and computational power. model.generate() is the method that produces the generated completion. Here's a brief on the parameters: input_ids: The to===kenized version of the prompt_text. max_length: The maximum length of the generated text. num_return_sequences: The number of sequences to return. It's set to 1, so only one completion is generated. pad_token_id, eos_token_id: The padding and end-of-sentence token IDs. This ensures the generated text is appropriately formatted. temperature: This controls the randomness of the output. Lower values make the output more deterministic. top_k, top_p: Parameters that control the randomness of the model's output by selecting the next token only from the top k tokens or top p probability. do_sample: It enables sampling, which means the model will consider multiple possible next tokens rather than just the most probable one.</p>
<p>Displaying the Generated Text:</p>
<pre><code class="lang-python">display(Markdown(tokenizer.decode(output[<span class="hljs-number">0</span>], skip_special_tokens=<span class="hljs-literal">True</span>)))
</code></pre>
<p>This decodes the generated token IDs back to human-readable text and then displays it in the Jupyter Notebook in a formatted manner.</p>
<p><mark>Testing</mark></p>
<pre><code class="lang-python">prompt_text = <span class="hljs-string">"### INSTRUCTION\nWrite a function to find the area of a triangle:\n\n### OUTPUT\n"</span>
</code></pre>
<pre><code class="lang-python">prompt_text1 = <span class="hljs-string">"### INSTRUCTION\nWrite a function to find if a number is odd:\n\n### OUTPUT\n"</span>
</code></pre>
<pre><code class="lang-python">prompt_text2 = <span class="hljs-string">"### INSTRUCTION\nWrite a code to find factorial of a number:\n\n### OUTPUT\n"</span>
</code></pre>
<pre><code class="lang-python">print(generate_completion(loaded_model, loaded_tokenizer, prompt_text))
</code></pre>
<p>Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 50256 }</p>
<p>INSTRUCTION</p>
<p>Write a function to find the area of a triangle:</p>
<p>OUTPUT</p>
<p>Here is a Python function that calculates the area of a triangle:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">area_of_triangle</span>(<span class="hljs-params">a, b, c</span>):</span>
    <span class="hljs-keyword">return</span> (a * b) / <span class="hljs-number">2</span>

print(area_of_triangle(<span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>))
</code></pre>
<pre><code class="lang-python">print(generate_completion(loaded_model, loaded_tokenizer, prompt_text1))
</code></pre>
<p>INSTRUCTION</p>
<p>Write a function to find if a number is odd:</p>
<p>OUTPUT</p>
<p>Here is a Python function that takes a number as input and returns True if the number is odd, and False otherwise.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">is_odd</span>(<span class="hljs-params">n</span>):</span>
    <span class="hljs-keyword">return</span> n % <span class="hljs-number">2</span> == <span class="hljs-number">1</span>

print(is_odd(<span class="hljs-number">5</span>))
This function takes a number <span class="hljs-keyword">as</span> an input <span class="hljs-keyword">and</span> returns <span class="hljs-literal">True</span> <span class="hljs-keyword">if</span> the number <span class="hljs-keyword">is</span> odd,
</code></pre>
<pre><code class="lang-python">print(generate_completion(loaded_model, loaded_tokenizer, prompt_text2))
</code></pre>
<p><strong>INSTRUCTION</strong></p>
<p>Write a code to find factorial of a number:</p>
<p><strong>OUTPUT</strong></p>
<p>Here is a Python code that solves the problem:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">factorial</span>(<span class="hljs-params">n</span>):</span>
    <span class="hljs-keyword">if</span> n == <span class="hljs-number">0</span>:
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>
    <span class="hljs-keyword">else</span>:
        <span class="hljs-keyword">return</span> n * factorial(n - <span class="hljs-number">1</span>)

print(factorial(<span class="hljs-number">5</span>))
</code></pre>
<p><mark>Conclusion</mark></p>
<p>To wrap up, we've delved deep into the intricate process of optimizing language models with the power of Parameter-Efficient Fine-Tuning, namely through PEFT and LoRA. This advanced approach, heightened by its adeptness in efficient training, not only fine-tunes models with less computational expense but also ensures robust performance on specialized tasks like code generation. The hands-on implementation of the Codegen-based pre-trained model illustrates the practicality and potential of such methods. For those enthusiasts and professionals aiming to harness the prowess of large language models without the associated computational overhead, this exploration shines a light on the path forward.</p>
<hr />
<p>If you want to contribute or you find any errors in this article please do leave me a comment.</p>
<p>You can reach out to me on any of the matrix decentralized servers. My element messenger ID is <a class="user-mention" href="https://hashnode.com/@maximilien">@maximilien</a><mark>:</mark><a target="_blank" href="http://matrix.org">matrix.org</a></p>
<p>If you are in one of the mastodon decentralized servers, here is my ID <a class="user-mention" href="https://hashnode.com/@maximilien">@</a><a target="_blank" href="mailto:maximilien@qoto.org"><strong>maximilien@qoto.org</strong></a></p>
<p>If you are on linkedIn, you can reach me <a target="_blank" href="http://www.linkedin.com/in/maximilien-kpizingui">here</a></p>
<p>If you want to contact me via email <a target="_blank" href="mailto:maximilien@maxtekai.tech"><strong><mark>maximilien@maxtekai.tech</mark></strong></a></p>
<p>If you want to hire me to work on machine learning, data science, IoT and AI-related projects, please reach out to me <a target="_blank" href="http://www.maxtekai.tech">here</a></p>
<p><code>Warm regards,</code></p>
<p><code>Maximilien.</code></p>
]]></content:encoded></item><item><title><![CDATA[Langchain Meets GPT-3.5: Crafting the Ultimate Multilingual News Articles Summarizer In English And French]]></title><description><![CDATA[Introduction
In our modern, rapidly evolving society, staying abreast of current news and updates is crucial. Yet, sifting through numerous articles can be a tedious task. To streamline this process and provide you with succinct insights, we're intro...]]></description><link>https://maximilien.docquest.io/langchain-meets-gpt-35-crafting-the-ultimate-multilingual-news-articles-summarizer-in-english-and-french</link><guid isPermaLink="true">https://maximilien.docquest.io/langchain-meets-gpt-35-crafting-the-ultimate-multilingual-news-articles-summarizer-in-english-and-french</guid><category><![CDATA[openai]]></category><category><![CDATA[langchain]]></category><category><![CDATA[News Summarization]]></category><category><![CDATA[Article Analysis]]></category><category><![CDATA[News Automation]]></category><dc:creator><![CDATA[Maximilien]]></dc:creator><pubDate>Tue, 26 Sep 2023 14:14:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1695729072438/660bdfbb-d5cd-459a-9d02-c4afff8efa68.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-introduction"><strong>Introduction</strong></h3>
<p>In our modern, rapidly evolving society, staying abreast of current news and updates is crucial. Yet, sifting through numerous articles can be a tedious task. To streamline this process and provide you with succinct insights, we're introducing a News Articles Summarizer built with GPT-3.5 and LangChain. This robust tool allows for efficient scraping of web articles, capturing their headlines, content and producing sharp summaries. In this guide, we'll delve into the step-by-step creation of this summarizer.</p>
<h3 id="heading-workflow-for-building-a-news-articles-summarizer"><strong>Workflow for Building a News Articles Summarizer</strong></h3>
<ol>
<li><p><strong><mark>Installing required libraries</mark></strong><mark>:</mark> To get started, ensure you have the necessary libraries installed: <code>requests</code>, <code>newspaper3k</code>, and <code>langchain</code>.</p>
</li>
<li><ul>
<li><p><strong><mark>Scraping articles</mark></strong>: Use <code>requests</code> the library to scrape the content of the target news articles from their respective URLs.</p>
<ul>
<li><p><strong><mark>Extracting titles and text</mark></strong>: Employ <code>newspaper</code> the library to parse the scraped HTML and extract the titles and text of the articles.</p>
</li>
<li><p><strong><mark>Preprocessing the text</mark></strong><mark>: </mark> Clean and preprocess the extracted texts to make them suitable for input to GPT-3.5 model.</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p><strong><mark>Generating summaries</mark></strong>: Utilize GPT-3.5 model to summarize the extracted articles</p>
</li>
<li><p><strong><mark>Outputing the results</mark></strong>: Present the summaries along with the original titles, allowing users to grasp the main points of each article quickly.</p>
</li>
</ol>
<hr />
<ol>
<li><strong><mark>Installing dependencies</mark></strong></li>
</ol>
<pre><code class="lang-plaintext">!pip install -q openai langchain newspaper3k python-dotenv  requests
</code></pre>
<p>Create a <code>.env</code> file in your project root directory and add your OpenAI environment variable:</p>
<pre><code class="lang-plaintext">from dotenv import load_dotenv

!echo "OPENAI_API_KEY='&lt;OPENAI_API_KEY&gt;'" &gt; .env

load_dotenv()
</code></pre>
<ol>
<li><strong><mark>Scraping &amp; extracting the title and the text of the article using requests and newspaper libraries</mark></strong></li>
</ol>
<pre><code class="lang-plaintext">import requests
from newspaper import Article

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
}

article_urls = "https://www.wired.com/story/fast-forward-chatgpt-my-new-chatbot-friend-get-things-done/"

session = requests.Session()

try:
    response = session.get(article_urls, headers=headers, timeout=10)

    if response.status_code == 200:
        article = Article(article_urls)
        article.download()
        article.parse()

        print(f"Title: {article.title}")
        print(f"Text: {article.text}")

    else:
        print(f"Failed to fetch article at {article_urls}")
except Exception as e:
    print(f"Error occurred while fetching article at {article_urls}: {e}")
</code></pre>
<details><summary>output of the above code</summary><div data-type="detailsContent"><em>Title: Enough Talk, ChatGPT—My New Chatbot Friend Can Get Things Done Text: I recently needed to contact the CEO of a startup called Lindy, a company developing personal assistants powered by artificial intelligence. Instead of looking for it myself, I turned to an AI helper of my own, an open source program called Auto-GPT, typing in “Find me the email address of the CEO of Lindy AI.” Like a delightfully enthusiastic intern, Auto-GPT began furiously Googling and browsing the web for answers, providing a running commentary designed to explain its actions as it went. “A web search is a good starting point to gather information about the CEO and their email address,” it told me.</em> <em>When given a task like finding a startup CEO's email address, the open source Auto-GPT suggests a plan for approval and can attempt to put it into action. Auto-GPT via Will Knight “I found several sources mentioning Flo Crivello as the CEO of </em><a target="_blank" href="http://Lindy.ai"><em>Lindy.ai</em></a><em>, but I haven't found their email address yet,” Auto-GPT reported. “I will now check Flo Crivello’s LinkedIn profile for their email address,” it said. That didn’t work either, so the program then suggested it could guess Crivello’s email address based on commonly used formats. After I gave it permission to go ahead, Auto-GPT used a series of different email verification services it found online to check if any of its guesses might be valid. None provided a clear answer, but the program saved the addresses to a file on my computer, suggesting I might want to try emailing them all.</em> <em>Who am I to question a friendly chatbot? I tried them all, but every email bounced back. Eventually, I made my own guess at Crivello’s email address based on past experience, and I got it right the first time. Auto-GPT failed me, but it got close enough to illustrate a coming shift in how we use computers and the web. The ability of bots like ChatGPT to answer an incredible variety of questions means they can correctly describe how to perform a wide range of sophisticated tasks. Connect that with software that can put those descriptions into action and you have an AI helper that can get a lot done. Of course, just as ChatGPT will sometimes produce confused messages, agents built that way will occasionally—or often—go haywire. As I wrote this week, while searching for an email address is relatively low-risk, in the future agents might be tasked with riskier business, like booking flights or contacting people on your behalf. Making agents that are safe as well as smart is a major preoccupation of projects and companies working on this next phase of the ChatGPT era. When I finally spoke to Crivello of Lindy, he seemed utterly convinced that AI agents will be able to wholly replace some office workers, such as executive assistants. He envisions many professions simply disappearing.</em></div></details>

<ol>
<li><strong><mark>Generating the summaries of the article using gpt-3.5-turbo</mark></strong></li>
</ol>
<p>The next code imports essential classes and functions from the LangChain and sets up a <code>ChatOpenAI</code> instance with a temperature of 0 for controlled response generation. Additionally, it imports chat-related message schema classes, which enable the smooth handling of chat-based tasks. The following code will start by setting the prompt and filling it with the article’s content.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> langchain.schema <span class="hljs-keyword">import</span> HumanMessage
<span class="hljs-keyword">from</span> langchain.chat_models <span class="hljs-keyword">import</span> ChatOpenAI

article_title = article.title

template = <span class="hljs-string">"""You are a very good assistant that summarizes online articles.

Here's the article you want to summarize.

==================
Title: {article_title}

{article_text}
==================

Write a summary of the previous article.
"""</span>

prompt = template.format(article_title=article.title, article_text=article.text)

messages = [HumanMessage(content=prompt)]

chat = ChatOpenAI(model_name=<span class="hljs-string">"gpt-3.5-turbo"</span>, temperature=<span class="hljs-number">0</span>)  

summary = chat(messages)
print(summary.content)
</code></pre>
<details><summary>output of the code below</summary><div data-type="detailsContent"><em>The article discusses the capabilities of AI chatbots, specifically Auto-GPT, in performing tasks and getting things done. The author shares their experience using Auto-GPT to find the email address of the CEO of a startup called Lindy. Although Auto-GPT was not successful in finding the email address, it demonstrated the potential of AI chatbots to perform a wide range of tasks. The article also highlights the importance of ensuring the safety and reliability of AI agents as they take on more complex and risky tasks in the future. The CEO of Lindy believes that AI agents have the potential to replace certain office workers and transform various professions.</em></div></details>

<p>If we want a bulleted list, we can modify the prompt as shown below.</p>
<pre><code class="lang-python">
template = <span class="hljs-string">"""You are an advanced AI assistant that summarizes online articles into bulleted lists.

Here's the article you need to summarize.

==================
Title: {article_title}

{article_text}
==================

Now, provide a summarized version of the article in a bulleted list format.
"""</span>


prompt = template.format(article_title=article.title, article_text=article.text)

summary = chat([HumanMessage(content=prompt)])
print(summary.content)
</code></pre>
<p><mark>The output of the code is shown below</mark></p>
<blockquote>
<pre><code class="lang-plaintext">- The author used an open source program called Auto-GPT to find the email address of the CEO of Lindy AI.
- Auto-GPT suggested a plan and attempted to find the email address through web searches and checking the CEO's LinkedIn profile.
- The program also tried guessing the email address based on commonly used formats and used email verification services to check its guesses.
- None of the attempts were successful, but the program saved the addresses for the author to try emailing them.
- The author eventually made their own guess and found the correct email address.
- The experience with Auto-GPT highlights the potential of AI assistants like ChatGPT to perform a wide range of tasks.
- However, there are concerns about the safety and reliability of AI agents when handling riskier tasks.
- The CEO of Lindy AI believes that AI agents could replace certain office workers and lead to the disappearance of some professions.
</code></pre>
</blockquote>
<p>To obtain a summary in French, we can guide the model to produce it in the French language. However, keep in mind that GPT-3's primary training data is in English. Although it possesses multilingual abilities, the output's accuracy might be inconsistent for non-English languages. Here's a way to adjust the prompt.</p>
<pre><code class="lang-python">
template = <span class="hljs-string">"""You are an advanced AI assistant that summarizes online articles into bulleted lists in French.

Here's the article you need to summarize.

==================
Title: {article_title}

{article_text}
==================

Now, provide a summarized version of the article in a bulleted list format, in French.
"""</span>

prompt = template.format(article_title=article.title, article_text=article.text)

summary = chat([HumanMessage(content=prompt)])
print(summary.content)
</code></pre>
<p><mark>The output of the code is shown below</mark></p>
<blockquote>
<pre><code class="lang-plaintext">- Auto-GPT est un programme open source qui peut aider à trouver des informations en ligne, comme l'adresse e-mail du PDG d'une startup appelée Lindy AI.
- Auto-GPT effectue une recherche sur le web pour trouver l'adresse e-mail du PDG de Lindy AI, mais ne parvient pas à la trouver.
- Le programme suggère ensuite de deviner l'adresse e-mail en se basant sur des formats couramment utilisés.
- Auto-GPT utilise différents services de vérification d'adresses e-mail pour vérifier ses suppositions, mais aucune ne s'avère valide.
- Auto-GPT enregistre les adresses dans un fichier sur l'ordinateur de l'utilisateur et suggère d'essayer de les contacter par e-mail.
- L'article souligne que les chatbots comme ChatGPT peuvent accomplir une grande variété de tâches sophistiquées grâce à leur capacité à répondre à de nombreuses questions.
- Cependant, il est important de développer des agents intelligents qui soient également sûrs pour éviter les problèmes potentiels.
- Le PDG de Lindy AI pense que les agents d'intelligence artificielle pourraient remplacer certains employés de bureau à l'avenir et prédit la disparition de certaines professions.
</code></pre>
</blockquote>
<p>Behind the scenes, the code first gathers article details like the title and content. A conversational prompt is then crafted, positioning the AI as a sophisticated assistant tasked with summarizing the article in French bullet points. The GPT-3 model is loaded with specific settings to regulate output randomness and the prompt is populated with the article's data. The core part of the process is when we pass the formatted prompt to the model. The model parses the prompt, understands the task and generates a summary accordingly.</p>
<h3 id="heading-conclusion"><strong><mark>Conclusion</mark></strong></h3>
<p>To wrap up, we've demystified the journey of crafting a proficient News Article Summarizer through the synergy of LangChain and GPT-3.5. This tool, enhanced by its ability to present AI summaries in bullet points not only distills complex articles for easy consumption but also embraces a global audience by offering translations with French as an example. Besides, the step-by-step guide provided serves as a beacon for those aiming to optimize their news-reading experience ensuring they remain updated without wasting time reading long news articles online.</p>
<hr />
<p>If you want to contribute or you find any errors in this article please do leave me a comment.</p>
<p>You can reach out to me on any of the matrix decentralized servers. My element messenger ID is <a class="user-mention" href="https://hashnode.com/@maximilien">@maximilien</a><mark>:</mark><a target="_blank" href="http://matrix.org">matrix.org</a></p>
<p>If you are in one of the mastodon decentralized servers, here is my ID <a class="user-mention" href="https://hashnode.com/@maximilien">@</a><a target="_blank" href="mailto:maximilien@qoto.org"><strong><mark>maximilien@qoto.org</mark></strong></a></p>
<p>If you are on linkedIn, you can reach me <a target="_blank" href="http://www.linkedin.com/in/maximilien-kpizingui">here</a></p>
<p>If you want to contact me via email <a target="_blank" href="mailto:maximilien@tutanota.de"><strong><mark>maximilien@</mark></strong></a><strong><mark>maxtekai.tech</mark></strong></p>
<p>If you want to hire me to work on machine learning, data science, IoT and AI-related projects, please reach out to me <a target="_blank" href="http://www.maxtekai.tech">here</a></p>
<p><code>Warm regards,</code></p>
<p><code>Maximilien.</code></p>
]]></content:encoded></item><item><title><![CDATA[Unleashing the Power of LLMs, OpenAI API, and LangChain for Personalized City-Specific Recipe Recommendations]]></title><description><![CDATA[Cooking is an art, and having a knowledgeable cooking assistant can greatly enhance your culinary journey. In this blog post, we will explore how the combination of LLMs (Large Language Models) the OpenAI API, and LangChain  can be leveraged to build...]]></description><link>https://maximilien.docquest.io/unleashing-the-power-of-llms-openai-api-and-langchain-for-personalized-city-specific-recipe-recommendations</link><guid isPermaLink="true">https://maximilien.docquest.io/unleashing-the-power-of-llms-openai-api-and-langchain-for-personalized-city-specific-recipe-recommendations</guid><category><![CDATA[#openai #LLMs #langchain #promtTemplate #PromptEngineering #python ]]></category><dc:creator><![CDATA[Maximilien]]></dc:creator><pubDate>Sun, 14 May 2023 11:10:44 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1683906895622/b3ddb0df-2c49-40e2-b28f-50677821cb12.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Cooking is an art, and having a knowledgeable cooking assistant can greatly enhance your culinary journey. In this blog post, we will explore how the combination of <mark>LLMs</mark> (Large Language Models) the <mark>OpenAI API</mark>, and <mark>LangChain </mark> can be leveraged to build an intelligent Cook Bot. This bot will provide recipes based on the location provided by the user, cooking tips, and even real-time translation for global culinary adventures.</p>
<hr />
<h3 id="heading-table-of-contents">Table of contents</h3>
<ol>
<li><p><mark>Introduction</mark></p>
</li>
<li><p><mark>Setting up the Environment Variables for OpenAI API</mark></p>
</li>
<li><p><mark>Installing and loading dependencies</mark></p>
</li>
<li><p><mark>Creating a new Python script</mark></p>
</li>
<li><p><mark>Invoking the method to load the environment variable and creating a new instance of the language model</mark></p>
</li>
<li><p><mark>Creating Location and Meal Chains</mark></p>
</li>
<li><p><mark>Building the Overall Chain</mark></p>
</li>
<li><p><mark>Streamlit User Interface</mark></p>
</li>
<li><p><mark>Conclusion</mark></p>
</li>
</ol>
<hr />
<ul>
<li><mark>Introduction</mark></li>
</ul>
<p>Cooking is an art that brings people together through delicious flavors and culinary experiences. Before delving into the code to create a cutting-edge cook assistant that will elevate your cooking skills to new heights we are going to play around with some keywords which are not common to non-geeks but key to the implementation of the cook assistant namely <mark>OpenAI</mark>, <mark>Langchain</mark>, <mark>LLMs,</mark> <mark>Prompt</mark> and <mark>Prompt template.</mark></p>
<ul>
<li><p><mark>OpenAI</mark> : OpenAI is a leading artificial intelligence research organization that has developed state-of-the-art language models capable of understanding and generating human-like text. In the context of our cook assistant, OpenAI's technology is utilized to power the language generation capabilities of the assistant. The cooking assistant relies on an OpenAI language model, specifically the OpenAI API, to generate responses to user inputs. The language model has been trained on vast amounts of text data, allowing it to understand the nuances of human language and provide contextually relevant and coherent responses. By integrating the OpenAI API into the cook assistant, we leverage the power of natural language processing and generation. The cooking assistant can understand and process user queries related to cooking, recipe recommendations, and cooking tips. It then utilizes the OpenAI language model to generate informative and engaging responses conversationally.</p>
</li>
<li><p><mark>Langchain</mark>: Python library that provides an easy-to-use interface for building conversational agents using large language models (LLMs). In the context of the cook assistant, Langchain is used to build a conversational interface that allows users to ask for cooking advice and recipe recommendations.</p>
<p>  At the core of Langchain is the concept of a "chain," which is essentially a sequence of LLMs that are used to generate responses to user inputs. In the cook assistant, Langchain is used to create two separate chains: one for location-based recipe recommendations and another for meal-specific recipe recommendations.</p>
</li>
<li><p><mark>LLMs</mark>(Large Language Models) in the realm of the cooking assistant are like culinary encyclopedias on steroids. These linguistic powerhouses possess a vast understanding of recipes, ingredients, cooking techniques, and culinary knowledge. LLMs are trained on massive amounts of text data, allowing them to absorb a wealth of culinary information. When it comes to cooking assistance, LLMs serve as the brilliant minds behind the scenes. They can comprehend and generate text-based responses to user queries, providing valuable cooking advice, recipe suggestions, and even personalized recommendations.</p>
</li>
<li><p><mark>Prompt</mark>: In the context of LLMs and the Cook Assistant, a prompt refers to the initial instruction or input provided to the language model to generate a response. The prompt serves as a guiding context for the model, helping it understand the user's request and generate relevant and meaningful output. When interacting with the Cook Assistant, the user provides prompts related to the city. For example, a user might ask, "What is a delicious recipe for a particular city?" In this case, the prompt is the user's question itself. The prompt is then passed to the LLM, such as OpenAI's GPT-3 model, which processes the input and generates a response based on its understanding of the given context. The model analyzes the prompt and utilizes its vast knowledge of cooking techniques, ingredients, and recipes to generate a helpful and informative response to the user's query.</p>
</li>
<li><p><mark>Prompt template: </mark> In the context of Large Language Models (LLMs) and the Cook Assistant, a prompt template is a structured framework that incorporates variables and placeholders to create dynamic prompts. Prompt templates allow for flexible and customizable input that can be easily adapted to different user queries and contexts. The Cook Assistant utilizes prompt templates to generate prompts tailored to specific user inputs, such as location or meal preferences. Instead of providing a static prompt, a prompt template includes placeholders that will be replaced with actual values provided by the user. For example, a prompt template for location-based recommendations might look like this: "Tell me the best food in {user_location}." Here, <code>{user_location}</code> is the placeholder that will be replaced with the actual user input, such as "New York" or "Paris." The prompt template allows the Cook Assistant to dynamically generate prompts based on the user's location, ensuring relevant and personalized responses. By employing prompt templates, the Cook Assistant can handle various user inputs and adapt its prompts accordingly. Prompt templates enable a more interactive and conversational experience, allowing users to provide specific details or preferences and receive targeted recommendations or recipes in response. In closing, prompt templates enhance the flexibility and interactivity of the Cook Assistant, enabling users to engage in meaningful conversations and receive customized cooking assistance based on their specific needs and preferences.</p>
</li>
<li><p><mark>Setting up the Environment Variables for OpenAI API</mark></p>
<p>  Before we start building the Cook Bot, we need to set up the environment variables for the OpenAI API. The API key is a sensitive piece of information that should not be hard-coded into your codebase. Instead, it should be stored as an environment variable. Here are the steps to set up the environment variable:</p>
</li>
<li><ul>
<li><p>Log in to the OpenAI dashboard and copy your API key.</p>
<ul>
<li><p>Open the terminal and navigate to the project directory.</p>
</li>
<li><p>Create a project directory using <code>sudo mkdir cookAssistant</code> and go into that directory using <code>cd cookAssistant</code> then create a new environment file using <code>sudo nano .env</code></p>
</li>
<li><p>Add the following line to the .env file</p>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<pre><code class="lang-bash">OPENAI_API_KEY = <span class="hljs-string">"your api key here"</span>
</code></pre>
<ul>
<li><mark>Installing and loading dependencies</mark></li>
</ul>
<p>In this section, we are going to install the required dependencies using pip and import them into our Python script</p>
<pre><code class="lang-bash">pip install openai langchain streamlit streamlit_chat dotenv
</code></pre>
<ul>
<li><p><mark>Creating a new Python script using </mark> <code>nano app.py</code> <mark>in the project</mark> <code>cookAssistant</code> <mark>directory where the .env file is and paste the following code as shown step by step.</mark></p>
<pre><code class="lang-python">
  <span class="hljs-keyword">from</span> langchain.llms <span class="hljs-keyword">import</span> OpenAI
  <span class="hljs-keyword">from</span> langchain.chains <span class="hljs-keyword">import</span> LLMChain
  <span class="hljs-keyword">from</span> langchain.prompts <span class="hljs-keyword">import</span> PromptTemplate
  <span class="hljs-keyword">from</span> langchain.chains <span class="hljs-keyword">import</span> SimpleSequentialChain
  <span class="hljs-keyword">import</span> streamlit <span class="hljs-keyword">as</span> st
  <span class="hljs-keyword">from</span> streamlit_chat <span class="hljs-keyword">import</span> message
  <span class="hljs-keyword">from</span> langchain.chains <span class="hljs-keyword">import</span> ConversationChain
  <span class="hljs-keyword">from</span> langchain.llms <span class="hljs-keyword">import</span> OpenAI
  <span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
  <span class="hljs-keyword">import</span> os
</code></pre>
</li>
<li><p><mark>Invoking the method to load the environment variable and creating a new instance of the language model</mark></p>
</li>
</ul>
<pre><code class="lang-python">load_dotenv()
token = os.environ.get(<span class="hljs-string">"OPENAI-API-KEY"</span>)
llm = OpenAI(temperature=<span class="hljs-number">1</span>, openai_api_key=token)
</code></pre>
<p>In the above code, the load_dotenv() function loads the environment variables from the .env file. The token variable retrieves the OpenAI API key stored in the openai-key environment variable. Next, an instance of the OpenAI class is created with the provided API key (token) and a temperature of 1. The temperature parameter determines the randomness of the responses generated by the model.</p>
<ul>
<li><mark>Creating Location and Meal Chains</mark></li>
</ul>
<p>To provide location-based recommendations and meal-specific recipes, we create separate chains for location and meal inputs. We define prompt templates that incorporate the user's location and desired meal to generate contextually relevant responses. By utilizing LLMChain and PromptTemplate from LangChain, we establish the foundation for our cook assistant's knowledge and response generation as shown below.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_chain_loc</span>():</span>
    template = <span class="hljs-string">"""Your job is to come up with a classic dish from the area that the users suggests.
    % USER LOCATION
    {user_location}

    YOUR RESPONSE:
    """</span>
    prompt_template = PromptTemplate(input_variables=[<span class="hljs-string">"user_location"</span>], template=template)

    location_chain = LLMChain(llm=llm, prompt=prompt_template)
    <span class="hljs-keyword">return</span> location_chain
</code></pre>
<p>In the above code, The <code>load_chain_loc</code> function defines a template for generating prompts based on the user's location input. It uses the <code>PromptTemplate</code> class to create a template with the variable <code>{user_location}</code>. An instance of the <code>LLMChain</code> class is created, passing the OpenAI instance (<code>llm</code>) and the prompt template (<code>prompt_template</code>). The function returns the <code>location_chain</code> instance.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_chain_meal</span>():</span>
    template = <span class="hljs-string">"""Given a meal, give a short and simple recipe on how to make that dish at home.
    % MEAL
    {user_meal}

    YOUR RESPONSE:
    """</span>
    prompt_template = PromptTemplate(input_variables=[<span class="hljs-string">"user_meal"</span>], template=template)

    meal_chain = LLMChain(llm=llm, prompt=prompt_template)
    <span class="hljs-keyword">return</span> meal_chain
</code></pre>
<p>In the above code, The <code>load_chain_meal</code> function defines a template for generating prompts based on the user's meal input. It uses the <code>PromptTemplate</code> class to create a template with the variable <code>{user_meal}</code>. An instance of the <code>LLMChain</code> class is created, passing the OpenAI instance (<code>llm</code>) and the prompt template (<code>prompt_template</code>). The function returns the <code>meal_chain</code> instance.</p>
<ul>
<li><mark>Building the Overall Chain</mark></li>
</ul>
<p>To integrate the location and meal chains, we create an overall chain using SimpleSequentialChain. This allows us to connect the chains and execute them sequentially, ensuring a smooth flow of information and generating cohesive responses. We configure the overall chain to provide verbose output for debugging and testing purposes.</p>
<pre><code class="lang-python">loc_chain = load_chain_loc()
chain_meal = load_chain_meal()
overall_chain = SimpleSequentialChain(chains=[loc_chain,chain_meal], verbose=<span class="hljs-literal">True</span>)
</code></pre>
<p>In the above code, the <code>load_chain_loc</code> function is called to create the <code>loc_chain</code> instance. The <code>load_chain_meal</code> function is called to create the <code>chain_meal</code> instance. Finally, an instance of the <code>SimpleSequentialChain</code> class is created, passing the <code>loc_chain</code> and <code>chain_meal</code> instances as a list.</p>
<ul>
<li><p><mark>Streamlit User Interface</mark></p>
<p>  In this section, we implement the user interface using Streamlit. We set the page configuration, display the cook assistant's title, and prompt the user to enter the desired city for location-based food recommendations. As the user inputs their preferences, the cook assistant generates responses using the overall chain, and the conversation history is displayed using the Streamlit Chat component as shown below</p>
</li>
</ul>
<pre><code class="lang-python">st.set_page_config(page_title=<span class="hljs-string">" Cook bot"</span>, page_icon=<span class="hljs-string">":robot:"</span>)
st.title(<span class="hljs-string">"Cook bot powered with LLMs"</span>)
st.write(<span class="hljs-string">"By Maximilien "</span>)

<span class="hljs-keyword">if</span> <span class="hljs-string">"generated"</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> st.session_state:
    st.session_state[<span class="hljs-string">"generated"</span>] = []

<span class="hljs-keyword">if</span> <span class="hljs-string">"past"</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> st.session_state:
    st.session_state[<span class="hljs-string">"past"</span>] = []


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_text</span>():</span>
    st.header(<span class="hljs-string">"enter the city you want to know the best food"</span>)
    input_text = st.text_input(<span class="hljs-string">""</span>, key=<span class="hljs-string">"input"</span>)
    <span class="hljs-keyword">return</span> input_text


user_input = get_text()

<span class="hljs-keyword">if</span> user_input:
    output = overall_chain.run(input=user_input)

    st.session_state.past.append(user_input)
    st.session_state.generated.append(output)
    st.write(output)
<span class="hljs-keyword">if</span> st.session_state[<span class="hljs-string">"generated"</span>]:

    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(st.session_state[<span class="hljs-string">"generated"</span>]) - <span class="hljs-number">1</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
        message(st.session_state[<span class="hljs-string">"generated"</span>][i], key=str(i))
        message(st.session_state[<span class="hljs-string">"past"</span>][i], is_user=<span class="hljs-literal">True</span>, key=str(i) + <span class="hljs-string">"_user"</span>)
</code></pre>
<p>Let's break down this part of the code step by step:</p>
<ol>
<li><p><code>st.set_page_config(page_title=" Cook bot", page_icon=":robot:")</code>: This line configures the page settings for the Streamlit application. It sets the page title as "Cook bot" and assigns a robot icon to the page.</p>
</li>
<li><p><code>st.title("Cook bot powered with LLMs")</code>: This line displays a title on the Streamlit app interface, indicating that it is a Cook bot powered by LLMs.</p>
</li>
<li><p><code>st.write("By Maximilien ")</code>: This line displays the name or attribution "By Maximilien" on the Streamlit app interface.</p>
</li>
<li><p><code>if "generated" not in st.session_state: st.session_state["generated"] = []</code>: This code block checks if the "generated" key is present in the Streamlit session state. If not, it initializes an empty list assigned to the "generated" key. This list will store the generated responses.</p>
</li>
<li><p><code>if "past" not in st.session_state: st.session_state["past"] = []</code>: Similar to the previous code block, this block checks if the "past" key is present in the Streamlit session state. If not, it initializes an empty list assigned to the "past" key. This list will store the past user inputs.</p>
</li>
<li><p><code>def get_text(): ...</code>: This is a function definition for <code>get_text()</code>. It displays a header asking the user to enter the city for which they want to know the best food. It then uses <code>st.text_input()</code> to retrieve the user's input as a text string.</p>
</li>
<li><p><code>user_input = get_text()</code>: This line calls the <code>get_text()</code> function and assigns the returned user input to the <code>user_input</code> variable.</p>
</li>
<li><p><code>if user_input: ...</code>: This code block checks if the <code>user_input</code> variable has a non-empty value. If there is user input, it proceeds with the following steps.</p>
</li>
<li><p><code>output = overall_chain.run(input=user_input)</code>: This line executes the <code>run()</code> method of the <code>overall_chain</code> object, passing the user input as the <code>input</code> argument. It generates a response based on the user input using the chained LLMs.</p>
</li>
<li><p><code>st.session_state.past.append(user_input)</code>: This appends the user input to the "past" list stored in the Streamlit session state.</p>
</li>
<li><p><code>st.session_state.generated.append(output)</code>: This appends the generated output to the "generated" list stored in the Streamlit session state.</p>
</li>
<li><p><code>st.write(output)</code>: This line displays the generated output on the Streamlit app interface.</p>
</li>
<li><p><code>if st.session_state["generated"]: ...</code>: This code block checks if the "generated" list in the Streamlit session state is not empty. If it is not empty, it proceeds with the following steps.</p>
</li>
<li><p><code>for i in range(len(st.session_state["generated"]) - 1, -1, -1): ...</code>: This loop iterates over the "generated" list in reverse order using the <code>range()</code> function. It retrieves each generated response and its corresponding user input from the Streamlit session state.</p>
</li>
<li><p><code>message(st.session_state["generated"][i], key=str(i))</code>: This line displays the generated response as a message on the Streamlit app interface, using the <code>message()</code> function. Each message is assigned a unique key based on the loop index.</p>
</li>
<li><p><code>message(st.session_state["past"][i], is_user=True, key=str(i) + "_user")</code>: This line displays the corresponding user input as a user message (indicating that it was input by the user) on the Streamlit app interface. Each user message is also.</p>
<ul>
<li><mark>Conclusion</mark></li>
</ul>
</li>
</ol>
<p>    With the implementation of the cook assistant using LLMs, the OpenAI API, and LangChain, we have harnessed the power of language models and intelligent conversation chains to create a valuable tool for cooking enthusiasts. The cooking assistant provides personalized recommendations and recipes based on user inputs, empowering users to explore new cuisines and enhance their culinary skills. By combining advanced technologies, we are revolutionizing the cooking experience and paving the way for future innovations in the realm of intelligent kitchen assistants.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1684019960749/086a5463-90c7-45d6-a650-a8cab82f480e.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1684020011552/aa486f60-c781-4aaa-a404-0cea4bbb828d.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1684020050370/6c638d16-cc97-4d8b-9e88-f5eaf25dd9b3.png" alt class="image--center mx-auto" /></p>
<hr />
<p>The source code can be found on my github repository <a target="_blank" href="https://github.com/flexil/cook_assistant">here</a></p>
<hr />
<p>If you want to contribute or you find any errors in this article please do leave me a comment.</p>
<p>You can reach out to me on any of the matrix decentralized servers. My element messenger ID is <mark> <a href="https://hashnode.com/@maximilien" class="user-mention" target="_blank">@maximilien</a>:</mark><a target="_blank" href="http://matrix.org"><mark>matrix.org</mark></a></p>
<p>If you are in one of the mastodon decentralized servers, here is my ID <strong><mark>@maximilien@qoto.org</mark></strong></p>
<p>If you are on linkedIn, you can reach me <a target="_blank" href="https://www.linkedin.com/in/maximilien-kpizingui-48222775/">here</a></p>
<p>If you want to contact me via email for freelance <strong><mark>maximilien@tutanota.de</mark></strong></p>
<p>If you want to hire me to work on machine learning, data science, IoT and AI related projects, please reach out to me <a target="_blank" href="https://flexil.github.io/freelance/">here</a></p>
<p><code>Warm regards,</code></p>
<p><code>Maximilien.</code></p>
]]></content:encoded></item><item><title><![CDATA[@WeRateDogs Data Wrangling  Analysis And Visualization]]></title><description><![CDATA[Welcome back to this blog post in advanced data analytics nanodegree scholarship sponsored by Udacity. The project II WeRateDogs, a twitter account which has 9.3 million followers across the world at the point this blog is published contains three da...]]></description><link>https://maximilien.docquest.io/weratedogs-data-wrangling-analysis-and-visualization</link><guid isPermaLink="true">https://maximilien.docquest.io/weratedogs-data-wrangling-analysis-and-visualization</guid><category><![CDATA[dataanalytics]]></category><category><![CDATA[Data wrangling]]></category><category><![CDATA[#data visualisation]]></category><category><![CDATA[data analysis]]></category><category><![CDATA[Python]]></category><dc:creator><![CDATA[Maximilien]]></dc:creator><pubDate>Mon, 18 Jul 2022 07:43:36 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1657458915455/-2kio68FL.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome back to this blog post in advanced data analytics nanodegree scholarship sponsored by Udacity. The project II <strong>WeRateDogs</strong>, a twitter account which has 9.3 million followers across the world at the point this blog is published contains three datasets namely twitter-archive-enhanced.csv, image_predictions.tsv and tweet_json.txt which we have to programmatically download from three different sources. Two from website URLs and the last one requires the learner to sign up for twitter developer account to collect additional tweet through twitter API. The objective of this project is to challenge the learner to wrangle the three datasets and to combine all the three clean datasets into a twitter master archive by providing at least four insights and two visualizations. Without delay let's get into the data wrangling.</p>
<hr />
<p><strong> Table of Contents</strong><br /></p>
<ol>
<li>Loading the libraries<br /></li>
<li>Data gathering<br /></li>
<li>Reading twitter-archive-enhanced dataset into a dataframe<br /></li>
<li>Downloading image prediction dataframe using request<br /></li>
<li>Reading image prediction into a dataframe<br /></li>
<li>Query tweet_json dataset using twitter API<br /></li>
<li>Reading tweet_json into a dataframe<br /></li>
<li>Assessing data<br /></li>
<li>Objectives<br /></li>
<li>Methodology<br /></li>
<li>Visual assessment of twitter_archive_df<br /></li>
<li>Programmatic assessment of twitter_achive_df<br /></li>
<li>Visual assessment of image_predictions_df<br /></li>
<li>Programmatic assessment of image_predictions_df<br /></li>
<li>Visual assessment of tweet_status<br /></li>
<li>Programmatic assessment of tweet_status dataframe<br /></li>
<li>Project scope<br /></li>
<li>Quality issues<br /></li>
<li>Tidiness issues<br /></li>
<li>Cleaning data<br /></li>
<li>Making the copy of the dataframes<br /></li>
<li>Quality issue #8: Convert tweet_id column in image_predictions_clean from int to str<br /></li>
<li>Quality Issue #6: Inconsistent use of lowercase and uppercase and underscores in p1, p2,p3 columns<br /></li>
<li>Quality Issue #7: Removing duplicated values img_url of image_prediction_df<br /></li>
<li>Quality Issue #4: Incorrect values and incorrect type for timestamp<br /></li>
<li>Quality Issue #1: invalid and missing dog name in the column name<br /></li>
<li>Tidiness Issue #1 : Column source contain HTML tag and hyperlinks<br /></li>
<li>quality Issue #10: Removing RT @ in column text of twitter_archive_clean<br /></li>
<li>Quality Issue #5 : dropping columns in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp<br /></li>
<li>Tidiness Issue #2: Merging columns doggo, floofer,pupper, and puppo<br /></li>
<li>Tidiness Issue #4: There are two information in a single column text url link and text<br /></li>
<li>Quality Issue #2 and 3#: Invalid ratings. Value varies from 1776 to 0. data type must be converted from int to float. Invalid denominator, I expected a fixed base. data Strucutre must be converted from int to float.<br /></li>
<li>Tidiness Issue #5: Merging twitter_archive_clean and image_predictions_clean<br /></li>
<li>Quality Issue #9: Incorrect data type in column id in tweet_status_clean<br /></li>
<li>Tidiness isssue #6: Creating a dataframe with columns: id, favorite_count, retweet_count<br /></li>
<li>Tidiness isssue #7: Merge tweet_status_clean and twitter_achive_merged<br /></li>
<li>Storing the master data into csv file<br /></li>
<li>ANALYSING AND VISUALIZING THE DATA<br /></li>
<li>Reading the twitter_archive_master.csv into a dataframe<br /></li>
<li>Insight about the clean master dataset<br /></li>
<li>Visualizing the hidden pattern in the dataset<br /></li>
<li>Function to plot the average count of tweets<br /></li>
<li>Visualizing the distribution of average favorite count of tweet using based on the dog category<br /></li>
<li>Visualizing the distribution of average retweet count based on the dog category<br /></li>
<li>Visualizing Likes vs Retweets<br /></li>
<li>Visualizing the most used devices by WeRateDogs users<br /></li>
<li>Visualizing the most popular name of the dogs<br /></li>
<li>Visualizing 20 dogs breed P2 predicted by twitter user on WeRateDogs<br /></li>
<li>Visualizing 20 dogs breed P3 predicted by twitter user on WeRateDogs<br /></li>
<li>Visualizing the dog category with the highest score<br /></li>
<li>Visualizing the dog category with maximum favorite count<br /><br /></li>
<li>Visualizing the dog category with minimun retweet count<br /></li>
<li>Visualizing Total Tweets made by WeRateDogs per month between 2015 to 2017<br /></li>
<li>Visualizing some of the dogs image prediction p1</li>
</ol>
<hr />
<p><strong>1. Loading the required libraries</strong></p>
<pre><code><span class="hljs-title">from</span> timeit <span class="hljs-keyword">import</span> default_timer <span class="hljs-keyword">as</span> timer
<span class="hljs-title">from</span> <span class="hljs-type">IPython</span>.display <span class="hljs-keyword">import</span> Image
<span class="hljs-title">from</span> tweepy <span class="hljs-keyword">import</span> OAuthHandler
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">import</span> datetime <span class="hljs-keyword">as</span> dt
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
%matplotlib inline
<span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">import</span> tweepy
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> re
</code></pre><p><strong>2. Data Gathering</strong></p>
<ul>
<li>Reading the twitter-archive-enhanced.csv</li>
</ul>
<pre><code>twitter_archive_df <span class="hljs-operator">=</span> pd.read_csv(<span class="hljs-string">'twitter-archive-enhanced.csv'</span>)
</code></pre><ul>
<li>Use the Requests library to download the tweet image prediction (image_predictions.tsv)</li>
</ul>
<pre><code><span class="hljs-comment"># Use Requests library to programmatically download tsv file from a website</span>
<span class="hljs-attribute">url</span> = 'https://d<span class="hljs-number">17</span>h<span class="hljs-number">27</span>t<span class="hljs-number">6</span>h<span class="hljs-number">515</span>a<span class="hljs-number">5</span>.cloudfront.net/topher/<span class="hljs-number">2017</span>/August/<span class="hljs-number">599</span>fd<span class="hljs-number">2</span>ad_image-predictions/image-predictions.tsv'
<span class="hljs-attribute">response</span> = requests.get(url)
<span class="hljs-comment"># Save tsv to file</span>
<span class="hljs-attribute">with</span> open('image_predictions.tsv', mode='wb') as file:
 <span class="hljs-attribute">file</span>.write(response.content)
</code></pre><ul>
<li>Reading image_predictions.tsv in dataframe</li>
</ul>
<pre><code>image_predictions_df <span class="hljs-operator">=</span> pd.read_csv(<span class="hljs-string">'image_predictions.tsv'</span>, sep<span class="hljs-operator">=</span><span class="hljs-string">'\t'</span>)
</code></pre><ul>
<li>Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)
One of the Project requirements is to access the Twitter API to create the tweet_json.txt completing some missing/wrong values of the tweet-json file. I will use the tweepy package (a client code) to access the Twitter API.</li>
<li>Authentication Details: load personal API keys (replaced with placeholders)</li>
</ul>
<pre><code>consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''
<span class="hljs-comment"># variables for Twitter API connection</span>
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit = True)
</code></pre><ul>
<li>Collecting tweet data using API</li>
</ul>
<pre><code>tweet_ids = twitter_archive_df.tweet_id.values
len(tweet_ids)

<span class="hljs-comment"># Query Twitter's API for JSON data for each tweet ID in the Twitter archive</span>
count = <span class="hljs-number">0</span>
fails_dict = {}
start = timer()
<span class="hljs-comment"># Save each tweet's returned JSON as a new line in a .txt file</span>
<span class="hljs-keyword">with</span> open(<span class="hljs-string">'tweet_json.txt'</span>, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> outfile:
    <span class="hljs-comment"># This loop will likely take 20-30 minutes to run because of Twitter's rate limit</span>
    <span class="hljs-keyword">for</span> tweet_id <span class="hljs-keyword">in</span> tweet_ids:
        count += <span class="hljs-number">1</span>
        print(str(count) + <span class="hljs-string">": "</span> + str(tweet_id))
        <span class="hljs-keyword">try</span>:
            tweet = api.get_status(tweet_id, tweet_mode=<span class="hljs-string">'extended'</span>)
            print(<span class="hljs-string">"Success"</span>)
            json.dump(tweet._json, outfile)
            outfile.write(<span class="hljs-string">'\n'</span>)
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            print(<span class="hljs-string">"Fail"</span>)
            fails_dict[tweet_id] = e
            <span class="hljs-keyword">pass</span>
end = timer()
print(end - start)
print(fails_dict)
</code></pre><ul>
<li>Reading tweet JSON content as pandas dataframe</li>
</ul>
<pre><code>tweet_status_df <span class="hljs-operator">=</span> pd.read_json(<span class="hljs-string">'tweet_json.txt'</span>, lines <span class="hljs-operator">=</span> True,encoding<span class="hljs-operator">=</span><span class="hljs-string">'utf-8'</span>)
</code></pre><p><strong>Assessing Data</strong></p>
<p><strong>Objectives</strong><br />
In this section, I detect and I document at least eight (8) quality issues and two (2) tidiness issue by using visual assessment and programmatic assessement
The issues could be defined into two types:</p>
<ul>
<li>Quality issues or dirty data: missing, duplicated, or incorrect data</li>
<li>Tidiness issues: messy data or unstructural data.</li>
</ul>
<p><strong>Methodology</strong><br />
I use visual assessment on each dataframe to detect the issues and document it.</p>
<p><strong>Visual assessment of twitter_archive_df</strong></p>
<ul>
<li>Displaying the first five raw of the twitter_archive_df</li>
</ul>
<pre><code>twitter_archive_df.sample(<span class="hljs-number">3</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657490722565/Tz02X9UQB.png" alt="sam1.png" /></p>
<ul>
<li>We notice quality issue in column name cause it contains invalid name and missing name which are not accurate</li>
<li>We notice tidiness issue in columns source HTML tags, URL, and content in a single column.</li>
</ul>
<p><strong>Programmatic assessment of twitter_archive_df</strong></p>
<p>Through visual assessment we found invalid name in column name of the dataframe.</p>
<ul>
<li>Displaying 20 name of dogs</li>
</ul>
<pre><code>twitter<span class="hljs-emphasis">_archive_</span>df[<span class="hljs-string">'name'</span>][<span class="hljs-symbol">:20</span>]
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657491036615/cbUqpiIOo.png" alt="sam1.png" /></p>
<ul>
<li>Counting the occurence of the unique name of the dogs</li>
</ul>
<pre><code><span class="hljs-selector-tag">twitter_archive_df</span><span class="hljs-selector-attr">[<span class="hljs-string">'name'</span>]</span><span class="hljs-selector-class">.value_counts</span>()<span class="hljs-selector-attr">[:20]</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657491291937/zs2U7riQy.png" alt="sam1.png" /></p>
<ul>
<li><p>We notice 745 dogs in the dataframe have the name None and 55 dogs have the name a</p>
</li>
<li><p>Displaying the unique occurance of the denominator rating value</p>
</li>
</ul>
<pre><code><span class="hljs-selector-tag">twitter_archive_df</span><span class="hljs-selector-attr">[<span class="hljs-string">'rating_denominator'</span>]</span><span class="hljs-selector-class">.value_counts</span>()<span class="hljs-selector-attr">[:20]</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657492401958/rBJRfO7WL.png" alt="sam1.png" /></p>
<ul>
<li><p>We notice incorrect value in the rating_denominator column cause it must have a same denomination which is 10 leading to accuracy quality issue</p>
</li>
<li><p>Displaying the rating_numerator column</p>
</li>
</ul>
<pre><code><span class="hljs-selector-tag">twitter_archive_df</span><span class="hljs-selector-attr">[<span class="hljs-string">'rating_numerator'</span>]</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657896260500/YHK2ZIFEQ.png" alt="Screenshot (6).png" /></p>
<pre><code><span class="hljs-selector-tag">twitter_archive_df</span><span class="hljs-selector-attr">[<span class="hljs-string">'rating_numerator'</span>]</span><span class="hljs-selector-class">.value_counts</span>()<span class="hljs-selector-attr">[:20]</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657896465290/OhYx8xCX7.png" alt="Screenshot7.png" /></p>
<ul>
<li>The numerator values must be between 0 to 10 which is not the case leading to accuracy quality issue</li>
</ul>
<p><strong>Displaying descriptive information about the twitter_archive_df</strong></p>
<pre><code>twitter_archive_df.info()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657896694052/D3BOhd_CG.png" alt="Screenshot (8).png" /></p>
<p>From the programmatic assessment we notice some quality issues<br />- timestamp column need to be converted to datetime</p>
<pre><code><span class="hljs-selector-tag">twitter_archive_df</span><span class="hljs-selector-class">.pupper</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657896877045/UyzTJov4s.png" alt="Screenshot (9).png" /></p>
<pre><code><span class="hljs-selector-tag">twitter_archive_df</span><span class="hljs-selector-class">.doggo</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657897059306/HV9ANZNoy.png" alt="Screenshot (10).png" /></p>
<pre><code><span class="hljs-selector-tag">twitter_archive_df</span><span class="hljs-selector-class">.floofer</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657897228631/fRhmSBo1kI.png" alt="Screenshot (11).png" /></p>
<pre><code><span class="hljs-selector-tag">twitter_archive_df</span><span class="hljs-selector-class">.puppo</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657897469148/9N1boh6zx.png" alt="Screenshot (12).png" /></p>
<ul>
<li>From programmatic assessment we find a tidiness issue on columns doggo, floofer,pupper, and puppo they all have the same name and can be merge into 1 column</li>
</ul>
<p><strong>Checking for null values</strong> </p>
<pre><code>twitter_archive_df.isnull().sum()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657897687689/WGL279AKL.png" alt="Screenshot (15).png" /></p>
<ul>
<li>in_reply_to_status_id, in_reply_to_user_id ,retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp columns have more null values which will bring bring about a quality issue we need to drop them</li>
</ul>
<pre><code><span class="hljs-selector-tag">twitter_archive_df</span><span class="hljs-selector-class">.text</span><span class="hljs-selector-attr">[:20]</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657897833182/hst3BWY9D.png" alt="Screenshot (16).png" /></p>
<ul>
<li>Checking the occurence of RT in the text column as it must be removed as specified in the project requirement</li>
</ul>
<pre><code>display(twitter_archive_df[twitter_archive_df[<span class="hljs-string">'text'</span>].str.contains(<span class="hljs-string">'RT @'</span>)])
print(<span class="hljs-string">'the number of RT in the text is:'</span>, sum(twitter_archive_df[<span class="hljs-string">'text'</span>].str.contains(<span class="hljs-string">'RT @'</span>)))
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657898029061/Wvx-kbk4t.png" alt="Screenshot (17).png" /></p>
<ul>
<li><p>the number of RT in the text is: 181</p>
</li>
<li><p>Programmatically this is a quality issue cause text columns contains text and Url andv 181 retweets</p>
</li>
</ul>
<p><strong>Visual assessment of image_predictions_df</strong></p>
<pre><code>image_predictions_df.sample(<span class="hljs-number">4</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657898384613/aHO-lvyua.png" alt="Screenshot (18).png" /></p>
<ul>
<li>From visual assessment we notice a quality issue on colums p1, p2 and p3 inconsistancy in the spelling of the name using upper case letter, lower case and underscore</li>
</ul>
<p><strong>Programmatic assessment of image_prediction dataframe</strong></p>
<ul>
<li>Displaying the description of the image_predictions.tsv</li>
</ul>
<pre><code>image_predictions_df.info()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657898731324/mRAWiNiSI.png" alt="Screenshot (19).png" /></p>
<ul>
<li>From the descriptive information of the image_predictions dataframe, tweet_id need to be converted to a String data type cause it does not affect our analysis . So we raise a quality issue on the validity of the tweet_id data type</li>
</ul>
<p><strong>Checking for duplicated value</strong></p>
<pre><code><span class="hljs-selector-tag">sum</span>(<span class="hljs-selector-tag">image_predictions_df</span><span class="hljs-selector-attr">[<span class="hljs-string">'tweet_id'</span>]</span><span class="hljs-selector-class">.duplicated</span>())
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657899007659/TiWxRU_Yf.png" alt="Screenshot (20).png" /></p>
<pre><code><span class="hljs-selector-tag">sum</span>(<span class="hljs-selector-tag">image_predictions_df</span><span class="hljs-selector-class">.jpg_url</span><span class="hljs-selector-class">.duplicated</span>())
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657899266869/kqZ_0wvLZ.png" alt="Screenshot (20).png" /></p>
<ul>
<li>They are 66 duplicated jpg_url double entry bringing about a quality issue on the validity of the data</li>
</ul>
<p><strong>Visual assessment of tweet_status</strong></p>
<pre><code>tweet_status_df.sample(<span class="hljs-number">4</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1657906391336/7neWwhEfj.png" alt="Screenshot (22).png" /></p>
<ul>
<li>We notice tidy issue in column in_reply_to_status_id, retweeted_status, quoted_status_id, quoted_status_id_str, quoted_status having NaN value</li>
<li>We notice tidiness issue in columns source, entities, extended_entities having HTML tags, URL
full_text contains RT @ which need to be remove per the project requirement</li>
</ul>
<p><strong>Programmatic assessment of tweet_status_df</strong></p>
<pre><code>tweet_status_df.isnull().sum()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658060325293/rz8he3kzE.png" alt="Screenshot (23).png" /></p>
<ul>
<li>we have quality issue on the validity of the data entry of the columns listed below having high number of NaN values therefore we need to drop<br />
in_reply_to_status_id<br />
in_reply_to_status_id_str<br />
in_reply_to_user_id<br />
in_reply_to_user_id_str<br />
in_reply_to_screen_name<br />
geo<br />
coordinates<br />
place<br />
contributors<br />
retweeted_status<br />
quoted_status_id<br />
quoted_status_id_str<br />
quoted_status<br /></li>
</ul>
<p><strong>Project scope</strong></p>
<p>Based on the tidiness concept, twitter_archive_enhanced.csv,tweet_status_df, image_predictions.tsv should be merged using the tweet_id as a mapping key</p>
<p><strong>Quality issues</strong></p>
<ul>
<li><p>invalid and missing dog name in the column name of twitter_achive_df</p>
</li>
<li><p>Invalid ratings in column rating_numerator of twitter_achive_df value varies from 1776 to 0 and data type need to be converted from int to float</p>
</li>
<li><p>Invalid denominator in column rating_denominator of twitter_achive_df. It should be a fixed base 10 and data type need to be converted from int to float.</p>
</li>
<li><p>Timestamp column in twitter_achive_df needs to be converted into datetime data type</p>
</li>
<li><p>in_reply_to_status_id, in_reply_to_user_id ,retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp have missing data</p>
</li>
<li><p>Spelling inconsistency of upper case letter in columns p1, p2 and p3 of image_predictions_df</p>
</li>
<li><p>jpg_url columns in image_predictions_df has 66 duplicated images and for the accurary of the data we need to drop them</p>
</li>
<li><p>tweet_id in image_predictions_df needs to be converted to string</p>
</li>
<li><p>column id in tweet_status_df need to be converted from in to string</p>
</li>
<li><p>column id need to be converted in tweet_id</p>
</li>
</ul>
<p><strong>Tidiness issues</strong></p>
<ul>
<li>HTML tags, URL, and content in source column</li>
<li>columns doggo, floofer,pupper, and puppo they all have the same name and can be merge into 1 column</li>
<li><ul>
<li>twitter_archive_df and image_predictions_df and tweet_status_df can be marged
There is two information in a single column text, ampasand and //n</li>
</ul>
</li>
<li>3 useful columns: id, favorite_count, retweet_count in tweet_status_df</li>
<li>Retweet need to be removed in text column and in full_text columns</li>
</ul>
<p><strong>Cleaning Data</strong></p>
<p>In this section, we clean all of the issues I listed out above while assessing the data.</p>
<p><strong>Making copies of the original dataframes</strong></p>
<pre><code>twitter_archive_clean <span class="hljs-operator">=</span> twitter_archive_df.copy()
image_predictions_clean <span class="hljs-operator">=</span> image_predictions_df.copy()
tweet_status_clean <span class="hljs-operator">=</span> tweet_status_df.copy()
</code></pre><p><strong>Quality Issue #8:</strong></p>
<p>Convert tweet_id column in image_predictions_clean from int to str</p>
<p><strong>Define:</strong></p>
<p> Converting tweet_id column from int to str</p>
<p><strong>Code</strong></p>
<pre><code>image_predictions_clean.tweet_id <span class="hljs-operator">=</span> image_predictions_clean.tweet_id.astype(str)
</code></pre><p><strong>Test</strong></p>
<pre><code><span class="hljs-keyword">type</span>(image_predictions_clean.tweet_id[<span class="hljs-number">0</span>])
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658063412603/wPSJ3FVIM.png" alt="Screenshot (25).png" /></p>
<p><strong>Quality Issue #6:</strong></p>
<p>Inconsistent use of lowercase and uppercase and underscores in p1, p2,p3 columns</p>
<p><strong>Define:</strong></p>
<p> Replacing underscores ('_') with spaces and capitalize first letter.</p>
<p><strong>Code</strong></p>
<pre><code>image_predictions_clean.p1 <span class="hljs-operator">=</span> image_predictions_clean.p1.str.replace(<span class="hljs-string">'_'</span>, <span class="hljs-string">" "</span>).str.capitalize()
image_predictions_clean.p2 <span class="hljs-operator">=</span> image_predictions_clean.p2.str.replace(<span class="hljs-string">'_'</span>, <span class="hljs-string">" "</span>).str.capitalize()
image_predictions_clean.p3 <span class="hljs-operator">=</span> image_predictions_clean.p3.str.replace(<span class="hljs-string">'_'</span>, <span class="hljs-string">" "</span>).str.capitalize()
</code></pre><p><strong>Test </strong></p>
<pre><code><span class="hljs-selector-tag">image_predictions_clean</span><span class="hljs-selector-attr">[[<span class="hljs-string">'p1'</span>,<span class="hljs-string">'p2'</span>,<span class="hljs-string">'p3'</span>]</span>]<span class="hljs-selector-class">.sample</span>(6)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658063592519/I1TkhO99M.png" alt="Screenshot (26).png" /></p>
<p><strong>Quality Issue #7</strong></p>
<p>Removing duplicated values img_url of image_prediction_df</p>
<p><strong>Define:</strong></p>
<p>Indexing all the duplicated value in jpg_url columns ;selecting the the non duplicating values and assigning the non duplicated jpg_url to image_predictions_clean</p>
<p><strong>Code</strong></p>
<pre><code>indexing <span class="hljs-operator">=</span> image_predictions_df.jpg_url.duplicated()
indexing <span class="hljs-operator">=</span> np.logical_not(indexing)
image_predictions_clean<span class="hljs-operator">=</span> image_predictions_clean[indexing]
</code></pre><p><strong>Test</strong></p>
<pre><code><span class="hljs-selector-tag">print</span>("<span class="hljs-selector-tag">Before</span> <span class="hljs-selector-tag">cleaning</span>: {} <span class="hljs-selector-tag">rows</span>.\<span class="hljs-selector-tag">nAfter</span> <span class="hljs-selector-tag">cleaning</span>: {} <span class="hljs-selector-tag">rows</span>."<span class="hljs-selector-class">.format</span>(<span class="hljs-selector-tag">image_predictions_df</span><span class="hljs-selector-class">.shape</span><span class="hljs-selector-attr">[0]</span>,<span class="hljs-selector-tag">image_predictions_clean</span><span class="hljs-selector-class">.shape</span><span class="hljs-selector-attr">[0]</span>))
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658063783520/qWPQK8ylb.png" alt="Screenshot (27).png" /></p>
<pre><code>print(<span class="hljs-string">"{} duplicated."</span>.format(sum(image_predictions_clean.jpg_url.duplicated())))
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658063873421/9Tbr2oPoR.png" alt="Screenshot (28).png" /></p>
<p><strong>Quality Issue #4:</strong></p>
<p>Incorrect values and incorrect type for timestamp</p>
<p><strong>Define:</strong> </p>
<p>Converting timestamp column from object to datetime series</p>
<p><strong>Code</strong></p>
<pre><code>twitter_archive_clean[<span class="hljs-string">'timestamp'</span>] <span class="hljs-operator">=</span> twitter_archive_clean[<span class="hljs-string">'timestamp'</span>].astype(<span class="hljs-string">'datetime64[ns]'</span>)
</code></pre><p><strong>Test</strong></p>
<pre><code><span class="hljs-selector-tag">twitter_archive_clean</span><span class="hljs-selector-class">.timestamp</span><span class="hljs-selector-attr">[0]</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658064080946/MWSQff9Il.png" alt="Screenshot (29).png" /></p>
<p><strong>Quality Issue #1</strong></p>
<p>invalid and missing dog name in the column name</p>
<p><strong>Define:</strong></p>
<p> Rename non-standard names to "None"</p>
<p><strong>Code</strong></p>
<pre><code>invalid_names<span class="hljs-operator">=</span>list(twitter_archive_clean[twitter_archive_clean.<span class="hljs-built_in">name</span>.
                                           str.contains(<span class="hljs-string">'^[a-z].*'</span>)].
                   <span class="hljs-built_in">name</span>.value_counts().index) <span class="hljs-operator">+</span> [<span class="hljs-string">'None'</span>]
twitter_archive_clean.loc[twitter_archive_clean.<span class="hljs-built_in">name</span>.apply(lambda x: x in invalid_names),<span class="hljs-string">'name'</span>]<span class="hljs-operator">=</span>None
</code></pre><p><strong>Test</strong></p>
<pre><code>(twitter_archive_clean.<span class="hljs-built_in">name</span>=<span class="hljs-operator">=</span><span class="hljs-string">'None'</span>).sum()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658064301785/wvobZ-_j-.png" alt="Screenshot (20).png" /></p>
<pre><code>(twitter_archive_clean.<span class="hljs-built_in">name</span>=<span class="hljs-operator">=</span><span class="hljs-string">'a'</span>).sum()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658064301785/wvobZ-_j-.png" alt="Screenshot (20).png" /></p>
<pre><code>(twitter_archive_clean.<span class="hljs-built_in">name</span>.apply(lambda x: x in invalid_names)).sum()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658064301785/wvobZ-_j-.png" alt="Screenshot (20).png" /></p>
<p><strong>Quality Issue #10 :</strong></p>
<p>twitter_data_archived contains retweets RT @ in column text</p>
<p><strong>Define</strong></p>
<p>As per project specification, we only want original dog ratings. We need to remove retweets text column starting with RT @. In order to fix this quality issue, we are going to create vectors to index all the non-retweet and we exclude the subset from the retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp.</p>
<p><strong>Code</strong></p>
<pre><code><span class="hljs-attr">indexing_retweeted_status_id</span> =twitter_archive_clean.retweeted_status_id.isnull()
<span class="hljs-attr">twitter_archive_clean</span> = twitter_archive_clean[indexing_retweeted_status_id]
<span class="hljs-attr">indexing_retweeted_status_user_id</span> = twitter_archive_clean.retweeted_status_user_id.isnull()
<span class="hljs-attr">twitter_archive_clean</span> = twitter_archive_clean[indexing_retweeted_status_user_id]
<span class="hljs-attr">indexing_retweeted_status_timestamp</span> = twitter_archive_clean.retweeted_status_timestamp.isnull()
<span class="hljs-attr">twitter_archive_clean</span> = twitter_archive_clean[indexing_retweeted_status_timestamp ]
</code></pre><p><strong>Test</strong></p>
<pre><code>print("Number of rows <span class="hljs-keyword">with</span> <span class="hljs-literal">true</span> <span class="hljs-keyword">in</span> retweeted_status_id:<span class="hljs-string">", sum(twitter_archive_clean.retweeted_status_id.isnull()))
print("</span><span class="hljs-built_in">Number</span> <span class="hljs-keyword">of</span> <span class="hljs-keyword">rows</span> <span class="hljs-keyword">with</span> <span class="hljs-literal">true</span> <span class="hljs-keyword">in</span> retweeted_status_timestamp:<span class="hljs-string">", sum(twitter_archive_clean.retweeted_status_timestamp.isnull()))
print("</span><span class="hljs-built_in">Number</span> <span class="hljs-keyword">of</span> <span class="hljs-keyword">rows</span> <span class="hljs-keyword">with</span> <span class="hljs-literal">true</span> <span class="hljs-keyword">in</span> retweeted_status_user_id:<span class="hljs-string">", sum(twitter_archive_clean.retweeted_status_user_id.isnull()))
print("</span><span class="hljs-built_in">Number</span> <span class="hljs-keyword">of</span> <span class="hljs-keyword">rows</span> <span class="hljs-keyword">of</span> twitter_archive_clean:<span class="hljs-string">",twitter_archive_clean.shape[0])</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658064619183/iW-U3y1bu.png" alt="Screenshot (32).png" /></p>
<pre><code>display(twitter_archive_clean[twitter_archive_df[<span class="hljs-string">'text'</span>].str.contains(<span class="hljs-string">'RT @'</span>)])
print(<span class="hljs-string">'the number of RT in the text is:'</span>, sum(twitter_archive_clean[<span class="hljs-string">'text'</span>].str.contains(<span class="hljs-string">'RT @'</span>)))
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658064731941/AMfHvY2cN.png" alt="Screenshot (33).png" /></p>
<pre><code><span class="hljs-selector-tag">twitter_archive_clean</span><span class="hljs-selector-attr">[<span class="hljs-string">'text'</span>]</span><span class="hljs-selector-class">.sample</span>(45)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658064839747/ybv3HeuFz.png" alt="Screenshot (34).png" /></p>
<p><strong>Tidiness Issue #1</strong></p>
<p>Column source contain HTML tag and hyperlinks</p>
<p><strong>Define:</strong></p>
<ul>
<li><p>Extracting the content between opening and closing tag using regular expressions.
Extracting the link.</p>
</li>
<li><p>Replacing source variable in the dataset with just the source name</p>
</li>
<li><p>Creating additional table with source link that we could use as a lookup table.</p>
</li>
</ul>
<p><strong>Code</strong></p>
<pre><code>source_link = twitter_archive_clean.source.str.extract(<span class="hljs-string">r'&lt;a href="(.+)" .+&gt;'</span>, expand=<span class="hljs-literal">False</span>)
source = twitter_archive_clean.source.str.extract(<span class="hljs-string">r'&gt;([A-z -]+)&lt;'</span>, expand=<span class="hljs-literal">False</span>)
twitter_archive_clean.source = source
</code></pre><pre><code>sources <span class="hljs-operator">=</span> pd.DataFrame({<span class="hljs-string">'source'</span>: source, <span class="hljs-string">'source_link'</span>: source_link})
sources.drop_duplicates(inplace<span class="hljs-operator">=</span>True)
</code></pre><pre><code>sources
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658065027684/g2__ng8yu.png" alt="Screenshot (35).png" /></p>
<p><strong>Test</strong></p>
<pre><code><span class="hljs-selector-tag">twitter_archive_clean</span><span class="hljs-selector-class">.source</span><span class="hljs-selector-class">.value_counts</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658065146375/a3xtWZuI8.png" alt="Screenshot (37).png" /></p>
<p><strong>Quality issue</strong></p>
<p>Dropping columns in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp and expanded_urls</p>
<p><strong>Define:</strong></p>
<p>Removing the columns in_reply_to_status_id, in_reply_to_user_id ,retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp</p>
<p><strong>Code</strong></p>
<pre><code>columns_to_remove <span class="hljs-operator">=</span> [<span class="hljs-string">'in_reply_to_user_id'</span>, <span class="hljs-string">'in_reply_to_status_id'</span>,
                    <span class="hljs-string">'retweeted_status_id'</span>, <span class="hljs-string">'retweeted_status_user_id'</span>,
                    <span class="hljs-string">'retweeted_status_timestamp'</span>,<span class="hljs-string">'expanded_urls'</span>]
twitter_archive_clean.drop(columns_to_remove, axis<span class="hljs-operator">=</span><span class="hljs-number">1</span>, inplace<span class="hljs-operator">=</span>True)
</code></pre><p><strong>Test</strong></p>
<pre><code>twitter_archive_clean.<span class="hljs-keyword">columns</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658065279955/cTHemKCXc.png" alt="Screenshot (38).png" /></p>
<p><strong>Tidiness Issue #2</strong></p>
<p>Merging columns doggo, floofer,pupper, and puppo</p>
<p><strong>Define</strong></p>
<p>Columns doggo, floofer,pupper, and puppo have the same values therefore can be merged into one feature</p>
<p><strong>Code</strong></p>
<pre><code>dog_cols <span class="hljs-operator">=</span> twitter_archive_clean[[<span class="hljs-string">'doggo'</span>,<span class="hljs-string">'floofer'</span>,<span class="hljs-string">'pupper'</span>,<span class="hljs-string">'puppo'</span>]]
dog_cols <span class="hljs-operator">=</span> dog_cols.replace(<span class="hljs-string">'None'</span>, <span class="hljs-string">''</span>) 
dog_category <span class="hljs-operator">=</span> np.array(dog_cols[<span class="hljs-string">'doggo'</span>]) <span class="hljs-operator">+</span> np.array(dog_cols[<span class="hljs-string">'floofer'</span>]) <span class="hljs-operator">+</span> np.array(dog_cols[<span class="hljs-string">'pupper'</span>]) <span class="hljs-operator">+</span> np.array(dog_cols[<span class="hljs-string">'puppo'</span>])
pd.DataFrame(dog_category, columns <span class="hljs-operator">=</span> [<span class="hljs-string">'dog_category'</span>]).dog_category.value_counts()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658065414369/FVK47B7gn.png" alt="Screenshot (39).png" /></p>
<ul>
<li>Appending this new column called dog_category to the twtitter_archive_clean</li>
</ul>
<pre><code>twitter_archive_clean.reset_index(drop<span class="hljs-operator">=</span>True, inplace<span class="hljs-operator">=</span>True)
twitter_archive_clean <span class="hljs-operator">=</span> pd.concat([twitter_archive_clean, pd.DataFrame(dog_category, columns <span class="hljs-operator">=</span> [<span class="hljs-string">'dog_category'</span>])], axis <span class="hljs-operator">=</span> <span class="hljs-number">1</span>)
</code></pre><ul>
<li>Dropping the individual columns after merging them</li>
</ul>
<pre><code>columns_to_remove <span class="hljs-operator">=</span> [<span class="hljs-string">'doggo'</span>,<span class="hljs-string">'floofer'</span>,<span class="hljs-string">'pupper'</span>,<span class="hljs-string">'puppo'</span>]
twitter_archive_clean.drop(columns_to_remove, axis<span class="hljs-operator">=</span><span class="hljs-number">1</span>, inplace<span class="hljs-operator">=</span>True)
</code></pre><p><strong>Test</strong></p>
<pre><code><span class="hljs-selector-tag">twitter_archive_clean</span><span class="hljs-selector-class">.dog_category</span><span class="hljs-selector-class">.value_counts</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658065595280/kkxgjHL1z.png" alt="Screenshot (40).png" /></p>
<pre><code>twitter_archive_clean.<span class="hljs-keyword">columns</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658065716536/TMNJkEoOp.png" alt="Screenshot (42).png" /></p>
<p><strong>Tidiness Issue #4</strong></p>
<p>There are two information in a single column text.</p>
<p><strong>Define</strong></p>
<p>Remove the URL in the end of the text column, ampasamd and /n</p>
<p><strong>Code</strong></p>
<pre><code>twitter_archive_clean[<span class="hljs-string">'text'</span>] = twitter_archive_clean.text.str.replace(<span class="hljs-string">r'[(https://.+),(\&amp;amp;)|(\n)]'</span>,<span class="hljs-string">''</span>,regex=<span class="hljs-literal">True</span>)
</code></pre><p><strong>Test</strong></p>
<pre><code><span class="hljs-selector-tag">twitter_archive_clean</span><span class="hljs-selector-attr">[<span class="hljs-string">'text'</span>]</span><span class="hljs-selector-class">.sample</span>(45)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658065873974/YVFeMLuf3.png" alt="Screenshot (43).png" /></p>
<p><strong>Quality Issue #2 and 3 :</strong></p>
<p>Invalid ratings. Value varies from 1776 to 0. data type must be converted from int to float. Invalid denominator, I expected a fixed base. data Strucutre must be converted from int to float.</p>
<p><strong>Define</strong></p>
<ul>
<li>Convert rating_numerator and rating_denominator to float because @dog_rates uses float rating number.</li>
<li>Remove the extreme values (1776, 420, etc.) of rating_numerator, and;</li>
<li>Remove non expected value of denominator, anything different of 10.</li>
</ul>
<p><strong>Code</strong></p>
<ul>
<li>Converting the rating_numerator and rating_denominator to float.</li>
</ul>
<pre><code>twitter_archive_clean.rating_numerator <span class="hljs-operator">=</span> twitter_archive_clean.rating_numerator.astype(float)
twitter_archive_clean.rating_denominator <span class="hljs-operator">=</span> twitter_archive_clean.rating_denominator.astype(float)
</code></pre><ul>
<li>From visual assessment of the rating_numerator done earlier on, we need to drop the numerator with invalid rating 1776, 420, 204.</li>
</ul>
<pre><code>invalid_ratings_1776_420_204<span class="hljs-operator">=</span> twitter_archive_clean[(twitter_archive_clean.rating_numerator=<span class="hljs-operator">=</span><span class="hljs-number">1776</span>)<span class="hljs-operator">|</span>(twitter_archive_clean.rating_numerator=<span class="hljs-operator">=</span><span class="hljs-number">204</span>)<span class="hljs-operator">|</span>(twitter_archive_clean.rating_numerator=<span class="hljs-operator">=</span><span class="hljs-number">420</span>)].index
twitter_archive_clean.drop(invalid_ratings_1776_420_204, inplace<span class="hljs-operator">=</span>True)
</code></pre><ul>
<li>Removing tweet_id with 0/10</li>
<li>Listing the tweet_id having 0 as numerator and removing these tweet_id from the dataframe</li>
</ul>
<pre><code>list(twitter_archive_clean.query(<span class="hljs-string">'rating_numerator == 0'</span>).tweet_id)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658066184662/P7ySl7Y5h.png" alt="Screenshot (44).png" /></p>
<pre><code><span class="hljs-attr">rm_list</span> = list(twitter_archive_clean.query(<span class="hljs-string">'rating_numerator == 0'</span>).tweet_id)
<span class="hljs-comment"># Creating a vector to subset twitter_archive_clean and remove the tweet_id from the rm_list.</span>
<span class="hljs-attr">indexing</span> = np.logical_not(twitter_archive_clean.tweet_id.isin(rm_list))
<span class="hljs-comment"># Updating the twitter_archive_clean data frame.</span>
<span class="hljs-attr">twitter_archive_clean</span> = twitter_archive_clean[indexing]
</code></pre><p><strong>Test</strong></p>
<pre><code>twitter_archive_clean.query(<span class="hljs-string">'rating_numerator == 0'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658066307417/xsQuud7KF.png" alt="Screenshot (45).png" /></p>
<ul>
<li>Quering denominator other than 10</li>
</ul>
<pre><code>list(twitter_archive_clean.query(<span class="hljs-string">'rating_denominator !=10'</span>).tweet_id)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658066402673/iVEftCQM_.png" alt="Screenshot (46).png" /></p>
<ul>
<li>Remove those three tweet_id's</li>
</ul>
<pre><code><span class="hljs-attr">rm_list</span> = list(twitter_archive_clean.query(<span class="hljs-string">'rating_denominator !=10'</span>).tweet_id)
<span class="hljs-comment"># Creating a vector to subset twitter_archive_clean and remove the tweet_id from the rm_list.</span>
<span class="hljs-attr">indexing</span> = np.logical_not(twitter_archive_clean.tweet_id.isin(rm_list))
<span class="hljs-comment"># Updating the twitter_archive_clean data frame.</span>
<span class="hljs-attr">twitter_archive_clean</span> = twitter_archive_clean[indexing]
</code></pre><p><strong>Test</strong></p>
<pre><code>twitter_archive_clean.query(<span class="hljs-string">'rating_denominator &lt; 10'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658066542568/k4qIgG2Gy.png" alt="Screenshot (48).png" /></p>
<pre><code>twitter_archive_clean.query(<span class="hljs-string">'rating_denominator &gt; 10'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658066542568/k4qIgG2Gy.png" alt="Screenshot (48).png" /></p>
<pre><code><span class="hljs-selector-tag">twitter_archive_clean</span><span class="hljs-selector-class">.rating_denominator</span><span class="hljs-selector-class">.value_counts</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658066678538/T6XX9SQHY.png" alt="Screenshot (49).png" /></p>
<pre><code>twitter_archive_clean.tweet_id <span class="hljs-operator">=</span>twitter_archive_clean.tweet_id.astype(str)
</code></pre><p><strong> Merging twitter_archive_clean and image_predictions_clean as stating in the tidiness issue</strong></p>
<pre><code>twitter_archive_merged<span class="hljs-operator">=</span> pd.DataFrame()
twitter_archive_merged <span class="hljs-operator">=</span> twitter_archive_clean.copy()
twitter_archive_merged_df  <span class="hljs-operator">=</span> twitter_archive_merged.merge(image_predictions_clean, on<span class="hljs-operator">=</span><span class="hljs-string">'tweet_id'</span>, how<span class="hljs-operator">=</span><span class="hljs-string">'inner'</span>)
</code></pre><pre><code>print(<span class="hljs-string">"Shape df_twitter_archive_clean: "</span> <span class="hljs-operator">+</span> str(twitter_archive_merged.shape))
print(<span class="hljs-string">"Shape image_predictions_clean: "</span> <span class="hljs-operator">+</span> str(image_predictions_clean.shape))
print(<span class="hljs-string">"Shape df_twitter_combined "</span> <span class="hljs-operator">+</span> str(twitter_archive_merged_df.shape) <span class="hljs-operator">+</span> <span class="hljs-string">" after joining df_twitter_archive_cleaned, df_tweet_performance and df_image_predictions_cleaned."</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658066868022/7VC2f9QEjK.png" alt="Screenshot (50).png" /></p>
<p><strong>Test</strong></p>
<pre><code>twitter_archive_merged_df.info()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658066977181/Es7ZRSvrd.png" alt="Screenshot (51).png" /></p>
<p><strong>Quality Issue #11:</strong></p>
<p>Incorrect data type in column id in tweet_status_clean</p>
<p><strong>Define:</strong></p>
<p>Convert id from int to str</p>
<p><strong>Code</strong></p>
<pre><code>tweet_status_clean.id <span class="hljs-operator">=</span> tweet_status_clean.id.astype(<span class="hljs-string">'str'</span>)
tweet_status_clean.rename(columns<span class="hljs-operator">=</span>{<span class="hljs-string">'id'</span>:<span class="hljs-string">'tweet_id'</span>}, inplace<span class="hljs-operator">=</span>True)
</code></pre><p><strong>Test</strong></p>
<pre><code><span class="hljs-keyword">type</span>(tweet_status_clean.tweet_id[<span class="hljs-number">0</span>])
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658067190316/vy3LpaU4g.png" alt="Screenshot (52).png" /></p>
<p><strong> Tidiness Issue #11:</strong></p>
<p>Removing RT @ in full_text column</p>
<p><strong>Define</strong></p>
<p>Removing RT @ in full_text column as required in the project requirement</p>
<p><strong>Code</strong></p>
<pre><code>display(tweet_status_clean[tweet_status_clean[<span class="hljs-string">'full_text'</span>].str.contains(<span class="hljs-string">'RT @'</span>)])
print(<span class="hljs-string">'the number of RT in the text is:'</span>, sum(tweet_status_clean[<span class="hljs-string">'full_text'</span>].str.contains(<span class="hljs-string">'RT @'</span>)))
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658067410620/V0H2DAVQq.png" alt="Screenshot (54).png" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658067477378/AbAcSzni4.png" alt="Screenshot (55).png" /></p>
<ul>
<li>Removing the 160 retweet in full_text column</li>
</ul>
<pre><code>tweet_status_clean[<span class="hljs-string">'full_text'</span>]= tweet_status_clean[<span class="hljs-string">'full_text'</span>].apply(<span class="hljs-keyword">lambda</span> x: re.sub(<span class="hljs-string">r'\bRT @\b'</span>,<span class="hljs-string">''</span>,x).strip())
</code></pre><p><strong>Test</strong></p>
<pre><code>sum(tweet_status_clean.full_text.str.contains(<span class="hljs-string">'RT @'</span>))
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658067571181/FaUb-PwAh.png" alt="Screenshot (20).png" /></p>
<ul>
<li>All the rows have been removed </li>
</ul>
<p><strong>Tidiness isssue #6</strong></p>
<p>Creating a dataframe with columns: id, favorite_count, retweet_count</p>
<p><strong>Define</strong>
Reassign the useful columns to the dataframe</p>
<p><strong>Code</strong></p>
<pre><code><span class="hljs-attr">tweet_status_clean</span> = tweet_status_clean[[<span class="hljs-string">'tweet_id'</span>, <span class="hljs-string">'favorite_count'</span>, <span class="hljs-string">'retweet_count'</span> ]]
</code></pre><p><strong>Test</strong></p>
<pre><code>tweet_status_clean.sample(<span class="hljs-number">3</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658067758641/7OUZl6BOx.png" alt="Screenshot (56).png" /></p>
<p><strong>Tidiness issue #7</strong></p>
<p>Merge tweet_status_clean and twitter_achive_merged</p>
<p><strong>Define</strong></p>
<p>Merging tweet_status_clean and twitter_achive_merged</p>
<p><strong>Code</strong></p>
<pre><code>twitter_archive_master <span class="hljs-operator">=</span> twitter_archive_merged_df.merge(tweet_status_clean, on<span class="hljs-operator">=</span><span class="hljs-string">'tweet_id'</span>, how<span class="hljs-operator">=</span><span class="hljs-string">'inner'</span>)
</code></pre><p><strong>Test</strong></p>
<pre><code>twitter_archive_master.info()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658068001369/UV2LMZ4JX.png" alt="Screenshot (57).png" /></p>
<p><strong>Storing Data</strong></p>
<p>Saving the clean merged dataframe into "twitter_archive_master.csv".</p>
<pre><code>twitter_archive_master.to_csv(<span class="hljs-string">'twitter_archive_master.csv'</span>, index<span class="hljs-operator">=</span>False)
</code></pre><p><strong>Analyzing and Visualizing Data</strong></p>
<p>In this section, we analyze the twitter archiv master dataset and visualizing data</p>
<p>-Reading and assessing the twitter_archive_master.csv into a dataframe</p>
<pre><code>master_df <span class="hljs-operator">=</span>pd.read_csv(<span class="hljs-string">"twitter_archive_master.csv"</span>)
</code></pre><pre><code>master_df.sample(<span class="hljs-number">3</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658068594089/yyWofqEAx.png" alt="Screenshot (58).png" /></p>
<p><strong>Insights:</strong>
In this section, we are interesting to find the hidden pattern in the clean twitter archive master dataset.</p>
<p>. Visualizing the distribution of dog category based on the favorite tweet count by their users on twitter</p>
<p>. Visualizing the distribution of dog category based on the retweet count by their users on twitter</p>
<p>. Visualizing the most used devices by WeRateDogs users</p>
<p>. Visualizing the most common name of dog</p>
<p>. Line plot of like and retweet by twitter users on WeRateDogs account</p>
<p>.Visualizing 20 dogs breed P2 predicted by twitter user on WeRateDogs</p>
<p>. Visualizing 20 dogs breed P3 predicted by twitter user on WeRateDogs</p>
<p>. Visualizing the dog category with the highest rating score</p>
<p>. Visualizing the dog category with maximum favorite count</p>
<p>. Visualizing the dog category with minimum retweet count</p>
<p>. Visualizing Total Tweets made by WeRateDogs per month</p>
<p>. Visualizing some of the dog names in image prediction p1</p>
<p><strong>Visualization</strong></p>
<p>Define a function to plot the average count of tweets</p>
<pre><code>def distributionPlot(feature,ylabel,title):
    stage_favorite_counts <span class="hljs-operator">=</span> master_df[[<span class="hljs-string">'dog_category'</span>,feature]]
    stage_favorite_counts_agg <span class="hljs-operator">=</span> stage_favorite_counts.groupby(<span class="hljs-string">'dog_category'</span>, as_index<span class="hljs-operator">=</span>False).mean()
    print(stage_favorite_counts_agg )
    f, ax <span class="hljs-operator">=</span> plt.subplots(<span class="hljs-number">1</span>, <span class="hljs-number">1</span>, figsize<span class="hljs-operator">=</span>(<span class="hljs-number">12</span>, <span class="hljs-number">4</span>));
    sns.barplot(data<span class="hljs-operator">=</span>stage_favorite_counts_agg,x<span class="hljs-operator">=</span><span class="hljs-string">'dog_category'</span>,y<span class="hljs-operator">=</span>feature,color<span class="hljs-operator">=</span><span class="hljs-string">'green'</span>, ax<span class="hljs-operator">=</span>ax);
    ax.set_ylabel(ylabel);
    ax.set_xlabel(<span class="hljs-string">"Dog category"</span>);
    ax.set_title(title)
</code></pre><p><strong>Plotting the distribution of average favorite count of tweet using based on the dog category</strong></p>
<pre><code><span class="hljs-selector-tag">distributionPlot</span>(<span class="hljs-string">'favorite_count'</span>,<span class="hljs-string">'average favorite count'</span>,<span class="hljs-string">"Distribution of average favorite count of tweet using based on the dog category¶"</span>)
<span class="hljs-selector-tag">plt</span><span class="hljs-selector-class">.savefig</span>(<span class="hljs-string">'av_favcounte.png'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658068868122/jQEjRAOjT.png" alt="Screenshot (59).png" /></p>
<p>Among the dog category:<br /></p>
<ul>
<li>doggo has 17599.225806 tweet favorite counts the most</li>
<li>followed by floofer with 11223.857143 tweet favorite counts</li>
<li>multiclass has 15008.909091 tweet favorite counts</li>
<li>pupper has 6204.975369 tweet favorite counts</li>
<li>puppo has 19573.545455 tweet favorite counts the least<br /></li>
</ul>
<p><strong>Plotting the distribution of average retweet count based on the dog category</strong></p>
<pre><code><span class="hljs-selector-tag">distributionPlot</span>(<span class="hljs-string">'retweet_count'</span>,<span class="hljs-string">'average retweet count'</span>,<span class="hljs-string">' Distribution of average retweet count based on the dog category'</span>)
<span class="hljs-selector-tag">plt</span><span class="hljs-selector-class">.savefig</span>(<span class="hljs-string">'av_retweet_count.png'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658069036256/KCOVCQxr3.png" alt="Screenshot (60).png" /></p>
<p>Among the dog category:<br /></p>
<ul>
<li>doggo has 5972.709677 average retweet counts the most</li>
<li>followed by puppo with 5325.318182 average retweet counts</li>
<li>multiclass has 4548.272727 average retweet countss</li>
<li>floofer has 1909.453202 average retweet counts</li>
<li>pupper has 5325.318182 average retweet counts the least</li>
</ul>
<p><strong>Plotting likes vs retweets</strong></p>
<pre><code>likes <span class="hljs-operator">=</span> pd.Series(data<span class="hljs-operator">=</span>master_df[<span class="hljs-string">'favorite_count'</span>].values, index<span class="hljs-operator">=</span>master_df[<span class="hljs-string">'timestamp'</span>])
retweets <span class="hljs-operator">=</span> pd.Series(data<span class="hljs-operator">=</span>master_df[<span class="hljs-string">'retweet_count'</span>].values, index<span class="hljs-operator">=</span>master_df[<span class="hljs-string">'timestamp'</span>])
</code></pre><pre><code>likes.plot(figsize<span class="hljs-operator">=</span>(<span class="hljs-number">16</span>,<span class="hljs-number">4</span>), label<span class="hljs-operator">=</span><span class="hljs-string">'Favorites'</span>, color<span class="hljs-operator">=</span><span class="hljs-string">'orange'</span>, legend<span class="hljs-operator">=</span>True)
retweets.plot(figsize<span class="hljs-operator">=</span>(<span class="hljs-number">16</span>,<span class="hljs-number">4</span>), label<span class="hljs-operator">=</span><span class="hljs-string">'Retweets'</span>, color<span class="hljs-operator">=</span><span class="hljs-string">'maroon'</span>, legend<span class="hljs-operator">=</span>True);
plt.title(<span class="hljs-string">'No. of Favorites and Retweets Over Time'</span>)
plt.show()
plt.savefig(<span class="hljs-string">'retvs.png'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658069209272/wVKXn2uOC.png" alt="Screenshot (61).png" /></p>
<p>From the plot of likes vs retwets we notice that  WeRateDogs twitter account  started in December,11,2015 with zero likes and retweets. In June,23,2016 over 65k users retweeted post of the account and 140k people like the post as their favorites. In February,16, 2017, the account got more popularity over 100k people like the post on the account as their favorites and 25k people retweeted the posts about WeRateDogs.In August 1,2017 few people retweets about their dogs and only 58k people like the post of dogs as their favorites.</p>
<p><strong>Visualizing the most used devices by WeRateDogs users</strong></p>
<pre><code>print(twitter_archive_clean.source.value_counts())
twitter_archive_clean.source.value_counts().plot(kind<span class="hljs-operator">=</span><span class="hljs-string">'bar'</span>, figsize<span class="hljs-operator">=</span>(<span class="hljs-number">11</span>,<span class="hljs-number">5</span>), title<span class="hljs-operator">=</span><span class="hljs-string">'Most used Twitter source'</span>).set_xlabel(<span class="hljs-string">"Number of Tweets"</span>)
plt.savefig(<span class="hljs-string">'twitter_source'</span>)
plt.savefig(<span class="hljs-string">'twitter_source.png'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658069386321/qerZrhETY.png" alt="Screenshot (62).png" /></p>
<ul>
<li>From the bar plot above, most of the WeRateDogs users used Twitter for iPhone</li>
</ul>
<p><strong>From the bar plot above, most of the WeRateDogs users used Twitter for iPhone</strong></p>
<p><strong>Displaying the most common name among the dogs</strong></p>
<pre><code><span class="hljs-selector-tag">master_df</span><span class="hljs-selector-attr">[<span class="hljs-string">'name'</span>]</span><span class="hljs-selector-class">.value_counts</span>()<span class="hljs-selector-attr">[1:13]</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658069542704/fC9VyCEDd.png" alt="Screenshot (63).png" /></p>
<p><strong>Visualizing the most popular name of the dogs</strong></p>
<pre><code>sorted_order <span class="hljs-operator">=</span> master_df[<span class="hljs-string">'name'</span>].value_counts()[<span class="hljs-number">1</span>:<span class="hljs-number">13</span>].head(<span class="hljs-number">13</span>).index
display(sorted_order)
plt.figure(figsize <span class="hljs-operator">=</span> (<span class="hljs-number">10</span>,<span class="hljs-number">4</span>))
values <span class="hljs-operator">=</span> np.count_nonzero(master_df[<span class="hljs-string">'name'</span>])
fclrs <span class="hljs-operator">=</span> [<span class="hljs-string">'grey'</span> <span class="hljs-keyword">if</span> (x <span class="hljs-operator">&gt;</span> min(sorted_order)) <span class="hljs-keyword">else</span> <span class="hljs-string">'red'</span> <span class="hljs-keyword">for</span> x in sorted_order]
sns.countplot(data <span class="hljs-operator">=</span> master_df, x <span class="hljs-operator">=</span> <span class="hljs-string">'name'</span>, order <span class="hljs-operator">=</span> sorted_order, orient <span class="hljs-operator">=</span> <span class="hljs-string">'h'</span>)
plt.xlabel(<span class="hljs-string">'Name'</span>, fontsize <span class="hljs-operator">=</span><span class="hljs-number">16</span>)
plt.ylabel(<span class="hljs-string">'Count'</span>, fontsize <span class="hljs-operator">=</span><span class="hljs-number">16</span>);
plt.title(<span class="hljs-string">'Most popular name of the dogs '</span>)
plt.savefig(<span class="hljs-string">'popular_dog_name.png'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658069669417/ttndSzBGR.png" alt="Screenshot (64).png" /></p>
<ul>
<li>From the above plot the most common name among the dogs are Charlie, Oliver, Lucy, Tucker, Penny, Winston, Sadie,Toby, Daisy, Lola, Koda and Bo</li>
</ul>
<p><strong>Visualizing 20 dogs breed P2 predicted by twitter user on WeRateDogs</strong></p>
<pre><code>breeds<span class="hljs-operator">=</span> master_df.p2.value_counts().head(<span class="hljs-number">20</span>)
display(breeds)
plt.barh(breeds.index , breeds,color<span class="hljs-operator">=</span><span class="hljs-string">'red'</span>)
plt.xlabel(<span class="hljs-string">'Count'</span>)
plt.ylabel(<span class="hljs-string">'Dog Breed'</span>)
plt.title(<span class="hljs-string">'Top 20 Dog Breeds prediction p2 on tweet'</span>)
plt.gca().invert_yaxis()
plt.show()
plt.savefig(<span class="hljs-string">'20p2_dog_pred.png'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658228989613/NxRQGpoKL.png" alt="update.png" /></p>
<ul>
<li>From the above plot, the most predicted dog image by twitter users on WeRateDogs is Labrador retriever</li>
</ul>
<p><strong>Visualizing 20 dogs breed P3 predicted by twitter user on WeRateDogs</strong></p>
<pre><code>breeds<span class="hljs-operator">=</span> master_df.p3.value_counts().head(<span class="hljs-number">20</span>)
display(breeds)
plt.barh(breeds.index , breeds,color<span class="hljs-operator">=</span><span class="hljs-string">'purple'</span>)
plt.xlabel(<span class="hljs-string">'Count'</span>)
plt.ylabel(<span class="hljs-string">'Dog Breed'</span>)
plt.title(<span class="hljs-string">'Top 20 Dog Breeds predction on tweet'</span>)
plt.gca().invert_yaxis()
plt.show()
plt.savefig(<span class="hljs-string">'20d0gs_p3.png'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658070198799/7BLHSav4k.png" alt="Screenshot (67).png" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658070279206/G4-l_7Mmg.png" alt="Screenshot (68).png" /></p>
<ul>
<li>From the above plot, the most predicted dog image by twitter users on WeRateDogs is Labrador retriver</li>
</ul>
<p><strong>Visualizing the dog category with the highest score</strong></p>
<ul>
<li>Creating a new column called rating in master_df</li>
</ul>
<pre><code>master_df[<span class="hljs-string">'rating'</span>] <span class="hljs-operator">=</span> master_df[<span class="hljs-string">'rating_numerator'</span>]<span class="hljs-operator">/</span>master_df[<span class="hljs-string">'rating_denominator'</span>]
</code></pre><pre><code>dog_rating<span class="hljs-operator">=</span>master_df.groupby(<span class="hljs-string">'dog_category'</span>)[<span class="hljs-string">'rating'</span>].<span class="hljs-built_in">max</span>()
print(dog_rating)
dog_rating.plot.bar(figsize<span class="hljs-operator">=</span>(<span class="hljs-number">10</span>,<span class="hljs-number">5</span>))
plt.ylim(top<span class="hljs-operator">=</span><span class="hljs-number">10</span>)
plt.title(<span class="hljs-string">"Dog category with the highest rating score"</span>)
plt.xlabel(<span class="hljs-string">"Dog category"</span>)
plt.legend([<span class="hljs-string">" Rating"</span>])
plt.savefig(<span class="hljs-string">'rating.png'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658070475735/_W2nNH4zu.png" alt="Screenshot (69).png" /></p>
<ul>
<li>From the plot above, pupper is the dog with the high rating score</li>
</ul>
<p><strong>Visualizing the dog category with maximun favorite count</strong></p>
<pre><code>dog_favorite_count<span class="hljs-operator">=</span>master_df.groupby(<span class="hljs-string">'dog_category'</span>)[<span class="hljs-string">'favorite_count'</span>].<span class="hljs-built_in">max</span>()
display(dog_favorite_count)
dog_favorite_count.plot.bar(figsize<span class="hljs-operator">=</span>(<span class="hljs-number">10</span>,<span class="hljs-number">5</span>))
plt.title(<span class="hljs-string">"Dog category favorite count"</span>)
plt.xlabel(<span class="hljs-string">"Dog category"</span>)
plt.legend([<span class="hljs-string">"favorite_count"</span>])
plt.savefig(<span class="hljs-string">'favorite.png'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658070651495/kSSKEngVy.png" alt="Screenshot (70).png" /></p>
<ul>
<li>From the above plot doggo is the dog with maximum twitter favorite count</li>
</ul>
<p><strong>Visualizing the dog category with minimun retweet count</strong></p>
<pre><code>dog_retweet_count<span class="hljs-operator">=</span>master_df.groupby(<span class="hljs-string">'dog_category'</span>)[<span class="hljs-string">'retweet_count'</span>].<span class="hljs-built_in">min</span>()
display(dog_retweet_count)
dog_retweet_count.plot.bar(figsize<span class="hljs-operator">=</span>(<span class="hljs-number">10</span>,<span class="hljs-number">5</span>))
plt.title(<span class="hljs-string">"Dog category with minimun retweet count"</span>)
plt.xlabel(<span class="hljs-string">"Dog category"</span>)
plt.legend([<span class="hljs-string">"favorite_count"</span>])
plt.savefig(<span class="hljs-string">'retweet.png'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658070798138/jOs4EBHR-.png" alt="Screenshot (71).png" /></p>
<ul>
<li>From the above plot pupper has the least retweet count</li>
</ul>
<p><strong>Visualizing Total Tweets made by WeRateDogs per month between 2015 to 2017</strong></p>
<pre><code>twitter_archive_master[<span class="hljs-string">'year_month_date'</span>] <span class="hljs-operator">=</span> twitter_archive_master[<span class="hljs-string">'timestamp'</span>].dt.year.astype(str) <span class="hljs-operator">+</span> <span class="hljs-string">'-'</span> <span class="hljs-operator">+</span> \
                                            twitter_archive_master[<span class="hljs-string">'timestamp'</span>].dt.month.astype(str).str.pad(<span class="hljs-number">2</span>, fillchar<span class="hljs-operator">=</span><span class="hljs-string">'0'</span>)
twitter_archive_master[<span class="hljs-string">'is_tweet'</span>] <span class="hljs-operator">=</span> np.where(twitter_archive_master.tweet_id.notnull(), <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)
twitter_archive_monthly_tweets <span class="hljs-operator">=</span> twitter_archive_master.groupby(<span class="hljs-string">'year_month_date'</span>).is_tweet.sum().reset_index()
plt.xticks(rotation<span class="hljs-operator">=</span><span class="hljs-number">45</span>)
ax <span class="hljs-operator">=</span> sns.lineplot(x<span class="hljs-operator">=</span><span class="hljs-string">'year_month_date'</span>, y<span class="hljs-operator">=</span><span class="hljs-string">'is_tweet'</span>, data<span class="hljs-operator">=</span>twitter_archive_monthly_tweets)
ax.set_title(<span class="hljs-string">'Total Tweets made by WeRateDogs per month between 2015 to 2017'</span>)
ax.set_xlabel(<span class="hljs-string">'Date'</span>)
ax.set_ylabel(<span class="hljs-string">'Number of Tweets'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658070920488/3rRUwoWeH.png" alt="Screenshot (72).png" /></p>
<ul>
<li>From the plot above, we notice in November,2015, 350 people tweeted on WeRateDogs. However, the number gradually decreased in August 2017, the plot revealed that the number of tweets had dropped to 0 tweet.</li>
</ul>
<p><strong>Visualizing some of the dogs image prediction p1</strong></p>
<pre><code><span class="hljs-selector-tag">twitter_archive_master</span><span class="hljs-selector-attr">[[<span class="hljs-string">'jpg_url'</span>,<span class="hljs-string">'p1'</span>]</span>]
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658071053726/vmnkMVRNN.png" alt="Screenshot (73).png" /></p>
<p><strong> Visualizing Miniature_pinscher</strong></p>
<pre><code><span class="hljs-selector-tag">Image</span>(<span class="hljs-string">"https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg"</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658071331444/95ue3DYFJ.png" alt="Miniature_pinscher.png" /></p>
<ul>
<li>Programmatically download Miniature_pinscher image from the url </li>
</ul>
<pre><code>response <span class="hljs-operator">=</span> requests.get(<span class="hljs-string">"https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg"</span>)
file <span class="hljs-operator">=</span> open(<span class="hljs-string">"Miniature_pinscher.png"</span>, <span class="hljs-string">"wb"</span>)
file.write(response.content)
file.close()
</code></pre><p><strong> Visualizing  Rhodesian ridgeback</strong></p>
<pre><code><span class="hljs-selector-tag">Image</span>(<span class="hljs-string">'https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658071571186/LPQrpMEXV.png" alt="Rhodesian ridgeback.png" /></p>
<ul>
<li>Programmatically download Rhodesian ridgeback image from the url </li>
</ul>
<pre><code>response <span class="hljs-operator">=</span> requests.get(<span class="hljs-string">"https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg"</span>)
file <span class="hljs-operator">=</span> open(<span class="hljs-string">"Rhodesian ridgeback.png"</span>, <span class="hljs-string">"wb"</span>)
file.write(response.content)
file.close()
</code></pre><p><strong> Visualizing German Sheperd</strong></p>
<pre><code><span class="hljs-selector-tag">Image</span>(<span class="hljs-string">'https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1658072168982/gaOtKI3AX.png" alt="german_sheperd.png" /></p>
<ul>
<li>Programmatically download german_sheperd image from the url </li>
</ul>
<pre><code>response <span class="hljs-operator">=</span> requests.get(<span class="hljs-string">"https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg"</span>)
file <span class="hljs-operator">=</span> open(<span class="hljs-string">"german_sheperd.png"</span>, <span class="hljs-string">"wb"</span>)
file.write(response.content)
file.close()
</code></pre><p><strong>Conclusion</strong><br />
We have reached the end of this data wrangling and visualization journey. 
Through our analysis , we found out that:</p>
<p>Among the dog category we found out that:<br /></p>
<p><strong> tweet favorite counts</strong></p>
<ul>
<li>doggo has 17599.225806  the most</li>
<li>followed by floofer with 11223.857143 </li>
<li>multiclass has 15008.909091</li>
<li>pupper has 6204.975369 </li>
<li>puppo has 19573.545455</li>
</ul>
<p><strong>retweet count</strong></p>
<ul>
<li>doggo has 5972.709677 the most</li>
<li>followed by puppo with 5325.318182</li>
<li>multiclass has 4548.272727 </li>
<li>floofer has 1909.453202 </li>
<li>pupper has 5325.318182</li>
</ul>
<p><strong> Most devices used by WeRaDogs users</strong></p>
<ul>
<li>Twitter for iPhone:     2016</li>
<li>Vine - Make a Scene:  91</li>
<li>Twitter Web Client:     31</li>
<li>TweetDeck  :              10</li>
</ul>
<p>** Dog category with the highest rating score</p>
<ul>
<li>pupper        2.7</li>
<li>doggo         1.4</li>
<li>puppo         1.4</li>
<li>floofer       1.3</li>
<li>multiclass    1.3</li>
</ul>
<p><strong>dog category with maximun favorite count</strong></p>
<ul>
<li>doggo         144774</li>
<li>puppo         124103</li>
<li>pupper        108900</li>
<li>multiclass     49401</li>
<li>floofer          28112</li>
</ul>
<p>Finaly,  People preferred to <strong>favorite</strong> a dog over a <strong>retweet</strong> and both actions had decreased over time from 2015 to 2017.</p>
<p>The github repository to download the datasets, the wrangling report and  the jupyter notebook can be found <a target="_blank" href="https://github.com/flexil/-WeRateDogs">here</a>. </p>
<hr />
<p>If you want to contribute or you find any errors in this article please do leave me a comment.</p>
<p>You can reach me out on any of the matrix decentralized servers. My element messenger ID is <strong>@maximilien:matrix.org</strong></p>
<p>If you are in one of the mastodon decentralized servers, here is my ID <strong>@maximilien@qoto.org</strong></p>
<p>If you are on linkedIn, you can reach me <a target="_blank" href="https://www.linkedin.com/in/ephrem-maximilien-kpizingui-48222775/">here</a></p>
<p>If you want to contact me via email for freelance <strong>maximilien@tutanota.de</strong></p>
<p>If you want to hire me to work on machine learning, data science,IoT and AI related project, please  reach out to me <a target="_blank" href="https://flexil.github.io/freelance/">here</a></p>
<blockquote>
<p>Warm regards,<br />
Maximilien.</p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[Medical Appointments No-Show Data Analysis And Visualization From A-Z]]></title><description><![CDATA[Welcome back to this blog post on medical appointment No- show project which consists of 100k medical appointments in Brazil  as part of the Advanced Data Analytics Nanodegree Scholarship by Udacity  which seeks to  equip and to train young Africans ...]]></description><link>https://maximilien.docquest.io/medical-appointments-no-show-data-analysis-and-visualization-from-a-z</link><guid isPermaLink="true">https://maximilien.docquest.io/medical-appointments-no-show-data-analysis-and-visualization-from-a-z</guid><category><![CDATA[dataanalytics]]></category><category><![CDATA[#data visualisation]]></category><category><![CDATA[Data wrangling]]></category><category><![CDATA[doctors]]></category><category><![CDATA[hospitals]]></category><dc:creator><![CDATA[Maximilien]]></dc:creator><pubDate>Thu, 16 Jun 2022 15:32:25 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1655338782211/KGFPATjK7.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p> Welcome back to this blog post on medical appointment No- show project which consists of 100k medical appointments in Brazil  as part of the Advanced Data Analytics Nanodegree Scholarship by Udacity  which seeks to  equip and to train young Africans for digital technologies and skills for remote work and local market opportunities. The dataset collects information from more than 100k medical appointments in Brazil and it is focused on the question of whether or not patients show up for their appointment. Without delay, let us get into the data wrangling.</p>
<hr />
<p><strong>Table of Contents</strong></p>
<p><strong> INTRODUCTION</strong></p>
<p><strong>1- Research  questions</strong></p>
<p><strong>2- DATA WRANGLING</strong></p>
<p><strong>3-  Section objectives</strong></p>
<p><strong>4- Loading libraries</strong></p>
<p><strong>5- Loading the dataset</strong></p>
<p><strong>6- Exploring data</strong></p>
<p><strong>7- Descriptive information about the dataset</strong></p>
<p><strong>8- Shape of the dataset</strong></p>
<p><strong>9- Statistical data</strong></p>
<p><strong>10- DATA CLEANING</strong></p>
<p><strong>11- Renaming columns</strong></p>
<p><strong>12- Converting date</strong></p>
<p><strong>13- Filtering row Age with -1</strong></p>
<p><strong>14- Dropping negative Age</strong></p>
<p><strong>15- Checking negative Age</strong></p>
<p><strong>16- Converting PatientId and AppointmentId to object data type</strong></p>
<p><strong>17- Displaying minimun value of age</strong></p>
<p><strong>18- Filtering and displaying row with Age==0</strong></p>
<p><strong>19- Dropping Age==0</strong></p>
<p><strong>20- Checking missing values</strong></p>
<p><strong>21- Checking duplicated values</strong></p>
<p><strong>22- Cleaned dataset</strong></p>
<p><strong>23- Exploratory Data Analysis</strong></p>
<p><strong>24- Pie plot of No-show patients</strong></p>
<p><strong>25- Pie plot of gender,hypertenstion and No-show</strong></p>
<p><strong>26- Pie plot of gender,diabetes and No-show</strong></p>
<p><strong>27- Pie plot of gender,SMS_received and No-show</strong></p>
<p><strong>28- Pie plot of gender and No-show</strong></p>
<p><strong>29- Function to plot the distribution in the research question</strong></p>
<p><strong>30- No-show gender distribution</strong></p>
<p><strong>31- Diabetes No-show gender distribution</strong></p>
<p><strong>32- Hypertension No-show gender distribution</strong></p>
<p><strong>33- SMS_received No-show gender distribution</strong></p>
<p><strong>34- Age No-show gender distribution</strong></p>
<p><strong>Conclusion</strong></p>
<p><strong>Limitations</strong></p>
<hr />
<p><strong>Introduction</strong></p>
<p>The dataset subject to our analysis contains information recorded from a hospital in Brazil. The dataset has 110,527 data entries starting from 0 to 110526 and 14 columns.
The description of each feature variable is shown as below</p>
<hr />
<p><strong>PatientId</strong>: Identification of a patient AppointmentId: Identification of each appointment<br />
<strong>Gender</strong>: Male or Female<br />
<strong>ScheduledDay</strong>: The day when the patient scheduled their appointment<br />
<strong>AppointmentDay</strong>: The day of the appointment<br />
<strong>Age</strong>: Age of the patient<br />
<strong>Neighbourhood</strong>: Address of the hospital where the appointment is taken<br />
<strong>Scholarship</strong>: Boolean 1 if the patient is enrolled into Brazilian welfare program Bolsa Familia 0 otherwise<br />
<strong>Hipertension</strong>: Patient has hipertension Yes boolean 1; 0 otherwise<br />
<strong>Diabetes</strong>: Patient has diabetis Yes boolean 1 or 0 otherwise Alcoholism: Patient drink alcohol Yes boolean 1 or 0 otherwise<br />
<strong>Handcap</strong>: Patient is Handicap Yes boolean 1 or 0 otherwise SMS_received: Patient received a SMS before the appointment Yes boolean 1 or 0 otherwise<br />
<strong>No-show</strong>: YES boolean 1 if the patient show up during the booking day 0 otherwise<br /></p>
<hr />
<p>From the description above, the aim of this project is to find out which population of the patient, ill health and disability condition show up or does not to their respective appointment.</p>
<p><strong>1- Research questions</strong></p>
<p>The research question during the brainstorming phase of our analysis are:<br />
1- What is the distribution of the patient that showed up and did not show up during the appointment?<br />
2- What is the distribution of the patients having or not having Hypertension showed up and did not show up during the appointment?<br />
3- What is the distribution of the patients having or not having diabetes showed up and did not show up during the appointment?<br />
4- What is the distribution of the patients who (received or did not a SMS) showed up and did not show up during the appointment?<br />
<strong>2- Data Wrangling</strong><br />
<strong>3-  Section objective</strong><br />
In this section:<br />We load in the data<br />We explore the data<br /> We clean the dataset<br /> We preprocess the dataset for visualization and further analysis.<br />
<strong>5-  Loading the required libraries</strong></p>
<pre><code><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd 
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np 
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
%matplotlib inline
</code></pre><p><strong>5- Loading the dataset</strong><br /></p>
<pre><code>df <span class="hljs-operator">=</span> pd.read_csv(<span class="hljs-string">"noshowappointments-kagglev2-may-2016.csv"</span>)
</code></pre><p><strong>6- Exploring data</strong><br />
Displaying the last 5 five observations of the dataset</p>
<pre><code>df.tail()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655332364699/AXdYOgjk3.png" alt="tail.png" /></p>
<p><strong>7- Descriptive information about the dataset</strong></p>
<pre><code>df.info()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655332862884/nalL9mkDg.png" alt="info.png" /></p>
<ul>
<li>The dataset has 14 non null features contaning respectively</li>
<li>1 data type float: PatientId</li>
<li>8 data type integer: AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received</li>
<li>5 data type object: Gender,ScheduledDay,AppointmentDay, Neighbourhood,No-show</li>
</ul>
<p><strong>8- Shape of the dataset</strong></p>
<pre><code><span class="hljs-selector-tag">df</span><span class="hljs-selector-class">.shape</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655333446863/IfLZDHG_n.png" alt="shape.png" /></p>
<ul>
<li>The dataset has 110527 rows which represent the number of observations and 14 columns which represent the number of feature variables</li>
</ul>
<p><strong>9- Displaying the statistical data of the dataset</strong></p>
<pre><code>df.describe()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655333666810/4TCOlY6-L.png" alt="stat.png" /></p>
<ul>
<li>From the statistical data above we notice a lot of discrepancies in the dataset.<br /> We need to covert it to integer</li>
</ul>
<pre><code>df.describe().astype(<span class="hljs-string">'int64'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655333949708/cupbPKH8H.png" alt="Screenshot from 2022-06-15 22-59-52.png" /></p>
<ul>
<li>We notice there is a negative value in the Age colunm</li>
<li>Besides, PatientID is in float data type we need to convert it in String</li>
<li>There is columns data mispelling such as hipertension and handcap</li>
<li>PatientId and AppointmentId are irrelevant to our analysis we need to convert them to String data type</li>
</ul>
<p><strong>- Exploring the Age of the patient closely</strong></p>
<pre><code><span class="hljs-selector-tag">df</span><span class="hljs-selector-class">.Age</span><span class="hljs-selector-class">.describe</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655334354124/f_kGvuco8.png" alt="Screenshot from 2022-06-15 23-07-22.png" />
We observe that the mean of the age of the patient is 38 The eldest patient is 115 old</p>
<p><strong>10- DATA CLEANING</strong><br />
From the dataset above, we notice that:<br /></p>
<ul>
<li>hipertension and handcap are mispelled</li>
<li>ScheduledDay and AppointmentDay are not in correct format. We need to convert it in date format</li>
<li>There is a negative value in the feature variable Age</li>
<li>there are zeros in the feature variable Age</li>
</ul>
<p><strong>11- Renaming columns</strong></p>
<ul>
<li>Renaming the columns hipertension and handcap </li>
</ul>
<pre><code>df <span class="hljs-operator">=</span> df.rename(columns<span class="hljs-operator">=</span>{<span class="hljs-string">'Hipertension'</span>: <span class="hljs-string">'Hypertension'</span>, <span class="hljs-string">'Handcap'</span>: <span class="hljs-string">'Handicap'</span>})
</code></pre><p><strong>12- Converting to date</strong></p>
<ul>
<li>From the dataset above, we need to convert ScheduledDay and AppointmentDay from String data type to datetime64 format yyyy-mm-dd.</li>
</ul>
<pre><code>df[<span class="hljs-string">'ScheduledDay'</span>] <span class="hljs-operator">=</span> pd.to_datetime(df[<span class="hljs-string">'ScheduledDay'</span>]).dt.date.astype(<span class="hljs-string">'datetime64[ns]'</span>)
df[<span class="hljs-string">'AppointmentDay'</span>] <span class="hljs-operator">=</span> pd.to_datetime(df[<span class="hljs-string">'AppointmentDay'</span>]).dt.date.astype(<span class="hljs-string">'datetime64[ns]'</span>)
</code></pre><pre><code>df.head()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655381361303/dOPHMu6ft.png" alt="table.png" /></p>
<p><strong>13- Filtering the row with negative Age</strong></p>
<pre><code>rowAgeNegative<span class="hljs-operator">=</span> (df.Age=<span class="hljs-operator">=</span><span class="hljs-number">-1</span>)
dfAgeNegative <span class="hljs-operator">=</span> df[rowAgeNegative]
dfAgeNegative.head()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655381749426/RRHsD5Yy_.png" alt="clean.png" /></p>
<ul>
<li>We have one row containing -1 </li>
</ul>
<p><strong>14- Dropping negative Age</strong></p>
<pre><code>df <span class="hljs-operator">=</span> df.drop(dfAgeNegative.index)
</code></pre><p><strong>15- Filtering if there still exist a negative Age in the dataset row</strong></p>
<pre><code>ages <span class="hljs-operator">=</span> [<span class="hljs-string">'-1'</span>]  
age_dataset <span class="hljs-operator">=</span> df[df[<span class="hljs-string">'Age'</span>].isin(ages)]  
age_dataset.head()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655382102584/ot0V66Veb.png" alt="ok.png" />
Row as been dropped successfully<br />
<strong>16- Converting patiendid and AppointmentId to object data type</strong></p>
<pre><code>df[<span class="hljs-string">'PatientId'</span>] <span class="hljs-operator">=</span> df[<span class="hljs-string">'PatientId'</span>].astype(<span class="hljs-string">'object'</span>)
df[<span class="hljs-string">'AppointmentID'</span>] <span class="hljs-operator">=</span> df[<span class="hljs-string">'AppointmentID'</span>].astype(<span class="hljs-string">'object'</span>)
</code></pre><ul>
<li>Displaying minimun value of the Age</li>
</ul>
<pre><code>df.Age.<span class="hljs-built_in">min</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655382451678/BpkAqluXL.png" alt="zero.png" /></p>
<ul>
<li>We notice the min value of Age is 0. This is wrong and we may consider it as a data entry error</li>
</ul>
<p><strong>17- Filtering row with Age=0 and displaying the head of the dataset and the length of the row with age ==0</strong></p>
<pre><code>ages <span class="hljs-operator">=</span> [<span class="hljs-number">0</span>]  
age_dataset <span class="hljs-operator">=</span> df[df[<span class="hljs-string">'Age'</span>].isin(ages)]  
age_dataset.head()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655382720918/Ev9-GyXDx.png" alt="cool.png" /></p>
<pre><code>rowAgeZero<span class="hljs-operator">=</span> (df.Age=<span class="hljs-operator">=</span><span class="hljs-number">0</span>)
dfAgeZero <span class="hljs-operator">=</span> df[rowAgeZero]
dfAgeZero.head(<span class="hljs-number">10</span>)
print(<span class="hljs-string">"The number of rows containing zero in the Age columns are "</span>,len(dfAgeZero))
</code></pre><ul>
<li>The number of rows containing zero in the Age columns are  3539</li>
</ul>
<p><strong>18- Dropping row with Age==0</strong></p>
<pre><code>df <span class="hljs-operator">=</span> df.drop(dfAgeZero.index)
</code></pre><ul>
<li>Checking if any zeros row remained</li>
</ul>
<pre><code>ages <span class="hljs-operator">=</span> [<span class="hljs-number">0</span>]  
age_dataset <span class="hljs-operator">=</span> df[df[<span class="hljs-string">'Age'</span>].isin(ages)]  
age_dataset.head()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655383088888/I83jv0ryE.png" alt="k.png" /></p>
<ul>
<li>Data dropped successfully</li>
</ul>
<p><strong>19- Checking for missing values</strong></p>
<pre><code>df.isnull().sum()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655383404026/0pQe4aTJU.png" alt="good.png" /></p>
<ul>
<li>There is no missing value in the dataset</li>
</ul>
<p><strong>20 -Checking for duplicated values</strong></p>
<pre><code>sum(df.duplicated())
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655383547485/jejiKjLRm.png" alt="zero.png" /></p>
<ul>
<li>There is not duplicated values</li>
</ul>
<p><strong>21- Displaying preprocessed dataset</strong></p>
<pre><code>df.head()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655383960204/NI14s3jGQ.png" alt="tr.png" /></p>
<p><strong>22- Exploratory Data Analysis</strong></p>
<p><strong>23- Defining a function to plot the pie plot taking three arguments namely the columns, the label and the title of the plot</strong></p>
<pre><code>def myplot(features,label,title ):
    plt.figure(figsize<span class="hljs-operator">=</span>(<span class="hljs-number">8</span>,<span class="hljs-number">8</span>))
    plt.pie(df.groupby(features).size(),autopct<span class="hljs-operator">=</span><span class="hljs-string">'%.2f'</span>)
    plt.title(title)
    plt.legend(label, loc<span class="hljs-operator">=</span><span class="hljs-string">"lower left"</span>)
    plt.show()
</code></pre><p><strong>24- Pie plot of No-show patients</strong></p>
<pre><code><span class="hljs-selector-tag">myplot</span>(<span class="hljs-string">'No-show'</span>,[<span class="hljs-string">'Not Show'</span>,<span class="hljs-string">'Show'</span>],<span class="hljs-string">"No-show distribution"</span> )
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655384505233/alTaipJqB.png" alt="pie.png" /></p>
<p>From the above pie plot, we conclude that:</p>
<ul>
<li>79.74% of the patients did not come to the appointment</li>
<li>Only 20.26 patients come to the appointment</li>
</ul>
<p><strong>25- Pie plot of gender,hypertenstion and No-show</strong></p>
<pre><code><span class="hljs-selector-tag">myplot</span>([<span class="hljs-string">"No-show"</span>,<span class="hljs-string">"Gender"</span>, <span class="hljs-string">"Hypertension"</span>],[<span class="hljs-string">'Not show,Female,Not hypertension'</span>,<span class="hljs-string">'Not show,Female,hypertension'</span>,<span class="hljs-string">'Not show,Male,Not hypertension'</span>,<span class="hljs-string">'Not show,Male,hypertension'</span>, <span class="hljs-string">'Show,Female,Not hypertension'</span>, <span class="hljs-string">'Show,Female,hypertension'</span>,<span class="hljs-string">'Show,Male,Not hypertension'</span>,<span class="hljs-string">'Show,Male,hypertension'</span>],<span class="hljs-string">"Pie plot of gender, hypertenstion and No-show"</span> )
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655384955904/zeQMJwcJQ.png" alt="ds.png" /></p>
<ul>
<li><p>Patient Male</p>
<p>  22.54% not having hypertension did not come to the appointment
  5.88% not having hypertension came to the appointment
  5% of having hypertension did not come to the appointment
  1.04% having hypertension came to the appointment</p>
</li>
<li><p>Patient female</p>
<p>  40.34% not having hypertension did not come to the appointment
  10.86% not having hypertension came to the appointment
  2.48% having hypertension came to the appointment
  11.85% having hypertension did not come to the appointment</p>
</li>
</ul>
<p><strong>26- Pie plot of gender,diabetes and No-show</strong></p>
<pre><code>myplot([<span class="hljs-string">"No-show"</span>,<span class="hljs-string">"Gender"</span>, <span class="hljs-string">"Diabetes"</span>],[<span class="hljs-string">'Not show,Female,Not diabete'</span>,<span class="hljs-string">'Not show,Female,diabete'</span>,<span class="hljs-string">'Not show,Male,Not diabete'</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655386027653/YcWZC9ESd.png" alt="a.png" /></p>
<p><strong>Patient Male</strong></p>
<ul>
<li>25.74% not having diabetes did not come to the appointment</li>
<li>6.54% not having diabetes came to the appointment</li>
<li>1.8% of having diabetes did not come to the appointment</li>
<li><p>0.39% having diabetes came to the appointment</p>
<p><strong>Patient female</strong></p>
</li>
<li><p>47.91% not having diabetes did not come to the appointment</p>
</li>
<li>12.39% not having diabetes came to the appointment</li>
<li>0.95% having diabetes came to the appointment</li>
<li>4.29% having diabetes did not come to the appointment</li>
</ul>
<p><strong>27- Pie plot of gender,SMS_received and No-show</strong></p>
<pre><code><span class="hljs-selector-tag">myplot</span>([<span class="hljs-string">"No-show"</span>,<span class="hljs-string">"Gender"</span>, <span class="hljs-string">"SMS_received"</span>], [<span class="hljs-string">'Not show,Female,Not SMS_received'</span>,<span class="hljs-string">'Not show,Female,SMS_received'</span>,<span class="hljs-string">'Not show,Male,Not SMS_received'</span>,<span class="hljs-string">'Not show,Male,SMS_received'</span>, <span class="hljs-string">'Show,Female,Not SMS_received'</span>, <span class="hljs-string">'Show,Female,SMS_received'</span>,<span class="hljs-string">'Show,Male,Not SMS_received'</span>,<span class="hljs-string">'Show,Male,SMS_received'</span>],<span class="hljs-string">"Pie plot of gender, SMS_received and No-show"</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655385780443/zVLtZCw6S.png" alt="w.png" /></p>
<p><strong>Patient Male</strong></p>
<ul>
<li>20.18% did not received sms did not come to the appointment</li>
<li>4.16% did not received sms came to the appointment</li>
<li>7.36% received sms did not come to the appointment</li>
<li>2.76% received sms came to the appointment</li>
</ul>
<p><strong>Patient female</strong></p>
<ul>
<li>36.17% not received sms did not come to the appointment</li>
<li>7.16% not received sms came to the appointment</li>
<li>6.18% received sms came to the appointment</li>
<li>16.03% received sms did not come to the appointment</li>
</ul>
<p><strong>28- Pie plot of gender and No-show</strong></p>
<pre><code><span class="hljs-selector-tag">myplot</span>([<span class="hljs-string">"No-show"</span>,<span class="hljs-string">"Gender"</span>],[<span class="hljs-string">'Not show,Female'</span>,<span class="hljs-string">'Not show,Female'</span>,<span class="hljs-string">'Not show,Male'</span>,<span class="hljs-string">'Not show,Male'</span>, <span class="hljs-string">'Show,Female'</span>, <span class="hljs-string">'Show,Female'</span>,<span class="hljs-string">'Show,Male'</span>,<span class="hljs-string">'Show,Male'</span>],<span class="hljs-string">"Pie plot of gender and No-show"</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655386714874/TeY5JDYts.png" alt="z.png" /></p>
<p>Patient Male</p>
<ul>
<li>Out of 34.46% of male, only 6.92% came to the appointment</li>
</ul>
<p>Patient female</p>
<ul>
<li>Out of 65.54 of female, only 13.34% came to the appointment</li>
</ul>
<p><strong>29- Function to plot the distribution in the research question</strong></p>
<pre><code>def distributionPlot(feature,titleplot):
    gender_column <span class="hljs-operator">=</span> <span class="hljs-string">'Gender'</span>
    df.groupby([<span class="hljs-string">'No-show'</span>,feature, gender_column]).size().unstack(level<span class="hljs-operator">=</span><span class="hljs-number">1</span>).plot(kind<span class="hljs-operator">=</span><span class="hljs-string">'bar'</span>,title<span class="hljs-operator">=</span>titleplot,ylabel<span class="hljs-operator">=</span><span class="hljs-string">'count'</span>)
    print(pd.DataFrame(df.groupby([<span class="hljs-string">'No-show'</span>,feature]).count().PatientId))
</code></pre><p><strong>30- No-show gender distribution</strong></p>
<pre><code><span class="hljs-selector-tag">distributionPlot</span>(<span class="hljs-string">'Gender'</span>,<span class="hljs-string">'No-show gender distribution'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655387125781/tnnmhmyKO.png" alt="n.png" /></p>
<p>From the bar plot above we notice:</p>
<ul>
<li>14275 female patients showed up to the appointment</li>
<li>7405 male patients showed up to the appointment</li>
<li>55843 female patients did not show up to the appointment</li>
<li>29464 male patients did not show up to the appointment</li>
</ul>
<p><strong>31- Diabetes No-show gender distribution</strong></p>
<pre><code><span class="hljs-selector-tag">distributionPlot</span>(<span class="hljs-string">'Diabetes'</span>,<span class="hljs-string">'Diabetes No-show gender distribution'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655387470029/KCGe-VWfz.png" alt="g.png" /></p>
<p>We notice that 1200 female patients with diabete show up to the appointment and 230 male patients with diabete show up to the appointment. Besides 15k female patients without diabete show up to the appointment and 6k patient male without diabete showed to the appointment. Moreover, 52k female not having diabete did not come to the appointment and 4k female having diabete did not come to the appointment. Finally, 28k of male patient not having diabete did not come to the appointment and 2k male patient having diabete did not come to the appointment.</p>
<p><strong>32- Hypertension No-show gender distribution</strong></p>
<pre><code><span class="hljs-selector-tag">distributionPlot</span>(<span class="hljs-string">'Hypertension'</span>,<span class="hljs-string">' Hypertension No-show gender distribution'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655387794657/uHafjBKa9.png" alt="l.png" /></p>
<p>We notice that most almost 2500 female patient hypertension show up to the appointment and 1272 male patient with Hypertension did not show to the appointment Besides, 12k female patient without hypertension show up to the appointment and 6k male patient without hypertension show up to the appointment</p>
<p><strong>33- SMS_received No-show gender distribution</strong></p>
<pre><code><span class="hljs-selector-tag">distributionPlot</span>(<span class="hljs-string">'SMS_received'</span>,<span class="hljs-string">' SMS_received No-show gender distribution'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655388105828/Ivp58v6_X.png" alt="we.png" /></p>
<p>We notice that 8k patient female who did not receive a sms showed up to the appointment 7k patient females who received a sms show up to the appointment 4k patient male who did not receive a sms showed up to the appointment 3k patient male who received a sms showed up to the appointment 38k patient female who did not receive sms did not come to the appointment 17k patient female who received sms did not come to the appointment 22k patient male who did not receive sms did not come to the appointment 7k patient male who did not receive the sms did not come to the appointment</p>
<p><strong>34- Age No-show gender distribution</strong></p>
<pre><code>boxplot <span class="hljs-operator">=</span> df.boxplot(column<span class="hljs-operator">=</span>[<span class="hljs-string">'Age'</span>] , by <span class="hljs-operator">=</span> [<span class="hljs-string">'No-show'</span>] , notch <span class="hljs-operator">=</span> True, labels<span class="hljs-operator">=</span>[<span class="hljs-string">'No-show'</span>,<span class="hljs-string">'Age'</span>])
pd.DataFrame(df.groupby([<span class="hljs-string">'No-show'</span>])[<span class="hljs-string">'Age'</span>].describe().loc[:,[<span class="hljs-string">'mean'</span>,<span class="hljs-string">'std'</span>]])
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1655388300754/zLCeCwdUE.png" alt="ui.png" /></p>
<p>We notice that the average age of patient who did not show up are 39.07 and the average age of patients who showed up to the appointment are 35.32 We notice that 25% of the patient who did not show up to the appointment are aged around 20 and 25% of the patients who showed up to the appointment are 19 Further more 75% of the patients who did not showed up are 58 and 75% of the patients who showed up to the appointment are 47.</p>
<p><strong>Conclusion</strong></p>
<p>We notice that few patients respond to appointment given by the physicians in Brazil</p>
<p><strong>The gender distribution revealed that</strong></p>
<ul>
<li>14275 female patients showed up to the appointment</li>
<li>7405 male patients showed up to the appointment</li>
<li>55843 female patients did not show up to the appointment</li>
<li><p>29464 male patients did not show up to the appointment.</p>
<p><strong> The hypertension patient distribution revealed that</strong></p>
</li>
<li>2500 female patient hypertension show up to the appointment</li>
<li>1272 male patient with Hypertension did not show to the appointment</li>
<li>12k female patient without hypertension show up to the appointment</li>
<li><p>6k male patient without hypertension show up to the appointment</p>
<p> <strong>The diabetes patient distribution revealed that</strong></p>
</li>
<li>1200 female patients with diabetes showed up to the appointment</li>
<li>230 male patients with diabetes show up to the appointment.</li>
<li>15k female patients without diabetes showed up to the appointment</li>
<li>6k male patients without diabetes showed to the appointment.</li>
<li>52k female not having diabetes did not come to the appointment</li>
<li>4k female having diabetes did not come to the appointment.</li>
</ul>
<p><strong> The SMS_received patient distribution revealed that</strong></p>
<ul>
<li>7k patient females who received a sms show up to the appointment</li>
<li>4k patient male who did not receive a sms showed up to the appointment</li>
<li>3k patient male who received a sms showed up to the appointment</li>
<li>38k patient female who did not receive sms did not come to the appointment</li>
<li>17k patient female who received sms did not come to the appointment</li>
<li>22k patient male who did not receive sms did not come to the appointment</li>
<li>7k patient male who did not receive the sms did not come to the appointment</li>
</ul>
<p>From the above research question, we conclude that<br /> SMS_received influenced patient to show up to their appointment more that the other feature variables.<br />The above summary did not not reflect the actual data entry from the hospital cause it contained some discrepancies. 3,540 observations were dropped during the analysis to arrived at the above conclusion. Additional information would have been handy to explain the reason why we have zero and negative one in the independent variable Age.</p>
<p><strong>Limitations</strong></p>
<p>The dataset submit to our analysis in this project contains some data entry errors which affect the outcome of our analysis. Out of the 110,527 observations, we found a negative Age in observation row index 99832 of the observations. Moreover, we found out that there were 3539 observations having Age value to be 0. We assumed those discrepancies were some errors therefore we dropped 3,540 observations. Our analysis was carried out on 106,987 observations after cleaning the dataset which did not represent the actual population of the patients which might change the outcome of our result. We further need to know the distance of the patients to the nearest hospital where the appointment has been booked to figure out which few patients respond to their medical appointment. We need further to know why the hospital prefers text message over phone call since most of people don't read they sms often.</p>
<hr />
<p>If you want to contribute or you find any errors in this article please do leave me a comment.</p>
<p>You can reach me out on any of the matrix decentralized servers. My element messenger ID is @maximilien:matrix.org</p>
<p>If you are in one of the mastodon decentralized servers, here is my ID @maximilien@qoto.org</p>
<p>If you are on linkedIn, you can reach me <a target="_blank" href="https://www.linkedin.com/in/ephrem-maximilien-kpizingui-48222775/">here</a></p>
<p>If you want to contact me via email for freelance <em>maximilien@tutanota.de</em></p>
<blockquote>
<p>Warm regards,<br />
    Maximilien.</p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[Qr Code Dynamic Certificate Generator and Authentication (QCDCGA) Application]]></title><description><![CDATA[Since the outbreak of corona virus, the learning system across the world has been shifted to an online learning mode which brings about the digital risk of E-learning certificate.  Moreover, many students in Africa who go to further their study abroa...]]></description><link>https://maximilien.docquest.io/qr-code-dynamic-certificate-generator-and-authentication-qcdcga-application</link><guid isPermaLink="true">https://maximilien.docquest.io/qr-code-dynamic-certificate-generator-and-authentication-qcdcga-application</guid><category><![CDATA[web application]]></category><dc:creator><![CDATA[Maximilien]]></dc:creator><pubDate>Mon, 11 Apr 2022 21:00:40 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1649685314308/OQs-CPVOR.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Since the outbreak of corona virus, the learning system across the world has been shifted to an online learning mode which brings about the digital risk of E-learning certificate.  Moreover, many students in Africa who go to further their study abroad with non-digital educational certifications and documents often face authentication issues and they are ushered to redo the same program as a fresher other even loose their job or they cannot find a job for authentication issues. In this light, to mitigate certificate counterfeit, <strong>maxtekAI  </strong> opts for digital solution by developing <strong>QCDCGA</strong> (Qr Code Dynamic Certificate Generator And Authentication) across Africa by empowering certificate issuing body with a two step Qr code technology  encrypted in a database to prevent its forgery and to ease the authenticity and validation process by any organizations across the world.</p>
<h3 id="heading-i-project-description">I.) Project Description</h3>
<p>The  <strong>QCDCGA</strong> system will include the following key features:</p>
<ul>
<li>Secure Login for admin</li>
<li>Dashboard statistics display.</li>
<li>Creating and managing certificate.</li>
<li>Creating and managing Administrator.</li>
<li>Creating and managing Administrator roles.</li>
<li>Profile update.</li>
<li>Password Reset</li>
<li>Menu Conversion feature allows for easy to navigate optimized menu(s). Integrate any
information you would like</li>
</ul>
<p>The application will give to certificate issuing bodies a strong sense of satisfaction by accomplishing the following goals:</p>
<ol>
<li>Generating of certificate easily</li>
<li>Updating of already generated certificate</li>
<li>Managing system administrators by setting rules on each account for security</li>
<li>Validation of the uniqueness of the certificates</li>
<li>Managing the database and the certificates</li>
<li>Managing the license issues of the institutions</li>
</ol>
<h3 id="heading-ii-development-package">II.) Development Package</h3>
<p>The sections below outline each development package and how it relates to your system.</p>
<p><strong>1. Domain name</strong></p>
<p>A domain name is an identification string that defines a realm of administrative autonomy,
authority, or control within the Internet.</p>
<p><strong>2. SSL Certificate</strong></p>
<p>SSL provides a secure channel between two machines or devices operating over the internet or an
internal network. One common example is when SSL is used to secure communication between a
web browser and a web server. This turns a website's address from HTTP to HTTPS, the 'S'
standing for 'secure'.</p>
<p><strong>3. VPC acquisition, installation, and security configuration</strong></p>
<p> Virtual private clouds OR Virtual private computer enables you to launch resources into a virtual
network that you've defined. This virtual network closely resembles a traditional network that
you'd operate in your own data center or office, with the benefits of using scalable infrastructures.</p>
<p><strong>4. Technical support</strong></p>
<p>A technical support representative is focused on resolving your issue as quickly as possible.
Technical support reps listen to symptoms, try to reproduce the issue, and quickly provide a
solution to the issue.</p>
<p><strong>5. Maintenance and update</strong></p>
<ul>
<li>System maintenance is an umbrella term that encompasses various forms of computer
maintenance needed to keep a system running.</li>
<li>Regular maintenance of your systems helps your systems to run more smoothly as well as
reduce the risk of them breaking down.
A well-maintained application ensures your staff and business has no technology
roadblocks that hamper productivity and will also lead to a reduction in support costs.</li>
</ul>
<p><strong>- NOTE:</strong> This development includes integration of third-party software.</p>
<h3 id="heading-iii-security-measure-considered">III.) Security Measure Considered</h3>
<p>There are few security measures we are going to implement on the system to ensure you have a secured
and well-functioning application.</p>
<ul>
<li>Server login access secure</li>
</ul>
<p>Using SSH Keys Authentication method Instead of regular password</p>
<ul>
<li>Secure Sockets Layer Certificates</li>
</ul>
<p>Secure web administration areas and forms with Secure Socket Layer (SSL) that guards’
information passed between two systems via the internet. SSL can be used both in server-client
and in server-server communication.3. Regular server Update, Upgrade and Backup
This will help keep the system up to date with new fixes to vulnerability and weaknesses in the
server.</p>
<ul>
<li>Monitoring Login Attempts</li>
</ul>
<p>Using intrusion prevention software to monitor login attempts is a way to protect your server
against brute force attacks. These automated attacks use a trial-and-error method, attempting
every possible combination of letters and numbers to gain access to the system.</p>
<ul>
<li>Firewall Restrictions</li>
</ul>
<p>We will be Setting Up and Maintaining a Firewall Restricting access to the server via firewall to
stop unwanted request made to the server</p>
<ul>
<li>Implementing Fail2Ban</li>
</ul>
<p>Fail2Ban is an intrusion prevention software framework that protects computer servers from
brute-force attacks. In other word it blocks IP address it finds suspicious.</p>
<h3 id="heading-iv-technical-support">IV.) Technical Support</h3>
<p>We will provide pay technical support for you after launching your application. We will answer your
question regarding application management, technical details, or anything about operating your own
sever; we can provide this through email or by phone.</p>
<h3 id="heading-v-payment-plan">V.) Payment Plan</h3>
<p>At the start of the development process for your company Application, you need to pay us 60% of the
total development costs, the remaining balance of 40% should be paid after launching your site on the
web (www). We will start the development immediately upon receiving of the initial payment.</p>
<h3 id="heading-vi-mode-of-payment">VI.) Mode Of Payment</h3>
<p>Our preferred mode of payment is Bank transfer. When payment is ready to be made, we will
communicate the bank details to you.</p>
<h3 id="heading-vii-payment-plan">VII.) Payment Plan</h3>
<p>At the start of the development process for your company Application, you need to pay us 60% of the
total development costs, the remaining balance of 40% will be given after launching your site on the web
(www). We will start the development immediately upon receiving of the initial payment.</p>
<p>NOTE: Our quotations are valid for 7 days from the date of this quotation. If you have any other query
regarding this quotation, please email us at maximilien@tutanota.de or call us at <strong>+233207863123</strong></p>
<p>Feel free to mail us to request for the demo link and the password for testing. Once you're ready to move forward with development of your custom application with your company logo, simply sign this proposal and email to us. We'll be notified and will begin the initial stages of app development.</p>
<p>Regards,</p>
<p>Maximilien - <strong>QCDCGA</strong> Project Leader</p>
]]></content:encoded></item><item><title><![CDATA[Deploying Company Predictive Marketing Application using RFM Behavioral Clustering Algorithm To Heroku (part 2)]]></title><description><![CDATA[Welcome to the part 2 of this blog post series. Please if you are reading this article for the first time like many other here please read the part one before testing the application. 

Train and test csv files can be downloaded here

The link to tes...]]></description><link>https://maximilien.docquest.io/deploying-company-predictive-marketing-application-using-rfm-behavioral-clustering-algorithm-to-heroku-part-2</link><guid isPermaLink="true">https://maximilien.docquest.io/deploying-company-predictive-marketing-application-using-rfm-behavioral-clustering-algorithm-to-heroku-part-2</guid><category><![CDATA[deployment]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Heroku]]></category><category><![CDATA[marketing]]></category><dc:creator><![CDATA[Maximilien]]></dc:creator><pubDate>Mon, 21 Mar 2022 20:43:23 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1647858788640/Ty3u9Cyvv.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome to the part 2 of this blog post series. Please if you are reading this article for the first time like many other here please read the part one before testing the application. </p>
<ul>
<li><p>Train and test csv files can be downloaded <a target="_blank" href="https://drive.google.com/drive/folders/1o8wlCp_KZ_-rh0SiruaWsyvh5M-GNBDq?usp=sharing">here</a></p>
</li>
<li><p>The link to test the unsupervised machine learning web application on heroku can be found <a target="_blank" href="https://marketing-analytics-ml.herokuapp.com/">here</a></p>
</li>
<li><p>Part 3 of this blog post series with be released soon with the end to end model deployment code.</p>
</li>
</ul>
<p>Awaiting stay tuned!</p>
<p>Regards,</p>
<p>Maximilien. </p>
]]></content:encoded></item><item><title><![CDATA[End To End Company Predictive Marketing Using RFM (Recency Frequency Monetary) Behavioral Based Clustering Algorithm(part 1).]]></title><description><![CDATA[Digital transformation including web, email, mobile, social, location technologies combined with technologies to store, process, and extract information has significantly changed our world. Nowadays, every entrepreneur in the process of business deve...]]></description><link>https://maximilien.docquest.io/end-to-end-company-predictive-marketing-using-rfm-recency-frequency-monetary-behavioral-based-clustering-algorithmpart-1</link><guid isPermaLink="true">https://maximilien.docquest.io/end-to-end-company-predictive-marketing-using-rfm-recency-frequency-monetary-behavioral-based-clustering-algorithmpart-1</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Artificial Intelligence]]></category><category><![CDATA[marketing]]></category><dc:creator><![CDATA[Maximilien]]></dc:creator><pubDate>Tue, 15 Mar 2022 10:20:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1647342726559/YcTTInEOy.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Digital transformation including web, email, mobile, social, location technologies combined with technologies to store, process, and extract information has significantly changed our world. Nowadays, every entrepreneur in the process of business development is faced with the question how to make his client more loyal and not let him go to a competitor. In this light, predictive marketing is the approach which restores the bridge by bringing human sensibility into our digital world by focusing on the consumers  to understand what they did, what they will do next and which product they are likely to buy. In the following, we are going to apply predictive marketing to segment and to cluster customer behavior on Shopify using recency frequency monetary clustering algorithm.</p>
<hr />
<p><strong>  Contents </strong></p>
<p><strong> 1 - What is predictive marketing </strong></p>
<p><strong> 2 - What is RFM analysis and why it is useful</strong></p>
<p><strong> 3 - What is the difference between Clustering and segmentation </strong></p>
<p><strong> 4- Different types of clustering </strong></p>
<p><strong> 5- Diving into the algorithm with ML object oriented programming in python  </strong></p>
<hr />
<p><strong> 1 - What is predictive marketing </strong></p>
<p>To understand predictive marketing in my humble opinion, it would be better to know what is predictive analytics. In machine learning or big data term, predictive analytics is a combination of mathematical and statistical techniques to recognize similarity in data or to make predictions about the future. In this sense, when predictive analytics is applied to marketing, it can predict customer behaviors, classify customers into segments, recommend a set of products to customers etc. So predictive marketing under the hood of predictive analysis helps companies to optimize their marketing strategy to acquire new customers, to grow customer lifetime value (revenue generated by purchased products) and to retain more customers over a period of time. However, someone reading this blog may ask does predictive marketing is going to replace marketers with robots? The answer is no, the use of predictive marketing is  to empower human intelligence with machine learning to increase customer lifetime value. For instance, Amazon has been using predictive analytics for long. Play close attention to the recommendations that appear under a product you are thinking of adding into your cart is part of what makes Amazon such successful  e-commerce platform today.</p>
<p><strong> 2 - What is RFM analysis and why it is useful?</strong></p>
<p>To  grasp RFM analysis, let us first curl around customer segmentation. By the way, customer segmentation is the process of dividing customers into groups based on common characteristics so companies can market to each group effectively and appropriately. With this understanding of customer segmentation, recency frequency monetary analysis on the other hand is a behavior-based approach consisting of grouping customers into segments. It groups the customers on the basis of their previous purchase transactions. How recently(recency), how often(frequency), and how much(monetary) did a customer buy products or services.  Basically, RFM model ranks customer on a scale of 1 to 5. The higher the customer ranking, the more likely he will do business again with the firm. Furthermore, It gives organizations a sense of how much revenue comes from previous customers. So it help marketers to leverage their marketing strategy to keep high value customer  and medium value customer loyal to them and to move targeted low value customer segment into high value customer through promotion , ads and discount on products.</p>
<p><strong> 3 - What is the difference between Clustering and segmentation </strong></p>
<p>Clustering is the automated machine learning powered version of segmentation. Clustering is a powerful tool that allows us to discover personas or communities within your customer base.  Segmentation on the other hand, is the process in which you segment customers to identify homogeneous groups that exist within your customer base which can be used to optimize and differentiate marketing actions or product strategy.</p>
<p><strong> 4- Different types of clustering </strong></p>
<p>The most frequent types of clustering use  by data analysts are Product-based clusters, brand-based clusters, and behavior-based clusters. </p>
<ul>
<li>Product based Clusters</li>
</ul>
<p>Product based or category based clustering models group customers based on what types or categories of products they tend to prefer and what types of products customers tend to buy together.  </p>
<ul>
<li>Brand based Clusters</li>
</ul>
<p>Brand-based clusters tell you what brands people are most likely to buy. It groups customers together that prefer a group of brands more than other. For instance, you will be able to identify which customers are likely to be interested when a specific brand releases new products.</p>
<ul>
<li>Behavior based Clusters</li>
</ul>
<p>A behavior based clustering model groups customers based on how they will behave while purchasing. Do they use the website or the call center? Are they discount addicts? How frequently do they buy? How much do they spend? How much time will pass before they purchase again?</p>
<p><strong> 5- Diving into the algorithm with ML OOP in python  </strong></p>
<p>The use of OOP(Object Oriented Programming) is entirely optional in Machine Learning as we already have libraries like Scikit-learn and TensorFlow from where we can easily use algorithms. If you are new to python reading this article don't worry, please pause at this point, google OOP in python and come back to understand the following. </p>
<ul>
<li>a) Importing the libraries</li>
</ul>
<pre><code><span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> MinMaxScaler
<span class="hljs-keyword">from</span> yellowbrick.<span class="hljs-keyword">cluster</span> <span class="hljs-keyword">import</span> KElbowVisualizer
<span class="hljs-keyword">from</span> matplotlib.gridspec <span class="hljs-keyword">import</span> GridSpec
<span class="hljs-keyword">from</span> sklearn.<span class="hljs-keyword">cluster</span> <span class="hljs-keyword">import</span> KMeans
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> plotly.express <span class="hljs-keyword">as</span> px
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">import</span> datetime <span class="hljs-keyword">as</span> dt
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> os
</code></pre><ul>
<li>b) Defining a class, containing a function to preprocess the dataset</li>
</ul>
<pre><code><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SomeModel</span>():</span>
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(<span class="hljs-keyword">self</span>)</span></span>:
                pass
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_preprocessing</span><span class="hljs-params">(<span class="hljs-keyword">self</span>, a, b , c,d)</span></span>:
        <span class="hljs-comment">#       removing duplicated index and dropping nan values</span>
                X= pd.read_csv(d).drop_duplicates(keep=<span class="hljs-string">"first"</span>)
                X=X[pd.notnull(X[a])]
                X=X[pd.notnull(X[b])]
                X=X[pd.notnull(X[c])]
                <span class="hljs-keyword">return</span> X
</code></pre><ul>
<li>Checking the output of this call. I use (....) to respect indentation while type this code on my blog which should be remove while testing this code in your environment with four spaces</li>
</ul>
<pre><code><span class="hljs-keyword">if</span> __name__ <span class="hljs-operator">=</span><span class="hljs-operator">=</span> <span class="hljs-string">'__main__'</span>:
....model_instance <span class="hljs-operator">=</span> SomeModel()        
....print(model_instance.get_preprocessing(<span class="hljs-string">'location_country'</span>,<span class="hljs-string">'referrer_source'</span>,<span class="hljs-string">'referrer_name'</span>,<span class="hljs-string">'shopify_dataseller1.csv'</span>))
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647262942238/a099up2OW.png" alt="func1.png" /></p>
<ul>
<li>c) Defining  a function to get RFM modelling</li>
</ul>
<pre><code>def get_rfm_modeling( self, a,b,c,d):
        # <span class="hljs-keyword">function</span> <span class="hljs-keyword">to</span> <span class="hljs-keyword">return</span> the rfm dataframe
                preprocessed_df = self.get_preprocessing( a, b, c,d)
                df_recency = preprocessed_df.groupby(<span class="hljs-keyword">by</span>=<span class="hljs-string">'location_country'</span>,as_index=<span class="hljs-keyword">False</span>)[<span class="hljs-string">'total_sessions'</span>].sum()
                df_recency.<span class="hljs-keyword">columns</span> = [<span class="hljs-string">'location_country'</span>, <span class="hljs-string">'Recency'</span>] 
                frequency_df = preprocessed_df.drop_duplicates().groupby( <span class="hljs-keyword">by</span>=[<span class="hljs-string">'location_country'</span>], as_index=<span class="hljs-keyword">False</span>)[<span class="hljs-string">'total_conversion'</span>].count()
                frequency_df.<span class="hljs-keyword">columns</span> = [<span class="hljs-string">'location_country'</span>, <span class="hljs-string">'Frequency'</span>]
                preprocessed_df[<span class="hljs-string">'Total'</span>] =preprocessed_df[<span class="hljs-string">'total_conversion'</span>]*preprocessed_df[<span class="hljs-string">'total_carts'</span>]
                monetary_df = preprocessed_df.groupby(<span class="hljs-keyword">by</span>=<span class="hljs-string">'location_country'</span>, as_index=<span class="hljs-keyword">False</span>)[<span class="hljs-string">'Total'</span>].sum()
                monetary_df.<span class="hljs-keyword">columns</span> = [<span class="hljs-string">'location_country'</span>, <span class="hljs-string">'Monetary'</span>]
                rf_df = df_recency.merge(frequency_df, <span class="hljs-keyword">on</span>=<span class="hljs-string">'location_country'</span>)
                rfm_df = rf_df.merge(monetary_df, <span class="hljs-keyword">on</span>=<span class="hljs-string">'location_country'</span>)
                <span class="hljs-keyword">return</span> rfm_df
</code></pre><ul>
<li>Checking the output</li>
</ul>
<pre><code><span class="hljs-keyword">if</span> __name__ <span class="hljs-operator">=</span><span class="hljs-operator">=</span> <span class="hljs-string">'__main__'</span>:
....model_instance <span class="hljs-operator">=</span> SomeModel()      
....print(model_instance.get_rfm_modeling(<span class="hljs-string">"location_country"</span>,<span class="hljs-string">"referrer_source"</span>,<span class="hljs-string">"referrer_name"</span>,<span class="hljs-string">'shopify_dat  a_seller1.csv'</span>))
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647264533610/Z6c1RjIUd.png" alt="score.png" /></p>
<ul>
<li>d) Defining two functions to get the R_score </li>
</ul>
<pre><code><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">R_score</span>(<span class="hljs-params">self,var,p,d</span>):</span>
        <span class="hljs-comment"># recency score on 2h activity high value, more logs on the platform</span>
                <span class="hljs-keyword">if</span> var &lt;= d[p][<span class="hljs-number">0.25</span>]:
                        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>
                <span class="hljs-keyword">elif</span> var &lt;= d[p][<span class="hljs-number">0.50</span>]:
                        <span class="hljs-keyword">return</span> <span class="hljs-number">2</span>
                <span class="hljs-keyword">elif</span> var &lt;= d[p][<span class="hljs-number">0.75</span>]:
                        <span class="hljs-keyword">return</span> <span class="hljs-number">3</span>
                <span class="hljs-keyword">else</span>:
                        <span class="hljs-keyword">return</span> <span class="hljs-number">4</span>
</code></pre><ul>
<li>e) Defining a function to get FM_score</li>
</ul>
<pre><code>   <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">FM_score</span>(<span class="hljs-params">self,var,p,d</span>):</span>
<span class="hljs-comment">#Frequency and Monetary score (Positive Impact : Higher the value, better the customer)   </span>
                <span class="hljs-keyword">if</span> var &lt;= d[p][<span class="hljs-number">0.25</span>]:
                        <span class="hljs-keyword">return</span> <span class="hljs-number">4</span>
                <span class="hljs-keyword">elif</span> var &lt;= d[p][<span class="hljs-number">0.50</span>]:
                        <span class="hljs-keyword">return</span> <span class="hljs-number">3</span>
                <span class="hljs-keyword">elif</span> var &lt;= d[p][<span class="hljs-number">0.75</span>]:
                        <span class="hljs-keyword">return</span> <span class="hljs-number">2</span>
                <span class="hljs-keyword">else</span>:
                        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>
</code></pre><ul>
<li>f) Defining a function to get the RFM score </li>
</ul>
<pre><code>     def get_rfmscore(<span class="hljs-built_in">self</span>,a,b,c,d):
<span class="hljs-comment">#Segmentation: Here, we will divide the data set into 4 parts based on the quantiles.</span>
                rfm_df = <span class="hljs-built_in">self</span>.get_rfm_modeling(a,b,c,d)
                quantiles =rfm_df.drop(<span class="hljs-string">'location_country'</span>,axis = <span class="hljs-number">1</span>).quantile(q = [<span class="hljs-number">0.25</span>,<span class="hljs-number">0.5</span>,<span class="hljs-number">0.75</span>])
                rfm_df[<span class="hljs-string">'R_score'</span>] = rfm_df[<span class="hljs-string">'Recency'</span>].apply(<span class="hljs-built_in">self</span>.R_score,args = (<span class="hljs-string">'Recency'</span>,quantiles,))
                rfm_df[<span class="hljs-string">'F_score'</span>] = rfm_df[<span class="hljs-string">'Frequency'</span>].apply(<span class="hljs-built_in">self</span>.FM_score,args = (<span class="hljs-string">'Frequency'</span>,quantiles,))
                rfm_df[<span class="hljs-string">'M_score'</span>] = rfm_df[<span class="hljs-string">'Monetary'</span>].apply(<span class="hljs-built_in">self</span>.FM_score,args = (<span class="hljs-string">'Monetary'</span>,quantiles,))
        <span class="hljs-comment">#Now we will create : RFMGroup and RFMScore</span>
                rfm_df[<span class="hljs-string">'RFM_Group'</span>] = rfm_df[<span class="hljs-string">'R_score'</span>].astype(str) + rfm_df[<span class="hljs-string">'F_score'</span>].astype(str) + rfm_df[<span class="hljs-string">'M_score'</span>].astype(str)
        <span class="hljs-comment">#Score</span>
                rfm_df[<span class="hljs-string">'RFM_Score'</span>] = rfm_df[[<span class="hljs-string">'R_score'</span>,<span class="hljs-string">'F_score'</span>,<span class="hljs-string">'M_score'</span>]].sum(axis = <span class="hljs-number">1</span>)
                rfm_df[<span class="hljs-string">'R_rank'</span>] = rfm_df[<span class="hljs-string">'Recency'</span>].rank(ascending=<span class="hljs-literal">False</span>)
                rfm_df[<span class="hljs-string">'F_rank'</span>] = rfm_df[<span class="hljs-string">'Frequency'</span>].rank(ascending=<span class="hljs-literal">True</span>)
                rfm_df[<span class="hljs-string">'M_rank'</span>] = rfm_df[<span class="hljs-string">'Monetary'</span>].rank(ascending=<span class="hljs-literal">True</span>)
        <span class="hljs-comment"># normalizing the rank of the customers</span>
                rfm_df[<span class="hljs-string">'R_rank_norm'</span>] = (rfm_df[<span class="hljs-string">'R_rank'</span>]/rfm_df[<span class="hljs-string">'R_rank'</span>].max())*<span class="hljs-number">100</span>
                rfm_df[<span class="hljs-string">'F_rank_norm'</span>] = (rfm_df[<span class="hljs-string">'F_rank'</span>]/rfm_df[<span class="hljs-string">'F_rank'</span>].max())*<span class="hljs-number">100</span>
                rfm_df[<span class="hljs-string">'M_rank_norm'</span>] = (rfm_df[<span class="hljs-string">'F_rank'</span>]/rfm_df[<span class="hljs-string">'M_rank'</span>].max())*<span class="hljs-number">100</span>
                rfm_df.drop(columns=[<span class="hljs-string">'R_rank'</span>, <span class="hljs-string">'F_rank'</span>, <span class="hljs-string">'M_rank'</span>], inplace=<span class="hljs-literal">True</span>)
                rfm_df[<span class="hljs-string">'RFM_Score'</span>] = <span class="hljs-number">0.15</span>*rfm_df[<span class="hljs-string">'R_rank_norm'</span>]+<span class="hljs-number">0.28</span> * rfm_df[<span class="hljs-string">'F_rank_norm'</span>]+<span class="hljs-number">0.57</span>*rfm_df[<span class="hljs-string">'M_rank_norm'</span>]
                rfm_df[<span class="hljs-string">'RFM_Score'</span>] *= <span class="hljs-number">0.05</span>
                rfm_df = rfm_df.round(<span class="hljs-number">2</span>)
                <span class="hljs-keyword">return</span> rfm_df
</code></pre><ul>
<li>Checking the output</li>
</ul>
<pre><code><span class="hljs-keyword">if</span> __name__ <span class="hljs-operator">=</span><span class="hljs-operator">=</span> <span class="hljs-string">'__main__'</span>:
....model_instance <span class="hljs-operator">=</span> SomeModel()      
....print(model_instance.get_rfmscore(<span class="hljs-string">"location_country"</span>, <span class="hljs-string">"referrer_source"</span>, <span class="hljs-string">"referrer_name"</span>,<span class="hljs-string">'shopify_data_seller1.csv'</span>))
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647265495972/zS_t4bL42.png" alt="fm.png" /></p>
<ul>
<li>g) Function to perform customer segmentation</li>
</ul>
<pre><code> def get_customerSegment(<span class="hljs-built_in">self</span>,a,b,c,d):
                rfm_df <span class="hljs-operator">=</span> <span class="hljs-built_in">self</span>.get_rfmscore(a,b,c,d)
                rfm_df[<span class="hljs-string">"Customer_segment"</span>] <span class="hljs-operator">=</span> np.where(rfm_df[<span class="hljs-string">'RFM_Score'</span>] <span class="hljs-operator">&gt;</span> <span class="hljs-number">4.5</span>, <span class="hljs-string">"Top Customers"</span>,
                             (np.where(rfm_df[<span class="hljs-string">'RFM_Score'</span>] <span class="hljs-operator">&gt;</span> <span class="hljs-number">4</span>,<span class="hljs-string">"High value Customer"</span>,
                             (np.where(rfm_df[<span class="hljs-string">'RFM_Score'</span>] <span class="hljs-operator">&gt;</span><span class="hljs-operator">=</span> <span class="hljs-number">3</span>,<span class="hljs-string">"Medium Value Customer"</span>,
                             np.where(rfm_df[<span class="hljs-string">'RFM_Score'</span>] <span class="hljs-operator">&gt;</span> <span class="hljs-number">1.6</span>,<span class="hljs-string">'Low Value Customers'</span>, <span class="hljs-string">'Low Customers'</span>))))))
                <span class="hljs-keyword">return</span> rfm_df
</code></pre><ul>
<li>Getting the ouput </li>
</ul>
<pre><code><span class="hljs-keyword">if</span> __name__ <span class="hljs-operator">=</span><span class="hljs-operator">=</span> <span class="hljs-string">'__main__'</span>:
....model_instance <span class="hljs-operator">=</span> SomeModel()        
....print(model_instance.get_customerSegment(<span class="hljs-string">"location_country"</span>, <span class="hljs-string">"referrer_source"</span>, <span class="hljs-string">"referrer_name"</span>,<span class="hljs-string">'shopify_data_seller1.csv'</span>))
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647266141725/9NClGO_wH.png" alt="segm.png" /></p>
<ul>
<li>h) Function to remove negative, zero and skew the data</li>
</ul>
<pre><code> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">right_treat</span><span class="hljs-params">(<span class="hljs-keyword">self</span>,var)</span></span>:
        <span class="hljs-comment"># First will focus on the negative and zero from the dataset before the transformation.</span>
                <span class="hljs-keyword">if</span> var &lt;= <span class="hljs-number">0</span>:
                        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>
                <span class="hljs-symbol">else:</span>
                        <span class="hljs-keyword">return</span> var
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_screwLogTransform</span><span class="hljs-params">(<span class="hljs-keyword">self</span>,a,b,c,d)</span></span>:
                rfm_df = <span class="hljs-keyword">self</span>.get_customerSegment(a,b,c,d)
<span class="hljs-comment">#skewness transform</span>
                rfm_df[<span class="hljs-string">'Recency'</span>] = rfm_df[<span class="hljs-string">'Recency'</span>].apply(lambda x : <span class="hljs-keyword">self</span>.right_treat(x))
                rfm_df[<span class="hljs-string">'Monetary'</span>] = rfm_df[<span class="hljs-string">'Monetary'</span>].apply(lambda x : <span class="hljs-keyword">self</span>.right_treat(x))
<span class="hljs-comment">#Log Transformation</span>
                log_RFM_data = rfm_df[[<span class="hljs-string">'Recency'</span>,<span class="hljs-string">'Frequency'</span>,<span class="hljs-string">'Monetary'</span>]].apply(np.log,axis = <span class="hljs-number">1</span>).round(<span class="hljs-number">4</span>)
                <span class="hljs-keyword">return</span> log_RFM_data
</code></pre><ul>
<li>i) Function to find the maximum number of cluster using the elbow technique</li>
</ul>
<pre><code><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">plotClusteringElbow</span><span class="hljs-params">()</span></span>:
        <span class="hljs-comment"># After plotting, we found elbow at k=3. We will use this value in training our model</span>
                x = scaledLogTransform()
                model = KMeans()
                visualizer =KElbowVisualizer(model, k=(<span class="hljs-number">1</span>,<span class="hljs-number">9</span>))
                visualizer.fit(x)
                <span class="hljs-keyword">return</span> visualizer.show()
</code></pre><ul>
<li>Output the plot </li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647269747528/b6vXZL-bl.png" alt="elbow.png" /></p>
<ul>
<li>j ) Training the model using K-means clustering algorithm </li>
</ul>
<pre><code>def train(<span class="hljs-built_in">self</span>,a,b,c,d):
                scaled_data <span class="hljs-operator">=</span> <span class="hljs-built_in">self</span>.scaledLogTransform(a,b,c,d)
                KM_clust <span class="hljs-operator">=</span> KMeans(n_clusters<span class="hljs-operator">=</span> <span class="hljs-number">3</span>, init <span class="hljs-operator">=</span> <span class="hljs-string">'k-means++'</span>,max_iter <span class="hljs-operator">=</span> <span class="hljs-number">1000</span>)
                KM_clust.fit(scaled_data)
                <span class="hljs-keyword">return</span> KM_clust
</code></pre><ul>
<li>k) defining a function to display the result </li>
</ul>
<pre><code>def get_results(<span class="hljs-built_in">self</span>,a,b,c,d):
                model<span class="hljs-operator">=</span><span class="hljs-built_in">self</span>.train(a,b,c,d)
                rfm_df <span class="hljs-operator">=</span> <span class="hljs-built_in">self</span>.get_customerSegment(a,b,c,d)
                rfm_df[<span class="hljs-string">'Cluster'</span>] <span class="hljs-operator">=</span> model.labels_
                rfm_df[<span class="hljs-string">'Cluster'</span>] <span class="hljs-operator">=</span> <span class="hljs-string">'Cluster'</span> <span class="hljs-operator">+</span> rfm_df[<span class="hljs-string">'Cluster'</span>].astype(str)
                new_rfm_df <span class="hljs-operator">=</span>  rfm_df[[<span class="hljs-string">"location_country"</span>,<span class="hljs-string">"Customer_segment"</span>, <span class="hljs-string">"Cluster"</span>]]
                <span class="hljs-keyword">return</span> new_rfm_df.tail(<span class="hljs-number">25</span>)
</code></pre><ul>
<li>Displaying the last 25 rows of the dataframe</li>
</ul>
<pre><code><span class="hljs-keyword">if</span> __name__ <span class="hljs-operator">=</span><span class="hljs-operator">=</span> <span class="hljs-string">'__main__'</span>:
....model_instance <span class="hljs-operator">=</span> SomeModel()
....print(model_instance.train(<span class="hljs-string">"location_country"</span>,<span class="hljs-string">"referrer_source"</span>, <span class="hljs-string">"referrer_name"</span>,<span class="hljs-string">'shopify_data_seller1.csv'</span>))
....print(model_instance.get_results(<span class="hljs-string">"location_country"</span>, <span class="hljs-string">"referrer_source"</span>,<span class="hljs-string">"referrer_name"</span>,<span class="hljs-string">'shopify_data_seller1.csv'</span>))
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1647269205384/8Y8nPsx56p.png" alt="result.png" /></p>
<p>Conclusion </p>
<p>We have reached to the end of this post. From the above data, we can  think of your customer cluster as a physical swimming pool. The pool is filled with money spent by active customers of your brands. High value customers are those customers who have spent money with you in the past frequently over a period of time. Higher value customers spend more money and they fill the pool up faster than medium value customers. Low value and low customers are seasoning customers, their purchasing power is small and it takes them years to fill the pool of water. On the other hand, water is draining as customers leave you or stop spending money with you. Therefore,  different strategy should be implemented by the marketers to retain their customer. For instance, high value customers can be contacted by call centers whereas medium value customers received an email and low-value customers a text message. Not only limited to this,  promotion can be rolled out to move low value and low customers into medium pool; special discount can be offered to high and medium value customer on some products or services to retain their loyalty to your brand.</p>
<p>If you want want to contribute or you find any errors in this article please do leave me a comment.</p>
<p>You can reach me out on any of the matrix decentralized server. Here is my Element ID <strong>@maximilien:matrix:org</strong></p>
<p>If you use one on the mastodon decentralized server, here is my ID <strong>@maximilien@qoto.org</strong></p>
]]></content:encoded></item><item><title><![CDATA[End-To-End Breast Cancer Model Explainability using SHAP and Random Forest Algorithm.]]></title><description><![CDATA[Model explainability and interpretability are one of the major concerns in the field of machine learning and artificial intelligence nowadays. We are gradually moving from traditional machine learning 'Black Box' where we preprocess data and feed int...]]></description><link>https://maximilien.docquest.io/end-to-end-breast-cancer-model-explainability-using-shap-and-random-forest-algorithm</link><guid isPermaLink="true">https://maximilien.docquest.io/end-to-end-breast-cancer-model-explainability-using-shap-and-random-forest-algorithm</guid><category><![CDATA[Artificial Intelligence]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Maximilien]]></dc:creator><pubDate>Thu, 16 Dec 2021 09:42:57 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1639604269041/khl5MsGrz.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Model explainability and interpretability are one of the major concerns in the field of machine learning and artificial intelligence nowadays. We are gradually moving from traditional machine learning 'Black Box' where we preprocess data and feed into our training algorithm by relying only on accuracy score and classification report to explain how good our model  performs on training  and validation dataset d to 'White Box' a near explainable and interpretable model. In the following, we are going to use SHAP (SHapley Additive exPlanations) to explain breast cancer model using random forest classifier.</p>
<hr />
<p><strong> Contents </strong></p>
<p><strong>  1- What is SHAP </strong></p>
<p><strong>  2- Goal of SHAP </strong></p>
<p><strong> 3- SHAP value </strong></p>
<p><strong> 4- SHAP Explainer </strong></p>
<p><strong> 5- List of SHAP charts</strong></p>
<p><strong> 6-  Model-Agnostic Method  and advantages </strong></p>
<p><strong> 7-  Model-Agnostic Method layers</strong></p>
<p><strong> 8- Implementation of breast cancer model explainability using SHAP and random forest classifier algorithm </strong></p>
<hr />
<p><strong> 1- What is SHAP </strong></p>
<p>SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. It has optimized functions for interpreting tree-based models and a model agnostic explainer function for interpreting any black-box models for which the predictions are known.</p>
<p><strong>  2- Goal of SHAP </strong></p>
<p>The goal of SHAP is to explain the prediction of an instance x by computing the contribution of each feature to the prediction. The SHAP explanation method computes Shapley values from coalitional game theory. The feature values of a data instance act as players in a coalition. Shapley values tell us how to fairly distribute the the prediction among the features. </p>
<p><strong> 3- SHAP value </strong></p>
<p>Shapley values are a widely used approach from cooperative game theory. The essence of Shapley value is to measure the contributions to the final outcome from each player separately among the coalition while preserving the sum of contributions being equal to the final outcome. Though there are some other techniques used to explain models like permutation importance and partial dependence plots, below are some benefits of using SHAP values over other techniques:</p>
<p><strong> Global interpretability </strong>: SHAP values not only show feature importance but also show whether the feature has a positive or negative impact on predictions.
<strong> Local interpretability </strong>: We can calculate SHAP values for each individual prediction and know how the features contribute to that single prediction. Other techniques only show aggregated results over the whole dataset.
SHAP values can be used to explain a large variety of models including linear models (e.g. linear regression), tree-based models (e.g. XGBoost) and neural networks, while other techniques can only be used to explain limited model types.</p>
<p><strong> 4- SHAP Explainer </strong></p>
<p>SHAP has a list of classes which can help us understand a different kind of machine learning models from many python libraries. These classes are commonly referred to as explainers. This explainer generally takes the ML model and data as input and returns an explainer object which has SHAP values which will be used to plot various charts explained later on. Below is a list of available explainers with SHAP.</p>
<ul>
<li>AdditiveExplainer: This explainer is used to explain Generalized Additive Models.</li>
<li>This explainer uses the brute force approach to find shap values which will try all possible parameter sequence.</li>
<li>DeepExplainer: This explainer is designed for deep learning models created using Keras, TensorFlow, and PyTorch. It’s an enhanced version of the  algorithm where we measure conditional expectations of SHAP values based on a number of background samples. It's advisable to keep reasonable samples as background because too many samples will give more accurate results but will take a lot of time to compute SHAP values. Generally, 100 random samples are a good choice.</li>
<li>GradientExplainer: This explainer is used for differentiable models which are based on the concept of expected gradients which itself is an extension of the integrated gradients method.</li>
<li>KernelExplainer: This explainer uses special weighted linear regression to compute the importance of each feature and the same values are used as SHAP values.</li>
<li>LinearExplainer: This explainer is used for linear models available from sklearn. It can account for the relationship between features as well.</li>
<li>PartitionExplainer: This explainer calculates shap values recursively through trying a hierarchy of feature combinations. It can capture the relationship between a group of related features.</li>
<li>PermutationExplainer: This explainer iterates through all permutation of features in both forward and reverses directions. This explainer can take more time if tried with many samples.</li>
<li>SamplingExplainer: This explainer generates shap values based on assumption that features are independent and is an extension of an algorithm proposed in the paper "An Efficient Explanation of Individual Classifications using Game Theory".</li>
<li>TreeExplainer: This explainer is used for models that are based on a tree-like decision tree, random forest, gradient boosting.</li>
<li>CoefficentExplainer: This explainer returns model coefficients as shap values. It does not do any actual shap values calculation.</li>
<li>LimeTabularExplainer: This explainer simply wrap around LimeTabularExplainer from lime library. If you are interested in learning about lime then please feel free to check on our tutorial on the same from references section.</li>
<li>MapleExplainer: This explainer simply wraps MAPLE into shap interface.</li>
<li>RandomExplainer:  This explainer simply returns random feature shap values.</li>
<li>TreeGainExplainer : This explainer returns global gain/Gini feature importances for tree models as shap values.</li>
<li>TreeMapleExplainer : This explainer provides a wrapper around tree MAPLE into shap interface.</li>
</ul>
<p><strong> 5- List of SHAP charts </strong></p>
<ul>
<li>summary_plot creates a beeswarm plot of shap values distribution of each feature of the dataset.</li>
<li>decision_plot  shows the path of how the model reached a particular decision based on shap values of individual features. The individual plotted line represents one sample of data and how it reached a particular prediction.</li>
<li>multioutput_decision_plot shows decision plot for multi output models.</li>
<li>dependence_plot  shows relationship between feature value (X-axis) and its shape values (Y-axis).</li>
<li>force_plot  plots shap values using additive force layout. It can help us see which features most positively or negatively contributed to prediction.</li>
<li>image_plot  plots shape values for images.</li>
<li>monitoring_plot  helps in monitoring the behavior of the model over time. It monitors the loss of model overtime.</li>
<li>embedding_plot  projects shap values using PCA for 2D visualization.
partial_dependence_plot  shows basic partial dependence plot for a feature.</li>
<li>bar_plot  shows a bar plot of shap values impact on the prediction of a particular sample.</li>
<li>waterfall_plot  shows a waterfall plot explaining a particular prediction of the model based on shap values. It kind of shows the path of how shap values were added to the base value to come to a particular prediction.</li>
<li>text_plot plots an explanation of text samples coloring text based on their shap values.</li>
</ul>
<p><strong> 6-  Model-Agnostic Method  and advantages</strong></p>
<p>The process of separating the explanations from the machine learning model is called model-agnostic
interpretation methods. The advantages of applying model-agnostic explanation system are :</p>
<ul>
<li><p>Model flexibility: The interpretation method can work with any machine learning model, such as random forests, linear model and deep neural networks.</p>
</li>
<li><p>Explanation flexibility: You are not limited to a certain form of explanation. In some cases it might be useful to have a linear formula, in other cases a graphic with feature importance.</p>
</li>
<li><p>Representation flexibility: The explanation system should be able to use a different feature representation as the model being explained.</p>
</li>
</ul>
<p><strong> 7-  Model-Agnostic Method layers</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1638862149283/22Ivv4BS9.png" alt="Word.png" /></p>
<p>Let us have a look at model-agnostic interpretability. We capture the world by collecting data and abstract it further by learning to predict the data with a machine learning model. </p>
<p>The World layer: It contains everything that can be observed which we aim  to learn something about and interact with.</p>
<p>Data layer:  We have to digitize the World in order to make it processable for computers and also to store information. The Data layer contains anything from images, texts etc. </p>
<p>Black  box model layer: We fit the preprocessed data into the machine learning model and predict the outcome on unseen test data.
 Interpretability Methods layer:  It deals with the opacity of machine learning models. What were the most important features for a particular diagnosis? Why was a financial transaction classified as
fraud?</p>
<p>The last layer is occupied by a Human where all the explaination takes place.</p>
<p><strong> 8- Implementation of breast cancer model explainability using SHAP and random forest classifier algorithm </strong></p>
<p><em>Problem statement:</em>   Breast cancer is a type of cancer that starts in the breast. Cancer starts when cells begin to grow out of control.A benign tumor is a tumor that does not invade its surrounding tissue or spread around the body. A malignant tumor is a tumor that may invade its surrounding tissue or spread around the body. We are required to use SHAP to explain the prediction of our model either a cancer is malignant or benign using random forest.</p>
<p>The dataset used below can be  downloaded from DPhi github reposetory <a target="_blank" href="https://raw.githubusercontent.com/dphi-official/Datasets/master/breast_cancer/Training_set_breastcancer.csv">here</a> </p>
<ul>
<li><strong>  Importing Necessary Libraries </strong></li>
</ul>
<pre><code><span class="hljs-keyword">from</span> sklearn.ensemble <span class="hljs-keyword">import</span> <span class="hljs-title">RandomForestClassifier</span>
<span class="hljs-title"><span class="hljs-keyword">from</span></span> <span class="hljs-title">sklearn</span>.<span class="hljs-title">model_selection</span> <span class="hljs-title"><span class="hljs-keyword">import</span></span> <span class="hljs-title">train_test_split</span>
<span class="hljs-title"><span class="hljs-keyword">from</span></span> <span class="hljs-title">matplotlib</span>.<span class="hljs-title">colors</span> <span class="hljs-title"><span class="hljs-keyword">import</span></span> <span class="hljs-title">ListedColormap</span>
<span class="hljs-title"><span class="hljs-keyword">import</span></span> <span class="hljs-title">matplotlib</span>.<span class="hljs-title">pyplot</span> <span class="hljs-title"><span class="hljs-keyword">as</span></span> <span class="hljs-title">plt</span>
<span class="hljs-title"><span class="hljs-keyword">import</span></span> <span class="hljs-title">seaborn</span> <span class="hljs-title"><span class="hljs-keyword">as</span></span> <span class="hljs-title">sns</span>
<span class="hljs-title"><span class="hljs-keyword">import</span></span> <span class="hljs-title">pandas</span> <span class="hljs-title"><span class="hljs-keyword">as</span></span> <span class="hljs-title">pd</span>
<span class="hljs-title"><span class="hljs-keyword">import</span></span> <span class="hljs-title">numpy</span> <span class="hljs-title"><span class="hljs-keyword">as</span></span> <span class="hljs-title">np</span>
<span class="hljs-title"><span class="hljs-keyword">import</span></span> <span class="hljs-title">warnings</span>
<span class="hljs-title">warnings</span>.<span class="hljs-title">filterwarnings</span>(<span class="hljs-string">'ignore'</span>)
<span class="hljs-title"><span class="hljs-keyword">import</span></span> <span class="hljs-title">shap</span>
<span class="hljs-title">shap</span>.<span class="hljs-title">initjs</span>()
</code></pre><ul>
<li><strong>  Loading the first five row of the data</strong></li>
</ul>
<pre><code>df<span class="hljs-operator">=</span>pd.read_csv(<span class="hljs-string">"https://raw.githubusercontent.com/dphi-official/Datasets/master/breast_cancer/Training_set_breastcancer.csv"</span>)
df.head()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639334109078/1hXVLcQuw.png" alt="head.png" /></p>
<ul>
<li><strong> Perform Basic Exploratory Data Analysis </strong></li>
</ul>
<p>This section displays the summary statistic that quantitatively describes or summarizes features of a collection of information, the process of condensing key characteristics of the data set into simple numeric metrics. Some of the common metrics used are mean, standard deviation, and correlation.</p>
<ul>
<li>Checking the dimensionality of the dataframe</li>
</ul>
<pre><code><span class="hljs-selector-tag">df</span><span class="hljs-selector-class">.shape</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639404827541/-n7XVQqRX.png" alt="shape.png" /></p>
<ul>
<li>Getting a concise summary of the dataframe</li>
</ul>
<pre><code>df.info()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639405201874/1bJftIOsI.png" alt="info.png" /></p>
<ul>
<li>Descriptive statistics</li>
</ul>
<pre><code>df.describe().transpose()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639576948406/RMIJDgHJJV.png" alt="sta.png" /></p>
<p>From the difference between the median and mean in the figure it appears some features  have skewness.</p>
<p>Based on the diagnosis class, data can be categorized using the mean value as follows.</p>
<pre><code>df.groupby(<span class="hljs-string">'diagnosis'</span>).mean()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639579152936/ZgE2lJXGB.png" alt="mean.png" /></p>
<ul>
<li>Grouping the label in classes and displaying the number of element per class</li>
</ul>
<pre><code>print(df.groupby(<span class="hljs-string">"diagnosis"</span>).size())
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639405905830/mJNrVwsqG.png" alt="size.png" /></p>
<p>-Displaying the distribution of elements per class </p>
<pre><code>sns.countplot(df[<span class="hljs-string">'diagnosis'</span>], label<span class="hljs-operator">=</span><span class="hljs-string">"Count"</span>, palette<span class="hljs-operator">=</span>sns.color_palette([<span class="hljs-string">'blue'</span>, <span class="hljs-string">'red'</span>]),
              order<span class="hljs-operator">=</span>pd.value_counts(df[<span class="hljs-string">'diagnosis'</span>]).iloc[:<span class="hljs-number">398</span>].index)
plt.show()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639406348711/e8QGg2SpM.png" alt="distr.png" /></p>
<p>From the above figure of count plot graph, it clearly displays there is more number of benign (B) stage of cancer tumors in the data set which can be  cured.</p>
<ul>
<li>Dropping the id column cause it does not affect our analysis</li>
</ul>
<pre><code>df <span class="hljs-operator">=</span> df.drop(<span class="hljs-string">'id'</span>, axis<span class="hljs-operator">=</span><span class="hljs-number">1</span>)
</code></pre><ul>
<li>Plotting correlation among among and target variable</li>
</ul>
<pre><code>df_corr <span class="hljs-operator">=</span> df.corr()
plt.figure(figsize<span class="hljs-operator">=</span>(<span class="hljs-number">20</span>, <span class="hljs-number">12</span>))
sns.heatmap(df_corr, cbar<span class="hljs-operator">=</span>True, annot<span class="hljs-operator">=</span>True, yticklabels<span class="hljs-operator">=</span>df.columns,
            xticklabels<span class="hljs-operator">=</span>df.columns)
plt.show()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639410862251/TuEf373IP.png" alt="corr.png" /></p>
<p>Each square shows the correlation between the variables on each axis. Correlation ranges from -1 to +1. Values closer to zero means there is no linear trend between the two features. The close to 1 the correlation is the more positively correlated they are; that is as one increases so does the other and the closer to 1 the stronger this relationship is. A correlation closer to -1 is similar, but instead of both increasing one variable will decrease as the other increases. The diagonals are all 1 because those squares are correlating each variable to itself so it's a perfect correlation. For the rest, the larger the number and lighter the color, the higher the correlation between the two variables. The plot is also symmetrical about the diagonal since the same two variables are being paired together in those squares.</p>
<ul>
<li>Printing features with high correlation</li>
</ul>
<pre><code>high_correlation <span class="hljs-operator">=</span>df_corr.abs()
high_correlation_unstack<span class="hljs-operator">=</span>high_correlation.unstack()
high_correlation_sort <span class="hljs-operator">=</span> high_correlation_unstack.sort_values(ascending<span class="hljs-operator">=</span>False)
print(high_correlation_sort[<span class="hljs-number">30</span>:<span class="hljs-number">35</span>])
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639413603613/xcWV9MxT2.png" alt="highcorr.png" /></p>
<ul>
<li>Plotting distribution of features with highest correlation "radius_mean and perimeter_mean"</li>
</ul>
<pre><code>sns.jointplot(<span class="hljs-string">"radius_mean"</span>, <span class="hljs-string">"perimeter_mean"</span>, data<span class="hljs-operator">=</span>df, kind<span class="hljs-operator">=</span><span class="hljs-string">"scatter"</span>,space<span class="hljs-operator">=</span><span class="hljs-number">0</span>, color<span class="hljs-operator">=</span><span class="hljs-string">"blue"</span>, height<span class="hljs-operator">=</span><span class="hljs-number">9</span>, ratio<span class="hljs-operator">=</span><span class="hljs-number">3</span>)
plt.show()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v16394137Use Random Forest Machine Learning Model for prediction¶91434/kHlTi9FcO.png" alt="distcorr.png" /></p>
<ul>
<li>Plotting distribution of features with highest correlation "radius_worst and  perimeter_worst"</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639414004193/L3PY7HigBL.png" alt="Screenshot from 2021-12-13 16-47-08.png" /></p>
<ul>
<li>Splitting the data into Train and Test sets</li>
</ul>
<pre><code>X<span class="hljs-operator">=</span>df.drop(<span class="hljs-string">"diagnosis"</span>,axis<span class="hljs-operator">=</span><span class="hljs-number">1</span>)
y<span class="hljs-operator">=</span>df.diagnosis.map({<span class="hljs-string">'B'</span>:<span class="hljs-number">0</span>, <span class="hljs-string">'M'</span>:<span class="hljs-number">1</span>}).astype(np.int)
</code></pre><ul>
<li>The train to test ratio should be 80:20 and the random_state should be 0</li>
</ul>
<pre><code>X_train, X_test,y_train,y_test<span class="hljs-operator">=</span>train_test_split(X,y,test_size<span class="hljs-operator">=</span><span class="hljs-number">20</span>,random_state<span class="hljs-operator">=</span><span class="hljs-number">0</span>)
</code></pre><ul>
<li>Use Random Forest Machine Learning Model for prediction</li>
</ul>
<pre><code>model <span class="hljs-operator">=</span> RandomForestClassifier(n_estimators <span class="hljs-operator">=</span><span class="hljs-number">400</span>, criterion<span class="hljs-operator">=</span><span class="hljs-string">'entropy'</span>,random_state<span class="hljs-operator">=</span><span class="hljs-number">1</span>,n_jobs<span class="hljs-operator">=</span><span class="hljs-number">-1</span>,max_depth<span class="hljs-operator">=</span><span class="hljs-number">5</span>)
</code></pre><ul>
<li>Fitting the model</li>
</ul>
<pre><code>model.fit(X_train, y_train)
</code></pre><ul>
<li>Predicting on X_test set</li>
</ul>
<pre><code>y_pred <span class="hljs-operator">=</span> model.predict(X_test)
</code></pre><ul>
<li>Evaluate the model using Accuracy Score</li>
</ul>
<pre><code><span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span>  <span class="hljs-title">accuracy_score</span>
<span class="hljs-title">score</span><span class="hljs-operator">=</span> <span class="hljs-title">accuracy_score</span>(<span class="hljs-title">y_test</span>,<span class="hljs-title">y_pred</span>)
<span class="hljs-title">print</span>(<span class="hljs-string">"Accuracy:"</span>,<span class="hljs-title">score</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639415187108/WGBK0oTFj.png" alt="Screenshot from 2021-12-13 17-06-51.png" /></p>
<p>Though we got the prediction score of 95% (very great), it does not tell us which features push the breast cancer  prediction towards benign or malignant. We need to explain what goes into the model leading to a specific predicted class. Other questions that may arise are:</p>
<ul>
<li>How do different features affect the prediction results?</li>
<li>What are the top features that influence the prediction results?</li>
<li>The model performance metrics look great, but should I trust the results?</li>
</ul>
<p><strong> Using SHAP Explainer to derive SHAP Values for the random forest ml model. </strong></p>
<ul>
<li>Creating an object of a class TreeExplainer which takes our model as a parameter.</li>
</ul>
<pre><code>explainer <span class="hljs-operator">=</span> shap.TreeExplainer(model)
</code></pre><ul>
<li>Calculating the SHAP value</li>
</ul>
<pre><code>shap_values <span class="hljs-operator">=</span> explainer.shap_values(X_test)
</code></pre><ul>
<li>Displaying the expected value</li>
</ul>
<pre><code>print(<span class="hljs-string">"Expected Value:"</span>, explainer.expected_value)
</code></pre><p><strong> Expected Value: [0.66426696 0.33573304] </strong></p>
<p>These lines of code above calculate the Shapely values. </p>
<ul>
<li>In our case,  classification problem, the shap_values is a list of arrays and the length of the list is equal to the number of classes 2 (benign and malignant). This is true for the expected_values also.
Besides, we should choose which label we are trying to explain and use the corresponding shap_value and expected_value in further plots. Depending on the prediction of an instance, we can choose the corresponding SHAP values and plot them as shown below.</li>
</ul>
<p>NB: In case of a regression out of scope of this article, the shap_values will only return a single item.</p>
<pre><code>row<span class="hljs-operator">=</span><span class="hljs-number">0</span>
<span class="hljs-keyword">for</span> which_class in y.unique():
    display(shap.waterfall_plot(shap.Explanation(values<span class="hljs-operator">=</span>shap_values[<span class="hljs-keyword">int</span>(which_class)][row], base_values<span class="hljs-operator">=</span>explainer.expected_value[<span class="hljs-keyword">int</span>(which_class)], data<span class="hljs-operator">=</span>X_test.iloc[row],feature_names<span class="hljs-operator">=</span>X_test.columns.tolist())))
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639574053376/-3no3R1Er.png" alt="negative.png" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639574168664/3Ht79laPZ.png" alt="positive.png" /></p>
<p>In the above plot, f(x) is the prediction after consedering all the features
E[f(x)] is the mean prediction</p>
<ul>
<li>The blue bar shows how much a particular feature decreases the value of the prediction.</li>
<li><p>The red bar shows how much a particular feature increases the value of the prediction.</p>
</li>
<li><p>Plotting  SHAP force plot for the first row of test data.</p>
</li>
</ul>
<pre><code>shap.initjs()
shap_values_first_row <span class="hljs-operator">=</span> explainer.shap_values(X_test.iloc[<span class="hljs-number">0</span>])
shap.force_plot(explainer.expected_value[<span class="hljs-number">0</span>], shap_values_first_row[<span class="hljs-number">0</span>], X_test.iloc[<span class="hljs-number">0</span>])
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639580140583/m8LQig9RD.png" alt="Screenshot from 2021-12-15 14-54-06.png" /></p>
<p>This force plot above depicts the weight of each feature contribution by the model centered around the baseline SHAP value of 0.6423 which either increase or decrease the prediction. The red color depicts features having positive weight on the model and the blue color depicts features which have negative weigh on our model. That is to say perimeter_worst ,concave points_mean,concave point_worst, concavity_worst,concavity_mean and textture_worst decrease the model prediction. Therefore the first test sample has a low risk of developping breast cancer (benign tumor).</p>
<ul>
<li>Shap summary_plot</li>
</ul>
<pre><code><span class="hljs-selector-tag">shap</span><span class="hljs-selector-class">.summary_plot</span>(<span class="hljs-selector-tag">shap_values</span><span class="hljs-selector-attr">[0]</span>,<span class="hljs-selector-tag">X_test</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639607310571/8Vs2XbQuZ.png" alt="Screenshot from 2021-12-15 22-28-16.png" /></p>
<p>This force plot above depicts the weight of each feature contribution by the model centered around the baseline SHAP value of 0.6423 which either increase or decrease the prediction. The red color depicts features having positive weight on the model and the blue color depicts features which have negative weigh on our model. That is to say perimeter_worst ,concave points_mean,concave point_worst, concavity_worst,concavity_mean and textture_worst decrease the model prediction. Therefore the first test sample has a low risk of developping breast cancer.</p>
<pre><code><span class="hljs-selector-tag">shap</span><span class="hljs-selector-class">.summary_plot</span>(<span class="hljs-selector-tag">shap_values</span><span class="hljs-selector-attr">[1]</span>,<span class="hljs-selector-tag">X_test</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639608490212/LFgU-ZwDB.png" alt="summary1.png" /></p>
<p>The red color depicts features having positive weight on the model and the blue color depicts features which have negative weigh on our model. That is to say perimeter_worst ,concave points_mean,concave point_worst, concavity_worst,concavity_mean and textture_worst increase the model prediction. Therefore the first test sample has a high risk of developing breast cancer (malignant tumor).</p>
<p><strong> There are other shap plots we could explore but for lack of time, I would like to introduce you to an amazing python library which explains shap model just in 5 lines of code. </strong></p>
<ul>
<li><strong> Explainerdashboard </strong></li>
</ul>
<p>explainerdashboard is a library for quickly building interactive dashboards for analyzing and explaining the predictions and workings of (scikit-learn compatible) machine learning models, including xgboost, catboost and lightgbm. This makes your model transparant and explainable with just two lines of code.
It allows you to investigate SHAP values, permutation importances, interaction effects, partial dependence plots, all kinds of performance plots, and even individual decision trees inside a random forest.  Besides, explainerdashboard helps any data scientist  to  create an interactive explainable AI web app in minutes, without having to know anything about web development or deployment.</p>
<p>Let's get into the code now.</p>
<ul>
<li>Installing explainerdashboard 
It takes 4 minutes to get it done</li>
</ul>
<pre><code><span class="hljs-addition">!pip install explainerdashboard</span>
</code></pre><ul>
<li>Importing the libraries </li>
</ul>
<pre><code><span class="hljs-keyword">from</span> explainerdashboard <span class="hljs-keyword">import</span> ClassifierExplainer
<span class="hljs-keyword">from</span> dash <span class="hljs-keyword">import</span> html
</code></pre><ul>
<li>Creating an object of the class ClassifierExplainer and passing the model, X_test and y_test as arguements</li>
</ul>
<pre><code><span class="hljs-attr">explainer</span> = ClassifierExplainer(model, X_test, y_test)
</code></pre><ul>
<li>launching the dashboard:</li>
</ul>
<pre><code><span class="hljs-keyword">from</span> explainerdashboard <span class="hljs-keyword">import</span> ExplainerDashboard
ExplainerDashboard(explainer).run()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639600325692/tbvaFkzOq.png" alt="Screenshot from 2021-12-15 15-33-31.png" />
After executing the above, it should display the flask web server IP address as shown above in the image. please copy it and paste into your web browser.</p>
<p>http://0.0.0.0:8050/</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639600442357/8F9OfxuUh.png" alt="Model Explainer.png" /></p>
<p>As shown above this summary_plot shows features that have the biggest impact on predicted malignant cancer  based on shap values</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639602379911/Ie-BU7XN5.png" alt="report.png" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639602715569/MMH4_JZsg.png" alt="ex.png" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1639603126624/e-27oB9N1.png" alt="dep.png" /></p>
<p>Kudos for making to the end of this article.</p>
<p><strong> Conclusion </strong></p>
<p>We started by a classification problem using random forest on breast cancer dataset from hospital in the USA. After analysis on the dataset, we found out that 250 people have benign breast cancer and 148 have malignant breast cancer. Next, we fed our training set into the black box and evaluated the model performance on unseen data giving 95% accuracy score. Finally,  we used SHAP to explain and interpret our black box model.</p>
<p>If you want to contribute or you find any errors in this article please do leave me a comment. </p>
<p>You can reach me out on any of the matrix decentralized servers. My element messenger ID is @maximilien:matrix.org</p>
<p>If you are one of the mastodon decentralized server, here is my ID  @maximilien@qoto.org</p>
<p>If you are on linkedIn, you can reach me  <a target="_blank" href="https://www.linkedin.com/in/ephrem-maximilien-kpizingui-48222775/">here</a></p>
<blockquote>
<p>Warm regards,</p>
<p>Maximilien.</p>
</blockquote>
<pre><code>
</code></pre>]]></content:encoded></item><item><title><![CDATA[End-To-End Amazon Product Rating Using Multinomial Naive Bayes Algorithm and CountVectorizer]]></title><description><![CDATA[In this pandemic time where every body has to cover his nose and turn to online shopping, your product is not what you say about it but it is what google says about it. Let me ask you this question as a consumer of an end product. When you go online ...]]></description><link>https://maximilien.docquest.io/end-to-end-amazon-product-rating-using-multinomial-naive-bayes-algorithm-and-countvectorizer</link><guid isPermaLink="true">https://maximilien.docquest.io/end-to-end-amazon-product-rating-using-multinomial-naive-bayes-algorithm-and-countvectorizer</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[nlp]]></category><category><![CDATA[ML]]></category><category><![CDATA[Amazon]]></category><dc:creator><![CDATA[Maximilien]]></dc:creator><pubDate>Sat, 06 Nov 2021 19:50:40 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1635431920924/uYv3tvEJS.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this pandemic time where every body has to cover his nose and turn to online shopping, your product is not what you say about it but it is what google says about it. Let me ask you this question as a consumer of an end product. When you go online to purchase a product what is the first thing you do? Do you just buy an item because you cannot interact physically with it and have a knowledge of it or you take your time to read through the comments and experiences of other end users about that same product before purchasing. If you belong to the last category, you are in the winning side in this digital age. In this optic, we are going to use a mathematical model to predict the rating of a particular product out of 5 based on some sample product information collected from end consumers on Amazon using machine learning and natural language processing with RMSE as evaluation metric.</p>
<p><strong>Contents</strong></p>
<p><strong>1. What is product review</strong></p>
<p><strong>2. Importance of  product review</strong></p>
<p><strong>3. Top product review platforms</strong></p>
<p><strong>4. What is multinomial Naive Bayes algorithm</strong></p>
<p><strong>5. What is CountVectorizer</strong></p>
<p><strong>6. Code implementation of CountVectorizer</strong></p>
<p><strong>7. Implementation of product rating using multinomial Naive Bayes algorithm</strong></p>
<p><strong>1- What is product review</strong></p>
<p>In electronic commerce, a product review is a section on shopping websites which gives the customers the opportunity to rate and to comment on a product they have purchased in which other end consumers will read to make their decision to purchase the same item for their personal need.</p>
<p><strong>2. Importance of  product review</strong></p>
<ul>
<li><p>Drive sales</p>
</li>
<li><p>Build trust</p>
</li>
<li><p>Aid customer decision making</p>
</li>
<li><p>Credibility and social proof</p>
</li>
</ul>
<p><strong>3. Top product  review platform</strong></p>
<ul>
<li><p>Baazarvoice</p>
</li>
<li><p>Yotpo</p>
</li>
<li><p>Trustpilot</p>
</li>
<li><p>PowerReviews</p>
</li>
<li><p>Reevoo</p>
</li>
<li><p>Feefo</p>
</li>
</ul>
<p><strong>4. What is multinomial Naive Bayes algorithm</strong></p>
<p>Multinomial Naive Bayes algorithm is a subset of probabilistic algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of the feature. It is designed to handle corpus using word counts as its method of calculating probability given by:</p>
<p>P(c|x) = P(x|c) * P(c) / P(x) </p>
<p> Where c is the class of the possible outcomes and x is the given instance which has to be classified representing some certain features.</p>
<p>If you want to go deep into the mathematics, I refer  <a target="_blank" href="https://dphi.tech/blog/naive-bayes-algorithm-everything-you-need-to-know/">
Nagesh Singh Chauhan post on DPhi platform </a> to you.</p>
<p><strong>5. What is CountVectorizer </strong></p>
<p>CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to convert a given text in a document into a numerical vector based on the frequency of each word that occurs in the entire text. This transformation is paramount at the early stage of machine learning pipeline which will be used as feature representation of the raw text in document for machine learning tasks such as text classification and clustering because machine learning algorithms only compute numerical features irrespective of the input data feed into the model.</p>
<p><strong>6. Code implementation of CountVectorizer</strong></p>
<p>Considering a few sample texts from a corpus of my IoT startup as a list element:</p>
<blockquote>
<p>corpus= ["maxtek helps startup",
                    "maxtek is into computer vision and IoT",
                    "maxtek provides Deep learning and IoT solutions"]</p>
</blockquote>
<p>CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the corpus is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample as shown below</p>
<pre><code><span class="hljs-keyword">from</span> sklearn.feature_extraction.text <span class="hljs-keyword">import</span> CountVectorizer

corpus = [<span class="hljs-string">"maxtek helps startup"</span>,
            <span class="hljs-string">"maxtek is into computer vision and IoT"</span>,
            <span class="hljs-string">"maxtek provides Deep learning and IoT solutions"</span>]

<span class="hljs-comment"># Create a Vectorizer Object</span>
vectorizer = CountVectorizer()

vectorizer.fit(corpus)

<span class="hljs-comment"># Printing the identified Unique words along with their indices</span>
print(<span class="hljs-string">"Vocabulary: "</span>, vectorizer.vocabulary_)

<span class="hljs-comment"># Encode the corpus</span>
vector = vectorizer.transform(corpus)

<span class="hljs-comment"># Summarizing the Encoded word in the corpus</span>
print(<span class="hljs-string">"Encoded corpus is:"</span>)
print(vector.toarray())
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636217949592/LrdusUt3z.png" alt="doc.png" />
This way of representation is known as a Sparse Matrix.</p>
<p>Key observations:</p>
<ul>
<li><p>There are 13 unique words in the corpus forming the vocabulary, represented as columns of the table.</p>
</li>
<li><p>There are 3 sentences in the corpus each represented as rows of the table.</p>
</li>
<li><p>Every cell contains a number, that represents the count of the word in that particular text.</p>
</li>
<li><p>All words have been converted to lowercase.</p>
</li>
<li><p>The words in columns have been arranged alphabetically.</p>
</li>
</ul>
<p><strong>7. Implementation of product rating using multinomial Naive Bayes algorithm</strong></p>
<ul>
<li>Importing all the libraries required to run this code. </li>
</ul>
<pre><code>from sklearn.feature_extraction.text <span class="hljs-keyword">import</span> CountVectorizer
from sklearn.metrics <span class="hljs-keyword">import</span> mean_squared_error
from sklearn.naive_bayes <span class="hljs-keyword">import</span> MultinomialNB
from sklearn <span class="hljs-keyword">import</span> preprocessing
from sklearn <span class="hljs-keyword">import</span> metrics
from sklearn.metrics <span class="hljs-keyword">import</span> roc_auc_score, accuracy_score
from sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
from nltk <span class="hljs-keyword">import</span> sent_tokenize, word_tokenize
from bs4 <span class="hljs-keyword">import</span> BeautifulSoup 
from nltk.corpus <span class="hljs-keyword">import</span> stopwords 
from nltk.stem.porter <span class="hljs-keyword">import</span> PorterStemmer
from nltk.stem <span class="hljs-keyword">import</span> SnowballStemmer, WordNetLemmatizer
from nltk <span class="hljs-keyword">import</span> sent_tokenize, word_tokenize, pos_tag
<span class="hljs-keyword">import</span> matplotlib.pyplot as plt
<span class="hljs-keyword">import</span> seaborn as sns
<span class="hljs-keyword">import</span> pandas as pd 
<span class="hljs-keyword">import</span> numpy as np
<span class="hljs-keyword">import</span> contractions
<span class="hljs-keyword">import</span> nltk
<span class="hljs-keyword">import</span> re
</code></pre><p>In case you run through package not found error after running this code above you can install that particular library as shown below</p>
<pre><code><span class="hljs-addition">!pip3 install name_of_the_library</span>
</code></pre><ul>
<li>Loading train and test dataset</li>
</ul>
<pre><code><span class="hljs-attr">train_df</span> = pd.read_csv(<span class="hljs-string">"Train_Data.csv"</span>)
<span class="hljs-attr">test_df</span> = pd.read_csv(<span class="hljs-string">'Test_Data.csv'</span>)
</code></pre><ul>
<li>Displaying 5 rows of the train_df</li>
</ul>
<pre><code><span class="hljs-selector-tag">train_df</span><span class="hljs-selector-class">.head</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636209055674/5sfSIUzh_.png" alt="head.png" /></p>
<ul>
<li>Displaying information about the features sets in the train dataframe</li>
</ul>
<pre><code>train_df.<span class="hljs-keyword">info</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636209351416/jN-lEBZ2f.png" alt="info.png" /></p>
<ul>
<li>Visulalizing the distribution of average_review_rating</li>
</ul>
<pre><code><span class="hljs-selector-tag">plt</span><span class="hljs-selector-class">.figure</span>(figsize=(<span class="hljs-number">12</span>,<span class="hljs-number">8</span>))

<span class="hljs-selector-tag">train_df</span><span class="hljs-selector-attr">['average_review_rating']</span><span class="hljs-selector-class">.value_counts</span>()<span class="hljs-selector-class">.sort_index</span>()<span class="hljs-selector-class">.plot</span>(kind=<span class="hljs-string">'bar'</span>)
<span class="hljs-selector-tag">plt</span><span class="hljs-selector-class">.title</span>(<span class="hljs-string">'Distribution of average_review_rating'</span>)
<span class="hljs-selector-tag">plt</span><span class="hljs-selector-class">.xlabel</span>(<span class="hljs-string">'average_review_rating'</span>)
<span class="hljs-selector-tag">plt</span><span class="hljs-selector-class">.ylabel</span>(<span class="hljs-string">'Count'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636209520296/neSbiWCTO.png" alt="dist.png" /></p>
<ul>
<li>Visualizing the distribution top 50 products reviews</li>
</ul>
<pre><code><span class="hljs-attribute">products</span> = train_df[<span class="hljs-string">"product_name"</span>].value_counts()
<span class="hljs-attribute">plt</span>.figure(figsize=(<span class="hljs-number">12</span>,<span class="hljs-number">8</span>))
<span class="hljs-attribute">products</span>[:<span class="hljs-number">50</span>].plot(kind='bar')
<span class="hljs-attribute">plt</span>.title(<span class="hljs-string">"Number of Reviews for Top 50 products"</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636209901988/0nScoTVZP.png" alt="dist.png" /></p>
<ul>
<li>Visualizing the distribution of top 50 manufactuters reviews</li>
</ul>
<pre><code><span class="hljs-attribute">brands</span> = train_df['manufacturer'].value_counts()
<span class="hljs-attribute">plt</span>.figure(figsize=(<span class="hljs-number">12</span>,<span class="hljs-number">8</span>))
<span class="hljs-attribute">brands</span>[:<span class="hljs-number">50</span>].plot(kind='bar')
<span class="hljs-attribute">plt</span>.title(<span class="hljs-string">"Number of Reviews for Top 50 manufacturers"</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636210132047/3sr8f4v7w.png" alt="distribu.png" /></p>
<ul>
<li>Visualizing the distrubition of the length of the reviews</li>
</ul>
<pre><code>review_length = train_df[<span class="hljs-string">"customer_reviews"</span>].dropna().<span class="hljs-keyword">map</span>(lambda x: <span class="hljs-built_in">len</span>(x))
plt.figure(figsize=(<span class="hljs-number">12</span>,<span class="hljs-number">8</span>))
review_length.loc[review_length &lt; <span class="hljs-number">1500</span>].hist()
plt.title(<span class="hljs-string">"Distribution of customer review Length"</span>)
plt.xlabel(<span class="hljs-string">'Review length (Number of character)'</span>)
plt.ylabel(<span class="hljs-string">'Count'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636210277052/ZDj2PkxtnN.png" alt="leng.png" /></p>
<ul>
<li>Checking for NaN value in the train dataframe</li>
</ul>
<pre><code>train_df.<span class="hljs-keyword">isnull</span>().sum()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636210401104/FlCpIGohT.png" alt="nan.png" /></p>
<ul>
<li>As shown above, we have many NaN values in the dataframe so we are going to remove them by using drop() method cause if we keep them there will be noise in our data.</li>
</ul>
<pre><code>train_df.dropna(inplace=<span class="hljs-literal">True</span>)
</code></pre><ul>
<li>Displaying customer_reviews in the dataframe cause we will be using it as input to our model. Recalling that the system rate various product based on customer review</li>
</ul>
<pre><code><span class="hljs-selector-tag">train_df</span><span class="hljs-selector-attr">[<span class="hljs-string">'customer_reviews'</span>]</span>
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636213008164/nzT_nths1f.png" alt="rev.png" /></p>
<ul>
<li>Checking if there is not NaN values in the features</li>
</ul>
<pre><code><span class="hljs-selector-tag">train_df</span><span class="hljs-selector-attr">[<span class="hljs-string">'customer_reviews'</span>]</span><span class="hljs-selector-class">.isnull</span>()<span class="hljs-selector-class">.sum</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636213145293/WpPXRNqRn.png" alt="zero.png" /></p>
<p>That looks good. let's proceed further!!</p>
<ul>
<li>Defining a function cleanText() to remove special, html tagsetc.  in our feature</li>
</ul>
<pre><code><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">cleanText</span>(<span class="hljs-params">raw_text, remove_stopwords=False, stemming=False, split_text=False </span>):</span>
    <span class="hljs-string">'''
    Convert a raw review to a cleaned review
    '''</span>
    text = BeautifulSoup(raw_text, <span class="hljs-string">'lxml'</span>).get_text()  <span class="hljs-comment">#remove html</span>
    letters_only = re.sub(<span class="hljs-string">"[^a-zA-Z]"</span>, <span class="hljs-string">" "</span>, text)  <span class="hljs-comment"># remove non-character</span>
    words = letters_only.lower().split() <span class="hljs-comment"># convert to lower case </span>

    <span class="hljs-keyword">if</span> remove_stopwords: <span class="hljs-comment"># remove stopword</span>
        stops = set(stopwords.words(<span class="hljs-string">"english"</span>))
        words = [w <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> words <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> w <span class="hljs-keyword">in</span> stops]

    <span class="hljs-keyword">if</span> stemming==<span class="hljs-literal">True</span>: <span class="hljs-comment"># stemming</span>
<span class="hljs-comment">#         stemmer = PorterStemmer()</span>
        stemmer = SnowballStemmer(<span class="hljs-string">'english'</span>) 
        words = [stemmer.stem(w) <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> words]

    <span class="hljs-keyword">if</span> split_text==<span class="hljs-literal">True</span>:  <span class="hljs-comment"># split text</span>
        <span class="hljs-keyword">return</span> (words)

    <span class="hljs-keyword">return</span>( <span class="hljs-string">" "</span>.join(words))
</code></pre><ul>
<li>Splitting the dataset into training and validation dataset to train and evaluate the performance of our model by using train_test_split() method taking the feature and target variable as arguments.</li>
</ul>
<pre><code><span class="hljs-attribute">X_train</span>, X_test, y_train, y_test = train_test_split(train_df['customer_reviews'],train_df['average_review_rating'],test_size=<span class="hljs-number">0</span>.<span class="hljs-number">2</span>, random_state=<span class="hljs-number">42</span>)
</code></pre><ul>
<li>Cleaning the training feature set.</li>
</ul>
<pre><code>X_train_cleaned = []
X_test_cleaned = []

<span class="hljs-keyword">for</span> d in X_train:
    X_train_cleaned.<span class="hljs-built_in">append</span>(cleanText(d))
<span class="hljs-built_in">print</span>(<span class="hljs-string">'Show a cleaned review in the training set : \n'</span>,  X_train_cleaned[<span class="hljs-number">10</span>])

<span class="hljs-keyword">for</span> d in X_test:
    X_test_cleaned.<span class="hljs-built_in">append</span>(cleanText(d))
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636213702953/0xh29C-Mh.png" alt="clean.png" /></p>
<ul>
<li>Printing the identified Unique words along with their indices</li>
</ul>
<pre><code>
X_train_countVect = countVect.fit(X_train_cleaned)

print(<span class="hljs-string">"Vocabulary: "</span>, X_train_countVect.vocabulary_)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636214299003/suKx4ARoP.png" alt="ind.png" /></p>
<ul>
<li>Applying CountVectorizer to X_train_clean</li>
</ul>
<pre><code>countVect = CountVectorizer() 
X_train_countVect = countVect.fit_transform(X_train_cleaned)
<span class="hljs-keyword">print</span>(<span class="hljs-string">"Number of features : %d \n"</span> %len(countVect.get_feature_names())) <span class="hljs-comment">#6378 </span>
<span class="hljs-keyword">print</span>(<span class="hljs-string">"Show some feature names : \n"</span>, countVect.get_feature_names()[::<span class="hljs-number">100</span>])
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636218909089/ydO5sTI51.png" alt="feat.png" /></p>
<ul>
<li>Create a object of a class LabelEncoder(). 
We notice that our target variable is of data type float and the input to our model must be of integer data type.</li>
</ul>
<pre><code><span class="hljs-attr">lb</span>=preprocessing.LabelEncoder()
</code></pre><ul>
<li>Displaying y_train before encoding</li>
</ul>
<pre><code>print(y_train)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636219508675/6ElVwL_qY.png" alt="tra.png" /></p>
<ul>
<li>Encoding y_train</li>
</ul>
<pre><code><span class="hljs-attr">y_train_encoded</span>=lb.fit_transform(y_train)
</code></pre><ul>
<li>Displaying the encoded y_train</li>
</ul>
<pre><code>print(y_train_encoded)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636219667435/zh8iGFsIg.png" alt="y.png" /></p>
<ul>
<li>Encoding y_test</li>
</ul>
<pre><code><span class="hljs-attr">y_test_encoded</span> = lb.transform(y_test)
</code></pre><p>y_test_encoded</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636219915832/ndvmYrt7g.png" alt="y.png" /></p>
<ul>
<li>Defining our model and fitting with X_train_countVect and y_train_encoded</li>
</ul>
<pre><code>mnb = MultinomialNB()
mnb.fit(X_train_countVect, y_train_encoded)
</code></pre><ul>
<li>Predicting on unseen validation test features</li>
</ul>
<pre><code><span class="hljs-attr">predictions</span> = mnb.predict(countVect.transform(X_test_cleaned))
</code></pre><ul>
<li>Displaying predicted value. </li>
</ul>
<pre><code>print(predictions)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636220364555/OfEiT-ehI.png" alt="pred.png" /></p>
<p>We notice that the predicted value is out of the scale in the problem statement simply because we encoded our target variable before fitting our model. So we need to inverse transform the predicting to get the actual prediction as shown below:</p>
<pre><code><span class="hljs-attr">prediction</span> =lb.inverse_transform(predictions )
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636220790409/rac8XwPUG.png" alt="pred.png" /></p>
<ul>
<li>Defining a function to evaluate the model using RMSE(Root Mean Square Error) as metric score.</li>
</ul>
<pre><code><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">modelEvaluation</span>(<span class="hljs-params">y_tst,pred</span>):</span>
    <span class="hljs-string">'''
    Print model evaluation to predicted result 
    '''</span>
    <span class="hljs-keyword">print</span> (<span class="hljs-string">"\nAccuracy on validation set: {:.4f}"</span>.format(np.sqrt(mean_squared_error(y_tst, pred))))
</code></pre><ul>
<li>Evaluating the model performance</li>
</ul>
<pre><code><span class="hljs-selector-tag">modelEvaluation</span>(lb.inverse_transform(y_test_encoded),predictions)
</code></pre><p>Accuracy on validation set: 3.1841</p>
<p>Low RMSE score means the model is performing well. We have reached half way to our final destination. Let us process the test dataframe and run the prediction on  unseen feature sets.</p>
<ul>
<li>Displaying the first 5 rows  of the test dataframe</li>
</ul>
<pre><code><span class="hljs-selector-tag">test_df</span><span class="hljs-selector-class">.head</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636221327271/ySlcRBQ1V.png" alt="head.png" /></p>
<ul>
<li>Displaying information about the test dataframe</li>
</ul>
<pre><code>test_df.<span class="hljs-keyword">info</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636221502570/gIKwqfVgL.png" alt="info.png" /></p>
<ul>
<li>Checking if there is a null value in the test dataframe</li>
</ul>
<pre><code><span class="hljs-selector-tag">test_df</span><span class="hljs-selector-class">.customer_reviews</span><span class="hljs-selector-class">.isnull</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636221635980/HNJtWooLj.png" alt="null.png" /></p>
<ul>
<li>Cleaning the test_df['customer_reviews']</li>
</ul>
<pre><code>test_df_cleaned = []

<span class="hljs-keyword">for</span> d in test_df[<span class="hljs-string">'customer_reviews'</span>]:
    test_df_cleaned .<span class="hljs-built_in">append</span>(cleanText(d))
<span class="hljs-built_in">print</span>(<span class="hljs-string">'Show a cleaned review in the training set : \n'</span>,  test_df_cleaned [<span class="hljs-number">10</span>])
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636222855971/lWP_eyd6m.png" alt="pic.png" /></p>
<ul>
<li>Predicting on the unseen clean customer reviews</li>
</ul>
<pre><code><span class="hljs-attr">predictions_test_df</span> = mnb.predict(countVect.transform(test_df_cleaned ))
<span class="hljs-attr">predictions_test</span> = lb.inverse_transform(predictions_test_df)
</code></pre><ul>
<li>Displaying the predicted values </li>
</ul>
<pre><code>print(predictions_test)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636223167747/I2CBZ9TML.png" alt="pr.png" /></p>
<ul>
<li>Putting the predicted values into a dataframe </li>
</ul>
<pre><code><span class="hljs-attr">predictions_test</span> = pd.DataFrame(predictions_test)
</code></pre><ul>
<li>Displaying the prediction</li>
</ul>
<pre><code>print(predictions_test)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636223394350/eEfYbMw5i.png" alt="df.png" /></p>
<ul>
<li>Converting the dataframe into csv format for submission.</li>
</ul>
<pre><code>predictions_test.<span class="hljs-keyword">index</span> = pd.DataFrame(predictions_test).<span class="hljs-keyword">index</span>
predictions_test.<span class="hljs-keyword">columns</span> = ["prediction"]
predictions_test.to_csv("submission.csv", <span class="hljs-keyword">index</span> = <span class="hljs-keyword">False</span>)
</code></pre><p>Check on the project folder in jupyter notebook you should have a file name submission.csv  as shown below:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1636224057383/JxEP8o6qx.png" alt="pred.png" /></p>
<p>Conclusion:</p>
<p>We  were able to rate customer reviews on Amazon product with RMSE score of 0.318 using multinomial Naive Bayes algorithm and countVectorizer. We could achieve lower RMSE score with TF-IDF which performs better than countVectorizer and exploring other machine learning algorithms such as support vector machine algorithm, deep learning algorithm like RNN and LSTM. </p>
<p>Please let me know if you find any errors. You can reach me out on any of the matrix decentralized servers. My element messenger ID is  <em>@maximilien:matrix.org</em></p>
<p>If you are on linkedIn you can reach me  <a target="_blank" href="https://www.linkedin.com/in/ephrem
-maximilien-kpizingui-48222775/"> here</a> </p>
<blockquote>
<p>Warm regards,</p>
<p>Maximilien.</p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[End-To-End Gender Determination by Morphometry of Eyes using CNN (Convolutional Neural Network)]]></title><description><![CDATA[I did recall the first time my mum escorted me to nursery school. I was that little boy who used to learn from images written on flash cards hanging on the blackboard "this is a car", "this is an elephant" :-). Similarly in the following, we are goin...]]></description><link>https://maximilien.docquest.io/end-to-end-gender-determination-by-morphometry-of-eyes-using-cnn-convolutional-neural-network</link><guid isPermaLink="true">https://maximilien.docquest.io/end-to-end-gender-determination-by-morphometry-of-eyes-using-cnn-convolutional-neural-network</guid><dc:creator><![CDATA[Maximilien]]></dc:creator><pubDate>Tue, 26 Oct 2021 16:04:52 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1635167582075/6ifQGS5gq.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I did recall the first time my mum escorted me to nursery school. I was that little boy who used to learn from images written on flash cards hanging on the blackboard "this is a car", "this is an elephant" :-). Similarly in the following, we are going to build mathematical models to mimic the functions of a human eyes and brain to empower machine to classify image of an eye of a patient and to find out whether the patient is male or female.</p>
<hr />
<p><strong>Contents</strong>:</p>
<p><strong>1. Computer vision</strong></p>
<p><strong>2. Convolutional Neural Network</strong></p>
<p><strong>3. Architecture of CNN</strong></p>
<p><strong>4. Image augmentation</strong></p>
<p><strong>5. Implementation of gender determination by morphometry of eye</strong></p>
<hr />
<p><strong>1. Computer Vision</strong></p>
<p>To understand computer vision, let us discuss human vision. Human vision is the ability of the human eye and brain to see and recognize objects. It is quite simple for the human eye to precisely identify whether a person  is a male or a female but it takes a lot of training for a computer system to understand such objects distinctly. In other word, Computer vision is the process of giving a machine a similar task of seeing and identifying objects in the real world. In this optic, computer vision can be defined as building mathematical models that can mimic the function of a human eye and brain. Basically, it is about training computers to understand and process images and videos.</p>
<p><strong>2. Convolutional Neural Network</strong></p>
<p>CNN is a class of deep neural network that is mostly used in the field of computer vision and imaging. CNNs are used to identify images, cluster them by their similarity and implement object recognition. Moreover, The word convolution refer to the filtering process that takes place in the network. Finally, CNN has different layers namely the input layer, the output layer, and multiple hidden layers. 
Besides, these hidden layers of a CNN consist of fully connected layers, convolutional layers, a ReLU layer as an activation function, normalization layers and pooling layers.</p>
<p><strong>3. Architecture of CNN</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635177100813/oBm-AnQi0.png" alt="CNN-architecture-image-courtesy-AtheroPoint-TM.png" />
The main components of CNN architecture are as follows:</p>
<p>• Input image: An input image forms the first component of a CNN architecture. An image can
be of any type: a human, an animal, scenery, a medical X-ray image etc. Each
image is converted into a mathematical matrix of zeros and ones as following:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635175108743/nOcxxlrQz.png" alt="Screenshot from 2021-10-25 15-16-12.png" /></p>
<p>• Convolutional layer: The convolution layer is the place where the image processing or filtering  starts. A convolution layer consists of two parts:</p>
<p>• Feature detector or filter  or kernel: This is a matrix basically a a 3*3 matrix for 2D image you put on an image to transform it into a feature map</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635175725969/LAcHyjwBP.png" alt="map.png" />
• Feature map: This is the reduced image that is produced by the convolution of
an image and feature detector. We have to put the feature detector on all
the possible locations of the original image and derive a smaller image from it.  That
derived image is the feature map of the input image obtained by the dot product of the input image with the kernel matrix.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635175993175/LGDtWiBc6.png" alt="Screenshot from 2021-10-25 15-31-38.png" /></p>
<p>NB: The feature detector or kernel is the filter and the feature map is the
reduced image. Some information is lost while reducing the image.
The above feature map is obtained by moving the square orange frame all over the layer and taking the dot product with the kernel as shown below</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635176376980/x2cbKU2sH.png" alt="Screenshot from 2021-10-25 15-38-52.png" /></p>
<p>• Pooling layer: The pooling layer helps us ignore the less important data in the image and reduces the image further while preserving its important features. The feature map derived from the convolution layer is passed through a pooling layer to further reduce the image, all while preserving the most relevant part of the image. The pooling layer consists of functions such as max pooling, min pooling, and average pooling. What this means is that we select a matrix size, say 2x2 and we scan the
feature map and select the maximum number from the 2x2 matrix that fits in that block. The following image gives us a clear idea of how max pooling works.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635177025896/PK1m1guFo_.png" alt="Screenshot from 2021-10-25 15-48-36.png" /></p>
<p>• Flattening: Flattening is part of a CNN where the image is made ready to use as an input to
an artificial neural network. The pooled image is flattened and converted into a
single column. Each row is made into a column and stacked one over another. Here,
we have converted a 3x3 matrix into a 1xn matrix, where n in our case is 9.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635177455501/ISGCtUIeWL.png" alt="Screenshot from 2021-10-25 15-57-18.png" /></p>
<p>Now, let's look at the overall structure of a CNN
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635177698665/VJWsxleIX.png" alt="Screenshot from 2021-10-25 16-02-10.png" /></p>
<p><strong>4. Image augmentation</strong>
Image or data augmentation works in a similar manner. Image/data augmentation creates many batches of our images. Then, it applies random transformations to random images inside the batches. Data transformation can be rotating images, shifting them, flipping them, and so on. By applying this
transformation, we get more diverse images inside the batches, and we also havemuch more data than we had originally like shown below for an image of a football.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635178131275/agcbLBqMR.png" alt="Screenshot from 2021-10-25 16-08-48.png" /></p>
<p><strong>5. Implementation of gender determination by morphometry of eye</strong></p>
<p>Let's get into the coding part. The dataset used in this code can be downloaded from DPhi platform  <a target="_blank" href="https://drive.google.com/file/d/1f7uslI-ZHidriQFZR966_aILjlkgDN76/view?usp=sharing">here</a> .</p>
<ul>
<li>Installing tensorflow framework (skip this part if you are using Google colab)</li>
</ul>
<pre><code><span class="hljs-comment"># Requires the latest pip</span>
 pip <span class="hljs-keyword">install</span> <span class="hljs-comment">--upgrade pip</span>
</code></pre><pre><code># <span class="hljs-keyword">Current</span> <span class="hljs-keyword">stable</span> <span class="hljs-keyword">release</span> <span class="hljs-keyword">for</span> CPU <span class="hljs-keyword">and</span> GPU
pip3 install tensorflow
</code></pre><pre><code><span class="hljs-comment">#installing open computer vision </span>
<span class="hljs-attribute">pip3</span> install opencv-contrib-python
</code></pre><ul>
<li><p>Importing the libraries</p>
<pre><code>from tensorflow.keras.preprocessing.image <span class="hljs-keyword">import</span> ImageDataGenerator,array_to_img, img_to_array, load_img
from keras.layers <span class="hljs-keyword">import</span> Dense, Activation, Flatten, Dropout, BatchNormalization
from keras.layers <span class="hljs-keyword">import</span> Conv2D, MaxPooling2D
from tensorflow.keras <span class="hljs-keyword">import</span> regularizers, optimizers
from keras.preprocessing <span class="hljs-keyword">import</span> image
from tensorflow.keras.models <span class="hljs-keyword">import</span> Sequential
<span class="hljs-keyword">import</span> matplotlib.pyplot as plt
from os <span class="hljs-keyword">import</span> listdir,makedirs
from os.path <span class="hljs-keyword">import</span> isfile,join
from keras <span class="hljs-keyword">import</span> layers
from keras <span class="hljs-keyword">import</span> models
<span class="hljs-keyword">import</span> pandas as pd
<span class="hljs-keyword">import</span> numpy as np
<span class="hljs-keyword">import</span> pathlib
<span class="hljs-keyword">import</span> PIL
<span class="hljs-keyword">import</span> cv2
<span class="hljs-keyword">import</span> os
</code></pre></li>
<li><p>Loading the train and test dataset</p>
</li>
</ul>
<pre><code><span class="hljs-attr">train_df</span>=pd.read_csv(<span class="hljs-string">"Training_set.csv"</span>,dtype=str)
<span class="hljs-attr">test_df</span>=pd.read_csv(<span class="hljs-string">"Testing_set.csv"</span>,dtype=str)
</code></pre><ul>
<li>Defining the path for the various directories<pre><code><span class="hljs-attr">src_path_train</span>=<span class="hljs-string">"/root/Desktop/Deep learning dphi/eye_gender_data/train/"</span>
<span class="hljs-attr">src_path_test</span>=<span class="hljs-string">"/root/Desktop/Deep learning dphi/eye_gender_data/test/"</span>
<span class="hljs-attr">src_path_validation</span>=<span class="hljs-string">"/root/Desktop/Deep learning dphi/eye_gender_data/train/validation/"</span>
<span class="hljs-attr">src_path_train_gray</span>=<span class="hljs-string">"/root/Desktop/Deep learning dphi/eye_gender_data/train_grayscale/"</span>
<span class="hljs-attr">src_path_test_gray</span>=<span class="hljs-string">"/root/Desktop/Deep learning dphi/eye_gender_data/test_grayscale/"</span>
</code></pre>Run this code only once to create validation folder<pre><code>base_dir=<span class="hljs-string">'/root/Desktop/Deep learning dphi/eye_gender_data/'</span>
validation_dir = os.path.<span class="hljs-keyword">join</span>(base_dir, <span class="hljs-string">'validation'</span>)
os.mkdir(validation_dir)
</code></pre>Creating train_grayscale folder<pre><code>validation_dir = os.path.<span class="hljs-keyword">join</span>(src_path_train_gray, <span class="hljs-string">'train_grayscale'</span>)
os.mkdir(validation_dir)
</code></pre>Creating test_grayscale folder<pre><code>validation_dir = os.path.<span class="hljs-keyword">join</span>(src_path_test_gray, <span class="hljs-string">'train_grayscale'</span>)
os.mkdir(validation_dir)
</code></pre></li>
</ul>
<p>Applying data augmentation to Image_6 in the train dataset to visualize how it looks like</p>
<pre><code><span class="hljs-attribute">datagen</span> = ImageDataGenerator(
        <span class="hljs-attribute">rotation_range</span>=<span class="hljs-number">40</span>,
        <span class="hljs-attribute">width_shift_range</span>=<span class="hljs-number">0</span>.<span class="hljs-number">2</span>,
        <span class="hljs-attribute">height_shift_range</span>=<span class="hljs-number">0</span>.<span class="hljs-number">2</span>,
        <span class="hljs-attribute">shear_range</span>=<span class="hljs-number">0</span>.<span class="hljs-number">2</span>,
        <span class="hljs-attribute">zoom_range</span>=<span class="hljs-number">0</span>.<span class="hljs-number">2</span>,
        <span class="hljs-attribute">horizontal_flip</span>=True,
        <span class="hljs-attribute">fill_mode</span>='nearest')

<span class="hljs-attribute">img</span> = load_img(<span class="hljs-string">"/root/Desktop/Deep learning dphi/eye_gender_data/train/Image_6.jpg"</span>)  # this is a PIL image
<span class="hljs-attribute">x</span> = img_to_array(img)  # this is a Numpy array with shape (<span class="hljs-number">3</span>, <span class="hljs-number">150</span>, <span class="hljs-number">150</span>)
<span class="hljs-attribute">x</span> = x.reshape((<span class="hljs-number">1</span>,) + x.shape)  # this is a Numpy array with shape (<span class="hljs-number">1</span>, <span class="hljs-number">3</span>, <span class="hljs-number">150</span>, <span class="hljs-number">150</span>)

<span class="hljs-comment"># the .flow() command below generates batches of randomly transformed images</span>
<span class="hljs-comment"># and saves the results to the `preview/` directory</span>
<span class="hljs-attribute">i</span> = <span class="hljs-number">0</span>
<span class="hljs-attribute">for</span> batch in datagen.flow(x, batch_size=<span class="hljs-number">1</span>,
                          <span class="hljs-attribute">save_to_dir</span>='preview', save_prefix='cat', save_format='jpeg'):
    <span class="hljs-attribute">i</span> += <span class="hljs-number">1</span>
    <span class="hljs-attribute">if</span> i &gt; <span class="hljs-number">12</span>:
        <span class="hljs-attribute">break</span>  # otherwise the generator would loop indefinitely
</code></pre><p>Creating a function to visualize 12 augmented samples of a real image</p>
<pre><code>def visualizeImg(<span class="hljs-type">path</span>,color): 
    sub_class = os.listdir(<span class="hljs-type">path</span>)
    fig = plt.figure(figsize=(<span class="hljs-number">10</span>,<span class="hljs-number">7</span>))
    <span class="hljs-keyword">for</span> e <span class="hljs-keyword">in</span> range(len(sub_class[:<span class="hljs-number">12</span>])):
        plt.subplot(<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,e+<span class="hljs-number">1</span>)
        img = plt.imread(os.path.<span class="hljs-keyword">join</span>(<span class="hljs-type">path</span>,sub_class[e]))
        plt.imshow(img, cmap=plt.get_cmap(color))
</code></pre><ul>
<li>Visualizing the  augmented images </li>
</ul>
<pre><code><span class="hljs-selector-tag">visualizeImg</span>(<span class="hljs-string">"/root/Desktop/Deep learning dphi/eye_gender_data/preview/"</span>,<span class="hljs-string">'CMRmap'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635194589586/pYzUKkOLX.png" alt="aug.png" /></p>
<ul>
<li>Displaying sample of the train dataset</li>
</ul>
<pre><code><span class="hljs-selector-tag">visualizeImg</span>(src_path_train,<span class="hljs-string">'CMRmap'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635194791240/eyZ5f2GUe.png" alt="train.png" /></p>
<ul>
<li>Displaying sample of test dataset</li>
</ul>
<pre><code><span class="hljs-selector-tag">visualizeImg</span>(src_path_test,<span class="hljs-string">'CMRmap'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635194954566/-8lAfaraz.png" alt="Screenshot from 2021-10-25 20-48-46.png" /></p>
<ul>
<li>Converting the train dataset to grayscale instead</li>
</ul>
<pre><code>path =<span class="hljs-string">"/root/Desktop/Deep learning dphi/eye_gender_data/train/"</span>  
<span class="hljs-comment">#create a folder named  train_grayscale in the eye_gender data directory </span>
dstpath =<span class="hljs-string">"/root/Desktop/Deep learning dphi/eye_gender_data/train_grayscale/"</span>

<span class="hljs-keyword">try</span>:
    makedirs(dstpath)
<span class="hljs-keyword">except</span>:
    <span class="hljs-keyword">print</span> (<span class="hljs-string">"Directory already exist, images will be written in same folder"</span>)

<span class="hljs-comment"># Folder won't used</span>
files = [f <span class="hljs-keyword">for</span> f <span class="hljs-keyword">in</span> listdir(path) <span class="hljs-keyword">if</span> isfile(join(path,f))] 

<span class="hljs-keyword">for</span> image <span class="hljs-keyword">in</span> files:
    <span class="hljs-keyword">try</span>:
        img = cv2.imread(os.path.join(path,image))
        gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
        dstPath = join(dstpath,image)
        cv2.imwrite(dstPath,gray)
    <span class="hljs-keyword">except</span>:
        <span class="hljs-keyword">print</span> (<span class="hljs-string">"{} is not converted"</span>.format(image))
</code></pre><ul>
<li>displaying sample of train_grayscale dataset</li>
</ul>
<pre><code><span class="hljs-selector-tag">visualizeImg</span>(src_path_train_gray,<span class="hljs-string">'gray'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635195215388/UjAeOckDBg.png" alt="Screenshot from 2021-10-25 20-53-37.png" /></p>
<ul>
<li>Converting test to grayscale</li>
</ul>
<pre><code>path =<span class="hljs-string">"/root/Desktop/Deep learning dphi/eye_gender_data/test/"</span>  
<span class="hljs-comment">#create a folder named  test_grayscale in the eye_gender data directory </span>
dstpath =<span class="hljs-string">"/root/Desktop/Deep learning dphi/eye_gender_data/test_grayscale/"</span>

<span class="hljs-keyword">try</span>:
    makedirs(dstpath)
<span class="hljs-keyword">except</span>:
    <span class="hljs-keyword">print</span> (<span class="hljs-string">"Directory already exist, images will be written in same folder"</span>)

<span class="hljs-comment"># Folder won't used</span>
files = [f <span class="hljs-keyword">for</span> f <span class="hljs-keyword">in</span> listdir(path) <span class="hljs-keyword">if</span> isfile(join(path,f))] 

<span class="hljs-keyword">for</span> image <span class="hljs-keyword">in</span> files:
    <span class="hljs-keyword">try</span>:
        img = cv2.imread(os.path.join(path,image))
        gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
        dstPath = join(dstpath,image)

        cv2.imwrite(dstPath,gray)
    <span class="hljs-keyword">except</span>:
        <span class="hljs-keyword">print</span> (<span class="hljs-string">"{} is not converted"</span>.format(image))
</code></pre><ul>
<li>Displaying sample of test_grayscale dataset</li>
</ul>
<pre><code><span class="hljs-selector-tag">visualizeImg</span>(src_path_test_gray,<span class="hljs-string">'gray'</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635198157531/g-cYiKmX1.png" alt="Screenshot from 2021-10-25 21-42-41.png" /></p>
<ul>
<li>Defining the augmentation configuration we will use for training </li>
</ul>
<pre><code><span class="hljs-attr">train_datagen</span> = ImageDataGenerator(
        <span class="hljs-attr">rescale</span>=<span class="hljs-number">1</span> / <span class="hljs-number">255.0</span>,
        <span class="hljs-attr">rotation_range</span>=<span class="hljs-number">2</span>,
        <span class="hljs-attr">zoom_range</span>=<span class="hljs-number">0.05</span>,
        <span class="hljs-attr">width_shift_range</span>=<span class="hljs-number">0.05</span>,
        <span class="hljs-attr">height_shift_range</span>=<span class="hljs-number">0.05</span>,
        <span class="hljs-attr">shear_range</span>=<span class="hljs-number">0.05</span>,
        <span class="hljs-attr">horizontal_flip</span>=<span class="hljs-literal">True</span>,
        <span class="hljs-attr">fill_mode</span>=<span class="hljs-string">"nearest"</span>,
        <span class="hljs-attr">validation_split</span>=<span class="hljs-number">0.20</span>)
</code></pre><ul>
<li>Defining the augmentation configuration we will use for testing</li>
</ul>
<pre><code><span class="hljs-attr">test_datagen</span> = ImageDataGenerator(rescale=<span class="hljs-number">1</span>./<span class="hljs-number">255</span>)
</code></pre><p>This is a generator that will read pictures found in subfolers of 'eye_gender_data/train_grayscale', and indefinitely generate batches of augmented image data</p>
<pre><code><span class="hljs-attr">train_generator</span> = train_datagen.flow_from_dataframe(
    <span class="hljs-attr">dataframe</span>=train_df,
    <span class="hljs-attr">directory</span>=src_path_train_gray,
    <span class="hljs-attr">x_col</span>=<span class="hljs-string">"filename"</span>,
    <span class="hljs-attr">y_col</span>=<span class="hljs-string">"label"</span>,
     <span class="hljs-attr">subset</span>=<span class="hljs-string">'training'</span>,
    <span class="hljs-attr">target_size</span>=(<span class="hljs-number">71</span>, <span class="hljs-number">71</span>),  <span class="hljs-comment"># all images will be resized to 71*71</span>
    <span class="hljs-attr">batch_size</span>=<span class="hljs-number">400</span>,
    <span class="hljs-attr">seed</span>=<span class="hljs-number">60</span>,
    <span class="hljs-attr">shuffle</span>=<span class="hljs-literal">True</span>,
   <span class="hljs-attr">class_mode</span>=<span class="hljs-string">'categorical'</span>)
</code></pre><p>Found 7376 validated image filenames belonging to 2 classes.</p>
<ul>
<li>This is a generator that will read pictures found in subfolers of 'eye_gender_data/test_grayscale', and indefinitely generate batches of augmented image data</li>
</ul>
<pre><code>test_generator = test_datagen.flow_from_dataframe(
    dataframe=test_df,
    directory=<span class="hljs-string">"/root/Desktop/Deep learning dphi/eye_gender_data/test_grayscale/"</span>,
    x_col=<span class="hljs-string">"filename"</span>,
    target_size=(<span class="hljs-number">71</span>, <span class="hljs-number">71</span>),
    batch_size=<span class="hljs-number">1</span>,
    class_mode=<span class="hljs-literal">None</span>,
    seed=<span class="hljs-number">60</span>,
    shuffle=<span class="hljs-literal">False</span>,
)
</code></pre><ul>
<li>Defining the function base_model contained the  sequential model and  all the connected layers</li>
</ul>
<pre><code><span class="hljs-attribute">def</span> base_model():
    <span class="hljs-attribute">model</span> = models.Sequential()
    <span class="hljs-attribute">model</span>.add(layers.Conv<span class="hljs-number">2</span>D(<span class="hljs-number">32</span>, (<span class="hljs-number">3</span>, <span class="hljs-number">3</span>),padding='same', activation='relu',input_shape=(<span class="hljs-number">32</span>, <span class="hljs-number">32</span>, <span class="hljs-number">3</span>)))
    <span class="hljs-attribute">model</span>.add(layers.MaxPooling<span class="hljs-number">2</span>D((<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)))
    <span class="hljs-attribute">model</span>.add(Dropout(<span class="hljs-number">0</span>.<span class="hljs-number">25</span>))
    <span class="hljs-attribute">model</span>.add(layers.Conv<span class="hljs-number">2</span>D(<span class="hljs-number">64</span>, (<span class="hljs-number">3</span>, <span class="hljs-number">3</span>), padding='same', activation='relu'))
    <span class="hljs-attribute">model</span>.add(layers.MaxPooling<span class="hljs-number">2</span>D((<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)))
    <span class="hljs-attribute">model</span>.add(Dropout(<span class="hljs-number">0</span>.<span class="hljs-number">25</span>))
    <span class="hljs-attribute">model</span>.add(layers.Conv<span class="hljs-number">2</span>D(<span class="hljs-number">64</span>,(<span class="hljs-number">3</span>, <span class="hljs-number">3</span>),padding='same' , activation='relu'))
    <span class="hljs-attribute">model</span>.add(layers.MaxPooling<span class="hljs-number">2</span>D((<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)))
    <span class="hljs-attribute">model</span>.add(Dropout(<span class="hljs-number">0</span>.<span class="hljs-number">25</span>))
    <span class="hljs-attribute">model</span>.add(layers.Conv<span class="hljs-number">2</span>D(<span class="hljs-number">64</span>, (<span class="hljs-number">3</span>, <span class="hljs-number">3</span>), padding='same',activation='relu'))
    <span class="hljs-attribute">model</span>.add(layers.MaxPooling<span class="hljs-number">2</span>D((<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)))
    <span class="hljs-attribute">model</span>.add(Dropout(<span class="hljs-number">0</span>.<span class="hljs-number">25</span>))
    <span class="hljs-attribute">model</span>.add(layers.Flatten())
    <span class="hljs-attribute">model</span>.add(layers.Dense(<span class="hljs-number">512</span>, activation='relu'))
    <span class="hljs-attribute">model</span>.add(Dropout(<span class="hljs-number">0</span>.<span class="hljs-number">5</span>))
    <span class="hljs-attribute">model</span>.add(layers.Dense(<span class="hljs-number">2</span>, activation='softmax'))
    <span class="hljs-attribute">return</span> model
</code></pre><ul>
<li>Compiling the baseline model</li>
</ul>
<pre><code>baseline=base_model()
baseline.compile(optimizer=<span class="hljs-string">'adam'</span>,loss=<span class="hljs-string">"categorical_crossentropy"</span>,metrics=[<span class="hljs-string">"accuracy"</span>])
</code></pre><ul>
<li>Fitting the baseline model in 5 epoch</li>
</ul>
<pre><code><span class="hljs-attr">STEP_SIZE_TRAIN</span>=train_generator.n//train_generator.batch_size
<span class="hljs-attr">STEP_SIZE_VALID</span>=valid_generator.n//valid_generator.batch_size
<span class="hljs-attr">STEP_SIZE_TEST</span>=test_generator.n//test_generator.batch_size
<span class="hljs-attr">history</span>= baseline.fit(train_generator,
                    <span class="hljs-attr">steps_per_epoch</span>=STEP_SIZE_TRAIN,
                    <span class="hljs-attr">validation_data</span>=valid_generator,
                    <span class="hljs-attr">validation_steps</span>=STEP_SIZE_VALID,
                   <span class="hljs-attr">epochs</span>=<span class="hljs-number">5</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635200803984/1Qe0pP2mJ.png" alt="Screenshot from 2021-10-25 22-27-13.png" /></p>
<ul>
<li>Creating a folder to save the model</li>
</ul>
<pre><code>!<span class="hljs-keyword">mkdir</span> -p saved_model
</code></pre><pre><code><span class="hljs-selector-tag">baseline</span><span class="hljs-selector-class">.save</span>(<span class="hljs-string">'saved_model/my_model'</span>)
</code></pre><ul>
<li>Evalualing the model on validation dataset</li>
</ul>
<pre><code>score = baseline.evaluate(valid_generator,steps=STEP_SIZE_TEST)
print(<span class="hljs-string">'Test loss:'</span>, score[<span class="hljs-number">0</span>])
print(<span class="hljs-string">'Test accuracy:'</span>, score[<span class="hljs-number">1</span>])
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635201203474/N6oKDSssw.png" alt="Screenshot from 2021-10-25 22-33-31.png" /></p>
<ul>
<li>Plotting Training and validation accuracy &amp; Training and validation loss</li>
</ul>
<pre><code><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
acc = history.history[<span class="hljs-string">'accuracy'</span>]
val_acc = history.history[<span class="hljs-string">'val_accuracy'</span>]
loss = history.history[<span class="hljs-string">'loss'</span>]
val_loss = history.history[<span class="hljs-string">'val_loss'</span>]
epochs = range(<span class="hljs-number">1</span>, len(acc) + <span class="hljs-number">1</span>)
plt.plot(epochs, acc, <span class="hljs-string">'bo'</span>, label=<span class="hljs-string">'Training acc'</span>)
plt.plot(epochs, val_acc, <span class="hljs-string">'b'</span>, label=<span class="hljs-string">'Validation acc'</span>)
plt.title(<span class="hljs-string">'Training and validation accuracy'</span>)
plt.legend()
plt.figure()
plt.plot(epochs, loss, <span class="hljs-string">'bo'</span>, label=<span class="hljs-string">'Training loss'</span>)
plt.plot(epochs, val_loss, <span class="hljs-string">'b'</span>, label=<span class="hljs-string">'Validation loss'</span>)
plt.title(<span class="hljs-string">'Training and validation loss'</span>)
plt.legend()
plt.<span class="hljs-keyword">show</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635201848693/4OnWtsC0t.png" alt="Screenshot from 2021-10-25 22-43-00.png" /></p>
<ul>
<li>Loading the save model</li>
</ul>
<pre><code><span class="hljs-attr">new_model</span> = models.load_model(<span class="hljs-string">'saved_model/my_model'</span>)
</code></pre><ul>
<li>Predicting the test dataset</li>
</ul>
<pre><code>STEP_SIZE_TEST=test_generator.n//test_generator.batch_size
test_generator.<span class="hljs-keyword">reset</span>()
prediction=new_model.predict(test_generator,steps=STEP_SIZE_TEST,<span class="hljs-keyword">verbose</span>=<span class="hljs-number">1</span>)
</code></pre><ul>
<li><p>Naming the columns of the prediction labels and re-ordering the  indices of the prediction .</p>
<pre><code><span class="hljs-attr">predicted_class_indices</span>=np.argmax(prediction,axis=<span class="hljs-number">1</span>)
<span class="hljs-attr">labels</span> = (train_generator.class_indices)
<span class="hljs-attr">labels</span> = dict((v,k) for k,v in labels.items())
<span class="hljs-attr">predictions</span> = [labels[k] for k in predicted_class_indices]
<span class="hljs-attr">filenames</span>=test_generator.filenames
</code></pre></li>
<li><p>creating a dataframe containing the prediction </p>
</li>
</ul>
<pre><code><span class="hljs-attr">results</span>=pd.DataFrame({ <span class="hljs-string">"label"</span>:predictions})
</code></pre><ul>
<li>Displaying the prediction in a dataframe</li>
</ul>
<pre><code>results
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635202564196/j1u9NgC2W.png" alt="Screenshot from 2021-10-25 22-55-33.png" /></p>
<ul>
<li>Saving the prediction in csv file</li>
</ul>
<pre><code>results.to_csv("submission.csv",<span class="hljs-keyword">index</span>=<span class="hljs-keyword">False</span>)
</code></pre><p>kudos for reaching at this point,  you can submit your submission file in csv format  to any platform such as kaggle , DPhi, hackerrank, hackerearth for marking. </p>
<ul>
<li>Classifying a new unseen image</li>
</ul>
<p>You have your deep learning model up now you can ahead take a picture of people's eyes randomly with your phone and run the prediction to determine the gender of the person. How to go by it? check the following</p>
<ul>
<li>Loading the new image save in the directory of your choice as below</li>
</ul>
<pre><code><span class="hljs-attr">new_image</span> = image.load_img(<span class="hljs-string">'/root/Desktop/Deep learning dphi/eye_gender_data/unseen_test/download.jpeg'</span>, target_size = (<span class="hljs-number">71</span>, <span class="hljs-number">71</span>))
</code></pre><ul>
<li>Converting the image to grayscale.</li>
</ul>
<pre><code><span class="hljs-comment">#converting train to grascale</span>
path =<span class="hljs-string">"/root/Desktop/Deep learning dphi/eye_gender_data/unseen_test"</span>  
<span class="hljs-comment">#create a folder named  train_grayscale in the eye_gender data directory </span>
dstpath =<span class="hljs-string">"/root/Desktop/Deep learning dphi/eye_gender_data/unseen_test_grayscale/"</span>

<span class="hljs-keyword">try</span>:
    makedirs(dstpath)
<span class="hljs-keyword">except</span>:
    <span class="hljs-keyword">print</span> (<span class="hljs-string">"Directory already exist, images will be written in same folder"</span>)

<span class="hljs-comment"># Folder won't used</span>
files = [f <span class="hljs-keyword">for</span> f <span class="hljs-keyword">in</span> listdir(path) <span class="hljs-keyword">if</span> isfile(join(path,f))] 

<span class="hljs-keyword">for</span> image <span class="hljs-keyword">in</span> files:
    <span class="hljs-keyword">try</span>:
        img = cv2.imread(os.path.join(path,image))
        gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
        dstPath = join(dstpath,image)
        cv2.imwrite(dstPath,gray)
    <span class="hljs-keyword">except</span>:
        <span class="hljs-keyword">print</span> (<span class="hljs-string">"{} is not converted"</span>.format(image))
</code></pre><ul>
<li>Visualizing the image</li>
</ul>
<pre><code><span class="hljs-selector-tag">plt</span><span class="hljs-selector-class">.imshow</span>(gray,cmap=<span class="hljs-string">"gray"</span>)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635262674914/pR5NebiOI.png" alt="Screenshot from 2021-10-26 15-33-17.png" /></p>
<ul>
<li>Checking the shape of the image because the input image to our sequential model is 32 by 32</li>
</ul>
<pre><code><span class="hljs-selector-tag">gray</span><span class="hljs-selector-class">.shape</span>
</code></pre><p>(617, 926)</p>
<ul>
<li>Resizing the image to (32,32)</li>
</ul>
<pre><code><span class="hljs-keyword">from</span> keras.preprocessing <span class="hljs-keyword">import</span> image
<span class="hljs-keyword">from</span> keras.preprocessing.image <span class="hljs-keyword">import</span> img_to_array
</code></pre><pre><code><span class="hljs-attr">image</span> = image.load_img(<span class="hljs-string">"/root/Desktop/Deep learning dphi/eye_gender_data/unseen_test_grayscale/download.jpeg"</span>)
<span class="hljs-attr">image</span> = image.resize((<span class="hljs-number">32</span>,<span class="hljs-number">32</span>))
</code></pre><pre><code><span class="hljs-selector-tag">plt</span><span class="hljs-selector-class">.imshow</span>(image)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635262777736/SDEfpEfg4.png" alt="Screenshot from 2021-10-26 15-40-04.png" /></p>
<ul>
<li>Converting the image to an array</li>
</ul>
<pre><code><span class="hljs-attr">new_image</span> =img_to_array(image, dtype=<span class="hljs-string">'uint8'</span>)
<span class="hljs-attr">new_image</span> = np.expand_dims(new_image, axis = <span class="hljs-number">0</span>)
</code></pre><ul>
<li>Running the prediction</li>
</ul>
<pre><code><span class="hljs-attr">STEP_SIZE_TEST</span>=test_generator.n//test_generator.batch_size
<span class="hljs-attr">prediction</span>=new_model.predict(new_image)
</code></pre><ul>
<li>Viewing the class labels</li>
</ul>
<pre><code><span class="hljs-selector-tag">train_generator</span><span class="hljs-selector-class">.class_indices</span>
</code></pre><p>{'female': 0, 'male': 1}</p>
<ul>
<li>Checking the label of the unseen image </li>
</ul>
<pre><code>if prediction[<span class="hljs-string">0</span>][<span class="hljs-symbol">0</span>] == 1:
<span class="hljs-code">    new_image = 'This is a male'
else:
    new_image = 'This is a female'
print(new_image)</span>
</code></pre><p>This is a female</p>
<p>Conclusion: </p>
<p>We  reach the end of our learning journey in convolutional neural network to determine the gender by morphometry of eyes with 93% accuracy. We can increase the accuracy of the model by using pre-trained models such Residual Network (ResNet50), VGG16, Xception, Deep Residual Network (DensetNet) or Inception to reach 98% or 99% with few line of codes.</p>
<p>Please subscribe to my newsletter and never miss my upcoming articles and leave me a comment if you have any questions or find this post interesting.</p>
<blockquote>
<p>My regards,</p>
<p>Maximilien.</p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[End-To- End Consumer Complaint Multiclass Classification Using Term Frequency - Inverse Document Frequency (TF-IDF) & Support Vector Machine Algorithm]]></title><description><![CDATA[You are about to embark on an exciting journey in NLP to implement end to end solution to solve business problem. In the end of this tutorial, the reader should be able to understand the concept of NLP, TF-IDF  and  to implement multiclass classifica...]]></description><link>https://maximilien.docquest.io/end-to-end-consumer-complaint-multiclass-classification-using-term-frequency-inverse-document-frequency-tf-idf-and-support-vector-machine-algorithm</link><guid isPermaLink="true">https://maximilien.docquest.io/end-to-end-consumer-complaint-multiclass-classification-using-term-frequency-inverse-document-frequency-tf-idf-and-support-vector-machine-algorithm</guid><category><![CDATA[nlp]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Maximilien]]></dc:creator><pubDate>Sat, 23 Oct 2021 15:52:03 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1635011134710/k6Sg1gagy.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You are about to embark on an exciting journey in NLP to implement end to end solution to solve business problem. In the end of this tutorial, the reader should be able to understand the concept of <strong>NLP</strong>, <strong>TF-IDF</strong>  and  to implement <strong>multiclass classification algorithm</strong> using python programming language.</p>
<hr />
<p><strong>Contents</strong></p>
<ol>
<li><p><strong>What is NLP</strong> </p>
</li>
<li><p><strong>Application of NLP</strong></p>
</li>
<li><p><strong>What is TF-IDF</strong></p>
</li>
<li><p><strong>What is multiclass classification</strong></p>
</li>
<li><p><strong> Importance of multiclass classification</strong></p>
</li>
<li><p><strong> Installing  libraries</strong></p>
</li>
<li><p><strong> Objective of the project</strong></p>
</li>
<li><p><strong>Implementing Multiclass Classification</strong></p>
</li>
</ol>
<hr />
<p> <strong>1. What is Natural Language Processing</strong></p>
<p> Natural language processing is the term used to describe the process of using computer algorithms to identify key elements in human language and extract meaning from unstructured spoken or written text. In other words,  NLP is a set of AI techniques designed to process human language. These techniques enable applications to recognize, process, analyze, and even generate natural human language. </p>
<p><strong>2.  Application of NLP</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1635004099168/1AhqtPaz_.png" alt="nlp.png" /></p>
<p> <strong>3.  What is TF-IDF</strong></p>
<p> TD-IDF or term frequency-inverse document frequency is a Google algorithm used to score and rank a piece of content’s relevance to any given search query. It is used to check the occurrence of a keyword in a document and allocate importance to that keyword, based on the number of times it appears in the document. It also checks how relevant that keyword is across the web.</p>
<p>Mathematically speaking, in a context of term and document, TF is defined as the number of times a term appears in a document. Term and Document are independent variables and TF is dependent on these. Let us denote TF as a function of term (t) and document (d) : TF(t,d).</p>
<p>Moreover, In the context of term and all the documents in corpus,  DF is defined as the number of documents that contain the term. Term and Document Corpus are independent variables and DF is dependent on these. Let us denote DF as a function of term (t) and document corpus (D) : DF(t,D).</p>
<p>NB: When the requirement is to calculate the importance of a term to a document in the corpus, TF denotes how important the term is to a document but it does not address the context of corpus. DF addresses how important a term in the context of a documents corpus. If a term appears across all documents, the term is overemphasized by TF. So the inverse of DF (IDF) could be used to project the actual importance of term, by calculating the product of TF and IDF.</p>
<p>  Inverse of DF (IDF) formular:
  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1634971653058/leMhmN8Dd.png" alt="idf.png" />
  With base of the log n &gt; 1.</p>
<p>Product of TF and IDF formular:
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1634972000847/dFA3tC5vP.png" alt="tfidf.png" /></p>
<p>Considering the following text corpus containing three documents below.</p>
<blockquote>
<p>document1 : Welcome to maxtekIoT. There are many tutorials covering various fields of technology. </p>
<p>document2 : Technology has advanced a lot with the invention of semi-conductor transistor. Technology is changing our way of living.</p>
<p>document3 : You may find this tutorial on transistor technology interesting.</p>
</blockquote>
<p>TFIDF(technology, document2, corpus)</p>
<p>TF(technology, document2) = 2</p>
<p>IDF(technology, document2) = log((3+1)/(3+1)) = 0</p>
<p>TFI-DF(technology, document2, corpus) = TF(technology, document2) . IDF(technology, document2) = 1 * 0 = 0</p>
<p>Conclusion: Though the term ‘technology’ appeared twice in document2 and the term has occurred in all the documents, it has no importance to document2 in the corpus.</p>
<p><strong>4. What is multiclass classification</strong></p>
<p> In machine learning, multiclass classification is an algorithm in which each feature in the dataset belongs to one of three or more classes in the dataset. So the aim of the algorithm is to construct a function in which given a new unseen feature point, the algorithm can precisely classify the class into which the new feature point belongs to.</p>
<p> <strong>5. Importance of multiclass classification</strong></p>
<p> Multiclass classification enables a business analyst to predict which product a customer will purchase next from several options allowing the business to estimate expected revenue and adjust business practices and resources accordingly.</p>
<p><strong> 6. Installing  libraries</strong></p>
<p>in the following section till the end, all the codes should be written either in Jupyter notebook, Spyder or Google Colab. The following codes are written in Jupyter lab running on Debian 10 OS. </p>
<p>Various packages will be needed to be installed for the following code to execute if the reader is using Jupyter lab. 
To avoid this headache for newbies, we recommend you to use Google Colab cause most of the libraries and packages are preinstalled . </p>
<p>  If you run through  packages issues or libraries issues just type  </p>
<pre><code><span class="hljs-addition">!pip3 install name_of_the missing_package_pointing_to</span>
</code></pre><p><strong>7.  Objective of the project</strong></p>
<p>The goal of the project is to classify consumers’ complaints about financial products and services to
companies for a response. Since it has multiple categories, it becomes a multiclass classification that can be solved through many of the machine learning algorithms. Once the algorithm is in place, 
 whenever  there is a new complaint, we can easily categorize it and can then be redirected to the concerned person. This will save a lot of time because we are minimizing the human intervention to decide whom this complaint should go to.</p>
<ul>
<li>Step 1 Importing the libraries</li>
</ul>
<pre><code><span class="hljs-keyword">from</span> sklearn <span class="hljs-keyword">import</span> model_selection, preprocessing, metrics
<span class="hljs-keyword">from</span> sklearn.feature_extraction.text <span class="hljs-keyword">import</span> TfidfVectorizer
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> StratifiedShuffleSplit
<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> confusion_matrix
<span class="hljs-keyword">from</span> nltk.corpus <span class="hljs-keyword">import</span> stopwords
<span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> Counter
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">from</span> sklearn.svm <span class="hljs-keyword">import</span> SVC
<span class="hljs-keyword">from</span> textblob <span class="hljs-keyword">import</span> Word
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
</code></pre><ul>
<li>Step 2 Loading the dataset read_csv() method from pandas library</li>
</ul>
<pre><code><span class="hljs-attr">df</span> = pd.read_csv(<span class="hljs-string">"consumer_complaints.csv"</span>)
</code></pre><ul>
<li>Step 3 Exploratory data analysis
Displaying the first five raw of the dataset</li>
</ul>
<pre><code><span class="hljs-selector-tag">df</span><span class="hljs-selector-class">.head</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1634985248268/gw5Ck4cDo.png" alt="head.png" /></p>
<p>Displaying information about the data type and name of all the variables in the feature sets</p>
<pre><code>df.<span class="hljs-keyword">info</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1634985906897/xuKTGG_bM.png" alt="info.png" /></p>
<p>Adding category_id to the dataframe.(category_id shows which class each feature belongs to)</p>
<pre><code>df[<span class="hljs-string">'category_id'</span>] = df[<span class="hljs-string">'product'</span>].factorize()[<span class="hljs-number">0</span>]
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1634986266035/caD4oi4bR.png" alt="head1.png" /></p>
<p>We are only interesting in the product and the customer_complaint_narative all the rest of the feature sets are additional information which do not affect our analysis therefore they can be ignored.</p>
<pre><code><span class="hljs-attr">df</span> = df[[<span class="hljs-string">'product'</span>, <span class="hljs-string">'consumer_complaint_narrative'</span>]]
</code></pre><p>Checking for NaN values in the dataframe</p>
<pre><code>df.<span class="hljs-keyword">isnull</span>().sum()
</code></pre><p>Keeping only no null values in the dataframe</p>
<pre><code><span class="hljs-attr">df</span> = df[pd.notnull(df[<span class="hljs-string">'consumer_complaint_narrative'</span>])]
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1634987061547/r-uH9I7k6.png" alt="sum.png" /></p>
<p>Grouping the product based on consumer_complaint_narrative and displaying the distribution </p>
<pre><code><span class="hljs-selector-tag">df</span><span class="hljs-selector-class">.groupby</span>(<span class="hljs-string">'product'</span>)<span class="hljs-selector-class">.consumer_complaint_narrative</span><span class="hljs-selector-class">.count</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1634987605250/K5PoTMgUi.png" alt="group.png" /></p>
<pre><code>fig = plt.figure(figsize=(<span class="hljs-number">16</span>,<span class="hljs-number">12</span>))
df.groupby(<span class="hljs-string">'product'</span>).consumer_complaint_narrative.count()
df.groupby(<span class="hljs-string">'product'</span>).consumer_complaint_narrative.count().plot.bar(ylim=<span class="hljs-number">0</span>)
plt.<span class="hljs-keyword">show</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1634988185684/Y_V1uuJ_D.png" alt="dist.png" /></p>
<ul>
<li>Step 4 Feature engineering using TF-IDF</li>
</ul>
<p>Splitting the features into independent and dependent variables.</p>
<pre><code><span class="hljs-attr">X</span>=df[<span class="hljs-string">'consumer_complaint_narrative'</span>]
<span class="hljs-attr">y</span>=df[<span class="hljs-string">'product'</span>]
</code></pre><p>We did not use train_test_split to split the feature and target variables  into train and test because. We noticed with train_test_split, the data were not proportionally distributed between y_train and y_test i.e some features appeared in the y_test only and not in the y_train. To fix that issue, we switched from train_test_split to StratifiedShuffleSplit.</p>
<pre><code>n_splits = 1 <span class="hljs-comment"># We only want a single split in this case</span>
sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=0.25, random_state=0)
for train_index, test_index in sss.split(X, y):
    X_train,X_test =X.iloc[train_index],X.iloc[test_index]
    y_train,y_test =y.iloc[train_index],y.iloc[test_index]
</code></pre><p>Checking the distribution of y_train and y_test</p>
<pre><code><span class="hljs-selector-tag">Counter</span>(y_train)
</code></pre><pre><code><span class="hljs-selector-tag">Counter</span>(y_test)
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1634989450117/Hpphuo0Q7.png" alt="check.png" /></p>
<p>Encoding target variable. We fit_trainform() y_train and transform y_test</p>
<pre><code><span class="hljs-attr">encoder</span> = preprocessing.LabelEncoder()
<span class="hljs-attr">y_train_encoded</span> = encoder.fit_transform(y_train)
<span class="hljs-attr">y_test_encoded</span> = encoder.transform(y_test)
</code></pre><p>We initialized the TfidfVectorizer we had previously imported. After initializing ,we passed our data to the vectorizer and  it transforms the data to a TF-IDF vector a matrix of array which is the mathematical representation of the corpus. This is done by using the fit_transform method.</p>
<pre><code>vectorizer = TfidfVectorizer(analyzer=<span class="hljs-string">'word'</span>,token_pattern=<span class="hljs-string">r'\w{1,}'</span>, max_features=<span class="hljs-number">5000</span>)
tfidf_matrix=vectorizer.fit_transform(X)
tfidf_train=vectorizer.transform(X_train)
tfidf_train=vectorizer.transform(X_train)
</code></pre><p>if we want to observe the mathematical representation of our text i.e. the TF-IDF representation we have to convert the sparse matrix to a dense matrix. This is done by using the toarray() method</p>
<pre><code>feature_array = tfidf_matrix.toarray()
feature_array
</code></pre><p>-Step 5 Model building and evaluation</p>
<pre><code>model = SVC(C= .1, kernel='linear', gamma= 1)
model.fit(tfidf_train, y_train_encoded )
y_prediction=model.predict(tfidf_test)
accuracy = metrics.accuracy_score(y_prediction,y_test_encoded)
print (<span class="hljs-string">"Accuracy: "</span>, accuracy)
</code></pre><p>Accuracy:  0.8164890432283559</p>
<p>Classification report</p>
<pre><code>print(metrics.classification_report(y_test_encoded, y_prediction,target_names=y_test.unique()))
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1634996870701/0kAq9fcTx.png" alt="report.png" /></p>
<p>Confusion matrix</p>
<pre><code>conf_mat = confusion_matrix(y_test_encoded, y_prediction)
category_id_df = df[[<span class="hljs-string">'product'</span>, <span class="hljs-string">'category_id'</span>]].drop_duplicates().sort_values(<span class="hljs-string">'category_id'</span>)
category_to_id = dict(category_id_df.<span class="hljs-keyword">values</span>)
id_to_category = dict(category_id_df[[<span class="hljs-string">'category_id'</span>,<span class="hljs-string">'product'</span>]].<span class="hljs-keyword">values</span>)
fig, ax = plt.subplots(figsize=(<span class="hljs-number">12</span>,<span class="hljs-number">12</span>))
sns.heatmap(conf_mat, annot=<span class="hljs-keyword">True</span>, fmt=<span class="hljs-string">'d'</span>, cmap="BuPu",xticklabels=category_id_df[[<span class="hljs-string">'product'</span>]].<span class="hljs-keyword">values</span>,yticklabels=category_id_df[[<span class="hljs-string">'product'</span>]].<span class="hljs-keyword">values</span>)
plt.ylabel(<span class="hljs-string">'Actual'</span>)
plt.xlabel(<span class="hljs-string">'Predicted'</span>)
plt.<span class="hljs-keyword">show</span>()
</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1634996667556/0H8IkAA5V.png" alt="conf.png" /></p>
<p>The accuracy of 82% is good for a baseline model.
 Precision and recall look pretty good across the categories except for “Payday loan.” If you look through Payload loan, most of the wrong predictions are Debt collection and Credit card, which might be because of the smaller number of samples in that category. It look like it’s a subcategory of a credit card. We can add these samples to any other group to make the model more stable. </p>
<p>-Step 6 Predicting unseen consumer complaint</p>
<pre><code>corpus= [<span class="hljs-string">"This company refuses to pay my interest to my bank account"</span>]
corpus_features = vectorizer.transform(corpus)
predictions = model.predict(corpus_features)
print(corpus)
print(<span class="hljs-string">"  - Predicted as: '{}'"</span>.format(id_to_category[predictions[0]]))
</code></pre><p>['This company refuses to pay my interest to my bank account']</p>
<p> Predicted as: 'Debt collection'</p>
<pre><code>corpus = [<span class="hljs-string">"This company refuses to provide me verification and validation of debt per my right under the FDCPA. I do not believe this debt is mine."</span>]
corpus_features = vectorizer.transform(corpus)
predictions =svc_model.predict(corpus_features)
print(corpus)
print(<span class="hljs-string">"  \n Predicted as: '{}'"</span>.format(id_to_category[predictions[0]]))
</code></pre><p>['This company refuses to provide me verification and validation of debt per my right under the FDCPA. I do not believe this debt is mine.']</p>
<p> Predicted as: 'Credit reporting'</p>
<ul>
<li>Conclusion:
We achieve the aim of our objective by classifying  consumer complaint using TF-IDF and Support Vector Algorithm with 82% accuracy. This model can be used as a baseline model. To increase the accuracy, we can do reiterate the process with different algorithms like Random Forest, GBM,  Naive Bayes deep learning techniques like RNN and LSTM. Besides, other techniques can be used such as hyper-parameter tuning to improve the model accuracy which are out of the scope of this tutorial. </li>
</ul>
<p>Please let me know if you find any errors.
I can be contacted via LinkedIn <a target="_blank" href="https://www.linkedin.com/in/ephrem-maximilien-kpizingui-48222775/">here</a> </p>
<blockquote>
<p>My regards,</p>
<p>Maximilien.</p>
</blockquote>
]]></content:encoded></item></channel></rss>