So, my GF’s (girlfriend) birthday was near and I, being a mediocre engineer, was confused what to plan for her birthday. She had a constant habit of breathing from her mouth. Even after pointing her about this, she repeats the whole process subconsciously.
Looking her birthday as a perfect opportunity, I, along with my best buddy “AI” decided to end this for the good. The video attached shows my journey from R&D stage to delivering an AI solution to solve this emerging problem of people breathing from their mouth subsequently. I thought it’s a good idea to share my findings in the form of a technical blog too. I’ll try to divide the blog in the following sections:
- Defining the problem statement
- Steps required for the solution
- Facial Landmark detection(Available models, pros and cons)
- Locating the Lip region
- Detect if mouth is open or not
- Robust solution and packaging
3. Conclusion
So, hey! There we go :)
1. Defining the Problem statement
For at least once, we all have breathed from our mouth. One reason can be because of the closed nostrils especially in winters. But sometimes it becomes a wrong habit and we start breathing from our mouth subconsciously even when our nostrils are open. Mouth breathing all the time isn’t good for our health. I would like to attach an article that talks more about mouth breathing, symptoms and complications.
Coming from a computer science background, we spend lots of our time in front of our desktop or laptop and subconsciously lots of my friends along with my girlfriend started this habit of breathing from their mouth. I’m not a doctor or any medical expert who can medically help them in any regard. I’m just a casual AI enthusiast. So, i thought if there could be any app running on our workstation as soon as we start working and alerts us whenever we start breathing from our mouth unknowingly. This way one can easily track their habit and can work upon it in-order to improve it. Since, there’ll be a whole application running in the background analysing our lip region, it should be highly accurate and at the same time should consume less computation power.
And with this motivation, i started this project.
2. Steps required for the solution
In order to get to the solution, we need to address the following sub-problems:
2.1 Facial Landmark detection(Available models, pros and cons)
First step is detecting face and facial key-points in the video frame. Facial landmarks are used to localise and represent important regions of the face like mouth, eyes, eyebrows, nose, jawline etc.
After some google searches, i found some algorithms/libraries in that deals with the problem we are facing here.
- HOG + Linear SVM (DLIB)
- Haar Cascade (OpenCV)
- HRNET (Deep Learning solution)
- Partial Face Landmark Detector — PFLD (Deep Learning solution)
- Mediapipe Face Mesh
Based on some constraints, we’ll go with one of them like:
- Accuracy: Our solution should be accurate and prone to false positive cases
- Low Computation Power: Our solution will be running all the time in the background and hence CPU and Ram usage should be as low as possible
Considering the constraints, i tried all the algorithms one by one
2.1.1 HOG + Linear SVM (DLIB)
This approach is accurate and fast but has some limitations like:
- Not invariant to changes in rotation and viewing angles
- Can’t detect small faces
2.1.2 Haar Cascade (OpenCV)
It is an old algorithm and gives lots of false positive cases.
I would love to share another articles named “Face Detection Models: Which to Use and Why?”. It consists of very detailed comparisons between DLIB and openCV models
After that i was looking for some Deep Learning based solution and then i found these 2 model:
- HRNET (Deep Learning solution)
- Partial Face Landmark Detector — PFLD (Deep Learning solution)
2.1.3 HRNET (Deep Learning solution)
Actually, i found out a thesis “FACIAL LANDMARK DETECTION ON MOBILE DEVICES” (http://jultika.oulu.fi/files/nbnfioulu-202012183424.pdf) where they did lots of benchmarking and found out that HR-NET is slow and won’t be good for real time applications.
2.1.4 Partial Face Landmark Detector — PFLD (Deep Learning solution)
This model was accurate though but CPU usage was pretty high, not fulfilling one of our constraints.
2.1.5 Mediapipe Face Mesh
MediaPipe Face Mesh estimates 468 3D face landmarks in real-time even on mobile devices. They have BlazeFace as their face detection model, that uses a lightweight feature extraction network inspired by, but distinct from MobileNetV1/V2, delivering better real-time performance.
2.2. Locating the Lip region
Since i was using MediaPipe Face Mesh, i used the canonical face mesh mapping in-order to get the lip region points.
In the MediaPipe source code, The FACEMESH_TESSELATION frozenset in the code defines the point-to-point connection pair between all points.
I used the following tuples in order to get the ROI.
Lips upper inner: (78, 191), (191, 80), (80, 81), (81, 82), (82, 13), (13, 312), (312, 311), (311, 310), (310, 415), (415, 308), (308, 324),
Lips lower: (324, 318), (318, 402), (402, 317), (317, 14), (14, 87), (87, 178), (178, 88),(88, 95), (95, 78)
2.3 Detect if mouth is open or not
Our problem is different that yawn detection since one won’t open his/her mouth a-lot when breathing from mouth. Here, i simply calculated the contour area inside the inner lip region. Let’s say if a threshold is there and the calculated contour area is more than the minimum threshold set, then it means mouth is open.
Now, you may ask that this hack may fail since when we speak, we open our mouth and according to this logic, our solution will falsely state speaking as breathing mouth.
Well, in order to solve this, earlier i was thinking to use VAD i.e., Voice activity detector with an intention that:
- VAD says voice is there and our solution say mouth open = Speaking
- VAD says no voice is there and our solution say mouth open = Breathing from mouth
But then i came up with a small trick. Again, if counter area is being more than the threshold constantly for some “X” number of frames then it means it’s a “breathing from mouth case” since while speaking, in order to pronounce any letter or word, one need to close their mouth or in short, contour area will change frequently after every frame. In case of subconsciously “breathing from mouth”, contour area won’t be altered much with a large variance.
Then i added a small notification popup that will alert the user.
2.4 Robust solution and packaging
Now, the question is what should be the minimum threshold (min_area variable in above code). For this, i added a “slider” in the solution through which user can itself define the threshold as per their camera position and sitting angle. I then made a small Tkinter application for our solution, used PyInstaller to convert my files into executable format and NSIS to create an installer for windows.
3. Conclusion
So, this is the my overall project. The video and the title is meant for the entertainment purpose only. With this i’m not trying to hurt anyone sentiments but even if i did subconsciously, My apologies.
I’m a ML enthusiast and i thought of sharing my journey and experiences with everyone via filming some devlogs (developer vlogs). If you like to be part of this journey too, feel free to checkout my YouTube channel. Always open for constructive criticism :)