Myth or Not?
Sounds unbelievable?
Used to be the case for me too. A tough customer bug would take me days, sometimes weeks to resolve it. Please note - I am talking about the tough P0s here, not the ones which I could shrug off with a small config change or close the ticket with an exception handler code or minor logic change.
For those nagging ones, you know what I am talking about, I would endlessly
- wait for support teams to share the exact steps on how to reproduce the bug.
- dread my PM asking status EOD everyday till it was resolved.
- go through lines of code repeatedly till I was convinced my code was nearly flawless.
- scan all possible logs and tools praying something got logged somewhere.
- guess trial and error solutions even though I had studied binary.
- convince support to send a test build with extra logs if the client was okay with it as one last resort and sometimes this seemed the easiest way out.
This story may sound familiar to developers who get P0 issues assigned. I have been in office over weekends trying to solve for that one evasive bug that occurs only in customer environment, for a critical customer who is sitting a thousands of nautical miles away from me and with me the whole dev team is in office, trying to understand the issue, search through some petabytes of logs in the system, and find errors in our error monitoring tool. All this while my partner is fuming back at home for another weekend spent in office. Read our blog on different ways Github Copilot is helping developers save time.
Typical bug stories in the life of a developer
So how do you think I would have reacted?
- I can swear It was working for me, it still is.
- I can’t reproduce it, there must be something wrong at the customer end.
- I think its a backend issue (of course I wrote flawless JS code).
- No way, I think the frontend team should look at the data they are sending. (I switched gears a few years back to write REST APIs).
- It’s working for all other customers, I believe this customer has a wrong config.
And I so wish Docker could solve it all. But life ain’t that simple.
Toughest P0 bugs - what are they?
Which bugs are usually the tough ones which developers, customer support teams struggle with? These are the ones where there is a lot of guess work.
- Ones which cannot be reproduced.
- Ones for which logs are missing.
- Ones for which stacktrace is missing.
- Ones where the customer environment details are missing.
- Ones where customer hasn’t shared enough data points in the support tickets.
- Ones which look like happened due to a race condition and now customer also can’t reproduce it.
To summarise Tough P0 bugs are the ones you are absolutely clueless about.
Scenarios that lead to these P0 bugs?
Let us dig a little deeper here and run through a few examples of why P0 bugs are logged.
- The workflow was never tested and missed out in QA. Example: A new Payment Gateway got added in the latest release and it was never tested for slow network scenario in QA. For some reason the payments requests are getting timed out.
- The customer environment was different from the test setup. Example: The customer used your product on a browser, OS combination which resulted in product breaking down.
- The customer used the product in an unexpected new way. Example: The customer closed a tab before it was loaded fully and signed out. On signing back in the customer observes weird behaviour in the app.
- The data set or values at customer end were different from the test set up. Example: The customer enters integer instead of string where you have only tested and configured for string.
- An unexpected network issue or infra problem caused this suddenly. Example: There is a sudden outage in a third party notification app you are using and your customers stop getting urgent notifications.
How do we solve for them today and how much time does it usually take for developers?
Some of the actions developers resort while debugging the issues.
- Try to reproduce the issue based on the ticket reported by the customer
- Try to replicate the customers environment
- Search for user logs across multiple tools like MixPanel, Sentry, Cloudwatch to recreate the scenario
- Search for errors that are logged by internal tools or dashboards which could have thrown an error for that user at that time
- Add new logs and send a test build to customer to reproduce the issue
- Get on hour long calls with support and customers to understand the issue better
Time Developer spends on the critical issues today
Here is an example break up for you (not statistical data - just from one of those really frustrating bugs in the life of a developer):
- Speaking to support/customer understanding the issue ~ 1 hour.
- Trying to reproduce the exact issue with little success ~ 2 hours.
- Going to QA and Support again trying to get help on reproducing the issue ~ 1 hour.
- QA+ Support + Developer all trying to reproduce the issue ~ 3 hours.
- Give up on reproducing, search for the exact error in logs, tools for the given customer ~ 2 hours.
- Found some error, now searching for possible reasons for the error on stack trace ~ 2 hours.
- Fixing the issue, releasing it, testing it and sharing a hot fix to customer ~ 8 hours.
- Only to realise its not fixing the customer reported issue ~ repeat above.
Guess what - a developer can take up to 20+ hours on a difficult bug?
And you know how much can it cost you in terms of Dev money? Okay I probably don’t need to tell you that 🙂
The bigger thing being all this time your customer is waiting for you to fix the issue?
Even bigger, you don’t know how many other customers may be facing the same issue.
What if we said you can fix all of this in minutes?
This is how. What if we say we build the complete customer bug picture for you? Yeah - Xray of bugs or even better an MRI of bugs? It gives you a complete picture for you to help diagnose the issue - and that too in minutes.
Here is how you will save more than 50% of developer time.
With Zipy you can now eliminate all of the below steps:
No need to reproduce the issue, you can just replay the issue in minutes.
Time Saving: 7 hours
The Error shows right in the dashboard, there is no guess work required.
Time Saving: 2 hours
Detailed Stacktrace available in the same screen
Time Saving: 2 hours
Connect timestamp in Zipy with backend logs to figure out end to end issue
Time Saving:
Overall Time saved: More than 9 hours on each issue.
Conclusion
So here is the takeaway, there won’t be a single developer out there who hasn’t handled a tough customer issue. Some of us have stories to share about customer bugs that gave us nightmares.
So you know the pain. There is a way out when you have the complete picture of who faced the issue, how it really happened and where in your code things are going wrong. Isn’t it almost heavenly, when you don’t have to ever go back to the customers to reproduce issues and when you can simply replay the issue in the form of a video. Best of all instead of spending weekends firefighting, you can sit back and relax and get to those nagging critical issues in minutes.
With session replay, logs, stacktrace, network traces, user identification, customer identification, advanced search, and surge alerts all built in one single tool, you don’t have to worry about reproducing customer issues anymore. You get to the root cause of any customer reported (and not reported ones as well) with Zipy in a jiffy. Why don’t you try it out yourself?
Some commonly occurring customer bugs
- Logical Bugs
- UX Bugs
- Environment related Bugs
- Frontend Code Bugs
- Network Errors
- Memory, CPU Issues
- Slow Page Issues
- Server down Issue
- DB issues