Over the past year, I've been increasingly using AI to help search logs and diagnose production issues.
At first, my thinking was simple: since AI can read code and summarize text, if I paste error logs to it, shouldn't it be able to quickly identify the problem?
In practice, the answer is: yes, but not if you just paste logs directly.
If you simply throw a large chunk of logs at AI, it can summarize things and pick out obvious errors. But production issue investigation isn't just about "spotting an ERROR log." The real challenge is: is this ERROR actually the root cause? Which service did it occur in? Is it a re-wrapped error from a downstream service? How do you correlate the user's taskId with the requestId in the logs? How many systems does a failed task actually pass through?
These problems can't be solved by just feeding logs to AI.
So recently, I consolidated my year of experience using AI for log investigation into a Skill. The project is here: Mercer-Lee/loghound.
Why I Built This Skill
The old workflow for investigating production issues was roughly:
- Customer support or a colleague sends over a taskId, uid, requestId, or in extreme cases, just a user screenshot.
- Open the log platform, use experience to decide which service to check first.
- Search with keywords, look for ERROR or WARN.
- If the entry service only shows "downstream call failed," go find the downstream service's logs.
- If the downstream calls another service, keep chasing.
- Finally, organize the understandable parts into a response: what's the cause, can it be retried, is it a user asset issue, does it need engineering intervention.
This process heavily relies on experience.
For example, with task failures: some errors are entry service parameter validation failures, some are async queue execution failures, some are rendering service failures, some are file download failures, some are third-party callback failures. The first error you see in the log platform is often just a symptom, not the root cause.
Moreover, many projects don't live on a single cloud platform. Some logs are on Alibaba Cloud SLS, some on Tencent Cloud CLS, some on Volcano Engine TLS, and some workflow systems can only be queried through Webhooks or APIs. Add in the need to reverse-query users, tasks, and asset records in MongoDB or SQL, and the whole chain becomes fragmented.
Humans can slowly work through it, but doing this every time is exhausting.
AI has great value here, but it needs a stable workbench — not ad-hoc copy-pasting of log dumps every time.
Problems with Letting AI Read Logs Directly
When I first started using AI for log investigation, I ran into several classic pitfalls.
The first problem: AI easily confuses symptoms with causes.
For example, if the entry service has a log saying "task execution failed," AI might directly summarize that the cause is "execution failed." But this statement has little investigative value — it tells you it failed, but not why.
The real useful clues might be in downstream services: download asset 403, unsupported file format, callback URL timeout, parameter parsing failure, or a clear error from a third-party API. Only when you trace to this level is the conclusion reliable.
The second problem: too many logs dilute the context.
A single production investigation might surface dozens or even hundreds of log entries. If you dump them all to AI, it consumes many tokens and easily gets distracted by duplicate logs, INFO flow logs, and irrelevant WARNs. The output looks comprehensive but might miss the most critical entry.
The third problem: AI doesn't know our system topology.
When humans investigate logs, there's an implicit map in their head: this service handles the entry, that one handles the queue, a certain taskId prefix means a certain type of task, a particular error usually requires checking another service. Without telling AI this experience, it can only guess from a single input.
So I eventually realized the key problem isn't "having AI read logs" — it's structuring the investigation experience so AI follows a fixed process.
What I Want to Capture Isn't Scripts, But Investigation Methodology
The loghound project looks like a log query tool on the surface, but what I really want to capture is the investigation methodology.
I divided it into two layers:
- Script layer: responsible for querying logs, querying databases, normalizing results, extracting error signals, clustering and deduplicating.
- Analysis layer: responsible for determining problem type, tracing along service chains, distinguishing symptoms from root causes, generating conclusions for internal use or customer-facing responses.
The script layer solves "how to collect evidence."
It supports Alibaba Cloud SLS, Tencent Cloud CLS, Volcano Engine TLS, Webhook workflow engines, and also supports reverse-querying records from MongoDB and SQL by user ID, task ID, etc. Different platforms have different query methods and return structures, so results need to be unified into a format that AI can more easily understand.
The analysis layer solves "how to judge after getting evidence."
This is where the Skill's value lies. It requires AI to first classify the user's feedback into a problem type: incident investigation, quality anomaly, status query, vague feedback, batch issue, or audit. Different problems require different investigation approaches — you can't treat everything as "find the ERROR."
Then it queries by identifier priority:
traceId / requestId > taskId > uid / userId > user-side IDIf the entry service returns nothing, or the logs don't match the user's description, it tries to reverse-query recent anomalies using the uid. If logs show downstream call failures, it continues extracting downstream taskIds, traceIds, or requestIds, then follows the service topology to the next hop.
This process sounds simple, but it solves an important problem: AI is no longer "glancing at logs and guessing" — it's constrained within an investigation workflow to do reasoning.
A Typical Investigation Chain
Assume a user reports a task failure with only a taskId.
Step 1: Don't jump to conclusions — first confirm the problem type. If the user is just asking "what's the status of this task," it might just be a status query that doesn't need an incident conclusion. Only if the user clearly says "the task failed" or "the result is abnormal" do you enter incident investigation or quality investigation mode.
Step 2: Use the taskId to query ERROR and WARN in the entry service. What you find might be task failure, or just task receipt, task creation, or downstream call failure.
Step 3: If the entry service's logs contain downstream service task IDs or requestIds, continue investigating downstream. Service topology is critical here because AI needs to know which services the current service calls and what each service is responsible for.
Step 4: Keep tracing until a service shows a clear hard error — file download failure, parameter format error, network timeout, third-party API failure, media parsing failure, etc. These logs are usually much closer to the root cause than "task execution failed."
Step 5: Organize the conclusion into a fixed format:
- Problem conclusion
- Key evidence
- Impact scope
- Handling recommendations
- Customer support or user response script
The benefit of this approach is that both the investigation process and the output become much more stable. Even if someone else takes over, or a similar issue comes up days later, it won't depend entirely on the investigator's intuition at that moment.
Key Points I Care About in the Skill
First: Must distinguish between "status query" and "incident investigation."
Often, someone just wants to know if a task is done — they're not asking you to analyze an incident. If AI starts writing root cause analysis and customer scripts right away, it feels odd. That's why loghound requires problem type classification at the very beginning.
Second: Log queries are evidence gathering, not conclusions.
Script results only show what logs were found at a certain point in time — they can't be treated as the final root cause. AI must combine upstream/downstream chain, error location, log level, and failure stage to make a judgment.
Third: Don't stop at business-wrapped errors.
Many systems wrap downstream errors into unified business errors like "task failed," "generation failed," or "processing exception." These logs might be ERRORs, but they're not necessarily the root cause. As long as there are still downstream clues in the logs, you should keep tracing.
Fourth: Output must be directly usable by people.
The final step of investigation isn't displaying a pile of logs — it's telling the other party: what caused this issue, is there a user-side factor, can it be retried, does it need engineering intervention. Especially for conclusions shown to customer support, they shouldn't read like a stack trace analysis report.
What It Can Do Now
Currently loghound covers these scenarios:
- Query the same taskId or requestId across multiple cloud log platforms.
- Identify different services' log sources and environments based on project configuration.
- Normalize, cluster, and extract error signals from logs.
- Trace downstream services based on service topology.
- Convert user-side IDs to internal IDs via MongoDB or SQL.
- Query Webhook-based workflow task status and error details.
- Have AI generate root cause analysis and response scripts following a fixed Skill process.
It's not an out-of-the-box universal incident bot. Because every company's service topology, log format, and task ID rules are different, it requires configuring projects, log sources, and call relationships.
But that's exactly why I think it's valuable.
If a tool doesn't understand your system at all, it can at best do generic log summarization. Only when you feed it project topology, log rules, and investigation experience can it become an assistant that truly participates in investigation.
Reflections After Building This Skill
I've always considered log investigation a classic experience-driven task.
Experienced engineers can look at an error and know whether to trust this log, whether to keep tracing downstream, which database this ID should be reverse-queried in, which errors can be directly answered to users, and which need engineering escalation.
After AI appeared, my understanding of this changed somewhat.
Experience-driven tasks can be augmented by AI — the key is extracting the experience. Don't just tell AI "help me look at these logs." Instead, tell it:
- First, classify the problem type.
- First, find the strongest identifier.
- How to fall back when queries return nothing.
- Keep tracing when downstream failures appear.
- Don't treat wrapped errors as root causes.
- Conclusions must include evidence.
Once these rules are codified, AI's performance becomes much more stable.
So for me, loghound isn't just a tool — it's a consolidation of my experience using AI to investigate production issues over the past year. It takes judgment pathways that were previously hidden in my head and turns them into a Skill, so AI doesn't just "read logs" but works as closely as possible to how an engineer actually investigates problems.
Conclusion
My biggest takeaway from using AI for log investigation this past year: AI doesn't inherently understand your system, and it doesn't inherently know what the root cause is.
But if we're willing to organize our system topology, log query methods, error judgment rules, and output formats, it can take on much of the repetitive, tedious, and fatigue-prone work.
Building loghound felt like re-examining my own log investigation habits. Which judgments are evidence-based, which are just gut feelings; which processes can be automated, which require human confirmation. Once these things are clear, the tool itself is almost a byproduct.
If I continue iterating, I hope to refine it across more log platforms, more project topologies, and more real incident scenarios. After all, production issues won't disappear just because we don't want to look at logs — but at least we can make log investigation a bit less painful.
