After a Year of Using AI to Search Logs, I Built a Skill

Can You Just Throw Logs at AI?

Over the past year, I have been using AI more and more at work to help search logs and diagnose production issues.

At first, my thinking was simple: since AI can read code and summarize text, if I paste error logs into it, shouldn't it be able to quickly tell me where the problem is?

In practice, the answer is: yes, but not if you use it that directly.

If you simply paste a large chunk of logs into AI, it can summarize something and pick out obvious errors. But production issue investigation is not just about "spotting an ERROR log." The harder questions are usually: is this ERROR actually the root cause? Which service did it happen in? Is it a downstream error wrapped by an upstream service? How do you correlate the taskId from the user with the requestId in the logs? How many systems did one failed task pass through?

Those questions cannot be solved by just throwing logs at AI.

So I recently organized my experience using AI for log investigation over the past year into a Skill. The project is here: Mercer-Lee/loghound.

Why I Thought of Building This Skill

The old workflow for investigating production issues was roughly:

Customer support or a colleague sends a taskId, uid, requestId, or in the more extreme case, only a user screenshot.
Open the log platform and rely on experience to decide which service to check first.
Search by keyword and look for ERROR or WARN.
If the entry service only says "downstream call failed," go find the downstream service logs.
If that downstream service calls another service, keep tracing.
Finally, organize the understandable parts into a response explaining the cause, whether it can be retried, whether it is related to user materials, and whether engineering needs to get involved.

This process depends heavily on experience.

For the same kind of task failure, some errors come from parameter validation in the entry service, some from async queue execution, some from the rendering service, some from file download failures, and some from third-party callbacks. The first error you see in the log platform is often only a symptom, not the final cause.

Many projects also do not live on a single cloud platform. Some logs are in Alibaba Cloud SLS, some in Tencent Cloud CLS, some in Volcano Engine TLS, and some workflow systems can only be checked through Webhooks or APIs. On top of that, you sometimes need to reverse-query users, tasks, or asset records from MongoDB or SQL. The whole chain becomes fragmented very quickly.

Of course, humans can work through it slowly, but doing that every time is tiring.

AI is valuable here, but it needs a stable workbench instead of ad hoc copy-pasting a pile of logs every time.

Problems with Letting AI Read Logs Directly

When I first started using AI for log investigation, I ran into a few typical pitfalls.

The first problem is that AI easily treats "symptoms" as "causes."

For example, if the entry service has a log saying "task execution failed," AI may directly summarize that the cause is task execution failure. But that is not very useful for investigation. It only tells you that the task failed, not why.

The useful clue may be in a downstream service: asset download 403, unsupported file format, callback URL timeout, parameter parsing failure, or a clear error from a third-party API. Only when you trace to that layer does the conclusion become more reliable.

The second problem is that too many logs dilute the context.

A single production investigation often surfaces dozens or even hundreds of log entries. If you throw all of them at AI, it consumes a lot of tokens and is easily distracted by duplicate logs, INFO-level flow logs, and irrelevant WARNs. The final output may look rich, but it may miss the most important entry.

The third problem is that AI does not know your system topology.

When a human searches logs, there is usually an implicit map in their head: this service handles the entry point, that service handles the queue, a certain taskId prefix represents a certain kind of task, and a certain error usually means you should continue into another service. If you do not give AI this experience, it can only guess from the current input.

So I eventually realized that the key problem is not "letting AI search logs," but structuring the investigation experience so AI follows a fixed process.

What I Wanted to Capture Was Not Scripts, But Investigation Methodology

On the surface, loghound is a log query tool. But what I really wanted to capture is the investigation methodology.

I split it into two layers:

Script layer: queries logs, queries databases, normalizes results, extracts error signals, and clusters or deduplicates entries.
Analysis layer: determines problem type, traces along service chains, distinguishes symptoms from root causes, and generates conclusions for internal teams or customers.

The script layer solves the question of "how to collect evidence."

It supports Alibaba Cloud SLS, Tencent Cloud CLS, Volcano Engine TLS, Webhook workflow engines, and reverse-querying records from MongoDB and SQL by user ID, task ID, and similar identifiers. Different platforms have different query methods and response structures, so the results need to be normalized into a format AI can understand more easily.

The analysis layer solves the question of "how to judge after getting evidence."

This is where the Skill becomes valuable. It asks AI to first classify the user's feedback: incident investigation, quality anomaly, status query, vague feedback, batch issue, or audit. Different problem types require different investigation paths. You cannot treat everything as "find the ERROR."

Then it follows an identifier priority:

text

traceId / requestId > taskId > uid / userId > user-side ID

If the entry service returns nothing, or the logs do not match the user's description, it tries to use uid to reverse-query recent anomalies. If the logs show a downstream call failure, it continues extracting downstream taskIds, traceIds, or requestIds, then follows the service topology to the next service.

The process sounds simple, but it solves an important problem: AI is no longer "glancing at logs and guessing." It is constrained to reason inside an investigation workflow.

A Typical Investigation Chain

Suppose a user reports a failed task and only provides a taskId.

First, do not jump to a conclusion. Confirm the problem type. If the user is only asking "what is the current status of this task?", it may just be a status query and does not need an incident conclusion. If the user clearly says "the task failed" or "the result is abnormal," then it enters incident investigation or quality investigation.

Second, use the taskId to search ERROR and WARN logs in the entry project. The result may show a real task failure, or it may only show that the task was received, created, or that a downstream call failed.

Third, if the entry project's logs contain a downstream task ID or requestId, continue downstream. Service topology matters here because AI needs to know which services the current service calls and what each service is responsible for.

Fourth, keep tracing until a service shows a clear hard error, such as file download failure, parameter format error, network timeout, third-party API failure, or media parsing failure. These logs are usually closer to the root cause than "task execution failed."

Fifth, organize the conclusion into a fixed format:

Problem conclusion
Key evidence
Impact scope
Handling recommendation
Customer support or user-facing response

The benefit is that both the investigation process and the output wording become much more stable. Even if another person takes over, or a similar issue appears a few days later, the result will not depend entirely on the investigator's intuition at that moment.

Details I Care About in the Skill

The first point is that it must distinguish between "status query" and "incident investigation."

Many times, someone only wants to know whether a task has completed. They are not asking you to analyze an incident. If AI immediately starts writing root causes, responsible services, and customer-facing scripts, it feels strange. That is why loghound asks for problem type classification at the very beginning.

The second point is that querying logs is evidence collection, not the conclusion.

Script results only show which logs were found at a certain point in time. They cannot be directly treated as the final root cause. AI must judge by combining upstream and downstream chains, error location, log level, and failure stage.

The third point is that it must not stop at business-wrapped errors.

Many systems wrap downstream errors into unified business errors such as "task failed," "generation failed," or "processing exception." These logs may be ERROR-level, but they are not necessarily the root cause. As long as the logs still contain downstream clues, the investigation should continue.

The fourth point is that the output must be directly usable by people.

The final step of investigation is not to show a pile of logs, but to tell the other person what caused the issue, whether there is a user-side factor, whether it can be retried, and whether engineering needs to handle it. Especially for conclusions shown to customer support, the output should not read like a stack trace analysis report.

What It Can Do Now

Currently, loghound can roughly cover these scenarios:

Query the same taskId or requestId across multiple cloud log platforms.
Identify log sources and environments for different services based on project configuration.
Normalize logs, cluster them, and extract error signals.
Trace downstream services based on service topology.
Convert user-side IDs to internal IDs through MongoDB or SQL.
Query the status and error details of Webhook-style workflow tasks.
Let AI generate root cause analysis and response wording by following a fixed Skill process.

It is not an out-of-the-box universal incident bot. Every company's service topology, log format, and task ID rules are different, so it necessarily requires configuring projects, log sources, and call relationships.

But that is exactly why I think it is valuable.

If a tool does not understand your system at all, it can at most do generic log summarization. Only when project topology, log rules, and investigation experience are fed into it does it have a chance to truly participate in investigation.

Some Thoughts After Building This Skill

I used to think of log investigation as a classic experience-heavy task.

Experienced engineers can look at an error and know whether to trust that log, whether to keep tracing downstream, which database an ID should be reverse-queried in, which errors can be answered to users directly, and which ones require engineering escalation.

After AI appeared, my understanding changed a bit.

Experience-heavy work can be assisted by AI. The key is to extract the experience. Do not just tell AI, "help me look at these logs." Tell it:

First classify the problem type.
First find the strongest identifier.
How to fall back when nothing is found.
Continue tracing when downstream failures appear.
Do not treat wrapped errors as root causes.
The conclusion must include evidence.

Once these rules are captured, AI's behavior becomes much more stable.

So for me, loghound is not just a tool. It is also a summary of my experience using AI to investigate production issues over the past year. It turns judgment paths that used to live in my head into a Skill, so AI does not merely "read logs," but tries to work in the way an engineer actually investigates a problem.

Conclusion

My biggest takeaway from using AI to search logs this year is that AI does not naturally understand your system, and it does not naturally know what the root cause is.

But if we are willing to organize system topology, log query methods, error judgment rules, and output formats, it can take over a lot of repetitive, fragmented, and fatigue-prone work.

Writing loghound also felt like re-examining my own log investigation habits: which judgments are evidence-based, which are only experience guesses, which processes can be automated, and where human confirmation must remain. Once these things are clear, the tool itself is almost just the result.

If I continue iterating on it, I hope it can be refined across more log platforms, more project topologies, and more real incident scenarios. Production issues will not disappear just because we do not want to look at logs, but at least we can make log investigation a little less painful.

Can You Just Throw Logs at AI? ​

Why I Thought of Building This Skill ​

Problems with Letting AI Read Logs Directly ​

What I Wanted to Capture Was Not Scripts, But Investigation Methodology ​

A Typical Investigation Chain ​

Details I Care About in the Skill ​

What It Can Do Now ​

Some Thoughts After Building This Skill ​

Conclusion ​