80 points | by mohsen19 小时前
It outputs both a file tree of your repo, a list of the dependancies, and a select list of files you want to include in your prompt for the LLM, in a single xml file. The first time you run it, it generates a .project-context.toml config file in your repo with all your files commented out, and you can just uncomment the ones you want written in full in the context file. I've found this helps when iterating on a specific part of the codebase - while keeping the full filetree give the LLM the broader context; I always ask the LLM to request more files if needed, as it can see the full list.
The files are not sorted by priority in the output though, curious what the impact would be / how much room for manual config to leave (might want to order differently depending on the objective of the prompt).
The way I currently do this is that I wrote a small python file that I can start with
llmcode.py /path/to/repo
Which then offers a simple web interface at localhost:8080 where I can select the files to serialize and describe a task.It then creates a prompt like this:
Look at the code files below and do the following:
{task_description}
Output all files that you need to change in full again,
including your changes. In the same format as I provide
the files below, that means each file starts with
filename: and ends with :filename
Under no circumstances output any other text, no additional
infos, no code formatting chars. Only the code in the
given format.
Here are the files:
somefile.py:
...code of somefile.py...
:somefile.py
someotherfile.py:
...code of someotherfile.py...
:someotherfile.py
assets/css/somestyles.css:
...code of somestyles.css...
:assets/css/somestyles.css
etc
Then llmcode.py sends it to an LLM, parses the output and writes the files back to disk.I then look at the changes via "git diff".
It's quite fascinating. I often only make minor changes before accepting the "pull request" the llm made. Sometimes I have to make no changes at all.
I am personally torn between the future of LLMs in this regard. Right now, even with Copilot, the benefit they give fundamentally depends on the coder that directs them - as you have noted.
What if that's no longer true in a couple years? How would that even be different from e.g. no code tools or website builders today? In different words will handwritten code stay valuable?
I personally enjoy coding so I can always keep doing it for entertainment, even if I am vastly surpassed by the machine eventually.
This might sound silly, but I feel like it has the potential of resulting in more readable code.
There have been times where I split up a 300 line function just so it’s easier to feed into an LLM. Same for extracting things into smaller files and classes that individually do more limited things, so they’re easier to change.
There have been times where I pay attention to the grouping of code blocks more or even leave a few comments along the way explaining the intent so LLM autocomplete would work better.
I also pay more attention to naming (which does sometimes end up more Java-like but is clear, even if verbose) and try to make the code simple enough to generate tests with less manual input.
Somehow when you understand the code yourself and so can your colleagues (for the most part) a lot of people won’t care that much. But when the AI tools stumble and actually start slowing you down instead of speeding you up and the readability of your code results in a more positive experience (subjectively) then suddenly it’s a no brainer.
But now, to actually improve my own productivity a lot? I’ll dig in more often, even in messy legacy code. Of course, if some convoluted LoginView breaks due to refactoring gone wrong, that is still my responsibility.
Here is a benchmark comparing it to [Repomix][1] serializing the Next.js project:
time yek
Executed in 5.19 secs fish external
usr time 2.85 secs 54.00 micros 2.85 secs
sys time 6.31 secs 629.00 micros 6.31 secs
time repomix
Executed in 22.24 mins fish external
usr time 21.99 mins 0.18 millis 21.99 mins
sys time 0.23 mins 1.72 millis 0.23 mins
yek is 230x faster than repomixI think "the part of it" is key here. For packaging a codebase, I'll select a collection of files using rg/fzf and then concatenate them into a markdown document, # headers for paths ```filetype <data>``` for the contents.
The selection of the files is key to let the LLM focus on what is important for the immediate task. I'll also give it the full file list and have the LLM request files as needed.
The idea would be to make it more token efficient and (lower accidental perplexity), e.g. by renaming variable names, fixing typos and shortening comments.
It should probably run after a normal linter like black.
[chimeracat](https://github.com/scottvr/chimeracat)
It took the shape that it has because it started as a tool to concatenate a library i had been working on into a single ipynb file so that I didn’t need to install the library on the remote colab, thus the dependency graph was born (as was the ascii graph plotter ‘phart’ that it uses) and then as I realized this could be useful to share code with an LLM, started adding the summarization capabilities, and in some sort of meta-recursive-irony, worked with Claude to do so. :-)
I’ve put a collection of ancillary tools I use to aid in the pairing with LLM process up at https://github.com/scottvr/LLMental
All these other ways seem unnecessarily complicated...
Can you share your function, please?
def get_important_files(self, file_tree):
# file_tree = "api/backend/main.py api.py"
# Send the prompt to Azure OpenAI for processing
response = openai.beta.chat.completions.parse(
model=self.model_name,
messages=[
{"role": "system", "content": "Can you give the list of upto 10 most important file paths in this file tree to understand code architechture and high level decisions and overall what the repository is about to include in the podcast i am creating, as a list, do not write any unknown file paths not listed below"}, # Initial system prompt
{"role": "user", "content": file_tree}
],
response_format=FileListFormat,
)
try:
response = response.choices[0].message.parsed
print(type(response), " resp ")
return response.file_list
except Exception as e:
print("Error processing file tree:", e)
return []
1. https://gitpodcast.com - Convert any GitHub repo into a podcast.Removed & tried again this was the result. Is the SHA256 mismatch a security concern?
Edit: Working on a fix here https://github.com/bodo-run/yek/pull/14
You can use the bash installer on macOS for now. You can read the installer file before executing it if you're not sure if it is safe
The priorization mentioned in the readme is especially interesting.
Is it purpose-built for code, or any text (e.g., Obsidian vault) would work?
Note, I understand why code context is important for LLMs. I don't understand what this chunking is or how it helps me get better code context.
Chunking is useful because in chat mode you can feed more than context max size if you feed in multiple USER messages
LLMs pay more attention to the last part of conversation/message. This is why sorting is very important. Your last sentence in a very long prompt is much more important the first.
Use case: I use this to run an "AI Loop" with Deepseek to fix bugs or implement features. The loop steers the LLM by not letting it go stray in various rabbit holes. Every prompt reiterates what the objective is. By loop I mean: Serialize repo, run test, feed test failure and repo to LLM, get a diff, apply the diff and repeat until the objective is achieved.
> in chat mode you can feed more than context max size if you feed in multiple USER messages
Just so you know, this is false. You might be using a system that automatically deletes or summarizes older messages, which would make you feel like that, and would also indicate why you feel that the sorting is so important (It is important! But possibly not critically important).
For future work, you might be interested in seeing how tools like Aider do their "repo serializing" (they call it a repomap), which tries to be more intelligent by only including "important lines" (like function definitions but not bodies).
(I’ve been using RepoPrompt for this sort of thing lately.)