Yek: Serialize your code repo (or part of it) to feed into any LLM

(github.com)

205 points | by mohsen11 个月前

32 comments

mg1 个月前
I think this is where the future of coding is. It is still useful to be a coder, the more experienced the better. But you will not write or edit a lot of lines anymore. You will organize the codebase in a way AI can handle it, make architectural decisions and organize the workflow around AI doing the actual coding.
The way I currently do this is that I wrote a small python file that I can start with
llmcode.py /path/to/repo
Which then offers a simple web interface at localhost:8080 where I can select the files to serialize and describe a task.
It then creates a prompt like this:
Look at the code files below and do the following: {task_description} Output all files that you need to change in full again, including your changes. In the same format as I provide the files below, that means each file starts with filename: and ends with :filename Under no circumstances output any other text, no additional infos, no code formatting chars. Only the code in the given format. Here are the files: somefile.py: ...code of somefile.py... :somefile.py someotherfile.py: ...code of someotherfile.py... :someotherfile.py assets/css/somestyles.css: ...code of somestyles.css... :assets/css/somestyles.css etc
Then llmcode.py sends it to an LLM, parses the output and writes the files back to disk.
I then look at the changes via "git diff".
It's quite fascinating. I often only make minor changes before accepting the "pull request" the llm made. Sometimes I have to make no changes at all.
- KronisLV1 个月前
  > You will organize the codebase in a way AI can handle it, make architectural decisions and organize the workflow around AI doing the actual coding.
  This might sound silly, but I feel like it has the potential of resulting in more readable code.
  There have been times where I split up a 300 line function just so it’s easier to feed into an LLM. Same for extracting things into smaller files and classes that individually do more limited things, so they’re easier to change.
  There have been times where I pay attention to the grouping of code blocks more or even leave a few comments along the way explaining the intent so LLM autocomplete would work better.
  I also pay more attention to naming (which does sometimes end up more Java-like but is clear, even if verbose) and try to make the code simple enough to generate tests with less manual input.
  Somehow when you understand the code yourself and so can your colleagues (for the most part) a lot of people won’t care that much. But when the AI tools stumble and actually start slowing you down instead of speeding you up and the readability of your code results in a more positive experience (subjectively) then suddenly it’s a no brainer.
  DJBunnies1 个月前
  You could have done all that for your peers instead.
  KronisLV1 个月前
  I already do when it makes sense… except if you look at messy code and nobody else seems to care, there might be better things to spend your time on (some of which might involve finding an environment where people care about all of that by default).
  But now, to actually improve my own productivity a lot? I’ll dig in more often, even in messy legacy code. Of course, if some convoluted LoginView breaks due to refactoring gone wrong, that is still my responsibility.
- maccard1 个月前
  I disagree that you won’t edit lines, but I think you’re right.
  At work this week I was investigating how we could auto scale our CI. I know enough Jenkins, AWS, perforce, power shell, packer, terraform, c++ to be able to do this, but having the time to implement and flesh everything out is a struggle. I asked Claude to create an AMI with our workspace preloaded on it, and a user data script that set up a perforce workspace without syncing it, all on windows, with the tools I mentioned. I had to make some small edits to the code to get it to match what I wanted but for the most part it took 2-3 days screwing around with a pretty clear concept in my head, and I had a prototype running in 30 minutes. Turns out it’s quicker to sync from perforce than it is to boot the custom AMI , but I learned that with an hour in total rather than building out more and we got to look at alternatives. That’s the future to me.
  senorrib1 个月前
  I don't understand what you disagreed with. The OP said
  > edit a lot of lines
  "a lot" being the keywords here.
- flessner1 个月前
  Even just "organizing" the code requires great amounts of knowledge and intuition from prior experiences.
  I am personally torn between the future of LLMs in this regard. Right now, even with Copilot, the benefit they give fundamentally depends on the coder that directs them - as you have noted.
  What if that's no longer true in a couple years? How would that even be different from e.g. no code tools or website builders today? In different words will handwritten code stay valuable?
  I personally enjoy coding so I can always keep doing it for entertainment, even if I am vastly surpassed by the machine eventually.
  maccard1 个月前
  > Even just "organizing" the code requires great amounts of knowledge and intuition from prior experiences.
  > I personally enjoy coding so I can always keep doing it for entertainment, even if I am vastly surpassed by the machine eventually.
  I agree with both these takes, and I think they’re far more important than wondering if hand written code is valuable.
  I do some DIY around the house. I can make a moderately straight cut (within tolerances for joinery use cases). A jig or circular saw makes that skill moot, but knowing I need a straight clean cut is a transferable skill. There’s also two separate skills - being able to break down and understand the problem and being able to implement the details of the problem. In trade skills we don’t expect any one person to design, analyze, estimate, build, install and decorate anything larger than a small piece of furniture and I think the same can be said of programming.
  It’s similar to using libraries/framesorks - there will always be people who will write shitty apps with shitty unmaintainable code - we’ve been complaining about that since I’ve been programming. Those people are going to move on from not understanding their wix websites to not understanding their AI generated code. But it’s another tool in the belt of a professional programmer
- csmpltn1 个月前
  > "But you will not write or edit a lot of lines anymore"
  > "I wrote a small python file that I can start with"
  Which one is it, chief?
  jaapbadlands1 个月前
  > not a lot of lines > small python file
  They mean the same thing, chief.
  csmpltn1 个月前
  A lot of projects start as a "small python file", but over time will grow into large and complex beasts - as they need to handle edge cases and complexities and requirements. Extrapolating that "programming is dead" based on anecdotal experiences from a hobby weekend project is very silly, and that's the joke.
  Serializing an entire repository and feeding it to an LLM, and having the LLM change the code in response to specific asks, then storing the results - is not a trivial thing to implement. Think about large repositories with very long files, or repositories with a complex structure, or polyglot repositories. Projects where code and data are mixed together. Then think about security, with things like prompt injection in the source files, etc. Handling hallucinations.
- croes1 个月前
  Then you aren’t a coder, you are an organizer or manager
  Art96811 个月前
  No one cares what badge of pride you choose to wear. Only the output. If an LLM produces working solutions, that's what your audience will appreciate. Not a self imposed title you chose for the project.
  wruza1 个月前
  What if it doesn’t? The next generation of “developers” will be even more clueless and we’ll get old and grumble that sites now take 200 megabytes to load, and they will laugh, cause who cares with 7G and terabit internet speeds and neighborhood-local cloudflare content pushing networks. And who even visits sites nowadays lol, when all the content is already prepushed to you.
  msoad1 个月前
  I'm sure a few decades ago people would say that for not fiddling with actual binary to make things work.
  1 个月前
  undefined
  fifilura1 个月前
  Many people identify themselves with being "a coder". Surely there are jobs for "coders" and will be in the future too. But not everyone writing programs today would qualify to doing the work defined as what "a coder" does.
  I like to be a "builder of systems" , "solver of problems". "Organizer or manager" would also fit in that description. And then what tool you use to get stuff done is not relevant.
  maccard1 个月前
  I disagree. If I put hook a library into a framework (e.g. laminar into rails) that doesn’t make me an organizer
  croes1 个月前
  But this does
  > You will organize the codebase in a way AI can handle it, make architectural decisions and organize the workflow around AI doing the actual coding.
  maccard1 个月前
  “Anyone” can implement a class that does something - the mark of a good engineer is someone who understands the context it’s going to be used in and modified , be that a one shot method, a core library function or a user facing api call that needs to be resilient against malicious inputs.
  ranger_danger1 个月前
  Can we turn down the dogmatism? This is merely your opinion and clearly others disagree with you. The majority of your comment history seems to be overwhelmingly negative.
- shatrov11 个月前
  Would you be kind to share your script? Thanks!
  mike_hearn1 个月前
  There is an open source project called Aider that does that.
- dark_star1 个月前
  Would you be willing to post the llm.py code? (Perhaps in a Github gist?)
paradite1 个月前
There are a lot of them. I collected a list of cli tools that does this:
https://prompt.16x.engineer/cli-tools
I also built a GUI tool that does this:
https://prompt.16x.engineer/
mohsen11 个月前
Added some benchmarking to show how fast it is:
Here is a benchmark comparing it to [Repomix][1] serializing the Next.js project:
time yek Executed in 5.19 secs fish external usr time 2.85 secs 54.00 micros 2.85 secs sys time 6.31 secs 629.00 micros 6.31 secs time repomix Executed in 22.24 mins fish external usr time 21.99 mins 0.18 millis 21.99 mins sys time 0.23 mins 1.72 millis 0.23 mins
yek is 230x faster than repomix
[1] https://github.com/jxnl/repomix
- amelius1 个月前
  Maybe I don't understand the usecase but I'm curious why speed matters given that LLMs are so slow (?)
  BigRedEye1 个月前
  22 minutes for a medium-sized repo is probably slow enough to optimize.
  elashri1 个月前
  However for these large size repositories. I'm not sure that you fit in the effective context window. I know that there is option to limit the token but then this would be your realistic limit.
ycombiredd1 个月前
i guess I shouldn’t be surprised that many of us have approached this in different ways. it’s neat to see already multiple replies of the sort I’m going to make too, which is to share the approach I’ve been taking, which is to concatenate or to “summarize” the code, with particular attention on dependency resolution.
[chimeracat](https://github.com/scottvr/chimeracat)
It took the shape that it has because it started as a tool to concatenate a library i had been working on into a single ipynb file so that I didn’t need to install the library on the remote colab, thus the dependency graph was born (as was the ascii graph plotter ‘phart’ that it uses) and then as I realized this could be useful to share code with an LLM, started adding the summarization capabilities, and in some sort of meta-recursive-irony, worked with Claude to do so. :-)
I’ve put a collection of ancillary tools I use to aid in the pairing with LLM process up at https://github.com/scottvr/LLMental
- sitkack1 个月前
  This is hilarious https://github.com/scottvr/retree
  ycombiredd1 个月前
  while I hope you mean it is hilarious in the same spirit that I write most of my stuff (“ludicrous” is a common phrase even in my documentation), I did want to ask that if you meant that in more of a disparaging way, that you could flesh out any criticism.
  Of course, if you meant “hilarious” similarly to how I mean “ludicrous”, thanks! And thank you for taking the time to look at it. :-)
  sitkack1 个月前
  Not disparaging in anyway, in fact, it made made me think of bijective lenses and what other tools could have an inverse, even if lossy.
  Your project is amazing, you are ahead of me in your thinking.
  ycombiredd1 个月前
  Hey, just wanted to come back from the rabbit hole you lead me to the entrance of and thank you for introducing me to lenses in the context of code; I was unfamiliar.
  sitkack1 个月前
  Mission accomplished!
  If you haven't check out
  Synthesizing Bijective Lenses https://arxiv.org/abs/1710.03248
  Synthesizing Symmetric Lenses https://arxiv.org/abs/1810.11527
  Synthesizing Quotient Lenses https://www.cs.sfu.ca/~miltner/papers/qptician.pdf
  https://github.com/Optician-Tool/Optician-Tool
  Boomerang https://www.seas.upenn.edu/~harmony/
  https://www.microsoft.com/en-us/research/group/prose/
  The subject touches on program synthesis, type directed programming and programming with holes.
  https://www.cs.sfu.ca/~miltner/
  https://www.cs.princeton.edu/~dpw/
  ycombiredd1 个月前
  Optician and Boomerang are exactly what I read up on during my excursion down the rabbit hole. And I was quite surprised that to the best of my recollection I had never heard of Augeas before.
  Thank you for the rest of the links!
  ycombiredd1 个月前
  “lessthan three” ;-)
  Thanks again.
claireGSB1 个月前
Adding my take to the mix, which has been working well for me: https://github.com/ClaireGSB/project-context.
It outputs both a file tree of your repo, a list of the dependancies, and a select list of files you want to include in your prompt for the LLM, in a single xml file. The first time you run it, it generates a .project-context.toml config file in your repo with all your files commented out, and you can just uncomment the ones you want written in full in the context file. I've found this helps when iterating on a specific part of the codebase - while keeping the full filetree give the LLM the broader context; I always ask the LLM to request more files if needed, as it can see the full list.
The files are not sorted by priority in the output though, curious what the impact would be / how much room for manual config to leave (might want to order differently depending on the objective of the prompt).
zurfer1 个月前
Has anyone build a linter that optimizes code for an LLM?
The idea would be to make it more token efficient and (lower accidental perplexity), e.g. by renaming variable names, fixing typos and shortening comments.
It should probably run after a normal linter like black.
- grajaganDev1 个月前
  Good idea - couldn't the linter also be an LLM?
- layer81 个月前
  > accidental perplexity
  I have to remember this.
endofreach1 个月前
I have a very simple bash function for this (filecontens), including ignoring files based on gitignore & binary files etc. Piped to clipboard and done.
All these other ways seem unnecessarily complicated...
- imiric1 个月前
  I also feel like this can be done in a few lines of shell script.
  Can you share your function, please?
TheTaytay1 个月前
This has some interesting ideas that I hadn’t seen in the other similar projects, especially around trying to sort files according to importance.
(I’ve been using RepoPrompt for this sort of thing lately.)

I am doing something similar for my gitpodcast project:

    def get_important_files(self, file_tree):
        # file_tree = "api/backend/main.py  api.py"
        # Send the prompt to Azure OpenAI for processing
        response = openai.beta.chat.completions.parse(
            model=self.model_name,
            messages=[
                {"role": "system", "content": "Can you give the list of upto 10 most important file paths in this file tree to understand code architechture and high level decisions and overall what the repository is about to include in the podcast i am creating, as a list, do not write any unknown file paths not listed below"},  # Initial system prompt
                {"role": "user", "content": file_tree}
            ],
            response_format=FileListFormat,
        )
        try:
            response = response.choices[0].message.parsed
            print(type(response), " resp ")
            return response.file_list
        except Exception as e:
            print("Error processing file tree:", e)
            return []

1. https://gitpodcast.com - Convert any GitHub repo into a podcast.

pagekicker1 个月前
Error: yek: SHA256 mismatch Expected: 34896ad65e8ae7c5e93d90e87f15656b67ed5b7596492863d1da80e548ba7301 Actual: 353f4f7467af25b5bceb66bb29d9591ffe8d620d17bf40f6e0e4ec16cd4bd7e7 File: /Users/... Library/Caches/Homebrew/downloads/0308e13c088cb787ece0e33a518cd211773daab9b427649303d79e27bf723e0d--yek-x86_64-apple-darwin.tar.gz To retry an incomplete download, remove the file above.
Removed & tried again this was the result. Is the SHA256 mismatch a security concern?
- mohsen11 个月前
  Oh totally forgot about homebrew installer. I'll fix it ASAP. Sorry about that.
  Edit: Working on a fix here https://github.com/bodo-run/yek/pull/14
  You can use the bash installer on macOS for now. You can read the installer file before executing it if you're not sure if it is safe
linschn1 个月前
That's neat ! I've built a transient UI to do this manually[0] within emacs, but with the context windows getting bigger ang bigger, being more systematic may be the way to go.
The priorization mentioned in the readme is especially interesting.
[0] https://rdklein.fr/bites/MyTransientUIForLocalLLMs.html
- widdershins1 个月前
  I'm a gptel addict, so I'll be looking at your approach. Thanks for posting it.
verghese1 个月前
How does this compare to a tool like RepoPrompt?
https://repoprompt.com
johnisgood1 个月前
https://github.com/simonw/files-to-prompt/ seems to work fine especially for Claude as it follows the recommendations and format.
__chefski__1 个月前
The token estimator of this package is quite inaccurate, since it appears to just take the number of characters, minus whitespace. This can lead to it being overly conservative, which would degrade an LLM's performance. That said, it can be improved in subsequent versions to properly integrate with a model's tokenizer so it can know the true token count.
https://github.com/bodo-run/yek/blob/17fe37fbd461a8194ff612e...
neuraldenis1 个月前
I made quite similar web tool some time ago, client side, no registration required: https://shir-man.com/txt-merge/
Alifatisk1 个月前
There is also https://repo2txt.simplebasedomain.com/local.html which doesn't require to download anything
sitkack1 个月前
I have to add https://github.com/simonw/files-to-prompt as a marker guid.
I think "the part of it" is key here. For packaging a codebase, I'll select a collection of files using rg/fzf and then concatenate them into a markdown document, # headers for paths ```filetype <data>``` for the contents.
The selection of the files is key to let the LLM focus on what is important for the immediate task. I'll also give it the full file list and have the LLM request files as needed.
lordofgibbons1 个月前
Does anyone know of a more semantically meaningful way of chunking code in a generalizable way? Token count seems like it'd leave out meaningful context, or include unrelated context.
- energy1231 个月前
  I have an LLM summarize each file, and feed in a list of summaries categorized by theme, along with the function signatures and a tree of the directory. I only paste the full code of the files that are very important for the task.
wiradikusuma1 个月前
Sorry if it's not very obvious, where does Yek fit with existing coding assistants such as Copilot or Continue.dev?
Is it purpose-built for code, or any text (e.g., Obsidian vault) would work?
- mohsen11 个月前
  This can be a piece of your own AI automation. Every task has a different need so being able to program your own AI automation is great for programmers. Any text based document works with this tool. It's rather simple, just stitching fils together with a dash of priority sorting
jamesponddotco1 个月前
Like many people, I built something similar, llmctx[1], but the chunking feature seems really interesting, I have to look into that for llmctx.
One thing I have with llmctx that I think is missing in yek is a “Claude mode”, as it outputs the concatenation in a format more suitable to provide context for the Anthropic LLM.
[1]: https://sr.ht/~jamesponddotco/llmctx/
CGamesPlay1 个月前
What is the use-case here? What is a "chunk"? It looks like it's just an arbitrary group of files, where "more important" files get put at the end. Why is that useful for LLMs? Also, I see it can chunk based on token count but... what's a token? ChatGPT? Llama?
Note, I understand why code context is important for LLMs. I don't understand what this chunking is or how it helps me get better code context.
- mohsen11 个月前
  token counting is done by crate that I'm using. I agree that not all LLMs use the same tokenizer but they are mostly similar.
  Chunking is useful because in chat mode you can feed more than context max size if you feed in multiple USER messages
  LLMs pay more attention to the last part of conversation/message. This is why sorting is very important. Your last sentence in a very long prompt is much more important the first.
  Use case: I use this to run an "AI Loop" with Deepseek to fix bugs or implement features. The loop steers the LLM by not letting it go stray in various rabbit holes. Every prompt reiterates what the objective is. By loop I mean: Serialize repo, run test, feed test failure and repo to LLM, get a diff, apply the diff and repeat until the objective is achieved.
  CGamesPlay1 个月前
  Got it, thanks.
  > in chat mode you can feed more than context max size if you feed in multiple USER messages
  Just so you know, this is false. You might be using a system that automatically deletes or summarizes older messages, which would make you feel like that, and would also indicate why you feel that the sorting is so important (It is important! But possibly not critically important).
  For future work, you might be interested in seeing how tools like Aider do their "repo serializing" (they call it a repomap), which tries to be more intelligent by only including "important lines" (like function definitions but not bodies).
  kruxigt1 个月前
  [dead]
dwrtz1 个月前
i wrote something similar https://github.com/dwrtz/sink
it has some nice features:
- follows .gitignore patterns
- optional file watcher to regenerate the snapshot when files are changed
- supports glob patterns for filter/exclude
- can report token usage
- can use a config yaml file if you have complicated filter/exclude patterns
i find myself using it all the time now. it's great with claude projects
yani1 个月前
You can do this all in the browser: https://dropnread.io/
gatienboquet1 个月前
https://gitingest.com/
hbornfree1 个月前
Thanks for this! I have the exact use-case and have been using a Python script to do this for a while.
msoad1 个月前
This is really fast! Serialized 50k lines in 500ms on my Mac
foxhop1 个月前
`tree --gitignore && cat .py && cat templates/`
- msoad1 个月前
  why Dropbox when you can rsync, huh? ;)
awestroke1 个月前
This looks promising. Hopefully much faster and less naive than Repomix
kauegimenes1 个月前
I have a feeling theres some native unix commands that should cover this. So looking at the current scope of this tool, i think it would need more features.
Anyone has a bash script that covers this use case?
- HarHarVeryFunny1 个月前
  Sure, find, cat and split would do the job.
frays1 个月前
Interesting thread with other links.
kruxigt1 个月前
[dead]
Bao125341 个月前
bảo