89 points | by S0y11 小时前
1. We take a lot of care to make sure the AI recommendations are safe and have a high quality bar (regular monitoring, code provenance tracking, adversarial testing, and more).
2. We also do regular A/B tests and randomized control trials to ensure these features are improving SWE productivity and throughput.
3. We see similar efficiencies across all programming languages and frameworks used internally at Google and engineers across all tenure and experience cohorts show similar gain in productivity.
You can read more on our approach here:
https://research.google/blog/ai-in-software-engineering-at-g...
Will AI be able to detect bugs and back doors that require multiple pieces of code working together rather than being in a single piece of code? Humans have a hard time with this.
- Hypothetical Example: Authentication bugs in sshd that requires a flaw in systemd which then requires a flaw in udev or nss or PAM or some underlying library ... but looking at each individual library or daemon there are no bugs that a professional penetration testing organization such as the NCC group or Google's Project Zero would find. In other words, will AI soon be able to find more complex bugs in a year than Tavis has found in his career and will they start to compete with one another and start finding all the state sponsored complex bugs and then ultimately be able to create a map that suggests a common set of developers that may need to be notified? Will there be a table that logs where AI found things that professional human penetration testers could not?
For teams you can measure meaningful outcomes and improve team metrics.
You shouldn’t really compare teams but it also is possible if you know what teams are doing.
If you are some disconnected manager that thinks he can make decisions or improvements reducing things to single numbers - yeah that’s not possible.
How? Which metrics?
None of this works to evaluate individuals or even teams. But it can be effective at evaluating tools.
edit: typo
I don't think this is a bad thing - if this can be accompanied by an increase in software quality, which is possible. Right now its very hit and miss and everyone has examples of LLMs producing buggy or ridiculous code. But once the tooling improves to:
1. align produced code better to existing patterns and architecture 2. fix the feedback loop - with TDD, other LLM agents reviewing code, feeding in compile errors, letting other LLM agents interact with the produced code, etc.
Then we will definitely start seeing more and more code produced by LLMs. Don't look at the state of the art not, look at the direction of travel.
That’s a huge “if”, and by your own admission not what’s happening now.
> other LLM agents reviewing code, feeding in compile errors, letting other LLM agents interact with the produced code, etc.
What a stupid future. Machines which make errors being “corrected” by machines which make errors in a death spiral. An unbelievable waste of figurative and literal energy.
> Then we will definitely start seeing more and more code produced by LLMs.
We’re already there. And there’s a lot of bad code being pumped out. Which will in turn be fed back to the LLMs.
> Don't look at the state of the art not, look at the direction of travel.
That’s what leads to the eternal “in five years” which eventually sinks everyone’s trust.
I’m far from impressed with the output of GPT/Claude, all they’ve done is weight against stack overflow - which is still low quality code relative to Google.
What is probability Google makes this a real product, or is it too likely to autocomplete trade secrets?
Most of the code must be what could be snippets (opening files and handling errors with absl::, and moving data from proto to proto). One thing that doesn't help here, is that when writing for many engineers on different teams to read, spelling out simple code instead of depending on too many abstractions seems to be preferred by most teams.
I guess that LLMs do provide smarter snippets that I don't need to fill out in detail, and when it understands types and whether things compile it gets quite good and "smart" when it comes to write down boilerplate.
When someone who uses a product says it, there is a 50% chance of it being true, but when someone far away from the user says it, it is 100% promotion of product and setup for trust building for a future sale.
Although, if we were to ignore all this for a second, you could also make similar estimates with, e.g., gzip: the higher the compression ratio attained, the more "verbose"/"fluffy" the code is.
Fun tangent: there are a lot of researchers who believe that compression and intelligence are equivalent or at least very tightly linked.
I'm not sure though. If it's copied a bunch of times, and it actually doesn't matter because each usecase of the copying is linearly independent, does it matter that it was copied?
Over time, you'd still see copies being changed by themselves show up as increased entropy
I’m aware that the difference is that AI-generated code can be read and modified by humans. But that quantity is bad because humans have to understand it to read or modify it.
What’s the point of shorter code if you can’t trust it to do what it’s supposed to?
I’ll take 20 lines of code that do what they should consistently over 1 line that may or may not do the task depending on the direction of the wind.
I understand more code as being more edge cases
fun story: today I had an LLM write me a non-trivial perl one-liner. It tried to be verbose but I insisted and it gave me one tight line.
These statements are brilliant.
Alphabet ($GOOG) 2024 Q3 earnings release
I'm not sure this stat is as important as people point it out to be. If I start of `for` and the AI auto-completes `for(int i=0; i<args.length; i++) {` then a lot more than 25% of the code is AI written but it's also not significant. I could've figured out how to write the for-loop and its also not a meaningful amount of time saved because most of the time is figuring out and testing which the AI doesn't do.
More lines == more shit to maintain. Complex lines == the shit is unmanageable.
But wall street investors love simplistic narratives such as More X == More revenue. So here we are. Pretty clever marketing imo.
Not that I'm really discounting the value of AI here. For example, I've found a ton of value and saved time getting AI to write CDKTF (basically, Terraform in Typescript) config scripts for me. I don't write Terraform that often, there are a ton of options I always forget, etc. So asking ChatGPT to write a Terraform config for, say, a new scheduled task for example saves me from a lot of manual lookup.
But at the same time, the AI isn't really writing the complicated logic pieces for me. I think that comes down to the fact that when I do need to write complicated logic, I'm a decent enough programmer that it's probably faster for me to write it out in a high-level programming language than write it in English first.
IMO it's only really an issue if a competent human wasn't involved in the process, basically a person who could have written it if needed, then they do the work connecting it to the useful stuff, and have appropriate QA/testing in place...the latter often taking far more effort than the actual writing-the-code time itself, even when a human does it.
That said, I've seen even higher ratios. But never in any place that survived for long.
Not the developer who has written the same effective stanza 10 times before.
Architecturally, it sounds like different architecture components map somewhere close to 1:1 to teams, rather than teams hacking components to be closer coupled to each other because they have the same ownership.
I'd see too much boilerplate as being a organization/management org issue rather than a code architecture issue
Combine that with generic functions, framework boilerplate, OS/browser stuff, or explicit x-y-z code then your 'boilerplate' (ie repetitive, easily reproducible) easily gets to 25% of code you're programmers write every month. If your job is >75% pure human cognition problem solving you're probably in a higher tier of jobs than the vast majority of programmers on the planet.
Everything is getting forced into a scalable, general purpose way, that most apps have to add a ridiculous amount of boilerplate.
With g3's immense amount of context, LLMs can vastly help you discover how other people are using existing libraries.
in regards to how others are using libraries, that’s where the technology will excel— re-writing code. once it has a stable AST to work with, the mathematical equation it is solving is a refactor.
until it has that AST that solves the business need, the game is just prompt spaghetti until it hits altitude to be able to refactor.
Google could be writing the same amount of code with fewer developers (they have had multiple layoffs lately), or their developers could be focusing more of their time and attention on the code they do write.
> New tool bypasses Google Chrome’s new cookie encryption system
That may explain why google search has, in the past couple of months, become so unusable for me that I switched (happily) to kagi.
- You know which 20K lines need changing - You have perfect QA - Nothing ever goes wrong in deployment.
I think there's a tendency in our industry to only take the hypotenuse of curves at the steepest point
it's like companies paying all those todolist and tutorial apps left running on aws ec2 instances in 2007ish.
I'd be worried if i were a google investor. lol.