84 comments

  • modeless2 周前
    https://aistudio.google.com/live is by far the coolest thing here. You can just go there and share your screen or camera and have a running live voice conversation with Gemini about anything you're looking at. As much as you want, for free.

    I just tried having it teach me how to use Blender. It seems like it could actually be super helpful for beginners, as it has decent knowledge of the toolbars and keyboard shortcuts and can give you advice based on what it sees you doing on your screen. It also watched me play Indiana Jones and the Great Circle, and it successfully identified some of the characters and told me some information about them.

    You can enable "Grounding" in the sidebar to let it use Google Search even in voice mode. The video streaming and integrated search make it far more useful than ChatGPT Advanced Voice mode is currently.

    • I got so hopefuly by your comment I showed it my current bug that I'm working on, I even prepared everything first with my github issue, the relevant code, the terminal with the failing tests, and I pasted it the full contents of the file and explained carefully what I wanted to achieve. As I was doing this, it repeated back to me everything I said, saying things like "if I understand correctly you're showing me a file called foo dot see see pee" and "I see you have a github issue open called extraneous spaces in frobnicator issue number sixty six" and "I see you have shared some extensive code" and after some more of this "validation"-speak it started repeating the full contents of the file like "backquote backquote backquote forward slash star .. import ess tee dee colon colon .." and so on.

      Not quite up to excited junior-level programmer standards yet. But maybe good for other things who knows.

      • You just rediscovered that LLMs become much stupider when using images as input. I think that has been already shown for gpt-4 as well.
        • dtquad2 周前
          Even when using the web search tool GPT-4 becomes stupider.

          Do they use a dumber model for tool/vision?

          • dwaltrip2 周前
            I’m guessing that it’s just a much harder problem. Images often contain more information but it is far less structured and refined than language.

            The transformation process that occurs when people speak and write is incredibly rich and complex. Compared to images which are essentially just the outputs of cameras or screen captures — there isn’t an “intelligent” transformation process occurring.

            • justsid2 周前
              I think images also have much higher information density than words, or at least they can. There is a reason a picture is worth 1000 words.
        • tbruckner2 周前
          This is news to me. Any good examples of this outside of the above?
          • Vision language models are blind (192 comments) https://news.ycombinator.com/item?id=40926734
            • fnordpiglet2 周前
              On the other hand if I take pictures of circuits, boards, electronic components, etc GPT4o is pretty reliably able to explain the me the pinouts, the board layouts, reference material in the data sheets, and provide pretty reasonable advice about how to use it (I.e., where to put resistors and why, what pins to use for the component on an esp32, etc). As a beginner in electronics this is fabulously helpful. Its ability to pass vision tests seems like a pretty dumb utility metric when most people judge utility by how useful things are.
      • zamadatix2 周前
        > foo dot see see pee

        Well there's your problem!

        • lovasoa2 周前
          ccp is the Chinese version of c++. Or maybe they meant ссср, the Soviet version.
        • The ccp extension is just another c++ flavour /i
      • jug2 周前
        Not sure this is an AI limitation. I think you'd be better off here with the Gemini Code Assist plugin in VS Code rather than that. Sounds like the AI is provided with unstructured information compared to an actual code base.
    • sky22242 周前
      THIS is the thing I'm excited for with AI.

      I'm someone that becomes about 5x more productive when I have a person watching or just checking in on me (even if they're just hovering there).

      Having AI to basically be that "parent" to kick me into gear would be so helpful. 90% of the time my problems are because I need someone to help keep the gears turning for me, but there isn't always someone available. This has the potential to be a person that's always available.

      • Huppie2 周前
        Just as an FYI: I recently learned (here on HN) that this is called Body Doubling[0] there's some services around (there's at least one by someone that hangs around here) that can do this too.

        [0] https://en.m.wikipedia.org/wiki/Body_doubling

        • Also, there are co-working spaces in VRChat, which works wonders for me.

          I went to the Glass Office co-working space to study for exams this summer and it worked out really well. I also met some nice people there.

          A standalone Quest 3 is enough to get you started.

        • Do they support WFH setups?
          • The parent might be referring to us: https://workmode.net/. Most of our clients work from home. Do you have a specific concern about body doubling and working from home?
            • biztos1 周前
              That’s interesting, I never considered something like that.

              But at that low price, surely you have a bunch of customers being watched by each employee, and then talking to only one at a time — isn’t it distracting to see your “double” chatting away with the sound off?

              • No, nobody has ever complained about it (and yes, we did ask). When we first started, we were really concerned about it, so we tried to move as little as possible, avoid hand gestures, and so on. However, it turned out to be a non-issue.

                Fun fact: I’d estimate that 50% of users don’t even look at their Productivity Partner while they work. WorkMode runs in another tab, and users rarely switch back to it. They don’t need to see us - they just need to know we’re watching. I’m in that group.

            • kirubakaran2 周前
              Some unsolicited feedback (please feel free to ignore):

              When I click on "Pricing" in the nav bar, it scrolls down, and the first thing that catches my eye is "$2100 / month". I happened to see this time that this is the benefit you're projecting and it is actually $2.50/hour. On the previous times I've visited your website based on your HN comments, I've always thought $2100/month was what you were going to charge me and closed the tab.

              I've been frustrated myself that people don't read what's right there on the page when they come to my startup app's landing page. Turns out I do the same. Hope this helps you improve the layout / font sizes and such "information hierarchy" so the correct information is conveyed at a glance.

              IMHO $2.50/hour is great value, and stands on its own. I know how much my time is worth, so perhaps the page doesn't really have to shout that to convince me.

              Again, please feel free to ignore this as it is quite possible that it is just me with the attention span of a goldfish with CTE while clicking around new websites.

              • Thank you! I hadn’t thought of it that way, but what you wrote makes total sense and explains the engagement issues we’re seeing with the calculator and the pricing section.

                > Again, please feel free to ignore this as it is quite possible that it is just me with the attention span of a goldfish with CTE while clicking around new websites.

                Most of our clients have issues with attention span, so your feedback is gold :-) Again, thank you!

                • kirubakaran1 周前
                  Welcome! btw this is how it looked: https://i.imgur.com/qg8gNJF.png

                  I understand if the window were taller, I'd have seen the actual price cards. I think it's just that when you click "Pricing", you expect the next obvious number you see to be the price.

            • 11235813211 周前
              Clever service! I assume your employees watch several people at once. Is it engaging enough work for them?
              • Yes, they monitor several people simultaneously. Most clients ask us to check in on their progress every 15–30 minutes, and these interactions can last anywhere from a few seconds to three minutes, depending on the client and the challenges they're facing. It might be boring when working with a single person, but it gets more challenging as more people connect.

                Also, we do more than just body doubling. Some clients need to follow a morning ritual before starting their work (think meditation, a quick house cleanup, etc.). Sometimes, we perform sanity checks on their to-do lists (people often create tasks that are too vague or vastly underestimate the time needed to complete them). We ask them to apply the 2-minute rule, and so on. It all depends on the client's needs.

      • Interesting! I see how this could work for inattentive procrastinators. By "inattentive procrastinators", I mean people who are easily distracted and forget that they need to work on their tasks. Once reminded, they return to their tasks without much fuss.

        However, I doubt it would work for hedonistic procrastinators. When body doubling, hedonistic procrastinators rely on social pressure to be productive. Using AI likely won't work unless the person perceives the AI as a human.

        • losvedir2 周前
          You don't necessarily need to believe the AI is a human for it to tickle the ingrained social instincts you're looking for. For example, I'm quite aware that AI's are just tools, and yet I still feel a strong need to be "polite" in my requests to ChatGPT. "Please do ...." or "Can you...?" and even "Thanks, that worked! Now can you..." etc.
          • I do the same, but I think it's because we were taught to be polite and to conduct conversations in a certain way.

            Do you put effort into being polite when ChatGPT makes a mistake and you correct it? Do you try to soften the blow to avoid hurting its "feelings"? Do you feel bad if you respond impolitely? I don't.

          • You only do that politeness as a novice.

            My questions to copilot.ms.com today are more like the following, still works like a charm...

            "I have cpp code: <enter><code snippet><enter> and i get error <piece of compilation output>. Is this wrong smart ponitor?"

            [elaborate answer with nice examples]

            "Works. <Next question>"

          • ishtanbul2 周前
            I dont feel this at all. I treat chatgpt like an investment banking intern.
      • player12341 周前
        So why not fire 3 of your colleges and have another whos new job is watching over/checking in on you and by your own account productivity would be about the same. Save your company some money it will be appreciated!

        On an unrelated note, I believe people need to start quantifying their outrageous ai productivity claims or shut up.

      • Jeff_Brown2 周前
        I'm intrigued to know whether that actually ends up working. I am something like that myself, but I don't know whether it is an effect of getting feedback or of having a person behind the feedback.
        • sky22242 周前
          There's definitely an ideal setup that's needed in order for it to work. I'm also not quite sure what part of the other person being present causes me to focus better (i.e., whether it's the presence vs good ideas and feedback).

          I'm leaning toward saying that the main issue for me is that I need to keep my focus on things that are active engagement rather than more passive engagement like taking notes versus just reading a passage.

      • mycall2 周前
        Your "parent" kicked you into gear because you have an emotional bond with them. A stranger might cause your guards to go up if you do not respect them as with wisdom. So too may go an AI.
        • sky22242 周前
          I used the term "parent" here because it was the descriptor I thought people would understand best.

          For me personally, I was awful at working when my parents were hovering over me.

          In the past, I used to work with a professor on a project and we'd spend significant amounts of time on zoom calls working (this was during COVID). The professor wouldn't even be helping me the entire time, but as soon as I was blocked, I'd start talking and the ideas would bounce back and forth and I'd find a solution significantly quicker.

      • Shameless plug, I'm working on something like this https://myaipal.kit.com/prerelease
        • sky22241 周前
          So I watched the demo video on your site, and honestly I'm not sure how this is really all that much better than what can already be done with ChatGPT.

          The key is, I don't want to have to initiate the contact. Hand holding the AI myself defeats the purpose. The ideal AI assistant is one that behaves as if it's a person that's sitting next to me.

          Imagine you're a junior that gets on a teams call to get help via pair programming with your boss. For anything more than just a quick fix, pair programming on calls tends to turn into the junior working on something, hitting a roadblock, and the boss stepping in to provide input.

          Here's the really important part that I've realized: very rarely will the input that the boss provides be something that is leaps and bounds outside of the ability of the junior. A lot of it will just be asking questions or talking the problem through until it turns the gears enough for the junior to continue on their own. THAT right there. That's the gear turning AI agent I'm looking for.

          If someone could develop a tool that "knows" the right time to jump in and talk with you, then I think we'd see huge jumps in productivity for people.

    • At least you can theoretically stop sharing with this one. Microsoft was essentially trying to do this, but doing it for everything on your PC, with zero transparency.

      Here's Google doing essentially the same thing, even more so that it's explicitly shipping your activity to the cloud, and this response is so different from the "we're sticking this on your machine and you can't turn it off" version Microsoft was attempting to land. This is what Microsoft should have done.

    • chefandy2 周前
      This is great! I viscerally dislike the "we're going to do art for you so you don't have to... even if you want to..." side of AI, but learning to use the tools to get the satisfaction of making it yourself is not easy! After 2 decades of working with 2D art and code separately, learning 3D stuff (if you include things like the complex and counterintuitive data flow of simulations in Houdini and the like) was as or more difficult than learning to code. Beyond that, taking classes is f'ing expensive, and more of that money goes to educational institutions than the teachers themselves. Obviously, getting beyond the basics for things that require experienced critique are just going to need human understanding, but for the base technical stuff, this is fantastic.
    • cryptozeus2 周前
      This comment is better than entire ad google just showed. Who is still pointing at the building with camera and asking what is this building?
      • kridsdale12 周前
        I do that in Manhattan. I also do it for yonder mountains.
    • Brotkrumen2 周前
      Sounds interesting, but voice input isn't working for me there. I guess I'm too niche with my Mac and Firefox setup.
      • mentalgear2 周前
        Actually plenty of tech people are using mac & firefox
        • portaouflop2 周前
          Irony detectors malfunctioning perhaps?
          • baq2 周前
            'irony' meant 'something made of metal' last time I checked
            • socksy2 周前
              Right, and Macs are made out of aluminium
            • 2 周前
              undefined
      • This isnt entierly suprising as Google have been breaking things artificially on Firefox for years now (Google Map and YouTube at least). Maybe try spoofing Chrome's user-agent.
      • SkyPuncher2 周前
        Console is throwing an error: "Connecting AudioNodes from AudioContexts with different sample-rate is currently not supported."

        Quick research suggests this is part of Firefox's anti-fingerprinting functionality.

    • icelancer2 周前
      I tried this, shared a terminal, asked it to talk about it, and it guessed that it was Google Chrome with some webUI stuff. Immediately closed the window and bailed.
      • kridsdale12 周前
        Which terminal? Was it chromium based?
        • icelancer2 周前
          Nope. Just KiTTY on Windows.
    • selvan2 周前
      Get started documentation on Multimodal Live API : https://ai.google.dev/api/multimodal-live
    • Zababa2 周前
      I don't know what's not working but I get "Has a large language model. I don't have the capability to see your screen or any other visual input. My interactions are purely based on the text that you provide"
    • moffkalast2 周前
      This'll be so fantastic once local models can do it, because nobody in the right mind would stream their voice and everything they do on their machine to Google right? Right?

      Oh who am I kidding, people upload literally everything to drive lmao.

  • simonw2 周前
    I released a new llm-gemini plugin with support for the Gemini 2.0 Flash model, here's how to use that in the terminal:

        llm install -U llm-gemini
        llm -m gemini-2.0-flash-exp 'prompt goes here'
    
    LLM installation: https://llm.datasette.io/en/stable/setup.html

    Worth noting that the Gemini models have the ability to write and then execute Python code. I tried that like this:

        llm -m gemini-2.0-flash-exp -o code_execution 1 \
          'write and execute python to generate a 80x40 ascii art fractal'
    
    Here's the result: https://gist.github.com/simonw/0d8225d62e8d87ce843fde471d143...

    It can't make outbound network calls though, so this fails:

        llm -m gemini-2.0-flash-exp  -o code_execution 1 \
          'write python code to retrieve https://simonwillison.net/ and use a regex to extract the title, run that code'
    
    Amusingly Gemini itself doesn't know that it can't make network calls, so it tries several different approaches before giving up: https://gist.github.com/simonw/2ccfdc68290b5ced24e5e0909563c...

    The new model seems very good at vision:

        llm -m gemini-2.0-flash-exp describe -a https://static.simonwillison.net/static/2024/pelicans.jpg
    
    I got back a solid description, see here: https://gist.github.com/simonw/32172b6f8bcf8e55e489f10979f8f...
    • simonw2 周前
      Published some more detailed notes on my explorations of Gemini 2.0 here https://simonwillison.net/2024/Dec/11/gemini-2/
    • pcwelder2 周前
      Code execution is okay, but soon runs into the problem of missing packages that it can't install.

      Practically, sandboxing hasn't been super important for me. Running claude with mcp based shell access has been working fine for me, as long as you instruct it to use venv, temporary directory, etc.

    • bravura2 周前
      Question: Have you tried using this for video?

      Alternately, if I wanted to pipe a bunch of screencaps into it and get one grand response, how would I do that?

      e.g. "Does the user perform a thumbs up gesture in any of these stills?"

      [edit: also, do you know the vision pricing? I couldn't find it easily]

      • simonw2 周前
        Previous Gemini models worked really well for video, and this one can even handle steaming video: https://simonwillison.net/2024/Dec/11/gemini-2/#the-streamin...
        • bravura2 周前
          Wow this is amazing. It just gave me critique on my bodyweight squat form.

          But I also found it hard to prompt to tutor in French or Portuguese; the accents were gruesomely bad.

          • og_kalu2 周前
            For some reason, the realtime api is using TTS for speech output. Not sure if that's temporary
        • andy_ppp2 周前
          I can't wait to see these models are able to do video <-> video, just so that I can be in ten standup meetings at once :^)
    • rafram2 周前
      > Some pelicans have white on their heads, suggesting that some of them are older birds.

      Interesting theory!

      • smackay2 周前
        Brown Pelican (Pelecanus occidentalis) heads are white in the breeding season. Birds start breeding aged three to five. So technically the statement is correct but I wonder if Gemini didn't get its pelicans and cormorants in a muddle. The mainland European Great Cormorant (Phalacrocorax carbo sinensis) has a head that gets progressively whiter as birds age.
    • lngnmn22 周前
      [dead]
  • crowcroft2 周前
    Big companies can be slow to pivot, and Google has been famously bad at getting people aligned and driving in one direction.

    But, once they do get moving in the right direction the can achieve things that smaller companies can't. Google has an insane amount of talent in this space, and seems to be getting the right results from that now.

    Remains to be seen how well they will be able to productize and market, but hard to deny that their LLM models aren't really, really good though.

    • StableAlkyne2 周前
      > Remains to be seen how well they will be able to productize and market

      The challenge is trust.

      Google is one of the leaders in AI and are home to incredibly talented developers. But they also have an incredibly bad track record of supporting their products.

      It's hard to justify committing developers and money to a product when there's a good chance you'll just have to pivot again once they get bored. Say what you will about Microsoft, but at least I can rely on their obsession with supporting outdated products.

      • egeozcan2 周前
        > they also have an incredibly bad track record of supporting their products

        Incredibly bad track record of supporting products that don't grow. I'm not saying this to defend Google, I'm still (perhaps unreasonably) angry because of Reader, it's just that there is a pattern and AI isn't likely to fit that for a long while.

        • I’m sad for reader but it was a somewhat niche product. Inbox I can’t forgive. It was insanely good and was killed because it was a threat to Gmail.

          My main issue with Google is that internal politic affects users all the time. See the debacle of anything built on top of Android and being treated as a second citizen.

          You can’t trust a company which can’t shield users from its internal politics. It means nothing is aligned correctly for users to be taken seriously.

          • kanzenryu22 周前
            Google reader was killed because there was only one guy who knew how to support it
            • If anything, this adds meat to the fact it can be understandable that people do not trust Google with products longevity.

              Bus factor et al. is literally CS 101.

            • troupo2 周前
              And that is a problem for a company with 20k employees?

              No, Reader was killed because it:

              - was free

              - didn't contribute to revenue growth from ads

              • creesch2 周前
                > And that is a problem for a company with 20k employees?

                It is for a company where the promotion culture rewards new initiatives and products and doesn't reward people maintaining products. Which was most certainly the company culture around the time reader was killed.

          • would you be willing to pay for something that essentially is google inbox - but a separate web view client with more personalization/better search?
        • The_Colonel2 周前
          > Incredibly bad track record of supporting products that don't grow.

          That's irrelevant to me as a user if I've already invested my time into the product.

          To an extent all companies do it. Google just does it much more, to a degree that I tend to ignore most Google's launches because of this uncertainty.

        • makeitdouble2 周前
          > products that don't grow.

          I think we all acknowledge this.

          The question is seldom "why" they kill it (I'd argue ultimately it doesn't matter), it's about how fast and what they offer as a migration path for those who boarded the train.

          That also means the minute Gemini stops looking like a growing product it's gone from this world, where Microsoft backed alternatives have a fighting chance to get some leeway to recover or pivot.

          • mnau2 周前
            Yeah, MS Azure DevOps is still alive, though stagnant. I thought everyone would be moved to GitHub in few years after MS acquired GitHub. Yet here we are, 6 years later.
            • aryonoco2 周前
              I can guarantee that DevOps will still be around and functional in 2030. It won't have new features but it will still be supported.

              Microsoft bought FoxBase in 1992. FoxPro never took the world by storm but it had a dedicated group of devs and ISVs who used it and it solved their needs. The last version was released in 2004, long after Microsoft had released.Net and C# and SQL Server. Microsoft officially ended support for it in 2015.

              Google? If the product doesn't become an instant #1 or #2 in its market and doesn't directly contribute to their bottom line in a way which can be itemised in their earnings call, it's gone in less than 3 years guaranteed.

        • Teever2 周前
          It's more nuanced than that.

          Like how many different instant messengers did they make at the same time only to abandon them all instead of just making one and supporting it?

        • michaelt2 周前
          > Incredibly bad track record of supporting products that don't grow. [...] AI isn't likely to fit that for a long while.

          Have you seen Google Bard anywhere recently? Me neither :)

          • egeozcan1 周前
            As far as I know, they just renamed it to Gemini.

            Now I'm not sure if are you arguing that a name change is not supporting a product or that Gemini is a different product with a different feature set?

        • msabalau2 周前
          Yeah, either AI is significant, in which case Google isn't going to kill it. Or AI is a bubble, in any of the alternatives one might pick can easily crash and die long before Google ends of life anything.

          This isn't some minor consumer play, like a random tablet or Stadia. Anyone who has paying attention would have noticed that AI has been an important, consistent, long term strategic interest of Google's for a very long time. They've been killing off the fail/minor products to invest in this.

        • esafak2 周前
          Why would they grow if they don't vocally support them? Launch and hope for the best does not work; it's not the wild west on the Internet any more.
        • not going to miss the opportunity to upvote on the grief of having lost Reader
        • nanna2 周前
          Please can we just get over Reader. Please. Yes it was devastating for RSS, but the debacle took place eleven years ago. Enough.
          • aryonoco2 周前
            No we can't cause it was absolutely a turning point in Google's trajectory.

            After Reader, it was Currents, Google TV, Picasa, Google Now, Spaces, Chromecast Audio,Inbox, GCM, Nest, Fusion Tables, Google Cloud Print, Google Play Music, Google Bookmarks, Chrome Apps, G Suite....

            Reader keeps coming up because after Reader, Google's motto turned into "Do be Evil"

          • egeozcan2 周前
            You are not wrong, but my irrational mind is unlikely to take your advice. Try treating it as a meme, rather than anything belonging to a sane discussion. For me, I don't think I'll ever get over it.

            I'm sorry.

      • TIPSIO2 周前
        Yes. Imagine Google banning your entire Google account / Gmail because you violated their gray area AI terms ([1] or [2]). Or, one of your users did via an app you made using an API key and their models.

        With that being said, I am extremely bullish on Google AI for a long time. I imagine they land at being the best and cheapest for the foreseeable future.

        [1] https://policies.google.com/terms/generative-ai

        [2] https://policies.google.com/terms/generative-ai/use-policy

        • estebarb2 周前
          For me that is a reason for not touching anything from Google for building stuff. I can afford lossing my Amazon account, but Google's one would be too much. At least they should be clear in their terms that getting banned at cloud doesn't mean getting banned from Gmail/Docs/Photos...
          • bippihippi12 周前
            why not just make a business / project account?
            • rtsil2 周前
              That won't help. Their TOS and policies are vague enough that they can terminate all accounts you own (under "Use of multiple accounts for abuse" for instance).
              • TIPSIO2 周前
                To be fair, I believe this is reserved for things like fighting fraud.
                • dbdoskey2 周前
                  It has been used a few times by people who had a Google Play app banned, that sometimes the personal account would get banned as well.

                  https://www.xda-developers.com/google-developer-account-ban-...

                • dudeinjapan2 周前
                  I used to be "thepimp@hotmail.com" in the early days of Hotmail, of course I was also a 6th grader (true story). One day they unceremoniously closed my account without any possibility to recover mails.

                  That day I learned an important lesson: pimpin' ain't easy.

                • My buddy lost his Gmail account because of a heart attack followed by a string of events that google ‘AI’ considered too risky to allow the account to live.

                  If their fraud AI is wrong there is now human to talk to.

                • Even if it is warranted on their part, the 1% false positive will be detrimental to those affected. And we all know there is no way to reach out to them in case the account is automatically flagged.
            • estebarb2 周前
              I asked Gemini about banning risks, and it answered:

              Gemini: Yes, there is a potential risk of your Google account being suspended if your SaaS is used to process inappropriate content, even if you use Gemini to reject the request. While Gemini can help you filter and identify harmful content, it's not a foolproof solution.

              Here are some additional measures you can take to protect your account:

              * Content moderation: Implement a robust content moderation system to filter out inappropriate content before it reaches Gemini. This can include keyword-based filtering, machine learning models, and human review.

              ...

              * Regularly review usage: Monitor your usage of Gemini to identify any suspicious activity.

              * Follow Google's terms of service: Make sure that your use of Gemini complies with Google's terms of service.

              By taking these steps, you can minimize the risk of your account being suspended and ensure that your SaaS is used responsibly.

              ---

              In a follow up question I asked about how to implement robust content moderation and it suggested humans reviewing each message...

              • So a convenient blah blah blah about all the nice things you can doto avoid Google's brainless algorithmic wrath, but which may simply not work anyhow because even by following all rules in good faith, you still get banned one day, as has happened to many, many people with zero recourse.
                • estebarb2 周前
                  Yes, exactly. So, this is a huge security gap.

                  As an attacker, instead of DDoSing a service we could just upload a bunch of NSFW text so Google kills their infra for us.

                  Other providers, like OpenAI, at least provide a free moderation API. Google has a moderation API, that after the free 50k requests it is more expensive than Gemini 1.5 flash (Moderation API costs $0.0005/100 characters vs Gemini 1.5 flash $0.000001875/100 characters).

      • > But they also have an incredibly bad track record of supporting their products.

        I don't know about that: my wife built her first SME on Google Workspace / GSuite / Google Apps for domain (this thing changed names so many times I lost track). She's now running her second company on Google tools, again.

        All she needs is a browser. At one point I switched her from Windows to OS X. Then from OS X to Ubuntu.

        Now I just installed Debian GNU/Linux on her desktop: she fires up a browser and opens up Google's GMail / GSuite / spreadsheets and does everything from there.

        She's a happy paying customer of Google products since a great many years and there's actually phone support for paying customers.

        I honestly don't have many bad things to say. It works fine. 2FA is top notch.

        It's a much better experience than being stuck in the Windows "Updating... 35%" "here's an ad on your taskbar" "you're computer is now slow for no reason" world.

        I don't think they'l pull the plug on GSuite: it's powering millions and millions of paying SMEs around the world.

      • dotancohen2 周前

          > Google is one of the leaders in AI and are home to incredibly talented developers. But they also have an incredibly bad track record of supporting their products.
        
        This is why we've stayed with Anthropic. Every single person I work with on my current project is sore at Google for discontinuing one product or another - and not a single one of them mentioned Reader.

        We do run some non-customer facing assets in Google Cloud. But the website and API are on AWS.

        • dotancohen1 周前
          Just to drive the point home. I made the parent comment five days ago. Today I searched Google for "google-cloud-skd vs google-cloud-cli", the top result was this page from two years ago:

          https://www.reddit.com/r/googlecloud/comments/wpq0eg/what_is...

          The top comment in that page is:

            > CLI is the new name for the SDK.The reasoning and strategy was explained in great detail in this podcast:
            > https://podcasts.google.com/feed/aHR0cHM6Ly9mZWVkcy5mZWVkYnVybmVyLmNvbS9HY3BQb2RjYXN0/episode/NTI5ZTM5ODAtYjYzOC00ODQxLWI3NDAtODJiMTQyMDMxNThj?ep=14
          
          So I click that link, and I'm greeted with:

            > Google Podcasts is no longer available
            > Listen to podcasts and build your library in the YouTube Music app.
          
          This is why AWS and Anthropic are getting our money. We can not trust that Google projects will survive as long as our business needs.
      • Putting your trust in Google is a fools errand. I don't know anyone that doesn't have a story.
        • meta_x_ai2 周前
          Google has 4 Billion users. It's delusional to think that you don't know anyone or you live in an incredibly small bubble
          • owlninja2 周前
            Yea the only stories I ever see are ones that bubble up to HN. Often they are very one-sided as well. Not saying it hasn't happened, but let's not pretend it's rampant.
      • fluoridation2 周前
        >Say what you will about Microsoft, but at least I can rely on their obsession with supporting outdated products.

        Eh... I don't know about that. Their tech graveyard isn't as populous as Google's, but it's hardly empty. A few that come to mind: ATL, MFC, Silverlight, UWP.

        • bri3d2 周前
          Besides Silverlight (which was supported all the way until the end of 2021!), you can still not only run but _write new applications_ using all of the listed technologies.
          • fluoridation2 周前
            That doesn't constitute support when it comes to development platforms. They've not received any updates in years or decades. What they've done is simply not remove the capability build capability from the toolchains. That is, not even the work that would be required to no longer support them in any way. Compare that to C#, which has evolved rapidly over the same time period.
            • Fidelix2 周前
              That's different from "killing" the product / technology, which is what Google does.
              • fluoridation2 周前
                Only because they operate different businesses. Google is primarily a service provider. They have few software products that are not designed to integrate with their servers. Many of Microsoft's businesses work fundamentally differently. There's nothing Microsoft could do to Windows to disable all MFC applications and only MFC applications, and if there was it would involve more work than simply not doing anything else with MFC.
                • pjmlp2 周前
                  Not only has MFC been recently updated to support HiDPI, it is still the best Microsoft GUI C++ development experience on Visual C++.

                  And even if C++/CX and C++/WinRT aren't that great, with worse tooling than MFC and in maintenance mode, you can easilly create an application with them today.

                  Hardly the same can be told of most Google technologies.

                  • fluoridation2 周前
                    Fair enough, I retract MFC as an example.
                • px19992 周前
                  The business model doesn't matter.

                  I can write something with Microsoft tech and expect it with reasonable likelihood to work in 10 years (even their service-based stuff), but can't say the same about anything from Google.

                  That alone stops me/my org buying stuff from Google.

                  • fluoridation2 周前
                    I'm not contending that Microsoft and Google are equivalent in this regard, I'm saying that Microsoft does have a history of releasing technologies and then letting them stagnate.
        • codebolt2 周前
          .NET Framework is the most egregious of the last few years.
          • fluoridation2 周前
            You're absolutely right. I wasn't thinking about it when I wrote the comment, but I really should have included it. I'm still pissed about that, and I don't like how Core releases are deprecated in 1-2 years. For my personal projects that's a breakneck pace.
      • boringg2 周前
        Can I add they have a bad track record of supporting new products. Gmail, google, gsuite seem to be well supported.
      • Surface Duo would like to have a word
    • panabee2 周前
      With many research areas converging to comparable levels, the most critical piece is arguably vertical integration and forgoing the Nvidia tax.

      They haven't wielded this advantage as powerfully as possible, but changes here could signal how committed they are to slaying the search cash cow.

      Nadella deservedly earned acclaim for transitioning Microsoft from the Windows era to cloud and mobile.

      It will be far more impressive if Google can defy the odds and conquer the innovator's dilemma with search.

      Regardless, congratulations to Google on an amazing release and pushing the frontiers of innovation.

      • They have to not get blind sided by Sora, while at the same time fighting the cloud war against MS/Amazon.

        Weirdly Google is THE AI play. If AI is not set to change everything and truly is a hype cycle, then Google stock withstands and grows. If AI is the real deal, then Google still withstands due to how much bigger the pie will get.

        • AH4oFVbPT4f82 周前
          Why is Google THE AI play as you put it? I don't agree or disagree, just wanting to understand your perspective.
          • Brotkrumen2 周前
            That conversation reads like two consultant LLMs talking past each other.
            • AH4oFVbPT4f82 周前
              I actually wanted to know, real person proof user: AH4oFVbPT4f8 created: August 12, 2013

              I'm going back and forth between the different models seeing which works best for me but I'm trying to learn how to read and use other people's feedback in making their decisions.

        • whimsicalism2 周前
          sora is not a big factor in this
          • 2 周前
            undefined
        • kranke1552 周前
          Sora is a toy. No understanding of physics. Fairly expensive. Hard to tell stories with it. I work in video production. The industry is not big enough for you to invest billions and billions into. It’s currently in a total state of crisis.

          Video production is just not big enough of a market to make a difference in the AI race. I don’t understand why any AI company would spend significant amount of resources matching Sora when I don’t really think it will be a 10 billion dollar product (yet).

          Plus Google is well positioned to match it anyway, since they have YouTube data they can probably license to their AI gen video training.

      • > Nadella deservedly earned acclaim for transitioning Microsoft from the Windows era to cloud and mobile.

        You mean by shifting away from Windows for mobile and focusing on iOS and Android?

      • crowcroft2 周前
        They need an iPod to iPhone like transition. If they can pull it off it will be incredible for the business.
    • crazygringo2 周前
      > and Google has been famously bad at getting people aligned and driving in one direction.

      To be fair, it's not that they're bad at it -- it's that they generally have an explicit philosophy against it. It's a choice.

      Google management doesn't want to "pick winners". It prefers to let multiple products (like messaging apps, famously) compete and let the market decide. According to this way of thinking, you come out ahead in the long run because you increase your chances of having the winning product.

      Gemini is a great example of when they do choose to focus on a single strategy, however. Cloud was another great example.

      • xnx2 周前
        I definitely agree that multiple competing products is a deliberate choice, but it was foolish to pursue it for so long in a space like messaging apps that has network effects.

        As a user I always still wish that there were fewer apps with the best features of both. Google's 2(!) apps for AI podcasts being a recent example : https://notebooklm.google.com/ and https://illuminate.google.com/home

      • tbarbugli2 周前
        Google is not winning on cloud, AWS is winning and MS gaining ground.
        • surajrmal2 周前
          Parent didn't claim Google is winning. Only that there is a cohesive push and investment in a single product/platform.
        • rrdharan2 周前
          That was 2023; more recently Microsoft is losing ground to Google (in 2024).
    • talldayo2 周前
      BERT and Gemma 2B were both some of the highest-performing edge models of their time. Google does really well - in terms of pushing efficiency in the community they're second to none. They also don't need to rely on inordinate amounts of compute because Google's differentiating factor is the products they own and how they integrate it. OpenAI is API-minded, Google is laser-focused on the big-picture experience.

      For example; those little AI-generated YouTube summaries that have been rolling out are wonderful. They don't require heavyweight LLMs to generate, and can create pretty effective summaries using nothing but a transcript. It's not only more useful than the other AI "features" I interact with regularly, it doesn't demand AGI or chain-of-thought.

      • closewith2 周前
        > Google is laser-focused on the big-picture experience.

        This doesn't match my experience of any Google product.

        • talldayo2 周前
          I disagree - another way you could phrase this is that Google is presbyopic. They're very capable of thinking long-term (eg. Google Deepmind and AI as a whole, cloud, video, Drive/GSuite, etc.), but as a result they struggle to respond to quick market changes. AdSense is the perfect example of Google "going long" on a product and reaping the rewards to monopolistic ends. They can corner a market when the set their sights on it.

          I don't think Google (or really any of FAANG) makes "good" products anymore. But I do think there are things to appreciate in each org, and compared to the way Apple and Microsoft are flailing helplessly I think Google has proven themselves in software here.

          • lxgr2 周前
            Google does software/features relatively well, but they are completely lost when it comes to marketing, shipping, and continuing to support products.

            Or how would you describe their handling of Stadia, or their weird obsession about shipping and cancelling about a dozen instant messengers?

            • talldayo2 周前
              Stadia was a failure from the start. Microsoft and Nvidia are also laser-focused on this game streaming business, but I seriously doubt it will pan out ever. At least not in a profitable sense. In that regard, I think Google planned to be first-to-market, failed early, and killed their darling before anyone got a chance to love it.

              The IMs post-Hangouts are less explainable, but I do empathize with Google for wanting to find some form of SMS replacement standard. The RCS we have today is flawed and was rushed out of the door just to have a serious option for the DOJ to endorse. This is an area where I believe the United States government has been negligent in allowing competing OEMs to refuse cooperation in creating an SMS successor. I agree it's silly, and it needs to stop eventually.

              • lxgr2 周前
                > Stadia was a failure from the start. Microsoft and Nvidia are also laser-focused on this game streaming business

                Have you actually compared these services first hand? Stadia was miles ahead of the competition. The experience was unbelievably good and ubiquitous (Desktop, phone, TV, Chromecast...), and both mouse and gamepad felt like first class input methods.

                Microsoft's Xbox game streaming is a complete joke in comparison. Last time I tried, I had to use my mouse to operate a virtual gamepad to operate a virtual cursor to click instruments in MSFS. Four levels of nesting. Development progress is also extremely slow. Not sure where you're seeing laser focus there.

                > I do empathize with Google for wanting to find some form of SMS replacement standard

                Why did Google out of all companies have to come up with an SMS replacement? Absolutely nobody asked for this! They started out with XMPP, which was federated and had world-class open source implementations, and after what feels like a double-digit number of failed attempts they arrived at SMS over SIP from hell that nobody other than themselves actually knows how to implement and only telcos can federate with (theoretically; practically, they just outsource to Google).

                I find it really hard to believe that this is anything other than a thinly veiled marketing plot to be able to point at an "open standard" that Google is almost exclusively running via Jibe (not sure if they provide that for free or are charging carriers for it).

                The contortions they went through to decouple their "Allo" and "Duo" brands from Google accounts (something almost everybody has anyway to send email!) for absolutely no benefit and even more significant customer confusion...

                And now look at Gemini. It looks like the exact same story to me from the beginning: Amazing technology backed by a great team (they literally invented transformers), yet completely kneecapped by completely confused product development. It's unreal how much better it is queried through the API, but that's unfortunately not what people see when they go to gemini.google.com.

    • > but hard to deny that their LLM models aren't really, really good though

      Although I do still pay for ChatGPT, I find it dog slow. ChatGPT is simply way too slow to generate answers. It feels like --even though of course it's not doing the same thing-- I'm back to the 80s with my 8-bit computer printing thing line by line.

      Gemini OTOH doesn't feel like that: answers are super fast.

      To me low latency is going to be the killer feature. People won't keep paying for models that are dog slow to answer.

      I'll probably be cancelling my ChatGPT subscription soon.

    • pelorat2 周前
      Well, compared to github copilot (paid), I think Gemini Free is actually better at writing non-archaic code.
      • rafaelmn2 周前
        Using Claude 3.5 sonnet ?
      • jacooper2 周前
        Gemini is coming to copilot soon anyway.
    • manishsharan2 周前
      >> hard to deny that their LLM models aren't really, really good though.

      The context window of Gemini 1.5 pro is incredibly large and it retains the memory of things in the middle of the window well. It is quite a game changer for RAG applications.

      • caeril2 周前
        Bear in mind that a "1 million token" context window isn't actually that. You're being sold a sparse attention model, which is guaranteed to drop critical context. Google TPUs aren't running inference on a TERABYTE of fp8 query-key inputs, let alone TWO of fp16.

        Google's marketing wins again, I guess.

      • It looks like long context degraded from 1.5 to 2.0 according to the 2.0 launch benchmarks.
        • verdverm2 周前
          Those benchmarks are for the Flash 2.0 model, showing it improved over the 1.5 Pro model minus for that one benchmark
    • sky22242 周前
      Maybe I've just hit a streak of good outputs, but I've also been noticing the automatic gemini search when doing google searches to have been significantly more useful than it was previously.

      About a year ago, I was saying that Google was potentially walking toward its own grave due to not having any pivots that rivaled OpenAI. Now, I'm starting to think they've found the first few steps toward an incredible stride.

    • bushbaba2 周前
      Yet, google continues to show it'll deprecate it's APIs, Services, and Functionality at the detriment of your own business. I'm not sure enterprises will trust Google's LLM over the alternatives. Too many have been burned throughout the years, including GCP customers.

      The fact GCP needs to have this page, and these lists are not 100% comprehensive is telling enough. https://cloud.google.com/compute/docs/deprecations https://cloud.google.com/chronicle/docs/deprecations https://developers.google.com/maps/deprecations

      Steve Yegge rightfully called this out, and yet no change has been made. https://medium.com/@steve.yegge/dear-google-cloud-your-depre...

      • verdverm2 周前
        At least GCP makes them easy to find

        Some guy had to do it for Azure, then he went to work for them and it is now deprecated itself

        https://blog.tomkerkhove.be/2023/03/29/sunsetting-azure-depr...

      • weatherlite2 周前
        GCP grew 35% last quarter , just saying ...
        • Jabbles2 周前
          "just saying" things that are false.

          Google Cloud grew 35% year over year, when comparing the 3 months ending September 30th 2024 with 2023.

          https://abc.xyz/assets/94/93/52071fba4229a93331939f9bc31c/go... page 12

          • surajrmal2 周前
            Isn't that the typical interpretation of what the parent comment said? How is it false?
            • mattmerr2 周前
              I read parent comment "grew 35% last quarter" as (income on 2024-09-30) is 1.35 * (income on 2024-07-01)

              The balance sheet shows (income on days from 2024-07-01 through 09-30) is 1.35 * (income on days from 2023-07-01 through 09-30)

              These are different because with heavily handwavey math the first is growing 35% in a single quarter and the second is growing 35% annually (by comparing like-for-like quarters)

            • weatherlite2 周前
              It's indeed the typical interpretation of what I've said. I could have written YOY growth but it's so common that that's what everyone means people many times omit it.
            • superq2 周前
              35% over 12 months != 35% over 3 months.
              • weatherlite2 周前
                In the financial industry you almost always compare over 12 months (year over year growth) to avoid noise like seasonality. It's so prevalent I didn't think I need to explain it but yeah I meant GCP grew 35% YOY.
          • freedomben2 周前
            Great point, although 35% yoy is still impressive, and numbers they are surely pleased with
        • bushbaba2 周前
          And AWS + Azure are 4.5x the size of GCP. AWS alone is 2.6x the size of GCP. So your point?
          • weatherlite2 周前
            Not sure why you're comparing both AWS + Azure to GCP - did they become one company ? My point is GCP is growing very fast, AWS while being much bigger has slowed considerably to around 20% growth rate while GCP is at 35%. It has very good growth and many people seem to be adequately happy with it, in contrast to what many commenters here seem to think.
            • bushbaba2 周前
              AWS is 2.5x the size of gcp. In fact AWS is growing faster than GCP based off of gross revenue added over last 12 months.

              I used both azure and AWS to show that GCP has lost significant markshare because of its deprecation policy. Enterprises don’t trust GCP won’t deprecate their services.

    • bwb2 周前
      So far, for my tests, it has performed terribly compared to ChatGPT and Claude. I hope this version is better.
    • aerhardt2 周前
      > seems to be getting the right results

      > hard to deny that their LLM models aren't really, really good though

      I'm so scarred by how much their first Gemini releases sucked that the thought of trying it again doesn't even cross my mind.

      Are you telling us you're buying this press release wholesale, or you've tried the tech they're talking about and love it, or you have some additional knowledge not immediately evident here? Because it's not clear from your comment where you are getting that their LLM models are really good.

      • MaxDPS2 周前
        I’ve been using Gemini 1.5 Pro for coding and it’s been great.
    • 2 周前
      undefined
  • serjester2 周前
    Buried in the announcement is the real gem — they’re releasing a new SDK that actually looks like it follows modern best practices. Could be a game-changer for usability.

    They’ve had OpenAI-compatible endpoints for a while, but it’s never been clear how serious they were about supporting them long-term. Nice to see another option showing up. For reference, their main repo (not kidding) recommends setting up a Kubernetes cluster and a GCP bucket to submit batch requests.

    [1]https://github.com/googleapis/python-genai

    • redrix2 周前
      Oh wow, it supports directly specifying a Pydantic model as an output schema that it will adhere to for structured JSON output. That’s fantastic!

      https://github.com/googleapis/python-genai?tab=readme-ov-fil...

      • jcheng2 周前
        FYI all the LLM Python SDKs that support structured output can use Pydantic for the schema—at least all the ones I can think of.
    • pkkkzip2 周前
      its interesting that just as the LLM hype appears to be simmering down, DeepMind is making big strides. I'm more excited by this than any of OpenAI's announcements.
      • 2 周前
        undefined
    • I looked carefully at the SDK earlier today - it does look very nice, but it is also a work in progress.
  • bradhilton2 周前
    Beats Gemini 1.5 Pro at all but two of the listed benchmarks. Google DeepMind is starting to get their bearings in the LLM era. These are the minds behind AlphaGo/Zero/Fold. They control their own hardware destiny with TPUs. Bullish.
    • VirusNewbie2 周前
      If you look at where talent is going, it's Anthropic that is the real competitor to Google, not OpenAI.
    • p1esk2 周前
      Are these benchmarks still meaningful?
      • maeil2 周前
        No, and they haven't been for at least half a year. Utterly optimized for by the providers. Nowadays if a model would be SotA for general use but not #1 on any of these benchmarks, I doubt they'd even release it.
      • CamperBob22 周前
        I've started keeping an eye out for original brainteasers, just for that reason. GCHQ's Christmas puzzle just came out [1], and o1-pro got 6 out of 7 of them right. It took about 20 minutes in total.

        I wasn't going to bother trying those because I was pretty sure it wouldn't get any of them, but decided to give it an easy one (#4) and was impressed at the CoT.

        Meanwhile, Google's newest 2.0 Flash model went 0 for 7.

        1: https://metro.co.uk/2024/12/11/gchq-christmas-puzzle-2024-re...

        • iamdelirium2 周前
          Why are you comparing flash vs o1-pro, wouldn't a more fair comparison be flash vs mini?
          • iamdelirium2 周前
            I just ask o1-mini the first two questions and it got it wrong.
          • 2 周前
            undefined
          • CamperBob22 周前
            It's the only Google model that my account has access to that accepts .PNG files. I assumed it was the latest/greatest experimental 2.0 release.

            If they want a rematch, they'll need to bring their 'A' game next time, because o1-pro is crazy good.

        • nrvn2 周前
          Did it get the 8 right? The linked article provides the wrong answer btw.
          • CamperBob22 周前
            I didn't see a straightforward way to submit the final problem, because I used different contexts for each of the 7 subproblems.

            Given the right prompt, though, I'm sure it could handle the 'find the corresponding letter from the landmarks to form an anagram' part. That's easier than most of the other problems.

            You're saying the ultimate answer isn't 'PROTECTING THE UNITED KINGDOM'?

            • nrvn2 周前
              if you follow the sleigh morse path starting from the robin it will be 'united in protecting the kingdom'.
        • p1esk2 周前
          Wow! That’s all I need to know about Google’s model.
          • Workaccount22 周前
            What is impressive about this new model is that it is the lightweight version (flash).

            There will probably be a 2.0 pro (which will be 4o/sonnet class) and maybe an ultra (o1(?)/Opus).

          • danpalmer2 周前
            That's a comparison of multiple GPT-4 models working together... against a single GPT-4 mini style model.
            • p1esk2 周前
              multiple GPT-4 models working together

              What do you mean? Is o1 not a single model?

    • dagmx2 周前
      Regarding TPU’s, sure for the stuff that’s running on the cloud.

      However their on device TPUs lag behind the competition and Google still seem to struggle to move significant parts of Gemini to run on device as a result.

      Of course, Gemini is provided as a subscription service as well so perhaps they’re not incentivized to move things locally.

      I am curious if they’ll introduce something like Apple’s private cloud compute.

      • whimsicalism2 周前
        i don’t think they need to win the on device market.

        we need to separate inference and training - the real winners are those who have the training compute. you can always have other companies help with inference

        • maeil2 周前
          > i don’t think they need to win the on device market.

          The second Apple comes out with strong on-device AI - and it very much looks like they will - Google will have to respond on Android. They can't just sit and pray that e.g. Samsung makes a competitive chip for this purpose.

          • SimianSci2 周前
            I think Apple is uniquely disadvantaged in the AI race to a point people dont realize. They have less training data to use, having famously been focused on privacy for its users and thus having no particular advantage in this space due to not having customer data to train on. They have little to no cloud business, and while they operate a couple of services for their users, they do not have the infrastructure scale to compete with hyperscaler cloud vendors such as Google and Microsoft. Most of what they would need to spend on training new models would require that they hand over lots of money to the very companies that already have their own models, supercharging their competition.

            While there is a chance that Apple might come out with a very sophisticate on-device model. The problem here is that they would only be able to compete with other on-device models. The magnitude of compute needed to keep pace with SOA models is not achievable on a single device. It will take many generations of Apple silicon in order to compete with the compute of existing datacenters.

            Google also already has competitive silicon in this space with the Tensor series processors, which are being fabbed at Samsung plants today. There is no sitting and praying necessary on their part as they already compete.

            Apple is a very distant competitor in the space of AI, and I see no reason to assume this will change, they are uniquely disadvantaged by several of the choices they made on their way to mobile supremacy. The only thing they currently have going for them is the development of their own ARM silicon which may give them the ability to compete with Google's TPU chips, but there is far more needed to be competitive here than the ability to avoid the Nvidia tax.

            • dagmx2 周前
              There’s an easy solution here: Apple isn’t trying to compete with the big models everyone else is running. They’re betting in the opposite direction that many small models is a better value ad for their customers. And they can call out to other services as needed for the larger stuff.

              I’m in the camp that this is the right call for consumers, instead of trying to compete on the large model side. They’ve yet to deliver on their full promise, but if they can, it’s the place where I think more of the industry will go (for consumers)

              And regarding Google’s mobile tensor chips, they are infamously behind all other players in the market space for the same generation of processor. They don’t share the same advantages they do in the server space.

              • whimsicalism2 周前
                training bigger models gets you small models for free plus a higher upper bound in capabilities.

                Apple just isn’t very capable in this space, not sure what’s so hard to accept

              • mike_hearn2 周前
                Apple have trained their own foundation LLM.
                • whimsicalism2 周前
                  hardly even qualifies for ‘fast follow’, more like ‘surprisingly slow follow’

                  their models aren’t even that good. sorry apple fanboys but the talent isn’t there

            • simonw2 周前
              "having famously been focused on privacy for its users and thus having no particular advantage in this space due to not having customer data to train on"

              That may not be as big a disadvantage as you think.

              Anthropic claim that they did not use any data from their users when they trained Claude 3.5 Sonnet.

              • whimsicalism2 周前
                sure but they certainly acquired data from mass scraping (including of data produced by their users) and/or data brokering aka paying someone to do the same.
            • It is likely Apple can get additional data by creating synthetic data for user interactions.

              About 7 years ago I trained GAN models to generate synthetic data, and it worked so well. The state of the art has increased a lot in 7 years, so Apple will be fine.

              • SimianSci2 周前
                For a while there I would have been in agreeance with you, but the thought that models can be trained purely on synthetic data has shown to be wrong on multiple levels. Synthetic data needs to be reviewed by individuals to ensure data quality, significantly reducing the speed at which an organization can adopt training data. Reasonable engineers would suggest that the answer to this is to have other language models review the synthetic data, but we have seen that this is what leads to model collapse due to compounding issues around hallucinations.

                At best Synthetic data is a "slow follow" for training a model due to the need for human review, but a competitive model, it does not make.

            • whimsicalism2 周前
              yeah i’ve never understood the outsized optimism for apple’s ai strategy, especially on hn.

              they’re a little bit less of a nobody than they used to be, but they’re basically a nobody when it comes to frontier research/scaling. and the best model matters way more than on-device which can always just be distilled later and find some random startup/chipco to do inference

              • msabalau2 周前
                Theory: Apple's lifestyle branding is quite important to the identity of many in the community here. I mean, look at the buy-in at launch for Apple Vision Pro by so many people on HN--it made actual Apple communities and publications look like jaded skeptics.
                • dagmx2 周前
                  Oh please, this is the classic “everyone who chooses differently than myself is <superficial/dumb/misinformed>” argument that a lot of people use when it comes to tech nerd identity politics.

                  Is it really that hard to imagine people have different viewpoints, and decisions than yourself without being painted as vapid, airheads?

                  • whimsicalism2 周前
                    I work in this industry, been working professionally on transformers since 2018.

                    The level of optimism for Apple AI capabilities on here is wrong. I can imagine people having wrong viewpoints, but it is wrong.

            • maeil2 周前
              For clarity, I was only talking about the hardware side, not the software one. I don't think the models matter too much, by the time the hardware is ready there will be open models that Apple can take and modify to their liking.

              Besides, did Anthropic and e.g. Mistral inherently have such troves of data to train on that Apple doesn't? For the last 6 months, Anthropic has had the SOTA model for the average production usecase.

              > Google also already has competitive silicon in this space with the Tensor series processors, which are being fabbed at Samsung plants today. There is no sitting and praying necessary on their part as they already compete.

              Intel had a much bigger advantage with x86, and look where we are now. I find it hard to believe that creating a good AI chip isn't a much smaller challenge than it was to do Apple Silicon. The upcoming SE uses their in-house 5G modem, another huge hardware achievement that no one else has been able to do.

              With that in mind, how can you bet against Apple when it comes to designing chips at this point? It's not like Amazon et al aren't producing their own AI chips too. Let alone all of the startups like Cerebras. That indicates the moat and barriers are likely much lower than Apple Slicion or the 5G modem.

              If I'm talking nonsense, do correct me.

          • reportingsjr2 周前
            The Android on chip AI is and has been leagues better than what is available on iOS.

            If anything, I think the upcoming iOS AI update will bring them to a similar level as android/google.

          • petra2 周前
            But given inference time compute, to give a strong reply reasonably fast, you'll need a lot of compute, very rarely used.

            Economically this fits the cloud much better.

        • dagmx2 周前
          At what point does the on device stuff eat into their market share though? As on device gets better, who will pay for cloud compute? Other than enterprise use.

          I’m not saying on device will ever truly compete at quality, but I believe it’ll be good enough that most people don’t care to pay for cloud services.

          • whimsicalism2 周前
            You're still focused about inference :)

            inference basically does not matter, it is a commodity

            • dagmx2 周前
              You’re still focused about training :)

              training doesn’t matter if inference costs are high and people don’t pay for them

              • whimsicalism2 周前
                but inference costs arent high already and there are tons of hardware companies that can do relatively cheap LLM inference
                • dagmx2 周前
                  Inference costs per invocation aren’t high. Scale it out to billions of users and it’s a different story.

                  Training is amortized over each inference, so the cost of inference also needs to include the cost of training to break even unless made up elsewhere

            • rowanG0772 周前
              That makes no sense. Inference cost dwarf training cost if you have a succesfull product pretty quickly. Afaik there is no commodity hardware that can run state of the art models like chatgpt-o1.
              • whimsicalism2 周前
                > Afaik there is no commodity hardware that can run state of the art models like chatgpt-o1.

                Stack enough GPUs and any of them can run o1. Building a chip to infer LLMs is much easier than building a training chip.

                Just because one cost dwarfs another does not mean that this is where the most marginal value from developing a better chip will be, especially if other people are just doing it for you. Google gets a good model, inference providers will be begging to be able to run it on their platform, or to just sell google their chips - and as I said, inference chips are much easier.

                • 2 周前
                  undefined
                • rowanG0772 周前
                  Chip level is only a tiny part of the story. Training can happen with a big boy variant of "it works on my machine". Inference require a world wide network of GPUs. Chip level is the last thing you will be worrying about.
                • menaerus2 周前
                  Each GPU costs ~50k. You need at least 8 of them to run mid-sized models. Then you need a server to plug those GPUs into. That's not commodity hardware.
                  • whimsicalism2 周前
                    more like ~$16k for 16 3090s. AMD chips can also run these models. The parts are expensive but there is a competitive market in processors that can do LLM inference. Less so in training.
                    • menaerus2 周前
                      > more like ~$16k for 16 3090s

                      I don't know where did you get that price from but 1x RTX 3090 is $1,900. 16x is ~$30,000.

                      > The parts are expensive

                      Now that we invested ~$30k in GPUs, we only need to find a motherboard that can accommodate 16x pcie4 x16 GPUs, right? And we also need a CPU that can drive that many pcie4 x16 lanes?

                      Well, none of them exist, not even in the server parts sector let alone client commodity hardware. In any case, you'd need two CPUs so even with this imaginary motherboard we are already entering the server rack design space. And that costs 100's of thousands of $$$.

                      > but there is a competitive market in processors that can do LLM inference

                      Nothing but the smallest and smallish models. If that existed then why would you set yourself out building a 16x RTX 3090 machine?

                      Sorry, but you're just spitting out non-sense.

                      • 2 周前
                        undefined
        • vineyardmike2 周前
          I don’t think the AI market will ever really be a healthy one until inference vastly outnumbers training. What does it say about AI if training is done more than inference?

          I agree that the in-device inference market is not important yet.

          • whimsicalism2 周前
            done more != where the value is at

            inference hardware is a commodity in a way that training is not

      • mupuff12342 周前
        Majority of people want better performance, running locally is just a nice to have feature.
        • dagmx2 周前
          They’ll care though when they have to pay for it, or when they’re in an area with poor reception.
          • mupuff12342 周前
            They pay to run it locally as well (more expensive hardware)

            And sure, poor reception will be an issue, but most people would still absolutely take a helpful remote assistant over a dumb local assistant.

            And you don't exactly see people complaining that they can't run Google/YouTube/etc locally.

            • dagmx2 周前
              Your first sentence has the fallacy that you’re attributing the cost of the device to a single feature against the cost of that single feature.

              Most people are unlikely to buy the device for the AI features alone. It’s a value add to the device they’d buy anyway.

              So you need the paid for option to be significantly better than the free one that comes with the device.

              Your second sentence assumes the local one is dumb. What happens when local ones get better? Again how much better is the cloud one to compete on cost?

              To your last sentence, it assumes data fetching from the cloud. Which is valid but a lot of data is local too. Are people really going to pay for what Google search is giving them for free?

              • mupuff12342 周前
                I think it's a more likely assumption that on device performance will trail off device models by a significant margin for at least the next few years - of course if magically you can make it work locally with the same level of performance it would be better.

                Plus a lot of the "agentic" stuff is interaction with the outside world, connectivity is a must regardless.

                • dagmx2 周前
                  My point is that you do NOT need the same level of performance. You need an adequate level of performance that the cost to get more performance isn’t worth it to most people.
                  • mupuff12342 周前
                    And my point is that it's way too early to try to optimize for running locally, if performance really stabilizes and comes to a halt (which may likely happen) then it makes more sense to optimize.

                    Plus once you start with on device features you start limiting your development speed and flexibility.

                • jsight2 周前
                  It isn't really hypothetical. Lots of good models run well on a modern Macbook Pro.
          • vineyardmike2 周前
            Poor reception is rapidly becoming a non-issue for most of the developed world. I can’t think of the last time I had poor reception (in America) and wasn’t on an airplane.

            As the global human population increasingly urbanizes, it’ll become increasingly easy to blanket it with cell towers. Poor(er) regions of the world will increase reception more slowly, but they’re also more likely to have devices that don’t support on-device models.

            Also, Gemini Flash is basically positioned as a free model, (nearly) free API, free in GUI, free in Search Results, Free in a variety of Google products, etc. No one will be paying for it.

            • dagmx2 周前
              Many major cities have significant dead spots for coverage. It’s not just for developing areas.

              Flash is free for api use at a low rate limit. Gemini as a whole is not free to Android users (free right now with subscription costs beyond a time period for advanced features) and isn’t free to Google without some monetary incentive. Hence why I also originally ask about private cloud compute alternatives with Google.

            • michaelmrose2 周前
              I ride a ferry from a city of 50k to a city of 700k in the US and work in a building with apartments upstairs basically a concrete cave.

              I see poor reception in both areas and only one has WiFi.

          • You can run model >100x faster in cloud compared to on device with DDR RAM. This would make up for the reception.
            • dagmx2 周前
              And you can’t run the cloud model at all if you can’t talk to the cloud.
              • Yes, but I can't imagine situations where I "have" to run a model when I don't have internet at that time. My life would be more affected with the rest of the internet than having to run a small stupid model locally. At the very least until the hallucination is completely solved, as I need internet to verify the models.
                • dagmx2 周前
                  You’re assuming the model is purely for generation though. Several of the Gemini features are lookup of things across data available to it. A lot of that data can be local to device.

                  That is currently Apple’s path with Apple Intelligence for example.

                • michaelmrose2 周前
                  Hallucination can't be solved because bogus output is categorically the same sort of thing as useful output.

                  It has no world model. It doesn't know truth any more than it knows bullshit just a statistical relationship between words.

        • griomnib2 周前
          Latency is a huge factor in performance, and local models often have a huge edge. Especially on mobile devices that could be offline entirely.
          • KoolKat232 周前
            Definitely not when it comes to LLM's, the larger more useful local models are not that fast and latency is not an issue, just look at this Google models voice function or even openai's advanced voice.
          • 2 周前
            undefined
      • If the model weights is not open, you can't run it on device anyways.
        • kridsdale12 周前
          The Pixel 9 runs many small proprietary Gemini models on the internal TPU.
          • griomnib2 周前
            And yet these new models still haven’t reached feature parity with Google Assistant, which can turn my flashlight on, but with all the power of burning down a rainforest, Gemini still cannot interact with my actual phone.
            • I just tried asking my phone to turn on the flashlight using Gemini. It worked. https://9to5google.com/2024/11/07/gemini-utilities-extension...
              • griomnib2 周前
                Ok I tried literally last week on Pixel 7a and it didn’t work. What model do you have? Maybe it requires a phone that can do on-device models?
                • _puk2 周前
                  Works on a Pixel 4A 5G..

                  Pretty sure that's not doing any fancy on-device models!

                  That said, there was a popup today saying that assistant is now using Gemini, so I just enabled it to try. Could well have changed in the last week.

                • staticman22 周前
                  I just tried it on my Galaxy Ultra s23 and it worked. I then disconnected internet and it did not work.
          • Gemini nano weights are leaked and google doesn't care about it being leaked. Google would definitely care if Pro weights are leaked.
            • Is there any phone in the world that can realistically run pro weights?
    • JeremyNT2 周前
      Yeah they've been slow to release end-user facing stuff but it's obvious that they're just grinding away internally.

      They've ceded the fast mover advantage, but with a massive installed base of Android devices, a team of experts who basically created the entire field, a huge hardware presence (that THEY own), massive legal expertise, existing content deals, and a suite of vertically integrated services, I feel like the game is theirs to lose at this point.

      The only caution is regulation / anti-trust action, but with a Trump administration that seems far less likely.

  • airstrike2 周前
    OT: I’m not entirely sure why, but "agentic" sets my teeth on edge. I don't mind the concept, but the word itself has that hollow, buzzwordy flavor I associate with overblown LinkedIn jargon, particularly as it is not actually in the dictionary...unlike perfectly serviceable entries such as "versatile", "multifaceted" or "autonomous"
    • OutOfHere2 周前
      To play devil's advocate, the correct use of the word would be when multiple AIs are coordinating and handing off tasks to each other with limited context, such that the handoffs are dynamically decided at runtime by the AI, not by any routine code. I have yet to see a single example where this is required. Most problems can be solved with static workflows and simple rule based code. As such, I do believe that >95% of the usage of the word is marketing nonsense.
      • maeil2 周前
        I actually have built such a tool (two AIs, each with different capabilities), but still cringe at calling at agentic. Might just be an instinctive reflex.
      • jasonsteving2 周前
        You nailed an interesting nuance there about agents needing to make their own decisions!

        I'm getting fairly excited about "agentic" solutions to the point that I even went out of my way to build "AgentOfCode" (https://github.com/JasonSteving99/agent-of-code) to automate solving Advent of Code puzzles by iteratively debugging executions of generated unit tests (intentionally not competing on the global leaderboard).

        And even for this, there's actually only a SINGLE place in the whole "agent" where the models themselves actually make a "decision" on what step to take next, and that's simply deciding whether to refactor the generated unit tests or the generated solution based on the given error message from a prior failure.

      • danpalmer2 周前
        I think this sort of usage is already happening, but perhaps in the internal details or uninteresting parts, such as content moderation. Most good LLM products are in fact using many LLM calls under the hood, and I would expect that results from one are influencing which others get used.
    • wepple2 周前
      Versatile is far worse. It’s so broad to the point of meaninglessness. My garden rake is fairly versatile.

      Agentic to me means that it acts somewhat under its own authority rather than a single call to an LLM. It has a small degree of agency.

    • thom2 周前
      I'm personally very glad that the word has adhered itself to a bunch of AI stuff, because people had started talking about "living more agentically" which I found much more aggravating. Now if anyone states that out loud you immediately picture them walking into doors and misunderstanding simple questions, so it will hopefully die out.
      • 2 周前
        undefined
    • ramoz2 周前
      Need a general term for autonomous intelligent decision making.
      • No, we need a scientific understanding of autonomous intelligent decision-making. The problem with “agentic AI” is the same old “Artificial Intelligence, Natural Stupidity” problem: we have no clue what “reasoning” or “intelligence” or “autonomous” actually means in animals, and trying to apply these terms to AI without understanding them (or inventing a new term without nailing down the underlying concept) is doomed to fail.
        • They don’t mean anything precise, and don’t need to. They describe a bag of behaviors with overlapping properties that we observe animals/people doing.

          ‘Intelligent’ is exactly as precise as ‘funny’ or ‘interesting’. It’s a label for a cluster of observations of another agent’s behavior. It entails almost nothing about how those behaviors are achieved.

          This is of course only an opinion, but it’s my professional opinion after thirty five years in the AI business.

      • airstrike2 周前
        Isn't that just "intelligent"?
        • ramoz2 周前
          We need something to describe a behavioral element in business processes. Something goes into it, something comes out of it - though in this case nondeterminism is involved and it may not be concrete outputs so much as further actioning.

          Intelligence is a characteristic.

          • airstrike2 周前
            Volitional, independent, spontaneous, free-willed, sovereign...
    • geodel2 周前
      Huh, all three words you mentioned as replacement are equally buzzwordy and I see them a lot in CVs while screen candidates for job interview.
      • lolinder2 周前
        They agree—they're saying that at least those buzzwords are in the dictionary, not that they'd be a good replacement for "agentic".
      • raincole2 周前
        Versatile implies it can to more kinds of tasks (than it's predecessor or competitor). Agentic implies it requires less human intervention.

        I don't think these are necessary buzzwords if the product really does what they imply.

      • airstrike2 周前
        At least all three of them are actually in the dictionary
        • hombre_fatal2 周前
          That's not necessarily a good thing because they are overloaded while novel jargon is specific.

          We use new words so often that we take it for granted. You've passively picked up dozens of new words over the last 5 or 10 years without questioning them.

    • 2 周前
      undefined
    • m3kw92 周前
      Yeah I hate it when AI companies throw around words like AGI and agentic capabilities. It’s non sense to most people and ambiguous at best
      • This is what other replies are missing - I've been following AI closely since GPT 2 and it's not immediately clear what agentic means, so to other people, the term must be even less clear. Using the word autonomous can't be worse than agentic imo.
    • dsr_2 周前
      agentic obviously means "not gentic" and gent is from the Latin for "people".

      agentic == not people.

      Quite sensible, really.

    • Take good care of your teeth, my friend. It's a new era in agentic factuality.
    • 2 周前
      undefined
    • bjackman2 周前
      [dead]
  • losvedir2 周前
    This naming is confusing...

    Anyway, I'm glad that this Google release is actually available right away! I pay for Gemini Advanced and I see "Gemini Flash 2.0" as an option in the model selector.

    I've been going through Advent of Code this year, and testing each problem with each model (GPT-4o, o1, o1 Pro, Claude Sonnet, Opus, Gemini Pro 1.5). Gemini has done decent, but is probably the weakest of the bunch. It failed (unexpectedly to me) on Day 10, but when I tried Flash 2.0 it got it! So at least in that one benchmark, the new Flash 2.0 edged out Pro 1.5.

    I look forward to seeing how it handles upcoming problems!

    I should say: Gemini Flash didn't quite get it out of the box. It actually had a syntax error in the for loop, which caused it to fail to compile, which is an unusual failure mode for these models. Maybe it was a different version of Java or something (I'm also trying to learn Java with AoC this year...). But when I gave Flash 2.0 the compilation error, it did fix it.

    For the more Java proficient, can someone explain why it may have provided this code:

         for (int[] current = queue.remove(0)) {
    
    which was a compilation error for me? The corrected code it gave me afterwards was just

         for (int[] current : queue) { 
    
    and with that one change the class ran and gave the right solution.
    • srameshc2 周前
      I use a Claude and Gemini a lot for coding and I realized there is no good or best model. Every model has it's upside and downside. I was trying to get authentication working according to the newer guidelines of Manifest V3 for browser extensions and every model is terrible. It is one use case where there is not much information or right documentation so every model makesup stuff. But this is my experience and I don't speak for everyone.
      • huijzer2 周前
        Relatedly, I start to think more and more the AI is great for mediocre stuff. If you just need to do the 1000th website, it can do that. Do you want to build a new framework? Then there will probably be less many useful suggestions. (Still not useless though. I do like it a lot for refactoring while building xrcf.)

        EDIT: One reason that lead me to think it's better for mediocre stuff was seeing the Sora model generate videos. Yes it can create semi-novel stuff through combinations of existing stuff, but it can't stick to a coherent "vision" throughout the video. It's not like a movie by a great director like Tarantino where every detail is right and all details point to the same vision. Instead, Sora is just flailing around. I see the same in software. Sometimes the suggestions go towards one style and the next moment into another. I guess AI currently is just way lower in their context length. Tarantino has been refining his style for 30 years now. And always he has been tuning his model towards his vision. AI in comparison seems to always just take everything and turn it into one mediocre blob. It's not useless but currently good to keep in mind I think. That you can only use it to generate mediocre stuff.

        • meiraleal2 周前
          We got to the point that AI isn't great because it is not like a Tarantino movie. What a time to be alive.
          • huijzer2 周前
            We got to the point that AI CEOs pretend AI can make a Tarantino movie and get away with it.
      • copperx2 周前
        That's when having a huge context is valuable. Dump all of the new documentation into the model along with your query and the chances of success hugely increase.
      • monkmartinez2 周前
        This is true for all newish code bases. You need to provide the context it needs to get the problem right. It has been my experience that one or two examples with new functions or new requirements will suffice for a correction.
      • xnx2 周前
        > I use a Claude and Gemini a lot for coding and I realized there is no good or best model.

        True to a point, but is anyone using GPT2 for anything still? Sometimes the better model completely supplants others.

    • notamy2 周前
      > For the more Java proficient, can someone explain why it may have provided this code:

      To me that reads like it was trying to accomplish something like

          int[] current;
          while((current = queue.pop()) != null) {
    • rybosome2 周前
      I can't comment on why the model gave you that code, but I can tell you why it was not correct.

      `queue.remove(0)` gives you an `int[]`, which is also what you were assigning to `current`. So logically it's a single element, not an iterable. If you had wanted to iterate over each item in the array, it would need to be:

      ``` for (int[] current : queue) { for (int c : current) { // ...do stuff... } } ```

      Alternatively, if you wanted to iterate over each element in the queue and treat the int array as a single element, the revised solution is the correct one.

    • ianmcgowan2 周前
      A tangent, but is there a clear best choice amongst those models for AOC type questions?
  • og_kalu2 周前
    The Gemini 2 models support native audio and image generation but the latter won't be generally available till January. Really excited for that as well as 4o's image generation (whenever that comes out). Steerability has lagged behind aesthetics in image generation for a while now and it's be great to see a big advance in that.

    Also a whole lot of computer vision tasks (via LLMs) could be unlocked with this. Think Inpainting, Style Transfer, Text Editing in the wild, Segmentation, Edge detection etc

    They have a demo: https://www.youtube.com/watch?v=7RqFLp0TqV0

    • kthartic2 周前
      I asked Gemini 2.0 Flash (with my voice) whether it natively understands audio or is converting my voice to text. It replied:

      "That's an insightful question. My understanding of your speech involves a pipeline first. Your voice is converted to text and then I process the text to understand what you're saying. So I don't understand your voice directly but rather through a text representation of it."

      Unsure if this is a hallucination, but is disappointing if true.

      Edit: Looking at the video you linked, they say "native audio output", so I assume this means the input isn't native? :(

      • og_kalu2 周前
        Native audio output won't be in general availability until early next year.

        If you're using Gemini in aistudio(not sure about the real-time API but everything else) then it has native audio input

    • jncfhnb2 周前
      These are not computer vision tasks…
      • newfocogi2 周前
        Maybe some of these tasks are arguably not aligned with the traditional applications of CV, but Segmentation and Edge detection are definitely computer vision in every definition I've come across - before and after NNs took over.
      • Jabrov2 周前
        What are they, then…?
        • 85392_school2 周前
          The first two are tasks which involve making images. They could be called image generation or image editing.
  • siliconc0w2 周前
    What's everyone's favorite LLM leaderboard? Gemini 2 seems to be edging out 4o on chatbot arena(https://lmarena.ai/?leaderboard)
    • manishsharan2 周前
      Leaderboards are not that useful for measuring real-life effectiveness of the models atleast in my day-today usage.

      I am currently struggling to diagnose an ipv6 mis-configuration in my enormous aws cloudformation yaml code. I gave the same input to Claude Opus, Gemini and ChatGPT ( o1 and 4o).

      4o was the worst. verbose and waste of my time.

      Claude completely went off-tangent and began recommending fixes for ipv4 while I specifically asked for ipv6 issues

      o1 made a suggestion which I tried out and it fixed it. It literally found a needle in the haystack. The solution is working well now.

      Gemini made a suggestion which almost got it right but it was not a full solution.

      I must clarify diagnosing network issues on AWS VPC is not my expertise and I use the LLMs to supplement my knowledge.

      • blastbking2 周前
        Sonnet 3.5 as of today is superior to Opus, curious if sonnet could have solved your problem
        • manishsharan2 周前
          I too was puzzled by the response from Claude. I am using the Anthropic workbench with claude-3-5-sonnet-20241022 (latest)

          But it think it has to do more with the freshness of training data.

          AWS IPV6 Egress is a new technology from AWS which was introduced only recently. Previously, we had to deploy NAT gateway which supported IPV4. I am assuming claude-3-5-sonnet-20241022 (latest) was not trained on this data.

        • nunodonato2 周前
          Yes. I find it a bit funny how much people care about leaderboards. I see models going up and down, winning this or that benchmark and yet, for me, Sonnet 3.5 still beats the crap out of all of them.
    • danpalmer2 周前
      Notably, GPT-4o is a "full size" model, whereas Gemini 2 Flash is the small and efficient variant in that family as far as I understand it.
      • jug2 周前
        True that - and I think Gemini-Exp-1206 is Gemini 2.0 Pro in testing. I noticed how they only replaced the "experimental" moniker for one of their experimental models, and it turned into 2.0 Flash. And that still experimental model is currently leading across all categories.
    • zhyder2 周前
      I like that https://artificialanalysis.ai/leaderboards/models describes both quality and speed (tokens/s and first chunk s). Not sure how accurate it is; anyone know? Speed and variance of it in particular seems difficult to pin down because providers obviously vary it with load to control their costs.
    • AI benchmarks and leaderboards are complete nonsense though.

      Find something you like, use it, be ready to look again in a month or two.

      • falcor842 周前
        With the accelerating progress, the "be ready to look again" is becoming a full time job that we need to be able to delegate in some way, and I haven't found anything better than benchmarks, leaderboards and reviews.

        EDIT: Typo

      • siliconc0w2 周前
        FWIW I've found the 'coding' 'category' of the leaderboard to be reasonably accurate. Claude was the best, o1-mini then was typically stronger, now the Gemini Exp 1206 is at the top.

        I find myself just paying a la carte via the API rather than paying the $20/mo so I can switch between the models.

      • hombre_fatal2 周前
        poe.com has a decent model where you buy credits and spend them talking to any LLM which makes it nice to swap between them even during the same conversation instead of paying for multiple subscriptions.

        Though gpt-4o could say "David Mayer" on poe.com but not on chat.openai.com which makes me wonder if they sometimes cheat and sneak in different models.

  • jncfhnb2 周前
    Am I alone in thinking the word “agentic” is dumb as shit?

    Most of these things seem to just be a system prompt and a tool that get invoked as part of a pipeline. They’re hardly “agents”.

    They’re modules.

    • It's easier for consultants and sales people to sell to enterprise if the terminology is familiar but mysterious.

      Bad

        1. installed Antivirus software
        2. added screen-size CSS rules
        3. copied 'Assets' harddrive to DropBox
        4. edited homepage to include Bitcoin wallet address link
        5. upgraded to ChatGPT Pro
      
      "Good"

        1. Cyber-security defenses
        2. Responsive Design implementation
        3. Cloud Storage
        4. Blockchain Technology gateway
        5. Agentic enhancements
      • endorphine2 周前
        George Carlin's "Euphimisms" comes to mind.
      • kthartic2 周前
        These are great examples. Also an excellent technique for relaying project updates to non-technical managers
    • xnx2 周前
      Controlling a browser in Project Mariner seems very agentic: https://youtu.be/Fs0t6SdODd8?t=86
    • Havoc2 周前
      >“agentic” is dumb as shit?

      It'll create endless consulting opportunities for projects that never go anywhere and add nothing of value unless you value rich consultants.

    • sippeangelo2 周前
      Aside from sounding "dumb as shit", I also think it's the completely wrong abstraction. Every time I've split up work between multiple specialized LLM "agents", they've done an WAY worse job than one monolith LLM that had access to all the tools it needs and can "reason" about the whole task.

      All the common tricks, like creating a list of steps that are then executed by specialized agents in order, for example, fall flat as soon as one agent returns a result that contradicts the initial steps. It's simply a bandaid for short context sizes and LLMs that can't remain focused past the first few thousand tokens of prompt.

    • Agentus2 周前
      The beauty of LLMs isn’t just these coding objects speak human vernacular but they can be concatenated with human vernacular prompts and that itself can be used as an input, command or output sensibly without necessarily causing error even if a series of inputs combinations weren't preprogrammed.

      I have an A.I. textbook that has agent terminology that was written preLLm days. agents are just autonomous ish code that loops on itself with some extra functionality. LLMs in their elegance can more easily out the box selfloop just on the basis concatenating language prompts, sensibly. They are almost agent ready out the box by this very elegant quality(the textbook agentic diagram is just a conceptual self perpetuation loop), except…

      Except they fail at a lot or get stuck at hiccups. But, here is a novel thought. What if an LLM becomes more agentic (ie more able to sustain autonomous chain prompts that do actions without a terminal failure) and less copilotee not by more complex controlling wrapper self perpetuation code, but by means of training the core llm itself to more fluidly function in agentic scenarios.

      a better agentically performing llm that isnt mislabeled with a bad buzzword might not reveal itself in its wrapper control code but through it just performing better in an typical agentic loop or environment conditions with whatever initiating prompt, control wrapper code, or pipeline that initiates its self perpetuation cycle.

    • uludag2 周前
      Definitely not alone. With all the this money at stake, coining dumb terms like this might make you a pretty penny.

      It's like a meme that can be milked for monetization.

    • WA2 周前
      Gemini, too, for the sole reason that non-native speakers have no clue how to pronounce it.
      • kaashif2 周前
        Also, people at NASA pronounce it two ways, even native speakers of English.
      • purple-leafy2 周前
        pronounced: juh-meany .... right?
        • coayer2 周前
          I say jem-in-eye in my English accent, Google search says jeh·muh·nai
    • 2 周前
      undefined
  • ofermend2 周前
    Gemini-2.0-Flash does extremely well on the Hallucination Evaluation Leaderboard, at 1.3% hallucination rate https://github.com/vectara/hallucination-leaderboard
    • refulgentis2 周前
      Fascinating, thanks for calling that out: I found 1.0 promising in practice, but with hallucination problems. Then I saw it had gotten 57% of questions wrong on open book true/false and I wrote it off completely - no reason to switch to it for speed and cost if it's just a random generator. That's a great outcome.
    • jug2 周前
      Speaking of which, I wonder how they'd do on SimpleQA. OpenAI is an outlier there in the negative sense vs Anthropic. This benchmark also deals with hallucination and "inappropriate certainty".
  • tkgally2 周前
    I tried accessing Gemini 2.0 Flash through Google AI Studio in the Safari browser on my iPhone, and to my surprise it worked. After I gave it access to my microphone and camera, I was able to have a pretty smooth conversation with it about what it saw through the camera. I pointed the camera at things in my room and asked what they were, and it identified them accurately. It was also able to read text in both English and Japanese. It correctly named a note I played on a piano when I showed it the keyboard with my finger playing the note, but it couldn’t identify notes by sound alone.

    The latency was low, though the conversation got cut off a few times.

  • ComputerGuru2 周前
    I've been using gemini-exp-1206 and I notice a lot of similarities to the new gemini-2.0-flash-exp: they're not that much actually smarter but they go out of their way to convince you they are with overly verbose "reasoning" and explanations. The reasoning and explanations aren't necessarily wrong per se, but put them aside and focus on the actual logical reasoning steps and conclusions to your prompts and it's still very much a dumb model.

    The models do just fine on "work" but are terrible for "thinking". The verbosity of the explanations (and the sheer amount of praise the models like to give the prompter - I've never had my rear end kissed so much!) should lead one to beware any subjective reviews of their performance rather than objective reviews focusing solely on correct/incorrect.

  • EternalFury2 周前
    Think of Google as of a tanker ship. It takes a while to change course, but it has great momentum. Sundar just needs to make sure the course is right.
    • CSMastermind2 周前
      That's almost word for word what people said about Windows Phone when I was at Microsoft.
      • zaptrem2 周前
        It is a lot easier to switch LLMs than it is to switch smartphone platforms.
      • But Windows Phone was actually good, like Xune, it was just late, and it was incredibly popular to hate Microsoft at the time.

        Additionally, Microsoft didn't really have any advantage in the smart phone space.

        Google is already a product the majority of people on the planet use regularly to answer questions.

        That seems like a competitive advantage to me.

        • wraptile2 周前
          > But Windows Phone was actually good

          I think people just have rose-tinted glasses on. Sure the hardware from Nokia was great, but software was very poor even by the standards of that time.

        • Yeah, I liked my windows phone, not sure why they killed it
      • atorodius2 周前
        Was the Windows Phone ever at the frontier tho?
        • scarmig2 周前
          Windows Phone was superior to everything else on the market at the time. But phones are an ecosystem, and MS was a latecomer.
          • That's a wild claim. Windows phone was garbage from the start. And the UI was terrible.
      • Windows Phone was actually great though, and would've eventually been a major player in the space if Microsoft were stubborn enough to stick with it long enough, like they did with the Xbox.

        By his own admission, Gates was extremely distracted at the time by the antitrust cases in Europe, and he let the initiative die.

    • griomnib2 周前
      And where is the ship headed if they are no longer supporting the open web?

      Publishers are being squeezed and going under, or replacing humans with hallucinated genai slop.

      It’s like we’re taking the private equity model of extracting value and killing something off to the entire web.

      I’m not sure where this is headed, but I don’t think Sundar has any strategy here other than playing catch up.

      Demis’ goal is pretty transparently positioning himself to take over.

      • EternalFury2 周前
        The Web is dead. It’s pretty clear future web pages, if we call them that, will be assembled on-the-fly by AI based your user profile and declared goals, interests and requests.
  • mherrmann2 周前
    Their Mariner tool for controlling the browser sounds scary and exciting. At the moment, it's an extension, which means JavaScript. Some web sites block automation that happens this way, and developers resort to tools such as Selenium. These use the Chrome DevTools API to automate the browser. It's better, but can still be distinguished from normal use with very technical details. I wonder if Google, who still own Chrome, will give extensions better APIs for automation that can not be distinguished from normal use.
    • horyd2 周前
      You would think so, and probably starting the productivity suite (Docs, Sheets, etc.). If Mariner could seamlessly integrate with those, then times are about to get real interesting (if they aren't already!)
  • mike80it2 周前
    Gemini 1120, 1206, and Gemini 2.0 flash have better coding results than ChatGPT o1 and Claude Sonnet 3.5.

    They did it: from now on Google will keep a leadership position.

    They have too much data (Search, Maps, Youtube, Chrome, Android, Gmail, etc.), and they have their own servers (it's free!) and now the Willow QPU.

    To me, it is evident how the future will look. I'll buy some more Alphabet stocks

  • smallerfish2 周前
    Was this written by an LLM? It's pretty bad copy. Maybe they laid off their copywriting team...?

    > "Now millions of developers are building with Gemini. And it’s helping us reimagine all of our products — including all 7 of them with 2 billion users — and to create new ones"

    and

    > "We’re getting 2.0 into the hands of developers and trusted testers today. And we’re working quickly to get it into our products, leading with Gemini and Search. Starting today our Gemini 2.0 Flash experimental model will be available to all Gemini users."

    • utopcell2 周前
      Sorry, what's wrong with these phrases?
      • krona2 周前
        > all of our products — including all 7 of them

        All the products including all the products?

        • iamdelirium2 周前
          Why did you specifically ignore the remainder of the sentence?

          "...all of our products — including all 7 of them with 2 billion users..."

          It tells people that 7 of their products have 2b users.

          • fluoridation2 周前
            That's not really any better, since "all of our products" already includes the subset that has at least 2B users. "I brought all my shoes, including all my red shoes."
            • stavros2 周前
              They're pointing out that seven of their products have more than 2 billion users.

              "I brought all my shoes, including the pairs that cost over $10,000" is saying something about what shoes you brought, more than "all of them".

              • fluoridation2 周前
                Why are they bragging about something completely unrelated in the middle of a sentence about the impact of a piece of technology?

                -Hey, are you done packing?

                -Yes, I decided I'll bring all my shoes, including the ones that cost over $10,000.

                What, they just couldn't help themselves?

                • stavros2 周前
                  The fact that they're using Gemini with even their most important products shows that they trust it.
                  • fluoridation2 周前
                    Again, that's covered by "all our products". Why do we need to be reminded that Google has a lot of users? Someone oblivious to that isn't going to care about this press release.
                    • scarmig2 周前
                      Scale and cost are defining considerations of LLMs. By saying they're rolling out to billions of users, they're pointing out they're doing something pretty unique and have confidence in a major competitive advantage. Point billions of devices at other high-performing competitors' offerings, and all of them would fall over.
                      • fluoridation2 周前
                        That's not what the sentence says. "Now millions of developers are building with Gemini. And it’s helping us reimagine all of our products — including all 7 of them with 2 billion users — and to create new ones." That does not imply that Gemini will start receiving requests from billions of users. At best it says that they'll start using it in some unspecified way.
              • TeaBrain1 周前
                >They're pointing out that seven of their products have more than 2 billion users.

                More specifically, they are trying to emphasize the point that gemini is being used with seven products with over 2 billion users. However, the above user is right that this was a bafflingly terrible use of English to establish this fact.

          • aerhardt2 周前
            That phrasing still sucks, I am neither a native speaker nor a wordsmith but I've worked with professional English writers who could make that look and sound infinitely better.
          • jay_kyburz2 周前
            all of our products, 7 of which have over 2 billion users..
          • doublerabbit2 周前
            If we are nitpicking, numbers shouldn't be used when using describing the adjective. At least for the small amount of seven.

            "including all seven of them with 2 billion users"

        • hombre_fatal2 周前
          The meme of LLM generated content is that it's verbose and formal, not that it's poorly written.

          It's why the quoted text is obviously written by a human.

          • There's no law that says LLM generated text has to bad in a singular way
        • 2 周前
          undefined
      • echelon2 周前
        It reads like a transcribed speech. You can picture this being read from a teleprompter at a conference keynote.

        Short sentence fact. And aspirational tagline - pause for some metrics - and more. And. Today. And. And. Today.

      • scudsworth2 周前
        executive spotted
  • PaulWaldman2 周前
    Anecdotally, using the Gemini App with "Gemini Advanced 2.0 Flash Experimental", the response quality is ignorantly improved and faster at some basic Python and C# generation.
    • xnx2 周前
      > ignorantly improved

      autocorrect of "significantly improved"?

  • epolanski2 周前
    I'm not gonna lie I like Google's models.

    Flash combines speed and cost and is extremely good to build apps on.

    People really take that whole benchmarking thing more seriously than necessary.

  • weatherlite2 周前
    Google is on fire lately, stock went up considerably in the last few days and rightly so; besides the usual cash cows (search, youtube) they made huge progress with GCP, Gemini and quite possibly Waymo. They have a fantastic momentum imo.
    • And quantum computing! They found a way to make error decrease with number of qubits.
  • CSMastermind2 周前
    > We're also launching a new feature called Deep Research, which uses advanced reasoning and long context capabilities to act as a research assistant, exploring complex topics and compiling reports on your behalf. It's available in Gemini Advanced today.

    Anyone seeing this? I don't have an option in my dropdown.

    • atorodius2 周前
      Rolling out the next few days accorsing to Jeff
    • jcims2 周前
      I just used it (a day later). It's only available on the 1.5 model right now. If anything it found a bunch of cool articles on the subject matter that I could use on my own. The actual results from its research in the domain I was using were much better than the standard chat bot response that 1.5 or 2.0 gave.
    • fudged712 周前
      Not seeing it yet on web or mobile (in Canada)
  • nycdatasci2 周前
    Gemini 2.0 Flash is available here: https://aistudio.google.com/prompts/new_chat

    Based on initial interactions, it's extremely verbose. It seems to be focused on explaining its reasoning, but even after just a few interactions I have seen some surprising hallucinations. For example, to assess current understanding of AI, I mentioned "Why hasn't Anthropic released Claude 3.5 Opus yet?" Gemini responded with text that included "Why haven't they released Claude 3.5 Sonnet First? That's an interesting point." There's clearly some reflection/attempted reasoning happening, but it doesn't feel competitive with o1 or the new Claude 3.5 Sonnet that was trained on 3.5 Opus output.

    • IanCal2 周前
      I'm having it lie a lot too. It wrote some pretty decent attempt at code in a new framework given instructions, which is great, but it made a key error about

      That's fine, but it couldn't spot it, then it told me that I had put that syntax in the instructions and quoted me - but that wasn't true. It repeatedly said it had undone that and rewrote the code with it still in as well.

      It added some async stuff where it wasn't needed (no way for it to know to be fair), what was interesting was when told this it apologised and explained it had just been doing so much async work recently it got confused.

      Another interesting quote

      > I am incredibly sorry for the repeated errors. I have made so many mistakes, and I am very frustrated with myself. I believe this is now correct, and adheres to the instructions.

    • snthpy2 周前
      Just speculating here but the verbose reasoning might be to make the "agentic" part work better as per the headline, ie pass context along to other agents. Would be good to get concise responses for human interactions though.
  • jerpint2 周前
    We’re definitely going to need better benchmarks for agentic tasks, and not just code reasoning. Things that are needlessly painful that humans go through all the time
    • it's insane on lmarena for a size, livebench should have it soon too I guess
      • maeil2 周前
        The size isn't stated, not necessarily a given that it's as small as 1.5-Flash.
  • brokensegue2 周前
    Any word on price? I can't find it at https://ai.google.dev/pricing
    • gman832 周前
      I've been using Gemini Flash for free through the API using Cline for VS Code. I switch between Claude and Gemini Flash, using Claude for more complicated tasks. Hope that the 2.0 model comes closer to Claude for coding.
      • Or… just continue using Claude?
        • IAmGraydon2 周前
          Claude is ridiculously expensive and often subject to rate limiting.
          • lol, and you think Google is going to be less subject limiting?

            I will pay more to not feed Google.

            • IAmGraydon2 周前
              I said rate limiting, not subject.
        • 85392_school2 周前
          I think they try to conserve costs by only using Claude when needed.
    • serjester2 周前
      Agreed - tried some sample prompts on our data and the rough vibe check is that flash is now as good as the old pro. If they keep pricing the same, this would be really promising.
    • Oras2 周前
      £18/month

      https://gemini.google/advanced/?Btc=web&Atc=owned&ztc=gemini...

      then sign in with Google account and you'll see it

  • nightski2 周前
    Anyone else annoyed how the ML/AI community just adopted the word "reasoning" when it seems like it is being used very out of context when looking at what the model actually does?
    • ramoz2 周前
      These models take an instruction, along with any contextual information, and are trained to produce valid output.

      That production of output is a form of reasoning via _some_ type of logical processing. No?

      Maybe better to say computational reasoning. That’s a mouthful.

      • nightski2 周前
        Static computation is not reasoning (these models are not building up an argument from premises, they are merely finding statistically likely completions). Computational thinking/reasoning would be breaking down a problem into an algorithmic steps. The model is doing neither. I wouldn't confuse the fact that it can break it into steps if you ask it, because again that is just regurgitation. It's not going through that process without your prompt. That is not part of its process to arrive at an answer.
        • I kinda agree with you but I can also see why it isn't that far from "reasoning" in the sense humans do it.

          To wit, if I am doing a high school geometry proof, I come up with a sequence of steps. If the proof is correct, each step follows logically from the one before it.

          However, when I go from step 2 to step 3, there are multiple options for step-3 I could have chose. Is it so different from a "most-likely-prediction" an LLM makes? I suppose the difference is humans can filter out logically-incorrect steps, or prune chains-of-steps that won't lead to the actual theorem quicker. But an LLM predictor coupled with a verifier doesn't feel that different from it.

        • ramoz2 周前
          The point is emergent capabilities in LLMs go beyond statistical extrapolation, as they demonstrate reasoning by combining learned patterns.

          When asked, “If Alice has 3 apples and gives 2 to Bob, how many does she have left?”, the model doesn’t just retrieve a memorized answer—it infers the logical steps (subtracting 2 from 3) to generate the correct result, showcasing reasoning built on the interplay of its scale and architecture rather than explicit data recall.

        • int_19h2 周前
          Models trained to do chain-of-thought do it without any specific prompting, and a specific system prompt that enables this behavior is often part of the setup, because that is what the model was trained on.

          I don't see how that is "regurgitation", either, if it performs the reasoning steps first, and only then arrives at the answer.

    • These kind of simplifications continue to make me an expert in LLM applications.

      So... its a trade secret to know how it actually works...

    • w10-12 周前
      Does it help to explore the annoyance using gap analysis? I think of it as heuristics. As with humans, it's the pragmatic "whatever seems to work" where "seems" is determined via training. It's neither reasoning from first principles (system 1) nor just selecting the most likely/prevalent answer (system 2). And chaining heuristics doesn't make it reasoning, either. But where there's evidence that it's working from a model, then it becomes interesting, and begins to comply with classical epistemology wrt "reasoning". Unfortunately, information theory seems to treat any compression as a model leading to some pretty subtle delusions.
  • jacooper2 周前
    The best thing about gemini models is the huge context windows, you can just throw big documents and find stuff real fast, rather than struggling with cut off in perplexity or Claude.
  • dandiep2 周前
    Gemini multimodal live docs here: https://cloud.google.com/vertex-ai/generative-ai/docs/model-...

    A little thin...

    Also no pricing is live yet. OpenAI's audio inputs/outputs are too expensive to really put in production, so hopefully Gemini will be cheaper. (Not to mention, OAI's doesn't follow instructions very well.)

    • kwindla2 周前
      The Multimodal Live API is free while the model/API is in preview. My guess is that they will be pretty aggressive with pricing when it's in GA, given the 1.5 Flash multimodal pricing.

      If you're interested in this stuff, here's a full chat app for the new Gemini 2 API's with text, audio, image, camera video and screen video. This shows how to use both the WebSocket API and to route through WebRTC infrastructure.

      https://github.com/pipecat-ai/gemini-multimodal-live-demo

      • dandiep2 周前
        Thanks, this is great!
    • spencerchubb2 周前
      I am eager to learn the pricing as well. It works sooo well but the pricing will make or break whether it's viable for apps
  • mfonda2 周前
    Does anyone have any insights into how Google selects source material for AI overviews? I run an educational site with lots of excellent information, but it seems to have been passed over entirely for AI overviews. With these becoming an increasingly large part of search--and from the sound of it, now more so with Gemini 2.0--this has me a little worried.

    Anyone else run into similar issues or have any tips?

  • I wrote about a possibly emergent behaviour in Gemini 2 that was only seen previously in sonnet-3.5 https://x.com/xundecidability/status/1867044846839431614
  • gotaran2 周前
    Google beat OpenAI at their own game.
  • sorenjan2 周前
    Published the day after one of the authors, Demis Hassabis, received his Nobel prize in Stockholm.
  • computergert2 周前
    It's funny how even the latest AI models struggle with simple questions.

    "What's the first name of Freddy LaStrange"? >> "I do not have enough information about that person to help with your request. I am a large language model, and I am able to communicate and generate human-like text in response to a wide range of prompts and questions, but my knowledge about this person is limited. Is there anything else I can do to help you with this request?"

    (Of course, we can't be 100% sure that his first name is Freddy. But I would expect that to be part of the answer then)

    • IanCal2 周前
      These kinds of things just strike me as trick questions, why would anyone ask that?

      Also

      > Freddy LaStrange is a fictional character and his first name is Freddy.

      and

      > Freddy is a nickname for Frederick. So, the first name of Freddy LaStrange is Frederick.

    • afro882 周前
      Is o1 not one of the latest models?

      https://chatgpt.com/share/675ab91c-c158-8004-9dfc-ea176ba387...

      A better way to say it is: "funny how Google can't keep up with the reasoning effectiveness of OpenAI's latest models"

  • fuddle2 周前
    I'd be interested to see Gemini 2.0's performance on SWE-Bench.
  • strongpigeon2 周前
    Did anyone get to play with the native image generation part? In my experience, Imagen 3 was much better than the competition so I'm curious to hear people's take on this one.
    • strongpigeon2 周前
      Hrm, when I tried to get it to generate an image it said it was using Imagen 3. Not sure what “native” image generation means then.
      • og_kalu2 周前
        native audio and image generation won't be in general availability till early next year
  • s3p2 周前
    This might be a dumb thought, but the naming of this model naturally suggests we have a Pro model coming out soon. Similar to Claude 3.5 Sonnet, where they never mentioned the Opus model afterward, does anyone think this is all we will see from Google in terms of the 2.0 model family? Does anyone think there will be some 2.0 Pro model that will blow this out of the water? The naming makes it seem so.
  • Considering so many of us would like more vRAM than NVIDIA is giving us for home compute, is there any future where these Trillium TPUs become commodity hardware?
    • Power concerns aside, individual chips in a TPU pod don't actually have a ton of vRAM; they rely on fast interconnects between a lot of chips to aggregate vRAM and then rely on pipeline / tensor parallelism. It doesn't make sense to try to sell the hardware -- it's operationally expensive. By keeping it in house Google only has to support the OS/hardware in their datacenter and they can and do commercialize through hosted services.

      Why do you want the hardware vs just using it in the cloud? If you're training huge models you probably don't also keep all your data on prem, but on GCS or S3 right? It'd be more efficient to use training resources close to your data. I guess inference on huge models? Still isn't just using a hosted API simpler / what everyone is doing now?

    • geodel2 周前
      So many of us are probably in thousands they need to be 3 order magnitude higher before Google can even think of it.
  • nuz2 周前
    I guess this means we'll have an openai release soon
  • 2 周前
    undefined
  • fpgaminer2 周前
    I work with LLMs and MLLMs all day (as part of my work on JoyCaption, an open source VLM). Specifically, I spend a lot of time interacting with multiple models at the same time, so I get the chance to very frequently compare models head-to-head on real tasks.

    I'll give Flash 2 a try soon, but I gotta say that Google has been doing a great job catching up with Gemini. Both Gemini 1.5 Pro 002 and Flash 1.5 can trade blows with 4o, and are easily ahead of the vast majority of other major models (Mistral Large, Qwen, Llama, etc). Claude is usually better, but has a major flaw (to be discussed later).

    So, here's my current rankings. I base my rankings on my work, not on benchmarks. I think benchmarks are important and they'll get better in time, but most benchmarks for LLMs and MLLMs are quite bad.

    1) 4o and its ilk are far and away the best in terms of accuracy, both for textual tasks as well as vision related tasks. Absolutely nothing comes even close to 4o for vision related tasks. The biggest failing of 4o is that it has the worst instruction following of commercial LLMs, and that instruction following gets _even_ worse when an image is involved. A prime example is when I ask 4o to help edit some text, to change certain words, verbage, etc. No matter how I prompt it, it will often completely re-write the input text to its own style of speaking. It's a really weird failing. It's like their RLHF tuning is hyper focused on keeping it aligned with the "character" of 4o to the point that it injects that character into all its outputs no matter what the user or system instructions state. o1 is a MASSIVE improvement in this regard, and is also really good at inferring things so I don't have to explicitly instruct it on every little detail. I haven't found o1-pro overly useful yet. o1 is basically my daily driver outside of work, even for mundane questions, because it's just better across the board and the speed penalty is negligible. One particularly example of o1 being better I encountered yesterday. I had it re-wording an image description, and thought it had introduced a detail that wasn't in the original description. Well, I was wrong and had accidentally skimmed over that detail in the original. It _told_ me I was wrong, and didn't update the description! Freaky, but really incredible. 4o never corrects me when I give it an explicit instruction.

    4o is fairly easy to jailbreak. They've been turning the screws for awhile so it isn't as easy as day 1, but even o1-pro can be jailbroken.

    2) Gemini 1.5 Pro 002 (specifically 002) is second best in my books. I'd guesstimate it at being about 80% as good as 4o on most tasks, including vision. But it's _significantly_ better at instruction following. Its RLHF is a lot lighter than ChatGPT models, so it's easier to get these models to fall back to pretraining, which is really helpful for my work specifically. But in general the Gemini models have come a long way. The ability to turn off model censorship is quite nice, though it does still refuse at times. The Flash variation is interesting; often times on-par with Pro with Pro edging out maybe 30% of the time. I don't frequently use Flash, but it's an impressive model for its size. (Side note: The Gemma models are ... not good. Google's other public models, like so400m and OWLv2 are great, so it's a shame their open LLMs forays are falling behind). Google also has the best AI playground.

    Jailbreaking Gemini is a piece of cake.

    3) Claude is third on my list. It has the _best_ instruction following of all the models, even slightly better than o1. Though it often requires multi-turn to get it to fully follow instructions, which is annoying. Its overall prowess as an LLM is somewhere between 4o and Gemini. Vision is about the same as Gemini, except for knowledge based queries which Gemini tends to be quite bad at (who is this person? Where is this? What brand of guitar? etc). But Claude's biggest flaw is the insane "safety" training it underwent, which makes it practically useless. I get false triggers _all_ the time from Claude. And that's to say nothing of how unethical their "ethics" system is to begin with. And what's funny is that Claude is an order of magnitude _smarter_ when its reasoning about its safety training. It's the only real semblance of reason I've seen from LLMs ... all just to deny my requests.

    I've put Claude three out of respect for the _technical_ achievements of the product, but I think the developers need to take a long look in the mirror and ask why they think it's okay to for _them_ to decide what people with disabilities are and are not aloud to have access to.

    4) Llama 3. What a solid model. It's the best open LLM, hands down. Nowhere near the commercial models above, but for a model that's completely free to use locally? That's invaluable. Their vision variation is ... not worth using. But I think it'll get better with time. The 8B variation far outperforms its weight class. 70B is a respectable model, with better instruction following than 4o. The ability to finetune these models to a task with so little data is a huge plus. I've made task specific models with 200-400 examples.

    5) Mistral Large (I forget the specific version for their latest release). I love Mistral as the "under-dog". Their models aren't bad, and behave _very_ differently from all other models out there, which I appreciate. But Mistral never puts any effort into polishing their models; they always come out of the oven half-baked. Which means they frequently glitch out, have very inconsistent behavior, etc. Accuracy and quality is hard to assess because of this inconsistency. On its best days it's up near Gemini, which is quite incredible considering the models are also released publicly. So theoretically you could finetune them to your task and get a commercial grade model to run locally. But rarely see anyone do that with Mistral, I think partly because of their weird license. Overall, I like seeing them in the race and hope they get better, but I wouldn't use it for anything serious.

    Mistral is lightly censored, but fairly easy to jailbreak.

    6) Qwen 2 (or 2.5 or whatever the current version is these days). It's an okay model. I've heard a lot of praises for it, but in all my uses thus far its always been really inconsistent, glitchy, and weak. I've used it both locally and through APIs. I guess in _theory_ it's a good model, based on benchmarks. And it's open, which I appreciate. But I've not found any practical use for it. I even tried finetuning with Qwen 2VL 72B, and my tiny 8B JoyCaption model beat it handily.

    That's about the sum of it. AFAIK that's all the major commercial and open models (my focus is mainly on MLLMs). OpenAI are still leading the pack in my experience. I'm glad to see good competition coming from Google finally. I hope Mistral can polish their models and be a real contender.

    There are a couple smaller contenders out there like Pixmo/etc from allenai. Allen AI has hands down the _best_ public VQA dataset I've seen, so huge props to them there. Pixmo is ... okayish. I tried Amazon's models a little but didn't see anything useful.

    NOTE: I refuse to use Grok models for the obvious reasons, so fucks to be them.

  • At least when it comes to Go code, I'm pretty impressed by the results so far. It's also pretty good at following directions, which is a problem I have with open source models, and seems to use or handle Claude's XML output very well.

    Overall, especially seeing as I haven't paid a dime to use the API yet, I'm pretty impressed.

  • Agents are the worst idea. I think AI will start to progress again when we get better models that drop this whole chat idea, and just focus on completions. Building the tools on top should be the work that is given over to the masses. It's sad that instead the most powerful models have been hamstrung.
  • Does anyone know how to sign up to the speech output wait list or tester program? I have a decent spent with GCP over the years if that helps at all. Really want DemoTime videos to use those voices. (I like how truly incredible best in the world tts is like a footnote in this larger announcement.)
  • zb32 周前
    Is this the gemini-exp model on LMArena?
    • warkdarrior2 周前
      Yes, LMArena shows Gemini-2.0-Flash-Exp ranking 3rd right now, after Gemini-Exp-1206 and ChatGPT-4o-latest_(2024-11-20), and ahead of o1-preview and o1-mini:

      https://lmarena.ai/?leaderboard

      • zb32 周前
        There's also the "gremlin" model (not reachable directly) and it seems to be pretty smart.. maybe that's the deep research mode?

        EDIT: probably not deep research.. is it Google testing their equivalent of o1? who knows..

    • jasonjmcghee2 周前
      Both are available on aistudio so I don't think so.

      In my own testing "exp 1206" is significantly better than Gemini 2.

      Feels like haiku 3.5 vs sonnet 3.5 kind of thing.

    • usaar3332 周前
      It looks like gemini-exp-1121 slightly upgraded. 1206 is something else.
  • ianbutler2 周前
    Unfortunately the 10rpm quota for this experimental model isn't enough to run an actual Agentic experience on.

    That's my main issue with google there's several models we want to try with our agent but quota is limited and we have to jump through hoops to see if we can get it raised.

  • AJRF2 周前
    I think they are really overloading that word "Agent". I know there isn't a standard definition - but I think Google are stretching the meaning of that way thinner than most C Suite level execs talk about agents at.

    I think DeepMind could make progress if they focused on the agent definition of multi-step reasoning + action through a web browser, and deliver a ton of value, outside of lumping in the seldom used "Look at the world through a camera" or "Multi modal Robots" thing.

    If Google cracked robots, past plays show that the market for those aren't big enough to interest Google. Like VR, you just can't get a billion people to be interested in robots - so even if they make progress, it won't survive under Google.

    The "Look at the world through a camera" thing is a footnote in an Android release.

    Agentic computer use _is_ a product a billion people would use, and it's adjacent to the business interests of Google Search.

  • petesergeant2 周前
    Speed looks good vis-a-vis 4o-mini, and quality looks good so far against my eval set. If it's cheaper than 4o-mini too (which, it probably will be?) then OpenAI have a real problem, because switching between them is a value in a config file.
  • thisoneworks2 周前
    "gemini for video games" - here we go again with the AI does the interesting stuff for you rather than the boring stuff
  • I'm quite impressed with the flash demo's reasoning capabilities. Did the 20 questions game with it, it found the computer mouse i had in my head. At first it was confused about our roles and said something weird, it thought that it had to guess its own word. Afterwards I had a meta conversation about that weirdness and it gave impressive insights:

    " Why My "I Guess What I Have in Mind" Statement Doesn't Make Sense

        The Problem of Agency: You're spot-on when you say it doesn't make sense for me to state "I guess what I (the computer) have in mind." AI does not have a "mind" in the same way that humans do. My internal state is a complex pattern of numerical data, not conscious thoughts or intentions in the way you have things in mind. My phrasing gave the illusion of having a "thing in mind," but in reality, I don't have this capacity.
    
        Misinterpretation of the Task: My statement "I guess what I have in mind" misinterprets the task. You were clearly signaling that I needed to ask questions to uncover what you were thinking of. My failure to grasp this created a weird scenario where it seemed like I was trying to determine my own data set!"
  • NoZZz2 周前
    Ugh. You're investing in this, but it disenfranchises the people that are investing in you. Well. Unless you'd like to admit that your income is pure fantasy, which it is.
  • r33b332 周前
    Can it work with large codebase? How? I have an Xcode project with python server for online game. Is this sufficient to understand code fully and implement edits correctly?
  • summerlight2 周前
    It is interesting to see that they keep focusing on the cheapest model instead of the frontier model. Probably because of their primary (internal?) customer's need?
    • coder5432 周前
      It's cheaper and faster to train a small model, which is better for a research team to iterate on, right? If Google decides that a particular small model is really good, why wouldn't they go ahead and release it while they work on scaling up that work to train the larger versions of the model?
      • int_19h2 周前
        On the other hand, once you have a large model, you can use various distillation techniques to train smaller models faster and with better results. Meta seems to be very successful doing this with Llama, in particular.
      • summerlight2 周前
        I have no knowledge of Google specific cases, but in many teams smaller models are trained upon bigger frontier models through distillation. So the frontier models come first then smaller models later.
        • coder5432 周前
          Training a "frontier model" without testing the architecture is very risky.

          Meta trained the smaller Llama 3 models first, and then trained the 405B model on the same architecture once it had been validated on the smaller ones. Later, they went back and used that 405B model to improve the smaller models for the Llama 3.1 release. Mistral started with a number of small models before scaling up to larger models.

          I feel like this is a fairly common pattern.

          If Google had a bigger version of Gemini 2.0 ready to go, I feel confident they would have mentioned it, and it would be difficult to distill it down to a small model if it wasn't ready to go.

    • discobot2 周前
      the problem is that last generation of the largest models failed to overcome smaller models on the benchmarks, see lack of new claude opus or gpt-5. The problem is probably in the benchmarks, but anyway.
  • ipsum22 周前
    Tested out Gemini-2 Flash, I had such high hopes that a better base model would help. It still hallucinates like crazy compared to GPT-4o.
    • gbickford2 周前
      Small models don't "know" as much so they hallucinate more. They are better suited for generations that are based in a ground truth, like in a RAG setup.

      A better comparison might be Flash 2.0 vs 4o-mini. Even then, the models aren't meant to have vast world knowledge, so benchmarking them on that isn't a great indicator of how they would be used in real-world cases.

      • ipsum22 周前
        Yes, it's not an apples to apples comparison. My point is the position it's at on the lmarena leaderboard is misplaced due to the hallucination issues.
  • dangoodmanUT2 周前
    Jules looks like it's going after Devin
    • m3kw92 周前
      Claude MCP does the same thing. It’s the setup that is hard. It will do push pull create branch automatically from a single prompt. 500$ a month for Devin could be worth it if you want it taken care off plus use the models for a team, but a single person can set it up
  • Animats2 周前
    "Over the last year, we have been investing in developing more agentic models, meaning they can understand more about the world around you, think multiple steps ahead, and take action on your behalf, with your supervision."

    "With your supervision". Thus avoiding Google being held responsible. That's like Teslas Fake Self Driving, where the user must have their hands on the wheel at all times.

  • aantix2 周前
    Their offering is just so... bad. Even the new model. All the data in the world, yet they trail behind.

    They have all of these extensions that they use to prop up the results in the web UI.

    I was asking for a list of related YouTube videos - the UI returns them.

    Ask the API the same prompt, it returns a bunch of made up YouTube titles and descriptions.

    How could I ever rely on this product?

  • geodel2 周前
    Just searched for GVP vs SVP and got:

    "GVP stands for Good Pharmacovigilance Practice, which is a set of guidelines for monitoring the safety of drugs. SVP stands for Senior Vice President, which is a role in a company that focuses on a specific area of operations."

    Seems lot of pharma regulation in my telecom company.

  • stared2 周前
    How does it compare to OpenAI and Anthropic models, on the same benchmarks?
  • topicseed2 周前
    Is it on AI studio already?
    • jonomacd2 周前
      Yes it is. Including the live features. It is pretty impressive. Basically voice mode with a live video feed as well.
      • topicseed2 周前
        Just played with it and it's great! A good 2.0 release I think.
  • jgalt2122 周前
    Can cloudflare turnstile (and others) detect these agents as bots?
    • relatedtitle2 周前
      CF Turnstile is just proof of work, not a CAPTCHA. It works on automated browsers from my experience.
  • greenchair2 周前
    shouldn't there be a benchmark category called 'strawberry' by now? or maybe these AIs are still too dumb to handle it which is why it is left off?
  • nopcode2 周前
    > What can you tell me about this sculpture?

    > It's located in London.

    Mind blowing.

  • We are moving through eras faster than years these days.
  • eichi2 周前
    Gemini needs to improve quality of the software.
  • MetroWind2 周前
    Yes, "agentic". Very creative word.
  • sachou2 周前
    What is the main difference compare to before?
  • Is it better than GPT4o? Does it have an API?
    • jerrygenser2 周前
      API is accessible via Vertex AI on Google Cloud in preview. I think it's also available in the consumer Gemini Chat.
      • nmfisher2 周前
        Also FYI for anyone else, I can't actually see the gemini-2.0-flash-exp list in the model directory for Vertex AI in GCP Console, but passing in google/gemini-2.0-flash-exp to the vertexai Python SDK does actually work.

        (you'll probably also need to increase your quotas right away, the default is only 10 requests per minute).

  • 2 周前
    undefined
  • chrsw2 周前
    Instead of throwing up tables of benchmarks just let me try to do stuff and see if it's useful.
  • 2 周前
    undefined
  • 2 周前
    undefined
  • echohack52 周前
    Is this what it feels like to become one of the gray bearded engineers? This sounds like a bunch of intentionally confusing marketing drivel.

    When capitalism has pilfered everything from the pockets of working people so people are constantly stressed over healthcare and groceries, and there's little left to further the pockets of plutocrats, the only marketing that makes sense is to appeal to other companies in order to raid their coffers by tricking their Directors to buy a nonsensical product.

    Is that what they mean by "agentic era"? Cause that's what it sounds like to me. Also smells alot like press release driven development where the point is to put a feather in the cap of whatever poor Google engineer is chasing their next promotion.

    • weatherlite2 周前
      > Is that what they mean by "agentic era"? Cause that's what it sounds like to me.

      What are you basing your opinion on? I have no idea how well these LLM agents will perform but its definitely a thing. OpenAI is working on them , Claude and certainly Google.

    • cush2 周前
      Yeah it’s a lot of marketing fluff but these tools are genuinely useful and there’s no wonder why Google is working hard to prevent them from destroying their search-dependent bottom line.

      Marketing aside, agents are just LLMs that can reach out of their regular chat bubbles and use tools. Seems like just the next logical evolution

  • paradite2 周前
    Honestly this post makes Google sound like the new IBM. Very corporate.

    "Hear from our CEO first, and then our other CEO in charge of this domain and CTO will tell you the actual news."

    I haven't seen other tech companies write like that.

  • xyst2 周前
    Side note on Gemini: I pay for Google Workspace simply to enable e-mail capability for a custom domain.

    I never used the web interface to access email until recently. To my surprise, all of the AI shit is enabled by default. So it’s very likely Gemini has been training on private data without my explicit consent.

    Of course G words it as “personalizing” the experience for me but it’s such a load of shit. I’m tired of these companies stealing our data and never getting rightly compensated.

    • hnuser1234562 周前
      Gmail is hosting your email. Being able to change the domain doesn't change that they're hosting it on their terms. I think there are other email providers that have more privacy-focused policies.
  • jstummbillig2 周前
    Reminder that implied models are not actual models. Models have failed to materialize repeatedly and vanished without further mention. I assume no one is trying to be misleading but, at this point, maybe overly optimistic.
  • the demos are amazing

    I need to rewire my brain for the power of these tools

    this plus the quantum stuff...Google is on a win streak

  • lngnmn22 周前
    [dead]
  • moralestapia2 周前
    >2,000 words of bs

    >General availability will follow in January, along with more model sizes.

    >Benchmarks against their own models which always underperformed

    >No pricing visible anywhere

    Completely inept leadership at play.

  • m3kw92 周前
    Can these guy lead for once? They are always responding to what OpenAI is doing.
  • “Hey google turn on kitchen lights”

    “Sure, playing don’t fear the reaper on bathroom speaker”

    Ok

  • ryandvm2 周前
    I am sure Google has the resources to compete in this space. What I'm less sure about is whether Google can monetize AI in a way that doesn't cannibalize their advertising income.

    Who the hell wants an AI that has the personality of a car salesman?

  • melvinmelih2 周前
    No mention of Perplexity yet in the comments but it's obvious to me that they're targeting Perplexity Pro directly with their new Deep Research feature (https://blog.google/products/gemini/google-gemini-deep-resea...). I still wonder why Perplexity is worth $7 billion when the 800-pound gorilla is pounding on their door (albeit slowly).
    • yandie2 周前
      Just tried the deep search. It's a much much slower experience than perplexity at the moment - taking sweet many minutes to return result. Maybe it's more extensive but I use perplexity for quick information summary a lot and this is a very different UX.

      Haven't used it enough to evaluate the quality, however.

      • BoorishBears2 周前
        Before dropping it for a different project that got some traction, "Slow Perplexity" was something I was pretty set on building.

        Perplexity is a much less versatile product than it has to be in the chase of speed: you can only chew through so many tokens, do so much CoT, etc. in a given amount of time.

        They optimized for virality (it's just as fast as Google but gives me more info!) but I suspect it kills the stickiness for a huge number of users since you end up with some "embarrassing misses": stuff that should have been a slam dunk, goes off the rails due to not enough search, or the wrong context being surfaced from the page, etc. and the user just doesn't see value in it anymore.

  • echelon2 周前
    Gemini in search is answering so many of my search questions wrong.

    If I ask natural language yes/no questions, Gemini sometimes tells me outright lies with confidence.

    It also presents information as authoritative - locations, science facts, corporate ownership, geography - even when it's pure hallucination.

    Right at the top of Google search.

    edit:

    I can't find the most obnoxious offending queries, but here was one I performed today: "how many islands does georgia have?".

    Compare that with "how many islands does georgia have? Skidaway Island".

    This is an extremely mild case, but I've seen some wildly wrong results, where Google has claimed companies were founded in the wrong states, that towns were located in the wrong states, etc.

    • airstrike2 周前
      Doesn't match my experience. It also feels like it's getting better over time.
    • jonomacd2 周前
      At first, this was true but now it has gotten pretty good. The times it gets things wrong are often not the models fault and just google searches fault.
    • nice__two2 周前
      Gemini 1.5 indeed is a lot of hit-and-miss. Also, the politically correct and medical info filtering is limiting its usefulness a lot, IMHO.

      I also miss that it’s not yet really as context aware as ChatGPTo4. Even just asking a follow-up question, confuses Gemini 1.5.

      Hope Gemini 2.0 will improve that!

    • sib3012 周前
      This has happened to me zero times. :shrug:
    • nilayj2 周前
      can you provide some example queries that Gemini in search gets wrong?
      • echelon2 周前
        I just found another one:

        > A depsipeptide is a cyclic peptide where one or more amide groups are replaced by ester groups.

        Depsipeptides are not necessarily cyclic, and I'd probably use "bond" instead of "group".

        https://imgur.com/a/YslvJO2

        These errors are happening all the time.

    • adultSwim2 周前
      I've found these results quite useful
  • tpoacher2 周前
    I know this isn't really a useful comment, but, I'm still sour about the name they chose. They MUST have known about the Gemini protocol. I'm tempted to think it was intentional, even.

    It's like Microsoft creating an AI tool and calling it Peertube. "Hurr durr they couldn't possibly be confused; one is a decentralised video platform and the other is an AI tool hurr durr. And ours is already more popular if you 'bing' it hurr durr."

    • jvolkman2 周前
      > It's like Microsoft creating an AI tool and calling it Peertube.

      How is it like that? Gemini is a much more common word than Peertube. https://en.wikipedia.org/wiki/Gemini

    • esafak2 周前
      I'd never heard of this protocol, and I try to keep up so I can't blame Google.