292 points | by Philpax1 天前
I'd be very interested to see what the output of their 'big model' is that feeds into the small model. I presume the small model gets a bunch of environmental input, and some input from the big model, and we know that the big model input only updates every 30 or 40 frames in terms of small model.
Like, do they just output random control tokens from big model and embed those in small model and do gradient descent to find a good control 'language'? Do they train the small model on english tokens and have the big model output those? Custom coordinates tokens? (probably). Lots of interesting possibilities here.
By the way, the dataset they describe was generated by a large (much larger presumably) vision model tasked with creating tasks from successful videos.
So the pipeline is:
* Video of robot doing something
* (o1 or some other high end model) "describe very precisely the task the robot was given"
* o1 output -> 7B model -> small model -> loss
The demo space is so sterile and empty I think we're still a loong ways off from the Coffee test happening. One big thing I see is they don't have to rearrange other items they have nice open bins/drawers/shelves/etc to drop the items into. That kind of multistep planning has been a thorn in independent robotics for decades.
You put a keyring with bunch of different keys in front of a robot and then instruct it pick it up and open a lock while you are describing which key is the correct one. Something like "Use the key with black plastic head and you need to put it in teeths facing down"
I have low hopes of this being possibe in the next 20 years. I hope I am still alive to witness if it ever happens.
This is basically safety-critical stuff but with LLMs. Hallucinating wrong answers in text is bad, hallucinating that your chest is a drawer to pull open is very bad.
The term I see a lot is co-robotics or corobots. At least that's what Kuka calls them.
There cannot be a safety system of this type for a generalist platform like a humanoid robot. It's possibility space is just too high.
I think the safety governor in this case would have to be a neural network that is at least as complex as the robots network, if not more so.
Which begs the question: what system checks that one for safety?
it's easy to take your able body for granted, but reality comes to meet all of us eventually.
When I hire a plumber or a mechanic or an electrician, I'm not just paying for muscle. Most of the value these professionals bring is experience and understanding. If a video-capable AI model is able to assume that experience, then either I can do the job myself or hire some 20 year old kid at roughly minimum wage. If capabilities like this come about, it will be very disruptive, for better and for worse.
Case in point, I remember about ten years ago our washing machine started making noise from the drum bearing. Found a Youtube tutorial for bearing replacement on the exact same model, but 3 years older. Followed it just fine until it was time to split the drum. Then it turned out that in the newer units like mine, some rent-seeking MBA fuckers had decided more profits could be had if they plastic welded shut the entire drum assembly. Which was then a $300 replacement part for a $400 machine.
An AI doesn't help with this type of shit. It can't know the unknown.
If you can't get a tutorial on your exact case you learn about the problem domain and intuit from there. Usually it works out if you're careful, unlike software.
Or you could wear it while you cook and it could give you nutrition information for whatever it is you cooked. Armed with that it could make recommendations about what nutrients you're likely deficient in based on your recent meals and suggest recipes to remedy the gap--recipes based on what it knows is already in the cupboard.
But none of the kitchen stuff we learned had anything to do with ensuring that this week's shopping list ensures that you'll get enough zinc next week, or the kind of prep that uses the other half of yesterday's cauliflower in tomorrow's dinner so that it doesn't go bad.
These aren't hard problems to solve if you've got time to plan, but they are hard to solve if you are currently at the grocery store and can't remember that you've got a half a cauliflower that needs an associated recipe.
Presumably, they won't as this is still a tech demo. One can take this simple demonstration and think about some future use cases that aren't too different. How far away is something that'll do the dishes, cook a meal, or fold the laundry, etc? That's a very different value prop, and one that might attract a few buyers.
Its similar to losing callouses on our hands if you don’t labor/go to the gym.
I think the key point why this "reverse cyborg" idea is not as dystopian as, say, being a worker drone in a large warehouse where the AI does not let you go to the toilet is that the AI is under your own control, so you decide on the high level goal "sort the stuff away", the AI does the intermediate planning and you do the execution.
We already have systems like that, every time you use you tell your navi where you want to go, it plans the route and gives you primitive commands like "on the next intersection, turn right", so why not have those for cooking, doing the laundry, etc.?
Heck, even a paper calendar is already kinda this, as in separating the planning phase from the execution phase.
For "stuff" I think a bigger draw is having it so it can let me know "hey you already have 3 of those spices at locations x, y, and z, so don't get another" or "hey you won't be able to fit that in your freezer"
> Manna told employees what to do simply by talking to them. Employees each put on a headset when they punched in. Manna had a voice synthesizer, and with its synthesized voice Manna told everyone exactly what to do through their headsets. Constantly. Manna micro-managed minimum wage employees to create perfect performance.
I'd totally use that to clean my garage so that later I can ask it where the heck I put the thing or ask it if I already have something before I buy one...
In other words: I'm sorry, but that's how reality turned out. Robots are better at thinking, humans better at laboring. Why fight against nature?
(Just joking... I think.)
This can be done of course, in your statement the phrase “just figure out” is doing a lot more heavy lifting than you allude to
https://www.aliexpress.com/w/wholesale-clothes-folding-machi...
Vision+language multimodal models seem to solve some of the hard problems.
Do you really expect the oligarchs to put up with the environmental degradation of 8 billion humans when they can have a pristine planet to themselves with their whims served by the AI and these robots?
I fully anticipate that when these things mature enough we'll see an "accidental" pandemic sweep and kill off 90% of us. At least 90%.
Fortunately, robotic capability like that basically becomes the equivalent of Nuclear MAD.
Unfortunately, the virus approach probably looks fantastic to extremist bad actors with visions of an afterlife.
I'd rather have less waged labour and more time for chores with the family.
The article mentions that the system in each robot uses two ai models.
S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data
and the other S1, an 80M parameter cross-attention encoder-decoder transformer, handles low-level [motor?] control.
It feels like although the article is quite openly technical they are leaving out the secret sauce? So they use an open source VLM to identify the objects on the counter. And another model to generate the mechanical motions of the robot.What part of this system understands 3 dimensional space of that kitchen?
How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?
How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?
Figure robots, each equipped with dual low-power-consumption embedded GPUs
Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?1. S2 is a 7B VLM, it is responsible for taken in camera streams (from however many of them), run through prompt guided text generation, and before the lm_head (or a few layers leading to it), directly take the latent encoding;
2. S1 is where they collected a few hundreds hours of teleoperating data, retrospectively come up with prompt for 1, then train from the scratch;
Whether S2 finetuned with S1 or not is an open question, at least there is a MLP adapter that is finetuned, but could be the whole 7B VLM is finetuned too.
It looks plausible, but I am still skeptical about the generalization claim given it is all fine-tuned with household tasks. But nowadays, it is really difficult to understand how these models generalize.
What part of this system understands 3 dimensional space of that kitchen?
The visual model "understands" it most readily, I'd say -- like a traditional Waymo CNN "understands" the 3D space of the road. I don't think they've explicitly given the models a pre-generated pointcloud of the space, if that's what you're asking. But maybe I'm misunderstanding? How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?
It appears that the robot is being fed plain english instructions, just like any VLM would -- instead of the very common `text+av => text` paradigm (classifiers, perception models, etc), or the less common `text+av => av` paradigm (segmenters, art generators, etc.), this is `text+av => movements`.Feeding the robots the appropriate instructions at the appropriate time is a higher-level task than is covered by this demo, but I think is pretty clearly doable with existing AI techniques (/a loop).
How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?
If your question is "where's the GPUs", their "AI" marketing page[1] pretty clearly implies that compute is offloaded, and that only images and instructions are meaningfully "on board" each robot. I could see this violating the understanding of "totally local" that you mentioned up top, but IMHO those claims are just clarifying that the individual figures aren't controlled as one robot -- even if they ultimately employ the same hardware. Each period (7Hz?) two sets of instructions are generated. What possible combo of model types are they stringing together? Or is this something novel?
Again, I don't work in robotics at all, but have spent quite a while cataloguing all the available foundational models, and I wouldn't describe anything here as "totally novel" on the model level. Certainly impressive, but not, like, a theoretical breakthrough. Would love for an expert to correct me if I'm wrong, tho!EDIT: Oh and finally:
Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?
Surely they are downplaying the difficulties of getting this setup perfectly, and don't show us how many bad runs it took to get these flawless clips.They are seeking to raise their valuation from ~$3B to ~$40B this month, sooooooo take that as you will ;)
https://www.reuters.com/technology/artificial-intelligence/r...
their "AI" marketing page[1] pretty clearly implies that compute is offloaded
I think that answers most of my questions.I am also not in robotics, so this demo does seem quite impressive to me but I think they could have been more clear on exactly what technologies they are demonstrating. Overall still very cool.
Thanks for your reply
I now they claim there’s no special coding but did they practice this task? Special training?
Even if this video is totally legit I’m but burned out by all the hype videos in general.
EDIT: Let alone chop an onion. Let me tell you having a robot manipulate onions is the worst. Dealing with loose onion skins is very hard.
What is the interface from the top level to the motors?
I feel it can not just be a neural network all the way down, right?
huh. An interesting approach. I wonder if something like this can be used for other things as well, like "computer use" with the same concept of a "large" model handling the goals, and a "small" model handling clicking and stuff, at much higher rates, useful for games and things like that.
200Hz is barely enough to control a motor, but it is good enough to send a reference signal to a motor controller. Usually what is done is that you have a neural network to learn complex high level behaviour and use that to produce a high level trajectory, then you have a whole body robot controller based on quadratic programming that does things like balancing, maintaining contacts when holding objects or pressing against things. This requires a model of the robot dynamics so that you know the relationship between torques and acceleration. Then after that you will need a motor controller that accepts reference acceleration/torque, velocity and position commands which then is turned into 10kHz to 100kHz pulse width modulated signals by the motor controller. The motor controller itself is driving MOSFETs so it can only turn them on or off, unless you are using expensive sinusoidal drivers.
Stop hosting your videos as MP4s on your web-server. Either publish to a CDN or use a platform like YouTube. Your bandwidth cannot handle serving high resolution MP4s.
/rant
the official figure yt release vid
I could also imagine a lot of safety around leaving things outside of the current task alone so you might have to bend over backwards to get new objects worked on.
These models are trained such that the given conditions (the visual input and the text prompt) will be continued with a desirable continuation (motor function over time).
The only dimension accuracy can apply to is desirability.
However, as it was trained using generic text data similarly to a normal LLM, it knows how an apple is supposed to look like.
Similar than a kid that never saw a banana, but his parent described it to him.
I think we're at an inflection point now where AI and robotics can be used in warfare, and we need to start having that conversation.
I have said for years that the only thing keeping us from "stabby the robot" is solving the power problem. If you can keep a drone going for a week, you have a killing machine. Use blades to avoid running out of ammo. Use IR detection to find the jugular. Stab, stab and move on. I'm guessing "traditional" vision algorithms are also sufficient to, say, identify an ethnicity and conduct ethnic cleansing. We are "solving the power problem" away from a new class of WMDs that are accessible to smaller states/groups/individuals.
And we already reached the peek here. Small drones that are cheaply mass produced, fly on SIM cards alone and explode when they reached a target. That's all there is to it. You don't need a gun mounted on a spot or a humanoid robot carrying a gun. Exploding swarms are enough.
I’m actually fairly impressed with this because it’s one neural net which is the goal, and the two system paradigm is really cool. I don’t know much about robotics but this seems like the right direction.
If they can do that, why aren't they selling picking systems to Amazon by the tens of thousands?
If they can find suckers who accept that valuation, it's much easier to exit as a billionaire than actually make it work.
Visualize making it work. You build or buy a robot that has enough operating envelope for an Amazon picking station, provide it with an end-effector, and use this claimed general purpose software to control it. Probably just arms; it doesn't need to move around. Movement is handled by Amazon's Kiva-type AGV units.
You set up a test station with a supply of Amazon products and put it to work. It's measured on the basis of picks per minute, failed picks, and mean time before failure. You spend months to years debugging standard robotics problems such as tendon wear, gripper wear, and products being damaged during picking and placing. Once it's working, Amazon buys some units and puts them to work in real distribution centers. More problems are found and solved.
Now you have a unit that replaces one human, and costs maybe $20,000 to make in quantity. Amazon beats you down in price so you get to sell it for maybe $25,000 in quantity. You have to build manufacturing facilities and service depots. Success is Amazon buying 50,000 of them, for total income of $0.25 billion. This probably becomes profitable about five years from now, if it all works.
By which time someone in China, Japan, or Taiwan is doing it cheaper and better.
Perhaps it's possible to grip them but not to pack them?
The latest models seemed to be fluidly tied in with generating voice; even singing and laughing.
It seems like it would be possible to train a multimodal that can do that with low level actuator commands.
If that sounds like a cheat, neuroscientists tell us this is how the human brain works.
The article clearly spells out that it's end to end LLM. Text and video in, motor function out.
Technically, the text model probably has a few copies, but they are nothing more than Asimov's narrative. Laws don't (and can't) exist in a model
That’s how I feel about LLMs and code.
Why make such sinister-looking robots though...?
But it did seem like title of their mood board must have been "Black Mirror".
Very uncanny valley, the glossy facelessness. It somehow looks neither purely utilitarian/industrial nor 'friendly'. I could see it being based on the aesthetic of laptops and phones, i.e. consumer tech, but the effect is so different when transposed onto a very humanoid form.
A fast reactive visuomotor policy that translates the latent semantic representations produced by S2 into precise continuous robot actions at 200 Hz
Why 200Hz...? Any experts in here on robotics? Because to this layman that seems really often to update motor controls.Does anyone know if this trained model would work on a different robot at all, or would it need retraining?