11 comments

  • ed5 小时前
    Elegant architecture, trained from scratch, excels at image editing. This looks very interesting!

    From https://arxiv.org/html/2409.11340v1

    > Unlike popular diffusion models, OmniGen features a very concise structure, comprising only two main components: a VAE and a transformer model, without any additional encoders.

    > OmniGen supports arbitrarily interleaved text and image inputs as conditions to guide image generation, rather than text-only or image-only conditions.

    > Additionally, we incorporate several classic computer vision tasks such as human pose estimation, edge detection, and image deblurring, thereby extending the model’s capability boundaries and enhancing its proficiency in complex image generation tasks.

    This enables prompts for edits like: "|image_1| Put a smile face on the note." or "The canny edge of the generated picture should look like: |image_1|"

    > To train a robust unified model, we construct the first large-scale unified image generation dataset X2I, which unifies various tasks into one format.

    • nairoz2 小时前
      > trained from scratch

      Not exactly. They mention starting from the VAE from Stable Diffusion XL and the Transformer from Phi3.

      Looks like these LLMs can really be used for anything

  • lelandfe5 小时前
    I left all the defaults as is, uploaded a small image, typed in "cafe," and 15 minutes later I am still waiting on this finishing.
    • 3 小时前
      undefined
  • block_dagger2 小时前
    This looks promising. I love how you can reference uploaded images with markup - this is exactly what the field needs more of. After spending the last two weeks generating thousands of album cover images using DALL-E and being generally disappointed with the results (especially with the variations feature of DALL-E 2), I'm excited to give this a try.
  • 1010082 小时前
    I am working on a API to generate avatars/profile pics based on a prompt. I tried looking for train my own model bt I think it's a titanic task and impossible to do it myself. Is my best solution use an external API and then crop the face for what was generated?
    • ncoronges2 小时前
      The simplest commercial product for finetuning your own model is probably Adobe firefly, although there’s no API access support yet. But there are cheap and only slightly more involved options like Replicate or Civit.ai. Replicate has solid API support.

      Check out:

      https://replicate.com/blog/fine-tune-flux

  • wwwtyro3 小时前
    With consistent representation of characters, are we now on the precipice of a Cambrian explosion of manga/graphic novels/comics?
    • Multicomp2 小时前
      I sure hope so - at the very least I will use it for tabletop illustrations instead of having to describe a party's scenario result - I can give them a character-accurate image showing their success (or epic lack thereof).
    • fullstackwife3 小时前
      not yet, still can't generate transparent images
  • ilaksh5 小时前
    I think this type of capability will make a lot of image generation stuff obsolete eventually. In a year or two, 75%+ of what people do with ComfyUI workflows might be built into models.
  • KerryJones4 小时前
    Love this idea -- you have a typo in tools "Satble Diffusion"
  • oatsandsugar8 小时前
    I mean, I struggle even getting Dall-E to iterate on one image without changing everything, so this is pretty cool
  • anyi098814 小时前
    Curious what's the actual cost for each edit? Will this infra always be reliable?
    • CamperBob21 小时前
      I was able to clone the repo and run it locally, even on a Windows machine, with only minimal Python dependency grief. Takes about a minute to create or edit an image on a 4090.

      It's pretty impressive so far. Image quality isn't mind-blowing, but the multi-modal aspects are almost disturbingly powerful.

      Not a lot of guardrails, either.

  • empath756 小时前
    it seems like there's a lot of potential for abuse if you can get it to generate ai images of real people reliably.
  • kazishariar6 小时前
    Hrmm, so this is how it's gonna be moving forward then? Use a smidgen of truth, to tell the whole falsehood, and nuttin' but the falsehoods. Sheesh- but, at least the subject is real? And that's that- nuttin' else doh.
    • illumanaughty5 小时前
      We've been manipulating photos as long as we've been taking them.