Show HN: ArXiv-txt, LLM-friendly ArXiv papers

(arxiv-txt.org)

20 points | by jerpint2 天前

5 comments

  • lgas1 天前
    It just extracts the abstracts?
    • jerpint1 天前
      For now , yes - abstracts and other metadata
      • rrekaf1 天前
        do you plan on adding descriptions of figures and tables?
        • jerpint1 天前
          will probably focus on getting the text out of the papers first, figures might be a good next step after that
  • cchance12 小时前
    Was super excited that it was going to be the actual papers, kinda cool but just being abstracts doesn't go very far, good luck getting the papers working thats gonna be pretty cool once working, then to feed it all into a vector db XD
  • sbpost1 天前
    The example you give doesn't seem to work - the raw txt does not have authors.
    • jerpint1 天前
      you're right - I hadn't noticed! I fixed it now, thanks for pointing it out
  • jmartin26831 天前
    This would be awesome wrapped in an MCP server/tool call :)
    • jerpint1 天前
      whoa - i haven't yet played with MCP - might be a good first project!
  • westurner1 天前
    If you train an LLM on only formally verified code, it should not be expected to generate formally verified code.

    Similarly, if you train an LLM on only published ScholarlyArticles ['s abstracts], it should not be expected to generate publishable or true text.

    Traceability for Retraction would be necessary to prevent lossy feedback.