Search ArXiv Fluidly

(searchthearxiv.com)

48 points | by Exorust1 day ago

6 comments

Syzygies1 day ago
Wow! My first test query immediately found me a paper relevant to research code I'm writing. Down that rabbit hole, I almost forgot to come back and praise this project.
sitkack22 hours ago
It is pretty good, https://searchthearxiv.com/?q=https%3A%2F%2Farxiv.org%2Fabs%...
If this is your project, please talk about how you made it.
- sitkack18 hours ago
  staring right in my face https://searchthearxiv.com/about
  Code is on github https://github.com/augustwester/searchthearxiv
elashri1 day ago
This seems interesting. I always thought of doing something similar but for specific topic in two specific arxiv categories. But for 300k papers like what is in here, Does anyone have an estimation on how much this will cost using OpenAI ada embeddings model?
- stephantul1 day ago
  Ada is deprecated: use the text embedding models instead.
  It depends on whether you do full text search, or abstract only. If you do full text, I’d guess about 1k tokens per page, 10 pages per paper? So that would be 3B tokens, which would cost you 60$ if you use the cheapest embedder.
  If you just do abstracts, the costs will be negligible.
  eden-u41 day ago
  this project only uses kaggle metadata and abstract from arxiv. Moreover it is "focused" on only 5-6 categories in the arxiv. Therefore, the costs are marginal.
  Plus you could use a mixed system: first you index the abstract of the most relevant 50 papers, then embedd the text of those 50 in order to asses which are truly relevant and/or meaningful.
Euphorbium22 hours ago
Did they calculate embeddings for the entire archive? That must have cost a fortune.
- hskalin20 hours ago
  Arxiv has about 2.6M articles, assuming about 10 pages per article, that's 26M pages. According to OpenAI, their cheapest embedding model (text-embedding-3-small) costs a dollar for 62.5K pages. So the price for calculating embedding for the whole Arxiv is about $416.
  I think doing it locally with an open source model would be a lot cheaper as well. Especially because they wouldn't have to keep using OpenAI's API for each new query.
  Edit: I overlooked the about page (https://searchthearxiv.com/about), seems like they *are* using OpenAI's API, but they only have 300K papers indexed, use an older embedding model, and only calculate embeddings on the abstract. So this should be pretty cheap.
- machiaweliczny20 hours ago
  Embeddings are very cheap to generate
- HeatrayEnjoyer22 hours ago
  What are embeddings and why are they expensive?
  Euphorbium21 hours ago
  Embeddings are vectors of chunks of documents, lists of 1024 (depending on a model) float numbers that represent that short snippet of text. This kind of search works by finding the most similar vectors, calculating them cost fractions of the cent, but when you need to do it billions to trillions of times, it adds up.
  orf21 hours ago
  You could likely calculate them all on a modern MacBook easily enough.
  Searching the embeddings is a different problem, but there are lots of specialised databases that can make it efficient.
  Euphorbium21 hours ago
  You can, but it is a scale problem. Doing that would take an unreasonable amount of time at this scale.
basedrum18 hours ago
I asked for papers from 2024,got two and then one from 2018
canadiantim20 hours ago
Now do the same with BioRxiv and you'd be a hero