304 points | by Quizzical42301 day ago
1. The option to somehow see _how_ the paper was reviewed and/or cited, if at all. There are things like OpenReview, see example [1]
2. The ability to "tell me a story to get up to speed" about a collection of papers. Generative models could help here -- but essentially, I want this thing to be able to write a paragraph for what one might find in the literature review / related work of a paper, with citations. :-)
All the best!
2. This is good feedback, making models write the Introduction section! I was planning to keep this search engine a little more traditional, however if the results are good, then it should be the way forward.
Thank you, Happy Holidays! :D
If you expand beyond arxiv, keep in mind since coverage matters for lit reviews, unfortunately the big publishers (Elsevier and Springer) are forcing other indices like OpenAlex, etc. to remove abstracts so they're harder to get.
Have you checked out other tools like undermind.ai, scite.ai, and elicit.org?
You might consider what else a dedicated product workflow for lit reviews includes besides search
(used to work at scite.ai)
| If you expand beyond arxiv, keep in mind since coverage matters for lit reviews,
I do have PaperMatchBio [^1] for bioRxiv and PaperMatchMed [^2] for medRxiv, however I do agree having multiple sites for domains isn't ideal. And I am yet to create a synchronization pipeline for these two so the results may be a little stale.
| unfortunately the big publishers (Elsevier and Springer) are forcing other indices like OpenAlex, etc. to remove abstracts so they're harder to get.
This sounds like a real issue in expanding the coverage.
| Have you checked out other tools like undermind.ai, scite.ai, and elicit.org?
I did, but maybe not thoroughly enough. I will check these and add complementing features.
| You might consider what else a dedicated product workflow for lit reviews includes besides search
Do you mean a reference management system like Mendeley/Zotero?
[1]: https://papermatchbio.mitanshu.tech/ [2]: https://papermatchmed.mitanshu.tech/
We have users with very similar use cases to yours. Want to email me? dylan@fixpoint.co. I'm one of the founders :)
The Cloudflare challenge screen at the beginning is a dealbreaker.
Random question - does anyone know why so many papers are missing from ArXiv? Do they need to be submitted manually, perhaps by their author(s)? I'll often find papers on mathematics, physics and computer science. But papers on biology, chemistry and medicine are usually missing.
I think a database of all paper ids in existence and where they're posted or missing could be at least as useful as this. Because no papers written with any level of public funding (meaning most of them) should ever be missing.
I understand your concern, however, I do not have the know-how to properly combat bots that keep spamming the server and this seemed the easiest way for me to have a functional site. I would love to know some resources for beginners in this regard, if you have them.
>Random question...
arXiv is generally for submitting CS, maths and physics papers. There are alternate preprint repositories like biorxiv.org, chemrxiv.org and medrxiv.org for such purposes. Note: arxiv is the largest, in terms of papers hosted, among these.
DOI is the primary identifier and preprints are also issuing them now.
Crossref has papers by DOI. OpenAlex and SemanticScholar also have records, with different id types supported (doi, pmid, etc).
2. how much efficiency gain did you see binarising embeddings/using hamming distance?
3. why milvus over other vector stores?
4. did you automate the weekly metadata pull? just a simple cron job? anything else you need orchestrated?
user thoughts on searching for "transformers on byte level not token level" - was good but didnt turn up https://arxiv.org/abs/2412.09871 <- which is more recent, more people might want
also you might want more result density - so perhaps a UI option to collapse the abstracts and display more in the first glance.
2. Close to 500ms. See [^1].
3. This [^2] was the reason I went with milvus. I also assumed that more stars would result in a bigger community and hence faster bug discovery and fixes. And better feature support.
4. Yes, I automated the weekly pull here [^3]. Since I am constrained on resources available, I used HuggingFace Spaces to do the automation for me :) Although, the space keeps sleeping and to avoid that, I am planning keep calling the same space using api/gradio_client. Let's see how that goes.
| which is more recent, more people might want
Absolutely agree. I am planning to add a 'Recency' sorting option for the same. It should balance between similarity and the date published.
| also you might want more result density - so perhaps a UI option to collapse the abstracts and display more in the first glance.
Oh, I will surely look into it. Thank you so much for a detailed response. :D
[1]: https://news.ycombinator.com/item?id=42507116#42509636 [2]: https://benchmark.vectorview.ai/vectordbs.html [3]: https://huggingface.co/spaces/bluuebunny/update_arxiv_embedd...
Some feedback:
I tried searching for "wave function collapse algorithm", "gumin wave function collapse", "wfc" and "model synthesis" without any relevant hits to the area of research I was interested in. I got a lot of quantum computing and other physics related papers.
The "WFC algorithm" overloaded the term (and has nothing to do with quantum mechanics) so it's kind of a bad case for this type of search. Model synthesis is way too generic, so again, might be a bad case for this.
The first page of results using "wave function collapse algorithm" from arXiv itself gives relevant results.
Compare that to scholar which returns all relevant results:
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=leak...
You might want to retrain/finetune your own embedding model instead of using a general-purpose one.
Google scholar scholar is a keyword based search engine. It looks for words as is in the text. PaperMatch tries to find similar papers that are closer in meaning.
Here is an alternative approach: Take one paper that you like, copy the abstract from Google Scholar and paste it in PaperMatch. This should help you find similar papers.
Subjectively, yes. I sent this around my peers and they said it helped them find new authors/papers in the field while preparing their manuscripts.
| Is this more useful in certain domains?
I don't think I have the capacity to comment on this.
Some of the current ideas I had:
1. Online ads search for marketers: embed and index video + image ads, allow natural language search to find marketing inspiration. 2. Multi e-commerce platform search for shopping: find products across Sephora, zara, h&m, etc.
I don't know if either are good enough business problems worth solving tho.
4. Quick lookup into the code to find relevant parts even when the wording in comments is different.
That way you’d cover what the human thinks the block is for vs what an LLM “thinks” it’s for. Should cover some amount of drift in names and comments that any codebase sees.
Something on similar lines which many may link, Research Rabbit - https://www.researchrabbit.ai/
I wanted PaperMatch to be open-source so that the users can understand the workflow behind it and hack it to their advantage instead of grumbling away when the results aren't to their liking.
Add a "similar papers" link to each paper, that will make this the obvious way to discover topics by clicking along the similar papers.
If you search for "UPC high performance computing evaluation", you'll see paper with buggy characters in the authors name (second results with that search).
https://www.youtube.com/watch?v=bq1Plo2RhYI
I'm not an expert, but I'll do it for learning. Then open source if it works. As far as I understand this approach requires a vector database and LLM which doesn't have to be big. Technically it can be implemented as local web server. Should be easy to use, just type and get a sorted by relevance list.
Although, atm I am only using retrieval without any LLM involved. Might try integrating if it significantly improves UX without compromising speeds.
Thanks for trying out the site!
Yes, I did binarize them for a faster search experience. However, I think the search quality degrades significantly after the first 10 results, which are same as fp32 search but with a shuffled order. I am planning to add a reranking strategy to boost better results upwards.
At the moment, this is plain search with no special prompts.
Here is a graph showing the difference. [^1]
Known ID is arXiv ID that is in the vector database, Unknown IDs need the metadata to be fetched via API. Text is embedded via the model's API.
FLAT and IVF_FLAT are different indexes used for the search. [^2]
[1]: https://raw.githubusercontent.com/mitanshu7/dumpyard/refs/he...
[2]: https://zilliz.com/learn/how-to-pick-a-vector-index-in-milvu...
MixedBread supports matryoshka embeddings too so that’s another option to explore on the latency-recall curve.
Will explore it thoroughly then!
> MixedBread supports matryoshka embeddings too so that’s another option to explore on the latency-recall curve.
Yes, exactly why I went with this model!
Also, this site is not Reddit. You don't have to reply to every comment.
I am so conflicted whether to reply to this comment or not Xp
Jokes apart, Mxbai model + Milvus gives fantastic results in fp32, however it's the latency that is an issue here. I could try chopping the fp32 vectors in half without binarizing to see. Thanks!
However I can give you the heads-up that the abstracts don't render well because (La)TeX is interpreted as markdown so that
Paper~1 shows something and Paper~2 shows something else
will strikethrough the text between the tildes (whereas they are meant to be non-breaking spaces). Similarly for the backtick which makes text monospaced in the rendered output but is simply supposed to be the opening quote.I will fix the LaTeX rendering ASAP.
Thank you for trying out the site! Happy Holidays :D
As mentioned in another comment, I've put together an embeddings database using the arxiv dataset (https://huggingface.co/NeuML/txtai-arxiv) recently.
For those interested in the literature search space, a couple other projects I've worked on that may be of interest.
annotateai (https://github.com/neuml/annotateai) - Annotates papers with LLMs. Supports searching the arxiv database mentioned above.
paperai (https://github.com/neuml/paperai) - Semantic search and workflows for medical/scientific papers. Built on txtai (https://github.com/neuml/txtai)
paperetl (https://github.com/neuml/paperetl) - ETL processes for medical and scientific papers. Supports full PDF docs.
These look like great projects, I will surely check them out :D