Six things I learned at SIGIR22
This year’s SIGIR was a hybrid event with the in-person part taking place in Madrid. Here's a recap.
It's always nice to have the opportunity to meet people who share the same passion for information retrieval (IR), and the SIGIR conference is a great place to do this. This year’s SIGIR was a hybrid event with the in-person part taking place in Madrid (11-15 July). My main motivation for attending it was to learn about state-of-the-art evaluation in IR and about current challenges in the IR community. I wanted to be there in person as, in virtual events, I find it harder to get into conversations with topic experts and learn about interesting new but related topics just by coincidence.
Demo Sessions Rock, Remote Talks Not So Much…
Two of the biggest strengths of this conference were the Short Paper and Demo Sessions. It was really easy to get in touch with the authors, as talks were held in person, and they were super motivated to explain their work. Also, you could ask as many questions as you wanted, dive into nice discussions, and just connect. I had several nice conversations about frozen phrases in Question Answering and explaining dense retrieval results. I also had a chance to chat with the authors of frameworks on IR evaluation (pyTerrier, Cherche, ir_metadata).
When there are strengths, most naturally there are also weaknesses. There were only two 90-minute Paper and Demo Session slots, and the second one overlapped to a huge degree with the Industry Track Panel Discussion (on dense retrieval methods). Regarding the other sessions — please don’t get me wrong: generally the Paper talks were very good when they were held on-site. However, lots of remote talks didn’t fully meet the quality of their in-person pendants. What makes this troublesome in the end is not the fact that there are remote talks in general but the sheer number of remote talks. I guess it was about 40% that were held remotely, even though there already was a dedicated remote day on July 7th.
In conclusion, I would say there's a good potential to learn from and improve with those hybrid conference settings. I guess a ratio of 80% on-site to 20% remote talks (and maybe having two instead of one dedicated remote days) could already be a huge improvement. On the other hand, I have to admit that this might make it harder for people not being able to travel to the conference venue to get the same amount of attention from the community.
The Venue Fully Compensated for the Incredibly Hot Temperatures
After almost three years of fully remote settings, the organizers demonstrated that they really cared about the social events. Starting with a welcome reception on one of the most beautiful rooftop terraces in Madrid on Monday, followed by a very entertaining concert on Tuesday, and wrapped up with a banquet held in a pretty remote but beautiful finca on Wednesday.
The welcome reception and the concert were great occasions to connect with the people I didn’t have a chance to speak with during the sessions (even at incredibly hot 40° C). I had a pleasant talk with someone from the University of Helsinki whose research was about whether there is an upper bound of context data to further improve search results. In my opinion, regarding Google and Bing, the tendency is to believe “the more the better,” but his work proves otherwise (see here).
At the welcome reception, after some beers and a very good amount of tapas, I had one of the craziest discussions. I met a guy from Jerusalem studying in Duisburg and we ended up discussing the origin of religions. Honestly, I have no idea how we managed to transition from IR to that topic. This is why I love in-person conferences.
Moreover, when it comes to the venue and organization, I have to mention there were some small downsides, too. The weakest parts for me, who runs best on a decent amount of caffeine, were the lunch breaks, as coffee was only served during two dedicated coffee breaks and not right after lunch. ;-)
High Research Focus, Lower Industry Focus
There was also an industry track (SIRIP), but unfortunately, the timing of the talks was often in parallel to regular sessions. SIRIP was a bit separated from the main conference, as you first had to exit the building, and then enter it again through another door.
Even though most presentations didn’t target my core interests (evaluation, ranking and question answering), the SIRIP track and especially its panel discussion was fun. Also, in general, there were a lot of attendees from the industry. Besides companies that have IR in their core business model, like Yext, there were also a good bunch of e-commerce vendors (mainly focusing on recommender systems). For this very reason, I wished the tutorial I attended on the first day also had a bit more focus on people from the industry as a target group beyond researchers.
Sponsors had nice booths, but not all of them were bringing their IR experts. Thus, it sometimes felt hard to start a conversation on actual IR topics. The best booth concept in my opinion had Bloomberg. They actually showed their analytics system hands-on, and the people there were very eager to deep dive into the topic. I was super happy when I learned that they also knew Haystack, and we had a brief discussion about dense retrieval and how to handle a no answer situation.
Will Sparse Retrieval Be the Future?
Dense retrieval methods often show better results but still have some drawbacks compared to their sparse counterparts, for example, when it comes to managing these systems. Despite the recent advances in efficient vector search (for example, HNSW), keeping latency and resources reasonably low at query and index time for highly-scaling systems is still a challenging topic. Updating embedding models for online systems or using multi-embedding systems is even harder to get right.
Of course, one could argue that this is an engineering task and not a research aspect, but the industry track (SIRIP) panel discussion showed that these challenges are real. Some more work on efficiency instead of effectiveness could be a great opportunity for future research.
To nail it down, Nils Reimers, who is famous for his work on dense retrieval and lots of Open Source Dense Retrieval Models, answered the question "What will be the future: dense or sparse retrieval?" with the latter (the others were more on the dense side).
This, however, doesn't mean that there won't be any semantic neural network models involved, of course. Approaches like SPLADE which can be used in a usual sparse retrieval setting evolved. These approaches use deep neural networks, like BERT, to create richer sparse representations. Dense representations, on the other hand, have additional advantages that the sparse ones are missing, like easily incorporating external data (for example, feedback) by simply adding vectors.
Nonetheless, I should add that regardless of which approach “makes the race,” we'd be happy to support both with Haystack.
What About Haystack? Let's Do a Demo
In total, there were not many open source frameworks or tools show-cased in the demo sessions. I think this is still a big opportunity to drive the field forward. Yes, most papers were using Hugging Face’s Transformers library for model implementations, but anything else around the information retrieval stack seems to be primarily home-grown, and the majority didn't release any models or code.
Exceptions were the evaluation frameworks I already mentioned in the first section. I really liked the hands-on presentations of PyTerrier and Cherche. Both were eager to show-case their frameworks in some notebooks and didn’t hesitate to run their code live. I think it would have been a nice addition to show more production-focused frameworks there too.
I really have to stress that especially with Raphael from Cherche I had a very pleasant conversation. Not only knew he Haystack very well but also gave some great tips on how to improve it further.
Situations like these were also great opportunities for me to explain and demo Haystack to others. In case you’re interested too: Haystack enables anyone — not just ML researchers — to build powerful and production-ready NLP pipelines for different search use cases. You can choose between a huge amount of models on Hugging Face's Model Hub, fine-tune them on your own data, and extend it with your own architectures and approaches. It also comes with powerful evaluation functionalities, supporting common practice data formats like SQuAD, and fully integrates BEIR benchmarks for Information Retrieval.
There were a couple of very interesting papers ranging from very narrow topics like improvements in pre-training reader models to higher-level topics like Retrieval Enhanced Machine Learning. SIGIR prioritized higher-level perspective papers this year. But let's first start with narrower topics. Here are my three favorites:
- Implicit Feedback for Dense Passage Retrieval: A Counterfactual Approach An easy to implement and yet flexible approach to incorporating clicks into dense retrieval without retraining the model.
- Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction A novel, powerful but intuitive pre-training method for dense retrieval models (big kudos for releasing code and models).
- Offline Retrieval Evaluation Without Evaluation Metrics Addresses label efficiency in evaluation by introducing a novel preference-based evaluation method recall-paired preference.
There were also the perspective papers. One of them got the most attention by far with respect to attendees. Google’s Donalt Metzler (who once co-authored one of the must-reads in IR back in the early 2010s: Search Engines - Information Retrieval in Practice) is among its authors. The paper is called Retrieval-Enhanced Machine Learning (REML).
It basically states that using a separate prediction model in combination with an information retrieval system (which is also responsible for storing data) adds more advantages than just using one incredibly large language model (LLM) that stores all the data implicitly in its parameters. One notable example of using REML systems is our well-known Retriever-Reader-Pipeline for Question Answering: the prediction model (Reader) first queries the information retrieval system (Retriever), and then uses its results to make the final prediction. Also, Retrieval Augmented Generation (as implemented in our RAGenerator) is one of those systems.
So What's the Big Deal?
First, it's good to see that even though working for big tech companies like Google (which have the resources to train and use LLMs at scale), the authors were motivated by the fact that “...focusing model development on the number of parameters is neither scalable nor sustainable in the long run.” Additionally, they go a bit further than our well-known QA and RAG systems, which are "retrieval-only" in their terms. They introduce “Retrieval with memory,” “Retrieval with feedback,” and the combination of both.
“Retrieval with memory” means the prediction system may store some information for future access. “Retrieval with feedback” means the prediction system might provide feedback to the information retrieval system regarding its retrieved results, so the IR system can learn from that feedback. Storing feedback of Readers should be a nice and simple addition to generate training data for dense retrievers. Incorporating memory seems to be a bit more challenging task.
In the end, they also propose a research agenda for studying and further improving REML systems.
Also, Facebook just recently released another REML system to the public for research purposes, which is getting a lot of attention these days. BlenderBot 3 was made available to everyone in the U.S. as a free online service. It is a LLM-powered chatbot that uses an information retrieval system to enrich its answers with information from "the Web." Early results (e.g. Meta's new chatbot has *opinions* about its CEO) show it's quite hard to filter LLM outputs for problematic or offensive answers with the public web as its primary knowledge source. Hence, selecting and filtering the knowledge source of the information retrieval systems still seems to be a crucial part for REML systems to succeed.
While I'm somewhat skeptical about the chatbots in general, and while BlenderBot with an LLM doesn't really fit the sustainability motivation of the REML paper, I'm curious to see how a system like that would perform when working on a curated knowledge like academic or corporate articles.
All-in-all, this conference was super interesting. I had some great learnings, made some new friends, and I hope I could contribute my part to make open source tools like Haystack more prominent in the information retrieval (IR) research community.
I'd recommend anyone — whether you’re from research or industry — to participate in this conference, even though there’s still some potential to improve with regard to the hybrid setting. As we constantly do it with our state-of-the-art search systems, hybrid-events seem to require a bit of a learning process, too.
With that being said, I’m already looking forward to SIGIR23 which will take place at a more remote location (at least to me): Taipei.