We've got to get deeper
So, my pitch went super well :) I made a first one with execs, then a second one with a project manager. They all had the "eureka" moment when I explained the concept of reversing the "build knowledge db, then query it" way to "ask questions, whose answers build the knowledge db", and wanted to try the product. ... Which I still haven't delivered to them. Shame on me! But there's a good reason to that.
As one of the people I've pitched put it, being able to correctly analyze questions while they are being typed in is the "nerf de la guerre" (the very core of the battle).
The whole vision of the product is that not only you build your knowledge database by asking questions, but you're then suggested answers while you're typing, so that you can find your answer without causing any interruption for your coworkers if it already has been answered. That means, nobody won't ever hesitate before asking something, and people answering questions will always hear about only new questions. That's a change from the sometime robotic support coworkers bring to one an other! So, how a product can do that?
This rings a bell
My initial answer was to use the collaborative filter algorithm from Spotify. On Spotify if an user "John" loves albums A, B, C, ..., and Y and an user "Alice" loves albums B, C, D, ..., Z, you can expect that John will love Z and that Alice will love A. You don't have to go through painful music style guessing, yay!
Well, my problem is quite the same. I want, given a question, to suggest content which seems close. That is, I want to find synonyms for most important words, and search for them as well. And I want to do that without having to know which language each of my customers' company use. I certainly don't want to load (and update) dictionaries from all languages around! Plus, it would not help much, since different teams may use some words as synonyms for others based on their company's culture more than usual language's meaning (what if they decided that "press room" and "pit" are synonyms?).
So I implemented something close to Spotify's idea. In each question/answer coming, I parsed a list of words with more than 2 letters, removed duplicates and stored them in database. Then, when typing a question, I did the same for the current question, and searched for the previous questions/answers with the highest number of similar words. This worked well ... with test data.
Did I hear some noise?
But then, I started to use the product to build my own knowledge database (yeah, I just can't count on my memory for that :P ). As more data was going in, the suggestions seemed more and more random. Bummer. I implemented the same algorithm in OpenVoyce, and it turns out that while it always provide sentences that are indeed close, they just don't have the same meaning.
Typically, if you ask "Hey guys, I was wondering if there was somewhere a list of customers?", it will be way closer to "Hey guys, I was wondering if there was a way to delete a page?" than to "where is the customer list document?". There is something in a sentence that there is not in a list of albums: a lot of noise. I tried for a while to only store the rarest words in each sentence, but this caused the lost of the synonym detection, losing the context. I could just not deliver the product as is, its coolest feature was just not working.
This needs more brain power
That being said, the whole problem of guessing a context sounded familiar. It's what google is doing! It's also one of the most common problems in NLP (Natural Language Processing). So I started to look at NLP frameworks, at first. And I kept coming to the same info: there is a revolution going on in NLP, with neural networks slowly replacing all other designs/algorithms. This definitely sounded like something I should learn about.
Well, it took me about a month, but I've finally got the gist of it and made a golang implementation of a simple feedforward neural network :)
So here is the idea: when a new company starts using Erudit, I make all searches doing a strict word search (that is, looking for other questions/answers with the same rarest words in it). As they add content, I feed it in a neural network to generate words embedding for the company (that is, to find relations between words in the context of their company). Then, I start feeding the bottom of the suggestion list with answers/questions my word vectors think are related. If one of those suggestions is clicked, I consider it a strong indicator it was a good one and feed the network back. Then, when the network results get more clicks than the strict match one, I switch to only use the network. Sounds like a plan, let's build!