The art of ranking code search results: "I am often asked the question - so how do you rank your source code results ?
While the ranking of web page results has a well understood set of heuristics and algorithms, this is somewhat unchartered territory as far as ranking source code goes. For web pages most search engines use some version of link analysis to derive a static (independent of the query) score for each page, and then apply the run-time query text against the page content, inbound link anchor text, and other heuristics to come up with a final score and ranking for their hits.
But what does it mean to rank source code files? How does one say that file A which has the word ‘test’ in it should rank higher than file B which also contains the same text?
In our earlier attempts at this we tried things like just boosting files that were named ‘test’ to the top of the list - that soon got to be ridiculous, when we started seeing the top 20 results all named ‘test’.
Another approach we tried looked at was boosting the repository - not all repositories are created equal… Well, we soon got into a state where the top results were from just one repository.
We have now settled down into something we think is more meaningful to us and our users.
The filename is taken into account, but so is the project: how active it has been, how big it is, and other project specific details. But unlike parsing web pages as just a stream of text, we do either full code parsing or some fuzzy parsing to extract meaningful syntactic elements from source code. For example, we know whether the word ‘test’ is in a comment versus it being a function call or a function definition.
So for the Krugle source code ranking recipe, we combine repository and project-level information to generate a static code file score, and then use syntactic information to boost function definitions over function calls, function calls over comment text, and so on.
In the office we still have passionate debates about what we ought to return for a general query such as ‘language:java’ - how does one rank something so generic? That, IMHO, is a user experience issue and not a ranking problem - we either need to detect these types of queries and generate alternative, meaningful results, or we need to convince our users that they shouldn’t be doing that.
Anyway, the above represents where we are after a year of work, but I’m sure it will continue to evolve. Let us know if it isn’t (or is) working for you - thanks!
"
(Via Krugle Blog.)
.. I'm pretty sure, that we will see more and more of these domain-specific search engines .. Krugle is a good example for that, as the article above proves .. they use specific domain logic to improve search results, something a generic search engine just can't do ..