by Brian Hendricks (BrianHeM10 in BIT330, Fall 2008)
Questions and queries
Web search engines
SoftwareasaService (SaaS) applications is currently the fastest growing software sector and one of the fastest growing industries in the world. I am interested in specifics reasons why this is the case. I used the web search engines to answer: "What are some of the biggest factors causing the enormous growth rates in the SaaS industry?"
For Google, Yahoo, and Windows Live, I used the query:
"software as a service" growth factors
Blog search engines
One of the most important elements to the future of the SaaS industry is the idea behind cloud computing and hosting applications "in the cloud". Currently, Amazon's Elastic Cloud Computing (EC2) is opening up new frontiers for cloud computing. I used the blog search engines to find blog posts about: "What exactly is EC2?"
For Google Blog Search, Technorati, and Bloglines, I used the query:
Amazon EC2
Data that I collected
Search engine overlap data


Search engine ranking overlap data
This table provides a measure of how much of Google's responses are reproduced by Yahoo.

This table provides a measure of how much of Yahoo's responses are reproduced by Google.


This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.

This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.

Results
Web search
Summary Statistics


The above statistics represent general statistics on the precision of results and the overlap of results between search engines. Precision measures how well the search engine returned relevant results and is a proportion of how many relevant results were returned out of how many results examined. Results overlap tracks the percentage of results in Live (L), Google (G), and Yahoo (Y) that appeared in the compared sets. For example, the average amount of results that were precise for Google was 54.4% and on average 20% of the results examined appeared in both Yahoo and Live.



The above statistics refer to the rankings overlap between Google and Yahoo. Ranking overlap measures the amount of times the first 5, 10, and 20 results of one search engine appear in the first 5, 10, and 20 results of the other search engine. For example, o(10,5) in the GY table is the number of top 10 Google results that appear in top 5 Yahoo results and o(5,20) in GY table is the number of top 5 Google results that appear in the top 20 Yahoo results.
Hypothesis Test  Is Google More Precise Than Live & Yahoo?
Null Hypothesis: Google(Precision) = Live(Precision), Alternative Hypothesis: Google(Precision) > Live (Precision)
Alpha: 5%
Sample Mean(Google): 54.4, Sample Mean(Live): 42.8, Std(Google): 20.1, Std(Live): 22.8
TStatistic: 1.5266
PValue: .0688
Decision: Fail to reject the Null Hypothesis
Conclusion: At the 5%, there is not enough evidence to conclude that Google's search results are more precise than Live search results.
Null Hypothesis: Google(Precision) = Yahoo(Precision), Alternative Hypothesis: Google(Precision) > Yahoo(Precision)
Alpha: 5%
Sample Mean(Google): 54.4, Sample Mean(Yahoo): 51.7, Std(Google): 20.1, Std(Yahoo): 22.4
TStatistic: .3588
PValue: .3611
Decision: Fail to reject the Null Hypothesis
Conclusion: At the 5%, there is not enough evidence to conclude that Google's search results are more precise than Yahoo search results.
The above tests are 2sample hypothesis tests of means. It measures the likelihood (pvalue) that the null hypothesis is true based on the observed results. Alpha is the minimum pvalue needed for the null hypothesis to hold true.
Blog search
Summary Statistics


The above statistics represent general statistics on the precision of results and the overlap of results between blog search engines. Please refer to the "Web Search" statistics for more information on the definitions of precision and overlap of results.



The above statistics refer to the rankings overlap between Google Blog Search (G) and Bloglines (B). Please refer to the "Web Search" statistics for more information on the definitions of ranking overlap and how to read the table.
Hypothesis Test  Is Google Blog Search More Precise Than Technorati & Bloglines?
Null Hypothesis: GoogleBlog(Precision) = Technorati(Precision), Alternative Hypothesis: GoogleBlog(Precision) > Technorati(Precision)
Alpha: 5%
Sample Mean(GBlog): 52.5, Sample Mean(Technorati): 33.1, Std(GBlog): 22.2, Std(Technorati): 21.2
TStatistic: 2.528
PValue: .0085
Decision: Reject the Null Hypothesis
Conclusion: At the 5%, there is sufficient evidence to suggest Google Blog Search produces more precise results than Technorati.
Null Hypothesis: GoogleBlog(Precision) = Bloglines(Precision), Alternative Hypothesis: GoogleBlog(Precision) > Bloglines(Precision)
Alpha: 5%
Sample Mean(GBlog): 52.5, Sample Mean(Bloglines): 44.4, Std(GBlog): 22.2, Std(Bloglines): 14.3
TStatistic: 1.2269
PValue: .1155
Decision: Fail to reject the Null Hypothesis
Conclusion: At the 5%, there is not enough evidence to suggest Google Blog Search produces more precise results than Bloglines.
The above tests are 2sample hypothesis tests of means. It measures the likelihood (pvalue) that the null hypothesis is true based on the observed results. Alpha is the minimum pvalue needed for the null hypothesis to hold true.
Discussion
Web search
Comments
The results of the Web Search overlap are not very interesting as the numbers are relatively similar across the board. It is very interesting that the average overlap between Google and Yahoo (20.6%) and the number of ranking overlaps (3.7/3.8) are slightly off. It was also intriguiging to see that the most common overlap % between Google and Yahoo was 15% higher than the overlap between Live/Google and Live/Yahoo. This means that their may be a strong relationship between Google and Yahoo's search methods and site index.
Recommendations
First, it is concerning that Live's precision was consistently lower than Google and Yahoo's and appears to have been driven up by some large numbers. Be wary of Live's ability to be return precise results; nevertheless, the hypothesis test did not conclude that Live is less precise than Google. Since the overlap of these sites was low (~20%), it is still worth the extra time to use all three sites. The more detailed your query is the more likely you will have similar results across the three search engines.
Key Learnings
I was very shocked to see that the highest overlap between Google and Yahoo was 7 out of 20. My impression was that there results are pretty the same with some slight ranking differences between the two. Also, it was surprising to see Live produce even more dissimilar results and have a consistently lower overlap with Google & Yahoo. I certainly learned to pay closer attention to the results of Google and Yahoo and consider using both when searching.
Possible Further Investigations
Based on the data, there are three additional investigations I would like to complete:
 How does Google compare to specialized search engines?
 What is the ranking overlap for the top 50?
 Does the overlap always grow with each additional 5/10 results?
 Is there always a strong correlation between the average % of overlapped and o(20,20) number? (Assuming we look at more than 20 results for overlap)
 Does Google and Yahoo consistently have higher overlaps than Live/Yahoo and Live/Google?
Blog search
Comments
The most striking implications of all the data collected is how unrelated the results are for blog search engines. Although all three blog search engines were relatively precise, they had barely any overlap between each other. As a result, the ranking overlap was almost nonexistant. It is very interesting to note that the mean overlap for G/B was 6.9% and the highest mean overlap ranking was 1.1 results  a pretty accurate ratio. However, this also implies that in most circumstances overlap between the blog search engines occurs outside of the top 10 which is rarely accessed by users. Either each blog search engine's algorithim or data index is quite different to have on average only 1.4/20 results appear on Technorati, Google Blog Search, and Bloglines.
Recommendations
The data suggests that when searching for blog posts it is beneficial to try multiple sources as no single search engine is superprecise and the search engines do not return similar results. However, as the hypothesis test supports, Google Blog Search will generally produce more precise results than Technorati. It will also be beneficial to vary your queries based on the individual search engine's individual syntax options. For the blog search engines, we did not delve into each's advanced search options which may have improved precision and overlap results. Also, each blog search engine excelled at a particular need  Technorati was great at displaying news and consolidating information, Google Blog Search helped find relevant blog feeds, and Bloglines returned really strong individual blog posts.
Key Learnings
From this research, I discovered two new web tools (Technorati and Bloglines) and difficult it can be to use blog search engines. It is very easy to type in a query to these tools, but to really find new blogs to subscribe to or recent posts to read it can be quite a challenge. This statement is supported by the much poorer precision and overlap results of the blog search engines compared to the web search engines.
Possible Further Investigations
Based on the data, there are three additional investigations I would like to complete:
 What is the ranking overlap between Technorati and Bloglines?
 Are there any specific syntax changes that dramatically increase precision for each blog search engine?
 Is the ranking overlap within blog search engines consistently that low?
Methodological Changes
The following changes to our research methodology affects both Web Search and Blog Search and can be used to create more realistic, accurate results:
 Increase the sample size from 16 to at least 35. Once the sample size reaches 35, the distribution of results can be considered approximately normal.
 Require that the queries for all three Web/Blog search engines are the same (adjusting for syntax differences when necessary). Some results may have been skewed by certain queries being very different.
 Encourage more consistency between the types of queries that are entered. The results may have been affected by some queries being very specific (intitle:x inurl:y "Q" or "U") or too simple (football).
 Establish a clear definition of "relevancy" when measuring precision and "overlap". Some results may have the same title but from a different source which may cause confusion.