Working with Students to Improve Indexing in Apache Hive

In Facebook Engineering, we're always looking for ways to expand our academic involvement beyond internships and research projects, so for the 2010-2011 academic year, we sponsored a team of four undergraduates as part of the Harvey Mudd Computer Science Clinic. Clinic teams work with their faculty advisor plus an industry partner to explore an evolving area of computer science in a way that is valuable to the company. To help make that happen, Facebook engineers serve as liaisons and mentors, providing requirements and technical input. Along the way, the students learn a lot about real world software development and project management.

The student team for this clinic included Skye Berghel, Jeffrey Lym, Russell Melick, and Marquis Wang, under the guidance of professor Bob Keller. Facebook liaisons included Yongqiang He (Hive team, and original developer of the indexing feature), Jonathan Hsu (data science team, and himself a Harvey Mudd alum), and John Sichi (open source and Hive teams).

The project chosen for the clinic involved beefing up indexing support in Apache Hive, an open source data warehousing project originally developed at Facebook to deal with its Big Data challenges, and now deployed at a number of other companies (many of which also contribute to its ongoing improvement). Prior to the clinic, a first cut of indexing support had been added to Hive to address some very specific use cases, but many important aspects still remained to be developed in order for the feature to become generally applicable. For example, index usage required complicated manual querying steps.

The goal of the clinic was to add support for automatic index usage, as well as to add a new index type (bitmap indexing), useful for accelerating queries that span many different attributes. For example, Facebook's site integrity team might want to ask "how many user accounts in the age range 18-25 in Egypt were compromised between January and March 2011 due to unauthorized access from a specific IP address range?"

The student team was entirely responsible for its own project planning and management. A weekly conference call with liaisons provided a steady review cycle and easily tracked their progress.

The project kicked off with a visit to Facebook headquarters in Palo Alto, California. The students started with some simple tasks to familiarize themselves with the Hive code base and the Apache open source collaboration process before moving on to tackle the big pieces.

We're wrapping up the program now. The students performed their own small-scale tests and benchmarks with some encouraging results. We're preparing to test their work at larger scale on Facebook production clusters.

In addition, Facebook liaisons attended Poster Day at Harvey Mudd on May 3, where the students presented the following poster. The poster includes details of how indexing works, along with some of the benchmark results.

(view larger)

During the presentation, additional results were shown, demonstrating significant space savings from bitmap index compression on columns with few distinct values (when compared with other index types), as well as speedups on queries combining multiple bitmap indexes across different columns.

1% Finished

At Facebook, we believe that the best code we ship is always a work in progress, so a lot of effort still remains to further improve Hive indexing. For example, cost-based index selection would be a great followup project for anyone wanting to dig into this feature. The clinic program has helped Hive indexing come a long way, and we're very happy to have helped the team gain valuable experience through late-night hacking! Our commitment to open source collaboration made it very easy to get the students interacting with others in the Hive community, including another team from Persistent Systems working on related indexing improvements, as well as researchers like Daniel Lemire, upon whose work the bitmap indexes were built.

Leveling Up

After the presentation, the students were asked for their thoughts on the experience.

As Russell put it, "besides the visit to Facebook headquarters (with the accompanying lunch and free t-shirts), the most enjoyable moments of the project came whenever our patches were committed to the codebase. It's a very special feeling to know that your work will be used by such an influential company as Facebook. Working with the open source community was both exciting and frustrating. Since Hive is used by so many diverse groups, our work felt very worthwhile. We were also able to take advantage of other open source work very easily, which significantly helped our work on bitmap indexes, and gave us a great head start on automatic use."

Jeffrey elaborated on Russell's last statement. "A turning point in the project came when another company, Persistent Systems, started work on automatically using indexes to speed up queries with GROUP BY, a feature that was very closely related to what we were working on. We were able to see each other's progress and share ideas through our Github respositories, and this helped us to find new approaches to problems that we were stuck on."

Skye also enjoyed the community aspect of open source. "Knowing that other people were going to use my code made clinic fun. I didn't feel like I was writing code as part of an academic exercise; I was making improvements to a project that many other people were using."

Of course, there were plenty of challenges along the way. Marquis mentioned, "one thing that was frustrating was the enormous complexity of Hive. After working on one task for quite some time, we'd finally begin to feel very comfortable with some aspect of the code. Then, as soon as we switched to a new task, it seemed that we were in way over our heads once again." But, he continued, "the most satisfying moment for me was the first time I got bitmap indexes to show significant results. It became quickly clear that it was at least as good as the compact index and better in some cases. It was really vindicating knowing that the thing we'd been working on all this time actually works."

The students all agreed that one of the most important lessons they learned was to be unafraid to ask for hints quickly once it was clear they were stuck. Russell said, "our success would not have been possible without the amazing help we received from our liaisons."

Professor Keller said about the experience, "the Computer Science Clinic is grateful to have the support of Facebook this year. The HMC project team and I learned a tremendous amount about open source development and Hive. We hope that the contributions made by the team will have a lasting impact on Hive and improve response times for queries that can exploit bitmap indices. The combined support of the mentors from Facebook was extremely helpful in making the project a success. We look forward to further involvement with Facebook."

John is a software engineer on the Hive and open source teams, and served as a liaison on this project.