Book Review: The Accidental Data Scientist

This is a new book from Information Today, authored by Amy Affelt, who has deep experience in the numbers side of special librarianship.  It’s a great read for those who are interested in getting into the information aspects of big data.  I wanted to repost the book review I put on LibraryThing but to add some additional context/opinion that’s probably irrelevant to most readers.

Here’s the first part of the review:

Big data is a buzzword common in technology, business, and information circles. Amy Affelt’s look at the roles librarians and information professionals can play in this data-driven trend is well-timed. The book is a good introduction to the general framework of what big data is and does a good job of describing how traditional library skill sets match up to the needs of data analysis.

Or rather, how they align. Affelt brings in discussion of the Special Libraries Association alignment work, in tying the librarian-as-data-scientist role to providing value. Her review of Gartner’s “three Vs” and 2 additional Vs of her own creation underscore that one – value – is integral to information professional success in this area. It recalls the Outsell graphic about librarians moving up the value chain, from providing an answer to providing analysis.

A book review’s meant to be short and pithy, so I omitted the following because I think it’s probably irrelevant to most readers of the reviewed book.

The book wasn’t really what I was expecting, which was something that delved a bit deeper into the alignment of librarian skills to big data projects.  I think the connection is made – there’s some what – but I like context, more of the how.  In particular, I like that big data provides an opportunity for librarians (information professionals) in two ways.  One side is the value-added – Affelt’s fourth “V” – context it creates.

Which returns me to an old chart that was in a Quantum Dialog / Outsell whitepaper called Creating Value-Added Research and Analysis that I can’t find any longer.  I have my own copy but this chart speaks volumes:

creating-value-added-research-analysis-quantum-dialog-outsell-whitepaper

Where do you want to be?  Big data is going to move librarians who engage with it up the arrow to the top-right corner.  Unlike with some special library research – law is a good example – where we stop at the quick answer or an aggregation of answers, there are very real opportunities to take other people’s data and generate the answers using big data tools.

This takes me to the next part of the book review:

Beyond matching competencies to concepts, the book is light on specifics of big data technology and how librarians are using it. I found chapter 4 – which walks through 6 specific big data tools – and chapter 5 – which has big data projects in context – to be the most useful. After playing around with BigML (mentioned in c. 4), and realizing how easy it was to take a large data set (usage statistics) and start to work with it, I think the book could be improved by having some librarians walk through their own use of big data tools (like in the other Information Today title, More Library Mashups).

I had no idea how easy these big data tools were.  Nor how relevant they could be even to a librarian who really doesn’t consider themselves as preparing to deal with big data or becoming a data scientist.  Every library has chunks of data lying around.  Budgets.  Electronic usage data.  Reference desk question data.  Walk-in gates, circulation, etc.

Electronic usage was what was on my mind, so I ran 2 years of (transactions by day and database) through BigML, a big data tool mentioned in the book.  The file – 35,000 transactions and about a 3 MB csv file – was easily ingested and I was ready to look at the data in a matter of minutes.  One of the interesting visualizations was this:

Usage data by the day of the week.  This could be used to contrast usage rates of different services too.

Usage data by the day of the week. This could be used to contrast usage rates of different services too.

It shows the transactions per day of the week – Monday through Friday – for the database.  Imagine if you could run your operational statistics to visualize this sort of usage pattern.  You might be open different hours or different days.  You might find that you need to staff up at particular times of day.  Perhaps it highlights that, comparatively, some services aren’t used as heavily.

I manually attempted some of this visualization within Microsoft Excel, and was happy with the result.  But it was time consuming.  The primary benefit to me from this book is the potential, using big data tools, to go far more rapidly from a large data set to a visual representation of data that can become part of the story of how my library operates.

This is a strong book for new librarians and librarians who are data-oriented and interested in using it better in their current role or expanding/changing to a more data-centric information role.

This is a great book for people going into roles where they want to understand how to bring value, using information competencies and perspective, to large or big data projects.  For the rest of us, it can also act as a nudge to dig out some of that data that we use – to report on a monthly or even annual basis – and do longer term comparisons to see if there are new or different ways to understand our operations.