Quick data mining of my own library

Almost back to the lab. It’s been a good summer with the boys, mostly at home. Reading books, papers and blog posts when I had free time. Which does not occur so often with children less than 5 years old, as anyone in the same situation can testify.

A lot of heated discussion are occurring online now about open access and data mining.  While some benefits are straightforward in certain domains such as genetics or chemistry, this is a brand new world to explore. I came across the fascinating comments by Philip Ball on chematica, a network of the transformations that link chemical species. Chemistry is not really my cup of tea, and I don’t have any of the coding abilities, unlike prominent data miners like Peter Murray-Rust. One thing I have, though, is a Mendeley library stuffed with papers (over 1400 as of today). Since my main focus now is on this ice-templating thing, I have a bit more than 350 papers on this topic only.

In addition, I am also fascinated by issues related to presenting data, aka the visual display of quantitative informations , as described by Tufte, among many others. I’ve been playing with Wordle before , it’s all over the internet now. Wordle are beautiful clouds of keywords, where the size of the words relates to their occurrence in a list or a text. You have a good example with the display of keywords in the right column of the blog page.

Today, I did some quick and dirty analysis of my collection of papers. Exporting the Mendeley data to a bib file, I compiled lists of titles of the papers in my library. I used the freely available wordle website. The whole process was really fast, like 15 minutes or so. The first result I got is shown below (clik to enlarge).

Well, as you can expect, being interested in porous ceramic materials templated by ice crystals, these keywords are obviously dominating the wordle. In the upper right you can find “zirconia”, reminiscent of my PhD on the low temperature degradation of zirconia containing ceramics. This was in the pre-Mendeley years, I don’t have many papers left on this topic.

Things get more interesting if I restrict the analysis to the titles of the papers related to ice-templating. I got about 340 of them. I’ve followed really closely the ceramic domain, and much less the polymer field. Polymers are thus largely under-represented in the following analysis, although ice-templated polymers came first.

The first obvious observation is the absolute domination of “freeze”, “casting”, “porous” and “ceramics”. They are almost in every tile. So if you want to be original, don’t come up with a paper entitled “freeze casting of porous ceramics”. The other dominant keywords are “structure” and “properties”, which is a pretty good image of the current approach to the phenomenon. Freeze whatever you have and look at the structure and properties. Not groundbreaking, most of the time. But the underlying mechanisms are so complex that very few people are willing to tackle them. “Tissue” and “scaffolds” are pretty strong too, and tissue engineering have indeed been one of the main focus so far in terms of potential applications. “Ice” is less prominent than “freeze”, and reflects how people are currently describing the process, “freeze-casting” instead of “ice templating”. I am not a big fan of “freeze-casting”, since it was originally used to describe the processing of dense materials. Although pretty much everyone is doing porous materials, “freeze-casting” still dominates. “Ice-templating” exclude all solvents other than water, so it’s not perfect either.

I also did the same analysis compiling all the abstracts. This is much closer to mining the full text of the papers. The output is much more balanced.

“Pore”, “porous”, “structure” and “freeze” still dominates, but the relative occurrences of other keywords is much more balanced. Since people tend to report almost exclusively positive results, we got a lot of “increased”, “high”, “new”, “novel”, “potential” “significantly” and “significant”, better represented than “low” and “decreased”. “Defects” is noticeably absent, although it remains a major issue of the process. “Control” is missing from the wordle (well, not really missing, but it’s really tiny), a fair representation of the majority of the papers, where people exert no control whatsoever. Freeze and see.
“Properties” is relatively large, although people are almost exclusively looking at mechanical properties (hence the presence of “MPa”). People became interested only very recently in other properties, such as conductivity or piezoelectricity.

Regarding materials, “silica” and “alumina” are the only ones found here. A lot of room for testing other materials, and therefore other properties. “Water” and “camphene” are of similar size, as people are equally interested in both solvents.

Missing keywords are equally interesting. “Colloids” is hardly visible, although everyone is dealing with colloidal suspensions. Ceramists are usually talking about slurries instead of colloidal suspensions, which is why we get “slurry” and “slurries” instead. Maybe. I still believe we have a lot to learn if we look at the colloid science papers.

“Interface” is the other elephant in the room. The control of the process largely depends on controlling the interface, and is something that people have largely ignored so far.

Without digging too much into the details, this quick and simple analysis is very informative about the current state of the art. Having followed very closely the domain for the past 5 or 6 years, the keyword clouds obtained here are very representative of the current state of the art. I’d love to extend this analysis to the full text of the papers, although I will need different tools to do it. Maybe I should get an access to the Mendeley API. They are responding to over 100 millons calls to their database each month, they can surely afford a few more. In the meantime, I’ll try to apply the same analysis to a different domains, using Google Scholar or Scopus and Mendeley. More later if I’m successfull.

Funny coincidence, this month’s issue of Nature Materials was released today while I was playing around with this analysis. Check out the front cover