Quick data mining of my own library

Almost back to the lab. It’s been a good summer with the boys, mostly at home. Reading books, papers and blog posts when I had free time. Which does not occur so often with children less than 5 years old, as anyone in the same situation can testify.

A lot of heated discussion are occurring online now about open access and data mining.  While some benefits are straightforward in certain domains such as genetics or chemistry, this is a brand new world to explore. I came across the fascinating comments by Philip Ball on chematica, a network of the transformations that link chemical species. Chemistry is not really my cup of tea, and I don’t have any of the coding abilities, unlike prominent data miners like Peter Murray-Rust. One thing I have, though, is a Mendeley library stuffed with papers (over 1400 as of today). Since my main focus now is on this ice-templating thing, I have a bit more than 350 papers on this topic only.

In addition, I am also fascinated by issues related to presenting data, aka the visual display of quantitative informations , as described by Tufte, among many others. I’ve been playing with Wordle before , it’s all over the internet now. Wordle are beautiful clouds of keywords, where the size of the words relates to their occurrence in a list or a text. You have a good example with the display of keywords in the right column of the blog page.

Today, I did some quick and dirty analysis of my collection of papers. Exporting the Mendeley data to a bib file, I compiled lists of titles of the papers in my library. I used the freely available wordle website. The whole process was really fast, like 15 minutes or so. The first result I got is shown below (clik to enlarge).

Well, as you can expect, being interested in porous ceramic materials templated by ice crystals, these keywords are obviously dominating the wordle. In the upper right you can find “zirconia”, reminiscent of my PhD on the low temperature degradation of zirconia containing ceramics. This was in the pre-Mendeley years, I don’t have many papers left on this topic.

Things get more interesting if I restrict the analysis to the titles of the papers related to ice-templating. I got about 340 of them. I’ve followed really closely the ceramic domain, and much less the polymer field. Polymers are thus largely under-represented in the following analysis, although ice-templated polymers came first.

The first obvious observation is the absolute domination of “freeze”, “casting”, “porous” and “ceramics”. They are almost in every tile. So if you want to be original, don’t come up with a paper entitled “freeze casting of porous ceramics”. The other dominant keywords are “structure” and “properties”, which is a pretty good image of the current approach to the phenomenon. Freeze whatever you have and look at the structure and properties. Not groundbreaking, most of the time. But the underlying mechanisms are so complex that very few people are willing to tackle them. “Tissue” and “scaffolds” are pretty strong too, and tissue engineering have indeed been one of the main focus so far in terms of potential applications. “Ice” is less prominent than “freeze”, and reflects how people are currently describing the process, “freeze-casting” instead of “ice templating”. I am not a big fan of “freeze-casting”, since it was originally used to describe the processing of dense materials. Although pretty much everyone is doing porous materials, “freeze-casting” still dominates. “Ice-templating” exclude all solvents other than water, so it’s not perfect either.

I also did the same analysis compiling all the abstracts. This is much closer to mining the full text of the papers. The output is much more balanced.

“Pore”, “porous”, “structure” and “freeze” still dominates, but the relative occurrences of other keywords is much more balanced. Since people tend to report almost exclusively positive results, we got a lot of “increased”, “high”, “new”, “novel”, “potential” “significantly” and “significant”, better represented than “low” and “decreased”. “Defects” is noticeably absent, although it remains a major issue of the process. “Control” is missing from the wordle (well, not really missing, but it’s really tiny), a fair representation of the majority of the papers, where people exert no control whatsoever. Freeze and see.
“Properties” is relatively large, although people are almost exclusively looking at mechanical properties (hence the presence of “MPa”). People became interested only very recently in other properties, such as conductivity or piezoelectricity.

Regarding materials, “silica” and “alumina” are the only ones found here. A lot of room for testing other materials, and therefore other properties. “Water” and “camphene” are of similar size, as people are equally interested in both solvents.

Missing keywords are equally interesting. “Colloids” is hardly visible, although everyone is dealing with colloidal suspensions. Ceramists are usually talking about slurries instead of colloidal suspensions, which is why we get “slurry” and “slurries” instead. Maybe. I still believe we have a lot to learn if we look at the colloid science papers.

“Interface” is the other elephant in the room. The control of the process largely depends on controlling the interface, and is something that people have largely ignored so far.

Without digging too much into the details, this quick and simple analysis is very informative about the current state of the art. Having followed very closely the domain for the past 5 or 6 years, the keyword clouds obtained here are very representative of the current state of the art. I’d love to extend this analysis to the full text of the papers, although I will need different tools to do it. Maybe I should get an access to the Mendeley API. They are responding to over 100 millons calls to their database each month, they can surely afford a few more. In the meantime, I’ll try to apply the same analysis to a different domains, using Google Scholar or Scopus and Mendeley. More later if I’m successfull.

Funny coincidence, this month’s issue of Nature Materials was released today while I was playing around with this analysis. Check out the front cover

Summer readings

It’s really hot in summer, where we live. Usually the hottest place in France, actually. From mid-day to late afternoon, it’s usually better to stay inside, where it’s a lot cooler. A good period to read books. I read three good scientific ones lately.

H2O, A biography of water, by Philip Ball. He’s probably my favorite science writer, and I enjoy his frequent columns in Nature or Nature Materials, among others. He’s the one that taught me, following our Science paper, that ice has been used as a structural material… for planes ! This book is truly excellent. Philip Ball is giving us a grand tour of water, through history and the various domains of science, from chemistry to biology or geophysics. I particularly enjoyed the history of water through the centuries. Hi style makes it a joy to read, I could hardly put it down. Lots of gems like this one (maybe because I’m getting into antifreeze proteins lately):

If fish conducted scientific research, you might expect them to set up whole institutes devoted to studying supercooled liquids, since their very existence depends on this precarious state.

Design in Nature: How the Constructal Law Governs Evolution in Biology, Physics, Technology, and Social Organization, by Adrian Bejan and J Peder Zane.
 I wasn’t aware of the constructal theory until I read that book, and that was quite a fascinating read. The constructal theory is about how design in nature arise from a simple law, the constructal law, which is basically how stuff (mass, materials, ideas) flow. Design of things are evolving towards an always better flow. The authors are aiming high, applying their theory to pretty much everything you can think about, from lungs, rivers and trees to universities and animals. Although I don’t agree with all of their ideas, such as their claim about the very existence of trees (which are supposidely the most effcient way of moving water from the soil to the atmosphere), it was a stimulating read nonetheless.

Visual Strategies, A Practical Guide to Graphics for Scientists and Engineers, by Felice C. Frankel and Angela H. DePace.

This one is all about how to design figures or graphics to convey scientific ideas, whether it’s for a paper, a poster or a grant application. Beautiful illustrations and some interesting stories, but I found too many examples and too little theory. If you are not familiar with graphic design, it’s difficult to translate the examples provided into usefull lessons you can applied. A good book, still.

New paper

Our latest paper is out in Advanced Engineering Materials. We investigated the potential interest of the ice templated structures for applications in catalysis. Mesoporosity is required in these applications, and we proposed two different strategies to introduce it. The first one is to use nanoparticles, in which case the mesoporosity is arising from the space between the particles. The materials are quite fragile, though. Alternatively, we deposited mesoporous coatings, thus introducing mesoporosity while taking advantage of the high strength ice-templated supports. Preliminary steps, thus.

We are not prepared at all for this

This post at Nature News caught my attention.

But when it comes to running our labs and managing people, we have to rely on our gut feelings, our limited know-how from mentoring a few students or our observations of our previous advisers. We can often feel ill-prepared.

Ill-prepared ? We are not prepared at all for this. As a young scientist trying to set up my own group, this is unfortunately not the only issue I am facing.

The number of science and PhD students is declining, and the blame is put on a few easy targets. Low salary. Long hours. Limited number of positions, if any. Etc. It is actually worse than this.

If, like me, you managed to secure a permanent or tenure position (congratulations), the most daunting is yet to come. Besides producing good science, a skill for which you have been trained, much more is awaiting you. You have to secure fundings through grants, hire people, manage your group, deal with administrative tasks (our favorite part of the job, isn’t it?), communicate, network locally and globally, make yourself a name in your domain, and so on. And for all these things, we received basically no training whatsoever.

As far as I am concerned, it could have been much worse. I’ve been lucky to do my postdoc in a big lab where communicating results with scientists or with the public is taken very seriously. I learned a lot from my former colleagues on how to design and give a talk, design figures, entrust people and think out of the box.

But for the rest, we are pretty much on our own. Learning as things are coming. You learn how to prepare proposal by having your first ones rejected. You learn to appreciate which people are independent and which ones need more support and attention.

Regarding funding and financial management, I have been lucky to receive a lot of support from the CNRS for my ERC grant, both for preparing the proposal (on the budget side) and for managing it now.
Spending rules are increasingly complex and vary with funding agencies and with time. It’s crazy indeed that we can secure rather big fundings, from institutions, agencies or university, and yet no one is formally trained early on on managing these funds. This should be dealt with when we graduate or shortly after. The situation is slowly changing, at least with the CNRS, but it seems to me that the change is driven more by financial considerations (ineligible money is lost money)or the perspective of being audited by funding agencies than increased efficiency of time and resources and better management of the labs.

If you like facing multiple challenges at once, science is the perfect job for you. I, for one, love it. It’s daunting and exciting.

Usage metrics, statistics from one paper

I am gradually becoming more and more interested in open access, and have followed the PLoS One evolution for a while. Working in materials science, PLoS One is not quite our common avenue for publishing our research. Although it is theoretically open to any domain of science, it is still strongly dominated by biology, for historical reasons.
Last year, though, we had cool and intriguing results about a compound exhibiting ice shaping properties, similar to that of antifreeze protein. I was very interested in having these results reaching biologists instead of ceramists. My first paper in PLoS One, thus.
One of the benefit of publishing there is the availability of usage metrics, updated daily. Curious to see how the paper would be perceived, or at least accessed, I tried to follow the usage over time. I did not manage to do it everyday, but maybe at least twice a week or so. So here are the results, with the total views and daily views for the past 5 months or so.

What we see is a very strong first increase of the views, which then decreases very fast. The window to catch attention of readers is very short, less than a week, with readers coming either from the front page when the paper is still in the recently published papers list, or through RSS or other feed. After that, there is a long tail, with 4 or 5 daily views in average.
A second peak is also visible, shortly after the first one. It corresponds to the publication of the press release by the CNRS, which was tweeted and retweeted a couple of time, and caught the attention of a number of new readers.

The other, less visible observation is the absence of a peak in the last month. I went to a conference in Germany to present these results, and apparently people did not rush to PLoS One to download the paper. Oh well.

Although these data are limited to a single paper, I suspect the general behaviour for all journals is very similar. I would be curious to see such data averaged for a journal. I wish all the journals would make such data available. I guess this is just a matter of time before they do so.

Google Scholar citations metrics

Google is now tracking the metrics of journals. They chose the h5 factor, which is basically the h factor taking into account the last 5 years. The search function works with keyword, as you can guess. So if you search for materials science journal with the keyword “materials”, you will only get results of journals whose name include “materials”, and skip journals like Nanoletters, ACS Nano or other ones.

If you click on the h5 link, you get a list of the top cited paper for that journal, neat. The 2007 graphene paper in Nature Materials of Geim and Novoselov is already cited >5600 times. Holy cow.

New paper

“Particles redistribution and structural defects development during ice templating”, to be published in Acta Materialia soon. You can find the final version on the publications page. Very intriguing results, I have to say. I don’t think we nailed the whole story yet, though. Suggestions are welcome.