Writing academic papers in plain text with Markdown and Jupyter notebook

July 17, 2015 § 7 Comments

TL;DR

My new workflow for writing academic papers involves Jupyter Notebook for data analysis and generating the figures, Markdown for writing the paper, and Pandoc for generating the final output. Works great !

Long version

As academics, writing is one of our core activity. Writing academic papers is not quite like writing blog posts or tweets. The text is structured, and include figures, lots of maths (usually), and many citations. Everyone has its own workflow, which usually involves Word or LaTex at some point, as well as some reference management solutions. I have been rethinking about my writing workflow recently, and come up with a new solutions solving a number of requirements I have:

  • future proof. I do not want to depend on a file format that might become obsolete.
  • lightweight.
  • one master file for all kind of outputs (PDF, DOC, but eventually HTML, etc…).
  • able to deal with citation management automatically (of course).
  • able to update the paper (including plots) as revisions are required, with a minimal amount of efforts (I told you I was lazy).
  • open source tools is a bonus.
  • strongly binded to my data analysis workflow (more on that later).

After playing around with a couple of tools, I experimented with a nice solution for our latest paper, and will share it here in case anyone else in interested.

This particular paper was particularly suited for my new workflow. What we did was data mine 120+ papers for process parameters and properties of materials to extract trends and look at the relative influence of the various parameters on the properties of the material. The data in that case was a big CSV file, with hundreds of lines. Each data point was labelled by its bib key (e.g. Deville2006), which turned out to be super convenient later.

Data analysis

I became a big fan of the Jupyter notebook for our data analysis. The main selling points for me were the following:

  • document how the analysis was done (future proof). The mix of Markdown, LaTeX, and code is a game changer for me.
  • ability to easily change the format of the output (plots) depending on the journal requirements and my own preferences.
  • ability to instantaneously update plots in the final paper with new data. As I run the notebook, the figures are generated and saved in a folder.
  • ability to share how the analysis was done, so as to provide a reproducible paper. The notebook of our latest paper is hosted on FigShare along with the raw data, with its own DOI (you can cite it if you reuse it).
  • ability to generate the bibliography automatically. As each data point in my CSV file comes with its bib key, I can track exactly which references were used for a plot. This was particularly useful when writing that particular paper. After each plot, where data are coming from many different papers, I can generate a list of the bib keys used for the plot, and copy/paste that list into the paper. Boom !

All the analysis was done in a Jupyter notebook, that I uploaded later on FigShare when the paper was published. The notebook is generating the figures with a consistent style, as well as the bib keys list. This turned out to be the biggest time saver here. To give you a rough idea, here is the simple function that I use to generate the list of bib keys.

Capture d’écran 2015-07-20 à 08.58.25

And here is the result when I run it for a figure. Now I just have to copy this list and paste it directly into the Markdown file of the paper. Very cool.

Capture d’écran 2015-07-20 à 08.58.03

Writing the paper

I am a big fan of LaTex for long documents (PhD manuscript, etc.), but not so much for regular academic papers. I am not a physicist, so my papers are usually light in terms of maths. I chose to write everything in Markdown, which is something like LaTex for dummies. It is a very, very simple markup syntax, very popular for blogging, among other uses. The files are plain text files, which is certainly the most future proof solution that I can think about. The syntax is dead simple, you will get it in literally 5 minutes.

I do all my writing in Sublime Text, boosted with a couple of packages. Of particular interest in this case: SmartMarkdown, and PandocAcademic (not mandatory, though).

Bibliography

I use Mendeley for my reference managements. My favorite function is the automatic generation of a bib file, which I can use for my LaTeX or Markdown writing later on.

Getting the final version

What do you do with the Markdown file, then ? The one tool that glues everything together is Pandoc, dubbed as the « swiss army knife » document converter. It is a simple but extremely powerful command line tool. In my case, it takes the Markdown file and convert it into a Word of PDF document (or many other format if you need them). The beauty of it is of course the generation of the bibliography and the incorporation of figures and beautifully typeset equations. You can run pandoc from the command line directly. Here is the typical command line for what I want to do:

pandoc -s paper.md -t docx -o paper.docx —filter pandoc-citeproc —bibliography=library.bib —csl=iop-numerics.csl

Pandoc takes the paper.md file, the library.bib file for the bibliography, and use citeproc and the iop-numerics.csl file for formatting the bibliography, and create the paper.docx file for me. Easy !

Putting everything together

So I have everything I need now. Here is how it works.

  • The Jupyter notebook generates the figures and saves them in a folder.
  • The Markdown file starts with a few YAML metadata, that I use to provide the title, authors, affiliation, and dates.


title: A meta-analysis of the mechanical properties of ice-templated ceramics and metals
author: Sylvain Deville^1^\footnote{Corresponding author – Sylvain.Deville@saint-gobain.com}, Sylvain Meille^2^, Jordi Seuba^1^
abstract : Ice templating, also known as freeze casting, is a popular shaping route for macroporous materials. bla bla bla. We hope these results will be a helpful guide to anyone interested in such materials.
include-before: ^1^ Laboratoire de Synthèse et Fonctionnalisation des Céramiques, UMR3080 CNRS/Saint-Gobain, 84306 Cavaillon, France. \newline ^2^ Université de Lyon, INSA-Lyon, MATEIS CNRS UMR5510, F-69621 Villeurbanne, France \newline \newline Keywords 10.03 Ceramics, 20.04 Crystal growth, 30.05 Mechanical properties
date: \today

  • the text itself is formatted in Markdown. Take note how the citations are used in the text. Markdown use relative references to folder and files, take note how I point to the figure file.

# Introduction
Ice templating, or freeze casting[@Deville2008b], has become a popular shaping route for all kinds of macroporous materials. The process is based on the segregation of matter (particles or solute) by growing crystals in a suspension or solution (Fig. 1). After complete solidification, the solvent crystals are removed by sublimation. The porosity obtained is thus an almost direct replica of the solvent crystals.

![Principles of ice-templating. The colloidal suspension is frozen, the solvent crystals are then sublimated, and the resulting green body sintered.](../figures/ice_templating_principles.png)

Ice templating has been applied to all classes of materials, but particularly ceramics over the past 15 years. Although a few review papers [@Deville2008b; @Deville2010a; @Wegst2010; @Li2012b; @Deville2013b; @Fukushima2014, @Pawelec2014b] have been published, they mostly focus on the underlying principles. Little can be found on the range of properties that could be achieved.

Here is how the PDF looks like.

Capture d’écran 2015-07-20 à 10.00.51

 

  • You can build from the command line. You can also do everything from Sublime Text. Just set the user settings of the SmartMarkdown package to automatically use the bib file (generated by Mendeley, for instance) and the CSL file (depending on which journal I submit to). You can also provide Pandoc with a LaTex template if you want to.

« pandoc_args_pdf »: [« —latex-engine=/usr/texbin/pdflatex », « -V », « —bibliography=/Users/sylvaindeville/Desktop/library.bib », « —csl=iop-numerics.csl », « —filter=/usr/local/bin/pandoc-citeproc », « —template=/Users/sylvaindeville/Documents/pandoc/templates/latex2.template »],

To build the final version, I either run Pandoc from the command line, or hit Maj+Cmd+P in ST and « Pandoc: render PDF », and Pandoc generates the final document for me, with the correctly formatted bibliography and the figures in place. That’s it ! I also saved the pandoc command line argument (as a text file) in the folder where the markdown file is, so that I do not depend on Sublime Text in case I change my mind, and do not have to remember the exact command line to type (lazy, I told you).

Summary of the tools you need

  • A valid Python and Jupyter notebook installation, if you are doing your data analysis with it.
  • Pandoc.
  • A valid LaTex installation.
  • A bib file for your bibliography.
  • CSL file for the bibliography styles you want to use. Get the one you need here.
  • A text editor. Many choices available.
    Total cost: 0$.

Final Thoughts

It took a while to get everything in place and working, but I am happy with it now. This workflow was particularly suitable for this paper, since all the data analysis was done in the Jupyter notebook and there were many citations (in particular for each plots) that I did not wanted to input manually. During the review of the paper, one of the referees mentioned a couple of papers that we did not found initially. I updated the CSV file with the new data plots, ran the notebook, and the figures were instantaneously updated. Rebuild the final file from the updated Markdown file, and boom. Very little friction indeed.

A common question is the co-writing/proof reading when the paper is collaborative. In that case, I wrote almost everything. The other authors just sent me their parts in plain text and I pasted. I used the PDF for proofreading, and everyone annotated the PDF files. If I am in charge of the paper, I choose the tools. Deal with it.

Future improvements

I still have to copy/paste the list of bib keys corresponding to the figures in the Markdown files. Ideally, the list would be automatically generated within the Markdown file, so that there is even less friction in the whole process. I am not quite sure how to do this. Any suggestion is welcome.

If you want more control of the pagination of your output files, you can tell Pandoc to use a template (many journals provide LaTeX templates, for instance. At least in physics). I did not try, as the pagination requirements for submission are very minimal. The whole idea of a master text file is to *not* have to deal with these sort of things.

Finally, some version control (e.g. with GitHub) would be nice.

 

Update 20/07/15

  • Added some Jupyter screenshots.
  • I forgot to mention the main limitation (for me) of this approach: Pandoc does not do cross-references. The impossibility of automatic references to figures and equations is thus the main limitation. That is a trade-off that I can accept for now, as I usually have a limited number of equations and figures. Overall, I prefer to save time on reference management than on cross-references of figures and equations. YMMV.
Advertisements

Snowflakes, engineered

March 10, 2015 § Leave a comment

"On the six-cornered snowflakes". Kepler's book on snowflakes.

“On the six-cornered snowflakes”. Kepler’s book on snowflakes.

One of the earliest scientific observations you may have performed as a kid may be that of snowflakes. Their delicate morphology, with multiple branches, has a unique appeal to the eye and can easily be observed with magnifying glasses. No wonder that snowflakes already caught the attention of scientists and poets for centuries. In the 17th century already, Johannes Kepler noticed their 6-fold symmetry, as well as their unique nature – not two snowflakes are alike.

Wilson Bentley’s photograph of snowflakes.

For a very long time, the only way to record the shape of snowflakes was drawing. If you ever looked at snowflakes under a magnifying glass, you can easily imagine how difficult it is to draw – notwithstanding that snowflakes tends to have a very short lifespan. In the early 20th century, Wilson A. Bentley was the first one to photograph snowflakes, systematically capturing thousands of unique snowflakes for over 40 years. His collection has proved to be incredibly valuable to investigate their morphology and is also a unique piece of art, if you ask me.

 

Libbrecht’s setup to photograph snowflakes in the wild.

Following on Bentley’s work, Kenneth Libbrecht, at Caltech, is dedicating his carrer to the study of snowflakes. Driven by both passion and science, he developed over the years a unique setup to capture images of natural snowflakes. There is still a lot to learn from snowflakes. Or there is actually not that much we understand about the growth of snowflakes and the physics behind it. One thing me know: when it comes to snowflakes, there is more than the 6-branches morphology that anyone will draw if you ask them. Depending on the conditions (temperature and supersaturation), you can get anything from needle to plates. The morphology of natural snowflakes directly depends on the conditions they encountered in the sky. As such, snowflakes can be seen as little messengers from the clouds, telling a very local climate story.

Snowflake morphology diagram. There’s more than 6-branches snowflakes !

Systematic investigations, required to understand the physics behind snowflakes, are thus notoriously difficult with natural snowflakes. Physicists have long been trying to grow artificial snowflakes in the lab, under controlled, reproducible conditions. The first attempts to grow such snowflakes used … rabbit hairs ! A well-controlled (at that time), one-dimensional object, suitable to trigger the nucleation of snowflakes.

The crystal on the right was subjected to periodic temperature changes that yielded a spider’s-web pattern of ridges and ribs.

The crystal on the
right was subjected to periodic temperature changes that yielded a spider’s-web pattern of ridges and ribs.

In a paper published on arXiv last week, Libbrecht describes a very unique microscope, designed to grow snowflakes under controlled conditions, and to record their growth in real time. The pictures are stunning, as usual. I have one of Libbrecht’s book of snowflakes collection on my desk, and peruse through it every once in a while. You should, too.

Engineered snowflake with a near-perfect 6-fold symmetry.

Engineered snowflake with a near-perfect 6-fold symmetry.

The most interesting thing to me are the time-lapse observations reported in the paper. By varying the supersaturations and temperature conditions, Libbrecht triggers, in a controlled manner, side branching events, effectively engineering the morphology of snowflakes. Increasing the supersaturation for a brief moment initiate the development of branches at the corners of the growing snowflakes. Several new branches are eventually created from each corner, each of them growing in a synchronized fashion, the conditions being homogeneous at the scale of the snowflake.

Noorduin's microscopic flowers grown under diffusion-controlled conditions.

Noorduin’s microscopic flowers grown under diffusion-controlled conditions.

This behavior reminded be of the beautiful microscopic flowers reported in Science  by Noorduin two years ago where, by varying the CO2 concentration, Noorduin was able to change the growth morphologies of its tiny flowers in a very controlled manner. In both cases, crystal growth occurs under diffusion-limited conditions and may thus share more than meets the eye.

Changing the morphology of flowers with a CO2 pulse.

Changing the morphology of flowers with a CO2 pulse.

Having a much better control of the conditions, the grown, engineered snowflakes have a much better 6-fold symmetry than their natural counterpart. Growing branches while falling through the windy sky is a tough job. The conditions vary constantly. By the time each snowflake reaches the ground, their infamous 6-fold symmetry is seldom preserved. By deliberately engineering the growth of his snowflakes, Libbrecht obtains new insights into the physics of snowflakes growth, which may certainly be valuable for our understanding of crystal growth. But I can’t help but see the sheer beauty of the highly symmetrical engineered snowflakes, too. Libbrecht may very well be the very first one to grow two identical snowflakes, ruining a long-standing belief that not two snowflakes are alike.


More:
Kennetch Libbrecht’s website
Gallery of Libbrecht’s snowflake photographs.

10 writing tips for academic papers

January 29, 2013 § Leave a comment

I’m currently wrapping up a long review paper (>10k words) that should hopefully be published this September. As usual, as a non-native speaker, I ran into many common grammar and style mistakes. Luckily, I have next door a native speaker, and he’s patient enough to correct most of my mistake. He’s my first secret weapon. The second one is this little gem, called The Elements of Style (4th Edition), by William Strunk Jr. and E. B. White. This book is probably the best money I’ve ever spend on a book.

So without further ado, here are my top ten mistakes, that I’ve learned to correct thanks to my two secret weapons:

  1. You should place a comma after abbreviations like i.e., e.g., etc.
  2. If you enumerate several terms with a single conjunction, use a comma after each term. Example: “… bla bla bla  in materials science, chemistry, and life science”. Same if you enumerate with “or”.
  3. Put statements in positive forms. It is much stronger.
  4. Omit needless words. For some reason, we french people seem to be using a lot of these. So here you go, go and mercilessly chase expressions like “the reason why is that”, “the question as to whether”, etc.
  5. “Due to” is synonym to “attributable to”. Avoid using it for “owing to” or “because of”.
  6. “Interesting”. It might be interesting to you, but not to everyone else. Remove it. Just remove it.
  7. “Type” is not a synonym for “kind of”. So get it straight.
  8. “While”. Just stick to it if you can replace it with “during the time that”.
  9. Don’t say “very unique”. “unique” is good enough.
  10. Split infinitive: when you put an adverb between “to” and the verb. I used this form a lot and thought it was cool. Apparently it’s not. Don’t say: “to thoroughly investigate”, say: “to investigate thoroughly”.

This is just the top ten. The entire book is full of stuff  like this. Go and get it. And don’t lend it to anyone, you’d never get it back. Do you have another one? Share it in the comments.

Usage metrics, statistics from one paper

April 4, 2012 § Leave a comment

I am gradually becoming more and more interested in open access, and have followed the PLoS One evolution for a while. Working in materials science, PLoS One is not quite our common avenue for publishing our research. Although it is theoretically open to any domain of science, it is still strongly dominated by biology, for historical reasons.
Last year, though, we had cool and intriguing results about a compound exhibiting ice shaping properties, similar to that of antifreeze protein. I was very interested in having these results reaching biologists instead of ceramists. My first paper in PLoS One, thus.
One of the benefit of publishing there is the availability of usage metrics, updated daily. Curious to see how the paper would be perceived, or at least accessed, I tried to follow the usage over time. I did not manage to do it everyday, but maybe at least twice a week or so. So here are the results, with the total views and daily views for the past 5 months or so.

What we see is a very strong first increase of the views, which then decreases very fast. The window to catch attention of readers is very short, less than a week, with readers coming either from the front page when the paper is still in the recently published papers list, or through RSS or other feed. After that, there is a long tail, with 4 or 5 daily views in average.
A second peak is also visible, shortly after the first one. It corresponds to the publication of the press release by the CNRS, which was tweeted and retweeted a couple of time, and caught the attention of a number of new readers.

The other, less visible observation is the absence of a peak in the last month. I went to a conference in Germany to present these results, and apparently people did not rush to PLoS One to download the paper. Oh well.

Although these data are limited to a single paper, I suspect the general behaviour for all journals is very similar. I would be curious to see such data averaged for a journal. I wish all the journals would make such data available. I guess this is just a matter of time before they do so.

Elsevier author artwork instruction

February 6, 2012 § Leave a comment

This is insane when you read it and think about it. We (authors) shouldn’t have to deal, at least not too much, with that. Just finished submitting a revised version of a manuscript. And I went crazy with the online submission system, that just keep crashing and forcing me to start over and over again. As usual. Not exactly a user-friendly experience.

Research news articles in CNRS International Magazine

January 24, 2012 § Leave a comment

Covering our PLoS One paper. Check it out.

TOC ROFL

October 17, 2011 § Leave a comment

aka table of contents, rolling on the floor laughing. This one is probably my favourite. Some outstanding graphical art out there. More seriously, check out the Nature Chemistry article here. Checking my RSS feed in the browser, or most of the time on the iPad with Reeder, graphical abstracts definitely helps identifying papers of interest (for me).

Where Am I?

You are currently browsing entries tagged with papers at Sylvain Deville.