[Rp] Reproducing and replicating the OCamlP3l experiment

This article provides a full report on the effort to reproduce the work described in the article “Parallel Functional Programming with Skeletons: the OCamlP3L experiment” [1], writ‐ ten in 1998. It presented OCamlP3l [2], a parallel programming system written in the OCaml programming language [3]. The system described in [1] was a breakthrough in many respects: it showed that it was possible to implement parallel skeletons [4] a combinators in a functional programming language; it showed how this parallel programming style allowed to write a single source code that produced executables targeted for sequential execution, hence enabling usual debugging techniques, and executables for parallel execution; and it led to the introduc‐ tion in OCaml of the ability to marshal functional closures , used later on by a wealth of different applications. The article consists of two main parts, the system description, and the system evaluation, so replicating the results involves the following:

1 Recovering the source code Looking into the original paper directory turned out to be of little help, as there was no trace of the source code or any useful information.So we turned to the paper itself, and found three links to web pages: • www.di.unipi.it/~marcod/ocamlp3l/ocamlp3l.ml, that today returns 404; looking at the archived copies on the archive.orgallowed to recover some documentation, but not the source code; • www.di.unipi.it/~susanna/p3l.ml,that is still live, but provides no useful link to the source code • pauillac.inria.fr/ocaml,that is also live, but the only hope to find the source code was the link to the anonymous CVS server which points today to the OCaml GitHub organization, where we found no trace of this 23 years old code.
Searching the web The links from the original paper being now useless, we resorted to searching the web, and found http://ocamlp3l.inria.fr/.We followed the link http:// ocamlp3l.inria.fr/eng.htm#download to the download page that offered an ftp link, ftp: //ftp.inria.fr/INRIA/caml-light/bazar-ocaml/ocamlp3l/ocamlp3l-2.03.tgz, now dead, and web link, http://ocamlp3l.inria.fr/ocamlp3l-2.03.tgz that was still working.Unfortunately, this is version 2.03 of OCamlP3l, way more evolved, and quite different from the version 0.9 used in the original research article, and there was no trace of the version history, so the quest was far from over.

Saving version 2.03
Here we decided to make a pause, and properly deposit this version 2.03, with extended metadata, into Software Heritage [5] via the HAL national open access archive, the result being now availabe as [6].
Back to searching the web More web searches brough up a related webpage for a newer system, http://camlp3l.inria.fr/eng.htmtouting a link to a git repository on Gitorious, http: //gitorious.org/camlp3l/.Unfortunately, following the link leads to nowhere, as Gitorious has been shutdown in 2015, but luckily Software Heritage has saved the full content of Gitorious, so we could download a a full copy of the git repository, but unfortunately its version history only goes back to 2011, with version 1.03 of CamlP3l, not OCamlP3l, and no trace of earlier versions of the system, so we were seemingly back to square one.To our great surprise, and satisfaction, the code compiled with the modern OCaml 4.05 installed on our machines unchanged.The only notable difference is that the modern compiler produces several new warnings that correspond to better static analysis checks introduced over the past quarter of a century.

Finding it on
This is a remarkable achievement, not just for our own code, but for OCaml itself.
3 Recovering the test suite and replicating speedup figures Here too, looking into the original paper directory turned out to be of little help, as there was no trace of the test suite used in the article or any useful information.Web searches were of little interest, as this test suite was used only for the article and not published.
A long search through old backups on tape, CR-ROMS and DVDs did not yield anything relevant either.Hence, our reproducibility journey ended here.
But we did not want to stop here: having found the original code, we could replicate the speed-up results, using a new test suite.After all, according to the article we wrote over 22 years ago, the original test suite was just producing a computational load to keep the compute nodes busy enough to take advantage of the parallelism.
As a first step, we adapted code in the Examples directory, from the SimpleFarm/simplefarm.ml[8] and the PerfTuning/pipeline.ml [9] files.The result is a simple parametric test code, shown in Figure 1, that allows to test the speedup one can get from the farm parallel skeleton in configurations obtained by varying the number nproc of processing nodes, and the time msecwait elapsed in each sequential computation.
The second step was to make the ocamlp3lrun driver command [10], that was using rsh (see these two occurrences) and rcp (see this occurrence) back in 1997 , work with the ssh and scp commands that are mainstream today.
A quick hack that works without even touching the code is to create an executable file rsh containing just the two lines: and similarly for rcp.Running the parallel test on a set of n different machines is then a simple matter of issueing the commands ocamlp3lcc -par test-for-speedup.mlocamlp3lrun test-for-speedup <machine1> <machine2> ... <machinen> ( * compute a f u n c t i o n over a stream o f f l o a t s using a farm * ) 2 ( * very simple code t o t e s t the speed −up o f a farm s k e l e t o n * ) 4 l e t msecwait = 1 0 0 ; ; ( * time spent in s e q u e n t i a l computation , microseconds * ) l e t nproc = 1 ; ; ( * number o f nodes a l l o c a t e d in the farm s k e l e t o n * ) 6 ( * a c t i v e wa i t f o r n microseconds * ) Test code for evaluating speedup of a farm skeleton, by varying the nproc and msecwait parameters.
The exact source of the test suite has SWH-ID swh:1:cnt:8e7f96cb82d50ea73c2d8e4bf2c832b0ada49a7e The third step was to run a parameter sweep experiment on a cluster available at the University of Pisa, and collect the data that was used to produce the new figures that we show in Figure 2.
The cluster is configured with 32 nodes each equipped with dual socket Intel(R) Xeon(R) CPU E5-2640 v4 2.40GHz.At the time of this experiment 5 nodes where busy or in maintainance and therefore our replication experiments were run with parallelism degrees n w ∈ [1 − 24].It is worth pointing out that the cluster nodes, differently from the ones used in the original experiments, sport 20 cores with 2-way hyperthreading.Hence, in order to replicate the very same experiments dating back to late '90s, we used only one process per node, as if the node had a single processor available.Figure 2a and Figure 2b show the completion times and the relative speedups measured in three different experiments, processing streams of data items of different lengths.
The completion times are very close to the ideal ones, but looking at the speedup figures we can see that the larger the load the better speedup is achieved.Indeed, the interarrival time of tasks on the stream is negligible with respect to the time spent processing the single task and therefore longer streams help giving more work to each one of the "workers" in the parallel farm.This, in turn, results in a minor impact of the overheads associated to the set up and orchestration of the nodes that take part in the computation.
Figure 2c shows the result of three different runs of the same experiments, executed at three different times of the same day on the cluster.We observe the same completion times, confirming the stability results already achieved at the time OcamlP3L was developed.Finally, Figure 2d reports the scaled speedup results.For each parallelism degree n w , we used an input stream whose length was k × n w .The measured completion times are almost constants and close to the ideal one, which is the sequential time taken to compute a k item stream.
To sum up, we could replicate quite faithfully the quality of the results achieved more than 20 years ago.It is worth pointing out that this is a nontrivial achievement, as the architectures used for these experiments today and in the past are completely different, both in terms of computation power (processors) and in terms of communication bandwidth and latency (network interface cards).
In our opinion, this is clearly due to two distinct and synergic factors: • the clean "functional" design and implementation of OcamlP3l, that resisted to language and system development, and • the algorithmic skeleton1 principles which are the base of the overall implementation of OcamlP3L that naturally implement "portability" across different architectures, independently of the fact the architectures use different hardware.
We have reported on our experience in reproducing work we have done ourselves on the OCamlP3l experiment over 22 years ago [1].Contrary to our expectations, the most difficult part has been to recover the source code.For its presevation, we had relied on institutional repositories first, and freely available collaborative development platforms later, neither of which passed the test of time.
We are delighted to report that leveraging the Software Heritage archive [5] we have been able to recover the full history of development of the system, and rebuild it as it likely was at the time the original article had been published.Despite the fact that we did not find the exact test suite used 22 years ago to test the scalability of the system, we have been able to replicate the results on modern hardware.
As a byproduct of this work, we have also safely archived in Software Heritage, and described in HAL, the stable final release 2.3 of OCamlP3l [6].
Based on this experience, we strongly suggest to systematically archive and reference research source code following the Software Heritage guidelines [11].
[7] code contained in this directory seems to be version 1.0[7]and is classified as follows by the sloccount utility: