Updated upstream files

authorJan Wielemaker
Fri Sep 30 10:22:42 2016 +0200
committerJan Wielemaker
Fri Sep 30 10:22:42 2016 +0200
Diff style: patch stat
diff --git a/examples/iris.swinb b/examples/iris.swinb
new file mode 100644
index 0000000..9adf444
--- /dev/null
+++ b/examples/iris.swinb
@@ -0,0 +1,511 @@
+<div class="notebook">
+<div class="nb-cell markdown">
+EM Clustering of the Iris Dataset
+The Iris dataset contains data regarding 150 iris plants. Each plant is described by four continuous features.
+The plants are divided into three classes: Iris setosa, Iris virginica and Iris versicolor, 50 plants for each class.
+The Iris dataset was introduced by Ronald Fisher in 1936  as an example of linear discriminant analysis and has become a standard dataset for data analysis.
+Here we use it as an example of clustering to showcase the data anlysis capabilities of SWISH and its integration with R. We aim at grouping the data in such a way that objects in the same group (called a cluster) are more similar  to each other than to those in other groups. Since for the Iris dataset a natural grouping is provided, we aim at rediscovering that grouping using clustering.
+A clustering algorithm that is particularly suitable for data described by continuous feature is EM clustering.
+In applying such an algorithm, we assume that the data is generated by a mixture of components, where each component corresponds to a cluster and is associated with a probability distribution of the features. So we assume that the data is generated in this way: a component from a fixed and finite set is sampled and then the features are sampled from the distribution associated to the component.
+Here we assume that the features given the component are independent random variables with a Gaussian distribution.
+So the joint probability distribution of the features given the component is the product of four Gaussians.
+The aim of EM clustering is to find the probability distribution over the components (a discrete distribution) and the parameters of the Gaussians over the features given the components.
+For the Iris dataset, we assume that there are three components/clusters, so the distribution over the components has two free parameters. For the distributions over features we have two parameters (mean and variance) for each cluster and each feature, so overall 24 parameters.
+EM clustering applies the EM algorithm: it first assigns random values to the parameters and then computes, for each individual and each cluster, the probability of the individual of belonging to the cluster (expectation step).
+Given these probabilities, it is possible to estimate the value of the parameters that maximize the likelihood that the data is generated by that model (maximization step, using weighted relative frequency). Then the expectation and maximization steps are repeated. It is possible to prove that during this cycle the likelihood of the data always increases. The cycle is performed a fixed number of times or until the likelihood does not increase any more. 
+In this notebook, we describe a Prolog implementation of EM applied to the Iris dataset.
+The predicate =em/1= performs the EM algorithm with the assumption 
+that there are 3 clusters and that the four features of the dataset are independent  given the cluster. 
+To show the quality of the clustering, =em/1= performs a Principal Component Analysis (PCA)
+of the dataset and draws a scatter plot using the first two 
+principal components (those that accounts for as much of the variability in 
+the data as possible). The color of the points indicates their class.
+Two plots are drawn, one with the original classes of the dataset and 
+one with the clusters assigned by the EM algorithm, so that the two grouping
+can be compared.
+The two groupings should be similar.
+=em/1= prints the log-likelihood of the dataset (LL) at each iteration. The values increase, showing that
+ the clustering improves at each iteration.
+For EM clustering, see 
+Witten, Ian H., and Eibe Frank. 
+Data Mining: Practical machine learning tools and techniques. 
+Morgan Kaufmann, 2005.
+For the iris dataset, see
+For the principal component analysis,see
+Author: Fabrizio Riguzzi
+Running the clustering
+EM clustering is started by calling =em/1= with the number of iterations, for example 10.
+Running em(10) produces two scatter plots that should be similar, showing that EM clustering successfully recovered the natural classes of the examples.
+<div class="nb-cell query">
+<div class="nb-cell markdown">
+As an exercise for the brave, develop a general version of the EM algorithm that is applicable to any dataset, not limited to three clusters and four features.
+Code Description
+The =em/1= predicate first collect all the data, then it performs the PCA on it and draws a scatter plot using the first two components, with the color indicating the real class (Iris setosa, Iris virginica or Iris versicolor).
+Then it computes the mean and variance of the four features over all the data and draws samples  for the initial values of the parameters: the mean of the Guassians for each feature and cluster is sampled from a Gaussian with mean the mean of the feature over all data and variance 1.0. The distribution over components is taken initially to be `[c1:0.2,c2:0.5,c3:0.3]`.
+The EM cycle is entered using predicate =em_it/7=.
+Then the individual are classified into the most probable cluster (=find_most_likely/2=), the original labels are replaced by the new ones and a PCA is applied on the new dataset, drawing a scatter plot using the first two components and indicating the cluster with color.
+<div class="nb-cell program" data-background="true">
+:- use_module(library(clpfd), [transpose/2]).
+:- use_module(library(apply)).
+  findall([A,B,C,D,Type],data(A,B,C,D,Type),LA),
+  data_stats(LA,DM1,DM2,DM3,DM4,V1,V2,V3,V4),
+  pca(LA),
+  findall(M,(between(1,3,_),gauss(DM1,1.0,M)),[M11,M21,M31]),
+  findall(M,(between(1,3,_),gauss(DM2,1.0,M)),[M12,M22,M32]),
+  findall(M,(between(1,3,_),gauss(DM3,1.0,M)),[M13,M23,M33]),
+  findall(M,(between(1,3,_),gauss(DM4,1.0,M)),[M14,M24,M34]),
+  em_it(0,It,LA,par([c1:0.2,c2:0.5,c3:0.3],
+    [M11,M12,M13,M14,V1,V2,V3,V4],[M21,M22,M23,M24,V1,V2,V3,V4],
+    [M31,M32,M33,M34,V1,V2,V3,V4]),_Par,_,LA1),
+  find_most_likely(LA1,LPostCat),
+  maplist(replace_cat,LA,LPostCat,LPost),
+  pca(LPost).
+<div class="nb-cell markdown">
+=em_it/7= takes as input the number of iterations, the data,  the current weighted assignment of examples to cluster and the current value of the parameters and returns the updated parameters and the updated weighted assignment of examples to clusters. 
+=em_it/7= first performs the expectation step (=expect/4=) and then the maximization step (=maxim/2=).
+=expect/4= takes as input the data and the current parameters and returns the weighted assignments and the log likelihood.
+It first computes the weight with which each example belongs to each cluster, then it normalizes (=normal/6=) the weights so that they sum to 1 for each example and then computes the log likelihood (=log_lik/7=).
+<div class="nb-cell markdown">
+<div class="nb-cell program" data-background="true">
+  expect(LA,Par0,LA1,LL),
+  write('LL '),write(LL),nl,
+  maxim(LA1,Par1),
+  It1 is It0+1,
+  em_it(It1,It,LA,Par1,Par,LA1,LAOut).
+  G1,G2,G3),[L1,L2,L3],LL):-
+  maplist(weight(G1,P1),LA,L01),
+  maplist(weight(G2,P2),LA,L02),
+  maplist(weight(G3,P3),LA,L03),
+  normal(L01,L02,L03,L1,L2,L3),
+  log_lik(L01,L02,L3,P1,P2,P3,LL).
+  stats(LA1,W1,C1),
+  stats(LA2,W2,C2),
+  stats(LA3,W3,C3),
+  SW is W1+W2+W3,
+  P1 is W1/SW,
+  P2 is W2/SW,
+  P3 is W3/SW.
+<div class="nb-cell markdown">
+=normal/6= normalizes the weights so that they sum up to one for each example.
+=weight/4= takes as input the current parameters for the Gaussians of a cluster, the probability of the cluster and an example and returns a couple example-weight.
+=log_lik/7= takes as input the weighted examples for each clusters and the probability of each cluster and returns the log likelihood of the data.
+<div class="nb-cell program" data-background="true">
+  maplist(px,L01,L02,L03,L1,L2,L3).
+  S is W01+W02+W03,
+  W1 is W01/S,
+  W2 is W02/S,
+  W3 is W03/S.
+       [A,B,C,D,_],[A,B,C,D]-W):-
+  gauss_density_0(M1,V1,A,W1),
+  gauss_density_0(M2,V2,B,W2),
+  gauss_density_0(M3,V3,C,W3),
+  gauss_density_0(M4,V4,D,W4),
+  W is W1*W2*W3*W4*P.
+  foldl(combine(P1,P2,P3),L1,L2,L3,0,LL).
+  LLs is log(P1*W1+P2*W2+P3*W3),
+  LL is LL0+LLs.
+<div class="nb-cell markdown">
+=find_most_likely/2= takes as input a list with three elements, each being the list of weighted examples for a cluster, and returns the list of clusters that are most likely for each example.
+=replace_cat/3= replaces the category (class) of each example with the given category =Cat=.
+<div class="nb-cell program" data-background="true">
+  maplist(classify,L1,L2,L3,LC).
+  find_max([W1,W2,W3],Cat).
+  max_list(Counts,MV),
+  nth1(Max,Counts,MV),!,
+  concat_atom(['c',Max],MaxC).
+<div class="nb-cell markdown">
+=stats/3= takes as input the dataset and the set of weights with which the examples belong to a cluster and computes the parameters of the Gaussian distributions of the features for that cluster. It uses standard formulas for mean and variance modified to take into account the weight.
+<div class="nb-cell program" data-background="true">
+  maplist(component_weight,LA,CA,CB,CC,CD),
+  weighted_mean(CA,M1,SW),
+  weighted_mean(CB,M2,_),
+  weighted_mean(CC,M3,_),
+  weighted_mean(CD,M4,_),
+  weighted_var(CA,M1,V1),
+  weighted_var(CB,M2,V2),
+  weighted_var(CC,M3,V3),
+  weighted_var(CD,M4,V4).
+  foldl(agg_val_var(M),L,(0,0),(S,SW0)),
+  SW is SW0,
+  (SW=:=0.0-&gt;
+    write(zero_var),nl,
+    Var=1.0
+  ;
+    Var is S/SW
+  ).
+  foldl(agg_val,L,(0,0),(S,SW0)),
+  SW is SW0,
+  (SW =:=0.0-&gt;
+    write(zero_mean),nl,
+    M is 0
+  ;
+    M is S/SW
+  ).
+agg_val(V -N,(S,SW),(S+V*N,SW+N)).
+agg_val_var(M,V -N,(S,SW),(S+(M-V)^2*N,SW+N)).
+<div class="nb-cell markdown">
+=data_stats/9= computes the mean and the variance of each component in the overall dataset.
+<div class="nb-cell program" data-background="true">
+  maplist(component,LA,CA,CB,CC,CD),
+  mean(CA,M1),
+  mean(CB,M2),
+  mean(CC,M3),
+  mean(CD,M4),
+  variance(CA,M1,V1),
+  variance(CB,M2,V2),
+  variance(CC,M3,V3),
+  variance(CD,M4,V4).
+  length(L,N),
+  sum_list(L,S),
+  M is S/N.
+  length(L,N),
+  foldl(agg_var(M),L,0,S),
+  Var is S/N.
+<div class="nb-cell markdown">
+=gauss_density_0/4= and =gauss_density/4= compute the value of the Gaussian density of a given mean and variance at a certain point. The first treats specially the case of 0 variance.
+=gauss/3= samples a value from a Gaussian given the mean and variance.
+<div class="nb-cell program" data-background="true">
+  (V=:=0.0-&gt;
+   write(zero_var_gauss),
+    W=0.0
+  ;
+    gauss_density(M,V,X,W)
+  ).
+  StdDev is sqrt(Variance),
+  D is 1/(StdDev*sqrt(2*pi))*exp(-(S-Mean)*(S-Mean)/(2*Variance)).
+  number(Mean),!,
+  random(U1),
+  random(U2),
+  R is sqrt(-2*log(U1)),
+  Theta is 2*pi*U2,
+  S0 is R*cos(Theta),
+  StdDev is sqrt(Variance),
+  S is StdDev*S0+Mean.
+<div class="nb-cell markdown">
+=pca/1= computes the PCA of the dataset passed as input and draws a scatter plot of the examples using the first two principal component and the color for the class.
+=pca/1= uses R in two ways. First, it uses the =prcomp= function of R to compute the principal components of the dataset. Then it uses =qplot= to draw a scatter plot of the dataset.
+<div class="nb-cell markdown">
+<div class="nb-cell program" data-background="true">
+:- &lt;- library("ggplot2").
+  length(LA,NP),
+  maplist(add_cat,LA,LCat,L),
+  L=[H|_],
+  length(H,Comp),
+  append(L,LLin),
+  D =..[c|LLin],
+  data&lt;- matrix(D,ncol=Comp,byrow='TRUE'),
+  pc&lt;- prcomp(data),
+  Data0&lt;-pc["x"],
+  Data0=[Data1],
+  foldl(getn(NP),Data2,Data1,[]),!,
+  transpose(Data2,Data),
+  maplist(getx,Data,X),
+  maplist(gety,Data,Y),
+  x&lt;- X,
+  y&lt;-Y,
+  class&lt;-LCat,
+  &lt;-qplot(x, y, colour=class),
+  r_download,
+  nl. 
+    length(LN,N),
+    append(LN,Rest,L).
+<div class="nb-cell markdown">
+The Iris dataset is represented using =data/5=.
+<div class="nb-cell program" data-background="true">
+% Iris dataset
+<div class="nb-cell markdown">
diff --git a/examples/swish_tutorials.swinb b/examples/swish_tutorials.swinb
index 6678834..07df164 100644
--- a/examples/swish_tutorials.swinb
+++ b/examples/swish_tutorials.swinb
@@ -26,4 +26,10 @@ The notebooks below explain the basics of using R from SWISH.  You can test whet
 &lt;- 'R.Version'().
+<div class="nb-cell markdown">
+### More elaborate examples
+  - [EM Clustering of the Iris Dataset](example/iris.swinb)
diff --git a/lib/swish/highlight.pl b/lib/swish/highlight.pl
index 092b348..2dffcf9 100644
--- a/lib/swish/highlight.pl
+++ b/lib/swish/highlight.pl
@@ -747,6 +747,8 @@ style(dict_content,	 dict_open-dict_close,             []).
 style(expanded,		 expanded,			   [text]).
 style(comment_string,	 comment_string,		   []).
 style(ext_quant,	 ext_quant,			   []).
+style(unused_import,	 unused_import,			   [text]).
+style(undefined_import,	 undefined_import,		   [text]).
 					% from library(http/html_write)
 style(html(_Element),	 html,				   []).
 style(entity(_Element),	 entity,			   []).
diff --git a/web/help/editor.html b/web/help/editor.html
index b97bf76..1801e39 100644
--- a/web/help/editor.html
+++ b/web/help/editor.html
@@ -3,8 +3,35 @@
   <title>Using the Prolog editor</title>
+  <style>
+  dl.help-editor dd { margin-left: 2em; }
+  </style>
-  <body>
+The editor is an instance of <a
+href="https://codemirror.net/">CodeMirror</a>, extended with various
+standard, contributed and dedicated plugins. Notably the <em>hover</em>
+and <em>semantic highlighting</em> plugins have been developed within
+the SWISH project.
+<h2>Find and replace</h2>
+We load the default CodeMirror plugin for find and replace.
+The keybindings are:</p>
+<dl class="help-editor">
+  <dt>Ctrl-F / Cmd-F</dt><dd>Start searching</dd>
+  <dt>Ctrl-G / Cmd-G</dt><dd>Find next</dd>
+  <dt>Shift-Ctrl-G / Shift-Cmd-G</dt><dd>Find previous</dd>
+  <dt>Shift-Ctrl-F / Cmd-Option-F</dt><dd>Replace</dd>
+  <dt>Shift-Ctrl-R / Shift-Cmd-Option-F</dt><dd>Replace all</dd>
+  <dt>Alt-F</dt><dd>Persistent search (dialog doesn't autoclose,
+  enter to find next, Shift-Enter to find previous)</dd>
+  <dt>Alt-G</dt><dd>Jump to line</dd>