Trees (3) : ltree extension of PostgreSQL

We created the mind map of the book “The pedagogy of the Ramchal” with WiseMapping. In the future, we will develop our own visual mind mapping tool bs”d.

We want now

  • to import this mind map in our database using the ltree extension of PostgreSQL, a implementation of trees with materialized paths, pre-calculated paths that describe ancestry of nodes
  • to encapsulate this ltree in RelGraph, our relational data model for graphs
  • to draw the tree with a R script

We export the mind map in text format and we remove the leading spaces in vi with :

:%le

Then we load raw data in a staging table because it is often a good idea to use the expressive power of SQL to parse and arrange raw data :

# we start psql with user postgres (needed by copy)
sudo -i -u postgres psql -d higayone

# we do a bulk copy
copy tree_staging from '/home/mchl/Desktop/Blog/map.txt';

The table tree_staging is like that :

input_line
------------------------------------------------------------------------------
1 The pedagogy of the Ramchal
1.1 The man
1.1.1 His life
1.1.1.1 Padua
1.1.1.1.1 Yeshayah Bassan
1.1.1.1.2 Isaac Cantarini
1.1.1.1.3 Aviad Sar Shalom Basilea
1.1.1.2 Amsterdam
1.1.1.2.1 Talmidim
1.1.1.2.1.1 Marranes
1.1.1.2.1.2 Modern science
1.1.1.3 Eretz Israel
...

Then we create extension ltree :

create extension if not exists ltree;

Then we create table and sequence for ltree :

drop table if exists ltree_node;
drop sequence if exists ltree_node_sequence;
create sequence ltree_node_sequence;

create table ltree_node (
 node_id integer primary key
    default nextval('ltree_node_sequence')
 ,node_name text
 ,node_path ltree
);

It is necessary to drop the table before the sequence, otherwise PostgreSQL will raise an error. Then we add a gist index on the column ltree  :

create index path_gist_idx on ltree_node using gist(node_path);

Then we populate the ltree with the data of the staging table :

insert into ltree_node(node_name, node_path) select
 substring(input_line, 
    char_length(substring(
        input_line from '[0-9.]*[ ]')) + 1) as node_name
 ,cast(substring(input_line from '[0-9.]*') as ltree) as node_path
from tree_staging;

The line

substring(input_line, 
    char_length(substring(
        input_line from '[0-9.]*[ ]')) + 1) as node_name

means that we take every character since the beginning of the line until the first white space. And the line

cast(substring(input_line from '[0-9.]*') as ltree) as node_path

means that we take every character since the first white space until the end of the line.

Let’s have now a look on a first application of ltree using the <@ operator. This query gives all the paths of the tree :

 select
 ln1.node_id
 ,array_to_string(
    array_agg(ln2.node_name order by ln2.node_path)
              ,'/') as full_path_name
from  
 ltree_node as ln1 
inner join 
 ltree_node as ln2 
on 
 ln1.node_path <@ ln2.node_path 
group by 
 ln1.node_id, ln1.node_path, ln1.node_name
order by 
 ln1.node_id, full_path_name;

The line

ln1.node_path <@ ln2.node_path

means that the left argument is a descendant of the right one. And array_agg is an aggregate function (working with group by) transforming his arguments into an array.

The output is like that :

node_id full_path_name
------- --------------
1       The pedagogy of the Ramchal
2       The pedagogy of the Ramchal/The man
3       The pedagogy of the Ramchal/The man/His life
4       The pedagogy of the Ramchal/The man/His life/Padua
5       The pedagogy of the Ramchal/The man/His life/Padua/Yeshayah Bassan
6       The pedagogy of the Ramchal/The man/His life/Padua/Isaac Cantarini
7       The pedagogy of the Ramchal/The man/His life/Padua/Aviad Sar Shalom Basilea
8       The pedagogy of the Ramchal/The man/His life/Amsterdam
9       The pedagogy of the Ramchal/The man/His life/Amsterdam/Talmidim
10      The pedagogy of the Ramchal/The man/His life/Amsterdam/Talmidim/Marranes
11      The pedagogy of the Ramchal/The man/His life/Amsterdam/Talmidim/Modern science
12      The pedagogy of the Ramchal/The man/His life/Eretz Israel
...

We populate now the tables of RelGraph, our relational model for graphs.

1. We create a new graph :

insert into graph(graph_name) values('Book');

2. We populate the table node :

insert into node(graph_id, node_name) select
 (select graph_id from graph where graph_name = 'Book')
 ,node_name
from
 ltree_node;

3. We populate the table edge :

insert into edge(from_node, to_node) select
 ln2.node_id as from_node
 ,ln1.node_id as to_node
from
 ltree_node ln1
 ,ltree_node ln2
where
 ln1.node_path = subpath(ln2.node_path
                         ,0
                         ,nlevel(ln2.node_path) - 1);

We see here two other functions of ltree :

  • subpath : subpath of tree starting at the beginning and starting at the penultimate component
  • nlevel : gives the number of components of the path

But we have to find solutions to make visible in RelGraph the insert / update / delete performed in the ltree. It will be the topic of a future post.

Having node and edge populated, we can run the following R script (which, in my private language, I call a graph slurper) to visualize the tree :

require(visNetwork)
require(RPostgreSQL) 

# establish connection 
driver ← dbDriver("PostgreSQL") 
connexion ← dbConnect(driver, dbname="higayone"
                      ,host="localhost", port=5432
                      ,user="mchl", password="secret")                              

# populates nodes and egdes
nodes ← dbGetQuery(connexion, "
 select 
   node_id as id
  ,node_name as label
 from node
 order by node_id;") 

edges ← dbGetQuery(connexion, "
 select 
   from_node as from
  ,to_node as to
 from edge
 order by from_node, to_node;") 

visNetwork(nodes, edges) %>%
  visNodes(font = '10px', shape = 'ellipse')

 

Here is a thumbnail (I am afraid I cannot do better with the free version of WordPress)  :

Book

If we want to generate a table of contents, we must think at two things :

  • node_path ‘1’ becoming the title of the book, it must be removed from the tree
  • for all other node_path, we simply remove the ‘1.’ at the beginning
select
 substring(node_path::text, 3, char_length(node_path::text))
 ,repeat('   ', nlevel(node_path)- 2) || node_name
from
 ltree_node
where node_path != '1'
order by node_path;

That gives :

1           The man
1.1            His life
1.1.1             Padua
1.1.1.1              Yeshayah Bassan
1.1.1.2              Isaac Cantarini
1.1.1.3              Aviad Sar Shalom Basilea
1.1.2             Amsterdam
1.1.2.1              Talmidim
1.1.2.1.1               Marranes
1.1.2.1.2               Modern science
1.1.3             Eretz Israel
...

Trees (2) : visualization with R

 Rplot

This mind map represents the following hierarchy of topics we used in a previous post (this one is small and inactive but with the visNetwork graph visualization R package, it will possible to develop animated and interactive mind maps) :

Root
            Relational theory
                      First-order predicate logic
                                Syntax
                                Rules of inference
                                Deductive systems
                                Semantics
                      Three-valued logic  
                      Normal forms
                                Functional dependencies
                                Multivalued dependencies
                                1NF
                                2NF
                                3NF
                                BCNF
                                4NF
                                5NF
                                DKNF
           SQL
                      Types and domains
                      PKs & FKs
                      Relations
                      Operators
                            Restriction
                            Projection
                            Joins
                               Cross join
                               Inner join
                               Left outer join
                               Right outer join
                               Full outer join
                            Union
                            Intersection
                            Difference
                            Divide
                      Aggregations
                      Constraints
                      Views

 

We changed the data structure for the implementation of Adjacency List Model for trees with relational operators.

Instead of

tree(node_id, node_up_id, topic) 
    where node_up_id is a reference (FK) to node_id

we use a two-tables representation of  a tree :

nodes(node_id, topic)

edges(from_node, to_node) 
    where from_node and to_node are references (FKs) to nodes.node_id

It’s always better to use normalized data model. It takes often a bit of effort but the queries will be easier to design.

We implement in PostgreSQL this model :

create sequence node_sequence;

create table nodes (
 node_id integer not null 
    default nextval('node_sequence') 
    primary key
 ,topic varchar(128)
);

create table edges (
 from_node integer not null 
    references nodes(node_id)
    on update cascade
 ,to_node integer not null
    references nodes(node_id)
 check (from_node != to_node)
 ,primary key(from_node, to_node)
 );

The visualization of the mind_map with the visGraph R library is now straightforward.

1. We load the libraries and we make connection to the database :

require(RPostgreSQL)
require(visNetwork)

driver ← dbDriver("PostgreSQL") 
connexion ← dbConnect(driver,
 ,dbname="higayone"         
 ,host="localhost", port=5432
 ,user="mchl", password="secret")                 

2. We populate nodes and edges, the R data structure is a direct mapping of the relational structure (a R data frame is relational table) :

nodes ← dbGetQuery(connexion, "
 select
   node_id as id
  ,topic as label
 from nodes
 order by id_topic;")

edges ← dbGetQuery(connexion, "
 select
   from_node as from
  ,to_node as to
 from edges
 order by to_node, from_node;")

3. We call the graphic interface, the visNetwork data structure is direct mapping of the R data structure (visNetwork function waits nodes identified by id and label and edges identified by from and to) :

visNetwork(nodes, edges) %>%
 visNodes(font = '10px', shape = 'ellipse') 

Remove duplicate records from a table

A well known problem : a table has an id which is the Primary Key (PK) but some records have the others fields similar. The script must remove all duplicates and keep the record with the smallest id.

For example, a table Customers having a SERIAL as PK but no UNIQUE constraint on the other fields. Two or more employees are able to create customers. And after a certain amount of time, there is a quite high probability to have duplicates.

The job is done with a single statement but we need to create a table with duplicates and to perform statistics in order to verify our script.

--
-- Create table, populates it, take a backup
--
drop table if exists dups_test;
drop table if exists dups_test_backup;

create table dups_test (
 id int primary key
 ,title text
 ,info int
);

insert into dups_test
 select i, 'Record #' || i::text, i
 from generate_series(1, 10000) i;

create table dups_test_backup as select * from dups_test;

By the way, we can see how the PostgreSQL function generate_series is used.

The first records of dups_test :

select * from dups_test order by id;

gives

id   title         info
---- ----------    -----
1    Record #1	   1
2    Record #2     2
3    Record #3	   3

Let’s now introduce some duplicates. The id must be different because id is a PK but the other fields, title and info will be taken from initial data. In order to guarantee the unicity, we will use a sequence starting at the size of the initial table + 1.

--
-- insert duplicates into the table
--
create sequence dups_seq minvalue 10001;

insert into dups_test (id, title, info)
 select nextval('dups_seq'), title, info 
 from dups_test 
 order by random()  
 limit 1000;

insert into dups_test (id, title, info)
 select nextval('dups_seq'), title, info 
 from dups_test 
 order by random()  
 limit 500;

insert into dups_test (id, title, info)
 select nextval('dups_seq'), title, info 
 from dups_test 
 order by random()  
 limit 250;

insert into dups_test (id, title, info)
 select nextval('dups_seq'), title, info 
 from dups_test 
 order by random() 
 limit 125;

drop sequence dups_seq;

By the way, we can see how to pick n random records in a table : SELECT * FROM table ORDER BY random() LIMIT n.

We see that 1875 (1000 + 500 + 250 + 125) records have been added :

select count(*) from dups_test;

gives

count
-----
11875

Some records of dups_test :

select title, info, id from dups_test order by title, info, id;

gives

title          info   id
-----------    ----   -----
...
Record #1490   1490   1490
Record #1491   1491   1491
Record #1491   1491   11491
Record #1491   1491   32491
Record #1492   1492   1492
Record #1493   1493   1493
Record #1493   1493   11493
Record #1493   1493   21493
Record #1494   1494   1494
...

In order to know the repartition of duplicates, we execute the following query

select distinct count(*) as count_records, t.countd as count_dups 
from (
 select info, count(*) as countd
 from dups_test
 group by (info)
 having count(*) > 1) as t
group by t.countd
order by t.countd;

that gives

count_records	count_dups
-------------   ----------
1498            2
167             3
13              4
1               5

It means that 1498 records have 2 times the same fields. And 167 have 3 times the same fields etc. (The same title and info but not the same id because id is the PK)

And now, we delete duplicates with this tricky query :

--
-- remove duplicates
--
delete from dups_test 
 where id in (
  select id
   from (
    select 
     id
     ,row_number() over (partition by title,info order by id ) as r
    from dups_test ) t 
   where t.r > 1);

that gives

DELETE 1875

How works this tricky query ?

partition is a window function : a function that performs a calculation across a set of table rows that are somehow related to the current row like an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row . The rows retain their separate identities.

row_number() gives the number of the current row within its partition, counting from 1.

Let’s execute the following query on the subset we have examined before :

select 
  id
  ,title
  ,info
  ,row_number() over (partition by title,info ORDER BY id ) as r
 from dups_test
 where info in (1490, 1491, 1492, 1493, 1494)
 order by title, info

gives

id      title          info   r
----    ------------   ----   -
1490    Record #1490   1490   1 
1491    Record #1491   1491   1 
11491   Record #1491   1491   2 
32491   Record #1491   1491   3 
1492    Record #1492   1492   1 
1493    Record #1493   1493   1 
11493   Record #1493   1493   2 
21493   Record #1493   1493   3 
1494    Record #1494   1494   1 

The deletion of the records in each partition having a ROW_NUMBER > 1 removes duplicates and keep initial record.

We have to verify that all duplicates have been deleted and only them.

select count(*) from dups_test;

gives the initial number of rows :

count
-----
10000

Good !

And we run now the script that computes the repartition of the duplicates.

select distinct count(*) as count_records, t.countd as count_dups 
from (
 select info, count(*) as countd
 from dups_test 
 group by (info) having count(*) > 1) as t 
group by t.countd order by t.countd; 

gives

count_records	count_dups
-------------   ----------
(0 rows)

Good !

And then we check the final table against the backup :

select count(*)
from dups_test a
full outer join dups_test_backup b
on a.id = b.id
where a.title = b.title and a.info = b.info;

gives

count
-----
10000

This query, comparing all fields, gives the initial number of rows.

A little explanation about OUTER JOINs :

table1 LEFT OUTER JOIN table2 ON table1.id = table2.id gives all the records of table1 even if there is no corresponding record in the table in table2.

table1 RIGHT OUTER JOIN table2 ON table1.id = table2.id gives all the records of table2 even if there is no corresponding record in the table in table1.

table1 FULL OUTER JOIN table2 ON table1.id = table2.id gives all records belonging to both table1 and table2.

Alternatively, the following script do the same job in searching fields having the same id but not the same title or info :

select count(*)
from dups_test a
full outer join dups_test_backup b
on a.id = b.id
where a.title != b.title or a.info != b.info;

gives

count
-------
(0 rows)

It seems that everything is OK.

By the way, have a look at the two last queries. The only difference is the WHERE clause

where a.title = b.title and a.info = b.info

giving all the records and

where a.title != b.title or a.info != b.info

giving no records.

It is a good example of the De Morgans’law :

The negation of a conjunction is the disjunction of the negations.
The negation of a disjunction is the conjunction of the negations.

or, in other words :

not (A AND B) = (not A) OR (not B)
not (A OR B) = (not A) AND (not B)

And now, a little bit cleaning before we leave :

drop table dups_test_backup;

And, last but not least, a visualization of the repartition of duplicates !

When we make statistics, it is always better to visualize the numbers. The R programming language is excellent for statistics and visualization ! And R has a PostgreSQL interface.

In order to produce a nice chart, a large amount of duplicates has been added.

And in order to avoid to put SQL in the front end application, as needed in client-server architecture, we create the following view that embeds the query we used before :

create or replace view stat_on_dups as
 select distinct count(*) as count_records, t.countd as count_dups 
 from (
  select info, count(*) as countd
  from dups_test
  group by (info)
  having count(*) > 1) as t
 group by t.countd
 order by t.countd;

that gives

count_records	count_dups
-------------   ----------
2027            2
1469            3
1082            4
857             5
549             6
376             7
260             8
182             9
124             10
73              11
53              12
38              13
30              14
26              15
19              16
7               17
11              18
5               19
2               20
1               21
3               22
1               26

This R script :

require("RPostgreSQL") 
# establish connection 
driver ← dbDriver("PostgreSQL") 
connexion ← dbConnect(driver, dbname="higayone"
                        ,host="localhost", port=5432
                        ,user="mchl", password="secret")                              

# query the database 
list ← dbGetQuery(connexion, "select * from stat_on_dups;") 

# create the plot 
vx ← seq(0, 2000, 100) 
vy ← seq(1, 30, 1) 
plot(list, type = "h", col = "blue", lwd = 1, 
        ,xlab = "Number of records", 
        ,ylab = "Number of duplicates", 
        ,main = "Repartion of duplicates"
        ,xaxt = "n", yaxt = "n") 
axis(side = 2, at = vy) 
axis(side = 1, at = vx) 

# clean before leave 
dbDisconnect(connexion) 
dbUnloadDriver(driver)

 

displays :

Rplot

The x-axis shows the number of records having duplicates and the y-axis shows the number of duplicates for those records.

We see at glance that we have a lot of records with few duplicates and few records with a lot of duplicates.