Grab yourself a Graph

Here is a story about Graph Database and the motivation for choosing it. Firstly, I think it’s very important to understand if you need a Graph datasource. Not sure it’s right for you and ofcourse shouldnt go there if you just think it’s cool. Graph has it’s cost. Graph’s are common when your business requirement demands are social, relation-connections, dependencies between hops(will cover later), Transitive relations.

My Startup’s major requirement was to give the ability to connect between entities using relations and properties between them. Assume we want to create followers data model. Which means if user X follow user Y and user Y follow user Z -> X follows Z aswell. Let’s start simple: Relational:

We all aware to the standard solutions for “connection between entities” using relational datasource (One-To-Many, Many-To-One, etc..). If you go relational you have to start thinking about JOIN table that probably holds foreign keys of both participating tables which further increases join operation costs:

> SELECT * FROM followers;
+-------------+--------------+
| user_id     | follower_id  |
+-------------+--------------+
| 1           | 2            |
| 1           | 9            |
| 4           | 4            |
| 9           | 5            |
+-------------+--------------+

Now, Let’s have this business case: Is user 1 following user 5?

First we will need to find all followers of user 1(X) and then checking if that list(Y) contains user 5

After that we need to iterate on all the followers of the followers of user 1(Z) and check if they happen to be user 5.

What’s going to happen here when we double the size of this records? Index wont help you dude (Still gotta find the values in the Index tree).

Let’s try different datasource model..  Key, Value? here we go…

I tried to model our relations with Redis(Known as a fast key,value store).

Things actually started to look better than relational from performances aspect, even when we increased the number of records. BUT as soon as we added more than 2 hops to the equation than the nightmare starts.

Assume following Key->Value structure.  For example: A->B meaning A follows B

A-> B,D,C,T,R

B-> D,E

C->T,R

So in that example we can conclude that A follows B,D,C,T,R

But what happens if we do modifications to our values?  we need to make sure we iterate on all influenced keys and modify them correspondingly.

What if I want to have C follow K. So C’s new state:

C->T,R,K

That means we must update A aswell (because A follow C):

A-> B,D,C,T,R,K

B-> D,E

C->T,R,K

What about delete action? I found myself writing crazy algorithms how to handle all side use-cases. On that point it felt like we didn’t choose the right design. Maybe there is a convenient way to model it on key,value datas-source if you find that way please share:)

Graph for the rescue:

Graph datasources’s building blocks based on relations between entities from the very beginning. When we query the graph it looks only at nodes which are directly connected. (The power of this datasource relays on the ability to iterate and query second and third tier ring that connected to our nodes – so called Hops).

Screen Shot 2016-03-19 at 7.40.28 PM

As long as nodes are not connected (aka related) the search will never hit them unless you add additional relationships between them and the query node.

If we conclude our performances: When we stay on low number of nodes(1,000) probably you wont see the effectiveness of graph datasource solution but as soon as you going to have 1,000,000 records thats when Graph API shows it’s power against any other datasource that gotta provide a relational business requirement as that.

As a “bonus” section I will expose you our findings about different graphs solutions and implementations currently on the market: The market currently seem to have 3 major datasources solutions. we will start one by one:

1. TitanDB:

TitanDB considered as distributed graph database which has very good scaling capabilities. I POC that datasource.

As an open-source addicted first important thing for me is community and documentation. For some reason it felt like I couldn’t find one decent place with a proper “Lets start guide”. After a while I think I found out why. TitanDB is built on 3rd party solutions. For example the backend data could be built on Cassandra, Berkeley DB, Amazon’s Dynamo DB and some others. In addition the Indexing mechanism also built on 3rd party like Solr or Elasticsearch.

So it means that I actually need a Devops team behind me just to put this thing alive (and what about production optimizations?)

Overall: Less practical for startup teams. Hard to crush-start but looks promising from performances and scaling aspects.

2. Neo4j:

Neo4j is among us for many years now. I think it maybe the most popular Graph solution on the market. First, docs are very neat. Community is large. It has great language called Cypher. I managed to run Neo4j instance within 10 mins (Just clicked Next->Next->Next) including having my coffee.

Seems like we found our solution! but wait.. not everything is for free. First of all Neo has it’s limitations. it’s not scaling easily. You cant shard it like you could with TitanDB. If your project requires significant writes (compared to reads) you’ll need to super optimize and maybe re-design your architecture (but thats something we will keep for another POST). Another disadvantage is License cost. To enable Neo4j Enterprise Edition( Enabling HA, Clustering, etc..) you going to find this line: “Please call us for farther information” – which means you are in trouble. Found out that you have to pay insane amounts of money per year for licenses. Not good for a startup as well.

Overall: Very practical, Easy going, Expensive for startups when growing up, need to put some efforts to get better performances which are not out of the box.

3. OrientDB:

That would be the last one we tried. OrientDB also promising a distributed graph solutions. However it’s community is not wide as the other solutions. Seems like docs are find but for some reason I still sensed that this solution was less popular than others.

Overall: Easy get going, less popular, docs are proper in place, Distributed.

In the end I think it’s a great solution but it wont come up for free. We choose one of solutions I mentioned but thats up for another POST:) stay tuned for additional updates on my next posts about graph’s. Idan.

Related Articles:

(924)

Be Sociable, Share!

Leave a Reply