Page 2 of 4

Building a Neo4j Graph on the Side

When the topic of graph databases — specifically Neo4j — has come up in the IRC chatrooms I lurk in, I hear a common refrain: “Neo4j is really awesome, and there’s no way my boss will ever let me use it.”

I also hear from its sibling a lot: “Neo4j looks really cool, but I have no idea what I’d use it for.”

Both are pretty common issue for developer types. They’re interested in exploring a technology or platform, but don’t think it has a place in the work they do on a daily basis. That’s where our friend the side project comes in.

side-project

The side project lets you make whatever you feel like on your own time. You get to build something substantive, and in the process give you opportunities to:

  1. Learn new technologies and techniques
  2. Follow a passion or curiosity
  3. Demonstrate expertise and a willingness to learn new skills. Career++!

Here are some ideas for side projects that would involve graphs:

  • Ask 20 people you know for their top 5 favorite movies. Put the friends and movies in a graph. Find interesting patterns (my friends who like this movie often also like that movie). You’re now well on your way to making a simple recommendation engine.
  • Build a blog driven by a graph instead of a relational database. While it doesn’t sound like one the traditional “sweet spots” for graphs, there’s no reason you can’t do it, and could explore both relationship/discovery and the simpler, easier model of nodes and relationships vs. tables and joins.
  • Create a graph-driven version of the Open Movie DB. This is a big data source, so a lot of the work would be in conversion, but even if you got a subset you could do a lot of interesting stuff, and it would be a good examination of the power of the graph model vs relational databases.
  • Grab your last X tweets from Twitter (because everyone uses Twitter, right?) and put them into a graph. Make sure to separate entity types like users, tweets, hashtags, etc. Then you can easily examine your posting and interaction patterns using Cypher queries. Extra credit for making fancy charts and visualizations.

Since you’re doing this on your own, you don’t have to be fancy. It doesn’t need accoutremont like a well-designed UI or an authentication system, until you decide it does. You can explore the stuff that interests you, and leave the bits that don’t.

OmNomHub by Michelle Sanver is a great example of a side project that uses Neo4j. You can also see her talks on Neo4j, which use OmNomHub as an example of how graphs can be useful and powerful.

It can be hard to find the time for a side project, especially if you have lots of other responsibilities. Making time for one can be extremely rewarding, both personally and professionally. If you have an idea, sign up for our free trial, and give Neo4j a spin.

Comparing Graph and Relational Databases

When comparing graph databases to relational databases, one thing that should be clear up front is data affiliation does not have to be exclusive. That is, graph databases – and other NoSQL options – will likely not replace relational databases on the whole. There are well-defined use cases that involve relational databases for the foreseeable future.

However, there are limitations – particularly the time as well as the risk involved to make additions to or updates to a relational database – that have opened up room to use alternatives or, at least, consider complimentary data storage solutions.

In fact, there are a number of use cases where relational databases are often poor fit for the goals of certain data, such as social relationships at-scale or intelligent recommendation engines. Overall, the limitation of how a relationship is defined within a relational database is a main reason to consider a switch to a graph database.

Also, industries of all types are seeing exponential data growth and the type of data that is growing fastest is unstructured. It doesn’t fit well in columns & rows – AKA the relational database schema.

Using a schema-less database model found in graph databases is a huge benefit for applications and the developers that maintain them. If the application can drive the data model, it fits better the with development cycle and can reduce risk when making model changes later.

The relationships in a property graph, like nodes, can have their own properties, such as a weighted score. With that capability, it would be relatively trivial to add this new property on a relationship. It’s especisally useful when the relationahip was not defined when the application was initially created.

For applications that use a relational database, this would be done by creating a join table – as it is known as a in the RDBMS world. This new table joins together two other tables and allows for properties to be stored about that relationship. While is a common practice, it adds a significant layer of complexity and maintenance that does not exist within the graph database world.

Yet another reason you might consider moving to graph database is to remove the work-arounds that must be used to make an application fit within a relational database.   As discussed the previous example, a join table is created in order have metadata that provides properties about relationships between two tables.

Often, new relationships will need to be created, which requires yet another join table. Even if it has the same properties as the other join table, it must be separate in order to ensure the integrity of the relationships.

In the case of graph databases, typed relationships can exist between more than just two types of nodes, For example, a relationship called “LIKES”, e.g.(person)-[:LIKES]->(book), can also be applied to other node types, e.g. (person)-[:LIKES]->(movie).   In fact, the relationship type could be applied between any of the applicable nodes in the graph.

Another reason to consider graph databases over relational database is what can be referred to as “join hell”. While creating the join can be relatively trial, those types of joins provide the least expressive data. Again, applications very often require joins over several tables. It is in this case that the big expense of joins begin to show – in both the development time and the application performance. In addition, if you wanted to modify the join query, it might also require adding more join tables – adding even more complexity to development and worse application performance.

Adding new relationships and the queries that represent them occur at application level. This removes a level development complexity and time and offer better application performance.

While the differences between graph and relational databases, there are a few similarities. A significant similarity is that both can achieve what is known as ACID compliance. It is these set of principles that guarantee that transactions completed by the database are processed reliably, which keeps data safe and consistent.

Why use a Graph Database?

What do LinkedIn, Walmart and eBay as well as many academic and research projects have in common? They all depend upon graph databases as a core part of their technology stack.

Why have such a wide range of industries and fields found a common relationship through graph databases?

The short answer: graph databases offer superior and consistent speed when analyzing relationships in large datasets and offer a tremendously flexible data structure.

As many developers can attest, one of the most tedious pieces of applications dependent on relational databases is managing and maintaining the database schema.
While relational databases are often the right tool for the job, there are some limitations – particularly the time as well as the risk involved to make additions to or update the model – that have opened up
room to use alternatives or, at least, consider complimentary data storage solutions. Enter NoSQL!

When NoSQL databases, such as MongoDB and Cassandra, came along they brought with them a simpler way to model data as well as a high degree of flexibility – or even a schema-less approach – for the model.
While document and key-value databases remove many of the time and effort hurdles, they were mainly designed to handle simple data structures.

However, the most useful, interesting and insightful applications require complex data as well as allow for a deeper understanding of the connections and relationships between different data sets.

For example, Twitter’s graph database – FlockDB – more elegantly solves the complex problem of storing and querying billions of connections than their prior relational database solution. In addition to simplifying the structure of the connections, FlockDB also ensures extremely fast access to this complex data. Twitter is just one use case of many that demonstrate why graph databases have become a draw for many organizations that need to solve scaling issues for their data relationships.

Graph databases offer the blend of simplicity and speed all while permitting data relationships to maintain a first-class status.

While offering fast access to complex data at scale is a primary driver for graph database adoption, another reason they offer the same tremendous flexibility that is found in so many other NoSQL options. The schema-free nature of a graph database permits the data model to evolve without sacrificing any the speed of access or adding significant and costly overhead to development cycles.

With the intersection of graph database capabilities, the growth of graph database interest and the trend towards more connected, big data, graph databases increase the speed of applications as well as the overall development cycle – specifically how graph databases will grow as a leading alternative to relational databases.

Uncovering Open Source Community Stories with Neo4j

Every dataset has a story to tell — we just need the right tools to find it.

At Graph Story, we believe that graph databases are one of the best tools for finding the story in your data. Because we are also active members of several open source communities, we wanted to find interesting stories about those communities. So, we decided to look at package ecosystems used by developers.
survey-search
The first one we tackled was Packagist, the community package repository for PHP. Nearly 20,000 maintainers have submitted over 60,000 packages to Packagist, which gives us a lot of interesting data to investigate.

How We Used Neo4j to Graph the Packagist Data

Collecting this data and getting it into Neo4j was relatively straightforward.

One HTTP endpoint on the Packagist site returns a JSON array of all the package names. We iterated over that, and made individual calls to another endpoint to retrieve a JSON hash for each package, which includes both base package data and information on each version of the package, including what packages a given version requires.

The data model for our initial version was pretty straightforward. We have three node labels:

  • Package
  • Maintainer
  • Version

and five relationship types:

  • HAS_VERSION
    (Package)-[:HAS_VERSION]->(Version)
  • MAINTAINED_BY
    (Package)-[:MAINTAINED_BY]->(Maintainer)
  • REQUIRES
    (Version)-[:REQUIRES]->(Package)
  • REQUIRES_DEV
    (Version)-[:REQUIRES_DEV]->(Package)
  • SUGGESTS
    (Version)-[:SUGGESTS]->(Package)

This certainly isn’t a complete schema to represent everything within the Packagist ecosystem, but it let us do some interesting analyses:

  1. What packages get required the most by other packages?
  2. What maintainers have the most packages?
  3. What maintainers have the most requires of their packages?
  4. What maintainers work together the most (packages can have multiple maintainers)?
  5. What are the shortest paths between two given packages, or two given maintainers

Our Findings

You can see our results so far at packagist.graphstory.com.

Some of what we found was expected: certain well-known open source component libraries get required the most, like doctrine/orm and illuminate/support.

It gets more interesting when examining maintainers, though. Some are high profile folks in the PHP community, like fabpot and taylorotwell, but some are people with whom we weren’t as familiar. It certainly made us re-examine what we thought we knew about the PHP community – it’s not always folks who are speaking at conferences that are making big contributions.

The shortest path analyses were interesting as well. There were a few packages that showed up in these paths over and over to tie together maintainers and packages, such as psr/log. “Keystone packages” might be a good term for these, because they seem to join and support the PHP open source community again and again.

A Cypher Example: Finding Top Maintainers by Packages

Here’s one example Cypher query we ran to find the top Packagist maintainers by package count:

MATCH (m1:Maintainer)<-[:MAINTAINED_BY]-(Package)
WITH m1,COUNT(*) AS count
WHERE count > 1
WITH m1,count
ORDER BY count DESC
RETURN m1.name as name, count
LIMIT { limit }

See the results of this query and others on packagist.graphstory.com.

Why We Used a Graph Database

Much of what we’ve done would be possible with an RDBMS or a document database, so why do it in a graph database – specifically Neo4j?

We found three major upsides while working on this project:

  1. It is so much easier to map out data and relationships. Making relationships in RDBMSes work, even in simple cases, is harder, and significantly more difficult to change down the road. Compared to popular document databases, Neo4j relationships are done in the database — we don’t have to maintain them with application logic.
  2. Discovering how people and packages are connected is much easier and faster than with RDBMSes and popular document databases. Cypher and the graph model makes it easy to get the data we want without complex SQL joins or a wrapper script in another language.
  3. Trying new queries to explore the data is so convenient with Neo4j’s web interface. It’s quick and easy to prototype and profile from there, and then copy and paste the Cypher into your app.

We’re obviously big believers in graph databases at Graph Story, but this is a fun project that highlights a lot of the advantages of Neo4j. We found a number of interesting stories in Packagist, and there are certainly more to uncover.

Easy Geocoding For Your Graph

Longitude west of Greenwich

At Graph Story we’re all about making the lives of developers easier, because we’re developers too. To that end, we’re announcing our newest feature to do just that: a geocoding service for your graph database. This allows Graph Story customers to efficiently geocode addresses and store them in their graph, without using a third-party service.

The service is exposed as an additional endpoint in the Neo4j HTTP API. The developer will make a POST request against that endpoint that sends the address information as JSON in the body. The service will geocode the address, create an :Address node in the Neo4j DB, and return the node data.

We can use any http client tool to send these requests. In the tried and true curl, we’d call the endpoint like so:

curl http://0.0.0.0:7475/graphstory/geo/geocode 
  -u USERNAME:PASSWORD 
  -H "Content-Type: application/json" 
  -H "Accept: application/json" 
  --data "{ "streetAddress": "902 Cooper Street", "city": "Memphis", "state": "TN", "postalCode": "38104" }"

We can also use httpie, which makes sending JSON a lot easier, like so:

http --auth USERNAME:PASSWORD POST http://0.0.0.0:7475/graphstory/geo/geocode 
  streetAddress="902 Cooper Street" city=Memphis state=TN postalCode=38104

Either way, the request will look like this:

POST /graphstory/geo/geocode HTTP/1.1
Accept: application/json
Authorization: Basic REDACTED
Content-Length: 95
Content-Type: application/json
Host: HOSTNAME:7474

{
    "city": "Memphis",
    "postalCode": "38104",
    "state": "TN",
    "streetAddress": "902 Cooper Street"
}

The response will look like this:

HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Content-Type: application/json
Server: Jetty(9.0.5.v20130815)
Transfer-Encoding: chunked

{
    "data": {
        "node": {
            "bbox": [
                35.12103275,
                -89.990982,
                35.12103275,
                -89.990982
            ],
            "city": "Memphis",
            "formattedAddress": "902 S Cooper St, Memphis, TN 38104",
            "gtype": 1,
            "lat": 35.12103275,
            "lon": -89.990982,
            "postalCode": "38104",
            "state": "TN",
            "streetAddress": "902 Cooper Street",
            "uuid": "9f21cd5b-151f-11e5-b1a9-d71b198494e6"
        }
    },
    "status": "success"
}

Additional notes:

  • If an :Address node created with the same request fields already exists, the existing node data will be returned – it won’t create duplicates
  • For authentication, just use you the username and password for Neo4j instance
  • Request fields are as follows
    • streetAddress (required)
    • city (required)
    • state (required)
    • postalCode (optional)

Keep in mind that this is an alpha feature, so things can and will change. Possible changes to anticipate:

  • The return format may change (probably the bbox and gtype values will go away).
  • New arguments to the endpoint, like sending the entire address as a string and letting our service parse it. What we change will be strongly influenced by your feedback, so if you have ideas or problems, please let us know!

For now we are enabling the geocoding service only for customers who request it. Drop us a line at support@graphstory.com and let us know you’d like it turned on, and we will get right on it. All we ask is that you let us know what you think!

PHP[TEK] 2015

We’re super excited to join the PHP community at php[tek] as a sponsor for this year’s conference. Please make sure to stop by our booth in the registration area, grab a game card and t-shirt!

Just in time for tek, we’ve put together a few example apps that help demonstrate the power of graph databases used in conjunction with PHP.

Composer Graph

Ed Finkler, our Lead Dev & Head of Developer Culture, put together an application that covers the Composer ecosystem, specificially the relationships between maintainers and package dependencies. “I’m seriously blown away at how graph databases can easily and performant-ly solve problems that require huge, complex SQL in an RDBMS,” said Ed. “Trying to create these types of dense and complex relationships in a RDBMS is just much, much harder to do. The structure of the data is a perfect fit for a graph.” Check it out at http://packagist.graphstory.com/

php[tek] Twitter Graph

Jeremy Kendall, our CTO, has put together a simple but elegant app that analyzes the Twitter activity around the #phptek hashtag. “Tek is all about the community. With this application, we can analyze and visualize the community conversations taking place on Twitter,” said Jeremy. You can check out Jeremy’s app at http://twitter.graphstory.com/

GraphConnect Europe 2015: (graphs)–[:ARE]->(everywhere)

We couldn’t agree more with the theme for this year’s GraphConnect in London: Graphs are everywhere!

Graph databases are the future, and Neo4J is clearly leading the pack. As a Neo4j partner, we are excited about the road ahead and building out a amazing Graph Database as a service platform.

Neo4j database ranking chart

So to everyone at Graph Connect this week: We hope you’ll enjoy all the sessions and make sure not to miss the closing keynote “Impossible is Nothing with Graphs” by Dr. Jim Webber. You will be sure to enjoy his insights!

Graph Story Presenting at eMerge America conference 2015

Emerge Americas

*** Update: We’ve been selected as one of the 10 Early Stage Finalists! – Presenting Tuesday May 5th – 9.40am!

This week we’ll be presenting in Florida at the eMerge Americas conference. eMerge Americas is a global idea exchange focusing on how technology and innovation are disrupting industries. The conference serves as a platform connecting revolutionary startups, cutting-edge ideas, and global industry leaders & investors across North America, Europe, and Latin America.

Graph Story eMerge AmericasGraph Story has been selected for the Startup Showcase competition. This pitching competition, sponsored by Nasdaq and Google’s Next Wave, has a grand prize of $175,000 in cash investment. After presenting at the SXSW Accelerator competition in January we’re excited to present again and make the case for our Graph Database as a Service business. We’ve made tremendous progress in the last few months and have enjoyed working with several enterprise clients, helping them get their Enterprise Neo4j environment up and running in no-time, all while being fully secure and scalable. So we’re excited to present in front of a jury that includes judges from Palladium, Goldman Sachs and Crunch Ventures – to name a few!

If we don’t see you in Miami, maybe we can meet in Chicago in 3 weeks: Graph Story is sponsoring php[tek] (May 18th-22)!

We have a lot of exciting news in the pipeline, including several new hires – so stay tuned for more!

 

Graph Story selected for SXSW Accelerator Competition

We’re excited to share that Graph Story was selected to participate in the Enterprise and Smart Data Technologies category for the 7th annual SXSW Accelerator competition presented by Oracle.

The SXSW Accelerator competition is the marquee event of SXSW Interactive Festival’s Startup Village, where leading startups from around the world showcase some of the most impressive new technology innovations to a panel of hand-picked judges and a live audience. Hundreds of companies submitted to present at SXSW Accelerator, where Graph Story was selected out of 48 finalists in six different categories.

The two-day event will be held the first weekend of SXSW Interactive, Saturday, March 14 through Sunday, March 15, on the sixth floor of the Downtown Austin Hilton. The pitch competition will then culminate with the SXSW Accelerator Awards Ceremony on Sunday evening, March 15, where winning startups from each category will be announced and honored.

“Over the past six years of companies competing in SXSW Accelerator, more than 50 percent have gone on to receive funding in excess of $1.7 billion and 12 percent of the companies have been acquired,” said SXSW Accelerator Event Producer, Chris Valentine. “This year’s finalists have all demonstrated the capability to change our perception of technology and we expect to see them achieve similar, if not greater, success than our past finalists.”

The Accelerator competition will feature finalists across categories including Enterprise and Smart Data Technologies, Entertainment and Content Technologies, Digital Health and Life Sciences Technologies, Innovative World Technologies, Social Technologies, and Wearable Technologies. SXSW Category Sponsors include Rackspace, Enterprise and Smart Data Technologies Category Sponsor; and IBM, Innovative World Technologies Category Sponsor and Dyn, Social Technologies Category Sponsor.

Graph Kit for Ruby Part 3: Neo4J, Spree – Engine Yard deployment

Welcome to the third and final installment of the Graph Kit for Ruby post series.  Part 1 kicked the series off with a brief look at the idea of a graph database and some description of the Spree online store I planned to enhance with a graph. Part 2 went in depth with the addition of a graph-powered product recommendation system to a Spree online store.  In this final entry we’ll learn how to tweak our Spree + Neo4j store to deploy it to a production server on Engine Yard Cloud.

Provisioning

Engine Yard deployment of the Spree application worked in three major phases: provisioning the server, configuring the server, and pushing the code.  That runs for ten minutes or so, and then you have a new server running.  Next up – SSH into the server to do the last-mile config before your first deployment.

Oops!  My new server didn’t have my SSH keys and I couldn’t figure out an easy way to get them installed after the provisioning.  Since I was still in a happy prototyping mode I just deleted the server and then uploaded my SSH keys to my Engine Yard account under Tools -> SSH Public Keys -> Add a new SSH public key.  You’ll want to do the same if you’re following along at home.  If you don’t have a key yet, I recommend GitHub’s explanation on what SSH keys are and how to get one.  Once you’ve got your keys uploaded you can safely move on to the ‘boot a new server’ part of the Engine Yard setup.

engine-yard-panel

The Engine Yard cloud servers look to be hosted somewhere on Amazon Web Services.  Once I got my keys sorted out I created a new application in the control panel and named it graphkit_ruby. I chose some pretty standard Rails app defaults – the latest available version of Ruby, the latest available version of Postgres, and Passenger as the web server.  Engine Yard does offer SSL for real production stores but I didn’t bother since I’m not planning to sell these fake pet products.

Configuration

Using environment variables for configuration on Engine Yard

Engine Yard’s provided us with an app server and an RDBMS which covers the basics of Spree.  To get our new graph-powered recommendation engine running we’ll also need access to a production graph database.  I signed up for a free trial database from the Graph Story front page.  To integrate our external Graph Story Neo4j database with Engine Yard we’ve got to have a nice safe way to store our database credentials and pass them to the Rails app at boot time.  I’ve gotten in the habit of using environment variables to configure my production applications so I can keep such secrets out of the codebase.  Newer versions of Rails support this practice with the addition of a secrets.yml file, but in this case I found it easiest to just use my own custom.env file with the dotenv gem.

To do the same for your app, add the dotenv gem to your Gemfile and then you’ll be able to read environment variables from a text file at run time. This wound up working well with Engine Yard – I just put the file in a shared config folder that is consistently available to the app from one deploy to the next.

We’ll force Rails to load our environment variables from config/env.custom at boot time by setting up a config/boot.rb file that preloads our variables:

boot.rb

I .gitignored this file full of configuration and secrets so the one I’m using locally won’t be automatically pushed to Github or to Engine Yard. We’ll push it up to our EY server with scp:

scp config/env.custom my-ey-server-name:/data/my-app-name/shared/config/

Note that I was able to omit the full EY login string because I have my EY server hostname and credentials set up locally in ~/.ssh/config. If you don’t do that you’ll have to spell out the connection info like scp filename deploy@ey-server-ip-address:destination-folder/ instead.  That shared config directory is automatically symlinked into the config subdirectory of each new deployment to EY.

unnamed

To integrate this custom environment setup with Rails I went ahead and created a custom Neo4j initializer file for Rails that teases apart a database URL-style configuration into the sort of thing that the Neo4j gem is actually looking for.  This means that I can punch in a NEO4J_URI variable of the form https://username:password@ autogenerated-hostname.do-stories.graphstory.com:portnumber and Rails will automatically connect to my remote database.  With a fallback of localhost:7474 we can seamlessly switch between local Neo4j in dev mode and our actual Graph Story hosted database in production.  Speaking of which, you’ll want a free hosted Neo4j database of your own.  You can of course sign up from the Graph Story home page. Here’s what the connection info looks like from within my Graph Story admin panel – I copied the server connection information from here into my custom.env and formatted it into a NEO4J_URI string that I configured Rails to recognize via my Neo4j initializer file.

graph-story-panel

Creating a production secret token to sign cookies

Rails 4.1 uses a secrets.yml file that is .gitignored much like our above env.custom to hold production secrets. I have never messed with those myself but I did notice that it was looking for ENV["SECRET_KEY_BASE"] to set a production secret token for signing sessions. Let’s go ahead and generate one of those and tack it on to the production secret file we already created and then we’re (almost) in business.

secret-key

Deploying the code and seeding the database

Setting the production secret was the last step in getting my EY environment to play nice with Spree!  From there I clicked the “deploy HEAD” button in my EY panel and it pulled up the latest code from the Graph Kit Ruby repository on GitHub.  Once the code was finally deployed and the app was running I went into my Spree console and ran my database seeds to get an admin user created and to gin up all of those pretend products for our sample data.  That’s RAILS_ENV=production rake db:seed from within your app’s deployment directory on the server. Mine was /data/graphkit_ruby/current as shown in the secret key screenshot above.

finished-store

Next Steps for a Real-World Project

Asynchronous Data Processing

For a high performance production application you wouldn’t really want end users to wait for the round trip between Engine Yard and Graph Story every time we log a new purchase event to the graph. It’d be much smoother to use a background job to send that data over. I’d use Sidekiq if this were a client project – It’s a great Ruby library for background job processing and it comes with a nice job status visualizer.  By offloading offloading the graph writes to a background job you allow the web app to respond that much faster.  It’s common to do the same with transactional emails and any post-order processing in a high volume Spree site.

Richer Recommendations

Once you get started down the road of tracking purchase events you quickly realize there’s lots of other data you can start tracking to use for better recommendations. Here’s a few ideas: “Users who looked at this also looked at that”, “users in your area also purchased this”, “users who bought this often by that in the same order”.  You can also look at copying more of your product and user metadata over to your graph nodes in order to query on product characteristics or user demographics.

Now that you’ve seen how straightforward it is to model nodes and relationships with Neo4j you can imagine how you might start layering your own user location data or per-cart data into your graph for richer recommendations.  I hope you’ve had as much fun reading this series as I did writing it!

Page 2 of 4

Powered by WordPress & Theme by Anders Norén