If you’re a developer, you’ve probably heard of Apache Cassandra and are excited to start working with it. But if you don’t have experience with distributed systems or big data, the thought of getting started with Cassandra can be overwhelming. What should you do? How should you learn? Why is it so important to be able to scale with your data? To get answers to these questions and more, read this article on how to easily learn Apache Cassandra in 10 steps.
Step 1: Set Up C
First, You need to download, build and install C. This is a must-do, no matter if you are using Windows or Linux. See here for more details.
If you are on Windows make sure to read about MinGW too. You will also need to know how to access your cluster with SSH (Linux) or Remote Desktop Connection (Windows).
Make sure that your system has enough RAM and CPU power available, especially when using YCSB (Hint: check out the c_measurements_suite tool).
Also do not forget to have your client libraries ready. There is an excellent tutorial by DataStax on installing Cassandra with their Java driver. Once installed, open a new terminal window/prompt in your C* installation directory, type cqlsh and then use CTRL+D to exit cqlsh after it has been loaded into your shell environment.
Step 2: Basic Key/Value Pairs
Whether you’re a seasoned developer or just getting your feet wet, making sense of Cassandra can be hard.
Luckily, Cassandra uses a simple, flexible data model that makes it easy to take advantage of its full power. With everything else in place, you need only learn two basic concepts: keys and columns.
Together, they make it possible to read and write any kind of data.
Keys are used for locating data within a column family (more on these below).
Columns contain individual values for each key.
There is no limit to how many columns exist per key—but there are limits on how many keys and column families exist in a given cluster at any given time!
Step 3: Store Data in Structs
A Struct is similar to a Map in Java or Python. If you’re coming from either of those languages, think of a Struct as being like an ArrayList—they can store multiple different types and are extremely powerful.
Storing data instructs means that you can do set-like operations on your dataset more easily and even allows us to use them with parallel processing systems like Spark.
There are four main things to consider when working with structs:
- The fields (what goes in your struct)
- How these fields get their values
- Performance considerations.
Step 4: Add Secondary Indexes
A secondary index adds an additional way to query a row, based on a column that’s not part of your primary key.
Creating these indexes doesn’t impact performance much (so it’s always safe to create as many as you want).
If a secondary index is present, you can then query across both columns at once in order to return rows that match both criteria (i.e., faster).
A simple example would be creating an email address index for searching by email. If you create one and have created your keys using UUIDs (as we recommend), then when adding indexes, there will be no collisions between them. We recommend doing so whenever possible.
Step 5: Create Custom Compaction Strategies
Compaction is one of Apache Cassandra’s most important features, but it can also be tricky to get right.
As your data grows and begins to take up more space on disk, you may find that your compaction strategy isn’t cutting it anymore; at a certain point, compaction can actually make matters worse as it will eat up all available disk space!
Since many shops have different needs when it comes to time-to-compaction, compression levels, and so on.
I highly recommend trying out open source storage tools such as C Tool or DataStax OpsCenter. Which give you a full spectrum of customization options for your overall database management.
Both are great ways to configure and monitor your database in real-time.
Step 6: Put C in Production with Docker
The name Docker has been on everyone’s lips over these past few years. But why? And why is it one of your best options for deploying Cassandra in production?
This chapter will dive into Docker and look at ways you can use it to deploy a production-ready C cluster.
Step 7: Connect to Your Datastore from Client Applications
Now that you have set up your datastore cluster and can access it from your application, you need to teach your client applications how to connect.
While there are many ways to do so, for beginners, I recommend using either JDBC (Java Database Connectivity) or Thrift. If you’re using JDBC, make sure to use an asynchronous data source.
The advantage of using JDBC is that it allows client-side scripting languages such as Python or Ruby to talk directly with your datastore.
Step 8: Use Query Batching to Improve Performance
Have you noticed that your queries run slowly sometimes?
Query batching is one of many techniques for optimizing performance in distributed systems and it’s very simple to enable.
If you aren’t familiar with query batching, I suggest reading CASSANDRA-9352 as a quick introduction and then configuring your servers to use batching by setting batch_size in Configuration.
We generally suggest a value of 100, but there are no hard rules here—you can try experimenting with different values to find what works best for your use case.
As always, measure after changing any setting like this! The new value should be applied to all servers in your cluster.
Step 9: Optimize for Column Families Instead of Tables and Rows
If you’re used to relational databases and looking at data as rows and tables, it may be a little difficult to wrap your head around column families.
A better way to think about it is that when you look at a table in a relational database, there are two different dimensions for storing data: Rows are one dimension, and columns are another. But in NoSQL databases like Apache Cassandra, rows and columns tend to blend together (that’s why they’re called column families).
This makes your design more flexible because you don’t have to decide on how many columns you want for each row upfront.
Step 10: Understand When to Use NODESET versus KEYSET Queries
Querying is such a fundamental skill in Apache Cassandra that you need to know how to do it right. There are two ways to perform queries in Apache Cassandra: NODESET and KEYSET.
With NODESET queries, all of your data is returned from one specific node at a time which makes querying faster and more efficient as you don’t have to contact all of your nodes if you’re looking for smaller subsets of data.
However, with KEYSET queries, you can specify exactly what node(s) to query from within a keyspace which means each query will be sent to different nodes for execution so that multiple nodes are involved in processing each request.
As you can see, it’s easy to learn Apache Cassandra. Just follow these tips, and you’ll be ready to start coding your own cluster in no time. It’s worth mentioning that we didn’t even touch on some of the more advanced topics in our list—like data modelling, high availability, and fault tolerance—but those are all topics for future posts.