Proof of Concept: Benchmarking Postgres on IBM Power

By Andrew Wojnarek
Director of Capacity and Performance Management (CPM)
(ATS Innovation Center, Malvern, PA)

Concept Timeline: February – May 2018

PostgreSQL is a powerful, open source object-relational database system that uses and extends the SQL language combined with many features that safely store and scale the most complicated data workloads. The origins of PostgreSQL date back to 1986 as part of the POSTGRES project at the University of California at Berkley and has more than 30 years of active development on the core platform. PostgreSQL has earned a strong reputation for its proven architecture, reliability, data integrity, robust feature set, extensibility, and the dedication of the open source community behind the software to consistently deliver performant and innovative solutions.

The goal of this Proof of Concept will be to explore running Postgres on Power as well as defining a methodology of benchmarking workloads with various tunables and application setup. We’re using a benchmarking suite called pgbench-tools, which is a wrapper for the popular utility pgbench. The script wrapper we wrote will take this a step further and runs the pgbench-tools benchmark repeatedly using IBM Power specific tunables to determine the best way to run different kinds of workloads (Selects vs Inserts etc).

Our infrastructure environment consists of:

  • IBM Power 812L
  • Ubuntu 16.04 LTS
  • 10 CPUs (3425 MHz) with SMT/2/4/8
  • IBM PowerKVM and IBM PowerVM
  • 64GB Memory
  • 140GB internal SAS disk
  • 100GB Flash Storage IBM V9000

Our application environment consists of:

  • Postgresql 9.5.12

Our tools used:

The Setup

We chose to use Ubuntu 16.04 for a couple of reasons: ease of use and up to date repository. For this proof of concept, we wanted to quickly create our infrastructure, test it, and destroy it. Ubuntu is easy to install and has a wide variety of packages built on IBM Power that are up to date. Our Postgres package in the Ubuntu repository is 9.5. In our initial research, it led us to a blog post from 2016 about the differences between 9.5 and 9.6, in regard to the performance implications.

Below are the versions of Postgres per Ubuntu Release:

Ubuntu 16.04 Postgres 9.5
Ubuntu 17.10 Postgres 9.6
Ubuntu 18.04 Postgres 10

* Note: pgbench-tools does NOT work for Postgres 10, so you’ll need to do the tests by hand.

The great thing about this benchmarking suite is that it generates a ton of data. It will generate database statistics like transactions per second (TPS) as well as query latency.  This suite will also grab critical operating systems statistics like vmstat, iostat and meminfo. We’ll be using the capacity and performance management tool called Galileo – so we won’t be needing this data but it is included, if needed.

So, the first thing you should do is understand exactly what pgbench is and what it does. The pgbench utility comes from the package called postgresql-client-common. pgbench is a simple program for running benchmark tests on PostgreSQL. It runs the same sequence of SQL commands over and over, possibly in multiple concurrent database sessions, and then calculates the average transaction rate (transactions per second).

Below is a simple pgbench example:

As you can see above, pgbench is super easy to use, and when paired with some automation, it is a fantastic way to measure performance.

The Results

 The intent of this paper is not to benchmark results across operating systems or architectures but rather establish an automated way to test different tunables and configurations. So we started our first test with the following configurations:

  • SCRIPT=”insert.sql”
  • SCALES=”1 10 100 1000”
  • SETCLIENTS=”1 2 4 8 16 32 64 128 256″

This tells the script to run the insert test, across 4 different database sizes (scale) and a bunch of different clients. We want to profile the different configurations – so we can determine what combination nets us the highest TPS! Let’s dive right in.

What the above chart shows is that we achieved the highest number of transactions per second using SMT2 at around 65 clients. You can see that SMT4 overtakes SMT2 after 256 clients. Another thing to note is that SMT8 beats out having SMT off until around 65 clients.

Below we can view the latency tables – where we can compare throughput vs latency.

SMT8:

SMT OFF:

One thing that jumped out on me is that SMT8 certainly outperformed in terms of throughput (TPS), but it also in latency. With 32 clients, SMT8 managed 110K transactions per second, and with SMT off it was able to almost keep up with 94K. There is a major difference in latency though! The latency in SMT8 was .282 and with SMT off the latency was .335. This is a significant jump between the two. The maximum latency is also orders of magnitude larger with SMT off as well.

Below is a peak at our infrastructure during these tests, powered by Galileo Performance Explorer:

You can see shutting off SMT, and the corresponding CPU usage over time. The one thing that we were surprised to see is that it was very hard to force the system to use real memory – it almost exclusively lived in filesystem cache. We maxed out at 5 million pages per second into filesystem cache.

So with SMT in mind, let’s talk about bottlenecks. In our benchmark, the CPU was not the bottleneck. In fact, we were not able to push more than 30-40% CPU! This really really shows the strength of the IBM Power hardware. We really needed to throw much more disk at this benchmark to really saturate the disk. We were using a IBM V9000 Flash Lun, at 100GB in size. Now while we were able to push a tremendous amounts of I/O – it wasn’t enough to saturate the CPU. So with this in mind, while benchmarking, I’d keep the following things in mind:

  1. CPU Usage (check for CPU wait)
  2. Disk Latency (check for submillisecond latency)
  3. Adapter Throughput (check for theoritcal limits on I/O adapters)
  4. Memory usage (check for disk paging)

Let’s take a look at our insert test:

As you can see above, we were able to achieve 70,000 inserts per second, and the best configuration was having SMT off and peaked around 125-130 clients.

Above is a 3D chart that has three points of data: database size (scaling factor), number of clients and transactions per second. The point of this chart is to show the relationship between all three metrics. The graph suggests high transactions per second with small databases with a medium amount of clients and a larger database with medium amount of clients.

The above chart shows the relationship between the size of the database and the transactions per second. It’s clear as the database gets bigger, the lower the transactions per second – which makes sense.

Another thing we’ll want to emphasis is to avoid bottlenecks. Especially when running insert tests, you want to pay close attention to your disk speeds and feeds.

The below graphs shows two things: IO throughput and IO latency. So we were able to push around 130MB/s at peak and had an average latency of 0.5ms. Our lab isn’t built for speed, but these are things you’ll want to keep in mind. Also keep things like fiber channel throughput and NIC throughput in mind too.

 

The IBM Power Advantage

Powerful forces—mobile, cloud and big data & analytics—are redefining how business gets done. Leaders are leveraging these forces to deepen relationships with customers and partners, drive new efficiencies and expand business models. IBM is the right partner to help you.

IBM Power Systems are designed for big data—from operational to computational to business and cognitive Watson solutions—are optimized for performance and can scale to support demanding and growing workloads. Capitalize on the currency of data by finding business insights faster and more efficiently. And gain the elasticity you need to handle the varying analytics initiatives your business requires.

The IBM Power 8 processors were designed with big data in mind. They’re truly remarkable, with features such as:

  • Support for DDR3 and DDR4 memory through memory buffer chips that offload the memory support from the IBM POWER8 memory controller
  • L4 cache within the memory buffer chip that reduces the memory latency for local access to memory behind the buffer chip; the operation of the L4 cache is transparent to applications running on the IBM POWER8 processor. Up to 128 MB of L4 cache can be available for each IBM POWER8 processor.
  • Hardware transactional memory.
  • On-chip accelerators, including on-chip encryption, compression, and random number generation accelerators.
  • Coherent Accelerator Processor Interface (CAPI), which allow accelerators plugged into a PCIe slot to access the processor bus using a low latency, high-speed protocol interface.
  • Adaptive IBM Power management.

As we referenced above in our benchmarking,  another feature of the IBM Power 8 architecture is Simultaneous Multi-Threading (SMT). IBM Power 8 offers 8 threads per core. If your workload has throughput driven, SMT8 can offer incredible benefit. Running Linux on IBM Power, it’s easy enough to set the system to use SMT dynamically:

 

* Indicates that the thread is enabled for the processor

So in the above example we’re checking what SMT we’re running. In the above example, you can see we’re using 80 ‘cores’ across 10 real cores. Now below let’s set SMT to two and check again:

You can see above that now we have 20 cores enabled via SMT2 across 10 real cores.

IBM Power9 will have the same number of SMT, but boasts some incredible improvements in performance:

  • ~1 TB/s BW into chip
  • 1st chip with PCIe4
  • 7 TB/s on chip BW
  • 4 GHZ peak frequency
  • 8 billion transistors

It’s 2x the performance cores vs x86, it has 6-10x better bandwidth to CPU accelerators, it has 2.6x more RAM and it has 1.8x more memory bandwidth!

The Recap

Running your database workloads in the IBM Power ecosystem is something to definitely consider. This is hardware that’s optimized for big data, transactional workloads, and is on the forefront of innovation so companies can gain faster insights for competitive business advantages. Consider using this methodology we went through in this paper to benchmark your workloads on IBM Power hardware.

One of the last recommendations I’ll leave you with is this: benchmark across different application versions, distributions and architectures. Innovation in the open source community is happening at an increasingly impossible pace! Gone are the days where you can trust what an article, paper or blog has confirmed.

Here at the ATS Group, we have a very diverse Innovation Center where we implement Proof of Concepts for customers looking to do this very exercise. We specialize in the implementation and architecture of advanced technology – we’d love to hear your story, and to talk to you about what problems you may have. Please contact us with any questions!

Since our founding in 2001, we have consulted on thousands of system implementations, upgrades, backups and recoveries. We also support customers by providing managed services, performance analysis and capacity planning. We are industry-certified professionals supporting SMBs, Fortune 500 companies, and government agencies. As experts in top technology vendors, we are experienced in virtualization, server and storage systems integration, containerized workloads, high performance computing (HPC), software defined infrastructure (SDI), devops, enterprise backup and other evolving technologies that operate mission-critical systems on premise, in the cloud, or in a hybrid environment.