Friday, October 29, 2021

Questions on ICPE 2022 Data Challenge

 I am co-organizing the Data Challenge Track at ICPE-2022 with Cor-Paul Bezemer and Weiyi Shang. The data challenge dataset is a snapshot of MongoDBs performance testing results and analysis (see the CFP for more detail). Recently, a group of researchers reached out to me with some questions for more detail about the data. I wanted to share the answers to those questions more widely, in case they are helpful to others working on the data challenge. 

Q1: What configuration options are used? I mean, how large the configuration space is for MongoDB? Specifically, are all of these options considered in the measurements?

A1: We test a subset of the total configuration space. We have talked about some of the configuration in our paper: and in the open source version of our testing infrastructure. Unfortunately that is a point in time snapshot -- we don't run our production system from the public repo.

At a high level there are a few dimensions:
  • MongoDB configuration
    • We test representative versions of the major configurations (standalone, replica set, sharded cluster). We also have some smaller versions of those (1-node replica set, and shard-lite). These have essentially the same configuration, but fewer nodes in the system.
    • We have some configurations with key features turned on and off (e.g., flow control, different security features).
    • We have some configurations specifically designed for testing initial sync. They add a node to the cluster during the test.
    • We also have tests that show up in "performance" project, as opposed to "sys-perf". These run on a single dedicated host, so are either standalone or a single node replica set.
  • Hardware configuration
    • Sys-perf runs predominantly on AWS c3.8xlarge instances with hyperthreading disabled, and processes pinned to a single socket.
    • We have another configuration using different instance types, and a matching MongoDB configuration. Those were tuned to be closer to our Atlas configuration, and are labelled "M60-like".
    • The "performance" project uses dedicated hardware.
All told, we currently run 28 distinct "variants" in our system. We would like to run a much larger slice of the configuration space. We continue to look into expanding the configuration space without blowing out the budget or overwhelming our ability to handle results.

Q2: How many configurations are measured?
A2: 28 in sys-perf, 4 in performance on our main branch. That may increase slightly when reviewing stable release branches (e.g., sys-perf-5.0).

Q3: Besides end-to-end metrics like throughout, what other performance events (performance traces, such as those we can measure using perf utility) are measured?
A3: As discussed in we have added latency measures (median, 90th percentile, 95th percentile) client side measurements. We also collect some system metrics across the runs (cpu usage, IO usage, ...). We haven't talked much about those, but they are hopefully fairly self explanatory. If you find something completely cryptic, reach out and I'll see if I can explain it. Not all of the metrics are always interesting (e.g., Worker's Max is a client side metric that I think should be constant across each timeseries).

Q4: What was the DBMS deployment topology for running the benchmarks (number of nodes, replication, sharding, client consistency settings)? Was always the same topology applied or different ones?
A4: Everything was deployed and run using DSI as described in The configurations do change over time (slowly and infrequently), but they are version controlled so we can isolate exactly when they change. There are change points in the system associated with configuration changes and with workload changes.

Q5: What were the applied resource dimensions (VM, container, cores, memory, storage, OS,...)?
A5: The largest set are c3.8xlarge AWS instances. Those were picked a while ago as they were the lowest noise option. The c3.8xlarge is the largest of the c3 family. It's not bare-metal, but it appears to be unshared and provides access to low level configuration. 

Q6: What kind of workload has been applied (read/write ratio, complexity of queries, intensity, data set size, runtime) ?
A6: A lot of workloads have been run, from very focused, to much more general. Large classes include:
  • YCSB (load, 100% read, 50%read 50% update, ...) and a YCSB configuration with a much larger data set (YCSB_60GB)
  • Linkbench (we are running from a private fork, but it's pretty close)
  • py-tpcc
  • Genny workloads (this is our internal load generator. We run from this public repo).
  • Private javascript/mongo shell based workloads
  • Mongo-perf

Q7: Are the DBMS traces of the applied workload available, i.e. the data of the MongoDB profiler?
A7: No. We run all the above workloads directly, not by traces.

Thursday, July 1, 2021

Writing To Be Read

Why do you write? I am an engineer and (to my surprise) I find that I write a lot. I write for several reasons: I write to take notes; I write to organize my thinking; I write to learn. All of these reasons are important, but I don't share those writings with others. I also write to change things: to convey ideas; to teach; to persuade. There is one thing that I desperately need in order to have a chance of achieving those goals: I need a reader. I need someone to read what I have written and to engage with my ideas. I need someone to understand what I wrote, think about its implications, and challenge my arguments. Because I need a reader, I write to be read.

Knowing why you do something provides a guide to doing it better. Knowing that I want people to read my writing, guides how I write, and guides my efforts to be a better writer. Chances are that you write as well. Maybe, you would like me to read what you have written. If so, please consider my thoughts below.

How do you get someone to read your writing? And why would someone read what you wrote? One way to answer those questions is to ask the reader. This is a good thing to do, but is challenging if you don't yet have readers. A second way to answer those questions is to imagine yourself as the reader and ask yourself. This imagination requires empathy with the reader. The best way to develop empathy with the reader is to be an avid reader yourself. I am.

I know from my own reading that I read to learn, to solve problems, and to be entertained. When I am reading to learn and to solve problems, I review a lot of writing to find answers to those problems. I even read about reading. I skim blogs, papers, and books to see if they address my interests. The best writing quickly tells me if it is relevant or not. If it is, I can devote my full attention to it. If it's not, I can move on. Bad writing makes me work to figure out if it is relevant. Once I have decided that a piece of writing is relevant, I read it specifically to answer my questions. That may entail a thorough reading of the entire work (possibly multiple readings). Or it may only entail a thorough reading of the conclusion or of a particular section. God bless authors who make it that easy. For my own reading, I want writing that quickly tells me if it's relevant, writing that answers my questions, and writing that is pleasant to read along the way. Therefore, I strive to make my writing relevant, actionable, and pleasant.

Tell the reader why it's relevant

Making writing relevant is not hard, but it does require resisting the urge to be dramatic or coquettish.

Start with the title

The title is the first thing someone is going to see. Give them enough information to decide if they should keep reading or not. For example, from the title of this blog post you can already tell that the post is 1. about writing, and 2. getting that writing read.

Continue with the body

Continue that directness into the start of your text. If you are writing something long enough to have an abstract, write the abstract after the rest is done and put a lot of effort into it. Specifically craft the abstract to help a reader answer "Is this relevant to me?" Don't be dramatic by hiding some stunning twist at the end. That's great for fiction—skip it here.

End strong

Finally, end with a summary or recap. Savvy readers will go straight to the end to determine if something is worth reading or not. Writing a strong recap at the end helps the savvy reader, as well as any reader who reads your whole work (through reinforcement).

Answer the questions

Now that you have told the reader why they should read your work, make it worth their time. They are reading your writing to learn something and to answer questions. Make sure you answer those questions. Don't just say the conclusions; Provide the details and arguments to back it up. Much of my technical writing is experiment based. I make sure to present the conclusions (easily and clearly), but I also include the detail that went into it—the why the reader should believe my conclusions. What experiments did I run? What did I expect to happen? What did happen? What are the limits of my conclusions? Include all of that. I am sure that most readers have no desire to recreate my experiments and results, but they take comfort knowing that they could. That detail both makes it more likely for someone to believe the recommendations, and for a reader to be able to point out flaws or limitations in the work. Both outcomes should be a win for you (or me) as a writer.

There are limits to including the detail. You want to make it easy for the reader to get the answers and have faith in those answers. You don't want to bury the reader in detail. So, include the detail, but if it becomes too much, move some of it to appendixes or separate (linked) documents.

Make it readable

You've written your article. You've made it relevant and you have answered questions. Great! But you are not done. In fact, you have just gotten started. Now it is time to make it readable—that means editing. Put the writing down and walk away, then come back and read it again.
  • Which parts are rough or confusing? Clean those up.
  • Which parts make no sense? Remove or rewrite those.
  • Which parts don’t fit with the overall article? Remove those also.
Repeat the process until you have something that you would like to read.

For example, the first draft of this post had long lists of reasons that I write, each with 5 or 6 items. The lists were long enough that I found myself getting lost in them when re-reading. “I write to take notes; I write to organize my thinking; I write to learn; I write to review; I write to stay focused in meetings”. Those lists were shortened to 3 items each. Additionally, that first draft had a paragraph on how to read. That content largely did not fit with the theme of this post and was cut, with only small pieces preserved to motivate “make it relevant”, “answer the questions”, and “make it pleasant”.

Now consider having someone else read your writing. Have them identify what's rough, confusing, or just does not make sense. Depending on the audience and the work, there is obviously a spectrum of how much effort you should put into a particular piece of writing. I would venture that the more you care about the outcome, the more effort you should put into it. I had someone read this article, and the writing is both better and truer to its theme for the feedback.

Recap: Be Kind to Your Reader

If you take one thing away from this blog post, I hope it is "be kind to your reader." If you write to be read, the reader is the most important person in the world. Be kind to that person. Make it easy for them to determine if they should read your writing. Make it easy for them to pull out the lessons they need. And make it easy and pleasant for them to read your writing. They will appreciate it. Maybe I will be your reader, and I will definitely appreciate that effort.

Coda: Why did I write this article?

Based on the questions at the top, you might ask why I wrote this article. There's a simple answer to that: It's because I read a lot. I read a lot of technical writing in order to learn and form opinions. Many technical writers do not know why they are writing. I would like them to know two things: One, they might be writing in order to convince me of something or to teach me something; Two, if they want to convince or teach me, they need to work hard to make their writing readable.

Special thanks to Rita Rodrigues for feedback on an earlier draft of this post. The post is better written and more consistent with its intent for her feedback.