среда, 22 мая 2013 г.

MongoDb vs. MS SQL: how to write to journal without additional seeks

In my previous post I figured out why single-threaded MongoDb benchmark is limited to 1000 inserts per second. But actually only SSD can reach this limit: 980 inserts per second on SSD and only 700 on HDD. Modified multi-threaded (8 threads) benchmark gives 7900 inserts on SSD and 5200 on HDD. Why there is such a big difference if journal is append-only storage. HDD should perform quite well in sequential write scenario. Can we close the gap?

If you read first article in the series you remember I was surprised that MS SQL does near the same number of writes as number of test iterations. But MongoDb doubles this numbers accessing not only journal file, but also NTFS metadata file.

Look how journal file is created:
















The purpose of FILE_FLAG_NO_BUFFERING is obvious:
In these situations, caching can be turned off. This is done at the time the file is opened by passing FILE_FLAG_NO_BUFFERING as a value for the dwFlagsAndAttributes parameter of CreateFile. When caching is disabled, all read and write operations directly access the physical disk. However, the file metadata may still be cached. 
And what does FILE_FLAG_WRITE_THROUGH?
A write-through request via FILE_FLAG_WRITE_THROUGH also causes NTFS to flush any metadata changes, such as a time stamp update or a rename operation, that result from processing the request.
Windows provides this ability through write-through caching. A process enables write-through caching for a specific I/O operation by passing the FILE_FLAG_WRITE_THROUGH flag into its call to CreateFile. With write-through caching enabled, data is still written into the cache, but the cache manager writes the data immediately to disk rather than incurring a delay by using the lazy writer.  
In other words this flag a) flushes NTFS metadata b) keeps written data available in cache manager memory.
Do we really need to update file attributes on every write operation paying additional HDD seek?! Do we really need this data to be cached? I expect data is read from journal only during recovery.

I removed FILE_FLAG_WRITE_THROUGH and here are numbers: 7900 inserts per second on SSD and 7800 on HDD! Gap is closed!

I expect pre-allocating journal file may give additional improvement, but now our 1 ms introduced delay is real bottleneck and should be addressed first. May be after journal throughput is improved there is no need to keep default value for journalCommitInterval at 100 ms?

Let me submit issue to JIRA accompanied by pull request and see what 10gen folks will say about it.

UPDATED: BTW, after modification disk load drops from 55% to 7-10% during benchmark. As I already said this guy is our current bottleneck:
sleepmillis(oneThird);
UPDATED: [proof] after modification mongod.exe doesn't double number of writes:



Stay tuned!

вторник, 21 мая 2013 г.

MongoDb vs. MS SQL: journalCommitInterval problem

  In my previous article I got awful results for MongoDb and some "proofs" led me to wrong away. Actually not all of them were wrong, but before let's deal with simplier problem.

  My benchmark results were 29 inserts per second, 34 ms per insert. I thought it was due to HDD seeks, because MongoDb updates two files per write. But when I ran same test on SSD I got same results. Something blocks my requests! And here it is:


























  I was surprised that write to journal is also delayed, looks like guys from 10gen try to group some disk operations together. The problem is that every write is delayed, unconditionally! I ran MongoDb without specifying journalCommitInterval, so in my case delay is oneThird = (100 / 3) + 1 = 34!!! My every insert operation is delayed by 34 ms! Yes, sir, my benchmark confirms it!

  After reading documentation I've found out that smallest valid journalCommitInterval is 2 which gives oneThird = (2 / 3) + 1 = 1 ms. I ran MongoDb with this param value expecting to get something around 1K inserts per second. But I got only 700 inserts per second. Not that much... Learned from my previous mistakes I ran same benchmark at my SSD and got around 980. Much closer to that I expected keeping in mind 1 ms penalty delay.

  There is something fishy why MS SQL doesn't update 'last write time' but MongoDb does...

UPDATED: how to write to journal without additional HDD seeks

MongoDb vs. MS SQL Server in 'durable insert' benchmark

  Recently I thought about append-only storage to store audit log. Seems like both MS SQL Server and MongoDb fit my needs, but I want to get some numbers. Here is my contest environment:

  • Windows 7 Professional SP1 x64
  • Intel Core i7-2600 @ 3.40 GHz
  • SSD Corsair Force 3 (only for OS, all database files are on HDD)
  • HDD Seagate ST320DM000 320GB @ 7200 rpm (rated as average access time 15.6 ms, 64 IOPS @ 4K block by HDD Tune)
  • MS SQL 2008 R2 SP1
  • MongoDb v2.4.3
The benchmark is very simple - insert small records as fast as possible. Code is written in C# and available here.

  First I ran my benchmark against MS SQL and got something around 2500 inserts/second. I do understand that it is just appending to transaction log file (.ldf) and I expected to get near same results for MongoDb. MongoDb was benchmarked in 'durable' mode which means with journal turned on (I'm not sure is it possible or not to turn off journal in recent versions). First results were surprising, at that moment I was sure I did something wrong - 29 inserts per second! Actually to get any results I had to reduce number of test iterations for MongoDb from 10000 to just 100, otherwise I couldn't wait till test completes.

Look at the times between writes during MS SQL benchmark:


And compare with MongoDb times:


Fractions of millisecond between MS SQL write operations and 34 ms between MongoDb writes! It took me a few hours to figure out what is going on, but I'm going to save your time. Did you notice I give detailed characteristics of my spin drive? 64 IOPS with 15.6 ms average access time and 29 inserts per second and 34 milliseconds between writes in benchmark...

  I used PerfView to verify my theory. Look at the disk activity while MS SQL performs 10K test iterations:


It does 10K+ writes (additional writes may be caused by test preparation phase) to a single file which is transaction log file. And here are MongoDb results for 100 iterations:













It does twice i/o write operations affecting two files! I supposed it updates some file metadata and I found "proof" quickly: MS SQL doesn't update last write time for InsertBenchmark_log.ldf file, but MongoDb does. That's why in latter scenario two files accessed. So I decided that MongoDb performance is penalized by HDD seek for every write operation.

  At the time of writing I understand I was totally blinded by my "proofs". Later I ran MongoDb benchmark on my SSD and got same results! Only 29 iterations per seconds! Obviously it is not related to HDD seek times...

  Let's summarize my findings before continue:
  • MongoDb is very slow in my append-only durable storage benchmark
  • It updates two files on the disk for every write operation
  • But results are the same on SSD and HDD, that means problem is inside client bindings or mongo itself.

пятница, 17 мая 2013 г.

How to persist aggregate root. Part I

    Сохранение AR не такая уж и простая задача, как может показаться. На самом деле то как вы собираетесь хранить состояние (а может и не состояние) AR может сильно повлиять на всю архитектуру приложения.

    Рассмотрим первый, возможно наиболее часто используемый вариант - сохранение в реляционную базу данных. Можно сохранять сущности руками, а можно и с помощью ORM, в моем примере NHibernate, что мы и будем делать. А вот, кстати, и наши сущности:

    public class Order
    {
        public Order(int customerId)
        {
            CustomerId = customerId;
 
            OrderLines = new List<OrderLine>();
        }
 
        private Order()
        {
        }
 
        public int Id { getprivate set; }
 
        public int Version { getprivate set; }
 
        public int CustomerId { getprivate set; }
 
        public double Total { getprivate set; }
 
        public IList<OrderLine> OrderLines { getprivate set; }
 
        public void BuyProduct(string productId, int quantity, double price)
        {
            OrderLines.Add(new OrderLine(this, productId, quantity, price));
 
            Total += quantity * price;
        }
    }
 
    public class OrderLine
    {
        public OrderLine(Order order, string productId, int quantity, double price)
        {
            Order = order;
            ProductId = productId;
            Quantity = quantity;
            Price = price;
        }
 
        private OrderLine()
        {
        }
 
        public int Id { getprivate set; }
 
        public Order Order { getprivate set; }
 
        public string ProductId { getprivate set; }
 
        public int Quantity { getprivate set; }
 
        public double Price { getprivate set; }
    }

Весь код примера доступен здесь.

    Так как многие ORM поддерживают замечательную, на первый взгляд, возможность отложенной загрузки дочерней коллекции (lazy-load), то мы будем считать свойство Total на самом Order, например для случаев когда нам нужен для вывода только Total и не нужны данные с OrderItems. В случае "развесистых" AR с большим количеством вложенных коллекций на нескольких уровнях использование lazy-load с точки зрения производительности просто напрашивается.

    Тут то и спрятан подвох! Но прежде немного теории. AR это прежде всего consistency boundary. Именно исходя из соображений целостности нужно решать объединять ли 100500 сущностей под одним корнем или делать 100500 независимых AR или какое то промежуточное решение. Тема определения границ AR это отдельная сложная тема, достаточно хорошо изложена например здесь: Effective Aggregate Design. Таким образом мы должны и сохранять и загружать AR целостно, но так ли это в нашем случае? Нет, не так!

    Загрузим order, выведем в консоль значение поля Total, а затем пересчитаем это значение уже по коллекции объектов OrderLines. Но предположим что между этими двумя действиями кто то обновил order. Однако нас это затронуть не должно, у нас же consistency boundary, не так ли?

       using (var session = _sessionFactory.OpenSession())
       using (var transaction = session.BeginTransaction())
       {
           var order = session.Get<Order>(orderId);
 
           Console.WriteLine(order.Total);
 
           ConcurrentWriter(orderId);
 
           // Here collection is actually loaded
           Console.WriteLine(order.OrderLines.Sum(oi => oi.Quantity * oi.Price));
 
           transaction.Commit();
       }

       private void ConcurrentWriter(int orderId)
       {
           var thread = new Thread(() =>
           {
               using (var session = _sessionFactory.OpenSession())
               using (var transaction = session.BeginTransaction())
               {
                   var order = session.Get<Order>(orderId);
 
                   order.BuyProduct("Whiskey", 1, 10.12);
 
                   transaction.Commit();
               }
           });
 
           thread.Start();
 
           // give it chance to complete
           thread.Join(1000);
       }


    В результате мы видим в консоли два разных числа! При отложенной загрузке коллекции в мы загрузили не только уже имеющиеся OrderLines, но и еще то что добавил наш concurrent writer. А как же целостность данных?!

    Есть два известных мне способа "вернуть" целостность. Первый способ - открывать транзакцию с уровнем изоляции RepeatableRead. Но тогда нужно быть готовым к исключениям в точке обращения к коллекции OrderLines, потому что нашу транзакцию будут выбирать жертвой разрешения deadlock'а. Второй способ - отказаться от lazy и загружать весь AR целиком.

    Итог: использование реляционной базы данных для хранения AR это удобный в плане использования (ORM с их "плюшками", развитый инструментарий для самих СУБД) и быстроты реализации способ, имеющий "некоторые" проблемы с целостностью, если о них не задумываться. Для нагруженных решений и "развесистых" AR к недостаткам можно так же отнести необходимость нескольких дисковых операций для загрузки одного AR.

    В следующий раз (если он будет) попробуем хранить наш AR в документной базе.
Wider Two Column Modification courtesy of The Blogger Guide