@awead

Technical notes and explications of code

Benchmark Testing With Valkyrie and Hyrax

Overview

There are three tests, testing four different combinations of backends, at counts of 10, 100, 1K, 10K, and 100K. Not all tests were performed at the higher counts because of performance limitations.

Backends

Valkyrie with Postgres

A Postgres database holds the metadata for every resource, and it is also indexed in Solr. Using Valkyrie’s indexing adapter, after one or more writes are performed on Postgres, one indexing operation is performed in Solr at the end.

Binary files are written directly to disk.

Valkyrie with ActiveFedora

ActiveFedora is used exactly like it is in the current Hydra stack. Valkyrie performs one write operation per resource and the ActiveFedora adapter translates this into metadata persistence in Fedora and indexing in Solr. This is as equivalent to Hyrax as possible, but without the full Hyrax stack.

Binary files are written to Fedora.

Valkyrie with Fedora

A Fedora repository holds the metadata for every resource, persisted as RDF, and it is also indexed in Solr. This is the exact analog of the Postgres backend, where Valkyrie’s indexing adapter makes multiple writes to Fedora and then submits one indexing operation to Solr at the end. ActiveFedora is not utilized at all. The solr process is the same as the Valkyrie with Postgres backend, and Valkyrie performs writes to Fedora using an LDP client.

Binary files are written to Fedora.

Hyrax

This is the full Hyrax application, using the current hydra-head stack, including ActiveFedora.

Testing Apparatus

One server ran the Rails application, while another ran Fedora, Postgres, and Solr. Tests where initiated on the application server via rake tasks. Both servers had identical specifications:

  • System: Red Hat Enterprise Linux Server release 7.4 (Maipo), virtualized
  • CPU: 4 Intel® Xeon® CPU E5-2697 v2 @ 2.70GHz
  • RAM: 16268092 kB

Test Types

Collections

Multiple works are part of a collection using a member_of relationship where the work asserts its relationship to the collection.

In Valkyrie, the work has an array of collections with only one member: the uri of the collection to which it belongs. Benchmarking is performed around the time to save the work:

1
2
work.part_of_collections = [collection.id.to_uri]
bench.report { adapter.persister.save(resource: work) }

In Hyrax, it is similar but we use the tools available in hydra-works which persists the membership in Fedora using LDP indirect containers. The outcome is the same, however, and each work only has an array of one item. Benchmarking is performed around the time to assert the relationship and save the work:

1
2
3
4
bench.report do
  i.member_of_collections = [collection]
  i.save
end

Nested Works

One work contains many other works using a has_members relationship where the parent work asserts its relationship to all the other child works. Benchmarking is performed around the time to add the new child and save the parent.

In Valkyrie, this is accomplished with an array, appending each new work as the test progresses:

1
2
3
4
bench.report do
  parent.has_collections << child.id
  adapter.persister.save(resource: parent)
end

In Hyrax, the only difference is that the array is ordered, using the ordered_members method available from hydra-pcdm:

1
2
3
4
bench.report do
  parent.ordered_members << child
  parent.save
end

Files

Many individual works are created, each with a unique 1 MB file of random data. Data is randomized by replacing some of the data in the file with a UUID.

In Valkyrie, two adapters where used: one for the metadata, and second for the binary data. The time to persist both was benchmarked:

1
2
3
4
5
6
bench.report do
  work = Work.new
  file = storage.upload(file: randomized_file, resource: work )
  work.has_files = [file.id]
  adapter.persister.save(resource: work)
end

In Hyrax, in order to match the application’s processes as closely as possible, each file was placed in a file set and then attached to the work. This was adapted from AttachFilesToWorkJob in Hyrax and benchmarked:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
bench.report do
  work = create_work(count)
  permissions = work.permissions.map(&:to_hash)
  file_set = FileSet.new
  actor = Hyrax::Actors::FileSetActor.new(file_set, user)
  actor.create_metadata(visibility: 'open')
  file_set.title = ["Small File #{count}"]
  file_set.label = "Small File #{count}"
  file_set.save
  Hydra::Works::AddFileToFileSet.call(file_set, randomized_file, :original_file)
  actor.attach_to_work(work)
  actor.file_set.permissions_attributes = permissions
  work.save
end

Collections Test Results

Valkyrie with Postgres

Valkyrie with Fedora

290.59 user 28.05 system 10:44.47 elapsed 49%CPU
(0avgtext+0avgdata 211648maxresident)k 0inputs+1136outputs
(0major+65774minor)pagefaults 0swaps
2243.38user 233.10system 1:33:21elapsed 44%CPU
(0avgtext+0avgdata 1254852maxresident)k 0inputs+11112outputs
(0major+313214minor)pagefaults 0swaps

Valkyrie with ActiveFedora

We attempted to create a collection with 100K items, but the test was terminated after 17.5 hours.

4692.09user 300.16system 17:34:25elapsed 7%CPU
(0avgtext+0avgdata 112960maxresident)k 0inputs+37800outputs
(0major+29319minor)pagefaults 0swaps

Hyrax

We attempted to create a collection with 100K items, but the test was terminated after 12 hours.

real 722m11.468s    user 94m18.218s    sys 11m1.927s

Nested Works Test Results

Valkyrie with Postgres

We attempted to create 10,000 nested works within one work, but the test failed before that.

8553.42 user 61.51 system 2:35:43 elapsed 92%CPU
(0avgtext+0avgdata 15162988maxresident)k 29446289inputs+6811464outputs
(33986major+4273558minor)pagefaults 0swaps

Valkyrie with Fedora

We attempted to create 10,000 nested works within one work, but the test failed before that.

Net::ReadTimeout: Net::ReadTimeout

22931.08 user 29.03 system 10:28:05 elapsed 60%CPU
(0avgtext+0avgdata 1712764maxresident)k 0inputs+10304outputs
(0major+3901698minor)pagefaults 0swaps

Valkyrie with ActiveFedora

We attempted to create 1K nested works within one work, but the test was terminated after several hours.

Hyrax

Files Test Results

With a unique 1MB file for each work, the tests were capped at 1,000 because of disk space limitations.

Valkyrie with Postgres

Valkyrie with Fedora

real    5m46.804s
user    2m29.848s
sys     0m17.905s

Valkyrie with ActiveFedora

real    12m10.749s
user    5m11.323s
sys     0m22.615s

Hyrax

real 71m26.433s
user 23m32.859s
sys 1m33.824s

Comparison of Backends

This compares the total time per each benchmark for all four backends.

Collections

Nested Works

Files

Further Analysis

Decreased Performance with Collections

Why does ingest time increase with inversely related collections when using ActiveFedora in Valkyrie or Hyrax? When using the Fedora and Postgres adapters in Valkyrie, performance remains flat.

Additional tests were conducted locally on a laptop because servers were no longer available. All tests were capped at 25000 works because that was large enough to show a significant decrease in performance.

Collection Performance Locally

Running a test using a laptop yielded similar results to those seen in a multi-server environment. There is a significant performance impact in the server environment, apparently due to network latency. A laptop performed slightly faster than a server and the variances of time was much smaller. However, there is still a clean decrease in performance over time in both environments.

Fedora and Solr Requests

If Fedora performance is degrading, we might see a similar increase in response times with the different HTTP requests sent to it.

Fedora POST

Each time a new work is added to the collection, two POST actions are done: one for the work, and a second for the access control list resource.

Fedora GET

For each new work, there were 5 GET requests:

  • 1 for the work
  • 3 for access control list resources
  • 1 a 404 for the work’s /list_source

Solr Update

The ActiveFedora adapter creates two Solr resources (documents) per work: one for the work, and a second for the ACL resource in Fedora. The total number of Solr documents came to 50002:

  • 25001 Valkyrie::Persistence::ActiveFedora::ORM::Resource (25000 works + 1 collection)
  • 25001 Hydra::AccessControl (25000 works + 1 collection)

During the work creation process, Solr makes three updates per collection and work resulting in a total of 75003 update requests.

Graphing the response times for each request showed the exact same pattern of performance degradation. Different Solr configurations were tested, but the key factor in performance was the suggest field. When text fields were not copied to the field, as they were with other Solr configurations, response times improved dramatically.

We don’t know exactly why suggest fields have such an impact on performance. All the other Valkyrie adapters used Solr configurations that had suggest fields enabled, but the performance impact was only felt when using ActiveFedora.

The common Solr configuration for Hyrax and other Samvera-based applications uses fields with suffixes such as _tesim and _ssim to denote stored, searchable text in Solr. Additionally, a suggest suffix is used for fields that Blacklight can use to provide a type of “Did you mean…” search refinement to users.

1
2
3
4
<dynamicField name="*_tesim" type="text_en" stored="true" indexed="true" multiValued="true"/>
<dynamicField name="*_ssim" type="string" stored="true" indexed="true" multiValued="true"/>

<dynamicField name="*suggest" type="textSuggest" indexed="true" stored="false" multiValued="true" />

All text fields’ content is copied directly to a suggest field:

1
2
<copyField source="*_tesim" dest="suggest"/>
<copyField source="*_ssim" dest="suggest"/>

The difference in configuration between the fields centers around _tesim versus suggest fields and their tokenizers and filters. We would need to do more testing to verify this, but it could be that the performance impact is related to the KeywordTokenizerFactory in the suggest field and the ICUTokenizerFactory in the tesim field.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.ICUTokenizerFactory"/>
    <filter class="solr.ICUFoldingFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.EnglishMinimalStemFilterFactory"/>
    <filter class="solr.TrimFilterFactory"/>
  </analyzer>

  <fieldType class="solr.TextField" name="textSuggest" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.KeywordTokenizerFactory"/>
      <filter class="solr.StandardFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
  </fieldType>
</fieldType>

Final Comparison

When comparing the process of creating 25000 works in a collection, the Solr configuration plays the critical role in performance.

One of the other dimensions of comparison that was used early on in the testing process was removing all the commits made to Solr. This showed the same performance boost as with suggest fields. However, since removing suggest fields, and retaining commits, demonstrated substantial performance increases, it’s pretty clear that while removing commits may give a slight increase over commits without suggest fields, the principle performance gain is found in removing the tesim to suggest field copying.