Performance Metrics with Hyrax and Valkyrie

Adam Wead

Penn State University

awead@psu.edu / @amsterdamos

Overview

Part 1: Testing and Results

  • testing apparatus
  • results
  • charts and graphs to back me up

Part 2: Performance Implications

  • performance problems with inverted collections
  • more charts and graphs to back me up
  • how we might make things better

Part 1: Testing and Results

The Scenario

Penn State's Cultural Heritage Object repository (CHO) project will migrate collections from CONTENTdm with 300,000 items.

What would performance be like with that?

The Experiment

Mimic our large collection use case and build collections with many thousands of works. We will use a default Hyrax application, and a Valkyrie-based Rails application to benchmark and compare the performance of each.

The Setup

  • Three Tests:
    1. Inverted collections with 100,000 works
    2. Nested collections with 10,000 works
    3. 1,000 works with files
  • Four Environments
    1. Hyrax 1.0.0.RC1
    2. Valkyrie with Postgres + Solr
    3. Valkyrie with Fedora + Solr
    4. Valkyrie with ActiveFedora

Test Apparatus

  • Servers
    1. Rails
    2. Fedora, Postgres, and Solr
  • Red Hat Enterprise Linux Server release 7.4 (Maipo), virtualized
  • 4 Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
  • 16 GB RAM

Inverted Collections

  • one collection with 100,000 works
  • works point to the collection: collection ← work
  • benchmarked around saving a new work with the referenced collection

Valkyrie Test


  Benchmark.benchmark do |bench|
    work = Work.new
    work.part_of_collections = [collection.id.to_uri]
    bench.report { adapter.persister.save(resource: work) }
  end
	          

Hyrax Test



    Benchmark.benchmark do |bench|
      i = Image.new
      bench.report do
        i.member_of_collections = [collection]
        i.save
      end
    end
	          

Comparisons

Results

  • Postgres adapter: finished in 15 minutes
  • Fedora adapter: finished in 93 minutes
  • ActiveFedora adapter: stopped after 17 hours at ~25,000 works
  • Hyrax: stopped after 12 hours at ~12,000 works

Nested Collections

  • one collection with 10,000 nested works
  • collection → work, or collection contains all the works
  • benchmarked around adding the new work to the collection and saving

Valkyrie Test


    Benchmark.benchmark do |bench|
      child = Work.new
      result = adapter.persister.save(resource: child)
      bench.report do
        collection_resource.has_collections << result.id
        adapter.persister.save(resource: collection_resource)
      end
    end
	          

Hyrax Test



    Benchmark.benchmark do |bench|
      child = Image.new
      child.save
      bench.report do
        parent.ordered_members << child
        parent.save
      end
    end
	          

Results: EVERYONE FAILED!

  • Postgres adapter: system locked up at 5,800 works after 2.5 hours hitting 92% CPU
  • Fedora adapter: Net::ReadTimeout at 2,000 works after 10.5 hours
  • ActiveFedora adapter: forcibly terminated after 208 works, taking 2 hours
  • Hyrax: no testing done with 10,000 works

Comparison

Test was reduced to 1,000 works

Files

  • 1,000 works, each with a unique 1 MB file
  • benchmarked around creating the file and work, attaching them together, and saving

Valkyrie Test


    Benchmark.benchmark do |bench|
      id = SecureRandom.uuid
      randomize_file(id)
      bench.report do
        work = Work.new(id)
        file = storage_adapter.upload(File.open('tmp/small_random.bin', 'r'), resource: work)
        work.has_files = [file.id]
        metadata_adapter.persister.save(resource: work)
      end
    end
	          

Hyrax Test



    Benchmark.benchmark do |bench|
      randomize_file
      bench.report do
        image = create_image # returns saved image with default metadata and permissions
        permissions = image.permissions.map(&:to_hash)
        file_set = FileSet.new
        actor = Hyrax::Actors::FileSetActor.new(file_set, user)
        actor.create_metadata(visibility: 'open')
        file_set.save
        Hydra::Works::AddFileToFileSet.call(file_set, File.open('tmp/small_random.bin', 'r'), :original_file)
        actor.attach_to_work(image)
        actor.file_set.permissions_attributes = permissions
        image.save
      end
    end
	          

Comparisons

Next Steps

Given our 300,000-work collection use case, Penn State opted to use Valkyrie because it demonstrated the ability to handle extremely large collections
  • nested collections would be much smaller
  • files would be stored outside of Fedora
  • metadata and content (maybe) could be pushed to Fedora asynchronously

Part 2: Performance Implications

Inverted Collections (redux)

Why the difference?

Local Test Apparatus

  • moved to testing on a laptop
  • Rails, Fedora, Solr, and Postgres all running locally
  • 2013 MacBook Pro, 2.6 GHz Intel Core i5, 8 GB RAM

New Testing Strategy

  • same as before, but capped at 25,000 works
  • focused specifically on Valkyrie's ActiveFedora adapter
  • Hyrax was not tested
  • logged requests to Fedora and Solr
  • graphed response times and overall number of requests

Local Comparison

Does a laptop environment perform similarly to the server environment? Yes

Fedora and Solr Requests

  • Valkyrie's ActiveFedora adapter makes 10 total requests per work:
    • 2 POSTs to Fedora: create the work and ACL
    • 5 GETs to Fedora: 1 work, 3 ACL, 1 work's /list_source (404)
    • 3 updates to Solr
  • By comparison, the Fedora adapter makes only one each to Fedora and Solr

Fedora GET Requests

Is reading from Fedora limiting performance? No

5 GETs x 25,000 works = 125,000 requests

Fedora POST Requests

Is writing to Fedora limiting performance? Somewhat

2 POSTs x 25,000 works = 50,000 requests

Solr Updates

Are Solr updates limiting performance? Yep!

3 updates x 25,000 works = 75,000 requests

Solr Comparisons

  • PSU config with Solr 5.3.1 took ~11.5 hours
  • Hyrax config with Solr 7.1.0 took ~11.25 hours
  • Hyrax config with Solr 7.1.0 without suggest took 1 hour
  • Disabled copying text fields to suggest fields


    # schema.xml
    <copyField source="*_tesim" dest="suggest"/>
    <copyField source="*_ssim" dest="suggest"/>

            

Solr Tokenizer Slowdown?


    
      
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      
    

    
      
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        <filter class="solr.TrimFilterFactory"/>
      
    
            

Final Result

ActiveFedora adapter with different Solr configurations

Conclusions

  • Solr matters
  • Valkyrie is faster:
    • fewer Solr updates per work (3:1)
    • fewer Fedora calls per work (7:1)
    • retains suggest field
  • Fedora can be performance limiting, but not in all cases

Remaining Questions

lots!!!!

  • Does Hyrax 2.0 have the same performance problems? Probably?
  • I want 10,000 ordered works. What do I do?
  • What about files?

What Does This Mean?

Is Valkyrie "better" than Hyrax? No!

Should I use Valkyrie instead of Hyrax? Depends

Will Valkyr-izing Hyrax fix these problems? Potentially

I want more!

Shout Outs

  • my team & Penn State
  • Trey Pendragon & Valkyrie
  • Aaron Coburn @ Amherst

Thank You!

Adam Wead

awead@psu.edu / @amsterdamos