I want to search


We've launched a new website!

You're currently accessing the archived version of the DataONE website. To see our new design and keep up to date with the latest DataONE news, visit our new website at https://dataone.org


Tika java class library available through the Apache group. It supports media type detection based on file type signatures, metadata extraction and text parsing and extraction.

Supported Document Formats:

  • HyperText Markup Language
  • XML and derived formats
  • Microsoft Office document formats
  • OpenDocument Format
  • Apple iWorks Formats
  • Portable Document Format
  • Electronic Publication Format
  • Rich Text Format
  • Compression and packaging formats
  • Text formats
  • Audio formats
  • Image formats
  • Video formats
  • Java class files and archives
  • Mail formats
  • The DWG (AutoCAD) format
  • Font formats
  • Scientific formats
  • The Tika application can be run in either command line mode or as a graphical user interface (GUI) mode. Tika is written in Java and the class library can be used in directly in other programs where needed.

    Those with advanced programming skills can extend the Tikal to meet specific project or analysis needs not covered by the basic release. It is an open source project at the Apache Software Foundation and available under the Apache License version 2.0 (ALv2).

    Technical Expertise Required: 
    Any platform or device that supports Java