Introduction
This tutorial provides a basic understanding of how to write/use the Apache-tika facade function and debug it in Eclipse.
Background
Apache Tika is a library that is used for document type detection and content extraction from various file formats. Reference: https://www.tutorialspoint.com/tika/tika_overview.htm
Using the Code
In this article, I would give an example for how to create a new project in Eclipse and try and run an example to detect file type using Apache tika library.
Steps
- I am using Apache-tika 1.20 version. This can be downloaded from http://tika.apache.org/download.html. Download the jar file and save it on your machine.
- Open Eclipse and create a new Java project like this:
- Give the project a name; say "
DetectType
" and set a version of JRE that you are using. If you do not have a comtable in the list, install it.
- Right click on the 'src' and select New->Class. Give it a name, say '
DetectType
'. Refresh the project and you shall see the new file been added in src. - Add
body
to the newly added file:
public class DetectType
{
public static void main(String[] args) throws Exception
{
}
}
}
- Make a folder 'lib' under the same workspace as the above and copy the jar file into that lib folder.
- Add the jar file into your
DetectType
project. Right click on your project and select Properties -> Java Build Path -> Add JARs. - Select the new copied jar file in your project. If you do not see the jar file, refresh your project and try again. Your properties window should now look like this:
- Refresh your project and on the Project Explorer, you could now see the jar file being added.
- Update your code body to include the
Tika
class and to detect the file type.
import org.apache.tika.Tika;
public class DetectType
{
public static void main(String[] args) throws Exception
{
Tika tika = new Tika();
for (String file : args) {
String fileType = tika.detect(file);
System.out.println("File type of '" + file + "' is : " + fileType);
}
}
}
- The Project heirachy should look like this (Note that you can have your package name as 'default package'. I have kept it as '
org.apache.tika
'. As in the next section, I would import the entire tika source code which would be helpful in case of debug).
- The above program expects input param as a file name. This can be passed in as arguments. Like this:
- Now run the program and you should get result in console. Something like this:
File type of 'format\1.vsd' is application/vnd.visio.
The above example is a small one to detect the type of the file. There are lots of exposed API that can be used to extract more metadata and even content of the file type. For the complete list, see https://tika.apache.org/1.20/api/.
Tika supports these various functionalities:
- Document type detection
- Content extraction
- Metadata extraction
- Language detection
Debugging the Apache Tika Facade
In case you wish to add the entire Apache tika source code to your Eclipse project and debug your facade class/function, follow these steps.
- Create a new package '
org.apache.tika
' in your src (as shown in point 11 in the above section) - Create a new class under '
org.apache.tika
'. Right click 'org.apache.tika
'->New->Class. Give it a name of your choice, say 'DetectType
'. - Download the source code 'Mirrors for tika-1.20-src.zip' from http://tika.apache.org/download.html.
- Unzipping the above will give you packages which can be used for us to debug the facade classed in the above code.
- Go into
tika-core
from above and copy the content in folder 'tika-core\src\main\java\org\apache\tika' into the folder of your workspace 'DetectType\src\org\apache\tika'. Refresh your project in Eclipse and you shall see all these as packages. I have a screenshot of few but not all:
- In case you see any error in project, that is because of '
package-info.java
'. Remove this file as this file's sole purpose is to provide a home for package level documentation and package level annotations. - Start debugging and at any level, you do not find the source code, go into the file structure in point 4 and copy it to the appropriate workspace structure within org/apache/tika.
For error while using 'org.osgi.framework
', 'org.osgi.util
', go to http://www.java2s.com/Code/Jar/o/Downloadorgosgicore500jar.htm and download the jar file. Add it into your project as you added the tika-app.jar in step 8.
Similarly, you can find few more packages on the same site as they might be troubling you like 'org.sqlite.SQLiteConfig
'.
Points of Interest
This is the first time I have tried to debug the tika facade class and found the steps to do so. In case you feel some bits are missing, please give feedback and we shall improve this article.
History
- 25th January, 2019: Initial version