Apache Tika 1.11 on Spark NoClassDeftFoundError

I'm trying to use apache tika on top of Spark. However, i'm having issues with configuration. My best guess at the moment is that the dependencies (of which tika has a lot...) are not bundled with the JAR for spark. If this intuition is correct I am unsure what the best path forward is. But i am also not certain that that is even my issue.

The following is a pretty simple spark job which compiles but hits a runtime error when it gets to the Tika instantiation.

My pom.xml is as follows:

<project>
  <groupId>tika.test</groupId>
  <artifactId>tikaTime</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>TikaTime</name>
  <packaging>jar</packaging>
  <version>1.0</version>
  <dependencies>
    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>1.5.2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>1.11</version>
    </dependency>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-core</artifactId>
      <version>1.11</version>
    </dependency>
  </dependencies>
</project>

My sample code is here:

/* TikaTime.java */
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.File;
import java.io.IOException;
import org.apache.tika.*;

public class TikaTime {
  public static void main(String[] args) throws IOException {

    String logFile = "file.txt";
    File logfile = new File("/home/file.txt");
    SparkConf conf = new SparkConf().setAppName("TikaTime");
    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaRDD<String> logData = sc.textFile(logFile).cache();

    long numAs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("a"); }
    }).count();

    long numBs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("b"); }
    }).count();

    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

    //Tika facade class.
    Tika tika = new Tika();
  }
}

Stack Trace of Error is as follows:

    Lines with a: 2, lines with b: 1
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/tika/Tika
    at TikaTime.main(TikaTime.java:32)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.tika.Tika
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    ... 10 more

Curious if others have encountered this issue before. I rarely use Maven and am also somewhat new to Spark, so I'm not confident my intuition is correct on this.

Edit: Including my spark submit syntax incase it is of interest.

~/spark151/spark-1.5.1/bin/spark-submit --class "TikaTime" --master local[4] target/tikaTime-1.0.jar

Answers


Per Gagravarr's response and my original suspicion, the issue was needing to provide the uber-jar to Spark. This was accomplished using the maven-shade plugin. New pom.xml shown below.

<project>
<groupId>tika.test</groupId>
<artifactId>tikaTime</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>TikaTime</name>
<packaging>jar</packaging>
<version>1.0</version>
<dependencies>
  <dependency> <!-- Spark dependency -->
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.10</artifactId>
    <version>1.5.2</version>
  </dependency>
  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.11</version>
  </dependency>
  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.11</version>
  </dependency>
</dependencies>
<build>
  <plugins>
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-shade-plugin</artifactId>
      <version>2.4.2</version>
      <executions>
        <execution>
          <phase>package</phase>
          <goals>
            <goal>shade</goal>
          </goals>
        </execution>
      </executions>
      <configuration>
          <filters>
            <filter>
              <artifact>*:*</artifact>
              <excludes>
                <exclude>META-INF/*.SF</exclude>
                <exclude>META-INF/*.DSA</exclude>
                <exclude>META-INF/*.RSA</exclude>
              </excludes>
            </filter>
          </filters>
          <finalName>uber-${project.artifactId}-${project.version}</finalName>
      </configuration>
    </plugin>
  </plugins>
</build>
</project>

Note: you must also submit the uber-jar created from this to spark instead of the original.


Need Your Help

LibGdx Polygon.getTransformedVerticies() not returning the array

java arrays loops libgdx polygon

I have quite a simple problem that just makes no sense to me.