Sunday, November 16, 2014

Hive Performance Improvement - statistics and file format checks

Following are couple of important configuration changes that can improve the performance of hive queries (specially insert queries) great deal.

hive.stats.autogather

By default 'hive.stats.autogather' is 'true'. This setting governs collection of statistics for newly created tables/partitions.
Following are the stats collected:
  • Number of rows
  • Number of files
  • Size in Bytes
For newly created tables and/or partitions (that are populated through the INSERT OVERWRITE command), statistics are automatically computed by default.
In case of tables with high number of partitions, this can have big impact on performance.
It is advisable to disable this setting with set hive.stats.autogather=false;
The stats can be collected when required using following query:
ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
  COMPUTE STATISTICS 
  [FOR COLUMNS]          -- (Note: Hive 0.10.0 and later.)
  [NOSCAN];

hive.fileformat.check

This property governs whether to check file format or not when loading data files.
In our case, where the format of data file is governed by our own processes, checking file format every time we load data file may not add any value.
Hence, it is recommended to disable the file format check to gain some performance.
set hive.fileformat.check=false;

Performance

More than 50% of Performance improvement is observed by using these configuration changes. The time taken by hive queries for metering reduced from >100 minutes to <50 minutes.

References

https://cwiki.apache.org/confluence/display/Hive/StatsDev
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

Thursday, October 2, 2014

Hive Query to get 95th Percentiled ranked item

I was working on a query where I had to convert a complex MySql query which was providing 95th percentile value from a group_concat result.

Following is the hive query to do the same on a simple sample table.

 

Sample Table:

hive> describe coll;

 

col_name            data_type      

proc_id                 int                    

status                    string                 

 

Sample Data:

hive> select * from coll;

proc_id                status

53                           stopped

56                           stopped

1                              started

2                              started

52                           stopped

4                              started

59                           stopped

29                           stopped

13                           stopped

55                           stopped

54                           stopped

63                           stopped

8                              stopped

9                              stopped

51                           stopped

61                           stopped

69                           stopped

6                              stopped

23                           stopped

57                           stopped

3                              started

11                           stopped

7                              stopped

66                           stopped

12                           stopped

67                           stopped

 

How percent_rank() works?

Query: select status, proc_id, percent_rank() over (PARTITION BY status ORDER BY proc_id DESC) as proc_rank_desc from coll;

Result:

Status          proc_id     proc_rank_desc

started                4         0.0

started                3         0.3333333333333333

started                2         0.6666666666666666

started                1         1.0

stopped               69      0.0

stopped               67      0.047619047619047616

stopped               66      0.09523809523809523

stopped               63      0.14285714285714285

stopped               61      0.19047619047619047

stopped               59      0.23809523809523808

stopped               57      0.2857142857142857

stopped               56      0.3333333333333333

stopped               55      0.38095238095238093

stopped               54      0.42857142857142855

stopped               53      0.47619047619047616

stopped               52      0.5238095238095238

stopped               51      0.5714285714285714

stopped               29      0.6190476190476191

stopped               23      0.6666666666666666

stopped               13      0.7142857142857143

stopped               12      0.7619047619047619

stopped               11      0.8095238095238095

stopped               9       0.8571428571428571

stopped               8       0.9047619047619048

stopped               7       0.9523809523809523

stopped               6       1.0

 

Final Query:

select q.status, min(q.proc_id) as proc_id

from (

select * from (select status, proc_id, percent_rank() over (PARTITION BY status ORDER BY proc_id DESC) as proc_rank_desc

from coll

           ) ranked_table

where ranked_table.proc_rank_desc >=0.95) q

group by q.status;

Final Result:

status                    proc_id

started                 1

stopped               6

 

 

Friday, September 26, 2014

Collections in CQL3 - How they are stored


If you don’t already know about collections in Cassandra CQL, following page provides excellent details about the same –

However, if you have been using cassandra from pre-CQL days, you would have worked with low level thrift APIs and hence you would be tempted to think how the data looks like in cassandra’s internal storage structure [which is very well articulated (exposed?) by thrift APIs]!
I have a big hangover of my extensive work with thrift API and hence I always get tempted to think how my CQL data looks like in internal storage structure.

Following is a CQL column-family containing  different types of collections – set/list/map followed by corresponding look of data in internal storage structure:

CQL:

cqlsh:dummy> describe table users;

CREATE TABLE users (
  user_id text,
  emails set<text>,
  first_name text,
  last_name text,
  numbers list<int>,
  todo map<timestamp, text>,
  top_places list<text>,
  PRIMARY KEY (user_id)
)

cqlsh:dummy> select * from users;

user_id    | emails                                 | first_name   | last_name      | numbers       | todo                                     | top_places             | numbermap
---------+----------------------------------------+------------+-----------+-----------+--------------------------------------------------------------------------------------------------------+------------------------
   frodo   | {'baggins@gmail.com', 'f@baggins.com'} |      Frodo    |   Baggins     | [1, 2, 3]      | {'2012-09-24 00:00:00-0700': value1’} | ['rivendell', 'rohan']   | {1: 11, 2: 12}
Internal Storage Structure: (CLI result)
RowKey: frodo
=> (name=, value=, timestamp=1411701396643000)
=> (name=emails:62616767696e7340676d61696c2e636f6d, value=, timestamp=1411701396643000)
=> (name=emails:664062616767696e732e636f6d, value=, timestamp=1411701396643000)
=> (name=first_name, value=46726f646f, timestamp=1411701396643000)
=> (name=last_name, value=42616767696e73, timestamp=1411701396643000)
=> (name=numbermap:00000001, value=0000000b, timestamp=1411703133238000)
=> (name=numbermap:00000002, value=0000000c, timestamp=1411703133238000)
=> (name=numbers:534eaca0452c11e4932561c636c97db3, value=00000001, timestamp=1411701740650000)
=> (name=numbers:534eaca1452c11e4932561c636c97db3, value=00000002, timestamp=1411701740650000)
=> (name=numbers:534eaca2452c11e4932561c636c97db3, value=00000003, timestamp=1411701740650000)
=> (name=todo:00000139f7134980, value=76616c756531, timestamp=1411702558812000)
=> (name=top_places:a3300bb0452c11e4932561c636c97db3, value=726976656e64656c6c, timestamp=1411701874667000)
=> (name=top_places:a3300bb1452c11e4932561c636c97db3, value=726f68616e, timestamp=1411701874667000)

Some important points to note:
-          The ‘set’ field (emails) do not have any column-values in CLI result. As set is expected to store unique items, the values (rather, hash values) are stored as part of column-names only!
-          On the contrary, as ‘list’ field (numbers/top_places) is expected to have duplicate values, the actual value of list elements is stored in column-value and not in column-name to avoid overwrite of duplicate elements!
-          ‘map’ field (numbermap/todo): hash/hex of key is used in column-name and hash of values is used in column-values.

Saturday, September 13, 2014

How to Stop compaction for specific keyspace in cassandra

Problem statement
How to stop compaction for a specific keyspace in cassandra?

Reason behind the problem statement
We had 17TB of data in a keyspace and lot of heavy compaction was running/was pending on this keyspace. UAT was running on another keyspace and during the UAT, we wanted the compaction for this keyspace to run smoothly. (The other keyspace was not very important during UAT). So temporarily we wanted to stop compaction on that keyspace to ensure it does not eat up cluster resources.

Solution
Set compaction threshold’s min value high enough to ensure compaction does not get triggered. The min threshold means that compaction will not be triggered until the sstables count reaches this minThreshold value. So if you set the value to let’s say 100,000; compaction will not be triggered until sstables count reach 100,000 – which practically may mean never!
>> nodetool setcompactionthreshold <keyspace> <cfname> <minthreshold> <maxthreshold>
(I am yet to try this)

Update
Apparently setting the min/max threshold value to 0 stops the compaction. Also, this is only possible from JMX and not from CLI!
Another option is to use 'disableautocompaction' using "nodetool disableautocompaction"

Yet to verify this.





Friday, September 12, 2014

Compaction Strategy in cassandra

Cassandra supports two basic compaction strategies:
  • Size tiered compaction
  • Level tiered compaction

Before talking about these two compaction strategies, let’s take a look at what is compaction.

In Cassandra, each write is written into MemTables and commitlogs in realtime and when Memtable fills up (or when manually memtables are flushed) data is written into the persistent files called SSTables.
Memtable, as the name suggests is in memory table and commitlogs are the files to ensure retrieval of data if node crashes before data is written from memtable to SSTables.

Even for any update operations, the data is written into SSTables in sequential manner only; unlike RDBMS where the existing rows are searched in data files and then the said rows are updated with new data. This is done to ensure good write performance as the heavy operation to seek an existing row in data files (here, SSTables) is taken out.

However, the flipside of this approach is that data for a single row may span across multiple SSTables. This impacts the read operation because multiple SSTables will have to be accessed in order to read a single row.
To avoid this issue with read operation, cassandra performs ‘compaction’ on each column family. During compaction, the data of each row is compacted into a single SSTable. This ensures good read performance by placing data for each row in single file.
However, we must recognize at this point that as the compaction is an asynchronous process; in worst case scenario, there would be situations where read will be performed from multiple SSTables until compaction is performed for the SSTables containing said row.
This becomes critical factor for selection of compaction strategy for your column family.

As mentioned above there are two strategies for compaction process.

Size Tiered Compaction

In this compaction process, SSTables of fixed size are created initially and ones the number of SSTables reach a threshold count, they are compacted into a bigger SSTable.
For example, let’s consider initial SSTable size to be 5MB and the threshold count to be 5: In this case, compaction is triggered when 5 SSTables of 5MB each are filled and the compaction process will create a single SSTable of bigger size by compacting all these 5 SSTables.

Level Tiered Compaction

In this compaction process, sstables of a fixed, relatively small size (5MB by default in Cassandra’s implementation) are created and grouped into “levels”. Within each level, sstables are guaranteed to be non-overlapping. Each level is ten times as large as the previous.
For example,
L0: SSTables with size 5MB
L1: SSTables with size 50MB
L2: SSTables with size 500MB
…………………………………………….

New SSTables being added at level L(n) are immediately compacted with the SSTables at Level L(n+1). Once Level L(n+1) gets filled; extra SSTables are promoted to level L(n+2) and subsequent SSTables are compacted with SStables in Level L(n+2).
This strategy ensures that 90% of all reads will be satisfied from single SSTable.

Comparison at high level

  • Size Tiered compaction is better suited for write heavy applications. As the sstables are not immediately compacted with next level, the write performance will be better.
  • For column families having very wide rows with frequent updates (for example, manual index rows) the read performance will be severely hit. I have observed a case where a single read accessed almost 800 SSTables for a column family with wide row containing more than a million columns – case of frequent readTimeOuts! (This is also a learning for Data Model design and explains why wide rows with millions of columns should be avoided!) 
  •  Level Tiered compaction is better suited for column families with heavy read operations and/or heavy update-operations! (Important: Different between a write and an update!)



Thursday, September 4, 2014

JVM Command: JInfo

In linux, typically ‘top’ is used to view the details of running processes. Often it is required that you may want to see full process command instead of just the process name. Using option ‘c’ in top, you can see the full command almost always.
However,  typically OS has limits the length of full command  4096 characters! So if your process has longer full command, specaially applicable for java processes; you wouldn’t see it. Neither in ‘top’ nor in ‘ps –ef | grep <pid>’.

Good way to see exact details of the java process is to use ‘jinfo’. Following is the example of such a long command.

<JAVA_HOME>/bin/jinfo <process-id>

Example:
[root@home ~]$ /usr/java/jdk1.6.0_45/bin/jinfo 55761
Attaching to process ID 55761, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 20.45-b01
Java System Properties:

java.runtime.name = Java(TM) SE Runtime Environment
sun.boot.library.path = /usr/java/jdk1.6.0_45/jre/lib/amd64
java.vm.version = 20.45-b01
java.vm.vendor = Sun Microsystems Inc.
java.vendor.url = http://java.sun.com/
storm.home = /opt/storm-0.8.2
path.separator = :
java.vm.name = Java HotSpot(TM) 64-Bit Server VM
file.encoding.pkg = sun.io
sun.java.launcher = SUN_STANDARD
user.country = US
sun.os.patch.level = unknown
java.vm.specification.name = Java Virtual Machine Specification
user.dir = /home/gse
java.runtime.version = 1.6.0_45-b06
java.awt.graphicsenv = sun.awt.X11GraphicsEnvironment
java.endorsed.dirs = /usr/java/jdk1.6.0_45/jre/lib/endorsed
os.arch = amd64
java.io.tmpdir = /tmp
line.separator =

java.vm.specification.vendor = Sun Microsystems Inc.
os.name = Linux
log4j.configuration = storm.log.properties
sun.jnu.encoding = UTF-8
java.library.path = /usr/local/lib:/opt/local/lib:/usr/lib
java.specification.name = Java Platform API Specification
java.class.version = 50.0
sun.management.compiler = HotSpot 64-Bit Tiered Compilers
os.version = 2.6.32-358.11.1.el6.x86_64
user.home = /root
user.timezone = GMT
java.awt.printerjob = sun.print.PSPrinterJob
file.encoding = UTF-8
java.specification.version = 1.6
java.class.path = /opt/storm-0.8.2/storm-0.8.2.jar:/opt/storm-0.8.2/lib/jgrapht-0.8.3.jar:/opt/storm-0.8.2/lib/log4j-1.2.16.jar:/opt/storm-0.8.2/lib/kryo-2.17.jar:/opt/storm-0.8.2/lib/guava-13.0.jar:/opt/storm-0.8.2/lib/disruptor-2.10.1.jar:/opt/storm-0.8.2/lib/objenesis-1.2.jar:/opt/storm-0.8.2/lib/junit-3.8.1.jar:/opt/storm-0.8.2/lib/joda-time-2.0.jar:/opt/storm-0.8.2/lib/commons-exec-1.1.jar:/opt/storm-0.8.2/lib/tools.logging-0.2.3.jar:/opt/storm-0.8.2/lib/curator-client-1.0.1.jar:/opt/storm-0.8.2/lib/zookeeper-3.3.3.jar:/opt/storm-0.8.2/lib/ring-core-1.1.5.jar:/opt/storm-0.8.2/lib/commons-fileupload-1.2.1.jar:/opt/storm-0.8.2/lib/tools.macro-0.1.0.jar:/opt/storm-0.8.2/lib/carbonite-1.5.0.jar:/opt/storm-0.8.2/lib/core.incubator-0.1.0.jar:/opt/storm-0.8.2/lib/commons-codec-1.4.jar:/opt/storm-0.8.2/lib/ring-jetty-adapter-0.3.11.jar:/opt/storm-0.8.2/lib/minlog-1.2.jar:/opt/storm-0.8.2/lib/clojure-1.4.0.jar:/opt/storm-0.8.2/lib/clj-time-0.4.1.jar:/opt/storm-0.8.2/lib/commons-logging-1.1.1.jar:/opt/storm-0.8.2/lib/libthrift7-0.7.0.jar:/opt/storm-0.8.2/lib/clout-1.0.1.jar:/opt/storm-0.8.2/lib/jline-0.9.94.jar:/opt/storm-0.8.2/lib/slf4j-log4j12-1.5.8.jar:/opt/storm-0.8.2/lib/jzmq-2.1.0.jar:/opt/storm-0.8.2/lib/servlet-api-2.5.jar:/opt/storm-0.8.2/lib/commons-io-1.4.jar:/opt/storm-0.8.2/lib/httpcore-4.1.jar:/opt/storm-0.8.2/lib/jetty-util-6.1.26.jar:/opt/storm-0.8.2/lib/httpclient-4.1.1.jar:/opt/storm-0.8.2/lib/commons-lang-2.5.jar:/opt/storm-0.8.2/lib/math.numeric-tower-0.0.1.jar:/opt/storm-0.8.2/lib/snakeyaml-1.9.jar:/opt/storm-0.8.2/lib/json-simple-1.1.jar:/opt/storm-0.8.2/lib/compojure-1.1.3.jar:/opt/storm-0.8.2/lib/servlet-api-2.5-20081211.jar:/opt/storm-0.8.2/lib/hiccup-0.3.6.jar:/opt/storm-0.8.2/lib/jetty-6.1.26.jar:/opt/storm-0.8.2/lib/curator-framework-1.0.1.jar:/opt/storm-0.8.2/lib/ring-servlet-0.3.11.jar:/opt/storm-0.8.2/lib/asm-4.0.jar:/opt/storm-0.8.2/lib/reflectasm-1.07-shaded.jar:/opt/storm-0.8.2/lib/tools.cli-0.2.2.jar:/opt/storm-0.8.2/lib/slf4j-api-1.5.8.jar:/opt/storm-0.8.2/log4j:/opt/storm-0.8.2/conf
user.name = root
java.vm.specification.version = 1.0
sun.java.command = backtype.storm.daemon.worker xyz56-36-1406612749 d39a4666-52a2-4099-8ca5-ee9aac670783 6702 d6db40f4-032b-443b-9c7e-b0ddc6b80b31
java.home = /usr/java/jdk1.6.0_45/jre
sun.arch.data.model = 64
user.language = en
java.specification.vendor = Sun Microsystems Inc.
java.vm.info = mixed mode
java.version = 1.6.0_45
java.ext.dirs = /usr/java/jdk1.6.0_45/jre/lib/ext:/usr/java/packages/lib/ext
logfile.name = worker-6702.log
sun.boot.class.path = /usr/java/jdk1.6.0_45/jre/lib/resources.jar:/usr/java/jdk1.6.0_45/jre/lib/rt.jar:/usr/java/jdk1.6.0_45/jre/lib/sunrsasign.jar:/usr/java/jdk1.6.0_45/jre/lib/jsse.jar:/usr/java/jdk1.6.0_45/jre/lib/jce.jar:/usr/java/jdk1.6.0_45/jre/lib/charsets.jar:/usr/java/jdk1.6.0_45/jre/lib/modules/jdk.boot.jar:/usr/java/jdk1.6.0_45/jre/classes
java.vendor = Sun Microsystems Inc.
file.separator = /
java.vendor.url.bug = http://java.sun.com/cgi-bin/bugreport.cgi
sun.io.unicode.encoding = UnicodeLittle
sun.cpu.endian = little
sun.cpu.isalist =

VM Flags:

-Xmx768m -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dlogfile.name=worker-6702.log -Dstorm.home=/opt/storm-0.8.2 -Dlog4j.configuration=storm.log.properties

Usage:
    jinfo [option] <pid>
        (to connect to running process)
    jinfo [option] <executable <core>
        (to connect to a core file)
    jinfo [option] [server_id@]<remote server IP or hostname>
        (to connect to remote debug server)

where <option> is one of:
    -flag <name>         to print the value of the named VM flag
    -flag [+|-]<name>    to enable or disable the named VM flag
    -flag <name>=<value> to set the named VM flag to the given value
    -flags               to print VM flags
    -sysprops            to print Java system properties
    <no option>          to print both of the above
    -h | -help           to print this help message
 

Wednesday, August 20, 2014

Oracle SQL Developer - Java version issue

Oracle SQL developer (4.2) has an issue with Java version.
If you specify some incompatible version when you first time start the app, it keeps complaining every time you start it with following error:


The suggested config change in the message is red-herring. It doesn’t actually fix the problem. Apparently we need to modify configuration file  %AppData%\sqldeveloper\product.conf to fix the problem.
The file  %AppData%\sqldeveloper\product.conf contains entry for SetJavaHome. Comment the existing entry and add an entry for a valid java version.

#SetJavaHome C:\Program Files\Java\jdk1.6.0_37
SetJavaHome C:\Java\jdk1.7.0_51

While this is a solution for the java version issue, your SQL Developer may still crash!
Best and quickest way to solve it is to delete 'SQL Developer' and 'sqldeveloper' folders in %AppData% folder and start sql developer a fresh.


Thursday, August 14, 2014

HDFS Read Caching: cache-locality

Applications using HDFS will be able to read data up to 59x faster thanks to this new feature.

http://blog.cloudera.com/blog/2014/08/new-in-cdh-5-1-hdfs-read-caching/

 

Summary:

A centralized cache management is added to the NameNode. NameNode instructs specific datanodes to cache specific data and datanode sends cache report periodically.

Caching is user drive. User can specify which file/directory to cache and NameNode will ask DataNodes to cache the data by piggy-backing a cache command on the DataNode heartbeat reply.

 

While scheduling job, DataNode where data is available in cache will be given priority (amongst other replicas) by application schedulers to take advantage of cache-locality.

 

 

Sunday, June 22, 2014

How to externalize property files of dependent libraries using maven and package it in a tar file alongside jar file

I was working on this module where the environment specific property files were packaged inside the jar files. As usual, it made me itchy to have to do a new build for deployment in each environment, where the only change the build would contain will be property values for new environment.

So the goal here was to externalize the property file and create a tar package for the build so that the tar can be moved from one environment to another environment and the build can be deployed simply by modifying the values in the externalized property file.

This is essential to manage versioning as well! To me, a new build is a new version even if it is just a change in property file as it opens up risk of some code change slipping in!

 

Now, the challenge in externalizing property files was that not just the main artifact of current project contained property files but also a dependency contained an environment specific property file.

So following were the major challenges:

-          Include the dependency without property file (a jar without property file)

-          Get the property file of dependency and copy it in the tar file built by current project

-          Build current project’s artifact (a war file) without property file and include the property file in tar file, alongside the property file of dependency

To add to this, there were other projects using the dependency’s original jar file including property file; so I didn’t want to disrupt the original jar file.

 

Following is the project structure:

 

Main-Project:

-          src/main/java

o   <Java Classes>

-          src/main/resources

o   main-project.properties

 

Dep-project: (dependency-project)

-          src/main/java

o   <Java Classes>

-          src/main/resources

o   dep-project.properties

 

Following are the steps performed to achieve the goal:

1.       Make changes in pom.xml of ‘dep-project’ to perform following actions:

a)      Create a new ext-jar file without property file [Leave original jar (with property file) AS IS]

b)      Create a resource-jar file containing only property files

2.       Make changes in pom.xml of ‘main-project’ to perform following actions:

a)      Include the new jar ext-jar as dependency instead of original dependency jar

b)      Add new resource-jar as dependency with scope ‘provided’

c)       Create new ext-war file without properties files. (keep existing war file as it is)

d)      Extract resource-jar and copy property file of dependency to the tar file.

e)      Use assembly plugin to create tar file

 

 

Now let’s look at the pom.xml changes:

 

Changes in pom.xml for each of the steps mentioned above:

 

Step-1a: Create a new ext-jar file without property file [Leave original jar (with property file) AS IS]

file: dep-project/pom.xml

                     <plugin>

                           <groupId>org.apache.maven.plugins</groupId>

                           <artifactId>maven-jar-plugin</artifactId>

                           <version>2.3.1</version>

                           <executions>

                                  <execution>

                                         <id>ext</id>                            

                                         <goals>

                                                <goal>jar</goal>

                                         </goals>

                                         <phase>package</phase>                                

                                         <configuration>

                                                <classifier>ext</classifier>

                                                <excludes>

                                                       <exclude>src/main/resources/*</exclude>

                                                       <exclude>**/dep-project.properties</exclude>

                                                </excludes>

                                         </configuration>

                                  </execution>

                           </executions>

                     </plugin>

 

Step-1b: Create a resource-jar file containing only property files

file: dep-project/pom.xml

              <plugin>

                <groupId>org.apache.maven.plugins</groupId>

                <artifactId>maven-assembly-plugin</artifactId>

                <executions>

                    <execution>

                        <id>make shared resources</id>

                        <goals>

                            <goal>single</goal>

                        </goals>

                        <phase>package</phase>

                        <configuration>

                            <descriptors>

                                <descriptor>src/main/assembly/resources.xml</descriptor>

                            </descriptors>

                        </configuration>

                    </execution>

                </executions>

            </plugin>

New file: src/main/assembly/resources.xml

<assembly>

  <id>resources</id>

  <formats>

    <format>jar</format>

  </formats>

  <includeBaseDirectory>false</includeBaseDirectory>

  <fileSets>

    <fileSet>

      <directory>src/main/resources</directory>

      <outputDirectory>resources</outputDirectory>

    </fileSet>

    <!-- include profile files if applicable -->

    <fileSet>

      <directory>profiles</directory>

      <outputDirectory>profiles</outputDirectory>

    </fileSet>

  </fileSets>

</assembly>

 

Step-2a: Include the new jar ext-jar as dependency instead of original dependency jar

               file: main-project/pom.xml

        <dependency>

            <groupId>{AS IS}</groupId>

            <artifactId>dep-project</artifactId>

            <classifier>ext</classifier> è added classifier based on step-1a

            <version>{AS IS}</version>

        </dependency>

 

 

Step-2b: Add new resource-jar as dependency with scope ‘provided’

file: main-project/pom.xml

              <dependency>

                     <groupId>{AS IS}</groupId>

                     <artifactId>dep-project</artifactId>

                     <classifier>resources</classifier>

                     <type>jar</type>

                     <version>{AS IS}</version>

                     <!-- Make sure this isn't included on the classpath-->

                     <scope>provided</scope>

              </dependency>

 

Step-2c: Create new ext-war file without properties files. (keep existing war file as it is)

                     <plugin>

                           <groupId>org.apache.maven.plugins</groupId>

                           <artifactId>maven-war-plugin</artifactId>

                           <executions>

                                  <execution>

                                         <id>externalized</id>                                 

                                         <goals>

                                                <goal>war</goal>

                                         </goals>

                                         <phase>package</phase>                                

                                         <configuration>

                                                <packagingExcludes>src/main/resources, **/dep-project.properties, **/main-project.properties </packagingExcludes>

                                                <warName>${final.war.name}</warName>

                                                <classifier>ext</classifier>

                                         </configuration>

                                  </execution>

                           </executions>

                     </plugin>

 

Step-2d: Extract resource-jar and copy property file of dependency to the tar file.

                     <!-- unpack jars with dependency resources -->

                     <plugin>

                      <groupId>org.apache.maven.plugins</groupId>

                      <artifactId>maven-dependency-plugin</artifactId>

                      <executions>

                        <execution>

                          <id>unpack-dep-resources</id>

                          <goals>

                                   <goal>unpack-dependencies</goal>

                          </goals>

                          <phase>generate-resources</phase>

                          <configuration>

                                 <outputDirectory>${project.build.directory}/dependency-resources</outputDirectory>

                                <includeGroupIds>{AS IS}</includeGroupIds>

                                 <includeArtifacIds>dep-project</includeArtifacIds>

                                 <includeClassifiers>resources</includeClassifiers>

                                 <includeScope>provided</includeScope>

                                 <excludeTransitive>true</excludeTransitive>

                          </configuration>

                        </execution>

                      </executions>

                  </plugin>

 

Step-2e: Use assembly plugin to create tar file

                <plugin>

                  <groupId>org.apache.maven.plugins</groupId>

                  <artifactId>maven-assembly-plugin</artifactId>

                  <version>2.2-beta-5</version>

                  <executions>

                    <execution>

                      <phase>package</phase>

                           <goals>

                                  <goal>attached</goal>

                           </goals>

                    </execution>

                  </executions>

                  <configuration>

                    <descriptors>

                      <descriptor>src/main/assembly/binary-deployment.xml</descriptor>

                    </descriptors>

                  </configuration>

                </plugin>

File: src/main/assembly/binary-deployment.xml

<assembly>

  <id></id>

  <formats>

    <format>tar.gz</format>

  </formats>

  <includeBaseDirectory>true</includeBaseDirectory>

  <fileSets>

    <fileSet>

      <directory>${project.build.directory}/classes</directory>

      <outputDirectory></outputDirectory>

      <includes>

        <include>main-project.properties</include>

        <!-- Include dependency resources -->

        <include>dep-project.properties</include>

      </includes>

    </fileSet>

       <fileSet>

              <directory>${project.build.directory}</directory>

              <outputDirectory />

              <includes>

                     <include>${final.war.name}-ext.war</include>

              </includes>

       </fileSet>

  </fileSets>

</assembly>

 

Regards,

Sarang